Matching or searching for strings, ranging from literal strings through regular expressions. The focus is on implementation, not introductions to automata or formal grammars.
Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences
Overview of string matching algorithms from 2002, and the book I have learned/am learning the most from. For each type of matching, it covers a half dozen or fewer algorithms that perfom well in practice, omitting many textbook algorithms with good asymptotic bounds. Covers exact and inexact match of literals, extended literals and full regular expressions.
If you're interested in regular expressions, I would recommend reading this after an introduction to automata and regular languages. This book will teach you much more interesting techniques for implementation, but less of the big picture.
Algorithms, Techniques, and Design
Regular Expression Matching Can Be Simple and Fast (But is Slow in Java, Perl, PHP, Python, Ruby, ...)
Many regular expression engines use backtracking as an implementation technique, causing exponential slowdowns in the worst case.
Observations on the cases where Aho-Corasick performs well or poorly in practice. He states that it typically has decent predictable performance.
I do also know that Paul Wankadia (current maintainer of RE2) has done some really cool work on removing the various split bytecodes from the normal Thompson construction, at the cost of doing a bit more work translating the regex to bytecodes. But I still don’t think it will approach to speed of Aho-Corasick. To a first approximation, I would somewhat expect Aho-Corasick to be about an order of magnitude faster than the typical Thompson construction with a VM. If you JIT the Thompson construction (which has been done), then all bets are off. :-)
A brief description of the algorithm plus a visualization of how the algorithm executes.
The design of re2.
Character classes are represented not as a simple list of ranges or as a bitmap but as a balanced binary tree of ranges. This representation is more complex than a simple list but crucial for handling large Unicode character classes.
Case-insensitive matching uses a special flag and special case for the ASCII range rather than two-element character classes: (?i)abc turns into abc with the case-insensitive bit instead of [Aa][Bb][Cc]. RE2 originally used the latter, but it was too memory-intensive, especially with the tree-based character classes.
One thing that surprised me is the variety of ways that real users write the same regular expression. For example, it is common to see singleton character classes used instead of escaping—[.] instead of —or alternations instead of character classes—a|b|c|d instead of [a-d]. The parser takes special care to use the most efficient form for these, so that [.] is still a single literal character and a|b|c|d is still a character class. It applies these simplifications during parsing, rather than in a second pass, to avoid a larger-than-necessary intermediate memory footprint.
The RE2 compiler has one interesting twist, which I learned from Thompson's grep. It compiles UTF-8 character classes down to an automaton that reads the input one byte at a time. In other words, the UTF-8 decoding is built into the automaton.
Blog post precis of the hyperscan paper. Geoff Langdale also posted a note on updating the teddy algorithm.
Attempting to skip bytes, like Boyer-Moore does, no longer seems to be as effective as using
memchr to process multiple bytes at once.
Alternative to backtracking implementations of regular expressions.
Rust regex crate. Design-wise, it uses many ideas from Re2. At one point in the past, it used macros for compile-time creation of efficient code for regexes, but removed that behavior. Also incorporates the teddy algorithm for SIMD substring matching derived from Hyperscan.
High-performance regular expression matching library
the open source version is x86 only.
How various languages' standard libraries do string searches. Java's
String.indexOf was the only surveyed implementation that uses the naive
O(nm) implementation. However, as of Java 9, this is a Hotspot intrinsic, which means that on supported architectures, Hotspot will replace the Java code with an optimized assembly routine. My implementations of Horspool/BNDM in Java were slower than this routine.
You can construct finite automata to find strings with a given levenshtein distance from an initial string.
Recent work, no idea how it compares to other work.
Variant of Shift-Or that is supposedly faster than BNDM. (2006)
An optimization in grep's Boyer-Moore implementation that improves best case search time at the expense of worst case search time. The optimization works by unrolling the loop to do three additions to the index between checks that there was a partial match. In the cases where there is a partial match, each check adds 0 to the index, temporarily stalling.
A multi-language regex benchmark. Haven't looked much at details--though there are nits to pick in the Java benchmark, for instance.
A suffix array construction algorithm with unicode support
2017 benchmark of Onigumura, Re2, Tre, PCRE2, Hyperscan and Rust's regex crate. One fact it shows is that it has examples where DFAs will struggle, and backtracking implementations will work well. The Regex crate maintains a set of benchmarks, but I don't know if they live anywhere public.
Original paper introducing Commentz-Walter. An algorithm that combines some ideas of Boyer-Moore and Aho-Corasick. Works best with a smallish number of patterns (where does this claim come from?).
Discusses the NFA-DFA conversion, and how it's different from the exponential blowup in backtracking implementations. Suggests that someone needs to write a condensed version of Russ Cox's work on regexes.
Somewhat out of date, see Burntsushi's comments.
Bitparallel algorithms and hopcroft minimization. A little hard for me to relate to other work--need to reread.