String Matching

Matching or searching for strings, ranging from literal strings through regular expressions. The focus is on implementation, not introductions to automata or formal grammars.

Overview

Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences

Overview of string matching algorithms from 2002, and the book I have learned/am learning the most from. For each type of matching, it covers a half dozen or fewer algorithms that perfom well in practice, omitting many textbook algorithms with good asymptotic bounds. Covers exact and inexact match of literals, extended literals and full regular expressions.

If you're interested in regular expressions, I would recommend reading this after an introduction to automata and regular languages. This book will teach you much more interesting techniques for implementation, but less of the big picture.

Algorithms, Techniques, and Design

Regular Expression Matching Can Be Simple and Fast (But is Slow in Java, Perl, PHP, Python, Ruby, ...)

Many regular expression engines use backtracking as an implementation technique, causing exponential slowdowns in the worst case.

Andrew Gallant on Aho-Corasick

Observations on the cases where Aho-Corasick performs well or poorly in practice. He states that it typically has decent predictable performance.

Andrew Gallant on Aho-Corasick and the Thompson Construction

I do also know that Paul Wankadia (current maintainer of RE2) has done some really cool work on removing the various split bytecodes from the normal Thompson construction, at the cost of doing a bit more work translating the regex to bytecodes. But I still don’t think it will approach to speed of Aho-Corasick. To a first approximation, I would somewhat expect Aho-Corasick to be about an order of magnitude faster than the typical Thompson construction with a VM. If you JIT the Thompson construction (which has been done), then all bets are off. :-)

Aho Corasick Visualization

A brief description of the algorithm plus a visualization of how the algorithm executes.

Regular Expression Language Matching in the Wild

The design of re2.

Character classes are represented not as a simple list of ranges or as a bitmap but as a balanced binary tree of ranges. This representation is more complex than a simple list but crucial for handling large Unicode character classes.

Case-insensitive matching uses a special flag and special case for the ASCII range rather than two-element character classes: (?i)abc turns into abc with the case-insensitive bit instead of [Aa][Bb][Cc]. RE2 originally used the latter, but it was too memory-intensive, especially with the tree-based character classes.

One thing that surprised me is the variety of ways that real users write the same regular expression. For example, it is common to see singleton character classes used instead of escaping—[.] instead of —or alternations instead of character classes—a|b|c|d instead of [a-d]. The parser takes special care to use the most efficient form for these, so that [.] is still a single literal character and a|b|c|d is still a character class. It applies these simplifications during parsing, rather than in a second pass, to avoid a larger-than-necessary intermediate memory footprint.

The RE2 compiler has one interesting twist, which I learned from Thompson's grep. It compiles UTF-8 character classes down to an automaton that reads the input one byte at a time. In other words, the UTF-8 decoding is built into the automaton.

Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs

Blog post precis of the hyperscan paper. Geoff Langdale also posted a note on updating the teddy algorithm.

Implementations

Re2

Alternative to backtracking implementations of regular expressions.

Regex Crate

Rust regex crate. Design-wise, it uses many ideas from Re2. At one point in the past, it used macros for compile-time creation of efficient code for regexes, but removed that behavior. Also incorporates the teddy algorithm for SIMD substring matching derived from Hyperscan.

Hyperscan

High-performance regular expression matching library

the open source version is x86 only.

Surveying Various Languages' String Search

How various languages' standard libraries do string searches. Java's String.indexOf was the only surveyed implementation that uses the naive O(nm) implementation. However, as of Java 9, this is a Hotspot intrinsic, which means that on supported architectures, Hotspot will replace the Java code with an optimized assembly routine. My implementations of Horspool/BNDM in Java were slower than this routine.

Damn Cool Algorithms: Levenshtein Automata

You can construct finite automata to find strings with a given levenshtein distance from an initial string.

Heap

SIMD-Friendly Algorithms for Substring Searching

Recent work, no idea how it compares to other work.

Practical and Optimal String Matching

Variant of Shift-Or that is supposedly faster than BNDM. (2006)

The Treacherous Optimization

An optimization in grep's Boyer-Moore implementation that improves best case search time at the expense of worst case search time. The optimization works by unrolling the loop to do three additions to the index between checks that there was a partial match. In the cases where there is a partial match, each check adds 0 to the index, temporarily stalling.

Regex-Benchmark

A multi-language regex benchmark. Haven't looked much at details--though there are nits to pick in the Java benchmark, for instance.

Suffix

A suffix array construction algorithm with unicode support

A Comparison of Regex Engines

2017 benchmark of Onigumura, Re2, Tre, PCRE2, Hyperscan and Rust's regex crate. One fact it shows is that it has examples where DFAs will struggle, and backtracking implementations will work well. The Regex crate maintains a set of benchmarks, but I don't know if they live anywhere public.