How to Implement

Regular Expressions

in functional JavaScript

using Derivatives

If you’re like most programmers working today, you’ve probably written a few regular expressions in your time: short (or possibly long), almost inscrutable mixtures of letters and symbols which let you check if a string looks like you expect it to or not.

But have you ever given much thought to how you might actually write a program which matches regular expressions? Regular expressions are the most successful form of declarative programming there is, but underneath, computers are still imperative beasts. There must be some way of matching them in imperative code!

This article will show you the theory behind one way of implementing regular expressions, then walk through building a toy regular expression engine in JavaScript in a more-or-less purely functional style. We’ll use JavaScript because it’s a nice lingua franca with syntax familiar to most programmers, although for programs like ours which do string manipulation it does have a few gotchas which you can read about by unfolding the next little section.

At the end there will be some exercises for you to work through yourself, and a list of further reading with articles which review some other approaches to implementing the same features.

This article assumes you’re familiar with using regular expressions in code, but not that you’re deeply familiar with the history or theory behind them. If you’re interested in that, unfold this short history.

Just to make sure we’re all on the same page, here’s the terminology I’ll be using to talk about different parts of regexp syntax:

ab concatenation of the two regexps a and b
a|b alternation of the two alternatives a and b
a* repetition (specifically, repetition at least zero times)
a? optionality of a; a is optional

I mention this because you might have got used to thinking of | as being ‘or’ and * as being ‘any number’ or some other words. These are the terms I’ll be using below, so you’ll need to understand them before moving on.

Our toy regular expression system will implement grouping, alternation, and repetition zero or more times only, alongside basic string search for a literal character and a wildcard character. Believe it or not, with the exception of the zero-width assertions, with only these features it’s possible to match any POSIX regular expression (or indeed any PCRE regular expression that doesn’t use backreferences) after applying the appropriate transformation.

Perl/POSIX syntax Equivalent
[abc] (character class) alternation of every character in the class (e.g. (a|b|c))
a? (optionality) alternation of a with the empty string (i.e. (a|))
a+ (a repeated one or more times) aa*
a{m,} (a repeated at least m times) a literally repeated m times, followed by a*
a{m,n} (a repeated between m and n times) a literally repeated m times, followed by a? literally repeated n − m times
a{,n} (a repeated between m and n times) a? literally repeated n times
mode i (case insensitive match) replace every character in the regexp with the alternation of all possible case folds (a with (a|A), σ with (σ|ς|Σ), etc.)

Note that these aren’t necessarily the most efficient ways of implementing these features — character classes especially will need their own implementation if you’re going to implement negated classes or efficiently handle ranges. But on a theoretical level, the primitive syntax features are sufficient so they’re all we’ll implement for our toy. (exercise)

Note I said we’re also not going to implement zero-width assertions. Because they’re zero-width, they complicate a process which can otherwise assume that each matching token in the regular expression corresponds to advancing the input string by one character, so it’s easier just to ignore them for teaching purposes. (exercise)

We’re also not going to implement a parser for POSIX regular expression syntax. Instead we’ll design a friendly JavaScript notation for regexps. (exercise) These we’ll transform into a structure that’s easier to match against.

Two more things that will be left as exercises. The first is finding whether a particular regexp matches a substring of an input string. The matcher we’ll build will only be able to tell us if an entire string matches an entire regexp. (exercise)

Lastly, we’re not going to implement submatch extraction — the feature that, in POSIX syntax, lets you put (parentheses) around part of a regexp and later pull out only the part of the input string which matched that part of the expression. (exercise)

Starting at the bottom

In defiance of conventional software design wisdom which favours top-down design, we’ll start our implementation at the bottom and progressively implement more layers of abstraction, so that we can see our regexp engine being built up from nothing and get useful, working behaviour at every stage. The lowest layer of our system is the structure which our input regexp will be transformed into.

Theory

Our regular expression engine will work by iteratively computing the derivative of the regular expression with respect to the input string. The derivative of a regular expression is a modified version of that regular expression after it has matched a single character. For example,

Step (Remaining) input (Derivative) regexp
0 abd a(b|c)d
... derive a(b|c)d with respect to the character a ...
1 bd (b|c)d
... derive (b|c)d with respect to the character b ...
2 d d
... derive d with respect to the character d ...
3 (empty) (empty)
Since both the input string and the derivative regexp are empty, the regexp matches!

Taking the derivative of a regexp with respect to a character that doesn’t start the match of that regexp produces a special null regexp which doesn’t match anything. For instance, trying the same regexp against the different input string ‘aed’ proceeds like this:

Step (Remaining) input (Derivative) regexp
0 aed a(b|c)d
... derive a(b|c)d with respect to the character a ...
1 ed (b|c)d
... derive (b|c)d with respect to the character e ...
2 d (never matches)
Since the derivative regexp will never match anything, we can give up and report failure.

Notice how we move over the input string one character at a time, and each character in the input string corresponds to one character in the regular expression. We will thus represent patterns we want to match against as a list of characters – specifically, as a singly-linked list where some of the nodes have extra outward pointers to represent alternation. There’s a special list entry for the empty string regexp returned at the very end, which is (theoretically) guaranteed to match anything and represents a completed match. The example regexp above can be represented by the following linked list:

It works fine if one of the options in the alternation is longer than the others:

Step (Remaining) input (Derivative) regexp
0 abbd a(bb|c)d
... derive a(bb|c)d with respect to the character a ...
1 bbd (bb|c)d
... derive (bb|c)d with respect to the character b ...
2 bd bd
Matching proceeds as normal as above.

And even when the two alternations both successfully derive with respect to a particular character — we just return another alternation from the derivation process.

Step (Remaining) input (Derivative) regexp
0 abcd a(bb|bc)d
... derive a(bb|c)d with respect to the character a ...
1 bcd (bb|bc)d
... derive (bb|bc)d with respect to the character b ...
2 cd (b|c)d
... derive (b|c)d with respect to the character c ...
3 d d
... derive d with respect to the character d ...
4 (empty) (empty)

In the case that an alternation group is the first thing in the regexp, we simply derive against each of the alternations in turn and return the derivatives of those which matched successfully.

Practice

Instead of alternations existing only as other nodes with multiple outward pointers as in the theory, in practice we’ll have a special alternation node in our tree which deals with deriving against multiple match possibilities. You could in theory handle that elsewhere and return a list of derivatives from every derive operation, but doing it this way lets us centralize some cleanup and operation routines within this bottom layer of abstraction, and we’ll need different node types to deal with repetition later anyway.

To begin with we’ll create a node superclass from which our other two types of nodes — characters and alternations — will descend. We’ll stub out a derive method which most subclasses will override.

class RegexpNode {
    derive (char) {
        return NeverMatches;
    }
}

We’ll also define the NeverMatches object as seen above (which represents an unsuccessful match), and the EmptyString (which represents a successful match) as singleton direct instances of the RegexpNode class.

const EmptyString = new RegexpNode();
const NeverMatches = new RegexpNode();

Then we need to define the two node types that actually do the matching. The CharacterNode class is the more basic one, and contains the essence of the ‘linked list’-esque structure that we traverse while matching:

class CharacterNode extends RegexpNode {
    constructor (char, next) {
        super(); // JavaScript makes us call this on all our subclasses even though it doesn't do anything
        this.char = char;
        this.next = next;
    }

    derive (char) {
        if (char === this.char) {
            return this.next;
        } else {
            return NeverMatches;
        }
    }
}

Hopefully that’s clear — basically, the derivative of a character node is the next node in the list if the character it’s being derived against is a match, otherwise it’s the null (NeverMatches) regexp.

The alternation class is slightly more complex. Let’s take it a bit at a time.

class AlternationNode extends RegexpNode {
    constructor (alternatives) {
        super();
        let _alternatives = alternatives.filter((alt) => alt !== NeverMatches);

First thing we do during the constructor is filter out all the alternatives which are the null regexp. It might seem strange that someone would give NeverMatches as an alternative, but remember from the theory that we’ll sometimes generate an AlternationNode as the derivative of another AlternationNode. Handling this case here lets us simplify the derive implementation below, and helps ensure that we never ever generate an invalid alternation node.

        if (_alternatives.length === 0) {
            return NeverMatches;
        } else if (_alternatives.length === 1) {
            return _alternatives[0];

Having filtered the list of alternatives that were passed to the constructor, we now need to do some sanity checking on the filtered list. If there were no valid alternatives left at the end, we must have derived an alternation which will never match. (For example, when we match the string ‘a’ against the regexp b|c.) If there was only one left, we can just skip the AlternationNode structure altogether (like above when we matched ‘bbd’ against (bb|c)d to give bd).

To handle these cases we can use the nifty ES6 class feature where you can return an object of a different type while constructing a class.

        } else {
            this.alternatives = _alternatives;
        }
        return this;
    }

Lastly, if there were still more than two valid alternatives, we construct a normal instance of the class. Notice how there is no next here, unlike on the character nodes. The character nodes themselves contain the pointers to whatever comes after the group containing the alternation.

    derive (char) {
        return new AlternationNode(this.alternatives.map((alt) => alt.derive(char)));
    }
}

Lastly, because of the sanity-checking we did in the constructor above, we can just define the derivative of an alternation very simply as the alternation of the derivatives of all the alternatives.

We don’t have repetition yet, but we can already match real regexps with this, provided we’re willing to do some minor fiddling around to compile the regexps ourselves. There’s actually a name for the kind of language we’ve just implemented — a star-free language, meaning a regexp that doesn’t use the * quantifier and is thus guaranteed to terminate matching. Here’s how we can create and test against the regexp a(b|c)d we saw earlier:

let commonTail = new CharacterNode('d', EmptyString),
    alternation = new AlternationNode([
        new CharacterNode('b', commonTail),
        new CharacterNode('c', commonTail)
    ]),
    head = new CharacterNode('a', alternation);

head.derive('a'); //=> the AlternationNode in the middle
head.derive('a').derive('b'); //=> the CharacterNode 'd'
head.derive('a').derive('e'); //=> NeverMatches
head.derive('a').derive('b').derive('d'); //=> EmptyString; match complete

Even if you can immediately see how that works, I hope you’ll agree that it’s a bit of a faff to set up your regexp like that and to match character-by-character! We’ll move on to a higher-level API later. Before we do, we’ll just quickly look at two more node types: the AnyCharacterNode and the RepetitionNode.

The AnyCharacterNode, equivalent to ‘.’ in classic regular expression syntax, is like the CharacterNode but simpler, since it doesn’t match a particular character but rather any character. It just advances the state of the match by one.

class AnyCharacterNode extends RegexpNode {
    constructor (next) {
        super();
        this.next = next;
    }

    derive (char) { return this.next; }
}

Lastly, let’s take a quick dip back into theory to understand how we’ll implement the RepetitionNode. Suppose we have the regexp (abc)*d and we want to find the derivative with respect to the character ‘a’. The resulting derivative should match strings of the forms ‘bc’ as well as ‘bcabcabcabc …’, so the answer is bc(abc)*d. To understand how we’ll represent this, let’s quickly dip back into theory and look at the diagram for this regexp, this time with an explicit ‘start’ node to make things clearer.

The repetition has two inward pointers: one from the start node, and one from the end of the repetition itself back to the start. At both points it’s possible to either go to node ‘a’ or to node ‘d’, either skipping over the repetition entirely or breaking out of it.

Like any cyclical linked list, we’ll need to use side-effects to construct this node type. The RepetitionNode will always have two outward pointers: one to the ‘head’ of the repetition (a in the example above), as well as one to the next node after the repetition as usual. The same RepetitionNode will appear both at the start and end of the repetition to handle both skipping over and breaking out of the loop.

class RepetitionNode extends RegexpNode {
    constructor (next) {
        super();
        // head will be set properly later by modification
        this.head = NeverMatches;
        this.next = next;
    }

    derive (char) {
        return new AlternationNode([
            this.head.derive(char),
            this.next.derive(char),
        ]);
    }
}

Let’s just test it with a manually-constructed ‘compiled’ regexp before we move on to making a friendlier API.

let tail = new CharacterNode('d', EmptyString),
    repetition = new RepetitionNode(tail),
    repetitionBody = new CharacterNode('a', new CharacterNode('b', new CharacterNode('c', repetition)));

// this is the side-effectual part I mentioned which sets up the cycle
repetition.head = repetitionBody;

repetition.derive('a'); //=> the CharacterNode b
repetition.derive('d'); //=> EmptyString; match complete
let repeatedOnce = repetition.derive('a').derive('b').derive('c'); // => the same RepetitionNode again
repeatedOnce.derive('a') // => back to b
repeatedOnce.derive('d') // => EmptyString again

Hey, it works! Now let’s work on making a nicer higher-level function interface so that we can actually use this without sticking all those nodes together manually.

A regular expression compiler (sans parser)

What’s a nice way to write regular expressions in JavaScript syntax? Let’s break out of the way of thinking about regular expressions that traps us in their Unix-style syntax as strings and design a way to build regexps using JavaScript values and function calls. I think the following is quite nice:

"abc" (a literal string) matches the characters in the string
[a, b, c, ...] (an ordinary array) concatenation of the regexps in the array
Or([a, b, c, ...]) (the function Or) alternation of the regexps in the array
ZeroOrMore(re) (the function ZeroOrMore) allow repeating the given regexp a number of times
Any (the variable Any) match any single character

Note that, for convenience and cleanliness, I forgo the new instantiator for Or and ZeroOrMore nodes. First let’s create the data types, functions and objects we need to use to make this work:

class _Or {
    constructor (alternatives) {
        this.alternatives = alternatives;
    }
}
function Or(alternatives) {
    if (!(alternatives instanceof Array)) {
        throw new TypeError("alternatives passed to Or must be an Array");
    } else {
        return new _Or(alternatives);
    }
}

class _ZeroOrMore {
    constructor (repeatable) {
        this.repeatable = repeatable;
    }
}
function ZeroOrMore(repeatable) {
    return new _ZeroOrMore(repeatable);
}

const Any = Symbol('Any');

… okay, so how do we actually turn our prettier syntax into the nodes we created above? Well, recall that our compiled regular expression structure resembles a linked list, with alternatives. The most efficient way to build up a linked list is from right to left, so we have to iterate over our input in reverse order. (Unfortunately JavaScript insists on making this difficult.) We’ll assume in the next few examples that we already have a generic compile which we don’t actually write until the very end.

We’ll start by writing a compiler that turns literal strings into chains of character nodes. This is quite simple.

function compileString(str, tail=EmptyString) {
    // the following is, as far as I can tell, the only way to reverse the codepoints of a string in JavaScript
    let reversedStr = Array.from(str).reverse();
    for (let char of reversedStr) {
        tail = new CharacterNode(char, tail);
    }
    return tail;
}

On each iteration, we construct a new character node for the current character, having the previous iteration’s character node (i.e. the next character in the string, since we’re going backwards) as its ‘next’ pointer.

The compiler for arrays is very similar, except we recursively dispatch back to compile for each sub-expression.

function compileArray(arr, tail=EmptyString) {
    for (let expr of arr.reverse()) {
        tail = compile(expr, tail);
    }
    return tail;
}

Compiling an Or expression is slightly different in that we only construct a single alternation node in the end, each of which shares the same tail instead of being iteratively replaced. We don’t need to go backwards this time, and we’ll simply use .map to construct all the alternatives to pass to the AlternationNode.

function compileOr(or, tail=EmptyString) {
    return new AlternationNode(or.alternatives.map((alternative) => compile(alternative, tail)));
}

Because of the aforementioned side-effectual nature of constructing a RepetitionNode, the ZeroOrMore compiler has to perform a little switcheroo on the other compilation functions, telling them to compile with a RepetitionNode tail that isn’t actually valid yet. Then it makes the node valid by adding a head pointer, once the node’s contents are compiled.

function compileZeroOrMore(zeroOrMore, tail=EmptyString) {
    let repetition = new RepetitionNode(tail),
        contents = compile(zeroOrMore.repeatable, repetition);
    repetition.head = contents;
    return repetition;
}

Which only leaves the utterly trivial compilation of Any with respect to a particular tail.

function compileAny(tail=EmptyString) {
    return new AnyCharacterNode(tail);
}

Lastly, let’s implement that compile function we’ve been recursing into so much.

function compile(expr, tail=EmptyString) {
    if ((typeof expr) === 'string') {
        return compileString(expr, tail);
    } else if (expr instanceof Array) {
        return compileArray(expr, tail);
    } else if (expr instanceof _Or) {
        return compileOr(expr, tail);
    } else if (expr instanceof _ZeroOrMore) {
        return compileZeroOrMore(expr, tail);
    } else if (expr === Any) {
        return compileAny(tail);
    } else {
        throw new TypeError("tried to compile something that's not a valid regexp datum");
    }
}

Next we’ll create a wrapper class RE for our various compiled node types, which will run the compiler. (We won’t call it Regexp because JavaScript’s built-in regular expression class is called RegExp, and having two class names distinguished only by capitalization is a bad idea.) You’ll instantiate this class as, for instance, new RE(["hello ", Or(["world", "dpk"])]);. We’ll also create a simple function that iterates over the characters of a string and reports true if the regexp and the string were exhausted at the same time.

class RE {
    constructor (regexp) {
        this.start = compile(regexp);
    }

    match (str) {
        let state = this.start,
            chars = Array.from(str);
        if ((chars.length === 0) && (state === EmptyString)) { return true; }
        for (let i = 0; i < chars.length; i++) {
            let char = chars[i];
            state = state.derive(char);
            if (state === EmptyString) {
                // reached the end of the regexp ...
                if (i === (chars.length - 1)) {
                    // if this is the end of the string too, it's a match
                    return true;
                } else {
                    // otherwise there's extra stuff in the string that isn't in the regexp, so it's not a perfect match
                    return false;
                }
            } else if (state === NeverMatches) {
                return false;
            }
        }
        // we must have run out of characters before reaching the end of the regexp, so it's not a match
        return false;
    }
}

Now let’s give it a go!

new RE(["a", Or(["a", "b"]), "d"]).match("abd"); //=> true
new RE(["a", Or(["a", "b"]), "d"]).match("aed"); //=> false
new RE([ZeroOrMore("abc"), "d"]).match("d"); //=> true

It seems to work!

Hmm, except …

new RE([ZeroOrMore("abc")]).match("abc"); //=> false?
new RE([ZeroOrMore("abc")]).match("abcabc"); //=> false ... wtf?
new RE([ZeroOrMore("abc")]).start.derive('a').derive('b').derive('c'); //=> ahh, of course, we're back at the repetition node

There’s a bug in our match function! Because a repetition always comes back to itself at the end, a regexp ending in one will never have a pure EmptyString as its final return value. We could check for this within the match function, special-casing it when the state is a RepetitionNode. But I found the match function pretty unwieldy anyway, what with all those nested if statements and so on … and besides, in the future we might want more node types which might have the same problem as RepetitionNode.

Instead we’ll go back down to the nodes and implement a new method on each of them which will tell them when we’ve matched.

Back down the ladder of abstraction

This is actually pretty easy, though we do have to change how EmptyString is implemented.

class RegexpNode {
    derive (char) {
        return NeverMatches;
    }
    matchEnd() { return false; }
    canMatchMore() { return !this.matchEnd(); }
}

class _EmptyString extends RegexpNode {
    matchEnd() { return true; }
}
const EmptyString = new _EmptyString();

The function matchEnd tells us if we can stop here; the function canMatchMore tells us if we must stop here. In practice, no function returning false from matchEnd can ever return false from canMatchMore, but the opposite might be true as we’ll see below.

For the basic node types CharacterNode and AnyCharacterNode we can just inherit the behaviour of RegexpNode, since they’ll always be able to match more. But AlternationNode is a bit more subtle. As long as it has at least one alternative that is at a matchEnd, it too is at a possible matchEnd. But also, as long as it has at least one alternative that canMatchMore, it also canMatchMore.

class AlternationNode extends RegexpNode {
    // ...
    matchEnd() { return this.alternatives.some((alt) => alt.matchEnd()); }
    canMatchMore() { return this.alternatives.some((alt) => alt.canMatchMore()); }
}

Similar logic applies for the RepetitionNode, but in practice the head of the repetition is always guaranteed not to be at the matchEnd; the next of it might be, though. So we can just call out to the repetition node’s next pointer’s matchEnd method, and return true from canMatchMore.

class RepetitionNode extends RegexpNode {
    // ...
    matchEnd() { return this.next.matchEnd(); }
    canMatchMore() { return true; }
}

Okay, now we’re ready to fix our higher-level match function to use the new node features.

Back up to the higher-level API

Our new match function looks like this:

class RE {
    // ...
    match (str) {
        let state = this.start,
            chars = Array.from(str);
        if ((chars.length === 0) && state.matchEnd()) { return true; }
        for (let i = 0; i < chars.length; i++) {
            let char = chars[i];
            state = state.derive(char);
            if (state.matchEnd() && (i === (chars.length - 1))) {
                // end of the regexp and of the string, successful match
                return true;
            } else if (state.matchEnd() && !state.canMatchMore()) {
                // end of the regexp but not of the string, doesn't match
                return false;
            }
        }
        // end of the string but not of the regexp, doesn't match
        return false;
    }
}

Does it work?

new RE(ZeroOrMore('abc')).match('abcabc'); //=> true
new RE([ZeroOrMore("abc"), "d"]).match("d"); //=> true
new RE(["a", Or(["a", "b"]), "d"]).match("abd"); //=> true
new RE(["a", Or(["a", "b"]), "d"]).match("aed"); //=> false

// because we're dealing with normal JS objects we can even define impromptu character classes ...
let Digit = Or(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']),
    usPhoneNumber = new RE([Or([['(', Digit, Digit, Digit, ') '], '']), Digit, Digit, Digit, '-', Digit, Digit, Digit, Digit]);;

usPhoneNumber.match('(415) 555-1212'); //=> true
usPhoneNumber.match('555-1212'); //=> true
usPhoneNumber.match('squirrel'); //=> false

Yes, it does!

Conclusion

In this article you’ve seen how to implement a small regular expression engine in purely-functional style with JavaScript. The completed implementation is just shy of 150 lines of code, excluding whitespace and comments. If you exclude brace lines too, it’s just 111 lines.

I stress that this is only one way to implement regexps. Most implementations you’ve probably ever used don’t actually work this way — instead they use implementations which try each alternative at a given match point in turn then backtrack if they got it wrong. As you can imagine, this can make them quite slow for some regular expressions.

Exercises for the reader

The marks given here are based on the difficulty scoring system used by Donald Knuth, although I don’t score for creativity separate to time taken. An ugly, obviously hacked-together solution will score low marks, no matter how long it took you; a creative and elegant piece of code that fits well with the rest of the library will score highly. Any solution which uses JavaScript’s built-in regexp capability in any way gets 0 marks — the point here is to learn to do it yourself.

The exercises are intended to be completed in order, although most are fairly independent of one another.

  1. Write a paragraph or two to briefly explain a possible optimization of the regexp matcher as it’s given above. [10 marks]
  2. Add a method to the RE class which returns true or false if only part of a string matches the given regexp. [10 marks]
  3. For five marks extra on the previous exercise, make another method which reports the start and end positions (as codepoint positions, not code units) within the string at which the start and end were found. Efficiency is no object for either of these two exercises. [+5 marks]
  4. Implement a case-insensitive matching or compilation mode. (It is okay for the case insensitivity only to work on ASCII characters.) [15 marks]
  5. Implement some of the POSIX features as more convenient syntax items in our input structure. Specifically, implement at-least-one repetition, between-n-and-m repetition (including m = ∞), optionality, and character classes (including ranges and negation; each type of character class can be a separate API function and a separate node type if you like). [20 marks]
  6. Implement zero-width assertions for the start and end of the line or the whole text. Then add a third zero-width assertion equivalent to POSIX’s \b, matching the beginning or end of a word (you can consider word characters to only consist of A–Z, a–z, 0–9, and underscore if you like). (Hint: you’ll need to modify what the derive function takes as input and/or what it returns, and adjust the higher-level matching functions as appropriate.) [20 marks]
  7. Add a method which makes it possible to extract submatch groups from the regular expression. [25 marks]
  8. Create a hand-hacked parser for Plan 9 or POSIX extended regular expressions that turns them into the JavaScript syntax objects which previously had to be created manually. (The engine would thus transform the input string to syntax objects, and then to the derivable multi-branch linked list form.) Syntax guides: Plan 9; POSIX (you don’t have to parse the [:…:] character classes or implement the \n backreference syntax). [35 marks]
  9. Port the entire regular expression engine to another programming language of your choice. [50 marks]

As a trial to see how much interest I get, you can send your solutions to each of these problems to me and I’ll give you feedback and a mark. If you’re having trouble, you can ask for help in #dregs on Freenode. Please only email me completed solutions — use the IRC channel exclusively to get help. Also, I may not actually be able to mark the last exercise constructively because (shock-horror) I don’t know every single programming language. The total excluding the last exercise is marks.

Further reading

I first started learning about implementing regular expressions from Russ Cox’s excellent series on NFA- and DFA-based implementations. But though I understood in the abstract what was going on with the automata, the explosion of states from NFA to DFA still confused the heck out of me in practice. I never intuitively grokked this implementation strategy — it always felt to me like there was some deep magic happening which was difficult to inspect and understand.

Compared to DFAs, derivatives are easier to understand and implement, and much quicker to compile, but probably quite a bit slower to match: where m is the length of the input string and n is the number of nodes in the regular expression, derivative-based implementations compile in time O(n) but in the worst case will, I believe, match in O(mn) (though that’s for pathological input regexps, and can be mitigated by clever implementation techniques — typical regexps will be closer to O(m)); DFA-based implementations require O(2n) time to compile but will always match in O(m). Both of these numbers are much nicer than typical backtracking implementations like Perl, PCRE, Python, and Ruby regexps, which tend to be more like O(n) to compile and O(2m) to match, in the worst case. Derivatives can also carry around small bits of state with them which can help when implementing limited repetition without excessive memory usage, for example — and with a little bit of laziness, it’s possible for derivative-based parsing engines to handle any context-free grammar.

I first learned how to implement regexps with derivatives thanks to Matt Might’s blog post, which is quite a bit shorter than my guide here. Might implemented his matcher in Scheme and thus gets away without a compilation step, because Scheme has convenient notation and procedures for linked lists built in.

Embarassingly, I never read Brzozozwksi’s original regexp derivatives paper until I’d already understood and implemented the concept myself. There’s also the Owens–Reppy–Turon paper which revitalized the idea.

I’m implementing a production-ready derivative-based regexp library called Dregs. When it’s ready, I’ll post the link here.