How to Implement

Regular Expressions

in functional JavaScript

using Derivatives

If you’re like most programmers working today, you’ve probably written a few regular expressions in your time: short (or possibly long), almost inscrutable mixtures of letters and symbols which let you check if a string looks like you expect it to or not.

But have you ever given much thought to how you might actually write a program which matches regular expressions? Regular expressions are the most successful form of declarative programming there is, but underneath, computers are still imperative beasts. There must be some way of matching them in imperative code!

This article will show you the theory behind one way of implementing regular expressions, then walk through building a toy regular expression engine in JavaScript in a more-or-less purely functional style. We’ll use JavaScript because it’s a nice lingua franca with syntax familiar to most programmers, although for programs like ours which do string manipulation it does have a few gotchas which you can read about by unfolding the next little section.

Nota bene: On string representation in JavaScript

You may be aware that JavaScript strings are arrays of the most useless kind of character encoding unit possible: UTF-16 code units. In a way even EBCDIC was better, because at least there you knew you were working with something exotic. UTF-16 units appear to work until you encounter an emoji, or a Chinese person using a word in that’s in Unicode’s Supplementary Ideographic Plane, then suddenly your code is dealing with only part of a character at once and string indexes are mysteriously out by one from where you expect them to be.

The ECMAscript people made an attempt to improve this in ES6 by adding some methods to the string type to let you access proper codepoints instead of 16-bit code units. Unfortunately they kind of half-arsed it, providing methods to iterate over the codepoints of a string, and to index a string by codepoint, but no codepoint-based equivalent of substr or length.

Nonetheless the implementation here attempts to use codepoints and codepoint indexing whenever possible, in the hope that one day, these gaps will be filled and JS implementations will start to adjust to a world in which codepoint access is the norm rather than the exception. (Or, in a crazy dream world, where extended grapheme cluster access is the norm. Nah, too crazy, I know …)

At the end there will be some exercises for you to work through yourself, and a list of further reading with articles which review some other approaches to implementing the same features.

This article assumes you’re familiar with using regular expressions in code, but not that you’re deeply familiar with the history or theory behind them. If you’re interested in that, unfold this short history.

Some info on the history and theory of regular expressions …

Fair warning: I’m not an expert on the history of regexps, so take some of the following with a pinch of salt. (The history could really do with being looked into in-depth by someone …)

Regular expressions, also called regular grammars, were invented in around 1951 by mathematician Stephen Kleene (ˈkleɪni) as a way to formally describe a set of strings that were within a ‘regular language’. Later on, famed structural linguist and political wonk Noam Chomsky realized that every regular language was also a context-free language, and thus placed regular grammars at the bottom of his Chomsky hierarchy which ranks types of formal languages by their complexity. Regular expressions are, in the strict sense, a mathematical idea: in CS jargon, regular languages are the set of languages which can be parsed with no more than a finite state automaton; in slightly simpler language, a string can always be tested against a regular language by making no more than a single pass over the characters of the string, and using no more memory than the string itself.

The regular expressions that programmers know and love today appeared first in Unix in 1969. The inventors of Unix created a way to write mathematical regular expressions in ASCII and implemented programs which could match against them. As Unix got more complicated, so did regular expression syntax, and application authors sometimes implemented their own varieties with additional features and altered syntax. Not for nothing did Donald Knuth once define Unix as ‘30 definitions of regular expressions living under one roof’. Worse, Unix’s standard library itself eventually grew two different incompatible regular expression syntaxes, basic and extended, which was eventually copied into POSIX and spread to any OS which used the C standard library.

By far the biggest expansion of the capabilities of regular expressions came when Larry Wall invented Perl. Perl added a tonne of new features to regexps while trying not to break syntactic compability with Unix’s standard extended regular expressions. As a result, the syntax got more confusing, and with the addition of backreferences and arbitrary zero-width assertions and recursive groups, Perl’s regular expressions actually went beyond what is formally allowed by the strict mathematical definition of a regular grammar. Nonetheless third-party implementations of Perl’s special non-regular regular expressions started appearing, and regexp users began to expect that regular expressions would be Perl-like.

Regular expressions got more complicated and started being widely used to parse languages which were not and could never be regular. And having moved beyond matching only regular languages, regular expression engines became deadly performance traps due to the inefficient implementation styles they were forced to adopt to support these new features.

These problems did provoke a minor backlash and an attempt to improve things. Firstly, Larry Wall himself, realizing the monster he unleashed, atoned for his sins and announced that Perl 6 would completely revamp regular expressions to fix the mistakes he had made in trying to remain compatible with POSIX regexps.

Perl 6 improves on old-school regexps by embracing the fact that they’re no longer regular, instead choosing to totally revamp the syntax and integrate them closely with an actual parsing system with a sane API for doing actual parsing jobs.

Meanwhile the creators of Unix were making a new operating system called Plan 9 and decided to take a different approach to fixing people’s broken idea of regexp matching. Plan 9 took the opposite approach, deliberately taking the syntax almost all the way back to version 1 Unix’s regexps. You get ‘.’ to match any character, brackets for grouping, basic character classes, repetition, alternation, optionality, and that’s it. Nonetheless, it’s still possible to match any regular language with this regexp — they’re exactly as powerful as POSIX extended REs and even Perl regexps (excluding the aforementioned backreferences, lookarounds, and recursive subpatterns).

Latterly, after Plan 9 failed to take off as an operating system, one of its developers created re2, an implementation of most of Perl regexps but without the features which made Perl regexps irregular, meaning they could once again be matched without that dastardly exponential-time worst case.

Just to make sure we’re all on the same page, here’s the terminology I’ll be using to talk about different parts of regexp syntax:

`ab`	concatenation of the two regexps `a` and `b`
`a\|b`	alternation of the two alternatives `a` and `b`
`a*`	repetition (specifically, repetition at least zero times)
`a?`	optionality of `a`; `a` is optional

I mention this because you might have got used to thinking of | as being ‘or’ and * as being ‘any number’ or some other words. These are the terms I’ll be using below, so you’ll need to understand them before moving on.

Our toy regular expression system will implement grouping, alternation, and repetition zero or more times only, alongside basic string search for a literal character and a wildcard character. Believe it or not, with the exception of the zero-width assertions, with only these features it’s possible to match any POSIX regular expression (or indeed any PCRE regular expression that doesn’t use backreferences) after applying the appropriate transformation.

Perl/POSIX syntax	Equivalent
`[abc]`	(character class) alternation of every character in the class (e.g. `(a\|b\|c)`)
`a?`	(optionality) alternation of `a` with the empty string (i.e. `(a\|)`)
`a+`	(`a` repeated one or more times) `aa*`
`a{m,}`	(`a` repeated at least `m` times) `a` literally repeated `m` times, followed by `a*`
`a{m,n}`	(`a` repeated between `m` and `n` times) `a` literally repeated `m` times, followed by `a?` literally repeated `n` − `m` times
`a{,n}`	(`a` repeated between `m` and `n` times) `a?` literally repeated `n` times
mode `i`	(case insensitive match) replace every character in the regexp with the alternation of all possible case folds (`a` with `(a\|A)`, `σ` with `(σ\|ς\|Σ)`, etc.)

Note that these aren’t necessarily the most efficient ways of implementing these features — character classes especially will need their own implementation if you’re going to implement negated classes or efficiently handle ranges. But on a theoretical level, the primitive syntax features are sufficient so they’re all we’ll implement for our toy. (exercise)

Note I said we’re also not going to implement zero-width assertions. Because they’re zero-width, they complicate a process which can otherwise assume that each matching token in the regular expression corresponds to advancing the input string by one character, so it’s easier just to ignore them for teaching purposes. (exercise)

We’re also not going to implement a parser for POSIX regular expression syntax. Instead we’ll design a friendly JavaScript notation for regexps. (exercise) These we’ll transform into a structure that’s easier to match against.

Two more things that will be left as exercises. The first is finding whether a particular regexp matches a substring of an input string. The matcher we’ll build will only be able to tell us if an entire string matches an entire regexp. (exercise)

Lastly, we’re not going to implement submatch extraction — the feature that, in POSIX syntax, lets you put (parentheses) around part of a regexp and later pull out only the part of the input string which matched that part of the expression. (exercise)

Starting at the bottom

In defiance of conventional software design wisdom which favours top-down design, we’ll start our implementation at the bottom and progressively implement more layers of abstraction, so that we can see our regexp engine being built up from nothing and get useful, working behaviour at every stage. The lowest layer of our system is the structure which our input regexp will be transformed into.

Theory

Our regular expression engine will work by iteratively computing the derivative of the regular expression with respect to the input string. The derivative of a regular expression is a modified version of that regular expression after it has matched a single character. For example,

Step	(Remaining) input	(Derivative) regexp
0	abd	`a(b\|c)d`
... derive `a(b\|c)d` with respect to the character a ...
1	bd	`(b\|c)d`
... derive `(b\|c)d` with respect to the character b ...
2	d	`d`
... derive `d` with respect to the character d ...
3	(empty)	(empty)
Since both the input string and the derivative regexp are empty, the regexp matches!

Taking the derivative of a regexp with respect to a character that doesn’t start the match of that regexp produces a special null regexp which doesn’t match anything. For instance, trying the same regexp against the different input string ‘aed’ proceeds like this:

Step	(Remaining) input	(Derivative) regexp
0	aed	`a(b\|c)d`
... derive `a(b\|c)d` with respect to the character a ...
1	ed	`(b\|c)d`
... derive `(b\|c)d` with respect to the character e ...
2	d	(never matches)
Since the derivative regexp will never match anything, we can give up and report failure.

Notice how we move over the input string one character at a time, and each character in the input string corresponds to one character in the regular expression. We will thus represent patterns we want to match against as a list of characters – specifically, as a singly-linked list where some of the nodes have extra outward pointers to represent alternation. There’s a special list entry for the empty string regexp returned at the very end, which is (theoretically) guaranteed to match anything and represents a completed match. The example regexp above can be represented by the following linked list:

It works fine if one of the options in the alternation is longer than the others:

Step	(Remaining) input	(Derivative) regexp
0	abbd	`a(bb\|c)d`
... derive `a(bb\|c)d` with respect to the character a ...
1	bbd	`(bb\|c)d`
... derive `(bb\|c)d` with respect to the character b ...
2	bd	`bd`
Matching proceeds as normal as above.

And even when the two alternations both successfully derive with respect to a particular character — we just return another alternation from the derivation process.

Step	(Remaining) input	(Derivative) regexp
0	abcd	`a(bb\|bc)d`
... derive `a(bb\|c)d` with respect to the character a ...
1	bcd	`(bb\|bc)d`
... derive `(bb\|bc)d` with respect to the character b ...
2	cd	`(b\|c)d`
... derive `(b\|c)d` with respect to the character c ...
3	d	`d`
... derive `d` with respect to the character d ...
4	(empty)	(empty)

In the case that an alternation group is the first thing in the regexp, we simply derive against each of the alternations in turn and return the derivatives of those which matched successfully.

Practice

Instead of alternations existing only as other nodes with multiple outward pointers as in the theory, in practice we’ll have a special alternation node in our tree which deals with deriving against multiple match possibilities. You could in theory handle that elsewhere and return a list of derivatives from every derive operation, but doing it this way lets us centralize some cleanup and operation routines within this bottom layer of abstraction, and we’ll need different node types to deal with repetition later anyway.

To begin with we’ll create a node superclass from which our other two types of nodes — characters and alternations — will descend. We’ll stub out a derive method which most subclasses will override.

class RegexpNode {
    derive (char) {
        return NeverMatches;
    }
}

We’ll also define the NeverMatches object as seen above (which represents an unsuccessful match), and the EmptyString (which represents a successful match) as singleton direct instances of the RegexpNode class.

const EmptyString = new RegexpNode();
const NeverMatches = new RegexpNode();

Then we need to define the two node types that actually do the matching. The CharacterNode class is the more basic one, and contains the essence of the ‘linked list’-esque structure that we traverse while matching:

class CharacterNode extends RegexpNode {
    constructor (char, next) {
        super(); // JavaScript makes us call this on all our subclasses even though it doesn't do anything
        this.char = char;
        this.next = next;
    }

    derive (char) {
        if (char === this.char) {
            return this.next;
        } else {
            return NeverMatches;
        }
    }
}

Hopefully that’s clear — basically, the derivative of a character node is the next node in the list if the character it’s being derived against is a match, otherwise it’s the null (NeverMatches) regexp.

The alternation class is slightly more complex. Let’s take it a bit at a time.

class AlternationNode extends RegexpNode {
    constructor (alternatives) {
        super();
        let _alternatives = alternatives.filter((alt) => alt !== NeverMatches);

First thing we do during the constructor is filter out all the alternatives which are the null regexp. It might seem strange that someone would give NeverMatches as an alternative, but remember from the theory that we’ll sometimes generate an AlternationNode as the derivative of another AlternationNode. Handling this case here lets us simplify the derive implementation below, and helps ensure that we never ever generate an invalid alternation node.

        if (_alternatives.length === 0) {
            return NeverMatches;
        } else if (_alternatives.length === 1) {
            return _alternatives[0];

Having filtered the list of alternatives that were passed to the constructor, we now need to do some sanity checking on the filtered list. If there were no valid alternatives left at the end, we must have derived an alternation which will never match. (For example, when we match the string ‘a’ against the regexp b|c.) If there was only one left, we can just skip the AlternationNode structure altogether (like above when we matched ‘bbd’ against (bb|c)d to give bd).

To handle these cases we can use the nifty ES6 class feature where you can return an object of a different type while constructing a class.

        } else {
            this.alternatives = _alternatives;
        }
        return this;
    }

Lastly, if there were still more than two valid alternatives, we construct a normal instance of the class. Notice how there is no next here, unlike on the character nodes. The character nodes themselves contain the pointers to whatever comes after the group containing the alternation.

    derive (char) {
        return new AlternationNode(this.alternatives.map((alt) => alt.derive(char)));
    }
}

Lastly, because of the sanity-checking we did in the constructor above, we can just define the derivative of an alternation very simply as the alternation of the derivatives of all the alternatives.

We don’t have repetition yet, but we can already match real regexps with this, provided we’re willing to do some minor fiddling around to compile the regexps ourselves. There’s actually a name for the kind of language we’ve just implemented — a star-free language, meaning a regexp that doesn’t use the * quantifier and is thus guaranteed to terminate matching. Here’s how we can create and test against the regexp a(b|c)d we saw earlier:

let commonTail = new CharacterNode('d', EmptyString),
    alternation = new AlternationNode([
        new CharacterNode('b', commonTail),
        new CharacterNode('c', commonTail)
    ]),
    head = new CharacterNode('a', alternation);

head.derive('a'); //=> the AlternationNode in the middle
head.derive('a').derive('b'); //=> the CharacterNode 'd'
head.derive('a').derive('e'); //=> NeverMatches
head.derive('a').derive('b').derive('d'); //=> EmptyString; match complete

Even if you can immediately see how that works, I hope you’ll agree that it’s a bit of a faff to set up your regexp like that and to match character-by-character! We’ll move on to a higher-level API later. Before we do, we’ll just quickly look at two more node types: the AnyCharacterNode and the RepetitionNode.

The AnyCharacterNode, equivalent to ‘.’ in classic regular expression syntax, is like the CharacterNode but simpler, since it doesn’t match a particular character but rather any character. It just advances the state of the match by one.

class AnyCharacterNode extends RegexpNode {
    constructor (next) {
        super();
        this.next = next;
    }

    derive (char) { return this.next; }
}

Lastly, let’s take a quick dip back into theory to understand how we’ll implement the RepetitionNode. Suppose we have the regexp (abc)*d and we want to find the derivative with respect to the character ‘a’. The resulting derivative should match strings of the forms ‘bc’ as well as ‘bcabcabcabc …’, so the answer is bc(abc)*d. To understand how we’ll represent this, let’s quickly dip back into theory and look at the diagram for this regexp, this time with an explicit ‘start’ node to make things clearer.

The repetition has two inward pointers: one from the start node, and one from the end of the repetition itself back to the start. At both points it’s possible to either go to node ‘a’ or to node ‘d’, either skipping over the repetition entirely or breaking out of it.

Like any cyclical linked list, we’ll need to use side-effects to construct this node type. The RepetitionNode will always have two outward pointers: one to the ‘head’ of the repetition (a in the example above), as well as one to the next node after the repetition as usual. The same RepetitionNode will appear both at the start and end of the repetition to handle both skipping over and breaking out of the loop.

class RepetitionNode extends RegexpNode {
    constructor (next) {
        super();
        // head will be set properly later by modification
        this.head = NeverMatches;
        this.next = next;
    }

    derive (char) {
        return new AlternationNode([
            this.head.derive(char),
            this.next.derive(char),
        ]);
    }
}

Let’s just test it with a manually-constructed ‘compiled’ regexp before we move on to making a friendlier API.

let tail = new CharacterNode('d', EmptyString),
    repetition = new RepetitionNode(tail),
    repetitionBody = new CharacterNode('a', new CharacterNode('b', new CharacterNode('c', repetition)));

// this is the side-effectual part I mentioned which sets up the cycle
repetition.head = repetitionBody;

repetition.derive('a'); //=> the CharacterNode b
repetition.derive('d'); //=> EmptyString; match complete
let repeatedOnce = repetition.derive('a').derive('b').derive('c'); // => the same RepetitionNode again
repeatedOnce.derive('a') // => back to b
repeatedOnce.derive('d') // => EmptyString again

Hey, it works! Now let’s work on making a nicer higher-level function interface so that we can actually use this without sticking all those nodes together manually.

A regular expression compiler (sans parser)

What’s a nice way to write regular expressions in JavaScript syntax? Let’s break out of the way of thinking about regular expressions that traps us in their Unix-style syntax as strings and design a way to build regexps using JavaScript values and function calls. I think the following is quite nice:

`"abc"`	(a literal string) matches the characters in the string
`[a, b, c, ...]`	(an ordinary array) concatenation of the regexps in the array
`Or([a, b, c, ...])`	(the function `Or`) alternation of the regexps in the array
`ZeroOrMore(re)`	(the function `ZeroOrMore`) allow repeating the given regexp a number of times
`Any`	(the variable `Any`) match any single character

Note that, for convenience and cleanliness, I forgo the new instantiator for Or and ZeroOrMore nodes. First let’s create the data types, functions and objects we need to use to make this work:

class _Or {
    constructor (alternatives) {
        this.alternatives = alternatives;
    }
}
function Or(alternatives) {
    if (!(alternatives instanceof Array)) {
        throw new TypeError("alternatives passed to Or must be an Array");
    } else {
        return new _Or(alternatives);
    }
}

class _ZeroOrMore {
    constructor (repeatable) {
        this.repeatable = repeatable;
    }
}
function ZeroOrMore(repeatable) {
    return new _ZeroOrMore(repeatable);
}

const Any = Symbol('Any');

… okay, so how do we actually turn our prettier syntax into the nodes we created above? Well, recall that our compiled regular expression structure resembles a linked list, with alternatives. The most efficient way to build up a linked list is from right to left, so we have to iterate over our input in reverse order. (Unfortunately JavaScript insists on making this difficult.) We’ll assume in the next few examples that we already have a generic compile which we don’t actually write until the very end.

We’ll start by writing a compiler that turns literal strings into chains of character nodes. This is quite simple.

function compileString(str, tail=EmptyString) {
    // the following is, as far as I can tell, the only way to reverse the codepoints of a string in JavaScript
    let reversedStr = Array.from(str).reverse();
    for (let char of reversedStr) {
        tail = new CharacterNode(char, tail);
    }
    return tail;
}

On each iteration, we construct a new character node for the current character, having the previous iteration’s character node (i.e. the next character in the string, since we’re going backwards) as its ‘next’ pointer.

The compiler for arrays is very similar, except we recursively dispatch back to compile for each sub-expression.

function compileArray(arr, tail=EmptyString) {
    for (let expr of arr.reverse()) {
        tail = compile(expr, tail);
    }
    return tail;
}

Compiling an Or expression is slightly different in that we only construct a single alternation node in the end, each of which shares the same tail instead of being iteratively replaced. We don’t need to go backwards this time, and we’ll simply use .map to construct all the alternatives to pass to the AlternationNode.

function compileOr(or, tail=EmptyString) {
    return new AlternationNode(or.alternatives.map((alternative) => compile(alternative, tail)));
}

Because of the aforementioned side-effectual nature of constructing a RepetitionNode, the ZeroOrMore compiler has to perform a little switcheroo on the other compilation functions, telling them to compile with a RepetitionNode tail that isn’t actually valid yet. Then it makes the node valid by adding a head pointer, once the node’s contents are compiled.

function compileZeroOrMore(zeroOrMore, tail=EmptyString) {
    let repetition = new RepetitionNode(tail),
        contents = compile(zeroOrMore.repeatable, repetition);
    repetition.head = contents;
    return repetition;
}

Which only leaves the utterly trivial compilation of Any with respect to a particular tail.

function compileAny(tail=EmptyString) {
    return new AnyCharacterNode(tail);
}

Lastly, let’s implement that compile function we’ve been recursing into so much.

function compile(expr, tail=EmptyString) {
    if ((typeof expr) === 'string') {
        return compileString(expr, tail);
    } else if (expr instanceof Array) {
        return compileArray(expr, tail);
    } else if (expr instanceof _Or) {
        return compileOr(expr, tail);
    } else if (expr instanceof _ZeroOrMore) {
        return compileZeroOrMore(expr, tail);
    } else if (expr === Any) {
        return compileAny(tail);
    } else {
        throw new TypeError("tried to compile something that's not a valid regexp datum");
    }
}

Next we’ll create a wrapper class RE for our various compiled node types, which will run the compiler. (We won’t call it Regexp because JavaScript’s built-in regular expression class is called RegExp, and having two class names distinguished only by capitalization is a bad idea.) You’ll instantiate this class as, for instance, new RE(["hello ", Or(["world", "dpk"])]);. We’ll also create a simple function that iterates over the characters of a string and reports true if the regexp and the string were exhausted at the same time.

class RE {
    constructor (regexp) {
        this.start = compile(regexp);
    }

    match (str) {
        let state = this.start,
            chars = Array.from(str);
        if ((chars.length === 0) && (state === EmptyString)) { return true; }
        for (let i = 0; i < chars.length; i++) {
            let char = chars[i];
            state = state.derive(char);
            if (state === EmptyString) {
                // reached the end of the regexp ...
                if (i === (chars.length - 1)) {
                    // if this is the end of the string too, it's a match
                    return true;
                } else {
                    // otherwise there's extra stuff in the string that isn't in the regexp, so it's not a perfect match
                    return false;
                }
            } else if (state === NeverMatches) {
                return false;
            }
        }
        // we must have run out of characters before reaching the end of the regexp, so it's not a match
        return false;
    }
}

Now let’s give it a go!

new RE(["a", Or(["a", "b"]), "d"]).match("abd"); //=> true
new RE(["a", Or(["a", "b"]), "d"]).match("aed"); //=> false
new RE([ZeroOrMore("abc"), "d"]).match("d"); //=> true

It seems to work!

Hmm, except …

new RE([ZeroOrMore("abc")]).match("abc"); //=> false?
new RE([ZeroOrMore("abc")]).match("abcabc"); //=> false ... wtf?
new RE([ZeroOrMore("abc")]).start.derive('a').derive('b').derive('c'); //=> ahh, of course, we're back at the repetition node

There’s a bug in our match function! Because a repetition always comes back to itself at the end, a regexp ending in one will never have a pure EmptyString as its final return value. We could check for this within the match function, special-casing it when the state is a RepetitionNode. But I found the match function pretty unwieldy anyway, what with all those nested if statements and so on … and besides, in the future we might want more node types which might have the same problem as RepetitionNode.

Instead we’ll go back down to the nodes and implement a new method on each of them which will tell them when we’ve matched.

Back down the ladder of abstraction

This is actually pretty easy, though we do have to change how EmptyString is implemented.

class RegexpNode {
    derive (char) {
        return NeverMatches;
    }
    matchEnd() { return false; }
    canMatchMore() { return !this.matchEnd(); }
}

class _EmptyString extends RegexpNode {
    matchEnd() { return true; }
}
const EmptyString = new _EmptyString();

The function matchEnd tells us if we can stop here; the function canMatchMore tells us if we must stop here. In practice, no function returning false from matchEnd can ever return false from canMatchMore, but the opposite might be true as we’ll see below.

For the basic node types CharacterNode and AnyCharacterNode we can just inherit the behaviour of RegexpNode, since they’ll always be able to match more. But AlternationNode is a bit more subtle. As long as it has at least one alternative that is at a matchEnd, it too is at a possible matchEnd. But also, as long as it has at least one alternative that canMatchMore, it also canMatchMore.

class AlternationNode extends RegexpNode {
    // ...
    matchEnd() { return this.alternatives.some((alt) => alt.matchEnd()); }
    canMatchMore() { return this.alternatives.some((alt) => alt.canMatchMore()); }
}

Similar logic applies for the RepetitionNode, but in practice the head of the repetition is always guaranteed not to be at the matchEnd; the next of it might be, though. So we can just call out to the repetition node’s next pointer’s matchEnd method, and return true from canMatchMore.

class RepetitionNode extends RegexpNode {
    // ...
    matchEnd() { return this.next.matchEnd(); }
    canMatchMore() { return true; }
}

Okay, now we’re ready to fix our higher-level match function to use the new node features.

Back up to the higher-level API

Our new match function looks like this:

class RE {
    // ...
    match (str) {
        let state = this.start,
            chars = Array.from(str);
        if ((chars.length === 0) && state.matchEnd()) { return true; }
        for (let i = 0; i < chars.length; i++) {
            let char = chars[i];
            state = state.derive(char);
            if (state.matchEnd() && (i === (chars.length - 1))) {
                // end of the regexp and of the string, successful match
                return true;
            } else if (state.matchEnd() && !state.canMatchMore()) {
                // end of the regexp but not of the string, doesn't match
                return false;
            }
        }
        // end of the string but not of the regexp, doesn't match
        return false;
    }
}

Does it work?

new RE(ZeroOrMore('abc')).match('abcabc'); //=> true
new RE([ZeroOrMore("abc"), "d"]).match("d"); //=> true
new RE(["a", Or(["a", "b"]), "d"]).match("abd"); //=> true
new RE(["a", Or(["a", "b"]), "d"]).match("aed"); //=> false

// because we're dealing with normal JS objects we can even define impromptu character classes ...
let Digit = Or(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']),
    usPhoneNumber = new RE([Or([['(', Digit, Digit, Digit, ') '], '']), Digit, Digit, Digit, '-', Digit, Digit, Digit, Digit]);;

usPhoneNumber.match('(415) 555-1212'); //=> true
usPhoneNumber.match('555-1212'); //=> true
usPhoneNumber.match('squirrel'); //=> false

Yes, it does!

Conclusion

In this article you’ve seen how to implement a small regular expression engine in purely-functional style with JavaScript. The completed implementation is just shy of 150 lines of code, excluding whitespace and comments. If you exclude brace lines too, it’s just 111 lines.

I stress that this is only one way to implement regexps. Most implementations you’ve probably ever used don’t actually work this way — instead they use implementations which try each alternative at a given match point in turn then backtrack if they got it wrong. As you can imagine, this can make them quite slow for some regular expressions.

Exercises for the reader

The marks given here are based on the difficulty scoring system used by Donald Knuth, although I don’t score for creativity separate to time taken. An ugly, obviously hacked-together solution will score low marks, no matter how long it took you; a creative and elegant piece of code that fits well with the rest of the library will score highly. Any solution which uses JavaScript’s built-in regexp capability in any way gets 0 marks — the point here is to learn to do it yourself.

The exercises are intended to be completed in order, although most are fairly independent of one another.

Write a paragraph or two to briefly explain a possible optimization of the regexp matcher as it’s given above. [10 marks]
Add a method to the RE class which returns true or false if only part of a string matches the given regexp. [10 marks]
For five marks extra on the previous exercise, make another method which reports the start and end positions (as codepoint positions, not code units) within the string at which the start and end were found. Efficiency is no object for either of these two exercises. [+5 marks]
Implement a case-insensitive matching or compilation mode. (It is okay for the case insensitivity only to work on ASCII characters.) [15 marks]
Implement some of the POSIX features as more convenient syntax items in our input structure. Specifically, implement at-least-one repetition, between-n-and-m repetition (including m = ∞), optionality, and character classes (including ranges and negation; each type of character class can be a separate API function and a separate node type if you like). [20 marks]
Implement zero-width assertions for the start and end of the line or the whole text. Then add a third zero-width assertion equivalent to POSIX’s \b, matching the beginning or end of a word (you can consider word characters to only consist of A–Z, a–z, 0–9, and underscore if you like). (Hint: you’ll need to modify what the derive function takes as input and/or what it returns, and adjust the higher-level matching functions as appropriate.) [20 marks]
Add a method which makes it possible to extract submatch groups from the regular expression. [25 marks]
Create a hand-hacked parser for Plan 9 or POSIX extended regular expressions that turns them into the JavaScript syntax objects which previously had to be created manually. (The engine would thus transform the input string to syntax objects, and then to the derivable multi-branch linked list form.) Syntax guides: Plan 9; POSIX (you don’t have to parse the [:…:] character classes or implement the \n backreference syntax). [35 marks]
Port the entire regular expression engine to another programming language of your choice. [50 marks]