Incremental NFAs with monoids

Let’s say you have a large balancing tree of text buffers, each buffer about 8 kilobytes long, so that you can modify the text in logarithmic time instead of having to shift around huge substrings. Now, suppose you want to match a regular expression against this text whenever it changes, perhaps to check for validity or to maintain search results. But regular expressions take linear time, because they have to look at every character in the string before admitting a match. We have a lot of data; this is not good. We do not want to run the regular expression on potentially gigabytes of text every time we change anything.

Regular expressions match regular languages, and therefore can be compiled into a Non-deterministic Finite Automaton (or alternatively a much larger deterministic one). An NFA is a finite state machine in which each state transition, in this case corresponding to a character read by the regex, can result in a number of possible states. Only at the end of the input does this non-determinism collapse into a termination state. For example, suppose we have the regex /.*a.../. When we read some character ‘a’ we don’t know whether this is a part of /.*/ or the /a/ three characters from the end of the input, and we can’t know until we reach the end of the string. As a result we fork the decision, transitioning to both states simultaneously.

/.*a.../ as an NFA

11, 211, 4

Going back to our buffer tree, suppose one of the middle buffers contained only the string “a”. Of course, the state we end up in when we read this depends on which state we were in at the start, and therefore at the end of the previous buffer. But for now let’s ignore that, and work out the possible output states for all possible input states. If we were in state 1 we’re in either state 1 or 2, and for all other states we progress one step through the NFA, with state 5 resulting in failure because we weren’t expecting this extra character ‘a’.

Now let’s suppose the very next buffer contained “bb”. If we were in state 1 we must still be, since we won’t have escaped the closure, and if we were in states 2 or 3 we would have taken two steps to 4 or 5, respectively. These state transition tables are actually monoids: m(a) ⊕ m(bb) = m(abb). If we follow m(a)’s transition (1 → 1, 2) through m(bb)’s (1 → 1; 2 → 4), we end up with (1, 4). This is the composition (or concatenation) of the two transition tables. We can use this monoidal property to match against the text in the tree incrementally, such that each update means recomputing the regular expression match for only the updated buffers.

So we label each buffer in the tree with one of these monoids, which we update with a regular expression match when we update that specific buffer. Each node in the tree also has a monoidal label, which composes its own buffer’s monoid with those of its child nodes: if this node x, with buffer b, has children y and z, then m(x) = m(y) ⊕ m(b) ⊕ m(z). This means that if y were to change its buffer, and thus its monoid, we would only need to re-compose m(x) and its ancestors, not recompute the regular expression match for the entire text. Then, at the root of the tree, we simply look to see what state 1 transitions to, and if it’s a termination state we have a match. (If you want to know what the matches are you have to use semirings.)

And now we have incremental NFAs! Oh frabjous day.