patricia extensions

Kragen Sitaker kragen@pobox.com
Wed, 3 May 2000 16:16:42 -0400 (EDT)


It seems that it should be possible to use Patricia to search for
strings matching a regular expression in a fairly straightforward
fashion reminiscent of Dijkstra's three-color incremental GC
algorithm.

The idea is that we spread a wave of known results out through the
Patricia tree; at the beginning, all the nodes are colored "unknown".
We examine the bit specified by the root node (the one with the first
branching); if our regular expression allows it to be set both ways, we
color the root node "possible" and look at both of its children
(coloring them "to look at").  If our regular expression requires that
particular bit to have a particular value, we color the excluded child
"impossible" and the other child "to look at".

We repeat this process on each node that is marked "to look at" until
no more nodes are "to look at".  At this point, all the paths leading
to matches to the regex are colored "possible", and all the nodes
colored "unknown" or "impossible" are not part of paths to matches to
the regex.

So are a bunch of other paths.  To avoid this, whenever we find an
upward pointer in the Patricia tree when we hoped to find a child to
color "to look at", we can look to see if the string pointed to by the
upward pointer does indeed match the regex.  If it does not, a bit
omitted from the Patricia tree caused the mismatch; we can find this
bit, can walk back up the tree from where we found the upward pointer
to find the highest node that tests a bit after the mismatched bit, and
mark it and all of its descendants "impossible".

When the regex matches only a prefix of the string Patricia indexed,
there are multiple matches with that prefix, so we must enumerate the
descendants of the node where the regex match ended.  This can
conveniently be done by treating all regexes as ending with ".*".

The order in which to-look-at nodes are processed determines the order
in which matches are returned; I think LIFO would return them in
lexicographic order (unless you went to some trouble to prevent this),
while FIFO would return them in order of length.

With LIFO, it's just the standard backtracking early-pruning search
beloved of old AI mavens.

(Looks like Baeza-Yates and Gonnet might have beat me to it by four
years, the bastards:
http://www.acm.org/pubs/citations/journals/jacm/1996-43-6/p915-baeza-yates/
Unfortunately, the abstract doesn't really describe the algorithm, and the 

A similar algorithm can be used to find matches with up to N character
substitutions --- and, I think, insertions and deletions, too.

The reason I was originally thinking about this, though, is UnQL and
XQL:
http://www.research.att.com/~suciu/strudel/external/NodeExternal,internal.genoid_3.html

UnQL's queries are mostly regular expressions, but they are matched not
against substrings of some large string but against all possible
sequences of edge traversals in a directed graph.  I think it is
possible to index these sequences of edge traversals in a relatively
straightforward way with a modification of Patricia, but I haven't
figured out how yet.  (No doubt Gonnet has beat me to it --- he's been
working on genetic databases since 1991 or earlier.)

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)