Patricia, UnQL, and NFA to DFA conversion

Kragen Sitaker kragen@pobox.com
Tue, 3 Feb 2004 12:06:21 -0500 (EST)


(I apologize in advance for the jargon.  Explaining this idea without
the jargon would take a book.  I also apologize for the lack of
references, but I don't have network access at the moment.)

I posted in kragen-tol a few years ago expressing my intuition that
you could index paths on arbitrary directed graphs in a way that
corresponded to indexing substrings of a long string of text with a
Patricia or "PAT" tree, but I didn't know how to do it.  I happened to
be thinking about graph transformation (thinking about the
nondeterministic mapping stuff I posted here a year or two ago, and in
particular how transpose wasn't exactly its own inverse) as I walked
to the train tonight, and I finally had the insight that explained how
to do it.

It's well known that nondeterministic finite automata (even with
epsilon-transitions) are equivalent in power to deterministic finite
automata, and there is a simple procedure to convert an NFA into an
equivalent DFA.  The DFA may be arbitrarily smaller or exponentially
larger than the NFA, but in the usual case is about the same size, and
once constructed, may be anywhere from the same speed to exponentially
faster.

Given a text, you can easily convert it into a nondeterministic finite
automaton to recognize suffixes of itself; trivially, you can
alternate all the suffixes:
(trivially|rivially|ivially|vially|ially|ally|lly|ly|y|), or you can
create a state for each suffix, then an initial state with
epsilon-transitions to each of those:
((((((((t?r)?i)?v)?i)?a)?l)?l)?y?).  You can also easily convert
either of these into an automaton to recognize substrings of the text
by adding epsilon-transitions to an accepting state; the NFAs aren't
much bigger, but I think the regular expressions for them are.

Matching a piece of text with one of these NFAs (to see if it occurred
in the original text) is pretty trivial, but slow --- it corresponds
exactly to the naive nested-loop string search algorithm.

If you convert one of these NFAs into the equivalent DFA, that DFA is
a suffix trie for the text you're searching.

(You can apply this process to the needle text instead of the haystack
text and you get Knuth-Morris-Pratt.)

Now, UnQL's edge-labeled-graph data model is homeomorphic to a
nondeterministic finite automaton.  The automaton it's homeomorphic to
recognizes the set of XPath-like paths through the data starting from
the root.  (I know UnQL is more powerful than path-expressions, but
they play a significant role in its evaluation, and I think they could
play an even more significant role in a more optimized implementation
of it.)

If you apply the same NFA-to-DFA transformation on an UnQL data graph,
you get a DFA that recognizes the same set of paths, but can recognize
a path of length N in O(N) time (with a small constant factor) instead
of O(M^N) time, where M is a branching factor characterizing the data
graph; in other words, you have something similar of a Patricia tree
for this arbitrary data graph.  It's more similar if you add an
initial state with epsilon-transitions to all the data nodes before
the NFA->DFA conversion.

That doesn't directly allow us to evaluate path-expressions,
i.e. given a regular expression describing paths into the data graph,
find the set of paths that exist in the data store matching that
regular expression (or, more tractably, their terminal nodes.)  It
only allows us to answer the question, does the literal path /A/B/C/D
exist in the data graph, and where does it terminate?

But regular languages are closed under intersection, and the set of
paths in the data graph matching a path-expression is the intersection
of the regular language generated by the path-expression and the
regular language generated by the data graph.  In my minimal
acquaintance with automata theory, I haven't seen the computation of a
finite automaton to recognize this intersection language discussed,
but it appears to me that if you have two DFAs and want to compute a
DFA that recognizes the intersection of the languages they recognize,
you can do so straightforwardly by just walking them in parallel.
Given a pair of states in the two DFAs, you can intersect the lists of
labels on their transitions (or the states the transitions lead to, if
you're dealing with that kind of DFA) to get the list of available
labels on outbound transitions, and for each possible label, there's a
single pair of states it will get you to from there.  I think the
resulting DFA takes time linear in the size of the output to compute,
and while it could theoretically contain one state for each pair of
states in the input automata (consider intersecting (MMMM)* with
(MMMMM)* --- you end up with (MMMMMMMMMMMMMMMMMMMM)*) I think that
won't be the usual case.

You can use the same technique to do regular expression searches on
the text indexed by a suffix trie.  I'm pretty sure Baeza-Yates or
Gonnet has published that technique for use on Patricia trees.

To make the results useful, of course, you may have to carry along a
list of pointers to the original data nodes that got merged into each
DFA node.  But a regular-expression representation of the intersection
of the languages might suffice as human-readable output.

When I first had this insight, I thought it was pretty cool because it
made it possible to evaluate path-expressions quickly, and you can
transform a lot of different problems into evaluating
path-expressions.  But as I thought about it, I realized that most of
the difficulties of those other problems still exist here.  For
example:

- you can transform a relational equijoin into path-expression
  evaluation by adding epsilon-transitions (or transitions with some
  token like "join") from field values in one table to record nodes
  from another table that they can be joined with, or drawing edges
  from fields to value nodes and then edges from those value nodes
  back to fields (or records).  But that doesn't really help with
  query optimization; the query-optimization problem of spending a lot
  of time looking at records that don't join against anything in the
  other table because you did a SEQ SCAN on the wrong table simply
  becomes the problem of computing a lot of DFA nodes that aren't on
  any path from the initial state to the final state.

- it doesn't, by itself, give you any help with restructuring
  tree-structured data --- just extracting subsets of it.

- an XQL query that refers to multiple subtrees under a single node
  can't be transformed into a simple path-expression.  (I can't
  remember if this capability is in XPath too, or just XQL.)  Worse,
  the DFA abstraction of the data throws away the very distinction
  such queries seek.  The NFAs (AX|AY) and (A(X|Y)) are distinct;
  their DFAs are the same.  (In fact, the second one is the DFA.)  If
  you're looking for an A followed by both an X and a Y, the second
  will answer, but the first will not.

- The general problem of answering relational queries with multiple
  joins with the tables all connected linearly (joined only against
  the table before or after them in some sequence) is a lot like the
  problem of concatenating regular expressions.  If you want to search
  for ( |\n)+(zzyzzyva)( |\n)+ in some text with Patricia, you might
  be be better off if you start by searching your tree for 'zzyzzyva',
  which must be somewhere in any substring matching that regexp, than
  if you start by finding all the matches for ( |\n)+, then proceeding
  to eliminate the ones that don't precede 'zzyzzyva'.  But you can't
  know that without knowing something about the data.

- relational queries in which one table is joined against three or
  more other tables are closely analogous to XQL queries looking for
  multiple subtrees under a single node.  

Despite all its sham and drudgery, it is still a beautiful algorithm.
Be careful.  Strive to be happy.

[Is an edge-labeled NFA Moore or Mealy?  Is it homeomorphic,
homologous, or something else?  Who published the results on regex
search on Patricia trees?]