semistructured data: summary of six years of wishes

Kragen Sitaker kragen at pobox.com
Thu Dec 1 03:37:01 EST 2005


Now I am feeling the need for a personal semistructured store more than
ever.  This is a searchable database that supports "data first,
structure later" <http://www.betaversion.org/~stefano/linotype/news/93/>
(i.e. you can add a schema incrementally to existing data entered
without one, or with a very primitive one), but supports enough
structure to render web pages and things like that.  Here are some
options:

- RDF/Sniki-style: label edges in a directed graph with nodes, which
  have names.  Define tables with Sniki-style queries.  Add field
  metadata with edges from the field label node.  Queries are different
  from edges; perhaps stored in nodes, perhaps defined in some manner by
  a local graph structure.  Adding data to a query result is the most
  convenient way to add it.
- UnQL-style: label edges in a directed graph with strings; nodes are
  anonymous; define queries by replacing regular expressions of edges
  with edges; display tables perhaps by going two levels deep from a
  particular node.  (OEM, object exchange model, is essentially
  identical to this.)
- Python/Perl/JavaScript-style: nodes are anonymous, but edges are
  uniquely labeled within a node with strings, and some nodes are
  special primitive types, like strings.  Some edges are 'virtual' and
  computed on demand by arbitrary code that walks the neighborhood.
  There are nodes that are lists.
- askSam-style: nodes are small text files, with a syntactic convention
  to specify fields.  Queries are full-text searches, perhaps limited to
  certain fields; reports can extract values of certain fields to get
  tabular output.
- XML-style: data is stored in a single hierarchy in which the children
  of each node are ordered, and each node is labeled with an element
  type and possibly some attributes; XPath or XQuery is the query
  language.  Maybe you have hyperlinks by id to other nodes.
- Wiki-style: a node is a small text file with a name; text formatted
  in a certain way inside the file represents a reference to another
  file.  Very similar to AskSam-style, but existing Wikis either don't
  have a facility for fields with values or don't use it much.
- wowbarbts-style: just like Wiki-style, but the node is a set of
  name-value pairs, and each value is potentially a small text file
  (which may contain references to other nodes.) Past versions are
  logged.
- Prolog-style: data is viewed in a set of relations of fixed arity; any
  relation may contain both data items directly entered there by the
  user as well as data items "inferred", i.e. produced as query results,
  from other relations, and there might actually be an infinite number
  of such other data items.  The data items may be atomic (i.e. strings)
  or node-labeled trees of atoms.
- Lotus-Agenda-style: there's an implication hierarchy of categories
  (with the universe at the top), and any data item (a short string of
  text) may belong to any subset of the hierarchy, subject to a few
  validity constraints.  Sets of categories within the tree form column
  values in reporting screens, categories form section headers in
  reporting screens, and the data items themselves are always displayed
  on reporting screens, which are also where you add new data items.  In
  a few cases, belonging to a category may have associated with it some
  numeric or date value.  Most category memberships are inferred from
  the text content of the data item, according to rules attached to the
  categories --- e.g. categorize anything containing the word "car" or
  "automotive" in the "automotive" category.
- Excel-style: you have a big rectangular grid of data values; you can
  automatically hide rows in some subset that don't meet your criteria,
  and you have "quick filters" that give you menus of possible data
  values.  Also you can hide whatever columns you want.  Really simple
  and not very powerful but apparently sufficient for many people's
  uses.
- mutated-Agenda-style: you have a big hierarchy of data items, each
  labeled either with a string or a hyperlink to some other data item.
  Category membership is denoted by a child labeled with a hyperlink to
  the category.  Reports/queries can optionally be run over just the
  descendants or children of a single node, minimizing the
  global-data-store-interference problem observed with several of these
  models.
- rumor-oriented style: data items are sets of name-value pairs; you
  can't modify old items, but only add new ones, which closely resemble
  input events; queries use the standard relational operations on some
  set of relations, one of which is the entire set of data items, and
  are evaluated in some namespace where you probably have defined most
  of the other names with standard queries.  By evaluating a query or a
  set of queries in a different namespace, you can get effects like
  undo, deletion, update, or the elimination of all actions from a
  particular IP address.  Using relational operations provides
  sequence-independence; the rumorset is not inherently sequenced.  Must
  use lazily-updated materialized views for most queries in order to
  achieve reasonable performance.
- zzstructure-style: each data item is a short string or other similar
  typed object; they are connected along 'axes', which have a linearity
  constraint --- if X follows Y along axis A, then nothing else can
  follow Y along that axis, and X must also precede Y along A.  Tabular
  display is achieved by mapping two of the axes onto the screen's
  dimensions.  You could probably do reasonable queries with regular
  expressions over axis directions and node tests, sort of like XPath or
  XQL.  However, Ted Nelson is a patent pirate who wishes that such
  projects would stop, and he may have the legal means to kill them
  <http://lists.gnu.org/archive/html/gzz-dev/2003-02/msg00042.html>
- del.icio.us-style: similar to Agenda, but without a hierarchy among
  categories and without any automatic categorization.
- unix-filesystem-style: nodes are either strings, string-indexed
  dictionaries of nodes, or transparent hyperlinks to other nodes.
  Queries (tree regular expressions on path traversals with full-text
  search, or XPath, would probably suffice) produce lists of node
  pathnames, perhaps with associated excerpts.  (It would be interesting
  to see full-fledged tools for viewing real Unix filesystems this way.)

To essentially any of these styles, you could add any of the following
features:
- full-text search
- incremental, interactive query result display
  (note that this probably implies that the query syntax should put
  field names after the field values being searched for; this seems
  obvious once stated, but does not seem to be a property of any
  existing query language)
- arithmetic on fields (for queries and reports)
- computed attributes (similar to arithmetic on fields, but perhaps more
  general, including query results)
- imperative programming
- Agenda-style automatic categorization
- RDF-style namespacing
- optimistically synchronized transactions
  - or at least locking
    - or at least some minimal atomic update
- access to old versions of the data
- indexing with Nilesh Dalvi's partial match index
  <ftp://ftp.cs.washington.edu/tr/2004/01/UW-CSE-04-01-01.pdf>
and probably some other features I haven't thought of yet.  These
features may actually matter as much as the underlying data
representation, and Excel shows that a nice UI for the simplest tasks
compensates for many weaknesses.

Into most of them, you could probably hack in some variant of the old Notes
synchronization approach, recently codified by Ray Ozzie at Microsoft as
Simple Sharing Extensions <http://msdn.microsoft.com/xml/rss/sse>; the
rumor-oriented style
<http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html>
gives you stronger guarantees about synchronization, at the cost of
unbounded storage and unpredictable performance.

Thoughts I've had on this problem over the years
------------------------------------------------

<http://del.icio.us/kragen/semistructured-data>
38 relevant web pages and my brief comments on them over the last year

<http://lists.canonical.org/pipermail/kragen-tol/1999-August/000466.html>
XML editors
(how to make XML editing convenient so you can stand to take notes in it)

<http://lists.canonical.org/pipermail/kragen-tol/1999-November/000486.html>
XML query languages
(abortive notes on a better path language for XML)

<http://lists.canonical.org/pipermail/kragen-tol/2000-May/000580.html>
patricia extensions
(groping toward the idea I finally understood in the NFA-DFA post below)

<http://lists.canonical.org/pipermail/kragen-tol/2000-September/000633.html>
AskSam & outliners
(sorry about the terrible formatting; will fix it sometime)

<http://lists.canonical.org/pipermail/kragen-tol/2002-February/000689.html>
grep on RFC-822 headers and stuff
(a previous post expressing a wish for a semistructured data store, with
six things to store in it)

<http://lists.canonical.org/pipermail/kragen-discuss/2002-February/000750.html>
patricia indexes on directed graphs
(groping more toward the idea I finally understood in the NFA-DFA post)

<http://lists.canonical.org/pipermail/kragen-tol/2003-January/000737.html>
mvfs --- multivalued functions --- and query languages
(me groping toward an understanding of semistructured query optimization)

<http://lists.canonical.org/pipermail/kragen-hacks/2004-January/000384.html>
viewing filesystems as filterable outlines
(very rough initial prototype of a incremental full-text search UI)

<http://lists.canonical.org/pipermail/kragen-tol/2004-February/000752.html>
Patricia, UnQL, and NFA to DFA conversion
(how to build a really large index of a graph to make path regexes fast)

<http://lists.canonical.org/pipermail/kragen-tol/2005-October/000795.html>
optimizing SQL in web apps by waiting until the last minute
(proposing how to make web apps on a triple store fast)

<http://lists.canonical.org/pipermail/kragen-hacks/2005-November/000420.html>
semistructured data land: sorting paragraphs by aisle number

<http://lists.canonical.org/pipermail/kragen-tol/2005-November/000809.html>
folkschemas and semistructured data
(how to make it easy to select field names and values)

Other related stuff
-------------------

wowbarbts embodies some of these ideas as well; it would be nice if it
could end up getting published too, but I wouldn't be surprised if
circumstances prevented that.

The KB editor we used at Alpiri also embodied some of these ideas, but
I didn't write it.

Last year, a prominent XQuery researcher recommended Galax as the best
available open-source XQuery implementation.


More information about the Kragen-tol mailing list