semistructured data: summary of six years of wishes
Kragen Sitaker
kragen at pobox.com
Thu Dec 1 03:37:01 EST 2005
Now I am feeling the need for a personal semistructured store more than
ever. This is a searchable database that supports "data first,
structure later" <http://www.betaversion.org/~stefano/linotype/news/93/>
(i.e. you can add a schema incrementally to existing data entered
without one, or with a very primitive one), but supports enough
structure to render web pages and things like that. Here are some
options:
- RDF/Sniki-style: label edges in a directed graph with nodes, which
have names. Define tables with Sniki-style queries. Add field
metadata with edges from the field label node. Queries are different
from edges; perhaps stored in nodes, perhaps defined in some manner by
a local graph structure. Adding data to a query result is the most
convenient way to add it.
- UnQL-style: label edges in a directed graph with strings; nodes are
anonymous; define queries by replacing regular expressions of edges
with edges; display tables perhaps by going two levels deep from a
particular node. (OEM, object exchange model, is essentially
identical to this.)
- Python/Perl/JavaScript-style: nodes are anonymous, but edges are
uniquely labeled within a node with strings, and some nodes are
special primitive types, like strings. Some edges are 'virtual' and
computed on demand by arbitrary code that walks the neighborhood.
There are nodes that are lists.
- askSam-style: nodes are small text files, with a syntactic convention
to specify fields. Queries are full-text searches, perhaps limited to
certain fields; reports can extract values of certain fields to get
tabular output.
- XML-style: data is stored in a single hierarchy in which the children
of each node are ordered, and each node is labeled with an element
type and possibly some attributes; XPath or XQuery is the query
language. Maybe you have hyperlinks by id to other nodes.
- Wiki-style: a node is a small text file with a name; text formatted
in a certain way inside the file represents a reference to another
file. Very similar to AskSam-style, but existing Wikis either don't
have a facility for fields with values or don't use it much.
- wowbarbts-style: just like Wiki-style, but the node is a set of
name-value pairs, and each value is potentially a small text file
(which may contain references to other nodes.) Past versions are
logged.
- Prolog-style: data is viewed in a set of relations of fixed arity; any
relation may contain both data items directly entered there by the
user as well as data items "inferred", i.e. produced as query results,
from other relations, and there might actually be an infinite number
of such other data items. The data items may be atomic (i.e. strings)
or node-labeled trees of atoms.
- Lotus-Agenda-style: there's an implication hierarchy of categories
(with the universe at the top), and any data item (a short string of
text) may belong to any subset of the hierarchy, subject to a few
validity constraints. Sets of categories within the tree form column
values in reporting screens, categories form section headers in
reporting screens, and the data items themselves are always displayed
on reporting screens, which are also where you add new data items. In
a few cases, belonging to a category may have associated with it some
numeric or date value. Most category memberships are inferred from
the text content of the data item, according to rules attached to the
categories --- e.g. categorize anything containing the word "car" or
"automotive" in the "automotive" category.
- Excel-style: you have a big rectangular grid of data values; you can
automatically hide rows in some subset that don't meet your criteria,
and you have "quick filters" that give you menus of possible data
values. Also you can hide whatever columns you want. Really simple
and not very powerful but apparently sufficient for many people's
uses.
- mutated-Agenda-style: you have a big hierarchy of data items, each
labeled either with a string or a hyperlink to some other data item.
Category membership is denoted by a child labeled with a hyperlink to
the category. Reports/queries can optionally be run over just the
descendants or children of a single node, minimizing the
global-data-store-interference problem observed with several of these
models.
- rumor-oriented style: data items are sets of name-value pairs; you
can't modify old items, but only add new ones, which closely resemble
input events; queries use the standard relational operations on some
set of relations, one of which is the entire set of data items, and
are evaluated in some namespace where you probably have defined most
of the other names with standard queries. By evaluating a query or a
set of queries in a different namespace, you can get effects like
undo, deletion, update, or the elimination of all actions from a
particular IP address. Using relational operations provides
sequence-independence; the rumorset is not inherently sequenced. Must
use lazily-updated materialized views for most queries in order to
achieve reasonable performance.
- zzstructure-style: each data item is a short string or other similar
typed object; they are connected along 'axes', which have a linearity
constraint --- if X follows Y along axis A, then nothing else can
follow Y along that axis, and X must also precede Y along A. Tabular
display is achieved by mapping two of the axes onto the screen's
dimensions. You could probably do reasonable queries with regular
expressions over axis directions and node tests, sort of like XPath or
XQL. However, Ted Nelson is a patent pirate who wishes that such
projects would stop, and he may have the legal means to kill them
<http://lists.gnu.org/archive/html/gzz-dev/2003-02/msg00042.html>
- del.icio.us-style: similar to Agenda, but without a hierarchy among
categories and without any automatic categorization.
- unix-filesystem-style: nodes are either strings, string-indexed
dictionaries of nodes, or transparent hyperlinks to other nodes.
Queries (tree regular expressions on path traversals with full-text
search, or XPath, would probably suffice) produce lists of node
pathnames, perhaps with associated excerpts. (It would be interesting
to see full-fledged tools for viewing real Unix filesystems this way.)
To essentially any of these styles, you could add any of the following
features:
- full-text search
- incremental, interactive query result display
(note that this probably implies that the query syntax should put
field names after the field values being searched for; this seems
obvious once stated, but does not seem to be a property of any
existing query language)
- arithmetic on fields (for queries and reports)
- computed attributes (similar to arithmetic on fields, but perhaps more
general, including query results)
- imperative programming
- Agenda-style automatic categorization
- RDF-style namespacing
- optimistically synchronized transactions
- or at least locking
- or at least some minimal atomic update
- access to old versions of the data
- indexing with Nilesh Dalvi's partial match index
<ftp://ftp.cs.washington.edu/tr/2004/01/UW-CSE-04-01-01.pdf>
and probably some other features I haven't thought of yet. These
features may actually matter as much as the underlying data
representation, and Excel shows that a nice UI for the simplest tasks
compensates for many weaknesses.
Into most of them, you could probably hack in some variant of the old Notes
synchronization approach, recently codified by Ray Ozzie at Microsoft as
Simple Sharing Extensions <http://msdn.microsoft.com/xml/rss/sse>; the
rumor-oriented style
<http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html>
gives you stronger guarantees about synchronization, at the cost of
unbounded storage and unpredictable performance.
Thoughts I've had on this problem over the years
------------------------------------------------
<http://del.icio.us/kragen/semistructured-data>
38 relevant web pages and my brief comments on them over the last year
<http://lists.canonical.org/pipermail/kragen-tol/1999-August/000466.html>
XML editors
(how to make XML editing convenient so you can stand to take notes in it)
<http://lists.canonical.org/pipermail/kragen-tol/1999-November/000486.html>
XML query languages
(abortive notes on a better path language for XML)
<http://lists.canonical.org/pipermail/kragen-tol/2000-May/000580.html>
patricia extensions
(groping toward the idea I finally understood in the NFA-DFA post below)
<http://lists.canonical.org/pipermail/kragen-tol/2000-September/000633.html>
AskSam & outliners
(sorry about the terrible formatting; will fix it sometime)
<http://lists.canonical.org/pipermail/kragen-tol/2002-February/000689.html>
grep on RFC-822 headers and stuff
(a previous post expressing a wish for a semistructured data store, with
six things to store in it)
<http://lists.canonical.org/pipermail/kragen-discuss/2002-February/000750.html>
patricia indexes on directed graphs
(groping more toward the idea I finally understood in the NFA-DFA post)
<http://lists.canonical.org/pipermail/kragen-tol/2003-January/000737.html>
mvfs --- multivalued functions --- and query languages
(me groping toward an understanding of semistructured query optimization)
<http://lists.canonical.org/pipermail/kragen-hacks/2004-January/000384.html>
viewing filesystems as filterable outlines
(very rough initial prototype of a incremental full-text search UI)
<http://lists.canonical.org/pipermail/kragen-tol/2004-February/000752.html>
Patricia, UnQL, and NFA to DFA conversion
(how to build a really large index of a graph to make path regexes fast)
<http://lists.canonical.org/pipermail/kragen-tol/2005-October/000795.html>
optimizing SQL in web apps by waiting until the last minute
(proposing how to make web apps on a triple store fast)
<http://lists.canonical.org/pipermail/kragen-hacks/2005-November/000420.html>
semistructured data land: sorting paragraphs by aisle number
<http://lists.canonical.org/pipermail/kragen-tol/2005-November/000809.html>
folkschemas and semistructured data
(how to make it easy to select field names and values)
Other related stuff
-------------------
wowbarbts embodies some of these ideas as well; it would be nice if it
could end up getting published too, but I wouldn't be surprised if
circumstances prevented that.
The KB editor we used at Alpiri also embodied some of these ideas, but
I didn't write it.
Last year, a prominent XQuery researcher recommended Galax as the best
available open-source XQuery implementation.
More information about the Kragen-tol
mailing list