XML semistructured query by example
kragen at pobox.com
Thu Dec 8 03:37:01 EST 2005
Semistructured data and QBE in XML
One of the nice things about spreadsheets is that it's all in one
place: your input data, your program, and the output of your program.
Because of this, you have the closure property: the output of your
program can be used as the input for another program. It's nice to
have the closure property along other axes as well (that your program
itself can be the input or output of another program) but that's a
Compare to an ordinary RDBMS: your data is in the database, your
queries are in text files or your client GUI, and your query results
are spewing out at you. Creating a view is quite different from, and
quite a bit more trouble than, running a query.
Suppose you have a big XML file that has all your data, and you'd like
some of the data replicated around the doc in a controlled way (a
summary here, a detail there), so you can edit it wherever (and
presumably follow hyperlinks between the views).
You could select a bunch of subtrees by providing a QBE node, which
would have some common subset of the subtrees you wanted, identified
by text contents, tag and attribute names, and ancestor relationships.
For example, <person/> would match all the <person> elements in the
document, <person><firstname/></person> would match all the person
elements with firstname subelements, <person firstname=""/> would
match all the person elements with firstname attributes, <person
firstname="Bob"/> would match all the person elements with firstname
attributes with content exactly "Bob",
<person><firstname>Bob</firstname></person> would match all the person
elements containing firstname children with content exactly "Bob",
(Maybe to start with you'd want this to only pick children of some
particular node, like the root.)
You'd want to be able to elaborate on these selection queries in
various ways: case-insensitivity, substring queries ("person contains
text string Bob"), descendant queries, order relationships, negation,
predicates on ancestors, etc. Maybe you'd like full XQuery and/or
XPath. But I'll ignore all that for now.
First you need to be able to get results! You could start with a
special attribute, "like", that points to the ID of a query. So if
you put the following text in your document:
<wibble like="firstnameguy" />
then your editor would instantly expand out your wibble element to
have all kinds of crazy contents:
<person>I know <firstname>Bob</firstname> <lastname>Smith</lastname>.
<person>My cat <firstname>Slink</firstname> likes mice.</person>
Which would be linked back to their original sources in the document,
maybe with magically recomputed XPaths in some attribute, or maybe
with some kind of IDREF, but anyway you could hit some key to go to
the original, or you could edit it there.
Now suppose you want to restructure the tree in your query --- maybe
you just want a list of firstnames. You could give an example of the
result you want and reference it, too, by ID:
<person id="firstnameguy"><firstname id="hisfirstname"/></person>
<p id="firstname_table">Name is <n from="hisfirstname" />.</p>
<div like="firstnameguy" format="firstname_table" />
And your wibble would expand out into
<div like="firstnameguy" format="firstname_table" />
<p>Name is <n>Bob</n>.</p>
<p>Name is <n>Slink</n>.</p>
Still with the links back to the source, and still with the ability to
edit. Inserting more elements into the query results is a pretty
straightforward thing to handle --- you just copy the QBE tree into
someplace. Deleting them isn't so obvious; does the person want the
original node to stop existing, or just to stop satisfying the query,
perhaps by one of its subtrees being renamed, deleted, or something?
You can always jump to the original item itself to delete it.
So already, we have basic CRUD here in our hypothetical
XML-editor/semistructured-data-store, without too much syntax or
hassle, with enough smarts to do some vaguely useful HTML parsing and
reporting. (We have a kind of 'select', 'project' into some simple
templates, but no join, intersect, difference, or union.)
You could add some glue to transclude arbitrary HTML documents in your
semistructured data store (which would be polled from time to time);
you could export parts of your document into URL-space with some other
magic attribute. You could even map POSTs to some part of URL-space
into some element of your document, so HTML form posts to that URL
would get encoded in XML and added to your document.
Probably you'd want at some point to be able to control which node was
the "context node" for a query --- that is, to use a query to look for
results in just one particular subtree of the document.
Maybe use a special tag, like <query>Bob</query>, instead of just an
What about parameters? Linking? Detail records?
Maybe it would be better to attach semantics to tag names instead of
using an IDREF attribute. For example, you could say something like
(sorry for confused example):
<name id="personsname"><firstname/> <lastname/></name>
to specify that whenever there's a <person>, it should have a <name>
containing <firstname> and <lastname> elements, and a <friendof>
element containing the results of a query over other <person> objects.
Maybe some elements or queries could reach out into their environment
to get query parameters, in the same relative place that they got them
from in the place where they were defined.
Anyway, I'm getting bogged down in details of the infinite
possibilities of how to design this system so that it could possibly
make sense, without really focusing on the main point, which is that
embedding some kind of specification of queries into your giant XML
document and including editable results of those queries immediately
inline could give you a relatively useful end-user database with little
This relates to my feeling that I don't have a good semistructured data
store (see "semistructured data: summary of six years of wishes" <insert
ref>); to my desire to produce an end-user-programmable web-page
compositing system (see "Lossless HTML template expansion"
This does not relate to Meredith Patterson's AI project in Postgres
More information about the Kragen-tol