XML query languages
Kragen Sitaker
kragen@pobox.com
Wed, 3 Nov 1999 10:51:55 -0500 (EST)
I was pondering the shortcomings of XQL last night. I think I came up
with something simpler to understand, simpler to implement, and more
concise. It has capabilities XQL doesn't, and XQL has capabilities it
doesn't.
(Here XQL means Texcel/webMethods/Microsoft XQL. See e.g.
http://www.w3.org/TandS/QL/QL98/pp/xql.html,
http://metalab.unc.edu/xql/xql-tutorial.html,
http://metalab.unc.edu/xql/, and
http://www.texcel.no/whitepapers/xql-design.html, which describe
significantly different languages.)
The basic concept is essentially the same as that of XQL. The query
language maps sets of nodes to sets of nodes. Normally when you query
a document, your starting set of nodes is simply the root node of the
document.
I believe this language can do everything required by Tim Bray's
Element Sets desiderata, described at
http://www.w3.org/TandS/QL/QL98/pp/sets.html, but not much more. It
provides element sets, inclusion and containment, union and (I believe)
intersection and set subtraction, tests for attributes and text
contents, and character-based proximity searching. Every feature of
this language, with the exception of ';', is required to satisfy one or
more of these requirements; some other useful things fall out of their
combination.
Definition
----------
A query is (almost) simply a path specification -- a sequence of moves
through the tree. If there are any nodes reachable by that path
specification from any of the starting nodes, those nodes constitute
the query results; otherwise, the query fails.
The path specification is a regular expression. It has a grammar like this:
pathspec ::= pathspec-alt
| pathspec-alt "|" pathspec
pathspec-alt ::= quantified-regex
| quantified-regex pathspec-alt
quantified-regex ::= move
| "(" pathspec ")"
| quantified-regex "*" | quantified-regex "+" |
| quantified-regex "?"
| quantified-regex "{" number "," number "}"
| quantified-regex "{" number "," "}"
| quantified-regex "{" "," number "}"
| quantified-regex "{" number "}"
| "[" pathspec "]"
| "[!" pathspec "]"
Most of these things have their standard meanings, except that (a)
they're matching a sequence of moves through the tree, not a string of
characters, and (b) "[ ]" and "[! ]".
A "move" can be:
"/" -- move to a child element of the current element
"@" -- move to an attribute of the current element
name -- assert that the current element or attribute is of the named type
"=" quoted-string -- assert that the character data content of the
current element or attribute is the quoted string. I'm not sure how
to define this for elements that have child elements.
"=~" quoted-char-regex -- assert that the cdata content of the current element
or attribute matches the quoted character regular expression, which uses
syntax similar to the standard Unix syntax.
";" -- move to the next element whose parent is the same as the current
element's parent.
The "[ ]" and "[! ]" constructs require a little explanation. "["
pushes the current node on a stack; "]" pops it off. So "[/foo]" will
eliminate any nodes that do not have child elements that are of type
foo. "[/]" will eliminate any nodes that do not have child elements.
"[! ]" is provided to make negative assertions. I'm not sure how to
describe this in the same simple terms I've described everything else;
the idea is that nodes which pass [! ] will fail the query. So
"[!foo]" will eliminate nodes of type foo; "[!/foo]" will eliminate
nodes with children of type foo; "[!/]" will eliminate nodes with any
children; "/[!foo]" will give you the child nodes that are not of type
foo; and "[/[!foo]]" will give you the nodes that have children that
are not of type foo.
Sample expressions
------------------
/ -- return the set of children of the starting nodes
/* -- return the set of descendants of the starting nodes (including the
starting nodes themselves)
/*char -- return the set of descendants of the starting nodes of type 'char'
/*char@ord=2 -- return the set of attributes of descendants of the
starting nodes that
- belong to elements of type 'char'
- are of type 'ord'
- have the value '2'
/*char[=""][@ord=2] -- return the set of descendants of the starting nodes that
have no content and have an attribute 'ord' whose value is 2
/*[/header[!/x-mimeole][/importance] -- of the descendants of the starting
nodes, find those that are parents of 'header' elements that contain
'importance' elements but not 'x-mimeole' elements.
/*[/header[!/@type='x-mimeole'][/@type='importance'] -- similar, but now
instead of elements of type x-mimeole and importance, we're testing
for elements with attribute 'type' saying x-mimeole and importance.
/invoice/customer[/address=~/alifornia/](|/address=~/alifornia/) --
find customers that are children of an invoice that is a child
of a starting node and which contain an address containing
'alifornia', and return those customers and the matching addresses.
/*(ol|ul) -- return descendants of the starting nodes of type 'ol' or 'ul'
/*[!ol|ul][/li] -- return descendants of the starting nodes that are not of
type 'ol' or 'ul', but which nevertheless contain children of type 'li'
/*(ol|ul)[!/li] -- return ol or ul nodes without children of type 'li'
(from among the descendants of the starting nodes)
/*(ol|ul)/[!li] -- return ol or ul nodes with children of other than type 'li'
/*[!dt][;dd] -- return descendants of the starting nodes that are followed by
'dd' elements, but which are not of type 'dt'
/*dt;dd -- return dd elements that follow dt elements that are descendants of
the starting nodes
/(header/(from|subject|date)|body) -- return header/from,
header/subject, header/date, and body elements of the starting
elements. (I could state this more formally, but I don't want to.)
/lines/line[!;line] -- return the last line of 'lines'. (Not sure how to
generalize this so you can pick the first line.)
/lines/line[(;line;line)*[!;line]] -- pick the last line of lines, the
third-from-last line of lines, the fifth-from-last line of
lines, etc.
@ -- return all attributes of all starting elements.
(/(body|p|blockquote|h1|h2|h3|h4))* -- return all elements reachable
from the starting elements by descending through chains of only
elements of the types named.
Fitting into the larger world
-----------------------------
The question of what to do with the returned set of nodes is
difficult. The XQL idea of returning a document containing the
selected nodes, preserving hierarchy and sequence, is a good one for
many applications, I think. In this case, there are at least three
reasonable choices you might want to make about the inclusion of the
content of the selected nodes:
- include no content;
- include CDATA content;
- include all content;
I'm not sure how you should express your choice -- in the query
language or elsewhere.
I agree with the position of David Beech of Oracle (for which see
http://www.w3.org/TandS/QL/QL98/pp/oracle.html) that it is likely to be
useful to treat the query results as a set of references into the
original document, rather than a document in itself.
It might be possible to extend this language to select text strings as
well as elements and attributes. If that were done, the language could
express the choices mentioned above, and more besides.
Possible things to add
----------------------
Inverse operators for '@', '/', and ';'. This would enable asking for
'the first x' or 'the root of the document', among other things.
Something similar to XQL's x[n], where n is a number, which returns the
nth member of the set defined by x. x[end()] returns the last.
Some way to do queries like the XQL //book[@author = /guest/author],
which is intended to find books whose author attribute is the same as
the text of the author element of the guest element of the root element.
Some way to do joins, which XQL also doesn't have. The XQL syntax
mentioned in the previous paragraph is a poor attempt at joins.
Perhaps other database stuff: transactions, updates.
Better text searching?
Namespace stuff?
Presumably elements in which character data is interspersed with child
elements do so because some meaning is conveyed. The language at
present offers no way to explore these relationships.
Some way to select strings of character data, not just elements.
There's also XML-QL (http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/),
which tries to adapt SQL/OQL-style query languages to XML. I haven't
looked at it.
XQL equivalences
----------------
The XQL ?? operator is unnecessary; x??y becomes x[y].
The XQL // has become /+ or /*/, and the XQL /*/ has become //.
Similarly, XQL ;; has become ;+ or ;*;, and XQL ;*; has become ;;.
There is no equivalent to XQL's leading / and //.
XQL /@ becomes @.
You can't compare content and attribute values to anything besides
fixed strings. This could be a serious obstacle! Fixing this would
require an addition at least as large as [! ], and I'm not sure how to
do it.
XQL [x] becomes [/x].
XQL [x $and$ y] becomes [x][y].
XQL [x $or$ y] becomes [x|y].
XQL $not$ x beomces [!x].
XQL x != y becomes, roughly, x[!=y]. But the semantics of the 'x' part
are different.
XQL x [.y] becomes, usually, x[y].
XQL methods and data typing smell bad. index(), datatype(), value(),
textNode(), comment(), pi(), element(), attribute(), node(),
ancestor(), etc., seem like warts.
XQL x[n], where n is a number, I cannot see how to do at the moment.
$ge$, $lt$, etc., can't be done.
XQL x[$all$ y] becomes a double negative: x[!/[!y]].
XQL $union$ becomes |. Interestingly, XQL has defined | as a synonym
for $union$ too.
I haven't thought about how to handle comments, PIs, DTDs, etc.
XQL x?y becomes something like x[y](|y), except that XQL's ? expresses
some semantics of what to do with the selected set of nodes which this
language doesn't. It might be useful to add.
XQL x $intersect$ y is generally something like x[y], but I'm not clear
on some of the XQL aggregate functions, so maybe I've misunderstood
$intersect$.
--
<kragen@pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
Tue Nov 02 1999
6 days until the Internet stock bubble bursts on Monday, 1999-11-08.
<URL:http://www.pobox.com/~kragen/bubble.html>