mvfs --- multivalued functions --- and query languages
Kragen Sitaker
kragen@pobox.com
Thu, 30 Jan 2003 11:24:02 -0500 (EST)
Let's consider a SQL database table called ab with two columns, a and
b, both holding integers. We can think of this table as a mapping
from row-IDs to pairs of integers. Let's call this mapping "ab" too,
just like the table; I hope context makes it clear when "ab" means the
table and when it means the mapping.
I'll talk a lot about mappings, but these mappings differ somewhat
from mathematical mappings; a particular domain value ("key") can map
to more than one range value, and we even keep track of the number of
times each range value occurs in the output. To reduce confusion,
I'll call these "mappings" "multivalued functions", or "mvfs" for
short.
I'll write the mvf that maps 1 to 2, 2, and 3, and also maps 4 to 5,
as follows: {1 -> 2, 1 -> 2, 1 -> 3, 4 -> 5}.
Other than that, I'll mostly use Python notation here.
So, anyway, the mvf ab looks something like this: {1 -> (2, 3), 2 ->
(2, 4), 3 -> (3, 2)}. Maybe it has thousands or hundreds of thousands
of rows instead of just three, but anyway, each key maps to just one
pair of integers --- it's just like a mathematical mapping.
A lot of this text defines operations on mvfs. I intend that a
straightforward implementation of most of these operations should be
reasonably efficient, unlike, say, some of the primitive operations of
the relational algebra.
Also, mvfs are simpler than relations, and it may be that fewer
fundamental mvf operations suffice to give them the same expressive
power as the relational algebra.
.values() gives the values
--------------------------
I define an operation ".values()" on mvfs; ab.values() is [(2, 3), (2,
4), (3, 2)], if we take the example value of ab given above. For any
mvf x = {k0 -> v0, k1 -> v1...kn -> vn}, x.values() gives the list
[v0, v1...vn].
So we can write the SQL query
select * from ab
as
ab.values()
.cut() splits into columns
--------------------------
However, we don't yet have any way to write
select a from ab
For this, we need an operation I call ".cut()", defined only on mvfs
whose values are tuples all of the same length. (By tuples, I mean
the conventional or Python kind, not the relational-algebra kind.) For
our example ab, ab.cut() gives a 2-tuple of mvfs:
({1 -> 2, 2 -> 2, 3 -> 3}, {1 -> 3, 2 -> 4, 3 -> 2}). This lets us
write the above SQL query as
ab.cut()[0].values()
Likewise, ab.cut()[1].values() gives column b.
For convenience, in the rest of this document, I will take
a, b = ab.cut()
A more general definition (but still by example, sorry):
For some mvf x = {d0 -> (r[0,0], r[0,1], ... r[0, m]),
d1 -> (r[1,0], r[1,1], ... r[1, m]),
...
dn -> (r[n,0], r[n,1], ... r[n, m])}, we define
x.cut() = ({d0 -> r[0,0], d1 -> r[1,0], ... dn -> r[n,0]},
{d0 -> r[0,1], d1 -> r[1,1], ... dn -> r[n,1]},
...
{d0 -> r[0,m], d1 -> r[1,m], ... dn -> r[n,m]})
.cut() has nothing to do with Prolog cut '!' and everything to do with
the Unix cut(1) command.
.invert() produces the inverse of an mvf
----------------------------------------
We have b = {1 -> 3, 2 -> 4, 3 -> 2}. I'll call
{3 -> 1, 4 -> 2, 2 -> 3} b.invert(). In this case, b.invert() maps
values of column b into row-ids of table ab.
a.invert() is {2 -> 1, 2 -> 2, 3 -> 3} --- the key 2 maps to two
different values.
In general, {d0 -> r0, d1 -> r1, ... dn -> rn}.invert() =
{r0 -> d0, r1 -> d1, ... rn -> dn}.
compose() chains together mvfs
------------------------------
compose() is an operation on an arbitrary number of mvfs; its value is
another mvf. compose() with zero arguments is not well-defined;
compose(x) = x;
compose(..., x) = compose(compose(...), x); and
compose(x, ...) = compose(x, compose(...)).
This leaves only the two-argument form of compose to define.
compose({a0 -> b0, a1 -> b1, ... an -> bn}, {c0 -> d0, c1 -> d1, ... cn -> dn})
is the mvf composed of exactly those pairs ai -> dj where bi = cj.
For example:
b = {1 -> 3, 2 -> 4, 3 -> 2};
compose(b, b) = {1 -> 2, 3 -> 4}
a.invert() = {2 -> 1, 2 -> 2, 3 -> 3}
compose(a.invert(), b) = {2 -> 3, 2 -> 4, 3 -> 2}
You may notice that this has a close relationship to the original ab
table.
A third example. For the SQL query:
select y.* from ab as x, ab as y where x.b = y.a
we can write:
compose(b, a.inverse(), ab).values()
Lists act as mvfs with one key: "None"
--------------------------------------
This one puzzled me for a long time. I had several possibilities:
- provide a .lookup() operation that takes a list of keys as an
argument and returns a list of values those keys map to
- allow the use of lists in the compose() function, but what should
[a, b, c] act like?
- {a -> a, b -> b, c -> c} (as in kjBuckets)
- {1 -> a, 2 -> b, 3 -> c}
- {None -> a, None -> b, None -> c} --- this is what I chose.
(Python's None is SQL's NULL, Lisp's NIL, or Perl's undef.)
So this allows us to write the SQL query
select * from ab where b in (4, 2)
as
compose([4, 2], b.invert(), ab).values()
paste() is the inverse of .cut()
---------------------------------
The operation paste(), like compose(), takes an arbitrary number of
mvfs as arguments and has an mvf as a value.
apply(paste, x.cut()) = x if every key in x maps to only one value.
We need paste() for some join operations. As noted in the compose()
section, .invert() and compose() suffice for some joins, like
select y.* from ab as x, ab as y where x.b = y.a
But consider
select y.b from ab as x, ab as y where x.b = y.a
If we fudge a little, we can write this as
compose(b, a.inverse(), b).values()
But this gives us a list of integers, while our previous query results
have all been lists of tuples. A list of 1-tuples is much cleaner, so
paste(compose(b, a.inverse(), b)).values()
does the right thing. This also lets us write
select x.a, y.b from ab as x, ab as y where x.b = y.a
as
paste(a, compose(b, a.inverse(), b)).values()
So far, I haven't said anything about how paste() handles the
following two conditions:
1. When a key that maps to something in some of its arguments maps to
nothing in one of them.
2. When a key maps to more than one value in one of its arguments.
This example motivates the following decisions:
1. Omit the key from the output entirely.
2. Take the Cartesian product of the sets of values the key maps to in the
different arguments; in paste()'s result, the key maps to every value in
the Cartesian product.
Note that
select x.a, y.b from ab as x, ab as y where x.b = y.a
can also be written as
paste(compose(a, b.inverse(), a), b).values()
by starting from y's row-ids instead of x's. Ideally, a program could
determine that these two expressions will give the same set of
values.
Aggregate operations
--------------------
SQL contains a number of aggregate operations: avg, count, max, min,
stddev, sum, and variance. MySQL contains an additional implicit
aggregate operation: pick one of the values, which it applies to your
data when you select some columns you didn't group by. (PostgreSQL
has this, too, if you use "select distinct on".)
I don't want to define all these aggregate operations right now, but
I'll define an operation "aggregate". aggregate operates on an mvf
that may have multiple values per key, along with some aggregate
operations, and produces an mvf that has one value per key; that value
contains a tuple of aggregate values computed on the values for that
key in the input mvf.
So aggregate({x -> 1, x -> 2, y -> 3, z -> 4}, count, sum, pick) might
produce the mvf {x -> (2, 3, 1), y -> (1, 3, 3), z -> (1, 4, 4)}.
(Here "pick" is the MySQL "pick one" operation mentioned earlier.)
.tclosure()
-----------
x.tclosure() produces the transitive closure of x: the set of pairs
produced by compose(x, x), compose(x, x, x), compose(x, x, x, x), etc.
Not useful for implementing SQL.
Operations that might be useful
-------------------------------
x.keys() to give a list of the mvf's keys. Same as x.inverse().values().
A mapping from lists to mvfs with unique keys. I think we need this if we
Union, intersection, set-subtraction. You can define intersection in
terms of subtraction (this is conventional in the relational algebra)
but I fear that expressions expressed in terms of set-subtraction may
be too hard to manipulate and optimize. We probably need these to
support both SQL and efficiently-updatable datasets.
What I haven't explained yet
----------------------------
I think these things are possible with the operations described above,
but I'm not sure yet, because I haven't worked out exactly how.
Query optimization. Caching intermediate composed results. Outer
joins. Optimized execution. Indices for cross-table joins --- normal
SQL indices just map column values to row IDs, but an mvf can map any
column in any table to any other column in any other table through
some set of joins.
select distinct/group by --- something like:
aggregate(paste(compose(resultrowids, groupbycol1),
compose(resultrowids, groupbycol2)).invert(), pick)
Changing the number of levels of indirection in data. Another
representation for ab is:
{1 -> ('a', 2), 1 -> ('b', 3), 2 -> ('a', 2), 2 -> ('b', 3),
3 -> ('a', 3), 3 -> ('b', 2)}
I think the operations outlined above suffice to convert it from this
representation to the one we've been using, and back again, but I
don't know for sure.
RDF, XML, and UnQL databases, with various kinds of graph queries.
I hope I have defined this "mvf" data type in such a way that I could
reimplement its elementary operations in a faster way (for example,
coding them in C, optimizing them with Psyco, parallelizing them on a
cluster, evaluating them lazily, and compiling sets of mvf operations
into machine code at run-time) and programs using mvfs to manipulate a
lot of data would get a lot faster. For this to happen, Python
programs using mvfs should not have to iterate over their contents;
the mvf aggregate operations must provide sufficient power to remove
this need.
What mvfs can't do, in their current form
-----------------------------------------
Everything mentioned so far is purely functional --- I haven't
mentioned changing an existing mvf. All the operations mentioned here
that take mvfs for input don't change them, although they may spit out
new fully-formed mvfs.
I haven't discussed ordering (or ordering predicates like < and >) at
all. I think they could fit in neatly --- we could define that
values() is always sorted by the keys --- but that drastically
confines possible implementations, so I don't know if that's the right
thing to do.
--
<kragen@pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
Edsger Wybe Dijkstra died in August of 2002. The world has lost a great
man. See http://advogato.org/person/raph/diary.html?start=252 and
http://www.kode-fu.com/geek/2002_08_04_archive.shtml for details.