"oerlap": interactive analysis of tuple-element frequencies
Kragen Sitaker
kragen@pobox.com
Mon, 21 Jan 2002 12:54:01 -0500 (EST)
I wanted to call this "colapse", because it seemed like the perfect
term, but that seems to be a widely-used word. Google finds 5560
hits, both misspellings and other things.
"oerlap" isn't quite as apt a term, but it isn't used for anything
else, except as a rare misspelling of "overlap".
This is sort of OLAPish (see http://www.olapreport.com/fasmi.htm) but
not full OLAP. (The FASMI criteria are "fast, analysis, shared,
multidimensional, information"; "fast" means "simplest analyses
under one second, most responses within five seconds, very few more
than 20 seconds"; "analysis" means end-users can program it to do
business logic, statistical analysis, and other ad hoc calculations;
"shared" means it supports reasonable security with shared
read-write access; "multidimensional" means it must provide a
multidimensional conceptual view of the data with hierarchies and
multiple hierarchies; and "information" means it handles lots of
information. oerlap is a half-assed hack at all of these.)
It's often the case that I have a bunch (tens or hundreds of thousands
of rows) of tabular data that I want to explore interactively, and I
don't have a good way to do that.
I envision "oerlap": a simple UI that makes this easy. You feed it
tabular data; it presents you with a table.
Initially, the table has one row, with one cell for each field in the
input data. Each cell contains a list of the most frequent three
values in that field, with their respective numbers of occurrences.
There is an extra cell that indicates the number of input rows.
Clicking on a cell causes the table to expand until it has one row for each
value of that field; it is sorted by the number of occurrences of those
values, so that the first few rows are the ones that represent most of the
input data records. The extra cell indicating the number of input records is
still there, but now it's an entire column, indicating the number of input
records represented by each rows. The remaining un-broken-out columns are
displayed as before: each cell contains the most frequent three values for
that field, with their respective numbers of occurrences.
So each column is in one of two states, broken-out or summary; there is one
row in the displayed table for each distinct tuple of values from the
broken-out columns. Clicking on a value in a column switches it between
broken-out and summary state.
Clicking on a column header causes the table to be sorted by the values in
that column; by default, it's sorted by the extra column indicating
number of input rows.
In its current state, it only does the analysis; it doesn't provide
the sorting, HTML interface, and interactivity I envision. Maybe
soon.
# incredibly powerful secret web log analysis tool
import string
def oerlap(datasrc, breakoutby):
"""Analyze data.
Given a data source that yields tuples or None when .next() is called,
and a sequence 'breakoutby' that specifies which fields of the tuples to
break out by, count frequencies.
"""
results = {}
while 1:
line = datasrc.next()
if line is None: return results
key = tuple(map(lambda f, line=line: line[f], breakoutby))
r = results.setdefault(key, map(lambda x: {}, range(len(line))))
if len(r) < len(line): r.extend([{}] * (len(line) - len(r)))
for dict, value in map(None, r, line):
dict[value] = dict.get(value, 0) + 1
class filelines:
"Return lines from a file."
def __init__(self, somefile):
self.file = somefile
def next(self):
line = self.file.readline()
if line == "": return None
return tuple(map(lambda x: intern(x), string.split(line)))
class arrayitems:
"For testing. Return tuples from an array."
def __init__(self, somearray):
self.array = somearray
self.ii = 0
def next(self):
if self.ii == len(self.array): return None
try: return self.array[self.ii]
finally: self.ii = self.ii + 1
testdata = [('a', 1, 32),
('a', 1, 33),
('b', 1, 31),
('c', 2, 30),
('a', 0, 30)]
def test(bb=[]): return oerlap(arrayitems(testdata), bb)