my new full-text indexer

Kragen Javier Sitaker kragen at pobox.com
Thu Aug 17 22:45:19 EDT 2006


On a 1.83GHz Intel Core Duo under Mac OS X, it:
- indexes more than 100 megabytes an hour (slowing down for larger
  corpuses)
- indexes in a working set of around 32MB (sometimes much less)
- produces indexes of around 17-19% of the size of the original files
- starts producing query results in less than a second for every
  query I've tried on 400MB of mail (except for "how old is your cat
  doctor cracking", which took 2, and goes down to 570ms if you take out
  'is' and 'your')
- embodies an innovative new data structure for its index, a variant of
  the Bloom filter, a special case of Boldi and Vigna's "compact
  approximation of lattice functions"
- supports incremental reindexing
- is 100% Python

Basic usage
-----------

Save it to 'textindex.py' and make it executable.  Make sure you have
Python and Unix installed.  To create a new index:

	./textindex.py mynewindex add myfile1 myfile2 myfile3
or, once the index exists,
	mynewindex/add myfile1 myfile2 myfile3

where myfile[123] are text files you want to add to the index.  To add a
standard Berkeley mbox file to the index:

	./textindex.py mynewindex addmbox mboxfilename
or
	mynewindex/addmbox mboxfilename

To search for some words:

	./textindex.py mynewindex excerpt some words
or
	mynewindex/excerpt some words

You can replace 'excerpt' with 'lookup' to simply list the documents
(mail messages or text files) in which the words were found, or with
'cat' to spew out the entire contents of the documents.

The background story
--------------------

This is the fifth inverted-index-based search engine I've written; the
first one delegated the task of index structure management to the
filesystem (which was too slow to be useful, even when it was reiserfs),
the second one tried to delegate it to Berkeley DB (which was likewise
too slow), the third one was a Python engine I used for my email, and
the fourth one was a C version of the Python one.  I've also been
inspired by my experience with Lucene and by Dave Long's very impressive
shell-script inverted-index manager.

After my old laptop crashed, I thought I'd take another whack at the
problem of indexing my email.  This is probably a little easier for me
than for most people, because I generally read my email in fairly
primitive ways, so there's not a lot of integration headache-work to do.

My old mail indexer was written in C, and by some quirk of fate, the
laptop I'm using at the moment has lots of programming languages on it,
but no C compiler.  The previous version of that mail indexer was
written in Python, but it was painfully slow, and didn't support
incremental reindexing.  Incremental reindexing is pretty hard in my
environment anyway, because here's how I download my email:

	#!/bin/sh
	ssh panacea.canonical.org tail -400000000c \
		/var/mail/kragen \> tmp.mboxtail && \
	rsync -Pavzz panacea.canonical.org:tmp.mboxtail /Users/kragen/mail

So every time I download my email, any local changes to the mailbox are
lost, and all the messages are generally found at a different offset in
the mbox file tmp.mboxtail afterwards.  Also, tmp.mboxtail isn't
strictly an mbox file, because it generally starts in the middle of a
message, usually a large binary attachment to a virus-induced spam.

My old mail indexer had some other disadvantages; to sum up:
- it was written in C, so it was hard to run on this machine;
- it could only index one file;
- it had to reindex that file from scratch every time I ran it, which
  took 10 or 20 minutes on my old computer (although this laptop is
  about five times faster)
- in order to avoid bloating its index with nonwords from mail
  attachments, it didn't index words of more than (IIRC) 20 letters;
- it used so much memory that my machine was unusable while it was
  running;
- because it mmapped the entire mailbox file ("for simplicity and
  speed") it couldn't handle mailboxes bigger than the largest free
  section of virtual memory space; when my mailboxes approached 1GiB,
  this meant that it would only run sporadically under Linux with PaX,
  and it could never have handled mailboxes bigger than a few gigabytes
  on a 32-bit machine.
- the index it produced was a text file, so it was about 60% of the size
  of the original data.
- it was so awkward to use that I wrote shell scripts to tie the various
  parts together and do the things I usually wanted.

But it was pretty quick.  The indexing phase indexed about 5 megabytes
per second, or 18 gigabytes per hour, on my 600MHz laptop, if I remember
correctly.

This new indexer removes all of the above problems:
- it is written in Python, so it should be easy to run anywhere;
- it can index any number of files;
- it SHA-1 hashes the documents it indexes so that it can avoid
  reindexing them if they haven't changed (even if they have moved);
- it indexes everything that looks word-like;
- the indexer has a working set up to about 36MB ("RPRVT" from Mac OS X
  top) when indexing my mailbox;
- it reads the mailbox file in a conventional fashion, so it should be
  able to index mailboxes up to 2^64 bytes without trouble, or possibly
  more on a machine where Python supports bigger files than that;
- its index is binary, requiring about 1.5 bytes per posting in its
  present format, although it could be shrunk, and it has a 1MiB
  overhead.
- it's pretty easy to use on the command line.

Interesting features
--------------------

The Bloom-filter-like structure of the index has some interesting
properties.  For example, the inverted index does not contain the words
that are indexed, and indeed it is impossible to recover those words
from the index.  I anticipate using this feature to support fielded
search.

This program makes heavy use of Python generators, largely in order to
allow the first search results to be displayed before they are all
available.  It probably shouldn't do this as deeply as it does --- it
would probably have faster lookup performance and be more debuggable.

The ad-hoc compression scheme it uses for document ID lists does a
reasonable job, although not as good as huffman coding would do.  This
probably needs to be revisited and tested against Lucene and MG4J, since
the whole point of this data structure is that it takes up less space
than a straightforward inverted index.  But I think, right now, it
doesn't, actually.

It uses SHA-1 hashes to identify documents it's seen before to avoid
reindexing them again, so as to be able to handle the case of my
tmp.mboxtail and similar difficult cases.  However, it can only compute
SHA-1 over my mailbox at about 5 megabytes per second, about a quarter
of the speed that my old text indexer could build a full-text index of
it.

I considered using the same Bloom-filter-like data structure for the
"hashes" file that tells which files (or byte ranges) have which hashes;
it would have been a lot smaller.  However, when the program is updating
the index, it needs to find out which hashes were previously found in
the file it's indexing, so it can delete them if they're no longer to be
found anywhere.  Bloom filters, and this data structure as well, make it
impossible to retrieve the keys with which some value was stored.  So I
used a structure similar to what I used for my inverted index in the
last rev of the indexer.

Magic numbers
-------------

It's full of magic numbers (14 of them by my count), most of which
represent some kind of tradeoff:

- 7, 0x7f, and 0x80 in unpacknums and packnum: 7 bits of data per byte,
  one bit of framing
- 12 bytes of SHA-1 and 4 bytes of "page number": make it easy to read
  the index file in a hex dump that has 16 bytes per line, support more
  documents than anybody but Google has, and documents bigger than
  almost anybody's disk, but shave 0.1% or so off the index file size
  (which really doesn't matter) by not using all 20 bytes of the SHA-1.
- range(256) in init_hexchars: self-explanatory.
- 32768 bytes at a time in filehash(): tradeoff between working-set size
  and the overhead of executing Python bytecodes once per chunk.
- 4096 bytes at a time in words() and friends: same thing.
- 4096 bytes at a time in mboxsplit(): same thing.
- three hashes: tradeoff between false-positive rate and size, on one
  hand, and speed (of both lookup and indexing) on the other
- 32 bits (4 bytes) of SHA-1 per hash: 16 wasn't enough, and it's easier
  to decode 32 bits into a number than 24, although you could
  theoretically make arguments about upgradability
- 131072 buckets: tradeoff between index size and false positive rate
  (or either of these and number of hashes and therefore speed)
- 8 bytes (64 bits) per bucket offset ('>q'): it's easier to decode 8
  bytes than, say, 5, and I didn't think 4 gigabytes of index file would
  be enough
- max 4096 words per page: each page has a separate document_id in the
  index, and allowing any page to have a number of words that is a
  substantial fraction of the number of buckets could cause it to come
  up frequently as a false positive.  4096 --- 1/32 of the number of
  buckets --- keeps the false positive rate, per document, to 1/(32^3),
  or 1/32768, which is good enough as long as your number of
  documents isn't a large multiple of 32768.  The downside of a small
  number of words per page is that it makes the index bigger and result
  sets larger.
- 71 bytes of excerpt context: to fit after a standard 8-space tab in a
  standard 80-column punched card^W^WVT100 CRT^W^Wterminal window,
  without wrapping.
- 2 097 152 words indexed between checkpoints when adding a mailbox
  file: tradeoff between the wasted time writing new index files
  (containing mostly the contents of the old index file) and working set
  size.
- 4096 bytes at a time for copying indexed documents to the output in
  the 'cat' command: tradeoff between working set size and overhead of
  Python bytecodes

Why not Microsoft Windows
-------------------------

For reliability, the indexer depends on the ability to make an atomic
update to the filesystem by renaming a file over the top of an existing
file, thus atomically replacing the original file.  Microsoft Windows
does not support this operation.

BUGS
----

This indexer still shares a few limitations with my previous one:
- there's no fielded search (yet --- should be much easier and more
  efficient to add to this one than to my previous attempts)
- there's no relevancy ranking
- there's no good summary of each search result (beyond its filename,
  byte offsets, and excerpts containing the search terms) along the
  lines of the title of an HTML document, or the subject, sender, and
  date on a mail message
- it doesn't understand HTML, MIME multipart, base64, quoted-printable,
  PDF, PostScript, TeX, any programming languages, etc.  Its
  understanding of mbox files consists of this: that they are divided
  into messages.

And it has a few new limitations of its own:
- it indexes slowly.  200 megabytes an hour on this machine is maybe 200
  times slower than my previous indexer.  It will probably get several
  times faster when I add support for multiple index files (at lookup
  time) and merging index files.
- it searches slowly.  It often takes several seconds to find all the
  documents that satisfy a query, although it is usually comes up with
  some of them within less than a second.
- it displays the earliest-added documents first; in a mailbox file,
  these will be the oldest.
- it loads certain information about an index at startup time; as the
  index grows, this can make command-line usage slow.
- the novel index structure keeps the index small, and it will be even
  smaller if I get a little bit smarter about compression, but it has
  some inherent disadvantages of its own:
  - it's difficult (prohibitively expensive, I think) to support
    sub-word queries ("donk*"), which has important implications for
    incremental UI feedback
  - because the index file doesn't actually contain the indexed terms,
    it's generally impossible to translate an index file when you change
    some of the parameters of the index file structure; you have to
    rebuild it from scratch.  Parameters I've changed that had this
    problem in the past included hash algorithm, number of hashes; it's
    possible to adjust page size and number of buckets and adapt the old
    index, but generally the previously indexed documents won't get the
    benefit of the changes.
  - it's not really possible to do negative queries: documents
    containing "beatrice" but not "murch", for example, because the
    index may produce false positives for "murch".  You must inspect
    each document to make sure.
- there's no support for negative queries

Also:
- I haven't tried it on any data sets over 400MB, partly because it's so slow
  to index; I don't know if it might get slow to retrieve as well.
- if you have a mailbox file and then you use 'add' on it, it throws
  away all the information it had about the individual messages in favor
  of index entries for the file as a whole
- I just added code to remove obsolete hash hints, so that the hash
  hints file doesn't grow every time I reindex my mailbox after it has
  changed.  I don't know how well tested that code is.
- Because the index file can't be updated in place, and because there's
  only one of them, adding a single posting to a 100MB index requires
  writing a new 100MB index, which takes many minutes.  This makes it
  O(N^2) indexing large mailboxes (although that's probably fixable with
  a tweak.)  Lucene-like, this program should support multiple index
  files, merging them when it's convenient.
- the excerpting is kind of lame
- there's no output mode analogous to 'grep' or 'grep -l'

The code of textindex.py
------------------------

#!/usr/bin/python
# vim:tw=80:sw=4:ai:et:ts=4
# Full-text indexing hack to see if my hash inverted indexing idea plays out.
"""Full-text indexer for text files using hash inverted indexing.

The unusual thing about this indexer is that it doesn't store the actual
words the postings are for, only their hashes.  It's somewhat similar to
a Bloom filter; we hash each word with two different hash functions and
stick the corresponding docid into the resulting hash buckets.  By
intersecting the lists of docids, we can achieve a reasonably small
false-positive rate.

I've only tried it on small datasets so far, because that's all I have
handy.

There are a few software artifacts in here that should have broad
applicability, although I've left them in this same file to facilitate
easy distribution:

PeekableIterator: an iterator wrapper that simplifies certain algorithms
Struct: a better interface to the struct module
Subfile: a file-like interface for a file byte range (cf. email._Subfile)
bin_to_hex, hex_to_bin: these aren't already in the stdlib?!
filehash, sha1: returns SHA-1 of file contents, or a string, respectively
merge_intersection, merge_union: incremental set operations on sorted iters
ignoring_enoent: run some code ignoring ENOENT
ITAMReader, ITAMWriter: fast sorted on-disk dictionaries stored in text files

TODO: (D means "done")
D hey, I bet we never sort the iterators of FileByteRanges, so their
  merging could produce wrong results, although I've been lucky so far
  because I tend to add things in sorted order; fix that!
D fix problem introduced by that bug fix: far-apart terms won't
  intersect properly
D fix problem with hash hints file being grossly inefficient at load
  time --- this is the last killer speed problem!
  D fix ITAM to be usable for hash hints
    D add doc strings :)
    D make it do multimap stuff:
      D go back and look slightly earlier in headwords file
    D make it do tuples
    D make it do atomic updates
  D make HashHints use ITAM
    D move ITAM into textindex.py
      X augment HashHints tests to ensure that hints aren't loaded?
        X perhaps measure load time for HashHints object?
    D use ITAM file for HashHints
    D don't load ITAM file for lookups
D fix problem with endlessly growing hash hints file
- post to kragen-hacks again at this point
- create multiple index files, merging them later if need be, rather
  than just one index file.
- add logging of requests to support analyses like fraction of false
  positives
- prefer looking at one short posting list (and the documents) over one
  short and one very long (maybe?)
- clean up excerpting so it's not so inefficient (merge with ensure
  pass); also make it produce saner output
- handle non-text files (e.g. PDFs)
- have some way to reindex a file only if it has changed (perhaps the
  current SHA-1 check will suffice?)
- have some way to reindex all the files that have changed
- add fielded search
- better summaries for email messages
- rewrite in Squeak

"""
import os, sys, sets, StringIO, struct, cgitb, errno, sha, urllib, re, stat
import time
# On speed, well, it's not that fast at indexing, but the index isn't
# that big, either, and lookups are reasonably fast again, beginning to
# provide results to the questions I've posed in less than a second.
# Indexing my Safari cache directory (16232kiB in 1643 files) 500 files at a
# time took only:
# real    4m47.684s
# user    4m28.884s
# sys     0m15.704s
# (newly using 16% more time than before because of using three hash
# functions instead of two, and 7% more time than that due to ITAM)
# and all 1643 files at once:
# real    2m19.048s
# user    2m6.978s
# sys     0m5.702s
# (18% more time, or 15% less speed, because of using three hash
# functions instead of two; another 9% more time due to using a more
# complicated way of keeping track of the file hashes, which is
# nevertheless faster on lookups.)
# This is all on Beatrice's 1.83 GHz Intel Core Duo Mac.  This is X
# seconds.  That's X files per second or X kiB per second, or perhaps
# 30kiB per second on my old 600MHz laptop, roughly 200 times slower
# than the text-file mailbox indexer I've been using.  The resulting
# index was 3700kiB, 23% of the size of the original data, and much of
# that is the fixed 1MiB overhead of the hash bucket pointers.  That's a
# lot smaller than my text-file index.  Most of the hash buckets are
# still empty.

# Back when I was using only two hash functions instead of three, it
# indexed 300M of email in
# real    121m13.152s
# user    118m4.057s
# sys     1m30.127s
# which is painfully slow but 45% faster than before.  Rerunning the
# same index command was pretty quick to recognize that it didn't need
# to index anything, but it took 5 minutes to write the index
# repeatedly, a bug that has since been fixed.

# Writing a new index is really slow, too.  Up to several minutes.  One
# solution is to write the index-mangling code in C so as to be able to
# do it in a few minutes; maybe a better solution is to support multiple
# index segments, like Lucene, and merge them when there are several
# large index segments of around the same size.

# A little profiling of the self-test says:
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#    262157   11.220    0.000   32.080    0.000 textindex.py:345(binbucket)
#         5   11.110    2.222   87.680   17.536 textindex.py:382(write_index)
#    524310   10.750    0.000   19.100    0.000 textindex.py:119(read)
#    327701    9.440    0.000   46.890    0.000 textindex.py:367(read_bucket)
#    327713    7.140    0.000   54.030    0.000 textindex.py:223(merge_union)
# but this is of dubious value, because the self-test normally runs in
# 24 seconds total, and the indices in the test are nearly empty.

### Serialization:

def unpacknums(strbuf):
    """Turns a binary string into an iterator of unsigned integers.

    Partial inverse of packnum.

    """
    rv = 0
    for c in strbuf:
        n = ord(c)
        rv = (rv << 7) + (n & 0x7f)
        if n & 0x80:  # high bit set: terminates
            yield rv
            rv = 0
    assert rv == 0

def packnum(n):
    """Turns an unsigned integer into a self-terminating binary string.

    Partial inverse of unpacknums; concatenate packnum results without
    terminators.

    """
    assert n >= 0   # won't terminate otherwise
    rv = []
    while 1:  # at least one byte
        rv.append(n & 0x7f)
        n >>= 7
        if n == 0: break
    rv[0] |= 0x80
    rv.reverse()
    return ''.join(map(chr, rv))

def delta_decode(nums):
    """Turns a list of numbers into a list of partial sums.

    Inverse of delta_encode.
    
    """
    n = 0
    for num in nums:
        n += num
        yield n

def delta_encode(nums):
    """Turns a list of partial sums into a list of numbers.

    Or, to look at it another way, turns a list of numbers into a list of
    differences between them.  Inverse of delta_decode.
    
    """
    last = 0
    for num in nums:
        yield num - last
        last = num

class Struct:
    """A more convenient interface to the struct module."""
    def __init__(self, fmt): self.fmt = fmt
    def calcsize(self): 
        """See struct.calcsize for documentation."""
        return struct.calcsize(self.fmt)
    def unpack(self, data): 
        """See struct.unpack for documentation."""
        return struct.unpack(self.fmt, data)
    def pack(self, *data): 
        """See struct.pack for documentation."""
        return struct.pack(self.fmt, *data)
    # for file-like objects:
    def read(self, fo):
        """Read this data structure from a file-like object and unpack."""
        data = fo.read(self.calcsize())
        if not data: raise EOFError(fo)
        return self.unpack(data)
    def write(self, fo, *data):
        """Write this data to this file-like object."""
        fo.write(self.pack(*data))

### tests for serialization stuff
def ok(a, b):
    """From Perl's Test::Simple IIRC.  Verbosely assert a == b."""
    assert a == b, (a, b)
def test_lists():
    """Unit tests for delta_encode, delta_decode, packnum, unpacknums."""
    def testlist(mylist):
        """Small utility function for testing delta_encode, unpacknums, etc."""
        ok(list(delta_decode(delta_encode(mylist))), mylist)
        ok(list(unpacknums(''.join(map(packnum, mylist)))), mylist)
    testlist([3, 5, 6, 10, 215])
    testlist([0, 1, 2])
    testlist([0, 0, 1, 2])
    tmp = Struct('>q')
    ok(tmp.calcsize(), 8)
    ok(tmp.unpack(tmp.pack(1380)), (1380,))
    ok(tmp.read(StringIO.StringIO(tmp.pack(1380))), (1380,))
test_lists()

### Generic file management
def ignoring_enoent(thunk):
    """Run a thunk and ignore IOError with ENOENT.

    Returns None in case of ENOENT, otherwise the return value of the
    thunk.  Propagates any exceptions other than IOError, and any
    IOError for a reason other than ENOENT.
    """
    try: return thunk()
    except IOError, e: 
        if e.errno == errno.ENOENT: return
        else: raise

### docid management

page_no_representation = Struct('>L')
def make_doc_hash(sha1, page_no = 0):
    """Given a SHA-1 hash and a 'page number', return a 16-byte 'hash'.

    The doc ID store (DocIds, .doc_id_store) uses this 'hash'; HashHints
    just wants the first 12 bytes.

    10 bytes gives birthday collisions prob. 50% at 2^48 documents, and
    at 2^38 documents, probability is very small; 2^38 documents is an
    exabyte or a little less.

    Each 'page' is 4096 words; the first page is 0; large documents are
    divided into pages to make the Bloom-filter-like data structure work
    better.
    """
    return sha1[:12] + page_no_representation.pack(page_no)

def parse_doc_hash(ahash):
    """Given a 16-byte hash, split into document uniquer and page no."""
    return ahash[:12], page_no_representation.unpack(ahash[12:])[0]

class DocIds:
    """Interface to the list of SHA-1 hashes* stored in the index file.

    * actually we only store the first 12 bytes of the SHA-1, plus four
    bytes of "page number".
    
    To keep the index small, we don't store filenames in the hash
    buckets of the index; instead, we store indices into this list.  We
    only ever append to this list.  There's another file that maps SHA-1
    hashes to filenames and offsets.
    """
    hashsize = 16
    def __init__(self, fo):
        """fo is a file-like object from which the hashes can be read."""
        self.hashes = []
        while 1:  # read to EOF
            next_hash = fo.read(self.hashsize)
            if not next_hash: return
            self.hashes.append(next_hash)
    def __getitem__(self, ahash):
        """Return the doc_id for the given binhash, or raise KeyError."""
        assert len(ahash) == self.hashsize
        try: return self.hashes.index(ahash)
        except ValueError: raise KeyError(ahash)
    def add_hash(self, ahash):
        """Add a new hash to the document ID database, returning new doc_id."""
        assert ahash not in self
        self.hashes.append(ahash)
        return self[ahash]  # infinite recursion here would suggest a bug
    def __contains__(self, ahash):
        """Test for the existence of a hash in the list."""
        return ahash in self.hashes
    def hash(self, doc_id):
        """Return the 16-byte binary hash for the given doc_id."""
        return self.hashes[doc_id]
    def write(self, fo):
        """Write a representation that can be read back."""
        for ahash in self.hashes:
            assert len(ahash) == self.hashsize, (len(ahash), ahash)
            fo.write(ahash)
    def remove(self, doc_id):
        """Forget the hash for document number doc_id."""
        self.hashes[doc_id] = 'deleted document'

### File byte range stuff

class Subfile:
    """Similar to email._Subfile: a file-like object for part of a 'r' file.
    
    The underlying file-like object must support .seek(pos) and
    .read(length), both of which need to speak bytes.  Supports pretty
    much exactly the same protocol.

    If email._Subfile were documented, I probably wouldn't have written
    this.
    """
    def __init__(self, fo, start, end):
        """fo: the file-like object to represent a part of.
        start: the offset where this subfile should begin.
        end: the offset where this subfile should end.
        """
        self.fo, self.start, self.end = fo, start, end
        self.pos = start
    def read(self, length):
        """Read specified number of bytes.

        Will likely affect current file position in underlying file object.
        """
        self.fo.seek(self.pos)
        length = min(self.end - self.pos, length)
        rv = self.fo.read(length)
        self.pos += len(rv)
        return rv
    def seek(self, pos):
        """Similar to file.seek, but doesn't support "whence".
        
        Sets current position to "pos" bytes from the start of the
        subfile.
        """
        self.pos = self.start + pos
    def tell(self): 
        """Similar to file.tell: current pos in bytes from subfile start."""
        return self.pos - self.start

class FileByteRange(object):
    """Similar to Subfile, but corresponds to a filename, not an open file."""
    # In retrospect this may not have been such a hot idea: a few lines
    # of code were simplified slightly, and interactive use is a little
    # easier, in exchange for the added complexity of 23 more lines of code.
    __slots__ = ['fname', 'start', 'end']
    def __init__(self, fname, start, end):
        """start and end are byte offsets in the file named fname."""
        self.fname, self.start, self.end = fname, start, end
    def open(self):
        """Return a Subfile."""
        return Subfile(file(self.fname), self.start, self.end)
    def __repr__(self): 
        return 'FileByteRange(fname=%r, start=%r, end=%r)' % self.as_tuple()
    def __str__(self): return '%s %d:%d' % self.as_tuple()
    def as_tuple(self): return self.fname, self.start, self.end
    def __cmp__(self, other):
        return (cmp(type(self), type(other)) or 
                cmp(self.as_tuple(), other.as_tuple()))
    def __hash__(self): return hash(self.as_tuple()) ^ 1
    def __iter__(self): 
        """Unpack like a (fname, start, end) tuple."""
        return iter(self.as_tuple())

### ITAM files --- used for the hash hints file
# These are like ISAM files, but are text and not updatable in place, although
# they are atomically updatable.
# The strategy for atomic updates is as follows:
# The basic ITAM file itself gets atomically renamed into place; it may have a
# headwords file, also an ITAM file.
# Each ITAM file begins with "ITAM; headwords would be in:
# foo.32021.headwords\n", where 32021 is a somewhat random number that will
# hopefully be different for each update.  Then, we write all our headwords
# files first, then rename ourselves into place, then delete all the old
# headwords files.   Anyone who started reading the old file would use the old
# headwords files, although there's a race condition where they might get
# deleted first --- which could result in the other guy's lookup taking a long
# time.  This is undesirable, but I think it's acceptable.

class ITAMWriter:
    """Indexed text access method, writer class.

    Stores a string->(tuple of strings) multimap in a sorted text file
    (with potentially some other files indexing it) for quick retrieval.
    Not updatable; you have to write the whole file again if you want to
    update it.
    """
    def __init__(self, fname, skipsize=10240):
        """fname is the base filename to store data in.
        skipsize is the maximum number of bytes for the reader to have to
            read from any level of the hierarchy.  Generally this should
            be around the number of bytes that the reader can decode in
            one disk seek time; if it's too small, there will be lots of
            levels of hierarchy and the disk will spend lots of time
            seeking, but it being small enough is what keeps us from
            spending lots of CPU time on the problem.  In the tests,
            skipsize=10240 lets the primary file get to 20 megabytes
            with only three levels of hierarchy.
        """
        self.fname = fname
        self.skipsize = skipsize
        self.headwords = None
        year = 86400 * 365
        self.unique_id = '%s.%s' % (int(time.time() * 100) % year, os.getpid())
        self.outfile = file(fname + '.new', 'w')
        self.written_bytes = 0
        self.last_bytes_in_headwords = 0
        self.prev_key = ''
        self.write_bytes('ITAM; headwords would be in: %s\n' %
                         self.headwords_fname())
        self.files_to_delete = []
        try: reader = ITAMReader(self.fname)
        except: pass
        else: self.files_to_delete = reader.all_fnames()
    def headwords_fname(self):
        """The name for my headwords file."""
        return '%s.%s.headwords' % (self.fname, self.unique_id)
    def encode(self, key, values):
        """Return a string to represent key->values in the file."""
        parts = map(urllib.quote, [key] + list(values))
        for item in parts:
            assert '\n' not in item
            assert ' ' not in item
        return '%s\n' % ' '.join(parts)
    def open_headwords(self):
        """Start writing to a headwords file.

        This isn't done in __init__ because we only need a headwords
        file if we write more than self.skipsize bytes of data.

        """
        self.headwords = ITAMWriter(self.headwords_fname())
    def write_bytes(self, outstr):
        """Write some bytes to the outfile."""
        self.written_bytes += len(outstr)
        self.outfile.write(outstr)
    def write(self, key, *values):
        """Put a key->(values) pair into the file.

        Needs to be called in order on all the keys.
        """
        assert key >= self.prev_key, (key, self.prev_key)
        self.prev_key = key
        old_bytes = self.written_bytes
        self.write_bytes(self.encode(key, values))
        if self.written_bytes - self.last_bytes_in_headwords > self.skipsize:
            if not self.headwords: self.open_headwords()
            self.headwords.write(key, str(old_bytes))
            self.last_bytes_in_headwords = old_bytes
    def close(self):
        """Commit the new version of the file."""
        # XXX I wish I knew how to make this atomic.  Using either a new
        # headwords file with the old main file or vice versa could
        # undetectably give bad results.
        # XXX I know how.  Make up funky names for the headwords file,
        # and put them into the main file.  Then we just have to
        # manually delete the old headwords files when we get done
        # writing.  Simultaneous updates will lose one of the new sets
        # of updates and will leave small trash files lying around, but
        # won't lose both sets of updates.
        self.outfile.close()
        os.rename(self.fname + '.new', self.fname)
        if self.headwords: self.headwords.close()
        self.outfile = None
        for fname in self.files_to_delete:
            if os.path.exists(fname) and fname != self.fname: os.unlink(fname)

class ITAMReader:
    """Indexed text access method, reader.

    Retrieves items from files written by ITAMWriter, either by key or
    by iterating over the whole file.
    """
    def __init__(self, fname):
        """fname is the name of the file containing the data."""
        self.fname = fname
        self.infile = file(self.fname)

        firstline = self.infile.readline()
        magic = 'ITAM; headwords would be in: '
        assert firstline.startswith(magic), firstline
        self._headwords_fname = firstline[len(magic):-1]

        self.tuples_start_at = len(firstline)

        if os.path.exists(self.headwords_fname()): 
            self.headwords = ITAMReader(self.headwords_fname())
        else: self.headwords = None
        self.decodes = 0
    def headwords_fname(self):
        """The filename under which our headwords are stored."""
        return self._headwords_fname
    def all_fnames(self):
        """All the filenames storing information --- for deletion."""
        if self.headwords: return [self.fname] + self.headwords.all_fnames()
        else: return [self.fname]
    def decode(self, line):
        """Decode a line read from the file into a tuple, including key."""
        assert line.endswith('\n')
        rv = tuple(map(urllib.unquote, line[:-1].split(' ')))
        self.decodes += 1
        return rv
    def find(self, key):
        """Find the beginning of 'key' in the file.

        Returns the last tuple in the file less than the key, the first
        tuple in the file after that (or a dummy tuple at EOF), and an
        iterator over any remaining tuples.

        Not really intended for external use; interface is too awkward.
        You probably want .values_for_key().

        The iterator is ready to iterate over the tuples with the key,
        if any, and then the tuples with greater keys.
        """
        pos = self.tuples_start_at
        if self.headwords:
            headtuple, _, _ = self.headwords.find(key)
            assert headtuple[0] < key or headtuple[0] == ''
            if len(headtuple) > 1:   # headtuple may be ('',)
                pos = int(headtuple[1])  
            else: pos = self.tuples_start_at
        prev_tuple = ('',)
        iterator = self.iterate_from_offset(pos)
        for cur_tuple in iterator:
            found_key = cur_tuple[0]
            if found_key >= key: return prev_tuple, cur_tuple, iterator
            prev_tuple = cur_tuple
            assert prev_tuple[0] <= key, (prev_tuple, key)
        return prev_tuple, ('',), iterator  # it's beyond EOF
    def __getitem__(self, key): 
        """Return a tuple from the file with key 'key', or raise KeyError."""
        found_tuple, result_tuple, iterator = self.find(key)
        if result_tuple[0] == key: return result_tuple[1:]
        else: raise KeyError(key)
    def values_for_key(self, key):
        """Iterate over the tuples attached to 'key'."""
        _, first_tuple, iterator = self.find(key)
        if first_tuple[0] != key: return
        yield first_tuple[1:]
        for each_tuple in iterator:
            if each_tuple[0] != key: return
            yield each_tuple[1:]
    def iterate_from_offset(self, byte_offset):
        """Iterate over the tuples in the file, starting at specified offset.

        This is not really intended for external use.
        """
        while 1:
            self.infile.seek(byte_offset)
            line = self.infile.readline()
            byte_offset += len(line)
            if not line: return
            yield self.decode(line)
    def __iter__(self):
        """Iterate over all tuples in the file, in sorted order."""
        return self.iterate_from_offset(self.tuples_start_at)
    def total_decodes(self):
        """Return the total number of tuples decoded by this reader.

        Includes tuples decoded in the headwords files.
        """
        if self.headwords: 
            return self.headwords.total_decodes() + self.decodes
        else: return self.decodes

def filesize(fname):
    """Return the number of bytes in the named file."""
    return os.stat(fname)[stat.ST_SIZE]

def assert_ioerror(thunk):
    """Assertion failure if thunk doesn't raise IOError when run."""
    try: x = thunk()
    except IOError: return
    else: assert 0, x

def test_itam():
    """Very slow unit tests (30s) for ITAM."""
    fname = 'tmp.itam.%s' % os.getpid()
    try:
        ### basic tests
        assert_ioerror(lambda: ITAMReader(fname))

        writer = ITAMWriter(fname)
        assert_ioerror(lambda: ITAMReader(fname))

        for ii in range(1000):
            writer.write('%04d' % ii, '%s%d' % ('x' * 100, ii))
        writer.write('1000', 'has', 'three', 'tuple items')
        assert_ioerror(lambda: ITAMReader(fname))

        writer.close()
        reader = ITAMReader(fname)

        val32 = 'x' * 100 + '32'
        ok(reader['0032'], (val32,))

        val31 = 'x' * 100 + '31'
        find_rv = ('0031', val31), ('0032', val32)
        ok(reader.find('0032')[0:2], find_rv)
        ok(reader.find('00320')[0:2], 
            (('0032', val32), ('0033', 'x' * 100 + '33')))
        ok(reader.find('003a')[0], ('0039', 'x' * 100 + '39'))
        ok(reader['0032'], (val32,))
        ok(reader['0999'], ('x' * 100 + '999',))
        ok(reader['1000'], ('has', 'three', 'tuple items'))
        assert 105000 <= filesize(fname) <= 210000, filesize(fname)
        hwsize = filesize(reader.headwords_fname())
        assert 30 <= hwsize <= 300, hwsize
        assert not os.path.exists(reader.headwords.headwords_fname())

        x = iter(reader)
        ok(x.next(), ('0000', 'x' * 100 + '0'))
        ok(x.next(), ('0001', 'x' * 100 + '1'))
        ok(x.next(), ('0002', 'x' * 100 + '2'))
        ok(len(list(x)), 998)

        ### test with big files

        writer = ITAMWriter(fname)
        for ii in range(200000):
            writer.write('%06d' % ii, '%s%d' % ('y' * 100, ii))
        writer.close()
        assert 21000000 <= filesize(fname) <= 42000000, filesize(fname)
        hwsize = filesize(writer.headwords_fname())
        assert 10000 <= hwsize <= 40000, hwsize
        assert filesize(writer.headwords.headwords_fname()) < 10000

        ### efficiency (if not speed) test
        for x in [3, 184, 121203, 123456, 199121, 35005, 135243, 108323,
                  24262, 170322]:
            #start_time = time.time()
            reader = ITAMReader(fname)
            ok(reader['%06d' % x], ('y' * 100 + str(x),))
            #print "time used", time.time() - start_time, reader.total_decodes()
            # absolute maximum should be 1 in .hw.hw, 10240/(6+1+5+1) +
            # 1 = 788 in .hw, and 10240/(6+1+100+1) + 1 = 95 in main
            # file, total 884.  Worst I've seen is actually 751 (for
            # 121203), and that took 19 ms, which was also the longest time.
            assert reader.total_decodes() < 900, reader.total_decodes()

        ### multiple values for one key

        writer = ITAMWriter(fname)
        for ii in range(1000):  # make the file big so we have to use headwords
            writer.write('%03d' % ii, 'x' * 100)
        writer.write('beatrice', 'fact')
        writer.write('beatrice', 'internet archive')
        writer.write('ben', 'bittorrent')
        writer.write('kragen', 'airwave')
        writer.write('kragen', 'commercenet')
        writer.write('marilyn', 'mvchd')
        writer.close()

        # if there were a .headwords.headwords file left over from
        # earlier, it could potentially cause trouble
        assert not os.path.exists(writer.headwords.headwords_fname())

        reader = ITAMReader(fname)
        ok(list(reader.values_for_key('beatrice')), 
           [('fact',), ('internet archive',)])
        ok(list(reader.values_for_key('kragen')),
           [('airwave',), ('commercenet',)])
        ok(list(reader.values_for_key('ben')), [('bittorrent',)])
    finally:
        for base_fname in reader.all_fnames():
            for fname in [base_fname, base_fname + '.new']:
                if os.path.exists(fname): os.unlink(fname)
 
### managing the hash hints file, which tells us what files have which hash

# There must be an existing way to do this in the standard library!
char_to_hex = {}
hex_to_char = {}
def init_hexchars():
    """Runs at import time to initialize char_to_hex and hex_to_char."""
    for ii in range(256): 
        hexval = '%02x' % ii
        char_to_hex[chr(ii)] = hexval
        hex_to_char[hexval] = chr(ii)
init_hexchars()

def bin_to_hex(astring):
    """Convert a binary string to hexadecimal."""
    return ''.join(map(char_to_hex.get, astring))
def hex_to_bin(hexstring): 
    """Convert a hexadecimal string to binary."""
    return ''.join(map(hex_to_char.get, re.findall('..', hexstring)))

ok(bin_to_hex('\r\n'), '0d0a')
ok(hex_to_bin(bin_to_hex('we are here')), 'we are here')

class HashHints:
    """Keeps track of where to find data with certain hashes.

    Stores its data in a text file on the disk: hash, filename, start
    and end offsets.  "Hints" because it's possibly out of date.
    """
    hashsize = 12  # only the first 12 bytes, not the page number
    def __init__(self, fname):
        """fname: the name of the file that contains the data."""
        self.fname = fname
        self.hints = None
        self.current_hash = None
    def read_file(self):
        """Initialize state from contents of self.fname."""
        start_time = time.time()
        self.hints = {}
        self.current_hash = {}
        for ahash, start, end, fname in ITAMReader(self.fname):
            br = FileByteRange(fname, int(start), int(end))
            self.add_hash(hex_to_bin(ahash), br)
        #print 'wasted', time.time() - start_time
    def read_file_if_necessary(self):
        """Read the whole file from disk if we haven't already."""
        if self.hints is None: ignoring_enoent(self.read_file)
    def add_hash(self, ahash, byte_range):
        """Call to indicate the specified byte range has hash 'ahash'."""
        # we don't write to the file until .flush()
        self.read_file_if_necessary()
        assert len(ahash) == self.hashsize
        self.hints.setdefault(ahash, []).append(byte_range)
        filename, start, end = byte_range
        try: oldhash = self.current_hash[filename][start, end]
        except KeyError: pass
        else: self.hints[oldhash].remove(byte_range)
        self.current_hash.setdefault(filename, {})[start, end] = ahash
    def __getitem__(self, ahash):
        """Return which filename,start,end triplets had hash 'ahash'."""
        assert len(ahash) == self.hashsize
        if self.hints is not None:
            return self.hints.get(ahash, [])
        else:
            rv = ignoring_enoent(lambda: self.read_from_file(ahash))
            if rv is None: return []
            else: return rv
    def read_from_file(self, ahash):
        """Read relevant data from underlying data file.
        
        May raise IOError, and won't return anything from memory.  Not public.
        """
        rv = []
        reader = ITAMReader(self.fname)
        for start, end, fname in reader.values_for_key(bin_to_hex(ahash)):
            rv.append(FileByteRange(fname, int(start), int(end)))
        return rv
    def flush(self):
        """Atomically update the file on the filesystem."""
        self.read_file_if_necessary()
        outfo = ITAMWriter(self.fname)
        keys = self.hints.keys()
        keys.sort()
        for key in keys:
            hexkey = bin_to_hex(key)  # I need some Allen screws...
            for filename, start, end in self.hints[key]:
                outfo.write(hexkey, str(start), str(end), filename)
        outfo.close()
        self.hints = None
        self.current_hash = None
    def hashes(self, fname):
        """Return the hashes found somewhere in fname, for deindexing purposes.
        """
        self.read_file_if_necessary()
        return self.current_hash.get(fname, {}).values()
    def hash_no_longer_in_file(self, filename, ahash):
        self.read_file_if_necessary()
        new_hints = []
        for byte_range in self.hints[ahash]:
            if byte_range.fname != filename:
                new_hints.append(byte_range)
            else:
                range = (byte_range.start, byte_range.end)
                del self.current_hash[byte_range.fname][range]
        self.hints[ahash] = new_hints

def filehash(fo, chunksize=32768):
    """Return the binary SHA-1 hash of the data in the file-like object fo."""
    ho = sha.sha()
    while 1:
        data = fo.read(chunksize)
        if not data: break
        ho.update(data)
    return ho.digest()

def sha1(astr):
    """Return the binary SHA-1 of the string."""
    return sha.sha(astr).digest()

def test_hash_hints():
    """Unit tests for HashHints."""
    ok(filehash(StringIO.StringIO('hello' * 10000)), sha1('hello' * 10000))
    fname = 'tmp.hash_hints.%s' % os.getpid()
    hh = HashHints(fname)
    try:
        a = '\n' * 12  # binary hash --- this test verifies it's stored
                       # safely, e.g. in hexadecimal
        range_a = FileByteRange(fname='somefile', start=0, end=100)
        hh.add_hash(a, range_a)
        ok(hh[a], [range_a])

        hh = HashHints(fname)
        ok(hh[a], [])  # didn't flush
        hh.add_hash(a, range_a)
        ok(hh[a], [range_a])
        hh.flush()
        ok(hh[a], [range_a])

        # reload from the file
        hh = HashHints(fname)
        ok(hh[a], [range_a])
        range_a2 = FileByteRange(fname='someotherfile', start=5, end=105)
        hh.add_hash(a, range_a2)
        ok(hh[a], [range_a, range_a2])
        hh.flush()
        ok(hh[a], [range_a, range_a2])
        ok(hh.hashes('somefile'), [a])

        # hash_no_longer_in_file
        hh.hash_no_longer_in_file('someotherfile', a)
        ok(hh[a], [range_a])
        assert a in hh.hashes('somefile')
        assert a not in hh.hashes('someotherfile')
        hh.flush()
        ok(hh[a], [range_a])
        assert a in hh.hashes('somefile')
        assert a not in hh.hashes('someotherfile')

        # filename encoding
        b = 'b' * 12
        uglyfname = 'a,b=c d\ne!f g'
        uglyrange = FileByteRange(fname=uglyfname, start=0, end=100)
        hh.add_hash(b, uglyrange)
        hh.flush()
        ok(hh[b], [uglyrange])

        hh = HashHints(fname)
        ok(hh[b], [uglyrange])

        # a file should only be able to have one hash at a time for the same
        # byte range:
        c = 'c' * 12
        hh.add_hash(c, uglyrange)
        hh.flush()
        ok(hh[c], [uglyrange])
        ok(hh[b], [])

        hh = HashHints(fname)
        ok(hh[c], [uglyrange])
        ok(hh[b], [])
        ok(hh.hashes(uglyfname), [c])
        # yet should be able to have different hashes for different byte ranges:
        uglyrange2 = FileByteRange(fname=uglyfname, start=100, end=200)
        hh.add_hash(b, uglyrange2)
        ok(hh[c], [uglyrange])
        ok(hh[b], [uglyrange2])
        hh.flush()
        ok(hh[c], [uglyrange])
        ok(hh[b], [uglyrange2])
        hh = HashHints(fname)
        ok(hh[c], [uglyrange])
        ok(hh[b], [uglyrange2])
        assert c in hh.hashes(uglyfname)
        assert b in hh.hashes(uglyfname)
    finally:
        if os.path.exists(fname): 
            for fname in ITAMReader(fname).all_fnames():
                if os.path.exists(fname): os.remove(fname)
                if os.path.exists(fname + '.new'): os.remove(fname + '.new')
        if os.path.exists(fname + '.new'): os.remove(fname + '.new')

### word extraction

def _words(fo, bufsiz=4096):
    """Iterate over the words in a file.
    
    For each word found, yields a byte offset into the file and a match
    object whose offsets are measured from that byte offset.  This weird
    return value is intended to support two kinds of clients without doing
    unnecessary work: clients who want the byte offset of the word in
    the file, and clients who just want the word.
    """
    word_re = re.compile(r'[\w\x80-\xff]+')
    buf = ''
    start = 0
    while 1:  # until EOF actually
        data = fo.read(bufsiz)
        if not data:
            for mo in word_re.finditer(buf): yield start, mo
            return
        buf += data
        mos = list(word_re.finditer(buf))
        end = len(buf)
        if mos and mos[-1].end(0) == len(buf): 
            mo = mos.pop()  # don't index partial word, but
            end = mo.start(0)  # don't discard it either
        for mo in mos: yield start, mo
        start += end
        buf = buf[end:]

def words(fo, bufsiz=4096):
    """Iterate over just the words in a file."""
    for start, mo in _words(fo, bufsiz): yield mo.group(0)

def words_for_excerpting(fo, bufsiz=4096):
    """Iterate over the words in a file along with their byte offsets."""
    for start, mo in _words(fo, bufsiz): yield mo.group(0), start + mo.start(0)

def test_words():
    """Unit test for words()."""
    f1 = 'hi there\n09 words  and 3jane RULZ! there'
    ok(list(words(StringIO.StringIO(f1))),
       ['hi', 'there', '09', 'words', 'and', '3jane', 'RULZ', 'there'])
    ok(list(words_for_excerpting(StringIO.StringIO(f1))),
       [('hi', 0), ('there', 3), ('09', 9), ('words', 12), 
        ('and', f1.index('and')), ('3jane', f1.index('3jane')), 
        ('RULZ', f1.index('RULZ')), ('there', f1.rindex('there'))])
    # This test is needed to test the second return path:
    ok(list(words(StringIO.StringIO('So it goes...'))), ['So', 'it', 'goes'])

    # treat chars with high bit set as word chars (works for utf-8 and usu.
    # latin-1)
    ok(list(words(StringIO.StringIO('x\xb5y'))), ['x\xb5y'])
    # make sure buffer size doesn't matter
    for bufsiz in range(1, len(f1) + 1):
        ok(list(words(StringIO.StringIO(f1), bufsiz=bufsiz)), 
           list(words(StringIO.StringIO(f1))))
        ok(list(words_for_excerpting(StringIO.StringIO(f1), bufsiz=bufsiz)),
           list(words_for_excerpting(StringIO.StringIO(f1))))

### Berzerkeley mailbox management

class Buf:
    """A buffer for reading a mailbox while computing SHA-1s of bits of it.

    The basic problem this solves is that the logic for parsing the
    mailbox file in small pieces, without using too much memory at a
    time (_mboxsplit), is hairy and bug-prone already, even without
    intertwining it with the logic for updating the SHA-1 hash and the
    current file offset.  So this object keeps track of the current file
    offset and the current SHA-1 hash.

    It's not particularly general.

    Public instance variables:
    start: the file offset of the beginning of the buffer
    """
    def __init__(self, fo, blocksize):
        """fo: the mailbox.
        blocksize: the size of chunks to read from it.
        """
        self.fo = fo
        self.start = fo.tell()  # an offset into the file
        self.buf = ''   # bytes starting at offset .start in the file
        self.h = sha.sha()  # always contains SHA-1 of data up to .start
        self.blocksize = blocksize
    def end(self): 
        """Returns the file offset of the end of the buffer."""
        return self.start + len(self)
    def msgend(self):
        """Return the byte offset and SHA-1 hash up to the start of the buffer.

        Also resets the SHA-1 hash to be ready for the next message.
        """
        try: return self.start, self.h.digest()
        finally: self.h = sha.sha()
    def grow(self):
        """Appending more data read from the file or return false.

        Used when the caller needs more data in the buffer to figure out
        what to do next.
        """
        # I imagine this to be expensive, so it's disabled:
        #tell = self.fo.tell()
        #assert tell == self.start + len(self), (tell, self.start, len(self))
        data = self.fo.read(self.blocksize)
        self.buf += data
        return data
    def __len__(self): 
        """Returns the amount of data currently in the buffer."""
        return len(self.buf)
    def find(self, astr):
        """Returns the byte offset (in the file) of the string, if in buffer.
        
        Otherwise, returns -1.
        """
        rv = self.buf.find(astr)
        if rv == -1: return -1
        return self.start + rv
    def consume_to(self, pos):
        """Throw away data up to (file offset) pos.
        """
        pos -= self.start
        assert 0 <= pos <= len(self), (self.start, pos, len(self))
        self.h.update(self.buf[:pos])
        self.buf = self.buf[pos:]
        self.start += pos

def _mboxsplit(fo, blocksize):
    """Parse a mailbox, yielding (offset, SHA1) pairs.

    The SHA-1 is binary, and is the SHA-1 of the data since the last
    offset, or beginning of file.  Yields one pair for every new message
    start (including one at the beginning of the file, if the mailbox
    starts with a message, as is usual for mbox files) plus a pair at
    EOF.  So a file with N messages will yield N+1 pairs.
    """
    b = Buf(fo, blocksize)
    sep = 'From '
    while len(b) < len(sep):
        if not b.grow():  # unlikely!
            b.consume_to(b.end())
            yield b.msgend()
            return
    if b.find(sep) == 0: yield b.msgend()
    sep = '\n' + sep
    while 1:
        while len(b) >= len(sep):
            pos = b.find(sep)
            if pos == -1: 
                b.consume_to(b.end() - len(sep) + 1)
                assert len(b) < len(sep)
            else:
                b.consume_to(pos + 1)  # consume '\n' too
                yield b.msgend()
        if not b.grow():
            b.consume_to(b.end())
            yield b.msgend()
            return

def mboxsplit(fo, blocksize=4096):
    """Yield (start_byte_offset, end_byte_offset, sha1) per message.

    Parses mbox file 'fo'.
    """
    msgs = _mboxsplit(fo, blocksize)
    start, _ = msgs.next()
    for end, hash in msgs:
        yield start, end, hash
        start = end

def test_mbox():
    """Unit tests for mboxsplit and friends."""
    uglier_mbox = '''leading garbage\nFrom somebody
        some text From me\nfrom you
        more text\nFrom somebody else
        yet more text
    '''
    pm = list(_mboxsplit(StringIO.StringIO(uglier_mbox), blocksize=3))
    ok(pm[0][0], uglier_mbox.index('From somebody'))
    ok(pm[1][0], uglier_mbox.index('From somebody else'))
    ok(pm[2][0], len(uglier_mbox))
    ok(len(pm), 3)

    first_msg = '''From me\nTo: you\n\nhi\n''' + '\n' * 100000
    second_msg = '''From you\nwho cares'''
    saner_mbox = first_msg + second_msg
    ok(list(_mboxsplit(StringIO.StringIO(saner_mbox), blocksize=3)),
       [(0, sha1('')), (len(first_msg), sha1(first_msg)),
        (len(saner_mbox), sha1(second_msg))])
    ok(list(_mboxsplit(StringIO.StringIO(saner_mbox), blocksize=3000)),
       [(0, sha1('')), (len(first_msg), sha1(first_msg)),
        (len(saner_mbox), sha1(second_msg))])

    ok(list(mboxsplit(StringIO.StringIO(saner_mbox), blocksize=3)),
       [(0, len(first_msg), sha1(first_msg)),
        (len(first_msg), len(saner_mbox), sha1(second_msg))])

    mb = mboxsplit(StringIO.StringIO(uglier_mbox))
    ii = iter(mb)
    first_msg_start, first_msg_end, sha1hash = ii.next()
    assert uglier_mbox[first_msg_start:].startswith('From somebody\n')
    assert uglier_mbox[first_msg_end:].startswith('From somebody else\n')
    next_msg_start, next_msg_end, sha1hash = ii.next()
    ok(next_msg_start, first_msg_end)
    ok(list(_mboxsplit(StringIO.StringIO('asdf'), 1)), [(4, sha1('asdf'))])

    desired = list(mboxsplit(StringIO.StringIO(uglier_mbox)))
    for ii in range(1, len(uglier_mbox)):
        ok(list(mboxsplit(StringIO.StringIO(uglier_mbox), blocksize=ii)),
           desired)

### Merging sorted lists to get union or intersection.

class PeekableIterator:
    """A wrapper for an iterator that behaves identically, but with peek.

    Merge algorithms are a lot easier if you can peek at the next value
    in each stream.  Using this class reduced my merge_union (the kind
    of merge you use in mergesort) from 42 lines down to 14.  You wrap
    an iterator with this class and get the same iterator, except with
    added functionality.  
    
    There are two public instance variables:
    .empty tells whether there is a .peek, 
    .peek tells what you will get the next time you call .next().
    
    It's guaranteed that .next() will raise StopIteration iff .empty,
    and there is a .peek iff not .empty.
    """
    def __init__(self, aniter):
        self.aniter = aniter
        self.empty = False
        self.peek = None
        self.next()
    def __iter__(self): return self
    def next(self):
        if self.empty: raise StopIteration
        rv = self.peek
        try: self.peek = self.aniter.next()
        except StopIteration:
            self.empty = True
            del self.peek
        return rv

def merge_union(itera, iterb):
    """Return the union of two sorted iterators, as a sorted iterator."""
    a, b = PeekableIterator(iter(itera)), PeekableIterator(iter(iterb))
    while 1:
        if a.empty and b.empty: return
        if a.empty: yield b.next()
        elif b.empty: yield a.next()
        elif a.peek < b.peek: yield a.next()
        elif a.peek > b.peek: yield b.next()
        else:
            b.next()
            yield a.next()

def merge_intersection(itera, iterb):
    """Return the intersection of two sorted iterators, as a sorted iterator."""
    a, b = PeekableIterator(iter(itera)), PeekableIterator(iter(iterb))
    while 1:
        if a.empty or b.empty: return
        if a.peek < b.peek: a.next()
        elif a.peek > b.peek: b.next()
        else:
            b.next()
            yield a.next()

# Each of these merge union tests (except the first one) actually used to test a
# different code path.  How gross is that?
ok(list(merge_union([], [])), [])
ok(list(merge_union([], [1, 2, 3])), [1, 2, 3])
ok(list(merge_union([1, 2, 3], [])), [1, 2, 3])
ok(list(merge_union([0, 1, 3], [0, 2, 6])), [0, 1, 2, 3, 6])
ok(list(merge_union([0, 2, 6], [0, 1, 3])), [0, 1, 2, 3, 6])
ok(list(merge_union([0, 2, 6], [0, 1, 2])), [0, 1, 2, 6])
ok(list(merge_union([0, 1, 2], [0, 2, 6])), [0, 1, 2, 6])
ok(list(merge_intersection([0, 2, 4], [1, 2, 3, 4])), [2, 4])

### Hash inverted index file management.

gethash = Struct(">L").unpack
def hashes(ob): 
    """Need multiple hash functions to make this work.  Here they are.
    """
    bighash = sha1(ob)
    return (gethash(bighash[0:4])[0], gethash(bighash[4:8])[0],
            gethash(bighash[8:12])[0])

class Index:
    """Interface to a single index."""
    nbuckets = 131072
    header_offset_struct = Struct('>q')
    def __init__(self, fname, hashfuncs):
        """fname is the name of the index file.
        hashfuncs is a hash function returning a tuple of hash values.
        """
        self.hashes = hashfuncs
        self.fname = fname
        self.header_offset_size = self.header_offset_struct.calcsize()
        self.clear()
    def remove_doc_hash(self, bare_hash):
        """Remove all postings for the specified document bare hash.

        bare_hash is the 12-byte chunk of SHA-1, not the 16-byte thing
        including the page number.

        No-op if the document hash isn't known.

        Does it lazily, and doesn't remove the postings added in this object
        since the last flush --- I don't expect that will be a problem, but 
        I thought I should document the stupidity here.
        """
        page_no = 0
        while 1:
            ahash = make_doc_hash(bare_hash, page_no)
            try: doc_id = self.doc_id_store()[ahash]
            except KeyError: return
            self.removed_doc_ids.add(doc_id)
            self.doc_id_store().remove(doc_id)
            page_no += 1
    def add(self, doc_id, word):
        """Add a posting to be written out later."""
        for bucketnum in self.bucketnums(word):
            try: bucket = self.additions[bucketnum]
            except KeyError: 
                self.additions[bucketnum] = bucket = [doc_id]
                self.pending_postings += 1
            else: 
                if bucket[-1] != doc_id: 
                    bucket.append(doc_id)
                    self.pending_postings += 1
    def flush(self):
        """Write index updates back to the filesystem."""
        # note race condition on two concurrent updates:
        newfile = self.fname + '.new'  
        newfo = file(newfile, 'wb')
        self.write_index(newfo)
        newfo.close()  # unnecessary in RefCountedLand, but...
        # Atomic rename for atomic index update.
        os.rename(newfile, self.fname)
        self.clear()
    def clear(self):
        """Forget any pending updates."""
        self.fo = None
        self._doc_id_store = None
        self.additions = {}
        self.removed_doc_ids = sets.Set()  # new in 2.3
        self.pending_postings = 0
    def sorted_additions(self, bucketnum):
        """The doc_ids currently pending addition to this bucket number.
        
        In order, as implied by the name.
        
        We could be lazier about this if we used a heap, but in the current
        program, we never do anything but write this whole list out to the
        index, so that would be premature optimization.
        """
        try: rv = list(sets.Set(self.additions[bucketnum]))
        except KeyError: return []
        rv.sort()
        return rv
    def doc_ids(self, bucketnum):
        """Return the doc_ids for this bucket number.

        This includes any pending updates as well as those from the file.

        """
        if self.additions.has_key(bucketnum): 
            return merge_union(self.read_bucket(bucketnum),
                               self.sorted_additions(bucketnum))
        else: return self.read_bucket(bucketnum)
    def bucket_end_offset(self, fo, bucketnum):
        """Read the index file fo to tell what byte offset the bucket ends."""
        fo.seek(bucketnum * self.header_offset_size)
        (end_offset,) = self.header_offset_struct.read(fo)
        return end_offset
    def binbucket(self, fo, bucketnum):
        """The binary string containing the doc_ids in the index file.

        This may be of zero length, or it may be quite large.  It won't 
        contain currently pending additions.
        """
        # So here, to find the length of the bucket, I need the starts of two
        # buckets.  So I need a special case either for the first bucket or for
        # the last bucket, or I need nbuckets+1 items in the header.  I chose
        # to have a special case for the first bucket, and store the offsets of
        # the *ends* of the buckets in the header.
        if bucketnum == 0: 
            start_offset = self.header_offset_size * self.nbuckets
        else:
            start_offset = self.bucket_end_offset(fo, bucketnum - 1)
        end_offset = self.bucket_end_offset(fo, bucketnum)
        fo.seek(start_offset)
        return fo.read(end_offset - start_offset)
    def open_fo(self):
        """Open a filehandle on the index file, saving it in self.fo.

        Originally this was a separate method just to facilitate the use
        of ignoring_enoent, but now it has two callers: read_bucket and
        doc_id_store.
        """
        self.fo = file(self.fname, 'rb')
        return self.fo
    def read_bucket(self, bucketnum):
        """The doc_ids in bucketnum in the file, minus those pending deletion.

        Deletion is processed here (rather than in .doc_ids()) so that it 
        won't affect newly added postings.
        """
        if not self.fo:
            # we ignore specifically only ENOENT here because we wouldn't want
            # EPERM to wipe out the index
            if not ignoring_enoent(self.open_fo): return  # no index yet!
        fo = self.fo
        for doc_id in delta_decode(unpacknums(self.binbucket(fo, bucketnum))):
            if doc_id not in self.removed_doc_ids: yield doc_id
    def doc_id_store(self):
        """Return DocIds object, opening file if necessary."""
        if not self._doc_id_store: 
            if ignoring_enoent(self.open_fo):
                # seek one byte past end of last bucket
                self.fo.seek(self.bucket_end_offset(self.fo, self.nbuckets-1))
                self._doc_id_store = DocIds(self.fo)
            else:
                self._doc_id_store = DocIds(StringIO.StringIO('')) # EW!
        return self._doc_id_store
    def write_index(self, fo):
        """Given a filelike object, write the current index to it."""
        header_offsets = []
        fo.seek(self.nbuckets * self.header_offset_size)
        for ii in xrange(self.nbuckets):
            if (not self.additions.has_key(ii) and not self.removed_doc_ids and 
                self.fo):
                bucketstr = self.binbucket(self.fo, ii)
            else:
                bucketstr = ''.join(map(packnum, delta_encode(self.doc_ids(ii))))
            fo.write(bucketstr)
            header_offsets.append(fo.tell())
        fo.seek(0)
        for item in header_offsets: self.header_offset_struct.write(fo, item)
        fo.seek(header_offsets[-1])  # EOF
        self.doc_id_store().write(fo)  # may open and seek self.fo
    def lookup(self, word):
        """Main entry point: return doc_ids that might contain 'word'."""
        bucketnums = self.bucketnums(word)
        rv = self.doc_ids(bucketnums[0])
        for bucketnum in bucketnums[1:]:
            # This builds a kind of one-sided tree of intersections, which
            # results in more comparisons than necessary when there are more
            # than two hash functions, but the difference is small when there
            # are less than five or so
            more = self.doc_ids(bucketnum)
            rv = merge_intersection(rv, more)
        return rv
    def bucketnums(self, word):
        """Return the bucket numbers in which postings for 'word' would be."""
        # note that x % y returns positive numbers in Python even when x is
        # negative
        return [hashval % self.nbuckets for hashval in self.hashes(word)]

def test_index():
    """Slow unit tests for Index and DocIds."""
    fname = 'tmp.index.%d' % os.getpid()
    try:
        idx = Index(fname, hashes)
        assert not os.path.isfile(fname)

        di = idx.doc_id_store()
        a = make_doc_hash('a' * 20)
        ok(a, 'a' * 12 + '\0' * 4)
        b = make_doc_hash('b' * 20)
        ok(di.add_hash(a), 0)
        ok(di[a], 0)
        ok(di.add_hash(b), 1)
        ok(di[b], 1)
        ok(di[a], 0)
        assert not os.path.isfile(fname)
        idx.flush()
        assert os.path.isfile(fname)

        idx = Index(fname, hashes)
        di = idx.doc_id_store()
        ok(di[b], 1)
        ok(di[a], 0)

        did1 = idx.doc_id_store()[a]
        did2 = idx.doc_id_store()[b]
        c = make_doc_hash('c' * 20)
        did3 = idx.doc_id_store().add_hash(c)
        ok(len(sets.Set([did1, did2, did3])), 3)
        ok(list(idx.lookup('foo')), [])
        idx.flush()
        ok(list(idx.lookup('foo')), [])  # No documents added yet!
        idx.add(did1, 'foo')
        ok(list(idx.lookup('foo')), [did1])  # No chance for hash collision yet
        idx.flush()

        idx = Index(fname, hashes)
        ok(list(idx.lookup('foo')), [did1])
        ok(idx.doc_id_store().hash(did1), a)
        assert a in idx.doc_id_store()
        idx.remove_doc_hash(a)
        assert a not in idx.doc_id_store()
        ok(list(idx.lookup('foo')), [])
        idx.flush()
        # XXX note that the following assumes doc_ids are stable across
        # flushes, which won't be true forever

        idx = Index(fname, hashes)
        ok(list(idx.lookup('foo')), [])
        assert idx.doc_id_store().hash(did1) != a
        assert a not in idx.doc_id_store()
        idx.add(did2, 'foo')
        idx.add(did3, 'foo')
        idx.add(did3, 'bar')
        ok(list(idx.lookup('foo')), [did2, did3])
        idx.add(did3, 'foo')
        ok(list(idx.lookup('foo')), [did2, did3])  # no dups!
        assert did3 in list(idx.lookup('bar'))   # could have did2 if collision
        idx.flush()

        idx = Index(fname, hashes)
        ok(list(idx.lookup('foo')), [did2, did3])
        idx.remove_doc_hash(c)
        ok(list(idx.lookup('foo')), [did2])
        idx.add(did3, 'bar')
        assert did3 in list(idx.lookup('bar'))   # could have did2 if collision
        idx.flush()

        idx = Index(fname, hashes)
        if idx.bucketnums('foo') != idx.bucketnums('bar'):
            ok(list(idx.lookup('foo')), [did2])
            ok(list(idx.lookup('bar')), [did3])
        else:
            print 'hash collision, skipping some tests'

    finally:
        for name in [fname, fname + '.new']:
            if os.path.exists(name): os.remove(name)

### Main entry point

class DB:
    """Interface to a directory containing an index of some files."""
    explanation = """This directory contains a full-text index of some files.

    The index is stored in the following files:
    README: this text file, explaining the format of the index.
    hashes: a text file containing the hexadecimal SHA-1 hashes of the
        files that have been indexed, along with the start and end
        offsets in that file (as ASCII decimal numbers), and the
        filename, separated by spaces.  The filename is URL-encoded with
        '%' signs to escape strange characters (including spaces and
        newlines).  This file exists so that it's possible to produce
        human-readable search results, rather than just spitting out
        SHA-1 hashes.  The first line points at a 'headwords' file that
        can be used as an index for fast access into this file.
    hashes.*.headwords: an index of the 'hashes' file for faster access.
        Every line but the first contains a hash and a file offset (in
        decimal) into the 'hashes' file where that the information for
        that hash can be found.  Because 'hashes' is lexicographically
        sorted, this is enough information to make access to 'hashes'
        rapid.  The first line gives the name of a
        hashes.*.headwords.*.headwords file, which may or may not exist,
        for faster access to this file.
    hashes.*.headwords.*.headwords: see above.
    index: a hash table of 131 072 buckets.  Starts with 131 072
        big-endian 64-bit offsets into this file; each points one byte
        past the end of the postings for the corresponding hash bucket.
        The postings for the 0th hash bucket starts after the end of
        this header (after the 1 048 576th byte), and the postings of
        consecutive hash buckets are simply concatenated together with
        no delimiters.  Each word is hashed with the three hash
        functions described below under "hash functions".  Words in the
        indexed files are hashed into these hash buckets.  The postings
        are merely document IDs (zero-based line numbers from the
        doc_ids file) encoded as follows.  First, the document IDs for a
        bucket are sorted; second, each one except the first is replaced
        with its difference from the one before it; third, each number
        in the resulting list is encoded in a variable number of bytes,
        seven bits per byte, with the last byte having the high bit set.
        Accordingly [0, 1, 2, 3, 5] encodes as (hexadecimal) 80 81 81 81
        82.  The actual words are not stored anywhere, which makes the
        results somewhat inexact.  Following the end of the hash buckets
        (the file offset for this location is encoded as the last 8
        bytes of the first 1 048 576) are the hashes of the indexed
        documents, in binary, 16 bytes each --- the first 12 bytes of
        the SHA-1 hash, followed by 4 bytes encoding a "page number"
        within a document --- concatenated together without delimiters
        up to the end of the file.  The order of hashes in this list
        defines the meaning of the document IDs in the index file:
        document ID 0 has the first hash in the list, document ID 1 has
        the second hash, and so forth.
    excerpt, cat, lookup, add, deindex, and addmbox: shell scripts to
        query and modify the index.
    
    The headers (the offsets) are stored in the same files as the
        postings to make index updates atomic.

    Hash functions: All three are derived from the SHA-1 digest of the
        lowercase* version of the word: one from the first 32 bits,
        another from the next 32 bits, and a third from the third 32
        bits; from each of these sections of the SHA-1 digest, we take
        the low 17 bits.

    * I'm not sure what "lowercase" means.
    """
    def __init__(self, dbdir):
        """dbdir is the path (relative or absolute) to store the index at."""
        self.index = Index(os.path.join(dbdir, 'index'), hashes)
        self.dbdir = dbdir
        self.hash_hints = None
    def create_if_necessary(self):
        """Create the index directory if necessary and open files.

        Like open_files, this is not in __init__ because, depending on
        what use is to be made of the DB object, we may want to create
        the directory if it doesn't already exist --- or not.

        """
        if not os.path.isdir(self.dbdir): os.makedirs(self.dbdir)
        self.open_files()
    def open_files(self):
        """Open files in the index directory if not already done.

        Like create_if_necessary, this is not in __init__ because,
        depending on what use is to be made of the DB object, we may
        want to create the directory if it doesn't already exist --- or
        not.

        """
        if self.hash_hints is None:
            self.hash_hints = HashHints(os.path.join(self.dbdir, 'hashes'))
    def add(self, filename):
        """Update the indexes with the current contents of the named file."""
        fo = file(filename)
        self.create_if_necessary()
        ahash = make_doc_hash(filehash(fo))  # sadly we read the file twice
        bare_hash, page_no = parse_doc_hash(ahash)  # stupid API
        end = fo.tell()
        if ahash in self.index.doc_id_store():  # already indexed it
            # but we may have indexed it in a different file, so we add
            # it to hash_hints
            self.hash_hints.add_hash(bare_hash, FileByteRange(filename, 0, end))
            return
        self.deindex(filename)
        # note that we have to add the new hash *after* we deindex ---
        # otherwise deindexing will remove the new hash!
        self.hash_hints.add_hash(bare_hash, FileByteRange(filename, 0, end))
        doc_id = self.index.doc_id_store().add_hash(ahash)
        fo.seek(0)
        word_no = 0
        for word in words(fo): 
            self.index.add(doc_id, word.lower())
            word_no += 1
            if word_no == 4096:  # max words per page
                page_no += 1
                ahash = make_doc_hash(bare_hash, page_no)
                doc_id = self.index.doc_id_store().add_hash(ahash)
                word_no = 0
        fo.close()  # not really necessary in Refcounted Land
    def addmbox(self, filename):
        """Update the indexes with the contents of the named mailbox file."""
        fo = file(filename)
        filesize = os.fstat(fo.fileno())[stat.ST_SIZE]
        self.create_if_necessary()
        hashes_possibly_missing = sets.Set(self.hash_hints.hashes(filename))
        ii = 0
        for start, end, ahash in mboxsplit(fo):
            fancy_hash = make_doc_hash(ahash)
            bare_hash, page_no = parse_doc_hash(fancy_hash)
            hashes_possibly_missing.discard(bare_hash)
            self.hash_hints.add_hash(bare_hash, 
                                     FileByteRange(filename, start, end))
            if fancy_hash not in self.index.doc_id_store():
                saved_pos = fo.tell()
                #myfrom = Subfile(fo, start, end).read(5)
                #assert myfrom == 'From ', (myfrom, ii, start, end)
                doc_id = self.index.doc_id_store().add_hash(fancy_hash)
                word_no = 0
                # We don't use byte_range.open here because that would
                # involve opening the file for every mail message.
                for word in words(Subfile(fo, start, end)):
                    self.index.add(doc_id, word.lower())
                    word_no += 1
                    if word_no == 4096:  # max words per page
                        page_no += 1
                        fancy_hash = make_doc_hash(ahash, page_no)
                        doc_id = self.index.doc_id_store().add_hash(fancy_hash)
                        word_no = 0
                fo.seek(saved_pos)
            yield ii, end, filesize
            ii += 1
        fo.close()
        for each_hash in hashes_possibly_missing:
            # XXX I've never verified that this is deleting the right
            # things.
            self.hash_hints.hash_no_longer_in_file(filename, each_hash)
            if not len(self.hash_hints[each_hash]):
                self.index.remove_doc_hash(each_hash)
    def deindex(self, filename):
        """Remove a file from the indexes if it is indexed.
        
        Does not remove the file from the doc_id list or the hashes
        list.
        """
        self.open_files()
        for each_hash in self.hash_hints.hashes(filename):
            self.hash_hints.hash_no_longer_in_file(filename, each_hash)
            if not len(self.hash_hints[each_hash]):
                self.index.remove_doc_hash(each_hash)
    def flush(self):
        """Flush indexes to disk after updates."""
        self.hash_hints.flush()
        self.index.flush()
        f = file(os.path.join(self.dbdir, 'README'), 'w')
        f.write(self.explanation)
        f.close()
        self.write_shell_scripts()
    def write_shell_scripts(self):
        commands = 'excerpt cat lookup add deindex addmbox'.split()
        program = os.path.abspath(__import__('__main__').__file__)
        shell_script_template = """#!/bin/bash
        dbdir="$(dirname "$0")"
        %(program)s "$dbdir" %(command)s "$@"
        """
        for command in commands:
            scriptpath = os.path.join(self.dbdir, command)
            f = file(scriptpath, 'w')
            f.write(shell_script_template % {
                'program': program,
                'command': command,
            })
            f.close()
            os.chmod(scriptpath, 0755)
    def ensure(self, terms, bare_hashes):
        """Filter out documents missing a term."""
        for bare_hash in bare_hashes:
            for byte_range in self.hash_hints[bare_hash]:
                if self.byte_range_contains(terms, byte_range):
                    yield byte_range
    def byte_range_contains(self, some_words, byte_range):
        """Determine whether a document contains all the required words."""
        fo = ignoring_enoent(byte_range.open)
        if not fo: return False  # deleted
        termset = sets.Set(some_words)
        for word in words(fo):
            termset.discard(word.lower())
            if not termset: return True
        return False
    def bare_hashes(self, term):
        """Return sorted bare hashes of documents containing term."""
        rv = sets.Set()
        dis = self.index.doc_id_store()
        # XXX note that the following undoes all the work to keep the
        # results streaming lazily out of the index:
        for doc_id in self.index.lookup(term):
            rv.add(parse_doc_hash(dis.hash(doc_id))[0])
        rv = list(rv)
        rv.sort()
        return rv
    def lookup(self, terms):
        """Iterate over some filenames (etc.) containing terms."""
        terms = [term.lower() for term in terms]
        self.open_files()
        results = self.bare_hashes(terms[0])
        # This results in a very one-sided intersection tree, resulting
        # in extra comparisons, but that's probably OK for three or four
        # terms.
        for term in terms[1:]:
            results = merge_intersection(results, self.bare_hashes(term))
        return self.ensure(terms, results)
    def excerpts(self, terms):
        """Return filenames and excerpts for specified terms."""
        # Fairly gross brute force approach --- we end up reading the 
        # whole file twice; but it does work, more or less.
        terms = [term.lower() for term in terms]
        for byte_range in self.lookup(terms):
            termset = sets.Set(terms)
            excerpts = []
            contextlen = 71
            def get_excerpts():
                fo = byte_range.open()
                for word, startpos in words_for_excerpting(fo):
                    if word.lower() in termset:
                        termset.discard(word.lower())
                        saved_pos = fo.tell()
                        fo.seek(max(0, startpos - contextlen/2))
                        textexcerpt = fo.read(contextlen)
                        excerpts.append(textexcerpt.replace('\n', ' ')
                                        .replace('\t', ' '))
                        if not termset: break
                        fo.seek(saved_pos)
            ignoring_enoent(get_excerpts)
            yield byte_range, excerpts

def test_db():
    """Really basic unit tests for DB."""
    dbdir = 'tmp.db.%d' % os.getpid()
    try:
        # We use this source file as the sample file to index, simply
        # because we can avoid writing another file and cleaning up
        # later that way.
        import textindex
        textindex_file = textindex.__file__
        # If we index the .pyc instead of the .py, 'wibble' becomes
        # 'wibblei' and consequently the test fails :)
        if textindex_file.endswith('pyc'): textindex_file = textindex_file[:-1]
        if not os.path.exists(textindex_file):
            print "Skipping db tests because %s doesn't exist" % textindex_file
            return
        filesize = os.stat(textindex_file)[stat.ST_SIZE]
        db = DB(dbdir)
        ok(list(db.lookup(['wibble'])), [])
        db.add(textindex_file)
        db.flush()
        results = list(db.lookup(['wibble']))
        ok(len(results), 1)
        filename, start, end = results[0]
        ok(filename, textindex_file)
        ok(start, 0)
        ok(end, filesize)

        ok(len(list(db.lookup(['the truth']))), 0)  # impossible search
        ok(len(list(db.lookup(['import', 'truth']))), 1)
        excerpt_results = list(db.excerpts(['import', 'truth']))
        ok(len(excerpt_results), 1)
        ok(excerpt_results[0][0], results[0])
        excerpts = excerpt_results[0][1]
        # 'import' occurs earlier in this file than 'truth', so its
        # excerpt should be first
        assert 'import' in excerpts[0], excerpts
        assert 'truth' in excerpts[1], excerpts
        # Like, maybe it would be nice to have more than one document in the
        # database, and to test mailbox indexing, and to test deindexing, but
        # these tests would already have caught an awful lot of embarrassing
        # bugs I checked in recently.

        # terms far apart in the file:
        ok(len(list(db.lookup(['un' + 'usual', 'truth']))), 1)
    finally:
        os.spawnv(os.P_WAIT, '/bin/rm', ['rm', '-rf', dbdir])
    
def test_all():
    """Run tests that might take a while or write to the filesystem."""
    test_words()
    test_hash_hints()
    test_index()
    test_mbox()
    test_db()
    test_itam()

def main(args):
    """Command-line entry point."""
    if len(args) < 2: return usage()
    db = DB(args[0])
    cmd = args[1]
    if cmd == 'add':
        for filename in args[2:]:
            if not os.path.isabs(filename):
                filename = os.path.abspath(filename)
            print 'indexing', filename
            db.add(filename)
        print 'updating index'
        db.flush()
    elif cmd == 'addmbox':
        for filename in args[2:]:
            if not os.path.isabs(filename):
                filename = os.path.abspath(filename)
            for msgno, bytes, totalbytes in db.addmbox(filename):
                percentage = int(100 * bytes / totalbytes + 0.5)
                sys.stdout.write('indexed msg #%d in %s (%d%%)\r' 
                                 % (msgno, filename, percentage))
                sys.stdout.flush()
                # the 2^11 postings in the next line is fairly arbitrary; it's
                # intended to strike a reasonable balance between working set
                # size (under 30M or so) and time wasted updating indexes.
                if db.index.pending_postings > 2097152:
                    print
                    print 'updating index'
                    db.flush()
            print
        print 'updating index'
        db.flush()
    elif cmd == 'lookup':
        for byte_range in db.lookup(args[2:]): print byte_range
    elif cmd == 'cat':
        for byte_range in db.lookup(args[2:]):
            fo = byte_range.open()
            while 1:
                data = fo.read(4096)
                if not data: break
                sys.stdout.write(data)
    elif cmd == 'excerpt':
        for byte_range, excerpts in db.excerpts(args[2:]):
            print byte_range
            for excerpt in excerpts:
                print '\t', excerpt
    elif cmd == 'test':
        test_all()
        print 'Tests OK :)'
    elif cmd == 'deindex':
        for filename in args[2:]:
            if not os.path.isabs(filename):
                filename = os.path.abspath(filename)
            print 'deindexing', filename
            db.deindex(filename)
        print 'updating index'
        db.flush()
    else:
        usage()

def usage():
    """Provide usage help for command-line interface."""
    sys.stderr.write('''usage: 
    textindex.py dbdir add file1 file2..., or 
    textindex.py dbdir addmbox mboxfile1 mboxfile2..., or 
    textindex.py dbdir lookup term1 term2..., or
    textindex.py dbdir excerpt term1 term2..., or
    textindex.py dbdir cat term1 term2..., or
    textindex.py dbdir deindex file1 file2..., or
    textindex.py dbdir test\n''')

if __name__ == '__main__':
    cgitb.enable(format='text')
    main(sys.argv[1:])


More information about the Kragen-hacks mailing list