my new full-text indexer
Kragen Javier Sitaker
kragen at pobox.com
Thu Aug 17 22:45:19 EDT 2006
On a 1.83GHz Intel Core Duo under Mac OS X, it:
- indexes more than 100 megabytes an hour (slowing down for larger
corpuses)
- indexes in a working set of around 32MB (sometimes much less)
- produces indexes of around 17-19% of the size of the original files
- starts producing query results in less than a second for every
query I've tried on 400MB of mail (except for "how old is your cat
doctor cracking", which took 2, and goes down to 570ms if you take out
'is' and 'your')
- embodies an innovative new data structure for its index, a variant of
the Bloom filter, a special case of Boldi and Vigna's "compact
approximation of lattice functions"
- supports incremental reindexing
- is 100% Python
Basic usage
-----------
Save it to 'textindex.py' and make it executable. Make sure you have
Python and Unix installed. To create a new index:
./textindex.py mynewindex add myfile1 myfile2 myfile3
or, once the index exists,
mynewindex/add myfile1 myfile2 myfile3
where myfile[123] are text files you want to add to the index. To add a
standard Berkeley mbox file to the index:
./textindex.py mynewindex addmbox mboxfilename
or
mynewindex/addmbox mboxfilename
To search for some words:
./textindex.py mynewindex excerpt some words
or
mynewindex/excerpt some words
You can replace 'excerpt' with 'lookup' to simply list the documents
(mail messages or text files) in which the words were found, or with
'cat' to spew out the entire contents of the documents.
The background story
--------------------
This is the fifth inverted-index-based search engine I've written; the
first one delegated the task of index structure management to the
filesystem (which was too slow to be useful, even when it was reiserfs),
the second one tried to delegate it to Berkeley DB (which was likewise
too slow), the third one was a Python engine I used for my email, and
the fourth one was a C version of the Python one. I've also been
inspired by my experience with Lucene and by Dave Long's very impressive
shell-script inverted-index manager.
After my old laptop crashed, I thought I'd take another whack at the
problem of indexing my email. This is probably a little easier for me
than for most people, because I generally read my email in fairly
primitive ways, so there's not a lot of integration headache-work to do.
My old mail indexer was written in C, and by some quirk of fate, the
laptop I'm using at the moment has lots of programming languages on it,
but no C compiler. The previous version of that mail indexer was
written in Python, but it was painfully slow, and didn't support
incremental reindexing. Incremental reindexing is pretty hard in my
environment anyway, because here's how I download my email:
#!/bin/sh
ssh panacea.canonical.org tail -400000000c \
/var/mail/kragen \> tmp.mboxtail && \
rsync -Pavzz panacea.canonical.org:tmp.mboxtail /Users/kragen/mail
So every time I download my email, any local changes to the mailbox are
lost, and all the messages are generally found at a different offset in
the mbox file tmp.mboxtail afterwards. Also, tmp.mboxtail isn't
strictly an mbox file, because it generally starts in the middle of a
message, usually a large binary attachment to a virus-induced spam.
My old mail indexer had some other disadvantages; to sum up:
- it was written in C, so it was hard to run on this machine;
- it could only index one file;
- it had to reindex that file from scratch every time I ran it, which
took 10 or 20 minutes on my old computer (although this laptop is
about five times faster)
- in order to avoid bloating its index with nonwords from mail
attachments, it didn't index words of more than (IIRC) 20 letters;
- it used so much memory that my machine was unusable while it was
running;
- because it mmapped the entire mailbox file ("for simplicity and
speed") it couldn't handle mailboxes bigger than the largest free
section of virtual memory space; when my mailboxes approached 1GiB,
this meant that it would only run sporadically under Linux with PaX,
and it could never have handled mailboxes bigger than a few gigabytes
on a 32-bit machine.
- the index it produced was a text file, so it was about 60% of the size
of the original data.
- it was so awkward to use that I wrote shell scripts to tie the various
parts together and do the things I usually wanted.
But it was pretty quick. The indexing phase indexed about 5 megabytes
per second, or 18 gigabytes per hour, on my 600MHz laptop, if I remember
correctly.
This new indexer removes all of the above problems:
- it is written in Python, so it should be easy to run anywhere;
- it can index any number of files;
- it SHA-1 hashes the documents it indexes so that it can avoid
reindexing them if they haven't changed (even if they have moved);
- it indexes everything that looks word-like;
- the indexer has a working set up to about 36MB ("RPRVT" from Mac OS X
top) when indexing my mailbox;
- it reads the mailbox file in a conventional fashion, so it should be
able to index mailboxes up to 2^64 bytes without trouble, or possibly
more on a machine where Python supports bigger files than that;
- its index is binary, requiring about 1.5 bytes per posting in its
present format, although it could be shrunk, and it has a 1MiB
overhead.
- it's pretty easy to use on the command line.
Interesting features
--------------------
The Bloom-filter-like structure of the index has some interesting
properties. For example, the inverted index does not contain the words
that are indexed, and indeed it is impossible to recover those words
from the index. I anticipate using this feature to support fielded
search.
This program makes heavy use of Python generators, largely in order to
allow the first search results to be displayed before they are all
available. It probably shouldn't do this as deeply as it does --- it
would probably have faster lookup performance and be more debuggable.
The ad-hoc compression scheme it uses for document ID lists does a
reasonable job, although not as good as huffman coding would do. This
probably needs to be revisited and tested against Lucene and MG4J, since
the whole point of this data structure is that it takes up less space
than a straightforward inverted index. But I think, right now, it
doesn't, actually.
It uses SHA-1 hashes to identify documents it's seen before to avoid
reindexing them again, so as to be able to handle the case of my
tmp.mboxtail and similar difficult cases. However, it can only compute
SHA-1 over my mailbox at about 5 megabytes per second, about a quarter
of the speed that my old text indexer could build a full-text index of
it.
I considered using the same Bloom-filter-like data structure for the
"hashes" file that tells which files (or byte ranges) have which hashes;
it would have been a lot smaller. However, when the program is updating
the index, it needs to find out which hashes were previously found in
the file it's indexing, so it can delete them if they're no longer to be
found anywhere. Bloom filters, and this data structure as well, make it
impossible to retrieve the keys with which some value was stored. So I
used a structure similar to what I used for my inverted index in the
last rev of the indexer.
Magic numbers
-------------
It's full of magic numbers (14 of them by my count), most of which
represent some kind of tradeoff:
- 7, 0x7f, and 0x80 in unpacknums and packnum: 7 bits of data per byte,
one bit of framing
- 12 bytes of SHA-1 and 4 bytes of "page number": make it easy to read
the index file in a hex dump that has 16 bytes per line, support more
documents than anybody but Google has, and documents bigger than
almost anybody's disk, but shave 0.1% or so off the index file size
(which really doesn't matter) by not using all 20 bytes of the SHA-1.
- range(256) in init_hexchars: self-explanatory.
- 32768 bytes at a time in filehash(): tradeoff between working-set size
and the overhead of executing Python bytecodes once per chunk.
- 4096 bytes at a time in words() and friends: same thing.
- 4096 bytes at a time in mboxsplit(): same thing.
- three hashes: tradeoff between false-positive rate and size, on one
hand, and speed (of both lookup and indexing) on the other
- 32 bits (4 bytes) of SHA-1 per hash: 16 wasn't enough, and it's easier
to decode 32 bits into a number than 24, although you could
theoretically make arguments about upgradability
- 131072 buckets: tradeoff between index size and false positive rate
(or either of these and number of hashes and therefore speed)
- 8 bytes (64 bits) per bucket offset ('>q'): it's easier to decode 8
bytes than, say, 5, and I didn't think 4 gigabytes of index file would
be enough
- max 4096 words per page: each page has a separate document_id in the
index, and allowing any page to have a number of words that is a
substantial fraction of the number of buckets could cause it to come
up frequently as a false positive. 4096 --- 1/32 of the number of
buckets --- keeps the false positive rate, per document, to 1/(32^3),
or 1/32768, which is good enough as long as your number of
documents isn't a large multiple of 32768. The downside of a small
number of words per page is that it makes the index bigger and result
sets larger.
- 71 bytes of excerpt context: to fit after a standard 8-space tab in a
standard 80-column punched card^W^WVT100 CRT^W^Wterminal window,
without wrapping.
- 2 097 152 words indexed between checkpoints when adding a mailbox
file: tradeoff between the wasted time writing new index files
(containing mostly the contents of the old index file) and working set
size.
- 4096 bytes at a time for copying indexed documents to the output in
the 'cat' command: tradeoff between working set size and overhead of
Python bytecodes
Why not Microsoft Windows
-------------------------
For reliability, the indexer depends on the ability to make an atomic
update to the filesystem by renaming a file over the top of an existing
file, thus atomically replacing the original file. Microsoft Windows
does not support this operation.
BUGS
----
This indexer still shares a few limitations with my previous one:
- there's no fielded search (yet --- should be much easier and more
efficient to add to this one than to my previous attempts)
- there's no relevancy ranking
- there's no good summary of each search result (beyond its filename,
byte offsets, and excerpts containing the search terms) along the
lines of the title of an HTML document, or the subject, sender, and
date on a mail message
- it doesn't understand HTML, MIME multipart, base64, quoted-printable,
PDF, PostScript, TeX, any programming languages, etc. Its
understanding of mbox files consists of this: that they are divided
into messages.
And it has a few new limitations of its own:
- it indexes slowly. 200 megabytes an hour on this machine is maybe 200
times slower than my previous indexer. It will probably get several
times faster when I add support for multiple index files (at lookup
time) and merging index files.
- it searches slowly. It often takes several seconds to find all the
documents that satisfy a query, although it is usually comes up with
some of them within less than a second.
- it displays the earliest-added documents first; in a mailbox file,
these will be the oldest.
- it loads certain information about an index at startup time; as the
index grows, this can make command-line usage slow.
- the novel index structure keeps the index small, and it will be even
smaller if I get a little bit smarter about compression, but it has
some inherent disadvantages of its own:
- it's difficult (prohibitively expensive, I think) to support
sub-word queries ("donk*"), which has important implications for
incremental UI feedback
- because the index file doesn't actually contain the indexed terms,
it's generally impossible to translate an index file when you change
some of the parameters of the index file structure; you have to
rebuild it from scratch. Parameters I've changed that had this
problem in the past included hash algorithm, number of hashes; it's
possible to adjust page size and number of buckets and adapt the old
index, but generally the previously indexed documents won't get the
benefit of the changes.
- it's not really possible to do negative queries: documents
containing "beatrice" but not "murch", for example, because the
index may produce false positives for "murch". You must inspect
each document to make sure.
- there's no support for negative queries
Also:
- I haven't tried it on any data sets over 400MB, partly because it's so slow
to index; I don't know if it might get slow to retrieve as well.
- if you have a mailbox file and then you use 'add' on it, it throws
away all the information it had about the individual messages in favor
of index entries for the file as a whole
- I just added code to remove obsolete hash hints, so that the hash
hints file doesn't grow every time I reindex my mailbox after it has
changed. I don't know how well tested that code is.
- Because the index file can't be updated in place, and because there's
only one of them, adding a single posting to a 100MB index requires
writing a new 100MB index, which takes many minutes. This makes it
O(N^2) indexing large mailboxes (although that's probably fixable with
a tweak.) Lucene-like, this program should support multiple index
files, merging them when it's convenient.
- the excerpting is kind of lame
- there's no output mode analogous to 'grep' or 'grep -l'
The code of textindex.py
------------------------
#!/usr/bin/python
# vim:tw=80:sw=4:ai:et:ts=4
# Full-text indexing hack to see if my hash inverted indexing idea plays out.
"""Full-text indexer for text files using hash inverted indexing.
The unusual thing about this indexer is that it doesn't store the actual
words the postings are for, only their hashes. It's somewhat similar to
a Bloom filter; we hash each word with two different hash functions and
stick the corresponding docid into the resulting hash buckets. By
intersecting the lists of docids, we can achieve a reasonably small
false-positive rate.
I've only tried it on small datasets so far, because that's all I have
handy.
There are a few software artifacts in here that should have broad
applicability, although I've left them in this same file to facilitate
easy distribution:
PeekableIterator: an iterator wrapper that simplifies certain algorithms
Struct: a better interface to the struct module
Subfile: a file-like interface for a file byte range (cf. email._Subfile)
bin_to_hex, hex_to_bin: these aren't already in the stdlib?!
filehash, sha1: returns SHA-1 of file contents, or a string, respectively
merge_intersection, merge_union: incremental set operations on sorted iters
ignoring_enoent: run some code ignoring ENOENT
ITAMReader, ITAMWriter: fast sorted on-disk dictionaries stored in text files
TODO: (D means "done")
D hey, I bet we never sort the iterators of FileByteRanges, so their
merging could produce wrong results, although I've been lucky so far
because I tend to add things in sorted order; fix that!
D fix problem introduced by that bug fix: far-apart terms won't
intersect properly
D fix problem with hash hints file being grossly inefficient at load
time --- this is the last killer speed problem!
D fix ITAM to be usable for hash hints
D add doc strings :)
D make it do multimap stuff:
D go back and look slightly earlier in headwords file
D make it do tuples
D make it do atomic updates
D make HashHints use ITAM
D move ITAM into textindex.py
X augment HashHints tests to ensure that hints aren't loaded?
X perhaps measure load time for HashHints object?
D use ITAM file for HashHints
D don't load ITAM file for lookups
D fix problem with endlessly growing hash hints file
- post to kragen-hacks again at this point
- create multiple index files, merging them later if need be, rather
than just one index file.
- add logging of requests to support analyses like fraction of false
positives
- prefer looking at one short posting list (and the documents) over one
short and one very long (maybe?)
- clean up excerpting so it's not so inefficient (merge with ensure
pass); also make it produce saner output
- handle non-text files (e.g. PDFs)
- have some way to reindex a file only if it has changed (perhaps the
current SHA-1 check will suffice?)
- have some way to reindex all the files that have changed
- add fielded search
- better summaries for email messages
- rewrite in Squeak
"""
import os, sys, sets, StringIO, struct, cgitb, errno, sha, urllib, re, stat
import time
# On speed, well, it's not that fast at indexing, but the index isn't
# that big, either, and lookups are reasonably fast again, beginning to
# provide results to the questions I've posed in less than a second.
# Indexing my Safari cache directory (16232kiB in 1643 files) 500 files at a
# time took only:
# real 4m47.684s
# user 4m28.884s
# sys 0m15.704s
# (newly using 16% more time than before because of using three hash
# functions instead of two, and 7% more time than that due to ITAM)
# and all 1643 files at once:
# real 2m19.048s
# user 2m6.978s
# sys 0m5.702s
# (18% more time, or 15% less speed, because of using three hash
# functions instead of two; another 9% more time due to using a more
# complicated way of keeping track of the file hashes, which is
# nevertheless faster on lookups.)
# This is all on Beatrice's 1.83 GHz Intel Core Duo Mac. This is X
# seconds. That's X files per second or X kiB per second, or perhaps
# 30kiB per second on my old 600MHz laptop, roughly 200 times slower
# than the text-file mailbox indexer I've been using. The resulting
# index was 3700kiB, 23% of the size of the original data, and much of
# that is the fixed 1MiB overhead of the hash bucket pointers. That's a
# lot smaller than my text-file index. Most of the hash buckets are
# still empty.
# Back when I was using only two hash functions instead of three, it
# indexed 300M of email in
# real 121m13.152s
# user 118m4.057s
# sys 1m30.127s
# which is painfully slow but 45% faster than before. Rerunning the
# same index command was pretty quick to recognize that it didn't need
# to index anything, but it took 5 minutes to write the index
# repeatedly, a bug that has since been fixed.
# Writing a new index is really slow, too. Up to several minutes. One
# solution is to write the index-mangling code in C so as to be able to
# do it in a few minutes; maybe a better solution is to support multiple
# index segments, like Lucene, and merge them when there are several
# large index segments of around the same size.
# A little profiling of the self-test says:
# ncalls tottime percall cumtime percall filename:lineno(function)
# 262157 11.220 0.000 32.080 0.000 textindex.py:345(binbucket)
# 5 11.110 2.222 87.680 17.536 textindex.py:382(write_index)
# 524310 10.750 0.000 19.100 0.000 textindex.py:119(read)
# 327701 9.440 0.000 46.890 0.000 textindex.py:367(read_bucket)
# 327713 7.140 0.000 54.030 0.000 textindex.py:223(merge_union)
# but this is of dubious value, because the self-test normally runs in
# 24 seconds total, and the indices in the test are nearly empty.
### Serialization:
def unpacknums(strbuf):
"""Turns a binary string into an iterator of unsigned integers.
Partial inverse of packnum.
"""
rv = 0
for c in strbuf:
n = ord(c)
rv = (rv << 7) + (n & 0x7f)
if n & 0x80: # high bit set: terminates
yield rv
rv = 0
assert rv == 0
def packnum(n):
"""Turns an unsigned integer into a self-terminating binary string.
Partial inverse of unpacknums; concatenate packnum results without
terminators.
"""
assert n >= 0 # won't terminate otherwise
rv = []
while 1: # at least one byte
rv.append(n & 0x7f)
n >>= 7
if n == 0: break
rv[0] |= 0x80
rv.reverse()
return ''.join(map(chr, rv))
def delta_decode(nums):
"""Turns a list of numbers into a list of partial sums.
Inverse of delta_encode.
"""
n = 0
for num in nums:
n += num
yield n
def delta_encode(nums):
"""Turns a list of partial sums into a list of numbers.
Or, to look at it another way, turns a list of numbers into a list of
differences between them. Inverse of delta_decode.
"""
last = 0
for num in nums:
yield num - last
last = num
class Struct:
"""A more convenient interface to the struct module."""
def __init__(self, fmt): self.fmt = fmt
def calcsize(self):
"""See struct.calcsize for documentation."""
return struct.calcsize(self.fmt)
def unpack(self, data):
"""See struct.unpack for documentation."""
return struct.unpack(self.fmt, data)
def pack(self, *data):
"""See struct.pack for documentation."""
return struct.pack(self.fmt, *data)
# for file-like objects:
def read(self, fo):
"""Read this data structure from a file-like object and unpack."""
data = fo.read(self.calcsize())
if not data: raise EOFError(fo)
return self.unpack(data)
def write(self, fo, *data):
"""Write this data to this file-like object."""
fo.write(self.pack(*data))
### tests for serialization stuff
def ok(a, b):
"""From Perl's Test::Simple IIRC. Verbosely assert a == b."""
assert a == b, (a, b)
def test_lists():
"""Unit tests for delta_encode, delta_decode, packnum, unpacknums."""
def testlist(mylist):
"""Small utility function for testing delta_encode, unpacknums, etc."""
ok(list(delta_decode(delta_encode(mylist))), mylist)
ok(list(unpacknums(''.join(map(packnum, mylist)))), mylist)
testlist([3, 5, 6, 10, 215])
testlist([0, 1, 2])
testlist([0, 0, 1, 2])
tmp = Struct('>q')
ok(tmp.calcsize(), 8)
ok(tmp.unpack(tmp.pack(1380)), (1380,))
ok(tmp.read(StringIO.StringIO(tmp.pack(1380))), (1380,))
test_lists()
### Generic file management
def ignoring_enoent(thunk):
"""Run a thunk and ignore IOError with ENOENT.
Returns None in case of ENOENT, otherwise the return value of the
thunk. Propagates any exceptions other than IOError, and any
IOError for a reason other than ENOENT.
"""
try: return thunk()
except IOError, e:
if e.errno == errno.ENOENT: return
else: raise
### docid management
page_no_representation = Struct('>L')
def make_doc_hash(sha1, page_no = 0):
"""Given a SHA-1 hash and a 'page number', return a 16-byte 'hash'.
The doc ID store (DocIds, .doc_id_store) uses this 'hash'; HashHints
just wants the first 12 bytes.
10 bytes gives birthday collisions prob. 50% at 2^48 documents, and
at 2^38 documents, probability is very small; 2^38 documents is an
exabyte or a little less.
Each 'page' is 4096 words; the first page is 0; large documents are
divided into pages to make the Bloom-filter-like data structure work
better.
"""
return sha1[:12] + page_no_representation.pack(page_no)
def parse_doc_hash(ahash):
"""Given a 16-byte hash, split into document uniquer and page no."""
return ahash[:12], page_no_representation.unpack(ahash[12:])[0]
class DocIds:
"""Interface to the list of SHA-1 hashes* stored in the index file.
* actually we only store the first 12 bytes of the SHA-1, plus four
bytes of "page number".
To keep the index small, we don't store filenames in the hash
buckets of the index; instead, we store indices into this list. We
only ever append to this list. There's another file that maps SHA-1
hashes to filenames and offsets.
"""
hashsize = 16
def __init__(self, fo):
"""fo is a file-like object from which the hashes can be read."""
self.hashes = []
while 1: # read to EOF
next_hash = fo.read(self.hashsize)
if not next_hash: return
self.hashes.append(next_hash)
def __getitem__(self, ahash):
"""Return the doc_id for the given binhash, or raise KeyError."""
assert len(ahash) == self.hashsize
try: return self.hashes.index(ahash)
except ValueError: raise KeyError(ahash)
def add_hash(self, ahash):
"""Add a new hash to the document ID database, returning new doc_id."""
assert ahash not in self
self.hashes.append(ahash)
return self[ahash] # infinite recursion here would suggest a bug
def __contains__(self, ahash):
"""Test for the existence of a hash in the list."""
return ahash in self.hashes
def hash(self, doc_id):
"""Return the 16-byte binary hash for the given doc_id."""
return self.hashes[doc_id]
def write(self, fo):
"""Write a representation that can be read back."""
for ahash in self.hashes:
assert len(ahash) == self.hashsize, (len(ahash), ahash)
fo.write(ahash)
def remove(self, doc_id):
"""Forget the hash for document number doc_id."""
self.hashes[doc_id] = 'deleted document'
### File byte range stuff
class Subfile:
"""Similar to email._Subfile: a file-like object for part of a 'r' file.
The underlying file-like object must support .seek(pos) and
.read(length), both of which need to speak bytes. Supports pretty
much exactly the same protocol.
If email._Subfile were documented, I probably wouldn't have written
this.
"""
def __init__(self, fo, start, end):
"""fo: the file-like object to represent a part of.
start: the offset where this subfile should begin.
end: the offset where this subfile should end.
"""
self.fo, self.start, self.end = fo, start, end
self.pos = start
def read(self, length):
"""Read specified number of bytes.
Will likely affect current file position in underlying file object.
"""
self.fo.seek(self.pos)
length = min(self.end - self.pos, length)
rv = self.fo.read(length)
self.pos += len(rv)
return rv
def seek(self, pos):
"""Similar to file.seek, but doesn't support "whence".
Sets current position to "pos" bytes from the start of the
subfile.
"""
self.pos = self.start + pos
def tell(self):
"""Similar to file.tell: current pos in bytes from subfile start."""
return self.pos - self.start
class FileByteRange(object):
"""Similar to Subfile, but corresponds to a filename, not an open file."""
# In retrospect this may not have been such a hot idea: a few lines
# of code were simplified slightly, and interactive use is a little
# easier, in exchange for the added complexity of 23 more lines of code.
__slots__ = ['fname', 'start', 'end']
def __init__(self, fname, start, end):
"""start and end are byte offsets in the file named fname."""
self.fname, self.start, self.end = fname, start, end
def open(self):
"""Return a Subfile."""
return Subfile(file(self.fname), self.start, self.end)
def __repr__(self):
return 'FileByteRange(fname=%r, start=%r, end=%r)' % self.as_tuple()
def __str__(self): return '%s %d:%d' % self.as_tuple()
def as_tuple(self): return self.fname, self.start, self.end
def __cmp__(self, other):
return (cmp(type(self), type(other)) or
cmp(self.as_tuple(), other.as_tuple()))
def __hash__(self): return hash(self.as_tuple()) ^ 1
def __iter__(self):
"""Unpack like a (fname, start, end) tuple."""
return iter(self.as_tuple())
### ITAM files --- used for the hash hints file
# These are like ISAM files, but are text and not updatable in place, although
# they are atomically updatable.
# The strategy for atomic updates is as follows:
# The basic ITAM file itself gets atomically renamed into place; it may have a
# headwords file, also an ITAM file.
# Each ITAM file begins with "ITAM; headwords would be in:
# foo.32021.headwords\n", where 32021 is a somewhat random number that will
# hopefully be different for each update. Then, we write all our headwords
# files first, then rename ourselves into place, then delete all the old
# headwords files. Anyone who started reading the old file would use the old
# headwords files, although there's a race condition where they might get
# deleted first --- which could result in the other guy's lookup taking a long
# time. This is undesirable, but I think it's acceptable.
class ITAMWriter:
"""Indexed text access method, writer class.
Stores a string->(tuple of strings) multimap in a sorted text file
(with potentially some other files indexing it) for quick retrieval.
Not updatable; you have to write the whole file again if you want to
update it.
"""
def __init__(self, fname, skipsize=10240):
"""fname is the base filename to store data in.
skipsize is the maximum number of bytes for the reader to have to
read from any level of the hierarchy. Generally this should
be around the number of bytes that the reader can decode in
one disk seek time; if it's too small, there will be lots of
levels of hierarchy and the disk will spend lots of time
seeking, but it being small enough is what keeps us from
spending lots of CPU time on the problem. In the tests,
skipsize=10240 lets the primary file get to 20 megabytes
with only three levels of hierarchy.
"""
self.fname = fname
self.skipsize = skipsize
self.headwords = None
year = 86400 * 365
self.unique_id = '%s.%s' % (int(time.time() * 100) % year, os.getpid())
self.outfile = file(fname + '.new', 'w')
self.written_bytes = 0
self.last_bytes_in_headwords = 0
self.prev_key = ''
self.write_bytes('ITAM; headwords would be in: %s\n' %
self.headwords_fname())
self.files_to_delete = []
try: reader = ITAMReader(self.fname)
except: pass
else: self.files_to_delete = reader.all_fnames()
def headwords_fname(self):
"""The name for my headwords file."""
return '%s.%s.headwords' % (self.fname, self.unique_id)
def encode(self, key, values):
"""Return a string to represent key->values in the file."""
parts = map(urllib.quote, [key] + list(values))
for item in parts:
assert '\n' not in item
assert ' ' not in item
return '%s\n' % ' '.join(parts)
def open_headwords(self):
"""Start writing to a headwords file.
This isn't done in __init__ because we only need a headwords
file if we write more than self.skipsize bytes of data.
"""
self.headwords = ITAMWriter(self.headwords_fname())
def write_bytes(self, outstr):
"""Write some bytes to the outfile."""
self.written_bytes += len(outstr)
self.outfile.write(outstr)
def write(self, key, *values):
"""Put a key->(values) pair into the file.
Needs to be called in order on all the keys.
"""
assert key >= self.prev_key, (key, self.prev_key)
self.prev_key = key
old_bytes = self.written_bytes
self.write_bytes(self.encode(key, values))
if self.written_bytes - self.last_bytes_in_headwords > self.skipsize:
if not self.headwords: self.open_headwords()
self.headwords.write(key, str(old_bytes))
self.last_bytes_in_headwords = old_bytes
def close(self):
"""Commit the new version of the file."""
# XXX I wish I knew how to make this atomic. Using either a new
# headwords file with the old main file or vice versa could
# undetectably give bad results.
# XXX I know how. Make up funky names for the headwords file,
# and put them into the main file. Then we just have to
# manually delete the old headwords files when we get done
# writing. Simultaneous updates will lose one of the new sets
# of updates and will leave small trash files lying around, but
# won't lose both sets of updates.
self.outfile.close()
os.rename(self.fname + '.new', self.fname)
if self.headwords: self.headwords.close()
self.outfile = None
for fname in self.files_to_delete:
if os.path.exists(fname) and fname != self.fname: os.unlink(fname)
class ITAMReader:
"""Indexed text access method, reader.
Retrieves items from files written by ITAMWriter, either by key or
by iterating over the whole file.
"""
def __init__(self, fname):
"""fname is the name of the file containing the data."""
self.fname = fname
self.infile = file(self.fname)
firstline = self.infile.readline()
magic = 'ITAM; headwords would be in: '
assert firstline.startswith(magic), firstline
self._headwords_fname = firstline[len(magic):-1]
self.tuples_start_at = len(firstline)
if os.path.exists(self.headwords_fname()):
self.headwords = ITAMReader(self.headwords_fname())
else: self.headwords = None
self.decodes = 0
def headwords_fname(self):
"""The filename under which our headwords are stored."""
return self._headwords_fname
def all_fnames(self):
"""All the filenames storing information --- for deletion."""
if self.headwords: return [self.fname] + self.headwords.all_fnames()
else: return [self.fname]
def decode(self, line):
"""Decode a line read from the file into a tuple, including key."""
assert line.endswith('\n')
rv = tuple(map(urllib.unquote, line[:-1].split(' ')))
self.decodes += 1
return rv
def find(self, key):
"""Find the beginning of 'key' in the file.
Returns the last tuple in the file less than the key, the first
tuple in the file after that (or a dummy tuple at EOF), and an
iterator over any remaining tuples.
Not really intended for external use; interface is too awkward.
You probably want .values_for_key().
The iterator is ready to iterate over the tuples with the key,
if any, and then the tuples with greater keys.
"""
pos = self.tuples_start_at
if self.headwords:
headtuple, _, _ = self.headwords.find(key)
assert headtuple[0] < key or headtuple[0] == ''
if len(headtuple) > 1: # headtuple may be ('',)
pos = int(headtuple[1])
else: pos = self.tuples_start_at
prev_tuple = ('',)
iterator = self.iterate_from_offset(pos)
for cur_tuple in iterator:
found_key = cur_tuple[0]
if found_key >= key: return prev_tuple, cur_tuple, iterator
prev_tuple = cur_tuple
assert prev_tuple[0] <= key, (prev_tuple, key)
return prev_tuple, ('',), iterator # it's beyond EOF
def __getitem__(self, key):
"""Return a tuple from the file with key 'key', or raise KeyError."""
found_tuple, result_tuple, iterator = self.find(key)
if result_tuple[0] == key: return result_tuple[1:]
else: raise KeyError(key)
def values_for_key(self, key):
"""Iterate over the tuples attached to 'key'."""
_, first_tuple, iterator = self.find(key)
if first_tuple[0] != key: return
yield first_tuple[1:]
for each_tuple in iterator:
if each_tuple[0] != key: return
yield each_tuple[1:]
def iterate_from_offset(self, byte_offset):
"""Iterate over the tuples in the file, starting at specified offset.
This is not really intended for external use.
"""
while 1:
self.infile.seek(byte_offset)
line = self.infile.readline()
byte_offset += len(line)
if not line: return
yield self.decode(line)
def __iter__(self):
"""Iterate over all tuples in the file, in sorted order."""
return self.iterate_from_offset(self.tuples_start_at)
def total_decodes(self):
"""Return the total number of tuples decoded by this reader.
Includes tuples decoded in the headwords files.
"""
if self.headwords:
return self.headwords.total_decodes() + self.decodes
else: return self.decodes
def filesize(fname):
"""Return the number of bytes in the named file."""
return os.stat(fname)[stat.ST_SIZE]
def assert_ioerror(thunk):
"""Assertion failure if thunk doesn't raise IOError when run."""
try: x = thunk()
except IOError: return
else: assert 0, x
def test_itam():
"""Very slow unit tests (30s) for ITAM."""
fname = 'tmp.itam.%s' % os.getpid()
try:
### basic tests
assert_ioerror(lambda: ITAMReader(fname))
writer = ITAMWriter(fname)
assert_ioerror(lambda: ITAMReader(fname))
for ii in range(1000):
writer.write('%04d' % ii, '%s%d' % ('x' * 100, ii))
writer.write('1000', 'has', 'three', 'tuple items')
assert_ioerror(lambda: ITAMReader(fname))
writer.close()
reader = ITAMReader(fname)
val32 = 'x' * 100 + '32'
ok(reader['0032'], (val32,))
val31 = 'x' * 100 + '31'
find_rv = ('0031', val31), ('0032', val32)
ok(reader.find('0032')[0:2], find_rv)
ok(reader.find('00320')[0:2],
(('0032', val32), ('0033', 'x' * 100 + '33')))
ok(reader.find('003a')[0], ('0039', 'x' * 100 + '39'))
ok(reader['0032'], (val32,))
ok(reader['0999'], ('x' * 100 + '999',))
ok(reader['1000'], ('has', 'three', 'tuple items'))
assert 105000 <= filesize(fname) <= 210000, filesize(fname)
hwsize = filesize(reader.headwords_fname())
assert 30 <= hwsize <= 300, hwsize
assert not os.path.exists(reader.headwords.headwords_fname())
x = iter(reader)
ok(x.next(), ('0000', 'x' * 100 + '0'))
ok(x.next(), ('0001', 'x' * 100 + '1'))
ok(x.next(), ('0002', 'x' * 100 + '2'))
ok(len(list(x)), 998)
### test with big files
writer = ITAMWriter(fname)
for ii in range(200000):
writer.write('%06d' % ii, '%s%d' % ('y' * 100, ii))
writer.close()
assert 21000000 <= filesize(fname) <= 42000000, filesize(fname)
hwsize = filesize(writer.headwords_fname())
assert 10000 <= hwsize <= 40000, hwsize
assert filesize(writer.headwords.headwords_fname()) < 10000
### efficiency (if not speed) test
for x in [3, 184, 121203, 123456, 199121, 35005, 135243, 108323,
24262, 170322]:
#start_time = time.time()
reader = ITAMReader(fname)
ok(reader['%06d' % x], ('y' * 100 + str(x),))
#print "time used", time.time() - start_time, reader.total_decodes()
# absolute maximum should be 1 in .hw.hw, 10240/(6+1+5+1) +
# 1 = 788 in .hw, and 10240/(6+1+100+1) + 1 = 95 in main
# file, total 884. Worst I've seen is actually 751 (for
# 121203), and that took 19 ms, which was also the longest time.
assert reader.total_decodes() < 900, reader.total_decodes()
### multiple values for one key
writer = ITAMWriter(fname)
for ii in range(1000): # make the file big so we have to use headwords
writer.write('%03d' % ii, 'x' * 100)
writer.write('beatrice', 'fact')
writer.write('beatrice', 'internet archive')
writer.write('ben', 'bittorrent')
writer.write('kragen', 'airwave')
writer.write('kragen', 'commercenet')
writer.write('marilyn', 'mvchd')
writer.close()
# if there were a .headwords.headwords file left over from
# earlier, it could potentially cause trouble
assert not os.path.exists(writer.headwords.headwords_fname())
reader = ITAMReader(fname)
ok(list(reader.values_for_key('beatrice')),
[('fact',), ('internet archive',)])
ok(list(reader.values_for_key('kragen')),
[('airwave',), ('commercenet',)])
ok(list(reader.values_for_key('ben')), [('bittorrent',)])
finally:
for base_fname in reader.all_fnames():
for fname in [base_fname, base_fname + '.new']:
if os.path.exists(fname): os.unlink(fname)
### managing the hash hints file, which tells us what files have which hash
# There must be an existing way to do this in the standard library!
char_to_hex = {}
hex_to_char = {}
def init_hexchars():
"""Runs at import time to initialize char_to_hex and hex_to_char."""
for ii in range(256):
hexval = '%02x' % ii
char_to_hex[chr(ii)] = hexval
hex_to_char[hexval] = chr(ii)
init_hexchars()
def bin_to_hex(astring):
"""Convert a binary string to hexadecimal."""
return ''.join(map(char_to_hex.get, astring))
def hex_to_bin(hexstring):
"""Convert a hexadecimal string to binary."""
return ''.join(map(hex_to_char.get, re.findall('..', hexstring)))
ok(bin_to_hex('\r\n'), '0d0a')
ok(hex_to_bin(bin_to_hex('we are here')), 'we are here')
class HashHints:
"""Keeps track of where to find data with certain hashes.
Stores its data in a text file on the disk: hash, filename, start
and end offsets. "Hints" because it's possibly out of date.
"""
hashsize = 12 # only the first 12 bytes, not the page number
def __init__(self, fname):
"""fname: the name of the file that contains the data."""
self.fname = fname
self.hints = None
self.current_hash = None
def read_file(self):
"""Initialize state from contents of self.fname."""
start_time = time.time()
self.hints = {}
self.current_hash = {}
for ahash, start, end, fname in ITAMReader(self.fname):
br = FileByteRange(fname, int(start), int(end))
self.add_hash(hex_to_bin(ahash), br)
#print 'wasted', time.time() - start_time
def read_file_if_necessary(self):
"""Read the whole file from disk if we haven't already."""
if self.hints is None: ignoring_enoent(self.read_file)
def add_hash(self, ahash, byte_range):
"""Call to indicate the specified byte range has hash 'ahash'."""
# we don't write to the file until .flush()
self.read_file_if_necessary()
assert len(ahash) == self.hashsize
self.hints.setdefault(ahash, []).append(byte_range)
filename, start, end = byte_range
try: oldhash = self.current_hash[filename][start, end]
except KeyError: pass
else: self.hints[oldhash].remove(byte_range)
self.current_hash.setdefault(filename, {})[start, end] = ahash
def __getitem__(self, ahash):
"""Return which filename,start,end triplets had hash 'ahash'."""
assert len(ahash) == self.hashsize
if self.hints is not None:
return self.hints.get(ahash, [])
else:
rv = ignoring_enoent(lambda: self.read_from_file(ahash))
if rv is None: return []
else: return rv
def read_from_file(self, ahash):
"""Read relevant data from underlying data file.
May raise IOError, and won't return anything from memory. Not public.
"""
rv = []
reader = ITAMReader(self.fname)
for start, end, fname in reader.values_for_key(bin_to_hex(ahash)):
rv.append(FileByteRange(fname, int(start), int(end)))
return rv
def flush(self):
"""Atomically update the file on the filesystem."""
self.read_file_if_necessary()
outfo = ITAMWriter(self.fname)
keys = self.hints.keys()
keys.sort()
for key in keys:
hexkey = bin_to_hex(key) # I need some Allen screws...
for filename, start, end in self.hints[key]:
outfo.write(hexkey, str(start), str(end), filename)
outfo.close()
self.hints = None
self.current_hash = None
def hashes(self, fname):
"""Return the hashes found somewhere in fname, for deindexing purposes.
"""
self.read_file_if_necessary()
return self.current_hash.get(fname, {}).values()
def hash_no_longer_in_file(self, filename, ahash):
self.read_file_if_necessary()
new_hints = []
for byte_range in self.hints[ahash]:
if byte_range.fname != filename:
new_hints.append(byte_range)
else:
range = (byte_range.start, byte_range.end)
del self.current_hash[byte_range.fname][range]
self.hints[ahash] = new_hints
def filehash(fo, chunksize=32768):
"""Return the binary SHA-1 hash of the data in the file-like object fo."""
ho = sha.sha()
while 1:
data = fo.read(chunksize)
if not data: break
ho.update(data)
return ho.digest()
def sha1(astr):
"""Return the binary SHA-1 of the string."""
return sha.sha(astr).digest()
def test_hash_hints():
"""Unit tests for HashHints."""
ok(filehash(StringIO.StringIO('hello' * 10000)), sha1('hello' * 10000))
fname = 'tmp.hash_hints.%s' % os.getpid()
hh = HashHints(fname)
try:
a = '\n' * 12 # binary hash --- this test verifies it's stored
# safely, e.g. in hexadecimal
range_a = FileByteRange(fname='somefile', start=0, end=100)
hh.add_hash(a, range_a)
ok(hh[a], [range_a])
hh = HashHints(fname)
ok(hh[a], []) # didn't flush
hh.add_hash(a, range_a)
ok(hh[a], [range_a])
hh.flush()
ok(hh[a], [range_a])
# reload from the file
hh = HashHints(fname)
ok(hh[a], [range_a])
range_a2 = FileByteRange(fname='someotherfile', start=5, end=105)
hh.add_hash(a, range_a2)
ok(hh[a], [range_a, range_a2])
hh.flush()
ok(hh[a], [range_a, range_a2])
ok(hh.hashes('somefile'), [a])
# hash_no_longer_in_file
hh.hash_no_longer_in_file('someotherfile', a)
ok(hh[a], [range_a])
assert a in hh.hashes('somefile')
assert a not in hh.hashes('someotherfile')
hh.flush()
ok(hh[a], [range_a])
assert a in hh.hashes('somefile')
assert a not in hh.hashes('someotherfile')
# filename encoding
b = 'b' * 12
uglyfname = 'a,b=c d\ne!f g'
uglyrange = FileByteRange(fname=uglyfname, start=0, end=100)
hh.add_hash(b, uglyrange)
hh.flush()
ok(hh[b], [uglyrange])
hh = HashHints(fname)
ok(hh[b], [uglyrange])
# a file should only be able to have one hash at a time for the same
# byte range:
c = 'c' * 12
hh.add_hash(c, uglyrange)
hh.flush()
ok(hh[c], [uglyrange])
ok(hh[b], [])
hh = HashHints(fname)
ok(hh[c], [uglyrange])
ok(hh[b], [])
ok(hh.hashes(uglyfname), [c])
# yet should be able to have different hashes for different byte ranges:
uglyrange2 = FileByteRange(fname=uglyfname, start=100, end=200)
hh.add_hash(b, uglyrange2)
ok(hh[c], [uglyrange])
ok(hh[b], [uglyrange2])
hh.flush()
ok(hh[c], [uglyrange])
ok(hh[b], [uglyrange2])
hh = HashHints(fname)
ok(hh[c], [uglyrange])
ok(hh[b], [uglyrange2])
assert c in hh.hashes(uglyfname)
assert b in hh.hashes(uglyfname)
finally:
if os.path.exists(fname):
for fname in ITAMReader(fname).all_fnames():
if os.path.exists(fname): os.remove(fname)
if os.path.exists(fname + '.new'): os.remove(fname + '.new')
if os.path.exists(fname + '.new'): os.remove(fname + '.new')
### word extraction
def _words(fo, bufsiz=4096):
"""Iterate over the words in a file.
For each word found, yields a byte offset into the file and a match
object whose offsets are measured from that byte offset. This weird
return value is intended to support two kinds of clients without doing
unnecessary work: clients who want the byte offset of the word in
the file, and clients who just want the word.
"""
word_re = re.compile(r'[\w\x80-\xff]+')
buf = ''
start = 0
while 1: # until EOF actually
data = fo.read(bufsiz)
if not data:
for mo in word_re.finditer(buf): yield start, mo
return
buf += data
mos = list(word_re.finditer(buf))
end = len(buf)
if mos and mos[-1].end(0) == len(buf):
mo = mos.pop() # don't index partial word, but
end = mo.start(0) # don't discard it either
for mo in mos: yield start, mo
start += end
buf = buf[end:]
def words(fo, bufsiz=4096):
"""Iterate over just the words in a file."""
for start, mo in _words(fo, bufsiz): yield mo.group(0)
def words_for_excerpting(fo, bufsiz=4096):
"""Iterate over the words in a file along with their byte offsets."""
for start, mo in _words(fo, bufsiz): yield mo.group(0), start + mo.start(0)
def test_words():
"""Unit test for words()."""
f1 = 'hi there\n09 words and 3jane RULZ! there'
ok(list(words(StringIO.StringIO(f1))),
['hi', 'there', '09', 'words', 'and', '3jane', 'RULZ', 'there'])
ok(list(words_for_excerpting(StringIO.StringIO(f1))),
[('hi', 0), ('there', 3), ('09', 9), ('words', 12),
('and', f1.index('and')), ('3jane', f1.index('3jane')),
('RULZ', f1.index('RULZ')), ('there', f1.rindex('there'))])
# This test is needed to test the second return path:
ok(list(words(StringIO.StringIO('So it goes...'))), ['So', 'it', 'goes'])
# treat chars with high bit set as word chars (works for utf-8 and usu.
# latin-1)
ok(list(words(StringIO.StringIO('x\xb5y'))), ['x\xb5y'])
# make sure buffer size doesn't matter
for bufsiz in range(1, len(f1) + 1):
ok(list(words(StringIO.StringIO(f1), bufsiz=bufsiz)),
list(words(StringIO.StringIO(f1))))
ok(list(words_for_excerpting(StringIO.StringIO(f1), bufsiz=bufsiz)),
list(words_for_excerpting(StringIO.StringIO(f1))))
### Berzerkeley mailbox management
class Buf:
"""A buffer for reading a mailbox while computing SHA-1s of bits of it.
The basic problem this solves is that the logic for parsing the
mailbox file in small pieces, without using too much memory at a
time (_mboxsplit), is hairy and bug-prone already, even without
intertwining it with the logic for updating the SHA-1 hash and the
current file offset. So this object keeps track of the current file
offset and the current SHA-1 hash.
It's not particularly general.
Public instance variables:
start: the file offset of the beginning of the buffer
"""
def __init__(self, fo, blocksize):
"""fo: the mailbox.
blocksize: the size of chunks to read from it.
"""
self.fo = fo
self.start = fo.tell() # an offset into the file
self.buf = '' # bytes starting at offset .start in the file
self.h = sha.sha() # always contains SHA-1 of data up to .start
self.blocksize = blocksize
def end(self):
"""Returns the file offset of the end of the buffer."""
return self.start + len(self)
def msgend(self):
"""Return the byte offset and SHA-1 hash up to the start of the buffer.
Also resets the SHA-1 hash to be ready for the next message.
"""
try: return self.start, self.h.digest()
finally: self.h = sha.sha()
def grow(self):
"""Appending more data read from the file or return false.
Used when the caller needs more data in the buffer to figure out
what to do next.
"""
# I imagine this to be expensive, so it's disabled:
#tell = self.fo.tell()
#assert tell == self.start + len(self), (tell, self.start, len(self))
data = self.fo.read(self.blocksize)
self.buf += data
return data
def __len__(self):
"""Returns the amount of data currently in the buffer."""
return len(self.buf)
def find(self, astr):
"""Returns the byte offset (in the file) of the string, if in buffer.
Otherwise, returns -1.
"""
rv = self.buf.find(astr)
if rv == -1: return -1
return self.start + rv
def consume_to(self, pos):
"""Throw away data up to (file offset) pos.
"""
pos -= self.start
assert 0 <= pos <= len(self), (self.start, pos, len(self))
self.h.update(self.buf[:pos])
self.buf = self.buf[pos:]
self.start += pos
def _mboxsplit(fo, blocksize):
"""Parse a mailbox, yielding (offset, SHA1) pairs.
The SHA-1 is binary, and is the SHA-1 of the data since the last
offset, or beginning of file. Yields one pair for every new message
start (including one at the beginning of the file, if the mailbox
starts with a message, as is usual for mbox files) plus a pair at
EOF. So a file with N messages will yield N+1 pairs.
"""
b = Buf(fo, blocksize)
sep = 'From '
while len(b) < len(sep):
if not b.grow(): # unlikely!
b.consume_to(b.end())
yield b.msgend()
return
if b.find(sep) == 0: yield b.msgend()
sep = '\n' + sep
while 1:
while len(b) >= len(sep):
pos = b.find(sep)
if pos == -1:
b.consume_to(b.end() - len(sep) + 1)
assert len(b) < len(sep)
else:
b.consume_to(pos + 1) # consume '\n' too
yield b.msgend()
if not b.grow():
b.consume_to(b.end())
yield b.msgend()
return
def mboxsplit(fo, blocksize=4096):
"""Yield (start_byte_offset, end_byte_offset, sha1) per message.
Parses mbox file 'fo'.
"""
msgs = _mboxsplit(fo, blocksize)
start, _ = msgs.next()
for end, hash in msgs:
yield start, end, hash
start = end
def test_mbox():
"""Unit tests for mboxsplit and friends."""
uglier_mbox = '''leading garbage\nFrom somebody
some text From me\nfrom you
more text\nFrom somebody else
yet more text
'''
pm = list(_mboxsplit(StringIO.StringIO(uglier_mbox), blocksize=3))
ok(pm[0][0], uglier_mbox.index('From somebody'))
ok(pm[1][0], uglier_mbox.index('From somebody else'))
ok(pm[2][0], len(uglier_mbox))
ok(len(pm), 3)
first_msg = '''From me\nTo: you\n\nhi\n''' + '\n' * 100000
second_msg = '''From you\nwho cares'''
saner_mbox = first_msg + second_msg
ok(list(_mboxsplit(StringIO.StringIO(saner_mbox), blocksize=3)),
[(0, sha1('')), (len(first_msg), sha1(first_msg)),
(len(saner_mbox), sha1(second_msg))])
ok(list(_mboxsplit(StringIO.StringIO(saner_mbox), blocksize=3000)),
[(0, sha1('')), (len(first_msg), sha1(first_msg)),
(len(saner_mbox), sha1(second_msg))])
ok(list(mboxsplit(StringIO.StringIO(saner_mbox), blocksize=3)),
[(0, len(first_msg), sha1(first_msg)),
(len(first_msg), len(saner_mbox), sha1(second_msg))])
mb = mboxsplit(StringIO.StringIO(uglier_mbox))
ii = iter(mb)
first_msg_start, first_msg_end, sha1hash = ii.next()
assert uglier_mbox[first_msg_start:].startswith('From somebody\n')
assert uglier_mbox[first_msg_end:].startswith('From somebody else\n')
next_msg_start, next_msg_end, sha1hash = ii.next()
ok(next_msg_start, first_msg_end)
ok(list(_mboxsplit(StringIO.StringIO('asdf'), 1)), [(4, sha1('asdf'))])
desired = list(mboxsplit(StringIO.StringIO(uglier_mbox)))
for ii in range(1, len(uglier_mbox)):
ok(list(mboxsplit(StringIO.StringIO(uglier_mbox), blocksize=ii)),
desired)
### Merging sorted lists to get union or intersection.
class PeekableIterator:
"""A wrapper for an iterator that behaves identically, but with peek.
Merge algorithms are a lot easier if you can peek at the next value
in each stream. Using this class reduced my merge_union (the kind
of merge you use in mergesort) from 42 lines down to 14. You wrap
an iterator with this class and get the same iterator, except with
added functionality.
There are two public instance variables:
.empty tells whether there is a .peek,
.peek tells what you will get the next time you call .next().
It's guaranteed that .next() will raise StopIteration iff .empty,
and there is a .peek iff not .empty.
"""
def __init__(self, aniter):
self.aniter = aniter
self.empty = False
self.peek = None
self.next()
def __iter__(self): return self
def next(self):
if self.empty: raise StopIteration
rv = self.peek
try: self.peek = self.aniter.next()
except StopIteration:
self.empty = True
del self.peek
return rv
def merge_union(itera, iterb):
"""Return the union of two sorted iterators, as a sorted iterator."""
a, b = PeekableIterator(iter(itera)), PeekableIterator(iter(iterb))
while 1:
if a.empty and b.empty: return
if a.empty: yield b.next()
elif b.empty: yield a.next()
elif a.peek < b.peek: yield a.next()
elif a.peek > b.peek: yield b.next()
else:
b.next()
yield a.next()
def merge_intersection(itera, iterb):
"""Return the intersection of two sorted iterators, as a sorted iterator."""
a, b = PeekableIterator(iter(itera)), PeekableIterator(iter(iterb))
while 1:
if a.empty or b.empty: return
if a.peek < b.peek: a.next()
elif a.peek > b.peek: b.next()
else:
b.next()
yield a.next()
# Each of these merge union tests (except the first one) actually used to test a
# different code path. How gross is that?
ok(list(merge_union([], [])), [])
ok(list(merge_union([], [1, 2, 3])), [1, 2, 3])
ok(list(merge_union([1, 2, 3], [])), [1, 2, 3])
ok(list(merge_union([0, 1, 3], [0, 2, 6])), [0, 1, 2, 3, 6])
ok(list(merge_union([0, 2, 6], [0, 1, 3])), [0, 1, 2, 3, 6])
ok(list(merge_union([0, 2, 6], [0, 1, 2])), [0, 1, 2, 6])
ok(list(merge_union([0, 1, 2], [0, 2, 6])), [0, 1, 2, 6])
ok(list(merge_intersection([0, 2, 4], [1, 2, 3, 4])), [2, 4])
### Hash inverted index file management.
gethash = Struct(">L").unpack
def hashes(ob):
"""Need multiple hash functions to make this work. Here they are.
"""
bighash = sha1(ob)
return (gethash(bighash[0:4])[0], gethash(bighash[4:8])[0],
gethash(bighash[8:12])[0])
class Index:
"""Interface to a single index."""
nbuckets = 131072
header_offset_struct = Struct('>q')
def __init__(self, fname, hashfuncs):
"""fname is the name of the index file.
hashfuncs is a hash function returning a tuple of hash values.
"""
self.hashes = hashfuncs
self.fname = fname
self.header_offset_size = self.header_offset_struct.calcsize()
self.clear()
def remove_doc_hash(self, bare_hash):
"""Remove all postings for the specified document bare hash.
bare_hash is the 12-byte chunk of SHA-1, not the 16-byte thing
including the page number.
No-op if the document hash isn't known.
Does it lazily, and doesn't remove the postings added in this object
since the last flush --- I don't expect that will be a problem, but
I thought I should document the stupidity here.
"""
page_no = 0
while 1:
ahash = make_doc_hash(bare_hash, page_no)
try: doc_id = self.doc_id_store()[ahash]
except KeyError: return
self.removed_doc_ids.add(doc_id)
self.doc_id_store().remove(doc_id)
page_no += 1
def add(self, doc_id, word):
"""Add a posting to be written out later."""
for bucketnum in self.bucketnums(word):
try: bucket = self.additions[bucketnum]
except KeyError:
self.additions[bucketnum] = bucket = [doc_id]
self.pending_postings += 1
else:
if bucket[-1] != doc_id:
bucket.append(doc_id)
self.pending_postings += 1
def flush(self):
"""Write index updates back to the filesystem."""
# note race condition on two concurrent updates:
newfile = self.fname + '.new'
newfo = file(newfile, 'wb')
self.write_index(newfo)
newfo.close() # unnecessary in RefCountedLand, but...
# Atomic rename for atomic index update.
os.rename(newfile, self.fname)
self.clear()
def clear(self):
"""Forget any pending updates."""
self.fo = None
self._doc_id_store = None
self.additions = {}
self.removed_doc_ids = sets.Set() # new in 2.3
self.pending_postings = 0
def sorted_additions(self, bucketnum):
"""The doc_ids currently pending addition to this bucket number.
In order, as implied by the name.
We could be lazier about this if we used a heap, but in the current
program, we never do anything but write this whole list out to the
index, so that would be premature optimization.
"""
try: rv = list(sets.Set(self.additions[bucketnum]))
except KeyError: return []
rv.sort()
return rv
def doc_ids(self, bucketnum):
"""Return the doc_ids for this bucket number.
This includes any pending updates as well as those from the file.
"""
if self.additions.has_key(bucketnum):
return merge_union(self.read_bucket(bucketnum),
self.sorted_additions(bucketnum))
else: return self.read_bucket(bucketnum)
def bucket_end_offset(self, fo, bucketnum):
"""Read the index file fo to tell what byte offset the bucket ends."""
fo.seek(bucketnum * self.header_offset_size)
(end_offset,) = self.header_offset_struct.read(fo)
return end_offset
def binbucket(self, fo, bucketnum):
"""The binary string containing the doc_ids in the index file.
This may be of zero length, or it may be quite large. It won't
contain currently pending additions.
"""
# So here, to find the length of the bucket, I need the starts of two
# buckets. So I need a special case either for the first bucket or for
# the last bucket, or I need nbuckets+1 items in the header. I chose
# to have a special case for the first bucket, and store the offsets of
# the *ends* of the buckets in the header.
if bucketnum == 0:
start_offset = self.header_offset_size * self.nbuckets
else:
start_offset = self.bucket_end_offset(fo, bucketnum - 1)
end_offset = self.bucket_end_offset(fo, bucketnum)
fo.seek(start_offset)
return fo.read(end_offset - start_offset)
def open_fo(self):
"""Open a filehandle on the index file, saving it in self.fo.
Originally this was a separate method just to facilitate the use
of ignoring_enoent, but now it has two callers: read_bucket and
doc_id_store.
"""
self.fo = file(self.fname, 'rb')
return self.fo
def read_bucket(self, bucketnum):
"""The doc_ids in bucketnum in the file, minus those pending deletion.
Deletion is processed here (rather than in .doc_ids()) so that it
won't affect newly added postings.
"""
if not self.fo:
# we ignore specifically only ENOENT here because we wouldn't want
# EPERM to wipe out the index
if not ignoring_enoent(self.open_fo): return # no index yet!
fo = self.fo
for doc_id in delta_decode(unpacknums(self.binbucket(fo, bucketnum))):
if doc_id not in self.removed_doc_ids: yield doc_id
def doc_id_store(self):
"""Return DocIds object, opening file if necessary."""
if not self._doc_id_store:
if ignoring_enoent(self.open_fo):
# seek one byte past end of last bucket
self.fo.seek(self.bucket_end_offset(self.fo, self.nbuckets-1))
self._doc_id_store = DocIds(self.fo)
else:
self._doc_id_store = DocIds(StringIO.StringIO('')) # EW!
return self._doc_id_store
def write_index(self, fo):
"""Given a filelike object, write the current index to it."""
header_offsets = []
fo.seek(self.nbuckets * self.header_offset_size)
for ii in xrange(self.nbuckets):
if (not self.additions.has_key(ii) and not self.removed_doc_ids and
self.fo):
bucketstr = self.binbucket(self.fo, ii)
else:
bucketstr = ''.join(map(packnum, delta_encode(self.doc_ids(ii))))
fo.write(bucketstr)
header_offsets.append(fo.tell())
fo.seek(0)
for item in header_offsets: self.header_offset_struct.write(fo, item)
fo.seek(header_offsets[-1]) # EOF
self.doc_id_store().write(fo) # may open and seek self.fo
def lookup(self, word):
"""Main entry point: return doc_ids that might contain 'word'."""
bucketnums = self.bucketnums(word)
rv = self.doc_ids(bucketnums[0])
for bucketnum in bucketnums[1:]:
# This builds a kind of one-sided tree of intersections, which
# results in more comparisons than necessary when there are more
# than two hash functions, but the difference is small when there
# are less than five or so
more = self.doc_ids(bucketnum)
rv = merge_intersection(rv, more)
return rv
def bucketnums(self, word):
"""Return the bucket numbers in which postings for 'word' would be."""
# note that x % y returns positive numbers in Python even when x is
# negative
return [hashval % self.nbuckets for hashval in self.hashes(word)]
def test_index():
"""Slow unit tests for Index and DocIds."""
fname = 'tmp.index.%d' % os.getpid()
try:
idx = Index(fname, hashes)
assert not os.path.isfile(fname)
di = idx.doc_id_store()
a = make_doc_hash('a' * 20)
ok(a, 'a' * 12 + '\0' * 4)
b = make_doc_hash('b' * 20)
ok(di.add_hash(a), 0)
ok(di[a], 0)
ok(di.add_hash(b), 1)
ok(di[b], 1)
ok(di[a], 0)
assert not os.path.isfile(fname)
idx.flush()
assert os.path.isfile(fname)
idx = Index(fname, hashes)
di = idx.doc_id_store()
ok(di[b], 1)
ok(di[a], 0)
did1 = idx.doc_id_store()[a]
did2 = idx.doc_id_store()[b]
c = make_doc_hash('c' * 20)
did3 = idx.doc_id_store().add_hash(c)
ok(len(sets.Set([did1, did2, did3])), 3)
ok(list(idx.lookup('foo')), [])
idx.flush()
ok(list(idx.lookup('foo')), []) # No documents added yet!
idx.add(did1, 'foo')
ok(list(idx.lookup('foo')), [did1]) # No chance for hash collision yet
idx.flush()
idx = Index(fname, hashes)
ok(list(idx.lookup('foo')), [did1])
ok(idx.doc_id_store().hash(did1), a)
assert a in idx.doc_id_store()
idx.remove_doc_hash(a)
assert a not in idx.doc_id_store()
ok(list(idx.lookup('foo')), [])
idx.flush()
# XXX note that the following assumes doc_ids are stable across
# flushes, which won't be true forever
idx = Index(fname, hashes)
ok(list(idx.lookup('foo')), [])
assert idx.doc_id_store().hash(did1) != a
assert a not in idx.doc_id_store()
idx.add(did2, 'foo')
idx.add(did3, 'foo')
idx.add(did3, 'bar')
ok(list(idx.lookup('foo')), [did2, did3])
idx.add(did3, 'foo')
ok(list(idx.lookup('foo')), [did2, did3]) # no dups!
assert did3 in list(idx.lookup('bar')) # could have did2 if collision
idx.flush()
idx = Index(fname, hashes)
ok(list(idx.lookup('foo')), [did2, did3])
idx.remove_doc_hash(c)
ok(list(idx.lookup('foo')), [did2])
idx.add(did3, 'bar')
assert did3 in list(idx.lookup('bar')) # could have did2 if collision
idx.flush()
idx = Index(fname, hashes)
if idx.bucketnums('foo') != idx.bucketnums('bar'):
ok(list(idx.lookup('foo')), [did2])
ok(list(idx.lookup('bar')), [did3])
else:
print 'hash collision, skipping some tests'
finally:
for name in [fname, fname + '.new']:
if os.path.exists(name): os.remove(name)
### Main entry point
class DB:
"""Interface to a directory containing an index of some files."""
explanation = """This directory contains a full-text index of some files.
The index is stored in the following files:
README: this text file, explaining the format of the index.
hashes: a text file containing the hexadecimal SHA-1 hashes of the
files that have been indexed, along with the start and end
offsets in that file (as ASCII decimal numbers), and the
filename, separated by spaces. The filename is URL-encoded with
'%' signs to escape strange characters (including spaces and
newlines). This file exists so that it's possible to produce
human-readable search results, rather than just spitting out
SHA-1 hashes. The first line points at a 'headwords' file that
can be used as an index for fast access into this file.
hashes.*.headwords: an index of the 'hashes' file for faster access.
Every line but the first contains a hash and a file offset (in
decimal) into the 'hashes' file where that the information for
that hash can be found. Because 'hashes' is lexicographically
sorted, this is enough information to make access to 'hashes'
rapid. The first line gives the name of a
hashes.*.headwords.*.headwords file, which may or may not exist,
for faster access to this file.
hashes.*.headwords.*.headwords: see above.
index: a hash table of 131 072 buckets. Starts with 131 072
big-endian 64-bit offsets into this file; each points one byte
past the end of the postings for the corresponding hash bucket.
The postings for the 0th hash bucket starts after the end of
this header (after the 1 048 576th byte), and the postings of
consecutive hash buckets are simply concatenated together with
no delimiters. Each word is hashed with the three hash
functions described below under "hash functions". Words in the
indexed files are hashed into these hash buckets. The postings
are merely document IDs (zero-based line numbers from the
doc_ids file) encoded as follows. First, the document IDs for a
bucket are sorted; second, each one except the first is replaced
with its difference from the one before it; third, each number
in the resulting list is encoded in a variable number of bytes,
seven bits per byte, with the last byte having the high bit set.
Accordingly [0, 1, 2, 3, 5] encodes as (hexadecimal) 80 81 81 81
82. The actual words are not stored anywhere, which makes the
results somewhat inexact. Following the end of the hash buckets
(the file offset for this location is encoded as the last 8
bytes of the first 1 048 576) are the hashes of the indexed
documents, in binary, 16 bytes each --- the first 12 bytes of
the SHA-1 hash, followed by 4 bytes encoding a "page number"
within a document --- concatenated together without delimiters
up to the end of the file. The order of hashes in this list
defines the meaning of the document IDs in the index file:
document ID 0 has the first hash in the list, document ID 1 has
the second hash, and so forth.
excerpt, cat, lookup, add, deindex, and addmbox: shell scripts to
query and modify the index.
The headers (the offsets) are stored in the same files as the
postings to make index updates atomic.
Hash functions: All three are derived from the SHA-1 digest of the
lowercase* version of the word: one from the first 32 bits,
another from the next 32 bits, and a third from the third 32
bits; from each of these sections of the SHA-1 digest, we take
the low 17 bits.
* I'm not sure what "lowercase" means.
"""
def __init__(self, dbdir):
"""dbdir is the path (relative or absolute) to store the index at."""
self.index = Index(os.path.join(dbdir, 'index'), hashes)
self.dbdir = dbdir
self.hash_hints = None
def create_if_necessary(self):
"""Create the index directory if necessary and open files.
Like open_files, this is not in __init__ because, depending on
what use is to be made of the DB object, we may want to create
the directory if it doesn't already exist --- or not.
"""
if not os.path.isdir(self.dbdir): os.makedirs(self.dbdir)
self.open_files()
def open_files(self):
"""Open files in the index directory if not already done.
Like create_if_necessary, this is not in __init__ because,
depending on what use is to be made of the DB object, we may
want to create the directory if it doesn't already exist --- or
not.
"""
if self.hash_hints is None:
self.hash_hints = HashHints(os.path.join(self.dbdir, 'hashes'))
def add(self, filename):
"""Update the indexes with the current contents of the named file."""
fo = file(filename)
self.create_if_necessary()
ahash = make_doc_hash(filehash(fo)) # sadly we read the file twice
bare_hash, page_no = parse_doc_hash(ahash) # stupid API
end = fo.tell()
if ahash in self.index.doc_id_store(): # already indexed it
# but we may have indexed it in a different file, so we add
# it to hash_hints
self.hash_hints.add_hash(bare_hash, FileByteRange(filename, 0, end))
return
self.deindex(filename)
# note that we have to add the new hash *after* we deindex ---
# otherwise deindexing will remove the new hash!
self.hash_hints.add_hash(bare_hash, FileByteRange(filename, 0, end))
doc_id = self.index.doc_id_store().add_hash(ahash)
fo.seek(0)
word_no = 0
for word in words(fo):
self.index.add(doc_id, word.lower())
word_no += 1
if word_no == 4096: # max words per page
page_no += 1
ahash = make_doc_hash(bare_hash, page_no)
doc_id = self.index.doc_id_store().add_hash(ahash)
word_no = 0
fo.close() # not really necessary in Refcounted Land
def addmbox(self, filename):
"""Update the indexes with the contents of the named mailbox file."""
fo = file(filename)
filesize = os.fstat(fo.fileno())[stat.ST_SIZE]
self.create_if_necessary()
hashes_possibly_missing = sets.Set(self.hash_hints.hashes(filename))
ii = 0
for start, end, ahash in mboxsplit(fo):
fancy_hash = make_doc_hash(ahash)
bare_hash, page_no = parse_doc_hash(fancy_hash)
hashes_possibly_missing.discard(bare_hash)
self.hash_hints.add_hash(bare_hash,
FileByteRange(filename, start, end))
if fancy_hash not in self.index.doc_id_store():
saved_pos = fo.tell()
#myfrom = Subfile(fo, start, end).read(5)
#assert myfrom == 'From ', (myfrom, ii, start, end)
doc_id = self.index.doc_id_store().add_hash(fancy_hash)
word_no = 0
# We don't use byte_range.open here because that would
# involve opening the file for every mail message.
for word in words(Subfile(fo, start, end)):
self.index.add(doc_id, word.lower())
word_no += 1
if word_no == 4096: # max words per page
page_no += 1
fancy_hash = make_doc_hash(ahash, page_no)
doc_id = self.index.doc_id_store().add_hash(fancy_hash)
word_no = 0
fo.seek(saved_pos)
yield ii, end, filesize
ii += 1
fo.close()
for each_hash in hashes_possibly_missing:
# XXX I've never verified that this is deleting the right
# things.
self.hash_hints.hash_no_longer_in_file(filename, each_hash)
if not len(self.hash_hints[each_hash]):
self.index.remove_doc_hash(each_hash)
def deindex(self, filename):
"""Remove a file from the indexes if it is indexed.
Does not remove the file from the doc_id list or the hashes
list.
"""
self.open_files()
for each_hash in self.hash_hints.hashes(filename):
self.hash_hints.hash_no_longer_in_file(filename, each_hash)
if not len(self.hash_hints[each_hash]):
self.index.remove_doc_hash(each_hash)
def flush(self):
"""Flush indexes to disk after updates."""
self.hash_hints.flush()
self.index.flush()
f = file(os.path.join(self.dbdir, 'README'), 'w')
f.write(self.explanation)
f.close()
self.write_shell_scripts()
def write_shell_scripts(self):
commands = 'excerpt cat lookup add deindex addmbox'.split()
program = os.path.abspath(__import__('__main__').__file__)
shell_script_template = """#!/bin/bash
dbdir="$(dirname "$0")"
%(program)s "$dbdir" %(command)s "$@"
"""
for command in commands:
scriptpath = os.path.join(self.dbdir, command)
f = file(scriptpath, 'w')
f.write(shell_script_template % {
'program': program,
'command': command,
})
f.close()
os.chmod(scriptpath, 0755)
def ensure(self, terms, bare_hashes):
"""Filter out documents missing a term."""
for bare_hash in bare_hashes:
for byte_range in self.hash_hints[bare_hash]:
if self.byte_range_contains(terms, byte_range):
yield byte_range
def byte_range_contains(self, some_words, byte_range):
"""Determine whether a document contains all the required words."""
fo = ignoring_enoent(byte_range.open)
if not fo: return False # deleted
termset = sets.Set(some_words)
for word in words(fo):
termset.discard(word.lower())
if not termset: return True
return False
def bare_hashes(self, term):
"""Return sorted bare hashes of documents containing term."""
rv = sets.Set()
dis = self.index.doc_id_store()
# XXX note that the following undoes all the work to keep the
# results streaming lazily out of the index:
for doc_id in self.index.lookup(term):
rv.add(parse_doc_hash(dis.hash(doc_id))[0])
rv = list(rv)
rv.sort()
return rv
def lookup(self, terms):
"""Iterate over some filenames (etc.) containing terms."""
terms = [term.lower() for term in terms]
self.open_files()
results = self.bare_hashes(terms[0])
# This results in a very one-sided intersection tree, resulting
# in extra comparisons, but that's probably OK for three or four
# terms.
for term in terms[1:]:
results = merge_intersection(results, self.bare_hashes(term))
return self.ensure(terms, results)
def excerpts(self, terms):
"""Return filenames and excerpts for specified terms."""
# Fairly gross brute force approach --- we end up reading the
# whole file twice; but it does work, more or less.
terms = [term.lower() for term in terms]
for byte_range in self.lookup(terms):
termset = sets.Set(terms)
excerpts = []
contextlen = 71
def get_excerpts():
fo = byte_range.open()
for word, startpos in words_for_excerpting(fo):
if word.lower() in termset:
termset.discard(word.lower())
saved_pos = fo.tell()
fo.seek(max(0, startpos - contextlen/2))
textexcerpt = fo.read(contextlen)
excerpts.append(textexcerpt.replace('\n', ' ')
.replace('\t', ' '))
if not termset: break
fo.seek(saved_pos)
ignoring_enoent(get_excerpts)
yield byte_range, excerpts
def test_db():
"""Really basic unit tests for DB."""
dbdir = 'tmp.db.%d' % os.getpid()
try:
# We use this source file as the sample file to index, simply
# because we can avoid writing another file and cleaning up
# later that way.
import textindex
textindex_file = textindex.__file__
# If we index the .pyc instead of the .py, 'wibble' becomes
# 'wibblei' and consequently the test fails :)
if textindex_file.endswith('pyc'): textindex_file = textindex_file[:-1]
if not os.path.exists(textindex_file):
print "Skipping db tests because %s doesn't exist" % textindex_file
return
filesize = os.stat(textindex_file)[stat.ST_SIZE]
db = DB(dbdir)
ok(list(db.lookup(['wibble'])), [])
db.add(textindex_file)
db.flush()
results = list(db.lookup(['wibble']))
ok(len(results), 1)
filename, start, end = results[0]
ok(filename, textindex_file)
ok(start, 0)
ok(end, filesize)
ok(len(list(db.lookup(['the truth']))), 0) # impossible search
ok(len(list(db.lookup(['import', 'truth']))), 1)
excerpt_results = list(db.excerpts(['import', 'truth']))
ok(len(excerpt_results), 1)
ok(excerpt_results[0][0], results[0])
excerpts = excerpt_results[0][1]
# 'import' occurs earlier in this file than 'truth', so its
# excerpt should be first
assert 'import' in excerpts[0], excerpts
assert 'truth' in excerpts[1], excerpts
# Like, maybe it would be nice to have more than one document in the
# database, and to test mailbox indexing, and to test deindexing, but
# these tests would already have caught an awful lot of embarrassing
# bugs I checked in recently.
# terms far apart in the file:
ok(len(list(db.lookup(['un' + 'usual', 'truth']))), 1)
finally:
os.spawnv(os.P_WAIT, '/bin/rm', ['rm', '-rf', dbdir])
def test_all():
"""Run tests that might take a while or write to the filesystem."""
test_words()
test_hash_hints()
test_index()
test_mbox()
test_db()
test_itam()
def main(args):
"""Command-line entry point."""
if len(args) < 2: return usage()
db = DB(args[0])
cmd = args[1]
if cmd == 'add':
for filename in args[2:]:
if not os.path.isabs(filename):
filename = os.path.abspath(filename)
print 'indexing', filename
db.add(filename)
print 'updating index'
db.flush()
elif cmd == 'addmbox':
for filename in args[2:]:
if not os.path.isabs(filename):
filename = os.path.abspath(filename)
for msgno, bytes, totalbytes in db.addmbox(filename):
percentage = int(100 * bytes / totalbytes + 0.5)
sys.stdout.write('indexed msg #%d in %s (%d%%)\r'
% (msgno, filename, percentage))
sys.stdout.flush()
# the 2^11 postings in the next line is fairly arbitrary; it's
# intended to strike a reasonable balance between working set
# size (under 30M or so) and time wasted updating indexes.
if db.index.pending_postings > 2097152:
print
print 'updating index'
db.flush()
print
print 'updating index'
db.flush()
elif cmd == 'lookup':
for byte_range in db.lookup(args[2:]): print byte_range
elif cmd == 'cat':
for byte_range in db.lookup(args[2:]):
fo = byte_range.open()
while 1:
data = fo.read(4096)
if not data: break
sys.stdout.write(data)
elif cmd == 'excerpt':
for byte_range, excerpts in db.excerpts(args[2:]):
print byte_range
for excerpt in excerpts:
print '\t', excerpt
elif cmd == 'test':
test_all()
print 'Tests OK :)'
elif cmd == 'deindex':
for filename in args[2:]:
if not os.path.isabs(filename):
filename = os.path.abspath(filename)
print 'deindexing', filename
db.deindex(filename)
print 'updating index'
db.flush()
else:
usage()
def usage():
"""Provide usage help for command-line interface."""
sys.stderr.write('''usage:
textindex.py dbdir add file1 file2..., or
textindex.py dbdir addmbox mboxfile1 mboxfile2..., or
textindex.py dbdir lookup term1 term2..., or
textindex.py dbdir excerpt term1 term2..., or
textindex.py dbdir cat term1 term2..., or
textindex.py dbdir deindex file1 file2..., or
textindex.py dbdir test\n''')
if __name__ == '__main__':
cgitb.enable(format='text')
main(sys.argv[1:])
More information about the Kragen-hacks
mailing list