full-text indexing with hash hints (Bloom-filter-like)

Kragen Javier Sitaker kragen at pobox.com
Thu Aug 3 03:37:01 EDT 2006


I was thinking about how I wasn't able to use my full-text mail indexer
to find a particular person's mail because their email address was
oneverylongwordwithlotsofletters at hotmail.com.  The indexer leaves such
long words out of the index, since they're usually false hits from some
binary file, and they take up a lot of space in the index.

Then it hit me: what if I didn't store the words in the index at all?
Maybe I could just store lists of postings in a fixed number of hash
buckets.  Hash collisions would retrieve extra documents, but the
indexer can examine the documents before displaying them anyway, so
that's just an efficiency issue.

The number of hash buckets just needs to be much larger than the number
of words in a typical document (in this case, a mail message).  If there
are some messages that contain a lot of words, they'll usually show up
as false hits, but that's OK as long as the number of them is small
compared to the number that would be in the posting list for the word
actually being searched for.

If you had two different indices using different hash functions, you
could take the intersection of the appropriate buckets in both indices
--- or just use the shorter bucket if the size difference were large
enough.  This technique squares the probability that a particular
false-hit message needn't be retrieved.

I'm going to get more concrete below to see if this is actually a
reasonable scheme.

(There's a roughly concurrent kragen-hacks posting implementing this
scheme, although I haven't yet gotten it to the point where I can use it
on my entire mailbox.)

Concretely, I was thinking of using 32-bit hashpjw, and using each
16-bit part of the hash to index into a separate 65536-bucket index.
Each index would have a 65536-entry header of fixed size --- probably 8
bytes per entry at the moment, so half a mebibyte --- that pointed at a
file offset in the index.  The bytes from there to the beginning of the
next bucket would encode the document IDs in sorted order --- or rather
the deltas between them --- in a variable-length format, where the byte
1xxx xxxx represented the least-significant 7 bits of a number (and
marked the end of the number), and the byte 0xxx xxxx represented seven
more significant bits.

I vaguely remember, from my current text indexer, that the index size in
its current ASCII form is about two thirds of the mailbox size, that
an average message is about 10 kilobytes, and that each posting consumes
about 11 bytes in the index file.  So a gigabyte of mail is about 
100 000 messages and has about 600 megabytes of index, which breaks down
to about 54 million postings, or about 540 per message.  In the best
case, this gigabyte would stick about 823 postings in each bucket, which
would be docIDs differing by about 121 --- so almost half of the deltas
would need two bytes, so the bucket would be about 1200 bytes long.  So
the whole index would be 2*(1200*65536 + 524288) = 158 334 976 bytes,
about a quarter of the size of my current index, which is better than I
can get with the minor tweaks I'd previously thought of.

(The figures above, I think, exclude nonsense words from base64.  They
would be worse if they didn't.)

If you looked up a word that didn't exist in the mailbox, the lookup
would hash it, fetch two header entries (two disk accesses, probably),
fetch two buckets (two more disk accesses), intersect two 823-item
sorted lists, and come up with an expected number of false hits
(823/100000)*(823/100000)*100000 = 6.77, which would then require
another 7 disk accesses and retokenization to reject.  Ten or eleven
disk accesses is slightly worse than my current Lucene-inspired scheme,
but this is sort of the worst case for a lookup (even though I've
assumed the best case for the indexing).

If this isn't the first search you've done lately, the headers can
already be in RAM (they're only twice the bandwidth-delay product on a
disk, each, so it makes some sense to cache them), and if there are
actual hits for your search, they may come before the false hits, so the
false hits don't slow down your seeing the results.  If there are a lot
of actual hits, they are almost certain to.  If you're doing a
multi-word search, the expected number of false hits drops well below 1.
These are good, because these are things that slow down my current
indexer with more disk accesses, and they are the usual case.

The real word distribution in my email is not uniform, of course, so I
think the index won't perform that well on average in real life.  But
this thought experiment suggested that it's worth trying the algorithm
in real life.


More information about the Kragen-tol mailing list