[kragen at canonical.org: Re: radix-sorting rational numbers with an efficient serialization of continued fractions]
Kragen Javier Sitaker
kragen at canonical.org
Tue Oct 18 03:04:52 EDT 2011
----- Forwarded message from Kragen Javier Sitaker <kragen at canonical.org> -----
Date: Tue, 18 Oct 2011 02:21:20 -0400
From: Kragen Javier Sitaker <kragen at canonical.org>
To: Darius Bacon <darius at wry.me>
Subject: Re: radix-sorting rational numbers with an efficient serialization
of continued fractions
Message-ID: <20111018062120.GA9104 at canonical.org>
References: <20111017225537.GA1783 at canonical.org>
<201110180002.p9I02Y5H003270 at wry.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <201110180002.p9I02Y5H003270 at wry.me>
User-Agent: Mutt/1.5.20 (2009-06-14)
Status: RO
Content-Length: 2559
Lines: 48
On Mon, Oct 17, 2011 at 05:02:34PM -0700, Darius Bacon wrote:
> Only skimmed this yet, but I'll note that your E(n) appeared in
> _Managing Gigabytes_ as the 'gamma code' and is said to perform
> reasonably for encoding postings lists.
Thank you very much! I had no idea.
It looks like Elias's original gamma code uses 00001 instead of 11110 to
represent "five-bit number", which means his code isn't lexicographically
sortable, but the Managing Gigabytes version of it (on p.117, table 3.5,
"Example codes for integers") is indeed identical to my E(n). The LingPipe
implementation of Elias gamma coding is documented to use Elias's version.
src.it.unimi.dsi.mg4j.io.ByteArrayPostingList.writeGamma calls writeUnary,
which uses Elias's non-sortable version (`while( i-- != 0 ) write( 0 );`)
instead of the MG book's sortable version.
<http://www.java2s.com/Open-Source/Java-Document/Search-Engine/mg4j/it/unimi/dsi/mg4j/io/ByteArrayPostingList.java.htm>
> Delta code is a bit fancier: the number of bits used for the 'actual number'
> is encoded with the gamma code.
Right. Which one is better depends on your input distribution and your utility
function.
They're actually encoding posting-list *gaps*, i.e. the differences between
adjacent document IDs in a posting list. I suspect that Golomb coding, with
the divisor M set to (posting list length / max document id), would provide
better compression than Elias's gamma or delta codes. WP says that Golomb
coding is the optimal prefix code for alphabets following a geometric
distribution, such as the intervals between successes in a Bernoulli process.
If your document IDs are randomly assigned, then the posting list for a
particular term is guaranteed to be a Bernoulli process. The only question is
whether some nonrandom distribution might be able to provide better compression
using gamma or delta coding.
They mention Golomb coding on pp.121-2. They say it "performs only marginally
better than global γ or δ codes," then go on to talk about hybrid coding.
I think the search engine I wrote in 2006
<http://lists.canonical.org/pipermail/kragen-hacks/2006-August/000432.html>
might benefit from using Golomb coding or something similar for its
pseudo-posting-lists. It currently uses the "Altavista trick" to encode
integers in variable numbers of bytes, 7 bits per byte, with one bit per byte
used as a termination marker, but I suspect that it can probably use
substantially less space with Golomb coding.
Do you mind if I forward your mail and this one to kragen-discuss?
Kragen
----- End forwarded message -----
More information about the Kragen-discuss
mailing list