fast mbox inverted indexing in C
Dave Long
dl at silcom.com
Fri Apr 16 20:24:16 EDT 2004
> this program often runs 2-4 times as long as it should
> because, even though the disk is capable of more read bandwidth
> than this program can use, the kernel is stupidly choosing to read
> chunks that are too small. madvise() is supposed to be able to
> solve this problem (with MADV_WILLNEED or at least
> MADV_SEQUENTIAL), but while MADV_SEQUENTIAL seemed to improve
> performance somewhat, it didn't really solve the problem.
> ...
> This is the program's biggest performance problem right now. It
> spends 30%-75% of its time waiting for the disk for no good reason.
Since you're reading sequentially,
wouldn't it be much easier to just
do explicit reads? That'd probably
give you better control over chunk
size than fiddling with mmap() and
madvise().
I tried running the sequential lex
program given in:
"Re: faster full-text mbox indexing"
http://lists.canonical.org/pipermail/kragen-discuss/2004-March/000923.html
against a copy of maildex.c, sysgen'd
to fit on my box with:
> arena_ptr = arena = malloc(64 * 1024 * 1024);
yielding the following results:
% time ./maildex 1meg > 1m.w
1.22user 0.06system 0:01.30elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (330major+427minor)pagefaults 0swaps
% time ./a.out 1meg
1.11user 0.10system 0:01.21elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (420major+372minor)pagefaults 0swaps
% time ./maildex 64meg > 64m.w
45.29user 3.60system 6:19.64elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (46779major+8997minor)pagefaults 5552swaps
% time ./a.out 64meg
89.21user 20.01system 3:26.72elapsed 52%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (17966major+2458minor)pagefaults 0swaps
1) your box seems to have much cheaper
i/o than mine
2) not hitting swap is a big win
3) in the 1meg case, I'm probably cheating
by being pickier about index terms (of
lengths between 2 and 20, starting with
an alphabetic character)
-Dave
More information about the Kragen-discuss
mailing list