big document database

Kragen Sitaker
7 Jul 1999 15:27:44 -0000

OK, suppose we want to set up a database of the sort I described the
other week, where documents are named with a 128-bit secure hash and are
located "on the network" somewhere.

Suppose we have a system similar to the web: about a billion files,
about a billion users, each user requests an average of 100 documents a
day.  (The actual Web is an order of magnitude or so smaller than this.)

The simplest way to connect a user who wants document X with a server
that has document X is to develop a big centralized database mapping
document IDs to servers.  (Not the most reliable, mind you, but the
safest.)  How feasible is this?

Well, we need storage, retrieval, CPU, and bandwidth for the document
database queries.  (For a while I will ignore update requirements.)


Suppose each document exists on an average of three servers; suppose
that the servers can find the document given just the document ID, so
storing something like the local-path part of a URL is not needed;
suppose that the servers' names are stored as 128-bit IPv6 addresses.

Now, each document record has an average of four 128-bit data about it:
the document's hash and its three servers.  This is a total of 64 bytes.

This means the document database is 64 gigabytes, which is about 64*$40 =
$2560 worth of disk.  No problem.

We'll probably have a significant amount of overhead for indexing;
suppose we use a simple hash scheme wherein the first or last few bytes
of the document ID are used as the hash, and wherein the disk array is
divided into buckets.  Each bucket will contain a number of entries.  We
can do a simple and robust thing: store a pointer to the next document
record with the same docid in each doc record, but try very hard to put
all entries that belong to a given bucket in that bucket, to reduce disk
accesses.  64 gigabytes is 4 billion 16-byte chunks, so the pointer
could conceivably be only 32 bits, but it would be a better idea to let
it be 64 bits.

With this scheme, we can have quick retrieval times (1 disk read on
average) if our table is less than about 90% full.


64 gigabytes of disk could reasonably be built out of eight 8G drives
(although, of course, mirroring would be a good idea.)  But here's the
bad news -- each disk can only do about 100 random seeks a second.  That
means our server can only handle about 800 queries a second.

A billion users each doing 100 queries per day means 1.15 million
queries per second, which is three orders of magnitude larger.

You could probably improve this by a factor of three or four by caching
stuff in memory.  But you could improve it by a much larger factor by
increasing latency drastically and sorting the retrievals -- doing a
single disk arm sweep for each batch of retrievals.  But I don't really
know how to analyze the performance improvement of such an improvement.
It seems plausible to me that it could produce an order-of-magnitude
improvement, but that's just a guess.

So you might well need 100 or 1000 servers like this to deal with the
query load.  Caching might be able to reduce this drastically.


1.15 million queries per second on a single-CPU machine would mean the
CPU had 870 ns to handle each query.  Current machines execute on the
order of one instruction per nanosecond, thus 800 instructions per
query.  Fulfilling the actual requests would be quite simple: look up
how many bits we're using for hash indices today, fetch the block from
disk, loop comparing hashids (how many iterations depends on the size of
the hash buckets), then send the block out to the network.  I suspect
that all of this except for the network stuff could be done in 400
instructions or less.

This is disregarding the effect of slow RAM.


1.15 million queries per second multiplied by 64 bytes of data plus 30
bytes of header per response comes to 100 megabytes per second of
network bandwidth, which is almost a gigabit.  This is a significant
amount of bandwidth, and probably costs about ten cents a second, or
$8000 per day (wild guess based on extrapolating the cost of a T1).
But it is actually available today, and there are network subsystems
that can handle this level of performance.


The limiting single-server feasibility factor is likely to be disk
speed.  The limiting cost factor is likely to be bandwidth.

It might be possible to spread this out among 100 or 1000 servers and
get a feasible system.  $80 per day or $8 per day for bandwidth is
reasonable for one person to spend.

Spreading it out among several servers is desirable anyway for

Update requirements need to be considered.