distributed peer-to-peer full-text web searching
Kragen Sitaker
kragen@pobox.com
Wed, 18 Feb 2004 01:47:00 -0500
Discussed decentralized peer-to-peer web search engines with Praveen
Sinha, Ben Sittler, and David Braginsky last night. (David seemed to
think they were a silly idea.)
This records the results of our discussion, and a few thoughts I had.
To build a full-text web search engine, you need these pieces:
* the crawler
* the index storage
* index lookup
* merging index results to evaluate boolean queries (most queries
these days include multiple terms)
* relevance and quality ranking
Bits and pieces of the solution follow
History
-------
I wrote about this in http://pobox.com/~kragen/bigdb.html several
years ago --- 1997 is the "last updated" date, but I think I started
that page in 1996. At the time, I marveled at the machine resources
AltaVista required to accomplish this task, and wondered if it were
possible to do it in a decentralized fashion. I still don't know ---
certainly Google has a hard enough time doing it in a centralized
fashion, and centralization makes a lot of the engineering problems
easier.
At the time, it didn't occur to me that web search was undergoing a
kind of reverse Malthusian catastrophe: bandwidth, storage space, and
computation per dollar had grown exponentially, with high exponents,
for a long time, while human authorship grew mostly linearly, or with
a small exponential term related to the world population. The logical
conclusion is that one day my cell phone, or my hearing aid, will be
able to crawl, index, and search the entire Web in real time.
Advantages
----------
P2P text search, if implemented well, would have a number of
advantages over the centralized search engines we see today. If your
search engine ran on your machine, you could stream in results as you
found them, rather than displaying results only after it finished
searching. If you had a say in how the search engine worked, you
wouldn't have to worry that laws in some other country might censor
the information you could find on the search engine --- as it has with
Google. You wouldn't have to worry (as much) about who saw what you
were searching for, because most of the queries your machine processed
and relayed wouldn't come from you. And you wouldn't have to worry
that your search results were tainted by bribes, as you must with
Hotbot, AltaVista, Lycos, and others.
Basic numbers
-------------
Google's index contains 6 billion items, mostly web pages; the average
web page contains about 10 kibibytes, so the Web contains perhaps 60
tebibytes of data. English contains about 100 000 different words;
the Web might contain a lot more, perhaps 1 000 000 or more. Each Web
page contains some links.
The Crawler
-----------
Download bandwidth is expanding much faster than the available text on
the Web. On a Japanese or Korean 10Mbps home DSL connection, 60
tebibytes takes maybe 6 million person-seconds to download, or 2-3
person-months, which is comparable to the mean web page lifetime. So
you could run a kinda crappy webcrawler on a single residential DSL
connection in a well-connected country, and a really good one on a
couple of dozen connections.
Software to suck down that much data that quickly poses engineering
challenges, but not insuperable ones. Heritrix looks promising.
In summary: cheap solutions are known.
Index Storage
-------------
60 tebibytes is 61 440 gibibytes, or roughly $61K of hard disk space
today. Many indexing schemes add on the order of 100% overhead to
simply storing the pages. A robust distributed index will necessarily
replicate itself, so it might occupy $300K-$1M worth of disk space.
At present, this makes it unlikely to occur by casual participation
(the way electricsheep, SETI@Home, and Napster work) unless it
provides a very tangible benefit to its participants, or unless it
gets 10K or more participants. It will take about 13 more doublings
in storage space per money to bring this index within the compass of a
single casual user. Doublings have happened every 12-18 months for
the last 15 years or so, so here's a timeline:
As we'll see later, it is likely to be beneficial to have many times
this amount of storage, in order to compensate for the inefficiencies
of distribution.
I think it's often possible to make the index much smaller, but not
always.
In summary: solutions are known, but they require significant
resources, therefore subject to first-mover advantage for a long time.
Index Lookup
------------
Looking up a term in an index held on local disk or local RAM is
pretty trivial, you can use a hash table or radix trie or whatever.
Distributing the index so you can find the piece you need isn't so
trivial, but Akamai-style consistent hashing or Kademlia will work.
Looking up a term in a distributed index generally requires you to
send your search to a machine or some machines that have the relevant
pieces of the index.
In summary: cheap solutions are known.
Merging Index Results
---------------------
Lists of documents --- even lists of high-quality documents ---
containing terms often run to thousands or hundreds of thousands of
URLs, but often intersections of those lists contain only a few
documents. Computing these intersections takes a lot of bandwidth
between the lists. With great cleverness, it can take a bit less
bandwidth. Latency and bandwidth trade off against one another here:
if you can afford to spend several round trips, you can dramatically
reduce your bandwidth usage.
Worst-case performance here matters a lot more than average-case
performance, because the queries that matter most tend to frustrate
optimizations like caching.
The baseline algorithm is simply to send a copy of the shortest
posting list to the machines that have the other, larger posting
lists, then winnow it by filtering out items that don't occur in the
larger posting lists; repeat until the results are small enough in
aggregate to send them to the client. A million-item posting list
might require 10 mebibytes to represent, and therefore take ten
seconds to send from one 10Mbps machine to another. I think posting
lists larger than that will rarely be the smallest posting list for a
query, but posting lists about that size commonly will be.
There are several ways to improve this algorithm.
- round-robin merge
The round-robin merge algorithm Dave Long suggested some years ago
allows you to use a fraction of the bandwidth in exchange for more
round trips. Supposing we use URIs to identify documents, posting
list A sends its alphabetically first URI to posting list B; posting
list B sends the same URI if it's listed, otherwise the alphabetically
next URI, to posting list C; which sends the same URI, or otherwise
the alphabetically next URI, back to posting list A. If it's the same
URI, it's in the intersection, and posting list A repeats the process.
This uses, I think, a fraction of the bandwidth of sending the entire
contents of the smallest list across the net; my reasoning is that the
worst case is that the lists were all equally dense and evenly spaced,
but staggered, so every item gets sent across the internet. From this
point, adding items to any list, or all but one list, doesn't increase
the amount of data sent or the number of round trips. Making the
lists randomly distributed rather than evenly spaced will, I think,
reduce the number of round trips by a small factor, which I SWAG at
between 1.5 and 3. Clustering will tend to increase the efficiency of
this method. I haven't experimented enough to know any of this for
sure, though.
The naive algorithm described above takes O(N * M) packet-times to
merge N lists of M terms. As Dave described, if you divide the
URI-space into two halves, though, you can merge sublists from each
half in parallel, thus roughly halving the number of packet-times, at
the expense of doubling the size of each packet. You can subdivide
the lists into more pieces in this manner in order to reduce latency
without increasing bandwidth, but eventually dividing into smaller
pieces won't reduce latency as fast as it increases packet size, which
I think will be around the time that your average segment size in some
sublist approaches 0-10.
- Bloom filters
If you send the actual list of a million URIs, you probably need
several bytes per URI --- perhaps 10. But if you send a Bloom filter
describing the small million-URI list instead, you may be able to get
away with sending many fewer bits. You can reduce the larger list in
size by filtering it with the Bloom filter and get a much smaller list
of possible intersection items.
How much you win depends on just how sparse the intersection set is;
you want your real hits to be a significant fraction of the
winnowed-down larger list, so your Bloom filter false-hit rate can't
be much more than the density of results in the larger list. For
example, if you seek the thousand-document intersection between a
million-document list and a ten-million-document list, your Bloom
filter representing the million documents has to winnow the
ten-million-document list down by a factor of about 10 000. A 50%
full Bloom filter is most space-efficient at rejecting false hits; a
50% full Bloom filter that has one false hit in ten thousand requires
13-14 hashes per item entered into it. Naively, you might think this
means you need 27 bits per item in the million-item list, so 3.4MB;
but actually you only need about 20 bits per item (2.5MB) to get 50%
density with 14 hashes. (That is, (1 - 1/20 000 000)^14 000 000 =~
0.5.)
That's perhaps a fourfold or eightfold win over just gzipping the URL
list.
The Bloom filter size is jointly proportional to
* the size of the posting list it encodes
* the logarithm of the size of the posting list it needs to filter
* the reciprocal of the logarithm of the size of the required
intersection result
It might be practical to use a smaller Bloom filter --- perhaps one
with five or six bits per item in the smaller posting list --- to
reduce the ten-million-document list down to one with only 100K
documents, which can then be sent back the other way as a much smaller
Bloom filter to winnow the million-document list down to a close
approximation to the eventual result.
- top results first
If you have a predetermined ranking that isn't enormously
term-dependent, like Google's PageRank, you can save yourself a lot of
merge work by dividing the posting list for common terms into
power-of-ten units: for a million-item posting list, you might have a
list of the "best" 100K documents, the best 10K, and the best 1K. If
merging the best 1K documents from two posting lists yields ten hits,
you don't need to do the extra work of merging the best 10K documents
for those same terms.
- substring merge
You can perform the merge in stages: first on, say, the first 15
characters of each URI, then on the subset thus selected of the full
URIs. Since many URIs that will co-occur in posting lists have the
same initial prefix, the URI lists being merged in both stages should
be much smaller than the entire URI lists.
You can do this in more stages, and indeed you can do it
character-by-character: first the one-character prefixes, then the
two-character prefixes, etc. At that point, it's essentially the
radix-search equivalent of the round-robin sorted merge. I don't know
whether this technique, either in this extreme form or in the more
practical form described in the previous paragraph, has advantages
over that one.
- reordering URL components
It seems likely that reordering URL components could make the
substring merge and round-robin merge techniques work better,
essentially by clustering URIs of similar documents; for example,
moving the http: scheme to the end, perhaps reversing the components
of the domain name, or reversing the entire URI. This might speed up
round-robin merge by orders of magnitude, but I don't know.
- posting list colocation
Merging gets a lot faster if every machine in the grid has a complete
copy of the index; URI lists need only be shipped across the
high-bandwidth links inside a computer, rather than across the
lower-bandwidth internet, in order to be merged. However, probably
not every machine can afford to host 120 tebibytes of data, at least
not yet. Still, if you should happen to find a machine that contains
the posting lists for two or more of your search terms, you can expect
things to work a lot faster.
If each machine can only afford to hold 1 tebibyte of data, it can
hold about 1% of the index. Suppose we divide the index into about
200 equal pieces and have 20 000 machines to work with. Now each
machine can contain a copy of two of these pieces of the index, and
there are enough machines that every possible pair of two pieces can
be found on one machine. Thus any two-word query has a unique machine
that can answer the query all by itself.
The number of machines you need to make this happen is roughly half of
the square of the number of pieces you have to divide the index into,
which is just the ratio between the index size and the size of an
individual machine's storage. This is a case where one quarter as
many machines, each twice as big, can do just as much --- that is, it
strongly favors centralization, and it strongly favors index
compression.
For this purpose, though, machines in the same house or neighborhood
might qualify as being "the same machine" --- if you have 100Mbps and
5ms between you, you can afford to talk to each other a lot more than
if you have 5Mbps and 100ms.
If you just distribute posting lists randomly, as most distributed
hashtable algorithms would, then you might get better numbers, but not
dramatically better. If you have 100 times the aggregate capacity you
need to store the indices, then each posting list will live on about
100 machines. Two randomly chosen posting lists will have about one
machine out of ten thousand in common, so the number of machines needs
to be a few times larger than that to ensure that at least one machine
will have a copy of both posting lists.
- summary
Summary: the baseline algorithm seems likely to be about an order of
magnitude too slow to produce acceptable results, but I have listed
here six directions for improvement, each of which could plausibly
improve its performance by a small integer factor. It seems likely
that two or more of these directions could make distributed
posting-list merging reasonably efficient.
In any case, another factor-of-ten increase in bandwidth per
participant will probably make even the baseline algorithm acceptably
fast.
Relevance and quality ranking
-----------------------------
Google has a lot of trouble with this, since many people find
attracting attention attractive, so they go to a lot of effort to
improve their own pages' relevance and quality ranking by exploiting
weaknesses in Google's ranking systems. In order to build a large
distributed search engine, you have to reduce the barriers to
participation as much as possible; unfortunately, this means you have
to allow spammers and SEOs to participate, which means they can
corrupt whatever parts of your database you give them responsibility
for.
I don't know how to solve this problem, although I have some ideas.
- Moore's law
This problem will get less severe as running a search engine gets
cheaper. First, ten times as many different search engines require
ten times as much effort to spam, and spamming even Google seems to be
a full-time job. Second, when it's possible for 200 participants to
constitute a search engine for the whole web, they can all know each
other and kick out the evildoers.
- cellular search engines
If it takes 1000 participants to store an index of the entire web,
then groups of 200 people can still store indices of 20% of it. If
some of these search engines are accessible to outsiders, a
metasearcher (like metacrawler or dogpile) that searches ten of them
and collates the results can get 90% coverage or better. A smart
metasearcher will derate results from engines that produce lousy
results, and eventually drop them entirely; so spammers' effects will
be largely limited to their own cells.
- referer metrics
Someone recently did a backscatter analysis of the SCO group
denial-of-service attack; by observing the number of reply packets
from SCO to a small sampling of nonexistent IP addresses in allocated
address space, they estimated the number of request packets flooding
the site.
Analogously, you can estimate the traffic on a web page, once it links
to you, from the number of your visitors whose Referer: headers point
to it.
This isn't very sturdy, because an SEO or spammer can forge requests
with specific Referer: headers; but you can take measures to make this
attack more expensive:
* omit data from requests that didn't receive a full response
* count one vote per client-IP-address/Referer pair, not per request;
you can go farther and count one vote per class-C subnet
* check the referring URL for an actual link to the referred address
You can run PageRank with Referer data rather than link data; this has
the major advantage of eliminating "dead-end" pages that don't link to
any other pages, or any off-site pages.
- proxy metrics
JupiterMedia or MediaMetrix used to use data from HTTP caching proxies
to compose lists of the top N web sites. Maybe they still do.
Anyway, caching proxies, or even passive snoopers, can collect two
interesting kinds of popularity data:
* what URLs people visit
* what URLs pages people visit link to
- searcher feedback
If you log which search-result links people click on and how soon, you
can tell which search results appear appealing in the search summary
view. The summary appeal will correlate closely and positively to
the actual relevance of the page, if your summary is any good.
- permanence and freshness
Web crawlers can tell when pages change. If a page changes into an
error message, it's usually a bad sign for nearby pages. If a domain
is recently registered, it probably contains nothing but garbage. If
a page never changes, it may be bad news for that page --- depending
on what you're searching for. I'm not sure about this.
- classifiers, other AI techniques
A naive Bayesian classifier of the kind presently widely used to
filter out email spam should work well to filter out SEO spam too. I
suspect I'm not doing much more than naive Bayesian and
finite-state-machine logic when I skip over most of the SEO spam I see
on Google these days.
- egocentric ranking
Given some sources of quality metrics, you can see which ones give
your own pages high rankings --- or, perhaps, pages you like. On this
basis, you can weight the quality metrics to accord more closely with
your own judgment.
- interstatistic consistency
If most of the quality metrics are pretty consistent in their ranking,
but a particular one is wildly different, derating that one may give
better search results. In some cases --- like Referer metrics,
searcher feedback, and proxy metrics --- you might be able to filter
out only a part of the input to a metric and improve that metric.
I fear this is a kind of negative attestation, and I fear that
negative attestations are prone to path-dependence --- who you trust
in the end may depend on who you start out trusting. Ways to compute
trust metrics that depend entirely on positive attestations, such as
PageRank or Advogato's trust metric, can more easily be made to
converge independent of a starting point.
- in summary
Summary: I have lots of ideas about how to solve this problem, but I
don't know if any of them are any good.