Reclaiming the Oxford English Dictionary for the public

kragen at pobox.com kragen at pobox.com
Mon Oct 3 03:37:02 EDT 2005


The Oxford English Dictionary, generously supported by the Oxford
University Press, is one of the earliest instances of what are now
called "pro-am" or "commons-based peer production" projects.  From
1857 to 1928, thousands of readers collected examples of uses of words
their dictionaries didn't define; they mailed these examples on slips
of paper to a small number of editors, who undertook to collate them
into a dictionary.  From 1884 to 1928, these editors published their
work in fascicles, mostly in alphabetical order.
<http://en.wikipedia.org/wiki/Oxford_English_Dictionary --- Wikipedia
article "Oxford English Dictionary">

In recent years, with the advent of public access to the internet, it
has become apparent that commons-based peer production works best when
no single party can restrict the uses of the end product; more people
can use it, it can be put to more uses, poor coordinators can be
replaced, and contributors have assurance that they will be able to
use their own work.  <http://perens.com/Articles/Economic.html ---
"The Emerging Economic Paradigm of Open Source", by Bruce Perens;
http://www.benkler.org/CoasesPenguin.html --- "Coase's Penguin, or
Linux and the Nature of the Firm, by Yochai Benkler>

This form of commons-based peer production of information, in which
the end product can be studied, copied, modified, and used freely, is
often called "Open Source
development". <http://opensource.org/docs/definition.php --- "The Open
Source Definition, Version 1.9", promulgated by the Open Source
Initiative; http:///www.catb.org/~esr/writings/cathedral-bazaar/ ---
"The Cathedral and the Bazaar", by Eric S. Raymond> It got this name
because it started with software whose source code was freely
available for all these purposes, also known as "free software"
<http://www.fsf.org/ --- the Free Software Foundation>.

Tim Bray, the world-famous hacker who co-invented XML, explains how
the OED is not currently open source:

    Well, literally thousands of people around the world diligently
    read books looking for usages of words and writing them on slips
    and sending them to Oxford. Many, many millions of these things
    are in filing cabinets in the basement of Oxford. Then Oxford, of
    course, turned them around to do a commercial product. It's not
    as though the underlying citation store or the dictionary itself
    are open for free access to anybody except for Oxford.

    So I don't think it's really open source in some of the essential
    characteristics. It is certainly community-based and
    community-driven. And it clearly became the case that some of the
    unpaid volunteers became thought leaders in terms of how you go
    about finding things.

<http://www.acmqueue.com/modules.php?name=Content&pa=printer_friendly&pid=282&page=1
--- "A Conversation with Tim Bray", ACM Queue, Vol. 3, No. 1, February 2005>

If the Oxford English Dictionary were Open Source, we could expect the
following improvements:
- Definitions would be available in many contexts; for example, within
  a word processor, at the command line, in a web browser.
- OED definitions and etymologies would be available to many more
  people, so many more people would think about how they needed
  improvement.
- When a person noticed a bad definition or an opportunity for
  improvement, they could immediately fix it in their local copy of
  the dictionary, and later, share their improvement with others who
  were interested.  This is particularly important because the OED is
  quite out of date, especially the parts in the public domain.
- Definitions and etymologies could be augmented with unlimited
  examples of use, drawn from the English literary canon (via Project
  Gutenberg) on demand.
- People could develop innovative software for looking up definitions;
  for example, it could disambiguate misspellings according to context
  of use, and preferentially display word senses that might apply in
  context (noun versus verb, for example, or by the publication year
  or country of origin of the work containing the unknown word, if
  that's available.)
- Web sites such as http://www.snopes.com/ could link to authoritative
  definitions and etymologies of words, and even quote them in full
  without fear of copyright infringement.
- Its English-language definitions could be translated into other
  languages (perhaps incrementally, as people requested them) to
  supplement existing inter-lingual dictionaries, or perhaps even
  create new ones.

I have been investigating what would be required to make the OED Open
Source.  Much of the first edition is out of copyright; in general,
anything published before 1923 is in the public domain in the US and
in Berne Convention countries.  Someone could take this
out-of-copyright text and create a public-domain or
open-source-licensed version thereof.

The fascicle 'W-Wash' was published in 1921
<http://www.colbycosh.com/old/december02.html ---
http://oed.com/pdfs/oed-news-2002-06.pdf --- article "J.R.R. Tolkien
and the OED", Oxford English Dictionary News, Series 2, Number 21,
June 2002, by Peter Gilliver, pp.1-3>; this suggests that nearly the
entire dictionary is out of copyright, in the form in the fascicles.
However, I don't know how to get hold of them, and the Wikipedia
article cited above mentions that the first one sold only 4000 copies
--- so there may be fascicles of which no copies survive.  For
example, none are listed on http://www.abebooks.com/ as far as I can
tell; I searched on "new english dictionary historical" and "fascicle
dictionary english" with little luck.  Searching for "new english
dictionary historical principles" in the title, however, I did find
several volumes supposedly published before 1921, at very reasonable
prices --- US$30-US$130.  (I found advertisements for volumes D-E,
H-K, L-N, Q-R, S-SH, V-Z, X-ZYXT, all claiming to be from before 1923,
comprising nearly half of the first edition OED.)

If someone were to take the original pages, of which I guess there are
around ten thousand, photograph each one with a cheap five-megapixel
digital camera, and compress the result, each page image would
probably be around a megabyte and take about ten seconds to produce;
the entire set would require only about ten gigabytes and about 30
hours of labor to produce.  It could then be distributed by BitTorrent
and DVD-R's.  

The Internet Archive's Texts Collection's Million Books Project
<http://www.archive.org/details/texts> consists of books scanned in
more or less this manner, although they are using expensive
sixteen-megapixel cameras because they have a wider range of uses in
mind.  The sample book I'm looking at is one bit deep and is
compressed by about a factor of 24 --- 558 pages, 8.5 megapixels each,
which would be 600 megabytes uncompressed, but is 24.4 megabytes
compressed with DjVu --- still around a megabyte per page.
<http://www.archive.org/details/ChurchDictionary --- Walter Parquhar
Hook's 1842 Church Dictionary>.

By itself, such a collection of images would be slightly less useful
than the original books, although much more easily reproduced.  You
would still have to page through the collection of numbered pages one
by one to find the page containing the word you wanted, but the images
would be displayed on a conventional small low-resolution computer
monitor rather than a large, high-resolution book page.  Consequently
it would be somewhat slower to access than the original books.

However, once the page images were available in the public domain, it
would be possible at any later time, and in any small increment, to
annotate them with OCR results or hand-written transcriptions, which
could also be corrected by people consulting it.

(My experiments with free-software OCR have not been terribly
encouraging so far.)


More information about the Kragen-tol mailing list