Towards a Superdictionary by Computer Lexicography

David Crystal’s essay and lecture on the mythical prospect of a ‘super-dictionary’ proposes a dictionary which would, like no other before it, be complete: include every single word known to its creators, not bound by the space limitations of a physical medium such as paper (which for simple economic and practical reasons must exclude very rare or very new words) nor by the practical limitations that would result from the difficulty of finding the word.

Erin McKean makes the precise extent of the deficiency of any dictionary clear in her 2007 talk on the nature and future of dictionaries and lexicography. She observes that with the incredible volume of English-language material, even if only every tenth book, newspaper, magazine (etc.) contained an ‘un-dictionaried’ word, you would need a dictionary several times the size of the OED to describe them all — but she has found that almost every book she reads contains at least one word, or use of a word, that’s not in any dictionary. And that, naturally, sets aside the new vocabulary to be found on the Internet.

Cataloguing this vast world of words is probably an impossible task for humans. As McKean observes, though, both the process of compiling a dictionary and the form that the end result takes when the job is done have hardly changed since the time of James Murray — and, in fact, since the time of Johnson.

While computers have sped up the process, and vastly increased the amount of material lexicographers can call upon while researching any particular word, the fundamental procedure remains the same: find examples of the word in question in use; sort those examples out by the sense of the word the author intended; work out how the different senses found are related, and organize them semantically; write the definitions, and (depending on the style of dictionary) show a small selection of the evidence you have for each sense next to them.

The only part of this process which computers can perform automatically is the first, finding evidence (usually in a full-text search database, either one such as Google Books or a specialized corpus like the BNC). The other principal advantage of computers in lexicography is the possibility of looking up words in ways other than mere alphabetical headword search: you can search by sense by doing a search through the definition text, or (if the dictionary has the information) by dates of usage, etc.

It seems to me that the only way a super-dictionary could be achieved is if the computer were to take over the majority of the lexicographer’s work: if it were able to distinguish senses and usage information in the evidence for each word; able to reason about the semantic and developmental relationships between those senses; able to actually write its own definitions for the senses it finds.

While this may sound dangerously like a pipe-dream, the technology we have today is already approaching the possibility of this. IBM’s Watson, while it looks like a stunt machine designed to score publicity for the company by winning a high-profile television game-show, actually appears to be able to be used as a general-purpose system for assessing evidence and making decisions. It’s being market as a medical tool to help doctors produce diagnoses. Applying it to lexicographizing a particular word would involve applying that technique to assessing the usage of that word, making decisions about which of a number of sense-categories a particular word-quotation belongs to, then assessing and deciding how the categories it ends up with for that word are related.

That would seem to be little use at the definition-writing stage, but machine writing also now seems to be an immanent techology. Some have hailed the rise of the computer-written news article, but writing dictionary definitions is certainly within its scope.

I do not envisage the complete absence of human input from the lexicographical process: the computer will need help. It will probably need a lot of help with word-sense disambiguation, over and above any other stage: as a programmer, for instance, I can’t see any algorithmic approach ever catching the distinction between the literal and figurative/emphatic uses of literally unaided. But with the majority of the work offloaded to machines, the possibility of complete coverage is quite real.

Given the hugely increased amount of material a computer would be able to process compared to a human, the dictionary could well and truly become completely inclusive. Most daily newspapers, local and national, are now automatically available digitally. The vast majority of books coming out now are available as e-books; even those which aren’t are generally processed as PDFs before hitting the press, so there’s a digital text available. Lots and lots of older books are available from Gutenberg, archive.org, and Google Books. Imagine feeding all this to the lexicographer program, and giving it more material every day as it becomes available. You’d end up with quite an incredible resource.

It seems likely to me that the main input of humans into such a process would be in re-typing or scanning old books to give to the lexicographer program to give it historical coverage long the lines of the OED. Plenty of old books are already digitized, but you’d want even more.

And with such a huge amount of material, new ways of exploring the lexicon would become available. You could see precise charts of usage frequency through time, associated with individual senses and with forms — effectively like Google N-grams, but with the additional possibility of sense analysis. You could choose how you’d like dictionary entries sorted. Given the way the lexicographer program works, you could even search for a particular sense of a word by giving an example usage (in which case the program would run its usual word-sense disambiguation over the sentence you gave it, then find the sense(s) in the dictionary most closely corresponding to the word as you used it).

Naturally this project is hugely ambitious. The technology, optimistic though I am about Watson and machine writing, is not quite there yet. It ought to be worked on, though.

The Oxford English Dictionary took nearly 80 years to finish — and the first 30 of those were spent preparing enough materials for it, before the dictionary even started being written. Other large timescales apply to similar dictionary projects, such as the Middle English Dictionary, the Dictionary of Old English, the Dictionary of American Regional English, etc. It would seem reasonable to assume that a similar timescale would apply to the development of the lexicographer program.

To my knowledge nobody has even begun to attempt a software project like this before — one which, from the outset, they forecast to take several decades to reach a usable state. But I believe that if the right team set out on this project today, with enough funding and enough expert input on lexicographical and linguistic matters, it is quite firmly achievable within this century. (It might even be finished before the OED3.)

There remains the issue of what the resulting database might be called. Crystal’s ‘superdictionary’ is a fine coinage, but it’s rather long-winded, and it seems likely that the ‘super’ part will someday no longer make sense (when there are no ‘ordinary’ dictionaries for it to be ‘super’ in comparison to).

I propose hyperlex, as a portmanteau of hypertext and lexicon, which also happens to work as a combination of the Greek hyper and léxis. I hope one day to see it in the hyperlex itself, even if some other word ends up being the usual name for such a database.