Digital Variants: Searching historical texts is broken

Searching is the mainstay of digital editions of historical texts. If nothing more, editors will usually supply readable transcriptions of the sources and an index-based search. Indeed, the fact that digital scholarly editions only ever had these features has been a complaint since the start of the century¹. Since then, generic XML publishing tools for historical texts like SADE² and TAPAS³ have scarcely moved beyond that model. So Search is something that everyone thinks can be easily added to their edition. But when we look "under the hood" we find that current search technology falls well short of what is needed.

Searching = Lucene

Just about the only search engine used in the digital humanities seems to be Lucene, and its various aliases Solr, Nutch and ElasticSearch. However, Lucene was never designed with transcriptions of historical texts in mind, and certainly not XML. What drove its design was the plain text document and forms used in a business context. It was first published in 1999, and probably written a bit before that. In other words it was effectively pre-XML. So Lucene was not based around the idea of searching document trees, but plain text documents.

Inverted indices

Like many search engines Lucene does not actually search text when you type in a query. Instead it searches a prepared index of words that tells it in which documents a particular search-term may be found, instead of which words are in the document. This is called an "inverted index". Since documents may be quite large the idea of "fields" was introduced to increase the precision of a search. Words would still be in the same document but they might belong to different sub-sections, so allowing the user to drill down and find the term, without knowing precisely where it actually was.

XML and HTML files are organised rather differently. Instead of fields, segments of text are arranged into an explicit hierarchy, each segment being assigned a name, which may be qualified by attributes. So we might describe a segment as being a "division" of type "chapter" to indicate a chapter. However, Lucene knows nothing about all this. It may be told that a particular word occurs inside a "field" called "division" but it doesn't know that the division is a chapter, or that it is part of a book called "Nostromo". In most cases knowing the "field" or element in which a search term occurs, such as "paragraph" or "line", will not prove very illuminating. We can't say "find all the quotations in letters" or some such hierarchical query, because Lucene does not understand hierarchies and does not even understand XML very well either. What it understands are documents and fields.

Non-linear text

You may think: so what? Text is the most important thing, and that is what Lucene retrieves. The problem is that digital scholarly editions use XML to structure the text in non-linear ways. Take the theoretical example: <l>The quick <del>red</del><add>brown</add> fox jumps over the lazy dog.</l>. Depending how the parser reads this document the Lucene indexer may or may not insert a space between "red" and "brown". But at best it will see "The quick red brown fox jumps over the lazy dog". If we query that nonsensical sentence then Lucene will retrieve it, but it is not in the text. The text is supposed to be "The quick red fox" or "The quick brown fox" but Lucene will not find either expression. Admittedly, a non-literal search will retrieve all documents in which the words "red", "fox" and "brown", "fox" occur, but that's not the same thing. And for longer variants the text simply becomes incomprehensible.

Take this real-world example from a single line of a poem in a manuscript by Charles Harpur:

<l>Yet even <app><rdg>that one <del>subject is to</del> starts</rdg> 
<rdg><del>that one's prone to starts of wrong</del></rdg> 
<rdg><emph>he</emph> shall sometimes prove insure:</rdg></app></l>

Lucene will see the nonsensical:

Yet even that one subject is to starts 
that one’s prone to starts of wrong 
he shall sometimes prove insure:

as the text of this line. Not only will the word "starts" be retrieved in two separate hits, but the reader will be highly puzzled by what Lucene returns when it formats and displays the result.

Hyphenation

Unfortunately, this is not the end of it. When we transcribe historical documents it is vital to record the line-breaks of the original source. If we don't do that we can't reference the "chapter and verse" of a passage. We can't display the text with line-breaks next to its page-facsimile, and we can't synchro-scroll with any precision two versions of the same work side by side. If we leave out line-breaks we may as well abandon all precision in the transcription altogether.

But line-breaks often occur in the middle of words. We may transcribe "quick-", "ly" on separate lines, but Lucene will see this as two words. OK, so what if we just join up all hyphens to the text of the next line when indexing? That's hard because in XML line-breaks will be marked by tags like </l>, and other tags representing page-breaks may intervene. But let's say that somehow we manage it. Then what about "sugar-cane" or "dog-house" or "avant-garde"? Hyphenated words may equally be split over a line. And what about authors who insist on hyphenating words that need not be, such as Conrad's use of "thunder-head"? The problem is, most technicians don't give a damn about these subtleties, and will index whatever is in the files, because it is too much work to fix the problem properly. But humanists who are interested in texts are pedantic as to the correctness of a text to the last full stop. And yet, when they search their magnum digital opus they seem content to find that the most common two-letter word is "ly".

OK, so how do we fix it?

Lucene can be forced to retrieve the exact location of words in a document, but this makes the indices enormous.

It is possible to write a program to tease apart the internal structures of an XML document so that we can separately index "The quick brown fox" and "The quick red fox", but then Lucene will return one hit for each copy of the non-different text we make, like "The quick". Such a program is also far harder to write than is generally supposed, and would only work for the specific set of XML texts it was designed for.⁴

With XML there is no nice way around the hyphenation problem.

"OK, so what's your solution?" I hear you ask. For that, I'm afraid, you'll have to wait for the next instalment of this blog.