Tuesday, 17 November 2015

Searching historical texts is broken

Searching is the mainstay of digital editions of historical texts. If nothing more, editors will usually supply readable transcriptions of the sources and an index-based search. Indeed, the fact that digital scholarly editions only ever had these features has been a complaint since the start of the century1. Since then, generic XML publishing tools for historical texts like SADE2 and TAPAS3 have scarcely moved beyond that model. So Search is something that everyone thinks can be easily added to their edition. But when we look "under the hood" we find that current search technology falls well short of what is needed.

Searching = Lucene

Just about the only search engine used in the digital humanities seems to be Lucene, and its various aliases Solr, Nutch and ElasticSearch. However, Lucene was never designed with transcriptions of historical texts in mind, and certainly not XML. What drove its design was the plain text document and forms used in a business context. It was first published in 1999, and probably written a bit before that. In other words it was effectively pre-XML. So Lucene was not based around the idea of searching document trees, but plain text documents.

Inverted indices

Like many search engines Lucene does not actually search text when you type in a query. Instead it searches a prepared index of words that tells it in which documents a particular search-term may be found, instead of which words are in the document. This is called an "inverted index". Since documents may be quite large the idea of "fields" was introduced to increase the precision of a search. Words would still be in the same document but they might belong to different sub-sections, so allowing the user to drill down and find the term, without knowing precisely where it actually was.

XML and HTML files are organised rather differently. Instead of fields, segments of text are arranged into an explicit hierarchy, each segment being assigned a name, which may be qualified by attributes. So we might describe a segment as being a "division" of type "chapter" to indicate a chapter. However, Lucene knows nothing about all this. It may be told that a particular word occurs inside a "field" called "division" but it doesn't know that the division is a chapter, or that it is part of a book called "Nostromo". In most cases knowing the "field" or element in which a search term occurs, such as "paragraph" or "line", will not prove very illuminating. We can't say "find all the quotations in letters" or some such hierarchical query, because Lucene does not understand hierarchies and does not even understand XML very well either. What it understands are documents and fields.

Non-linear text

You may think: so what? Text is the most important thing, and that is what Lucene retrieves. The problem is that digital scholarly editions use XML to structure the text in non-linear ways. Take the theoretical example: <l>The quick <del>red</del><add>brown</add> fox jumps over the lazy dog.</l>. Depending how the parser reads this document the Lucene indexer may or may not insert a space between "red" and "brown". But at best it will see "The quick red brown fox jumps over the lazy dog". If we query that nonsensical sentence then Lucene will retrieve it, but it is not in the text. The text is supposed to be "The quick red fox" or "The quick brown fox" but Lucene will not find either expression. Admittedly, a non-literal search will retrieve all documents in which the words "red", "fox" and "brown", "fox" occur, but that's not the same thing. And for longer variants the text simply becomes incomprehensible.

Take this real-world example from a single line of a poem in a manuscript by Charles Harpur:

<l>Yet even <app><rdg>that one <del>subject is to</del> starts</rdg> 
<rdg><del>that one's prone to starts of wrong</del></rdg> 
<rdg><emph>he</emph> shall sometimes prove insure:</rdg></app></l> 

Lucene will see the nonsensical:

Yet even that one subject is to starts 
that one’s prone to starts of wrong 
he shall sometimes prove insure: 

as the text of this line. Not only will the word "starts" be retrieved in two separate hits, but the reader will be highly puzzled by what Lucene returns when it formats and displays the result.

Hyphenation

Unfortunately, this is not the end of it. When we transcribe historical documents it is vital to record the line-breaks of the original source. If we don't do that we can't reference the "chapter and verse" of a passage. We can't display the text with line-breaks next to its page-facsimile, and we can't synchro-scroll with any precision two versions of the same work side by side. If we leave out line-breaks we may as well abandon all precision in the transcription altogether.

But line-breaks often occur in the middle of words. We may transcribe "quick-", "ly" on separate lines, but Lucene will see this as two words. OK, so what if we just join up all hyphens to the text of the next line when indexing? That's hard because in XML line-breaks will be marked by tags like </l>, and other tags representing page-breaks may intervene. But let's say that somehow we manage it. Then what about "sugar-cane" or "dog-house" or "avant-garde"? Hyphenated words may equally be split over a line. And what about authors who insist on hyphenating words that need not be, such as Conrad's use of "thunder-head"? The problem is, most technicians don't give a damn about these subtleties, and will index whatever is in the files, because it is too much work to fix the problem properly. But humanists who are interested in texts are pedantic as to the correctness of a text to the last full stop. And yet, when they search their magnum digital opus they seem content to find that the most common two-letter word is "ly".

OK, so how do we fix it?

Lucene can be forced to retrieve the exact location of words in a document, but this makes the indices enormous.

It is possible to write a program to tease apart the internal structures of an XML document so that we can separately index "The quick brown fox" and "The quick red fox", but then Lucene will return one hit for each copy of the non-different text we make, like "The quick". Such a program is also far harder to write than is generally supposed, and would only work for the specific set of XML texts it was designed for.4

With XML there is no nice way around the hyphenation problem.

"OK, so what's your solution?" I hear you ask. For that, I'm afraid, you'll have to wait for the next instalment of this blog.

References

1 Peter Robinson, 2003. "Where we are with Electronic Scholarly Editions, and where we want to be" Jahrbuch für Computerphilologie 5.

2 SADE Publish Tool, 2015.

3 Julia Flanders and Scott Hamlin, 2013. "TAPAS: Building a TEI Publishing and Repository Service", JTEI 5.

4 Desmond Schmidt, 2014. "Towards an Interoperable Digital Scholarly Edition" JTEI 7, Section 5.

Tuesday, 10 November 2015

Why digital editions lack longevity (and how to fix it)

If there is one point on which all theorists agree it is that the software component of the digital scholarly edition is ephemeral:

'compared to books -- in particular, compared to scholarly editions -- software lives out its life in but the twinkling of an eye.'1
'computer related technologies are hardly renowned for their longevity'2;
'... the price seems to be the interface: while the digital scholarly community has developed meaningful ways to support the longevity of the dataset, the same cannot be said about interfaces.'3
'how can the continued—and useful—existence of a system or tool be guaranteed, or at least facilitated, once a project's funding has been spent?'4.

The client-server model

But it may be a mistake to point the finger of blame so quickly at the graphical user interface, when the real culprit is the ubiquitous client-server model, in which the function of the application is split between the web-server and the browser.

In Web 1.0 almost all the action happened on the server. Since rendering in the client was unreliable it was considered virtuous to actually create Web pages on the server and then send them down to the client. Users were even told to 'turn off Javascript' for safety reasons, because interactivity was just a superfluous enhancement.

Client-server model: Web 1.0

With the arrival of consistent cross-browser functionality in Web 2.0 developers began to build much more interactivity into the browser. Now the user could participate in the creation of Web content. Facebook, blogs, Twitter and all the rest happened. Man went mobile, but the humble Web application, the client-server model wasn't all that deeply changed. The same tools were available to developers as in Web 1.0 – and they did the same things with it: building secret functionality into the server component, and leaving the interactive stuff for the browser and UI-guys to figure out. It was kind of a 50-50 division of labour.

Client-server model: Web 2.0

So, what's this got to do with digital editions?

A lot, as it turns out. The Web is under constant attack from hackers trying to gain control of servers. As a result, the server software is constantly updated. It is practically guaranteed that, once a digital edition built for the Web is finished, that it will fail within 6 months to two years counting from the moment of project end. Digital humanists get money in grants. And when the grant money runs out, as the guys above say, there is no one left to maintain it. Then 'poof!' goes the digital edition. It drops off the Net, the code breaks and all that is left are the source data files that actually need that software to process them. And only the authors have access to them. So they are nearly as useless as the software.

But the problem here is not the lack of longevity in the GUI software. Because that is all based on HTML, CSS and Javascript. To my knowledge none of those technologies has yet undergone a change that is not backwards-compatible. The original web-page made by Tim Berners Lee still works perfectly in modern browsers with a zillion more features than what he had back then. Extrapolating from that, a digital edition made today with just those technologies and no others could be expected to live on more or less indefinitely. Why? Because those technologies are too big to fail. There are around 40 billion web pages on the Internet, and a sizeable portion of them would break if there was any change to the underlying technology that was not backwards-compatible. So as they say, 'it ain't gonna happen'.

On the other hand, server software changes in incompatible ways all the time. My old Java programs don't work any more because 'deprecated' features are eventually withdrawn. But the worst problems concern security changes which are automatically applied to the server operating system by administrators. Failure to keep those security measures up to date will result, sooner or later, in the hacking of your digital edition. So it will die if you do not constantly spend money on it. Corporate users love the client-server model because it allows them to keep one step ahead of the competition. The constant updates delight them, and they have permanent staff to carry them out. But we don't. And as digital humanists we are left with their tools to do our work.

What can be done about it?

The answer lies in moving all the functionality into the client. With open source software there is no need to hide anything on the server. So we rewrite all the server code in Javascript, move it to the client, and turn the server into a dumb service that just gets and saves data. This gives us three advantages: archivability, interoperability and durability.

Client-server model: Web 4?.0

For archiving, digital editions can be taken off the server and saved as archives that work directly off the file-system (on a USB stick or DVD etc.), without the need for a Web-server, but only for reading. Editions could also be shared or sent to other locations. We can say: 'Here's my edition: tell me what you think of it.' That is like giving someone a copy of your book to read: you won't and can't change it.

But since these editions use only globally interoperable data formats we and our friends can also collaborate in creating and updating them through a master copy on a server. We can update their editions in our browsers and they can update ours in theirs, and everything will work perfectly. We can even annotate and understand the semantic content of an edition because the formats it uses are standard, and the tools to do that already exist.

On durability, if we leave it on the server it should run more or less indefinitely because there is almost nothing to hack, and nothing to break. It would be an interesting experiment to make one and see how long it lasts, adding a clock on the website to show how many years, months, days it has been running correctly without modification.

How to archive a digital edition

If we want to archive our edition we must first save the website's static content, by following its internal links, just as you do when you save a web-page as 'complete' in your browser. But to preserve the dynamic pages is harder. We can easily save the database files, but all the server software will have to be converted into a form that will run in the browser. This is hard for XML-based editions because almost all XML software only works on the server: eXistDB, XQuery, Cocoon, XSLT, Lucene (and Solr, Elastic search). Since we didn't write it we don't control it, so converting it to run in the browser is practically impossible. Which is another reason not to use XML. But with our XML-free approach, by writing all the software we need for the edition ourselves, this becomes absolutely achievable. Since our server code is written in Java it is not too hard to convert it all into Javascript. And we can also use that new client software to run the live, updatable server version of the edition.

However, I don't expect people will rush out to do this because they all love XML too much. But they also don't have any answer to the longevity problem, which is far and away the biggest problem we face. Solve that, and digital editions that last nearly as long as books can become a reality.

References

1 C. Michael Sperberg-McQueen, 1994. “Textual Criticism and the Text Encoding Initiative.” Proceedings of MLA '94, San Diego.

2 Lou Burnard, 2013. "The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure" JTEI 5.

3 Elena Pierazzo, 2014. Digital Scholarly Editing: Theories, Models and Methods, p.15.

4 Mark Hedges, Heike Neuroth, Kathleen M. Smith, Tobias Blanke, Laurent Romary, Marc Küster and Malcolm Illingworth, 2013. "TextGrid, TEXTvre, and DARIAH: Sustainability of Infrastructures for Textual Scholarship" JTEI 5.