Wednesday 9 December 2015

Fixing Search

The previous post explained that searching historical documents is fraught with problems that industrial search engines simply cannot handle. And the reason they can't is because they treat the underlying data as if it was a digitally-authored word-processor file, rather than a historical and manually written physical artefact.

The most serious of these problems is how to deal with versions – both internal and external. An internal version is created whenever an author or scribe changes something in the text, through deletion, replacement or insertion. After each round of corrections the author could, if he or she liked, write out the text in full as a clean draft. This kind of version hidden inside a document may be termed a layer. But when a new physical copy is produced, the differences between copies is termed external variation.

Some people seem to think that internal variation can be represented as a format: it is a crossing-out or insertion in the same document rather than a whole new text. Not so. Consider the last three lines of this poem:

Take out the markup and you will get:

Yet even that one subject is to one's prone to starts of wrong 
Of evil As ever So he shall sometimes prove insure: 
  in the clearest well thus fountain ever lies
A sediment—disturb it, and 'twill rise.

This is pure nonsense. The author never wrote that. It is not a text to be searched, viewed or compared, but that is what is being recorded when people treat internal variants as if they were formats, and that is what is being indexed by industrial search-engines.

So in order to search reliably the internal states of the document must first be separated out into coherent layers.

The scrapbook analogy

Let's say you have three copies of a novel. I'm thinking about a favourite sci-fi novel of mine, but there are plenty of other similar cases. The first one is a serialisation in a magazine. The second is the inaugural American edition, the third is the British edition, which was abridged by the publisher. I want to make one edition of all three. So I photocopy each page, and where the text is either unique in one copy or shared across several versions I cut out that portion and paste it into a scrapbook. As I do so I can preserve the order of the fragments so that each bit of each version precedes the next bit in the same version. There are printed books that do this already. Things like Michael Warren's 'Parallel King Lear', where the various quartos and the folio versions are laid down side by side so the reader can see the insertions, deletions and variants. But returning to the scrapbook idea, by copying each shared piece of text, however small (maybe just one letter) all duplication between the copies can be eliminated.

Well, nearly all. If a section was the same but transposed between two versions then the scrapbook idea won't work: the text will have to be copied from the 'before' to the 'after' location.

A digital scrapbook

If we do the same thing with three digital copies of the novel by finding all the bits in common electronically we can also eliminate the copies in the transposed text. The first time the transposed text occurs we record it as normal. The second or third times it is reused we simply refer to it, without copying it. So now our digital scrapbook is just one document, but it records all the text of the work from the three copies just once. By labelling each fragment with the set of versions it belongs to, say "1,2" or "2,3" or "1" or "2" etc. it will be possible to reconstruct the text of any copy by reading in order only those fragments that belong to it. So reading all the fragments labelled with a "2" or "1,2" or "2,3" etc will reproduce the text of version 2. And the same goes for the other versions.

And as explained above, internal variations are inherently no different from the external ones. They can be treated as separate editions of the text, as long as we can tease apart the internal versions and produce coherent copies from them. This is always possible, although the arguments are too long to include here.

Indexing a digital scrapbook

The benefits for search can only be realised if we can produce an index of this digital scrapbook, by treating it as just another kind of document. In a real world scenario some texts will be scrapbooks and some won't. We need one way to index them all uniformly. The scrapbook idea is cool, but it is so radical that it threatens to break industrial search engines. How can this be avoided?

The poetry example given above when represented as a digital scrapbook might look internally as four layers:

[1-4] Yet even [1,2] that one [2] 's [1] subject is [2] prone [1,2] to starts [1] Of evil [2] As ever [1-4] : in the clearest [3,4] he shall sometimes prove insure [3] So [1,2] well thus [3,4] fountain ever [1-4] lies [1-4] A sediment—disturb it, and 'twill rise.

The versions to which each fragment of text belong are represented here as some numbers in square-brackets before the fragment they refer to. But forgetting about that for now this 'document' can be treated like any other. The location of each word has a position that can be measured by counting characters from the start of the file. The position of 'subject' is 20, counting all the preceding characters of "Yet even ", "that one" and "'s ", without regard to the versions they belong to. This is a kind of global, cross-version position of each word. For example, the word "one's", in version 2, starts at position 14, which is the same as the word "one" in version 1. But so long as we have a position for each word in our index this causes no problem because we already know that they are two different words. All we need is some program that can read digital scrapbooks, which is no big deal.

Admittedly, the index will not record which versions "one" and "one's" belong to, but this can be deduced by reading the digital scrapbook at position 14. Following the fragments for "one's" reveals that this text is in version 2, whereas the word "one" (followed by a space) belongs to version 1.

So no reinvention of the wheel is needed to index a digital scrapbook, or an ordinary file, and the positions of words in both types of file can be stored in the same index.

Finding the text

Now we have our index it should be easy enough to find something in it. The index tells us in which documents a particular word can be found, and at what position(s) in the file. Any standard search engine could be used for this purpose, but in practice it is probably better to make your own, because of what comes after.

First the 'hits' have to be arranged into 'digests' which are just short summaries of the relevant bits of the source documents. To do this naturally requires the search engine to read the source documents again. So a search engine would have to be aware of digital scrapbooks. But that can be done, since it is just another format. Finally the hits have got to be displayed. That also requires knowledge of the digital scrapbook format, but the beauty is that a single hit in a single digital scrapbook will be displayed as one hit, not as 20 hits in 20 versions or layers. And the user can move around inside the scrapbook and read the text of any version and see how the hits propagate across the versions. Take a look for yourself.

Tuesday 17 November 2015

Searching historical texts is broken

Searching is the mainstay of digital editions of historical texts. If nothing more, editors will usually supply readable transcriptions of the sources and an index-based search. Indeed, the fact that digital scholarly editions only ever had these features has been a complaint since the start of the century1. Since then, generic XML publishing tools for historical texts like SADE2 and TAPAS3 have scarcely moved beyond that model. So Search is something that everyone thinks can be easily added to their edition. But when we look "under the hood" we find that current search technology falls well short of what is needed.

Searching = Lucene

Just about the only search engine used in the digital humanities seems to be Lucene, and its various aliases Solr, Nutch and ElasticSearch. However, Lucene was never designed with transcriptions of historical texts in mind, and certainly not XML. What drove its design was the plain text document and forms used in a business context. It was first published in 1999, and probably written a bit before that. In other words it was effectively pre-XML. So Lucene was not based around the idea of searching document trees, but plain text documents.

Inverted indices

Like many search engines Lucene does not actually search text when you type in a query. Instead it searches a prepared index of words that tells it in which documents a particular search-term may be found, instead of which words are in the document. This is called an "inverted index". Since documents may be quite large the idea of "fields" was introduced to increase the precision of a search. Words would still be in the same document but they might belong to different sub-sections, so allowing the user to drill down and find the term, without knowing precisely where it actually was.

XML and HTML files are organised rather differently. Instead of fields, segments of text are arranged into an explicit hierarchy, each segment being assigned a name, which may be qualified by attributes. So we might describe a segment as being a "division" of type "chapter" to indicate a chapter. However, Lucene knows nothing about all this. It may be told that a particular word occurs inside a "field" called "division" but it doesn't know that the division is a chapter, or that it is part of a book called "Nostromo". In most cases knowing the "field" or element in which a search term occurs, such as "paragraph" or "line", will not prove very illuminating. We can't say "find all the quotations in letters" or some such hierarchical query, because Lucene does not understand hierarchies and does not even understand XML very well either. What it understands are documents and fields.

Non-linear text

You may think: so what? Text is the most important thing, and that is what Lucene retrieves. The problem is that digital scholarly editions use XML to structure the text in non-linear ways. Take the theoretical example: <l>The quick <del>red</del><add>brown</add> fox jumps over the lazy dog.</l>. Depending how the parser reads this document the Lucene indexer may or may not insert a space between "red" and "brown". But at best it will see "The quick red brown fox jumps over the lazy dog". If we query that nonsensical sentence then Lucene will retrieve it, but it is not in the text. The text is supposed to be "The quick red fox" or "The quick brown fox" but Lucene will not find either expression. Admittedly, a non-literal search will retrieve all documents in which the words "red", "fox" and "brown", "fox" occur, but that's not the same thing. And for longer variants the text simply becomes incomprehensible.

Take this real-world example from a single line of a poem in a manuscript by Charles Harpur:

<l>Yet even <app><rdg>that one <del>subject is to</del> starts</rdg> 
<rdg><del>that one's prone to starts of wrong</del></rdg> 
<rdg><emph>he</emph> shall sometimes prove insure:</rdg></app></l> 

Lucene will see the nonsensical:

Yet even that one subject is to starts 
that one’s prone to starts of wrong 
he shall sometimes prove insure: 

as the text of this line. Not only will the word "starts" be retrieved in two separate hits, but the reader will be highly puzzled by what Lucene returns when it formats and displays the result.

Hyphenation

Unfortunately, this is not the end of it. When we transcribe historical documents it is vital to record the line-breaks of the original source. If we don't do that we can't reference the "chapter and verse" of a passage. We can't display the text with line-breaks next to its page-facsimile, and we can't synchro-scroll with any precision two versions of the same work side by side. If we leave out line-breaks we may as well abandon all precision in the transcription altogether.

But line-breaks often occur in the middle of words. We may transcribe "quick-", "ly" on separate lines, but Lucene will see this as two words. OK, so what if we just join up all hyphens to the text of the next line when indexing? That's hard because in XML line-breaks will be marked by tags like </l>, and other tags representing page-breaks may intervene. But let's say that somehow we manage it. Then what about "sugar-cane" or "dog-house" or "avant-garde"? Hyphenated words may equally be split over a line. And what about authors who insist on hyphenating words that need not be, such as Conrad's use of "thunder-head"? The problem is, most technicians don't give a damn about these subtleties, and will index whatever is in the files, because it is too much work to fix the problem properly. But humanists who are interested in texts are pedantic as to the correctness of a text to the last full stop. And yet, when they search their magnum digital opus they seem content to find that the most common two-letter word is "ly".

OK, so how do we fix it?

Lucene can be forced to retrieve the exact location of words in a document, but this makes the indices enormous.

It is possible to write a program to tease apart the internal structures of an XML document so that we can separately index "The quick brown fox" and "The quick red fox", but then Lucene will return one hit for each copy of the non-different text we make, like "The quick". Such a program is also far harder to write than is generally supposed, and would only work for the specific set of XML texts it was designed for.4

With XML there is no nice way around the hyphenation problem.

"OK, so what's your solution?" I hear you ask. For that, I'm afraid, you'll have to wait for the next instalment of this blog.

References

1 Peter Robinson, 2003. "Where we are with Electronic Scholarly Editions, and where we want to be" Jahrbuch für Computerphilologie 5.

2 SADE Publish Tool, 2015.

3 Julia Flanders and Scott Hamlin, 2013. "TAPAS: Building a TEI Publishing and Repository Service", JTEI 5.

4 Desmond Schmidt, 2014. "Towards an Interoperable Digital Scholarly Edition" JTEI 7, Section 5.

Tuesday 10 November 2015

Why digital editions lack longevity (and how to fix it)

If there is one point on which all theorists agree it is that the software component of the digital scholarly edition is ephemeral:

'compared to books -- in particular, compared to scholarly editions -- software lives out its life in but the twinkling of an eye.'1
'computer related technologies are hardly renowned for their longevity'2;
'... the price seems to be the interface: while the digital scholarly community has developed meaningful ways to support the longevity of the dataset, the same cannot be said about interfaces.'3
'how can the continued—and useful—existence of a system or tool be guaranteed, or at least facilitated, once a project's funding has been spent?'4.

The client-server model

But it may be a mistake to point the finger of blame so quickly at the graphical user interface, when the real culprit is the ubiquitous client-server model, in which the function of the application is split between the web-server and the browser.

In Web 1.0 almost all the action happened on the server. Since rendering in the client was unreliable it was considered virtuous to actually create Web pages on the server and then send them down to the client. Users were even told to 'turn off Javascript' for safety reasons, because interactivity was just a superfluous enhancement.

Client-server model: Web 1.0

With the arrival of consistent cross-browser functionality in Web 2.0 developers began to build much more interactivity into the browser. Now the user could participate in the creation of Web content. Facebook, blogs, Twitter and all the rest happened. Man went mobile, but the humble Web application, the client-server model wasn't all that deeply changed. The same tools were available to developers as in Web 1.0 – and they did the same things with it: building secret functionality into the server component, and leaving the interactive stuff for the browser and UI-guys to figure out. It was kind of a 50-50 division of labour.

Client-server model: Web 2.0

So, what's this got to do with digital editions?

A lot, as it turns out. The Web is under constant attack from hackers trying to gain control of servers. As a result, the server software is constantly updated. It is practically guaranteed that, once a digital edition built for the Web is finished, that it will fail within 6 months to two years counting from the moment of project end. Digital humanists get money in grants. And when the grant money runs out, as the guys above say, there is no one left to maintain it. Then 'poof!' goes the digital edition. It drops off the Net, the code breaks and all that is left are the source data files that actually need that software to process them. And only the authors have access to them. So they are nearly as useless as the software.

But the problem here is not the lack of longevity in the GUI software. Because that is all based on HTML, CSS and Javascript. To my knowledge none of those technologies has yet undergone a change that is not backwards-compatible. The original web-page made by Tim Berners Lee still works perfectly in modern browsers with a zillion more features than what he had back then. Extrapolating from that, a digital edition made today with just those technologies and no others could be expected to live on more or less indefinitely. Why? Because those technologies are too big to fail. There are around 40 billion web pages on the Internet, and a sizeable portion of them would break if there was any change to the underlying technology that was not backwards-compatible. So as they say, 'it ain't gonna happen'.

On the other hand, server software changes in incompatible ways all the time. My old Java programs don't work any more because 'deprecated' features are eventually withdrawn. But the worst problems concern security changes which are automatically applied to the server operating system by administrators. Failure to keep those security measures up to date will result, sooner or later, in the hacking of your digital edition. So it will die if you do not constantly spend money on it. Corporate users love the client-server model because it allows them to keep one step ahead of the competition. The constant updates delight them, and they have permanent staff to carry them out. But we don't. And as digital humanists we are left with their tools to do our work.

What can be done about it?

The answer lies in moving all the functionality into the client. With open source software there is no need to hide anything on the server. So we rewrite all the server code in Javascript, move it to the client, and turn the server into a dumb service that just gets and saves data. This gives us three advantages: archivability, interoperability and durability.

Client-server model: Web 4?.0

For archiving, digital editions can be taken off the server and saved as archives that work directly off the file-system (on a USB stick or DVD etc.), without the need for a Web-server, but only for reading. Editions could also be shared or sent to other locations. We can say: 'Here's my edition: tell me what you think of it.' That is like giving someone a copy of your book to read: you won't and can't change it.

But since these editions use only globally interoperable data formats we and our friends can also collaborate in creating and updating them through a master copy on a server. We can update their editions in our browsers and they can update ours in theirs, and everything will work perfectly. We can even annotate and understand the semantic content of an edition because the formats it uses are standard, and the tools to do that already exist.

On durability, if we leave it on the server it should run more or less indefinitely because there is almost nothing to hack, and nothing to break. It would be an interesting experiment to make one and see how long it lasts, adding a clock on the website to show how many years, months, days it has been running correctly without modification.

How to archive a digital edition

If we want to archive our edition we must first save the website's static content, by following its internal links, just as you do when you save a web-page as 'complete' in your browser. But to preserve the dynamic pages is harder. We can easily save the database files, but all the server software will have to be converted into a form that will run in the browser. This is hard for XML-based editions because almost all XML software only works on the server: eXistDB, XQuery, Cocoon, XSLT, Lucene (and Solr, Elastic search). Since we didn't write it we don't control it, so converting it to run in the browser is practically impossible. Which is another reason not to use XML. But with our XML-free approach, by writing all the software we need for the edition ourselves, this becomes absolutely achievable. Since our server code is written in Java it is not too hard to convert it all into Javascript. And we can also use that new client software to run the live, updatable server version of the edition.

However, I don't expect people will rush out to do this because they all love XML too much. But they also don't have any answer to the longevity problem, which is far and away the biggest problem we face. Solve that, and digital editions that last nearly as long as books can become a reality.

References

1 C. Michael Sperberg-McQueen, 1994. “Textual Criticism and the Text Encoding Initiative.” Proceedings of MLA '94, San Diego.

2 Lou Burnard, 2013. "The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure" JTEI 5.

3 Elena Pierazzo, 2014. Digital Scholarly Editing: Theories, Models and Methods, p.15.

4 Mark Hedges, Heike Neuroth, Kathleen M. Smith, Tobias Blanke, Laurent Romary, Marc Küster and Malcolm Illingworth, 2013. "TextGrid, TEXTvre, and DARIAH: Sustainability of Infrastructures for Textual Scholarship" JTEI 5.