Wednesday 9 December 2015

Fixing Search

The previous post explained that searching historical documents is fraught with problems that industrial search engines simply cannot handle. And the reason they can't is because they treat the underlying data as if it was a digitally-authored word-processor file, rather than a historical and manually written physical artefact.

The most serious of these problems is how to deal with versions – both internal and external. An internal version is created whenever an author or scribe changes something in the text, through deletion, replacement or insertion. After each round of corrections the author could, if he or she liked, write out the text in full as a clean draft. This kind of version hidden inside a document may be termed a layer. But when a new physical copy is produced, the differences between copies is termed external variation.

Some people seem to think that internal variation can be represented as a format: it is a crossing-out or insertion in the same document rather than a whole new text. Not so. Consider the last three lines of this poem:

Take out the markup and you will get:

Yet even that one subject is to one's prone to starts of wrong 
Of evil As ever So he shall sometimes prove insure: 
  in the clearest well thus fountain ever lies
A sediment—disturb it, and 'twill rise.

This is pure nonsense. The author never wrote that. It is not a text to be searched, viewed or compared, but that is what is being recorded when people treat internal variants as if they were formats, and that is what is being indexed by industrial search-engines.

So in order to search reliably the internal states of the document must first be separated out into coherent layers.

The scrapbook analogy

Let's say you have three copies of a novel. I'm thinking about a favourite sci-fi novel of mine, but there are plenty of other similar cases. The first one is a serialisation in a magazine. The second is the inaugural American edition, the third is the British edition, which was abridged by the publisher. I want to make one edition of all three. So I photocopy each page, and where the text is either unique in one copy or shared across several versions I cut out that portion and paste it into a scrapbook. As I do so I can preserve the order of the fragments so that each bit of each version precedes the next bit in the same version. There are printed books that do this already. Things like Michael Warren's 'Parallel King Lear', where the various quartos and the folio versions are laid down side by side so the reader can see the insertions, deletions and variants. But returning to the scrapbook idea, by copying each shared piece of text, however small (maybe just one letter) all duplication between the copies can be eliminated.

Well, nearly all. If a section was the same but transposed between two versions then the scrapbook idea won't work: the text will have to be copied from the 'before' to the 'after' location.

A digital scrapbook

If we do the same thing with three digital copies of the novel by finding all the bits in common electronically we can also eliminate the copies in the transposed text. The first time the transposed text occurs we record it as normal. The second or third times it is reused we simply refer to it, without copying it. So now our digital scrapbook is just one document, but it records all the text of the work from the three copies just once. By labelling each fragment with the set of versions it belongs to, say "1,2" or "2,3" or "1" or "2" etc. it will be possible to reconstruct the text of any copy by reading in order only those fragments that belong to it. So reading all the fragments labelled with a "2" or "1,2" or "2,3" etc will reproduce the text of version 2. And the same goes for the other versions.

And as explained above, internal variations are inherently no different from the external ones. They can be treated as separate editions of the text, as long as we can tease apart the internal versions and produce coherent copies from them. This is always possible, although the arguments are too long to include here.

Indexing a digital scrapbook

The benefits for search can only be realised if we can produce an index of this digital scrapbook, by treating it as just another kind of document. In a real world scenario some texts will be scrapbooks and some won't. We need one way to index them all uniformly. The scrapbook idea is cool, but it is so radical that it threatens to break industrial search engines. How can this be avoided?

The poetry example given above when represented as a digital scrapbook might look internally as four layers:

[1-4] Yet even [1,2] that one [2] 's [1] subject is [2] prone [1,2] to starts [1] Of evil [2] As ever [1-4] : in the clearest [3,4] he shall sometimes prove insure [3] So [1,2] well thus [3,4] fountain ever [1-4] lies [1-4] A sediment—disturb it, and 'twill rise.

The versions to which each fragment of text belong are represented here as some numbers in square-brackets before the fragment they refer to. But forgetting about that for now this 'document' can be treated like any other. The location of each word has a position that can be measured by counting characters from the start of the file. The position of 'subject' is 20, counting all the preceding characters of "Yet even ", "that one" and "'s ", without regard to the versions they belong to. This is a kind of global, cross-version position of each word. For example, the word "one's", in version 2, starts at position 14, which is the same as the word "one" in version 1. But so long as we have a position for each word in our index this causes no problem because we already know that they are two different words. All we need is some program that can read digital scrapbooks, which is no big deal.

Admittedly, the index will not record which versions "one" and "one's" belong to, but this can be deduced by reading the digital scrapbook at position 14. Following the fragments for "one's" reveals that this text is in version 2, whereas the word "one" (followed by a space) belongs to version 1.

So no reinvention of the wheel is needed to index a digital scrapbook, or an ordinary file, and the positions of words in both types of file can be stored in the same index.

Finding the text

Now we have our index it should be easy enough to find something in it. The index tells us in which documents a particular word can be found, and at what position(s) in the file. Any standard search engine could be used for this purpose, but in practice it is probably better to make your own, because of what comes after.

First the 'hits' have to be arranged into 'digests' which are just short summaries of the relevant bits of the source documents. To do this naturally requires the search engine to read the source documents again. So a search engine would have to be aware of digital scrapbooks. But that can be done, since it is just another format. Finally the hits have got to be displayed. That also requires knowledge of the digital scrapbook format, but the beauty is that a single hit in a single digital scrapbook will be displayed as one hit, not as 20 hits in 20 versions or layers. And the user can move around inside the scrapbook and read the text of any version and see how the hits propagate across the versions. Take a look for yourself.