Tuesday, 26 July 2016

Sync-scrolling images and text when editing transcriptions

When transcribing original documents it is helpful to have the page image next to the transcription as it is being written. That way the transcriber can see what the next word to transcribe is, or quickly check for mistakes in some one else's transcription. Most documents contain more than one page, so this gives rise to a problem: how can the relevant page image be placed next to the relevant portion of the transcription so that the transcriber can easily see what corresponds to what?

One obvious solution is to transcribe page by page. For each image show only the transcription of that page. However, this creates a problem for the technician and user alike: now the transcription of the document is divided into parts which must be stitched back together not only by the computer when the document is saved, but also mentally by the transcriber. Pages rarely end at paragraph end. More often they split in the middle of a sentence or even a word. The transcriber may change only one page and then the computer must reassemble the document with that one altered page in its middle. If the page is marked up in some way the page's transcription may not be complete or well-formed, which would hamper editing. All this is both technically messy and counter-intuitive for the user.

A better method is to display the entire document for editing: both the transcription as a continuous running text and the page images to which it corresponds as a scrolling list. The reason this is not often done is because of an intrinsic alignment problem: how to find the part of an image that corresponds to the currently displayed piece of text. To be readable page-images may need to be higher than the screen. Typically the transcription of a page is much shorter. But as a general rule the centre of the page's transcription should be aligned with the centre of the corresponding page-image across the centre of the screen. This is what the user expects. However, this creates a problem: the first and last pages cannot possibly conform to this rule. The first page must be aligned so that the top of the page image aligns with the top of the transcription text. And likewise at the bottom: on the last page the end of the transcription must correspond with the bottom of the last page-image.

Live sync-scrolling

To achieve live sync scrolling we need a table or function that gives the left-hand scroll position for each possible right hand scroll position.

What we musn't do is make the list of images itself scrollable. If we do that then we will have to link its own scrolling with that of the scrolling text. Since the right-hand-side (RHS) controls the scrolling of the left-hand-side (LHS) we will get infinite feedback if we link the scrolling in both directions. 'Scrolling' on the LHS can be achieved by other means, for example, by varying the CSS 'top' property of the overall list of images.

The scrolling positions for the LHS are just the mid-points of each image in the overall list. These correspond to the mid-points for each page of text in the RHS. The latter can be found easily by parsing the text. In my case, since I use a minimal markup language (MML) page-starts are marked by [NNN] where NNN is the page-number on a line by itself. Any scrolling position in-between two of these corresponding values can be interpolated by scaling. However, this does not work for the first and last pages because the desirable alignments in these cases are the top of the first image with the top of the first page of text, and likewise the end of the text with the bottom of the last image. So my solution was just to replace the mid-points of the first page in both the LHS and RHS with half the window-height. Likewise for the last mid-points I used the length of the text and the length of the image-list minus half the window height. In some cases 'half-way from the top or bottom of the window' may be in the middle of a page that is not the first or last. In that case the overlapping values can simply be removed, as long as the ones at the extreme ends of the list are preserved.

There is a demo of this method on Charles Harpur, in the test-interface for the letter from W.A. Duncan to Henry Parkes dated around 1841.

Tuesday, 23 February 2016

Improvements to events editor

Events are things that happen in the life of your author. One of the successes of the AustESE project was the realisation that such events would best be represented as database records, so that biographical information could be rearranged into various useful forms. Events have a 'fuzzy' date more often than a precise one. So 'ca. 1834' or 'before February 1865' is what you would expect as the date of an event, not 26/12/1845. And events can have a description and a list of references. These two are represented as simple HTML, a globally interoperable standard for mixed content. So forget about XML, which only serves as a preliminary to making HTML.

To WYSIWYG or not?

This is where the problems started. AustESE used a sophisticated HTML editor that filtered out potentially dangerous HTML constructs that hackers could use to implant code exploits. But those tags and attributes also happen to be quite useful for building a website. For example, the title attribute or special data-attributes on a link could be used to animate a popup image. Unfortunately the editor stripped all these out when the user saved, and the images would disappear. So I swapped it for a simple HTML editor, on the grounds that users would still want to see a WYSIWYG preview of their HTML before saving it. But that didn't work out any better. Since a preview is created when the user clicks on another element it already reverts to an effective 'preview' that won't be saved until the user clicks the 'save' button. So the elaborate WYSIWYG editor could be replaced by a simple textarea. Sophisticated, huh?

But what about the 'dangerous' HTML constructs I am no longer filtering out? Since the events editor is not publicly accessible and all editors of the content are guaranteed to be trustworthy, this extra security measure is quite worthless.

The loss of the WYSIWYG editing environment is also not a problem, since editors are mostly sophisticated enough to handle this. After all, what we need to put into the HTML goes beyond mere formatting, and for our purposes a WYSIWYG environment simply doesn't suffice.

Which brings me to my main point: The best ideas come when you decide to delete something, not when you add some shiny new GUI component you probably don't even need. Less is truly more. But finding out what to throw away is the problem.

Events editor before the user clicks on the description area

Events editor with the textarea enabled for the description field

Events editor after the user clicks on the references field

Sunday, 14 February 2016

Improvements to table view

Table view seems to be much liked by my two expert editors. But they did request some changes, which I have now implemented and which require some explanation.

First, they wanted to see all the text of each version, and not restrict it to the base row in those columns where it was all the same. Second, they wanted some way to order or reorder the versions, and third, they needed a way to reduce clutter by deleting rows. Also I replaced the clumsy slider with a conventional scrollbar to enable swiping on tablets. These changes have made table view much more useful, without adding significantly to its complexity.

Moving up and down

If the user clicks on a siglum in the leftmost column two small buttons appear to raise or lower that row. After 5 seconds the buttons disappear in any case. The disappearing buttons are cool because they only appear when needed and free up the display when not. But clicking on the up button moves that row above the one immediately above, and down moves it down one row. Only the up button appears on the bottom row and only the down button on the top row.

Selecting some versions

Normally the user wants to see all the versions, but if that is a bit overwhelming the rows can be reduced by deselecting them from a simple dropdown menu in the toolbar that has been added at the foot of the display. Selected versions are shown by adding a tick-mark after their names. This reappears if it is chosen again. The usual user interface method for showing a set of options is to use checkboxes, but in this case there may be very many, and it would get too confusing. So a select dropdown is used instead. The 'rebuild' button resubmits the newly selected versions and builds a reduced (or expanded) table.

Tuesday, 9 February 2016

Table view

One way to demonstrate the flexibility of multi-version documents is to display the same information in several ways. Charles-harpur.org and Ecdosis now boast a table view, which displays all the versions of a work in stacked form, so an editor can quickly scan through a text to see what is a variant of what, no matter how complex the variation.

My XML-based rivals are still struggling to produce such views, but I doubt they will succeed. Their problem is that they record internal variants (deletions, additions, substitutions) inline as part of the text, and to produce a table view you have to tease apart these changes into separate layers, which is almost impossible due to markup variability. So this display, although it doesn't look all that earth-shattering, is actually unique. Also it is what textual editors have long been bugging me for.

Table view of part of The Creek of the Four Graves

To try with different poems, select a poem from the Browse menu, then click on the "table view" tab. Some poems are not uploaded yet and may not work, but most are OK. Some minor features: the table of sigla on the left is anchored. Mousing over the sigla shows their full name in case they are partly obscured. The spacing could be improved, but it is basically all there.

Monday, 18 January 2016

Tree View

I added Tree view to the Ecdosis front toolset. Since a multi-version document (MVD) represents multiple versions of the same work it is pretty much like multiple variants of a genome. A phylogenetic tree is very close to a stemma describing the relationships between witnesses in a multiple-manuscript tradition. A genome is a sequence of nucleotides expressed in a four-letter vocabulary (GACT) but a single version of a historical document can also be expressed as a sequence of letters in the Unicode character set. Like genomes, historical texts are also subject to insertions, deletions, substitutions and transpositions. Hence the same tools used by geneticists ought to work for humanists also.

Distance matrices

The question arises how to generate a stemma or tree from a set of versions. Many of the phylogenetic approaches use a distance matrix: a table that describes how different each version is from every other version. Since in an MVD all the bits of text that are shared by versions and the bits that are different have already been computed, making a distance matrix is easy. The distance matrix for the four versions/layers of Abed Ben Haroun by Charles Harpur looks like this:

ABCD
A0.00.053440.106880.11508
B0.053440.00.133700.08087
C0.106880.133700.00.19323
D0.115080.080870.193230.0

Obviously the edit-distance between each version with itself is 0, which explains the diagonal of zeros in the table. The larger the number the more 'distant' it is from the other version. So here the biggest difference is between versions C and D. Of the other values only half are needed, since the distance between versions A and D is the same as the distance between D and A. But the format is traditional, and can be fed into a tree-drawing algorithm. There are many of these but one of the best distance-based methods is neighbour-join. The version I chose is a refinement of that technique published some years ago by Desper and Gascuel called 'FastME'. The tree-view is provided by the 'drawgram' program in the Phylip package, which allows different visualisations of rooted trees, since these represent most closely the humanist's stemma.

Stemmatic trees are useful even in cases where all the sources are written by the same person. What it shows in this case is which version derived from which – something that would require a lot of manual labour to discover. Often it is unclear in a collection of manuscript versions exactly which preceded which, but a phylogenetic tree makes this easy. Here's an example of Harpur's The Creek of the Four Graves. The h-numbers indicate physical versions and the internal layers are the bit added to the name introduced by a /.

How to make them

It is only really possible to make these trees from within Ecdosis. A back-end Java Web-service called Tree reads the MVD from the database and computes the distance matrix. It then builds the tree and streams it back to the Web-browser directly as an image. The controls at the bottom of the screen are contained in a JQuery module wrapped up as a Drupal module, which now form part of the Ecdosis-front collection. It is called 'tree'. There are some examples on the Charles Harpur site. Many of the 700 poems have more than one version so you should be able to select other poems from the Browse menu.

Thursday, 14 January 2016

Multi-version documents and standoff properties

I have written two new papers for Digital Scholarship in the Humanities on 'standoff properties as an alternative to XML', and a second on 'Automating textual variation with multi-version documents'. Together they form the basis of a model of how I think historical documents should be encoded. The now 25 year old drive for 'standardisation' has led to something of a dead-end: people have begun to realise that it is not in fact possible to standardise the encoding of documents written on analogue media. Instead of reusability, sharability and durability, such 'standards' provide only a fertile ground for embedding private technology and interpretations into texts that cannot then be reused for any other purpose. 'Standard' encoding also fails to propose a usable solution to textual variation, which is the one feature that all historical documents share. Rather than attempting to create a new standard, this model reuses existing formats already in use worldwide: HTML, CSS, RDFa, Unicode. Although the model can be fully expressed in these formats its internal representation predisposes the data into a form that facilitates the things that digital humanists want to do with it, rather than throwing up barriers to its processing and reuse. What is needed is something simple that works. This is my attempt to explain how that can be achieved.

Wednesday, 9 December 2015

Fixing Search

The previous post explained that searching historical documents is fraught with problems that industrial search engines simply cannot handle. And the reason they can't is because they treat the underlying data as if it was a digitally-authored word-processor file, rather than a historical and manually written physical artefact.

The most serious of these problems is how to deal with versions – both internal and external. An internal version is created whenever an author or scribe changes something in the text, through deletion, replacement or insertion. After each round of corrections the author could, if he or she liked, write out the text in full as a clean draft. This kind of version hidden inside a document may be termed a layer. But when a new physical copy is produced, the differences between copies is termed external variation.

Some people seem to think that internal variation can be represented as a format: it is a crossing-out or insertion in the same document rather than a whole new text. Not so. Consider the last three lines of this poem:

Take out the markup and you will get:

Yet even that one subject is to one's prone to starts of wrong 
Of evil As ever So he shall sometimes prove insure: 
  in the clearest well thus fountain ever lies
A sediment—disturb it, and 'twill rise.

This is pure nonsense. The author never wrote that. It is not a text to be searched, viewed or compared, but that is what is being recorded when people treat internal variants as if they were formats, and that is what is being indexed by industrial search-engines.

So in order to search reliably the internal states of the document must first be separated out into coherent layers.

The scrapbook analogy

Let's say you have three copies of a novel. I'm thinking about a favourite sci-fi novel of mine, but there are plenty of other similar cases. The first one is a serialisation in a magazine. The second is the inaugural American edition, the third is the British edition, which was abridged by the publisher. I want to make one edition of all three. So I photocopy each page, and where the text is either unique in one copy or shared across several versions I cut out that portion and paste it into a scrapbook. As I do so I can preserve the order of the fragments so that each bit of each version precedes the next bit in the same version. There are printed books that do this already. Things like Michael Warren's 'Parallel King Lear', where the various quartos and the folio versions are laid down side by side so the reader can see the insertions, deletions and variants. But returning to the scrapbook idea, by copying each shared piece of text, however small (maybe just one letter) all duplication between the copies can be eliminated.

Well, nearly all. If a section was the same but transposed between two versions then the scrapbook idea won't work: the text will have to be copied from the 'before' to the 'after' location.

A digital scrapbook

If we do the same thing with three digital copies of the novel by finding all the bits in common electronically we can also eliminate the copies in the transposed text. The first time the transposed text occurs we record it as normal. The second or third times it is reused we simply refer to it, without copying it. So now our digital scrapbook is just one document, but it records all the text of the work from the three copies just once. By labelling each fragment with the set of versions it belongs to, say "1,2" or "2,3" or "1" or "2" etc. it will be possible to reconstruct the text of any copy by reading in order only those fragments that belong to it. So reading all the fragments labelled with a "2" or "1,2" or "2,3" etc will reproduce the text of version 2. And the same goes for the other versions.

And as explained above, internal variations are inherently no different from the external ones. They can be treated as separate editions of the text, as long as we can tease apart the internal versions and produce coherent copies from them. This is always possible, although the arguments are too long to include here.

Indexing a digital scrapbook

The benefits for search can only be realised if we can produce an index of this digital scrapbook, by treating it as just another kind of document. In a real world scenario some texts will be scrapbooks and some won't. We need one way to index them all uniformly. The scrapbook idea is cool, but it is so radical that it threatens to break industrial search engines. How can this be avoided?

The poetry example given above when represented as a digital scrapbook might look internally as four layers:

[1-4] Yet even [1,2] that one [2] 's [1] subject is [2] prone [1,2] to starts [1] Of evil [2] As ever [1-4] : in the clearest [3,4] he shall sometimes prove insure [3] So [1,2] well thus [3,4] fountain ever [1-4] lies [1-4] A sediment—disturb it, and 'twill rise.

The versions to which each fragment of text belong are represented here as some numbers in square-brackets before the fragment they refer to. But forgetting about that for now this 'document' can be treated like any other. The location of each word has a position that can be measured by counting characters from the start of the file. The position of 'subject' is 20, counting all the preceding characters of "Yet even ", "that one" and "'s ", without regard to the versions they belong to. This is a kind of global, cross-version position of each word. For example, the word "one's", in version 2, starts at position 14, which is the same as the word "one" in version 1. But so long as we have a position for each word in our index this causes no problem because we already know that they are two different words. All we need is some program that can read digital scrapbooks, which is no big deal.

Admittedly, the index will not record which versions "one" and "one's" belong to, but this can be deduced by reading the digital scrapbook at position 14. Following the fragments for "one's" reveals that this text is in version 2, whereas the word "one" (followed by a space) belongs to version 1.

So no reinvention of the wheel is needed to index a digital scrapbook, or an ordinary file, and the positions of words in both types of file can be stored in the same index.

Finding the text

Now we have our index it should be easy enough to find something in it. The index tells us in which documents a particular word can be found, and at what position(s) in the file. Any standard search engine could be used for this purpose, but in practice it is probably better to make your own, because of what comes after.

First the 'hits' have to be arranged into 'digests' which are just short summaries of the relevant bits of the source documents. To do this naturally requires the search engine to read the source documents again. So a search engine would have to be aware of digital scrapbooks. But that can be done, since it is just another format. Finally the hits have got to be displayed. That also requires knowledge of the digital scrapbook format, but the beauty is that a single hit in a single digital scrapbook will be displayed as one hit, not as 20 hits in 20 versions or layers. And the user can move around inside the scrapbook and read the text of any version and see how the hits propagate across the versions. Take a look for yourself.