Tuesday, 20 September 2016

The fall of XML

Talk to software developers today and they will tell you that 'XML is toast'. XML has not been replaced by any single technology. It is not JSON that has killed off XML; it is the mobile Web and associated technologies. Digital humanists who think that XML is here to stay, and imagine that they can continue to build software on top of it, should take a look at the following graphic, derived from Stackoverflow, one of the most popular discussion forums for software developers. HTML, Javascript, JSON and CSS have collectively supplanted XML, and these technologies no longer have any need of it. You may say 'Who cares what software developers think?' But they are the guys who build and maintain the tools that digital humanists use. If they abandon XML then those tools will soon perish or become obsolete, disconnected from the services they were designed to support.

When the World's way is running East,
    Keep your way running West;
And it is two to one, at least,
    That yours will be the best.

Charles Harpur

Tuesday, 26 July 2016

Sync-scrolling images and text when editing transcriptions

When transcribing original documents it is helpful to have the page image next to the transcription as it is being written. That way the transcriber can see what the next word to transcribe is, or quickly check for mistakes in some one else's transcription. Most documents contain more than one page, so this gives rise to a problem: how can the relevant page image be placed next to the relevant portion of the transcription so that the transcriber can easily see what corresponds to what?

One obvious solution is to transcribe page by page. For each image show only the transcription of that page. However, this creates a problem for the technician and user alike: now the transcription of the document is divided into parts which must be stitched back together not only by the computer when the document is saved, but also mentally by the transcriber. Pages rarely end at paragraph end. More often they split in the middle of a sentence or even a word. The transcriber may change only one page and then the computer must reassemble the document with that one altered page in its middle. If the page is marked up in some way the page's transcription may not be complete or well-formed, which would hamper editing. All this is both technically messy and counter-intuitive for the user.

A better method is to display the entire document for editing: both the transcription as a continuous running text and the page images to which it corresponds as a scrolling list. The reason this is not often done is because of an intrinsic alignment problem: how to find the part of an image that corresponds to the currently displayed piece of text. To be readable page-images may need to be higher than the screen. Typically the transcription of a page is much shorter. But as a general rule the centre of the page's transcription should be aligned with the centre of the corresponding page-image across the centre of the screen. This is what the user expects. However, this creates a problem: the first and last pages cannot possibly conform to this rule. The first page must be aligned so that the top of the page image aligns with the top of the transcription text. And likewise at the bottom: on the last page the end of the transcription must correspond with the bottom of the last page-image.

Live sync-scrolling

To achieve live sync scrolling we need a table or function that gives the left-hand scroll position for each possible right hand scroll position.

What we musn't do is make the list of images itself scrollable. If we do that then we will have to link its own scrolling with that of the scrolling text. Since the right-hand-side (RHS) controls the scrolling of the left-hand-side (LHS) we will get infinite feedback if we link the scrolling in both directions. 'Scrolling' on the LHS can be achieved by other means, for example, by varying the CSS 'top' property of the overall list of images.

The scrolling positions for the LHS are just the mid-points of each image in the overall list. These correspond to the mid-points for each page of text in the RHS. The latter can be found easily by parsing the text. In my case, since I use a minimal markup language (MML) page-starts are marked by [NNN] where NNN is the page-number on a line by itself. Any scrolling position in-between two of these corresponding values can be interpolated by scaling. However, this does not work for the first and last pages because the desirable alignments in these cases are the top of the first image with the top of the first page of text, and likewise the end of the text with the bottom of the last image. So my solution was just to replace the mid-points of the first page in both the LHS and RHS with half the window-height. Likewise for the last mid-points I used the length of the text and the length of the image-list minus half the window height. In some cases 'half-way from the top or bottom of the window' may be in the middle of a page that is not the first or last. In that case the overlapping values can simply be removed, as long as the ones at the extreme ends of the list are preserved.

There is a demo of this method on Charles Harpur, in the test-interface for the letter from W.A. Duncan to Henry Parkes dated around 1841.

Tuesday, 23 February 2016

Improvements to events editor

Events are things that happen in the life of your author. One of the successes of the AustESE project was the realisation that such events would best be represented as database records, so that biographical information could be rearranged into various useful forms. Events have a 'fuzzy' date more often than a precise one. So 'ca. 1834' or 'before February 1865' is what you would expect as the date of an event, not 26/12/1845. And events can have a description and a list of references. These two are represented as simple HTML, a globally interoperable standard for mixed content. So forget about XML, which only serves as a preliminary to making HTML.

To WYSIWYG or not?

This is where the problems started. AustESE used a sophisticated HTML editor that filtered out potentially dangerous HTML constructs that hackers could use to implant code exploits. But those tags and attributes also happen to be quite useful for building a website. For example, the title attribute or special data-attributes on a link could be used to animate a popup image. Unfortunately the editor stripped all these out when the user saved, and the images would disappear. So I swapped it for a simple HTML editor, on the grounds that users would still want to see a WYSIWYG preview of their HTML before saving it. But that didn't work out any better. Since a preview is created when the user clicks on another element it already reverts to an effective 'preview' that won't be saved until the user clicks the 'save' button. So the elaborate WYSIWYG editor could be replaced by a simple textarea. Sophisticated, huh?

But what about the 'dangerous' HTML constructs I am no longer filtering out? Since the events editor is not publicly accessible and all editors of the content are guaranteed to be trustworthy, this extra security measure is quite worthless.

The loss of the WYSIWYG editing environment is also not a problem, since editors are mostly sophisticated enough to handle this. After all, what we need to put into the HTML goes beyond mere formatting, and for our purposes a WYSIWYG environment simply doesn't suffice.

Which brings me to my main point: The best ideas come when you decide to delete something, not when you add some shiny new GUI component you probably don't even need. Less is truly more. But finding out what to throw away is the problem.

Events editor before the user clicks on the description area

Events editor with the textarea enabled for the description field

Events editor after the user clicks on the references field

Sunday, 14 February 2016

Improvements to table view

Table view seems to be much liked by my two expert editors. But they did request some changes, which I have now implemented and which require some explanation.

First, they wanted to see all the text of each version, and not restrict it to the base row in those columns where it was all the same. Second, they wanted some way to order or reorder the versions, and third, they needed a way to reduce clutter by deleting rows. Also I replaced the clumsy slider with a conventional scrollbar to enable swiping on tablets. These changes have made table view much more useful, without adding significantly to its complexity.

Moving up and down

If the user clicks on a siglum in the leftmost column two small buttons appear to raise or lower that row. After 5 seconds the buttons disappear in any case. The disappearing buttons are cool because they only appear when needed and free up the display when not. But clicking on the up button moves that row above the one immediately above, and down moves it down one row. Only the up button appears on the bottom row and only the down button on the top row.

Selecting some versions

Normally the user wants to see all the versions, but if that is a bit overwhelming the rows can be reduced by deselecting them from a simple dropdown menu in the toolbar that has been added at the foot of the display. Selected versions are shown by adding a tick-mark after their names. This reappears if it is chosen again. The usual user interface method for showing a set of options is to use checkboxes, but in this case there may be very many, and it would get too confusing. So a select dropdown is used instead. The 'rebuild' button resubmits the newly selected versions and builds a reduced (or expanded) table.

Tuesday, 9 February 2016

Table view

One way to demonstrate the flexibility of multi-version documents is to display the same information in several ways. Charles-harpur.org and Ecdosis now boast a table view, which displays all the versions of a work in stacked form, so an editor can quickly scan through a text to see what is a variant of what, no matter how complex the variation.

My XML-based rivals are still struggling to produce such views, but I doubt they will succeed. Their problem is that they record internal variants (deletions, additions, substitutions) inline as part of the text, and to produce a table view you have to tease apart these changes into separate layers, which is almost impossible due to markup variability. So this display, although it doesn't look all that earth-shattering, is actually unique. Also it is what textual editors have long been bugging me for.

Table view of part of The Creek of the Four Graves

To try with different poems, select a poem from the Browse menu, then click on the "table view" tab. Some poems are not uploaded yet and may not work, but most are OK. Some minor features: the table of sigla on the left is anchored. Mousing over the sigla shows their full name in case they are partly obscured. The spacing could be improved, but it is basically all there.

Monday, 18 January 2016

Tree View

I added Tree view to the Ecdosis front toolset. Since a multi-version document (MVD) represents multiple versions of the same work it is pretty much like multiple variants of a genome. A phylogenetic tree is very close to a stemma describing the relationships between witnesses in a multiple-manuscript tradition. A genome is a sequence of nucleotides expressed in a four-letter vocabulary (GACT) but a single version of a historical document can also be expressed as a sequence of letters in the Unicode character set. Like genomes, historical texts are also subject to insertions, deletions, substitutions and transpositions. Hence the same tools used by geneticists ought to work for humanists also.

Distance matrices

The question arises how to generate a stemma or tree from a set of versions. Many of the phylogenetic approaches use a distance matrix: a table that describes how different each version is from every other version. Since in an MVD all the bits of text that are shared by versions and the bits that are different have already been computed, making a distance matrix is easy. The distance matrix for the four versions/layers of Abed Ben Haroun by Charles Harpur looks like this:


Obviously the edit-distance between each version with itself is 0, which explains the diagonal of zeros in the table. The larger the number the more 'distant' it is from the other version. So here the biggest difference is between versions C and D. Of the other values only half are needed, since the distance between versions A and D is the same as the distance between D and A. But the format is traditional, and can be fed into a tree-drawing algorithm. There are many of these but one of the best distance-based methods is neighbour-join. The version I chose is a refinement of that technique published some years ago by Desper and Gascuel called 'FastME'. The tree-view is provided by the 'drawgram' program in the Phylip package, which allows different visualisations of rooted trees, since these represent most closely the humanist's stemma.

Stemmatic trees are useful even in cases where all the sources are written by the same person. What it shows in this case is which version derived from which – something that would require a lot of manual labour to discover. Often it is unclear in a collection of manuscript versions exactly which preceded which, but a phylogenetic tree makes this easy. Here's an example of Harpur's The Creek of the Four Graves. The h-numbers indicate physical versions and the internal layers are the bit added to the name introduced by a /.

How to make them

It is only really possible to make these trees from within Ecdosis. A back-end Java Web-service called Tree reads the MVD from the database and computes the distance matrix. It then builds the tree and streams it back to the Web-browser directly as an image. The controls at the bottom of the screen are contained in a JQuery module wrapped up as a Drupal module, which now form part of the Ecdosis-front collection. It is called 'tree'. There are some examples on the Charles Harpur site. Many of the 700 poems have more than one version so you should be able to select other poems from the Browse menu.

Thursday, 14 January 2016

Multi-version documents and standoff properties

I have written two new papers for Digital Scholarship in the Humanities on 'standoff properties as an alternative to XML', and a second on 'Automating textual variation with multi-version documents'. Together they form the basis of a model of how I think historical documents should be encoded. The now 25 year old drive for 'standardisation' has led to something of a dead-end: people have begun to realise that it is not in fact possible to standardise the encoding of documents written on analogue media. Instead of reusability, sharability and durability, such 'standards' provide only a fertile ground for embedding private technology and interpretations into texts that cannot then be reused for any other purpose. 'Standard' encoding also fails to propose a usable solution to textual variation, which is the one feature that all historical documents share. Rather than attempting to create a new standard, this model reuses existing formats already in use worldwide: HTML, CSS, RDFa, Unicode. Although the model can be fully expressed in these formats its internal representation predisposes the data into a form that facilitates the things that digital humanists want to do with it, rather than throwing up barriers to its processing and reuse. What is needed is something simple that works. This is my attempt to explain how that can be achieved.