Monday 18 January 2016

Tree View

I added Tree view to the Ecdosis front toolset. Since a multi-version document (MVD) represents multiple versions of the same work it is pretty much like multiple variants of a genome. A phylogenetic tree is very close to a stemma describing the relationships between witnesses in a multiple-manuscript tradition. A genome is a sequence of nucleotides expressed in a four-letter vocabulary (GACT) but a single version of a historical document can also be expressed as a sequence of letters in the Unicode character set. Like genomes, historical texts are also subject to insertions, deletions, substitutions and transpositions. Hence the same tools used by geneticists ought to work for humanists also.

Distance matrices

The question arises how to generate a stemma or tree from a set of versions. Many of the phylogenetic approaches use a distance matrix: a table that describes how different each version is from every other version. Since in an MVD all the bits of text that are shared by versions and the bits that are different have already been computed, making a distance matrix is easy. The distance matrix for the four versions/layers of Abed Ben Haroun by Charles Harpur looks like this:

ABCD
A0.00.053440.106880.11508
B0.053440.00.133700.08087
C0.106880.133700.00.19323
D0.115080.080870.193230.0

Obviously the edit-distance between each version with itself is 0, which explains the diagonal of zeros in the table. The larger the number the more 'distant' it is from the other version. So here the biggest difference is between versions C and D. Of the other values only half are needed, since the distance between versions A and D is the same as the distance between D and A. But the format is traditional, and can be fed into a tree-drawing algorithm. There are many of these but one of the best distance-based methods is neighbour-join. The version I chose is a refinement of that technique published some years ago by Desper and Gascuel called 'FastME'. The tree-view is provided by the 'drawgram' program in the Phylip package, which allows different visualisations of rooted trees, since these represent most closely the humanist's stemma.

Stemmatic trees are useful even in cases where all the sources are written by the same person. What it shows in this case is which version derived from which – something that would require a lot of manual labour to discover. Often it is unclear in a collection of manuscript versions exactly which preceded which, but a phylogenetic tree makes this easy. Here's an example of Harpur's The Creek of the Four Graves. The h-numbers indicate physical versions and the internal layers are the bit added to the name introduced by a /.

How to make them

It is only really possible to make these trees from within Ecdosis. A back-end Java Web-service called Tree reads the MVD from the database and computes the distance matrix. It then builds the tree and streams it back to the Web-browser directly as an image. The controls at the bottom of the screen are contained in a JQuery module wrapped up as a Drupal module, which now form part of the Ecdosis-front collection. It is called 'tree'. There are some examples on the Charles Harpur site. Many of the 700 poems have more than one version so you should be able to select other poems from the Browse menu.

Thursday 14 January 2016

Multi-version documents and standoff properties

I have written two new papers for Digital Scholarship in the Humanities on 'standoff properties as an alternative to XML', and a second on 'Automating textual variation with multi-version documents'. Together they form the basis of a model of how I think historical documents should be encoded. The now 25 year old drive for 'standardisation' has led to something of a dead-end: people have begun to realise that it is not in fact possible to standardise the encoding of documents written on analogue media. Instead of reusability, sharability and durability, such 'standards' provide only a fertile ground for embedding private technology and interpretations into texts that cannot then be reused for any other purpose. 'Standard' encoding also fails to propose a usable solution to textual variation, which is the one feature that all historical documents share. Rather than attempting to create a new standard, this model reuses existing formats already in use worldwide: HTML, CSS, RDFa, Unicode. Although the model can be fully expressed in these formats its internal representation predisposes the data into a form that facilitates the things that digital humanists want to do with it, rather than throwing up barriers to its processing and reuse. What is needed is something simple that works. This is my attempt to explain how that can be achieved.