Monday 18 January 2016

Tree View

I added Tree view to the Ecdosis front toolset. Since a multi-version document (MVD) represents multiple versions of the same work it is pretty much like multiple variants of a genome. A phylogenetic tree is very close to a stemma describing the relationships between witnesses in a multiple-manuscript tradition. A genome is a sequence of nucleotides expressed in a four-letter vocabulary (GACT) but a single version of a historical document can also be expressed as a sequence of letters in the Unicode character set. Like genomes, historical texts are also subject to insertions, deletions, substitutions and transpositions. Hence the same tools used by geneticists ought to work for humanists also.

Distance matrices

The question arises how to generate a stemma or tree from a set of versions. Many of the phylogenetic approaches use a distance matrix: a table that describes how different each version is from every other version. Since in an MVD all the bits of text that are shared by versions and the bits that are different have already been computed, making a distance matrix is easy. The distance matrix for the four versions/layers of Abed Ben Haroun by Charles Harpur looks like this:

ABCD
A0.00.053440.106880.11508
B0.053440.00.133700.08087
C0.106880.133700.00.19323
D0.115080.080870.193230.0

Obviously the edit-distance between each version with itself is 0, which explains the diagonal of zeros in the table. The larger the number the more 'distant' it is from the other version. So here the biggest difference is between versions C and D. Of the other values only half are needed, since the distance between versions A and D is the same as the distance between D and A. But the format is traditional, and can be fed into a tree-drawing algorithm. There are many of these but one of the best distance-based methods is neighbour-join. The version I chose is a refinement of that technique published some years ago by Desper and Gascuel called 'FastME'. The tree-view is provided by the 'drawgram' program in the Phylip package, which allows different visualisations of rooted trees, since these represent most closely the humanist's stemma.

Stemmatic trees are useful even in cases where all the sources are written by the same person. What it shows in this case is which version derived from which – something that would require a lot of manual labour to discover. Often it is unclear in a collection of manuscript versions exactly which preceded which, but a phylogenetic tree makes this easy. Here's an example of Harpur's The Creek of the Four Graves. The h-numbers indicate physical versions and the internal layers are the bit added to the name introduced by a /.

How to make them

It is only really possible to make these trees from within Ecdosis. A back-end Java Web-service called Tree reads the MVD from the database and computes the distance matrix. It then builds the tree and streams it back to the Web-browser directly as an image. The controls at the bottom of the screen are contained in a JQuery module wrapped up as a Drupal module, which now form part of the Ecdosis-front collection. It is called 'tree'. There are some examples on the Charles Harpur site. Many of the 700 poems have more than one version so you should be able to select other poems from the Browse menu.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.