Wednesday, 7 May 2014

nmergec: a better way to collate and merge variant texts

The original idea of multi-version documents (MVDs) in 2009 was intended to address a common problem when editing cultural heritage texts: often the source documents are not cleanly reproduced texts that only exist in one version. In reality most historical documents exist in multiple drafts, have internal corrections, or are published as a set of editions or manuscript copies. An MVD is designed to gather all this information into a single digital bundle that can be commented on, or displayed in a variety of ways, because its structure mimics that of the documents it represents.

The basic principle of an MVD is that text occurring multiple times is only represented once, whereas the generally accepted method is to transcribe each physical document representing a work as a separate digital file, and to record internal variations within each document via markup. But this raises three serious difficulties:

  1. Markup was originally designed to represent formats as applied to linear text. Using it to describe non-linear text, as found in internal variations, leads to a conundrum: how can markup be used to describe changes in formats or textual structure through variation, when markup is already used to describe those very features? For example, two paragraphs joined up by an editorial stroke, or a cancelled format. Markup can't describe itself.
  2. There is a lot of redundancy between the versions represented as separate documents. If each of 20 versions has the same text at the same location, then why record it 20 times? And every time you copy or edit that piece of text you have to make sure it is the same throughout the 20 copies. This makes a lot of work for no reason. Computers are supposed to reduce work, not increase it.
  3. Comparison and annotation of the multiple files creates a logistical problem: each time two texts are compared requires the repetition of a costly computation, whose result is the same each time. Comparing two documents with internal versions is practically impossible, unless you throw away the internal variation, in which case why record it in the first place? And to annotate a piece of text shared between versions the same annotation has to be applied to each copy separately.

There has to be a better way to reorganise the underlying data of a work so that the information structure facilitates the kinds of operations the user wishes to perform upon it. This is the motivation behind the MVD.

Variant graphs

The basic idea of the MVD is the variant graph. Instead of representing text as a linear stream of characters it represents it as a branching stream that diverges or merges as required. Consider this short segment of Charles Harpur's Social Charity, MS C376 (and this is just one of five realisations of this poem):

In this one manuscript one can discern five layers:

Yet even that one subject is to starts
Of evil: in the clearest well thus lies

Yet even that one's prone to starts of wrong
As ever: in the clearest well their lies

Yet even he shall sometimes prove insure:
So: in the clearest fountain their lies

Yet even he shall sometimes prove insure:
So: in the clearest fountain there lies

Yet even he shall sometimes prove insure:
As in the clearest fountain ever lies

The variant graph of all the above looks like this:

This covers the text from "Yet even" to "in the clearest", which are shared by all five versions. Note that the resolution of this variant graph is the character. Although for some uses a word-granularity is better, for side-by-side display the character-level granularity is superior. It doesn't leave the user guessing as to exactly what is different in any two versions. Also, if desired, it is possible to increase granularity to word-level, but if you start from word-level it is harder to descend to character level.

It is important to realise that this is an internal data structure of a program written out to explain how it works, not a display format. Its purpose is to increase efficiency by removing redundancy, which it clearly does. Even better from the efficiency point of view the above complex graph with all its nodes and arcs can be reduced to a simple list of textual fragments belonging to versions:

[1-5]Yet even
[3-5]he s
[3-5]ll sometime
[1-2]t one
[2-5]s pro
[3-5]ve in
[1]bject is
[1-2] to starts
[2] of wrong
[1-2] ev
[1-5] in the clearest

THIS is what an MVD is, not the graph as such, although this is also a graph, just expressed differently. The list can be converted back into the graph by following three simple rules:

  1. Forced: if one fragment's versions intersect with the versions of the following fragment, then they define a node in the graph, to which the first and the second fragments attach as incoming and outgoing arcs respectively. For example, "to starts" (versions 1,2) intersects with "of wrong" (version 2). In the graph above this corresponds to a node between these two arcs.
  2. Overhang: If a fragment is not attached as outgoing to the preceding fragment by the forced rule, it attaches to the last fragment with which it intersects. For example, "bject is" (version 1) attaches as outgoing to the same node as the earlier "su" (versions 1,3-5), forming the words "subject is".
  3. Reverse overhang: All fragments not attached as incoming by rule 1 attach to the node of the next fragment in the list with which they intersect. So "ve in " (versions 3-5) attaches to "su" (versions a and 3-5) several fragments lower.

Any list of fragments so composed no matter how many versions it contains, no matter how complex, can be turned into a variant graph using these three rules. This is a lot simpler than the earlier nmerge program, which required five rules. The difference is that nmergec generates restricted variant graphs that have this property. So not any variant graph can be decomposed into such a list, only those generated by nmergec.

Some digital humanists have recognised that the variant graph is an intuitive way to conceive textual variation, but they only see it as a display format. But it is so much more than that. In fact the only way to represent complex cases of variation is to use the list of fragments, because in that way complexity does not increase as more versions are added. An MVD is thus best understood as an efficient data representation of the underlying content of a work.

Of course, getting the list of fragments in the first place is the problem. And that is what nmergec is designed to solve.

How nmergec works

Collation programs examine two versions at a time within a window of 50 words or a page. The first matching word they find between the two versions is taken as a position where the two texts are the same, and then the window is moved on by one word. Anything that did not match is a variant, insertion or deletion. The problem with this is that you can only see alignments within the window, and often they are outside it. And alignment word by word is incredibly error-prone. You can easily have transpositions, insertions or deletions that throw it out of alignment. That's why some early collation programs needed human input to help them.

CollateX, Juxta and Juxta Commons use comparison tools adapted from computer code comparison, which were designed to work at the granularity of lines, not words. They have been adapted to work with words, and as a result they are slow, and can only recognise small transpositions over short distances. The fact is, many transpositions occur on a much greater scale. Sometimes whole chapters in a novel are rearranged between drafts.For example, Wittgenstein rewrote his manuscripts by cutting them up into bits and rearranging them. We need something that is much more powerful and designed for the purpose.

MUM to the rescue!

A Maximal Unique Match or MUM is the longest unique common sequence of matching characters between two versions. It has to be unique because the longest match might occur several times, and aligning on one of those would probably be a mistake. Imagine a text with 20 spaces at the start of each line. Once the longest sections of actual text had been aligned those sequences of spaces would be the longest matches left. So it has to be unique. Nmergec uses the powerful Ukkonnen suffixtree algorithm to find MUMs, and it uses that to find the best point to align two texts. In the example above the MUM between versions 1 and 2 is marked in bold:

Yet even that one subject is to starts
Of evil: in the clearest well thus lies

Yet even that one's prone to starts of wrong
As ever: in the clearest well their lies

If we imagine that the first version is the MVD and the second one is the one we are adding to it then the leftovers after accepting this alignment will be "Yet even that one's prone to starts of wrong As ever" and "eir lies". Now all we have to do is to repeat the process on each of these fragments recursively until all that can be aligned is aligned. This is called a greedy algorithm: it gobbles up the biggest bits first, then progressively smaller sections. It's perhaps not perfect but it is undeniably fast, or as fast as these things get. Extending it to work on a full MVD with already many versions is not hard, we just choose a match the same way, but between the whole graph and the new version. I won't draw it because it will be too messy, but it is the same principle.


After the first alignment subsequent alignments don't always match with the "opposite" text. An example is:

I had written him a letter which I had, for want of better
Knowledge, sent to where I met him, years ago, down the Lachlan.

I had written him a letter which I had, for want of better
Knowledge, sent to where I met him down the Lachlan, years ago,

I've marked the transposition in bold. After aligning the longer text that surrounds it, nmergec eventually computes the alignment ", years ago". But it only fits on the other side of an already aligned section:

The big change in nmergec is that it treats transpositions in just the same way as direct alignments. The only difference is that transpositions are assessed based on length and distance between the two halves. If they are too far apart for a given length they are rejected. For example, the "transposition" of the word "is" between the beginning and end of a chapter is not a transposition at all, but the transpostion of "is" between "is there" and "there is" is a transposition. Transpositions are not copied, but the half in the new version "points to" the first half in the already existing MVD, like a child to its parent.


As we sail down the rapids, bits of our former boat can be lashed together to form a raft that may save our lives. Similarly, matches between versions are often not exact. There may be a different type of quotation mark, or a small change, but the match basically goes on after that. The problem is, with suffixtrees or any kind of literal match, these small variations break up the matching process into chunks that, for example, mess up the assessment of whether something is a transposition or not. Also the greedy algorithm described above relies on choosing the biggest chunk first to get the best alignment between two texts, and this influences which bits are transposed and which bits get aligned "directly". So rafting is chaining together matches that have a short section of mismatch, but continue afterwards. Our Clancy of the Overflow example continues:

He was shearing sheep when I knew him, so I sent the letter to him, Just `on spec', addressed as follows, `Clancy, of The Overflow'.

He was shearing when I knew him, so I sent the letter to him, Just 'on spec', addressed as follows, 'Clancy, of The Overflow'.
"In this case the MUM is composed of three sections: 1) "when I knew him, so I sent the letter to him, Just ", 2) "on spec', addressed as follows, " and 3) "Clancy, of The Overflow'.". In-between there is variation in using back-quotes for opening quotes rather than the single quotes of version 2. So these three MUMs get assessed as one block, but are aligned as three separate blocks. So "matches" in the variant graph or MVD are still literal matches, with little variants in between. Up to two characters may mismatch in the MVD or in the new version and the match will be allowed to continue.

Working with any language

Collation programs are usually designed to work on texts in the language of their author. Most of them work OK with European languages that can be expressed in 8-bit characters, but fail on anything more complex. Nowadays everyone uses Unicode, which may convert many characters like accented letters, punctuation etc into multi-byte sequences in UTF-8, the most widespread encoding in modern computing. This plays havoc with programs designed only to work with 8-bit characters, so I decided from the start to make nmergec work with 16-bit Unicode. That way it could handle Chinese and Indian languages, ancient Greek, Russian, etc with ease. To do this I had to use ICU, a library maintained by IBM and now the embodiment of the Unicode standard in code form. ICU can convert just about any encoding into another. An MVD can be in any encoding you like, but internally nmergec uses 16-bit Unicode for all comparisons. Unless you are editing a text in an obscure dead language like ancient Phoenician you don't need more than 16 bits, and even then nmergec will continue to work because it uses UTF-16, so 32-bit characters will just get split into multi-character sequences. I think multi-language handling is very important and as far as I am aware no other merging/collation program supports it.

Immortal code

When Thucydides wrote his History of the Peloponesian War in the fourth century BC he said he was creating a "possession for ever", and it turned out to be true. But when writing computer programs that is a hard act to follow. Technologies come and go, and any program written today is likely to be useless in two years' time. I didn't want to keep coming back to my program just because some dimwit in some mega company thought it cool to change their programming language or tools just to roll out some new feature. So I wrote nmergec in C, one of the oldest and still the most popular programming language on the planet. Why? Who uses it any more? Surely C will soon die because it is so old. C is popular for several reasons, which happen also to be my reasons for choosing it:

C is incredibly stable. I have programs on my computer written in 1980 in C that still compile flawlessly. C is the language of Linux, OSX, iOS, BSD and Android, and all other flavours of UNIX, and variations of it are used to write mobile phone software and virtually all device drivers. C is going to be around for a very long time.

Also C is as fast as anything out there, and I needed speed. C is compiled directly into machine code, whereas almost all the new fancy languages like Node, Python, Haskell, Ruby, etc. are basically scripting languages that get compiled every time they run. That makes them about 100 to 1000 times slower than C. True, Java may be a bit faster in some test cases, but I have found that C is dramatically faster than Java when you have a complex program.

C allocates and frees memory as it goes. Java uses a garbage collector, which slows it down at crucial moments, and also makes it a memory-hog. You have to declare in advance how much memory you want to use in Java for your program to run, but in my case this is inconvenient because it depends on the size of the files. Since Java wastes so much memory this was the main reason it was limited to 70K files in the old nmerge.

Almost any language already in use can link to a C program. In Java you have JNI, and parts of Java itself already use C for speed. In PHP you can define an extension which is written in C. The same goes for most other languages. C is versatile, and will compile in any programming environment.

You may say that few people will be able to understand what I have written because it is complex and in a language not much used in the digital humanities. True, but those who are serious to extend or repair the tool will either make the effort or hire someone who has the skill. I have documented it well, it is carefully organised into manageable chunks, and it has a comprehensive test suite. I can do no more.

Size matters

Cultural heritage texts can be big things. Often alterations between versions can span entire works, such as Joseph Furphy's Such is life:

Notice how the author's original manuscript was revised by cutting out two entire chapters, which were replaced by two new ones in the printed book, but then the excised chapters were revised and published as two separate books. Variation occurs on this scale in real-world examples, not just at the microscopic level of individual sentences. Nmergec was designed to compare texts of several megabytes. I haven't tested the whole program yet on files that large, at this time of writing, but it is designed to do it. No other comparison program can handle variation over spans this large.

NMergeC is modular

One of the problems with the old nmerge was that it kept growing as more and more features had to be added. I needed a way to generate stemmas or trees showing the ancestors and children of versions. I also needed to produce an apparatus aligned by words, to compute the degree of difference between each version. So in nmergec I decided to break each of the program's functions into modules. The nmergec program itself just loads and saves MVDs, nothing more. mvd_add adds one version to an existing MVD. There are other modules, as yet incomplete, which do a great variety of tasks, but operate on MVDs. If you want nmergec to do something new just write a plugin. There is a simple application-programming interface to do this, though I haven't documented it yet. The following plugins are already defined or will soon be adapted from parts of the old nmerge:

  • create an new empty MVD
  • generate a table suitable for an apparatus
  • compute the variants of a section of one version
  • add a new version
  • delete one version and its contents from an MVD
  • list the versions of an MVD
  • compute a phylogenetic tree using neighbour-join
  • save an MVD as an archive of separate versions
  • read one version
  • read in an archive and convert to an MVD
  • compare two versions
  • search for a string
  • rename an MVD
  • replace an existing version with a new one

Keeping it simple (or the KISS principle)

One way to visualise an MVD is as a deck of cards. Each textual fragment corresponds to the value of a card, and the suit corresponds to the set of versions to which the fragment belongs. We also sort the deck into a particular order, like a card-shark cheating in some way. NMergec differs from nmerge in that it operates directly on the cards or fragments, and never builds an explicit graph. Most of the complexity of nmerge was caused by this simple difference. NMergec can be understood as a card-shuffling program, that splits cards into segments and shuffles them into a very particular order. The variant graph still exists, but it is implicit, and doing it this way is a lot simpler. For example, let's say you have 2,000 versions of a work. That is quite possible with New Testament texts. If you create a variant graph of even a short section of such a work it is incomprehensible. It is also, unsurprisingly, just as complex for the programmer to make sense of. But a list of fragments remains simple no matter how many versions are added to it, and never becomes a "huge hairy mess".

NB: "nmergec" is pronounced "en-merge-see", not "en-merge-ec"

No comments:

Post a Comment