Monday, 30 September 2013

Java stack size and the Nmerge memory limitation

My Nmerge program runs within a JVM whose maximum memory is set to 1GB. That seemed to be plenty but it still ran out of something whenever I gave it more than about 50K of text to digest. I always thought this was because Java was such a memory-hog. And it is. But that wasn't the reason it was failing for years. The simple answer was that increasing memory space in Java via the -Xmx option doesn't increase the stack space in proportion. Stack space is controlled by the -Xss option, and the default stack size is just 512K. Now that's plenty for most applications but nmerge was no ordinary program. At its heart were two mutually recursive procedures. I proved mathematically that they would terminate and produce a valid result, but what I couldn't estimate was the maximum stack size they would require. It turns out that the stack size needed is proportional to the length of the file. Or more precisely it was the number of 'nodes' in the graph. A short file with a complex graph structure (and lots of variation) might need more stack space than a longer file with relatively little variation. Whatever the reason, increasing the stack space to 4MB allowed the program to deal with much larger files. And I had a Homer Simpson D'oh! moment.

Saturday, 7 September 2013

Splitting TEI files into coherent layers

Embedded markup prevents transcriptions from being interoperable, because the tags are chosen on the basis of human interpretation, not predictable machine instructions. Removing embedded markup from text, however, is fraught with problems, particularly if tags are used to record variant versions. Recently I have had to revisit this problem that I thought was already solved. It wasn't. I needed to rethink it and hammer out a reliable algorithm for the mountains of very variable data it was likely to deal with. It's important because without it we'll never be able to build interoperable tools in the digital humanities because transcriptions deal with multiple versions all of the time.

Taking the markup out of a text that contains alternatives naïvely produces garbage. A simple example: <del>dog</del><add>cat</add> would produce 'dogcat'. It is much better to store markup and text separately, and to separate out the versions into complete, readable texts. If desired, differences between versions can then be computed, and merged, so that a compact representation of a document can be created to act as a focus for queries. For example: 'what are the changes carried out by the author as the first layer of correction, or the second layer?' 'What are the differences between the first and second complete drafts, the differences between MS A and MS B?' 'Markup the text of the Bloggs edition using this markup set and that markup set combined and send it to a book format'. All these operations are at best inefficient using embedded versions, and at worst impossible. But it is not actually necessary to merge the variant texts. Just storing them as separate versions, although less efficient than the merged representation, still allows applications to access coherent layers of a multi-version document using simple applications.

Taking versions out of the text

One of the main functions of the Versioning Machine is to do exactly this. But it's hard to do well. What I am going to describe is a fourth attempt to do it properly, using only clearly defined properties of a TEI-encoded text. So here's my main concept of version 4 of the splitting algorithm:

Clusters and siblings


A cluster is a fragment of a DOM tree containing tags that describe embedded versions of a text. In the TEI schema the tags <mod>, <subst>, <choice>, <sic> and <corr>, <add> and <del>, <abbr> and <expan>, <orig> and <reg>, <app>, <lem> and <rdg> describe such textual phenomena. Of these <mod>, <subst>, <choice> and <app> can be more or less ignored, because all they do is bracket the other tags. <subst> is used, for example, to group <add>s and <del>s. I'm going to call the other tags siblings, even though the term may also describe ordinary siblings of unrelated elements in a DOM tree. In my case a sibling means an element that is one of <sic> and <corr>, <add> and <del>, <abbr> and <expan>, <orig> and <reg>, <lem> or <rdg>. The set is not fixed, however, and further elements that behave in the same way could be defined, or these could be renamed. A sibling can either be adjacent to another sibling or be its descendant or ancestor, because that's all you can do in a tree. (I'm discounting interlinking elements via attributes because this is non-computable in the general case when used to record versions.)

Identifying a cluster in a tree is easy. A sibling, once encountered, is constructed depth-first (that is, children first, then siblings at the same level). It doesn't matter if a direct child is not a member of the sibling set. That doesn't break the cluster, because even if no children are themselves siblings, they are still affected by the choice instigated by their parents. They are still conditional text. What breaks a cluster and sets it off from the rest of the text is the occurrence of a non-sibling element next to a sibling at the top-level of a cluster. Or the end of the list of siblings itself. So when, for example, a <rdg> is the last in an <app> series. A cluster may also be a single element, such as a lone <del>. It's still a cluster of size one.

Siblings are also numbered using the depth-first ordering. The first <rdg> encountered can be called 'rdg0', and the second 'rdg1' etc. But unconnected siblings at the same level (such as two independent deletions in the same line) are given the same version-tag. All non-sibling elements outside of any cluster automatically receive the 'base' version to indicate that they belong together implicitly to 'all' versions.

Four rules are needed to separate versions for display or extraction.

Rule 1

All the ancestors of a sibling belonging to version v must also belong to version v, because otherwise ancestors would not be visible if that version was selected.

Rule 2

In any cluster the first alternative at the top level, except if it is added, inherits the 'base' version. If the first alternative has child elements that are not siblings then those elements inherit the base version likewise. If they are sibling elements then Rule 2 applies to them recursively.

Rule 3

In any cluster the final uncancelled or uncorrected alternative at the top level inherits all uncancelled versions in the document that do not already occur in the cluster itself. Like Rule 2 this applies recursively to child elements.

Rule 3 does not apply if a set of alternatives does not have any uncancelled alternative at the top level or if all its alternatives use the wit attribute to manually specify versions.

Rule 4

The first sibling in each sibling-sequence inherits the versions of its parents within its own cluster.