Digital Variants: January 2013

I have now translated the functions that load or save MVDs from disk from Java to C. They are both moderately complex and use several classes – a good real world test. Here is the preliminary speed test for the very similar functions loadMVD and saveMVD, when run on the file kinglear.mvd averaged over 10 iterations, discounting the first result, which is always slower:

Language	Time (microseconds)	Memory (bytes)
Java (load)	46127	1104072
C (load)	1608	364544
Java (save)	56294	348752
C (save)	4844	36864

This surprises me. I have been led to believe that Java was nearly as fast as C and sometimes faster. Or so the philosophy of the JIT compiler goes. And no, I didn't include the JIT compilation or JVM load or save time, but in both cases I took the time at the start and end of the relevant function. And no, I am very experienced in writing in both languages. The C version is just a translation of the Java program, but with this speed difference who wants to use Java?

The memory results were kind of expected, though. Everyone knows that Java is a memory hog.

The loadMVD function loads a binary file format on disk then parses it into an in-memory form. The saveMVD function reverses the process but is a bit more complicated because it must reconnect transposed segments.

Technique of measurement

In C I used getrusage (the ru_maxrss field) for memory and gettimeofday for timing. In Java I used System.nanoTime() and Runtime.getRuntime().freeMemory(); (after calling System.gc() the first time). The Java code was compiled with debugging off, the C code with -O3, no debugging symbols. The Java version was 1.6.0_35 on MacOSX 10.8.1, and the C compiler was gcc 4.2.1 on the same platform. Both programs were run on the commandline.

Humanists have always wanted to compare texts, but the computational techniques they are still using are based on a method devised in the 1960s for very early computers with severely limited RAM. They call it 'collation', because it reminds them of the manual process of comparing texts in the print era. Automatic collation examines the text word by word within a window of at most a page. The first word or line encountered closest to the expected position is taken as an alignment position and the window moved on by that amount. In this way it proceeds like a man trying to piece together a giant jigsaw puzzle using a torch in a field at night. It cannot see alignments outside of its window and mistakes once made lead to errors that must be manually corrected.

Suffix trees

But computer science has moved on a long way since 1962, when the method was first devised. The idea of aligning the entire text of two versions became possible with the invention of practical methods for constructing suffix trees in the 1990s. The principle of the Maximal Unique Match as used in bioinformatics is something that digital humanists should be more aware of. The text can be efficiently compared by aligning it greedily on the longest match shared by two versions that is unique, not word by word. This virtually eliminates misalignments or mistakes during collation. A 'window' is not needed because modern computers have enough memory to take in the whole text at a time. The suffix tree can quickly tell us whether any given substring is found in a text in a time proportional to the length of that substring. This represents such a big speedup over conventional techniques that I wish more people would sit up and pay attention.

Revising nmerge

When I wrote the nmerge program in 2009 it was nearly a first. Although MEDITE had used suffix trees for alignment on literary texts, it was limited to two versions at a time. Also, it had no format to store differences. nmerge could handle an arbitrary number of versions and store them in a Multi-Version-Document (MVD) format so they could be efficiently read back later. But I never claimed that nmerge was anything but a 'first cut at a difficult problem'. Since 2009 we have learned a lot about what it should do and how to improve it. And now, after several false starts, a realistic rewrite is in progress.

What it plans to do is:

Simplify the alignment process by splitting up the existing functionality of nmerge into a series of plugins. The only tasks that the core nmerge program will perform will be to load and save MVD files, and to manage the plugins, of course.
Rewrite the entire program from scratch in the C language. This should overcome memory problems with Java by dynamically allocating memory only as needed, instead of wastefully at present. Also C gives the program great longevity and portability as well as speed. Also provide language wrappers so it can be called natively in PHP (as an extension) and Java (via JNI).
Use multi-threading to improve performance. Individual sub-alignments and building of suffix trees can carry on simultaneously.
Transpositions can be computed using a technique that exploits adjacency of short transposed sections. In this way even transpositions containing minor revisions can be detected. This should improve alignment quality.
Alignment will be by shuffling the fragments of the MVD, not by pasting in differences into a explicit variant graph. This should greatly improve the program's simplicity.
Changing the MVD file format so that versions and groups are merged into version-IDs. This should make version specification simpler by using a hierarchical naming system based on paths like /folios/F1, or /C376/add0, rather than on tables of separate groups and versions.
Change the default text encoding from UTF-8 to UTF-16. This will allow easy comparison between Chinese and other languages like Bengali, which split almost all characters across byte-boundaries.
Provide a test-suite to verify every aspect of the program as it is being written and to insulate it from damage if any changes are made later.

I have already made a good start refactoring the tool into a series of plugins: 16 to be precise. There is even one for adding a version to an existing MVD. Since it is proceeding by plugins rather than a single monolithic block of code I anticipate early results in a few weeks at most. I have high hopes for this new tool, as it will lift MVDs into areas where it has not gone before, into general use.

Digital Variants

Monday 21 January 2013

C versus Java speed and memory test

Technique of measurement

Wednesday 16 January 2013

Comparison and alignment in literary texts

Suffix trees

Revising nmerge

Blog Archive