Wednesday, 16 January 2013

Comparison and alignment in literary texts

Humanists have always wanted to compare texts, but the computational techniques they are still using are based on a method devised in the 1960s for very early computers with severely limited RAM. They call it 'collation', because it reminds them of the manual process of comparing texts in the print era. Automatic collation examines the text word by word within a window of at most a page. The first word or line encountered closest to the expected position is taken as an alignment position and the window moved on by that amount. In this way it proceeds like a man trying to piece together a giant jigsaw puzzle using a torch in a field at night. It cannot see alignments outside of its window and mistakes once made lead to errors that must be manually corrected.

Suffix trees

But computer science has moved on a long way since 1962, when the method was first devised. The idea of aligning the entire text of two versions became possible with the invention of practical methods for constructing suffix trees in the 1990s. The principle of the Maximal Unique Match as used in bioinformatics is something that digital humanists should be more aware of. The text can be efficiently compared by aligning it greedily on the longest match shared by two versions that is unique, not word by word. This virtually eliminates misalignments or mistakes during collation. A 'window' is not needed because modern computers have enough memory to take in the whole text at a time. The suffix tree can quickly tell us whether any given substring is found in a text in a time proportional to the length of that substring. This represents such a big speedup over conventional techniques that I wish more people would sit up and pay attention.

Revising nmerge

When I wrote the nmerge program in 2009 it was nearly a first. Although MEDITE had used suffix trees for alignment on literary texts, it was limited to two versions at a time. Also, it had no format to store differences. nmerge could handle an arbitrary number of versions and store them in a Multi-Version-Document (MVD) format so they could be efficiently read back later. But I never claimed that nmerge was anything but a 'first cut at a difficult problem'. Since 2009 we have learned a lot about what it should do and how to improve it. And now, after several false starts, a realistic rewrite is in progress.

What it plans to do is:

  1. Simplify the alignment process by splitting up the existing functionality of nmerge into a series of plugins. The only tasks that the core nmerge program will perform will be to load and save MVD files, and to manage the plugins, of course.
  2. Rewrite the entire program from scratch in the C language. This should overcome memory problems with Java by dynamically allocating memory only as needed, instead of wastefully at present. Also C gives the program great longevity and portability as well as speed. Also provide language wrappers so it can be called natively in PHP (as an extension) and Java (via JNI).
  3. Use multi-threading to improve performance. Individual sub-alignments and building of suffix trees can carry on simultaneously.
  4. Transpositions can be computed using a technique that exploits adjacency of short transposed sections. In this way even transpositions containing minor revisions can be detected. This should improve alignment quality.
  5. Alignment will be by shuffling the fragments of the MVD, not by pasting in differences into a explicit variant graph. This should greatly improve the program's simplicity.
  6. Changing the MVD file format so that versions and groups are merged into version-IDs. This should make version specification simpler by using a hierarchical naming system based on paths like /folios/F1, or /C376/add0, rather than on tables of separate groups and versions.
  7. Change the default text encoding from UTF-8 to UTF-16. This will allow easy comparison between Chinese and other languages like Bengali, which split almost all characters across byte-boundaries.
  8. Provide a test-suite to verify every aspect of the program as it is being written and to insulate it from damage if any changes are made later.

I have already made a good start refactoring the tool into a series of plugins: 16 to be precise. There is even one for adding a version to an existing MVD. Since it is proceeding by plugins rather than a single monolithic block of code I anticipate early results in a few weeks at most. I have high hopes for this new tool, as it will lift MVDs into areas where it has not gone before, into general use.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.