Thursday, 19 December 2013

Distributed Digital Scholarly Editions

A certain scholar in the field has been talking about distributed DSEs for years. I make no claim as to the originality of the basic idea, but the implementation is a custom extension to AustESE. The basic problem is this. When you create a DSE you rarely have ALL the data in one place. If I give you a DSE of Charles Harpur, do I include all the images of all of his manuscripts? That would be several gigabytes. You probably wouldn't like that. But let's say I gave you 3 megabytes instead, being all the text and commentary only. The images could just be stored via links or pointers to the real data. This is the case with manuscripts in Europe now, because they are all stored on Europaeana.

The specific situation I am thinking of is a biography of an author that is stored on a different server, and making a copy of it is simply out of the question, because it is subject to constant editing and we want just one copy to exist. But what if on MY server I stored a link that pointed to it. I could treat the file just the same as the real resources that reside there. In other words I could have a virtual file system that covered both real files, such as "english/harpur/The nevers of poetry", which would contain an MVD and the text of the poem, but also a file called "english/harpur/biography", which would be a LINK to "http://austese.net/sites/all/modules/austese_repository/api/events/", with a parameter of project=21 and a username and password needed to get it. But if I put all that information into the LINK then I could treat the LINK as if IT was the biography and forget about the details of how to fetch it.

I realise this sounds a lot like handles - intermediary URLS like doi://, or like a file system on disk with soft and hard links. The difference is that it is implemented as a virtual file system as part of a digital scholarly edition. The advantage is then that I can annotate those virtual files even if their address changes, and I can have all the files that make up the 'distributed' edition in one place - the real and the virtual. The real advantage is then that you can treat local and remote files as if they were the same thing for the purposes of editing, updating, annotating, reading, etc.

Monday, 30 September 2013

Java stack size and the Nmerge memory limitation

My Nmerge program runs within a JVM whose maximum memory is set to 1GB. That seemed to be plenty but it still ran out of something whenever I gave it more than about 50K of text to digest. I always thought this was because Java was such a memory-hog. And it is. But that wasn't the reason it was failing for years. The simple answer was that increasing memory space in Java via the -Xmx option doesn't increase the stack space in proportion. Stack space is controlled by the -Xss option, and the default stack size is just 512K. Now that's plenty for most applications but nmerge was no ordinary program. At its heart were two mutually recursive procedures. I proved mathematically that they would terminate and produce a valid result, but what I couldn't estimate was the maximum stack size they would require. It turns out that the stack size needed is proportional to the length of the file. Or more precisely it was the number of 'nodes' in the graph. A short file with a complex graph structure (and lots of variation) might need more stack space than a longer file with relatively little variation. Whatever the reason, increasing the stack space to 4MB allowed the program to deal with much larger files. And I had a Homer Simpson D'oh! moment.

Saturday, 7 September 2013

Splitting TEI files into coherent layers

Embedded markup prevents transcriptions from being interoperable, because the tags are chosen on the basis of human interpretation, not predictable machine instructions. Removing embedded markup from text, however, is fraught with problems, particularly if tags are used to record variant versions. Recently I have had to revisit this problem that I thought was already solved. It wasn't. I needed to rethink it and hammer out a reliable algorithm for the mountains of very variable data it was likely to deal with. It's important because without it we'll never be able to build interoperable tools in the digital humanities because transcriptions deal with multiple versions all of the time.

Taking the markup out of a text that contains alternatives naïvely produces garbage. A simple example: <del>dog</del><add>cat</add> would produce 'dogcat'. It is much better to store markup and text separately, and to separate out the versions into complete, readable texts. If desired, differences between versions can then be computed, and merged, so that a compact representation of a document can be created to act as a focus for queries. For example: 'what are the changes carried out by the author as the first layer of correction, or the second layer?' 'What are the differences between the first and second complete drafts, the differences between MS A and MS B?' 'Markup the text of the Bloggs edition using this markup set and that markup set combined and send it to a book format'. All these operations are at best inefficient using embedded versions, and at worst impossible. But it is not actually necessary to merge the variant texts. Just storing them as separate versions, although less efficient than the merged representation, still allows applications to access coherent layers of a multi-version document using simple applications.

Taking versions out of the text

One of the main functions of the Versioning Machine is to do exactly this. But it's hard to do well. What I am going to describe is a fourth attempt to do it properly, using only clearly defined properties of a TEI-encoded text. So here's my main concept of version 4 of the splitting algorithm:

Clusters and siblings

Clusters

A cluster is a fragment of a DOM tree containing tags that describe embedded versions of a text. In the TEI schema the tags <mod>, <subst>, <choice>, <sic> and <corr>, <add> and <del>, <abbr> and <expan>, <orig> and <reg>, <app>, <lem> and <rdg> describe such textual phenomena. Of these <mod>, <subst>, <choice> and <app> can be more or less ignored, because all they do is bracket the other tags. <subst> is used, for example, to group <add>s and <del>s. I'm going to call the other tags siblings, even though the term may also describe ordinary siblings of unrelated elements in a DOM tree. In my case a sibling means an element that is one of <sic> and <corr>, <add> and <del>, <abbr> and <expan>, <orig> and <reg>, <lem> or <rdg>. The set is not fixed, however, and further elements that behave in the same way could be defined, or these could be renamed. A sibling can either be adjacent to another sibling or be its descendant or ancestor, because that's all you can do in a tree. (I'm discounting interlinking elements via attributes because this is non-computable in the general case when used to record versions.)

Identifying a cluster in a tree is easy. A sibling, once encountered, is constructed depth-first (that is, children first, then siblings at the same level). It doesn't matter if a direct child is not a member of the sibling set. That doesn't break the cluster, because even if no children are themselves siblings, they are still affected by the choice instigated by their parents. They are still conditional text. What breaks a cluster and sets it off from the rest of the text is the occurrence of a non-sibling element next to a sibling at the top-level of a cluster. Or the end of the list of siblings itself. So when, for example, a <rdg> is the last in an <app> series. A cluster may also be a single element, such as a lone <del>. It's still a cluster of size one.

Siblings are also numbered using the depth-first ordering. The first <rdg> encountered can be called 'rdg0', and the second 'rdg1' etc. But unconnected siblings at the same level (such as two independent deletions in the same line) are given the same version-tag. All non-sibling elements outside of any cluster automatically receive the 'base' version to indicate that they belong together implicitly to 'all' versions.

Four rules are needed to separate versions for display or extraction.

Rule 1

All the ancestors of a sibling belonging to version v must also belong to version v, because otherwise ancestors would not be visible if that version was selected.

Rule 2

In any cluster the first alternative at the top level, except if it is added, inherits the 'base' version. If the first alternative has child elements that are not siblings then those elements inherit the base version likewise. If they are sibling elements then Rule 2 applies to them recursively.

Rule 3

In any cluster the final uncancelled or uncorrected alternative at the top level inherits all uncancelled versions in the document that do not already occur in the cluster itself. Like Rule 2 this applies recursively to child elements.

Rule 3 does not apply if a set of alternatives does not have any uncancelled alternative at the top level or if all its alternatives use the wit attribute to manually specify versions.

Rule 4

The first sibling in each sibling-sequence inherits the versions of its parents within its own cluster.

Thursday, 6 June 2013

Sustainability of software in the digital humanities

Everyone knows that it is very difficult to maintain whatever you create in this field, let alone make any progress. And once progress has been made, the result breaks very quickly. We have seen this scenario played out countless times but I'm not going to point fingers. That's just how it is. I am reminded very much of Alice through the looking glass where the Red Queen takes Alice by the hand and runs with her for some time as fast as she can go. But when they finally stop Alice realises the horrible truth:

'Why, I do believe we've been under this tree the whole time! Everything's just as it was!'

'Of course it is,' said the Queen, 'what would you have it?'

'Well, in OUR country,' said Alice, still panting a little, 'you'd generally get to somewhere else—if you ran very fast for a long time, as we've been doing.'

'A slow sort of country!' said the Queen. 'Now, HERE, you see, it takes all the running YOU can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!'

The answer to this dilemma is not to complain that you don't have enough resources, but simply to program economically. First put everything you can into the most stable software platform you know: in my case the C language. Much of the computation you need to do can be expressed this way. Programs that compute phylogenetic trees, compare and format texts, load and save formats, in fact anything you need to do that performs basic operations that you can code once, debug and forget. I have C programs from the 1980s that still compile and run flawlessly. The modern programmer seems to have forgotten that C is the bedrock of everything they do: every scripting language they use, every operating system (OK, except Windows, which is C++ I believe) is written in this. It cannot go away or the entire world would cease to be. And it doesn't change. If you don't have any dependencies, if your code just computes something then you can write, debug and forget. That saves a lot of maintenance work.

Secondly, leverage existing open source code. Write your extra stuff to use existing CMSes and other platforms, with as little glue code as you can manage. Let the wider community maintain that platform for you. Don't expect that your service will last. It won't. But you can call the C-code from the service; in fact you can call C-code from any language and compile it for any platform, so that the service need only be a wrapper around core functionality. Be ready for change: expect it. Then when you have to rewrite your code it will just be a matter of rewrapping it all in the latest style.

Thirdly, don't expect that the open source community is going to either fix your code or maintain it for you, because they won't. You have to be prepared to do it all yourself, at least until you have a great product that everyone is using, like Linux. So far no one in the digital humanities has got that far. And maybe they never will. You have to believe in what you do, and not be dependent on only working when you have grant money.

Monday, 6 May 2013

Exporting a digital scholarly edition

One of the most often requested features in a digital scholarly edition service is the ability to export data produced within it. This is a more complex problem than importing, which typically happens in stages: one adds some source texts, then marks them up, then adds annotations, images and formats to change its appearance and function over time. Getting all that data back out again into one archive that can be transported to a new site, or transferred to a different program can be a bit messy. Also adopting a radically new way to store texts for editing, viewing and comparing makes it harder to just "export" the stuff.

So to overcome this I have added a PDEF (portable digital edition format) export function to calliope. This queries the server for all the files, markup views, annotations, images and formats, and downloads them all in a single .zip or .tar.gz archive. At the moment this is just a service in Calliope: the user invokes the calliope service e.g. in curl:

curl -X GET http://localhost:8080/pdef/?DOC_ID=english%2Fshakespeare%2Fkinglear.*\&FORMAT=TEXT\&FORMAT=MVD -o archive.tar.gz

This downloads a file, whose structure is:

This archive can then be uploaded using the mmpupload tool (or will be when I have modified it) to a new repository, effectively installing the DSE in a new location. All the user has to provide is a wildcard document identifier -- in this case "english/shakespeare/kinglear.*" and it will download all the files with that docid prefix. (Because this is a URL the "/"s have to be escaped as "%2F".) Upload supports two other formats: XML and MIXED. MIXED allows text and XML files to be freely intermingled.

Sunday, 28 April 2013

Calliope 0.2.1: Moving to Mongo and Tomcat

This project is about Digital Scholarly Editions or DSEs. These are Web-based publications of cultural heritage texts. My design for DSEs calls for a REST Web Service, originally called hritserver, but now renamed Calliope, which provides the components for building a DSE within a Web authoring environment like a CMS. The infrastructure looks like this:

As you can see, at the bottom there is a database, including an images handler. Currently Calliope uses CouchDB and stores the images in the file system for speed. But I now have made it possible to define a single Java class to extend Calliope to handle other databases. MongoDB, a rival NoSql database to CouchDB, is supposed to be faster. Unlike Couch it handles images well, through the GridFS module, making replication of the entire database easier. In Couch you have to attach images to documents, and word has it that when those images are large it becomes sluggish. One advantage of Couch, though, is its revision control, which Mongo lacks. In Mongo, your latest copy of a document is your only copy.

The architecture has also a number of canonical views, which are only suggestions. They can be extended, customised and redecorated. They also know nothing about which database they are running on. They communicate with the Calliope service, which gives them partial Web-pages, which they use to compose the views. This is thus a pure example of the familiar Model-View-Controller paradigm.

Another improvement in version 0.2.1 is making it able to run equally well in Jetty embedded mode or Tomcat. The difference is that in Jetty the web application is part of a runnable application that makes it easy to debug and set up, as well as launch. Tomcat uses a plugin type of architecture where Web applications are stored in WAR files (really ZIP or JAR archives), which are really just directories with configuration information. Tomcat acts as a Web Server and hands off control of everything in the WebApp's context to the stuff unpacked from the WAR file. So one difference is that the URL for the Jetty test service is http://localhost/tests/ and for Tomcat at the moment it is http://localhost:8080/calliope/tests/ (Calliope is the name of the WAR file). So what I have now is a combination of two ways to launch Calliope: via Jetty or Tomcat, using either of two databases, Couch and Mongo, so there are a total of four possible configurations. This kind of flexibility helps others to consider adopting your software.

Monday, 21 January 2013

C versus Java speed and memory test

I have now translated the functions that load or save MVDs from disk from Java to C. They are both moderately complex and use several classes – a good real world test. Here is the preliminary speed test for the very similar functions loadMVD and saveMVD, when run on the file kinglear.mvd averaged over 10 iterations, discounting the first result, which is always slower:

Language Time (microseconds) Memory (bytes)
Java (load)461271104072
C (load)1608364544
Java (save)56294348752
C (save)484436864

This surprises me. I have been led to believe that Java was nearly as fast as C and sometimes faster. Or so the philosophy of the JIT compiler goes. And no, I didn't include the JIT compilation or JVM load or save time, but in both cases I took the time at the start and end of the relevant function. And no, I am very experienced in writing in both languages. The C version is just a translation of the Java program, but with this speed difference who wants to use Java?

The memory results were kind of expected, though. Everyone knows that Java is a memory hog.

The loadMVD function loads a binary file format on disk then parses it into an in-memory form. The saveMVD function reverses the process but is a bit more complicated because it must reconnect transposed segments.

Technique of measurement

In C I used getrusage (the ru_maxrss field) for memory and gettimeofday for timing. In Java I used System.nanoTime() and Runtime.getRuntime().freeMemory(); (after calling System.gc() the first time). The Java code was compiled with debugging off, the C code with -O3, no debugging symbols. The Java version was 1.6.0_35 on MacOSX 10.8.1, and the C compiler was gcc 4.2.1 on the same platform. Both programs were run on the commandline.

Wednesday, 16 January 2013

Comparison and alignment in literary texts

Humanists have always wanted to compare texts, but the computational techniques they are still using are based on a method devised in the 1960s for very early computers with severely limited RAM. They call it 'collation', because it reminds them of the manual process of comparing texts in the print era. Automatic collation examines the text word by word within a window of at most a page. The first word or line encountered closest to the expected position is taken as an alignment position and the window moved on by that amount. In this way it proceeds like a man trying to piece together a giant jigsaw puzzle using a torch in a field at night. It cannot see alignments outside of its window and mistakes once made lead to errors that must be manually corrected.

Suffix trees

But computer science has moved on a long way since 1962, when the method was first devised. The idea of aligning the entire text of two versions became possible with the invention of practical methods for constructing suffix trees in the 1990s. The principle of the Maximal Unique Match as used in bioinformatics is something that digital humanists should be more aware of. The text can be efficiently compared by aligning it greedily on the longest match shared by two versions that is unique, not word by word. This virtually eliminates misalignments or mistakes during collation. A 'window' is not needed because modern computers have enough memory to take in the whole text at a time. The suffix tree can quickly tell us whether any given substring is found in a text in a time proportional to the length of that substring. This represents such a big speedup over conventional techniques that I wish more people would sit up and pay attention.

Revising nmerge

When I wrote the nmerge program in 2009 it was nearly a first. Although MEDITE had used suffix trees for alignment on literary texts, it was limited to two versions at a time. Also, it had no format to store differences. nmerge could handle an arbitrary number of versions and store them in a Multi-Version-Document (MVD) format so they could be efficiently read back later. But I never claimed that nmerge was anything but a 'first cut at a difficult problem'. Since 2009 we have learned a lot about what it should do and how to improve it. And now, after several false starts, a realistic rewrite is in progress.

What it plans to do is:

  1. Simplify the alignment process by splitting up the existing functionality of nmerge into a series of plugins. The only tasks that the core nmerge program will perform will be to load and save MVD files, and to manage the plugins, of course.
  2. Rewrite the entire program from scratch in the C language. This should overcome memory problems with Java by dynamically allocating memory only as needed, instead of wastefully at present. Also C gives the program great longevity and portability as well as speed. Also provide language wrappers so it can be called natively in PHP (as an extension) and Java (via JNI).
  3. Use multi-threading to improve performance. Individual sub-alignments and building of suffix trees can carry on simultaneously.
  4. Transpositions can be computed using a technique that exploits adjacency of short transposed sections. In this way even transpositions containing minor revisions can be detected. This should improve alignment quality.
  5. Alignment will be by shuffling the fragments of the MVD, not by pasting in differences into a explicit variant graph. This should greatly improve the program's simplicity.
  6. Changing the MVD file format so that versions and groups are merged into version-IDs. This should make version specification simpler by using a hierarchical naming system based on paths like /folios/F1, or /C376/add0, rather than on tables of separate groups and versions.
  7. Change the default text encoding from UTF-8 to UTF-16. This will allow easy comparison between Chinese and other languages like Bengali, which split almost all characters across byte-boundaries.
  8. Provide a test-suite to verify every aspect of the program as it is being written and to insulate it from damage if any changes are made later.

I have already made a good start refactoring the tool into a series of plugins: 16 to be precise. There is even one for adding a version to an existing MVD. Since it is proceeding by plugins rather than a single monolithic block of code I anticipate early results in a few weeks at most. I have high hopes for this new tool, as it will lift MVDs into areas where it has not gone before, into general use.