Sunday, 23 February 2014

Rewriting TILT

Linking areas on an image to segments of text so you can highlight one or the other and show what fragment of an image produced what transcription sounds like a crazy pedantic idea. At least that is what I thought when I first heard about it. But the fact is, if you want to display a facsimile image next to a transcription the user has no easy way to make the correlation between what corresponds to what. They spend their whole time scrolling up and down, scanning the image with their eyes and then going back to the text, losing where they were on the image and starting again, etc. Following the HCI notion of least 'excise' or user effort to get a task done, text-to-image links make it easy to read a manuscript facsimile. Not so crazy after all!

TILT (text-Image-Linking Tool) was a pun on TILE, and like TILE was intended to allow for semi-automatic selection of areas on an image and linking them to segments of text. The problem with TILE was that it relied too much on Javascript. Javascript may be the up and coming child of the Web but it is still relatively immature and slow. TILT decided instead to use Java in the form of an applet, which would have access to all the amazing image manipulation tools of the Java class library for free, and be able to do things like highlight areas as the mouse moved over it, to resize regions, and to recognise lines when the image was tilted etc (hence TILT, get it?). The only problem with this design is that the Java gets downloaded to the browser, and the compatibility of browsers with applets is not good. Also that puts a strain on the Internet link, and then drawing refreshing and resizing is a problem.


TILT2 comes to the rescue. I always find that second goes at a design often work better because you have behind you the experience of the initial failures. TILT2 gets around the problems of TILE and TILT1 by using HTML5 to do all the drawing and Javascript to handle the clicking and dragging events. The Java is still used to do the image transforms and word-detection but it is all done on the server, and only the results are sent down the wire as JSON, or they get handled directly in the browser. Here's how the dataflow and various components will look in TILT2. (This is just a design for now):

The images are stored in Mongo's grid-fs. In the same database are stored the plain text and markup overlays called cortex and corcode respectively. CorCode has the advantage over standard standoff markup or directly embedded markup in that you can overlay any number of markup sets onto the same text and it produces valid HTML. The HTML then gets sent to the browser, which has a simple window with two panels. The left one shows the image inside a HTML5 canvas. The right hand side has a transcription of the image's contents. As the user moves over regions occupied by words in the image those regions turn pink, and a corresponding region on in the text on the right is highlighted also. It also works vice-versa.


This works by drawing the highlighted regions using javascript. Events are captured by the mouse-movements and sent to the HTML canvas object, which responds to Javascript commands. The Image itself is a bare-bones custom image representation that was originally downloaded from the server, and created by merging the CorCode for the image with the image itself (see top of drawing). The right hand side is likewise composed of structural markup and spans that are activated when the mouse moves over them too. The markup to achieve this is likewise supplied from the CorCode, which points to regions inside the plain text (CorTex).

When the user drags the corner of a region on the left, javascript is used to track its movements, and highlighting is instant. The user can also select a region on the left for recognition by single-clicking with the region selection tool. On the right hand side a corresponding piece of text can be selected by just dragging the mouse over the text. (ierange is used to get the text selection). When the user clicks the "recognise" button in the toolbar the contents of the page are sent up to the server for analysis, and word-regions on the left are matched up with words on the right. The results are then computed into CorCode/CorTex form and sent back to the browser for rendering.

TILT Automation

The really cool bit in TILT1 is that the user can quickly refine the guessimate made by the server by selecting a region already recognised on each side. Re-recognising does the same thing but starts at two known good end-points on either side. In the most fine-grained case one word on each side could be chosen, but in most cases great swathes of text can be selected in one go. TILT1 uses a clever alignment algorithm adapted from textual diff tools to align the word-shapes on the left with words of corresponding length on the right by taking account of their order. When the user is satisfied with the alignment he/she can press "next" or"prev" to go on to a new page or to refine a previously done page, and the work is automatically saved.

The problem with this is that it is still just a design. But I need it for two projects: the De Roberto I Viceré and the Charles Harpur critical archive, both of which have extensive manuscript facsimiles to compliment the texts. Without automation such text image alignment would be infeasible on this scale. The thing I like about this design is that each software component does what it is good at, and delegates the rest to the other components.

Yes, it will require HTML5, but all modern browsers support this. Without HTML5 it becomes very very messy to do the drawing on IE (using VML) and another way on other browsers. If it doesn't work for you and you need it, just update your browser or if you can't then buy a new computer or tablet. I haven't got time to support every damn browser out there.