Thursday, 29 December 2011

Adding a directory to java.library.path

In addition to what I said before, there is actually a cool way to add a directory to java.library.path. If you launch your application via a script you can execute a commandline Java tool to output the current library path. For example, this simple Java program adds "/usr/local/lib/" to that path and prints it without a trailing CR to the console:

public class LibPath
    public static void main( String[] args )

To use it just compile it, and then invoke it in the script:

LIBPATH=`java LibPath`:/usr/local/lib
java -Djava.library.path=$LIBPATH ....

And as if by magic the java library path acquires a new directory before your prorgam is run.

Wednesday, 28 December 2011

Posting multipart form data

For testing I needed to simulate a web-browser doing a file-upload. I tried to Google this but I couldn't find a suitable answer that worked. So I rolled my own. This might save other people some trouble too. First comes the MIMEMultipart object, which stores the body of a multipart post:


public class MIMEMultipart 
    StringBuilder text;
    static String CRLF = "\r\n";
    String boundary;
    public MIMEMultipart()
        text = new StringBuilder();
        boundary = Long.toHexString(
    public String getContent()
        return text.toString();
    public String getBoundary()
        return boundary;
    public int getLength()
        return text.length();
    public void putStandardParam( String name, 
        String value, String encoding )
        StringBuilder sb = new StringBuilder();
        sb.append("--" + boundary).append(CRLF);
        sb.append("Content-Disposition: form-data; "
        sb.append("Content-Type: text/plain; charset=" 
            + encoding );
        text.append( sb.toString() );
    public void putBinaryFileParam( String name, 
        String fileName, String mimeType, 
        String encoding ) throws Exception
        // compose the header
        StringBuilder sb = new StringBuilder();
        sb.append( "--"+boundary );
        sb.append( CRLF );
        sb.append("content-disposition: form-data; "
            +"name=\"" );
        sb.append( name );
        sb.append( "\";  filename=\"");
        sb.append( fileName );
        sb.append( "\"" );
        sb.append( CRLF );
        sb.append("Content-Type: "+mimeType ); 
        sb.append( CRLF );
        sb.append("Content-Transfer-Encoding: binary");
        sb.append( CRLF ); // need two of these
        sb.append( CRLF );
        text.append( sb.toString() );
        // now for the file
        File input = new File( fileName );
        FileInputStream fis = new FileInputStream(input);
        byte[] data = new byte[(int)input.length()]; data );
        text.append( new String(data,encoding) );
        text.append( CRLF );
    public void finish()
        text.append( "--" );
        text.append( boundary );
        text.append( "--" );
        text.append( CRLF );

To call it, open a standard Java URLConnection:

private static void printResponse( 
    URLConnection conn )
        InputStream is = conn.getInputStream();
        while ( is.available() != 0 )
            byte[] data = new byte[is.available()];
   data );
            System.out.println(new String(data,
    catch ( Exception e )
        e.printStackTrace( System.out );
URL url2 = new URL("http://localhost:8080/strip");
URLConnection conn = url2.openConnection();
MIMEMultipart mmp = new MIMEMultipart();
mmp.putStandardParam( Params.FORMAT,Formats.STIL,
    "UTF-8" );
mmp.putStandardParam( Params.STYLE,"TEI/drama",
    "UTF-8" );
mmp.putBinaryFileParam( Params.RECIPE,"recipe.xml",
    "application/xml","UTF-8" );
mmp.putBinaryFileParam( Params.XML,
    "application/xml","UTF-8" );
conn.setRequestProperty("Accept-Charset", "UTF-8");
    "multipart/form-data, boundary="
OutputStream output = conn.getOutputStream();
output.write( mmp.getContent().getBytes() );
// get response and print it
printResponse( conn );

Extend MIMEMultipart if you like by adding a method for plain text files and other types of part.

Sunday, 25 December 2011

Loading native libraries in Java

This is an old problem, so I thought I'd write down my current experiences to save others, and myself, pain in future.

You write a native library or foo.dll or libFoo.dylib etc. for Java. And you store it in a convenient location, not a system location, because your software shouldn't meddle with that. To use it you need to call System.loadLibrary("foo");. This will probably give you: "Exception in thread "main" java.lang.UnsatisfiedLinkError: no foo in java.library.path". Where did I go wrong?

System.loadLibrary looks in the Java library path, "java.library.path". Cool, let's set that in the program, just before we call loadLibrary. We can get some platform-independence by passing in the library path on the commandline:

String old = System.getProperty("java.library.path");

It doesn't find the library because you can't change "java.library.path" after starting the JVM. It just ignores your additional directory.

Everyone says set the environment variable LD_LIBRARY_PATH to (in my case) /usr/local/lib. This doesn't work either. On Linux and OSX at least Java ignores that variable when setting up java.library.path. In any case setting LD_LIBRARY_PATH globally for your application will screw up something else on your system. Not cool.

Third attempt. Set java.library.path on the java commandline:


Now you've changed the JVM so things could go wrong. Instead of having the system default library path where everything is, you've redefined it to a custom location. Unfortunately there's no universal way to ADD /usr/local/lib to java.library.path. So the best you can do is find the java library path on your system (by writing a Java program that outputs System.getProperty("java.library.path")) and then add /usr/local/lib to that value and finally specify the entire string to java:


This is what I have to do on Mac OSX. Of course it's entirely platform-specific, which is stupid for a programming language that is supposed to be platform-independent. On the other hand this yucky solution is the best one on offer. Since when you've finished developing you'll be running it time and time again on the same platform it probably doesn't matter much.

Alternatively, you could just put your library in the current directory, which will work on Windows (reportedly) and Mac OSX but not Linux.

Saturday, 24 December 2011

Installing couchdb on Mac OSX

Installing couchdb on Linux is a breeze. At least with the Debian package manager. But on OSX, where I was stuck over the Christmas break, installation is a pain. Of course if you believe the hype homebrew will save us. All you have to do is install it and type: brew install couchdb. Except that, it doesn't work. Packages have to be maintained, and unless the homebrew authors do all that work, their packeges will soon break. So I had to do it myself. This is the formula on 64 bit systems:

  1. Download the latest Erlang source. Configure with --enable-darwin-64bit . Otherwise it compiles in 32 bit and it won't work with the other components, especially icu. Then make, make install as usual.
  2. Download and install ICU (configure, make, make install)
  3. Download and install couchdb. Configure, make, make install. And it all should work.

Now that Apple's great leader has passed on maybe someone in charge will see that a proper package manager would be a good idea for OSX.

Saturday, 6 August 2011

A new kind of standoff markup

When markup was first invented it was always embedded in the text. The earliest forms were like formatting instructions to tell us how the text should look when it was printed:

.indent 5
blah blah blah
.indent 0

This meant that the text should be indented five spaces, say, for a bit of quoted text. This might be a problem if you later wanted all quotes indented 10 spaces. Then in the 1980s some folks had the bright idea of embedding only abstract information about textual structure rather than specific formats:

blah blah blah

Then when it was printed a separate program would read a stylesheet and convert the .q and .endq instructions into their correct indents. This was called 'generalised markup'. For a while people thought this was wonderful. It was used in the first web-pages in the early 1990s to make HTML, the Hypertext Markup Language. Every web page you read today is still encoded this way - well, nearly. But the problems persisted. The tags were all embedded in the text, which made it difficult to read. As more and more information had to be added eventually there was so much markup that it was almost impossible to read the text without first formatting it. Which was a problem if you wanted to edit it. Then, in the mid-1990s some people invented standoff markup.

Standoff markup

Standoff markup removes the embedded tags and stores them separately from the text. Each of the externally stored tags is associated with:

  • an offset into the text where it could be reinserted
  • the distance between pairs of start and end-tags as measured in the underlying text
  • any attributes that belonged to the original start-tag

Standoff markup doesn't change the markup structure; it just separates it from the text and makes it possible to combine different markup sets with the same underlying text. For the reader it also removes the confusion arising from the embedding of complex tags. But there were also drawbacks:

  • If the text was altered then all the offsets in the standoff markup file would have to be updated.
  • It was still not possible for the textual properties described by the tags to overlap.
  • Markup sets could be exchanged but not combined. You couldn't, for example, add both metrical and formatting structure to the same document, or merge markup sets written by different people.

Standoff markup is suited to the corpus linguistics applications it was originally developed for, where various natural language tagging tools produced differently marked-up versions of the same underlying text. But in the humanities this method has proved less popular because of the need to edit the text, at least in the initial stages of preparing an electronic edition - which is what I am interested in.

Standoff properties

To distinguish our solution from conventional standoff markup we call it 'standoff properties'. It can also be used, as in our software, to allow the editing of either the text or the markup while keeping the other half automatically in sync.

Standoff properties extends standoff markup in two significant ways:

  1. By allowing overlap between properties
  2. By allowing sets of markup properties to be freely combined


Software engineers use different kinds of data structures for various tasks, such as graphs, arrays, hash tables etc. Humanists and corpus linguists have unnecessarily limited themselves to embedded markup languages and thereby limited representations of their data to tree structures. Moving to a data structure that resides outside of the text liberates them from that restriction.

Mixing markup sets

A set of markup tags, stripped from a standard XML file is always properly nested because it is sorted by the order in which tags were encountered in the original XML file. This can be used to reconstruct the document tree in cases where several tags start at the same point in the underlying text. For example:

<book name="my book"><chapter n="1"><text>...

In standoff form these properties/tags might be represented as:

"book" start: 0 length: 12345
"chapter" start: 0 length: 234
"text" start: 0 length: 234

Here the three tags 'book', 'chapter' and 'text' all start at the same point in the text they enclose. The order can be used to decide that 'text' goes inside 'chapter' and not the other way around, even though they may describe equal ranges in the base text.

But if we were to allow different sets of markup tags to be combined, what would the order be? Since the two trees, which represent two sets of markup tags, can't be combined, this reliance on the order of the tags will have to be abandoned if we want truly mixable tagsets.

Lifting the restrictions

Digital humanists have been calling for these relaxations in the strictness of markup for many years. But the reality is that the external world of markup has not changed. Web pages are still represented as tree-structured HTML. So if 'standoff properties' are adopted as an alternative representation for markup, how is it possible to convert them into a well-formed and valid formatted HTML file? The conventional solution is to use the XML Transformation Language (XSLT), but that requires embedded XML as input. So a new technique is needed to perform the conversion. That's what this posting is all about.

Standoff properties to HTML

The reader may well be skeptical that this is possible at all. Anyone who has worked in this field over the past 20 years will know that there have been a large number of attempts, more or less unsuccessful, to allow overlap in marked-up texts. It may also seem impossible to convert sets of arbitrarily overlapping properties with no real structure into rigid tree-formatted HTML. But I am going to prove it, although to make the proof short enough to read I'm only going to outline it. For the more sceptical there is a program to back it up that has already been extensively tested. You can access it on the Digital Humanities Testbed website and try it out for yourself.

To prove that standoff properties can be converted to HTML we can omit certain confusing details. Firstly we can ignore attributes because a set of attributes is associated with each property or tag on a one-to-one basis. We can also forget about how properties are turned into HTML tags because this is likewise a one-for-one mapping. For example, we can just specify that the property 'stage' should always be converted into the HTML tag 'p'. Finally, since property sets can be freely combined, it suffices to prove that one set of standoff properties can be converted into a HTML document tree.

Deriving nesting information

In order to build a tree from data that has no such structure some information about which properties may be allowed to nest inside other properties is obviously required. This information can be derived from two sources: firstly by scanning the list of properties and their ranges in the text it is possible to compute how often a particular property is entirely 'inside' another property. So if the property-set was derived from stripping an XML file this nesting information could not violate any requirement of the XML file's original schema, if it already conformed to it. In fact XML can perfectly well do without any form of 'schema', or syntax recipe, and can derive it from its 'well-formedness', that is the strict nesting of its paired tags. We will be going one step beyond that by not even mandating well-formedness.

First of all some defintions:

  • An 'element' in XML means a pair of start and end-tags and their intervening content. The XML element '<em>really true</em>' projects the property 'em' over the text-range 'really true'. An element can also contain other elements.
  • A 'property' on the other hand is just a name for a range in the text. Unlike elements, properties don't have any intrinsic ability to nest.

By scanning a file containing a set of property names and their ranges for a particular text – a property-set – a matrix of how often each property was observed to be inside itself and every other property can be derived. A table from a simple TEI-XML file for a play might look like this:


Table 1

Reading from left to right the numbers means 'head', 'stage', 'speech' etc are found so many times inside the properties named in the various columns. So the property named 'stage' is found inside the property named 'text' a total of seven times.

A similar table can be drawn up for HTML because all the nesting rules can be found in its schema. So, for example, the tag <span> may appear inside <p>, but not vice-versa. So the corresponding fragment of the HTML matrix would look like this:


Table 2

In reality the full HTML table has 107 rows and columns and is too big to show here. Binary values are used instead of the frequencies of Table 1 because this is general nesting information derived from a schema, not from actual texts. Reading the table from left to right, the binary value '1' means 'may nest inside' and '0' means 'may not nest inside'. Now by mapping each of the property names from Table 1 to their HTML tag equivalents, we can look up the corresponding values in a full-sized version of Table 2 with the 107 tags of HTML5. If there is a 1 in the corresponding location of the HTML table then the frequency count in Table 1 will be left as is, and if the HTML table has a 0, meaning that the HTML equivalents may not nest, then the frequency, however large, will be zeroed. This guarantees that the nesting information derived from the property set, when converted to HTML, specifies HTML-compatible nesting rules.

Enclosing versus nesting

Recall that from the definition above a 'property' consists of a name and a range in the underlying text. What Table 1 records are the nesting characteristics of the property-names. To say property-name A nests inside property-name B is a general statement about those two properties. On the other hand to say that property A encloses property B means that the range of property A is outside the range of property B by at least one character on the left or right. So, for example, the property 'speech' encloses the property 'speaker' even though they both start at the same point:

<sp><speaker>Bast.</speaker> Nothing my Lord.</sp>

Document Object Model

The idea of a document object model (DOM) is borrowed from SGML/XML. HTML is an instance of an SGML language, and a HTML document can be described by a tree-structure of 'nodes', each containing other descendant nodes called 'children', and nodes at the same level called 'siblings'. Figure 1 shows the structure of a basic DOM tree.

To be continued ...

Sunday, 29 May 2011

Extensions to CorCode

I've been thinking about the exact role of CorCode and how we render it on screen as HTML. There are some problems with the simple CorTex, CorCode, CorPix model that need figuring out. Hopefully they're just details.


For TEI embedded markup people write XSLT stylesheets to transform the text, typically into HTML. Since a particular encoding of a text is embedded in it, and is highly customised, the XSLT stylesheet that transforms it cannot be fully portable to other people's texts:

  1. they may want to render the same data differently, or
  2. they may have custom tags and attributes that need rendering according to their local GUI requirements

The reason that TEI tries to mandate particular names for tags and attributes is so that standard stylesheets can be used. In practice this does not work very well because just as the encoding is customised the stylesheets also need customisation as required by these two points. On the other hand, if we no longer mandate any standard encoding and just allow the user to specify fully the names of properties (aka tags) then it is up to the encoder to specify what is to be done with them.

How to do this in HRIT?

One objective of designing HRIT is to make things easy and to make them work smoothly. So the user should download Cortex and CorCode and they should automatically merge locally and produce output that looks good in their local GUI tool. But how?

Both functions of the transformation - the one into a visual appearance and its interpretation by the application are specific to the encoding itself. So I am thinking that the transforming css file that performs both duties in HRIT needs to be closely associated with the CorCode. It shouldn't need to be transformed locally for a particular application because how can that transformation do anything without knowledge of the format? Imagine I write a css file to render King Lear, and use it within single view. Cool. But what happens if I download Little Dorrit by Charles Dickens in the same application, and my css file no longer works? So the css file can't belong to the application, or at least most of it can't. It has to belong to the CorCode. This is more or less what TEI already does: an XSLT stylesheet is designed for a particular collection of customised text encodings and is useless outside of that domain.

Ergo: Corform

So maybe we need a fourth type of basic data: corform that is as tightly bound to corcode as corcode is itself bound to cortex. That sounds less evil to me than extending the corcode format to support formatting. The good news is that we wouldn't need as many corforms as corcodes, because they would be version- and even file-independent. But it would be both functionally and visually specific to a specific application. We might assign each corcode a "style" and then specify corforms like render that style. We'd want many corforms for each style but not the other way around. So a suitable rest resource url might look like: "app-name" could be "default", meaning any application, or it could be "hritsingle" meaning the single view application; a suitable style-name might be "TEI"; "render-name" might be "freds-format". An individual corform resource would be in css.

So the way it would work in practice the user would specify a particular corform resource manually, and it would be saved until changed. But if none was specified, the application name combined with the corcode's style name should identify a default rendering in every case.

Writing a basic php5 extension

I wanted to put formatter into a php extension because the CorCodeDemo currently has to write the CorCode and CorTex returned by the server to temporary files because formatter is currently a C commandline tool, and the CorTex and CorCode responses are too long to be commandline arguments. Also making formatter directly accessible from within php will effectively make it a replacement for XSLT, which is my intention.

The trouble with "hello world" as a php extension

I followed the instructions at but there were several problems. Documenting them might help others.

The first gotcha I discovered was that you're supposed to build php from the latest sources and install that on your system. This is actually quite dangerous on a Linux system because doing so invalidates the version installed by the package manager. The default deb package on Ubuntu puts files here, there and everywhere, and making the source code build put everything in exactly the same places is rather difficult and would destroy the package manager information about php. On the other hand, putting it elsewhere gives you two versions and the actual one invoked depends on the order directories are searched. The main problem this leads to is that building an extension in any version other than the one it is run in will prevent the extension from loading. My original php version was 5.3.2, which is what my server is still using. The latest version is currently 5.3.6, which became my commandline version. I didn't want to delete the old version because I had spent some effort installing and configuring it with xdebug, mysql and xslt.

The second gotcha was that the commandline php.ini file is not the same as the one used by the web server. On my system the apache2 one is in /etc/php5/apache2 and the commandline one is in /etc/php5/cli. You really must check the following:

  1. that the extension_dir used during execution is the one you put your extension in
  2. that the php.ini file that loads your extension is the one actually being used by php on the commandline (or server)

You can check the php.ini file by invoking on the commandline:

php -i | grep php.ini

And the extension_dir similarly:

php -i | grep extension_dir

Like me you'll probably be surprised by what it says. But match them up, and match the compile and the execution versions and it will work. Remember though, that the server will use a different php.ini file, and a different extension_dir. The best way around this dilemma is to download the same version of the source as is currently installed on your system, and build with that.

Wednesday, 25 May 2011

HritSample Module

I've managed to create a CMS-independent Hello World! Hrit module that runs in Joomla!

So what, huh? Well, this will make it easy for anyone to write a Hrit module in future, which they can load into any popular CMS, starting with Joomla! The cross-CMS compatibility is provided by a single php file, that is the first point of entry in the extension by the Joomla! system. Also needed is a Joomla!-specific xml file for describing the extension. Thereafter the php entry point delegates everything to the 'module', which is a folder containing a real controller and a real view class that doesn't refer to Joomla! in any way, shape or form.

A HRIT Module wrapper tool

Eventually we will need a wrapper program to take such a module and prepare it for the various CMSes automatically. For that purpose an XML file would have to be added to each module to describe its contents, which the wrapper would read and turn into a CMS-specific form.

Hritcore module

To make writing Hrit modules really easy all of the Hrit-specific functionality is contained in another CMS-independent module called hritcore, which is added to the php include-path and from which not only superclasses for Hrit but also the mvdguilib.js Javascript functions, which provide cross-browser functionality for interactivity, can be called.

One of these support files is a surrogate or imitation HritRest service, which just accesses a local database for experimental purposes, rather than a remote service over the web. There is only one public method: get, which takes a REST-style URL as its only argument, and spews back whatever the caller requested. So if users request:


they will get the F1 version of act 1, scene 1 of Shakespeare's King Lear, etc. And if they replace the first bit of the url by 'corcode' they will get the corcode for that version, and similarly for corpix.

Where this still needs to go

Later, when this kind of url is possible on the real REST service the HritRest.php file can be done away with. Shouldn't be long before the two example views: SingleView and CompareView are ready to demonstrate.

Next step, though, is to incorporate formatter (and why not also splitter and stripper?) into the php service and do away with XSLT forever. Oh, what a joy!

Saturday, 21 May 2011

Formatter inside PHP

The formatter is a program to replace XSLT, but it uses our standoff markup solution called CorTex/CorCode. At the moment the CorCodeDemo calls formatter on the command line as an external script. But it struck me that because it replaces XSLT it ought to be called the same way that XSTL is - within the PHP daemon as a PHP extension. That way we won't get the hit of inefficiency by running it as a commandline tool. Instead it will just be available, once installed, for any HRIT-modules we design. I don't yet know how to do it, but since everything is open-source it should be easy.

Another refinement is the reuse of MvdCore, a Joomla! component I wrote to facilitate writing a robust set of Joomla! plugins for the MVD-GUI. But since it doesn't refer to Joomla!, except in order to exist as a module, I thought I would use it as a basis for every HRIT module, as a kind of framework library. That way, writing new modules should be dead easy and they will automatically work in every browser. Now that's cool.

Wednesday, 18 May 2011

Brave New World

Now we've thrashed out a pretty good picture of how the HRIT system will look, I decided to make the MVD-GUI in its image. That is, to use the HRIT REST service, the HRIT API, and write a HRIT compatibility layer for all my modules to run in. The BIG advantage of this is that, provided a compatibility layer is available for a particular CMS, the same module will run unmodified in that CMS (Content Management System). The only catch is that I'd have to find somewhere to host the HRIT Java REST service (cloud, left). So here's my current design for the compatibility layer:

In the Model-View-Controller design the Model handles all interactions with the database. Give me this, here is that, etc. So an implementation of the HRIT API fits that design pretty well (the pink box). It wouldn't have to access any databases directly, because it would get all the data it needs from the REST Service. So it could be the same on all CMSes.

The Controller on the other hand (orange boxes) deals with all the user-clicks in the view and decides what to do with them. It interposes between the view and the model to decide when to call the model and what to return to the view. It's the 'business logic' of the application. An individual CMS can't and does not in practice mandate an MVC design, but it does define conventions for calling modules. So in Drupal and Joomla!, HRIT modules will be defined with a library of functions that will act as the master controller for the module. Whereas the module itself will have its own private controller that will call the master one. The private controller will include the master controller with a require statement and assume the existence of certain functions or static methods inside that controller, say HRITController::doSomething().

The View (green box) can do what it likes, so long as it only uses PHP functionality, and doesn't call any CMS-specific routines or rely on any CMS-specific data. If it needs something like that it will have to get it from the compatibility layer via its private controller.

Saturday, 16 April 2011

HRIT standoff format

The HRIT standoff XML format is designed to replace embedded markup by using standoff properties. It is not intended to be definitive, and the same information could easily be recorded using another format. All that matters is that it suffices. It supports the following features:

  1. The text is in a separate file, which may be referred to by multiple HRIT markup files.
  2. Several markup files can be combined to enrich the text simultaneously.
  3. Properties may overlap
  4. Both HRIT markup files and the underlying text files can be freely edited

The format doesn't perform these functions it just allows them to happen. So it is actually pretty simple.


The header provides some essential structure and infomation:

<?xml version="1.0" encoding="UTF-8"?>
<hrit-markup style="TEI">

A CorCode is of course closely bound to a CorTex version. In the HRIT system each cortex and corcode is uniquely identified by a URN in the FRBR mould as a sequence of hierarchical names, e.g. "/corcode/english/shakespeare/kinglear/act1/scene1/folios/F1". A CorCode URN is the same as the corresponding CorTex URN except that the first component is "/corcode" instead of "/cortex". So there is no need to record that relationship explicitly in the markup.

The "style" attribute on the other hand is a reference to the URN identifying a CSS format file. This is used to render a CorCode into an external HTML form for display. It's not a complete name because the same corcode/cortex might be rendered in multiple ways, depending on the application and the user's preferences. Each CSS file points to just one CorCode, and hence the unique style name here. A full format identifier would look more like "/corform/TEI/hritsingle/luc-style", which would mean 'the core format for "TEI" corcodes in the hrit-single application in the LUC style'. The CorForm concept is new and we are still working out the details, but it seems clear that this kind of information has to be stored in the "cloud" with the cortexs and corcodes.


The body of a HRIT markup file is composed of an ordered series of ranges. Each range points to a span in the base text and is specified by its relative offset from the previous range and its length. This facilitates editing, since changing one range only invalidates the immediately following one. It also has a name because it represents a property. Finally it may contain annotations, which are mostly leftover attributes from XML, although they can also be used to store invisible programming information such as links. But these will need to be backed up by direct functionality in a GUI; they are not intended to be edited by the user.

<range name="pb" reloff="0" len="0">
<annotation name="ed" value="F3"/>
<annotation name="n" value="765"/>
<range name="sp" reloff="271" len="20"/>

Removed ranges

If a range is labelled as removed="true", it can't be formatted and cannot specify a range in the base text. An example is the teiHeader element in TEI. Since this is metadata its content has a fundamentally different status to the text of the body. So the default recipe of stripper specifies that this element and all its children are removed. A removed element may have its own private content that is stored as part of the standoff markup. This 'content' can belong to any point in the file, but usually to the start, so it can have a reloff attribute, though not a len:

<range name="title" removed="true" reloff=0">
<content>Mr. William Shakespeares Comedies, Histories, &amp; 
Tragedies. Published according to the True Originall Copies.

This may sound like a hack, but the information has to be stored somehow if we later want to reverse the conversion from HRIT back to TEI. Of course files originating in the HRIT system itself won't have any removed elements.

Nesting of ranges

If a HRIT markup file has been imported from XML then its ranges will effectively be nested. But this information is not used during formatting, since ranges can freely overlap.


This is just:


This format is still subject to development and may be changed in future.

Sunday, 13 March 2011

How fast can we diff two strings?

On this question I have yet to read of any advance on Esko Ukkonen's 'Algorithms for Approximate String Matching' which appeared in Information and Control, 64, 100-118 in 1985. As per his predecessors, he uses the edit-graph to model the matching process, laying out the B-string across the top and the A-string down the left hand side (see diagram below). Finding the shortest edit script is akin to moving through this graph one square at a time. The x coordinates are just indices into the B-string, and the y-coordinates are expressed as x-i, where i is the index of the relevant diagonal. (The diagonals are numbered 0 at the origin, with increasing negative values on the 'down' side and increasing positive values on the 'right' side.) From each square there are three possible moves: right, corresponding to a deletion from the B-string; down, corresponding to an insertion from the A string; diagonally down and right, corresponding to a match (if the characters at that point are equal) or to an exchange if they are not.

In this diagram we transform 'aback' into 'beak'. First we delete 'a'. Then we match 'b', insert 'e', match 'a', delete 'c' and match 'k'. Note that each move takes us one beyond the square it was performed in.

Progressive refinement of the p-band

The basic idea of Ukkonen's (and Myers) algorithm is that the only positions of interest as we traverse the graph are the most advanced positions on each diagonal. So instead of storing NxM positions (as in the dynamic programming method of Needleman and Wunsch for example) we only need to store N+M diagonal positions – quite a saving. Ukkonen further reduces the search space to a narrow diagonal strip down the centre of the graph – the p-band (p.105f) – which can gradually narrow as we proceed toward the end (p.106). Both ideas have been republished as alleged improvements to his algorithm (e.g. Wu 1989 – cited 88 times). Ukkonen defines p as ⌊1/2(dmn/Δ–|n-m|)⌋. In plain English the worst possible cost is incurred by exchanging all the characters on the 0 diagonal and then deleting all the characters horizontally until you get to M,N. Any path on the graph that is more than half that cost away from the goal has no chance of reaching it more efficiently than that. What is more, we can update p as we get closer to the goal, and we get a better idea of the cost of reaching Dmn.

Based on this idea my program assesses dynamically, for each move in the graph, whether it could result in a shorter path to the goal than the worst result of the currently best diagonal. If not, then it is dropped. My algorithm is also very space-efficient. Rather than allocating an array of size M+N (the lengths of the two strings) to hold the diagonals, I use a linked list to store only the active ones.

Moving through the graph

We construct a linked list of diagonals and number them, starting with the one between the two strings at 0,0 (the origin), numbered 0. The other diagonals along the y-axis are numbered -1,-2,-3 etc and the ones along the x-axis are numbered 1,2,3 etc. Then we assess the even and odd numbered diagonals in the list in alternate passes. The reason is that the computation of the even-diagonals interferes with that of the odd-diagonals. For each diagonal at index i and horizontal position x we compute the maximum x-value that would result from:

an insertion from the i+1 diagonal
a deletion from the i-1 diagonal
an exchange on the i diagonal

Whichever takes us further east along the diagonal (and by implication also further down) will be the best move. All things being equal we favour the exchange move because it promises to be the cheapest overall. After processing the even and odd diagonals we increment the d-value: our record of how much it has cost to transform B into A so far. If we are at one or the other edge of the linked list, and it is necessary, we also create a new diagonal to the left or right. This won't be processed in the current pass because it is odd when we are even and vice versa.

Finding the snakes

A match or 'snake' as Myers called them, is just a run of matching characters in the edit graph, and they give the diagonal method all its speed. But we have to be clear as to when we look for them. The very first square of the edit graph at 0,0 may or may not contain a matching character. If it does, we have to match the snake before checking for an exchange, but we will still have to do a delete and an insertion from 0,0. The reason is that otherwise we might miss an optimal snake starting on the -1 or 1 diagonal. Another position of care arises at the very last square at M-1,N-1. Until we have checked that square for a match as well as for insertion, deletion and exchange we won't have reached the real 'end', which is one character beyond M-1,N-1. For all other squares we just test for the existence of snakes after each edit operation, so we can be certain at the start of the next pass that the characters at the head of each diagonal do not match.

Exchanges vs insertions and deletions

Traditionally diff algorithms don't compute the exchanges. But an exchange move from x,y to x+1,y+1 costs only 1, whereas a deletion+insertion move to the same location costs 2. So exchanges are often much cheaper. The insertion/deletion pair can be recovered from any pair of exchanges. Replacing xxx with yyy is the same as deleting xxx and inserting yyy. Here's the result of my implementation of Ukkonen's algorithm. When you consider how many comparisons the program could have made (the entire graph) as opposed to the ones it did make (the squares marked with 'x'), you can see how efficient it is:

Recording the path

Finding the optimal path through the graph is no use if it doesn't allow us to construct a shortest edit script. One simple way to do this is to associate a path with each diagonal. The path will be implemented as a linked list. Each component of the path has an x,i origin, a length and a type (nothing [for the initial move], inserted, deleted, exchanged or matched). If the type changes, from, say, matched to exchanged, we create a new path-component and link it backwards to the last one. (If we went forwards we would need multi-way branches.) But if the type is unchanged we simply increase the length of the existing one. When we finally get to the end, the backwards-pointing path of the successful diagonal will contain the shortest edit script. To print it we just reverse the order and print it out. Here's the edit script for the above graph:

matched 11 chars at x=0 y=0
exchanged 4 chars at x=11 y=11
matched 25 chars at x=15 y=15
exchanged 3 chars at x=40 y=40
matched 1 chars at x=43 y=43


pdiff prints out an edit script but also generates a trace of moves taken in the edit graph if both strings are smaller than 100 bytes. It's free under the GPLv3 and is written in C++. It can also diff entire files.

Monday, 7 February 2011

Joomla! 1.6: An Attack of Biggerism

I was quite relieved to try out Joomla! 1.6. So it's not just me who releases software before it is ready. In fact the link to download its more stable predecessor, version 1.5, is about 1/10th of the size. So the users are being recruited to do the testing. The danger of this strategy is that they might just move off somewhere else, say Drupal.

On my Linux laptop running Firefox the default installation gave me a 20mm white border at the top of the page. Heaven only knows what for. Below that a cavernous banner that consumes another 40 mm, leaving a thin strip for content at the bottom. In the administrator pages gratuitous icons the size of surfboards float around in the limited space, dwarfing the innumerable input controls, which seem more bewildering than ever before.

Gone is the list of components. Instead you have a melange of components, plugins and modules, which you have to manually filter to get what you want. The old way of doing this, while not ideal, was better and required fewer clicks on the part of the user. What if after filtering I want to just see the modules I installed? More mouse-clicks which destroy my previous configuration. I think they need to take a hard look at how to design a usable interface not go wild on artwork.

Gone, too, are the polished templates of 1.5. Joni Mitchell was right: 'You don't know what you got till it's gone'. In its place a collection of ugly and nearly useless ones. On the banner we read: 'We are volunteers', which sounds more like an apology than as a boast. They are at serious risk of losing their market dominance. My rating so far: 2/10

Saturday, 5 February 2011

Pixel-perfect positioning of elements in HTML

The web is full of advice, most of it bad, on how to find out the precise position of any element on an HTML page in any browser. Well the answer is simpler than I thought.

The Box Model

Each element on a HTML page is surrounded first by its padding, then by its border finally by its margin. Or put it another way, the margin is on the outside of the border and the padding is on the inside. Padding is usually included in an element's height and width calculation, border may be and margin almost never is.

Computing the offset of an element from the top of its window

First you need to determine the top and left offset of an element. I believe this works in all browsers.

function getTopOffset( elem )
 var offset = 0;
 while ( elem != null )
  offset += elem.offsetTop;
  elem = elem.offsetParent;
 return offset;

This works because the use of a fixed point on each element insures that all other margins, padding and border values add up correctly. As for the margin of the body element, although different browsers use different values, the offsetTop property is measured to the outer edge of the border, just like the value stored in offsetHeight, so includes the margin. So if, as in IE, the margin-top value of body is 15 then the offsetTop property of the first element inside body will be 15. A similar routine will work for the left offset.

Measuring the height of an element

The height of an element (or its width) is more complex because we usually don't include the border width. The cross-browser property clientHeight is only cross browser in name because in IE it is only set for formatted elements. For unformatted elements like div and p its value is 0. You can force it to have a clientHeight property by giving it a css property "display: inline-block" as recommended on many websites. This is a hack that changes the way the element is displayed, which could play havoc with your display. A cleaner way is to use the offsetHeight property. It's always set even in IE. The only problem is subtracting the border-top-width and border-bottom-width. But these can be reliably computed from the css style:

function getHeight( elem, inclBorder )
    var borderHeight = getStyleValue(elem,"border-top-width")
    if ( elem.clientHeight )
        return (inclBorder)?borderHeight+elem.clientHeight
        return (inclBorder)?elem.offsetHeight
function getStyleValue( elem, prop )
    var value = getStyle( elem, prop );
    if ( value )
        return parseInt( value );
        return 0;
function getStyle( elem, prop )
    // test if in IE
    if ( elem.currentStyle )
        var y = elem.currentStyle[cssToIE(prop)];
    else if ( window.getComputedStyle )
        var y = window.getComputedStyle(elem,null)
    return y;
function cssToIE( prop )
 var parts = prop.split("-");
 if ( parts.length > 0 )
  var ccProp = parts[0];
  for ( var i=1;i 0 )
    ccProp += parts[i].substr(0,1).toUpperCase()
  return ccProp;
  return prop;

Again, width is computed similarly. The css properties are always set even if you add them via a style= attribute on the element. getComputedStyle (Mozilla) and the element's currentStyle property (IE) both give you the computed css styles of an element. Only problem is that IE uses camelCase names for the hyphenated property names. That's easily fixed.

Getting Window height

The Firefox way to get window height is just window.innerHeight. But this doesn't work on IE of course. You have to use document.body.offsetHeight or document.documentElement.offsetHeight. (Everyone says to use the clientHeight property but that doesn't include the borders and you can thus get the wrong answer on IE):

function getWindowHeight()
 var myHeight = 0;
   if ( typeof(window.innerWidth) == 'number' )
     myHeight = window.innerHeight;
 else if ( document.documentElement
  && document.documentElement.offsetHeight )
     //IE 6+ in 'standards compliant mode'
  myHeight = document.documentElement.offsetHeight;
 else if ( document.body && document.body.offsetHeight )
    //IE 4+ non-compliant mode
  myHeight = document.body.offsetHeight;
 return myHeight;

Using these properties my tests indicate that any browser will give you pixel-perfect readouts of the position of elements on the page.

Wednesday, 2 February 2011

Joomla! 1.6 released

Joomla! 1.6 is finally out for real. The installer script I wrote won't actually work in that version because the table names have been changed. I'll fix it as soon as possible. Also, I'll need to iron out any failures of the MVD-GUI in the new system, though I don't expect anything major.

Sunday, 23 January 2011

Unistalling multimple extensions in Joomla!

Here's the uninstaller code. It depends on the extensions.txt file written by the installer script.

 * Uninstall a set of previously installed extensions
 * @package Packager
// No direct access
defined( '_JEXEC' ) or die( 'Restricted access' );
define( 'PACKAGER','Packager');
$exts = array();
$types = array();
// 1. read in the extensions.txt file
$extensions_file = $this->parent->_paths[
$file_len = filesize( $extensions_file );
$handle = fopen( $extensions_file, "r" );
if ( $handle )
  $contents = fread( $handle, $file_len );
  if ( $contents )
    $lines = split( "\n", $contents );
    foreach ( $lines as $line )
      $extension = split( "\t", $line );
      if ( count($extension)==2 )
        $types[] = $extension[0];
        $exts[] = $extension[1];
  fclose( $handle );
  // 2. for each extension, retrieve its ID from the database
  $db = JFactory::getDBO();
  $prefix = $db->getPrefix();
  $installer = new JInstaller();
  for ( $i=0;$i<count($types);$i++ )
    $table = $prefix.$types[$i]."s";
    $name = ($types[$i]=="module")?"title":"name";
    $query = "select id from $table where $name='".$exts[$i]."';";
    $db->setQuery( $query );
  $result = $db->loadObjectList();
  if ( count($result) == 1 )
      // 3. uninstall it
      $installer->uninstall( $types[$i], $result[0]->id, 0 );
  error_log(PACKAGER.": failed to find extensions.txt");

Friday, 21 January 2011

Installing multiple Joomla! extensions in one go

The problem

Joomla! has a fairly good component architecture but what it lacks is a way to break up a custom component into smaller modules, plugins and other components. This is essential for the development of complex components. An example is my own mvd-gui, which has multiple views of the same data. I realise that I can supply multiple controllers, models and view files and coordinate them all from a master controller. That's what I did before, but it gets unmanageable after more than a few views. I had eight with more planned.

However, breaking up a Joomla! application into many small components requires the end-user to manually install multiple components, modules and plugins. Since scripting of the installation isn't possible, how can many separately developed extensions be unified into a single package?

The Solution

I have seen two other approaches to this problem. One is a relatively low level customisation of the JInstaller code used in the Joomdle package. Another is this post on "Jeff Channel" which is higher level if also incomplete. It's very simple and I used it as the basis for my own approach. But whereas Jeff adds the components to the manifest file, so they get copied to the installation directory of one of the components to be installed, I decided not to copy any of them and instead created an empty Packager component that would just take care of the installation. The user drops any number of components, plugins and modules into an "extensions" folder within a generic Packager component, zips it up, and installs it. The install script then copies them directly from the tmp directory during installation.


Likewise, when the user wants to uninstall the package, removal of the Packager component will trigger removal of all sub-extensions automatically. To do this the installer script saves a list of the extensions it originally installed in "extensions.txt" in the administrator directory. It will also need a small uninstall script for each sub-extension to warn (but not prohibit) the user from uninstalling it separately. The master Packager uninstall script can always skip any extensions it can't find so this doesn't have to generate an error.

So far I've got the install part working well, but not uninstall yet. Here is the manifest file of the Packager component. The current state of this package in on Googlecode.

<install type="component" version="1.5.0">
 <author>Desmond Schmidt</author>
 <copyright>Desmond Schmidt 2011</copyright>
 <license>GPL v3</license>
 <description>Packager to install/uninstall multiple 
 components/modules/plugins in one go</description>
 <files folder="admin">

Simple eh? Other than the "install.packager.php" file the rest are more or less empty files. Here's that installer script:

 * All-in-one installer/uninstaller component. To install 
 * a set of components, modules and plugins just drop them 
 * into the "extensions" directory within this component. 
 * Then zip up the component and install it via the Joomla 
 * interface. Similarly when uninstalling the packager
 * all previously installed components and modules etc will
 * be uninstalled (by the uninstaller script).
 * Copyright 2011 Desmond Schmidt
 * License: GPLv3.
 * @package packager
// No direct access
defined( '_JEXEC' ) or die( 'Restricted access' );
// rename this to something you like
define( 'PACKAGER','Packager');
$installer = new JInstaller();
$installer->_overwrite = true;
$config =& JFactory::getConfig();
$tmp_dir = $config->getValue('config.tmp_path');
$dir_handle = opendir( $tmp_dir );
$jroot = JURI::root( true );
// look for the installation directory of Packager in tmp
if ( $dir_handle )
  $found_dirs = array();
  while ( $file = readdir($dir_handle) )
    if ( strncmp("install_",$file,8)==0 )
      if ( file_exists($tmp_dir.DS.$file.DS.PACKAGER) )
        $found_dirs[] = $tmp_dir.DS.$file;
  if ( count($found_dirs) > 0 )
    $best_dir = $found_dirs[0];
    $best_ctime = filectime( $found_dirs[0] );
    for ($i=1;$i<count($found_dirs);$i++ )
      if ( filectime($found_dirs[$i])>$best_ctime )
        $best_dir = $found_dirs[$i];
        $best_ctime = filectime( $found_dirs[$i] );
    // so $best_dir is our best candidate directory
    $extensions_dir = $best_dir.DS.PACKAGER.DS."extensions";
    if ( file_exists($extensions_dir) )
      // save record of installed extensions
      $exts = array();
      $types = array();
      // look for and install all extensions
      $zip_handle = opendir($extensions_dir);
      if ( $zip_handle )
        while ( $zip = readdir($zip_handle) )
          // ends in ".zip"?
          if ( strrpos($zip,".zip")==strlen($zip)-4 )
            $zip_file = $extensions_dir.DS.$zip;
            $package = JInstallerHelper::unpack( $zip_file );
            $msgtext = "";
            $msgcolor = "";
            $pkgname = substr( $zip, 0, strlen($zip)-4 );
            $image = $jroot."/administrator/images/tick.png";
            if( $installer->install( $package['dir'] ) )
              $msgcolor = "#E0FFE0";
              $msgtext  = "$pkgname successfully installed.";
              if ( count($installer->_adapters)>0 )
                $type = $package['type'];
                $exts[] = $installer->_adapters[$type]->name;
                $types[] = $type;
              $msgcolor = "#FFD0D0";
              $msgtext  = "ERROR: Could not install the ".
                $pkgname.". Please install manually.";
              $image = $jroot
            echo "<table bgcolor=\"$msgcolor\" width =\"100%\">";
            echo "<tr style=\"height:30px\">";
            echo "<td width=\"50px\"><img src=\"$image\" height="
              ."\"20px\" width=\"20px\"></td>";
            echo "<td><font size=\"2\"><b>$msgtext</b></font></td>";
            echo "</tr></table>";
               $package['packagefile'], $package['extractdir'] );
        closedir( $zip_handle );
      // save record of installed extensions
      if ( count($exts)> 0 )
        $handle = fopen( $this->parent->_paths[
          'extension_administrator'].DS."extensions.txt", "w" );
        if ( $handle )
          for ( $i=0;$i<count($exts)&&$i<count($types);$i++ )
            fwrite( $handle, $types[$i]."\t".$exts[$i]."\n" );
          fclose( $handle );
      error_log(PACKAGER.": missing extensions directory!");
    error_log(PACKAGER.": couldn't find a suitable install directory!");
  closedir( $dir_handle );
  error_log(PACKAGER.": couldn't open $tmp_dir!");

You also need to create a folder called "extensions" within the Packager component where you can put sub-components and plugins etc. You will also probably want to rename the "Packager" component as something else. I'll post the uninstall script and any modifications to the installer script in my next post.

Friday, 14 January 2011

mvd-core component complete

It's not tested yet fully but the backend component that is meant to govern how the mvd-gui works as a whole is finished. Here's a picture of it:

I've come to the conclusion that the only way to write a successful Joomla GUI is to break it up into small components and modules. Otherwise it rapidly gets too complex to handle. This way I can add new views at will, or take them away and it will all still work. Each view hangs off the mvd-core component, which only has this simple backend, and provides access to nmerge. And that's it. With components this small fixing bugs should be a breeze.

Monday, 10 January 2011

Writing an admin backend for a Joomla! component

I wanted to add an admin back-end to my mvd-gui component. There are few instructions about how to do this and all of them are messy. Basically you are supposed to define a model-view-controller interface as in the site part of the component. But in the Joomla application they have shortcuts that only require a few lines of code, so I wanted to do it that way. Also it would look more consistent and reformat properly when the user changed the admin template.

So I took the massemail component's admin interface, which was quite simple, and deconstructed it. Customising the icons that appear in the toolbar was easy: just modify the calls to JToolBarHelper in toolbar.massemail.html.php. (You should rename this file as something else). I then got stuck on changing the icon that is displayed in the toolbar. Basically you call JToolBarHelper::title with 2 arguments: the first is the text you want displayed. The 2nd argument is supposed to be an image, but in reality it has to be one of the images in the images directory of the admin template. Since I can't add to that without making my changes non-portable I decided just to use the generic icon. It looks OK:

Or, in code-form:

class TOOLBAR_mvdcore
 * Draws the menu for a New Contact
 function _DEFAULT() {
  JToolBarHelper::title( 'mvd-core', 'generic.png' );
  JToolBarHelper::help( 'screen.users.massmail' );

Next step is to supply a help file. That's the last line in the code above. Then I still have to provide code for the save button and reformat the HTML form so that it displays my stuff. I'll save that for tomorrow's post.

Sunday, 2 January 2011

Debugging php in Netbeans on Ubuntu

I used to be an avid fan of Eclipse. But setting it up to debug php drove me to try Netbeans yet again. And no it is not much easier in that IDE either. What they don't seem to understand is that all the programmer really wants is to do is install the package and it just works. So here are the problems I had getting the debugger to work and how I overcame them.

  1. First, you have to install the xdebug package and apache2 with php etc. This is straightforward using Synaptic or apt-get etc.
  2. Next you can just run a php application you create in Netbeans and it will tell you to add certain lines to your php.ini file. Cool. Do that.
  3. Third, if it still does not work, and inexplicably gives the same error, the reason is probably that you haven't specified the script to debug. Right click on the project and check that the "index file" in the "run configuration" section is set to your root script.
  4. Fourth, if you created a source folder in your project, remember to set the "source folder" in the project properties under "Sources". And no, of course you can't edit it directly in the GUI, Silly, but you can in the "" file in the nbproject folder (look under the "files" tab on the left of the IDE). Now that's what I call a useful feature.
  5. If it debugs but the line-numbers are wrong, remember that the executing script is the one copied to the server. And the files visible in your debugger are the ones in the project. Cool. To keep them in sync click on "sources" in the project properties and select "Copy sources from Sources folder to another location". And remember to specify the folder on your web server. Of course you'll have to set the creator of that folder to your user name not www-data (if you're using Apache). But you knew that.
  6. So it all works but you don't see any local variables, just global ones and the current object? The problem is with xdebug. You need to upgrade it manually. There are some instructions here.

So now it all should work. Thank heaven for Linux or geeks would be extinct.