Digital Variants: May 2011

Sunday 29 May 2011

Extensions to CorCode

I've been thinking about the exact role of CorCode and how we render it on screen as HTML. There are some problems with the simple CorTex, CorCode, CorPix model that need figuring out. Hopefully they're just details.

TEI and XSLT

For TEI embedded markup people write XSLT stylesheets to transform the text, typically into HTML. Since a particular encoding of a text is embedded in it, and is highly customised, the XSLT stylesheet that transforms it cannot be fully portable to other people's texts:

they may want to render the same data differently, or
they may have custom tags and attributes that need rendering according to their local GUI requirements

The reason that TEI tries to mandate particular names for tags and attributes is so that standard stylesheets can be used. In practice this does not work very well because just as the encoding is customised the stylesheets also need customisation as required by these two points. On the other hand, if we no longer mandate any standard encoding and just allow the user to specify fully the names of properties (aka tags) then it is up to the encoder to specify what is to be done with them.

How to do this in HRIT?

One objective of designing HRIT is to make things easy and to make them work smoothly. So the user should download Cortex and CorCode and they should automatically merge locally and produce output that looks good in their local GUI tool. But how?

Both functions of the transformation - the one into a visual appearance and its interpretation by the application are specific to the encoding itself. So I am thinking that the transforming css file that performs both duties in HRIT needs to be closely associated with the CorCode. It shouldn't need to be transformed locally for a particular application because how can that transformation do anything without knowledge of the format? Imagine I write a css file to render King Lear, and use it within single view. Cool. But what happens if I download Little Dorrit by Charles Dickens in the same application, and my css file no longer works? So the css file can't belong to the application, or at least most of it can't. It has to belong to the CorCode. This is more or less what TEI already does: an XSLT stylesheet is designed for a particular collection of customised text encodings and is useless outside of that domain.

Ergo: Corform

So maybe we need a fourth type of basic data: corform that is as tightly bound to corcode as corcode is itself bound to cortex. That sounds less evil to me than extending the corcode format to support formatting. The good news is that we wouldn't need as many corforms as corcodes, because they would be version- and even file-independent. But it would be both functionally and visually specific to a specific application. We might assign each corcode a "style" and then specify corforms like render that style. We'd want many corforms for each style but not the other way around. So a suitable rest resource url might look like: http://luc.edu/hrit/corform/style-name/app-name/render-name. "app-name" could be "default", meaning any application, or it could be "hritsingle" meaning the single view application; a suitable style-name might be "TEI"; "render-name" might be "freds-format". An individual corform resource would be in css.

So the way it would work in practice the user would specify a particular corform resource manually, and it would be saved until changed. But if none was specified, the application name combined with the corcode's style name should identify a default rendering in every case.

Writing a basic php5 extension

I wanted to put formatter into a php extension because the CorCodeDemo currently has to write the CorCode and CorTex returned by the server to temporary files because formatter is currently a C commandline tool, and the CorTex and CorCode responses are too long to be commandline arguments. Also making formatter directly accessible from within php will effectively make it a replacement for XSLT, which is my intention.

The trouble with "hello world" as a php extension

I followed the instructions at http://devzone.zend.com/article/1021 but there were several problems. Documenting them might help others.

The first gotcha I discovered was that you're supposed to build php from the latest sources and install that on your system. This is actually quite dangerous on a Linux system because doing so invalidates the version installed by the package manager. The default deb package on Ubuntu puts files here, there and everywhere, and making the source code build put everything in exactly the same places is rather difficult and would destroy the package manager information about php. On the other hand, putting it elsewhere gives you two versions and the actual one invoked depends on the order directories are searched. The main problem this leads to is that building an extension in any version other than the one it is run in will prevent the extension from loading. My original php version was 5.3.2, which is what my server is still using. The latest version is currently 5.3.6, which became my commandline version. I didn't want to delete the old version because I had spent some effort installing and configuring it with xdebug, mysql and xslt.

The second gotcha was that the commandline php.ini file is not the same as the one used by the web server. On my system the apache2 one is in /etc/php5/apache2 and the commandline one is in /etc/php5/cli. You really must check the following:

that the extension_dir used during execution is the one you put your extension in
that the php.ini file that loads your extension is the one actually being used by php on the commandline (or server)

You can check the php.ini file by invoking on the commandline:

php -i | grep php.ini

And the extension_dir similarly:

php -i | grep extension_dir

Like me you'll probably be surprised by what it says. But match them up, and match the compile and the execution versions and it will work. Remember though, that the server will use a different php.ini file, and a different extension_dir. The best way around this dilemma is to download the same version of the source as is currently installed on your system, and build with that.

Wednesday 25 May 2011

HritSample Module

I've managed to create a CMS-independent Hello World! Hrit module that runs in Joomla!

So what, huh? Well, this will make it easy for anyone to write a Hrit module in future, which they can load into any popular CMS, starting with Joomla! The cross-CMS compatibility is provided by a single php file, that is the first point of entry in the extension by the Joomla! system. Also needed is a Joomla!-specific xml file for describing the extension. Thereafter the php entry point delegates everything to the 'module', which is a folder containing a real controller and a real view class that doesn't refer to Joomla! in any way, shape or form.

A HRIT Module wrapper tool

Eventually we will need a wrapper program to take such a module and prepare it for the various CMSes automatically. For that purpose an XML file would have to be added to each module to describe its contents, which the wrapper would read and turn into a CMS-specific form.

Hritcore module

To make writing Hrit modules really easy all of the Hrit-specific functionality is contained in another CMS-independent module called hritcore, which is added to the php include-path and from which not only superclasses for Hrit but also the mvdguilib.js Javascript functions, which provide cross-browser functionality for interactivity, can be called.

One of these support files is a surrogate or imitation HritRest service, which just accesses a local database for experimental purposes, rather than a remote service over the web. There is only one public method: get, which takes a REST-style URL as its only argument, and spews back whatever the caller requested. So if users request:

http://localhost/cortex/english/shakespeare/kinglear/act1/scene1/F1

they will get the F1 version of act 1, scene 1 of Shakespeare's King Lear, etc. And if they replace the first bit of the url by 'corcode' they will get the corcode for that version, and similarly for corpix.

Where this still needs to go

Later, when this kind of url is possible on the real REST service the HritRest.php file can be done away with. Shouldn't be long before the two example views: SingleView and CompareView are ready to demonstrate.

Next step, though, is to incorporate formatter (and why not also splitter and stripper?) into the php service and do away with XSLT forever. Oh, what a joy!

Saturday 21 May 2011

Formatter inside PHP

The formatter is a program to replace XSLT, but it uses our standoff markup solution called CorTex/CorCode. At the moment the CorCodeDemo calls formatter on the command line as an external script. But it struck me that because it replaces XSLT it ought to be called the same way that XSTL is - within the PHP daemon as a PHP extension. That way we won't get the hit of inefficiency by running it as a commandline tool. Instead it will just be available, once installed, for any HRIT-modules we design. I don't yet know how to do it, but since everything is open-source it should be easy.

Another refinement is the reuse of MvdCore, a Joomla! component I wrote to facilitate writing a robust set of Joomla! plugins for the MVD-GUI. But since it doesn't refer to Joomla!, except in order to exist as a module, I thought I would use it as a basis for every HRIT module, as a kind of framework library. That way, writing new modules should be dead easy and they will automatically work in every browser. Now that's cool.

Wednesday 18 May 2011

Brave New World

Now we've thrashed out a pretty good picture of how the HRIT system will look, I decided to make the MVD-GUI in its image. That is, to use the HRIT REST service, the HRIT API, and write a HRIT compatibility layer for all my modules to run in. The BIG advantage of this is that, provided a compatibility layer is available for a particular CMS, the same module will run unmodified in that CMS (Content Management System). The only catch is that I'd have to find somewhere to host the HRIT Java REST service (cloud, left). So here's my current design for the compatibility layer:

In the Model-View-Controller design the Model handles all interactions with the database. Give me this, here is that, etc. So an implementation of the HRIT API fits that design pretty well (the pink box). It wouldn't have to access any databases directly, because it would get all the data it needs from the REST Service. So it could be the same on all CMSes.

The Controller on the other hand (orange boxes) deals with all the user-clicks in the view and decides what to do with them. It interposes between the view and the model to decide when to call the model and what to return to the view. It's the 'business logic' of the application. An individual CMS can't and does not in practice mandate an MVC design, but it does define conventions for calling modules. So in Drupal and Joomla!, HRIT modules will be defined with a library of functions that will act as the master controller for the module. Whereas the module itself will have its own private controller that will call the master one. The private controller will include the master controller with a require statement and assume the existence of certain functions or static methods inside that controller, say HRITController::doSomething().

The View (green box) can do what it likes, so long as it only uses PHP functionality, and doesn't call any CMS-specific routines or rely on any CMS-specific data. If it needs something like that it will have to get it from the compatibility layer via its private controller.

About this blog

This blog is a technical record of my attempts to create a first class website for ecdosis.net. This will be a revision of www.digitalvariants.org and is intended to incorporate genetic texts in the MVD (Multi-Version Document) format. It will be the first website to allow the user to view and edit original texts with all their raw corrections, revisions, and variant versions as they were truly meant to be: as multi-version texts. A lot of people have talked about the theoretical possibility of doing this but the tools they choose are not up to the task. In fact the history of Digital Humanities is all about shoehorning humanistic problems into off-the-shelf technical solutions that don't fit. This project, on the other hand, is about breaking free from the limitations of mere markup and database structures to represent the true nature of originally analog documents.