Tuesday, 25 December 2012

Last piece in the import puzzle

People wanting to import XML into the HRIT system probably have XSLT scripts that transform the files into some other form and then format it into HTML. Perhaps the two steps are not even separated. If they try to use the current HRIT import system then TEI constructs like the following (taken from the TEI-Lite manual) will fail:

<list>
 <head>A short list</head>
 <label>1</label>
 <item>First item in list.</item>
 <label>2</label>
 <item>Second item in list.</item>
 <label>3</label>
 <item>Third item in list.</item>
</list>

The reason is that the <head> element is inside the <list> element. If we translate one-for-one the elements of XML into elements in HTML we will have to delete the <head> element, because none of the <h1>, <h2>, <h3> elements in HTML can appear inside <ul> or <ol>. But what we really want is to do is move it outside <list> and give it an attribute like type="list". But that's manipulation of the XML DOM, which neither stripper nor formatter (my two import tools) currently perform.

Brainwave

Whatever new facilities I add to stripper to allow such transforms you can bet that someone will need something in XSLT that isn't supported. My stripper program would just keep on getting more and more complex. And I would waste more and more time. So I had the simple idea not to modify stripper or formatter at all, but just to add a further step into the import process. All we have to do is allow XSLT transforms to take place on the imported XML files as a first step in the importation process, through the use of a tool like Xalan. By default a TEI-Lite stylesheet could perform the necessary transforms on the XML to turn it into sane XML for easy conversion into standoff properties. Not only is this a trivial change to implement, it is also extremely powerful. Although existing stylesheets may have to be modified for this to work, no loss of functionality can any longer be claimed for the HRIT system over existing XML based digital editions. A neat result, indeed.

It seems that the only XSLT processor that works on MacOSX and Linux any more is libxslt. XML may not be dead yet but its tools at least are dying. A sign of the times?

D'Oh!

I forgot that Java has an XSLT processor built in, so after getting libxslt to work via JNI I had to scrap it and redo it more simply. Which just goes to show that having a coffee and a walk in the garden before you code something is often time well spent, even though it looks and feels like you're loafing.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.