Saturday 16 April 2011

HRIT standoff format

The HRIT standoff XML format is designed to replace embedded markup by using standoff properties. It is not intended to be definitive, and the same information could easily be recorded using another format. All that matters is that it suffices. It supports the following features:

  1. The text is in a separate file, which may be referred to by multiple HRIT markup files.
  2. Several markup files can be combined to enrich the text simultaneously.
  3. Properties may overlap
  4. Both HRIT markup files and the underlying text files can be freely edited

The format doesn't perform these functions it just allows them to happen. So it is actually pretty simple.

Header

The header provides some essential structure and infomation:

<?xml version="1.0" encoding="UTF-8"?>
<hrit-markup style="TEI">

A CorCode is of course closely bound to a CorTex version. In the HRIT system each cortex and corcode is uniquely identified by a URN in the FRBR mould as a sequence of hierarchical names, e.g. "/corcode/english/shakespeare/kinglear/act1/scene1/folios/F1". A CorCode URN is the same as the corresponding CorTex URN except that the first component is "/corcode" instead of "/cortex". So there is no need to record that relationship explicitly in the markup.

The "style" attribute on the other hand is a reference to the URN identifying a CSS format file. This is used to render a CorCode into an external HTML form for display. It's not a complete name because the same corcode/cortex might be rendered in multiple ways, depending on the application and the user's preferences. Each CSS file points to just one CorCode, and hence the unique style name here. A full format identifier would look more like "/corform/TEI/hritsingle/luc-style", which would mean 'the core format for "TEI" corcodes in the hrit-single application in the LUC style'. The CorForm concept is new and we are still working out the details, but it seems clear that this kind of information has to be stored in the "cloud" with the cortexs and corcodes.

Ranges

The body of a HRIT markup file is composed of an ordered series of ranges. Each range points to a span in the base text and is specified by its relative offset from the previous range and its length. This facilitates editing, since changing one range only invalidates the immediately following one. It also has a name because it represents a property. Finally it may contain annotations, which are mostly leftover attributes from XML, although they can also be used to store invisible programming information such as links. But these will need to be backed up by direct functionality in a GUI; they are not intended to be edited by the user.

<range name="pb" reloff="0" len="0">
<annotation name="ed" value="F3"/>
<annotation name="n" value="765"/>
</range>
<range name="sp" reloff="271" len="20"/>

Removed ranges

If a range is labelled as removed="true", it can't be formatted and cannot specify a range in the base text. An example is the teiHeader element in TEI. Since this is metadata its content has a fundamentally different status to the text of the body. So the default recipe of stripper specifies that this element and all its children are removed. A removed element may have its own private content that is stored as part of the standoff markup. This 'content' can belong to any point in the file, but usually to the start, so it can have a reloff attribute, though not a len:

<range name="title" removed="true" reloff=0">
<content>Mr. William Shakespeares Comedies, Histories, &amp; 
Tragedies. Published according to the True Originall Copies.
</content>
</range>

This may sound like a hack, but the information has to be stored somehow if we later want to reverse the conversion from HRIT back to TEI. Of course files originating in the HRIT system itself won't have any removed elements.

Nesting of ranges

If a HRIT markup file has been imported from XML then its ranges will effectively be nested. But this information is not used during formatting, since ranges can freely overlap.

Tail

This is just:

</hrit-markup>

This format is still subject to development and may be changed in future.