Saturday, 6 August 2011

A new kind of standoff markup

When markup was first invented it was always embedded in the text. The earliest forms were like formatting instructions to tell us how the text should look when it was printed:

.indent 5
blah blah blah
.indent 0

This meant that the text should be indented five spaces, say, for a bit of quoted text. This might be a problem if you later wanted all quotes indented 10 spaces. Then in the 1980s some folks had the bright idea of embedding only abstract information about textual structure rather than specific formats:

.q
blah blah blah
.endq

Then when it was printed a separate program would read a stylesheet and convert the .q and .endq instructions into their correct indents. This was called 'generalised markup'. For a while people thought this was wonderful. It was used in the first web-pages in the early 1990s to make HTML, the Hypertext Markup Language. Every web page you read today is still encoded this way - well, nearly. But the problems persisted. The tags were all embedded in the text, which made it difficult to read. As more and more information had to be added eventually there was so much markup that it was almost impossible to read the text without first formatting it. Which was a problem if you wanted to edit it. Then, in the mid-1990s some people invented standoff markup.

Standoff markup

Standoff markup removes the embedded tags and stores them separately from the text. Each of the externally stored tags is associated with:

  • an offset into the text where it could be reinserted
  • the distance between pairs of start and end-tags as measured in the underlying text
  • any attributes that belonged to the original start-tag

Standoff markup doesn't change the markup structure; it just separates it from the text and makes it possible to combine different markup sets with the same underlying text. For the reader it also removes the confusion arising from the embedding of complex tags. But there were also drawbacks:

  • If the text was altered then all the offsets in the standoff markup file would have to be updated.
  • It was still not possible for the textual properties described by the tags to overlap.
  • Markup sets could be exchanged but not combined. You couldn't, for example, add both metrical and formatting structure to the same document, or merge markup sets written by different people.

Standoff markup is suited to the corpus linguistics applications it was originally developed for, where various natural language tagging tools produced differently marked-up versions of the same underlying text. But in the humanities this method has proved less popular because of the need to edit the text, at least in the initial stages of preparing an electronic edition - which is what I am interested in.

Standoff properties

To distinguish our solution from conventional standoff markup we call it 'standoff properties'. It can also be used, as in our software, to allow the editing of either the text or the markup while keeping the other half automatically in sync.

Standoff properties extends standoff markup in two significant ways:

  1. By allowing overlap between properties
  2. By allowing sets of markup properties to be freely combined

Overlap

Software engineers use different kinds of data structures for various tasks, such as graphs, arrays, hash tables etc. Humanists and corpus linguists have unnecessarily limited themselves to embedded markup languages and thereby limited representations of their data to tree structures. Moving to a data structure that resides outside of the text liberates them from that restriction.

Mixing markup sets

A set of markup tags, stripped from a standard XML file is always properly nested because it is sorted by the order in which tags were encountered in the original XML file. This can be used to reconstruct the document tree in cases where several tags start at the same point in the underlying text. For example:

<book name="my book"><chapter n="1"><text>...

In standoff form these properties/tags might be represented as:

"book" start: 0 length: 12345
"chapter" start: 0 length: 234
"text" start: 0 length: 234

Here the three tags 'book', 'chapter' and 'text' all start at the same point in the text they enclose. The order can be used to decide that 'text' goes inside 'chapter' and not the other way around, even though they may describe equal ranges in the base text.

But if we were to allow different sets of markup tags to be combined, what would the order be? Since the two trees, which represent two sets of markup tags, can't be combined, this reliance on the order of the tags will have to be abandoned if we want truly mixable tagsets.

Lifting the restrictions

Digital humanists have been calling for these relaxations in the strictness of markup for many years. But the reality is that the external world of markup has not changed. Web pages are still represented as tree-structured HTML. So if 'standoff properties' are adopted as an alternative representation for markup, how is it possible to convert them into a well-formed and valid formatted HTML file? The conventional solution is to use the XML Transformation Language (XSLT), but that requires embedded XML as input. So a new technique is needed to perform the conversion. That's what this posting is all about.

Standoff properties to HTML

The reader may well be skeptical that this is possible at all. Anyone who has worked in this field over the past 20 years will know that there have been a large number of attempts, more or less unsuccessful, to allow overlap in marked-up texts. It may also seem impossible to convert sets of arbitrarily overlapping properties with no real structure into rigid tree-formatted HTML. But I am going to prove it, although to make the proof short enough to read I'm only going to outline it. For the more sceptical there is a program to back it up that has already been extensively tested. You can access it on the Digital Humanities Testbed website and try it out for yourself.

To prove that standoff properties can be converted to HTML we can omit certain confusing details. Firstly we can ignore attributes because a set of attributes is associated with each property or tag on a one-to-one basis. We can also forget about how properties are turned into HTML tags because this is likewise a one-for-one mapping. For example, we can just specify that the property 'stage' should always be converted into the HTML tag 'p'. Finally, since property sets can be freely combined, it suffices to prove that one set of standoff properties can be converted into a HTML document tree.

Deriving nesting information

In order to build a tree from data that has no such structure some information about which properties may be allowed to nest inside other properties is obviously required. This information can be derived from two sources: firstly by scanning the list of properties and their ranges in the text it is possible to compute how often a particular property is entirely 'inside' another property. So if the property-set was derived from stripping an XML file this nesting information could not violate any requirement of the XML file's original schema, if it already conformed to it. In fact XML can perfectly well do without any form of 'schema', or syntax recipe, and can derive it from its 'well-formedness', that is the strict nesting of its paired tags. We will be going one step beyond that by not even mandating well-formedness.

First of all some defintions:

  • An 'element' in XML means a pair of start and end-tags and their intervening content. The XML element '<em>really true</em>' projects the property 'em' over the text-range 'really true'. An element can also contain other elements.
  • A 'property' on the other hand is just a name for a range in the text. Unlike elements, properties don't have any intrinsic ability to nest.

By scanning a file containing a set of property names and their ranges for a particular text – a property-set – a matrix of how often each property was observed to be inside itself and every other property can be derived. A table from a simple TEI-XML file for a play might look like this:

headstagespeechspeakerlineparaitalicstext
head00000000
stage00000007
speech00000000
speaker00000000
line00000000
para00000000
italics00000000
text00000000

Table 1

Reading from left to right the numbers means 'head', 'stage', 'speech' etc are found so many times inside the properties named in the various columns. So the property named 'stage' is found inside the property named 'text' a total of seven times.

A similar table can be drawn up for HTML because all the nesting rules can be found in its schema. So, for example, the tag <span> may appear inside <p>, but not vice-versa. So the corresponding fragment of the HTML matrix would look like this:

spanp
span11
p01

Table 2

In reality the full HTML table has 107 rows and columns and is too big to show here. Binary values are used instead of the frequencies of Table 1 because this is general nesting information derived from a schema, not from actual texts. Reading the table from left to right, the binary value '1' means 'may nest inside' and '0' means 'may not nest inside'. Now by mapping each of the property names from Table 1 to their HTML tag equivalents, we can look up the corresponding values in a full-sized version of Table 2 with the 107 tags of HTML5. If there is a 1 in the corresponding location of the HTML table then the frequency count in Table 1 will be left as is, and if the HTML table has a 0, meaning that the HTML equivalents may not nest, then the frequency, however large, will be zeroed. This guarantees that the nesting information derived from the property set, when converted to HTML, specifies HTML-compatible nesting rules.

Enclosing versus nesting

Recall that from the definition above a 'property' consists of a name and a range in the underlying text. What Table 1 records are the nesting characteristics of the property-names. To say property-name A nests inside property-name B is a general statement about those two properties. On the other hand to say that property A encloses property B means that the range of property A is outside the range of property B by at least one character on the left or right. So, for example, the property 'speech' encloses the property 'speaker' even though they both start at the same point:

<sp><speaker>Bast.</speaker> Nothing my Lord.</sp>

Document Object Model

The idea of a document object model (DOM) is borrowed from SGML/XML. HTML is an instance of an SGML language, and a HTML document can be described by a tree-structure of 'nodes', each containing other descendant nodes called 'children', and nodes at the same level called 'siblings'. Figure 1 shows the structure of a basic DOM tree.

To be continued ...