Saturday 25 February 2012

The importance of deletions

Standoff properties describe what happens to runs of text, but what about runs of text you don't want, or need to put into attributes when it is formatted as HTML? This problem was revealed when I tried to use formatter to generate a dropdown menu from a list of versions spat out by nmerge. What nmerge gave me was the description of the work on one line, then the name of the version-group, followed by short names and long names for each version:

King Lear
Base F1 Version F1
 F2 Version F2
 F3 Version F3
 F4 Version F4
 Q1 Version Q1
 Q2 Version Q2

But I needed to turn this into HTML that looked like this:

<select name="versions" class="list">
<option title="Version F1" class="version">F1</option> 
<option title="Version F2" class="version">F2</option> 
<option title="Version F3" class="version">F3</option> 
<option title="Version F4" class="version">F4</option> 
<option title="Version Q1" class="version">Q1</option> 
<option title="Version Q2" class="version">Q2</option> 
</select>

It's obvious that strings like "Version F1" need to be moved from the text into attributes, and strings like "Base" and "King Lear" need to be deleted outright. No doubt XSLT wizards are laughing at me, pointing out that if only I had embraced XML I would be able to transform XML into exactly this format easily. But choosing a simpler model for representing textual properties doesn't mean any loss of functionality. In fact "less is more". All it needs is for these problems to be thought through and solved.

All you have to do is allow each standoff property to be "removed". I already had this feature but the length had to be zero. I used this for removing the TEI header from TEI documents and saving the content in an annotation of an empty range. But the length had to be zero, and the text the elements formerly enclosed was removed permanently from the base text. Since I wanted to reuse the same text for other applications I didn't want to remove any text permanently. So I just allowed the length of a "removed" property to be > 0. A deleted property now removes the text it affects from the output, and also any ranges that it encloses, as well as parts of any ranges it overlaps with. So in the JSON all you do is define some "empty" ranges that have the "removed" property set to true:

{ "name": "list", "annotations": [ { "name": "versions" } ], "len": 94, "reloff": 10 },{ "name": "empty", "removed": true, "len": 5, "reloff": 0 },{ "name": "top-group", "annotations": [ { "name": "Base" } ], "len": 88, "reloff": 5 },{ "name": "version", "annotations": [ { "description": "Version F1" } ], "len": 2, "reloff": 0 },{ "name": "empty", "removed": true, "len": 10, "reloff": 3 }...

What this machine-generated JSON says is that the string "Base" should be deleted. Long version names like "Version F1" should be moved to attributes called "description" for the short-names. The resulting HTML given above is exactly that output by formatter. So now we can also pick and choose which text gets selected and where it appears.