Thursday, 22 March 2018

The end of XML

XML is now 20 years old. We might expect the first author of the XML 1.0 specification, Tim Bray, to be enthusiastic about XML's achievements and excited about its prospects for the future. Not a bit of it. In a limp endorsement on Tim tries diplomatically to think up some nice things to say about XML. But by the end of the article he lets out his true feelings:

People did a lot of that with XML just because there was no other alternative and, well… while it worked, you could do better, and in fact we have done better, for weak values of “better”. I wonder if we’ll ever do better still? As the editor of the IETF JSON RFCs, I’m a pessimist.

It’s been OK

Seriously; XML made a lot of things possible that previously weren’t. It has extended the lifetime of big chunks of our intellectual and legal heritage. It’s created a lot of interesting jobs and led to the production of a lot of interesting software. We could have done better, but you always could have done better.

Happy birthday!

I don't think there are any candles on the cake.

HTML is the new XML

More evidence of the disappearance of XML can be found in the new HTML Imports standard published by the W3C. Remember XInclude?

This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset.

Now we have HTML Imports:

HTML Imports are a way to include and reuse HTML documents in other HTML documents.

Sounds familiar? Of course they are not the same because XML and HTML are not the same but the same basic need that existed when they were busy defining XML standards provides yet another case where HTML is taking on the capabilities of XML. The others are RDFa being redefined for HTML, HTML5 being independent of SGML/XML, CSS 3 Paged Media Module replacing XSL formatting objects. Does the list go on beyond what I know? Probably. One thing is clear: HTML is being made more and more into a replacement for XML in all things. In a couple of more years people will even be asking: 'what is XML?' And the Museum Guide will point to a funny page of complex markup with a stick and everyone will go 'Ooooh!'

Saturday, 23 December 2017

Ecdosis and the Charles Harpur Critical Archive

Now that we have are close to finishing our first historical digital edition, the Charles Harpur Critical Archive, it was time to articulate the technical design that led to its realisation. It is also worth reflecting on what we achieved. The extant papers of Charles Harpur (1813-1868) consist of 5,225 manuscript pages ranging in difficulty from easy to diabolically complex, 674 published newspaper poems, 140 letters on 403 manuscript pages, and 250 published pages in book form. To give you some idea of how large that is just to print the last version of each poem took Elizabeth Perkins 1000 pages. We have included all the versions, which is three times that much plus all the notes to the poems and the letters, which is double that again. So think 6,000 pages of printed matter. And we did it, including an elaborate user interface, in just 3 years. We recorded every last deleted full-stop. Here's a sample, in case you thought it was easy:

The technical design that made this possible is now described in outline on the CHCA website. It is a general system that can be reused to create a wide range of other editions.

Saturday, 17 June 2017

Preserving soft and hard hyphens in transcriptions of historical documents

Like all documents historical texts contain line-breaks. An obvious case where preservation of line-breaks is essential is poetry. And yet on the Web, HTML assumes that all text is flowed. That is, line-breaks are converted into spaces unless the text is broken by a <br> tag. Or you can just specify that line-breaks are preserved as in the <pre> element, or by using the white-space:pre CSS property. What is needed though is some way to easily switch between the two. Flowed text is easier to read, but for historical accuracy line-breaks and the inevitable hyphens must be preserved. In spite of this requirement in many digitised versions of historical texts hyphens are permanently removed and the text is flowed for readability. This prevents ever showing the text as it really is. You need to do this for example when displaying a text next to its page-image. Or when citing a historical document by its line-number.

Hard and soft hyphens

What's needed is some way to record the line-breaks but to hide or show them on demand. The easiest way to do this is in the browser by flipping a switch in the CSS stylesheet. One problem with this is the existence of hard and soft hyphens. In heavily hyphenated languages like English and French, hyphens occur not just when an unhyphenated word is split over a line but also between parts of the one word, as in double-barrelled names like 'Normington-Rawling' or compound words like 'the high-glooming mountain'. When such compounds are split over a line the hyphen is regarded as 'hard', that is, it will not disappear if the line-break is removed. Whereas a 'soft-hyphen' disappears along with the line-break when the text is reflowed. So what is really needed are two sets of CSS styles for flowed and unflowed text. Another complication is that 'hyphens' come in various flavours. Sometimes writers use characters other than '-'. One common variant is use of the colon, or an equal-sign. And sometimes the hyphen is repeated on the next line. So we need a way to switch off these as well.

The two CSS styles

Here are my two styles. I've tried them in Firefox, Chrome and Opera and they appear to work perfectly. First the flow styles:

.soft-hyphen { display:none }
.hard-hyphen { word-spacing:-.25em; }

There is no direct way to hide spaces or line-breaks that get automatically turned into spaces in CSS but you can vary the amount of horizontal spacing between words. The default is, according to the W3C, equal to .25em. So setting it to -.25em should eliminate it altogether. Here are the corresponding definitions of soft and hard hyphens when preserving line-spacing:

.soft-hyphen,.hard-hyphen { white-space:pre }

An example

Here is a short example text in three formats.

Source HTML
<p>"When they will not give
a doit to relieve a lame beggar, they
will lay out ten to see a dead Indian",—
the device which aimed at converting
to the benefit of a living author, the 
expense they were only disposed to throw
away upon a dead one, if not praise<span class="hard-hyphen">-
</span><span class="soft-hyphen">-</span>worthy, was at least pardonable.</p>
<p>In fine, Chatterton was stung to the
quick by neglect, and rendered de<span class="soft-hyphen">-
-</span>fiant by the apparent blindness of

Note that in the first case, when the hyphen is doubled, this has to be dealt with somehow when the HTML encoding is generated so that the first hyphen and its line-break is encoded as a hard-hyphen and the second as soft. This works no matter which characters are used for the actual hyphens.


"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

With line-breaks preserved

"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

I cheated a little bit in the last example because it is not possible to have two definitions of the same style on one HTML page. Since this can be manipulated in JavaScript easily I don't see that as a problem. Other than that, I used the same source encoding for the three formats.

Friday, 24 February 2017

Refinements to Twin-view

Twin view is a side by side view of an historic print or manuscript document and its transcription. The idea was to scroll the images of each page on the left in sync with the formatted text on the right. Although I already described this view earlier, I have since made several refinements that are worth expounding in a separate post.

To recap on what twin-view already achieved: it aims to align the transcription with the corresponding page image so that the text is aligned across the middle of the display. Of course, without scanning the page image for words-shapes precise alignment is impossible. But approximate alignment can be achieved for any document and its transcription by following some simple rules. So long as we can measure the height of each page image and the height of the corresponding text on the right the two can be scrolled in sync fairly accurately.

1.Partial pages

Problems however arise whenever only part of a page is transcribed – say the end of a poem, which may then be followed by another work. In our case since individual poems were taken from both printed and handwritten anthologies, many poems begin or end some way down the page.

My first idea was to keep the page images intact, but to outline that part of the image connected with a particular poem, so it could be electronically sliced into segments just before display to the user. Then only relevant portions of the page would be visible, and the original images would remain intact. However, this proved impractical for several reasons. First how could such areas be determined? Only manually. And that meant a lot of work and a recording of the data in some format that would have to be customised for our website. Also the slicing would be computationally expensive, and the part-images would have to be cached to improve performance. That gave me the idea of manually slicing the images into segments, while keeping a copy of the original page for other purposes. So the part-page of the transcription would be connected with a part-image of the page. And the only technology needed to achieve this would be the web-server's innate ability to serve images and HTML.

I have done this now for 104 poems out of 700 in the collection. The result mimics to some extent earlier attempts by others to produce complex 'diplomatic' layouts of original documents containing blocks of text that may be rotated or written on other pages that are then displayed as such. Such views are pretty hard to read even though the text has been transcribed. Instead, twin-view simply connects a series of derotated part-images and their corresponding textual transcriptions into a continuous and easy to read document on both sides of the display. The zoom feature then takes care of the user's need for closer examination.

2. Full-screen

Another refinement was the provision of a full-screen view. Nowadays many people have access to large monitors with gargantuan resolutions. Why not make use of that, while retaining a fallback of adequate display for smaller screens? Content management systems typically don't allow this. They confine the text to a narrow band in the centre of the screen, in the belief that screen sizes must have some minimum. Typically this is 600-800 pixels wide. In a responsive layout, on the other hand, text and images are scaled to fit the available sceen-width. So I thought: why not use all of the screen for twin-view? The result is a view that enables the user to see the text and its images in minute detail, while retaining the sync-scrolling of the main view within the CMS.

Twin view of Harpur's 'Creek of the Four Graves' MS C384

3. Layers

Manuscript documents, especially of modern works, often contain erasures, substitutions, insertions and transpositions. These are usually encoded into the text as formats: crossed out text is displayed in a crossed out format, inserted text is displayed over the line in smaller type, blocks of rotated text displayed as rotated blocks etc. This is complex and expensive to do, and the result is not much more readable than the original manuscript. Layers offer a way around this problem. Since each local change to the text belongs to a clear temporal sequence in almost all cases, it is possible to code for time instead of layout. A layer is a combination of each of these local independent changes. All local changes that occurred one unit of time after that of the baseline text appear in layer 2, and changes to layer 2 in layer 3 etc. Layers aren't versions and the non-final text is therefore displayed in red. Only the final text is displayed in black. 'Layer-final' is the last layer in the temporal sequence representing the final state of the document as the author left it. Layers thus provide a diachronic view of the text. They are also mostly coherent – meaning we can read them – as opposed to the direct diplomatic approach where the text is shown with erasures inline, making it unreadable for humans and computers alike.

Take a look for yourself on the Harpur website. The full-screen button is next to the tabs for layers. Only the poems in the title index in Browse from A-D are enabled for twin view presently.

Monday, 31 October 2016

More about the decline of XML

At ESTS Antwerp recently (5-7 October 2016) some XML aficionados thought that the StackOverflow graphic in my previous post was somehow misleading, that attendees at the conference needn't worry about the decline of XML, because it wasn't really happening. But they didn't offer any facts to counter the evidence.

XML Web services

Five years ago in 2011 on ProgrammableWeb there was a news story posted based on APIs submitted to their index, that stated: 1 in 5 APIs say "bye" to XML, or, 1 in 5 APIs were already JSON only: that is, they offered no support for XML.

What's an 'API'? It's an index into the functionality offered by a web-service. Typically data is sent to the service in some format and returned via some other (or the same) format, such as XML or JSON.

The 2011 story was widely quoted, so I went back to the site and created my own analysis of all JSON and/or XML APIs in their registry from 2005 to October 2016. That's a total of 4,453 APIs. Since 2011 1 in 5 has now become 4 in 5:

The trend is clear: designers of web services are going for REST/JSON and only supporting XML legacy applications when they can afford to do so. Very few APIs are now pure XML and judging by this rate of decline, XML in Web services will be all but dead in 12 months time.

XML tools

According to the effective creator of XML, James Clark, Web services were the biggest motivation for XML in the first place. The disappearance of such a big usage case for XML will inevitably result in the withdrawal of vendor support for XML products and open source development projects that they patronize. Without the support of key open-source XML building blocks, which are not being adequately maintained as shown in the graph below, commercial new products based on XML will no longer be possible, and existing ones will break.

Releases of 8 key open-source XML tools1

General popularity of XML

Another possible source of information about xml's decline can be found in the archives of, which is sponsored by industry giants IBM and Microsoft, and hosted by Oasis. The xml-dev newsgroup documents a marked decline of developer interest in XML since its inception, as this graph of the number of monthly posts to the group between February 1997 to October 2016 clearly shows:

A corroboration of this trend can be found in posts to the popular news site which mention either an XML language or XML itself:

XML 'Mixed content'2

The use of XML for mixed content seems likely to succumb to the same trend eventually. Its decline is evidenced by falling interest in DocBook and TEI. Unlike SGML, XML never was designed to be typed manually, even in an XML editor. While interest in DocBook has plummeted to 1% of what it was 10 years ago, simpler markup languages like Markdown have risen dramatically in popularity. Niche XML vocabularies like TEI would thus seem to have no future; their survival will depend on the continued maintenance of XML tools by a tech community that is rapidly losing interest in them.

Popularity of DocBook (blue) vs Markdown (red) 2004-present

XML databases

Interest in XML document databases like Sedna, BaseX, eXist and MarkLogic (now a dual XML/JSON database) is roughly flat, but at a very low level. For every user of exist-db (XML document database) there are 3,500 mongo-db (JSON document database) users. The only possible reason to choose an XML over a JSON document database can be that the data is already in an XML format. All four surviving XML databases show a slight decline since 2014, probably due to companies gradually migrating their documents to other formats.

Native XML databases popularity 2012-2016

XSLT transformation language

XSLT, the XML transformation language, seems to be dying even faster than XML itself. The clumsiness of its syntax, combined with its inability to directly transform HTML5 (although it can write it) has doubtless contributed to its demise.


In conclusion, building software based on XML for the future is a risky business. In spite of its ubiquitous use only a few years ago, XML has been cut off from mainstream development by a tech community focused on HTML5/JSON/CSS/Javascript. HTML and RDFa were once big use cases for XML that now no longer require it. Adrift on its own, having to justify itself on its few merits and many drawbacks, the future of XML looks bleak.

[1] Cocoon, fop, libxml, libxml2, xalan-c, xalan-j, xerces-c, xerces-j

[2] 'Mixed content' is when tags may appear in text and text in tags

Tuesday, 25 October 2016

Formatting poetry for the Web

HTML was designed for the display of business documents, not poetry. In HTML, text is composed out of a succession of flow elements, each of which contains a series of phrasing elements. So an element like <p> is a flow-element, and <span> is phrasing content. <div> is like the joker in cards: it can enclose anything.


Let's consider the first problem, how to encode stanzas:

Ah! Daniel mine, some Muse malign Hath skimm'd thy judgments cream away But take a slice of "good advice"--- Even that I proffer thee today.

If the stanza here was enclosed in a <pre> then all the lines would have the same indentation, and this could not be corrected using CSS. You could use spaces at the start of each line, but with variable-width fonts this looks awful and you have no control over indentation when fonts are substituted by the browser. In CSS you can instead use the white-space: pre property to make any flow element behave like <pre> anyhow, so <pre> is not needed, especially as it uses a monospaced font by default.

An obvious alternative would seem to be <div>, which can enclose anything. So a <div class="stanza"> would be a good choice. Equally <p> is also possible, so long as it encloses only phrasing content. (A <p> can't enclose another <p>, so we can't use <p> to represent lines if <p> is already used for stanzas.) However, <p> has the distinct advantage of being the direct result of translation of Markdown's double line-breaks to separate paragraphs. This allows us to type poems online using a very simple Markdown-like syntax, and then translate it into suitable HTML. Stanzas can be styled using a CSS selector that selects all <p>-elements enclosed in a single <div class="poem">, so you don't have to keep typing "<div class="stanza">" every four lines or so.


Often in poems headings are centred. Unfortunately, poetic lines are typically much shorter than the screen-width, and since the enclosing <div> will fill all the available width on screen, it will push the heading to the right of the text. So, it won't be centered in any meaningful sense. The fix is to write a little Javascript to measure each line of poetry, then adjust the width of the <div> so that it is slightly wider than the longest line. Something like this:

<script src="
<script type="text/javascript">
    var maxWidth=0;
    var lines = $("div p span");
    for ( var i=0;i<lines.length;i++ ) {
        var w = lines.eq(i).width();
        if ( w > maxWidth )
            maxWidth = w;
<div class="poem">


Neither the <pre>-tag, which is used to format computer-code, nor the line-break tag, <br>, provides control over the indenting of specific lines. It is a common mistake of XML technicians to encode lines as "<lb/>", rather than <l>...</l>, to avoid common issues of overlap with other elements like <del> (deleted). Once a poetic line has been encoded using an empty XML-element like this it can only be translated into HTML's <br> tag, which incurs the problems just mentioned. Hence lines are better represented as <span>s within a stanza (a <p> or <div>) where white-space has been set to "pre". Lines of various indentations can then be represented by defining classes of spans such as <span class="line"> or <span class="line-indent1"> etc.

Italics etc

For character formats ("phrasing content") you need to use classes, so <span class="underlined">that</span> can be styled to be in italics. If you use <i> or <em> you can't control the text appearance so well. For example, you might have stage-directions, foreign words etc that need different formatting.

Special characters

To get a really professional look simple typewriter codes like " and ' need to be translated into their curly equivalents. The same goes for dashes like ---, which becomes —.

Putting it all together

The whole design looks like this. You can change the indents, define extra classes for a wider variety of indented lines. I use up to six. To copy the design just use "display source" in your browser.

To Twank.1

Ah! Daniel mine, some Muse malign Hath skimm’d thy judgments cream away But take a slice of “good advice”— Even that I proffer thee today.

Again read Shakespear by the hour,— Read Milton more—McDonald less— And Wordsworth for his simple power, Not for his namby-pamby-ness.

And know,—’twere better to esteem What’s best in Byron’s godless “Don” Than with crude Browning much to dream, Or wire-draw through with Tennyson.

And better at the “woes of Moore” To shed the artifical tear, Than doat with eunuch passion o’er The feeble beauties of De Vere.

The above poem was automatically formatted by simply translating a Markdown representation into HTML. This is what the user actually typed:

To Twank.

    Ah! Daniel mine, some Muse malign
        Hath skimm'd thy judgments cream away
    But take a slice of "good advice"---
        Even _that_ I proffer thee today.

    Again read Shakespear by the hour,---
        Read Milton more---M^c^Donald less---
    And Wordsworth for his simple power,
        Not for his namby-pamby-ness.

    And know,---'twere better to esteem
        What's best in Byron's godless "Don"
    Than with crude Browning much to dream,
        Or wire-draw through with Tennyson.
    And better at the "woes of Moore"
        To shed the artifical tear,
    Than doat with eunuch passion o'er
        The _feeble_ beauties of De Vere.

Now I call that easy.

1 Charles Harpur, 'To Twank' Empire 14 March 1860. 'Twank' was a nickname of Daniel Deniehy

Tuesday, 20 September 2016

The fall of XML

Talk to software developers today and they will tell you that 'XML is toast'. XML has not been replaced by any single technology. It is not JSON that has killed off XML; it is the mobile Web and associated technologies. Digital humanists who think that XML is here to stay, and imagine that they can continue to build software on top of it, should take a look at the following graphic, derived from Stackoverflow, one of the most popular discussion forums for software developers. HTML, Javascript, JSON and CSS have collectively supplanted XML, and these technologies no longer have any need of it. You may say 'Who cares what software developers think?' But they are the guys who build and maintain the tools that digital humanists use. If they abandon XML then those tools will soon perish or become obsolete, disconnected from the services they were designed to support.

When the World's way is running East,
    Keep your way running West;
And it is two to one, at least,
    That yours will be the best.

Charles Harpur