Monday 31 October 2016

More about the decline of XML

At ESTS Antwerp recently (5-7 October 2016) some XML aficionados thought that the StackOverflow graphic in my previous post was somehow misleading, that attendees at the conference needn't worry about the decline of XML, because it wasn't really happening. But they didn't offer any facts to counter the evidence.

XML Web services

Five years ago in 2011 on ProgrammableWeb there was a news story posted based on APIs submitted to their index, that stated: 1 in 5 APIs say "bye" to XML, or, 1 in 5 APIs were already JSON only: that is, they offered no support for XML.

What's an 'API'? It's an index into the functionality offered by a web-service. Typically data is sent to the service in some format and returned via some other (or the same) format, such as XML or JSON.

The 2011 story was widely quoted, so I went back to the site and created my own analysis of all JSON and/or XML APIs in their registry from 2005 to October 2016. That's a total of 4,453 APIs. Since 2011 1 in 5 has now become 4 in 5:

The trend is clear: designers of web services are going for REST/JSON and only supporting XML legacy applications when they can afford to do so. Very few APIs are now pure XML and judging by this rate of decline, XML in Web services will be all but dead in 12 months time.

XML tools

According to the effective creator of XML, James Clark, Web services were the biggest motivation for XML in the first place. The disappearance of such a big usage case for XML will inevitably result in the withdrawal of vendor support for XML products and open source development projects that they patronize. Without the support of key open-source XML building blocks, which are not being adequately maintained as shown in the graph below, commercial new products based on XML will no longer be possible, and existing ones will break.

Releases of 8 key open-source XML tools1

General popularity of XML

Another possible source of information about xml's decline can be found in the archives of xml.org, which is sponsored by industry giants IBM and Microsoft, and hosted by Oasis. The xml-dev newsgroup documents a marked decline of developer interest in XML since its inception, as this graph of the number of monthly posts to the group between February 1997 to October 2016 clearly shows:

A corroboration of this trend can be found in posts to the popular Slashdot.org news site which mention either an XML language or XML itself:

XML 'Mixed content'2

The use of XML for mixed content seems likely to succumb to the same trend eventually. Its decline is evidenced by falling interest in DocBook and TEI. Unlike SGML, XML never was designed to be typed manually, even in an XML editor. While interest in DocBook has plummeted to 1% of what it was 10 years ago, simpler markup languages like Markdown have risen dramatically in popularity. Niche XML vocabularies like TEI would thus seem to have no future; their survival will depend on the continued maintenance of XML tools by a tech community that is rapidly losing interest in them.

Popularity of DocBook (blue) vs Markdown (red) 2004-present

XML databases

Interest in XML document databases like Sedna, BaseX, eXist and MarkLogic (now a dual XML/JSON database) is roughly flat, but at a very low level. For every user of exist-db (XML document database) there are 3,500 mongo-db (JSON document database) users. The only possible reason to choose an XML over a JSON document database can be that the data is already in an XML format. All four surviving XML databases show a slight decline since 2014, probably due to companies gradually migrating their documents to other formats.

Native XML databases popularity 2012-2016

XSLT transformation language

XSLT, the XML transformation language, seems to be dying even faster than XML itself. The clumsiness of its syntax, combined with its inability to directly transform HTML5 (although it can write it) has doubtless contributed to its demise.

Conclusion

In conclusion, building software based on XML for the future is a risky business. In spite of its ubiquitous use only a few years ago, XML has been cut off from mainstream development by a tech community focused on HTML5/JSON/CSS/Javascript. HTML and RDFa were once big use cases for XML that now no longer require it. Adrift on its own, having to justify itself on its few merits and many drawbacks, the future of XML looks bleak.

[1] Cocoon, fop, libxml, libxml2, xalan-c, xalan-j, xerces-c, xerces-j

[2] 'Mixed content' is when tags may appear in text and text in tags

Tuesday 25 October 2016

Formatting poetry for the Web

HTML was designed for the display of business documents, not poetry. In HTML, text is composed out of a succession of flow elements, each of which contains a series of phrasing elements. So an element like <p> is a flow-element, and <span> is phrasing content. <div> is like the joker in cards: it can enclose anything.

Stanzas

Let's consider the first problem, how to encode stanzas:

Ah! Daniel mine, some Muse malign Hath skimm'd thy judgments cream away But take a slice of "good advice"--- Even that I proffer thee today.

If the stanza here was enclosed in a <pre> then all the lines would have the same indentation, and this could not be corrected using CSS. You could use spaces at the start of each line, but with variable-width fonts this looks awful and you have no control over indentation when fonts are substituted by the browser. In CSS you can instead use the white-space: pre property to make any flow element behave like <pre> anyhow, so <pre> is not needed, especially as it uses a monospaced font by default.

An obvious alternative would seem to be <div>, which can enclose anything. So a <div class="stanza"> would be a good choice. Equally <p> is also possible, so long as it encloses only phrasing content. (A <p> can't enclose another <p>, so we can't use <p> to represent lines if <p> is already used for stanzas.) However, <p> has the distinct advantage of being the direct result of translation of Markdown's double line-breaks to separate paragraphs. This allows us to type poems online using a very simple Markdown-like syntax, and then translate it into suitable HTML. Stanzas can be styled using a CSS selector that selects all <p>-elements enclosed in a single <div class="poem">, so you don't have to keep typing "<div class="stanza">" every four lines or so.

Headings

Often in poems headings are centred. Unfortunately, poetic lines are typically much shorter than the screen-width, and since the enclosing <div> will fill all the available width on screen, it will push the heading to the right of the text. So, it won't be centered in any meaningful sense. The fix is to write a little Javascript to measure each line of poetry, then adjust the width of the <div> so that it is slightly wider than the longest line. Something like this:

<head>
...
<script src="https://ajax.googleapis.com/ajax/libs/
  jquery/1.12.4/jquery.min.js"></script>
<script type="text/javascript">
$(document).ready(function(){
    var maxWidth=0;
    var lines = $("div p span");
    for ( var i=0;i<lines.length;i++ ) {
        var w = lines.eq(i).width();
        if ( w > maxWidth )
            maxWidth = w;
    }
    $("div.poem").width(maxWidth+10);
})
</head>
<body>
<div class="poem">
...
</div>
</body>

Lines

Neither the <pre>-tag, which is used to format computer-code, nor the line-break tag, <br>, provides control over the indenting of specific lines. It is a common mistake of XML technicians to encode lines as "<lb/>", rather than <l>...</l>, to avoid common issues of overlap with other elements like <del> (deleted). Once a poetic line has been encoded using an empty XML-element like this it can only be translated into HTML's <br> tag, which incurs the problems just mentioned. Hence lines are better represented as <span>s within a stanza (a <p> or <div>) where white-space has been set to "pre". Lines of various indentations can then be represented by defining classes of spans such as <span class="line"> or <span class="line-indent1"> etc.

Italics etc

For character formats ("phrasing content") you need to use classes, so <span class="underlined">that</span> can be styled to be in italics. If you use <i> or <em> you can't control the text appearance so well. For example, you might have stage-directions, foreign words etc that need different formatting.

Special characters

To get a really professional look simple typewriter codes like " and ' need to be translated into their curly equivalents. The same goes for dashes like ---, which becomes —.

Putting it all together

The whole design looks like this. You can change the indents, define extra classes for a wider variety of indented lines. I use up to six. To copy the design just use "display source" in your browser.

To Twank.1

Ah! Daniel mine, some Muse malign Hath skimm’d thy judgments cream away But take a slice of “good advice”— Even that I proffer thee today.

Again read Shakespear by the hour,— Read Milton more—McDonald less— And Wordsworth for his simple power, Not for his namby-pamby-ness.

And know,—’twere better to esteem What’s best in Byron’s godless “Don” Than with crude Browning much to dream, Or wire-draw through with Tennyson.

And better at the “woes of Moore” To shed the artifical tear, Than doat with eunuch passion o’er The feeble beauties of De Vere.

The above poem was automatically formatted by simply translating a Markdown representation into HTML. This is what the user actually typed:

To Twank.
==========

    Ah! Daniel mine, some Muse malign
        Hath skimm'd thy judgments cream away
    But take a slice of "good advice"---
        Even _that_ I proffer thee today.

    Again read Shakespear by the hour,---
        Read Milton more---M^c^Donald less---
    And Wordsworth for his simple power,
        Not for his namby-pamby-ness.

    And know,---'twere better to esteem
        What's best in Byron's godless "Don"
    Than with crude Browning much to dream,
        Or wire-draw through with Tennyson.
    
    And better at the "woes of Moore"
        To shed the artifical tear,
    Than doat with eunuch passion o'er
        The _feeble_ beauties of De Vere.

Now I call that easy.

1 Charles Harpur, 'To Twank' Empire 14 March 1860. 'Twank' was a nickname of Daniel Deniehy

Tuesday 20 September 2016

The fall of XML

Talk to software developers today and they will tell you that 'XML is toast'. XML has not been replaced by any single technology. It is not JSON that has killed off XML; it is the mobile Web and associated technologies. Digital humanists who think that XML is here to stay, and imagine that they can continue to build software on top of it, should take a look at the following graphic, derived from Stackoverflow, one of the most popular discussion forums for software developers. HTML, Javascript, JSON and CSS have collectively supplanted XML, and these technologies no longer have any need of it. You may say 'Who cares what software developers think?' But they are the guys who build and maintain the tools that digital humanists use. If they abandon XML then those tools will soon perish or become obsolete, disconnected from the services they were designed to support.

When the World's way is running East,
    Keep your way running West;
And it is two to one, at least,
    That yours will be the best.

Charles Harpur

Tuesday 26 July 2016

Sync-scrolling images and text when editing transcriptions

When transcribing original documents it is helpful to have the page image next to the transcription as it is being written. That way the transcriber can see what the next word to transcribe is, or quickly check for mistakes in some one else's transcription. Most documents contain more than one page, so this gives rise to a problem: how can the relevant page image be placed next to the relevant portion of the transcription so that the transcriber can easily see what corresponds to what?

One obvious solution is to transcribe page by page. For each image show only the transcription of that page. However, this creates a problem for the technician and user alike: now the transcription of the document is divided into parts which must be stitched back together not only by the computer when the document is saved, but also mentally by the transcriber. Pages rarely end at paragraph end. More often they split in the middle of a sentence or even a word. The transcriber may change only one page and then the computer must reassemble the document with that one altered page in its middle. If the page is marked up in some way the page's transcription may not be complete or well-formed, which would hamper editing. All this is both technically messy and counter-intuitive for the user.

A better method is to display the entire document for editing: both the transcription as a continuous running text and the page images to which it corresponds as a scrolling list. The reason this is not often done is because of an intrinsic alignment problem: how to find the part of an image that corresponds to the currently displayed piece of text. To be readable page-images may need to be higher than the screen. Typically the transcription of a page is much shorter. But as a general rule the centre of the page's transcription should be aligned with the centre of the corresponding page-image across the centre of the screen. This is what the user expects. However, this creates a problem: the first and last pages cannot possibly conform to this rule. The first page must be aligned so that the top of the page image aligns with the top of the transcription text. And likewise at the bottom: on the last page the end of the transcription must correspond with the bottom of the last page-image.

Live sync-scrolling

To achieve live sync scrolling we need a table or function that gives the left-hand scroll position for each possible right hand scroll position.

What we musn't do is make the list of images itself scrollable. If we do that then we will have to link its own scrolling with that of the scrolling text. Since the right-hand-side (RHS) controls the scrolling of the left-hand-side (LHS) we will get infinite feedback if we link the scrolling in both directions. 'Scrolling' on the LHS can be achieved by other means, for example, by varying the CSS 'top' property of the overall list of images.

The scrolling positions for the LHS are just the mid-points of each image in the overall list. These correspond to the mid-points for each page of text in the RHS. The latter can be found easily by parsing the text. In my case, since I use a minimal markup language (MML) page-starts are marked by [NNN] where NNN is the page-number on a line by itself. Any scrolling position in-between two of these corresponding values can be interpolated by scaling. However, this does not work for the first and last pages because the desirable alignments in these cases are the top of the first image with the top of the first page of text, and likewise the end of the text with the bottom of the last image. So my solution was just to replace the mid-points of the first page in both the LHS and RHS with half the window-height. Likewise for the last mid-points I used the length of the text and the length of the image-list minus half the window height. In some cases 'half-way from the top or bottom of the window' may be in the middle of a page that is not the first or last. In that case the overlapping values can simply be removed, as long as the ones at the extreme ends of the list are preserved.

There is a demo of this method on the Charles Harpur Critical Archive, in Twin View for Harpur's poem Shelley.

Tuesday 23 February 2016

Improvements to events editor

Events are things that happen in the life of your author. One of the successes of the AustESE project was the realisation that such events would best be represented as database records, so that biographical information could be rearranged into various useful forms. Events have a 'fuzzy' date more often than a precise one. So 'ca. 1834' or 'before February 1865' is what you would expect as the date of an event, not 26/12/1845. And events can have a description and a list of references. These two are represented as simple HTML, a globally interoperable standard for mixed content. So forget about XML, which only serves as a preliminary to making HTML.

To WYSIWYG or not?

This is where the problems started. AustESE used a sophisticated HTML editor that filtered out potentially dangerous HTML constructs that hackers could use to implant code exploits. But those tags and attributes also happen to be quite useful for building a website. For example, the title attribute or special data-attributes on a link could be used to animate a popup image. Unfortunately the editor stripped all these out when the user saved, and the images would disappear. So I swapped it for a simple HTML editor, on the grounds that users would still want to see a WYSIWYG preview of their HTML before saving it. But that didn't work out any better. Since a preview is created when the user clicks on another element it already reverts to an effective 'preview' that won't be saved until the user clicks the 'save' button. So the elaborate WYSIWYG editor could be replaced by a simple textarea. Sophisticated, huh?

But what about the 'dangerous' HTML constructs I am no longer filtering out? Since the events editor is not publicly accessible and all editors of the content are guaranteed to be trustworthy, this extra security measure is quite worthless.

The loss of the WYSIWYG editing environment is also not a problem, since editors are mostly sophisticated enough to handle this. After all, what we need to put into the HTML goes beyond mere formatting, and for our purposes a WYSIWYG environment simply doesn't suffice.

Which brings me to my main point: The best ideas come when you decide to delete something, not when you add some shiny new GUI component you probably don't even need. Less is truly more. But finding out what to throw away is the problem.

Events editor before the user clicks on the description area

Events editor with the textarea enabled for the description field

Events editor after the user clicks on the references field

Sunday 14 February 2016

Improvements to table view

Table view seems to be much liked by my two expert editors. But they did request some changes, which I have now implemented and which require some explanation.

First, they wanted to see all the text of each version, and not restrict it to the base row in those columns where it was all the same. Second, they wanted some way to order or reorder the versions, and third, they needed a way to reduce clutter by deleting rows. Also I replaced the clumsy slider with a conventional scrollbar to enable swiping on tablets. These changes have made table view much more useful, without adding significantly to its complexity.

Moving up and down

If the user clicks on a siglum in the leftmost column two small buttons appear to raise or lower that row. After 5 seconds the buttons disappear in any case. The disappearing buttons are cool because they only appear when needed and free up the display when not. But clicking on the up button moves that row above the one immediately above, and down moves it down one row. Only the up button appears on the bottom row and only the down button on the top row.

Selecting some versions

Normally the user wants to see all the versions, but if that is a bit overwhelming the rows can be reduced by deselecting them from a simple dropdown menu in the toolbar that has been added at the foot of the display. Selected versions are shown by adding a tick-mark after their names. This reappears if it is chosen again. The usual user interface method for showing a set of options is to use checkboxes, but in this case there may be very many, and it would get too confusing. So a select dropdown is used instead. The 'rebuild' button resubmits the newly selected versions and builds a reduced (or expanded) table.

Tuesday 9 February 2016

Table view

One way to demonstrate the flexibility of multi-version documents is to display the same information in several ways. Charles-harpur.org and Ecdosis now boast a table view, which displays all the versions of a work in stacked form, so an editor can quickly scan through a text to see what is a variant of what, no matter how complex the variation.

My XML-based rivals are still struggling to produce such views, but I doubt they will succeed. Their problem is that they record internal variants (deletions, additions, substitutions) inline as part of the text, and to produce a table view you have to tease apart these changes into separate layers, which is almost impossible due to markup variability. So this display, although it doesn't look all that earth-shattering, is actually unique. Also it is what textual editors have long been bugging me for.

Table view of part of The Creek of the Four Graves

To try with different poems, select a poem from the Browse menu, then click on the "table view" tab. Some poems are not uploaded yet and may not work, but most are OK. Some minor features: the table of sigla on the left is anchored. Mousing over the sigla shows their full name in case they are partly obscured. The spacing could be improved, but it is basically all there.

Monday 18 January 2016

Tree View

I added Tree view to the Ecdosis front toolset. Since a multi-version document (MVD) represents multiple versions of the same work it is pretty much like multiple variants of a genome. A phylogenetic tree is very close to a stemma describing the relationships between witnesses in a multiple-manuscript tradition. A genome is a sequence of nucleotides expressed in a four-letter vocabulary (GACT) but a single version of a historical document can also be expressed as a sequence of letters in the Unicode character set. Like genomes, historical texts are also subject to insertions, deletions, substitutions and transpositions. Hence the same tools used by geneticists ought to work for humanists also.

Distance matrices

The question arises how to generate a stemma or tree from a set of versions. Many of the phylogenetic approaches use a distance matrix: a table that describes how different each version is from every other version. Since in an MVD all the bits of text that are shared by versions and the bits that are different have already been computed, making a distance matrix is easy. The distance matrix for the four versions/layers of Abed Ben Haroun by Charles Harpur looks like this:

ABCD
A0.00.053440.106880.11508
B0.053440.00.133700.08087
C0.106880.133700.00.19323
D0.115080.080870.193230.0

Obviously the edit-distance between each version with itself is 0, which explains the diagonal of zeros in the table. The larger the number the more 'distant' it is from the other version. So here the biggest difference is between versions C and D. Of the other values only half are needed, since the distance between versions A and D is the same as the distance between D and A. But the format is traditional, and can be fed into a tree-drawing algorithm. There are many of these but one of the best distance-based methods is neighbour-join. The version I chose is a refinement of that technique published some years ago by Desper and Gascuel called 'FastME'. The tree-view is provided by the 'drawgram' program in the Phylip package, which allows different visualisations of rooted trees, since these represent most closely the humanist's stemma.

Stemmatic trees are useful even in cases where all the sources are written by the same person. What it shows in this case is which version derived from which – something that would require a lot of manual labour to discover. Often it is unclear in a collection of manuscript versions exactly which preceded which, but a phylogenetic tree makes this easy. Here's an example of Harpur's The Creek of the Four Graves. The h-numbers indicate physical versions and the internal layers are the bit added to the name introduced by a /.

How to make them

It is only really possible to make these trees from within Ecdosis. A back-end Java Web-service called Tree reads the MVD from the database and computes the distance matrix. It then builds the tree and streams it back to the Web-browser directly as an image. The controls at the bottom of the screen are contained in a JQuery module wrapped up as a Drupal module, which now form part of the Ecdosis-front collection. It is called 'tree'. There are some examples on the Charles Harpur site. Many of the 700 poems have more than one version so you should be able to select other poems from the Browse menu.

Thursday 14 January 2016

Multi-version documents and standoff properties

I have written two new papers for Digital Scholarship in the Humanities on 'standoff properties as an alternative to XML', and a second on 'Automating textual variation with multi-version documents'. Together they form the basis of a model of how I think historical documents should be encoded. The now 25 year old drive for 'standardisation' has led to something of a dead-end: people have begun to realise that it is not in fact possible to standardise the encoding of documents written on analogue media. Instead of reusability, sharability and durability, such 'standards' provide only a fertile ground for embedding private technology and interpretations into texts that cannot then be reused for any other purpose. 'Standard' encoding also fails to propose a usable solution to textual variation, which is the one feature that all historical documents share. Rather than attempting to create a new standard, this model reuses existing formats already in use worldwide: HTML, CSS, RDFa, Unicode. Although the model can be fully expressed in these formats its internal representation predisposes the data into a form that facilitates the things that digital humanists want to do with it, rather than throwing up barriers to its processing and reuse. What is needed is something simple that works. This is my attempt to explain how that can be achieved.