Monday 31 October 2016

More about the decline of XML

At ESTS Antwerp recently (5-7 October 2016) some XML aficionados thought that the StackOverflow graphic in my previous post was somehow misleading, that attendees at the conference needn't worry about the decline of XML, because it wasn't really happening. But they didn't offer any facts to counter the evidence.

XML Web services

Five years ago in 2011 on ProgrammableWeb there was a news story posted based on APIs submitted to their index, that stated: 1 in 5 APIs say "bye" to XML, or, 1 in 5 APIs were already JSON only: that is, they offered no support for XML.

What's an 'API'? It's an index into the functionality offered by a web-service. Typically data is sent to the service in some format and returned via some other (or the same) format, such as XML or JSON.

The 2011 story was widely quoted, so I went back to the site and created my own analysis of all JSON and/or XML APIs in their registry from 2005 to October 2016. That's a total of 4,453 APIs. Since 2011 1 in 5 has now become 4 in 5:

The trend is clear: designers of web services are going for REST/JSON and only supporting XML legacy applications when they can afford to do so. Very few APIs are now pure XML and judging by this rate of decline, XML in Web services will be all but dead in 12 months time.

XML tools

According to the effective creator of XML, James Clark, Web services were the biggest motivation for XML in the first place. The disappearance of such a big usage case for XML will inevitably result in the withdrawal of vendor support for XML products and open source development projects that they patronize. Without the support of key open-source XML building blocks, which are not being adequately maintained as shown in the graph below, commercial new products based on XML will no longer be possible, and existing ones will break.

Releases of 8 key open-source XML tools¹

General popularity of XML

Another possible source of information about xml's decline can be found in the archives of xml.org, which is sponsored by industry giants IBM and Microsoft, and hosted by Oasis. The xml-dev newsgroup documents a marked decline of developer interest in XML since its inception, as this graph of the number of monthly posts to the group between February 1997 to October 2016 clearly shows:

A corroboration of this trend can be found in posts to the popular Slashdot.org news site which mention either an XML language or XML itself:

XML 'Mixed content'²

The use of XML for mixed content seems likely to succumb to the same trend eventually. Its decline is evidenced by falling interest in DocBook and TEI. Unlike SGML, XML never was designed to be typed manually, even in an XML editor. While interest in DocBook has plummeted to 1% of what it was 10 years ago, simpler markup languages like Markdown have risen dramatically in popularity. Niche XML vocabularies like TEI would thus seem to have no future; their survival will depend on the continued maintenance of XML tools by a tech community that is rapidly losing interest in them.

Popularity of DocBook (blue) vs Markdown (red) 2004-present

XML databases

Interest in XML document databases like Sedna, BaseX, eXist and MarkLogic (now a dual XML/JSON database) is roughly flat, but at a very low level. For every user of exist-db (XML document database) there are 3,500 mongo-db (JSON document database) users. The only possible reason to choose an XML over a JSON document database can be that the data is already in an XML format. All four surviving XML databases show a slight decline since 2014, probably due to companies gradually migrating their documents to other formats.

Native XML databases popularity 2012-2016

XSLT transformation language

XSLT, the XML transformation language, seems to be dying even faster than XML itself. The clumsiness of its syntax, combined with its inability to directly transform HTML5 (although it can write it) has doubtless contributed to its demise.

Conclusion

In conclusion, building software based on XML for the future is a risky business. In spite of its ubiquitous use only a few years ago, XML has been cut off from mainstream development by a tech community focused on HTML5/JSON/CSS/Javascript. HTML and RDFa were once big use cases for XML that now no longer require it. Adrift on its own, having to justify itself on its few merits and many drawbacks, the future of XML looks bleak.

[1] Cocoon, fop, libxml, libxml2, xalan-c, xalan-j, xerces-c, xerces-j

[2] 'Mixed content' is when tags may appear in text and text in tags

Tuesday 25 October 2016

Formatting poetry for the Web

HTML was designed for the display of business documents, not poetry. In HTML, text is composed out of a succession of flow elements, each of which contains a series of phrasing elements. So an element like <p> is a flow-element, and <span> is phrasing content. <div> is like the joker in cards: it can enclose anything.

Stanzas

Let's consider the first problem, how to encode stanzas:

Ah! Daniel mine, some Muse malign Hath skimm'd thy judgments cream away But take a slice of "good advice"--- Even that I proffer thee today.

If the stanza here was enclosed in a <pre> then all the lines would have the same indentation, and this could not be corrected using CSS. You could use spaces at the start of each line, but with variable-width fonts this looks awful and you have no control over indentation when fonts are substituted by the browser. In CSS you can instead use the white-space: pre property to make any flow element behave like <pre> anyhow, so <pre> is not needed, especially as it uses a monospaced font by default.

An obvious alternative would seem to be <div>, which can enclose anything. So a <div class="stanza"> would be a good choice. Equally <p> is also possible, so long as it encloses only phrasing content. (A <p> can't enclose another <p>, so we can't use <p> to represent lines if <p> is already used for stanzas.) However, <p> has the distinct advantage of being the direct result of translation of Markdown's double line-breaks to separate paragraphs. This allows us to type poems online using a very simple Markdown-like syntax, and then translate it into suitable HTML. Stanzas can be styled using a CSS selector that selects all <p>-elements enclosed in a single <div class="poem">, so you don't have to keep typing "<div class="stanza">" every four lines or so.

Headings

Often in poems headings are centred. Unfortunately, poetic lines are typically much shorter than the screen-width, and since the enclosing <div> will fill all the available width on screen, it will push the heading to the right of the text. So, it won't be centered in any meaningful sense. The fix is to write a little Javascript to measure each line of poetry, then adjust the width of the <div> so that it is slightly wider than the longest line. Something like this:

<head>
...
<script src="https://ajax.googleapis.com/ajax/libs/
  jquery/1.12.4/jquery.min.js"></script>
<script type="text/javascript">
$(document).ready(function(){
    var maxWidth=0;
    var lines = $("div p span");
    for ( var i=0;i<lines.length;i++ ) {
        var w = lines.eq(i).width();
        if ( w > maxWidth )
            maxWidth = w;
    }
    $("div.poem").width(maxWidth+10);
})
</head>
<body>
<div class="poem">
...
</div>
</body>

Lines

Neither the <pre>-tag, which is used to format computer-code, nor the line-break tag, <br>, provides control over the indenting of specific lines. It is a common mistake of XML technicians to encode lines as "<lb/>", rather than <l>...</l>, to avoid common issues of overlap with other elements like <del> (deleted). Once a poetic line has been encoded using an empty XML-element like this it can only be translated into HTML's <br> tag, which incurs the problems just mentioned. Hence lines are better represented as <span>s within a stanza (a <p> or <div>) where white-space has been set to "pre". Lines of various indentations can then be represented by defining classes of spans such as <span class="line"> or <span class="line-indent1"> etc.

Italics etc

For character formats ("phrasing content") you need to use classes, so <span class="underlined">that</span> can be styled to be in italics. If you use <i> or <em> you can't control the text appearance so well. For example, you might have stage-directions, foreign words etc that need different formatting.

Special characters

To get a really professional look simple typewriter codes like " and ' need to be translated into their curly equivalents. The same goes for dashes like ---, which becomes —.

Putting it all together

The whole design looks like this. You can change the indents, define extra classes for a wider variety of indented lines. I use up to six. To copy the design just use "display source" in your browser.

To Twank.¹

Ah! Daniel mine, some Muse malign Hath skimm’d thy judgments cream away But take a slice of “good advice”— Even that I proffer thee today.

Again read Shakespear by the hour,— Read Milton more—McDonald less— And Wordsworth for his simple power, Not for his namby-pamby-ness.

And know,—’twere better to esteem What’s best in Byron’s godless “Don” Than with crude Browning much to dream, Or wire-draw through with Tennyson.

And better at the “woes of Moore” To shed the artifical tear, Than doat with eunuch passion o’er The feeble beauties of De Vere.

The above poem was automatically formatted by simply translating a Markdown representation into HTML. This is what the user actually typed:

To Twank.
==========

    Ah! Daniel mine, some Muse malign
        Hath skimm'd thy judgments cream away
    But take a slice of "good advice"---
        Even _that_ I proffer thee today.

    Again read Shakespear by the hour,---
        Read Milton more---M^c^Donald less---
    And Wordsworth for his simple power,
        Not for his namby-pamby-ness.

    And know,---'twere better to esteem
        What's best in Byron's godless "Don"
    Than with crude Browning much to dream,
        Or wire-draw through with Tennyson.
    
    And better at the "woes of Moore"
        To shed the artifical tear,
    Than doat with eunuch passion o'er
        The _feeble_ beauties of De Vere.

Now I call that easy.

¹ Charles Harpur, 'To Twank' Empire 14 March 1860. 'Twank' was a nickname of Daniel Deniehy

About this blog

This blog is a technical record of my attempts to create a first class website for ecdosis.net. This will be a revision of www.digitalvariants.org and is intended to incorporate genetic texts in the MVD (Multi-Version Document) format. It will be the first website to allow the user to view and edit original texts with all their raw corrections, revisions, and variant versions as they were truly meant to be: as multi-version texts. A lot of people have talked about the theoretical possibility of doing this but the tools they choose are not up to the task. In fact the history of Digital Humanities is all about shoehorning humanistic problems into off-the-shelf technical solutions that don't fit. This project, on the other hand, is about breaking free from the limitations of mere markup and database structures to represent the true nature of originally analog documents.

Digital Variants

Monday 31 October 2016

More about the decline of XML

XML Web services

XML tools

General popularity of XML

XML 'Mixed content'²

XML databases

XSLT transformation language

Conclusion

Tuesday 25 October 2016

Formatting poetry for the Web

Stanzas

Headings

Lines

Italics etc

Special characters

Putting it all together

To Twank.¹

Blog Archive

Monday 31 October 2016

More about the decline of XML

XML Web services

XML tools

General popularity of XML

XML 'Mixed content'2

XML databases

XSLT transformation language

Conclusion

Tuesday 25 October 2016

Formatting poetry for the Web

Stanzas

Headings

Lines

Italics etc

Special characters

Putting it all together

To Twank.1

XML 'Mixed content'²

To Twank.¹