Tuesday, 25 October 2016

Formatting poetry for the Web

HTML, the language of most documents on the Web, was designed for the display of business documents, not poetry. In HTML, documents are basically composed out of a succession of flow elements, each of which contains a series of phrasing elements. So an element like <p> is a flow-element, and <span> is phrasing content. <div> is like the joker in cards: it can enclose anything.


Let's consider the first problem, how to encode stanzas:

Ah! Daniel mine, some Muse malign Hath skimm'd thy judgments cream away But take a slice of "good advice"--- Even that I proffer thee today.

If the stanza here was enclosed in a <pre> then all the lines would have the same indentation, and this could not be corrected using CSS. You could use spaces at the start of each line, but with variable-width fonts this looks awful and you have no control over indentation when fonts are substituted by the browser. In CSS you can instead use the white-space: pre property to make any flow element behave like <pre> anyhow, so <pre> is not needed, especially as it uses a monospaced font by default.

An obvious alternative would seem to be <div>, which can enclose anything. So a <div class="stanza"> would be a good choice. Equally <p> is also possible, so long as it encloses only phrasing content. (A <p> can't enclose another <p>, so we can't use <p> to represent lines if <p> is already used for stanzas.) However, <p> has the distinct advantage of being the direct result of translation of Markdown's double line-breaks to separate paragraphs. This allows us to type poems online using a very simple Markdown-like syntax, and then translate it into suitable HTML. Stanzas can be styled using a CSS selector that selects all <p>-elements enclosed in a single <div class="poem">, so you don't have to keep typing "<div class="stanza">" every four lines or so.


Often in poems headings are centred. Unfortunately, poetic lines are typically much shorter than the screen-width, and since the enclosing <div> will fill all the available width on screen, it will push the heading to the right of the text. So, it won't be centered in any meaningful sense. The fix is to write a little Javascript to measure each line of poetry, then adjust the width of the <div> so that it is slightly wider than the longest line. Something like this:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script type="text/javascript">
    var maxWidth=0;
    var lines = $("div p span");
    for ( var i=0;i<lines.length;i++ ) {
        var w = lines.eq(i).width();
        if ( w > maxWidth )
            maxWidth = w;
<div class="poem">


Neither the <pre>-tag, which is used to format computer-code, nor the line-break tag, <br>, provides control over the indenting of specific lines. It is a common mistake of XML technicians to encode lines as "<lb/>", rather than <l>...</l>, to avoid common issues of overlap with other elements like <del> (deleted). Once a poetic line has been encoded using an empty XML-element like this it can only be translated into HTML's <br> tag, which incurs the problems just mentioned. Hence lines are better represented as <span>s within a stanza (a <p> or <div>) where white-space has been set to "pre". Lines of various indentations can then be represented by defining classes of spans such as <span class="line"> or <span class="line-indent1"> etc.

Italics etc

For character formats ("phrasing content") you need to use classes, so <span class="underlined">that</span> can be styled to be in italics. If you use <i> or <em> you can't control the text appearance so well. For example, you might have stage-directions, foreign words etc that need different formatting.

Special characters

To get a really professional look simple typewriter codes like " and ' need to be translated into their curly equivalents. The same goes for dashes like ---, which becomes —.

Putting it all together

The whole design looks like this. You can change the indents, define extra classes for a wider variety of indented lines. I use up to six. To copy the design just use "display source" in your browser.

To Twank.1

Ah! Daniel mine, some Muse malign Hath skimm’d thy judgments cream away But take a slice of “good advice”— Even that I proffer thee today.

Again read Shakespear by the hour,— Read Milton more—McDonald less— And Wordsworth for his simple power, Not for his namby-pamby-ness.

And know,—’twere better to esteem What’s best in Byron’s godless “Don” Than with crude Browning much to dream, Or wire-draw through with Tennyson.

And better at the “woes of Moore” To shed the artifical tear, Than doat with eunuch passion o’er The feeble beauties of De Vere.

The above poem was automatically formatted by simply translating a Markdown representation into HTML. This is what the user actually typed:

To Twank.

    Ah! Daniel mine, some Muse malign
        Hath skimm'd thy judgments cream away
    But take a slice of "good advice"---
        Even _that_ I proffer thee today.

    Again read Shakespear by the hour,---
        Read Milton more---M^c^Donald less---
    And Wordsworth for his simple power,
        Not for his namby-pamby-ness.

    And know,---'twere better to esteem
        What's best in Byron's godless "Don"
    Than with crude Browning much to dream,
        Or wire-draw through with Tennyson.
    And better at the "woes of Moore"
        To shed the artifical tear,
    Than doat with eunuch passion o'er
        The _feeble_ beauties of De Vere.

Now I call that easy.

1 Charles Harpur, 'To Twank' Empire 14 March 1860. 'Twank' was a nickname of Daniel Deniehy

Monday, 10 October 2016

More about the fall of XML

At ESTS Antwerp recently (5-7 October 2016) some XML aficionados thought that the StackOverflow graphic in my previous post was somehow misleading, that attendees at the conference needn't worry about the decline of XML, because it wasn't really happening. But they didn't offer any facts to counter the evidence.

Five years ago in 2011 on ProgrammableWeb there was a news story posted based on APIs submitted to their index, that stated: 1 in 5 APIs say "bye" to XML, that is, 1 in 5 APIs were already JSON only: that is, they offered no support for XML.

What's an 'API'? It's an index into the functionality offered by a web-service. Typically data is sent to the service in some format and returned via some other (or the same) format, such as XML or JSON.

The 2011 story was widely quoted, so I went back to the site and created my own breakdown of their latest 2016 data. Since 2011 1 in 5 has now become 3 in 5:

The trend is clear: designers of web services are going for REST/JSON and only supporting XML legacy applications when they can afford to do so. Only 1 new API was pure XML and that was in fact written in March 2015 (NASA GIBS API) in conformance with earlier XML services.

According to the effective creator of XML, James Clark, Web services were the biggest raison d'ĂȘtre for XML in the first place. The disappearance of such a big usage case for XML will inevitably result in the withdrawal of vendor support for XML products and open source development projects that they patronize.

Judging by this rate of decline, XML in Web services at least will be all but dead in 5 years time, or sooner. There seems little chance of it "bottoming out", since the prevailing technology will pull other vendors into conformance, just as XML did 10 years ago.

The use of XML as a pure document format seems likely to succumb to the same trend eventually. Its decline is evidenced by falling interest in DocBook and TEI. These technologies are basically by-products of XML's main use in web-services, and would seem to be unviable on their own.

Interest in XML document databases like Sedna, BaseX, eXist and MarkLogic (now a dual XML/JSON database) is roughly flat, but at a very low level. They don't even begin to compete with their JSON database counterparts like Mongo.

The use of XML as an office document format is a special case, since users are mostly unaware that their data is being read and written in XML. Microsoft and Ecma's OOXML may never become an ISO standard. However, there is no obvious substitute for XML in office applications, other than a return to the binary formats they replaced.

XSLT, the XML transformation language, seems to be dying even faster than XML itself. The clumsiness of its syntax, combined with its inability to directly transform HTML5 (although it can write it) has doubtless contributed to its demise.

In conclusion, building software based on XML for the future is a risky business. In spite of its ubiquitous use only a dozen years ago, XML has been cut off from mainstream development by a tech community focused on JSON/CSS/Javascript. HTML and RDFa were once big use cases for XML that now no longer require it. Adrift on its own, having to justify itself on its own merits, the future of XML looks bleak.

Tuesday, 20 September 2016

The fall of XML

Talk to software developers today and they will tell you that 'XML is toast'. XML has not been replaced by any single technology. It is not JSON that has killed off XML; it is the mobile Web and associated technologies. Digital humanists who think that XML is here to stay, and imagine that they can continue to build software on top of it, should take a look at the following graphic, derived from Stackoverflow, one of the most popular discussion forums for software developers. HTML, Javascript, JSON and CSS have collectively supplanted XML, and these technologies no longer have any need of it. You may say 'Who cares what software developers think?' But they are the guys who build and maintain the tools that digital humanists use. If they abandon XML then those tools will soon perish or become obsolete, disconnected from the services they were designed to support.

When the World's way is running East,
    Keep your way running West;
And it is two to one, at least,
    That yours will be the best.

Charles Harpur

Tuesday, 26 July 2016

Sync-scrolling images and text when editing transcriptions

When transcribing original documents it is helpful to have the page image next to the transcription as it is being written. That way the transcriber can see what the next word to transcribe is, or quickly check for mistakes in some one else's transcription. Most documents contain more than one page, so this gives rise to a problem: how can the relevant page image be placed next to the relevant portion of the transcription so that the transcriber can easily see what corresponds to what?

One obvious solution is to transcribe page by page. For each image show only the transcription of that page. However, this creates a problem for the technician and user alike: now the transcription of the document is divided into parts which must be stitched back together not only by the computer when the document is saved, but also mentally by the transcriber. Pages rarely end at paragraph end. More often they split in the middle of a sentence or even a word. The transcriber may change only one page and then the computer must reassemble the document with that one altered page in its middle. If the page is marked up in some way the page's transcription may not be complete or well-formed, which would hamper editing. All this is both technically messy and counter-intuitive for the user.

A better method is to display the entire document for editing: both the transcription as a continuous running text and the page images to which it corresponds as a scrolling list. The reason this is not often done is because of an intrinsic alignment problem: how to find the part of an image that corresponds to the currently displayed piece of text. To be readable page-images may need to be higher than the screen. Typically the transcription of a page is much shorter. But as a general rule the centre of the page's transcription should be aligned with the centre of the corresponding page-image across the centre of the screen. This is what the user expects. However, this creates a problem: the first and last pages cannot possibly conform to this rule. The first page must be aligned so that the top of the page image aligns with the top of the transcription text. And likewise at the bottom: on the last page the end of the transcription must correspond with the bottom of the last page-image.

Live sync-scrolling

To achieve live sync scrolling we need a table or function that gives the left-hand scroll position for each possible right hand scroll position.

What we musn't do is make the list of images itself scrollable. If we do that then we will have to link its own scrolling with that of the scrolling text. Since the right-hand-side (RHS) controls the scrolling of the left-hand-side (LHS) we will get infinite feedback if we link the scrolling in both directions. 'Scrolling' on the LHS can be achieved by other means, for example, by varying the CSS 'top' property of the overall list of images.

The scrolling positions for the LHS are just the mid-points of each image in the overall list. These correspond to the mid-points for each page of text in the RHS. The latter can be found easily by parsing the text. In my case, since I use a minimal markup language (MML) page-starts are marked by [NNN] where NNN is the page-number on a line by itself. Any scrolling position in-between two of these corresponding values can be interpolated by scaling. However, this does not work for the first and last pages because the desirable alignments in these cases are the top of the first image with the top of the first page of text, and likewise the end of the text with the bottom of the last image. So my solution was just to replace the mid-points of the first page in both the LHS and RHS with half the window-height. Likewise for the last mid-points I used the length of the text and the length of the image-list minus half the window height. In some cases 'half-way from the top or bottom of the window' may be in the middle of a page that is not the first or last. In that case the overlapping values can simply be removed, as long as the ones at the extreme ends of the list are preserved.

There is a demo of this method on Charles Harpur, in the test-interface for the letter from W.A. Duncan to Henry Parkes dated around 1841.

Tuesday, 23 February 2016

Improvements to events editor

Events are things that happen in the life of your author. One of the successes of the AustESE project was the realisation that such events would best be represented as database records, so that biographical information could be rearranged into various useful forms. Events have a 'fuzzy' date more often than a precise one. So 'ca. 1834' or 'before February 1865' is what you would expect as the date of an event, not 26/12/1845. And events can have a description and a list of references. These two are represented as simple HTML, a globally interoperable standard for mixed content. So forget about XML, which only serves as a preliminary to making HTML.

To WYSIWYG or not?

This is where the problems started. AustESE used a sophisticated HTML editor that filtered out potentially dangerous HTML constructs that hackers could use to implant code exploits. But those tags and attributes also happen to be quite useful for building a website. For example, the title attribute or special data-attributes on a link could be used to animate a popup image. Unfortunately the editor stripped all these out when the user saved, and the images would disappear. So I swapped it for a simple HTML editor, on the grounds that users would still want to see a WYSIWYG preview of their HTML before saving it. But that didn't work out any better. Since a preview is created when the user clicks on another element it already reverts to an effective 'preview' that won't be saved until the user clicks the 'save' button. So the elaborate WYSIWYG editor could be replaced by a simple textarea. Sophisticated, huh?

But what about the 'dangerous' HTML constructs I am no longer filtering out? Since the events editor is not publicly accessible and all editors of the content are guaranteed to be trustworthy, this extra security measure is quite worthless.

The loss of the WYSIWYG editing environment is also not a problem, since editors are mostly sophisticated enough to handle this. After all, what we need to put into the HTML goes beyond mere formatting, and for our purposes a WYSIWYG environment simply doesn't suffice.

Which brings me to my main point: The best ideas come when you decide to delete something, not when you add some shiny new GUI component you probably don't even need. Less is truly more. But finding out what to throw away is the problem.

Events editor before the user clicks on the description area

Events editor with the textarea enabled for the description field

Events editor after the user clicks on the references field

Sunday, 14 February 2016

Improvements to table view

Table view seems to be much liked by my two expert editors. But they did request some changes, which I have now implemented and which require some explanation.

First, they wanted to see all the text of each version, and not restrict it to the base row in those columns where it was all the same. Second, they wanted some way to order or reorder the versions, and third, they needed a way to reduce clutter by deleting rows. Also I replaced the clumsy slider with a conventional scrollbar to enable swiping on tablets. These changes have made table view much more useful, without adding significantly to its complexity.

Moving up and down

If the user clicks on a siglum in the leftmost column two small buttons appear to raise or lower that row. After 5 seconds the buttons disappear in any case. The disappearing buttons are cool because they only appear when needed and free up the display when not. But clicking on the up button moves that row above the one immediately above, and down moves it down one row. Only the up button appears on the bottom row and only the down button on the top row.

Selecting some versions

Normally the user wants to see all the versions, but if that is a bit overwhelming the rows can be reduced by deselecting them from a simple dropdown menu in the toolbar that has been added at the foot of the display. Selected versions are shown by adding a tick-mark after their names. This reappears if it is chosen again. The usual user interface method for showing a set of options is to use checkboxes, but in this case there may be very many, and it would get too confusing. So a select dropdown is used instead. The 'rebuild' button resubmits the newly selected versions and builds a reduced (or expanded) table.

Tuesday, 9 February 2016

Table view

One way to demonstrate the flexibility of multi-version documents is to display the same information in several ways. Charles-harpur.org and Ecdosis now boast a table view, which displays all the versions of a work in stacked form, so an editor can quickly scan through a text to see what is a variant of what, no matter how complex the variation.

My XML-based rivals are still struggling to produce such views, but I doubt they will succeed. Their problem is that they record internal variants (deletions, additions, substitutions) inline as part of the text, and to produce a table view you have to tease apart these changes into separate layers, which is almost impossible due to markup variability. So this display, although it doesn't look all that earth-shattering, is actually unique. Also it is what textual editors have long been bugging me for.

Table view of part of The Creek of the Four Graves

To try with different poems, select a poem from the Browse menu, then click on the "table view" tab. Some poems are not uploaded yet and may not work, but most are OK. Some minor features: the table of sigla on the left is anchored. Mousing over the sigla shows their full name in case they are partly obscured. The spacing could be improved, but it is basically all there.