Saturday 8 December 2018

The trouble with DIVs

A <div> is a division in a HTML or XML document that can be nested. It is typically used to provide a higher-level structure to a document that would otherwise be just a succession of paragraphs and characters-ranges. But here is the key point: word-processors don't use DIVs. Everyone has used a word-processor like Word, and we don't notice any restrictions on what you can do in those programs. Quite the contrary: Usually there are too many formatting options to choose from. So what are DIVs exactly good for, and can we get rid of them altogether?

DIVs would appear to be useful for two main reasons:

  1. They provide a logical organisation of an otherwise complexly marked-up document that allows the encoder to apply a divide-and-conquer strategy to getting the job done. Each DIV corresponds to some kind of logical unit: a section of notes added to a poem, a prologue, a title page etc.
  2. They can be used to provide extra formatting to sections of a document. We might want to add extra white space at the end of a poem and the start of its notes. So we can attach that white space to the DIVs in question.

The first point is a way to overcome inherent complexity of markup, of which DIVs form a part. They are thus self-justifying. In a WYSIWYG editing environment DIVs are not only unnecessary but greatly increase the complexity of the user interface. That is bad news for documents that need to be created online in a crowd-sourcing scenario.

The second use of DIVs can be met in CSS by just adding extra space or special formats to the last or first instance of a class of paragraph.

So neither requirement adds any significant functionality to editing itself, and DIVs would thus appear to be entirely dispensable. Now this runs counter to what all XML-afficionados keep telling me that plain text doesn't have enough expressibility. They use it as a justification for complex XML, but in our textual model variants and other alternatives – the main reason for complex markup – are expressed through layers and versions, leaving each version of a document (a draft, a stage in its correction etc.) quite simple already. If we can also dispense with DIVs that means that at least 95% of the complexity of XML can be dispensed with, and hence we do not need XML at all.

A different model of text

HTML uses a textual model built on a weak hierarchy of DIVs, paragraphs and characters. For cultural heritage (historical) texts what you actually need is a hierarchy of paragraphs, lines and character-ranges. And this applies not only in poetry, which is structured around lines and stanzas, but also in prose where transcriptions of historical texts should preserve the line-breaks of the originals. This greatly facilitates transcription and checking, and also allows a digital reconstruction of the textual content.

In our WYSIWYG web-editor we use single linebreaks to indicate a new line and double line-breaks to indicate a new paragraph. There are no DIVs. Formats can be applied to paragraphs, lines and character ranges.

Ecdosis WYSIWYG editor

On the top left there is a dropdown list of documents to edit. Below that is a list of applicable formats divided into the three categories already mentoned. Next to that is the version menu, by which one can select a version of the document to edit. (There might be several, e.g. a newspaper printing of a poem, a manuscript and a book version etc.). On the right hand side there is a layer tab. This represents the final state of the text. Other earlier states of the same version can be created by clicking the plus-tab. Layers are edited individually. The save button saves one layer of one version of one document. Publish combines all the versions and layers into one document for viewing on the web. I am building a sandbox tutorial site where I am putting up examples for training in the Ecdosis editing system.

Saturday 20 October 2018

The Current State of the Digital Scholarly Edition and Three Challenges

An examination of the leading DSEs on the Web reveals that it is moving away from providing a reliable text for scholarly purposes to a collection of interactive tools that facilitate the kinds of queries scholars wish to make about texts. Over the past 16 years not as much progress in the development of the DSE has been achieved as might have been expected during a time of significant developments in interactive media on the Web. Leading DSEs have established a suite of seven interactive components:

  1. Text and facsimile side by side
  2. A timeline of events in the life of the author
  3. Side by side textual comparison
  4. Table view (stacking of variants analogous to the critical apparatus)
  5. Searching
  6. Manuscript viewer
  7. Annotations.

Although there exist several editions that display a number of these features, generic tools for creating DSEs have not yet implemented them. To break out of this limitation three problems need first to be overcome: 1. The bias inherent in current approaches to encoding the document at the expense of the work; 2. The achievement of true interoperability of texts and tools 3. The pending obsolescence of the main encoding format XML. While the design and composition of the modern DSE has been broadly mapped out over the past 30 years, its future development must take account of the limitations encountered in reaching this goal.

For the full text see: The Current State of the Digital Scholarly Edition and Three Challenges in Domenico Fiormonte, Per una critica del testo digitale, Rome: Bulzoni, pp.181-199.

Thursday 22 March 2018

The end of XML

XML is now 20 years old. We might expect the first author of the XML 1.0 specification, Tim Bray, to be enthusiastic about XML's achievements and excited about its prospects for the future. Not a bit of it. In a limp endorsement on xml.com Tim tries diplomatically to think up some nice things to say about XML. But by the end of the article he lets out his true feelings:

People did a lot of that with XML just because there was no other alternative and, well… while it worked, you could do better, and in fact we have done better, for weak values of “better”. I wonder if we’ll ever do better still? As the editor of the IETF JSON RFCs, I’m a pessimist.

It’s been OK

Seriously; XML made a lot of things possible that previously weren’t. It has extended the lifetime of big chunks of our intellectual and legal heritage. It’s created a lot of interesting jobs and led to the production of a lot of interesting software. We could have done better, but you always could have done better.

Happy birthday!

I don't think there are any candles on the cake.

HTML is the new XML

More evidence of the disappearance of XML can be found in the new HTML Imports standard published by the W3C. Remember XInclude?

This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset.

Now we have HTML Imports:

HTML Imports are a way to include and reuse HTML documents in other HTML documents.

Sounds familiar? Of course they are not the same because XML and HTML are not the same but the same basic need that existed when they were busy defining XML standards provides yet another case where HTML is taking on the capabilities of XML. The others are RDFa being redefined for HTML, HTML5 being independent of SGML/XML, CSS 3 Paged Media Module replacing XSL formatting objects. Does the list go on beyond what I know? Probably. One thing is clear: HTML is being made more and more into a replacement for XML in all things. In a couple of more years people will even be asking: 'what is XML?' And the Museum Guide will point to a funny page of complex markup with a stick and everyone will go 'Ooooh!'