Tuesday 10 November 2015

Why digital editions lack longevity (and how to fix it)

If there is one point on which all theorists agree it is that the software component of the digital scholarly edition is ephemeral:

'compared to books -- in particular, compared to scholarly editions -- software lives out its life in but the twinkling of an eye.'1
'computer related technologies are hardly renowned for their longevity'2;
'... the price seems to be the interface: while the digital scholarly community has developed meaningful ways to support the longevity of the dataset, the same cannot be said about interfaces.'3
'how can the continued—and useful—existence of a system or tool be guaranteed, or at least facilitated, once a project's funding has been spent?'4.

The client-server model

But it may be a mistake to point the finger of blame so quickly at the graphical user interface, when the real culprit is the ubiquitous client-server model, in which the function of the application is split between the web-server and the browser.

In Web 1.0 almost all the action happened on the server. Since rendering in the client was unreliable it was considered virtuous to actually create Web pages on the server and then send them down to the client. Users were even told to 'turn off Javascript' for safety reasons, because interactivity was just a superfluous enhancement.

Client-server model: Web 1.0

With the arrival of consistent cross-browser functionality in Web 2.0 developers began to build much more interactivity into the browser. Now the user could participate in the creation of Web content. Facebook, blogs, Twitter and all the rest happened. Man went mobile, but the humble Web application, the client-server model wasn't all that deeply changed. The same tools were available to developers as in Web 1.0 – and they did the same things with it: building secret functionality into the server component, and leaving the interactive stuff for the browser and UI-guys to figure out. It was kind of a 50-50 division of labour.

Client-server model: Web 2.0

So, what's this got to do with digital editions?

A lot, as it turns out. The Web is under constant attack from hackers trying to gain control of servers. As a result, the server software is constantly updated. It is practically guaranteed that, once a digital edition built for the Web is finished, that it will fail within 6 months to two years counting from the moment of project end. Digital humanists get money in grants. And when the grant money runs out, as the guys above say, there is no one left to maintain it. Then 'poof!' goes the digital edition. It drops off the Net, the code breaks and all that is left are the source data files that actually need that software to process them. And only the authors have access to them. So they are nearly as useless as the software.

But the problem here is not the lack of longevity in the GUI software. Because that is all based on HTML, CSS and Javascript. To my knowledge none of those technologies has yet undergone a change that is not backwards-compatible. The original web-page made by Tim Berners Lee still works perfectly in modern browsers with a zillion more features than what he had back then. Extrapolating from that, a digital edition made today with just those technologies and no others could be expected to live on more or less indefinitely. Why? Because those technologies are too big to fail. There are around 40 billion web pages on the Internet, and a sizeable portion of them would break if there was any change to the underlying technology that was not backwards-compatible. So as they say, 'it ain't gonna happen'.

On the other hand, server software changes in incompatible ways all the time. My old Java programs don't work any more because 'deprecated' features are eventually withdrawn. But the worst problems concern security changes which are automatically applied to the server operating system by administrators. Failure to keep those security measures up to date will result, sooner or later, in the hacking of your digital edition. So it will die if you do not constantly spend money on it. Corporate users love the client-server model because it allows them to keep one step ahead of the competition. The constant updates delight them, and they have permanent staff to carry them out. But we don't. And as digital humanists we are left with their tools to do our work.

What can be done about it?

The answer lies in moving all the functionality into the client. With open source software there is no need to hide anything on the server. So we rewrite all the server code in Javascript, move it to the client, and turn the server into a dumb service that just gets and saves data. This gives us three advantages: archivability, interoperability and durability.

Client-server model: Web 4?.0

For archiving, digital editions can be taken off the server and saved as archives that work directly off the file-system (on a USB stick or DVD etc.), without the need for a Web-server, but only for reading. Editions could also be shared or sent to other locations. We can say: 'Here's my edition: tell me what you think of it.' That is like giving someone a copy of your book to read: you won't and can't change it.

But since these editions use only globally interoperable data formats we and our friends can also collaborate in creating and updating them through a master copy on a server. We can update their editions in our browsers and they can update ours in theirs, and everything will work perfectly. We can even annotate and understand the semantic content of an edition because the formats it uses are standard, and the tools to do that already exist.

On durability, if we leave it on the server it should run more or less indefinitely because there is almost nothing to hack, and nothing to break. It would be an interesting experiment to make one and see how long it lasts, adding a clock on the website to show how many years, months, days it has been running correctly without modification.

How to archive a digital edition

If we want to archive our edition we must first save the website's static content, by following its internal links, just as you do when you save a web-page as 'complete' in your browser. But to preserve the dynamic pages is harder. We can easily save the database files, but all the server software will have to be converted into a form that will run in the browser. This is hard for XML-based editions because almost all XML software only works on the server: eXistDB, XQuery, Cocoon, XSLT, Lucene (and Solr, Elastic search). Since we didn't write it we don't control it, so converting it to run in the browser is practically impossible. Which is another reason not to use XML. But with our XML-free approach, by writing all the software we need for the edition ourselves, this becomes absolutely achievable. Since our server code is written in Java it is not too hard to convert it all into Javascript. And we can also use that new client software to run the live, updatable server version of the edition.

However, I don't expect people will rush out to do this because they all love XML too much. But they also don't have any answer to the longevity problem, which is far and away the biggest problem we face. Solve that, and digital editions that last nearly as long as books can become a reality.

References

1 C. Michael Sperberg-McQueen, 1994. “Textual Criticism and the Text Encoding Initiative.” Proceedings of MLA '94, San Diego.

2 Lou Burnard, 2013. "The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure" JTEI 5.

3 Elena Pierazzo, 2014. Digital Scholarly Editing: Theories, Models and Methods, p.15.

4 Mark Hedges, Heike Neuroth, Kathleen M. Smith, Tobias Blanke, Laurent Romary, Marc Küster and Malcolm Illingworth, 2013. "TextGrid, TEXTvre, and DARIAH: Sustainability of Infrastructures for Textual Scholarship" JTEI 5.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.