Digital Variants

Thursday 18 March 2021

You can’t have art without resistance in the material

William Morris is quoted everywhere as having said ‘You can’t have art without resistance in the material(s)’. Some say ‘materials’ and some say ‘material’. I looked everywhere in his works but I didn’t find this sentence. Did William Morris ever say it? Bethanie Nowviskie says the phrase was reported by Henry Halliday Sparling, Morris’s son in law, in 1924. But she doesn’t give a reference to the work, and confusingly seems to attribute the quote to Sparling. So I looked it up. The book is H. Halliday Sparling (1924). The Kelmscott Press and William Morris Master-Craftsman, MacMillan: London, page 14. It's a nicely printed book on quality paper. Here's a transcript of the relevant bits with page-breaks:

[13] Morris condemned the typewriter for creative work; it was “all right for journalism and the like; there’s nothing to be said for that! For hastily written copy, which doesn’t matter anyway, it may be desirable, or for a chap who can't write clearly—I daresay the Commonweal compositors would be glad enough were Blank to go in for one!—but it’s out of the place in imaginative work or work that’s meant to be permanent. Any- thing that gets between a man’s hand and his work, you see, is more or less bad for him. There's a pleasant feel in the paper under one’s hand and the pen between one’s fingers that has its own part in the work done .... I always write with a quill be- cause it’s fuller in the hand for its weight, and carries ink better —good ink—than a steel pen.... I don’t like the typewriter or [14] the pneumatic brush—that thing for blowing ink on to the paper—because they come between the hand and its work, as I’ve said, and again because they make things too easy. The minute you make the executive part of the work too easy, the less thought there is in the result. And you can't have art with- out resistance in the material. No! The very slowness with which the pen or brush moves over the paper, or the graver goes through the wood, has its value. And it seems to me, too, that with a machine one’s mind would be apt to be taken off the work at whiles by the machine sticking or what not.”

So firstly, Morris didn’t actually say that. Sparling reports a long quotation that is probably only paraphrased, so the words may well be Sparling’s. And secondly, what does it mean ‘you can’t have art without resistance in the material’ (not ‘materials’)? It’s an arts and crafts argument: an airbrush or a typewriter takes away the feeling between the artist or the author and the paper, and this messes it up. Another point he makes is that when the tool makes drawing, engraving or writing too easy the less thought goes into the result.

Morris did speak of the resistance in the material being something to be overcome in Hopes and Fears for Art (1882):

Up to a certain point you must be the master of your material, but you must never be so much the master as to turn it surly, so to say. You must not make it your slave, or presently you will be a slave also. You must master it so far as to make it express a meaning, and to serve your aim at beauty.

This is, curiously, a different point. Mastery of the material effects of writing, engraving or painting makes you forget that your true purpose is to create art. You must make the materials work for you, not allow them to dictate what you produce.

Saturday 8 December 2018

The trouble with DIVs

A <div> is a division in a HTML or XML document that can be nested. It is typically used to provide a higher-level structure to a document that would otherwise be just a succession of paragraphs and characters-ranges. But here is the key point: word-processors don't use DIVs. Everyone has used a word-processor like Word, and we don't notice any restrictions on what you can do in those programs. Quite the contrary: Usually there are too many formatting options to choose from. So what are DIVs exactly good for, and can we get rid of them altogether?

DIVs would appear to be useful for two main reasons:

They provide a logical organisation of an otherwise complexly marked-up document that allows the encoder to apply a divide-and-conquer strategy to getting the job done. Each DIV corresponds to some kind of logical unit: a section of notes added to a poem, a prologue, a title page etc.
They can be used to provide extra formatting to sections of a document. We might want to add extra white space at the end of a poem and the start of its notes. So we can attach that white space to the DIVs in question.

The first point is a way to overcome inherent complexity of markup, of which DIVs form a part. They are thus self-justifying. In a WYSIWYG editing environment DIVs are not only unnecessary but greatly increase the complexity of the user interface. That is bad news for documents that need to be created online in a crowd-sourcing scenario.

The second use of DIVs can be met in CSS by just adding extra space or special formats to the last or first instance of a class of paragraph.

So neither requirement adds any significant functionality to editing itself, and DIVs would thus appear to be entirely dispensable. Now this runs counter to what all XML-afficionados keep telling me that plain text doesn't have enough expressibility. They use it as a justification for complex XML, but in our textual model variants and other alternatives – the main reason for complex markup – are expressed through layers and versions, leaving each version of a document (a draft, a stage in its correction etc.) quite simple already. If we can also dispense with DIVs that means that at least 95% of the complexity of XML can be dispensed with, and hence we do not need XML at all.

A different model of text

HTML uses a textual model built on a weak hierarchy of DIVs, paragraphs and characters. For cultural heritage (historical) texts what you actually need is a hierarchy of paragraphs, lines and character-ranges. And this applies not only in poetry, which is structured around lines and stanzas, but also in prose where transcriptions of historical texts should preserve the line-breaks of the originals. This greatly facilitates transcription and checking, and also allows a digital reconstruction of the textual content.

In our WYSIWYG web-editor we use single linebreaks to indicate a new line and double line-breaks to indicate a new paragraph. There are no DIVs. Formats can be applied to paragraphs, lines and character ranges.

Ecdosis WYSIWYG editor

On the top left there is a dropdown list of documents to edit. Below that is a list of applicable formats divided into the three categories already mentoned. Next to that is the version menu, by which one can select a version of the document to edit. (There might be several, e.g. a newspaper printing of a poem, a manuscript and a book version etc.). On the right hand side there is a layer tab. This represents the final state of the text. Other earlier states of the same version can be created by clicking the plus-tab. Layers are edited individually. The save button saves one layer of one version of one document. Publish combines all the versions and layers into one document for viewing on the web. I am building a sandbox tutorial site where I am putting up examples for training in the Ecdosis editing system.

Saturday 20 October 2018

The Current State of the Digital Scholarly Edition and Three Challenges

An examination of the leading DSEs on the Web reveals that it is moving away from providing a reliable text for scholarly purposes to a collection of interactive tools that facilitate the kinds of queries scholars wish to make about texts. Over the past 16 years not as much progress in the development of the DSE has been achieved as might have been expected during a time of significant developments in interactive media on the Web. Leading DSEs have established a suite of seven interactive components:

Text and facsimile side by side
A timeline of events in the life of the author
Side by side textual comparison
Table view (stacking of variants analogous to the critical apparatus)
Searching
Manuscript viewer
Annotations.

Although there exist several editions that display a number of these features, generic tools for creating DSEs have not yet implemented them. To break out of this limitation three problems need first to be overcome: 1. The bias inherent in current approaches to encoding the document at the expense of the work; 2. The achievement of true interoperability of texts and tools 3. The pending obsolescence of the main encoding format XML. While the design and composition of the modern DSE has been broadly mapped out over the past 30 years, its future development must take account of the limitations encountered in reaching this goal.

For the full text see: The Current State of the Digital Scholarly Edition and Three Challenges in Domenico Fiormonte, Per una critica del testo digitale, Rome: Bulzoni, pp.181-199.

Thursday 22 March 2018

The end of XML

XML is now 20 years old. We might expect the first author of the XML 1.0 specification, Tim Bray, to be enthusiastic about XML's achievements and excited about its prospects for the future. Not a bit of it. In a limp endorsement on xml.com Tim tries diplomatically to think up some nice things to say about XML. But by the end of the article he lets out his true feelings:

People did a lot of that with XML just because there was no other alternative and, well… while it worked, you could do better, and in fact we have done better, for weak values of “better”. I wonder if we’ll ever do better still? As the editor of the IETF JSON RFCs, I’m a pessimist.

It’s been OK

Seriously; XML made a lot of things possible that previously weren’t. It has extended the lifetime of big chunks of our intellectual and legal heritage. It’s created a lot of interesting jobs and led to the production of a lot of interesting software. We could have done better, but you always could have done better.

Happy birthday!

I don't think there are any candles on the cake.

HTML is the new XML

More evidence of the disappearance of XML can be found in the new HTML Imports standard published by the W3C. Remember XInclude?

This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset.

Now we have HTML Imports:

HTML Imports are a way to include and reuse HTML documents in other HTML documents.

Sounds familiar? Of course they are not the same because XML and HTML are not the same but the same basic need that existed when they were busy defining XML standards provides yet another case where HTML is taking on the capabilities of XML. The others are RDFa being redefined for HTML, HTML5 being independent of SGML/XML, CSS 3 Paged Media Module replacing XSL formatting objects. Does the list go on beyond what I know? Probably. One thing is clear: HTML is being made more and more into a replacement for XML in all things. In a couple of more years people will even be asking: 'what is XML?' And the Museum Guide will point to a funny page of complex markup with a stick and everyone will go 'Ooooh!'

Saturday 23 December 2017

Ecdosis and the Charles Harpur Critical Archive

Now that we have are close to finishing our first historical digital edition, the Charles Harpur Critical Archive, it was time to articulate the technical design that led to its realisation. It is also worth reflecting on what we achieved. The extant papers of Charles Harpur (1813-1868) consist of 5,225 manuscript pages ranging in difficulty from easy to diabolically complex, 674 published newspaper poems, 140 letters on 403 manuscript pages, and 250 published pages in book form. To give you some idea of how large that is just to print the last version of each poem took Elizabeth Perkins 1000 pages. We have included all the versions, which is three times that much plus all the notes to the poems and the letters, which is double that again. So think 6,000 pages of printed matter. And we did it, including an elaborate user interface, in just 3 years. We recorded every last deleted full-stop. Here's a sample, in case you thought it was easy:

The technical design that made this possible is now described in outline on the CHCA website. It is a general system that can be reused to create a wide range of other editions.

Saturday 17 June 2017

Preserving soft and hard hyphens in transcriptions of historical documents

Like all documents historical texts contain line-breaks. An obvious case where preservation of line-breaks is essential is poetry. And yet on the Web, HTML assumes that all text is flowed. That is, line-breaks are converted into spaces unless the text is broken by a <br> tag. Or you can just specify that line-breaks are preserved as in the <pre> element, or by using the white-space:pre CSS property. What is needed though is some way to easily switch between the two. Flowed text is easier to read, but for historical accuracy line-breaks and the inevitable hyphens must be preserved. In spite of this requirement in many digitised versions of historical texts hyphens are permanently removed and the text is flowed for readability. This prevents ever showing the text as it really is. You need to do this for example when displaying a text next to its page-image. Or when citing a historical document by its line-number.

Hard and soft hyphens

What's needed is some way to record the line-breaks but to hide or show them on demand. The easiest way to do this is in the browser by flipping a switch in the CSS stylesheet. One problem with this is the existence of hard and soft hyphens. In heavily hyphenated languages like English and French, hyphens occur not just when an unhyphenated word is split over a line but also between parts of the one word, as in double-barrelled names like 'Normington-Rawling' or compound words like 'the high-glooming mountain'. When such compounds are split over a line the hyphen is regarded as 'hard', that is, it will not disappear if the line-break is removed. Whereas a 'soft-hyphen' disappears along with the line-break when the text is reflowed. So what is really needed are two sets of CSS styles for flowed and unflowed text. Another complication is that 'hyphens' come in various flavours. Sometimes writers use characters other than '-'. One common variant is use of the colon, or an equal-sign. And sometimes the hyphen is repeated on the next line. So we need a way to switch off these as well.

The two CSS styles

Here are my two styles. I've tried them in Firefox, Chrome and Opera and they appear to work perfectly. First the flow styles:

.soft-hyphen { display:none }
.hard-hyphen { word-spacing:-.25em; }

There is no direct way to hide spaces or line-breaks that get automatically turned into spaces in CSS but you can vary the amount of horizontal spacing between words. The default is, according to the W3C, equal to .25em. So setting it to -.25em should eliminate it altogether. Here are the corresponding definitions of soft and hard hyphens when preserving line-spacing:

.soft-hyphen,.hard-hyphen { white-space:pre }

An example

Here is a short example text in three formats.

Source HTML

<p>"When they will not give
a doit to relieve a lame beggar, they
will lay out ten to see a dead Indian",—
the device which aimed at converting
to the benefit of a living author, the 
expense they were only disposed to throw
away upon a dead one, if not praise<span class="hard-hyphen">-
</span><span class="soft-hyphen">-</span>worthy, was at least pardonable.</p>
<p>In fine, Chatterton was stung to the
quick by neglect, and rendered de<span class="soft-hyphen">-
-</span>fiant by the apparent blindness of
fortune.</p>

Note that in the first case, when the hyphen is doubled, this has to be dealt with somehow when the HTML encoding is generated so that the first hyphen and its line-break is encoded as a hard-hyphen and the second as soft. This works no matter which characters are used for the actual hyphens.

Flowed

"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

With line-breaks preserved

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

I cheated a little bit in the last example because it is not possible to have two definitions of the same style on one HTML page. Since this can be manipulated in JavaScript easily I don't see that as a problem. Other than that, I used the same source encoding for the three formats.

Friday 24 February 2017

Refinements to Twin-view

Twin view is a side by side view of an historic print or manuscript document and its transcription. The idea was to scroll the images of each page on the left in sync with the formatted text on the right. Although I already described this view earlier, I have since made several refinements that are worth expounding in a separate post.

To recap on what twin-view already achieved: it aims to align the transcription with the corresponding page image so that the text is aligned across the middle of the display. Of course, without scanning the page image for words-shapes precise alignment is impossible. But approximate alignment can be achieved for any document and its transcription by following some simple rules. So long as we can measure the height of each page image and the height of the corresponding text on the right the two can be scrolled in sync fairly accurately.

1.Partial pages

Problems however arise whenever only part of a page is transcribed – say the end of a poem, which may then be followed by another work. In our case since individual poems were taken from both printed and handwritten anthologies, many poems begin or end some way down the page.

My first idea was to keep the page images intact, but to outline that part of the image connected with a particular poem, so it could be electronically sliced into segments just before display to the user. Then only relevant portions of the page would be visible, and the original images would remain intact. However, this proved impractical for several reasons. First how could such areas be determined? Only manually. And that meant a lot of work and a recording of the data in some format that would have to be customised for our website. Also the slicing would be computationally expensive, and the part-images would have to be cached to improve performance. That gave me the idea of manually slicing the images into segments, while keeping a copy of the original page for other purposes. So the part-page of the transcription would be connected with a part-image of the page. And the only technology needed to achieve this would be the web-server's innate ability to serve images and HTML.

I have done this now for 104 poems out of 700 in the collection. The result mimics to some extent earlier attempts by others to produce complex 'diplomatic' layouts of original documents containing blocks of text that may be rotated or written on other pages that are then displayed as such. Such views are pretty hard to read even though the text has been transcribed. Instead, twin-view simply connects a series of derotated part-images and their corresponding textual transcriptions into a continuous and easy to read document on both sides of the display. The zoom feature then takes care of the user's need for closer examination.

2. Full-screen

Another refinement was the provision of a full-screen view. Nowadays many people have access to large monitors with gargantuan resolutions. Why not make use of that, while retaining a fallback of adequate display for smaller screens? Content management systems typically don't allow this. They confine the text to a narrow band in the centre of the screen, in the belief that screen sizes must have some minimum. Typically this is 600-800 pixels wide. In a responsive layout, on the other hand, text and images are scaled to fit the available sceen-width. So I thought: why not use all of the screen for twin-view? The result is a view that enables the user to see the text and its images in minute detail, while retaining the sync-scrolling of the main view within the CMS.

Twin view of Harpur's 'Creek of the Four Graves' MS C384

3. Layers

Manuscript documents, especially of modern works, often contain erasures, substitutions, insertions and transpositions. These are usually encoded into the text as formats: crossed out text is displayed in a crossed out format, inserted text is displayed over the line in smaller type, blocks of rotated text displayed as rotated blocks etc. This is complex and expensive to do, and the result is not much more readable than the original manuscript. Layers offer a way around this problem. Since each local change to the text belongs to a clear temporal sequence in almost all cases, it is possible to code for time instead of layout. A layer is a combination of each of these local independent changes. All local changes that occurred one unit of time after that of the baseline text appear in layer 2, and changes to layer 2 in layer 3 etc. Layers aren't versions and the non-final text is therefore displayed in red. Only the final text is displayed in black. 'Layer-final' is the last layer in the temporal sequence representing the final state of the document as the author left it. Layers thus provide a diachronic view of the text. They are also mostly coherent – meaning we can read them – as opposed to the direct diplomatic approach where the text is shown with erasures inline, making it unreadable for humans and computers alike.

Take a look for yourself on the Harpur website. The full-screen button is next to the tabs for layers. Only the poems in the title index in Browse from A-D are enabled for twin view presently.

About this blog

This blog is a technical record of my attempts to create a first class website for ecdosis.net. This will be a revision of www.digitalvariants.org and is intended to incorporate genetic texts in the MVD (Multi-Version Document) format. It will be the first website to allow the user to view and edit original texts with all their raw corrections, revisions, and variant versions as they were truly meant to be: as multi-version texts. A lot of people have talked about the theoretical possibility of doing this but the tools they choose are not up to the task. In fact the history of Digital Humanities is all about shoehorning humanistic problems into off-the-shelf technical solutions that don't fit. This project, on the other hand, is about breaking free from the limitations of mere markup and database structures to represent the true nature of originally analog documents.