Saturday, 17 June 2017

Preserving soft and hard hyphens in transcriptions of historical documents

Like all documents historical texts contain line-breaks. An obvious case where preservation of line-breaks is essential is poetry. And yet on the Web, HTML assumes that all text is flowed. That is, line-breaks are converted into spaces unless the text is broken by a <br> tag. Or you can just specify that line-breaks are preserved as in the <pre> element, or by using the white-space:pre CSS property. What is needed though is some way to easily switch between the two. Flowed text is easier to read, but for historical accuracy line-breaks and the inevitable hyphens must be preserved. In spite of this requirement in many digitised versions of historical texts hyphens are permanently removed and the text is flowed for readability. This prevents ever showing the text as it really is. You need to do this for example when displaying a text next to its page-image. Or when citing a historical document by its line-number.

Hard and soft hyphens

What's needed is some way to record the line-breaks but to hide or show them on demand. The easiest way to do this is in the browser by flipping a switch in the CSS stylesheet. One problem with this is the existence of hard and soft hyphens. In heavily hyphenated languages like English and French, hyphens occur not just when an unhyphenated word is split over a line but also between parts of the one word, as in double-barrelled names like 'Normington-Rawling' or compound words like 'the high-glooming mountain'. When such compounds are split over a line the hyphen is regarded as 'hard', that is, it will not disappear if the line-break is removed. Whereas a 'soft-hyphen' disappears along with the line-break when the text is reflowed. So what is really needed are two sets of CSS styles for flowed and unflowed text. Another complication is that 'hyphens' come in various flavours. Sometimes writers use characters other than '-'. One common variant is use of the colon, or an equal-sign. And sometimes the hyphen is repeated on the next line. So we need a way to switch off these as well.

The two CSS styles

Here are my two styles. I've tried them in Firefox, Chrome and Opera and they appear to work perfectly. First the flow styles:

.soft-hyphen { display:none }
.hard-hyphen { word-spacing:-.25em; }

There is no direct way to hide spaces or line-breaks that get automatically turned into spaces in CSS but you can vary the amount of horizontal spacing between words. The default is, according to the W3C, equal to .25em. So setting it to -.25em should eliminate it altogether. Here are the corresponding definitions of soft and hard hyphens when preserving line-spacing:

.soft-hyphen,.hard-hyphen { white-space:pre }

An example

Here is a short example text in three formats.

Source HTML
<p>"When they will not give
a doit to relieve a lame beggar, they
will lay out ten to see a dead Indian",—
the device which aimed at converting
to the benefit of a living author, the 
expense they were only disposed to throw
away upon a dead one, if not praise<span class="hard-hyphen">-
</span><span class="soft-hyphen">-</span>worthy, was at least pardonable.</p>
<p>In fine, Chatterton was stung to the
quick by neglect, and rendered de<span class="soft-hyphen">-
-</span>fiant by the apparent blindness of
fortune.</p>

Note that in the first case, when the hyphen is doubled, this has to be dealt with somehow when the HTML encoding is generated so that the first hyphen and its line-break is encoded as a hard-hyphen and the second as soft. This works no matter which characters are used for the actual hyphens.

Flowed

"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

With line-breaks preserved

"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

I cheated a little bit in the last example because it is not possible to have two definitions of the same style on one HTML page. Since this can be manipulated in JavaScript easily I don't see that as a problem. Other than that, I used the same source encoding for the three formats.

Friday, 24 February 2017

Refinements to Twin-view

Twin view is a side by side view of an historic print or manuscript document and its transcription. The idea was to scroll the images of each page on the left in sync with the formatted text on the right. Although I already described this view earlier, I have since made several refinements that are worth expounding in a separate post.

To recap on what twin-view already achieved: it aims to align the transcription with the corresponding page image so that the text is aligned across the middle of the display. Of course, without scanning the page image for words-shapes precise alignment is impossible. But approximate alignment can be achieved for any document and its transcription by following some simple rules. So long as we can measure the height of each page image and the height of the corresponding text on the right the two can be scrolled in sync fairly accurately.

1.Partial pages

Problems however arise whenever only part of a page is transcribed – say the end of a poem, which may then be followed by another work. In our case since individual poems were taken from both printed and handwritten anthologies, many poems begin or end some way down the page.

My first idea was to keep the page images intact, but to outline that part of the image connected with a particular poem, so it could be electronically sliced into segments just before display to the user. Then only relevant portions of the page would be visible, and the original images would remain intact. However, this proved impractical for several reasons. First how could such areas be determined? Only manually. And that meant a lot of work and a recording of the data in some format that would have to be customised for our website. Also the slicing would be computationally expensive, and the part-images would have to be cached to improve performance. That gave me the idea of manually slicing the images into segments, while keeping a copy of the original page for other purposes. So the part-page of the transcription would be connected with a part-image of the page. And the only technology needed to achieve this would be the web-server's innate ability to serve images and HTML.

I have done this now for 104 poems out of 700 in the collection. The result mimics to some extent earlier attempts by others to produce complex 'diplomatic' layouts of original documents containing blocks of text that may be rotated or written on other pages that are then displayed as such. Such views are pretty hard to read even though the text has been transcribed. Instead, twin-view simply connects a series of derotated part-images and their corresponding textual transcriptions into a continuous and easy to read document on both sides of the display. The zoom feature then takes care of the user's need for closer examination.

2. Full-screen

Another refinement was the provision of a full-screen view. Nowadays many people have access to large monitors with gargantuan resolutions. Why not make use of that, while retaining a fallback of adequate display for smaller screens? Content management systems typically don't allow this. They confine the text to a narrow band in the centre of the screen, in the belief that screen sizes must have some minimum. Typically this is 600-800 pixels wide. In a responsive layout, on the other hand, text and images are scaled to fit the available sceen-width. So I thought: why not use all of the screen for twin-view? The result is a view that enables the user to see the text and its images in minute detail, while retaining the sync-scrolling of the main view within the CMS.

Twin view of Harpur's 'Creek of the Four Graves' MS C384

3. Layers

Manuscript documents, especially of modern works, often contain erasures, substitutions, insertions and transpositions. These are usually encoded into the text as formats: crossed out text is displayed in a crossed out format, inserted text is displayed over the line in smaller type, blocks of rotated text displayed as rotated blocks etc. This is complex and expensive to do, and the result is not much more readable than the original manuscript. Layers offer a way around this problem. Since each local change to the text belongs to a clear temporal sequence in almost all cases, it is possible to code for time instead of layout. A layer is a combination of each of these local independent changes. All local changes that occurred one unit of time after that of the baseline text appear in layer 2, and changes to layer 2 in layer 3 etc. Layers aren't versions and the non-final text is therefore displayed in red. Only the final text is displayed in black. 'Layer-final' is the last layer in the temporal sequence representing the final state of the document as the author left it. Layers thus provide a diachronic view of the text. They are also mostly coherent – meaning we can read them – as opposed to the direct diplomatic approach where the text is shown with erasures inline, making it unreadable for humans and computers alike.

Take a look for yourself on the Harpur website. The full-screen button is next to the tabs for layers. Only the poems in the title index in Browse from A-D are enabled for twin view presently.