Saturday 17 June 2017

Preserving soft and hard hyphens in transcriptions of historical documents

Like all documents historical texts contain line-breaks. An obvious case where preservation of line-breaks is essential is poetry. And yet on the Web, HTML assumes that all text is flowed. That is, line-breaks are converted into spaces unless the text is broken by a <br> tag. Or you can just specify that line-breaks are preserved as in the <pre> element, or by using the white-space:pre CSS property. What is needed though is some way to easily switch between the two. Flowed text is easier to read, but for historical accuracy line-breaks and the inevitable hyphens must be preserved. In spite of this requirement in many digitised versions of historical texts hyphens are permanently removed and the text is flowed for readability. This prevents ever showing the text as it really is. You need to do this for example when displaying a text next to its page-image. Or when citing a historical document by its line-number.

Hard and soft hyphens

What's needed is some way to record the line-breaks but to hide or show them on demand. The easiest way to do this is in the browser by flipping a switch in the CSS stylesheet. One problem with this is the existence of hard and soft hyphens. In heavily hyphenated languages like English and French, hyphens occur not just when an unhyphenated word is split over a line but also between parts of the one word, as in double-barrelled names like 'Normington-Rawling' or compound words like 'the high-glooming mountain'. When such compounds are split over a line the hyphen is regarded as 'hard', that is, it will not disappear if the line-break is removed. Whereas a 'soft-hyphen' disappears along with the line-break when the text is reflowed. So what is really needed are two sets of CSS styles for flowed and unflowed text. Another complication is that 'hyphens' come in various flavours. Sometimes writers use characters other than '-'. One common variant is use of the colon, or an equal-sign. And sometimes the hyphen is repeated on the next line. So we need a way to switch off these as well.

The two CSS styles

Here are my two styles. I've tried them in Firefox, Chrome and Opera and they appear to work perfectly. First the flow styles:

.soft-hyphen { display:none }
.hard-hyphen { word-spacing:-.25em; }

There is no direct way to hide spaces or line-breaks that get automatically turned into spaces in CSS but you can vary the amount of horizontal spacing between words. The default is, according to the W3C, equal to .25em. So setting it to -.25em should eliminate it altogether. Here are the corresponding definitions of soft and hard hyphens when preserving line-spacing:

.soft-hyphen,.hard-hyphen { white-space:pre }

An example

Here is a short example text in three formats.

Source HTML
<p>"When they will not give
a doit to relieve a lame beggar, they
will lay out ten to see a dead Indian",—
the device which aimed at converting
to the benefit of a living author, the 
expense they were only disposed to throw
away upon a dead one, if not praise<span class="hard-hyphen">-
</span><span class="soft-hyphen">-</span>worthy, was at least pardonable.</p>
<p>In fine, Chatterton was stung to the
quick by neglect, and rendered de<span class="soft-hyphen">-
-</span>fiant by the apparent blindness of
fortune.</p>

Note that in the first case, when the hyphen is doubled, this has to be dealt with somehow when the HTML encoding is generated so that the first hyphen and its line-break is encoded as a hard-hyphen and the second as soft. This works no matter which characters are used for the actual hyphens.

Flowed

"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

With line-breaks preserved

"When they will not give a doit to relieve a lame beggar, they will lay out ten to see a dead Indian",— the device which aimed at converting to the benefit of a living author, the expense they were only disposed to throw away upon a dead one, if not praise- -worthy, was at least pardonable.

In fine, Chatterton was stung to the quick by neglect, and rendered de- -fiant by the apparent blindness of fortune.

I cheated a little bit in the last example because it is not possible to have two definitions of the same style on one HTML page. Since this can be manipulated in JavaScript easily I don't see that as a problem. Other than that, I used the same source encoding for the three formats.