Monday, 6 January 2014

Paratexts in PSEF format

What my reading about digital scholarly editions has revealed is that most people think of the scholarly edition as not only composed of a multi-version core-text, that is the thing being studied, but also a collection of single version paratexts. An example is the WoolfOnline site, where the distinction is made explicitly by the category 'contextual' as distinct from texts, images and bibliography. Contextual here is a collection of postcards and biographical information, but it could be anything: articles about the core-texts, photographs that are not facsimilies of the texts, a formal biography etc. There needs to be some way to wrap up this data into the digital blob that travels with the main edition. Otherwise, there is the danger that paratexts will become detached from their roots, since they may be explicitly linked to files in the archive/online edition. The updated top-level structure of the current PSEF (portable scholarly edition format) looks roughly like this:

  • cortexs and corcodes: Each of these is identified by a docID, which is a relative path such as english/shakespeare/kinglear/act1/scene1". Each version within a document is given a version ID as well, expressed the same way, such as /folios/F1, which is the F1 version of the folios of that play. Put them together and you get a complete path that uniquely identifies that text and its markup. Documents are themselves split into two parts: cortexs and corcodes:
    • Cortexs are the basic plain text files. For each physical version there are potentially many sub-files, each representing a layer such as a normalised spelling layer, a version with abbreviations expanded, the base text after the first level of corrections etc. Each layer is a coherent plain text.
    • CorCodes are the standoff markup files, potentially many per plain text file. If you want to represent, say, structural markup and a set of references to locations in the text, then a set of links to an external image, each can be stored in separate files, and later merged to produce a single HTML file for viewing in a browser. This also keeps the text cleanly separated from the markup.
  • CorPix is a collection of facsimile images of pages of specific texts. So several layers within a Cortex/Corcode document might refer to the same set of facsimile images, or to different images. Corpix files are referred in to the same way, via docIDs.
  • Corform is a collection of formats in some stylesheet language such as CSS or XSLT. They also have docIDs, and are referred to by the CorCodes, which need a default rendering format (although you can substitute another or add an extra one).
  • Config is a collection of JSON files with configuration parameters for any part of the edition, usually for importing. These files can be used to specify import filters, or to provide long version names for the short versions used in docIDs.
  • Misc: this stores the paratextual files, again using the simple docID scheme, but without the complication of multiple versions.

In addition, the Cortexs and Corcodes can be entirely replaced by the format of your choice. Currently the psef-tool supports HTML, TEXT (described above), XML and MVD. So that, using the plain text and HTML formats alone, any ordinary tool can be used to read a psef-archive, so it can claim to be interoperable - usable in many programs. No special tools are needed.

No comments:

Post a Comment