Friday, December 10, 2010

Painting a Beach: TEI P5, Paleography, Editing, and Digitising

One of the primary features of T-PEN will be its “auto-encoding”: transcribers will be able to insert commonly used xml markup (without understanding the technicalities!) into their transcriptions. At the CCL, we chose to use TEI P5 for our encoding, and our first major release of T-PEN will be oriented toward TEI P5 markup for the auto-encoding. The hope is that projects involving large quantities of transcribed text will no longer be crushed under the burden of encoding, for which the labour can certainly be extensive. Our premise is that it is the scholar -- who is often the transcriber -- who has the appropriate expertise to identify the components to be marked, although perhaps not the expertise in TEI. TEI P5 is rich, perhaps overly rich, in choices, and as TEI modules multiply, so do the choices. One can agonise for a long time about which tag to use, especially as many of the choices are subtle. TEI modules can sometimes offer different possibilities, as the documentation for TEI makes clear. Working within the “Manuscript Description” module (number 10) gives options that differ somewhat from the “Transcription of Primary Sources” module (number 11), which in turn differs from the “Critical Apparatus” module (number 12), and one might have been working on other major areas of TEI as well (“verse”, “certainty, uncertainty, and responsibility”, etc.) We anticipate that T-PEN users will likely be integrated persons, who are to some degree paleographers interested in editing, or historians working with images of primary sources that they hope to edit. How will we make intelligent decisions about which tags are most commonly used, and most useful? Although we have designed our development process to include extensive testing on a range of use cases, we welcome additional comments from anyone with experience in using TEI P5 in a project centred on manuscripts or handwritten documents.

I referred in the title to this post to “Painting a Beach”. That is because granularity is also an issue, and it is potentially one of the greatest shifts, perhaps, from the preparation of a print edition on the one hand, and, on the other, a digital presentation of material. In print editions, it is normal to have an introductory study that might report, say, that a manuscript has severe water damage on fol. 8-12, 24-32, and 118-140, and trimming interferes with reading the marginal glosses throughout. TEI P5 allows one to record on a word-by-word (or even letter-by-letter) basis what is obscure, why it is obscure, how obscure it is... So, instead of painting a beach with a few strokes of sand-coloured paint, one can paint every grain of sand. Accurate, but efficient? Do paleographers and editors opt for significantly different levels of detail in their markup practices? If the point of markup is to provide a foundation for display programming, what sort of features are likely to be displayed? Or to what extent is markup just a form of recording information, with no use of digital tools foreseen? These are some of the larger theoretical questions upon which we meditate as we develop the auto-encoding feature.


  1. I love this project! I haven't (yet) signed up for an account, but I'm really interested in crowd-sourced transcription projects and the kinds of tools and applications that are being developed.

    I'm a project assistant for the William Blake Archive ( The Blake Archive has only recently started transcribing and encoding manuscripts, and while we're not exactly TEI-compliant, we did modify our tag set when P5 was released. A colleague and I recently published an article about our experiences editing a MS of Blake's that you might be interested in -- it goes into quite a bit of detail about how our XML tag set evolved. The entire volume (all about editing Blake) is here:

    Our article is here:

    Looking forward to seeing how your project develops!

  2. Thanks so much for your comment! We have been watching the Blake Archive (its reputation precedes it), but had not caught up to the article. It is just the sort of thing we are interested in reading. We hope you will stay in touch as we post more about our tag set for T-PEN.