One of the primary features of T-PEN will be its “auto-encoding”: transcribers will be able to insert commonly used xml markup (without understanding the technicalities!) into their transcriptions. At the CCL, we chose to use TEI P5 for our encoding, and our first major release of T-PEN will be oriented toward TEI P5 markup for the auto-encoding. The hope is that projects involving large quantities of transcribed text will no longer be crushed under the burden of encoding, for which the labour can certainly be extensive. Our premise is that it is the scholar -- who is often the transcriber -- who has the appropriate expertise to identify the components to be marked, although perhaps not the expertise in TEI. TEI P5 is rich, perhaps overly rich, in choices, and as TEI modules multiply, so do the choices. One can agonise for a long time about which tag to use, especially as many of the choices are subtle. TEI modules can sometimes offer different possibilities, as the documentation for TEI makes clear. Working within the “Manuscript Description” module (number 10) gives options that differ somewhat from the “Transcription of Primary Sources” module (number 11), which in turn differs from the “Critical Apparatus” module (number 12), and one might have been working on other major areas of TEI as well (“verse”, “certainty, uncertainty, and responsibility”, etc.) We anticipate that T-PEN users will likely be integrated persons, who are to some degree paleographers interested in editing, or historians working with images of primary sources that they hope to edit. How will we make intelligent decisions about which tags are most commonly used, and most useful? Although we have designed our development process to include extensive testing on a range of use cases, we welcome additional comments from anyone with experience in using TEI P5 in a project centred on manuscripts or handwritten documents.
I referred in the title to this post to “Painting a Beach”. That is because granularity is also an issue, and it is potentially one of the greatest shifts, perhaps, from the preparation of a print edition on the one hand, and, on the other, a digital presentation of material. In print editions, it is normal to have an introductory study that might report, say, that a manuscript has severe water damage on fol. 8-12, 24-32, and 118-140, and trimming interferes with reading the marginal glosses throughout. TEI P5 allows one to record on a word-by-word (or even letter-by-letter) basis what is obscure, why it is obscure, how obscure it is... So, instead of painting a beach with a few strokes of sand-coloured paint, one can paint every grain of sand. Accurate, but efficient? Do paleographers and editors opt for significantly different levels of detail in their markup practices? If the point of markup is to provide a foundation for display programming, what sort of features are likely to be displayed? Or to what extent is markup just a form of recording information, with no use of digital tools foreseen? These are some of the larger theoretical questions upon which we meditate as we develop the auto-encoding feature.