Monday, December 13, 2010

How Do You Transcribe?

As the planning and development of T-PEN begins to take practical shape and form, I have begun to think about my transcription methodology – the manner in which I have transcribed manuscripts in the past – and the implications of this for T-PEN’s User Interface and the issue of usability.

Transcription is a very practical activity, and it is usually something I have done rather than thought too deeply about. However, when I sat back and reflected on my methodology, I noticed that I tend to take advantage of the digital medium and work from digitized manuscripts, with copies of all the manuscripts containing the text open on my computer as I transcribe. Furthermore, I tend to work by sense-unit, transcribing a few short words from my base manuscript and then checking this reading against all the other manuscripts before I move on to the next unit of text. In effect, I transcribe only one manuscript and collate, sense-unit by sense-unit, as I transcribe.

I am certainly not claiming that this is the ideal method of transcription, but it struck me that it is quite different from the traditional approach of transcribing the entire text from one manuscript, then moving to the next manuscript and transcribing or collating the entire text once again. Of course this approach originally resulted from the fact that the requisite manuscripts were often housed in different libraries, making it impossible to compare all the manuscripts as one transcribed. This situation certainly changed with the advent of microfilm and other methods of reproduction, but it has been transformed radically with the arrival of digitization.

Nonetheless, the basic paradigm around which we have built the T-PEN prototype is still the old approach of transcribing the entire text one manuscript at a time. My reflections on my own practices have demonstrated that there are clearly other ways of transcribing a text; and T-PEN will be a far better tool if it is adaptable to the user’s preferred approach to transcription, whatever that may be.

As a result we are currently considering whether we should enable users to move between different manuscripts as they transcribe and/or facilitate other transcription techniques. I would therefore like to invite you to reflect on your own approach to transcription and on what kind of digital tool would best fit your preferred practices. How flexible or rigid would you like such a tool to be: should it provide a variety of options or be based on what we consider best practice? Even more fundamentally: what is your own transcription methodology – exactly how do you transcribe? All comments welcome below!

(Manuscript images from the e-codices collection and Codices Electronici Ecclesiae Coloniensis. Used in accordance with conditions of use.)

Friday, December 10, 2010

Painting a Beach: TEI P5, Paleography, Editing, and Digitising

One of the primary features of T-PEN will be its “auto-encoding”: transcribers will be able to insert commonly used xml markup (without understanding the technicalities!) into their transcriptions. At the CCL, we chose to use TEI P5 for our encoding, and our first major release of T-PEN will be oriented toward TEI P5 markup for the auto-encoding. The hope is that projects involving large quantities of transcribed text will no longer be crushed under the burden of encoding, for which the labour can certainly be extensive. Our premise is that it is the scholar -- who is often the transcriber -- who has the appropriate expertise to identify the components to be marked, although perhaps not the expertise in TEI. TEI P5 is rich, perhaps overly rich, in choices, and as TEI modules multiply, so do the choices. One can agonise for a long time about which tag to use, especially as many of the choices are subtle. TEI modules can sometimes offer different possibilities, as the documentation for TEI makes clear. Working within the “Manuscript Description” module (number 10) gives options that differ somewhat from the “Transcription of Primary Sources” module (number 11), which in turn differs from the “Critical Apparatus” module (number 12), and one might have been working on other major areas of TEI as well (“verse”, “certainty, uncertainty, and responsibility”, etc.) We anticipate that T-PEN users will likely be integrated persons, who are to some degree paleographers interested in editing, or historians working with images of primary sources that they hope to edit. How will we make intelligent decisions about which tags are most commonly used, and most useful? Although we have designed our development process to include extensive testing on a range of use cases, we welcome additional comments from anyone with experience in using TEI P5 in a project centred on manuscripts or handwritten documents.

I referred in the title to this post to “Painting a Beach”. That is because granularity is also an issue, and it is potentially one of the greatest shifts, perhaps, from the preparation of a print edition on the one hand, and, on the other, a digital presentation of material. In print editions, it is normal to have an introductory study that might report, say, that a manuscript has severe water damage on fol. 8-12, 24-32, and 118-140, and trimming interferes with reading the marginal glosses throughout. TEI P5 allows one to record on a word-by-word (or even letter-by-letter) basis what is obscure, why it is obscure, how obscure it is... So, instead of painting a beach with a few strokes of sand-coloured paint, one can paint every grain of sand. Accurate, but efficient? Do paleographers and editors opt for significantly different levels of detail in their markup practices? If the point of markup is to provide a foundation for display programming, what sort of features are likely to be displayed? Or to what extent is markup just a form of recording information, with no use of digital tools foreseen? These are some of the larger theoretical questions upon which we meditate as we develop the auto-encoding feature.

Tuesday, November 30, 2010

Of lines and columns

TPEN’s current image parsing process begins by identifying columns, then proceeds to identify the lines of text within those columns. I worked with an early version of another tool this week which asks the user to draw bounding boxes around each line of text without specifying columns first. This method has an up side and a down side. The up side is that each individual line’s left and right boundaries are more precise than when they just inherit those values from a column. In the TPEN UI, we have to allow a reasonable amount (currently 4 times the mean height of lines on the page) of the image beyond the column boundaries to be displayed, to account for cases where a line extends outside the column, or the column is skewed. A precise bounding of each line doesn’t have this issue. The downside of not having columns is that it is not easy to properly order the lines for transcription, as you cannot just take the top line from the left most column, continue to the bottom of the column then move to the next column. This particular tool currently displays lines based solely on their vertical starting point, meaning in a 2 column document you will get the first line of one column, then the first line of the other.
Having seen the up and down sides of both of these methods, it seems a hybrid method would be best. Something that either groups individually boxed lines into columns, or something that searches for lines within columns, but then attempts to find the left and right boundaries for that particular line, and allows those values to be used in the UI without losing the line to column relationship. I am going to prototype the latter method this week, and see if it ends up being the best of both worlds.

Wednesday, November 24, 2010

Defining Digital Humanities

For the last month or so I've been on an evangelistic kick at my home institution promoting digital humanities.   It has been a lot of fun learning what my colleagues in the Humanities departments think when they hear the phrase "digital humanities."  Inevitably, the tables get turned on me and I am asked: what do you think Digital Humanities is?  This has heated up a bit in the last few days since the New York Times published an article on how Digital Humanities could unlock Humanities' riches.  Reading that article, along with the almost 100 comments and the response to that response by Martin Foys and Asa Mittman has prodded me further to reflect on defining on what I do. 

It is typical in both the humanities and the sciences to engage in research without first identifying its context in absolute terms.  Indeed, one should always be wary of any a priori definition of research fields, since the more accurate accounts of academic work come  out of praxis.  So I haven't been necessarily embarrassed that I had never properly defined what I do as a digital humanist.  But with the question dangling, I had to articulate my vision.  Here's what I have come up with:

"As a form of scholarly work digital humanities  offers methods and resources that can strengthen the established methods of humanities research.  It can also help make the boundaries between the constituent disciplines more porous and thus bring together different groups of scholars and students.  Digital humanities comes closest to how our students engage the world on a daily basis as consumers of digital information. 

"Such possibilities may seem incommensurate with the common (mis)perception of digital humanities namely that is defined solely by the task of digitization.   This is certainly a principal task of digital humanities, but it hardly accounts for all that it is.  In sum, digital humanities comprises three general tasks: preservation, aggregation and integration. To begin with, digital humanists engage in the digitization of existing artifacts as an act of preservation, from text to images to 3D objects.  Digitization projects, however,  preserve artifacts for the sake of access. There is no point in digitizing if does not change the scope of access.  Digitization ought to be about broadening access.   As for aggregation,  the methods of digital humanities allow for a wider consultation and analysis of large data sets through automation.  Even if computer algorithms assist in this analysis, it is based on the core methods of “pattern matching” that most humanists use: finding common ideas or words/phrases in a set of texts, matching textual accounts to other cultural artifacts or practices, etc.  Digital humanities permits this type of work to occur on a larger scale, and often supports complex forms of pattern matching.   This data, when brought together, can give the humanist scholar ample evidence to draw significant conclusions.  I would include here the role that "crowdsourcing" has come to play recently.  Many projects can achieve so many more goals with a large number of scholars and interested parties working collaboratively.    Finally, the ample evidence from the aggregrate sources is analyzed within an interdisciplinary context. On a very basic level, it encourages the integration of text and image, but it can also provide ideal opportunities to integrate different textual types (or different sets of images) in a way that assist the reader or user to engage the complexity without becoming confused.  Digital humanities can provide the virtual means to bring together disparate sources, ones that have never been connected before (and perhaps can never be physically present together), and provide for the humanist scholar the tools to develop a more complex picture of the topic under study.  It can also bring together disparate scholars which can open new ways to study and interpret cultural artifacts.  One example is how historians are using GIS technology to contextualize historical narratives within a specific geographical space (paying attention to meteorology and other environmental conditions)  or to map the path of documents as they traveled from reader to reader.

"Digital humanities can therefore breathe new life into the world of scholarship and teaching, without snuffing out how humanities scholars currently function.  Additionally, it can provide a mechanism for the critical evaluation of technology as both a pedagogical tool and a form of research.  As our culture demands immediate access to larger and larger sets of data, and also seeks ways to integrate that data in multivalent ways, digital humanities can assist students and professors in this complex, wired world."

I'm not suggesting that I have developed the definitive account of digital humanities.  Some of my fellow DHers might object to the way I privilege text -- although I employ a rather elastic notion of text as simply a container of information.   And, I'm sure I've not accounted for everything, but this is my modest contribution to trying to understand what we do as digital humanists and why we do it.

Tuesday, November 9, 2010

Recruiting a GUI Web Developer

T-PEN is currently advertising for a GUI Developer.  The advertisement is posted online and applications must be submitted online at

Here are the basic details:

Web Developer (Theological Studies)
Job Summary: Under general direction, assists in the creation and deployment of a web-based digital tool for scholars working with digital images of unpublished manuscripts; performs Graphical User Interface (GUI) related duties with responsibility for the user interface.
May include any and/or all of the following:

  1. Participates as a member of a design team, working with the senior developer, project directors, and other personnel.
  2. Designs a set of user-centered, interactive Web pages to integrate various APIs that will comprise the digital tool.
  3. Contributes to testing and bug tracking of various iterations of the digital tool.
  4. Responds to formal usability testing and makes appropriate changes in implementation.
  5. Oversees the Center's Web pages and ensures they are up to date and functional.
  6. Performs other duties as assigned.
  • Knowledge of Web development, applications, and technology
  • Knowledge of HTML/CSS and Javascript
  • Knowledge of programming and graphic design
  • Project management skills
  • Interpersonal/human relations skills
  • Written and verbal communication skills
  • Ability to provide site management solutions
  • Ability to perform work that is technically oriented, working on various platforms, including Microsoft and Apple
  • Ability to recognize trends in Web development
  • Ability to maintain confidentiality
Education and experience equivalent to:
Bachelor's degree; supplemented with two (2) years related work experience

T-PEN's image

Branding in Digital Humanities is important, since it helps users easily identify a specific tool or methodology--especially if that tool interoperates or can be integrated into larger frameworks.  At T-PEN, we've started thinking about this, and a more serious attempt at branding our tool will be done in the next few months. For now, we offer this basic image



I’m Jim Ginther and I am the Principal Investigator for T-PEN.  I have been working in Digital Humanities for over a decade and T-PEN is my seventh research project in DH.  I am very excited about this project for two reasons.  First, T-PEN is one step closer to a major dream of mine to create an editing suite that assists the editor from the transcription stage; through editing, collating and annotation; to the final digital publication–all  in a digital workspace.   Given how many other teams are working towards this same dream, the T-PEN team is committed to interoperability:  we want to ensure that users can take their transcriptions and import them easily into other tools.  The second reason I am excited about T-PEN is that I am providing one of the regular use cases during development.  I will begin a critical edition of the Super Psalterium of Robert Grosseteste (ca. 1170-1253).  I have been studying the life and works of this English thinker since my doctoral studies.  Grosseteste was one of the few masters we know by name at the University of Oxford in the early thirteenth century  He was also a polymath. He was a leader in natural philosophy (in the areas we now identify as cosmology and mathematical physics), an outstanding theologian, and bishop of Lincoln.  His commentary on the Psalter is the last major work from his days at Oxford.  Editing this text is not without its challenges since its textual history is a complicated story.  During T-PEN’s development I will be transcribing one of the manuscript witnesses of this large text (well over 200K words in length!).  Given the unique character of the text, I will definitely be putting T-PEN through its paces. 

Monday, November 8, 2010

YouTube video of T-PEN's Basic Features

One of Tomas O'Sullivan's responsibilities as T-PEN's Research Fellow is to document the tool's development and feature set.  Here is his first video on YouTube which describes T-PEN's basic features.  The voice belongs to Tomas himself.

Tuesday, November 2, 2010


Hello! My name is Tomás O’Sullivan, and I am a research fellow on the T-PEN project.
I hail from Bantry, Co. Cork, Ireland, and hold degrees in Medieval History and in Theology from University College Cork and Mary Immaculate College, University of Limerick. I am currently based in the Department of Theological Studies and the Center for Digital Theology at Saint Louis University, where I cut my digital teeth over the last few years working on the Electronic Norman Anonymous Project with Jim Ginther and Jon Deering.
My research interests focus on the ecclesiastical culture of early medieval Ireland within its Insular and European contexts, with particular concentration on homilies, eschatology (conceptions of the end of the world and the afterlife) and hagiography (writings about the saints). My PhD dissertation examines a distinct collection of Insular homilies which survives in four manuscripts copied on the Continent in the ninth century; these manuscripts will form the basis for my test-case to run T-PEN through its paces as development proceeds. I’ll also function as the technical writer for the project, creating a user manual to accompany the final product.
Transcribing and editing the anonymous Latin homilies of the early Middle Ages is a daunting task, as these sermons were often composed from a variety of textual extracts and images which could be combined and recombined in a kaleidoscope of patterns; I’ve taken to using the phrase “microtexts in motion” to describe this situation where, very often, there is no such thing as a “stable” text. I’m excited about the possibility of using T-PEN’s automated encoding and personalized mark-up features to rein in these mobile microtexts. I’m confident that if T-PEN can help me tame these anonymous homilies, it should be able to handle anything!

Tuesday, October 26, 2010

quick hello from the Co-PI (Abigail Firey, Univ. of Kentucky)

I come to T-PEN as the director of a project that has as one of its central activities the transcription of unedited, usually unprinted manuscripts: the Carolingian Canon Law project. It is a highly collaborative project, designed to receive contributions from scholars present and future, known and unknown, in order to build a “conceptual corpus” of the legal texts known to Carolingian jurists. Because of its open nature and the need for transcriptions prepared to the highest editorial standards, the project will benefit enormously from a tool that allows easy transcription and simultaneous preparation of an encoded file (we are using TEI-P5) without any knowledge of markup required on the part of the contributor, and that also allows ready verification and correction of transcriptions. Transcribing is – as experienced scholars know!—a demanding task, and there are dangers lurking. We had a strange experience on the CCL when we were reviewing a transcription that had been prepared from an existing electronic version of a text and then altered to match the manuscript readings: until we started line-by-line proofreading, we did not notice that very familiar words in the first rubric were, in fact, missing in the manuscript! If the transcriber had been using T-PEN to create a new transcription, there likely would not have been the error; even if the presumed reading had crept in, it would have been easy to check and correct the transcription. Our other challenge has been to keep up with encoding transcriptions (we haven’t, is the short answer). We cannot wait to implement the “auto-encoding” function of T-PEN, so our research assistants can then dig into serious scholarship, instead of encoding all the time! (Some readers may remember Stan Rogers’ “White Collar Holler” (“Can you code it? Program it right!”)


I'm Jon Deering, the Senior Developer working on TPEN.
As Jim said, I'm getting started on TPEN today. The first thing I needed to do today was build a bit of a front page for the tool. The transcription prototype was built for ENAP, which was only a single manuscript. While it was built to allow other manuscripts to be available, including manuscripts from other repositories, we didn't have a good way to browse the manuscripts available in the tool. The front page I built lists the available manuscripts, along with the name of the hosting repository, as a link to the first page of that MS available for transcribing. During our Monday meeting, I'm going to see if we can get the go ahead for at least one full repository made available in the prototype, maybe more. By then I think we will have our domain, and anyone will be able to go in and test transcribing with those few hundred manuscripts.
I'm also working on the customizable hotkeys, which allow a transcriber to set up a number of non standard keys they use often in transcribing (ÞÐÆ in middle English for example). Those characters will be clickable on a toolbar at the bottom, and also have control combinations assigned to them, starting with control 1-9. We had a static set for middle English when a paleography class used our prototype last fall, and the feedback we received was very positive.
As far as introducing myself, there isn't much to say. I keep a rather low profile online, not using facebook and only using twitter for work related communication. While I don't have a humanities background, I have been caught puzzling over the peculiarities of manuscripts now and then, just because they end up in front of me while I work on projects. This is particularly true of the image processing project we did over the summer, which I'll post about in the future when I can include links back to the results. I really enjoy the image processing side of the work we do, and the way setting the computer on images of a manuscript can really aid in not only the editing, but the exploration of the document.. I would say that is my favorite aspect of working on both ENAP and TPEN.
I attended THATCamp London in July, an unconference for digital humanities that occurred right before DH2010. I enjoyed the opportunity to share my work, and find that others in the field had solved some of the problems I was dreading. THATCamp provides a less formal, very open and collaborative forum where such things can happen. In particular, while showing our transcription prototype to a few people, one showed me an in house tool used for transcription at the library he worked for. The method they used for representing overlapping tags with colored underlining is an elegant solution, and something I expect to use in TPEN.

Welcome Back!

Jon Deering, T-PEN's senior Developer, begins work on the project today. Jon was the sole developer for the Electronic Norman Anonymous Project (ENAP), a project that finished in July 2010. It is great to have him back. His enormous energy and creative approach to software engineering has garnered him a great deal of respect and praise here at Saint Louis University and amongst our collaborating institutional partners. We know we are going to see great things from him as we all get hip deep into T-PEN development.

Jon will introduce himself, along with the other project staff, in the near future.

Wednesday, October 20, 2010

T-PEN Strategy Meeting

Yesterday the core design team for T-PEN met to plan out the work flow for the next 18 months. We worked through a hefty agenda that took almost six hours to complete, but we covered everything! That included a terrific "on the fly" schematic outline of T-PEN's basic functionality that Jon Deering, T-PEN's senior developer, put on the white board (right). It was exciting to start talking about the practical realities of a project we've been dreaming about for a few years now.

We've got lots to do, but there is going to be some fun work ahead.

Sunday, October 10, 2010

What is T-PEN?

T-PEN (Transcription for Paleographical and Editorial Notation) is a digital tool for scholars who use digital images of unpublished manuscripts that are housed in digital repositories throughout the world. T-PEN will provide a fully-equipped digital workspace in which the scholar -- while constantly viewing the manuscript images -- transcribes line by line, makes notes about problematic paleographic features, documents glosses and corrections or revisions to the manuscript, and may—either during transcription or after further research—add interpretative or bibliographic information pertaining to particular lines or larger sections of the text. With this tool, the transcribed text can also be immediately encoded with XML markup to indicate any given feature of the text (e.g., a rubric, colophon, gloss, lemma, correction, quire signature, citation, etc.).

This blog will track the 18 months of software development and testing we have planned. This project is being funded by both the National Endowment for the Humanities and the Andrew W. Mellon Foundation.