TPEN’s current image parsing process begins by identifying columns, then proceeds to identify the lines of text within those columns. I worked with an early version of another tool this week which asks the user to draw bounding boxes around each line of text without specifying columns first. This method has an up side and a down side. The up side is that each individual line’s left and right boundaries are more precise than when they just inherit those values from a column. In the TPEN UI, we have to allow a reasonable amount (currently 4 times the mean height of lines on the page) of the image beyond the column boundaries to be displayed, to account for cases where a line extends outside the column, or the column is skewed. A precise bounding of each line doesn’t have this issue. The downside of not having columns is that it is not easy to properly order the lines for transcription, as you cannot just take the top line from the left most column, continue to the bottom of the column then move to the next column. This particular tool currently displays lines based solely on their vertical starting point, meaning in a 2 column document you will get the first line of one column, then the first line of the other.
Having seen the up and down sides of both of these methods, it seems a hybrid method would be best. Something that either groups individually boxed lines into columns, or something that searches for lines within columns, but then attempts to find the left and right boundaries for that particular line, and allows those values to be used in the UI without losing the line to column relationship. I am going to prototype the latter method this week, and see if it ends up being the best of both worlds.