Encoding for Visibility

During my final term of Oxford’s Digital Scholarship MSc, I was given the chance to work on an ongoing Digital Humanities project of my choosing. I decided to work with CLARIN-UK, in creating a digital edition of the preface, dedication, and first six chapters of The Cambrian Register 1795-1796. This report details the process from start to finish.

Choosing the Register as my focus for the placement was twofold. First, the Register could add to CLARIN’S repository of Celtic material, which (whilst hardly empty) had fewer entries when compared to the vast repositories of English or French. This of course is not an unusual situation, since Celtic languages of the UK and Ireland, such as Welsh, Gaelic, and Irish, generally are not well represented in digital resources – which is something that the CLARIN-UK DR-LIB K-Centre hopes to address.

Second, the eccentricities Register made it an especially interesting text to encode, and a great exercise for developing my skills with digital preservation and text encoding. As a bilingual text with multiple sections of poetry and prose, the Register posed several unique questions regarding how to go about encoding: do I attempt to focus on structural fidelity? Are the sentences in different languages properly aligned, and if not, how would this affect my tagging? These questions were always on my mind as I mapped out the encoding process, which eventually would be split into two broad phases: pre-processing and encoding.

The pre-processing stage of the project comprised of three photography sessions, followed by a first pass of transcription using OCR. It should be noted that the pictures were taken primarily for personal reference and OCR; the image quality, despite only being used for reference, was key to ensuring that the OCR ran smoothly. I initially used Transkribus for OCR, but soon changed to Google’s OCR via Google Docs because it was more effective and not limited by a monetized token system. In many cases, however, I still needed to manually transcribe and edit certain pages that Google Docs struggled with. Google Docs often failed to understand the paragraph boundaries, especially in sections of parallel English-Welsh translation. The mixture of prose and poetry, the column layout of the paragraphs, and the multiple end notes that accompanied many passages also posed challenges to automated transcription. Of all these structural or syntactic hurdles, the one that by far prolonged the pre-processing stage was Google Docs’ inability to recognise the ‘long s’ (ſ). The ‘ſ’ would be continually recognised as an ‘f,’ and thus required manual post-processing. Once the transcriptions were cleaned up and organised by page number, I began to get an understanding of the shape of the text, and realise the sheer potential of encoding.

Page 3 from The Cambrian Register 1795-1796

I began encoding by attempting to construct a digital edition that maintains structural fidelity to the source text. I wanted to focus on page layout because of the varied nature of my source text; the shifting structure of the Register, I believed, was interesting enough as a point of study, and so I wanted to create a digital edition that would recreate the text’s structure as best as possible. I clearly tagged new pages using <pb/>, gave each chapter a separate heading with <head>, and ensured that the prose paragraphs were the exact same length in the XML file as the original. Early on, I even used line breaks (<lb>) to mimic the column format of the prose section, although that was later abandoned, when I realised that XML could not recreate page layout with perfect fidelity without building a web interface to interpret and style the XML.  This work was unfortunately beyond the scope of my assignment.  Struck by this revelation that more experienced encoders would have probably seen from a mile off, I took a step back to assess what I worked through up to that point to find another route for encoding. This isn’t to say that I went back through and dropped all tags that built up page structure; the page breaks and headings are all still there, and I continued to tag new pages for the sake of consistency. My new focus became tagging the links between the English and Welsh portions of the text, as the parallel translations were what drew me to working with the Register (even more so than the Register’s structure). Anyone who’s worked on an extended project will be familiar with the natural way work shifts and adapts over time. A project’s focus shifts, limitations are learned (and hopefully overcome), and overall the final product may not resemble what was originally envisioned. All of this remains true for making a digital edition, with the added element that you’re naturally discovering more points of interest as you’re reading the source text. That being said, as I began working through the Welsh sections of the Register, I wouldn’t call my transcription work ‘reading’ because, well, I don’t know Welsh.

In the section of the Register that I digitised there are two large excerpts containing parallel English and Welsh texts: a stanza from the Dedication’s opening ode, and the sequence from chapter VI that outlines the history of Wales as told by Geoffrey of Monmouth. While my transcription of the Welsh sections is accurate to the text, there are naturally many weaknesses of transcribing a language you yourself don’t understand. Syntactic nuance is entirely missed because I am only looking at the letters on the page. Were I to expand my work with the Register, I would do so while consulting a Welsh speaker to capture the depth of the text as well as its raw contents. Despite this linguistic barrier, the Register’s Welsh sections were still a rich source for encoding, with the intent of my encoding to highlight the direct links between the original Welsh and English translation. I used the <div>, <seg>, and <lnkGrp> tags to create distinct subsections of English and Welsh paragraphs, labelling those subsections with matching id’s that were then set in a link group list at the end of the XML file (e.g., section e_1 would be paired with w_1, e_2 to w_2, and so on).

The clear flaw with this method, made visible in the XML file, is that some English paragraphs do not have a clear Welsh parallel.  In the physical text, this appears as the English paragraphs occasionally sitting beside blank space where one would expect Welsh text. This is not due to damage to the text rendering those Welsh sections illegible, though; I speculate that the English translation is longer because the translator took many creative liberties with the translation process, resulting in the equivalency between these two texts not being one-to-one, but many-to-one. The paragraphs that lack a direct Welsh parallel include: e_9, e_15, e_23, e_25, e_31, and e_34. These absences speak to an intriguing question implicit in much of my work with the Register: how do we study a work that professes to champion one language, while primarily being written in another? As previously mentioned, the author of the Register seems to have taken a very liberal approach to the translation, problematising the text as a piece of reference material for historic Welsh. At the same time, this is what made the Register such an intriguing text to work with; in balancing these conflicting elements during the encoding process, we can gain a greater appreciation for the place of Welsh in the late eighteenth century.  

Overall, digitising The Cambrian Register and working with CLARIN has been an incredibly insightful experience. While I have only been able to digitise a portion of the text due to the time constraints of the placement, I would highly suggest others consider delving into the Register as its many eccentricities could prove a fruitful area of study. Furthermore, XML encoding and directly dealing with questions around how we make a digital edition has been equally stimulating.  While we may find encoding a thankless, pedantic task, it can form the backbone of literary and linguistic research. So next time you’re scouring over language data or trawling through CLARIN’S database, take a moment to think of the sleep-deprived encoder who made your research possible!