Tools for Digitising, Encoding, and Publishing Texts
Lessons from the Oxford Text Archive’s Pilot Training Programme
At the Oxford Text Archive (OTA), we work extensively with tools for digitizing, processing and analysing texts. Consequently, it only made sense when we were designing our Winter Pilot Training Programme to focus on some of the leading tools currently available. The following is an overview of the two tools we focused on, and a summary of their strengths, features, and uses, which will hopefully help you design your methods for your next research project.
Create a Digital Edition Using LEAF Commons
Our first workshop, led by James Cummings (Newcastle University), explored LEAF Commons – a suite of independent but interoperable tools for creating, encoding, and publishing cultural and scholarly materials.
We focused on LEAF-Writer, a web-based semantic text editor that enables users to simultaneously see their XML documents as human readable texts and XML documents with complex annotation. This helps users see what the text is and how it relates to the annotation, making it ideal for teaching those new to XML, who may find its syntax intimidating.

Screenshot of the LEAF-Writer user interface from the Dante exercise set during the Create a Digital Edition Using LEAF Commons workshop.
It also provides easy ways to add XML annotations, generating a wizard when adding new elements that guides users through selecting allowed attributes and explaining their significance. Again, this is useful for the uninitiated – more advanced users will probably find it more useful to edit the XML directly. This is also doable in LEAF-Writer, though the experience is not as smooth as editing in Oxygen or even VSCode.
Getting Started with Transkribus
Our second workshop focused on Transkribus – a helpful tool that can automatically transcribe a range of documents – including handwritten, historical documents that pose extra challenges. This event was led by Joe Nockels (University of Sheffield), an expert in Automatic Text Recognition (ATR).
Transkribus trains and runs AI models to recognise text. Unlike Optical Character Recognition (OCR), which identifies specific characters, Transkribus recognizes line structures and is trained on specific languages to recognize whole words and phrases. Users can choose from public models tailored to different languages and historical periods, train their own models, or refine existing ones.
Transkribus also recognises aspects of page layouts including marginalia, borders or different paragraph structures through its layout recognition models and has newly developed table recognition models as well.

From Joe Nockels’s slides for the Getting Started with Transkribus workshop.
The platform includes an online interface where you can easily manually transcribe text – ostensibly for producing training data for a model – but it is also useful for editing automatic transcriptions. Indeed, the most effective method of implementing ATR – according to Joe – is to run a model that captures the text near enough and then manually edit the output. This is an approach that has only recently been made possible with the array of public models now available.
In this interface, users can view images of their text and type a transcription alongside, and define aspect of the page’s layout, such as lines, regions, and tag specific areas of the text, like catchwords.
A limitation of Transkribus is its cost. Running models requires credits, purchased via subscriptions. However, free accounts offer fifty monthly credits, which is roughly enough for recognising fifty pages of handwriting, or training a your own model. Larger projects may require a paid plan, which also grants access to more advanced AI tools.
Which tool should you use?
While each of these tools was created for a different purpose – LEAF-Writer for XML editing and encoding, Transkribus for ATR – these tools share some functionalities. So, which tool should you use for which task?
Manual NER
Transkribus and LEAF-Writer also provide the option to tag named entities within a text, but in different ways. Named Entity Recognition (NER) is one of the real strengths of LEAF-Writer, as it provides a friendly user interface for the manual identification of people, places, organisations, works, and things with reference to authorities such as Wikidata, VIAF, and DBpedia. This will improve in the near future with the incorporation of NERVE (Named Entity Relationship and Vetting Environment). LEAF-Writer generates RDF JSON-LD for these entities and embeds them into the XML document, integrating the text into a Linked Open Data (LOD) framework. Transkribus enables tagging for custom categories, but does allow for RDF annotations currently. However, the Transkribus team are in the process of developing at automated NER model - so watch this space!

Text export
Both of these tools offer handy export options, but at different points in a digital text workflow. Transkribus’s export is ideally for moving a transcribed text into an XML editor like LEAF-Writer. LEAF-Writer’s export is more for transforming an XML document into a web publication. Transkribus exports to multiple formats, including TEI XML - though James Cummings recommends exporting in the Page XML format and then use Dario Kampkaspar's page2tei XSLT conversion instead. The LEAF Turning Engine, which is integrated into LEAF-Writer, can currently import plain text and TEI XML and export HTML, XML, and Markdown, but more imports/exports are planned.
Digital edition publication
These digital tools are not only useful for digitising and annotating texts, but publishing them as well. Their interfaces, which are designed to enable easy transcription and encoding, are also ideal for reading, especially reading across different manifestations of text. LEAF-Writer in particular can act as a portable user interface for XML documents published on repositories like Github.

From James Cumming’s slides for the Create a Digital Edition Using LEAF Commons workshop.
However, these tools also have special tools specifically for publishing digital texts. Transkribus has Transkribus sites, which creates individual mini websites for transcription projects that allow users to read and search through images and transcriptions. Since Transkribus Sites uses the Transkribus API, it is also very easy for users to update and correct their content using the transcription interface. No coding is required on to set up a site – though it does require a paid subscription.
LEAF Commons has the Dynamic Table of Contexts (DToC) tool, which combines traditional features of printed texts – such as the table of contents and index – with digital features like full-text search and tagging. This enables close reading, distant reading, and hypertextual reading, acting in effect like a textual observatory. Unlike Transkribus sites, this tool is free to use; however, it does require some set up and formatting of textual materials to work.
Can you feel the LEAF? Are you lost in Transkribus?
Both Transkribus and LEAF Commons tools are valuable for digitising, encoding, and publishing texts, with unique strengths. LEAF-Writer provides a user-friendly, free XML editor, as well as an out-of-the-box, easy to implement user interface for digital editions – both of which are badly needed in a market dominated by proprietary software and difficult to implement open access solutions. Transkribus, likewise, makes technical processes like ATR more accessible to the humanities community.
As is ever the motto in digital humanities, the tool you use depends on your research question and practical constraints. For the Oxford Text Archive, LEAF-Writer is best suited to our current needs, given that our deposits are already digitised, and many of them are encoded using TEI. In the future we may explore embedding LEAF's Dynamtic Table of Contexts as a e-reader for texts deposited in the Oxford Text Archive, to provide both an archiving and publishing solution for digital editing projects.
However, both tools are to be recommended – not least, for the robust scholarly communities behind them that continue to develop and support these resources. The LEAF Commons team have an exciting road map laid out, with plans to support project-specifc authority look-ups, prosopographies, customised HTML output, and easier ways to created Linked Open Data (LOD) – amongst other features! Likewise, Tranksribus continues to be expanded with developments like layout recognition models and a move to a completely online interface. Such efforts ensure that these tools will remain essential resources for digital humanities research.
You can view James Cummings's slides on LEAF Commons and Joe Nockels's slides on Transkribus via Zenodo.
Are you interested in future Oxford Text Archive training events and news? Sign up to our mailing list by emailing oxford-text-archive-subscribe@maillist.ox.ac.uk.
Do you have textual data that you would like to archive and share with the community? Consider depositing it at the Oxford Text Archive. Email megan.bushnell@ling-phil.ox.ac.uk to express interest.