Working with NLP and Holocaust Testimonies

CLARIN Student Placement

As part of the first-ever MSc Digital Scholarship programme at the University of Oxford, students were asked to complete a practicum placement with an ongoing Digital Humanities project at the University. This post is the report of my placement with CLARIN.

Caitlin Wilson presenting her work at the CLARIN-EHRI workshop

With Martin Wynne as my supervisor, we decided that the main goal of the placement should be to produce a multilingual corpus of oral testimonies of the Holocaust to which corpus linguistics research methods could be applied. A beta version of the project was presented at the EHRI-CLARIN workshop in London in May. This workshop was an opportunity for researchers from a variety of fields spanning historical, archival, computational, and linguistic studies to come together and establish a reproducible workflow for testimony analysis. The workshop allowed participants to outline the steps needed to go from a collection of taped interviews with Holocaust survivors and witnesses to a corpus of written transcripts that could be analysed using distant reading methods. 

Our project started with around 100 testimonies of Holocaust survivors that were kindly shared with CLARIN by the United States Holocaust Memorial Museum. Investigations were initially made to find out which of the testimonies were accompanied by complete transcripts, as this project would only focus on text and would not undertake the task of manual transcription or make use of automatic speech-to-text technology.

United States Holocaust Memorial Museum logo

Following inspection of the transcripts, and filtering out those which did not include an easily identifiable and usable interview transcript, there remained around 50 transcripts in 5 languages: English, Russian, Polish, Czech, and Hungarian. These could then move on to the next stage of preparation: cleaning. The transcript files, originally in JSON format, were cleaned to remove any unnecessary text or metadata (USHMM had occasionally included time stamps and rights and restriction notices at various intervals in the text). Cleaning was done with a combination of XSLT scripts and global ‘find and replace’ functions. The output files contained only the text ID and transcript of the interview structured using sentence and utterance boundaries. All other metadata was extracted and stored in separate files. 

The next step involved parsing the files with various parsing tools to allow for syntactic information about each word in the transcript to be added. The tools used were NTLK, spaCy, TreeTagger, RNNTagger, and stanza. These were implemented either via command-line code or through Python packages. The output of this process was files in VRT format (one word per line) with each word accompanied by its part-of-speech tag and lemma. Non-English files were also translated using DeepL and parsed using the aforementioned tools. The files were then uploaded to CQPWeb (translated files were aligned with the originals) in various sub-corpora organised by language. 

Screenshot of a parallel concordance

Concordance from the Polish-English aligned corpus of testimonies

The resultant corpora uploaded to CQPWeb allowed us to demonstrate at the workshop how Holocaust research can utilise corpus linguistics tools to perform searches on a large number of testimonies at once. Whether researching individual words and the contexts in which they are spoken, or understanding the type of language that survivors use when discussing traumatic events, corpus linguistics can allow researchers to gain a better understanding of overall themes and trends in testimonies. This in turn allows for users to zoom in to individual texts and perform a more informed close reading.

Further discussion has now led us to identify areas of improvement for the corpus, notably seeking a universal syntactic parser that could produce universal part of speech tags that are identical across all languages. Using the Universal Dependency guidelines was deemed to be the most appropriate course of action as it would allow for all texts in the corpus to be searchable by syntactic category at once. Other options including semantic tagging and named entity recognition were also considered as possible ways of enriching the corpus. Lastly, far more than 50 testimonies are needed to allow for researchers to gain a real understanding of the full breadth of realities experienced by those who lived under the Third Reich. 

My time working with CLARIN has taught me a lot about the intricacies and complicated politics of research infrastructures. The workshop in particular shone a light on the importance of interdisciplinary research. A linguist or a historian alone could not have produced the outcomes of this project. Rather, close collaboration, discussion, and sharing of ideas and data allowed for us to envision a new way in which to approach Holocaust and oral history studies. 

Working with CLARIN has taught me how language data and tools are not limited to linguists. Researchers from many fields across the Humanities and more can make use of these type of data to enrich their research. While Holocaust historians may shy away from corpus work and computational tools, for fear of losing the individual voices of their subjects, I do believe that implementing distant reading methods and gathering quantitative data can help researchers improve their understanding of individual cases and study each testimony with a more holistic approach. The overall outcome of the placement was highly positive, the pilot project itself had a positive outcome and was well received at the conference. The hopes are that with time, more feedback, and more data, an improved version of the corpus can be uploaded to CQPWeb.