Grand Challenges

Based on part of a submission made in April 2019 to a consultation on requirements in the arts and humanities for the UK research infrastructure. The following is an attempt to explain why it is useful to make large-scale computing facilities available for corpus linguistic research.

Update March 2022: the text below was included in the UKRI science case for UK supercomputing (PDF, 1,350 KB) in 2020.

1. Data-driven linguistics: Over the past fifty years corpus linguistics has led the way in developing data-driven methods in the humanities and social sciences. This has not only enabled researchers to test assertions about language against real evidence of usage, but it also led to new types of research question involving larger scales of data and time-frames, new theories of grammar, and it has revolutionized numerous areas of applied linguistics such as language learning, translation studies, and lexicography, as well as making a major contribution to a number of commercial activities involving language processing. Language continues to evolve and change as societies change, and data-driven research into trends in language usage is therefore a constant ongoing challenge.

2. Text mining: Large datasets of text and speech are not just for linguists but for everyone, although extracting reliable information from texts relies on linguistic knowledge, methods, tools and datasets. Mining texts for information is an increasingly important process in almost all disciplines, and in many areas of business and public life. Corpus linguistics is therefore a key enabler as well as a valuable area of research in its own right, offering the knowledge and tools necessary for sophisticated search based on linguistic knowledge, and methods for interpreting the results.

3. Exploiting historical text collections: If the outputs of the mass digitization projects which are underway were consistently made available not just as page images in digital libraries, but as full text datasets accompanied by processing power and interfaces for exploration and analysis, then many more researchers could participate in new forms of digital research. A student with a desktop computer can nowadays access more texts than a senior researcher could track down in their lifetime a few years ago, and there is huge potential for democratizing research and applying the wisdom of crowds to the understanding of history.

4. Freeing the speech archives: Historical records of speech are sparse, but hugely important as holders of information about language use, accents and dialects, and of cultural content. Recordings of speech in oral history projects, broadcast media, home recordings, business archives contain information about people's lives and experiences and can provide unique insights, but are mostly held on analogue media, with very few are available digitally to researchers. Computing facilities to digitize, store, transcribe, annotate, make available and preserve these archives could create a renaissance in oral history and transform our understanding of the recent past.

5. ‘Social climate change’: Nowadays digital text and speech are being produced everywhere, potentially searchable, downloadable and analysable in real time. New facilities and instruments are necessary to handle this data deluge, which presents opportunities not only to understand language change in real time, but to understand how society might be changing, by capturing information about the ‘social climate’ through discourse, in the same way that millions of digital sensors allow us to track the weather and understand it better.

Computing requirememts
In order to address the above challenges effectively, there are specific requirements for the following:
1. Long-term data storage, since virtually all original data is of high value and should be preserved in perpetuity. Recordings of speech or writing are not reproducible datasets. They are unique records of human behaviour, and a part of our cultural record.
2. Secure access and authorization in a wide domain of trust, involving millions of users, and commercial data providers, since many of the potentially most valuable datasets for research are also commercially valuable and not freely available at zero cost. In order to exploit the possibilities of text mining more widely and thoroughly the domain of trust for secure access to protected resources needs to include data producers and owners, such as publishers and social media enterprises, allowing a much wider range of datasets to be made available for data mining and other research purposes, with users numbering in the millions worldwide. Access to data and interfaces in the humanities and social sciences cannot be restricted to a core team of more or less fixed size or duration, as might happen in the experimental sciences - it is likely to be of potential interest to large and growing numbers researchers, teachers, other professionals, students and members of the public. This requires access and authorization mechanisms which are sustainable in the long-term and which are scalable to millions of data items and users.

Grand Challenges

Why corpus linguistics is important