Infrastructure for Digital Language Resources and Tools
Open access and open source tools for corpus linguistics
Wmatrix version 7 and PyMUSAS
Sunday 29 June
University of Birmingham, Edgbaston
Paul Rayson, Daisy Lal, John Vidler
An afternoon pre-conference workshop at the Corpus Linguistics 2025 conference.
This half day (3 hours) workshop will provide practical hands-on tutorial with the new version of the web-based Wmatrix corpus analysis and comparison software (https://ucrel.lancs.ac.uk/wmatrix/). Version 7 of Wmatrix is now open access for academic researchers and incorporates the Python open source (Apache Licence 2.0) version of the multilingual UCREL Semantic Analysis System (PyMUSAS) that automatically assigns semantic fields to words and multiword expressions to corpora. Wmatrix7 via PyMUSAS provides support for 8 languages (https://pypi.org/project/pymusas/) and facilitates the extension of the key semantic domains method (Rayson, 2008) to those languages. Wmatrix7 represents the most significant update to the online software since the first version was presented at the ICAME 2001 conference (Louvain-la-Neuve, Belgium) and is now free to use. Wmatrix7 has a completely new indexing system implemented in the open source sqlite database allowing indexing of 10s of millions of words. The semantic lexicons used in PyMUSAS are also now freely available under Creative Commons CC-NC-BY-SA 4.0 licence (https://github.com/UCREL/Multilingual-USAS). Open access and open source tools are vital for the replicability and reproducibility of future corpus linguistics studies and support the explainability of annotation and analysis methods in corpus linguistics and NLP software, especially in light of the speedy uptake in new generative AI methods and large language models (LLMs), some of which are not open source or do not declare their training materials. Open tools also facilitate the exchange of methods and techniques to enable further developments to be built on top of existing groundwork e.g. as has been done in the Australian Text Analytics Platform (Jufri & Sun, 2022) building on PyMUSAS.
New and ongoing developments and features will also be highlighted including the future integration with large scale parallel processing using the UCREL-hex facility at Lancaster, a hybrid multiprocessor system including shared GPUs (https://www.lancaster.ac.uk/scc/research/research-facilities/hex/). Facilities like hex have been used to hugely speed up the large scale annotation of extreme scale corpora e.g. for the 1.2 billion words of the ParlaMint II corpus of comparable parliamentary data across Europe (Erjavec et al, 2024) from 18 days to around 7 hours. We will also describe further development of the English, Spanish, Dutch and Danish PyMUSAS taggers and lexicons as part of the 4D Picture project (https://4dpicture.eu/).