#LancsBox X

On Friday 24 February 2023, the ESRC Centre for Corpus Approaches to Social Science (CASS) together with CLARIN-UK organised an online workshop featuring a newly released version of #LancsBox X. More than 1,300 participants attended this event. Many users are now familiar with previous versions of #LancsBox (see Figure 1), which has been used as a research tool in over 1,000 academic publications. #LancsBox is also a primary corpus analysis tool introduced in the Corpus linguistics MOOC ‘Corpus Linguistics: Method, Analysis, Interpretation’, a free online course, which has provided training to over 72,000 participants in last ten years. So, what’s new in #LancsBox X and should you be interested?

Screenshots of Lancsbox windows from 2015, 2020 and 2023

Figure 1. Brief history of #LancsBox

#LancsBox X is in many ways a game changer. Based on a completely new architecture with the Lucene database in the background and a simplified and flexible user interface (UI, see Figure 2), #LancsBox X can process and analyse efficiently millions and billions of words. It also natively supports XML (see Figure 3), although it can load data in any format (txt, docx, pdf etc.). With increasing demands in corpus linguistics on the complexity of data including different levels of annotation and on more sophisticated statistical analysis, corpus tools such as #LancsBox X need to deal with a range of expectations from their users. #LancsBox X responds to these challenges by offering flexibility both in terms of the data and its size but also statistical analyses. #LancsBox X incorporates the statistical package R and allows users to run customisable R scripts – this feature is currently available for association measures (AMs) for identifying collocations and future releases of the tool will build further on this for automated statistical analyses.

Screenshot of lancsbox on a mac

Figure 2. #LancsBox X on mac

Let us focus on key functionalities of #LancsBox X. #LancsBox X offers tools to create concordances, wordlists and collocation analyses. At the top, #LancsBox X shows a very powerful search bar, where the users can type simple words (research), phrases (I don't know), smart searches (NOUN PASSIVE ADVERB) or complex CQL queries ([word="cat"] [pos="V.*"] [sem="N.*"]). All of these are automatically interpreted as different categories of searches and treated appropriately; #LancsBox X also highlights syntax, helping users to formulate valid queries. So no matter how simple or complex a query is and regardless of the experience of the user with formulating queries, #LancsBox X can provide adequate support for a wide range of research questions or casual queries. It is thus suitable both for research and classroom purposes.

Image of lots of XML text with annotations

Figure 3. An example of an XML corpus file with annotation, which #LancsBox X can process

The UI offers users the option to have multiple analyses open at the same time in separate windows, which can be resized, maximised and re-positioned as required. If multiple windows are selected, #LancsBox X will search in all of these at once.

#LancsBox X allows users to analyse and visualise frequency and word association data. Information about word frequencies and distributions are available in Words tool, while word associations (collocations) are analysed using the GraphColl tool (Figure 4). GraphColl produces dynamic graphs, which display rich information about the collocational relationship. Each of the graph properties such as the edge length, size of the data point (collocate), colour of the data point, position of the data point can be assigned a numerical value based on frequency, distribution or one or multiple of the available association measures (AMs). Instead of computing only one AM, #LancsBox X always offers users the choice from 13 standard AMs, which are all computed at once. By default, Log Dice is displayed as a primary AM in the graph, but users can easily switch between different AMs without the need to wait for the results. This makes the collocation analysis much more flexible and transparent.

Screenshot of a visualization with the GraphColl tool

Figure 4. GraphColl tool for the analysis and visualization of collocations

More details about #LancsBox X features can be found in the detailed manual. The manual is currently also available in Japanese and versions in more languages are in preparation.

So where to start? If you are interested in exploring #LancsBox X, your first port of call will be a new website with the download links, video tutorials and a manual. If you would like to get in touch or if you want to find instant help in frequently asked questions (FAQ), you can use a helpdesk functionality on #LancsBox website. #LancsBox X is available for any major operating system (Windows, Mac and Linux) and can be used to analyse data in any language (UTF-8 encoded).