Digital Resources for the Languages in Ireland and Britain

In September 2024, a new CLARIN knowledge centre – Digital Resources for the Languages in Ireland and Britain (DR-LIB) – was launched to support researchers searching for resources on the languages of Britain and Ireland in all their varieties – native, and non-native, contemporary and historic, standard and non-standard.  DR-LIB is a virtual and distributed network that acts as a point of contact for all questions relating to digital resources and research on these languages.

One of DR-LIB’s first goals is to compile a list of the digital resources – such as corpora, lexicons, language taggers, etc. – currently available for the study and research of the languages in Ireland and Britain and share these resources with CLARIN to make them more adherent to the FAIR principles – i.e., we aim to make them more findable, accessible, interoperable, and reusable). CLARIN, as a European Consortium that provides access to language data and tools to support research, is the ideal organisation to help with this effort, and it has two infrastructure that can help with this effort: the CLARIN Resource Families, which are collections of known resources organised by type and language, and the CLARIN Virtual Language Observatory, which is an interface for searching across and within resources known to CLARIN.

Below is a list of the resources that we have found so far that we have confirmed are active. Please do email us if you would like us to add something to list or if you find that something is no longer active. We will regularly update this page.

 

Language

Name

Description

Breton

An Drouizig

Tools for translation, spellcheckers, Breton keyboard, Breton fonts, Breton dictionaries.

Breton

Porched niverel ar brezhoneg

Breton language technology portal, promoting various digital tools and resources.

Cornish

BBC news in Cornish

 

Cornish

Gerlyver Kernewek

Cornish dictionary.

Cornish

Korpus kernewek

Cornish corpus.

English

DANTE lexical database

Corpus-based description of the core vocabulary of English.

English

Welsh, etc.

PymUSAS

Python Multilingual Ucrel Semantic Analysis System.

English

Irish

Welsh

Seamless Communication

Translation and S2T Models.

Hiberno-English

CORVIZ: CORIECOR visualised

A publicly accessible, sustainable electronic correspondence corpus.

Irish

ABAIR

Project developing synthetic voices for Irish.

Irish

ainm.ie

The National Irish Language Biographical Database.

Irish

An Bunachar Náisiúnta Téarmaíochta don Ghaeilge

The National Terminology Database for Irish

Irish

An Gramadóir

Open source grammar checking engine.

Irish

Bardic Poetry Database

 

Irish

Manx

Scottish Gaelic

Cadhan Aonair

Private company that provides tools to the Irish Language community. Tools include: An Gramadóir, Caighdeánaitheoir Gaeilge, Foclóir Gàidhlig-Gaeilge, Foclóir Manainnis-Gaeilge, GaelSpell, Historical Irish Corpus, Intergaelic, Líonra Séimeantach na Gaeilge, Cadhan Aonair UD treebank, amongst others hosted on this site.

Irish

Cadhan Aonair UD treebank

Treebank for Irish.

Irish

Caighdeánaitheoir Gaeilge

Irish Language Standardiser.

Irish

CODECS: Collaborative Online Database and e-Resources for Celtic Studies

Comprehensive database of sources of interest to Celtic studies.

Irish

Corpas Náisiúnta na Gaeilge

National Corpus of Irish.

Irish

DCU-NLP Research Group

NLP/ ELCTRA BERT based models.

Irish

Digital Plan for the Irish Language

A roadmap for Irish-language technology developments 2023-2027.

Irish

dúchas.ie

National Folklore Collection UCD Digitisation Project.

Irish

eDIL - Electronic Dictionary of the Irish Language

Dictionary of Irish.

Irish

focloir.ie

English-Irish dictionary.

Irish

Scottish Gaelic

Foclóir Gàidhlig-Gaeilge

A bilingual dictionary between Irish and Scottish Gaelic.

Irish

Manx

Foclóir Manainnis-Gaeilge

A bilingual dictionary between Irish and Manx.

Irish

GaelSpell

Irish language spellchecker.

Irish

GAOIS

Gaois Research Group; contains numerous corpora and resources related to terminology, idioms, surnames, etc.

Irish

Gioraíonn BERT bóthar

Repository containing datasets and code for measuring progress in Irish language NLP. Includes datasets for author identification, bilingual lexicon induction, chunking, etc.

Irish

Manx

Scottish Gaelic

Grammatch

Code repository related to Universal Dependences corpora for Irish, Manx, and Scottish Gaelic

Irish

Historical Irish Corpus

Over 3000 texts published in Irish between 1600 and 1926.

Irish

Manx

Scottish Gaelic

Intergaelic

Dictionary and translation engine between Irish, Scottish Gaelic and Manx Gaelic.

Irish

Irish (Gaeilge) part-of-speech tagset

Tagset developed specifical for Irish.

Irish

Irish Script on Screen

Digital repository of Irish manuscripts

Irish

Irish UD Treebank (IUDT)

A Universal Dependencies 4910-sentence treebank for modern Irish.

Irish

Líonra Séimeantach na Gaeilge

The Irish Language Semantic Network.

Irish

logainm.ie

Placenames Database of Ireland.

Irish

Ríomhacadamh

Group of translators and computer scientists creating Irish language versions of software.

Irish

Welsh

TALKBANK

Language development data.

Irish

teannglann.ie

Dictionary and language library.

Irish

téarma.ie

The National Terminology Database for Irish.

Irish

Scottish Gaelic

Tobar na Gaedhilge

A searchable textbase of 20th-century Gaelic texts (mostly Irish, with some Scottish), best described as ‘continuity Gaelic’.

Irish

UD Irish-IDT

A Universal Dependencies 4910-sentence treebank for modern Irish.

Manx

Cadhan Aonair UD treebank

Treebank for Manx Gaelic.

Manx

Gaelg Corpus Search

Online corpus and search.

Scottish Gaelic

ARCOSG

Annotated Reference Corpus of Scottish Gaelic.

Scottish Gaelic

Corpas na Gàidhlig

 

Scottish Gaelic

Crùbadàn

An NLTK corpus reader for ngram files; supports several languages.

Scottish Gaelic

Dachaigh airson Stòras na Gàidhlig

Digital archive of Scottish Gaelic.

Scottish Gaelic

Faclair na Gàidhlig

A historical dictionary.

Scottish Gaelic

GLA

The Gaelic Linguistic Analyser.

Scottish Gaelic

NLS Matheson collection

Digitised collection.

Scottish Gaelic

Sabhal Mòr Ostaig

Digital library.

Scottish Gaelic

UD Scottish Gaelic ARCOSG

A treebank of Scottish Gaelic based on the Annotated Reference Corpus Of Scottish Gaelic (ARCOSG).

Welsh

CorCenCC Corpus

National Corpus of Contemporary Welsh.

Welsh

CorCenCC Explore

National Corpus of Contemporary Welsh KWIC tool.

Welsh

cyfieithu.techiath.cymru

Machine translation tool.

Welsh

CySemTagger

Welsh semantic tagger.

Welsh

Cysgliad

Software package that includes the Cysill Welsh-langauge grammar and spelling checker as well as the Cysgeir collection of dictionaries.

Welsh

Cysill Arlein

Welsh spellchecker.

Welsh

DigiGrid

Online collection of freely available digital resources designed to support the exploration, analysis, learning, and referencing of the Welsh language.

Welsh

Geirfan

Dictionary for adult learners of Welsh.

Welsh

GPC – Geiriadur Prifysgol Cymru

Dictionary of Welsh

Welsh

Macsen

Open source Welsh language voice assistant similar to Alexa or the Google Assistant.

Welsh

Open Translation Memories

Public translation memory sharing service.

Welsh

Porth Technolegau Iaith Cenedlaethol Cymru

Welsh National Language Technologies Portal.

Welsh

Set ddata’r Adnodd Creu Crynodebau

Welsh summarization dataset.

Welsh

Termau

standardized terminology to use in teaching and learning

Welsh

Trawsgrifiwr

Welsh transcriber.

Welsh

Welsh National Corpora Portal

A collection of on-line written Welsh and bilingual corpora in an easily searchable format.

Welsh

Welsh Natural Language Toolkit

GATE-based NLP pipeline.

Welsh

Welsh Word Embeddings

 

Welsh

Y Tiwtiadur

National Corpus of Contemporary Welsh pedagogic toolkit.

Welsh

Y Termiadur Addysg

Standardized terminology for the field of education.

 

Thanks to Dr Mo El-Haj (VinUniversity) and others in the CLIDA network for starting to map these resources in 2024.