Project Overview
The main aim of this project is to develop an open-access, freely available online thesaurus of the Welsh language, for Welsh speakers and learners alike. Users will be able to use an interface on this website to search for synonyms (similar words). For example, searching for the word ‘to search’ could show synonyms like ‘look for’, ‘pursue’, and ‘explore’.
The project team intend to draw on (1) the use of pre-existing word embeddings to find related words without relying on human lexicographers and (2) the use of the Welsh Semantic Tagger and human evaluators to refine the tool. This innovative methodology has seen some success with languages such as French but is yet to be applied to under-resourced languages where the automated and less costly approach to thesaurus compilation is arguably more necessary.
The creation of word embeddings have been a relatively recent development in Natural Language Processing (NLP) and involves the transformation of words in a corpus (collection of speech) to a vector. Words which are similar in meaning (synonyms) or association are closer in the vector space and embeddings can therefore be used to map the various links between individual lexemes. For the language user, this represents a valuable resource which goes beyond traditional thesauri.
The project will use pre-existing word embeddings for Welsh to find similar words. The Welsh Semantic Tagger can then be used to refine the similarities.
Following this, human evaluators (Welsh speakers) will be recruited in order to refine the output.
The resource will be available publicly on this page and the accompanying python code will be available through our GitHub repository.
Project Team
Jonathan Morris, Cardiff University (project PI, Principal Investigator)
Dr. Jonathan Morris is a Senior Lecturer in Welsh linguistics at Cardiff University. Jonathan’s research focuses on sociolinguistic aspects of bilingualism. His publications include work on cross-linguistic phonological interactions and sociophonetic variation in Welsh-English bilinguals’ speech and research on the use of the Welsh language among young people and families.
Dawn Knight, Cardiff University (project CI, Co-Investigator)
Dr. Dawn Knight is a Reader in Applied Linguistics at Cardiff University, UK. She was the Principal Investigator (PI) of the CorCenCC (National Corpus of Contemporary Welsh) project and is the Co-Principal Investigator of the Interactional Variation Online project (https://ivohub.com). Dawn has expertise in corpus linguistics, discourse analysis, digital interaction and non-verbal communication and was former Chair of the British Association for Applied Linguistics (BAAL).
Mahmoud El-Haj, Lancaster University (project CI, Co-Investigator)
Dr. Mahmoud El-Haj, also known as Mo, is an NLP Lecturer in Computer Science at the School of Computing and Communications at Lancaster University. Mo received his PhD in Computer Science from The University of Essex working on Multi-document Summarization. His work is mainly towards Summarization, Information Extraction, Financial NLP and multilingual NLP with his work being applied to many languages including English, Arabic, Spanish, Portuguese and Welsh. He has an interest in under-resourced languages and building NLP datasets.
Elin Arfon, Cardiff University (project RA)
Elin Arfon is an ESRC and Welsh Government-funded PhD student at the School of Modern Languages, Cardiff University. Elin’s research focuses on the concept of multilingualism within the Curriculum for Wales. Her doctoral study explores how international languages secondary school teachers across Wales understand multilingualism in relation to language teaching and assessment. Elin has a keen interest in the multilingual context of Wales.