FreeTxt – CorCenCC – National Corpus of Contemporary Welsh

Supporting bilingual free-text survey and questionnaire data analysis

Project summary

In a modern consumer-led culture, obtaining and responding to feedback is embedded in the professional practice of many walks of life. Surveys, for example, are used in staff development and professional training, in product design and testing, and in various forms of service provision across the public and private sector.

Surveys and questionnaires often produce a combination of both quantitative and qualitative forms of data. Quantitative forms, such as rating scales, multiple-choice questions and rank-order questions can be quantified with ease, the analysis of which can be conducted in a systematic and often automated way. Qualitative forms of free-text responses pose more of a challenge to many companies/institutions who often lack the expertise to analyse such data with ease. While a range of sophisticated tools for the analysis of text do exist for the analysis of qualitative data, these are often expensive, difficult to use and/or inaccessible to non-expert users. Such tools also lack support for the analysis of bilingual text, which can be a particular challenge in the context of Wales, as survey respondents should always be given the opportunity to respond in English and/or Welsh. On St David’s Day (1^st March) in 2022, work began on a project which aimed to respond to these needs, through the development of the novel ‘FreeTxt’ toolkit.

Co-designed and co-built with researchers at Cardiff University and Lancaster University, and project partners Museum Wales, Cadw, National Trust Wales, WJEC and National Centre for Learning Welsh, FreeTxt has been designed to support the analysis and visualisation of multiple forms of open-ended, free-text data in both English and Welsh. It is designed for non-expert users, with a focus on making the toolkit as widely accessible and intuitive as possible.

Key features of FreeTxt/TestunRhydd

FreeTxt draws on existing open-source bilingual corpus-based utilities and methodologies, as well as range of Natural Language Processing (NLP) tools, repackaging these and taking them in a new direction so that they are relevant to new (and non-expert) audiences/user-groups. It encourages the user to becoming a budding corpus analyst without them needing to know what a corpus or concordance is.

FreeTxt includes CorCenCC’s (National Corpus of Contemporary Welsh) semantic and part-of-speech taggers and tagsets, and corpus functionalities for the querying of language, as well as sentiment analysis and text summarisation tools. These tools have been integrated into a user-friendly, online interface that users can paste/upload their texts into, to:

search for patterns of meaning that emerge in survey responses and feedback,
see which words are most often used in relation to a given theme, place, topic, and to
understand what visitors particularly enjoyed about a service or attraction, and what they think could be improved.

To try FreeTxt visit: https://freetxt.app/

Project Team

Dawn Knight, Cardiff University (project PI, Principal Investigator)

Dr. Dawn Knight is a Reader in Applied Linguistics at Cardiff University, UK. She was the Principal Investigator (PI) of the CorCenCC (National Corpus of Contemporary Welsh) project and is the Co-Principal Investigator of the Interactional Variation Online project (https://ivohub.com). Dawn has expertise in corpus linguistics, discourse analysis, digital interaction and non-verbal communication and was former Chair of the British Association for Applied Linguistics (BAAL). Dawn is the PI of the FreeTxt/TestunRhydd project.

Paul Rayson, Lancaster University (project CI, Co-Investigator)

Professor Paul Rayson works in the School of Computing and Communications at Lancaster University, and is Director of the UCREL interdisciplinary research centre which carries out research in corpus linguistics and natural language processing (NLP). A long-term focus of his work is semantic multilingual NLP in extreme circumstances where language is noisy e.g. in historical, learner, speech, email, txt and other CMC varieties.

Mahmoud El-Haj, Lancaster University (project CI, Co-Investigator)

Dr. Mahmoud El-Haj, also known as Mo, is an NLP Lecturer in Computer Science at the School of Computing and Communications at Lancaster University. Mo received his PhD in Computer Science from The University of Essex working on Multi-document Summarization. His work is mainly towards Summarization, Information Extraction, Financial NLP and multilingual NLP with his work being applied to many languages including English, Arabic, Spanish, Portuguese and Welsh. He has an interest in under-resourced languages and building NLP datasets.

Ignatius Ezeani, Lancaster University (project RA, Research Associate)

Dr Ignatius Ezeani is a Senior Teaching/Research Associate at Lancaster University. He is interested in the application of NLP techniques in building resources for low-resource languages including Igbo and Welsh. He works on the efficient adaption of existing NLP tools and techniques for creating task-oriented systems for low-resource languages.

Steve Morris, Cardiff University (Senior Research Associate)

Steve Morris is an Honorary Research Fellow in Applied Linguistics at Swansea University where previously he worked as an Associate Professor in Applied Linguistics and Welsh. Together with Dr Dawn Knight and Professor Tess Fitzpatrick, he was a co-creator of the CorCenCC (National Corpus of Contemporary Welsh) project on which he was also a Co-Investigator. The interdisciplinary interface between Applied Linguistics and the Welsh Language continues to be the prime focus of his work.

Project Advisory Group

National Trust Wales
Cadw
National Museum Wales
Emyr Davies, CBAC | WJEC
Efa Gruffudd Jones, Chief Executive, National Centre for Learning Welsh

Accessing FreeTxt/TestunRhydd

To try FreeTxt visit: www.freetxt.app

Funding Acknowledgement

FreeTxt was developed as part of an AHRC funded collaborative FreeTxt supporting bilingual free-text survey and questionnaire data analysis research project involving colleagues from Cardiff University and Lancaster University (Grant Number AH/W004844/1).

FreeTxt logo design by Katie Rayson