Adnodd Creu Crynodebau: Welsh Automatic Text Summarisation
Project overview
We are developing a publicly available Welsh-language automatic text summarisation tool: Adnodd Creu Crynodebau (ACC). ACC will contribute to the automated tools available in the Welsh language and facilitate the work of those Adnodd Creu Crynodebau (ACC) is a publicly available Welsh-language automatic text summarisaton tool. ACC contributes to the automated tools available in the Welsh language and facilitates the work of those involved in document preparation, proof-reading, and (in certain circumstances) translation. ACC also allows professionals to quickly summarise long documents for efficient presentation. For instance, ACC allows educators to adapt long documents for use in the classroom. It is envisaged that ACC will benefit the wider public, who may prefer to read a summary of complex information presented on the internet or who may have difficulties reading long documents.
What is text summarisation?
Text summarisation is a digital approach to summarising ‘key’ information contained within texts, and the creation of shortened versions of texts based on this content. This is to provide succinct and coherent summaries to users, something that is often time-consuming and difficult to conduct manually. Summarisation is useful in the modern digital world where the creation and sharing of text is ever-increasing, as it enables users to navigate, and make sense of, the dearth of information available with ease.
Approaches to text summarisation
The main approaches to text summarisation include extraction-based summarisation and abstraction-based summarisation. The former extracts specific words/phrases from the text in the creation of the summary, while the latter works to provide paraphrased summaries (i.e. not directly extracted) from the source text. The successful extraction/abstraction of content, when using summarisation tools/approaches, depends on the accuracy of automatic algorithms (which require training using hand-coded gold-standard datasets).
Work on automatic text summarisation has a long history in NLP (Natural Language Processing). This work originally focused only on English, but is now used in a range of other language contexts, including French, Spanish, Hindi, Arabic, amongst others. The ‘MultiLing’ project and associated conference series, are a noteworthy champion of developing text summarisation in a range of the world’s 7000+ different languages. The website, http://multiling.iit.demokritos.gr provides an open repository for summarisation tasks test/training data, model summaries, amongst others. ACC fills a gap in previous resource by providing summarisation tools that work with the Welsh language.
Project Team
Dawn Knight, Cardiff University (project PI, Principal Investigator)
Dr. Dawn Knight is a Reader in Applied Linguistics at Cardiff University, UK. She was the Principal Investigator (PI) of the CorCenCC (National Corpus of Contemporary Welsh) project and is the Co-Principal Investigator of the Interactional Variation Online project (https://ivohub.com). Dawn has expertise in corpus linguistics, discourse analysis, digital interaction and non-verbal communication and was former Chair of the British Association for Applied Linguistics (BAAL). Dawn is the PI of the Welsh Automatic Text Summarisation project.
Jonathan Morris, Cardiff University (project CI, Co-Investigator)
Dr. Jonathan Morris is a Senior Lecturer in Welsh linguistics at Cardiff University. Jonathan’s research focuses on sociolinguistic aspects of bilingualism. His publications include work on cross-linguistic phonological interactions and sociophonetic variation in Welsh-English bilinguals’ speech and research on the use of the Welsh language among young people and families.
Mahmoud El-Haj, Lancaster University (project CI, Co-Investigator)
Dr. Mahmoud El-Haj, also known as Mo, is an NLP Lecturer in Computer Science at the School of Computing and Communications at Lancaster University. Mo received his PhD in Computer Science from The University of Essex working on Multi-document Summarization. His work is mainly towards Summarization, Information Extraction, Financial NLP and multilingual NLP with his work being applied to many languages including English, Arabic, Spanish, Portuguese and Welsh. He has an interest in under-resourced languages and building NLP datasets.
Ignatius Ezeani, Lancaster University (project RA, Research Associate)
Dr Ignatius Ezeani is a Senior Teaching/Research Associate at Lancaster University. He is interested in the application of NLP techniques in building resources for low-resource languages including Igbo and Welsh. He works on the efficient adaption of existing NLP tools and techniques for creating task-oriented systems for low-resource languages.
Research Assistants: Ianto Gruffydd, Katharine Young, Nia Eyre and Lynne Davies
Summarisers: Heledd Ainsworth, Aur Bleddyn, Esyllt Einion, Bethan Evans, Madlen Evans, Lisa Evans, Emma Herbert, Mali Hire, Megan Huws, Sian Morgan, Daniel O’Callaghan, Dafydd Orritt, Cêt Roberts and Hari Timms Rhianwen Williams
Technical details
To learn more about the technical development of ACC, and for access to the tools and dataset being created as part of this project, please visit our GitHub site.
Accessing ACC
ACC is available here.
Outputs
- Ezeani, I., El-Haj, M., Morris, J. and Knight, D. (2022). Introducing the Welsh Text Summarisation Dataset and Baseline Systems. Proceedings of the LREC (Language Resources Evaluation) 2022 Conference, June 2022, Marseille, France.
- Morris, Jonathan, Ignatius Ezeani, Ianto Gruffydd, Katharine Young, Lynne Davies, Mahmoud El-Haj a Dawn Knight. 2022. Welsh Automatic Text Summarisation. Wales Academic Symposium on Language Technologies 2022, Bangor University, 28 January 2022.
- Morris, Jonathan, Ignatius Ezeani, Ianto Gruffydd, Katharine Young, Lynne Davies, Mahmoud El-Haj a Dawn Knight. Forth. Welsh automatic text summarisation. Language and Technology in Wales: Volume II, ed. D. Prys. Bangor: Canolfan Bedwyr.
Contact us
If you would like to know more information about this project, please contact us on: crynodebau@caerdydd.ac.uk
Funding acknowledgement
This project, which runs from 2021-2022, is funded by the Welsh Government as part of the ‘Welsh Automatic Text Summarisation’ project.