- Downloads
- Publications
- Keynotes and Conference Presentations
- CorCenCC tools and software
- Related projects and software
- Project newsletter (archive)
Downloads:
There are different ways in which CorCenCC can be accessed. Click the links below to:
Access full corpus
The published CorCenCC dataset includes 13,487,210 tokens (circa 11-million-words). Tokens are the smallest unit contained within a corpus, which includes words (i.e. items starting with a letter of the alphabet) and nonwords (i.e. items starting with a character that is not a letter of the alphabet).
The data in CorCenCC represents a wide range of contexts, genres and topics. For a detailed breakdown of this composition, see Knight, Morris and Fitzpatrick (2021). This data has, as far as possible, been anonymised using a combination of manual and automated techniques, and has been fully tagged in terms of part-of-speech (POS) and semantic categories. The POS and semantic tagging was carried out using CyTag and SemCyTag tools, available from CorCenCC’s GitHub website.
To request a copy of the CorCenCC corpus, please click here. The CorCenCC dataset is licensed under Creative Commons Attribution Non Commercial Share Alike 4.0 International. The associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged. Citation details are available here. Full documentation for this corpus, including details of the CorCenCC transcription conventions, metadata descriptors and corpus taxonomy are also found on CorCenCC’s GitHub site.
Existing corpus analysis tools can be used to carry out some basic analyses of CorCenCC (although please note that they may not support all functionalities for Welsh-language data). Such tools include: AntConc, WMatrix, CQPWeb, and #LancsBox, all of which are freely available.
A variety of different Welsh language utilities, many of which can be used with the CorCenCC corpus dataset, are available from Canolfan Bedwyr at Bangor University.
Back to top
Explore CorCenCC online
A beta version of CorCenCC’s bilingual corpus query tools, with complete user guide, is available through the Explore tab of this website. This includes the following functionalities:
- Simple Query: to explore any word and/or lemma form in the corpus, and one or many part-of-speech (POS) tags, mutation types, or semantic category tags of a specific word and/or lemma. A randomised selection of results are presented in a KWIC (Key Word in Context) output. Results can then be filtered of results by mode, geographical area, context, genre, topic, target audience and source.
- Full Query: used to search for longer sequences of patterns (multi-word expressions) separated by spaces, using CorCenCC’s bespoke query syntax. Results are presented in a KWIC (Key Word in Context) output, which can be filtered according to mode, geographical area, context, genre, topic, target audience and source.
- Frequency List: produces a list of words or lemmas in the corpus, ranked according to frequency of occurrence.
- N-Gram Analysis: lists patterns of n-grams/clusters of 2-7 words, lemmas, or POS in the corpus, ranked according to frequency of occurrence.
- Keyword Analysis: displaying words that are unusually frequent in one sub-set of the corpus compared with a different ‘reference’ sub-set of the corpus.
- Collocation Analysis: displaying information on the relationships between word types that appear together within a given context window. [Functionality available soon]
CorCenCC’s accompanying pedagogic tools are available through the Y Tiwtiadur tab of this website.
All data in CorCenCC has been fully tagged in terms of part-of-speech (POS) and semantic category. These tags are fully searchable within the corpus and, in the case of Simple and Full Queries, POS-tags are also colour coded to ease the examination of patterns in query results. All data is also categorised according to its context of use, genre, topic etc., enabling users to examine patterns within/across specific types of text and demographic information in the corpus. Details of tags and taxonomies used, are available in the user guide on the main query tools page and via CorCenCC’s GitHub site.
Results from analyses using the query tools may contain tags where data has been anonymised, or (for spoken data) where transcription conventions have been used. Anonymisation tags include:
Personal names <anon> enwg1 </anon> – first male name
<anon> enwb1 </anon> – first female name
Phone numbers <anon> Rhif ffôn </anon>
Email addresses <anon> cyfeiriad e-bost </anon>
Personal addresses <anon> cyfeiriad </anon>
Spoken data was transcribed using CorCenCC’s bespoke transcription conventions. Examples include:
<S4> Rydym ni yn defnyddio ein trwyna’ i arogli. <arogli i mewn yn sydyn> Pan ym mae ‘da fi anwyd mae fy nhrwyn i’n mynd yn goch ac <=> mae </=> mae fel yn rhedag trwy’r amser.
Here, <S4> denotes the speakers in the conversation, <=> mae </=> indicates a repeated word in the conversation.
<S1> Boeth. A’r hen athrawon ‘na’n mynd fyny ac i lawr yn mynd <griddfan>.
<S2> <Chwerthin>. Gwrando ar y+
<S1> Ti’n cofio hyna <anon>enwb3</anon>?
<S2> +Gwrando ar y cloc yn tician.
Here the use of ‘+’ indicates when a speaker interrupts another speaker in the conversation – so they talk at the same time. The use of <anon>enwb3</anon> signals that a personal name has been anonmyised. Finally, <Chwerthin> indicates that the speaker is laughing and <griddfan> indicates a groan.
To familiarise yourself with the conventions and tags used in the corpus, please read the transcription conventions and taxonomy information available on CorCenCC’s GitHub site.
Back to top
Publications:
- Knight, D., Fitzpatrick, T., Morris, S., Tovey-Walsh, B., Prosser, H. and Davies, E. (2023). Corpus to curriculum: Developing word lists for adult learners of Welsh. Applied Corpus Linguistics.
- Knight, D., Tovey-Walsh, B., Davies, E., Morris, S. and Prosser, H. (2022). The Geirfan wordlist: A vocabulary list for adult learners of Welsh, Cardiff University, DOI: 10.17035/d.2022.0234583226.
- Knight, D., Morris, S., Arman, L., Needs, J. and Rees, M. (2021). Building a National Corpus: A Welsh Language Case Study. London: Palgrave.
- Knight, D., Morris, S. and Fitzpatrick, T. (2021). Corpus Design and Construction in Minoritised Language Contexts: The National Corpus of Contemporary Welsh. London: Palgrave.
- Corcoran, P., Palmer, G., Arman, L., Knight, D. and Spasić, I. (2021). Creating Welsh Language Word Embeddings. Journal of Information Science 11(15): 6896.
- Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
- Muralidaran, V., Knight, D. and Spasić, I. (2020). A systematic review of unsupervised approaches to usage-based grammar induction. Natural Language Engineering.
- Spasić, I., Owen, D., Knight, D. and Arteniou, A. (2019). Data-driven terminology alignment in parallel corpora. Proceedings of the Celtic Language Technology Workshop 2019, Dublin, Ireland.
- Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
- Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
- Rayson, P. (2018). Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. Proceedings of the Challenges in the Management of Large Corpora workshop at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
- Rayson, P. and Piao, S. (2017). Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds. Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications held at the European Chapter of the Association for Computational Linguistics 2017 (EACL) conference, April, Valencia.
- Piao, S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R-M., Knight, D., Křen, M., Löfberg, L., Nawab, R. M. A., Shafi, J., Teh, P-L., and Mudraya, O. (2016). Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages. Proceedings of the LREC (Language Resources Evaluation) 2016 Conference, May 2016, Portorož, Slovenia.
Keynotes and Conference Presentations:
CorCenCC Tools and Software:
The CorCenCC corpus and its associated tools are open source so are freely available via the CorCenCC GitHub site. To access the site, please click here.
Please cite these outputs as follows:
- CorCenCC corpus:
- Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., Thomas, E-M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey-Walsh, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M. and Scannell, K. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh. Cardiff University. http://doi.org/10.17035/d.2020.0119878310
- The CorCenCC project report:
- Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E. M. (2020). The National Corpus of Contemporary Welsh: Project Report | Y Corpws Cenedlaethol Cymraeg Cyfoes: Adroddiad y Prosiect. arXiv:2010.05542, October 2020.
- CorCenCC’s infrastructure and crowdsourcing app:
- Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
- CorCenCC’s part-of-speech (POS) tagger ‘CyTag’:
- Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
- CorCenCC’s semantic tagger ‘CySemTagger’:
- Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
- Piao, S., Rayson, P., Knight, D., Watkins, G. and Donnelly, K. (2017). Towards a Welsh Semantic Tagger: Creating Lexicons for A Resource Poor Language. In Proceedings of The Corpus Linguistics 2017 Conference, held from 24-28 July 2017 at University of Birmingham, Birmingham, UK.
- CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’:
- Davies, J., Thomas, E-M., Fitzpatrick, T., Needs, J., Anthony, L., Cobb, T. and Knight, D. (2020). Y Tiwtiadur. [Digital Resource]. Available at: https://ytiwtiadur.corcencc.org
- CorCenCC’s word frequency lists ‘Yr Amliadur‘:
- Knight, D., Morris, S., Tovey-Walsh, B., Fitzpatrick, T. and Anthony, L. (2020). Yr Amliadur: Frequency Lists for Contemporary Welsh. Cardiff University, http://doi.org/10.17035/d.2020.0120164107
Please visit our GitHub site to access CyTag, CySemTag, links to the CorCenCC dataset and details of the CorCenCC transcription and coding conventions: https://github.com/CorCenCC
Related projects and software
Below are details of all externally funded satellite projects of CorCenCC:
Start Date |
Funder |
Amount |
Description [with PI] |
Feb 2017 | British Council | £2000 | Funding to support the public launch of the CorCenCC project the Pierhead Building, Cardiff [Knight] |
Oct 2017 | Welsh Government | £24,992 | Competitive commission from Welsh Government to provide a rapid evidence assessment of effective second language teaching approaches and methods. For more information, click here. [Fitzpatrick] |
Jan 2018 | Cymraeg 2050 2017-2018 Grant Scheme (GC2050/17-18/20) | £19,964 | A project which focused on automatically constructing a WordNet for Welsh, a lexical database in which words are grouped into sets of synonyms (synsets), which are then organised into a network of lexico-semantic relationships. To access the WordNet Cymru website, click here. [Spasić] |
Jan 2018 | Welsh Joint Education Committee (WJEC) | £1,968 | Research grant (including intramural programme). Research grant to complete work on producing a B1 core vocabulary for Welsh for Adults (Canolradd level). For more information, click here. [Morris] |
Jan 2019 | Welsh Government Technology Funding | £20,000 | Funding to support the development of a Welsh language Stemmer. For more information click here. [Spasić] |
Aug 2019 | Welsh Government Technology Funding | £90,000 | Project entitled: ‘Welsh language processing infrastructure: Welsh word embeddings’. The project focused on word embeddings for Welsh (primarily on creating a lexicon and Welsh word and term embeddings) and contributes to the Welsh Language Technology Action Plan’s aim to ‘promote Welsh language technology and coding resources to teachers and children and others’. [Spasić] |
May 2020 | Welsh Government Technology Funding | £90,000 | Project entitled: ‘Learning English-Welsh bilingual embeddings and applications in text categorisation’. This project aims to extend the results of the previous one by creating cross-lingual representations of words in a joint embedding space for Welsh and English. [Knight] |
CorCenCC newsletter (archive)
Click below to view the archived editions of the newsletters that were published during the CorCenCC project:
- Issue 1: April 2016
- Issue 2: May 2016
- Issue 3: June 2016
- Issue 4: July 2016
- Issue 5: August 2016
- Issue 6: September 2016
- Issue 7: October 2016
- Issue 8: November 2016
- Issue 9: January 2017
- Issue 10: March 2017
- Issue 11: May 2017
- Issue 12: July 2017
- Issue 13: September 2017
- Issue 14: November 2017
- Issue 15: January 2018
- Issue 16: March 2018
- Issue 17: May 2018
- Issue 18: July 2018
- Issue 19: September 2018
- Issue 20: November 2018
- Issue 21: January 2019
- Issue 22: March 2019
- Issue 23: May 2019
- Issue 24: August 2019