Downloads:

There are different ways in which CorCenCC can be accessed. Click the links below to:

Access full corpus

The published CorCenCC dataset includes 13,487,210 tokens (circa 11-million-words). Tokens are the smallest unit contained within a corpus, which includes words (i.e. items starting with a letter of the alphabet) and nonwords (i.e. items starting with a character that is not a letter of the alphabet).

The data in CorCenCC represents a wide range of contexts, genres and topics. For a detailed breakdown of this composition, see Knight, Morris and Fitzpatrick (2021). This data has, as far as possible, been anonymised using a combination of manual and automated techniques, and has been fully tagged in terms of part-of-speech (POS) and semantic categories. The POS and semantic tagging was carried out using CyTag and SemCyTag tools, available from CorCenCC’s GitHub website.

To request a copy of the CorCenCC corpus, please click here. The CorCenCC dataset is licensed under Creative Commons Attribution Non Commercial Share Alike 4.0 International. The associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged. Citation details are available here. Full documentation for this corpus, including details of the CorCenCC transcription conventions, metadata descriptors and corpus taxonomy are also found on CorCenCC’s GitHub site.

Existing corpus analysis tools can be used to carry out some basic analyses of CorCenCC (although please note that they may not support all functionalities for Welsh-language data). Such tools include: AntConc, WMatrix, CQPWeb, and #LancsBox, all of which are freely available.

A variety of different Welsh language utilities, many of which can be used with the CorCenCC corpus dataset, are available from Canolfan Bedwyr at Bangor University.
Back to top

Explore CorCenCC online

A beta version of CorCenCC’s bilingual corpus query tools, with complete user guide, is available through the Explore tab of this website. This includes the following functionalities:

  • Simple Query: to explore any word and/or lemma form in the corpus, and one or many part-of-speech (POS) tags, mutation types, or semantic category tags of a specific word and/or lemma. A randomised selection of results are presented in a KWIC (Key Word in Context) output. Results can then be filtered of results by mode, geographical area, context, genre, topic, target audience and source.
  • Full Query: used to search for longer sequences of patterns (multi-word expressions) separated by spaces, using CorCenCC’s bespoke query syntax. Results are presented in a KWIC (Key Word in Context) output, which can be filtered according to mode, geographical area, context, genre, topic, target audience and source.
  • Frequency List: produces a list of words or lemmas in the corpus, ranked according to frequency of occurrence.
  • N-Gram Analysis: lists patterns of n-grams/clusters of 2-7 words, lemmas, or POS in the corpus, ranked according to frequency of occurrence.
  • Keyword Analysis: displaying words that are unusually frequent in one sub-set of the corpus compared with a different ‘reference’ sub-set of the corpus.
  • Collocation Analysis: displaying information on the relationships between word types that appear together within a given context window. [Functionality available soon]

CorCenCC’s accompanying pedagogic tools are available through the Y Tiwtiadur tab of this website.

All data in CorCenCC has been fully tagged in terms of part-of-speech (POS) and semantic category. These tags are fully searchable within the corpus and, in the case of Simple and Full Queries, POS-tags are also colour coded to ease the examination of patterns in query results. All data is also categorised according to its context of use, genre, topic etc., enabling users to examine patterns within/across specific types of text and demographic information in the corpus. Details of tags and taxonomies used, are available in the user guide on the main query tools page and via CorCenCC’s GitHub site.

Results from analyses using the query tools may contain tags where data has been anonymised, or (for spoken data) where transcription conventions have been used.  Anonymisation tags include:

Personal names                      <anon> enwg1 </anon> – first male name

<anon> enwb1 </anon> – first female name

Phone numbers                       <anon> Rhif ffôn </anon>

Email addresses                      <anon> cyfeiriad e-bost </anon>

Personal addresses              <anon> cyfeiriad </anon>

Spoken data was transcribed using CorCenCC’s bespoke transcription conventions. Examples include:

<S4> Rydym ni yn defnyddio ein trwyna’ i arogli. <arogli i mewn yn sydyn> Pan ym mae ‘da fi anwyd mae fy nhrwyn i’n mynd yn goch ac <=> mae </=> mae fel yn rhedag trwy’r amser.

Here, <S4> denotes the speakers in the conversation, <=> mae </=> indicates a repeated word in the conversation.

<S1> Boeth. A’r hen athrawon ‘na’n mynd fyny ac i lawr yn mynd <griddfan>.

<S2> <Chwerthin>. Gwrando ar y+

<S1> Ti’n cofio hyna <anon>enwb3</anon>?

<S2> +Gwrando ar y cloc yn tician.

Here the use of ‘+’ indicates when a speaker interrupts another speaker in the conversation – so they talk at the same time. The use of <anon>enwb3</anon> signals that a personal name has been anonmyised. Finally, <Chwerthin> indicates that the speaker is laughing and <griddfan> indicates a groan.

To familiarise yourself with the conventions and tags used in the corpus, please read the transcription conventions and taxonomy information available on CorCenCC’s GitHub site.
Back to top

Publications:

  • Knight, D., Fitzpatrick, T., Morris, S., Tovey-Walsh, B., Prosser, H. and Davies, E. (2023). Corpus to curriculum: Developing word lists for adult learners of Welsh. Applied Corpus Linguistics.
  • Knight, D., Tovey-Walsh, B., Davies, E., Morris, S. and Prosser, H. (2022). The Geirfan wordlist: A vocabulary list for adult learners of Welsh, Cardiff University, DOI: 10.17035/d.2022.0234583226.
  • Knight, D., Morris, S., Arman, L., Needs, J. and Rees, M. (2021). Building a National Corpus: A Welsh Language Case Study. London: Palgrave.
  • Knight, D., Morris, S. and Fitzpatrick, T. (2021). Corpus Design and Construction in Minoritised Language Contexts: The National Corpus of Contemporary Welsh. London: Palgrave.
  • Corcoran, P., Palmer, G., Arman, L., Knight, D. and Spasić, I. (2021). Creating Welsh Language Word Embeddings. Journal of Information Science 11(15): 6896. 
  • Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
  • Muralidaran, V., Knight, D. and Spasić, I. (2020). A systematic review of unsupervised approaches to usage-based grammar induction. Natural Language Engineering.
  • Spasić, I., Owen, D., Knight, D. and Arteniou, A. (2019). Data-driven terminology alignment in parallel corpora. Proceedings of the Celtic Language Technology Workshop 2019, Dublin, Ireland.
  • Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Rayson, P. (2018). Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. Proceedings of the Challenges in the Management of Large Corpora workshop at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Rayson, P. and Piao, S. (2017). Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds. Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications held at the European Chapter of the Association for Computational Linguistics 2017 (EACL) conference, April, Valencia.
  • Piao, S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R-M., Knight, D., Křen, M., Löfberg, L., Nawab, R. M. A., Shafi, J., Teh, P-L., and Mudraya, O. (2016). Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages. Proceedings of the LREC (Language Resources Evaluation) 2016 Conference, May 2016, Portorož, Slovenia.

Back to top

Keynotes and Conference Presentations:

Back to top

CorCenCC Tools and Software:

The CorCenCC corpus and its associated tools are open source so are freely available via the CorCenCC GitHub site. To access the site, please click here.

Please cite these outputs as follows:

  • CorCenCC corpus:
    • Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., Thomas, E-M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey-Walsh, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M. and Scannell, K. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh. Cardiff University. http://doi.org/10.17035/d.2020.0119878310
  • The CorCenCC project report:
    • Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E. M. (2020). The National Corpus of Contemporary Welsh: Project Report | Y Corpws Cenedlaethol Cymraeg Cyfoes: Adroddiad y Prosiect. arXiv:2010.05542, October 2020.
  • CorCenCC’s infrastructure and crowdsourcing app:
    • Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
  • CorCenCC’s part-of-speech (POS) tagger ‘CyTag’:
    • Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • CorCenCC’s semantic tagger ‘CySemTagger’:
    • Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
    • Piao, S., Rayson, P., Knight, D., Watkins, G. and Donnelly, K. (2017). Towards a Welsh Semantic Tagger: Creating Lexicons for A Resource Poor Language. In Proceedings of The Corpus Linguistics 2017 Conference, held from 24-28 July 2017 at University of Birmingham, Birmingham, UK.
  • CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’:
    • Davies, J., Thomas, E-M., Fitzpatrick, T., Needs, J., Anthony, L., Cobb, T. and Knight, D. (2020). Y Tiwtiadur. [Digital Resource]. Available at: https://ytiwtiadur.corcencc.org
  • CorCenCC’s word frequency lists ‘Yr Amliadur‘:

Please visit our GitHub site to access CyTag, CySemTag, links to the CorCenCC dataset and details of the CorCenCC transcription and coding conventions: https://github.com/CorCenCC

Back to top

Related projects and software

Below are details of all externally funded satellite projects of CorCenCC:

Start Date
Funder
Amount
Description [with PI]
Feb 2017 British Council £2000 Funding to support the public launch of the CorCenCC project the Pierhead Building, Cardiff [Knight]
Oct 2017 Welsh Government £24,992 Competitive commission from Welsh Government to provide a rapid evidence assessment of effective second language teaching approaches and methods. For more information, click here. [Fitzpatrick]
Jan 2018 Cymraeg 2050 2017-2018 Grant Scheme (GC2050/17-18/20) £19,964 A project which focused on automatically constructing a WordNet for Welsh, a lexical database in which words are grouped into sets of synonyms (synsets), which are then organised into a network of lexico-semantic relationships. To access the WordNet Cymru website, click here. [Spasić]
Jan 2018 Welsh Joint Education Committee (WJEC) £1,968 Research grant (including intramural programme). Research grant to complete work on producing a B1 core vocabulary for Welsh for Adults (Canolradd level). For more information, click here. [Morris]
Jan 2019 Welsh Government Technology Funding £20,000 Funding to support the development of a Welsh language Stemmer. For more information click here. [Spasić]
Aug 2019 Welsh Government Technology Funding £90,000 Project entitled: ‘Welsh language processing infrastructure: Welsh word embeddings’. The project focused on word embeddings for Welsh (primarily on creating a lexicon and Welsh word and term embeddings) and contributes to the Welsh Language Technology Action Plan’s aim to ‘promote Welsh language technology and coding resources to teachers and children and others’. [Spasić]
May 2020 Welsh Government Technology Funding £90,000 Project entitled: ‘Learning English-Welsh bilingual embeddings and applications in text categorisation’. This project aims to extend the results of the previous one by creating cross-lingual representations of words in a joint embedding space for Welsh and English. [Knight]

Back to top

CorCenCC newsletter (archive)

Click below to view the archived editions of the newsletters that were published during the CorCenCC project:

Back to top