当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2020-07-29 , DOI: 10.1007/s10579-020-09501-9
Dawn Knight , Fernando Loizides , Steven Neale , Laurence Anthony , Irena Spasić

CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.



中文翻译:

为CorCenCC语料库开发计算基础结构:当代威尔士国家语料库

CorCenCC(Corpws Cenedlaethol Cymraeg Cyfoes,当代威尔士国家语料库)是威尔士语的第一个综合语料库,旨在反映跨交流类型,体裁,说话者,语言变体(区域和社会)和上下文的语言使用情况。本文重点介绍我们为支持CorCenCC的数据收集而设计的计算基础结构,以及该语料库的后续用途,包括词典编纂,教学研究和语料库分析。已经采用了基层设计方法,该方法适应并扩展了以前的语料库构建,并根据此特定上下文和语言的要求引入了新功能。基础架构的主要支柱包括一个支持元数据收集的框架,一个创新的移动应用程序,旨在收集口头数据(利用众包方法),一个存储选定数据的后端数据库以及一个基于Web的界面,允许用户在线查询数据。进行了可用性研究,以评估面向用户的工具并为将来的改进提供建议。尽管该基础结构是为威尔士语言收集而开发的,但其设计可以重新用于支持其他少数或主要语言环境中的语料库开发,从而扩大了这项工作的潜在效用和影响。

更新日期:2020-07-29
down
wechat
bug