当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2019-08-23 , DOI: 10.1007/s10579-019-09473-5
Karoline Kühl , Jan Heegård Petersen , Gert Foget Hansen

This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.

中文翻译:

美国丹麦语语料库:北美和南美口语移民丹麦语的语言资源

本文介绍了“美洲丹麦语Corpus”(CoAmDa),这是在北美和南美新建立的口头丹麦语语料库。CoAmDa约为 170万个令牌,使其成为目前最大的传统语言语料库之一。就文本类型而言,CoAmDa是一种非标准的多语言口语资源,因为丹麦语在每次录制中都分别与美式英语,加拿大英语或阿根廷西班牙语混合。本说明的目的是记录CoAmDA的相关方面和规格。音频数据,说话者的社会人口统计学元数据,模拟数据的数字化过程,转录过程,语音文件的格式和标签以及内部验证过程。这样,
更新日期:2019-08-23
down
wechat
bug