当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data
arXiv - CS - Computation and Language Pub Date : 2021-02-22 , DOI: arxiv-2102.11152
Anouck Braggaar, Rob van der Goot

This paper explores the difficulties of annotating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependencies. We make use of data from the FAME! corpus, which consists of transcriptions and audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and non-standard sentence segmentation. As a starting point, two annotators annotated 150 random utterances in three stages of 50 utterances. After each stage, disagreements where discussed and resolved. An increase of 7.8 UAS and 10.5 LAS points was achieved between the first and third round. This paper will focus on the issues that arise when annotating a transcribed speech corpus. To resolve these issues several solutions are proposed.

中文翻译:

创建口语Frisian-Dutch代码交换数据的通用依赖树库

本文探讨了将转录的荷兰语-弗里斯兰语代码转换语音注释注释为通用依赖项的困难。我们利用来自FAME的数据!语料库,由转录和音频数据组成。除了通常的注释困难外,由于弗里斯兰(Frisian)资源少,数据的非正式性质,代码转换和非标准的句段分割,因此该数据集更具挑战性。作为起点,两个注释者在三个阶段的50个语音中对150个随机语音进行了注释。在每个阶段之后,讨论和解决分歧。第一轮和第三轮之间增加了7.8 UAS和10.5 LAS点。本文将重点讨论注释转录语音语料库时出现的问题。为了解决这些问题,提出了几种解决方案。
更新日期:2021-02-23
down
wechat
bug