Representing variation in a spoken corpus of an endangered dialect: the case of Torlak,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Representing variation in a spoken corpus of an endangered dialect: the case of Torlak
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2021-01-09 , DOI: 10.1007/s10579-020-09522-4
Teodora Vuković

The paper presents a spoken corpus of the endangered Torlak dialect from the Timok area of Southeast Serbia. This dialect expresses a great deal of variation in the use of non-standard features under the influence of standard Serbian (SSr). Accounting for this variation, a specific methodology has been selected for collection, sampling, transcription and annotation. Between 2015 and 2017, semi-structured interviews were conducted in the field eliciting spontaneous speech in the form of long narratives about traditional culture and history. The corpus comprises 500,697 tokens of semi-orthographic transcripts representing 80 h of recording from locations evenly distributed across the Timok area of the Torlak dialect zone, thus enabling a spatial contrastive analysis. The majority of speakers in the corpus are older people whose language represents the highly non-standard variety. In order to allow for analysis of language change under the influence of SSr, the corpus includes a number of younger people whose speech is closer to SSr. Tools for automatic PoS annotation and lemmatization that were lacking were developed based on the existing resources for SSr. For tagger training, a dialect sample of 27,000 manually verified tokens was merged with an existing training set for SSr.

中文翻译：

代表濒危方言的语料库中的变异：Torlak案

本文介绍了来自塞尔维亚东南部Timok地区的濒危的托拉克方言的语音语料库。在标准塞尔维亚语（SSr）的影响下，该方言在使用非标准功能方面表现出很大的变化。考虑到这种变化，已经选择了一种特定的方法进行收集，采样，转录和注释。在2015年至2017年之间，该领域进行了半结构化访谈，以对传统文化和历史的长篇叙事的形式引发了自发性演讲。该语料库包含500,697个半正字成绩单的记号，这些记号表示从Torlak方言区域的Timok区域均匀分布的位置记录了80 h的记录，因此可以进行空间对比分析。语料库中的大多数说话者是老年人，他们的语言代表着高度非标准的语言。为了允许在SSr的影响下分析语言变化，语料库包括许多年轻人，他们的言语更接近SSr。基于SSr的现有资源，开发了缺少的自动PoS注释和词义化工具。对于标记培训，将27,000个手动验证的令牌的方言样本与SSr的现有培训集合并。

更新日期：2021-01-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11