From 0 to 10 million annotated words: part-of-speech tagging for Middle High German,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2019-04-08 , DOI: 10.1007/s10579-019-09462-8
Sarah Schulz , Nora Ketschik

By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.

中文翻译：

从0到1千万个带注释字词：中高级德语的词性标记

通过为中高级德语构建词性（POS）标记器，我们研究了在自然语言处理领域中处理资源匮乏，多样化和非标准语言的策略。我们重点介绍了各个方面，例如训练所需的数据量以及数据质量对标记器性能的影响。由于缺少带注释的资源给培训标记者带来了问题，因此我们举例说明如何有效地调整现有资源以用作其他培训数据。生成的POS模型在代表不同体裁，时间段和MHG品种的多样化测试集上实现了约91％的标记准确性。为了验证其普遍适用性，我们分别评估了MHG的不同流派，作者和品种的表现。

更新日期：2019-04-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11