当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2019-03-26 , DOI: 10.1007/s10579-019-09453-9
Ayla Rigouts Terryn , Véronique Hoste , Els Lefever

Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation.

中文翻译:

毫无疑问:从可比语料库中提取单语和多语自动术语的数据集

自动术语提取是自然语言处理领域一个富有成果的研究领域,但在数据集和评估方面仍然面临重大障碍,需要手动注释术语。这是一项艰巨的任务,由于术语和通用语言之间缺乏明确的区分,这使工作变得更加困难,从而导致注释者之间的共识减少。大量需要有据可查的,经过手动验证的数据集,尤其是在从可比语料库中提取多语言术语的新兴领域中,这提出了一系列独特的挑战。本文针对可比语料库中的单语和多语术语注释提出了一种新方法。带有不同术语标签的详细指南,领域和语言无关的方法论以及大量注释了三种不同语言和四个不同域的内容,使它成为了一个丰富的资源。所得的数据集不仅适合评估目的,而且还可以用作有关术语的一般信息源,甚至可以作为受监督方法的训练数据。此外,从可比语料库中提取多语言术语的黄金标准包含有关术语变体和翻译对等物的信息,从而可以进行深入细致的评估。
更新日期:2019-03-26
down
wechat
bug