当前位置: X-MOL 学术Across Languages and Cultures › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Trade-off between Quantity and Quality. Comparing a Large Crawled Corpus and a Small Focused Corpus for Medical Terminology Extraction
Across Languages and Cultures ( IF 1.0 ) Pub Date : 2019-12-01 , DOI: 10.1556/084.2019.20.2.3
Veronique Hoste 1 , Klaar Vanopstal 1 , Ayla Rigouts Terryn 1 , Els Lefever 1
Affiliation  

We investigate the cost-effectiveness of special-purpose crawled corpora versus more focused corpora for automatic terminology extraction (ATE). Our focus is on medical terminology on heart failure for two languages, viz. English for which we have more web and specialized resources at our disposal and the less resourced Dutch. We show that, although term density in the dedicated corpora is larger for both languages, the potential for term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore, in a set of experiments in which we evaluate both types of corpora, while keeping size constant, we observe that more Gold Standard (GS) terms are covered by the "noisy" crawled corpus than with a dedicated corpus of the same size.

中文翻译:

数量和质量之间的权衡。比较大型爬行语料库和小型聚焦语料库进行医学术语提取

我们调查了专用爬取语料库与更集中的语料库在自动术语提取 (ATE) 方面的成本效益。我们的重点是两种语言的心力衰竭医学术语,即。英语,我们有更多的网络和专业资源可供我们使用,而荷兰语资源较少。我们表明,尽管两种语言的专用语料库中的术语密度都较大,但爬取的语料库中术语提取的潜力比专用语料库中的更高。此外,在我们评估两种类型的语料库的一组实验中,在保持大小不变​​的情况下,我们观察到,与相同大小的专用语料库相比,“嘈杂”爬行语料库涵盖了更多的黄金标准 (GS) 术语。
更新日期:2019-12-01
down
wechat
bug