当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-01-28 , DOI: 10.1017/s1351324919000615
Branislava Šandrih , Cvetana Krstev , Ranka Stanković

In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being varied. In the experiments presented in this paper, the source language was English, and the target language Serbian, and a selected domain was Library and Information Science, for which an aligned corpus exists, as well as a bilingual terminological dictionary. For term extraction, we used the FlexiTerm tool for the source language and a shallow parser for the target language, while for word alignment we used GIZA++. The evaluation results show that for the first approach the F1 score varies from 29.43% to 51.15%, while for the second it varies from 61.03% to 71.03%. On the basis of the evaluation results, we developed a binary classifier that decides whether a candidate pair, composed of aligned source and target terms, is valid. We trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of our extraction system. The best results in a fivefold cross-validation setting were achieved with the Radial Basis Function Support Vector Machine classifier, giving a F1 score of 82.09% and accuracy of 78.49%.

中文翻译:

从词汇资源中编译双语多词术语表的两种方法

在本文中,我们提出了两种方法和实现的双语术语提取系统,它们依赖于对齐的双语域语料库、目标语言的术语提取器和块对齐工具。这两种方法在获取源语言术语的方式上有所不同:第一种方法依赖于现有的领域术语词典,而第二种方法使用术语提取工具。对于这两种方法,进行了四个实验,其中两个参数是变化的。在本文提出的实验中,源语言是英语,目标语言是塞尔维亚语,选择的领域是图书馆和信息科学,其中存在一个对齐的语料库,以及一个双语术语词典。对于术语提取,我们使用弹性期限源语言的工具和目标语言的浅解析器,而对于单词对齐,我们使用了 GIZA++。评估结果表明,对于第一种方法,F1分数从 29.43% 到 51.15% 不等,而第二个分数从 61.03% 到 71.03% 不等。在评估结果的基础上,我们开发了一个二元分类器,它决定由对齐的源词和目标词组成的候选对是否有效。我们在实施我们的提取系统后获得的手动标记的候选对列表上训练和评估不同的分类器。径向基函数支持向量机分类器在五重交叉验证设置中取得了最佳结果,给出了 F1得分为 82.09%,准确率为 78.49%。
更新日期:2020-01-28
down
wechat
bug