当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-10-13 , DOI: 10.1145/3404854
Gilbert Badaro 1 , Hazem Hajj 1 , Nizar Habash 2
Affiliation  

Success of Natural Language Processing (NLP) models, just like all advanced machine learning models, rely heavily on large -scale lexical resources. For English, English WordNet (EWN) is a leading example of a large-scale resource that has enabled advances in Natural Language Understanding (NLU) tasks such as word sense disambiguation, question answering, sentiment analysis, and emotion recognition. EWN includes sets of cognitive synonyms called synsets, which are interlinked by means of conceptual-semantic and lexical relations and where each synset expresses a distinct concept. However, other languages are still lagging behind in having large-scale and rich lexical resources similar to EWN. In this article, we focus on enabling the development of such resources for Arabic. While there have been efforts in developing an Arabic WordNet (AWN), the current version of AWN has its limitations in size and in lacking transliteration standards, which are important for compatibility with Arabic NLP tools. Previous efforts for extending AWN resulted in a lexicon, called ArSenL, that overcame the size and the transliteration standard limitation but was limited in accuracy due to the heuristic approach that only considered surface matching between the English definitions from the Standard Arabic Morphological Analyzer (SAMA) and EWN synset terms, and that resulted in inaccurate mapping of Arabic lemmas to EWN’s synsets. Furthermore, there has been limited exploration of other expansion methods due to expensive manual validation needed. To address these limitations of simultaneously having large-scale size with high accuracy and standard representations, the mapping problem is formulated as a link prediction problem between a large-scale Arabic lexicon and EWN, where a word in one lexicon is linked to a word in another lexicon if the two words are semantically related. We use a semi-supervised approach to create a training dataset by finding common terms in the large-scale Arabic resource and AWN. This set of data becomes implicitly linked to EWN and can be used for training and evaluating prediction models. We propose the use of a two-step Boosting method, where the first step aims at linking English translations of SAMA’s terms to EWN’s synsets. The second step uses surface similarity between SAMA’s glosses and EWN’s synsets. The method results in a new large-scale Arabic lexicon that we call ArSenL 2.0 as a sequel to the previously developed sentiment lexicon ArSenL. A comprehensive study covering both intrinsic and extrinsic evaluations shows the superiority of the method compared to several baseline and state-of-the-art link prediction methods. Compared to previously developed ArSenL, ArSenL 2.0 included a larger set of sentimentally charged adjectives and verbs. It also showed higher linking accuracy on the ground truth data compared to previous ArSenL. For extrinsic evaluation, ArSenL 2.0 was used for sentiment analysis and showed, here, too, higher accuracy compared to previous ArSenL.

中文翻译:

一种将大规模阿拉伯语词汇资源准确映射到英语 WordNet 的链接预测方法

自然语言处理 (NLP) 模型的成功,就像所有先进的机器学习模型一样,在很大程度上依赖于大规模的词汇资源。对于英语,英语 WordNet (EWN) 是大规模资源的主要示例,它推动了自然语言理解 (NLU) 任务的进步,例如词义消歧、问答、情感分析和情感识别。EWN 包括称为同义词集的认知同义词集,它们通过概念-语义和词汇关系相互关联,并且每个同义词集表达一个不同的概念。但是,其他语言在拥有类似于 EWN 的大规模丰富的词汇资源方面仍然落后。在本文中,我们专注于为阿拉伯语开发此类资源。虽然一直在努力开发阿拉伯语 WordNet (AWN),当前版本的 AWN 在大小和缺乏音译标准方面存在局限性,这对于与阿拉伯语 NLP 工具的兼容性很重要。以前扩展 AWN 的努力产生了一个名为 ArSenL 的词典,它克服了大小和音译标准的限制,但由于启发式方法仅考虑标准阿拉伯语形态分析器 (SAMA) 的英语定义之间的表面匹配,因此准确性受到限制和 EWN 同义词,这导致阿拉伯词条与 EWN 同义词的映射不准确。此外,由于需要昂贵的手动验证,对其他扩展方法的探索有限。为了解决同时具有高精度和标准表示的大规模尺寸的这些限制,映射问题被表述为大规模阿拉伯语词典和 EWN 之间的链接预测问题,其中如果两个词在语义上相关,则一个词典中的词与另一个词典中的词链接。我们使用半监督方法通过在大规模阿拉伯语资源和 AWN 中查找常用术语来创建训练数据集。这组数据隐式链接到 EWN,可用于训练和评估预测模型。我们建议使用两步 Boosting 方法,第一步旨在将 SAMA 术语的英文翻译与 EWN 的同义词联系起来。第二步使用 SAMA 的光泽和 EWN 的同义词之间的表面相似性。该方法产生了一个新的大型阿拉伯语词典,我们将其称为 ArSenL 2.0,作为先前开发的情感词典 ArSenL 的续集。一项涵盖内在和外在评估的综合研究表明,与几种基线和最先进的链接预测方法相比,该方法具有优越性。与之前开发的 ArSenL 相比,ArSenL 2.0 包含了更多带有情感色彩的形容词和动词。与之前的 ArSenL 相比,它还显示出更高的地面实况数据链接精度。对于外部评估,ArSenL 2.0 被用于情绪分析,并且在这里也显示出比以前的 ArSenL 更高的准确性。与之前的 ArSenL 相比,它还显示出更高的地面实况数据链接精度。对于外部评估,ArSenL 2.0 被用于情绪分析,并且在这里也显示出比以前的 ArSenL 更高的准确性。与之前的 ArSenL 相比,它还显示出更高的地面实况数据链接精度。对于外部评估,ArSenL 2.0 被用于情绪分析,并且在这里也显示出比以前的 ArSenL 更高的准确性。
更新日期:2020-10-13
down
wechat
bug