当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Arabic real time entity resolution using inverted indexing
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2020-10-07 , DOI: 10.1007/s10579-020-09504-6
Marwah Alian , Ghazi Al-Naymat , Banda Ramadan

Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool for linking records across databases as well as for matching query records with existing databases in real-time. Indexing is a major step in the ER process that aims at reducing the search space. Several indexing techniques are available for use with the ER process in general for English Databases. However, such techniques are not validated if they work well with other languages, such as Arabic. The Dynamic Similarity Aware Inverted Index (DySimII) is one of the indexing techniques that are utilized with dynamic databases to match query records in real time and is demonstrated to work well with English language. In this paper, we propose a framework—Arabic Real Time Entity Resolution (ARTER)—that uses DySimII with Arabic databases to perform real time ER. We also examine using different string similarity functions required for comparing records in the matching process for the aim of evaluating which similarity function is more suitable for comparing Arabic strings. A real-world Arabic database is used to conduct our experimental evaluation where two stemmers and three similarity functions are used to see the effect on DySimII with Arabic dataset. The results represent that matching accuracy is improved using Asem stemmer when the number of corrupted attributes is increased, also testing the three similarity functions show that using winkler similarity function provides better matching accuracy while N-gram provides better results when used with Asem stemmer.



中文翻译:

使用反向索引的阿拉伯文实时实体解析

具有相同世界实体(即人,物等)的两个或多个记录的阿拉伯数据集,由于其阿拉伯数据集重复而没有任何机制来检测这些重复项,使机构遭受质量低下和性能下降的困扰。区分同一真实世界实体的记录的操作称为实体解析(ER)。它被认为是跨数据库链接记录以及将查询记录与现有数据库进行实时匹配的工具。索引编制是ER流程中的主要步骤,旨在减少搜索空间。通常,对于英语数据库,有几种索引技术可用于ER流程。但是,如果这些技术与其他语言(例如阿拉伯语)配合良好,则无法通过验证。动态相似性感知倒排索引(DySimII)是与动态数据库一起使用以实时匹配查询记录的索引技术之一,并被证明可以与英语一起很好地工作。在本文中,我们提出了一个框架-阿拉伯实时实体解析(ARTER)-该框架将DySimII与阿拉伯数据库一起使用以执行实时ER。我们还检查了在匹配过程中使用不同的字符串相似度函数来比较记录,以评估哪个相似度函数更适合比较阿拉伯字符串。真实世界的阿拉伯数据库用于进行我们的实验评估,其中使用两个词干和三个相似性函数来查看阿拉伯数据集对DySimII的影响。

更新日期:2020-10-07
down
wechat
bug