Network metrics for assessing the quality of entity resolution between multiple datasets,Semantic Web

当前位置： X-MOL 学术 › Semant. Web › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Network metrics for assessing the quality of entity resolution between multiple datasets
Semantic Web ( IF 3.0 ) Pub Date : 2020-10-23 , DOI: 10.3233/sw-200410
Al Idrissou _{1,

2} , Frank van Harmelen ₁ , Peter van den Besselaar ₂

Affiliation

Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.

中文翻译：

用于评估多个数据集之间的实体解析质量的网络指标

数据集之间的匹配实体是在语义网上组合多个数据集的关键步骤。关于解决该实体解析问题的不同方法，存在着丰富的文献。但是，一旦生成此类实体链接，如何进行质量评估的工作就少得多。链接质量的评估方法通常仅限于与基本事实数据集（通常不可用）进行比较，手动工作（麻烦且容易出错）或众包（这并不总是可行的，特别是在专家知识的情况下）是必须的）。此外，两个以上数据集之间的链接大大增加了链接评估的问题，因为可能链接的数量随数据集数量的增加而迅速增长。在本文中，我们提出了一种估计多个数据集之间实体链接质量的方法。我们利用了以下事实：来自多个数据集的实体之间的链接形成了一个网络，并且我们展示了该网络上的简单度量如何可靠地预测其质量。我们使用来自科学，技术和创新研究领域的六个数据集在一项大型实验研究中验证了我们的结果，为此我们创建了黄金标准。该黄金标准可在线获得，是本文的补充内容。此外，我们根据最近发布的黄金标准评估指标，以证实我们的发现。我们使用来自科学，技术和创新研究领域的六个数据集在一项大型实验研究中验证了我们的结果，为此我们创建了黄金标准。该黄金标准可在线获得，是本文的补充内容。此外，我们根据最近发布的黄金标准评估指标，以证实我们的发现。我们使用来自科学，技术和创新研究领域的六个数据集在一项大型实验研究中验证了我们的结果，为此我们创建了黄金标准。该黄金标准可在线获得，是本文的补充内容。此外，我们根据最近发布的黄金标准评估指标，以证实我们的发现。

更新日期：2020-10-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11