Entity deduplication in big data graphs for scholarly communication,Data Technologies and Applications

当前位置： X-MOL 学术 › Data Technol. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Entity deduplication in big data graphs for scholarly communication
Data Technologies and Applications ( IF 1.6 ) Pub Date : 2020-06-26 , DOI: 10.1108/dta-09-2019-0163
Paolo Manghi , Claudio Atzori , Michele De Bonis , Alessia Bardi

Purpose

Several online services offer functionalities to access information from “big research graphs” (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts.

Design/methodology/approach

This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments.

Findings

GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph.

Originality/value

To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.

中文翻译：

大数据图中的实体重复数据删除以进行学术交流

目的

几种在线服务提供了从“大型研究图”（例如Google Scholar，OpenAIRE，Microsoft Academic Graph）访问信息的功能，这些图将学术/科学交流实体（例如出版物，作者，数据集，组织，项目，资助者等）相关联。在目标用户上，访问权限可以从搜索和浏览内容到统计信息消耗（用于监视和提供反馈）不同。随着时间的流逝，这些图以多个来源的聚集形式填充，因此遭受了主要的实体重复问题。尽管图的重复数据删除是一个已知的实际问题，但是现有解决方案专用于特定场景，在平面集合上运行，本地拓扑驱动挑战，因此不能在其他情况下重复使用。