当前位置: X-MOL 学术Data Technol. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Entity deduplication in big data graphs for scholarly communication
Data Technologies and Applications ( IF 1.6 ) Pub Date : 2020-06-26 , DOI: 10.1108/dta-09-2019-0163
Paolo Manghi , Claudio Atzori , Michele De Bonis , Alessia Bardi

Purpose

Several online services offer functionalities to access information from “big research graphs” (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts.

Design/methodology/approach

This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments.

Findings

GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph.

Originality/value

To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.



中文翻译:

大数据图中的实体重复数据删除以进行学术交流

目的

几种在线服务提供了从“大型研究图”(例如Google Scholar,OpenAIRE,Microsoft Academic Graph)访问信息的功能,这些图将学术/科学交流实体(例如出版物,作者,数据集,组织,项目,资助者等)相关联。在目标用户上,访问权限可以从搜索和浏览内容到统计信息消耗(用于监视和提供反馈)不同。随着时间的流逝,这些图以多个来源的聚集形式填充,因此遭受了主要的实体重复问题。尽管图的重复数据删除是一个已知的实际问题,但是现有解决方案专用于特定场景,在平面集合上运行,本地拓扑驱动挑战,因此不能在其他情况下重复使用。

设计/方法/方法

这项工作介绍了GDup,这是一个集成的,可扩展的通用系统,可以对其进行自定义以解决任意大信息图上的重复数据删除问题。本文介绍了其高层体系结构,作为OpenAIRE基础架构系统中使用的服务的实现方式,并报告了一些实际案例实验。

发现

GDup提供了通过通用输入图交付全面的实体重复数据删除工作流所需的功能。该系统提供开箱即用的地面真相管理,从数据管理者处获取反馈以及用于识别和合并重复项的算法,从而获得输出明确的图形。

创意/价值

据我们所知,GDup是文献中唯一为重复数据删除图提供集成和通用解决方案的系统,同时针对大数据可伸缩性问题。今天,GDup是OpenAIRE基础设施生产系统的关键模块之一,该系统代表欧洲委员会,国家资助者和机构监视Open Science的趋势。

更新日期:2020-08-26
down
wechat
bug