IPDS: A semantic mediator‐based system using Spark for the integration of heterogeneous proteomics data sources,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

IPDS: A semantic mediator‐based system using Spark for the integration of heterogeneous proteomics data sources
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2020-05-23 , DOI: 10.1002/cpe.5814
Chaimaa Messaoudi ₁ , Rachida Fissoune ₁ , Hassan Badir ₁

Affiliation

With the constant rise of data volumes in many disciplines, various new Big data management systems have emerged to provide scalable tools for efficient data integration, processing, and analysis. In this article, we provide an overview of biomedical data integration systems focusing on ontology‐based semantic systems and Big data technologies based systems such as Apache Spark. We also propose a new semantic data integration system, called Integrated Proteomics Data System (IPDS), which uses a mediator approach. IPDS provides users a unified interface for query processing and data exploration. This system takes advantage of the Apache Spark framework to perform the query transformation and execution needed to question the integrated data sources. We develop a domain ontology that allows the user to formulate its queries in terms defined in the ontology. IPDS is a case study of semantic proteomics data integration linking four data sources UniProt (protein annotation), String (protein‐protein interaction), PDB (protein structure), and Pubmed (biomedical citation).

中文翻译：

IPDS：基于语义中介的系统，使用 Spark 集成异构蛋白质组学数据源

随着许多学科中数据量的不断增加，各种新的大数据管理系统应运而生，为高效的数据集成、处理和分析提供了可扩展的工具。在本文中，我们概述了生物医学数据集成系统，重点是基于本体的语义系统和基于大数据技术的系统，例如 Apache Spark。我们还提出了一种新的语义数据集成系统，称为集成蛋白质组学数据系统 (IPDS)，它使用中介方法。IPDS 为用户提供统一的查询处理和数据探索接口。该系统利用 Apache Spark 框架来执行查询集成数据源所需的查询转换和执行。我们开发了一个领域本体，允许用户根据本体中定义的术语来制定其查询。IPDS 是语义蛋白质组学数据集成的案例研究，将四个数据源 UniProt（蛋白质注释）、String（蛋白质-蛋白质相互作用）、PDB（蛋白质结构）和 Pubmed（生物医学引文）联系起来。

更新日期：2020-05-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文