Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet,Information Technology and Libraries

当前位置： X-MOL 学术 › Information Technology and Libraries › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet
Information Technology and Libraries ( IF 1.5 ) Pub Date : 2018-09-26 , DOI: 10.6017/ital.v37i3.10177
Kumar Sharma , Ujjal Marjit , Utpal Biswas

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.

中文翻译：

使用Apache Spark和Parquet高效处理和存储库链接数据

资源描述框架（RDF）是语义Web环境中常用的数据模型。从传统存储系统中提取数据之后，图书馆和其他社区一直在使用RDF数据模型来存储有价值的数据。但是，由于数据量巨大，处理和存储数据已成为传统数据管理工具的噩梦。这项挑战需要可并行管理数据的可伸缩分布式系统。在本文中，提出了一种分布式解决方案，用于有效处理和存储传统存储系统中存储的大量库链接数据。Apache Spark用于大型数据集的并行处理，并提出了面向列的架构来存储RDF数据。该存储系统基于Hadoop分布式文件系统（HDFS）构建，并使用Apache Parquet格式以压缩形式存储数据。实验评估表明，与Jena TDB，Sesame，RDF / XML和N-Triples文件格式相比，存储需求显着降低。使用Spark SQL处理SPARQL查询以查询压缩数据。实验评估显示出良好的查询响应时间，随着工作节点数量的增加，查询响应时间显着减少。

更新日期：2018-09-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文