Triple storage for random-access versioned querying of RDF archives

doi:10.1016/j.websem.2018.08.001

Journal of Web Semantics

Volume 54, January 2019, Pages 4-28

https://doi.org/10.1016/j.websem.2018.08.001 Get rights and content

Abstract

When publishing Linked Open Datasets on the Web, most attention is typically directed to their latest version. Nevertheless, useful information is present in or between previous versions. In order to exploit this historical information in dataset analysis, we can maintain history in RDF archives. Existing approaches either require much storage space, or they expose an insufficiently expressive or efficient interface with respect to querying demands. In this article, we introduce an RDF archive indexing technique that is able to store datasets with a low storage overhead, by compressing consecutive versions and adding metadata for reducing lookup times. We introduce algorithmsbased on this technique for efficiently evaluating queries at a certain version, between any two versions, and for versions. Using the BEAR RDF archiving benchmark, we evaluate our implementation, called OSTRICH. Results show that OSTRICH introduces a new tradeoff regarding storage space, ingestion time, and querying efficiency. By processing and storing more metadata during ingestion time, it significantly lowers the average lookup time for versioning queries. OSTRICH performs better for many smaller dataset versions than for few larger dataset versions. Furthermore, it enables efficient offsets in query result streams, which facilitates random access in results. Our storage technique reduces query evaluation time for versioned queries through a preprocessing step during ingestion, which only in some cases increases storage space when compared to other approaches. This allows data owners to store and query multiple versions of their dataset efficiently, lowering the barrier to historical dataset publication and analysis.

Introduction

In the area of data analysis, there is an ongoing need for maintaining the history of datasets. Such archives can be used for looking up data at certain points in time, for requesting evolving changes, or for checking the temporal validity of these data [1]. With the continuously increasing number of Linked Open Datasets [2], archiving has become an issue for RDF [3] data as well. While the RDF data model itself is atemporal, Linked Datasets typically change over time [4] on dataset, schema, and/or instance level [5]. Such changes can include additions, modifications, or deletions of complete datasets, ontologies, and separate facts. While some evolving datasets, such as DBpedia [6], are published as separate dumps per version, more direct and efficient access to prior versions is desired.

Consequently, RDF archiving systems emerged that, for instance, support query engines that use the standard SPARQL query language [7]. In 2015, however, a survey on archiving Linked Open Data [1] illustrated the need for improved versioning capabilities, as current approaches have scalability issues at Web-scale. They either perform well for versioned query evaluation – at the cost of large storage space requirements – or require less storage space—at the cost of slower query evaluation. Furthermore, no existing solution performs well for all versioned query types, namely querying at, between, and for different versions. An efficient RDF archive solution should have a scalable storage model, efficient compression, and indexing methods that enable expressive versioned querying [1].

In this article, we argue that supporting both RDF archiving and SPARQL at once is difficult to scale due to their combined complexity. Instead, we propose an elementary but efficient versioned triple pattern index. Since triple patterns are the basic element of SPARQL, such indexes can serve as an entry point for query engines. Our solution is applicable as: (a) an alternative index with efficient triple-pattern-based access for existing engines, in order to improve the efficiency of more expressive SPARQL queries; and (b) a data source for the Web-friendly Triple Pattern Fragments [8] (TPF) interface, i.e., a Web API that provides access to RDF datasets by triple pattern and partitions the results in pages. We focus on the performance-critical features of stream-based results, query result offsets, and cardinality estimation. Stream-based results allow more memory-efficient processing when query results are plentiful. The capability to efficiently offset (and limit) a large stream reduces processing time if only a subset is needed. Cardinality estimation is essential for efficient query planning [[8], [9]] in many query engines.

Concretely, this work introduces a storage technique with the following contributions:

$•$
a scalable versioned and compressed RDF index with offset support and result streaming;
$•$
efficient query algorithms to evaluate triple pattern queries and perform cardinality estimation at, between, and for different versions, with optional offsets;
$•$
an open-source implementation of this approach called OSTRICH;
$•$
an extensive evaluation of OSTRICH compared to other approaches using an existing RDF archiving benchmark.

The main novelty of this work is the combination of efficient offset-enabled queries over a new index structure for RDF archives. We do not aim to compete with existing versioned SPARQL engines—full access to the language can instead be leveraged by different engines, or by using alternative RDF publication and querying methods such as the HTTP interface-based TPF approach. Optional versioning capabilities are possible for TPF by using VTPF [10], or datetime content-negotiation [11] through Memento [12].

This article is structured as follows. In the following section, we start by introducing the related work and our problem statement in Section 3. Next, in Section 4, we introduce the basic concepts of our approach, followed by our storage approach in Section 5, our ingestion algorithms in Section 6, and the accompanying querying algorithms in Section 7. After that, we present and discuss the evaluation of our implementation in Section 8. Finally, we present our conclusions in Section 9.

Section snippets

Related work

In this section, we discuss existing solutions and techniques for indexing and compression in RDF storage, without archiving support. Then, we compare different RDF archiving solutions. Finally, we discuss suitable benchmarks and different query types for RDF archives. This section does not contain an exhaustive list of all relevant solutions and techniques, instead, only those that are most relevant to this work are mentioned.

Problem statement

As mentioned in Section 1, no RDF archiving solutions exist that allow efficient triple pattern querying at, between, and for different versions, in combination with a scalable storage model and efficient compression. In the context of query engines, streams are typically used to return query results, on which offsets and limits can be applied to reduce processing time if only a subset is needed. Offsets are used to skip a certain amount of elements, while limits are used to restrict the number

Overview of storage and querying approach

In this section, we lay the groundwork for the following sections. We introduce fundamental concepts that are required in our storage approach and its accompanying querying algorithms, which will be explained in Sections 5 Hybrid multiversion storage approach, 7 Versioned query algorithms, respectively.

To combine smart use of storage space with efficient processing of VM, DM, and VQ triple pattern queries, we employ a hybrid approach between the individual copies (IC), change-based (CB), and

Hybrid multiversion storage approach

In this section, we introduce our hybrid IC/CB/TB storage approach for storing multiple versions of an RDF dataset. Fig. 3 shows an overview of the main components. Our approach consists of an initial dataset snapshot – stored in HDT [23]– followed by a delta chain (similar to TailR [40]). The delta chain uses multiple compressed B+Trees for a TB-storage strategy (similar to Dydra [39]), applies dictionary encoding to triples, and stores additional metadata to improve lookup times. In this

Changeset ingestion algorithms

In this section, we discuss two ingestion algorithms: a memory-intensive batch algorithm and a memory efficient streaming algorithm. These algorithms both take a changeset – containing additions and deletions – asinput, and append it as a new version to the store. Note that the ingested changesets are regular changesets: they are relative to one another according to Fig. 1. Furthermore, we assume that the ingested changesets are valid changesets: they do not contain impossible triple sequences

Versioned query algorithms

In this section, we introduce algorithms for performing VM, DM and VQ triple pattern queries based on the storage structure introduced in Section 5. Each of these querying algorithms are based on result streams, enabling efficient offsets and limits, by exploiting the index structure from Section 5. Furthermore, we provide algorithms to provide count estimates for each query.

Evaluation

In this section, we evaluate our proposed storage technique and querying algorithms. We start by introducing OSTRICH, an implementation of our proposed solution. After that, we describe the setup of our experiments, followed by presenting our results. Finally, we discuss these results.

Conclusions

In this article, we introduced an RDF archive storage method with accompanied algorithms for evaluating VM, DM, and VQ queries, with efficient result offsets. Our novel storage technique is a hybrid of the IC/CB/TB approaches, because we store sequences of snapshots followed by delta chains. The evaluation of our OSTRICH implementation shows that this technique offers a new trade-off in terms of ingestion time, storage size and lookup times. By preprocessing and storing additional data during

Acknowledgments

We would like to thank Christophe Billiet for providing his insights into temporal databases. We thank Giorgos Flouris for his comments on the structure and contents of this article, and Javier D. Fernández for his help in setting up and running the BEAR benchmark. The described research activities were funded by Ghent University (Belgium), imec (Belgium), Flanders Innovation & Entrepreneurship (AIO) (Belgium), and the European Union (Belgium). Ruben Verborgh is a postdoctoral fellow of the

References (53)

FernándezJ.D. et al.
Binary RDF representation for publication and exchange (HDT)
Web Semant. Sci. Serv. Agents World Wide Web
(2013)
GuoY. et al.
LUBM: A benchmark for OWL knowledge base systems
Web Semant. Sci. Serv. Agents World Wide Web
(2005)
J.D. Fernández, A. Polleres, J. Umbrich, Towards efficient archiving of dynamic linked open data, in: J. Debattista, M....
BizerC. et al.
Linked data - the story so far
Semant. Serv. Interoper. Web Appl.: Emerging Concepts
(2009)
R. Cyganiak, D. Wood, M. Lanthaler, RDF 1.1: Concepts and Abstract Syntax. W3C, 2014,...
J. Umbrich, S. Decker, M. Hausenblas, A. Polleres, A. Hogan, Towards dataset dynamics: Change frequency of linked open...
MeimarisM. et al.
A query language for multi-version data web archives
Expert Syst.
(2016)
AuerS. et al.
DBpedia: A nucleus for a Web of open data
S. Harris, E. Seaborne, SPARQL 1.1 Query Language. W3C, 2013,...
VerborghR. et al.
Triple pattern fragments: a low-cost knowledge graph interface for the web
J. Web Semant.
(2016)

NeumannT. et al.

RDF-3X: a RISC-style engine for RDF

Proc. VLDB Endowment

(2008)

R. Taelman, M. Vander Sande, R. Verborgh, E. Mannens, Versioned triple pattern fragments: A Low-cost Linked data...

Vander SandeM. et al.

Towards sustainable publishing and querying of distributed linked data archives

J. Doc.

(2017)

H. Van de Sompel, M.L. Nelson, R. Sanderson, L.L. Balakireva, S. Ainsworth, H. Shankar, Memento: Time travel for the...

ErlingO. et al.

Virtuoso: RDF support in a native RDBMS

WallgrünJ.O. et al.

Qualitative spatial representation and reasoning in the SparQ-toolbox

PhamM.-D. et al.

Deriving an emergent relational schema from RDF data

PhamM.-D. et al.

Exploiting emergent schemas to make RDF systems more efficient

MeimarisM. et al.

Extended characteristic sets: graph indexing for SPARQL query optimization

MontoyaG. et al.

The Odyssey approach for optimizing federated SPARQL queries

WeissC. et al.

Hexastore: sextuple indexing for semantic web data management

Proc. VLDB Endowment

(2008)

YuanP. et al.

TripleBit: a fast and compact system for large scale RDF data

Proc. VLDB Endowment

(2013)

Álvarez-García Sandra, N.R. Brisaboa, A. Fernández, A. Martínez Prieto Miguel, Compressed k2-triples for full-in-...

BrisaboaN.R. et al.

A compact RDF store using suffix arrays

Martínez-Prieto MiguelA. et al.

Exchange and consumption of huge RDF data

BeekW. et al.

LOD laundromat: a uniform way of publishing other people’s dirty data

Cited by (14)

RDF-TR: Exploiting structural redundancies to boost RDF compression
2020, Information Sciences
Citation Excerpt :
Thus, RDF-specific compression has recently emerged as an effective technique to detect and leverage internal redundancies in RDF data, minimizing space requirements for storage, exchange and consumption processes [33]. In addition, RDF compression plays an increasingly important role in other application areas, such as RDF archiving and versioning [47] or distributed RDF stores [22], among others. In this scenario, HDT [16], also within the W3C scope [15], represents one of the first and more standardized binary formats for RDF data.
The number and volume of semantic data have grown impressively over the last decade, promoting compression as an essential tool for RDF preservation, sharing and management. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy in RDF data. Thus, state-of-the-art RDF compressors excel at exploiting syntactic and semantic redundancies, i.e., repetitions in the serialization format and information that can be inferred implicitly. However, little attention has been paid to the existence of structural patterns within the RDF dataset; i.e. structural redundancy. In this paper, we analyze structural regularities in real-world datasets, and show three schema-based sources of redundancies that underpin the schema-relaxed nature of RDF. Then, we propose RDF-Tr (RDF Triples Reorganizer), a preprocessing technique that discovers and removes this kind of redundancy before the RDF dataset is effectively compressed. In particular, RDF-Tr groups subjects that are described by the same predicates, and locally re-codes the objects related to these predicates. Finally, we integrate RDF-Tr with two RDF compressors, HDT and k²-triples. Our experiments show that using RDF-Tr with these compressors improves by up to 2.3 times their original effectiveness, outperforming the most prominent state-of-the-art techniques.
A Parallel World Framework for scenario analysis in knowledge graphs
2020, Data-Centric Engineering
Compressed and queryable self-indexes for RDF archives
2024, Knowledge and Information Systems
GLENDA: Querying RDF Archives with Full SPARQL
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Publishing public transport data on the Web with the Linked Connections framework
2023, Semantic Web
Scaling Large RDF Archives to Very Long Histories
2023, Proceedings - 17th IEEE International Conference on Semantic Computing, ICSC 2023

View all citing articles on Scopus

View full text

Triple storage for random-access versioned querying of RDF archives

Abstract

Introduction

Section snippets

Related work

Problem statement

Overview of storage and querying approach

Hybrid multiversion storage approach

Changeset ingestion algorithms

Versioned query algorithms

Evaluation

Conclusions

Acknowledgments

Web Semant. Sci. Serv. Agents World Wide Web

Web Semant. Sci. Serv. Agents World Wide Web

Linked data - the story so far

Semant. Serv. Interoper. Web Appl.: Emerging Concepts

A query language for multi-version data web archives

Expert Syst.

DBpedia: A nucleus for a Web of open data

Triple pattern fragments: a low-cost knowledge graph interface for the web

J. Web Semant.

RDF-3X: a RISC-style engine for RDF

Proc. VLDB Endowment

Towards sustainable publishing and querying of distributed linked data archives

J. Doc.

Virtuoso: RDF support in a native RDBMS

Qualitative spatial representation and reasoning in the SparQ-toolbox

Deriving an emergent relational schema from RDF data

Exploiting emergent schemas to make RDF systems more efficient

Extended characteristic sets: graph indexing for SPARQL query optimization

The Odyssey approach for optimizing federated SPARQL queries

Hexastore: sextuple indexing for semantic web data management

Proc. VLDB Endowment

TripleBit: a fast and compact system for large scale RDF data

Proc. VLDB Endowment

A compact RDF store using suffix arrays

Exchange and consumption of huge RDF data

LOD laundromat: a uniform way of publishing other people’s dirty data