Triple storage for random-access versioned querying of RDF archives
Introduction
In the area of data analysis, there is an ongoing need for maintaining the history of datasets. Such archives can be used for looking up data at certain points in time, for requesting evolving changes, or for checking the temporal validity of these data [1]. With the continuously increasing number of Linked Open Datasets [2], archiving has become an issue for RDF [3] data as well. While the RDF data model itself is atemporal, Linked Datasets typically change over time [4] on dataset, schema, and/or instance level [5]. Such changes can include additions, modifications, or deletions of complete datasets, ontologies, and separate facts. While some evolving datasets, such as DBpedia [6], are published as separate dumps per version, more direct and efficient access to prior versions is desired.
Consequently, RDF archiving systems emerged that, for instance, support query engines that use the standard SPARQL query language [7]. In 2015, however, a survey on archiving Linked Open Data [1] illustrated the need for improved versioning capabilities, as current approaches have scalability issues at Web-scale. They either perform well for versioned query evaluation – at the cost of large storage space requirements – or require less storage space—at the cost of slower query evaluation. Furthermore, no existing solution performs well for all versioned query types, namely querying at, between, and for different versions. An efficient RDF archive solution should have a scalable storage model, efficient compression, and indexing methods that enable expressive versioned querying [1].
In this article, we argue that supporting both RDF archiving and SPARQL at once is difficult to scale due to their combined complexity. Instead, we propose an elementary but efficient versioned triple pattern index. Since triple patterns are the basic element of SPARQL, such indexes can serve as an entry point for query engines. Our solution is applicable as: (a) an alternative index with efficient triple-pattern-based access for existing engines, in order to improve the efficiency of more expressive SPARQL queries; and (b) a data source for the Web-friendly Triple Pattern Fragments [8] (TPF) interface, i.e., a Web API that provides access to RDF datasets by triple pattern and partitions the results in pages. We focus on the performance-critical features of stream-based results, query result offsets, and cardinality estimation. Stream-based results allow more memory-efficient processing when query results are plentiful. The capability to efficiently offset (and limit) a large stream reduces processing time if only a subset is needed. Cardinality estimation is essential for efficient query planning [[8], [9]] in many query engines.
Concretely, this work introduces a storage technique with the following contributions:
a scalable versioned and compressed RDF index with offset support and result streaming;
efficient query algorithms to evaluate triple pattern queries and perform cardinality estimation at, between, and for different versions, with optional offsets;
an open-source implementation of this approach called OSTRICH;
an extensive evaluation of OSTRICH compared to other approaches using an existing RDF archiving benchmark.
The main novelty of this work is the combination of efficient offset-enabled queries over a new index structure for RDF archives. We do not aim to compete with existing versioned SPARQL engines—full access to the language can instead be leveraged by different engines, or by using alternative RDF publication and querying methods such as the HTTP interface-based TPF approach. Optional versioning capabilities are possible for TPF by using VTPF [10], or datetime content-negotiation [11] through Memento [12].
This article is structured as follows. In the following section, we start by introducing the related work and our problem statement in Section 3. Next, in Section 4, we introduce the basic concepts of our approach, followed by our storage approach in Section 5, our ingestion algorithms in Section 6, and the accompanying querying algorithms in Section 7. After that, we present and discuss the evaluation of our implementation in Section 8. Finally, we present our conclusions in Section 9.
Section snippets
Related work
In this section, we discuss existing solutions and techniques for indexing and compression in RDF storage, without archiving support. Then, we compare different RDF archiving solutions. Finally, we discuss suitable benchmarks and different query types for RDF archives. This section does not contain an exhaustive list of all relevant solutions and techniques, instead, only those that are most relevant to this work are mentioned.
Problem statement
As mentioned in Section 1, no RDF archiving solutions exist that allow efficient triple pattern querying at, between, and for different versions, in combination with a scalable storage model and efficient compression. In the context of query engines, streams are typically used to return query results, on which offsets and limits can be applied to reduce processing time if only a subset is needed. Offsets are used to skip a certain amount of elements, while limits are used to restrict the number
Overview of storage and querying approach
In this section, we lay the groundwork for the following sections. We introduce fundamental concepts that are required in our storage approach and its accompanying querying algorithms, which will be explained in Sections 5 Hybrid multiversion storage approach, 7 Versioned query algorithms, respectively.
To combine smart use of storage space with efficient processing of VM, DM, and VQ triple pattern queries, we employ a hybrid approach between the individual copies (IC), change-based (CB), and
Hybrid multiversion storage approach
In this section, we introduce our hybrid IC/CB/TB storage approach for storing multiple versions of an RDF dataset. Fig. 3 shows an overview of the main components. Our approach consists of an initial dataset snapshot – stored in HDT [23]– followed by a delta chain (similar to TailR [40]). The delta chain uses multiple compressed B+Trees for a TB-storage strategy (similar to Dydra [39]), applies dictionary encoding to triples, and stores additional metadata to improve lookup times. In this
Changeset ingestion algorithms
In this section, we discuss two ingestion algorithms: a memory-intensive batch algorithm and a memory efficient streaming algorithm. These algorithms both take a changeset – containing additions and deletions – asinput, and append it as a new version to the store. Note that the ingested changesets are regular changesets: they are relative to one another according to Fig. 1. Furthermore, we assume that the ingested changesets are valid changesets: they do not contain impossible triple sequences
Versioned query algorithms
In this section, we introduce algorithms for performing VM, DM and VQ triple pattern queries based on the storage structure introduced in Section 5. Each of these querying algorithms are based on result streams, enabling efficient offsets and limits, by exploiting the index structure from Section 5. Furthermore, we provide algorithms to provide count estimates for each query.
Evaluation
In this section, we evaluate our proposed storage technique and querying algorithms. We start by introducing OSTRICH, an implementation of our proposed solution. After that, we describe the setup of our experiments, followed by presenting our results. Finally, we discuss these results.
Conclusions
In this article, we introduced an RDF archive storage method with accompanied algorithms for evaluating VM, DM, and VQ queries, with efficient result offsets. Our novel storage technique is a hybrid of the IC/CB/TB approaches, because we store sequences of snapshots followed by delta chains. The evaluation of our OSTRICH implementation shows that this technique offers a new trade-off in terms of ingestion time, storage size and lookup times. By preprocessing and storing additional data during
Acknowledgments
We would like to thank Christophe Billiet for providing his insights into temporal databases. We thank Giorgos Flouris for his comments on the structure and contents of this article, and Javier D. Fernández for his help in setting up and running the BEAR benchmark. The described research activities were funded by Ghent University (Belgium), imec (Belgium), Flanders Innovation & Entrepreneurship (AIO) (Belgium), and the European Union (Belgium). Ruben Verborgh is a postdoctoral fellow of the
References (53)
- et al.
Binary RDF representation for publication and exchange (HDT)
Web Semant. Sci. Serv. Agents World Wide Web
(2013) - et al.
LUBM: A benchmark for OWL knowledge base systems
Web Semant. Sci. Serv. Agents World Wide Web
(2005) - J.D. Fernández, A. Polleres, J. Umbrich, Towards efficient archiving of dynamic linked open data, in: J. Debattista, M....
- et al.
Linked data - the story so far
Semant. Serv. Interoper. Web Appl.: Emerging Concepts
(2009) - R. Cyganiak, D. Wood, M. Lanthaler, RDF 1.1: Concepts and Abstract Syntax. W3C, 2014,...
- J. Umbrich, S. Decker, M. Hausenblas, A. Polleres, A. Hogan, Towards dataset dynamics: Change frequency of linked open...
- et al.
A query language for multi-version data web archives
Expert Syst.
(2016) - et al.
DBpedia: A nucleus for a Web of open data
- S. Harris, E. Seaborne, SPARQL 1.1 Query Language. W3C, 2013,...
- et al.
Triple pattern fragments: a low-cost knowledge graph interface for the web
J. Web Semant.
(2016)
RDF-3X: a RISC-style engine for RDF
Proc. VLDB Endowment
Towards sustainable publishing and querying of distributed linked data archives
J. Doc.
Virtuoso: RDF support in a native RDBMS
Qualitative spatial representation and reasoning in the SparQ-toolbox
Deriving an emergent relational schema from RDF data
Exploiting emergent schemas to make RDF systems more efficient
Extended characteristic sets: graph indexing for SPARQL query optimization
The Odyssey approach for optimizing federated SPARQL queries
Hexastore: sextuple indexing for semantic web data management
Proc. VLDB Endowment
TripleBit: a fast and compact system for large scale RDF data
Proc. VLDB Endowment
A compact RDF store using suffix arrays
Exchange and consumption of huge RDF data
LOD laundromat: a uniform way of publishing other people’s dirty data
Cited by (14)
RDF-TR: Exploiting structural redundancies to boost RDF compression
2020, Information SciencesCitation Excerpt :Thus, RDF-specific compression has recently emerged as an effective technique to detect and leverage internal redundancies in RDF data, minimizing space requirements for storage, exchange and consumption processes [33]. In addition, RDF compression plays an increasingly important role in other application areas, such as RDF archiving and versioning [47] or distributed RDF stores [22], among others. In this scenario, HDT [16], also within the W3C scope [15], represents one of the first and more standardized binary formats for RDF data.
A Parallel World Framework for scenario analysis in knowledge graphs
2020, Data-Centric EngineeringCompressed and queryable self-indexes for RDF archives
2024, Knowledge and Information SystemsGLENDA: Querying RDF Archives with Full SPARQL
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Scaling Large RDF Archives to Very Long Histories
2023, Proceedings - 17th IEEE International Conference on Semantic Computing, ICSC 2023