Elsevier

Journal of Web Semantics

Volume 65, December 2020, 100614
Journal of Web Semantics

Schema-agnostic SPARQL-driven faceted search benchmark generation

https://doi.org/10.1016/j.websem.2020.100614Get rights and content

Abstract

In this work, we present a schema-agnostic faceted browsing benchmark generation framework for RDF data and SPARQL engines. Faceted search is a technique that allows narrowing down sets of information items by applying constraints over their properties, whereas facets correspond to properties of these items. While our work can be used to realise real-world faceted search user interfaces, our focus lies on the construction and benchmarking of faceted search queries over knowledge graphs. The RDF model exhibits several traits that seemingly make it a natural foundation for faceted search: all information items are represented as RDF resources, property values typically already correspond to meaningful semantic classifications, and with SPARQL there is a standard language for uniformly querying instance and schema information.

However, although faceted search is ubiquitous today, it is typically not performed on the RDF model directly. Two major sources of concern are the complexity of query generation and the query performance. To overcome the former, our framework comes with an intermediate domain-specific language. Thereby our approach is SPARQL-driven which means that every faceted search information need is intensionally expressed as a single SPARQL query. In regard to the latter, we investigate the possibilities and limits of real-time SPARQL-driven faceted search on contemporary triple stores. We report on our findings by evaluating systems performance and correctness characteristics when executing a benchmark generated using our generation framework.

All components, namely the benchmark generator, the benchmark runners and the underlying faceted search framework, are published freely available as open source.

Introduction

Faceted browsing is ubiquitous on the Web today. Most if not all major online shops and media platforms provide at least some faceted browsing features to navigate their products or – more specifically – the data records about them. Typical examples include support for filtering videos by length, music by genre, or more generally, products by relevant features. Faceted search is a technique that facilitates exploratory search by allowing for narrowing down sets of information items by applying constraints over their property values, whereas facets correspond to properties of these items. However, many of these faceted search interfaces are based on systems that require tailoring of datasets – i.e. manual specification of what facets and values to show to users and how the input data relates to them. Such approaches circumvent flexible ad-hoc exploration of datasets.

In contrast, the RDF model exhibits several traits that seemingly make it a natural foundation for faceted search: all information items are represented as RDF resources, property values typically already correspond to meaningful semantic classifications, and with SPARQL there is a standard language for uniformly querying instance and schema information. Furthermore, RDF was designed to enable the construction of knowledge graphs (KG) that capture relations between items of arbitrary type thereby exploiting web technology.

The idea of Semantic Faceted Search (SFS) systems is exactly to utilise the flexibility of the RDF model for faceted search. However, although several SFS systems with different features and degrees of expressivity have been proposed, there are only few works on benchmarking faceted search performance on RDF. Among the work concerned with benchmarking, to the best of our knowledge, each of the existing approaches is tied to a specific schema. Conversely, none is schema-agnostic, i.e. can operate on an arbitrary given schema. However, w.r.t. usability, it is beneficial to know in advance whether faceted search can beinteractively performed on a given dataset, which limits acceptable response times to roughly one second.

Additionally, most of the available SFS systems are primarily designed as applications in contrast to libraries, which makes re-use and evaluation of existing tools difficult. Furthermore, SFS are generally not interoperable due to the lack of common APIs or intermediate languages. Yet, SPARQL as a query language facilitates interoperability between RDF stores and is suitable to express the information needs of faceted search (cf. Section 4). Hence, in this work we focus on the generation of benchmarks for assessing performance and correctness of triple store performance w.r.t. given datasets and SPARQL query loads generated from simulated interaction with a real-world SFS engine. In contrast to other benchmarks that assess triple stores, our goal is to specifically study the performance of triple stores w.r.t. workloads of SPARQL queries tied to the faceted search paradigm.

This work builds upon the ideas presented in [1] which describes several types of possible interactions with a SFS. There, the outcome was a set of manually crafted query templates for simulating a faceted search user session on a specific schema. In this work we present significant advances featuring a comprehensive automatic benchmark generator that explores a dataset in a schema-agnostic way based on a library of functions that advance the state of a faceted search session in various ways. The state of such a session determines the set of SPARQL queries that are generated. Thus, the sequences of SPARQL queries obtained by repeatedly advancing the session state form a generated benchmark.

For this purpose, we built a comprehensive framework for SPARQL-driven faceted search named Facete. The most essential components are the framework core and the benchmark generator. To test the validity of our framework, a faceted search framework application for end-users is also available. The latter features a text-based user interface. The framework’s core features a model for faceted search queries together with several translations to SPARQL queries in order to satisfy essential information needs of the faceted search paradigm. The model and the translations are detailed in Section 4. The benchmark generator and the user application are both built on the same core and thus make use of the same model for faceted search queries and the corresponding SPARQL query generation capabilities. As a result, in the context of this work the user application serves as a demonstrator that our system indeed allows for real-world SPARQL-driven faceted search and thus testifies to the relevance of the described system. Furthermore, the user application not only enables a user to browse facets, facet values and matching values of a given dataset but it also allows for viewing the underlying SPARQL query strings which are the same ones the benchmark generator emits. Although the focus lies on the Facete benchmark generator, the Facete user application can be seen as a complementary interactive verification and debugging tool that allows one to manually inspect the generated queries. Note that the framework’s core is independent of any user interface. Whereas many related works on faceted search have strong ties to graphical user interfaces, in this work we describe a model-driven approach to faceted search. This model is intended to enable (SPARQL-driven) exploratory search over RDF data also for machines. Our benchmark generator is one such implementation.

In detail, our contributions are as follows:

  • A formal description of a model for faceted search with corresponding translations to SPARQL queries that satisfy faceted search information needs. Most notably, we detail the construction of SPARQL queries that intensionally capture facet counts, facet value counts and matching values under a given set of constraints.

  • Implementation of these techniques in the core of the SPARQL-driven faceted search framework Facete which is used as a building block to realise the benchmark.

  • Design and implementation of a schema-agnostic benchmark generation framework within Facete,1 which allows for highly configurable query generation based on customisable distributions of transition types on arbitrary datasets.

  • Performance and correctness evaluation of contemporary triple stores with regard to the faceted browsing paradigm.

  • As a side contribution, we also present a text mode user interface for faceted search which is also built on the Facete framework’s core. This demonstrates that the engine is suitable for real-world applications and the generated SPARQL queries actually conform to the faceted search paradigm.

The remainder of the paper is structured as follows: First, in Section 2 we present related work and position our approach in it. Afterwards, in Section 3 we introduce RDF and SPARQL and on this basis formalise fundamental notions for faceted search query generation as used in our benchmarking framework. Subsequently, in Section 4 we first propose a model for faceted search and detail the generation of SPARQL queries from it. The actual benchmark generation is described in Section 5, where we first present the conceptual grounding followed by a description of the implementation. Our findings when executing an exemplary benchmark generated by our system are reported on in Section 6. Finally, we conclude in Section 7 and also point out directions for future work.

Section snippets

Related work

There are two lines for evaluating faceted search systems in general: Performance benchmarking and usability studies. The latter requires a user interface and the former is typically tied to a specific user interface and/or a specific dataset and is thus difficult to generalise.

There is a considerable amount of benchmarks available to test the general performance of triple stores (e.g., LUBM [2], SP2 [3], BSBM [4], WatDiv [5] and Geographica [6]). However, specifically for benchmarking faceted

Preliminaries: RDF and SPARQL

The Resource Description Framework (RDF) is a W3C standard for data interchange.6 The most fundamental notions are as follows: Let there be pairwise disjoint sets of IRIs I, blank nodes B and literals L. Further, let the set of RDF terms be TIBL. The set of concrete RDF terms is denoted by ILIL. An RDF graph G is defined as G(IB)×I×T, whereas the elements of this set are called RDF triples.

Consequently, a triple t is a three-tuple whose components in order are

Semantic faceted search query generation model

In this section we first introduce fundamental definitions, especially that of a faceted query and a facet query configuration. On this basis, we define the information needs a faceted search system has to satisfy, namely matching values, facet value counts, facet counts.

The purpose is to present human and/or machine agents with statistics about available information items and their relations under a given set of constraints. This serves as a guide for data exploration because it provides

The faceted search benchmark generation framework

In this section we present our benchmark generation system. In a nutshell, the goal of the benchmark generator is to yield sequences of SPARQL queries that are the result of simulated sessions of interactions with a faceted search system. Because of the SPARQL-driven nature of our approach, the resulting queries correspond to specifications of essential faceted search information needs. In consequence, our approach is representative for SPARQL-driven systems that capture the relations of

Evaluation

In this section we present our empiric validation of the SPARQL-driven faceted search benchmark generation. We perform two related studies: First, we evaluate the performance of several triple stores w.r.t. the most basic query that computes facet counts. In this experiment we scale the amount of data that participates in the aggregation by using different limits for that query. The results provide insights about how much data a triple store can process in what time on certain hardware in order

Conclusions and future work

In this work, we presented a schema-agnostic faceted search benchmark generation framework for triple stores. In accordance with the Semantic Web vision where autonomous agents are able to explore the Web of Data in order to solve tasks on someone’s behalf efficient exploratory search mechanisms are needed. Faceted search is a form of exploratory search that enables systematic exploration with a-priori insights about the available data under a set of constraints. As a consequence, the class of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by grants from the EU H2020 Programme for the projects HOBBIT (GA no. 688227) and QROWD (GA no. 732194) and the Federal Ministry of Transport and Digital Infrastructure (BMVI) for the LIMBO project (GA no. 19F2029G).

References (35)

  • GuoY. et al.

    LUBM: A benchmark for OWL knowledge base systems

    Web Semant.: Sci. Serv. Agents World Wide Web

    (2005)
  • ArenasM. et al.

    Faceted search over RDF-based knowledge graphs

    J. Web Semant.

    (2016)
  • PetzkaH. et al.

    Benchmarking faceted browsing capabilities of triplestores

  • SchmidtM. et al.

    SP̂ 2Bench: a SPARQL performance benchmark

  • BizerC. et al.

    The berlin sparql benchmark

    Int. J. Semant. Web Inf. Syst. (IJSWIS)

    (2009)
  • AluçG. et al.

    Diversified stress testing of RDF data management systems

  • GarbisG. et al.

    Geographica: A benchmark for geospatial rdf stores (long version)

  • TunkelangD.

    Faceted search

    Synth. Lect. Inf. Concepts Retr. Serv.

    (2009)
  • BastH. et al.

    Broccoli: Semantic full-text search at your fingertips

    (2012)
  • Moreno-VegaJ. et al.

    GraFa: Scalable faceted browsing for RDF graphs

  • HildebrandM. et al.

    /facet: A browser for heterogeneous semantic web repositories

  • HeimP. et al.

    gFacet: A browser for the web of data

  • OrenE. et al.

    Extending faceted navigation for RDF data

  • ChengG. et al.

    Falcons: searching and browsing entities on the semantic web

  • DaviesJ. et al.

    QuizRDF: Search technology for the semantic web

  • HahnR. et al.

    Faceted wikipedia search

  • WaitelonisJ. et al.

    Towards exploratory video search using linked data

    Multimedia Tools Appl.

    (2012)
  • Cited by (2)

    • Towards the next generation of the LinkedGeoData project using virtual knowledge graphs

      2021, Journal of Web Semantics
      Citation Excerpt :

      Due to the lack of support for aggregation functions in Sparqlify, this was so far not possible. Preliminary experiments with Ontop and our faceted search benchmark generator framework [34] showed that queries were already answered correctly, however the performance was not yet sufficient for interactive purposes. Hence, further analysis of the bottlenecks across LinkedGeoData’s VKG stack together with the corresponding optimizations are worthwhile.

    View full text