Elsevier

Information Systems

Volume 102, December 2021, 101814
Information Systems

Keyword search over schema-less RDF datasets by SPARQL query compilation

https://doi.org/10.1016/j.is.2021.101814Get rights and content

Highlights

  • An algorithm to automatically translate keyword-based queries to SPARQL queries is introduced.

  • The algorithm does not rely on RDF schemas.

  • The algorithm outperforms a baseline schema-based RDF keyword search tool.

  • The algorithm outperforms two “virtual documents” RDF keyword search tools.

  • A measure named Graph Relevance Ratio (GRR) is proposed.

Abstract

This article introduces an algorithm to automatically translate a user-specified keyword-based query K to a SPARQL query Q so that the answers Q returns are also answers for K. The algorithm does not rely on an RDF schema, but it synthesizes SPARQL queries by exploring the similarity between the property domains and ranges, and the class instance sets observed in the RDF dataset. It estimates set similarity based on set synopses, which can be efficiently pre-computed in a single pass over the RDF dataset. The article includes two sets of experiments with an implementation of the algorithm. The first set of experiments shows that the implementation outperforms a baseline RDF keyword search tool that explores the RDF schema, while the second set of experiments indicate that the implementation performs better than the state-of-the-art TSA+BM25 and TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach.

Introduction

Keyword search is typically associated with information retrieval systems, especially those designed for the Web. The user specifies a few terms, called keywords, and it is up to the system to retrieve the documents, such as Web pages, that best match the keywords. Keyword search over relational databases, as well as over RDF datasets, has also been studied for some time. The challenge, in this case, is to retrieve the objects that the keywords specify and to discover how they are interrelated. That is, an answer for a keyword search over a relational database, or an RDF dataset, is not just a set of objects, but a set of objects and relationships between them.

The keyword search tools proposed in the literature for the relational and the RDF environments have points in common. However, RDF datasets pose an additional challenge when no schema is defined, which is never the case for relational databases. This is the primary motivation for the research reported in this article: keyword search over schema-less RDF datasets.

Indeed, an RDF dataset does not require a predefined schema, that is, the user may introduce new classes and properties, without defining them a priori, or changing the current schema declaration if any. Hence, when compared to the relational model, RDF offers the flexibility of schema-less datasets but, in this case, an RDF keyword search tool must face the problem of retrieving sets of objects and their relationships without resorting to schema information for guidance.

There are two other features of RDF that should be mentioned in the context of keyword search. First, the adoption of RDF imposes no strict distinction between data and metadata, that is, a keyword may match the name or description of a class or property in the same way that it may match a data value. Second, an RDF dataset is equivalent to a labeled graph, called an RDF graph, which allows addressing RDF keyword search as a graph search problem. We will use the terms RDF dataset and RDF graph indistinctly in what follows.

In more detail, a keyword-based query is a set K of literals, or keywords. An answer for K over an RDF dataset T is a subset A of T such that: (i) A has triples that match some keywords in K; and (ii) A induces a connected RDF graph. Note that we can then compare answers based on the number of keywords they match and on their number of triples, as defined in detail in [1]. Based on these remarks, we can informally state the RDF keyword search problem (the RDF KwS-Problem) as: “Given an RDF dataset T and a keyword-based query K, find an answer A for K over T, preferably with as many keyword matches as possible and with the smallest set of triples as possible”.

Let GA be the RDF graph induced by an answer A. If GA is a Steiner tree of T that covers the nodes that match keywords, then GA is connected and does not have unnecessary edges. Therefore, a basic strategy to solve the RDF KwS-Problem would be to construct an algorithm that tries to find as many keyword matches as possible and, at the same time, find a Steiner tree of T that covers the matching nodes, which is a challenging task.

The central contribution of this article is an algorithm to address the RDF KwS-Problem by automatically translating a keyword-based query K into a SPARQL query Φ so that the answers Φ returns are also answers for K. The novelty of the algorithm lies in that it neither relies on an RDF schema, nor accesses the RDF graph during the compilation process – and this is a relevant feature, but it synthesizes SPARQL queries by exploring the similarity between the property domains and ranges and the class instance sets observed in the RDF dataset. To achieve good performance, even for large RDF datasets, the algorithm, in turn, estimates set similarity based on KMV-synopses [2]. The KMV-synopses are pre-computed efficiently in a single pass over the RDF dataset and stored together with the RDF dataset for later use by the compilation process.

The second contribution consists of two sets of comprehensive experiments with an implementation of the algorithm. The first set of experiments shows that the implementation outperforms, in all metrics adopted, a baseline RDF keyword search tool that explores the RDF schema. These results suggest that schema information can indeed be replaced by pre-computed, concise KMV-synopses for the property domains and ranges, and class instance sets. These results are, to some extent, unexpected, since the lack of schema information seemed difficult to overcome from the onset. The second set of experiments indicate that the implementation performs better than the TSA+BM25 and TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach, using the metrics and the benchmarks proposed originally to assess these systems.

As a third contribution, we propose the Graph Relevance Ratio (GRR) to establish when an answer graph is relevant w.r.t. a ground truth graph. It is based on the number of relevant and non-relevant triples in the RDF graph, but it punishes the presence of non-relevant triples, and does not memorize the relevant triples in previous rank positions.

The remainder of this article is organized as follows. Section 2 summarizes related work. Section 3 provides the necessary background. Section 4 describes the proposed algorithm to compile keyword-based queries into SPARQL queries. Section 5 covers the first set of experiments that compare an implementation of the proposed algorithm with a baseline RDF keyword search tool that explores the RDF schema. Section 6 describes the second set of experiments that compare the implementation with the state-of-the-art TSA+BM25 and TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach. Finally, Section 7 contains the conclusions and suggests directions for future research.

Section snippets

Related work

A survey of keyword-based query processing tools over relational databases and RDF datasets can be found in [3].

Early relational keyword-based query processing tools [4], [5], [6], [7], [8] explored the foreign/primary keys declared in the relational schema to compile a keyword-based query into an SQL query with a minimal set of join clauses – and this is a key idea – based on the notion of candidate networks (CNs). This approach was also adopted in recent tools [8], [9]. In particular, QUEST 

RDF

An Internationalized Resource Identifier (IRI) is a global identifier that denotes a resource. RDF describes data as triples of the form (s,p,o), where s is the subject, p is the predicate or property, and o is the object of the triple [40]. The subject of a triple is an IRI or a blank node, the property is an IRI, and the object is an IRI, a blank node, or a literal. An RDF dataset is a set T of RDF triples; T is equivalent to a labeled graph GT whose set of nodes is the set of RDF terms that

A motivating example

This section describes a simplified example that illustrates how the proposed algorithm translates a keyword-based query K into a SPARQL query Φ so that the answers Φ returns are also answers for K.

Example 1

Let T be the RDF dataset whose graph is shown in Fig. 1. In what follows, let Dp and Rp denote the observed domain and the observed range of a property p observed in T.

Pre-processing. Before processing any keyword-based query, the algorithm executes a single scan of T that simultaneously pre-computes

Comparison with a schema-based RDF keyword search tool

This section describes a set of experiments that compares the schema-less approach proposed in this article with a state-of-the-art schema-based RDF keyword search tool, adopted as baseline. The experiments are based on a RDF keyword search benchmark [45], with: two RDF datasets triplified from the IMDb3 and Mondial4 databases; the RDF schema of each of these RDF datasets; two lists of keyword-based queries, one for each of

Comparison with keyword search systems based on the “virtual documents” approach

This section describes a second set of experiments that compares the schema-less approach proposed in this article with the state-of-the-art TSA+BM25 and TSA+VDP keyword search systems over RDF datasets based on the “virtual documents” approach [30], adopted as baselines. The experiments are based on the same datasets and queries adopted in [30].

Conclusions and future work

This article addressed the problem of implementing keyword search for RDF datasets that do not necessarily feature an RDF schema. It introduced a novel algorithm to automatically translate a user-specified keyword-based query into a SPARQL query that returns answers with respect to the keywords. The algorithm synthesizes the SPARQL query by exploring the Jaccard and set containment similarity measures between the property domains and ranges and class instance sets, observed in the RDF dataset.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partly funded by grants CAPES/88881.134081/ 2016-01, CNPq/302303/2017-0, and FAPERJ/E-26-202.818/2017 and E-26/200.770/2019. Carlos Oliveira was partially supported by the Project CEMAPRE/REM - UIDB/05069/2020 - financed by FCT/MCTES through national funds. The authors also wish to thank Dennis Dosso and Gianmaria Silvello for their invaluable help with the benchmarks used in Section 6.

References (48)

  • V. Hristidis, Y. Papakonstantinou, DISCOVER: keyword search in relational databases, in: Proceedings of the 28th...
  • P. Oliveira, A. Silva, E. Moura, Ranking Candidate Networks of relations to improve keyword search over relational...
  • 34 VinayM.S. et al.

    Operator implementation of result set dependent KWS scoring functions

    Inf. Syst.

    (2020)
  • BergamaschiS. et al.

    Keyword search over relational databases: a metadata approach

  • Oliveira FilhoA.C.

    Benchmark Para Métodos de Consultas Por Palavras-Chave a Bancos de Dados Relacionais

    (2018)
  • IzquierdoY.T. et al.

    QUIOW: A keyword-based query processing tool for RDF datasets and relational databases

  • Q. Zhou, C. Wang, M. Xiong, H. Wang, Y. Yu, SPARK: Adapting keyword query to semantic search, in: Proceedings of the...
  • M. Rihany, Z. Kedad, S. Lopes, Keyword Search Over RDF Graphs Using WordNet, in: Proceedings of the 1st Int’l. Conf. on...
  • S. Elbassuoni, R. Blanco, Keyword search over RDF graphs, in: Proceedings of the 20th ACM International Conference on...
  • GhanbarpourA. et al.

    A model-based keyword search approach for detecting top-k effective answers

    Comput. J.

    (2019)
  • S. Han, L. Zou, X. Yu, D. Zhao, Keyword Search on RDF Graphs - A Query Graph Assembly Approach, in: Proceedings of the...
  • T. Tran, H. Wang, S. Rudolph, P. Cimiano, Top-k exploration of query candidates for efficient keyword search on...
  • Le20 W. et al.

    Scalable keyword search on large RDF data

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • ZhengW. et al.

    Semantic SPARQL similarity search over RDF knowledge graphs

    Proc. VLDB Endow. 9

    (2016)
  • View full text