Keyword search over schema-less RDF datasets by SPARQL query compilation
Introduction
Keyword search is typically associated with information retrieval systems, especially those designed for the Web. The user specifies a few terms, called keywords, and it is up to the system to retrieve the documents, such as Web pages, that best match the keywords. Keyword search over relational databases, as well as over RDF datasets, has also been studied for some time. The challenge, in this case, is to retrieve the objects that the keywords specify and to discover how they are interrelated. That is, an answer for a keyword search over a relational database, or an RDF dataset, is not just a set of objects, but a set of objects and relationships between them.
The keyword search tools proposed in the literature for the relational and the RDF environments have points in common. However, RDF datasets pose an additional challenge when no schema is defined, which is never the case for relational databases. This is the primary motivation for the research reported in this article: keyword search over schema-less RDF datasets.
Indeed, an RDF dataset does not require a predefined schema, that is, the user may introduce new classes and properties, without defining them a priori, or changing the current schema declaration if any. Hence, when compared to the relational model, RDF offers the flexibility of schema-less datasets but, in this case, an RDF keyword search tool must face the problem of retrieving sets of objects and their relationships without resorting to schema information for guidance.
There are two other features of RDF that should be mentioned in the context of keyword search. First, the adoption of RDF imposes no strict distinction between data and metadata, that is, a keyword may match the name or description of a class or property in the same way that it may match a data value. Second, an RDF dataset is equivalent to a labeled graph, called an RDF graph, which allows addressing RDF keyword search as a graph search problem. We will use the terms RDF dataset and RDF graph indistinctly in what follows.
In more detail, a keyword-based query is a set K of literals, or keywords. An answer for K over an RDF dataset T is a subset A of T such that: (i) A has triples that match some keywords in K; and (ii) A induces a connected RDF graph. Note that we can then compare answers based on the number of keywords they match and on their number of triples, as defined in detail in [1]. Based on these remarks, we can informally state the RDF keyword search problem (the RDF KwS-Problem) as: “Given an RDF dataset T and a keyword-based query K, find an answer A for K over T, preferably with as many keyword matches as possible and with the smallest set of triples as possible”.
Let be the RDF graph induced by an answer A. If is a Steiner tree of T that covers the nodes that match keywords, then is connected and does not have unnecessary edges. Therefore, a basic strategy to solve the RDF KwS-Problem would be to construct an algorithm that tries to find as many keyword matches as possible and, at the same time, find a Steiner tree of T that covers the matching nodes, which is a challenging task.
The central contribution of this article is an algorithm to address the RDF KwS-Problem by automatically translating a keyword-based query K into a SPARQL query so that the answers returns are also answers for K. The novelty of the algorithm lies in that it neither relies on an RDF schema, nor accesses the RDF graph during the compilation process – and this is a relevant feature, but it synthesizes SPARQL queries by exploring the similarity between the property domains and ranges and the class instance sets observed in the RDF dataset. To achieve good performance, even for large RDF datasets, the algorithm, in turn, estimates set similarity based on KMV-synopses [2]. The KMV-synopses are pre-computed efficiently in a single pass over the RDF dataset and stored together with the RDF dataset for later use by the compilation process.
The second contribution consists of two sets of comprehensive experiments with an implementation of the algorithm. The first set of experiments shows that the implementation outperforms, in all metrics adopted, a baseline RDF keyword search tool that explores the RDF schema. These results suggest that schema information can indeed be replaced by pre-computed, concise KMV-synopses for the property domains and ranges, and class instance sets. These results are, to some extent, unexpected, since the lack of schema information seemed difficult to overcome from the onset. The second set of experiments indicate that the implementation performs better than the TSABM25 and TSAVDP keyword search systems over RDF datasets based on the “virtual documents” approach, using the metrics and the benchmarks proposed originally to assess these systems.
As a third contribution, we propose the Graph Relevance Ratio (GRR) to establish when an answer graph is relevant w.r.t. a ground truth graph. It is based on the number of relevant and non-relevant triples in the RDF graph, but it punishes the presence of non-relevant triples, and does not memorize the relevant triples in previous rank positions.
The remainder of this article is organized as follows. Section 2 summarizes related work. Section 3 provides the necessary background. Section 4 describes the proposed algorithm to compile keyword-based queries into SPARQL queries. Section 5 covers the first set of experiments that compare an implementation of the proposed algorithm with a baseline RDF keyword search tool that explores the RDF schema. Section 6 describes the second set of experiments that compare the implementation with the state-of-the-art TSABM25 and TSAVDP keyword search systems over RDF datasets based on the “virtual documents” approach. Finally, Section 7 contains the conclusions and suggests directions for future research.
Section snippets
Related work
A survey of keyword-based query processing tools over relational databases and RDF datasets can be found in [3].
Early relational keyword-based query processing tools [4], [5], [6], [7], [8] explored the foreign/primary keys declared in the relational schema to compile a keyword-based query into an SQL query with a minimal set of join clauses – and this is a key idea – based on the notion of candidate networks (CNs). This approach was also adopted in recent tools [8], [9]. In particular, QUEST
RDF
An Internationalized Resource Identifier (IRI) is a global identifier that denotes a resource. RDF describes data as triples of the form (s,p,o), where s is the subject, p is the predicate or property, and o is the object of the triple [40]. The subject of a triple is an IRI or a blank node, the property is an IRI, and the object is an IRI, a blank node, or a literal. An RDF dataset is a set T of RDF triples; T is equivalent to a labeled graph whose set of nodes is the set of RDF terms that
A motivating example
This section describes a simplified example that illustrates how the proposed algorithm translates a keyword-based query K into a SPARQL query so that the answers returns are also answers for K.
Example 1 Let T be the RDF dataset whose graph is shown in Fig. 1. In what follows, let and denote the observed domain and the observed range of a property p observed in T.
Pre-processing. Before processing any keyword-based query, the algorithm executes a single scan of T that simultaneously pre-computes
Comparison with a schema-based RDF keyword search tool
This section describes a set of experiments that compares the schema-less approach proposed in this article with a state-of-the-art schema-based RDF keyword search tool, adopted as baseline. The experiments are based on a RDF keyword search benchmark [45], with: two RDF datasets triplified from the IMDb3 and Mondial4 databases; the RDF schema of each of these RDF datasets; two lists of keyword-based queries, one for each of
Comparison with keyword search systems based on the “virtual documents” approach
This section describes a second set of experiments that compares the schema-less approach proposed in this article with the state-of-the-art TSABM25 and TSAVDP keyword search systems over RDF datasets based on the “virtual documents” approach [30], adopted as baselines. The experiments are based on the same datasets and queries adopted in [30].
Conclusions and future work
This article addressed the problem of implementing keyword search for RDF datasets that do not necessarily feature an RDF schema. It introduced a novel algorithm to automatically translate a user-specified keyword-based query into a SPARQL query that returns answers with respect to the keywords. The algorithm synthesizes the SPARQL query by exploring the Jaccard and set containment similarity measures between the property domains and ranges and class instance sets, observed in the RDF dataset.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was partly funded by grants CAPES/88881.134081/ 2016-01, CNPq/302303/2017-0, and FAPERJ/E-26-202.818/2017 and E-26/200.770/2019. Carlos Oliveira was partially supported by the Project CEMAPRE/REM - UIDB/05069/2020 - financed by FCT/MCTES through national funds. The authors also wish to thank Dennis Dosso and Gianmaria Silvello for their invaluable help with the benchmarks used in Section 6.
References (48)
- et al.
Combining user and database perspective for solving keyword queries over relational databases
Inf. Syst.
(2016) - et al.
From keywords to relational database content: A semantic mapping method
Inf. Syst.
(2020) - et al.
From keywords to semantic queries - incremental query construction on the semantic web
Web Semant. Sci. Serv. Agents World Wide Web
(2009) - et al.
LUBM: A benchmark for OWL knowledge base systems
J. Web Semantics
(2005) - G.M. García, Y.T. Izquierdo, E. Menendez, F. Dartayre, M.A. Casanova, RDF Keyword-based Query Technology Meets a...
- K. Beyer, P.J. Haas, B. Reinwald, Y. Sismanis, R. Gemulla, On synopses for distinct-value estimation under multiset...
- et al.
Semantic search on text and knowledge bases
Found. and Trends Info. Retr.
(2016) - B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, . Parag, S. Sudarshan, BANKS: Browsing and keyword...
- S. Agrawal, S. Chaudhuri, G. Das, DBXplorer: A system for keyword-based search over relational databases, in:...
- H. He, H. Wang, J. Yang, P.S. Yu, Blinks: Ranked keyword searches on graphs, in: Proceedings 2007 ACM SIGMOD...
Operator implementation of result set dependent KWS scoring functions
Inf. Syst.
Keyword search over relational databases: a metadata approach
Benchmark Para Métodos de Consultas Por Palavras-Chave a Bancos de Dados Relacionais
QUIOW: A keyword-based query processing tool for RDF datasets and relational databases
A model-based keyword search approach for detecting top-k effective answers
Comput. J.
Scalable keyword search on large RDF data
IEEE Trans. Knowl. Data Eng.
Semantic SPARQL similarity search over RDF knowledge graphs
Proc. VLDB Endow. 9
Cited by (7)
A workflow model for holistic data management and semantic interoperability in quantitative archival research
2023, Digital Scholarship in the HumanitiesA family of natural language interfaces for databases based on ChatGPT and LangChain
2023, CEUR Workshop ProceedingsImplementation of a framework for graph-based keyword search over relational data
2023, International Journal of Intelligent Information and Database SystemsImplementing SPARQL-Based Prefiltering on Jena Fuseki TDB Store to Reduce the Semantic Web Services Search Space
2022, Lecture Notes on Data Engineering and Communications Technologies