Elsevier

Information Systems

Volume 93, November 2020, 101551
Information Systems

SchemaDecrypt++: Parallel on-line Versioned Schema Inference for Large Semantic Web Data sources

https://doi.org/10.1016/j.is.2020.101551Get rights and content

Highlights

  • A novel approach for parallel discovery of schema versions in a remote source.

  • Overcoming source querying restrictions and combinatorial explosion of versions.

  • SchemaDecrypt++, the implementation of our approach, is publicly available.

  • A powerful tool for discovering the hidden structure of Web data.

  • Experiments on real remote sources with access restrictions.

Abstract

A growing number of linked data sources are published on the Web. They form a single huge data space referred to as the Web of data. These data sources contain both the data and the schema describing them, but the data is not constrained by this schema. Indeed, two instances of the same class may be described by different properties. This flexibility for describing the data eases their evolution, but it comes at the cost of losing the description of the data, which can be useful in many contexts. The different structures of a class represent its versions. These versions provide useful information on property co-occurrence for a class, but their discovery can be very costly, and even impossible because the data sources are remote. Furthermore, they may have some access limitations, either on the query execution time, or on the number of queries, or on the size of the results.

In this paper, we present SchemaDecrypt++, a novel approach for the parallel discovery of a versioned schema for a remote data source. Our approach discovers the versions on-line, without uploading or browsing the data source. Broadly speaking, SchemaDecrypt++ allows to discover co-occurrences between properties from any set of properties: (i) specified by the user; (ii) describing the instances of a class or (iii) specified in the schema. SchemaDecrypt++ relies on our previous approach for schema discovery, SchemaDecrypt; in the present work we introduce a new strategy of parallelization of class version exploration, based on the discovery of a set of occurrence rules between the properties of the class. This strategy enables to overcome the source querying restrictions, the combinatorial explosion of the candidate versions and it improves the performances. We present some experimental evaluations on DBpedia to demonstrate the effectiveness of our approach.

Introduction

Modern applications dealing with huge collections of data have evidenced the limitations of relational database management systems, leading both researchers and companies to explore non-traditional ways of storing data. This has motivated the development of a continuously growing number of new data models, with the purpose of tackling the requirements of such applications. Among these requirements, a very flexible and schema-less data model, the ability to represent complex data and achieve scalability.

Users and applications are also provided with a huge amount of data on the Web. This Web of data is enabled thanks to the standard languages provided by the W3C for describing data, such as RDF1 (S2 )/OWL.3 Data is made available through query endpoints, where users and applications can issue their queries expressed in dedicated query languages such as SPARQL.4

Languages used to describe data in the semantic Web provide a high flexibility due to the lack of an explicit or strict schema for the data. RDF(S)/OWL data sources can store data with different structures for the same class, and data evolution is eased due to the lack of restrictions imposed on the data structure. However, this lack of structure makes the interrogation of these data sources more difficult.

The different structures of the instances of a class represent the different versions of this class. Class versions could be viewed as a summary of the co-occurrence between the properties, which is useful for many purposes such as formulating queries, providing a description of the data, identifying the relevant sources for a specific usage, decomposing queries over distributed data sources and optimizing their execution plan.

Our goal is to infer a versioned schema for a remote RDF data source, i.e. versions of the classes defined in the schema. In our previous work [1], we have proposed SchemaDecrypt an on-line approach which discovers the versions of each class in the schema, along with the number of occurrences for each one. Our approach does not require to upload or browse the data to find the class versions, it is therefore suitable for large evolving data sources. In this paper, we propose SchemaDecrypt++, an extension of our approach enabling the parallel exploration of the candidate versions of a class. We have conducted some experiments with both SchemaDecrypt and SchemaDecrypt++ on DBpedia which is a real remote data source. The results show that significant performance improvement is achieved by our extended approach.

The remainder of this paper is organized as follows. We motivate our approach for discovering a versioned schema in Section 2, then we present the baseline approach and its challenges in Section 3. In Section 4, we present SchemaDecrypt, our approach for discovering class versions. We propose a parallel exploration of class versions with SchemaDecrypt++ in Section 5. We discuss the cost of our approach in Section 6. In Section 7, we present our evaluation methodology and the results achieved on a real remote data source. We then discuss some related works in Section 8 and finally, a conclusion is provided in Section 9.

Section snippets

Motivation

A data source described in RDF(S)/OWL is defined as a set of triples D(RB)×Y×(RBL), where the sets R, B, Y and L represent resources, blank nodes, properties and literals respectively. Such data sources are subject to constant evolution and the nature of the languages used to describe them do not impose any constraint on the structure of the data: instances of the same class may have different properties.

Fig. 1 shows an example of user who wants to find the different descriptions of a

Baseline approach and challenges

To find the versioned schema of a data source we have to find the different versions of each class. In this section, we first discuss the set of input properties according to the user’s needs. We then define the class versions and finally, we present the version discovery process as a combinatorial problem which will highlight the main challenges of discovering a versioned schema. Finally, we present the restrictions imposed by the data sources in our setting.

SchemaDecrypt: Enabling on-line discovery of schema versions

Finding the versioned schema of a data source consists in finding the different versions of its classes. In order to find the versions of a class from a large remote data source, we propose the SchemaDecrypt approach. It is based on the construction of a probabilistic class profile which allows to: (i) guide the exploration of candidate versions by testing the most probable versions first; (ii) reduce the search space of candidate versions and (iii) define a stopping criteria for the

SchemaDecrypt++: Parallel and on-line discovery of class versions

In this section, we present an extension of SchemaDecrypt which consists in parallelizing the exploration of candidate class versions. Two versions can be tested in parallel if their sets of instances are disjoint. In order to parallelize the discovery process, we propose to identify sets of versions that do not overlap, and we represent them using the notion of version template.

In this section, we present the generation of version templates which can be explorable in parallel in Section 5.1.

Analyzing the cost of version discovery

In this section, we discuss the cost of the proposed approach. Each exclusion rule allows to parallelize the exploration of the candidate versions. However, for parallel exploration to actually improve performance, the data source must be able to process multiple queries in parallel. In this analysis of the cost of our approach, we consider the worst case where the source can only process one query at a time.

The number of queries sent to a data source reflects the cost of an on-line approach

Evaluation

This section presents some experimental results usingSchemaDecrypt++ to find different versions of a class. We have evaluated the performances of SchemaDecrypt and compared them to those of SchemaDecrypt++, to show the effect of parallelism and dynamic pruning of the exploration graph on a real data source. We have also illustrated the usefulness of versions for the example presented in the motivation section.

Related work

Proposed approaches for discovering structural versions of a data source are provided for local Json data sources [12], [13], [14], local RDF data sources [15], [16], [17], [18], streamed RDF data [19] or distributed RDF data sources [20]. Unlike our approach, all of these approaches only consider the outgoing properties of the instances. In addition, they require browsing the data to find the structural versions, making their use impossible on remote data sources.

Some of these structural

Conclusion

We have proposed SchemaDecrypt, the first on-line approach for discovering the versioned schema of a large remote data source, without having to upload or browse the data. To find the different versions of a class, we propose to build a probabilistic class profile to guide the exploration of the candidate versions. We reduce the number of candidate versions by discovering inclusion and exclusion rules between the properties of a class.

We have also proposed a parallel exploration of versions

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was partially funded by the French National Research Agency (CAIR ANR-14-CE23-0006 project).

References (35)

  • KonrathM. et al.

    Schemex: efficient construction of a data catalogue by stream-based indexing of linked data

    Web Semant. Sci. Serv. Agents World Wide Web

    (2012)
  • K. Kellou-Menouer, Z. Kedad, On-line versioned schema inference for large semantic web data sources, in: Proceedings of...
  • QuilitzB. et al.

    Querying distributed RDF data sources with SPARQL

  • K. Kellou-Menouer, Z. Kedad, Evaluating the gap between an RDF dataset and its schema, in: Conceptual Modeling - 34th...
  • K. Kellou-Menouer, Z. Kedad, Schema discovery in RDF data sources, in: Proceedings of the 34th International Conference...
  • H. Paulheim, C. Bizer, Type inference on noisy RDF data, in: The Semantic Web - ISWC 2013 - 12th International Semantic...
  • VölkerJ. et al.

    Statistical schema induction

  • ChristodoulouK. et al.

    Structure inference for linked data sources using clustering

    Trans. Large-Scale-Data Knowl.-Cent. Syst.

    (2015)
  • Q.Y. Wang, J.X. Yu, K. Wong, Approximate graph schema extraction for semi-structured data, in: Advances in Database...
  • A. Gangemi, A.G. Nuzzolese, V. Presutti, F. Draicchio, A. Musetti, P. Ciancarini, Automatic typing of DBpedia entities,...
  • SwensonC.

    Modern Cryptanalysis: Techniques for Advanced Code Breaking

    (2012)
  • LehmannJ. et al.

    DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia

    Semant. Web

    (2015)
  • D.S. Ruiz, S.F. Morales, J.G. Molina, Inferring versioned schemas from NoSQL databases and its applications, in: ER,...
  • M.A. Baazizi, H.B. Lahmar, D. Colazzo, G. Ghelli, C. Sartiani, Schema inference for massive JSON datasets, in:...
  • BaaziziM.A. et al.

    Parametric schema inference for massive JSON datasets

    VLDB J.

    (2019)
  • ZneikaM. et al.

    RDF graph summarization based on approximate patterns

  • M. Zneika, C. Lucchese, D. Vodislav, D. Kotzinos, Summarizing linked data RDF graphs using approximate graph pattern...
  • Cited by (0)

    View full text