SchemaDecrypt++: Parallel on-line Versioned Schema Inference for Large Semantic Web Data sources
Introduction
Modern applications dealing with huge collections of data have evidenced the limitations of relational database management systems, leading both researchers and companies to explore non-traditional ways of storing data. This has motivated the development of a continuously growing number of new data models, with the purpose of tackling the requirements of such applications. Among these requirements, a very flexible and schema-less data model, the ability to represent complex data and achieve scalability.
Users and applications are also provided with a huge amount of data on the Web. This Web of data is enabled thanks to the standard languages provided by the W3C for describing data, such as RDF1 (S2 )/OWL.3 Data is made available through query endpoints, where users and applications can issue their queries expressed in dedicated query languages such as SPARQL.4
Languages used to describe data in the semantic Web provide a high flexibility due to the lack of an explicit or strict schema for the data. RDF(S)/OWL data sources can store data with different structures for the same class, and data evolution is eased due to the lack of restrictions imposed on the data structure. However, this lack of structure makes the interrogation of these data sources more difficult.
The different structures of the instances of a class represent the different versions of this class. Class versions could be viewed as a summary of the co-occurrence between the properties, which is useful for many purposes such as formulating queries, providing a description of the data, identifying the relevant sources for a specific usage, decomposing queries over distributed data sources and optimizing their execution plan.
Our goal is to infer a versioned schema for a remote RDF data source, i.e. versions of the classes defined in the schema. In our previous work [1], we have proposed an on-line approach which discovers the versions of each class in the schema, along with the number of occurrences for each one. Our approach does not require to upload or browse the data to find the class versions, it is therefore suitable for large evolving data sources. In this paper, we propose , an extension of our approach enabling the parallel exploration of the candidate versions of a class. We have conducted some experiments with both and on DBpedia which is a real remote data source. The results show that significant performance improvement is achieved by our extended approach.
The remainder of this paper is organized as follows. We motivate our approach for discovering a versioned schema in Section 2, then we present the baseline approach and its challenges in Section 3. In Section 4, we present , our approach for discovering class versions. We propose a parallel exploration of class versions with in Section 5. We discuss the cost of our approach in Section 6. In Section 7, we present our evaluation methodology and the results achieved on a real remote data source. We then discuss some related works in Section 8 and finally, a conclusion is provided in Section 9.
Section snippets
Motivation
A data source described in RDF(S)/OWL is defined as a set of triples , where the sets , , and represent resources, blank nodes, properties and literals respectively. Such data sources are subject to constant evolution and the nature of the languages used to describe them do not impose any constraint on the structure of the data: instances of the same class may have different properties.
Fig. 1 shows an example of user who wants to find the different descriptions of a
Baseline approach and challenges
To find the versioned schema of a data source we have to find the different versions of each class. In this section, we first discuss the set of input properties according to the user’s needs. We then define the class versions and finally, we present the version discovery process as a combinatorial problem which will highlight the main challenges of discovering a versioned schema. Finally, we present the restrictions imposed by the data sources in our setting.
SchemaDecrypt: Enabling on-line discovery of schema versions
Finding the versioned schema of a data source consists in finding the different versions of its classes. In order to find the versions of a class from a large remote data source, we propose the approach. It is based on the construction of a probabilistic class profile which allows to: (i) guide the exploration of candidate versions by testing the most probable versions first; (ii) reduce the search space of candidate versions and (iii) define a stopping criteria for the
SchemaDecrypt++: Parallel and on-line discovery of class versions
In this section, we present an extension of which consists in parallelizing the exploration of candidate class versions. Two versions can be tested in parallel if their sets of instances are disjoint. In order to parallelize the discovery process, we propose to identify sets of versions that do not overlap, and we represent them using the notion of version template.
In this section, we present the generation of version templates which can be explorable in parallel in Section 5.1.
Analyzing the cost of version discovery
In this section, we discuss the cost of the proposed approach. Each exclusion rule allows to parallelize the exploration of the candidate versions. However, for parallel exploration to actually improve performance, the data source must be able to process multiple queries in parallel. In this analysis of the cost of our approach, we consider the worst case where the source can only process one query at a time.
The number of queries sent to a data source reflects the cost of an on-line approach
Evaluation
This section presents some experimental results using to find different versions of a class. We have evaluated the performances of and compared them to those of , to show the effect of parallelism and dynamic pruning of the exploration graph on a real data source. We have also illustrated the usefulness of versions for the example presented in the motivation section.
Related work
Proposed approaches for discovering structural versions of a data source are provided for local Json data sources [12], [13], [14], local RDF data sources [15], [16], [17], [18], streamed RDF data [19] or distributed RDF data sources [20]. Unlike our approach, all of these approaches only consider the outgoing properties of the instances. In addition, they require browsing the data to find the structural versions, making their use impossible on remote data sources.
Some of these structural
Conclusion
We have proposed , the first on-line approach for discovering the versioned schema of a large remote data source, without having to upload or browse the data. To find the different versions of a class, we propose to build a probabilistic class profile to guide the exploration of the candidate versions. We reduce the number of candidate versions by discovering inclusion and exclusion rules between the properties of a class.
We have also proposed a parallel exploration of versions
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was partially funded by the French National Research Agency (CAIR ANR-14-CE23-0006 project).
References (35)
- et al.
Schemex: efficient construction of a data catalogue by stream-based indexing of linked data
Web Semant. Sci. Serv. Agents World Wide Web
(2012) - K. Kellou-Menouer, Z. Kedad, On-line versioned schema inference for large semantic web data sources, in: Proceedings of...
- et al.
Querying distributed RDF data sources with SPARQL
- K. Kellou-Menouer, Z. Kedad, Evaluating the gap between an RDF dataset and its schema, in: Conceptual Modeling - 34th...
- K. Kellou-Menouer, Z. Kedad, Schema discovery in RDF data sources, in: Proceedings of the 34th International Conference...
- H. Paulheim, C. Bizer, Type inference on noisy RDF data, in: The Semantic Web - ISWC 2013 - 12th International Semantic...
- et al.
Statistical schema induction
- et al.
Structure inference for linked data sources using clustering
Trans. Large-Scale-Data Knowl.-Cent. Syst.
(2015) - Q.Y. Wang, J.X. Yu, K. Wong, Approximate graph schema extraction for semi-structured data, in: Advances in Database...
- A. Gangemi, A.G. Nuzzolese, V. Presutti, F. Draicchio, A. Musetti, P. Ciancarini, Automatic typing of DBpedia entities,...