Analysis and evaluation of document-oriented structures

https://doi.org/10.1016/j.datak.2021.101893Get rights and content

Abstract

Document-oriented bases allow high flexibility in data representation which facilitates a rapid development of applications and enables many possibilities for data structuring. Unfortunately, in many cases, due to this flexibility and the absence of data modelling, the choice of a data representation is neglected by developers leading to many issues on several aspects of the document base and application quality; e.g., memory print, data redundancy, readability and maintainability. We aim at facilitating the study of data structuring alternatives and providing objective metrics to better reveal the advantages and disadvantages of a structure with respect to user needs. The main contributions of our approach are twofold. First of all, the semi-automatic generation of many suitable alternatives for data structuring given an initial UML model. Second, the automatic computation of structural metrics, allowing a comparison of the alternatives for JSON-compatible schema abstraction. These metrics reflect the complexity of the structure and are intended to be used in decision criteria for schema analysis and design process. This work capitalises on experiences with MongoDB, XML and software complexity metrics. The paper presents the schema generation and the metrics together with a validation scenario where we discuss how to use the results in a schema recommendation perspective.

Introduction

Developing and maintaining applications and information systems is crucial for most organisations. For that, software and data engineering play an important role in meeting user requirements. Data management remains challenging despite the existence of a large variety of data management systems. Today’s relational and NoSQL systems provide data management solutions with diverse characteristics regarding performance, scalability, flexibility in data structuring, and querying among others.

This work focuses on data engineering when using document-oriented systems. More specifically, systems storing JSON-like documents such as MongoDB,1 one of the most used NoSQL systems.2 Unlike relational DBMSs, most NoSQL systems are “schema-free.” They store semi-structured data without the previous creation of a database schema [1]. Data can be represented as collections of documents with atomic and complex attributes, including other embedded documents. Such flexibility enables rapid initial development and permits many data structuring possibilities, even for simple information. Even though there is no explicitly managed database schema, the characteristics of the data structures are crucial for the potential impact on the quality of the applications [2]. Indeed, each data structure may have advantages and disadvantages regarding several aspects such as the memory footprint of the document base, data redundancy, navigation cost, data access facility, and programme readability and maintainability. Thus, data and software engineering issues are tightly interlaced.

Careful analysis of data structuring choices and conscious decision making are important tasks but are not easy in the context of document-oriented data. On the one hand, there are no well-defined “good design” criteria analogous to normalisation theory for the relational data model. On the other hand, there are potentially many structuring possibilities to perform an exhaustive study.

Our SCORUS project is a contribution to making it easier for developers to improve data design and tuning decisions. Our proposal relies on two key points. First, the introduction of automation for quick generation and analysis of data structuring alternatives. Second, the proposal of structural metrics to provide objective indicators of the complexity of data structures.

Although most document-oriented systems do not support database schema, we propose to work with a “schema” abstraction to assist users in a data modelling process. We abstract and work with a “schema” to facilitate comprehension, assessment, and comparison of document-oriented data structures. The purpose is to clarify the possibilities and characteristics of each “schema” and to provide objective criteria for evaluating and assessing their advantages and disadvantages. In [3], we first proposed a set of structural metrics for JSON-compatible schema abstractions. These metrics reflect the complexity of the data structures and can be used to analyse criteria such as readability and maintainability. In this paper, we present the whole SCORUS proposal for assisting developers to design the data structures. We propose to use metrics early during the design phases but they can also be used to analyse existing document bases for audit and tuning purposes. For that kind of use, a reverse engineering effort is necessary to abstract the data structures.

In the SCORUS approach [4], developers model the data using UML classes. This model is processed to automatically generate a set of document-oriented structuring alternatives. Such alternatives can be visualised as AJSchemas, a human-friendly format. In an evaluation phase, SCORUS automatically evaluates structural metrics for each AJSchema. An analysis phase completes the approach based on metrics and user preferences [3], [5]. The ScorusTool prototype supports the process. As schema generation and metric evaluation are fully automated, users benefit from data structuring information with very little effort.

In this paper, we provide a global presentation of SCORUS and the main ideas and algorithmic choices for each phase. We first provide (Section 2) background on MongoDB and the motivation of this work. Then, in Section 3, we present the overview of SCORUS and we introduce AJSchema, the schema abstraction used to represent document structures. Section 4 is devoted to our proposal for generating structure alternatives. In Section 5, we propose structural metrics to measure important characteristics of structures. Section 6 is devoted to the validation of our proposal. It presents a scenario involving all the phases of the SCORUS approach. Related work is discussed in Section 7. Finally, conclusions and research perspectives are presented in Section 8.

Section snippets

Background and motivation

Among NoSQL proposals, document-oriented systems are known for the flexibility they offer to store complex semi-structured data. Our work is motivated by data design issues in systems like MongoDB [6] storing JSON-like documents. Data is managed as collections of documents (see Fig. 1) where a document is a set of attribute:value pairs. The value type can be atomic or complex. Complex means either an array of values of any type or another nesting document. An attribute value can be the

SCORUS: Approach and system

In this section, we provide an overview of our proposal, SCORUS, for helping users in data structuring when using document-oriented models. We introduce the approach in Section 3.1 and then describe the format we use for representing data structures. The main proposals are developed in Section 4 and Section 5.

Generating document-oriented structuring alternatives

The objective of proposing a generator of structuring alternatives is to permit users to consider several possibilities in an easy way so as to assist them in the design process. As previously mentioned, the input is the UML class diagram of the data to store.4 End-users do not have to deal with the complexity of the internal generation process.

To

Structural metrics

The way data is structured influences the maintainability and usability of the database and its applications. In the choice of data structuring, the priorities can be divergent. It then becomes interesting to consider several structures to make a choice depending on the case. To facilitate the study of alternatives, we proposed the automatic generation of structures presented in the previous section. These structures can be visualised in a human friendly format. There remains the question of

Validation scenario

As mentioned, our work aims at assisting users in the choice of document-oriented data structures. The proposed metrics, together with application priorities, will be used to establish criteria for choosing and comparing schemas. This is primarily to bring out the most suitable schema according to certain criteria but also to exclude unsuitable choices or to consider alternative schemas that were not necessarily obvious or considered initially.

In the following, we present a validation scenario

Related work

Related work concerns several subjects including data modelling, metrics and quality. We studied works concerning NoSQL systems [6], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36] , complex data [15], [37], [38], [39], XML documents [20], [21] and software metrics [22], [23], [24], [25], [26], [40].

Regarding metrics for complex data, Klettke et al. [20] proposed five structural metrics for XML. They are based on the software quality model ISO 9126. Using a graph representation of

Conclusion and perspectives

This work is motivated by the growing use of NoSQL systems and more precisely by the impact of quality issues on current and future information systems. We focus on data structuring in JSON documents, supported by MongoDB. Indeed, MongoDB is widely used and will probably be used for a long time. In such systems, there are significant aspects to consider such as the absence of a conceptual schema, the great flexibility of representing data with complex structures and the way of accessing data.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Many thanks to G. Vega, J. Chavarriaga, M. Cortés, C. Labbé, E. Perrier and P. Lago for their comments on this work. Thanks to the Laboratoire d’Informatique de Grenoble and the University of Grenoble Alpes for their support.

Paola Gómez received her Ph.D., Master and Engineer in Computer Science graduated from Grenoble Alpes (France), Andes (Colombia) and Distrital (Colombia) universities, respectively. She has 14 years of experience in java development, modelling, meta-modelling, business processes and data as well as university research and teaching. She is particularly interested in data modelling, enterprise architecture and business processes. She works as a consultant and senior developer with

References (47)

  • Eclipse, Eclipse Modeling Framework Project (EMF), Accessed: 2018-09-21,...
  • OMG, Unified Modeling Language (UML) Version 2.0, Accessed: 2018-09-21,...
  • AcherM.

    Managing, Multiple Feature Models: Foundations, Languages and Applications

    (2011)
  • AcherM. et al.

    Comparing approaches to implement feature model composition

  • KangK.C. et al.

    Feature-Oriented Domain Analysis (FODA) Feasibility StudyTech. rep.

    (1990)
  • GómezP. et al.

    Automatic schema generation for document-oriented systems

  • jsonSchema, JSON Schema,...
  • MaiwaldB. et al.

    What are real JSON schemas like? - an empirical analysis of structural properties

  • BurgueñoL. et al.

    A systematic approach to generate diverse instantiations for conceptual schemas

  • GomaaH.

    Designing Software Product Lines with UML

    (2005)
  • KangK.C. et al.

    Feature-oriented product line engineering

    IEEE Softw. Mag.

    (2002)
  • KlettkeM. et al.

    Metrics for XML document collections

  • PušnikM. et al.

    XML Schema metrics for quality evaluation

    Comput. Sci. Inform. Syst.

    (2014)
  • Cited by (0)

    Paola Gómez received her Ph.D., Master and Engineer in Computer Science graduated from Grenoble Alpes (France), Andes (Colombia) and Distrital (Colombia) universities, respectively. She has 14 years of experience in java development, modelling, meta-modelling, business processes and data as well as university research and teaching. She is particularly interested in data modelling, enterprise architecture and business processes. She works as a consultant and senior developer with responsibilities for architecture and modelling.

    Claudia Roncancio is a professor of Computer Science at the Grenoble Institute of Technology. She received her Ph.D. (1994) and “Habilitation a diriger des recherches” (2004) degrees from the University of Grenoble, France. Her research interests lie within the areas of data management and distributed systems. She has published more than 120 international papers on complex data management, stream and event processing, query optimisation, and personalisation, among others.

    Rubby Casallas is a professor of Computer Science at the Universidad de los Andes, Bogotá, Colombia and head of the school of Engineering. She received her Ph.D. from the Institut National Polytechnique of Grenoble, France. Her research interests lie in software engineering.

    View full text