Elsevier

Information Systems

Volume 91, July 2020, 101495
Information Systems

Discovering and merging related analytic datasets

https://doi.org/10.1016/j.is.2020.101495Get rights and content

Highlights

  • Attribute graphs for literal functional dependencies in hierarchical dimensions.

  • Schema augmentation and reduction operations to eliminate tuple multiplication.

  • Quality criteria and automatic repair operations for merging schema augmentations.

  • Detailed description of the implementations within SAP HANA platform.

  • Experimental evaluations on real datasets and usage scenarios.

Abstract

The production of analytic datasets is a significant big data trend and has gone well beyond the scope of traditional IT-governed dataset development. Analytic datasets are now created by data scientists and data analysts using big data frameworks and agile data preparation tools. However, despite the profusion of available datasets, it remains quite difficult for a data analyst to start from a dataset at hand and customize it with additional attributes coming from other existing datasets. This article describes a model and algorithms that exploit automatically extracted and user-defined semantic relationships for extending analytic datasets with new atomic or aggregated attribute values. Our framework is implemented as a REST service in SAP HANA and includes a careful theoretical analysis and practical solutions for several complex data quality issues.

Introduction

Enterprise Business Intelligence (BI) provides business users with solutions for managed data reporting, analysis, aggregation and visualization. These solutions heavily rely on trusted and well documented analytic datasets comprising multidimensional facts that hold measures and refer to one or more hierarchical dimensions [1]. Traditionally, these analytic datasets are created by the IT department in the form of data warehouses and data marts [2] or by enterprise application software vendors in the form of predefined and customizable analytic models. For example, SAP provisions thousands of predefined and customizable analytic datasets, also known as “virtual data models” for various business application domains (e.g., SCM, CRM, ERP) [3], [4]. These datasets are defined as views over the transactional data stored and managed by the SAP S4/HANA business suite and carry information, including sophisticated measures, which is easily understandable by business users, and ready for consumption by BI tools. Recently, agile data preparation tools [5], [6], [7] emerged to empower business users and data scientists to easily create their own high-quality analytic datasets from transactional data and existing analytic datasets and to build insightful and interactive personalized data visualizations.

Data preparation tools have the ability to partially automate data cleaning and transformation operations, perform record linkage and duplicate elimination, and even extract structured information from unstructured data. Another typical requirement of business users or data scientists is to augment the schema of an existing analytic dataset with new attributes, coming from one or more semantically related datasets that may represent additional details on dimensions or new measures. This is a critical need in many scenarios such as the creation of data mashups (e.g., add demographics to sales information), or the engineering of features within datasets used to train predictive models (e.g., features engineering to learn identifying customers that are likely to sign up for a credit card in the following quarter). Despite the importance of this task, existing data preparation tools poorly support users in augmenting the schema of an analytic dataset. This lack of assistance compels business users to redefine multiple times similar analytic datasets in a possibly inconsistent manner or to depend on their IT department to create their customized datasets. This creates an important bottleneck on the IT organization and pushes forward the deadline for making the desired datasets available to business users.

Addressing the problem of agile schema augmentation in the context of building trusted analytic datasets for specific BI tasks is a major business opportunity for several reasons. First, companies generally manage large collections of analytic datasets which keep growing with the need for new business data analysis tasks. At the scale of a large company, the facility offered by agile data preparation tools increases considerably the number of available analytic datasets with respect to those created by the IT organization or those predefined by enterprise software application vendors. Second, semantic relationships between analytic datasets, which are essential to support schema augmentation, can be accurately and automatically extracted from the dataset definitions. Indeed, analytic datasets are often defined over other datasets using queries, scripts, or views. For example, in BI applications supported by SAP, it is common to find analytic datasets with five or more levels of nested view definitions. By parsing these view definitions it is possible to discover the dimensions shared by different analytic datasets. Third, analytic datasets are generally complemented by rich and carefully designed metadata (e.g., about which attributes represent levels of hierarchical dimensions, or about the units and currencies of measure attributes) to support complex data analysis scenarios, that can be exploited during the schema augmentation process. Finally, IT organization and business users invest time to create analytic datasets containing high quality data (i.e., clean data) for business process optimization and decision making. Reusing these datasets and their metadata for schema augmentation is therefore worthwhile. For all these reasons, analytic datasets are a “gold mine” of high-quality and interrelated data that is relevant to business users and data scientists, although under-exploited by current data preparation tools.

The manual augmentation of an analytic dataset is often a cumbersome and error-prone process, raising multiple challenges illustrated hereafter. Imagine a simple use case with two analytic datasets represented by fact tables Sales and Dem(ographics) as shown in Table 1, Table 2. Attributes ORG_ID, CITY and COUNTRY in table Sales come from dimension table SALESORG, attribute YEAR in tables Sales and Dem is from dimension table TIME, and attributes CITY, STATE, COUNTRY in table Dem are from dimension table REGION. Dimension table names are in italic font to distinguish them from fact tables. Assume that attribute ORG_ID is unique in Sales while attributes CITY, STATE, COUNTRY, and YEAR are unique in Dem. In addition, assume that SALESORG and REGION are also related through common attributes CITY and COUNTRY which means that these attributes have the same meaning in both tables. Then, tables Sales and Dem are naturally related through their common attribute YEAR which comes from the same dimension and attributes CITY and COUNTRY which come from different dimensions but are semantically equivalent.

In our example, all tables are defined as views over transactional data as shown on Fig. 1. Dimension tables are represented by rounded rectangles, fact tables by square rectangles. Plain edges represent data dependencies, expressing for instance that the definition of Sales depends on SALESORG, TIME, and some transactional data. A data analyst now might want to build a new analytic table SalesDem by augmenting the schema of dataset Sales with the measure attributes POP(ulation) and UNEMP(loyment rate) of dataset Dem. This augmentation then corresponds to yet another fact table (view) SalesDem defined by a left-outer join with Dem on the common attributes CITY, COUNTRY and YEAR.

The goal of this article is to propose a solution for assisting the user in the definition of this new augmentation view. In particular, we will show that by exploiting all available information on analytic datasets, possibly extended by some other user-defined metadata, it is possible to automatically generate for a given dataset, a set of useful and correct schema augmentations. This is done by solving several challenges we will describe below.

The first challenge is to discover the relationships between analytic datasets that can be used for schema augmentation. As explained before, many useful relationships can be extracted from the definitions of analytic datasets. For example, the view definition of fact table SALES in Fig. 1 uses attributes from dimensions SALESORG and TIME. Similarly, fact table DEM uses attributes from dimension TIME. Thus, relationships can be extracted to identify semantically equivalent attributes of TIME in both fact tables and to generate left-outer join queries like the query for the augmentation SalesDem of table Sales.

In our example, the outer-join operation increases the number of rows in Sales (e.g., generates two rows with ORG_ID = ‘Oh_01’) because the join attributes do not constitute a unique identifier in dataset Dem. However, for some application scenarios in feature engineering, data enrichment or data analysis, such a multiplication of rows in Sales is undesirable. There, schema augmentation should be controlled to keep the number of rows in Sales constant, leading to the notion of schema complement introduced in [8]. This raises the challenge of identifying row multiplication by computing the unique, and possibly minimal, identifiers of analytic datasets.

A simple solution to avoid row multiplication is to transform table Dem by reducing partitions identified by the common attributes with Sales into single tuples. For instance, a possible reduction operation is to pre-aggregate the measures of table Dem along attribute STATE before performing a left outer join. However, this raises two new challenges. The first challenge is to choose the correct aggregation functions that can be applied to the aggregated attributes. For this we need the knowledge that POP can be summed and averaged while only a maximal or minimal value can be computed on the unemployment rate values of attribute UNEMP. Furthermore, if an averaged POP is used to augment Sales, we must determine what aggregation functions are applicable on this new attribute in the augmented Sales dataset to avoid future incorrect aggregation operations. The second challenge is the ambiguity of the aggregated measure attributes added to Sales which, for instance, contain an aggregated value for all cities ‘Dublin’ in country ‘USA’, thereby destroying the distinction between cities in different states and sharing the same name. Ambiguous measures to should be detected and controlled, e.g., by assigning a null value to POP and UNEMP for tuples with ORG_ID = ‘Oh_01’ and ORG_ID = ‘Ca_01’.

Another solution to avoid row multiplication would be to add STATE in the schema of Sales (although this is not necessary since ORG_ID was supposed to be unique in Sales). Then the common join attributes would form a unique identifier in Dem and a schema complement for dataset Sales can be obtained without any pre-aggregation. However, even in that case, summing POP in the augmented Sales dataset along CITY will not return the same result as summing POP along CITY in Dem, even for the same values of COUNTRY and STATE in Sales (e.g., POP value of city ‘San Jose’ in ‘California’ will not be counted). This can be a problem if in the augmented Sales dataset, REVENUE should be compared with POP in each STATE of a COUNTRY. This issue arises because the merge of Sales with Dem is incomplete with respect to Dem and detecting incomplete merge is necessary to avoid such an erroneous analysis.

This paper presents a new solution to assist business users and data scientists in augmenting the schema of an analytic dataset with attributes coming from other datasets. This is achieved by automatically discovering related analytic datasets and suggesting ways of building a schema augmentations, including schema complements, that resolve the challenges presented before.

More specifically, we make the following technical contributions.

  • We introduce attribute graphs as a novel concise and natural way to define literal functional dependencies over the level types of hierarchical dimensions from which we can easily infer unique identifiers in both dimension and fact tables, and minimal identifiers in the case of dimensions.

  • We give the formal definitions for schema augmentation, schema complement and merge query in the context of analytic tables. We then present several reduction operations for transforming schema augmentations with row multiplication into schema complements extending each row in the source table by a single row in the augmented table. These operations extend previous contributions on schema augmentation and schema complement (e.g., [8], [9], [10]) to the case of analytic datasets.

  • We also present formal quality criteria for schema augmentations, schema complements and merge queries. These criteria are used to define automatic repair operations (1) to notify the generation of ambiguous attributes, (2) to infer applicable aggregation functions on new attributes, and (3) to supplement merge results obtained by incomplete schema augmentation.

  • We describe the implementation of our solution as a REST service within the SAP HANA platform and provide a detailed description of our algorithms. We separate the generic part of the algorithms from the specific implementation optimizations done by leveraging the capabilities of SAP HANA.

  • We evaluate the performance of our algorithms to compute unique identifiers in dimension and fact tables, and analyze the effectiveness of our REST service using two application scenarios.

The rest of this paper is structured as follows. In Section 2, we describe the multi-dimensional data model for analytic datasets, which is used to capture all usual concepts of hierarchical dimensions, facts, measures, and cubes [1]. This makes our results easily applicable to any other database system implementing the concepts of multidimensional models. We also introduce the new concept of attribute graph as a way to define literal functional dependencies between dimension and fact table attributes and to compute dimension and fact table identifiers. In Section 3, we describe the relationships between analytic tables that are used to support schema augmentation and that can be automatically extracted from the definition of fact and dimensions tables. We then provide the formal definitions of schema augmentation and of natural and reduction-based schema complement. Section 4 introduces formal quality criteria for schema augmentations, schema complements and merge queries. Section 5 describes the implementation of our solution as a REST application service, not in production yet, in the SAP HANA platform while Section 6 details our algorithms. Section 8 presents our experiments to evaluate the performance and effectiveness of our solution. Section 7 compares our research results with previous work on schema augmentation as well as work on the querying of single or multiple OLAP data cubes. We conclude in Section 9.

Section snippets

Data model

This section presents our extension of the relational data model to model analytic tables and their semantic relationships. We assume as input to our model any relational database containing a set of non-analytic tables. Analytic tables, i.e., dimension and fact tables, are then defined bottom-up as views over these non-analytic tables and other analytic tables. We use conventional relational database notations [11]. Each table T is a finite multiset of tuples over a set of domains of values S={

Schema augmentation and complement

In this section we first describe the relationships that can be automatically extracted from the definitions of dimension and fact tables and formally define the notion of schema augmentation between tables connected by such relationships. We then introduce natural and reduction-based schema complements as a special case of schema augmentation where the tuples of a table are augmented with new attribute values obtained from another table without generating new tuples in the result.

Quality guarantees

We now describe the quality guarantees managed by our system. Section 4.1 evaluates the correctness of aggregated attribute values produced by reduction operations and merge queries. Sections 4.2 Non-ambiguous aggregable attributes, 4.3 Complete merge results deal with the generation of ambiguous and incomplete attribute values during the construction of schema augmentations (and schema complements).

Architecture overview in SAP HANA

The contributions presented in this paper have been implemented and published within the SAP HANA platform [15] as a REST application service. The extended HANA architecture is depicted in Fig. 9. White boxes are new components that implement the algorithms described in Section 6 whereas gray boxes are existing HANA components that have been extended.

Analytic tables are defined in SAP HANA as non-recursive information views [16] over non-analytic tables and other information views using a set

Dimension and fact identifiers

In this section we describe how to compute and maintain dimension and fact identifiers which play a central role in our work. We encode attribute graphs into two tables ATtribute Graph Node (ATGN) and ATtribute Graph Edge (ATGE). Their schema are respectively:

ATGN (DT_ID, ATT_NAME, LEVEL_NUM, OPTIONAL)

ATGE (DT_ID, ATT_NAME, PARENT_ATT_NAME, LABEL)

Each hierarchy in a dimension table is identified by a distinct DT_ID value. Table ATGN stores attribute nodes, with their level in the hierarchy and

Related work

The notion of natural schema complement was first proposed by [8] for web tables (tables extracted from web page) where join relationships between tables are discovered by computing schema matching similarities [20] using a combination of similarity in attribute names, data types, and values (through a variant of Jaccard similarity). In the context of web tables, [9] presents direct and holistic schema matching techniques to discover natural schema complements for a given web table and

Attribute graph and dimension identifier computation

All performance experiments for attribute graph generation (ATG) and dimension identifier computation (CDI) are conducted on an SAP HANA instance with 250 GB of main memory and 300 GB of disk space.

Conclusions and future work

In this paper, we propose a solution to the problem of discovering and merging schema augmentations for analytic datasets. We formally define efficient algorithms for building reduction-based schema complements which generalize previously defined natural schema complements. We discuss various quality issues and their solution for generating semantically correct schema augmentations and merged tables annotated with proper metadata enabling their correct usage in future analytic queries. We also

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (37)

  • MazónJ.-N. et al.

    A survey on summarizability issues in multidimensional modeling

    Data Knowl. Eng.

    (2009)
  • NiemiT. et al.

    Detecting summarizability in OLAP

    Data Knowl. Eng.

    (2014)
  • JensenC.S. et al.

    Multidimensional Databases and Data Warehousing

    Synthesis lectures on data management

    (2010)
  • KimballR. et al.

    The Data Warehouse Toolkit

    (2013)
  • The Virtual Data Model in SAP S/4HANA, URL...
  • PattanayakA.

    High performance analytics with SAP HANA virtual models

    J. Comput. Commun.

    (2017)
  • Data Wrangling Tools & Software | Trifacta, URL...
  • Paxata | Self-Service Data Preparation for Data Analytics, URL...
  • SAP Agile Data Preparation and Transformation Solution, URL...
  • Das SarmaA. et al.

    Finding related tables

  • YakoutM. et al.

    InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

  • D. Deng, R.C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A.K. Elmagarmid, I.F. Ilyas, S. Madden, M. Ouzzani, N....
  • UllmanJ.D. et al.

    Database System: The Complete Book

    (2009)
  • BadiaA. et al.

    Functional dependencies with null markers

    Comput. J.

    (2014)
  • HornerJ. et al.

    An analysis of additivity in olap systems

  • LeeJ. et al.

    High-performance transaction processing in SAP HANA

    IEEE Data Eng. Bull.

    (2013)
  • SAP HANA Modeling Guide

    (2019)
  • SAP HANA Smart Data Integration and SAP HANA Smart Data Quality - SAP Help Portal, URL...
  • This work was supported by the French Association Nationale de la Recherche et de la Technologie (ANRT) [grant CIFRE number 2016/0644].

    View full text