Elsevier

Information Systems

Volume 92, September 2020, 101480
Information Systems

Feedback driven improvement of data preparation pipelines

https://doi.org/10.1016/j.is.2019.101480Get rights and content

Highlights

  • Feedback on results can inform changes to a complete data preparation pipeline.

  • The pipeline can include matching, mapping generation and data repair.

  • A statistical approach can establish which actions to take based on feedback.

  • The same statistical approach can be used to target results for feedback.

Abstract

Data preparation, whether for populating enterprise data warehouses or as a precursor to more exploratory analyses, is recognised as being laborious, and as a result is a barrier to cost-effective data analysis. Several steps that recur within data preparation pipelines are amenable to automation, but it seems important that automated decisions can be refined in the light of user feedback on data products. There has been significant work on how individual data preparation steps can be refined in the light of feedback. This paper goes further, by proposing an approach in which feedback on the correctness of values in a data product can be used to revise the results of diverse data preparation components. The approach uses statistical techniques, both in determining which actions should be applied to refine the data preparation process and to identify the values on which it would be most useful to obtain further feedback. The approach has been implemented to refine the results of matching, mapping and data repair components in the VADA data preparation system, and is evaluated using deep web and open government data sets from the real estate domain. The experiments have shown how the approach enables feedback to be assimilated effectively for use with individual data preparation components, and furthermore that synergies result from applying the feedback to several data preparation components.

Introduction

Data preparation is the process of transforming data from its original form into a representation that is more appropriate for analysis. In data warehouses, data preparation tends to be referred to as involving an Extract Transform Load (ETL) process [1], and for more ad hoc analyses carried out by data scientists may be referred to as data wrangling [2]. In both cases, similar steps tend to be involved in data preparation, such as: discovery of relevant sources; profiling of these sources to better understand their individual properties and the potential relationships between them; matching to identify the relationships between source attributes; mapping to combine the data from multiple sources; format transformation to revise the representations of attribute values; and entity resolution to identify and remove duplicate records representing the same real world object.

This is a long list of steps, each of which can potentially involve data engineers: (i) deciding which data integration and cleaning operations to apply to which sources; (ii) deciding the order of application of the operations; and (iii) either configuring the individual operation applications or writing the rules that express the behaviour to be exhibited. Although there are many data preparation products, and the market for data preparation tools is estimated to be $2.9 billion [3], most of these products are essentially visual programming platforms, in which users make many, fine-grained decisions. The consequence of this is that data preparation is typically quoted as taking 80% of the time of data scientists, who would prefer to be spending their time on analysing and interpreting results.1

The high cost of data preparation has been recognised for a considerable period. For example, research into dataspaces [4] proposed a pay-as-you-go approach to data integration, in which there was an initial and automated bootstrapping phase, which was followed by an incremental improvement phase in which the user provided feedback on the data product. This gave rise to a collection of proposals for pay-as-you-go data integration and cleaning platforms [5], which in turn led to proposals for the use of crowds as a possible source of feedback [6]. This research provided experience with pay-as-you-go data management, without leading to many end-to-end systems; for the most part, feedback was obtained for a particular task (e.g. mapping selection, entity resolution) and used for that task alone. This was alright, but collecting feedback on lots of individual components is itself expensive, and thus not especially helpful for the complete, many-step data preparation process.

This, therefore, leaves open the question as to how to make a multi-step data preparation process much more cost effective, for example through automation and widespread use of feedback on data products. There are now some results on automating comprehensive data preparation pipelines. For example, in Data Tamer [7], machine learning is used to support activities including the alignment of data sets and instance level integration through entity resolution and fusion. In some respects Data Tamer follows a pay-as-you-go approach, as the training data used by the learning components is revised in the light of experience. Furthermore, in VADA [8], [9], a collection of components (for matching, mapping generation, source selection, format transformation and data repair) are orchestrated automatically over data sources, informed by supplementary instance data drawn from the domain of the target schema [10]. However, to date, feedback has only been applied in VADA to inform the selection of mappings.

In this paper we investigate how feedback on the data product that results from the multi-component data preparation process in VADA can be used to revise the results of multiple of these wrangling components in a well-informed way. In particular, given feedback on the correctness of tuples in the data product, a feedback assimilation strategy explores a set of hypotheses about the reason for problems with the result. The statistical significance of these hypotheses is then tested, giving rise to the generation of a revised data integration process. The proposed approach thus uses the same feedback to inform changes to many different data preparation components, thereby seeking to maximise the return on the investment made in the provision of feedback.

The contributions of the paper are as follows:

  • 1.

    A technique for applying feedback on a data product across a multi-step data preparation process that both identifies statistically significant issues and provides a mechanism for exploring the actions that may resolve these issues.

  • 2.

    An approach to feedback targeting that builds on the statistical analysis from (1) to identify the values on which it would be most useful to obtain additional feedback.

  • 3.

    A realisation of the techniques from (1) and (2) in a specific data preparation platform, where feedback is used to change the matches used in an integration, change which mappings are used, and change which data quality rules are applied.

  • 4.

    An empirical evaluation of the implementation of the approaches from (3) that investigates the effectiveness of the proposed approaches both for individual data preparation constructs (matches, mappings, and repairs in the form of conditional functional dependencies (CFDs)) and for applying feedback across all these constructs together.

The remainder of the paper is structured as follows. Section 2 outlines the data preparation pipeline on which we build, and provides a running example that will be used in the later sections. Section 3 provides a problem statement and an overview of the approach. Section 4 details the individual components in the realisation of the approach, and presents feedback assimilation algorithms. Section 5 describes how the feedback can be targeted on result values that are relevant to the approach in Section 4. Section 6 evaluates the technique in a real estate application. Section 7 reviews other approaches to increasing the cost-effectiveness of data preparation, and Section 8 concludes. This paper is an extended version of [11], to which extensions include a proposal for acting on feedback only when the action is predicted to provide a benefit, a technique for feedback targeting, and associated comparative evaluations.

Section snippets

A data preparation pipeline

This section provides an overview of the aspects of the VADA data preparation architecture that are relevant to the feedback assimilation approach that is the focus of the paper. The VADA architecture is described in more detail in earlier publications [8], [9], [10].

Problem statement

This section provides more details on the problem to be solved, along with an overview of the approach to be followed. The problem can be described as follows.

Assume we have a data preparation pipeline P, that orchestrates a collection of data preparation steps {s1,,sn}, to produce an end data product E that consists of a set of tuples. The problem is, given a set of feedback instances F on tuples from E, to re-orchestrate some or all of the data preparation steps si, revised in the light of

Solution

This section provides additional details on how the steps from Section 3 are carried out in practice, and includes details on how the feedback can be used to inform actions on matches, mappings and repairs. In particular, Section 4.1 identifies hypotheses that may be suggested by the available feedback; Section 4.2 describes how the statistical significance of such hypotheses can be ascertained; Section 4.3 identifies some actions that may be taken in response to a hypothesis; Section 4.4

Targeted feedback collection

Feedback collection can follow one of two rather different approaches:

  • In targeted feedback collection, the system identifies what feedback should be collected, and prompts the user for that feedback.

  • In untargeted feedback collection, the user decides what feedback to provide, and quite likely when to provide it.

In the targeted approach, there is a specific interface for feedback collection, and feedback collection is separate from the browsing of results. In the untargeted approach, typically

Experimental setup

For the evaluation, we used the following datasets:

  • 40 datasets with real-estate properties extracted from the web using OXpath [17], with contents similar to sources s1 to s3 in Fig. 1. Each dataset had from 6 to 16 attributes, with an average of 11 attributes per dataset. Their initial total size was 7.8k tuples. These datasets were used as source data.

  • English indices of deprivation data, downloaded from www.gov.uk, as shown in s4 in Fig. 1. The complete dataset had 6 attributes and 62.3k

Related work

In this section, we consider related work under four headings, pay-as-you-go data preparation, reducing manual effort in data preparation, applying feedback to multiple activities and targeted feedback collection.

Conclusions

The development of data preparation processes is laborious, requiring sustained attention to detail from data engineers across a variety of tasks. Many of these tasks involve activities that are amenable to automation. However, automated approaches have partial knowledge, and thus cannot necessarily be relied upon to make the best integration decisions. When automation falls short, one way to improve the situation is through feedback on the candidate end data product. The successful combination

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is funded by the UK Engineering and Physical Sciences Research Council (EPSRC) through the VADA Programme Grant (EP/M025268/1).

References (49)

  • VassiliadisP.

    A survey of extract-transform-load technology

    IJDWM

    (2011)
  • RattenburyT. et al.

    Principles of Data Wrangling

    (2017)
  • Mark A. BeyerE.Z.

    Magic Quadrant for Data Integration ToolsTech. rep.

    (2018)
  • FranklinM.J. et al.

    From databases to dataspaces: a new abstraction for information management

    SIGMOD Rec.

    (2005)
  • HedelerC. et al.

    Dataspaces

  • CrescenziV. et al.

    Crowdsourcing for data management

    Knowl. Inf. Syst.

    (2017)
  • M. Stonebraker, D. Bruckner, I.F. Ilyas, G. Beskales, M. Cherniack, S.B. Zdonik, A. Pagan, S. Xu, Data curation at...
  • KonstantinouN. et al.

    The VADA architecture for cost-effective data wrangling

  • KonstantinouN. et al.

    VADA: an architecture for end user informed data preparation

    J. Big Data

    (2019)
  • KoehlerM. et al.

    Data context informed data wrangling

  • KonstantinouN. et al.

    Feedback driven improvement of data preparation pipelines

  • AbelE. et al.

    User driven multi-criteria source selection

    Inform. Sci.

    (2018)
  • L. Mazilu, N.W. Paton, A.A. Fernandes, M. Koehler, Dynamap: Schema mapping generation in the wild, in: Proceedings of...
  • FanW. et al.

    Discovering conditional functional dependencies

    Proc. - Int. Conf. Data Eng.

    (2011)
  • BulmerM.G.

    Principles of Statistics

    (1979)
  • J.C.C. Ríos, N.W. Paton, A.A.A. Fernandes, K. Belhajjame, Efficient feedback collection for pay-as-you-go source...
  • FurcheT. et al.

    Oxpath: A language for scalable data extraction, automation, and crawling on the deep web

    VLDB J.

    (2013)
  • HedelerC. et al.

    Dstoolkit: An architecture for flexible dataspace management

    TLDKS

    (2012)
  • DongX.L. et al.

    Data integration with uncertainty

    VLDB J.

    (2009)
  • BlunschiL. et al.

    A dataspace odyssey: The imemex personal dataspace management system (demo)

  • CrescenziV. et al.

    Croudsourcing large scale wrapper inference

    Distrib. Parallel Databases

    (2015)
  • HungN.Q.V. et al.

    SMART: a tool for analyzing and reconciling schema matching networks

  • ZhangC.J. et al.

    Reducing uncertainty of schema matching via crowdsourcing

    PVLDB

    (2013)
  • BelhajjameK. et al.

    Feedback-based annotation, selection and refinement of schema mappings for dataspaces

  • Cited by (12)

    • Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • A Data-Centric Approach for Reducing Carbon Emissions in Deep Learning

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text