当前位置: X-MOL 学术Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Feedback driven improvement of data preparation pipelines
Information Systems ( IF 3.0 ) Pub Date : 2019-12-06 , DOI: 10.1016/j.is.2019.101480
Nikolaos Konstantinou , Norman W. Paton

Data preparation, whether for populating enterprise data warehouses or as a precursor to more exploratory analyses, is recognised as being laborious, and as a result is a barrier to cost-effective data analysis. Several steps that recur within data preparation pipelines are amenable to automation, but it seems important that automated decisions can be refined in the light of user feedback on data products. There has been significant work on how individual data preparation steps can be refined in the light of feedback. This paper goes further, by proposing an approach in which feedback on the correctness of values in a data product can be used to revise the results of diverse data preparation components. The approach uses statistical techniques, both in determining which actions should be applied to refine the data preparation process and to identify the values on which it would be most useful to obtain further feedback. The approach has been implemented to refine the results of matching, mapping and data repair components in the VADA data preparation system, and is evaluated using deep web and open government data sets from the real estate domain. The experiments have shown how the approach enables feedback to be assimilated effectively for use with individual data preparation components, and furthermore that synergies result from applying the feedback to several data preparation components.



中文翻译:

反馈驱动的数据准备管道的改进

数据准备,无论是用于填充企业数据仓库还是作为更多探索性分析的前期工作,都被认为是费力的,因此,这是进行具有成本效益的数据分析的障碍。数据准备管道中重复出现的几个步骤都适合自动化,但根据用户对数据产品的反馈来完善自​​动化决策似乎很重要。关于如何根据反馈改进单个数据准备步骤的工作量很大。本文进一步提出了一种方法,其中可以使用对数据产品中值的正确性的反馈来修改各种数据准备组件的结果。该方法使用统计技术,在确定应采取哪些行动来完善数据准备过程以及确定对进一步获取反馈最有用的值上都应如此。已实施该方法以完善VADA数据准备系统中的匹配,映射和数据修复组件的结果,并使用来自房地产领域的深层网络和开放政府数据集对其进行了评估。实验表明,该方法如何使反馈有效地同化,以与各个数据准备组件一起使用,此外,通过将反馈应用于多个数据准备组件,可以产生协同作用。VADA数据准备系统中的地图绘制和数据修复组件,并使用来自房地产领域的深层网络和开放的政府数据集进行评估。实验表明,该方法如何使反馈有效地同化,以与各个数据准备组件一起使用,此外,通过将反馈应用于多个数据准备组件,可以产生协同作用。VADA数据准备系统中的地图绘制和数据修复组件,并使用来自房地产领域的深层网络和开放的政府数据集进行评估。实验表明,该方法如何使反馈有效地同化,以与各个数据准备组件一起使用,此外,通过将反馈应用于多个数据准备组件,可以产生协同作用。

更新日期:2019-12-06
down
wechat
bug