SparkDQ: Efficient generic big data quality management on distributed data-parallel computation,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SparkDQ: Efficient generic big data quality management on distributed data-parallel computation
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2021-06-04 , DOI: 10.1016/j.jpdc.2021.05.012
Rong Gu , Yang Qi , Tongyu Wu , Zhaokang Wang , Xiaolong Xu , Chunfeng Yuan , Yihua Huang

In the big data era, large amounts of data are under generation and accumulation in various industries. However, users usually feel hindered by the data quality issues when extracting values from the big data. Thus, data quality issues are gaining more and more attention from data quality management analysts. Cutting-edge solutions like data ETL, data cleaning, and data quality monitoring systems have many deficiencies in capability and efficiency, making it difficult to cope with complicated situations on big data. These problems inspire us to build SparkDQ, a generic distributed data quality management model and framework that provides a series of data quality detection and repair interfaces. Users can quickly build custom tasks of data quality computing for various needs by utilizing these interfaces. In addition, SparkDQ implements a set of algorithms that in a parallel manner with optimizations. These algorithms aim at various data quality goals. We also propose several system-level optimizations, including the job-level optimization with multi-task execution scheduling and the data-level optimization with data state caching. The experimental evaluation shows that the proposed distributed algorithms in SparkDQ run up to 12 times faster compared to the corresponding stand-alone serial and multi-thread algorithms. Compared with the cutting-edge distributed data quality solution Apache Griffin, SparkDQ has more features, and its execution time is only around half of Apache Griffin on average. SparkDQ achieves near-linear data and node scalability.

中文翻译：

SparkDQ：分布式数据并行计算的高效通用大数据质量管理

大数据时代，各行各业都在产生和积累大量数据。然而，用户在从大数据中提取价值时，通常会受到数据质量问题的阻碍。因此，数据质量问题越来越受到数据质量管理分析师的关注。数据ETL、数据清洗、数据质量监控系统等前沿解决方案在能力和效率上存在诸多不足，难以应对大数据的复杂情况。这些问题激发了我们构建 SparkDQ，一个通用的分布式数据质量管理模型和框架，提供了一系列数据质量检测和修复接口。用户可以利用这些接口，快速构建满足各种需求的数据质量计算的自定义任务。此外，SparkDQ 实现了一组以并行方式进行优化的算法。这些算法针对各种数据质量目标。我们还提出了几种系统级优化，包括具有多任务执行调度的作业级优化和具有数据状态缓存的数据级优化。实验评估表明，与相应的独立串行和多线程算法相比，SparkDQ 中提出的分布式算法运行速度提高了 12 倍。与尖端的分布式数据质量解决方案Apache Griffin相比，SparkDQ具有更多的特性，其执行时间平均只有Apache Griffin的一半左右。SparkDQ 实现了近线性的数据和节点可扩展性。我们还提出了几种系统级优化，包括具有多任务执行调度的作业级优化和具有数据状态缓存的数据级优化。实验评估表明，与相应的独立串行和多线程算法相比，SparkDQ 中提出的分布式算法运行速度提高了 12 倍。与尖端的分布式数据质量解决方案Apache Griffin相比，SparkDQ具有更多的特性，其执行时间平均只有Apache Griffin的一半左右。SparkDQ 实现了近线性的数据和节点可扩展性。我们还提出了几种系统级优化，包括具有多任务执行调度的作业级优化和具有数据状态缓存的数据级优化。实验评估表明，与相应的独立串行和多线程算法相比，SparkDQ 中提出的分布式算法运行速度提高了 12 倍。与尖端的分布式数据质量解决方案Apache Griffin相比，SparkDQ具有更多的特性，其执行时间平均只有Apache Griffin的一半左右。SparkDQ 实现了近线性的数据和节点可扩展性。实验评估表明，与相应的独立串行和多线程算法相比，SparkDQ 中提出的分布式算法运行速度提高了 12 倍。与尖端的分布式数据质量解决方案Apache Griffin相比，SparkDQ具有更多的特性，其执行时间平均只有Apache Griffin的一半左右。SparkDQ 实现了近线性的数据和节点可扩展性。实验评估表明，与相应的独立串行和多线程算法相比，SparkDQ 中提出的分布式算法运行速度提高了 12 倍。与尖端的分布式数据质量解决方案Apache Griffin相比，SparkDQ具有更多的特性，其执行时间平均只有Apache Griffin的一半左右。SparkDQ 实现了近线性的数据和节点可扩展性。

更新日期：2021-06-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11