A Unified Framework for Task-Driven Data Quality Management,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Unified Framework for Task-Driven Data Quality Management
arXiv - CS - Machine Learning Pub Date : 2021-06-10 , DOI: arxiv-2106.05484
Tianhao Wang, Yi Zeng, Ming Jin, Ruoxi Jia

High-quality data is critical to train performant Machine Learning (ML) models, highlighting the importance of Data Quality Management (DQM). Existing DQM schemes often cannot satisfactorily improve ML performance because, by design, they are oblivious to downstream ML tasks. Besides, they cannot handle various data quality issues (especially those caused by adversarial attacks) and have limited applications to only certain types of ML models. Recently, data valuation approaches (e.g., based on the Shapley value) have been leveraged to perform DQM; yet, empirical studies have observed that their performance varies considerably based on the underlying data and training process. In this paper, we propose a task-driven, multi-purpose, model-agnostic DQM framework, DataSifter, which is optimized towards a given downstream ML task, capable of effectively removing data points with various defects, and applicable to diverse models. Specifically, we formulate DQM as an optimization problem and devise a scalable algorithm to solve it. Furthermore, we propose a theoretical framework for comparing the worst-case performance of different DQM strategies. Remarkably, our results show that the popular strategy based on the Shapley value may end up choosing the worst data subset in certain practical scenarios. Our evaluation shows that DataSifter achieves and most often significantly improves the state-of-the-art performance over a wide range of DQM tasks, including backdoor, poison, noisy/mislabel data detection, data summarization, and data debiasing.

中文翻译：

任务驱动的数据质量管理的统一框架

高质量数据对于训练高性能机器学习 (ML) 模型至关重要，这凸显了数据质量管理 (DQM) 的重要性。现有的 DQM 方案通常无法令人满意地提高 ML 性能，因为在设计上，它们忽略了下游 ML 任务。此外，它们无法处理各种数据质量问题（尤其是由对抗性攻击引起的数据质量问题），并且仅适用于某些类型的 ML 模型。最近，数据评估方法（例如，基于 Shapley 值）已被用于执行 DQM；然而，实证研究观察到，它们的性能因基础数据和训练过程而有很大差异。在本文中，我们提出了一个任务驱动、多用途、与模型无关的 DQM 框架 DataSifter，它针对给定的下游 ML 任务进行了优化，能够有效去除具有各种缺陷的数据点，适用于多种模型。具体来说，我们将 DQM 表述为一个优化问题，并设计了一个可扩展的算法来解决它。此外，我们提出了一个理论框架来比较不同 DQM 策略的最坏情况性能。值得注意的是，我们的结果表明，基于 Shapley 值的流行策略可能最终在某些实际场景中选择最差的数据子集。我们的评估表明，DataSifter 在广泛的 DQM 任务（包括后门、毒害、噪声/错误标签数据检测、数据汇总和数据去偏差）中实现了并通常显着提高了最先进的性能。我们将 DQM 表述为一个优化问题，并设计了一个可扩展的算法来解决它。此外，我们提出了一个理论框架来比较不同 DQM 策略的最坏情况性能。值得注意的是，我们的结果表明，基于 Shapley 值的流行策略可能最终在某些实际场景中选择最差的数据子集。我们的评估表明，DataSifter 在广泛的 DQM 任务（包括后门、毒害、噪声/错误标签数据检测、数据汇总和数据去偏差）中实现了并通常显着提高了最先进的性能。我们将 DQM 表述为一个优化问题，并设计了一个可扩展的算法来解决它。此外，我们提出了一个理论框架来比较不同 DQM 策略的最坏情况性能。值得注意的是，我们的结果表明，基于 Shapley 值的流行策略可能最终在某些实际场景中选择最差的数据子集。我们的评估表明，DataSifter 在广泛的 DQM 任务（包括后门、毒害、噪声/错误标签数据检测、数据汇总和数据去偏差）中实现了并通常显着提高了最先进的性能。我们的结果表明，基于 Shapley 值的流行策略可能最终在某些实际场景中选择最差的数据子集。我们的评估表明，DataSifter 在广泛的 DQM 任务（包括后门、毒害、噪声/错误标签数据检测、数据汇总和数据去偏差）中实现了并通常显着提高了最先进的性能。我们的结果表明，基于 Shapley 值的流行策略可能最终在某些实际场景中选择最差的数据子集。我们的评估表明，DataSifter 在广泛的 DQM 任务（包括后门、毒害、噪声/错误标签数据检测、数据汇总和数据去偏差）中实现了并通常显着提高了最先进的性能。

更新日期：2021-06-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文