Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
arXiv - CS - Databases Pub Date : 2021-01-13 , DOI: arxiv-2101.05308
Adel Ardalan, Derek Paulsen, Amanpreet Singh Saini, Walter Cai, AnHai Doan

Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.

中文翻译：

达到目标精度的数据清洗：价值归一化的案例研究

许多应用程序需要以目标精度清除数据。据我们所知，这个问题尚未得到深入研究。在本文中，我们迈出了解决该问题的第一步。我们专注于值规范化（VN），即用唯一字符串替换引用同一实体的所有字符串的问题。VN无处不在，我们经常想以100％的精度进行VN。在当今的工业中，通常是通过自动对字符串进行聚类，然后要求用户验证和清理聚类，直到达到100％的准确性来完成此操作。该解决方案具有明显的局限性。它没有告诉用户如何验证和清理群集。这部分通常也要花费很多时间，例如几天。此外，没有有效的方法让多个用户协同验证和清理。在本文中，我们解决了这些挑战。总体，

更新日期：2021-01-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>