Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes
arXiv - CS - Databases Pub Date : 2021-04-10 , DOI: arxiv-2104.04659
Jie Song, Yeye He

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate \emph{machine-generated data} by inferring suitable data-validation "patterns" that accurately describe the underlying data-domain, which minimizes false positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that Auto-Validate is substantially more effective than existing methods. Part of this technology ships as an \textsc{Auto-Tag} feature in \textsc{Microsoft Azure Purview}.

中文翻译：

自动验证：使用从Data Lakes推断出的数据域模式进行无监督的数据验证

复杂的数据管道在BI应用程序和ML建模等各种应用程序中越来越常见。这些流水线经常定期（例如，每天或每周）重复出现，因为需要刷新BI报告，并且需要重新训练ML模型。但是，据广泛报道，在复杂的生产管道中，上游数据馈送会以意想不到的方式发生变化，从而导致下游应用程序无声地中断，这是解决成本很高的问题。因此，数据验证已成为一个重要的话题，最近Google和Amazon所做的显着努力证明了这一点，目标是尽早发现管道中出现的数据质量问题。但是，我们在生产数据上的经验表明，在字符串值数据上，这些现有方法产生较高的假阳性率，并且经常需要人工干预。在这项工作中，我们通过推断适当的数据验证“模式”（准确描述潜在的数据域），开发了一种语料库驱动的方法来自动验证\ emph {机器生成的数据}，从而最大程度地减少误报，同时最大程度地提高捕获的数据质量问题。使用来自真实数据湖的生产数据进行的评估表明，自动验证比现有方法有效得多。此技术的一部分作为\ textsc {Microsoft Azure Purview}中的\ textsc {Auto-Tag}功能提供。使用来自真实数据湖的生产数据进行的评估表明，自动验证比现有方法有效得多。此技术的一部分作为\ textsc {Microsoft Azure Purview}中的\ textsc {Auto-Tag}功能提供。使用来自真实数据湖的生产数据进行的评估表明，自动验证比现有方法有效得多。此技术的一部分作为\ textsc {Microsoft Azure Purview}中的\ textsc {Auto-Tag}功能提供。

更新日期：2021-04-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文