当前位置: X-MOL 学术ACM Trans. Database Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Discovery of Matching Dependencies
ACM Transactions on Database Systems ( IF 2.2 ) Pub Date : 2020-07-07 , DOI: 10.1145/3392778
Philipp Schirmer 1 , Thorsten Papenbrock 1 , Ioannis Koumarelas 1 , Felix Naumann 1
Affiliation  

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.

中文翻译:

匹配依赖项的有效发现

匹配依赖项 (MD) 是数据分析结果,通常用于数据集成、数据清洗和实体匹配。它们是功能依赖 (FD) 匹配的泛化相似的而不是相同的元素。由于它们的发现非常困难,现有的分析算法要么只发现所有 MD 的一小部分,要么它们的范围仅限于小数据集。我们专注于高效的所有的发现有趣的真实世界数据集中的 MD。为此,我们提出了 HyMD,这是一种新颖的 MD 发现算法,可以在给定的相似性边界内找到所有最小的、非平凡的 MD。该算法从数据中提取单个 MD 的确切相似度阈值,而不是使用预定义的相似度阈值。出于这个原因,它是第一个以精确和真正完整的方式解决 MD 发现问题的方法。但是,如果需要,该算法可以在报告的 MD 上强制执行某些属性,例如不相交性和最小支持,以将发现集中在下游用例实际需要的此类结果上。HyMD 在技术上是一种混合方法,它结合了相关工作中两种最流行的依赖发现策略:晶格遍历和记录对推断。
更新日期:2020-07-07
down
wechat
bug