当前位置: X-MOL 学术ACM Trans. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Keeping the Data Lake in Form
ACM Transactions on Information Systems ( IF 5.4 ) Pub Date : 2020-05-22 , DOI: 10.1145/3388870
Ayman Alserafi 1 , Alberto Abelló 2 , Oscar Romero 2 , Toon Calders 3
Affiliation  

Data lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching . Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets that are proposed for further schema matching. We conduct extensive experiments on a real-world DL that proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.

中文翻译:

保持数据湖的形式

数据湖 (DL) 是来自不同来源的原始数据集的大型存储库。随着越来越多的数据集被摄取到 DL 中,越来越需要有效的技术来分析它们并检测它们的模式之间的关系,通常称为整体模式匹配. 模式匹配检测存储在数据集中的信息之间的相似性,以支持信息发现和检索。目前,随着最先进的 DL 的数量,这在计算上是昂贵的。为了应对这一挑战,我们提出了一种新的早期修剪方法来提高效率,我们收集不同类型的内容元数据模式元数据关于数据集,然后在早期修剪步骤中使用此元数据来预过滤模式匹配比较。这涉及基于元数据计算数据集之间的邻近度,根据整体邻近度发现它们的关系,并提出相似的数据集对进行模式匹配。我们通过引入一种有监督的挖掘方法来有效地检测为进一步模式匹配而提出的相似数据集,从而提高了这项任务的有效性。我们在真实世界的 DL 上进行了广泛的实验,证明了我们的方法在有效检测相似数据集以进行模式匹配方面的成功,召回率超过 85%,效率提高超过 70%。我们通过应用我们的方法与基于实例的模式匹配技术相比,经验性地展示了节省空间和时间的计算成本。
更新日期:2020-05-22
down
wechat
bug