Bayesian Networks for Data Integration in the Absence of Foreign Keys,IEEE Transactions on Knowledge and Data Engineering

当前位置： X-MOL 学术 › IEEE Trans. Knowl. Data. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Bayesian Networks for Data Integration in the Absence of Foreign Keys
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-04-01 , DOI: 10.1109/tkde.2019.2940019
Bohan Zhang , Scott Sanner , Mohamed Reda Bouadjenek , Shagun Gupta

In the era of open data, a single data source rarely contains all of the attributes we need for inference in specific applications. For example, a marketing department may aim to integrate retailer-specific purchase data with separate demographic data for purposes of targeted advertising – a capability not possible with either dataset alone. In this work, we address two key desiderata of an automated framework for probabilistic data integration over multiple data sources: (1) we require that each relational data source share at least one attribute with another relational data source, but we do not require these attributes to be foreign keys (e.g., attributes such as gender, age, and postal code are not foreign keys because they do not uniquely identify individuals in a data source) and (2) we require inference to be probabilistic to reflect inherent uncertainty in population-level predictions given the absence of foreign keys. While some frameworks such as Probabilistic Relational Models (PRMs) address point (2), they do not address point (1) since they rely on foreign keys to link tables. To achieve both desiderata simultaneously, we develop an automated framework to construct Bayesian networks for data integration capable of answering any probabilistic query spanning the attributes of multiple relational data sources. We demonstrate that our framework is able to closely approximate the inference of a global Bayesian network over a single relation that has been projected onto multiple local relations and further investigate properties of local relations such as the number of shared attributes and their cardinality to understand how these properties affect the quality of inference.

中文翻译：

在没有外键的情况下用于数据集成的贝叶斯网络

在开放数据时代，单个数据源很少包含我们在特定应用中进行推理所需的所有属性。例如，营销部门的目标可能是将零售商特定的购买数据与单独的人口统计数据相结合，以进行有针对性的广告——这是单独使用任一数据集都无法实现的能力。在这项工作中，我们解决了在多个数据源上进行概率数据集成的自动化框架的两个关键需求：（1）我们要求每个关系数据源与另一个关系数据源共享至少一个属性，但我们不需要这些属性成为外键（例如，性别、年龄、和邮政编码不是外键，因为它们不能唯一标识数据源中的个人）和（2）我们要求推理是概率性的，以反映在没有外键的情况下人口水平预测中的固有不确定性。虽然某些框架（例如概率关系模型 (PRM)）解决了点 (2)，但它们没有解决点 (1)，因为它们依赖外键来链接表。为了同时实现这两个需求，我们开发了一个自动化框架来构建贝叶斯网络以进行数据集成，该网络能够回答跨越多个关系数据源属性的任何概率查询。

更新日期：2020-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11