A Practical Approach to Proper Inference with Linked Data,The American Statistician

当前位置： X-MOL 学术 › Am. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Practical Approach to Proper Inference with Linked Data
The American Statistician ( IF 1.8 ) Pub Date : 2022-03-23 , DOI: 10.1080/00031305.2022.2041482
Andee Kaplan ₁ , Brenda Betancourt ₂ , Rebecca C. Steorts ₃

Affiliation

Abstract

Entity resolution (ER), comprising record linkage and deduplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the downstream task. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated on three simulated datasets and one application – determining the relationship between demographic information and party affiliation in voter registration data from the North Carolina State Board of Elections. We first perform Bayesian ER and evaluate our proposed methods for canonicalization before considering the downstream tasks of linear and logistic regression. Bayesian canonicalization methods are empirically shown to improve downstream inference in both settings through prediction and coverage.

中文翻译：

使用关联数据进行正确推理的实用方法

摘要

实体解析 (ER) 包括记录链接和重复数据删除，是在没有唯一标识符的情况下合并嘈杂的数据库以删除重复实体的过程。使用关联数据进行分析的一个主要挑战是在确定的匹配中识别代表记录，以传递给推理或预测任务，称为下游任务。此外，将来自 ER 的不确定性纳入下游任务对于确保正确推理至关重要。为了弥合分析管道中 ER 和下游任务之间的差距，我们提出了五种方法来从链接数据中选择代表性（或规范）记录，称为规范化. 我们的方法在记录数量上是可扩展的，适用于一般数据场景，并通过贝叶斯规范化阶段提供自然错误传播。提议的方法在三个模拟数据集和一个应用程序上进行了评估——确定北卡罗来纳州选举委员会选民登记数据中的人口统计信息和政党隶属关系之间的关系。在考虑线性和逻辑回归的下游任务之前，我们首先执行贝叶斯 ER 并评估我们提出的规范化方法。经验表明，贝叶斯规范化方法可以通过预测和覆盖来改善两种设置中的下游推理。

更新日期：2022-03-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文