当前位置: X-MOL 学术J. Comput. Graph. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data
Journal of Computational and Graphical Statistics ( IF 2.4 ) Pub Date : 2021-03-04 , DOI: 10.1080/10618600.2020.1870482
Martin Slawski 1 , Guoqing Diao 2 , Emanuel Ben-David 3
Affiliation  

Abstract

Recently, there has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters. In this article, we present a method to adjust for such mismatches under “partial shuffling” in which a sufficiently large fraction of (predictors, response)-pairs are observed in their correct correspondence. The proposed approach is based on a pseudo-likelihood in which each term takes the form of a two-component mixture density. expectation-maximization schemes are proposed for optimization, which (i) scale favorably in the number of samples, and (ii) achieve excellent statistical performance relative to an oracle that has access to the correct pairings as certified by simulations and case studies. In particular, the proposed approach can tolerate considerably larger fraction of mismatches than existing approaches, and enables estimation of the noise level as well as the fraction of mismatches. Inference for the resulting estimator (standard errors, confidence intervals) can be based on established theory for composite likelihood estimation. Along the way, we also propose a statistical test for the presence of mismatches and establish its consistency under suitable conditions. Supplemental files for this article are available online.



中文翻译:

部分混洗数据线性回归的伪似然方法

摘要

最近,由于单独的数据收集和数据集成的不确定性,在对应于相同统计单元的匹配对中没有观察到预测变量和响应的情况下,线性回归引起了极大的兴趣。不匹配的对会显着影响模型拟合并破坏回归参数的估计。在本文中,我们提出了一种在“部分改组”下调整此类不匹配的方法,其中在正确对应中观察到足够大的(预测变量、响应)对。所提出的方法基于伪似然,其中每一项都采用双组分混合密度的形式。提出了用于优化的期望最大化方案,其(i)在样本数量上有利地扩展,(ii) 相对于通过模拟和案例研究获得正确配对的预言机而言,实现出色的统计性能。特别是,所提出的方法可以容忍比现有方法大得多的失配比例,并且能够估计噪声水平以及失配比例。对所得估计量(标准误差、置信区间)的推断可以基于已建立的复合似然估计理论。在此过程中,我们还提出了对错配存在的统计检验,并在合适的条件下建立其一致性。本文的补充文件可在线获取。与现有方法相比,所提出的方法可以容忍更大比例的失配,并且能够估计噪声水平以及失配比例。对所得估计量(标准误差、置信区间)的推断可以基于已建立的复合似然估计理论。在此过程中,我们还提出了对错配存在的统计检验,并在合适的条件下建立其一致性。本文的补充文件可在线获取。与现有方法相比,所提出的方法可以容忍更大比例的失配,并且能够估计噪声水平以及失配比例。对所得估计量(标准误差、置信区间)的推断可以基于已建立的复合似然估计理论。在此过程中,我们还提出了对错配存在的统计检验,并在合适的条件下建立其一致性。本文的补充文件可在线获取。在此过程中,我们还提出了对错配存在的统计检验,并在合适的条件下建立其一致性。本文的补充文件可在线获取。在此过程中,我们还提出了对错配存在的统计检验,并在合适的条件下建立其一致性。本文的补充文件可在线获取。

更新日期:2021-03-04
down
wechat
bug