Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference,International Statistical Review

当前位置： X-MOL 学术 › Int. Stat. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference
International Statistical Review ( IF 2 ) Pub Date : 2020-12-01 , DOI: 10.1111/insr.12434
Jae‐Kwang Kim ₁ , Siu‐Ming Tam ₂

Affiliation

The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under-coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias-corrected data integration estimator under misclassification errors. Finally, we develop a two-step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing-at-random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.

中文翻译：

结合大数据和调查样本数据进行数据集成以进行有限人口推理

使用大数据在有限总体中进行有效统计推断的统计挑战已在文献中得到充分证明。这些挑战主要是由于大数据源覆盖不足引起的统计偏差，以表示数据集中可用变量中的感兴趣的总体和测量误差。通过将总体分为大数据层和缺失数据层，我们可以通过使用完全响应的概率样本来估算缺失数据层，从而使用数据集成估计器来估算总体。通过将数据集成估计量表示为回归估计量，我们可以处理大数据和概率样本中变量的测量误差。我们还提出了一种用于识别重叠单元的完全非参数分类方法，并在错误分类错误下开发了一种偏差校正的数据集成估计器。最后，我们开发了一个两步回归数据集成估计器来处理概率样本中的测量误差。本文提倡的方法的一个优点是，我们不必为使方法起作用而做出不切实际的随机缺失假设。所提出的方法应用于使用 2015-2016 年澳大利亚农业普查数据的真实数据示例。本文提倡的方法的一个优点是，我们不必为使方法起作用而做出不切实际的随机缺失假设。所提出的方法应用于使用 2015-2016 年澳大利亚农业普查数据的真实数据示例。本文提倡的方法的一个优点是，我们不必为使方法起作用而做出不切实际的随机缺失假设。所提出的方法应用于使用 2015-2016 年澳大利亚农业普查数据的真实数据示例。

更新日期：2020-12-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>