当前位置: X-MOL 学术J. Am. Stat. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data
Journal of the American Statistical Association ( IF 3.7 ) Pub Date : 2020-07-07 , DOI: 10.1080/01621459.2020.1773832
Jun Yu 1 , HaiYing Wang 2 , Mingyao Ai 3 , Huiming Zhang 4
Affiliation  

Abstract

Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This article solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real datasets. Supplementary materials for this article are available online.



中文翻译:

具有海量数据的最大准似然估计量的最优分布式二次抽样

摘要

非均匀二次抽样方法可有效减少计算负担并保持海量数据的估计效率。由于其高计算效率,现有方法主要集中在替换下采样。如果数据量太大以至于不能一次计算出非均匀子采样概率,那么带放回的子采样是不可行的。本文使用泊松二次采样解决了这个问题。我们首先在 A 和 L 最优性标准下的准似然估计的背景下推导出最优泊松子抽样概率。对于具有近似最优子采样概率的实际可实现算法,我们建立了结果估计量的一致性和渐近正态性。为了处理完整数据存储在不同块或多个位置的情况,我们开发了一个分布式二次采样框架,其中统计数据在完整数据的较小分区上同时计算。研究了所得聚合估计量的渐近特性。我们通过模拟和真实数据集的数值实验来说明和评估所提出的策略。本文的补充材料可在线获取。

更新日期:2020-07-07
down
wechat
bug