当前位置: X-MOL 学术J. Am. Stat. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Distributed Estimation for Principal Component Analysis: An Enlarged Eigenspace Analysis
Journal of the American Statistical Association ( IF 3.7 ) Pub Date : 2021-04-06 , DOI: 10.1080/01621459.2021.1886937
Xi Chen 1 , Jason D. Lee 2 , He Li 1 , Yun Yang 3
Affiliation  

Abstract

The growing size of modern datasets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This article studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-L-dim (L > 1) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-L-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis–Kahan theorem requires the explicit eigengap assumption to estimate the top-L-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-L-dim eigenspace, we show that our estimator is able to cover the targeted top-L-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, we provide simulation studies to demonstrate the performance of the proposed distributed estimator.



中文翻译:

主成分分析的分布式估计:扩大的特征空间分析

摘要

现代数据集规模的不断扩大给现有的统计估计方法带来了许多挑战,这需要新的分布式方法。本文研究了基本统计机器学习问题的分布式估计,即主成分分析 (PCA)。尽管有大量关于顶特征向量估计的文献,但对于顶L- dim ( L  > 1) 特征空间估计的介绍要少得多,尤其是在分布式方式中。我们提出了一种新的多轮算法来构造 top- L-dim 特征空间用于分布式数据。我们的算法利用了移位和反转预处理和凸优化。我们的估计器通信效率高,收敛速度快。与现有的分而治之算法相比,我们的方法对机器数量没有限制。理论上,传统的 Davis–Kahan 定理需要明确的特征间隙假设来估计顶-L-维特征空间。为了放弃这个特征间隙假设,我们在分析中考虑了一条新路线:我们证明我们的估计器能够覆盖目标顶部L ,而不是准确地识别 top- L -dim 特征空间-dim 总体特征空间。我们的分布式算法可以应用于广泛的基于 PCA 的统计问题,例如主成分回归和单指标模型。最后,我们提供模拟研究来证明所提出的分布式估计器的性能。

更新日期:2021-04-06
down
wechat
bug