Abstract
Fan et al. (Ann Stat 47(6):3009–3031, 2019) constructed a distributed principal component analysis (PCA) algorithm to reduce the communication cost between multiple servers significantly. However, their algorithm’s guarantee is only for sub-Gaussian data. Spurred by this deficiency, this paper enhances the effectiveness of their distributed PCA algorithm by utilizing robust covariance matrix estimators of Minsker (Ann Stat 46(6A):2871–2903, 2018) and Ke et al. (Stat Sci 34(3):454–471, 2019) to tame heavy-tailed data. The theoretical results demonstrate that when the sampling distribution is symmetric innovation with the bounded fourth moment or asymmetric with the finite 6th moment, the statistical error rate of the final estimator produced by the robust algorithm is similar to that of sub-Gaussian tails. Extensive numerical trials support the theoretical analysis and indicate that our algorithm is robust to heavy-tailed data and outliers.
Similar content being viewed by others
Notes
The original data consists of 24017 instances and 2400 features. We employ the first 1000 features of each instance as the sample.
References
Anderson TW (1963) Asymptotic theory for principal component analysis. Ann Math Stat 34(1):122–148
Avella-Medina M, Battey HS, Fan J, Li Q (2018) Robust estimation of high-dimensional covariance and precision matrices. Biometrika 105(2):271–284
Bhaskara A, Wijewardena PM (2019) On distributed averaging for stochastic \(k\)-PCA. In: Advances in neural information processing systems, pp 11024–11033
Bickel PJ, Levina E (2008) Covariance regularization by thresholding. Ann Stat 36:2577–2604
Catoni O (2012) Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 48(4):1148–1185
Catoni O (2016) PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. arXiv:1603.05229
Chen TL, Chang DD, Huang S-Y, Chen H, Lin C, Wang W (2016) Integrating multiple random sketches for singular value decomposition. arXiv:1608.08285
Chen X, Lee JD, Li H, Yang Y (2021) Distributed estimation for principal component analysis: an enlarged Eigenspace analysis. J Am Stat Assoc 47:1–31
Davis AW (1977) Asymptotic theory for principal component analysis: non-normal case. Aust J Stat 19(3):206–212
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
El Karoui N, d’Aspremont A (2010) Second order accurate distributed eigenvector computation for extremely large matrices. Electron J Stat 4:1345–1385
Fan J, Fan Y, Lv J (2008) High dimensional covariance matrix estimation using a factor model. J Econom 147:186–197
Fan J, Liu H, Wang W (2018) Large covariance estimation through elliptical factor models. Ann Stat 46(4):1383
Fan J, Wang D, Wang K, Zhu Z (2019a) Distributed estimation of principal eigenspaces. Ann Stat 47(6):3009–3031
Fan J, Wang W, Zhong Y (2019b) Robust covariance estimation for approximate factor models. J Econom 208(1):5–22
Fan J, Guo Y, Wang K (2021a) Communication-efficient accurate statistical estimation. J Am Stat Assoc (to appear)
Fan J, Wang W, Zhu Z (2021b) A shrinkage principle for heavy-tailed data: high-dimensional robust low-rank matrix recovery. Ann Stat 49(3):1239–1266
Han F, Liu H (2018) ECA: high-dimensional elliptical component analysis in non-Gaussian distributions. J Am Stat Assoc 113(521):252–268
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101
Janzamin M, Sedghi H, Anandkumar A (2014) Score function features for discriminative learning: matrix and tensor framework. arXiv:1412.2863
Jordan MI, Lee JD, Yang Y (2018) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681
Ke Y, Minsker S, Ren Z, Sun Q, Zhou W-X (2019) User-friendly covariance estimation for heavy-tailed distributions. Stat Sci 34(3):454–471
Lee JD, Liu Q, Sun Y, Taylor JE (2017) Communication-efficient sparse regression. J Mach Learn Res 18:1–30
Mendelson S, Zhivotovskiy N (2018) Robust covariance estimation under \(L_{4}-L_{2}\) norm equivalence. Ann Stat 48(3):1648–1664
Minsker S (2018) Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann Stat 46(6A):2871–2903
Minsker S, Wei X (2017) Estimation of the covariance structure of heavy-tailed distributions. In: Advances in neural information processing systems, pp 2855–2864
Minsker S, Wei X (2020) Robust modifications of U-statistics and applications to covariance estimation problems. Bernoulli 26(1):694–727
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci 2(11):559–572
Schizas ID, Aduroja A (2015) A distributed framework for dimensionality reduction and denoising. IEEE Trans Signal Process 63(23):6379–6394
Tian L, Gu Q (2017) Communication-efficient distributed sparse linear discriminant analysis. In: Artificial intelligence and statistics, pp 1178–1187
Wang W, Fan J (2017) Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Ann Stat 45(3):1342–1374
Yang Z, Balasubramanian K, Liu H (2017) High-dimensional non-Gaussian single index models via thresholded score function estimation. In: International conference on machine learning, pp 3851–3860
Yu Y, Wang T, Samworth RJ (2014) A useful variant of the Davis–Kahan theorem for statisticians. Biometrika 102(2):315–323
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by grants from the NSF of China (Grant No. 11731012), Ten Thousands Talents Plan of Zhejiang Province (Grant No. 2018R52042) and the Fundamental Research Funds for the Central Universities.
Appendix
Appendix
1.1 Proof of Lemma 1
Proof
Instead of \(\max (e^x-1,0)\) in the proof of Theorem 3.2 in Minsker (2018), we define \(\phi (x):= e^{x}-x-1\). For \(t \ge 0,\)
By following the proof of Lemma 3.1 in Minsker (2018). we can obtain
Due to
and \(\log (1+x)\le x\), it yields
Therefore,
Because \(\frac{e^{x}-x-1}{x}=\sum _{i=1}^{\infty } \frac{x^{i}}{(i +1)!}\), we have
By \(\frac{e^{x}}{e^{x}-x-1} \le 1+\frac{2}{x}+\frac{2}{x^{2}}\) for \(x>0\), it yields
Therefore,
By the same way, we have
\(\square \)
1.2 Proof of Theorem 1
Proof
For \(\forall k, s \in [d]\) and \(t>0\), by \(\sigma _{k,s}:=\Sigma _{(k,s)}\), we have
Setting \(\tau _{k,s}=\left( \frac{2\mathbb {E}\left| {X_{i}}_{(k)}{X_{i}}_{(s)}\right| ^{\alpha }}{t}\right) ^{\frac{1}{\alpha -1}}\), it yields
When \(t=2\left( \mathbb {E}\left| {X_{i}}_{(k)}{X_{i}}_{(s)}\right| ^{\alpha }\right) ^{\frac{1}{\alpha }}\left( \frac{2\log d -\log \delta }{n}\right) ^{\frac{\alpha -1}{\alpha }}\), we have
Therefore,
By the union bound, it yields
\(\square \)
1.3 Proof of Lemma 3
Proof
Define \(D_{j}:=I-2 e_{j}e_{j}^{T},\) for \(\forall j \in [d]\). Suppose that \({\widehat{\lambda }} \in \mathbb {R}\) and \({\widehat{v}} \in \mathbb {S}^{d-1}\) are an eigenvalue and the correspondent eigenvector of
such that \({\widehat{\Sigma }}_{n}(\alpha , \tau ) {\hat{v}}={\widehat{\lambda }} {\hat{v}}\). Let \(\Sigma ^{(\ell )}=V^{(\ell )} \Lambda ^{(\ell )} V^{T(\ell )}\) be the eigendecomposition of \(\Sigma ^{(\ell )}\). For ease of notation, we remove the superscript \(\ell \), and define \(Z_{i}=\Lambda ^{-\frac{1}{2}} V^{T} X_{i}\) and \({\widehat{S}}=\frac{1}{n } \sum _{i=1}^{n} \psi _{\tau }\left( \left\| X_{i}\right\| _{2}^{2}\right) \frac{Z_{i} Z_{i}^{T}}{\left\| X_{i}\right\| _{2}^{2}} . \) It yields \({\widehat{\Sigma }}_{n}(\alpha , \tau )=V \Lambda ^{\frac{1}{2}} {\widehat{S}} {\Lambda }^{\frac{1}{2}} {V}^{T}\). We denote the matrix \({\check{\Sigma }}:={V} {\Lambda }^{\frac{1}{2}} {D}_{j} \widehat{{S}} {D}_{j} {\Lambda }^{\frac{1}{2}} {V}^{T}\). Because \(\left\{ X_{i}\right\} _{i=1}^{n}\) are symmetric innovation, we have \({Z}_{i}{\mathop {=}\limits ^{d}}D_{j}{Z}_{i}:={{Z}_{i}}^{*}\), and
Note that \(\left\| {V} {\Lambda }^{\frac{1}{2}}Z_{i}\right\| _{2}^{2}=\left\| {V} {\Lambda }^{\frac{1}{2}}{Z_{i}}^{*}\right\| _{2}^{2}.\) Hence, we have
Therefore, we get that \({\widehat{\Sigma }}_{n}(\alpha ,\tau )\) and \({\check{\Sigma }}\) are identically distributed. The rest of the proof is the same as that of Theorem 2 in Fan et al. (2019a). \(\square \)
1.4 Proof of Theorem 2
Proof
By Lemma 2 and \(x<e^{x}\), we can get that for \(\tau =O\left( \sigma \cdot \sqrt{n}\right) \),
By the equivalent definition of sub-exponential random variable and \(\psi _{1}\)-norm,
Because \(\Gamma (k) \le k^{k}\) and for any \(k \ge 1\), \(k^{1 / k} \le e^{1 / e} \le 2\), we have
Hence, \(\left\| \left\| {\widehat{\Sigma }}_{n}(2,\tau )-\Sigma \right\| _{2}\right\| _{\psi _{1}}=\sup _{k \ge 1}\left( \mathbb {E}\left\| {\widehat{\Sigma }}_{n}(2,\tau )-\Sigma \right\| _{2}^{k}\right) ^{1/k}/k\le C{\bar{d}}{\frac{\sigma }{\sqrt{n}}}.\)
By the Davis-Kahan theorem Yu et al. (2014),
By the robust covariance version of Lemma 1 and Theorem 2 in Fan et al. (2019a), if for all \(\ell \in [m]\), \(\Vert \mathbb {E}\widehat{{V}}_{K}^{(\ell )} \widehat{{V}}_{K}^{(\ell ) T}-{V}_{K} {V}_{K}^{T}\Vert _{2} \le 1 / 4\), the first term in (3) can be written as
Since
we obtain that if \(C_{1}\) is sufficiently large such that \(n \ge C_{1}K \max _{\ell \in [m]}\left( {\bar{d}}_{(\ell )}\frac{\sigma _{(\ell )}}{\Delta _{(\ell )}}\right) ^2\), (5) implies that \(\Vert \mathbb {E}\widehat{{V}}_{K}^{(\ell )} \widehat{{V}}_{K}^{(\ell ) T}-{V}_{K} {V}_{K}^{T}\Vert _{2} \le 1 / 4< 1/2\) for all \(\ell \in [m]\). Therefore, by (4) and Lemma 3, we have for some constant \(C_{2}\),
\(\square \)
1.5 Proof of Lemma 4
Proof
For \(\forall v \in {\mathcal {S}}^{d-1}\), we have
Therefore, define \({\Omega }={\widehat{\Sigma }}_{n}^{(1)}(2, \tau _{(1)})-{\Sigma }^{(1)}\), \({\Gamma }={V}_{K} {V}_{K}^{T}\), \(\widehat{{\Gamma }}=\widehat{{V}}_{K}^{(1)} \widehat{{V}}_{K}^{(1) T}\), \({\Theta }=f\left( {\Omega V}_{K}\right) {V}_{K}^{T}+{V}_{K} f\left( {\Omega V}_{K}\right) ^{T}\) where f is a linear function defined in Lemma 2 of Fan et al. (2019a), \({\Phi } =\widehat{{\Gamma }}-{\Gamma }-{\Theta }\) and \(\omega =\Vert {\Omega }\Vert _{2} / \Delta \). Since
we have
By Theorem 3 in Fan et al. (2019a), it yields
Since \(\mathbb {E}(\Omega )=\mathbb {E}\Big (\psi _{\tau _{(1)}}\left( \left\| X_{i}\right\| _{2}^{2}\right) \frac{X_{i} X_{i}^{T}}{\left\| X_{i}\right\| _{2}^{2}}-X_{i} X_{i}^{T}\Big )=\mathbb {E}\big ((\psi _{\tau _{(1)}}(\Vert X_{i}\Vert _{2}^{2}) /\Vert X_{i}\Vert _{2}^{2}-1) X_{i} X_{i}^{T}\big )\), for \(\forall v \in {\mathcal {S}}^{d-1}\), we have
where the second and third inequalities follow from Hölder and Markov inequality. The last inequality follows from \(C_{r}\) inequality. Hence, \(\Vert \mathbb {E}\left( \Omega \right) \Vert _{2}\lesssim R_{(1)}^{\prime }d^{2}/\left( \sigma _{(1)}^2 n\right) \) and
Finally, combing (6)–(8), it can be shown that
\(\square \)
1.6 Proof of Theorem 3
Proof
By Lemma 4 and (4), we obtain that when \(n \ge C_{2}K \max _{\ell \in [m]}\left( {\bar{d}}_{(\ell )}\frac{\sigma _{(\ell )}}{\Delta _{(\ell )}}\right) ^2\),
Therefore, when the requirement on m and n is satisfied, we have for a constant \(C_{3}\),
\(\square \)
Rights and permissions
About this article
Cite this article
Li, K., Bao, H. & Zhang, L. Robust covariance estimation for distributed principal component analysis. Metrika 85, 707–732 (2022). https://doi.org/10.1007/s00184-021-00848-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-021-00848-9