1 Introduction

Covariance matrix plays a fundamental role in multivariate analysis. Many methodologies, including discriminant analysis, principal component analysis and clustering analysis, rely critically on the knowledge of the covariance structure. Driven by a wide range of contemporary applications in many fields including genomics, signal processing, and financial econometrics, estimation of covariance matrices in the high-dimensional setting is of particular interest.

There have been significant recent advances on the estimation of a large covariance matrix and its inverse, the precision matrix. A variety of regularization methods, including banding, tapering, thresholding and penalization, have been introduced for estimating several classes of covariance and precision matrices with different structures. See, for example, [3, 4, 8, 12, 13, 17, 22, 26, 36, 41], among many others.

1.1 Sparse spiked covariance matrix model

In the present paper, we consider spiked covariance matrix models in the high-dimensional setting, which arise naturally from factor models with homoscedastic noise. To be concrete, suppose that we observe an \(n\times p\) data matrix \(\mathbf {X}\) with the rows \({\mathbf {X}}_{{1} *}\), ..., \({\mathbf {X}}_{{n} *}\) i.i.d. following a multivariate normal distribution with mean \({\varvec{0}}\) and covariance matrix \(\varvec{\Sigma }\), denoted by by \(N({\varvec{0}},\varvec{\Sigma })\), where the covariance matrix \(\varvec{\Sigma }\) is given by

$$\begin{aligned} \varvec{\Sigma }= {\mathsf {Cov}}({\mathbf {X}}_{{i} *})= \mathbf {V}\varvec{\Lambda }\mathbf {V}' + \sigma ^2 \mathbf {I}_p, \end{aligned}$$
(1)

where \(\varvec{\Lambda }= \mathrm{diag}(\lambda _1,\dots , \lambda _r)\) with \(\lambda _1\ge \cdots \ge \lambda _r >0\), and \(\mathbf {V}= [{\mathbf {v}}_1,\dots ,{\mathbf {v}}_r]\) is \(p\times r\) with orthonormal columns. The \(r\) largest eigenvalues of \(\varvec{\Sigma }\) are \(\lambda _i+\sigma ^2,\,i=1, ..., r\), and the rest are all equal to \(\sigma ^2\). The \(r\) leading eigenvectors of \(\varvec{\Sigma }\) are given by the column vectors of \(\mathbf {V}\). Since the spectrum of \(\varvec{\Sigma }\) has \(r\) spikes, (1) is termed by [19] as the spiked covariance matrix model. This covariance structure and its variations have been widely used in signal processing, chemometrics, econometrics, population genetics, and many other fields. See, for instance, [16, 24, 32, 34]. In the high-dimensional setting, various aspects of this model have been studied by several recent papers, including but not limited to [2, 5, 10, 20, 21, 31, 33, 35]. For simplicity, we assume \(\sigma \) is known. Since \(\sigma \) can always be factored out by scaling \(\mathbf {X}\), without loss of generality, we assume \(\sigma = 1\). Data-based estimation of \(\sigma \) will be discussed in Sect. 6.

The primary focus of this paper is on the setting where \(\mathbf {V}\) and \(\varvec{\Sigma }\) are sparse, and our goal is threefold. First, we consider the minimax estimation of the spiked covariance matrix \(\varvec{\Sigma }\) under the spectral norm. The method as well as the optimal rates of convergence in this problem are considerably different from those for estimating other recently studied structured covariance matrices, such as bandable and sparse covariance matrices. Second, we are interested in rank detection. The rank \(r\) plays an important role in principal component analysis (PCA) and is also of significant interest in signal processing and other applications. Last but not least, we consider optimal estimation of the principal subspace \(\mathrm {span}(\mathbf {V})\) under the spectral norm, which is the main object of interest in PCA. Each of these three problems is important in its own right.

We now explain the sparsity model of \(\mathbf {V}\) and \(\varvec{\Sigma }\). The difficulty of estimation and rank detection depends on the joint sparsity of the columns of \(\mathbf {V}\). Let \({\mathbf {V}}_{{j} *}\) denote the \(j\)th row of \(\mathbf {V}\). The row support of \({\mathbf {V}}\) is defined by

$$\begin{aligned} \mathrm {supp}(\mathbf {V}) = \{j\in [p]: {\mathbf {V}}_{{j} *} \ne {\varvec{0}}\}, \end{aligned}$$
(2)

whose cardinality is denoted by \(|\mathrm {supp}(\mathbf {V})|\). Let the collection of \(p \times r\) matrices with orthonormal columns be \(O(p,r) = \left\{ \mathbf {V}\in {\mathbb {R}}^{p \times r}: \mathbf {V}'\mathbf {V}=\mathbf {I}_r \right\} \). Define the following parameter spaces for \(\varvec{\Sigma }\),

$$\begin{aligned} \varTheta _0(k,p,r,\lambda ,\tau )&= \{\varvec{\Sigma }= \mathbf {V}\varvec{\Lambda }\mathbf {V}' + \mathbf {I}_p: \lambda /\tau \le \lambda _r\le \cdots \le \lambda _1 \le \lambda , \nonumber \\&\quad \mathbf {V}\in O(p,r),\, |\mathrm {supp}(\mathbf {V})| \le k\}, \end{aligned}$$
(3)

where \(\tau \ge 1\) is a constant and \(r\le k\le p\) is assumed throughout the paper. Note that the condition number of \(\varvec{\Lambda }\) is at most \({\tau }\). Moreover, for each covariance matrix in \(\varTheta _0(k,p,r,\lambda ,\tau )\), the leading \(r\) singular vectors (columns of \(\mathbf {V}\)) are jointly \(k\)-sparse in the sense that the row support size of \(\mathbf {V}\) is upper bounded by \(k\). The structure of group sparsity has proved useful for high-dimensional regression; See, for example, [30]. In addition to (3), we also define the following parameter spaces by dropping the dependence on \(\tau \) and \(r\), respectively:

$$\begin{aligned} \varTheta _1(k,p,r,\lambda )&= \mathbf {cl} \bigcup _{\tau \ge 1} \varTheta _0(k,p,r,\lambda ,\tau ) \nonumber \\&= \{\varvec{\Sigma }= \mathbf {V}\varvec{\Lambda }\mathbf {V}' + \mathbf {I}_p: 0 \le \lambda _r\le \cdots \le \lambda _1 \le \lambda , \nonumber \\&\quad \mathbf {V}\in O(p,r),\, |\mathrm {supp}(\mathbf {V})| \le k \} \end{aligned}$$
(4)

and

$$\begin{aligned} \varTheta _2(k,p, \lambda ,\tau ) = \bigcup _{r = 0}^k \varTheta _0(k,p,r,\lambda ,\tau ). \end{aligned}$$
(5)

As a consequence of the group sparsity in \(\mathbf {V}\), a covariance matrix \(\varvec{\Sigma }\) in any of the above parameter spaces has at most \(k\) rows and \(k\) columns containing nonzero off-diagonal entries. We note that the matrix is more structured than the so-called “\(k\)-sparse” matrices considered in [3, 8, 13], where each row (or column) has at most \(k\) nonzero off-diagonals.

1.2 Main contributions

In statistical decision theory, the minimax rate quantifies the difficulty of an inference problem and is frequently used as a benchmark for the performance of inference procedures. The main contributions of this paper include the sharp non-asymptotic minimax rates for estimating the covariance matrix \(\varvec{\Sigma }\) and the principal subspace \(\mathrm {span}(\mathbf {V})\) under the squared spectral norm loss, as well as for detecting the rank \(r\) of the principal subspace. In addition, we also establish the minimax rates for estimating the precision matrix \(\varvec{\Omega }= \varvec{\Sigma }^{-1}\) as well as the eigenvalues of \(\varvec{\Sigma }\) under the spiked covariance matrix model (1).

We establish the minimax rate for estimating the spiked covariance matrix \(\varvec{\Sigma }\) in (1) under the spectral norm

$$\begin{aligned} L(\varvec{\Sigma },\widehat{\varvec{\Sigma }}) = \Vert {\varvec{\Sigma }- \widehat{\varvec{\Sigma }}} \Vert ^2, \end{aligned}$$
(6)

where for a matrix \(\mathbf {A}\) its spectral norm is defined as \(\Vert \mathbf {A}\Vert =\sup _{\Vert \mathbf {x}\Vert _2=1}\Vert \mathbf {A}\mathbf {x}\Vert _2\) with \(\Vert \cdot \Vert _2\) the vector \(\ell _2\) norm. The minimax upper and lower bounds developed in Sects. 3 and 4 yield the following optimal rate for estimating sparse spiked covariance matrices under the spectral norm

$$\begin{aligned} \inf _{\widehat{\varvec{\Sigma }}}\sup _{\varvec{\Sigma }\in \varTheta _1(k,p,r,\lambda ) } {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \asymp { \bigg [ \frac{(\lambda +1) k}{n} \log \frac{\mathrm{e}p}{k} + \frac{\lambda ^2 r}{n} \bigg ] } \wedge \lambda ^2, \end{aligned}$$
(7)

subject to certain mild regularity conditions.Footnote 1 The two terms in the squared bracket are contributed by the estimation error of the eigenvectors \(\mathbf {V}\) and the eigenvalues, respectively. Note that the second term can be dominant if \(\lambda \) is large.

An important quantity of the spiked model is the rank \(r\) of the principal subspace \(\mathrm {span}(\mathbf {V})\), or equivalently, the number of spikes in the spectrum of \(\varvec{\Sigma }\), which is of significant interest in chemometrics [24], signal array processing [25], and other applications. Our second goal is the minimax estimation of the rank \(r\) under the zero–one loss, or equivalently, the minimax detection of the rank \(r\). It is intuitively clear that the difficulty in estimating the rank \(r\) depends crucially on the magnitude of the minimum spike \(\lambda _r\). Results in Sects. 3 and 4 show that the optimal rank detection boundary over the parameter space \(\varTheta _2(k,p, \lambda ,\tau )\) is of order \(\sqrt{\frac{k}{n} \log \frac{\mathrm{e}p}{k}}\). Equivalently, the rank \(r\) can be exactly recovered with high probability if \(\lambda _r \ge \beta \sqrt{\frac{k}{n} \log \frac{e p}{k}}\) for a sufficiently large constant \(\beta \); On the other hand, reliable detection becomes impossible by any method if \(\lambda _r \le \beta _0 \sqrt{\frac{k}{n} \log \frac{e p}{k}}\) for some positive constant \(\beta _0\). Lying in the heart of the arguments is a careful analysis of the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time, which is summarized in Lemma 1. It is worth noting that the optimal rate for rank detection obtained in the current paper resolves a gap left open in a recent paper by Berthet and Rigollet [2], where the authors obtained the optimal detection rate \(\sqrt{\frac{k}{n} \log \frac{e p}{k}}\) for the rank-one case in the regime of \(k \ll \sqrt{p}\), but the lower bound deteriorates to \(\sqrt{\frac{p}{kn}}\) when \(k \gg \sqrt{p}\) which is strictly suboptimal.

In many statistical applications, instead of the covariance matrix itself, the object of direct interest is often a lower dimensional functional of the covariance matrix, e.g. , the principal subspace \(\mathrm {span}(\mathbf {V})\). This problem is known in the literature as sparse PCA [5, 10, 20, 31]. The third goal of the paper is the minimax estimation of the principal subspace \(\mathrm {span}(\mathbf {V})\). To this end, we note that the principal subspace can be uniquely identified with the associated projection matrix \(\mathbf {V}\mathbf {V}'\). Moreover, any estimator can be identified with a projection matrix \(\widehat{\mathbf {V}}\widehat{\mathbf {V}}'\), where the columns of \(\widehat{\mathbf {V}}\) constitute an orthonormal basis for the subspace estimator. Thus, estimating \(\mathrm {span}(\mathbf {V})\) is equivalent to estimating \(\mathbf {V}\mathbf {V}'\). We aim to optimally estimate \(\mathrm {span}(\mathbf {V})\) under the loss [37, Section II.4]

$$\begin{aligned} L(\mathbf {V},\widehat{\mathbf {V}}) = \Vert {\mathbf {V}\mathbf {V}' - \widehat{\mathbf {V}}\widehat{\mathbf {V}}'} \Vert ^2, \end{aligned}$$
(8)

which equals the squared sine of the largest canonical angle between the respective linear spans. In the sparse PCA literature, the loss (8) was first used in [31] for multi-dimensional subspaces. For this problem, we shall show that, under certain regularity conditions, the minimax rate of convergence is

$$\begin{aligned} \inf _{\widehat{\mathbf {V}}} \sup _{\varvec{\Sigma }\in \varTheta _0(k,p, r, \lambda ,\tau )} {\mathsf {E}}\Vert \widehat{\mathbf {V}}\widehat{\mathbf {V}}' -\mathbf {V}\mathbf {V}' \Vert ^2 \asymp \frac{ (\lambda +1)k}{n \lambda ^2} \log \frac{\mathrm{e}p}{k} \wedge 1. \end{aligned}$$
(9)

In the present paper we considered estimation of the principal subspace \(\mathrm {span}(\mathbf {V})\) under the spectral norm loss (8). It is interesting to compare the results with those for optimal estimation under the Frobenius norm loss [10, 40] \(\Vert \mathbf {V}\mathbf {V}' - \widehat{\mathbf {V}}\widehat{\mathbf {V}}'\Vert _\mathrm{F}^2\), whose ratio to (8) is between 1 and \(2r\). The optimal rate under the spectral norm loss given in (9) does not depend on the rank \(r\), whereas the optimal rate under the Frobenius norm loss has an extra term \(\frac{\lambda +1}{n\lambda ^2}r(k-r)\), which depends on the rank \(r \) quadratically through \(r(k - r)\) [10]. Therefore the rate under the Frobenius norm far exceeds (9) when \(r\gg \log \frac{\mathrm{e}p}{k}\). When \(r = 1\), both norms lead to the same rate and the result in (9) recovers earlier results on estimating the leading eigenvector obtained in [5, 29, 39].

In addition to the optimal rates for estimating the covariance matrix \(\varvec{\Sigma }\), the rank \(r\) and the principal subspace \(\mathrm {span}(\mathbf {V})\), the minimax rates for estimating the precision matrix \(\varvec{\Omega }= \varvec{\Sigma }^{-1}\) as well as the eigenvalues of \(\varvec{\Sigma }\) are also established.

1.3 Other related work

Apart from the spiked covariance matrix model studied in this paper, other covariance matrix models have been considered in the literature. The most commonly imposed structural assumptions include “Toeplitz”, where each descending diagonal from left to right is constant, “bandable”, where the entries of the covariance matrix decay as they move away from the diagonal, and “sparse”, where only a small number of entries in each row/column are nonzero. The optimal rates of convergence were established in [11, 12] and [13] for estimating Toeplitz, bandable, and sparse covariance matrices, respectively. Estimation of sparse precision matrices has also been actively studied due to its close connection to Gaussian graphical models [9, 36, 41]. In addition, our work is also connected to the estimation of effective low-rank covariance matrices. See, for example, [7, 28] and the reference therein.

1.4 Organization

The rest of the paper is organized as the following. Section 2 introduces basic notation and then gives a detailed description of the procedures for estimating the spiked covariance matrix \(\varvec{\Sigma }\), the rank \(r\) and the principal subspace \(\mathrm {span}(\mathbf {V})\). The rates of convergence of these estimators are given in Sect. 3. Section 4 presents the minimax lower bounds that match the upper bounds in Sect. 3 in terms of the convergence rates, thereby establishing the minimax rates of convergence and rate-optimality of the estimators constructed in Sect. 2. The minimax rates for estimating the eigenvalues and the precision matrix are given in Sect. 5. Section 6 discusses computational and other related issues. The proofs are given in Sect. 7.

2 Estimation procedure

We give a detailed description of the estimation procedure in this section and study its properties in Sect. 3. Throughout, we shall focus on minimax estimation and assume the sparsity \(k\) is known, while the rank \(r\) will be selected based on data. Adaptation to \(k\) will be discussed in Sect. 6.

Notation We first introduce some notation. For any matrix \(\mathbf {X}= (x_{ij})\) and any vector \(\mathbf {u}\), denote by \(\Vert \mathbf {X}\Vert \) the spectral norm, \(\Vert \mathbf {X}\Vert _\mathrm{F}\) the Frobenius norm, and \(\Vert \mathbf {u}\Vert \) the vector \(\ell _2\) norm. Moreover, the \(i\)th row of \(\mathbf {X}\) is denoted by \({\mathbf {X}}_{{i} *}\) and the \(j\)th column by \({\mathbf {X}}_{* {j}}\). Let \(\mathrm {supp}(\mathbf {X}) = \{i: {\mathbf {X}}_{{i} *} \ne 0\}\) denote the row support of \(\mathbf {X}\). For a positive integer \(p,\,[p]\) denotes the index set \(\{1, 2, ..., p\}\). For any set \(A,\,|A|\) denotes its cardinality, and \({A^{\mathrm{c}}}\) its complement. For two subsets \(I\) and \(J\) of indices, denote by \(\mathbf {X}_{IJ}\) the \(|I|\times |J|\) submatrices formed by \(x_{ij}\) with \((i,j) \in I \times J\). Let \({\mathbf {X}}_{{I} *} = \mathbf {X}_{I [n]}\) and \({\mathbf {X}}_{* {J}} = \mathbf {X}_{[p] J}\). For any square matrix \(\mathbf {A}= (a_{ij})\), we let \(\mathop {\mathsf{Tr}}(\mathbf {A}) = \sum _{i}a_{ii}\) be its trace. Define the inner product of any two matrices \(\mathbf {B}\) and \(\mathbf {C}\) of the same size by \(\langle \mathbf {B}, \mathbf {C} \rangle = \mathop {\mathsf{Tr}}(\mathbf {B}'\mathbf {C})\). For any matrix \(\mathbf {A}\), we use \(\sigma _i(\mathbf {A})\) to denote its \(i\)th largest singular value. When \(\mathbf {A}\) is positive semi-definite, \(\sigma _i(\mathbf {A})\) is also the \(i\)th largest eigenvalue of \(\mathbf {A}\). For any real number \(a\) and \(b\), set \(a\vee b = \max \{a,b\}\) and \(a\wedge b = \min \{a,b\}\). Let \({\mathbb {S}}^{p-1}\) denote the unit sphere in \({\mathbb {R}}^p\). For any event \(E\), we write \({\mathbf {1}_{\left\{ {E}\right\} }}\) as its indicator function.

For any set \(B\subset [p]\), let \({B^{\mathrm{c}}}\) be its complement. For any symmetric matrix \({\mathbf {A}}\in \mathbb {R}^{p\times p}\), we use \({\mathbf {A}}_B\) to denote the \(p\times p\) matrix whose \(B\times B\) block is \({\mathbf {A}}_{BB}\), the remaining diagonal elements are all ones and the remaining off-diagonal elements are all zeros, i.e. ,

$$\begin{aligned} ({\mathbf {A}}_B)_{ij} = a_{ij} {\mathbf {1}_{\left\{ {i \in B,j\in B}\right\} }} + {\mathbf {1}_{\left\{ {i=j \in {B^{\mathrm{c}}}}\right\} }}. \end{aligned}$$
(10)

In other word, after proper reordering of rows and columns, we have

$$\begin{aligned} {\mathbf {A}}_B = \begin{bmatrix} {\mathbf {A}}_{BB}&{\varvec{0}}\\ {\varvec{0}}&{\mathbf {I}}_{{B^{\mathrm{c}}}{B^{\mathrm{c}}}} \end{bmatrix}. \end{aligned}$$

Let \(P \otimes Q\) denote the product measure of \(P\) and \(Q\) and \(P^{\otimes n}\) the \(n\)-fold product of \(P\). For random variables \(X\) and \(Y\), we write \(X\,{\mathop {=}\limits ^\mathrm{(d)}}\,Y\) if they follow the same distribution, and \(X \mathop {\le }\limits ^\mathrm{{s.t.}} Y\) if \({\mathsf {P}}(X>t) \le {\mathsf {P}}(Y>t)\) for all \(t\in {\mathbb {R}}\). Throughout the paper, we use \(C\) to denote a generic positive absolute constant, whose actual value may depend on the context. For any two sequences \(\{a_n\}\) and \(\{b_n\}\) of positive numbers, we write \(a_n\lesssim b_n\) when \(a_n\le C b_n\) for some numeric constant \(C\), and \(a_n\gtrsim b_n\) when \(b_n \lesssim a_n\), and \(a_n \asymp b_n\) when both \(a_n\gtrsim b_n\) and \(a_n\lesssim b_n\) hold.

Estimators We are now ready to present the procedure for estimating the spiked covariance matrix \(\varvec{\Sigma }\), the rank \(r\), and the principal subspace \(\mathrm {span}(\mathbf {V})\).

Let

$$\begin{aligned} A = \mathrm {supp}(\mathbf {V}) \end{aligned}$$
(11)

be the row support of \(\mathbf {V}\). For any \(m\in [p]\), let

$$\begin{aligned} \eta (m,n,p,\gamma _1) = 2\left( \sqrt{\frac{m}{n}} + \sqrt{ \frac{\gamma _1}{n} \,m\log \frac{\mathrm{e}p}{m}} \right) + \left( \sqrt{\frac{m}{n}} + \sqrt{ \frac{\gamma _1}{n} \,m\log \frac{\mathrm{e}p}{m}} \right) ^2. \end{aligned}$$
(12)

Recall that the observed matrix \({\mathbf {X}}\) has i.i.d. rows \({{\mathbf {X}}}_{{i} *}\sim N({\varvec{0}}, \varvec{\Sigma })\). We define \(\mathbf {S}= \frac{1}{n}{\mathbf {X}}'{\mathbf {X}}\) as the sample covariance matrix. Also recall that we assume knowledge of the sparsity level \(k\) which is an upper bound for the support size \(|A|\). The first step in the estimation scheme is to select a subset \(\widehat{A}\) of \(k\) features based on the data. To this end, let

$$\begin{aligned} {\mathbb {B}}_k = \{ B\subset [p]:~ |B|&= k,~~\text{ and } \text{ for } \text{ all }\,D\subset {B^{\mathrm{c}}}\hbox { with }|D|\le k, \nonumber \\&\quad \Vert {\mathbf {S}_{D}-{\mathbf {I}}} \Vert \le \eta (|D|,n,p,\gamma _1),~ \Vert {\mathbf {S}_{DB}} \Vert \le 2\sqrt{\Vert {\mathbf {S}_{B}} \Vert }\, \eta (k,n,p,\gamma _1) \}. \end{aligned}$$
(13)

The appropriate value of \(\gamma _1\) will be specified later in the statement of the theorems. Intuitively speaking, the requirements in (13) aim to ensure that for any \(B\in {\mathbb {B}}_k\), there is no evidence in data suggesting that \({B^{\mathrm{c}}}\) overlaps with the row support \(A\) of \(\mathbf {V}\). If \({\mathbb {B}}_k\ne \emptyset \), denote by \(\widehat{A}\) an arbitrary element of \({\mathbb {B}}_k\) (or we can let \(\widehat{A}= \mathop {\mathrm{argmax}}_{B\in {\mathbb {B}}_k} {\mathsf {Tr}}(\mathbf {S}_{BB})\) for concreteness). As we will show later, \({\mathbb {B}}_k\) is non-empty with high probability; See Lemma 3 in Sect. 7.1. The set \(\widehat{A}\) represents the collection of selected features, which turns out to be instrumental in constructing optimal estimators for the three objects we are interested in: the covariance matrix, the rank of the spiked model, and the principle subspace. The estimator \(\widehat{A}\) of the support set \(A\) is obtained through searching over all subsets of size \(k\). Such a global search, though computationally expensive, appears to be necessary in order for our procedure to optimally estimate \(\varvec{\Sigma }\) and \(\mathbf {V}\) under the spectral norm. For example, estimating row-by-row is not guaranteed to always yield optimal results. Whether there exist computationally efficient procedures attaining the optimal rate is currently unknown. See Sect. 6 for more discussions.

Given \(\widehat{A}\), the estimators for the above three objects are defined as follows. Recalling the notation in (10), we define the covariance matrix estimator as

$$\begin{aligned} \widehat{\varvec{\Sigma }} = \mathbf {S}_{\widehat{A}} {\mathbf {1}_{\left\{ {{\mathbb {B}}_k \ne \emptyset }\right\} }} + \mathbf {I}_p {\mathbf {1}_{\left\{ {{\mathbb {B}}_k = \emptyset }\right\} }}, \end{aligned}$$
(14)

The estimator for the rank is

$$\begin{aligned} \widehat{r} = \max \Big \{l: \sigma _l(\mathbf {S}_{\widehat{A}\widehat{A}}) \ge 1 + \gamma _2\,\sqrt{\Vert \mathbf {S}_{\widehat{A}} \Vert }\, \eta (k,n,p,\gamma _1) \Big \}. \end{aligned}$$
(15)

The appropriate value of \(\gamma _2\) will be specified later in the statement of the theorems. Last but not least, the estimator for the principle subspace is \(\mathrm {span}(\widehat{\mathbf {V}})\), where

$$\begin{aligned} \widehat{\mathbf {V}}= [\widehat{\mathbf {V}}_1,\dots ,\widehat{\mathbf {V}}_{\hat{r}}]\in O(p,\widehat{r}) \end{aligned}$$
(16)

with \(\widehat{\mathbf {V}}_l\) the \(l\)th eigenvector of \(\widehat{\varvec{\Sigma }}\). When \({\mathbb {B}}_k = \emptyset \), we set \(\widehat{r} = 0\) and \(\widehat{\mathbf {V}} = 0\) since \(\widehat{\varvec{\Sigma }} = \mathbf {I}_p\). Note that the estimator \(\widehat{\mathbf {V}}\) is based on the estimated rank \(\widehat{r}\). Whenever \(\widehat{r}\ne r\), the value of the loss function (8) equals 1.

For a brief discussion on the comparison between the foregoing estimation procedure and that in [40], see Sect. 6.

3 Minimax upper bounds

We now investigate the properties of the estimation procedure introduced in Sect. 2. Rates of convergence for estimating the whole covariance matrix and its principal subspace under the spectral norm as well as for rank detection are established. The minimax lower bounds given in Sect. 4 will show that these rates are optimal and together they yield the minimax rates of convergence.

We begin with the estimation error of the covariance matrix estimator (14). Note that here we do not require the ratio \(\lambda _1/\lambda _r\) to be bounded.

Theorem 1

Let \(\widehat{\varvec{\Sigma }}\) be defined in (14) with \({\mathbb {B}}_k\) given by (13) for some \(\gamma _1 \ge 10\). If \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\le c_0\) for a sufficiently small constant \(c_0 > 0\), then

$$\begin{aligned} \sup _{\varvec{\Sigma }\in \varTheta _1(k, p, r, \lambda )} {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \lesssim \frac{(\lambda +1) k}{n} \log \frac{\mathrm{e}p}{k} + \frac{\lambda ^2 r}{n}, \end{aligned}$$
(17)

where the parameter space \(\varTheta _1(k,p,r,\lambda )\) is defined in (4).

As we shall show later in Theorem 4, the rates in (17) are optimal with respect to all the model parameters, namely \(k,\,p,\,r\) and \(\lambda \).

Remark 1

Since the parameter space \(\varTheta _1\) in (4) is contained in the set of \(k\)-sparse covariance matrices, it is of interest to compare the minimax rates for covariance matrix estimation in these two nested parameter spaces. For simplicity, only consider the case where the spectral norms of covariance matrices are uniformly bounded by a constant. Cai and Zhou [13] showed that, under certain regularity conditions, the minimax rate of convergence for estimating \(k\)-sparse matrices is \(k^2 \frac{\log p}{n}\), while the rate over \(\varTheta _1\) in (7) reduces to \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\) when the spectral norm of the matrix and hence \(\lambda \) is bounded. Thus, ignoring the logarithmic terms, the rate over the smaller parameter space \(\varTheta _1\) is faster by a factor of \(k\). This faster rate can be achieved because the group \(k\)-sparsity considered in our parameter space imposes much more structure than the row-wise \(k\)-sparsity does for the general \(k\)-sparse matrices.

The next result concerns with the detection rate of the rank estimator (15) under the extra assumption that the ratio of the largest spike to the smallest, i.e., \(\lambda _1/\lambda _r\), is bounded.

Theorem 2

Let \(\hat{r}=\hat{r}(\gamma _1,\gamma _2)\) be defined in (15) for some constants \(\gamma _1 \ge 10\) and \(\gamma _2 \ge 8\sqrt{\gamma _1}+34\). Assume that

$$\begin{aligned} \frac{\tau k}{n} \log \frac{\mathrm{e}p}{k} \le c_0 \end{aligned}$$
(18)

for a sufficiently small constant \(c_0\in (0,1)\) which depends on \(\gamma _1\). If \(\lambda \ge \beta \sqrt{\frac{k}{n} \log \frac{\mathrm{e}p}{k}}\) for some sufficiently large \(\beta \) depending only on \(\gamma _1,\gamma _2\) and \(\tau \), then

$$\begin{aligned} \sup _{\varvec{\Sigma }\in \varTheta _2(k,p, \lambda ,\tau )} { \mathsf {P}\left\{ \widehat{r} \ne \mathop {\mathsf{rank}}(\varvec{\Lambda }) \right\} } \le C p^{1-\gamma _1/2} \end{aligned}$$
(19)

where the parameter space \(\varTheta _2(k,p,\lambda ,\tau )\) is defined in (5).

By Theorem 5 to be introduced later, the detection rate of \(\sqrt{\frac{k}{n}\log \frac{\mathrm{e}p}{k}}\) is optimal. For more details, see the discussion in Sect. 4.

Finally, we turn to the risk of the principal subspace estimator. As in Theorem 2, we require \(\lambda _1/\lambda _r\) to be bounded.

Theorem 3

Suppose

$$\begin{aligned} M_1\log {p} \ge \log {n} \ge M_0\log \lambda \end{aligned}$$
(20)

holds for some absolute constants \(M_0, M_1 > 0\). Let \(\widehat{\mathbf {V}}\) be defined in (16) with constants \(\gamma _1 \ge 10\vee (1+\frac{2}{M_0})M_1\) and \(\gamma _2\ge 8\sqrt{\gamma _1}+34\) in (13) and (15). If (18) holds for a sufficiently small constant \(c_0\) depending on \(\gamma _1\), then

$$\begin{aligned} \sup _{\varvec{\Sigma }\in \varTheta _0(k,p,r,\lambda ,\tau )} {\mathsf {E}}\Vert \widehat{\mathbf {V}}\widehat{\mathbf {V}}' -\mathbf {V}\mathbf {V}' \Vert ^2 \lesssim \frac{k (\lambda +1)}{n \lambda ^2} \log \frac{\mathrm{e}p}{k} \wedge 1, \end{aligned}$$
(21)

where the parameter space \(\varTheta _0(k,p,r,\lambda ,\tau )\) is defined in (3).

Remark 2

To ensure that the choice of \(\gamma _1\) for achieving (21) is data-driven, we only need an under-estimate for \(M_0 = \log {n}/\log {\lambda }\), or equivalently an over-estimate for \(\lambda \). (Note that \(M_1 = \log {n}/\log {p}\) can be obtained directly given the data matrix.) To this end, we first estimate \(\varvec{\Sigma }\) by \(\widehat{\varvec{\Sigma }}\) in (14) with an initial \(\gamma _1 = 10\) in (13). Then we control \(\lambda \) by \(2\Vert \widehat{\varvec{\Sigma }} \Vert -1\). By the proof of Theorem 2, and in particular (75), this is an over-estimate of \(\lambda \) with high probability. The upper bound in (21) remains valid if we compute \(\widehat{\mathbf {V}}\) with a (possibly) new \(\gamma _1 = 10\vee (1+2/\widehat{M_0})M_1\) in (13), where \(\widehat{M_0} = \log {n}/ \log (2\Vert \widehat{\varvec{\Sigma }} \Vert -1)\).

It is worth noting that the rate in (21) does not depend on \(r\), and is optimal, by the lower bound given in Theorem 4 later.

The problems of estimating \(\varvec{\Sigma }\) and \(\mathbf {V}\) are clearly related, but they are also distinct from each other. To discuss their relationship, we first note the following result (proved in Appendix 7.5) which is a variation of the well-known sin-theta theorem [15]:

Proposition 1

Let \(\varvec{\Sigma }\) and \(\widehat{\varvec{\Sigma }}\) be \(p \times p\) symmetric matrices. Let \(r \in [p-1]\) be arbitrary and let \(\mathbf {V},\widehat{\mathbf {V}}\in O(p,r)\) be formed by the \(r\) leading singular vectors of \(\varvec{\Sigma }\) and \(\widehat{\varvec{\Sigma }}\), respectively. Then

$$\begin{aligned} \Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert \ge \frac{1}{2} (\sigma _r(\varvec{\Sigma })- \sigma _{r+1}(\varvec{\Sigma })) \Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert . \end{aligned}$$
(22)

In view of Proposition 1, the minimax risks for estimating the spiked covariance matrix \(\varvec{\Sigma }\) and the principle subspace \(\mathbf {V}\) under the spectral norm can be tied as follows:

$$\begin{aligned} { \inf _{\widehat{\varvec{\Sigma }}} \sup _{\varvec{\Sigma }\in \varTheta } {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \gtrsim \lambda _r^2 \inf _{\widehat{\mathbf {V}}} \sup _{\varvec{\Sigma }\in \varTheta } {\mathsf {E}}\Vert \widehat{\mathbf {V}}\widehat{\mathbf {V}}' -\mathbf {V}\mathbf {V}' \Vert ^2 } , \end{aligned}$$
(23)

where \(\varTheta =\varTheta _0(k,p,r,\lambda ,\tau )\).

The results of Theorems 1 and 3 suggest, however, that the above inequality is not tight when \(\lambda \) is large. The optimal rate for estimating \(\mathbf {V}\) is not equivalent to that for estimating \(\varvec{\Sigma }\) divided by \(\lambda ^2\) when \(\frac{\lambda +1}{\lambda ^2} \log \frac{\mathrm{e}p}{k} \ll \frac{r}{k}\). Consequently, Theorem 3 cannot be directly deduced from Theorem 1 but requires a different analysis by introducing an intermediate matrix \(\mathbf {S}_0\) defined later in (50). This is because the estimation of \(\varvec{\Sigma }\) needs to take into account the extra error in estimating the eigenvalues in addition to those in estimating \(\mathbf {V}\). On the other hand, in proving Theorem 1 we need to contend with the difficulty that the loss function is unbounded.

4 Minimax lower bounds and optimal rates of convergence

In this section we derive minimax lower bounds for estimating the spiked covariance matrix \(\varvec{\Sigma }\) and the principal subspace \(\mathrm {span}(\mathbf {V})\) as well as for the rank detection. These lower bounds hold for all parameters and are non-asymptotic. The lower bounds together with the upper bounds given in Sect. 3 establish the optimal rates of convergence for the three problems.

The technical analysis heavily relies on a careful study of a rank-one testing problem and analyzing the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time. This lower bound technique is of independent interest and can be useful for other related matrix estimation and testing problems.

4.1 Lower bounds and minimax rates for matrix and subspace estimation

We first consider the lower bounds for estimating the spiked covariance matrix \(\varvec{\Sigma }\) and the principal subspace \(\mathbf {V}\) under the spectral norm.

Theorem 4

For any \(1\le r\le k\le p\) and \(n \in {\mathbb {N}}\),

$$\begin{aligned} \inf _{\widetilde{\varvec{\Sigma }}} \sup _{\varvec{\Sigma }\in \varTheta _1(k,p,r,\lambda )} {\mathsf {E}}\Vert \widetilde{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \gtrsim \left( \frac{(\lambda +1) k}{n} \log \frac{\mathrm{e}p}{k} + \frac{\lambda ^2 r}{n} \right) \wedge \lambda ^2, \end{aligned}$$
(24)

and

$$\begin{aligned} \inf _{\widetilde{\mathbf {V}}} \sup _{\varvec{\Sigma }\in \varTheta _0(k,p,r,\lambda ,\tau )} {\mathsf {E}}\Vert \widetilde{\mathbf {V}}\widetilde{\mathbf {V}}' -\mathbf {V}\mathbf {V}' \Vert ^2 \gtrsim \frac{(\lambda +1)(k-r)}{n \lambda ^2} \log \frac{\mathrm{e}(p-r)}{k-r} \wedge 1, \end{aligned}$$
(25)

where the parameter spaces \(\varTheta _1(k,p,r,\lambda )\) and \(\varTheta _0(k,p,r,\lambda ,\tau )\) are defined in (4) and (3), respectively.

To better understand the lower bound (24), it is helpful to write it equivalently as

$$\begin{aligned} \left( \frac{(\lambda +1) k}{n} \log \frac{\mathrm{e}p}{k} \right) \wedge \lambda ^2 + \frac{\lambda ^2 r}{n}, \end{aligned}$$

which can be proved by showing that the minimax risk is lower bounded by each of these two terms. The first term does not depend on \(r\) and is the same as the lower bound in the rank-one case. The second term is the oracle risk when the true support of \(\mathbf {V}\) is known. The key to the proof is the analysis of the rank-one case which will be discussed in more detail in Sect. 4.3. The proof of (25) is relatively straightforward by using known results on rank-one estimation.

In view of the upper bounds given in Theorems 1 and 3 and the lower bounds given in Theorem 4, we establish the following minimax rates of convergence for estimating the spiked covariance matrix \(\varvec{\Sigma }\) and the principal subspace \(\mathbf {V}\), subject to the constraints on the parameters given in Theorems 1 and 3:

$$\begin{aligned} \inf _{\widetilde{\varvec{\Sigma }}} \sup _{\varvec{\Sigma }\in \varTheta _1(k,p, r, \lambda ,\tau )} {\mathsf {E}}\Vert \widetilde{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2&\asymp&\left( \frac{(\lambda +1) k}{n} \log \frac{\mathrm{e}p}{k} + \frac{\lambda ^2 r}{n} \right) \wedge \lambda ^2, \end{aligned}$$
(26)
$$\begin{aligned} \inf _{\widetilde{\mathbf {V}}} \sup _{\varvec{\Sigma }\in \varTheta _0(k,p, r, \lambda ,\tau )} {\mathsf {E}}\Vert \widetilde{\mathbf {V}}\widetilde{\mathbf {V}}' -\mathbf {V}\mathbf {V}' \Vert ^2&\asymp&\frac{(\lambda +1) k}{n \lambda ^2} \log \frac{\mathrm{e}p}{k} \wedge 1, \end{aligned}$$
(27)

where (27) holds under the addition condition that \(k -r \gtrsim k\). Therefore, the estimators of \(\varvec{\Sigma }\) and \(\mathbf {V}\) given in Sect. 2 are rate optimal. In (26), the trivial upper bound of \(\lambda ^2\) can always be achieved by using the identity matrix as the estimator.

4.2 Lower bound and minimax rate for rank detection

We now turn to the lower bound and minimax rate for the rank detection problem.

Theorem 5

Let \(\beta _0 < \frac{1}{36}\) be a constant. For any \(1\le r\le k\le p\) and \(n \in {\mathbb {N}}\), if \(\lambda \le 1 \wedge \sqrt{\frac{\beta _0 k}{n} \log \frac{e p}{k}}\), then

$$\begin{aligned} \inf _{\tilde{r}} \sup _{\varvec{\Sigma }\in \varTheta _2(k,p,\lambda ,\tau )} { \mathsf {P}\left\{ \tilde{r} \ne \mathop {\mathsf{rank}}(\varvec{\Sigma }) \right\} } \ge w(\beta _0), \end{aligned}$$
(28)

where the function \(w: (0,\frac{1}{36}) \rightarrow (0,1)\) satisfies \(w(0+)=1\).

The upper and lower bounds given in Theorems 2 and 5 show that the optimal detection boundary for the rank \(r\) is \(\sqrt{\frac{k}{n} \log \frac{e p}{k}}\). That is, the rank \(r\) can be estimated with an arbitrarily small error probability when \(\lambda _r \ge \beta \sqrt{\frac{k}{n} \log \frac{e p}{k}}\) for a sufficiently large constant \(\beta \), whereas this is impossible to achieve by any method if \(\lambda _r \le \sqrt{\frac{\beta _0 k}{n} \log \frac{e p}{k}}\) for some small positive constant \(\beta _0\). Note that Theorem 5 applies to the full range of sparsity including the non-sparse case \(k=p\), which requires \(\lambda _r \ge \sqrt{\frac{p}{n}}\). This observation turns out to be useful in proving the “parametric term” in the minimax lower bound for estimating \(\varvec{\Sigma }\) in Theorem 4.

The rank detection lower bound in Theorem 5 is in fact a direct consequence of the next proposition concerning testing the identity covariance matrix against rank-one alternatives,

$$\begin{aligned} H_{0}: \varvec{\Sigma }= \mathbf {I}_p, \quad \text{ versus }\quad H_{1}: \varvec{\Sigma }= \mathbf {I}_p + \lambda \mathbf {v}\mathbf {v}', \quad \mathbf {v}\in {\mathbb {S}}^{p-1} \cap B_0(k), \end{aligned}$$
(29)

where \(B_0(k) \triangleq \{\mathbf {x}\in {\mathbb {R}}^p: |\mathrm {supp}(\mathbf {x})| \le k\}\). Note that \(\varvec{\Sigma }\) is in the parameter space \(\varTheta _2\) under both the null and the alternative hypotheses. The rank-one testing problem (29) has been studied in [2], where there is a gap between the lower and upper bounds when \(k\gtrsim \sqrt{p}\). The following result show that their lower bound is in fact sub-optimal in this case. We shall give below a dimension-free lower bound for the optimal probability of error and determine the optimal rate of separation. The proof is deferred to Sect. 7.2.2.

Proposition 2

Let \(\beta _0 < \frac{1}{36}\) be a constant. Let \(\mathbf {X}\) be an \(n \times p\) random matrix whose rows are independently drawn from \(N(0,\varvec{\Sigma })\). For any \(k \in [p]\) and \(n \in {\mathbb {N}}\), if \(\lambda \le 1 \wedge \sqrt{\frac{\beta _0 k}{n} \log \frac{\mathrm{e}p}{k}}\), the minimax sum of Type-I and Type-II error probabilities for the testing problem (29) satisfies

$$\begin{aligned} {\mathcal {E}}_n(k,p,\lambda ) \triangleq \inf _{\phi : {\mathbb {R}}^{n \times p} \rightarrow \{0,1\}} \left( \mathsf {P}_{\mathbf {I}_p}\{\phi (\mathbf {X})=1\} + \sup _{\varvec{\Sigma }\in H_1} \mathsf {P}_{\varvec{\Sigma }}\{\phi (\mathbf {X})=0\} \right) \ge w(\beta _0) \end{aligned}$$

where the function \(w: (0,\frac{1}{36}) \rightarrow (0,1)\) satisfies \(w(0+)=1\).

Proposition 2 shows that testing independence in the rank-one spiked covariance model can be achieved reliably only if the effective signal-to-noise-ratio

$$\begin{aligned} \beta _0 = \frac{\lambda ^2 }{\frac{k}{n} \log \frac{\mathrm{e}p}{k}}\rightarrow \infty . \end{aligned}$$

Furthermore, the lower bound in Proposition 2 also captures the following phenomenon: if \(\beta _0\) vanishes, then the optimal probability of error converges to one. In fact, the lower bound in Proposition 2 is optimal in the sense that the following test succeeds with vanishing probability of error if \(\beta _0 \rightarrow \infty \):

$$\begin{aligned} \text {reject}\,H_0\,\text {if and only if } \max _{\begin{array}{c} J \subset [p]\\ |J|=k \end{array}} \Vert \mathbf {S}_{JJ} - \mathbf {I}_{JJ}\Vert \ge c \sqrt{\frac{k}{n} \log \frac{\mathrm{e}p}{k}} \end{aligned}$$

for some \(c\) only depends on \(\beta _0\). See, e.g. , [2, Section 4]. However, the above test has high computational complexity since one needs to enumerate all \(k\times k\) submatrices of \(\mathbf {S}\). It remains an open problem to construct tests that are both computationally feasible and minimax rate-optimal.

4.3 Testing rank-one spiked model

As mentioned earlier, a careful study of the rank-one testing problem (29) provides a major tool for the lower bound arguments. A key step in this study is the analysis of the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time. We present the main ideas in this section as the techniques can also be useful for other related matrix estimation and testing problems.

It is well-known that the minimax risk is given by the least-favorable Bayesian risk under mild regularity conditions on the model [27]. For the composite testing problem (29), it turns out that the rate-optimal least-favorable prior for \(\mathbf {v}\) is given by the distribution of the following random vector:

$$\begin{aligned} \mathbf {u}= \frac{1}{\sqrt{k}} \mathbf {J}_I \mathbf {w}, \end{aligned}$$
(30)

where \(\mathbf {w}= ({w_1,\ldots ,w_p})\) consists of iid Rademacher entries, and \(\mathbf {J}_I\) is a diagonal matrix given by \((\mathbf {J}_I)_{ii}={\mathbf {1}_{\left\{ {i \in I}\right\} }}\) with \(I\) uniformly chosen from all subsets of \([p]\) of size \(k\). In other words, \(\mathbf {u}\) is uniformly distributed on the collection of \(k\)-sparse vectors of unit length with equal-magnitude non-zeros. Hence \(\mathbf {u}\in {\mathbb {S}}^{p-1} \cap B_0(k)\). We set

$$\begin{aligned} \lambda ^2 = \frac{\beta _0 k}{n} \log \frac{\mathrm{e}p}{k}, \end{aligned}$$

where \(\beta _0>0\) is a sufficiently small absolute constant.Footnote 2 The desired lower bound then follows if we establish that the following (Bayesian) hypotheses

$$\begin{aligned} H_0: \varvec{\Sigma }= \mathbf {I}_p \quad \text {v.s.} \quad H_1: \varvec{\Sigma }= \mathbf {I}_p + \lambda \mathbf {u}\mathbf {u}' \end{aligned}$$
(31)

cannot be separated with vanishing probability of error.

Remark 3

The composite testing problem (31) has also been considered in [2]. In particular, the following suboptimal lower bound is given in [2, Theorem 5.1]: If

$$\begin{aligned} \lambda \le 1 \wedge \sqrt{\frac{v k}{n} \log \left( 1+ \frac{p}{k^2} \right) } \end{aligned}$$
(32)

then the optimal error probability satisfies \({\mathcal {E}}_n(k,p,\lambda ) \ge C(v)\), where \(C(v) \rightarrow 1\) as \(v \rightarrow 0\). This result is established based on the following prior:

$$\begin{aligned} \mathbf {u}= \frac{1}{\sqrt{k}} \mathbf {J}_I \mathbf 1, \end{aligned}$$
(33)

which is a binary sparse vector with uniformly chosen support.

Compared to the result in Proposition 2, (32) is rate-optimal in the very sparse regime of \(k = o(\sqrt{p})\). However, since \(\log (1+x) \asymp x\) when \(x \lesssim 1\), in the moderately sparse regime of \(k \gtrsim \sqrt{p},\,\log (1+\frac{p}{k^2}) \asymp \frac{p}{k^2}\) and so the lower bound in (32) is substantially smaller than the optimal rate in Proposition 2 by a factor of \(\sqrt{\frac{k^2}{p}\log \frac{\mathrm{e}p}{k}}\), which is a polynomial factor in \(k\) when \(k \gtrsim p^\alpha \) for any \(\alpha > 1/2\). In fact, by strengthening the proof in [2], one can show that the optimal separation for discriminating (31) using the binary prior (33) is \(1 \wedge \sqrt{\frac{k}{n}\log \frac{\mathrm{e}p}{k}} \wedge \frac{p}{k \sqrt{n}}\). Therefore the prior (33) is rate-optimal only in the regime of \(k = o(p^{\frac{2}{3}})\), while (30) is rate-optimal for all \(k\). Examining the role of the prior (30) in the proof of Theorem 5, we see that it is necessary to randomize the signs of the singular vector in order to take advantage of the central limit theorem and Gaussian approximation. When \(k\gtrsim p^{\frac{2}{3}}\), the fact that the singular vector \(\mathbf {u}\) is positive componentwise reduces the difficulty of the testing problem.

The main technical tool for establishing the rank-detection lower bound in Proposition 2 is the following lemma, which can be of independent interest. It deals with the behavior of a symmetric random walk stopped after a hypergeometrically distributed number of steps. Moreover, note that Lemma 1 also incorporates the non-sparse case (\(k=p\) and \(H=k\)), which proves to be useful in establishing the minimax lower bound for estimating \(\varvec{\Sigma }\) in Theorem 4. The proof of Lemma 1 is deferred to Sect. 7.6.

Lemma 1

Let \(p \in {\mathbb {N}}\) and \(k \in [p]\). Let \({B_1,\ldots ,B_k}\) be independently Rademacher distributed. Denote the symmetric random walk on \({\mathbb {Z}}\) stopped at the \(m\)th step by

$$\begin{aligned} G_m \triangleq \sum _{i=1}^m B_i. \end{aligned}$$
(34)

Let \(H \sim \mathrm{Hypergeometric}(p,k,k)\) with \({ \mathsf {P}\left\{ H=i \right\} } = \frac{\left( {\begin{array}{c}k\\ i\end{array}}\right) \left( {\begin{array}{c}p-k\\ k-i\end{array}}\right) }{\left( {\begin{array}{c}p\\ k\end{array}}\right) }, i = {0,\ldots ,k}\). Then there exists a function \(g: (0,\frac{1}{36}) \rightarrow (1,\infty )\) with \(g(0+)=1\), such that for any \(a < \frac{1}{36}\),

$$\begin{aligned} \mathsf {E} \exp \left( t G_H^2 \right) \le g(a), \end{aligned}$$
(35)

where \(t = \frac{a}{k} \log \frac{\mathrm{e}p}{k}\).

Remark 4

(Tightness of Lemma 1) The purpose of Lemma 1 is to seek the largest \(t\), as a function of \((p,k)\), such that \(\mathsf {E} \exp \left( t G_H^2 \right) \) is upper bounded by a constant non-asymptotically. The condition that \(t \asymp \frac{1}{k} \log \frac{ep}{k}\) is in fact both necessary and sufficient. To see the necessity, note that \({ \mathsf {P}\left\{ G_H = H | H=i \right\} } = 2^{-i}\). Therefore

$$\begin{aligned} \mathsf {E} \exp \left( t G_H^2 \right) \ge \mathsf {E} \exp (t H^2) 2^{-H} \ge \exp (t k^2) 2^{-k} \, { \mathsf {P}\left\{ H=k \right\} } \ge \exp \left( tk^2 - k\log \frac{2p}{k} \right) , \end{aligned}$$

which cannot be upper bounded by an absolutely constant unless \(t \lesssim \frac{1}{k} \log \frac{\mathrm{e}p}{k}\).

5 Estimation of precision matrix and eigenvalues

We have so far focused on the optimal rates for estimating the spiked covariance matrix \(\varvec{\Sigma }\), the rank \(r\) and the principal subspace \(\mathrm {span}(\mathbf {V})\). The technical results and tools developed in the earlier sections turn out to be readily useful for establishing the optimal rates of convergence for estimating the precision matrix \(\varvec{\Omega }= \varvec{\Sigma }^{-1}\) as well as the eigenvalues of \(\varvec{\Sigma }\) under the spiked covariance matrix model (1).

Besides the covariance matrix \(\varvec{\Sigma }\), it is often of significant interest to estimate the precision matrix \(\varvec{\Omega }\), which is closely connected to the Gaussian graphical model as the support of the precision matrix \(\varvec{\Omega }\) coincides with the edges in the corresponding Gaussian graph. Let \(\widehat{\varvec{\Sigma }}\) be defined in (14) and let \(\sigma _i(\widehat{\varvec{\Sigma }})\) denote its \(i\)th largest eigenvalue value for all \(i\in [p]\). Define the precision matrix estimator as

$$\begin{aligned} \widehat{\varvec{\Omega }} = {\left\{ \begin{array}{ll} (\widehat{\varvec{\Sigma }})^{-1}, &{} \sigma _p(\widehat{\varvec{\Sigma }}) \ge \frac{1}{2}, \\ \mathbf {I}_p, &{} \sigma _p(\widehat{\varvec{\Sigma }}) < \frac{1}{2}. \end{array}\right. } \end{aligned}$$
(36)

The following result gives the optimal rates for estimating \(\varvec{\Omega }\) under the spectral norm.

Proposition 3

(Precision matrix estimation) Assume that \(\lambda \asymp 1\). If \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\le c_0\) for a sufficiently small constant \(c_0 > 0\), then

$$\begin{aligned} \inf _{\widehat{\varvec{\Omega }}} \sup _{\varvec{\Sigma }\in \varTheta _1(k,p,r,\lambda )} {\mathsf {E}}\Vert \widehat{\varvec{\Omega }}- \varvec{\Omega } \Vert ^2 \asymp \frac{k}{n}\log \frac{\mathrm{e}p}{k}, \end{aligned}$$
(37)

where the upper bound is attained by the estimator (36) with \(\gamma _1\ge 10\) in obtaining \(\widehat{\varvec{\Sigma }}\).

The upper bound follows from the lines in the proof of Theorem 1 after we control the smallest eigenvalue of \(\widehat{\varvec{\Sigma }}\) as in (36). Proposition 2 can be readily applied to yield the desired lower bound.

Note that the optimal rate in (37) is quite different from the minimax rate of convergence \(M^2k^2\frac{\log p}{n}\) for estimating \(k\)-sparse precision matrices where each row/column has at most \(k\) nonzero entries. Here \(M\) is the \(\ell _1\) norm bound for the precision matrices. See [9]. So the sparsity in the principal eigenvectors and the sparsity in the precision matrix itself have significantly different implications for estimation of \(\varvec{\Omega }\) under the spectral norm.

We now turn to estimation of the eigenvalues. Since \(\sigma \) is assumed to be equal to one, it suffices to estimate the eigenvalue matrix \(\mathbf {E}= \mathop {\text {diag}}(\lambda _i)\) where \(\lambda _i \triangleq 0\) for \(i > r\). For any estimator \(\widetilde{\mathbf {E}}= \mathop {\text {diag}}(\tilde{\lambda }_i)\), we quantify the estimator error by the loss function \(\Vert \widetilde{\mathbf {E}}-\mathbf {E}\Vert ^2 = \max _{i\in [p]} |\tilde{\lambda }_i - \lambda _i|^2\). The following result gives the optimal rate of convergence for this estimation problem.

Proposition 4

(Uniform estimation of spectra) Under the same conditions of Proposition 3,

$$\begin{aligned} \inf _{\widehat{\mathbf {E}}} \sup _{\varvec{\Sigma }\in \varTheta _1(k,p,r,\lambda )} {\mathsf {E}}\Vert \widehat{\mathbf {E}}-\mathbf {E}\Vert ^2 \asymp \frac{k }{n} \log \frac{\mathrm{e}p}{k}, \end{aligned}$$
(38)

where the upper bound is attained by the estimator \(\widehat{\mathbf {E}}= \mathop {\text {diag}}(\sigma _i(\widehat{\varvec{\Sigma }})) - \mathbf {I}_p\) with \(\widehat{\varvec{\Sigma }}\) defined in (14) with \(\gamma _1 \ge 10\).

Hence, the spikes and the eigenvalues can be estimated uniformly at the rate of \(\frac{k}{n} \log \frac{\mathrm{e}p}{k}\) when \(k\ll n/\log p\). Proposition 4 is a direct consequence of Theorem 1 and Proposition 2. The proofs of Propositions 3 and 4 are deferred to Sect. 7.4.

6 Discussions

We have assumed the knowledge of the noise level \(\sigma = 1\) and the support size \(k\). For a given value of \(\sigma ^2\), one can always rescale the data and reduce to the case of unit variance. As a consequence of rescaling, the results in this paper remain valid for a general \(\sigma ^2\) by replacing each \(\lambda \) with \(\lambda /\sigma ^2\) in both the expressions of the rates and the definitions of the parameter spaces. When \(\sigma ^2\) is unknown, it can be easily estimated based on the data. Under the sparsity models (3)–(5), when \(k < p/2,\,\sigma ^2\) can be well estimated by \(\widehat{\sigma }^2 = \mathrm {median}(s_{jj})\) as suggested in [20], where \(s_{jj}\) is \(j\)th diagonal element of the sample covariance matrix.

The knowledge of the support size \(k\) is much more crucial for our procedure. An interesting topic for future research is the construction of adaptive estimators which could achieve the minimax rates in Theorems 1–3 without knowing \(k\). One possibility is to define \({\mathbb {B}}_k\) in (13) for all \(k\in [p]\), find the smallest \(k\) such that \({\mathbb {B}}_k\) is non-empty, and then define estimators for that particular \(k\) similar to those in Sect. 2 with necessary adjustments accounting for the extra multiplicity in the support selection procedure.

For ease of exposition, we have assumed that the data are normally distributed and so various non-asymptotic tail bounds used in the proof follow. Since these bounds typically only rely on sub-Gaussianity assumptions on the data, we expect the results in the present paper readily extendable to data generated from distributions with appropriate sub-Gaussian tail conditions. The estimation procedure in Sect. 2 is different from that in [40]. Although both are based on enumerating all possible support sets of size \(k\), Vu and Lei [40] proposed to pick the support set which maximizes a quadratic form, while ours is based on picking the set satisfying certain deviation bounds.

A more important issue is the computational complexity required to obtain the minimax rate optimal estimators. The procedure described in Sect. 2 entails a global search for the support set \(A\), which can be computationally intensive. In many cases, this seems unavoidable since the spectral norm is not separable in terms of the entries/rows/columns. However, in some other cases, there are estimators that are computationally more efficient and can attain the same rates of convergence. For instance, in the low rank cases where \(r \lesssim \log \frac{\mathrm{e}p}{k}\), the minimax rates for estimating \(\mathrm {span}(\mathbf {V})\) under the spectral norm and under the Frobenius norm coincide with each other. See the discussion following Theorem 3 in Sect. 3. Therefore the procedure introduced in [10, Section 3] attains the optimal rates under both norms simultaneously. As shown in [10], this procedure is not only computationally efficient, but also adaptive to the sparsity \(k\). Finding the general complexity-theoretic limits for attaining the minimax rates under the spectral norm is an interesting and challenging topic for future research. Following the initial post of the current paper, Berthet and Rigollet [1] showed in a closely related sparse principal component detection problem that the minimax detection rate cannot be achieved by any computationally efficient algorithm in the highly sparse regime.

7 Proofs

We first collect a few useful technical lemmas in Sect. 7.1 before proving the main theorems in Sect. 7.2 in the order of Theorems 1–4. We then give the proofs of the propositions in the order of Propositions 2, 3, 4, and 1. As mentioned in Sect. 4, Theorem 5 on the rank detection lower bound is a direct consequence of Proposition 2. We complete this section with the proof of Lemma 1.

Recall that the row vector of \(\mathbf {X}\) are i.i.d. samples from the \(N({\varvec{0}}, \varvec{\Sigma })\) distribution with \(\varvec{\Sigma }\) specified by (1). Equivalently, one can think of \(\mathbf {X}\) as an \(n\times p\) data matrix generated as

$$\begin{aligned} \mathbf {X}= \mathbf {U}\mathbf {D}\mathbf {V}' + \mathbf {Z}, \end{aligned}$$
(39)

where \(\mathbf {U}\) is an \(n\times r\) random effects matrix with iid \(N(0,1)\) entries, \(\mathbf {D}= \mathrm{diag}(\lambda _1^{1/2},\dots , \lambda _r^{1/2}),\,\mathbf {V}\) is \(p\times r\) orthonormal, and \(\mathbf {Z}\) has iid \(N(0,1)\) entries which are independent of \(\mathbf {U}\).

7.1 Technical lemmas

Lemma 2

Let \(\mathbf {S}= \frac{1}{n}\mathbf {X}'\mathbf {X}\) be the sample covariance matrix, then

$$\begin{aligned} {\mathsf {P}}\bigg ( \max _{|B|=k} \Vert {\mathbf {S}_{BB}} \Vert \le (\lambda _1+1) \Big (1 +\sqrt{\frac{k}{n}}+\frac{t}{\sqrt{n}}\Big )^2 \bigg ) \le \left( {\begin{array}{c}p\\ k\end{array}}\right) \mathrm{e}^{-t^2/2}. \end{aligned}$$

Proof

Note that

$$\begin{aligned} \max _{|B|=k}\Vert {\mathbf {S}_{BB}} \Vert&\le \max _{|B|=k} \Vert {\varvec{\Sigma }_{BB}} \Vert \cdot \Vert {\varvec{\Sigma }_{BB}^{-1/2}\mathbf {S}_{BB}\varvec{\Sigma }_{BB}^{-1/2}} \Vert \\&\le (\lambda _1+1)\max _{|B|=k} \Vert {\varvec{\Sigma }_{BB}^{-1/2}\mathbf {S}_{BB}\varvec{\Sigma }_{BB}^{-1/2}} \Vert {\mathop {=}\limits ^\mathrm{(d)}}(\lambda _1+1) \max _{|B|=k} \Vert {{\mathbf {Z}}_{*B}} \Vert ^2. \end{aligned}$$

The result follows from the Davidson–Szarek bound [14, Theorem II.7] and the union bound. \(\square \)

Lemma 3

Let \({\mathbb {B}}_k\) be defined as in (13) with \(\gamma _1 \ge 3\). Then \({\mathsf {P}}(A\notin {\mathbb {B}}_k) \le 5(\mathrm{e}p)^{1-\gamma _1/2}\).

Proof

Note that by union bound

$$\begin{aligned} {\mathsf {P}}(A\notin {\mathbb {B}}_k)&\le {\mathsf {P}}(\exists D\subset {A^{\mathrm{c}}}, |D|\le k, \Vert {\mathbf {S}_D-{\mathbf {I}}} \Vert > \eta (|D|,n,p,\gamma _1))\\&\quad + {\mathsf {P}}(\exists D\subset {A^{\mathrm{c}}}, |D|\le k, \Vert {\mathbf {S}_{DA}} \Vert > 2\sqrt{\Vert {\mathbf {S}_A} \Vert }\,\eta (k,n,p,\gamma _1)). \end{aligned}$$

We now bound the two terms on the right-hand side separately.

For the first term, note that \(A=\mathrm {supp}(\mathbf {V})\). Then for any \(D \subset {A^{\mathrm{c}}},\,\mathbf {S}_D = \frac{1}{n} \mathbf {Z}_{*D}' \mathbf {Z}_{*D}\). Hence

$$\begin{aligned}&{\mathsf {P}}(\exists D\subset {A^{\mathrm{c}}}, |D|\le k, \Vert {\mathbf {S}_D-{\mathbf {I}}} \Vert > \eta (|D|,n,p,\gamma _1)) \\&\le \sum _{D\subset {A^{\mathrm{c}}}, |D|\le k} {\mathsf {P}}(\Vert {\mathbf {S}_D-{\mathbf {I}}} \Vert > \eta (|D|,n,p,\gamma _1)) \\&\le \sum _{l=1}^{k} \left( {\begin{array}{c}p-k\\ l\end{array}}\right) 2 \exp \left( -\frac{\gamma _1}{2}l \log \frac{\mathrm{e}p}{l}\right) \\&\le 2 \sum _{l=1}^k \left( \frac{\mathrm{e}p}{l}\right) ^{l(1-\gamma _1/2)} \le 4(\mathrm{e}p)^{1-\gamma _1/2}, \end{aligned}$$

where the second inequality is [10, Proposition 4], and the last inequality holds for all \(\gamma _1\ge 3\) and \(p\ge 2\).

For the second term, note that for any fixed \(D\subset {A^{\mathrm{c}}},\,\mathbf {S}_{DA} = \frac{1}{n}\mathbf {Z}_{*D}'\mathbf {X}_{*A}\), where \(\mathbf {Z}_{*D}\) and \(\mathbf {X}_{*A}\) are independent. Thus, let \({\mathbf {W}}\) be the left singular vector matrix of \(\mathbf {X}_{*A}\), we obtain thatFootnote 3

$$\begin{aligned} \Vert {\mathbf {S}_{DA}} \Vert \le \frac{1}{n} \Vert {\mathbf {Z}_{*D}'{\mathbf {W}}} \Vert \Vert {\mathbf {X}_{*A}} \Vert {\mathop {=}\limits ^\mathrm{(d)}}\frac{1}{\sqrt{n}}\Vert {{\mathbf {Y}}} \Vert \sqrt{\Vert {\mathbf {S}_A} \Vert }, \end{aligned}$$

where \({\mathbf {Y}}\) is a \(|D|\times k\) matrix with i.i.d. \(N(0,1)\) entries. Thus, we have

$$\begin{aligned}&{\mathsf {P}}(\exists D\subset {A^{\mathrm{c}}}, |D|\le k, \Vert {\mathbf {S}_{DA}} \Vert > 2\sqrt{\Vert {\mathbf {S}_A} \Vert }\,\eta (k,n,p,\gamma _1))\\&\le \sum _{D\subset {A^{\mathrm{c}}}, |D|\le k} {\mathsf {P}}(\Vert {\mathbf {S}_{DA}} \Vert > 2\sqrt{\Vert {\mathbf {S}_A} \Vert }\,\eta (k,n,p,\gamma _1))\\&\le \sum _{D\subset {A^{\mathrm{c}}}, |D|\le k} {\mathsf {P}}(\Vert {{\mathbf {Y}}} \Vert > 2\sqrt{n}\,\eta (k,n,p,\gamma _1) ) \\&\le \sum _{l=1}^k \left( {\begin{array}{c}p-k\\ l\end{array}}\right) \exp \left( -2\gamma _1 k\log \frac{\mathrm{e}p}{k}\right) \\&\le k \left( \mathrm{e}\frac{p}{k}\right) ^{k(1-2\gamma _1)} \le (\mathrm{e}p)^{1-\gamma _1/2}, \end{aligned}$$

where the third inequality is due to the Davidson-Szarek bound. Combining the two bounds completes the proof. \(\square \)

Lemma 4

Let \(\gamma _1\ge 3\). Suppose that \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\le c_0\) for a sufficiently small constant \(c_0 > 0\). Then with probability at least \(1-12 (\mathrm{e}p)^{1-\gamma _1/2},\,{\mathbb {B}}_k\ne \emptyset \) and

$$\begin{aligned} \Vert {\mathbf {S}_{\widehat{A}} - \mathbf {S}_A} \Vert \le C_0(\gamma _1) \sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1), \end{aligned}$$
(40)

where \(C_0(\gamma _1) = 14+4\sqrt{\gamma _1}\).

Proof

We focus on the event \({\mathbb {B}}_k \ne \emptyset \). Define the sets

$$\begin{aligned} G = A\cap \widehat{A}, \quad M = A\cap {\widehat{A}^{\mathrm{c}}},\quad O = {A^{\mathrm{c}}}\cap \widehat{A}\,, \end{aligned}$$
(41)

which correspond to the sets of correctly identified, missing, and overly identified features by the selected support set \(\widehat{A}\), respectively. By the triangle inequality, we have

$$\begin{aligned} \Vert {\mathbf {S}_{\widehat{A}} - \mathbf {S}_{A}} \Vert \le \Vert {\mathbf {S}_{MM} - {\mathbf {I}}_{MM}} \Vert + \Vert {\mathbf {S}_{OO} - {\mathbf {I}}_{OO}} \Vert + 2\Vert {\mathbf {S}_{GM}} \Vert + 2\Vert {\mathbf {S}_{GO}} \Vert . \end{aligned}$$
(42)

We now provide high probability bounds for each term on the right-hand side.

For the first term, recall that \(\widehat{A}\in {\mathbb {B}}_k\) which is defined in (13). Since \(M\subset {\widehat{A}^{\mathrm{c}}}\), we have

$$\begin{aligned} \Vert {\mathbf {S}_{MM} - {\mathbf {I}}_{MM}} \Vert \le \eta (|M|,n,p,\gamma _1) \le \eta (k,n,p,\gamma _1). \end{aligned}$$
(43)

For the second term, by similar calculation to that in the proof of Lemma 3, we have that when \(\gamma _1\ge 3\),

$$\begin{aligned} \Vert {\mathbf {S}_{OO} - {\mathbf {I}}_{OO}} \Vert \le \eta (k,n,p,\gamma _1) \end{aligned}$$
(44)

with probability at least \(1- 4(\mathrm{e}p)^{1-\gamma _1/2}\). For the third term, we turn to the definition of \({\mathbb {B}}_k\) in (13) again to obtain

$$\begin{aligned} \Vert {\mathbf {S}_{GM}} \Vert \le \Vert {\mathbf {S}_{\widehat{A}M}} \Vert \le 2\sqrt{\Vert {\mathbf {S}_{\widehat{A}\widehat{A}}} \Vert }\,\eta (k,n,p,\gamma _1). \end{aligned}$$

By Lemma 2, with probability at least \(1-(\mathrm{e}p)^{1-\gamma _1/2}\),

$$\begin{aligned} \Vert {\mathbf {S}_{\widehat{A}\widehat{A}}} \Vert ,\Vert {\mathbf {S}_{AA}} \Vert&\le \max _{|B|=k}\Vert {S_{BB}} \Vert \le (\lambda _1 + 1) \Big (1 + \sqrt{\frac{k}{n}} + \sqrt{\frac{\gamma _1 k}{n}\log \frac{\mathrm{e}p}{k}}\Big )^2 \nonumber \\&\le (1+(\sqrt{\gamma _1}+1)\sqrt{c_0})^2 (\lambda _1+1) \nonumber \\&\le (\tfrac{1}{2}\sqrt{\gamma _1}+\tfrac{3}{2})^2(\lambda _1+1), \end{aligned}$$
(45)

where the second inequality holds when \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\le c_0\) and the last inequality holds for sufficiently small constant \(c_0\). Moreover, the last two displays jointly imply

$$\begin{aligned} \Vert {\mathbf {S}_{GM}} \Vert \le (\sqrt{\gamma _1}+3) \sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1). \end{aligned}$$
(46)

For the fourth term, we obtain by similar arguments that with probability at least \(1-(ep)^{1-\gamma _1/2}\),

$$\begin{aligned} \Vert {\mathbf {S}_{GO}} \Vert \le \Vert {\mathbf {S}_{A O}} \Vert \le (\sqrt{\gamma _1}+3) \sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1). \end{aligned}$$
(47)

Note that \({\mathsf {P}}({\mathbb {B}}_k = \emptyset )\le {\mathsf {P}}(A\notin {\mathbb {B}}_k)\), and by Lemma 3, the latter is bounded above by \(5(\mathrm{e}p)^{1-\gamma _1/2}\). So the union bound implies that the intersection of the event \(\{{\mathbb {B}}_k\ne \emptyset \}\) and the event that (43)–(47) all hold has probability at least \(1-12(\mathrm{e}p)^{1-\gamma _1/2}\). On this event, we assemble (42)–(47) to obtain

$$\begin{aligned} \Vert \mathbf {S}_{\widehat{A}}-\mathbf {S}_{A} \Vert&\le [2 + 4(\sqrt{\gamma _1}+3)\sqrt{\lambda _1+1}]\,\eta (k,n,p,\gamma _1)\\&\le [2 + 4(\sqrt{\gamma _1}+3)]\,\sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1). \end{aligned}$$

This completes the proof. \(\square \)

Lemma 5

Let \(\mathbf {Y}\in \mathbb {R}^{n\times k}\) and \(\mathbf {Z}\in \mathbb {R}^{n\times m}\) be two independent matrices with i.i.d. \(N(0,1)\) entries. Then there exists an absolute constant \(C>0\), such that

$$\begin{aligned}&{\mathsf {E}}\Big \Vert \frac{1}{n}\mathbf {Y}'\mathbf {Y}- \mathbf {I}_k\Big \Vert ^2 \le C \left( \frac{k}{n} + \frac{k^2}{n^2} \right) ,\quad \text{ and } \end{aligned}$$
(48)
$$\begin{aligned}&{\mathsf {E}}\Vert \mathbf {Y}'\mathbf {Z} \Vert ^2 \le C (n(k+m) + km). \end{aligned}$$
(49)

Proof

The inequality (48) follows directly from integrating the high-probability upper bound in [10, Proposition 4].

Let the SVD of \(\mathbf {Y}\) be \(\mathbf {Y}=\mathbf {A}\mathbf {C}\mathbf {B}'\), where \(\mathbf {A},\mathbf {B}\) are \(n \times (n\wedge k)\) and \(k \times (n\wedge k)\) uniformly Haar distributed, and \(\mathbf {C}\) is an \((n\wedge k) \times (n\wedge k)\) diagonal matrix with \(\Vert \mathbf {C}\Vert = \Vert \mathbf {Y}\Vert \). Since \(\mathbf {A}\) and \(\mathbf {Z}\) are independent, \(\mathbf {A}'\mathbf {Z}\) has the same law as a \((n\wedge k) \times m\) iid Gaussian matrix. Therefore

$$\begin{aligned} \mathsf {E} \Vert \mathbf {Y}'\mathbf {Z}\Vert ^2 \le \mathsf {E} \Vert \mathbf {Y}\Vert ^2 \mathsf {E} \Vert \mathbf {A}'\mathbf {Z}\Vert ^2 \le C_0 (n+k) (n\wedge k + m). \end{aligned}$$

for some absolute constant \(C_0\), where the last inequality follows from the Davidson-Szarek theorem [14, Theorem II.7]. Exchanging the role of \(\mathbf {Y}\) and \(\mathbf {Z}\), we have \(\mathsf {E} \Vert \mathbf {Y}'\mathbf {Z}\Vert ^2 \le C_0 (n+m) (n\wedge m + k)\). Consequently,

$$\begin{aligned} \mathsf {E} \Vert \mathbf {Y}'\mathbf {Z}\Vert ^2 \!\le \! C_0 ((n+k) (n\wedge k \!+\! m) \wedge (n+m) (n\wedge m \!+\! k)) \le 2 C_0 (n(k+m) + km). \end{aligned}$$

This completes the proof. \(\square \)

In the proofs, the following intermediate matrix

$$\begin{aligned} \mathbf {S}_0 = \frac{1}{n} \mathbf {V}\mathbf {D}\mathbf {U}'\mathbf {U}\mathbf {D}\mathbf {V}'+\mathbf {I}_p \end{aligned}$$
(50)

plays a key role. In particular, the following results on \(\mathbf {S}_0\) will be used repeatedly.

Lemma 6

Suppose \(\lambda _1/\lambda _r\le \tau \) for some constant \(\tau \ge 1\). If \(\Vert \frac{1}{n}\mathbf {U}'\mathbf {U}- \mathbf {I}_r \Vert < 1/\tau \), then

$$\begin{aligned} \sigma _l(\mathbf {S}_0)>1,\quad \forall l\in [r], \quad \text{ and }\quad \sigma _l(\mathbf {S}_0)=1, \quad \forall l>r. \end{aligned}$$
(51)

Moreover, \(\mathbf {V}\mathbf {V}'\) is the projection matrix onto the rank \(r\) principal subspace of \(\mathbf {S}_0\).

Proof

It is straightforward to verify that

$$\begin{aligned} \mathbf {S}_0-\varvec{\Sigma }= \mathbf {V}\mathbf {D}\left( \frac{1}{n}\mathbf {U}'\mathbf {U}-\mathbf {I}_r \right) \mathbf {D}\mathbf {V}'. \end{aligned}$$
(52)

When \(\Vert \frac{1}{n}\mathbf {U}'\mathbf {U}- \mathbf {I}_r \Vert < 1,\,\Vert \mathbf {S}_0-\varvec{\Sigma } \Vert \le \Vert \mathbf {D} \Vert ^2 \Vert \frac{1}{n}\mathbf {U}'\mathbf {U}- \mathbf {I}_r \Vert < \lambda _1 / \tau \le \lambda _r\). So Weyl’s inequality [18, Theorem 4.3.1] leads to

$$\begin{aligned} \sigma _r(\mathbf {S}_0) \ge \sigma _r(\varvec{\Sigma }) - \Vert \mathbf {S}_0-\varvec{\Sigma } \Vert > \lambda _r+1-\lambda _r =1. \end{aligned}$$

Note that \(\mathbf {S}_0\) always has 1 as its eigenvalue with multiplicity at least \(p-r\). We thus obtain (51).

When (51) holds, (50) shows that the rank \(r\) principal subspace of \(\mathbf {S}_0\) is equal to that of \(\frac{1}{n}\mathbf {V}\mathbf {D}\mathbf {U}'\mathbf {U}\mathbf {D}\mathbf {V}'\). Therefore, the subspace is spanned by the column vectors of \(\mathbf {V}\), and \(\mathbf {V}\mathbf {V}'\) is the projection matrix onto it since \(\mathbf {V}\in O(p,r)\). \(\square \)

To prove the lower bound for rank detection, we need the following lemma concerning the the \(\chi ^2\)-divergences in covariance models. Recall that the \(\chi ^2\)-divergence between two probability measures is defined as

$$\begin{aligned} \chi ^2(P \, || \, Q) \triangleq \int \left( \frac{\mathrm{d}P}{\mathrm{d}Q} - 1 \right) ^2 \mathrm{d}Q. \end{aligned}$$

For a distribution \(F\), we use \(F^{\otimes n}\) to denote the product distribution of \(n\) independent copies of \(F\).

Lemma 7

Let \(\nu \) be a probability distribution on the space of \(p \times p\) symmetric random matrix \(\mathbf {M}\) such that \(\Vert \mathbf {M} \Vert \le 1\) almost surely. Consider the scale mixture distribution \(\mathsf {E}[N(0, \mathbf {I}_p + \mathbf {M})^{\otimes n}] = \int N(0, \mathbf {I}_p + \mathbf {M})^{\otimes n} \nu (\mathrm{d}\mathbf {M})\). Then

$$\begin{aligned} \chi ^2(\mathsf {E}[N(0, \mathbf {I}_p + \mathbf {M})^{\otimes n}] \, || N(0, \mathbf {I}_p)^{\otimes n})+1&= ~ \mathsf {E} \det (\mathbf {I}_p - \mathbf {M}_1 \mathbf {M}_2)^{-\frac{n}{2}} \end{aligned}$$
(53)
$$\begin{aligned}&\ge ~ \mathsf {E} \exp \left( \frac{n}{2}\left\langle \mathbf {M}_1, \mathbf {M}_2 \right\rangle \right) , \end{aligned}$$
(54)

where \(\mathbf {M}_1\) and \(\mathbf {M}_2\) are independently drawn from \(\nu \). Moreover, if \(\mathop {\mathsf{rank}}(\mathbf {M})=1\) a.s., then (53) holds with equality.

Proof

Denote by \(g_{i}\) the probability density function of \(N\left( 0,\varvec{\Sigma }_{i}\right) \) for \(i=0,1\) and 2, respectively. Then it is straightforward to verify that

$$\begin{aligned} \int \frac{g_{1}g_{2}}{g_{0}} =&~ \det (\varvec{\Sigma }_0)^{-1} \left[ \det \left( \varvec{\Sigma }_{0}(\varvec{\Sigma }_{1}+\varvec{\Sigma }_{2}) - \varvec{\Sigma }_1\varvec{\Sigma }_2\right) \right] ^{-\frac{1}{2}} \end{aligned}$$
(55)

if \( \varvec{\Sigma }_{0}(\varvec{\Sigma }_{1}+\varvec{\Sigma }_{2}) \ge \varvec{\Sigma }_1\varvec{\Sigma }_2\); otherwise, the integral on the left-hand side of (55) is infinite. Applying (55) to \(\varvec{\Sigma }_0=\mathbf {I}_p\) and \(\varvec{\Sigma }_i=\mathbf {I}_p+\mathbf {M}_i\) and using Fubini’s theorem, we have

$$\begin{aligned}&\chi ^2( \mathsf {E}[N(0, \mathbf {I}_p + \mathbf {M})^{\otimes n}] \, || N(0, \mathbf {I}_p)^{\otimes n}) + 1 \nonumber \\&\quad = ~ \mathsf {E} \det (\mathbf {I}_p - \mathbf {M}_1 \mathbf {M}_2)^{-\frac{n}{2}} \end{aligned}$$
(56)
$$\begin{aligned}&\quad = ~ \mathsf {E} \exp \left( -\frac{n}{2} \log \det (\mathbf {I}_p - \mathbf {M}_1 \mathbf {M}_2) \right) \end{aligned}$$
(57)
$$\begin{aligned}&\quad \ge ~ \mathsf {E} \exp \left( \frac{n}{2} \mathop {\mathsf{Tr}}(\mathbf {M}_1 \mathbf {M}_2) \right) , \end{aligned}$$
(58)

where (56) is due to \(\Vert \mathbf {M}_1 \mathbf {M}_2 \Vert \le \Vert \mathbf {M}_1 \Vert \Vert \mathbf {M}_2 \Vert \le 1\) a.s., and (58) is due to \(\log \det (\mathbf {I}+\mathbf {A}) \le \mathop {\mathsf{Tr}}(\mathbf {A})\), with equality if and only if \(\mathop {\mathsf{rank}}(\mathbf {A})=1\). \(\square \)

7.2 Proofs of main results

7.2.1 Proofs of the upper bounds

Proof of Theorem 1

Let \(\gamma _1 \ge 10\) be a constant. Denote by \(E\) the event that (40) holds, which, in particular, contains the event \(\{{\mathbb {B}}_k \ne \emptyset \}\). By triangle inequality,

$$\begin{aligned} \mathsf {E} \Vert {\mathbf {S}_{\widehat{A}} - \varvec{\Sigma }} \Vert ^2 \mathbf 1_E \le&~ 2 \, \mathsf {E} \Vert {\mathbf {S}_{\widehat{A}} - \mathbf {S}_A} \Vert ^2 \mathbf 1_E + 2 \, \mathsf {E} \Vert {\mathbf {S}_A - \varvec{\Sigma }} \Vert ^2 \mathbf 1_E \nonumber \\ \lesssim&~ 2 (\lambda _1 +1) \eta ^2(k,n,p,\gamma _1) + 2 \, \mathsf {E} \Vert {\mathbf {S}_A - \varvec{\Sigma }} \Vert ^2 \nonumber \\ \asymp&~ (\lambda _1 + 1) \frac{k}{n} \log \frac{\mathrm{e}p}{k} + \mathsf {E} \Vert {\mathbf {S}_A - \varvec{\Sigma }} \Vert ^2 , \end{aligned}$$
(59)

where the last step holds because \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\le c_0\). In view of the definition of \(\eta \) in (12), we have \(\eta ^2(k,n,p,\gamma _1) \asymp \frac{k}{n} \log \frac{\mathrm{e}p}{k}\).

To bound the second term in (59), define \(\mathbf {J}_B\) as the diagonal matrix given by

$$\begin{aligned} (\mathbf {J}_B)_{ii} = {\mathbf {1}_{\left\{ {i\in B}\right\} }}. \end{aligned}$$
(60)

Then, for \(\mathbf {S}_0\) in (50),

$$\begin{aligned} \mathbf {S}_A =&~ \mathbf {J}_A (\mathbf {S}- \mathbf {I}) \mathbf {J}_A + \mathbf {I}= \mathbf {J}_A \mathbf {S}\mathbf {J}_A - \mathbf {I}_{AA} + \mathbf {I}\nonumber \\ =&~ \mathbf {S}_0 + \frac{1}{n} \mathbf {V}\mathbf {D}\mathbf {U}' \mathbf {Z}\mathbf {J}_A + \frac{1}{n} \mathbf {J}_A \mathbf {Z}' \mathbf {U}\mathbf {D}\mathbf {V}' + \mathbf {J}_A \left( \frac{1}{n} \mathbf {Z}'\mathbf {Z}-\mathbf {I} \right) \mathbf {J}_A. \end{aligned}$$
(61)

Therefore

$$\begin{aligned} \Vert \mathbf {S}_A - \varvec{\Sigma }\Vert \le&~ \Vert \mathbf {S}_0 - \varvec{\Sigma }\Vert + \frac{2}{n} \Vert {\mathbf {V}\mathbf {D}\mathbf {U}' \mathbf {Z}\mathbf {J}_A} \Vert + \frac{1}{n} \left\| \mathbf {J}_A \left( \frac{1}{n} \mathbf {Z}'\mathbf {Z}-\mathbf {I} \right) \mathbf {J}_A \right\| \end{aligned}$$
(62)

In view of (52) and Lemma 5, we have

$$\begin{aligned} {\mathsf {E}}\Vert \mathbf {S}_0 - \varvec{\Sigma }\Vert ^2 \le \Vert {\mathbf {D}} \Vert ^4 \, {\mathsf {E}}\Vert \frac{1}{n}\mathbf {U}'\mathbf {U}-\mathbf {I}_r \Vert ^2 \lesssim \lambda _1^2 \left( \frac{r}{n} + \frac{r^2}{n^2} \right) \asymp \frac{\lambda _1^2 r}{n}, \end{aligned}$$
(63)

where the last step is due to \(n \ge c_0 k \log \frac{\mathrm{e}p}{k} \ge c_0 k\) by assumption. Similarly,

$$\begin{aligned} {\mathsf {E}}\left\| \mathbf {J}_A \left( \frac{1}{n} \mathbf {Z}'\mathbf {Z}-\mathbf {I} \right) \mathbf {J}_A \right\| ^2 \lesssim \frac{k}{n}, \end{aligned}$$
(64)

Again by (49) in Lemma 5,

$$\begin{aligned} \frac{1}{n^2} {\mathsf {E}}\Vert {\mathbf {V}\mathbf {D}\mathbf {U}' \mathbf {Z}\mathbf {J}_A} \Vert ^2 \le \frac{\lambda _1}{n^2} {\mathsf {E}}\Vert {\mathbf {U}' \mathbf {Z}\mathbf {J}_A} \Vert ^2 \lesssim \frac{\lambda _1 (n(k+r)+kr)}{n^2} \asymp \frac{\lambda _1 k}{n} . \end{aligned}$$
(65)

Assembling (62)–(65), we have

$$\begin{aligned} {\mathsf {E}}\Vert \mathbf {S}_A - \varvec{\Sigma }\Vert ^2 \lesssim \frac{\lambda _1^2 r}{n} + \frac{(\lambda _1+1) k}{n} . \end{aligned}$$
(66)

Combining (59) and (66) yields

$$\begin{aligned} \mathsf {E} \Vert {\mathbf {S}_{\widehat{A}} - \varvec{\Sigma }} \Vert ^2 \mathbf 1_E \lesssim (\lambda _1+1) \frac{k}{n} \log \frac{\mathrm{e}p}{k} + \frac{\lambda _1^2 r}{n}. \end{aligned}$$
(67)

Next we control the estimation error conditioned on the complement of the event \(E\). First note that \(\mathbf {S}_{\widehat{A}}-\varvec{\Sigma }= \begin{bmatrix} \mathbf {S}_{\widehat{A}\widehat{A}}&{\varvec{0}}\\ {\varvec{0}}&{\mathbf {I}}_{{\widehat{A}^{\mathrm{c}}}{\widehat{A}^{\mathrm{c}}}} \end{bmatrix} - \varvec{\Sigma }\). Then \(\Vert \mathbf {S}_{\widehat{A}}- \varvec{\Sigma }\Vert \le \Vert \mathbf {S}\Vert + \Vert \varvec{\Sigma }\Vert + 1 \le \Vert {\varvec{\Sigma }} \Vert (\Vert {\mathbf {W}_p} \Vert + 1) + 1\), where \(\mathbf {W}_p\) is equal to \(\frac{1}{n}\) times a \(p \times p\) Wishart matrix with \(n\) degrees of freedom. Also, \(\Vert {\varvec{\Sigma }-\mathbf {I}} \Vert = \lambda _1\). In view of (14), we have \(\Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert \le (1+\lambda _1)(\Vert {\mathbf {W}_p} \Vert + 2)\). Using Cauchy-Schwartz inequality, we have

$$\begin{aligned} \mathsf {E} \Vert {\widehat{\varvec{\Sigma }}- \varvec{\Sigma }} \Vert ^2 \mathbf 1_{{E^{\mathrm{c}}}} \le&~ (1+\lambda _1)^2 \mathsf {E} (\Vert {\mathbf {W}_p} \Vert + 2)^2 \mathbf 1_{{E^{\mathrm{c}}}} \\ \le&~ (1+\lambda _1)^2 \sqrt{\mathsf {E} (\Vert {\mathbf {W}_p} \Vert + 2)^4 } \sqrt{{ \mathsf {P}\left\{ \mathbf 1_{{E^{\mathrm{c}}}} \right\} }}. \end{aligned}$$

Note that \(\Vert \mathbf {W}_p\Vert {\mathop {=}\limits ^\mathrm{(d)}}\frac{1}{n} \sigma _1^2(\mathbf {Z})\). By [14, Theorem II.7], \({ \mathsf {P}\left\{ \sigma _1(\mathbf {Z}) \ge 1 + \sqrt{\frac{p}{n}} + t \right\} } \le \exp (-nt^2/2)\). Then \(\mathsf {E} \Vert {\mathbf {W}_p} \Vert ^4 \lesssim \frac{1}{n^4} (1 + \frac{p^4}{n^4})\). By Lemma 4, we have \({ \mathsf {P}\left\{ {E^{\mathrm{c}}} \right\} } \le 12(\mathrm{e}p)^{1-\gamma _1/2}\). Therefore

$$\begin{aligned} \mathsf {E} \Vert {\widehat{\varvec{\Sigma }}- \varvec{\Sigma }} \Vert ^2 \mathbf 1_{{E^{\mathrm{c}}}} \lesssim (1+\lambda _1)^2 \frac{1}{n^2} \left( 1 + \frac{p^2}{n^2} \right) p^{\frac{1}{2}-\frac{\gamma _1}{4}}. \end{aligned}$$
(68)

Choose a fixed \(\gamma _1 \ge 10\). Assembling (67) and (68), we have

$$\begin{aligned} \mathsf {E} \Vert {\widehat{\varvec{\Sigma }}- \varvec{\Sigma }} \Vert ^2 =&~ \mathsf {E} \Vert {\mathbf {S}_{\widehat{A}} - \varvec{\Sigma }} \Vert ^2 \mathbf 1_E + \mathsf {E} \Vert {\mathbf {S}_{\widehat{A}} - \varvec{\Sigma }} \Vert ^2 \mathbf 1_{{E^{\mathrm{c}}}} \\ \lesssim&~ (1+\lambda _1) \frac{k}{n} \log \frac{\mathrm{e}p}{k} + \frac{\lambda _1^2 r}{n} + (1+\lambda _1)^2 \frac{1}{n^2} \asymp (1+\lambda _1) \frac{k}{n} \log \frac{\mathrm{e}p}{k} \!+\! \frac{\lambda _1^2 r}{n}. \end{aligned}$$

\(\square \)

Proof of Theorem 2

To prove the theorem, it suffices to show that with the desired probability,

$$\begin{aligned} \sigma _r(\mathbf {S}_{\widehat{A}}) > 1 + \gamma _2\,\sqrt{\Vert \mathbf {S}_{\widehat{A}} \Vert }\,\eta (k,n,p,\gamma _1) > \sigma _{r+1}(\mathbf {S}_{\widehat{A}}). \end{aligned}$$
(69)

Recall (50) and (52). By [10, Proposition 4], with probability at least \(1-2(\mathrm{e}p)^{-\gamma _1/2},\,\Vert \frac{1}{n}\mathbf {U}'\mathbf {U}-\mathbf {I}_r \Vert \le \eta (k,n,p,\gamma _1)\). Under (18), \(\sqrt{\frac{k}{n}\log \frac{\mathrm{e}p}{k}}\asymp \eta (k,n,p,\gamma _1)\le 1/(2\tau )\) when \(c_0\) is sufficiently small, and so \(\eta (k,n,p,\gamma _1)\le 1/(2\tau )\). Thus,

$$\begin{aligned} \Vert \mathbf {S}_0 - \varvec{\Sigma } \Vert \le \Vert \mathbf {D} \Vert ^2 \Vert \frac{1}{n}\mathbf {U}'\mathbf {U}-\mathbf {I}_r \Vert \le \lambda _r/2. \end{aligned}$$
(70)

Therefore, Lemma 6 leads to (51). Moreover, Weyl’s inequality leads to

$$\begin{aligned} \sigma _r(\mathbf {S}_0) \ge \sigma _r(\varvec{\Sigma }) - |\sigma _r(\mathbf {S}_0) - \sigma _r(\varvec{\Sigma })| \ge \lambda _r + 1- \lambda _1\eta (k,n,p,\gamma _1)\ge \frac{1}{2}\lambda _r + 1 > 1. \end{aligned}$$
(71)

Next, we consider \(\mathbf {S}_A - \mathbf {S}_0\). By (61), we have

$$\begin{aligned} \Vert \mathbf {S}_A - \mathbf {S}_0\Vert \le \frac{2}{n} \Vert {\mathbf {V}\mathbf {D}\mathbf {U}' \mathbf {Z}\mathbf {J}_A} \Vert + \frac{1}{n} \Vert {\mathbf {J}_A (\mathbf {Z}'\mathbf {Z}-\mathbf {I})\mathbf {J}_A} \Vert . \end{aligned}$$

By [10, Proposition 4] and (18), when \(c_0\) is sufficiently small, with probability at least \(1-(\mathrm{e}p + 1)(\mathrm{e}p)^{-\gamma _1/2}\),

$$\begin{aligned} \Vert {\mathbf {V}\mathbf {D}\mathbf {U}' \mathbf {Z}\mathbf {J}_A} \Vert&\le \Vert \mathbf {D} \Vert \Vert \mathbf {U}'\mathbf {Z}\mathbf {J}_A \Vert \\&\le \sqrt{\lambda _1}\, n\sqrt{1+\frac{8\gamma _1\log p}{3n}} \Big (\sqrt{\frac{r}{n}}+\sqrt{\frac{k}{n}} + \sqrt{\gamma _1\log \frac{\mathrm{e}p}{n}}\Big )\\&\le n \sqrt{\lambda _1}\, \eta (k,p,n,\gamma _1). \end{aligned}$$

Moreover, [10, Proposition 3] implies that with probability at least \(1-2 (\mathrm{e}p)^{-\gamma _1/2}\),

$$\begin{aligned} \frac{1}{n} \Vert {\mathbf {J}_A (\mathbf {Z}'\mathbf {Z}-\mathbf {I})\mathbf {J}_A} \Vert \le \eta (k,n,p,\gamma _1). \end{aligned}$$

Assembling the last three displays, we obtain that with probability at least \(1-(\mathrm{e}p+3) (\mathrm{e}p)^{-\gamma _1/2}\),

$$\begin{aligned} \Vert \mathbf {S}_A - \mathbf {S}_0 \Vert \le (2\sqrt{\lambda _1}+1)\,\eta (k,n,p,\gamma _1). \end{aligned}$$
(72)

Last but not least, Lemma 4 implies that, under the condition of Theorem 2, with probability at least \(1-12(\mathrm{e}p)^{1-\gamma _1/2}\), (40) holds. Together with (72), it implies that

$$\begin{aligned} \Vert \mathbf {S}_{\widehat{A}}-\mathbf {S}_0 \Vert \le [C_0(\gamma _1) +3 ] \sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1), \end{aligned}$$
(73)

where \(C_0(\gamma _1) = 14 + 4\sqrt{\gamma _1}\). By (18), we could further upper bound the right-hand side by \(\lambda _1/4 \vee 1/2\). When \(\lambda _1 > 1,\,\sqrt{\lambda _1+1}\le \sqrt{2\lambda _1} < \sqrt{2}\lambda _1\), so for sufficiently small \(c_0\), (18) implies that the right side of (73) is further bounded by \(\lambda _1/4\). When \(\lambda _1\le 1,\,\sqrt{\lambda _1+1}\le \sqrt{2}\), and so the right side of (73) is further bounded by 1/2 for sufficiently small \(c_0\).

Thus, the last display, together with (70), implies

$$\begin{aligned} \Vert \mathbf {S}_{\widehat{A}}- \varvec{\Sigma } \Vert&\le \lambda _1 \eta (k,n,p,\gamma _1) + [C_0(\gamma _1) +3 ]\sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1) \nonumber \\&< \frac{1}{2}(\lambda _1+1). \end{aligned}$$
(74)

Here, the last inequality comes from the above discussion, and the fact that \(\eta (k,n,p,\gamma _1) < 1/4\) for small \(c_0\). The triangle inequality further leads to

$$\begin{aligned} \frac{1}{2}(\lambda _1+1) < \Vert \varvec{\Sigma } \Vert \!-\! \Vert \mathbf {S}_{\widehat{A}}- \varvec{\Sigma } \Vert \le \Vert \mathbf {S}_{\widehat{A}} \Vert \le \Vert \varvec{\Sigma } \Vert \!+\! \Vert \mathbf {S}_{\widehat{A}}- \varvec{\Sigma } \Vert < \frac{3}{2}(\lambda _1+1).\qquad \end{aligned}$$
(75)

Set \(\gamma _2 \ge 2[C_0(\gamma _1)+3] = 8\sqrt{\gamma _1}+34\). Then (51), (73) and (75) jointly imply that, with probability at least \(1-(13\mathrm{e}p+5)(\mathrm{e}p)^{-\gamma _1/2}\), the second inequality in (69) holds. Moreover, (74) and the triangle inequality implies that, with the same probability,

$$\begin{aligned} \sigma _r(\mathbf {S}_{\widehat{A}})&\ge \sigma _r(\varvec{\Sigma }) - \Vert \mathbf {S}_{\widehat{A}}-\varvec{\Sigma } \Vert \\&\ge 1 + \lambda _r - \lambda _1 \eta (k,n,p,\gamma _1) - [C_0(\gamma _1) +3 ]\sqrt{\lambda _1+1}\,\eta (k,n,p,\gamma _1)\\&\ge 1 + \lambda _r\Big [1-\tau \eta (k,n,p,\gamma _1) - \frac{\sqrt{\lambda _1+1}}{2\lambda _1}\gamma _2 \tau \eta (k,n,p,\gamma _1)\Big ]\\&\ge 1 + \frac{3}{2}\gamma _2 (\lambda _1+1)\eta (k,n,p,\gamma _1). \end{aligned}$$

Here, the last inequality holds when \(\lambda _r \ge \beta \sqrt{\frac{k}{n}\log \frac{\mathrm{e}p}{k}}\) for a sufficiently large \(\beta \) which depends only on \(\gamma _1,\,\gamma _2\) and \(\tau \). In view of (75), the last display implies the first inequality of (69). This completes the proof of the upper bound. \(\square \)

Proof of Theorem 3

Let \(E\) be the event such that Lemma 3, Lemma 4, the upper bound in Theorem 2, and (71) hold. Then \({\mathsf {P}}({E^{\mathrm{c}}}) \le C(\mathrm{e}p)^{1-\gamma _1/2}\).

On the event \(E,\,\widehat{\varvec{\Sigma }} = \mathbf {S}_{\widehat{A}}\). Moreover, Lemma 6 shows that \(\mathbf {V}\mathbf {V}'\) is the projection matrix onto the principal subspace of \(\mathbf {S}_0\), and Theorem 2 ensures \(\widehat{\mathbf {V}}\in {\mathbb {R}}^{p\times r}\). Thus, Proposition 1 leads to

$$\begin{aligned} \begin{aligned} \Vert {\widehat{\varvec{\Sigma }}-\mathbf {S}_0} \Vert {\mathbf {1}_{\left\{ {E}\right\} }}&\ge \frac{1}{2} (\sigma _r(\mathbf {S}_0)- \sigma _{r+1}(\mathbf {S}_0)) \Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert {\mathbf {1}_{\left\{ {E}\right\} }}\\&\ge \frac{\lambda _r}{4}\Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert {\mathbf {1}_{\left\{ {E}\right\} }}, \end{aligned} \end{aligned}$$

and so

$$\begin{aligned} {\mathsf {E}}\Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert ^2{\mathbf {1}_{\left\{ {E}\right\} }} \le \frac{16}{\lambda _r^2}\,{\mathsf {E}}\Vert {\widehat{\varvec{\Sigma }}-\mathbf {S}_0} \Vert ^2{\mathbf {1}_{\left\{ {E}\right\} }}. \end{aligned}$$

To further bound the right-hand side of the last display, we apply (61), (64), (65), Lemma 3 and Lemma 4 to obtain

$$\begin{aligned} \begin{aligned} {\mathsf {E}}\Vert {\widehat{\varvec{\Sigma }}-\mathbf {S}_0} \Vert ^2{\mathbf {1}_{\left\{ {E}\right\} }}&\le {\mathsf {E}}\Vert {\mathbf {S}_{\widehat{A}}-\mathbf {S}_0} \Vert ^2{\mathbf {1}_{\left\{ {E}\right\} }} \\&\lesssim {\mathsf {E}}\Vert {\mathbf {S}_{\widehat{A}}-\mathbf {S}_A} \Vert ^2{\mathbf {1}_{\left\{ {E}\right\} }} + {\mathsf {E}}\Vert {\mathbf {S}_A-\mathbf {S}_0} \Vert ^2 \lesssim (\lambda _1+1)\frac{k}{n}\log \frac{\mathrm{e}p}{k}. \end{aligned} \end{aligned}$$

Together with the second last display, this implies

$$\begin{aligned} {\mathsf {E}}\Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert ^2{\mathbf {1}_{\left\{ {E}\right\} }} \le C\tau ^2 \frac{k(\lambda _1+1)}{n \lambda _1^2}\log \frac{\mathrm{e}p}{k}. \end{aligned}$$
(76)

Now consider the event \({E^{\mathrm{c}}}\). Note that \(\Vert \widehat{\mathbf {V}}\widehat{\mathbf {V}}' - \mathbf {V}\mathbf {V}' \Vert \le 1\) always holds. Thus,

$$\begin{aligned} {\mathsf {E}}\Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert ^2{\mathbf {1}_{\left\{ {{E^{\mathrm{c}}}}\right\} }} \le {\mathsf {P}}({E^{\mathrm{c}}}) \le C(\mathrm{e}p)^{1-\gamma _1/2} \lesssim \frac{\lambda _1+1}{n\lambda _1^2}, \end{aligned}$$
(77)

where the last inequality holds under condition (20) for all \(\gamma _1 \ge (1+\frac{2}{M_0})M_1\). Assembling (76) and (77), we obtain the upper bounds. \(\square \)

7.2.2 Proofs of the lower bounds

Proof of Theorem 4

\(1^{\circ }\) The minimax lower bound for estimating \(\mathrm {span}(\mathbf {V})\) follows straightforwardly from previous results on estimating the leading singular vector, i.e. , the rank-one case (see, e.g. , [5, 39]). The desired lower bound (25) can be found in [10, Eq. (58)] in the proof of [10, Proof of Theorem 3].

\(2^{\circ }\) Next we establish the minimax lower bound for estimating the spiked covariance matrix \(\varvec{\Sigma }\) under the spectral norm, which is considerably more involved. Let \(\varTheta _1 = \varTheta _1(k,p,r,\lambda )\). In view of the fact that \(a \wedge (b+c) \le a \wedge b + a \wedge c \le 2 (a \wedge (b+c))\) for all \(a,b,c \ge 0\), it is convenient to prove the following equivalent lower bound

$$\begin{aligned} \inf _{\widehat{\varvec{\Sigma }}} \sup _{\varvec{\Sigma }\in \varTheta } {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \gtrsim \lambda ^2 \wedge \left( \frac{(\lambda +1) k}{n} \log \frac{e p}{k} \right) + \lambda ^2 \left( 1 \wedge \frac{r}{n} \right) . \end{aligned}$$
(78)

To this end, we show that the minimax risk is lower bounded by the two terms on the right-hand side of (78) separately. In fact, the first term is the minimax rate in the rank-one case and the second term is the rate of the oracle risk when the estimator knows the true support of \(\mathbf {V}\).

\(2.1^{\circ }\) Consider the following rank-one subset of the parameter space

$$\begin{aligned} \varTheta '=\{\varvec{\Sigma }= \mathbf {I}_p + \lambda \mathbf {v}\mathbf {v}': \mathbf {v}\in B_0(k) \cap {\mathbb {S}}^{p-1} \} \subset \varTheta _1. \end{aligned}$$

Then \(\sigma _1(\varvec{\Sigma })=1+\lambda \) and \(\sigma _2(\varvec{\Sigma })=1\). For any estimator \(\widehat{\varvec{\Sigma }}\), denote by \(\widehat{\mathbf {V}}\) its leading singular vector. Applying Proposition 1 yields

$$\begin{aligned} \Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert \ge \frac{\lambda }{2} \Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {v}\mathbf {v}'} \Vert . \end{aligned}$$

Then

$$\begin{aligned} \inf _{\widehat{\varvec{\Sigma }}} \sup _{\varvec{\Sigma }\in \varTheta '} {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \ge&~ \frac{\lambda ^2}{4} \inf _{\widehat{\mathbf {V}}} \sup _{\mathbf {v}\in B_0(k) \cap {\mathbb {S}}^{p-1}} {\mathsf {E}}\Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {v}\mathbf {v}'} \Vert ^2 \nonumber \\ \ge&~ \frac{\lambda ^2}{8} \inf _{\widehat{\mathbf {V}}} \sup _{\mathbf {v}\in B_0(k) \cap {\mathbb {S}}^{p-1}} {\mathsf {E}}\Vert \widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {v}\mathbf {v}'\Vert _\mathrm{F}^2 \end{aligned}$$
(79)
$$\begin{aligned} \gtrsim&~ \lambda ^2 \left( 1 \wedge \frac{(\lambda +1) k}{n \lambda ^2} \log \frac{e p}{k} \right) , \end{aligned}$$
(80)

where (79) follows from \(\mathop {\mathsf{rank}}(\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {v}\mathbf {v}') \le 2\) and (80) follows from the minimax lower bound in [39, Theorem 2.1] (see also [5, Theorem 2]) for estimating the leading singular vector.

\(2.2^{\circ }\) To prove the lower bound \(\lambda ^2 \left( 1 \wedge \frac{r}{n} \right) \), consider the following (composite) hypotheses testing problem:

$$\begin{aligned} H_{0}: \varvec{\Sigma }\! =\! \begin{bmatrix} \left( 1+\frac{\lambda }{2} \right) \mathbf {I}_r&{\varvec{0}}\\ {\varvec{0}}&\mathbf {I}_{p-r} \end{bmatrix}, ~\text{ vs. }~ H_{1}: \varvec{\Sigma }\!\!=\!\! \begin{bmatrix} \left( 1+\frac{\lambda }{2} \right) (\mathbf {I}_r + \rho \mathbf {v}\mathbf {v}')&{\varvec{0}}\\ {\varvec{0}}&\mathbf {I}_{p-r} \end{bmatrix}, \mathbf {v}\in {\mathbb {S}}^{r-1}, \nonumber \\ \end{aligned}$$
(81)

where \(\rho = \frac{b \lambda }{\lambda +2} (1 \wedge \sqrt{\frac{r}{n}})\) with a sufficiently small absolute constant \(b>0\). Since \(r \in [k]\) and \(\rho (1+\frac{\lambda }{2}) \le \lambda \), both the null and the alternative hypotheses belong to the parameter set \(\varTheta _1\) defined in (4). Following the Le Cam’s two-point argument [38, Section 2.3], next we show that the minimal sum of Type-I and Type-II error probabilities of testing (81) is non-vanishing. Since any pair of covariance matrices in \(H_0\) and \(H_1\) differ in operator norm by at least \(\rho (1+\frac{\lambda }{2}) = \frac{b \lambda }{2} (1 \wedge \sqrt{\frac{r}{n}})\), we obtain a lower bound of rate \(\lambda ^2 (1 \wedge \frac{r}{n})\).

To this end, let \(\mathbf {X}\) consist of \(n\) iid rows drawn from \(N(0,\varvec{\Sigma })\), where \(\varvec{\Sigma }\) is either from \(H_0\) or \(H_1\). Since under both the null and the alternative, the last \(p-r\) columns of \(\mathbf {X}\) are standard normal and independent of the first \(r\) columns, we conclude that the first \(r\) columns form a sufficient statistic. Therefore the minimal Type-I+II error probability testing (81), denoted by \(\epsilon _n\), is equal to that of the following testing problem of dimension \(r\) and sample size \(n\):

$$\begin{aligned} H_{0}: \varvec{\Sigma }= \mathbf {I}_r, \quad \text{ versus } \quad H_{1}: \varvec{\Sigma }= \mathbf {I}_r + \rho \mathbf {v}\mathbf {v}', \mathbf {v}\in {\mathbb {S}}^{r-1}. \end{aligned}$$
(82)

Recall that the minimax risk is lower bounded by the Bayesian risk. For any random vector \(\mathbf {u}\) taking values in \({\mathbb {S}}^{r-1}\), denote by the \(\mathsf {E}[N(0, \mathbf {I}_r + \rho \mathbf {u}\mathbf {u}')^{\otimes n}]\) the mixture alternative distribution with a prior equal to the distribution of \(\mathbf {u}\). Applying [38, Theorem 2.2 (iii)] we obtain the following lower bound in terms of the \(\chi ^2\)-divergence from the mixture alternative to the null:

$$\begin{aligned} \epsilon _n \ge&~ 1 - \left( \chi ^2\left( \mathsf {E}[N(0,\mathbf {I}_r+\rho \mathbf {u}\mathbf {u}')^{\otimes n}] \, || \, N(0,\mathbf {I}_r)^{\otimes n} \right) /2 \right) ^{1/2} \end{aligned}$$
(83)

Consider the unit random vector \(\mathbf {u}\) with iid coordinates taking values in \(\frac{1}{\sqrt{r}}\{\pm 1\}\) uniformly. Since \(\rho \le 1\), applying the equality case of Lemma 7 yields

$$\begin{aligned} 1+ \chi ^2\left( \mathsf {E}[N(0,\mathbf {I}_r+\rho \mathbf {u}\mathbf {u}')^{\otimes n}] \, || \, N(0,\mathbf {I}_r)^{\otimes n} \right) =&~ \mathsf {E} \exp \left( n \rho ^2 \langle \mathbf {u}\mathbf {u}, {\tilde{\mathbf {u}}}{\tilde{\mathbf {u}}} \rangle ^2 \right) \\ =&~ \mathsf {E} \exp \left( n \rho ^2 G_r^2 \right) , \end{aligned}$$

where in the last step we recall that \(G_r\) is the symmetric random walk on \({\mathbb {Z}}\) at \(r\)th step defined in (34). Since \(n \rho ^2 \le b^2 \frac{r}{n}\), choosing \(b^2 \le \frac{1}{36}\) as a fixed constant and applying Lemma 1 with \(p=k=r\) (the non-sparse case), we conclude that

$$\begin{aligned} \chi ^2\left( \mathsf {E}[N(0,\mathbf {I}_p+\lambda \mathbf {u}\mathbf {u}')^{\otimes n}] \, || \, N(0,\mathbf {I}_p)^{\otimes n} \right) \le g(b^2) - 1, \end{aligned}$$
(84)

where \(g\) is given by in Lemma 1 and satisfies \(g>1\). Combining (83) and (84), we obtain the following lower bound for estimating \(\varvec{\Sigma }\):

$$\begin{aligned} \inf _{\widehat{\varvec{\Sigma }}} \sup _{\varvec{\Sigma }\in \varTheta } {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \gtrsim \left( 1+\frac{\lambda }{2} \right) ^2 \rho ^2 \asymp \lambda ^2 \left( 1 \wedge \frac{r}{n} \right) . \end{aligned}$$
(85)

As we mentioned in Sect. 4, the rank-detection lower bound in Theorem 5 is a direct consequence of Proposition 2 concerning testing rank-zero versus rank-one perturbation, which we prove below. \(\square \)

7.3 Proof of proposition 2

Proof

Analogous to (83), any random vector \(\mathbf {u}\) taking vales in \({\mathbb {S}}^{p-1} \cap B_0(k)\) gives the Bayesian lower bound

$$\begin{aligned} {\mathcal {E}}_n(k,p,\lambda ) \ge 1 - \left( \chi ^2(\mathsf {E}N(0,\mathbf {I}_p+\lambda \mathbf {u}\mathbf {u}')^{\otimes n} || N(0,\mathbf {I}_p)^{\otimes n})/2 \right) . \end{aligned}$$
(86)

Put

$$\begin{aligned} t \triangleq \frac{n \lambda ^2}{k^2} = \frac{\beta _0}{k} \log \frac{\mathrm{e}p}{k}. \end{aligned}$$

Let the random sparse vector \(\mathbf {u}\) be defined in (30). In view of Lemma 7 as well as the facts that \(\mathop {\mathsf{rank}}(\lambda \mathbf {v}\mathbf {v}')=1\) and \(\Vert \lambda \mathbf {v}\mathbf {v}' \Vert =\lambda \le 1\), we have

$$\begin{aligned} 1+ \chi ^2(\mathsf {E}[N(0,\mathbf {I}_p+\lambda \mathbf {u}\mathbf {u}')^{\otimes n}] \,&|| \, N(0,\mathbf {I}_p)^{\otimes n})\\ =&~ \mathsf {E} \exp \left( n \lambda ^2 \langle \mathbf {u}\mathbf {u}', {\tilde{\mathbf {u}}}{\tilde{\mathbf {u}}}' \rangle ^2 \right) \\ =&~ \mathsf {E} \exp \left( n \lambda ^2 \left\langle \frac{1}{\sqrt{k}} \mathbf {J}_I \mathbf {w}, \frac{1}{\sqrt{k}} \mathbf {J}_{{\tilde{I}}} {\tilde{\mathbf {w}}} \right\rangle ^2 \right) \\ =&~ \mathsf {E} \exp \left( t (\mathbf {w}' \mathbf {J}_{I\cap {\tilde{I}}} {\tilde{\mathbf {w}}}')^2 \right) \\ =&~ \mathsf {E} \exp \left( t G_H^2 \right) , \end{aligned}$$

where in the last step we have defined \(H \triangleq |I \cap {\tilde{I}}| \sim \text {Hypergeometric}(p,k,k)\) and \(\{G_m\}\) is the symmetric random walk on \({\mathbb {Z}}\) defined in (34). Now applying Lemma 1, we conclude that

$$\begin{aligned} \chi ^2\left( \mathsf {E}[N(0,\mathbf {I}_p+\lambda \mathbf {u}\mathbf {u}')^{\otimes n}] \, || \, N(0,\mathbf {I}_p)^{\otimes n} \right) \le g(\beta _0) - 1 \end{aligned}$$

where \(g\) is given by in Lemma 1 satisfying \(g(0+)=1\). In view of (86), we conclude that

$$\begin{aligned} {\mathcal {E}}_n(k,p,\lambda ) \ge \max \{0,1 - \sqrt{(g(\beta _0)-1)/2}\} \triangleq w(\beta _0). \end{aligned}$$

Note that the function \(w\) satisfies \(w(0+)=1\). \(\square \)

7.4 Proof of propositions 3 and 4

We give here a joint proof of Propositions 3 and 4.

Proof

\(1^{\circ }\) (Upper bounds) By assumption, we have \(\lambda \asymp 1\) and \(\frac{k}{n}\log \frac{\mathrm{e}p}{k}\le c_0\). Since \(r \in [k]\), applying Theorem 1 yields

$$\begin{aligned} \sup _{\varvec{\Sigma }\in \varTheta _1(k, p, r, \lambda )} {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2 \lesssim \frac{k}{n} \log \frac{\mathrm{e}p}{k}. \end{aligned}$$
(87)

Note that

$$\begin{aligned} \Vert {\widehat{\varvec{\Sigma }}^{-1}-\varvec{\Sigma }^{-1}} \Vert \le&~ \big \Vert \widehat{\varvec{\Sigma }}^{-1}\big \Vert \big \Vert \varvec{\Sigma }^{-1}\big \Vert \Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert \le \frac{\Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert }{1-\Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert } , \end{aligned}$$
(88)

where the last inequality follows from \(\sigma _p(\varvec{\Sigma })=1\) and Weyl’s inequality. By Chebyshev’s inequality, \({ \mathsf {P}\left\{ \Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert \ge t \right\} } \le \frac{1}{t^2} {\mathsf {E}}\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert ^2\). Let \(E=\{\Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma }\Vert \le \frac{1}{2}\}\). Then

$$\begin{aligned} { \mathsf {P}\left\{ {E^{\mathrm{c}}} \right\} } \lesssim \frac{k}{n} \log \frac{\mathrm{e}p}{k}. \end{aligned}$$
(89)

Moreover, again by Weyl’s inequality, we have \(E \subset \{\sigma _p(\widehat{\varvec{\Sigma }}) \ge \frac{1}{2}\}\) hence \(\widehat{\varvec{\Omega }}{\mathbf {1}_{\left\{ {E}\right\} }} = (\widehat{\varvec{\Sigma }})^{-1} {\mathbf {1}_{\left\{ {E}\right\} }}\) in view of (36). On the other hand, by definition we always have \(\Vert \widehat{\varvec{\Omega }} \Vert \le 2\). Therefore

$$\begin{aligned} {\mathsf {E}}\Vert {\widehat{\varvec{\Omega }}-\varvec{\Omega }} \Vert ^2 =&~ {\mathsf {E}}\Vert {\widehat{\varvec{\Sigma }}^{-1}-\varvec{\Sigma }^{-1}} \Vert ^2 {\mathbf {1}_{\left\{ {E}\right\} }} + {\mathsf {E}}\Vert {\widehat{\varvec{\Omega }}-\varvec{\Omega }} \Vert ^2 {\mathbf {1}_{\left\{ {{E^{\mathrm{c}}}}\right\} }} \nonumber \\ \le&~ {\mathsf {E}}\Vert {\widehat{\varvec{\Sigma }}^{-1}-\varvec{\Sigma }^{-1}} \Vert ^2 {\mathbf {1}_{\left\{ {E}\right\} }} + 3{ \mathsf {P}\left\{ {E^{\mathrm{c}}} \right\} } \nonumber \\ \le&~ 4 {\mathsf {E}}\Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert ^2 + 3{ \mathsf {P}\left\{ {E^{\mathrm{c}}} \right\} } \end{aligned}$$
(90)
$$\begin{aligned} \lesssim&~\frac{k}{n} \log \frac{\mathrm{e}p}{k}. \end{aligned}$$
(91)

where (90) follows from (88) and (91) is due to (87) and (89).

Next we prove upper bound for estimating \(\mathbf {E}\). Recall that \(\mathbf {E}+\mathbf {I}_p\) and \(\widehat{\mathbf {E}}+\mathbf {I}_p\) give the diagonal matrices formed by the ordered singular values of \(\varvec{\Sigma }\) and \(\widehat{\varvec{\Sigma }}\), respectively. Similar to the proof of Proposition 4, Weyl’s inequality implies that \(\Vert \widehat{\mathbf {E}}-\mathbf {E}\Vert = \Vert \widehat{\mathbf {E}}+ \mathbf {I}_p -(\mathbf {E}+ \mathbf {I}_p) \Vert = \max _{i} |\sigma _i(\widehat{\varvec{\Sigma }}) - \sigma _i(\varvec{\Sigma })| \le \Vert \widehat{\varvec{\Sigma }}-\varvec{\Sigma } \Vert \), where for any \(i\in [p],\,\sigma _i(\varvec{\Sigma })\) is the \(i{^{\mathrm{th}}}\) largest eigenvalue of \(\varvec{\Sigma }\). In view of Theorem 1, we have \({\mathsf {E}}\Vert \widehat{\mathbf {E}}-\mathbf {E}\Vert ^2 \le \frac{k}{n} \log \frac{\mathrm{e}p}{k}\).

\(2^{\circ }\) (Lower bounds) The lower bound follows from the testing result in Proposition 2. Consider the testing problem (29), then both the null \((\varvec{\Sigma }=\mathbf {I}_p)\) and alternatives \((\varvec{\Sigma }=\mathbf {I}_p + \lambda \mathbf {v}\mathbf {v}')\) are contained in the parameter space \(\varTheta _1\). By Proposition 2, they cannot be distinguished with probability 1 if \(\lambda ^2 \asymp \frac{k}{n} \log \frac{\mathrm{e}p}{k}\). The spectra differs at least \(|\sigma _1(\mathbf {I}_p) - \sigma _1(\mathbf {I}_p + \lambda \mathbf {v}\mathbf {v}')| = \lambda \). By the Woodbury identity, \((\mathbf {I}_p + \lambda \mathbf {v}\mathbf {v}')^{-1} = \mathbf {I}_p - \frac{\lambda }{\lambda +1} \mathbf {v}\mathbf {v}'\), hence \(\Vert \mathbf {I}_p - (\mathbf {I}_p + \lambda \mathbf {v}\mathbf {v}')^{-1} \Vert = \frac{\lambda }{1+\lambda } \asymp \lambda \). The lower bound \(\frac{k}{n} \log \frac{\mathrm{e}p}{k}\) is now completed by way of the usual two-point argument. \(\square \)

7.5 Proof of proposition 1

Proof

Recall that \(\sigma _1(\varvec{\Sigma }) \ge \cdots \ge \sigma _p(\varvec{\Sigma }) \ge 0\) denote the ordered singular values of \(\varvec{\Sigma }\). If \(\sigma _r(\widehat{\varvec{\Sigma }}) \le \frac{\sigma _r(\varvec{\Sigma })+ \sigma _{r+1}(\varvec{\Sigma })}{2}\), then by Weyl’s inequality, we have \(\Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert \ge \frac{1}{2} (\sigma _r(\varvec{\Sigma })- \sigma _{r+1}(\varvec{\Sigma }))\). If \(\sigma _r(\widehat{\varvec{\Sigma }}) \ge \frac{\sigma _r(\varvec{\Sigma })+ \sigma _{r+1}(\varvec{\Sigma })}{2}\), then by David–Kahn’s sin-theta theorem [15] (see also [10, Theorem 10] we have

$$\begin{aligned} \Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert \le \frac{\Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert }{|\sigma _r(\widehat{\varvec{\Sigma }}) - \sigma _{r+1}(\varvec{\Sigma })|} \le \frac{2 \Vert {\widehat{\varvec{\Sigma }}-\varvec{\Sigma }} \Vert }{\sigma _r(\varvec{\Sigma }) - \sigma _{r+1}(\varvec{\Sigma })}, \end{aligned}$$

completing the proof of (22) in view of the fact that \(\Vert {\widehat{\mathbf {V}}\widehat{\mathbf {V}}'-\mathbf {V}\mathbf {V}'} \Vert \le 1\). \(\square \)

7.6 Proof of lemma 1

Proof

First of all, we can safely assume that \(p \ge 5\), for otherwise the expectation on the right-hand side of (35) is obviously upper bounded by an absolutely constant. In the sequel we shall assume that

$$\begin{aligned} 0 < a \le \frac{1}{36} \, . \end{aligned}$$
(92)

We use normal approximation of the random walk \(G_m\) for small \(m\) and use truncation argument to deal with large \(m\). To this end, we divide \([p]\), the whole range of \(k\), into three regimes.

Case I: Large \(k\). Assume that \(\frac{p}{2} \le k \le p\). Then \(t \le \frac{2a \log (2 \mathrm{e})}{p}\). By the following non-asymptotic version of the Tusnády’s coupling lemma (see, for example, [6, Lemma 4, p. 242]), for each \(m\), there exists \(Z_m \sim N(0,m)\), such that

$$\begin{aligned} |G_m| \mathop {\le }\limits ^\mathrm{s.t.} |Z_m| + 2. \end{aligned}$$
(93)

Since \(H\le p\), in view of (92), we have

$$\begin{aligned} \mathsf {E} \exp \left( t G_H^2 \right) \le&~ \mathsf {E} \exp \left( t G_p^2 \right) \le \mathsf {E} \exp \left( t (|Z_p|+2)^2 \right) \nonumber \\ \le&~ \exp (8t) \mathsf {E} \exp \left( 2 t Z_p^2 \right) \end{aligned}$$
(94)
$$\begin{aligned} =&~ \frac{\exp (8t)}{\sqrt{1-4 pt}} \le \frac{\exp (7a)}{\sqrt{1 - 8 \log (2 \mathrm{e}) a}}. \end{aligned}$$
(95)

Case II: Small \(k\). Assume that \(1 \le k \le \log \frac{\mathrm{e}p}{k}\), which, in particular, implies that \(k \le p^{\frac{1}{3}} \wedge \frac{p}{2}\) since \(p \ge 5\). Using \(G_H^2 \le H^2 \le k H\), we have

$$\begin{aligned} \mathsf {E} \exp \left( t G_H^2 \right) \le&~ \mathsf {E} \left( \frac{\mathrm{e}p}{k} \right) ^{a H} \le \left( 1 + \frac{k}{p-k} \left( \left( \frac{\mathrm{e}p}{k} \right) ^{a} - 1 \right) \right) ^k \end{aligned}$$
(96)
$$\begin{aligned} \le&~ \exp \left\{ \frac{2 \sqrt{k}}{\sqrt{p}} \left( \left( \frac{\mathrm{e}p}{k} \right) ^{a}-1 \right) \right\} \end{aligned}$$
(97)
$$\begin{aligned} \le&~ \exp (8a) \,, \end{aligned}$$
(98)

where (96) follows from the stochastic dominance of hypergeometric distributions by binomial distributions (see, e.g. , [23, Theorem 1.1(d)])

$$\begin{aligned} \mathrm{Hypergeometric}(p,k,k) \mathop {\le }\limits ^\mathrm{s.t.} \text {Bin}\left( k,\frac{k}{p-k} \right) , \quad k\le \frac{p}{2} \end{aligned}$$
(99)

and the moment generating function of binomial distributions, (97) is due to \((1+x)^k \le \exp (kx)\) for \(x \ge 0\) and \(\frac{k^2}{p} \le \sqrt{\frac{k}{p}}\), (98) is due to

$$\begin{aligned} \sup _{y \ge 1} \frac{(\mathrm{e}y)^a-1}{\sqrt{y}} = 2 \sqrt{\mathrm{e}} a ((1-2 a))^{\frac{1}{2 a}-1} \le 4 a, \quad \forall \; a \le \frac{1}{2}, \end{aligned}$$

and

Case III: Moderate \(k\). Assume that \(\log \frac{\mathrm{e}p}{k} \le k \le \frac{p}{2}\). Define

$$\begin{aligned} A \triangleq \frac{k}{\sqrt{a} \log \frac{\mathrm{e}p}{k}} \wedge k \end{aligned}$$
(100)

and write

$$\begin{aligned} \mathsf {E} \exp \left( t G_H^2 \right) = \mathsf {E} \exp \left( t G_H^2 \right) {\mathbf {1}_{\left\{ {H \le A}\right\} }} + \mathsf {E} \exp \left( t G_H^2 \right) {\mathbf {1}_{\left\{ {H > A}\right\} }} . \end{aligned}$$
(101)

Next we bound the two terms in (101) separately: For the first term, we use normal approximation of \(G_m\). By (93) and (94), for each fixed \(m \le A\), we have

$$\begin{aligned} \mathsf {E} \exp \left( t G_m^2 \right) \le&~ \frac{\exp (8t)}{\sqrt{1-4 m t}} \le \frac{\exp (8a)}{\sqrt{1 - 4 \sqrt{a}}}, \end{aligned}$$
(102)

where the last inequality follows from \(k \ge \log \frac{\mathrm{e}p}{k}\) and (100).

To control the second term in (101), without loss of generality, we assume that \(A \le k\), i.e. , \(A = \frac{k}{\sqrt{a} \log \frac{\mathrm{e}p}{k}} \ge \frac{1}{\sqrt{a}}\). we proceed similarly as in (96)–(98):

$$\begin{aligned}&~\mathsf {E} \exp \left( t G_H^2 \right) {\mathbf {1}_{\left\{ {H > A}\right\} }} \nonumber \\&\quad \le ~ \mathsf {E} \exp \left( tk H \right) {\mathbf {1}_{\left\{ {H > A}\right\} }} \nonumber \\&\quad \le ~ \sum _{m > A} \exp (t k m) \left( \frac{k}{p-k} \right) ^m \left( \frac{p-2k}{p-k} \right) ^{k-m} \left( {\begin{array}{c}k\\ m\end{array}}\right) \end{aligned}$$
(103)
$$\begin{aligned}&\quad \le ~ \sum _{m > A} \exp \left( a m \log \frac{p}{k} - m \log \frac{p}{2k} + m \log \left( \sqrt{a} \mathrm{e}\log \frac{\mathrm{e}p}{k} \right) \right) \end{aligned}$$
(104)
$$\begin{aligned}&\quad \le ~ \sum _{m > A} \exp \left\{ - \left( 1-a-\frac{1}{\mathrm{e}} \right) m \log 2 + m \log (2 \sqrt{a} \mathrm{e}) + \frac{m}{\mathrm{e}} \right\} \end{aligned}$$
(105)
$$\begin{aligned}&\quad \le ~ 7 \exp \left( -\frac{1}{\sqrt{a}} \right) , \end{aligned}$$
(106)

where (103) follows from (99), (104) follows from that \(\left( {\begin{array}{c}k\\ m\end{array}}\right) \le \left( \frac{\mathrm{e}k}{m} \right) ^m,\,p \ge 2k\) and \(m \le A\), (105) follows from that \(\mathrm{e}\log x \le x\) for all \(x \ge 1\), and (106) is due to (100) and our choice of \(a\) in (92). Plugging (102) and (106) into (101), we obtain

$$\begin{aligned} \mathsf {E} \exp \left( t G_H^2 \right) \le \frac{\exp (8a)}{\sqrt{1 - 4 \sqrt{a}}} + 7 \exp \left( -\frac{1}{\sqrt{a}} \right) . \end{aligned}$$
(107)

We complete the proof of (35), with \(g(a)\) defined as the maxima of the right-hand sides of (95), (98) and (107). \(\square \)