A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix

Zongliang Hu; Kai Dong; Wenlin Dai; Tiejun Tong

doi:10.1515/ijb-2017-0013

Publicly Available Published by De Gruyter September 21, 2017

A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix

Zongliang Hu , Kai Dong , Wenlin Dai and Tiejun Tong

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2017-0013

Abstract

The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.

Keywords: covariance matrix; high-dimensional data; log-determinant,sparse matrix; shrinkage estimation; thresholding estimation

1 Introduction

High-dimensional data are becoming more common in scientific research including gene expression study, financial engineering and signal processing. One significant feature of such data is that the dimension p is larger than the sample size n, the so-called “large p small n” data. For example, gene microarray often measures thousands of gene expression values simultaneously for each individual. However, due to the cost or the limited availability of patients, the number of samples in microarray experiments is usually much smaller than the number of genes. It is common to see microarray data with less than 10 samples [1, 2, 3, 4 5]. As seen in the literature, there are many statistical and computational challenges in analyzing the “large p small n” data.

Let Xi=(xi1,…,xip)T, i=1,…,n, be independent and identically distributed (i.i.d.) random vectors from the multivariate normal distribution Np(μ,Σ), where μ is a p-dimensional mean vector and Σ is a covariance matrix of size p×p. When p is larger than n, the sample covariance matrix Sn is a singular matrix. To overcome the singularity problem, various methods for estimating Σ have been proposed in the recent literature, e.g., the ridge-type estimators in [6] and [7], the sparse estimators in [8, 9 10] and [11]. Recently, [12] and [13] considered sparse covariance matrix estimation for time series data based on certain dependence measures, which relaxes the independence assumption among samples. For more references, see also [14, 15] and [16].

Apart from the covariance matrix estimation, there are situations where one needs an estimate of the determinant (or the log-determinant) of the covariance matrix for high-dimensional data. To illustrate it, we write the log-likelihood function of the data as

log(L)=−np2log(2π)−n2log|Σ|−12∑i=1n(Xi−μ)TΣ−1(Xi−μ),

where |Σ| denotes the determinant of the covariance matrix Σ. In classic multivariate analysis, the determinant |Σ|, referred to as the generalized variance (GV), was introduced by [17] and [18] as a scalar measure of overall multidimensional scatter. It has many applications such as outlier detection, hypothesis testing, and classification. To cater for this demand, we present several examples as follows.

Quadratic discriminant analysis (QDA) is an important method of classification. Assuming that the data in class k follows Np(μk,Σk), the quadratic discriminant scores are given by
dk(Y)=(Y−μk)TΣk−1(Y−μk)+log|Σk|−2logπk, k=1,…,K,
where Y is the new sample, K is the total number of classes, and πk is the prior probability of observing a sample from class k. The classification rule is to assign Y to class k that minimizes dk(Y) among all classes. To implement QDA, it is obvious that we need an estimate of |Σk| or log|Σk|.
To estimate the high-dimensional precision matrix Ω=Σ−1, [19] and [20] proposed to solve the following optimization problem:
Ωˆ=argminΩ>0{tr(SnΩ)−log|Ω|+λ∥Ω∥1},
where tr(⋅) is the trace, ∥⋅∥1 is the ℓ1 norm, and λ is a tuning parameter. The purpose of the term, log|Ω|=−log|Σ|, is to ensure that the optimization problem has a unique global positive definite minimizer [10]. Other proposals in this direction include [21], [22], [23], [24] and among others.
In probability theory and information theory, the differential entropy is commonly used by extending the concept of entropy to the continuous probability distribution [25, 26]. For a random vector from Np(μ,Σ), the differential entropy is
h(Σ)=p2+plog(2π)2+log|Σ|2.
The minimum covariance determinant (MCD) method developed by [27] and [28] is a robust estimator of multivariate scatter. MCD aims to find a subset with h samples (observations) having the smallest determinant of the covariance matrix. Specifically, let S={I⊂{1,…,n}:card(I)=h} be the collections of all subsets with h samples, where card(I) is the cardinality of I. For any I∈S, let SI be the corresponding sample covariance. The subset with the minimum determinant is defined as
Im=argminI∈S{|SI|}.
When p is larger than n, MCD is ill-defined as SI is singular. To generalize the MCD method to high-dimensional data, we need an estimate for the determinant of the high-dimensional covariance matrix. For instance, [29] replaced |SI| with |diag(SI)|, and [30] modified |SI| by shrinking the subset-based sample covariance matrix toward a target matrix.
Multivariate analysis of variance (MANOVA) is a procedure for testing the equality of mean vectors across multiple groups. Wilks’ Λ statistic for the hypothesis test [31] is given as
Λ=|E||H+E|,
where E is the within-group sum of squares and cross-product matrix, and H is the between-group sum of squares and cross-product matrix. However, E is singular under the “large p small n” setting. To apply MANOVA for high-dimensional data, [32] proposed replacing E with a shrinkage estimator, in which the shrinkage intensity is computed based on the method by [33]. Ullah and Jones [34] compared the powers of three types of regularized Wilks’ Λ statistics, in which E was replaced by the lasso, ridge and shrinkage estimator, respectively.

From the above examples, it is evident that an estimator of GV, or log|Σ|, plays an important role in high-dimensional data analysis. For ease of notation, we let

θ=log|Σ|

throughout the paper. In contrast to the covariance matrix estimation, the investigation of estimating θ is relatively overlooked in the literature. In practice, one often estimates the covariance matrix first and then uses it to compute the log-determinant. Chiu et al. [35] considered a regression model and allowed the covariance matrix of response vector Xi=(xi1,…,xip)T to vary with explanatory variables. In specific, they proposed modeling each element of logΣ as a linear function of the explanatory variables. One property of the transformation is that the log determinant log|Σ| is equal to tr(logΣ), a summation of log eigenvalues of Σ. Recently, [36] investigated the estimation of θ under various settings. Under some “moderate” setting with p≤n, they proposed to estimate θ by the determinant of the sample covariance matrix, i.e., log|Sn|. A central limit theorem was also established for log|Sn| in the setting where p can grow with n. For the “large p small n” data, however, they showed that it is impossible to estimate θ consistently, unless some structural assumption such as sparsity on the parameter can be imposed.

In this paper, we conduct a comprehensive simulation study that evaluates the performance of the existing methods for estimating θ. We follow a two-step procedure: we first estimate Σ with the existing methods, and then estimate θ by the plug-in estimator, θˆ=log(|Σˆ|). In Section 2, we consider a total of eight methods for estimating θ. A brief review on each of the methods is also given. In Section 3, we conduct simulation studies to evaluate and compare their performance under various settings. In particular, we will consider different types of correlation structures including a non-positive definite covariance matrix that is often ignored in the existing literature. We then explore and summarize some useful findings, and provide some practical guidelines for scientists in Section 4. Finally, we conclude the paper in Section 5 with some discussion. Technical details are provided in the Appendix.

2 Methods for estimating θ

In this section, we review eight representative methods for estimating the covariance matrix, and then estimate the log-determinant θ using the eight estimates of Σ, respectively. We also propose a new method for estimating θ under the assumption of a diagonal covariance matrix. For ease of presentation, we divide the eight methods into four categories: diagonal estimation, shrinkage estimation, sparse estimation, and factor model estimation.

2.1 Diagonal estimation

Method 1: Diagonal Estimator (DE)

Under the “large p small n” setting, one naive approach is to estimate Σ by the diagonal sample covariance matrix, i.e., diag(Sn). This estimator was first considered in [37] to propose a diagonal linear discriminant analysis. It was further considered in [38] where the authors demonstrated that a diagonal covariance matrix estimation can be sometimes reasonable when p is much larger than n. Let diag(Σ)=diag(σ12,…,σp2) where σj2 are the covariate-specific variances for j=1,…,p, and diag(Sn)=diag(s12,…,sp2) where sj2 are the sample variances of σj2, respectively. By letting Σˆ=diag(Sn), we define the first estimator of θ as

(1)θˆ(1)=log|diag(Sn)|=∑j=1plogsj2.

We refer to θˆ(1) as the diagonal estimator (DE). To be specific, DE is proposed to estimate log|diag(Σ)| rather than log|Σ|.

Method 2: Improved Diagonal Estimator (IDE)

It is noteworthy that DE may not perform well as an estimate of log|diag(Σ)| when the sample size is small, mainly due to the unreliable estimates of the sample variances. Various approaches have been proposed to improving the variance estimation in the literature. See, for example, [39, 40, 41 42], and [43].

To improve DE, we consider the optimal shrinkage estimator in [42],

σˆj2={hp(1)spool2}α{h1(1)sj2}1−α,

where spool2=∏i=1p(sj2)1/p, hp(1)=(ν/2)Γ(ν/2)/Γ(ν/2+1/p)p with ν=n−1, Γ(⋅) is the Gamma function, and α∈[0,1] is the shrinkage parameter. Replacing sj2 in DE by σˆj2, we have

(2)θˆ=∑j=1plogσˆj2=θˆ(1)+C,

where C=log{hpαp(1)h1(1−α)p(1)} is a constant.

The estimation structure in eq. (2) shows that the DE estimator, θˆ(1), can be further improved. Specifically, if we have C0 such that E(θˆ(1)+C0)=log|diag(Σ)|, then C0 is defined as the optimal C value so that the estimator θˆ(1)+C0 minimizes the mean squared error in the family of estimators {θˆ(1)+C: C∈(−∞,∞)}.

Theorem 1

Let sj2=σj2χν,j2/ν, where χν,j2 are i.i.d random variables with a common chi-squared distribution with ν degrees of freedom, and C0=−plog(2/ν)+ψ(ν/2), where ψ(⋅)=Γ′(⋅)/Γ(⋅) is the digamma function. Then for any fixed ν>0, we have

[(1)] θˆ(1)+C0 is an unbiased estimator of log|diag(Σ)|.
[(2)] Assume also that σj2 are i.i.d random variables from a common distribution F and E(logσ12)<∞. Then
1pθˆ(1)+C0−log|diag(Σ)|⟶a.s.0 as p→∞,
where ⟶a.s. denotes almost sure convergence.

The proof of Theorem 1 is given in the Appendix. By eq. (2) and Theorem 1, we define the second estimator of θ as

θˆ(2)=∑j=1plogsj2−plog(2/ν)+ψ(ν/2).

We refer to θˆ(2) as the improved diagonal estimator (IDE).

2.2 Shrinkage estimation

Recall that the sample covariance matrix Sn is singular when the dimension is larger than the sample size. To overcome the singularity problem, other than the diagonal methods in Section 2.1, one may also estimate the covariance matrix by the following convex combination:

S∗=δT+(1−δ)Sn,

where T is the target matrix, and δ∈[0,1] is the shrinkage parameter. Both the target matrix and the shrinkage parameter play an important role in the shrinkage estimation. For instance, if we let T=diag(Sn) and δ=1, then S∗ reduces to the DE estimator.

The appropriate choice of the target matrix has been extensively studied in the literature. See, for example, [6, 33, 44, 45], and [7] and the references therein. Note that T is often chosen to be positive definite and well-conditioned, and consequently, the final estimate S∗ is also guaranteed positive definite and well-conditioned for any dimensionality. As suggested in [33] and [7], we consider a popular target matrix for nonhomogeneous variances: the “diagonal, unequal variance” matrix, i.e., the diagonal sample covariance matrix diag(Sn).

We also note that, given the target matrix, the estimation of the shrinkage parameter δ is also crucial to the final estimation. The available estimation methods for the shrinkage parameter are mainly: (1) the unbiased estimation, and (2) the consistent estimation. The unbiased estimation is replacing unknown terms in the optimal value by their unbiased estimators [33]. Whereas, the consistent estimation is replacing the unknown terms in the optimal shrinkage parameter with (n,p)-consistent estimators [7]. Taken together, we present below the four methods for estimating the covariance matrix and consequently for estimating θ, respectively.

Method 3: Unbiased Shrinkage Estimator with T=I (USIE)

Letting the target matrix be T=I, [33] proposed an unbiased estimator for the shrinkage parameter, denoted by δˆ1∗. This leads to S∗=δˆ1∗I+(1−δˆ1∗)Sn. We then define the third estimator of θ as

(3)θˆ(3)=log|δˆ1∗I+(1−δˆ1∗)Sn|.

Method 4: Consistent Shrinkage Estimator with T=I (CSIE)

Letting the target matrix be T=I, [7] proposed a consistent estimator for the shrinkage parameter, denoted by δˆ2∗. This leads to S∗=δˆ2∗I+(1−δˆ2∗)Sn. We then define the fourth estimator of θ as

(4)θˆ(4)=log|δˆ2∗I+(1−δˆ2∗)Sn|.

Method 5: Unbiased Shrinkage Estimator with T=diag(Sn) (USDE)

Letting T=diag(Sn), [33] also proposed an unbiased estimator for the shrinkage parameter, denoted by δˆ3∗. This leads to S∗=δˆ3∗diag(Sn)+(1−δˆ3∗)Sn. We then define the fifth estimator of θ as

(5)θˆ(5)=log|δˆ3∗diag(Sn)+(1−δˆ3∗)Sn|.

Method 6: Consistent Shrinkage Estimator with T=diag(Sn) (CSDE)

Letting T=diag(Sn), [7] also proposed a consistent estimator for the shrinkage parameter, denoted by δˆ4∗. This leads to S∗=δˆ4∗diag(Sn)+(1−δˆ2∗)Sn. We then define the sixth estimator of θ as

(6)θˆ(6)=log|δˆ4∗diag(Sn)+(1−δˆ4∗)Sn|.

2.3 Sparse estimation

When p is much larger than n, the shrinkage methods in Section 2.2 may not achieve a significant improvement over Sn. In such settings, to have a good estimate of Σ, one may have to impose some structural assumptions such as sparsity in the parameters. Recently, [15] reviewed some methods on estimating structured high-dimensional covariance and precision matrix. A typical sparsity is to assume that most of the off-diagonal elements in the covariance matrix are zero. To estimate the covariance matrix under a sparsity condition, various thresholding-based methods have been proposed in the literature that aim to locate some “large” off-diagonal elements. See, for example, [8, 9, 46, 47, 48, 49, 50 51], and [52]. Particularly, the adaptive thresholding estimator proposed by [49] achieves the optimal rate of convergence over a large class of sparse covariance matrix under a wide spectral norms. Besides, it can be shown that the adaptive thresholding estimator also attains the optimal convergence rate under Bregman divergence losses over a large parameter class [15, 50]. Therefore, we also consider the sparsity methods as a representative and use them to estimate θ, i.e., the log-determinant of the covariance matrix.

Method 7: Adaptive Thresholding Estimator (ATE)

Bickel and Levina [8] proposed a universal thresholding method where all entries in the sample covariance matrix are thresholded by a common value γ. They required that the variances σj2 are uniformly bounded by a constant K, and consequently, the variances of the entries of the sample covariance matrix are also uniformly bounded. However, it was shown that a universal thresholding method is suboptimal over a certain class of sparse covariance matrices.

To improve the method above, [49] proposed an adaptive thresholding estimator for the covariance matrix:

Σˆ∗=(σ˜ij∗)p×p with σ˜ij∗=sγij(sij),

where γij is the corresponding threshold of σ˜ij∗, and sγij(⋅) is a generalized thresholding operator [47], which is specified as the soft thresholding throughout simulations. With the proper γij, the estimator Σˆ∗ adaptively achieves the optimal rate of convergence over a large class of sparse covariance matrix under the spectral norm. Now by Σˆ∗, the seventh estimator of θ is

(7)θˆ(7)=log|Σˆ∗|.

We refer to θˆ(7) as the adaptive thresholding estimator (ATE).

2.4 Factor model estimation

The sparsity condition on the covariance matrix assumes that most of covariates are uncorrelated to each other. Note that, however, this assumption may not be realistic in practice. Recently, under the assumption of conditional sparsity, [54] introduced a principle orthogonal complement thresholding method using the factor model. In this section, we briefly review their method and then apply it to estimate the log-determinant of the covariance matrix.

Method 8: Principal Orthogonal Complement Thresholding Estimator (POET)

Fan et al. [54] considered the approximate factor model:

yg=Bfg+ug, g=1,…,G,

where yg=(y1g,…,ypg)T is the observed response, B=(b1,…,bp)T is the loading matrix, fg is a Q×1 vector of common factors, and ug=(u1g,…,upg)T is the error vector. In this model, we can only observe yg. Let

Σ=Bcov(fg)BT+Σu, g=1,…,G,

where Σu is the covariance matrix of ug. To estimate Σ, [54] applied the spectral decomposition on the sample covariance matrix:

Sn=∑j=1QλˆjξˆjξˆjT+RˆQ,

where λˆ1≥λˆ1≥…≥λˆp are eigenvalues of Sn, ξˆj, j=1,…,p are the corresponding eigenvectors, and RˆQ=∑j=Q+1pλˆjξˆjξˆjT is the principal orthogonal complement. For this decomposition, the first Q principal components were kept and the thresholding was applied on RˆQ. Here, the generalized thresholding operator can be used. In addition, [54] also introduced a method to obtain an estimation of Q, denoted by Qˆ. Their final estimator of Σ is

(8)ΣˆQˆ=∑j=1QˆλˆjξˆjξˆjT+RˆQˆT,

where RˆQˆT is the thresholding result of RˆQ. Now by eq. (8), we define the last estimator of θ as

(9)θˆ(8)=log|ΣˆQˆ|.

We refer to θˆ(8) as the principal orthogonal complement thresholding estimator (POET).

3 Simulation studies

In this section, we compare the numerical performance of the aforementioned eight estimators. We consider five different setups. In the first setup, we generate data from the multivariate normal distribution, Np(0,Σ). In the second setup, we generate data from a mixture distribution where the covariance matrix is highly sparse. In the third setup, we simulate data from the log-normal distribution to assess the robustness of the eight methods under heavy-tailed data. In the forth setup, we consider a special case where the covariance matrix is degenerate and the data are generated from a degenerate multivariate normal distribution. And in the final setup, we use a realistic covariance matrix structure that is obtained from a real data. To compare these methods, we compute the mean squared error (MSE) as below:

MSE(θ,θˆ)=1Mp∑m=1M(θˆm−θ)2,

where M is the repeated times. Throughout the simulations, we take M=500.

Figure 1

Log MSEs for data from normal distribution with p=50. The sample size ranges from 5 to 50. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figure 2

Log MSEs for data from normal distribution with p=300. The sample size ranges from 10 to 200. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

$Figure 3 Log MSEs for data from normal distribution with p$p$=300, and ρ$\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.$

Figure 3

Log MSEs for data from normal distribution with p=300, and ρ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

3.1 Normal data

In this setup, we consider a block diagonal structure for the covariance matrix. This structure is widely adopted in the literature, e.g., [55] and [56]. Specifically, we let

Σ2=D1/2R(ρ)D1/2,

where D=diag(σ12,…,σp2) with σj2 being i.i.d. from the distribution χ52/5, and R follows a block diagonal structure:

R(ρ)=Σρ0⋯⋯00Σ−ρ0⋱⋮⋮0Σρ0⋮⋮⋱0Σ−ρ⋱0⋯⋯⋱⋱p×p.

In our simulations, we consider Σρ=(σij(ρ))q×q with σij(ρ)=ρ|i−j| for 1≤i,j≤q. In addition, we set ρ=0, 0.3, 0.6 or 0.9, to represent different levels of dependence, and (p,q)=(50,5) or (300,10), respectively.

Table 1

MSEs of θˆ for data from normal distribution with ρ=0.3,0.6,0.9, n=10,40 and p=50,100, respectively. The number of factors K is either fixed or estimated by the method in [54], denoted by Kˆ. All MSEs are rounded to integer numbers. The minimum MSE of each line is highlighted.

	ρ	K=0	K=1	K=2	K=4	K=6	K=Kˆ
n=10	0.3	17	272	660	2752	8820	667
	p=50	0.6	156	38	246	2543	7771	288
	0.9	2675	1179	338	710	4667	375
	0.3	33	944	2487	14261	37946	2481
	p=100	0.6	862	58	665	9643	30972	673
	0.9	14031	7386	2733	695	14648	2767
	ρ	K=0	K=1	K=4	K=8	K=12	K=Kˆ
n=40	0.3	7	5	82	339	752	18
	p=50	0.6	91	44	17	241	691	15
	0.9	1359	909	109	272	558	531
	0.3	38	8	203	1066	3200	37
	p=100	0.6	526	272	18	529	2159	140
	0.9	6457	3816	712	81	1170	2303

Figures 1 and 2 display the log(MSE) of the eight methods for different levels of dependence, dimension and sample size. From these figures, we have the following findings. When the covariates are uncorrelated, IDE gives the best performance under a high dimension (e.g., p=300). However, if the dimension is not large (e.g., p=50), and the covariates are uncorrelated or weakly correlated, shrinking the covariance matrix toward an identity matrix leads to a better performance under a small sample size. This is because when the sample size is small, the variances of the entries of the sample covariance matrix are large. Hence, CSIE and USIE stabilize both diagonal and off-diagonal entries and, at the same time, an identity target possesses an explicit structure which in turn requires little data to fit. Consequently, the resulting estimators have a good bias–variance tradeoff. In addition, when the correlation and dimension are both large, imposing additional structure assumptions is necessary. Under this situation, ATE and POET turn out to be the best two methods among the eight methods unless the sample size is relatively small. When the sample size is small, the pattern of ATE is very similar to that of DE. When the sample size and dimension are both large, ATE outperforms all other methods except for POET.

Figure 3 displays the performance of the eight methods for different levels of dependence with p=300. The pattern is consistent with Figure 2. In particular, as the correlation and sample size are large, the performance of POET is satisfactory. From Figures 1 and 2, however, we note that the log(MSE) of POET tends to be oscillating as the sample size increases. This may due to that POET depends on the estimated number of factors K. In [54], the authors used a consistent estimator for K and showed that POET is robust to over-estimated number of factors under the spectral norm. Our simulations in Table 1, however, show that the robustness for estimating the covariance matrix may not hold any more when the purpose is to estimate the determinant. In particular for small sample sizes, either over-estimated or under-estimated K leads to a large bias for the determinant estimator.

$Figure 4 Log MSEs for data from mixture normal distribution with p$p$=50, and ρ$\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.$

Figure 4

Log MSEs for data from mixture normal distribution with p=50, and ρ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

$Figure 5 Log MSEs for data from mixture normal distribution with p$p$=300, and ρ$\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.$

Figure 5

Log MSEs for data from mixture normal distribution with p=300, and ρ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

3.2 Mixture normal data

In this setup, we consider a mixture model where the random vectors are generated from

X∼α1f1(X)+α2f2(X),

where f1(X) and f2(X) are the density functions of Np(μ3,Σ3) and Np(μ4,Σ4), respectively. For the covariance matrices, we consider a sparse block diagonal structure as follows:

Σ3=D1/2R(ρ)D1/2 and Σ4=D1/2R(−ρ)D1/2,

where D=diag(σ12,…,σp2) with σj2 being i.i.d. from the distribution (1/5)χ52, and R(ρ) being the same as in Setup II. For simplicity, we set α1=α2=1/2 and μ1=μ2=0. Under this setting, the covariance matrix of X is simplified as (Σ3+Σ4)/2, which results in a highly sparse matrix where the odd off-diagonal parts in diagonal blocks are zeros. We set (p,q)=(50,5) or (300,10), and ρ=0, 0.3, 0.6 or 0.9.

$Figure 6 Log MSEs for data from heavy-tailed distribution with p$p$=50, and ρ$\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.$

Figure 6

Log MSEs for data from heavy-tailed distribution with p=50, and ρ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

$Figure 7 Log MSEs for data from heavy-tailed distribution with p$p$=300, and ρ$\rho$ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.$

Figure 7

Log MSEs for data from heavy-tailed distribution with p=300, and ρ ranging from 0 to 0.9. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

Figures 4 and 5 display the log(MSE) of the eight methods under different levels of dependence and sample size. When the sample size is large and the covariates are uncorrelated, IDE gives the best performance. When the sample size is small and the dimension is not large (e.g., n=5,p=50), shrinking the covariance matrix toward an identity matrix (e.g., USIE and CSIE) outperforms the other methods except that the correlation is very large (e.g., ρ=0.9). However, as the sample size and dimension are both large, the shrinkage methods will become suboptimal. Instead,if the correlation is also large (e.g., ρ=0.6), ATE and POET outperform the other methods in most settings. As aforementioned, the performance of POET is not stable and may not be satisfactory when the sample size is not large.

3.3 Heavy-tailed data

In this setup, we consider to simulate heavy-tailed data from a log-normal distribution, lnN(μ,σ2), where the mean and variance are eμ+σ2/2 and (eσ2−1)e2μ+σ2, respectively. First of all, we generate n independent random vectors Zi=(zi1,…,zip)T, where all the components of Zi are sampled independently from lnN(0,1). Let Xi=Σ1/2Zi∗ with Zi∗=(zi1−e1/2,…,zip−e1/2)T/{e(e−1)}1/2, and Σ is a p×p positive definite matrix. Consequently, the mean vector and covariance matrix of Xi are 0p×1 and Σp×p, respectively. For the covariance matrix, we consider the block diagonal structure as described in Section 3.1. We set (p,q)=(50,5) or (300,10), and ρ=0, 0.3, 0.6 or 0.9.

Figures 6 and 7 display the log(MSE) of the eight methods under different levels of dependence and sample size. When the dimension and correlation are both small, USIE and CSIE outperform the other methods. The reason is similar as the discussion in Section 3.1, the heavy-tailed data may lead to unstable estimates of the entries of Σ, hence shrinking towards a simple identity target, which requires little data to fit, stabilizes the sample covariance matrix. In addition, as shown in Figure 7, when the dimension is large and the correlation is not small, ATE and POET are the only two methods that have a better performance than the other methods except that the sample size is small. Finally, we also note that IDE cannot provide a satisfactory performance even if the covariates are uncorrelated. As demonstrated in Theorem 1, IDE estimator is derived under the normal distribution and may not be robust to heavy-tailed data.

3.4 Degenerate normal data

To further investigate the performance of the eight methods, we consider a non-positive definite covariance matrix in which the positive definite assumption of the covariance matrix is violated. Note that this new setting is often overlooked in the literature. To construct a non-positive definite covariance matrix, we define the affine transformation C as

C=10⋯⋯⋯⋯⋯0010⋱⋱⋱⋱⋮⋮010⋱⋱⋱⋮⋮⋱⋱⋱⋱⋱⋱⋮⋮⋱⋱⋱010⋮⋮⋱⋱⋱⋱0100⋯01/p−41/p−4⋯1/p−40p×p.

We then apply the affine transformation to the covariance matrix in Setup II and form

Σ5=CΣ2CT.

Figure 8

Log MSEs for data from degenerate normal distribution with p=50. The sample size ranges from 5 to 50. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

It is obvious that |Σ5|=0 since |C|=0. We set (p,q)=(50,5), and ρ=0, 0.3, 0.6 or 0.9. Note that he log-determinant of Σ5 is negative infinity. Hence, for this degenerate setting, the MSE is defined on the determinant rather than on the log-determinant. Specifically, it is

MSE(eθ,eθˆ)=1Mp∑m=1Meθˆm−eθ2.

Figure 8 shows the log(MSE) of all eight methods for different levels of dependence and sample size. We can see that the simulation results are different from those in the previous three setups. POET gives the best performance among the eight methods. In addition, we note that, under the non-positive definite setting, POET performs extremely well when the sample size is very small. For this phenomenon, we explore the possible reasons in the next paragraph.

To estimate Σ, [54] applied the spectral decomposition on the sample covariance matrix:

Sn=∑j=1QλˆjξˆjξˆjT+RˆQ.

If the sample size is much smaller than the dimension p, most eigenvalues of Sn are zeros. This leads to RˆQ, the principal orthogonal complement of the largest Q eigenvalues, is nearly a zero matrix. And consequently, the final estimator of POET, ΣˆQˆ=∑j=1QˆλˆjξˆjξˆjT+RˆQˆT, tends to be highly degenerate for small sample sizes rather than for large sample sizes.

Finally, it is noteworthy that when the correlation is strong, the log(MSE) of POET is also fluctuant as the sample size increases. This again verifies that both the correlation and sample size have a large impact on the performance of POET.

3.5 Real data

In this setup, we consider to generate a realistic covariance matrix from the Myeloma data [57], which is a real microarray data set including a total of 54, 675 genes, with 351 samples in the first group and 208 samples in the second group. To generate the covariance matrix, we first select 100 genes randomly from the first group and then compute the sample covariance matrix using the selected genes, denoted by Σr. Next, to evaluate the performance of the estimators under different levels of dependence, we follow [58] and define the true covariance matrix as

Σ1=(1−ρ)diag(Σr)+ρΣr,

where ρ controls the level of dependence. We set ρ=0, 1/3, 2/3 or 1. Note that ρ=0 corresponds to a diagonal covariance matrix, and ρ=1 treats the generated sample covariance matrix as the true covariance matrix.

Figure 9 shows the log(MSE) of the eight methods for different levels of dependence and sample size. The comparison results are summarized as follows. When the sample size and correlation are both small, the methods that shrinking the covariance matrix toward the identity matrix (e.g., USIE and CSIE) lead to a good performance. When the covariates are uncorrelated and the sample size is large, IDE has the best performance. In addition, when the sample size is large and the correlation is moderate (e.g., n=80 and ρ=2/3), shrinking the sample covariance matrix toward a diagonal target matrix (e.g., USDE and CSDE) has a good performance. When the correlation and sample size are both large, ATE outperforms or is at least comparable to USDE and CSDE. Finally, POET is not stable and very sensitive to both the correlation and the sample size. When the correlation and sample size is not large, POET may fail to provide a satisfactory performance owing to the largely increased bias compared with the other methods.

Figure 9

Log MSEs for real data with p=100. The sample size ranges from 10 to 80. In all figures, “1” to “8” represent the eight methods: DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively.

4 Conclusion

In this section, we summarize some useful findings of the comparison results and also provide some practical guidelines for researchers.

Diagonal estimation
The diagonal estimator, DE, is the simplest method for estimating the determinant of high-dimensional covariance matrix. It assumes that all covariates are uncorrelated. For independent normal data, IDE is an unbiased estimator of log|diag(Σ)| and also provides the best performance, especially when the dimension is large. For such settings, IDE can be recommended for estimating the determinant of high-dimensional covariance matrix. In addition, we note that IDE is not robust and may lead to an unsatisfactory performance when the independent normal assumption is violated.
Table 2
The time consumption of computing θˆ with DE [38], IDE, USIE [33], CSIE [7], USDE [33], CSDE [7], ATE [49], and POET [54], respectively. In ATE and POET, the turning parameter was selected based on 5-fold cross validation. The data is generated as described in Section 3.1. Timings (seconds) of 10 runs with Intel Core(TM) 3.20GH processor.
n=100,p=300 rDE rIDE rUSIE rCSIE rUSDE rCSDE rATE r@POETρ=0.0 0.52 0.59 16.3 0.70 15.8 0.71 258 359ρ=0.9 0.50 0.55 16.1 0.71 16.0 0.67 259 361
Shrinkage estimation
For the shrinkage estimation, different choices of the target matrix and shrinkage parameter result in different performance for the determinant estimation. In general, when the dimension is not large (e.g., p=50), the shrinkage towards an identity target matrix (e.g., CSIE and USIE) performs well under the small sample size and weak correlation. This pattern is more evident for the heavy-tailed data. With a diagonal target matrix, CSDE, the consistent estimator of [7], has a similar performance with USDE. However, CSDE and USDE are seldom the best methods especially when the sample size is not large.
For the shrinkage estimators, the optimal shrinkage intensity can be specified without any further turning parameters. Consequently, the time consuming procedures such as cross-validation or bootstrap can be avoided. Table 2 shows the computational time of the eight methods. As we can see, the shrinkage methods are much faster than ATE and POET. More importantly, if the sample size is very small as n=5, 10, selecting the turning parameters in ATE and POET by cross-validation may result in a large bias. Under this situation, the shrinkage estimators (e.g., shrinkage towards an explicit target matrix) can be very attractive. Nevertheless, as the sample size increases or the correlation is strong, the performance of the shrinkage methods may not be as competitive as the sparse method and the factor model method.
Sparse estimation
ATE presents its robust property in our settings. Specifically, when the sample size is not very small, ATE performs better or comparably to the other seven methods under various data structures and different levels of dependence. In practice, if the sample size is not very small and we have no prior information about the dependence level of the covariates, the sparse estimator can be recommended for estimating the determinant of high-dimensional covariance matrix.
As shown in the simulations, when the sample size is very small, the performance of ATE is not attractive as the shrinkage estimators or even the diagonal estimators. For possible reasons, we note that an adaptive thresholding parameter in ATE is needed in practice. When the sample size is very small, however, their proposed cross-validation method may not provide a reliable estimate for the optimal threshold value.
Factor model estimation
The factor model estimation, POET, is very attractive for strongly correlated data sets when the sample size is not small. [54] assumed that the data are weakly correlated after extracting the common factors which can result in high levels of dependence among the covariates. This implies that POET may provide a good performance if the data are strongly correlated. Note also that POET can select K=0 automatically if the true covariance matrix is sparse. Then consequently, their method will degenerate to the sparse estimation such as the hard thresholding estimation in [8] or ATE in [49].
POET, however, depends on the number of factors K, which is unknown in practice. To investigate the impact of the factors under different sample sizes and different levels of dependence, we simulated the MSE of POET for the log-determinant of the covariance under Setup II. Results from Table 1 show that K has a large impact on the determinant estimation. When the correlation is strong, Kˆ, a consistent estimator of K, usually leads to a large MSE. [54] demonstrated that POET is robust to over-estimated and sensitive to under-estimated factors. For the finite sample size, they suggested to chose a relatively large K (e.g., not less than 8). However, our simulation studies showed that the robustness for estimating the covariance matrix may not hold any more for estimating the determinant. In particular, for small sample size, both under-estimated and over-estimated factors will give a bad performance of POET. In view of this, we believe that future research is needed for selecting the optimal K when the factor model method is applied to estimate the determinant of the covariance matrix.

To conclude, the sample size, the dependence level and the dimension of the data sets take a great impact on the accuracy of estimation. In practice, we may need to select an appropriate estimation method according to the sample size and the prior information on the correlation structure of the covariates. When such prior information is not available, we recommend to use ATE [49] to estimate the determinant of high-dimensional covariance matrix, which is robust to various correlations and data structures.

5 Discussion

In this paper, we have compared a total of eight methods for estimating the log-determinant of the high-dimensional covariance matrix. The performance of the eight methods depends on the sample size, the dependence structure and the dimension of the data. When the sample size is not small, we note that ATE [49] is always able to provide an average or above average performance among the eight methods. Hence, if there is little prior information about the structure of the covariance matrix, we recommend to use ATE to estimate the log-determinant θ, or GV, in practice. In terms of computational time, the shrinkage methods are more convenient than ATE and POET because the latter two methods need to select the penalty parameters via cross-validation.

Note that the log-determinant of a covariance matrix is a scalar, the two-step procedure may not provide the best estimation for θ. One possible future direction is to consider circumventing the full covariance matrix estimation, and estimating the log-determinant directly. Note that log|Σ|=tr(logΣ), which is essentially a summation of the log-eigenvalues of Σ. This suggests that the random matrix theory or the spectrum analysis may provide feasible solutions to estimate the log-determinant more accurately. The comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation. Specifically, from a perspective of the loss function, if we define the loss function as

Loss(Σˆ,Σ)=(log|Σˆ|−log|Σ|)2 or Loss(Σˆ,Σ)=(|Σˆ|−|Σ|)2,

then the conducted simulations in Section 3 provide essentially a comparison for the eight methods for estimating Σ rather than θ. Of course, we do not intend to claim that the above loss functions should be consistently recommended. In contrast, for evaluating the covariance matrix estimation, other popular methods are also available in the literature. For instance, by letting L as the likelihood function and Lˆ as the corresponding estimator, we may consider any of the distance between the log-likelihood and the estimated log-likelihood as the criterion to evaluate the performance:

D(L,Lˆ)={log(L)−log(Lˆ)}2.

In addition, we can also consider any of the following loss functions:

Loss(Σˆ,Σ)=∥Σˆ−Σ∥2=λmax{(Σˆ−Σ)T(Σˆ−Σ)}, where λmax(⋅) denotes the maximum eigenvalue [46, 47, 59].
Loss(Σˆ,Σ)=∥Σˆ−Σ∥F=∑i,j(σˆij−σij)2, where Σ=(σij)p×p and Σˆ=(σˆij)p×p [49, 54].
Loss(Σˆ,Σ)=∥Σˆ−Σ∥max=maxi,j|σˆij−σij| [54].

Further research is needed to investigate which loss function provides the best criterion for evaluating the estimation methods of the covariance matrix.

Finally, it is noteworthy that there is another category of publications in the literature on calculating the log-determinant of the covariance matrix [53, 60, 61, 62, 63, 64 65]. We now point out that they are very different from the proposed study in our paper. Specifically, these papers assume that the covariance matrix Σ is known, yet as the dimension is very large, the canonical methods (e.g., the Choleskey decomposition) for computing log|Σ| require a total of O(p3) operations and may not be feasible in practice. The above papers have proposed more efficient algorithms including the random matrix theory and the spectrum analysis for fast computation of log|Σ|.

Appendix

A proof of Theorem 1

(1) From sj2=σj2χν,j2/ν, we have logsj2=logσj2+log(χν,j2/ν). Then, ∑j=1plogsj2=∑j=1plogσj2+plogχν,j2−plogν. Further,

E∑j=1plogsj2=∑j=1plogσj2+plog2+ψ(ν/2)−plogν.

This leads to

Eθˆ(1)+C0=E∑j=1plogsj2−plog2+ψ(ν/2)+plogν=∑j=1plogσj2=log|diag(Σ)|.

Hence, θˆ(1)+C0 is an unbiased estimator of log|diag(Σ)|.

(2) For E(logσ12)<∞, we have

1p∑j=1plogσj2⟶a.s.E(logσ12) as p→∞.

Since E(logs12)=E{E(logs12|σ12)}=E(logσ12)+log(2/ν)+ψ(ν/2), we have

1p∑j=1plogsj2−log(2/ν)−ψ(ν/2)⟶a.s.E(logσ12) as p→∞.

By the above two results, it yields that

1p∑j=1plogsj2−log(2/ν)−ψ(ν/2)−1p∑j=1plogσj2⟶a.s.0 as p→∞.

Finally, we have

1pθˆ(1)+C0−log|diag(Σ)|=1p∑j=1plogsj2−log(2/ν)−ψ(ν/2)−1p∑j=1plogσj2⟶a.s.0 as p→∞.□

Acknowledgements:

Tiejun Tong’s research was supported by the National Natural Science Foundation of China grant (No. 11671338), and the Hong Kong Baptist University grants FRG2/15-16/019, FRG2/15-16/038 and FRG1/16-17/018. The authors thank the editor, the associate editor and two reviewers for their constructive comments that have led to a substantial improvement of the paper.

References

1. Kaur S, Archer KJ, Devi MG, Kriplani A, Strauss JF, Singh R. Differential gene expression in granulosa cells from polycystic ovary syndrome patients with and without insulin resistance: identification of susceptibility gene sets through network analysis. J Clin Endocrinol Metab 2012;97:E2016–E2021.10.1210/jc.2011-3441Search in Google Scholar PubMed

2. Kuster DW, Merkus D, Kremer A, van IJcken WF, de Beer VJ, Verhoeven AJ, et al. Left ventricular remodeling in swine after myocardial infarction: a transcriptional genomics approach. Basic Res Cardiol 2011;106:1269–1281.10.1007/s00395-011-0229-1Search in Google Scholar

3. Mokry M, Hatzis P, Schuijers J, Lansu N, Ruzius FP, Clevers H, et al. Integrated genome-wide analysis of transcription factor occupancy, RNA polymerase II binding and steady-state RNA levels identify differentially regulated functional gene classes. Nucleic Acids Res 2012;40:148–158.10.1093/nar/gkr720Search in Google Scholar PubMed

4. Richard AC, Lyons PA, Peters JE, Biasci D, Flint SM, Lee JC, et al. 2014; Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation. BMC Genomics 15:649–659.10.1186/1471-2164-15-649Search in Google Scholar PubMed

5. Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use. RNA 2016;22:839–851.10.1261/rna.053959.115Search in Google Scholar

6. Ledoit O, Wolf M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empirical Finance 2003;10:603–621.10.1016/S0927-5398(03)00007-0Search in Google Scholar

7. Fisher TJ, Improved Sun X. Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Comput Stat Data Anal 2011;55:1909–1918.10.1016/j.csda.2010.12.006Search in Google Scholar

8. Bickel PJ, Levina E. Covariance regularization by thresholding. Ann Stat 2008;36:2577–2604.10.1214/08-AOS600Search in Google Scholar

9. Cai T, Yuan M. Adaptive covariance matrix estimation through block thresholding. Ann Stat 2012;40:2014–2042.10.1214/12-AOS999Search in Google Scholar

10. Rothman AJ. Positive definite estimators of large covariance matrices. Biometrika 2012;99:733–740.10.1093/biomet/ass025Search in Google Scholar

11. Cai T, Ren Z, Zhou H. Optimal rates of convergence for estimating Toeplitz covariance matrices. Probab Theo Relat Fields 2013;156:101–143.10.1007/s00440-012-0422-7Search in Google Scholar

12. Chen X, Xu M, Wu WB. Covariance and precision matrix estimation for high-dimensional time series. Ann Stat 2013;41:2994–3021.10.1214/13-AOS1182Search in Google Scholar

13. Basu S, Michailidis G. Regularized estimation in sparse high-dimensional time series models. Ann Stat 2015;43:1535–1567.10.1214/15-AOS1315Search in Google Scholar

14. Tong T, Wang C, Wang Y. Estimation of variances and covariances for high-dimensional data: a selective review. WIREs Comput Stat 2014;6:255–264.10.1002/wics.1308Search in Google Scholar

15. Cai T, Ren Z, Zhou H. Estimating structured high-dimensional covariance and precision matrices: optimal rates and adaptive estimation. Electron J Stat 2016;10:1–59.10.1214/15-EJS1081Search in Google Scholar

16. Fan J, Liao Y, Liu H. An overview of the estimation of large covariance and precision matrices. Econometrics J 2016;19:C1–C32.10.1111/ectj.12061Search in Google Scholar

17. Wilks SS. Certain generalizations in the analysis of variance. Biometrika 1932;24:471–494.10.1093/biomet/24.3-4.471Search in Google Scholar

18. Wilks S. Multidimensional statistical scatter. In: Andreson TW, editor. Collected papers: contributions to mathematical statistics. New York: John Wiley & Sons, 1967:597–614.Search in Google Scholar

19. Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007;94:19–35.10.1093/biomet/asm018Search in Google Scholar

20. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008;9:432–441.10.1093/biostatistics/kxm045Search in Google Scholar PubMed PubMed Central

21. Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J Mach Learn Res 2008;9:485–516.Search in Google Scholar

22. Witten DM, Tibshirani R. Covariance-regularized regression and classification for high dimensional problems. J R Stat Soc Ser B 2009;71:615–636.10.1111/j.1467-9868.2009.00699.xSearch in Google Scholar PubMed PubMed Central

23. Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ₁-penalized log-determinant divergence. Electron J Stat 2011;5:935–980.10.1214/11-EJS631Search in Google Scholar

24. Yin J, Li H. Adjusting for high-dimensional covariates in sparse precision matrix estimation by ℓ₁-penalization. J Multivariate Anal 2013;116:365–381.10.1016/j.jmva.2013.01.005Search in Google Scholar PubMed PubMed Central

25. Bishop CM. Pattern recognition and machine learning. New York: Springer, 2006.Search in Google Scholar

26. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer, 2002.10.1007/978-0-387-21606-5Search in Google Scholar

27. Rousseeuw PJ. Multivariate estimation with high breakdown point. Math Stat Appl 1985;8:283–297.10.1007/978-94-009-5438-0_20Search in Google Scholar

28. Rousseeuw PJ, Driessen KV. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999;41:212–223.10.1080/00401706.1999.10485670Search in Google Scholar

29. Ro K, Zou C, Wang Z, Yin G. Outlier detection for high-dimensional data. Biometrika 2015;102:589–599.10.1093/biomet/asv021Search in Google Scholar

30. Boudt K, Rousseeuw P, Vanduffel S, Verdonck T. The minimum regularized covariance determinant estimator, 2017. arXiv preprint arXiv:1701.07086.10.2139/ssrn.2905259Search in Google Scholar

31. Anderson TW. An introduction to multivariate statistical analysis. New York: Wiley, 1984.Search in Google Scholar

32. Tsai CA, Chen JJ. Multivariate analysis of variance test for gene set analysis. Bioinformatics 2009;25:897–903.10.1093/bioinformatics/btp098Search in Google Scholar PubMed

33. Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol. Biol. 2005;4:32.10.2202/1544-6115.1175Search in Google Scholar PubMed

34. Ullah I, Jones B. Regularised MANONA for high-dimensional data. Aust N Z J Stat 2015;57:377–389.10.1111/anzs.12126Search in Google Scholar

35. Chiu TY, Leonard T, Tsui KW. The matrix-logarithmic covariance model. J Am Stat Assoc 1996;91:198–210.10.1080/01621459.1996.10476677Search in Google Scholar

36. Cai T, Liang T, Zhou H. Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional Gaussian distributions. J Multivariate Anal 2015;137:161–172.10.1016/j.jmva.2015.02.003Search in Google Scholar

37. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002;97:77–87.10.1198/016214502753479248Search in Google Scholar

38. Bickel PJ, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 2004;10:989–1010.10.3150/bj/1106314847Search in Google Scholar

39. Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001;17:509–519.10.1093/bioinformatics/17.6.509Search in Google Scholar PubMed

40. Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 2003;19:2448–2455.10.1093/bioinformatics/btg345Search in Google Scholar PubMed

41. Cui X, Hwang JT, Qiu J, Blades NJ, Churchill GA. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 2005;6:59–75.10.1093/biostatistics/kxh018Search in Google Scholar PubMed

42. Tong T, Wang Y. Optimal shrinkage estimation of variances with applications to microarray data analysis. J Am Stat Assoc 2007;102:113–122.10.1198/016214506000001266Search in Google Scholar

43. Tong T, Jang H, Wang Y. James-Stein type estimators of variances. J Multivariate Anal 2012;107:232–243.10.1016/j.jmva.2012.01.019Search in Google Scholar

44. Warton DI. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J Am Stat Assoc 2008;103:340–349.10.1198/016214508000000021Search in Google Scholar

45. Warton DI. Regularized sandwich estimators for analysis of high-dimensional data using generalized estimating equations. Biometrics 2011;67:116–123.10.1111/j.1541-0420.2010.01438.xSearch in Google Scholar PubMed

46. Karoui, NE. 2008; Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann Stat 36:2717–2756.10.1214/07-AOS559Search in Google Scholar

47. Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J Am Stat Assoc 2009;104:177–186.10.1198/jasa.2009.0101Search in Google Scholar

48. Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann Stat 2009;37:42–54.10.1214/09-AOS720Search in Google Scholar

49. Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J Am Stat Assoc 2011;106:672–684.10.1198/jasa.2011.tm10560Search in Google Scholar

50. Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. Ann Stat 2012;40:2389–2420.10.1214/12-AOS998Search in Google Scholar

51. Mitra R, Zhang C. Multivariate analysis of nonparametric estimates of large correlation matrices, 2014. arXiv preprint arXiv:1403.6195.Search in Google Scholar

52. Wang T, Berthet Q, Samworth RJ. Statistical and computational trade-offs in estimation of sparse principal components. Ann Stat 2016;44:1896–1930.10.1214/15-AOS1369Search in Google Scholar

53. Barry RP, Pace RK. Monte carlo estimates of the log determinant of large sparse matrices. Linear Algebra Appl 1999;289:41–54.10.1016/S0024-3795(97)10009-XSearch in Google Scholar

54. Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements (with discussion). J R. Stat Soc: Ser B 2013;75:603–680.10.1111/rssb.12016Search in Google Scholar PubMed PubMed Central

55. Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics 2007;8:86–100.10.1093/biostatistics/kxj035Search in Google Scholar PubMed

56. Pang H, Tong T, Zhao H. Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data. Biometrics 2009;65:1021–1029.10.1111/j.1541-0420.2009.01200.xSearch in Google Scholar PubMed PubMed Central

57. Zhan F, Barlogie B, Arzoumanian V, Huang Y, Williams DR, Hollmig K, et al. Gene-expression signature of benign monoclonal gammopathy evident in multiple myeloma is linked to good prognosis. Blood 2007;109:1692–1700.10.1182/blood-2006-07-037077Search in Google Scholar PubMed PubMed Central

58. Tong T, Feng Z, Hilton JS, Zhao H. Estimating the proportion of true null hypotheses using the pattern of observed p-values. J Appl Stat 2013;40:1949–1964.10.1080/02664763.2013.800035Search in Google Scholar PubMed PubMed Central

59. Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Ann Stat 2011;39:3320–3356.10.1214/11-AOS944Search in Google Scholar PubMed PubMed Central

60. Boutsidis C, Drineas P, Kambadur P, Kontopoulou E-M, Zouzias A. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications 2017, in press.10.1016/j.laa.2017.07.004Search in Google Scholar

61. Fitzsimons J, Cutajar K, Osborne M, Roberts S, Filippone M. Bayesian inference of log determinants, 2017a. arXiv preprint arXiv:1704.01445.Search in Google Scholar

62. Fitzsimons J, Granziol D, Cutajar K, Osborne M, Filippone M, Roberts S. Entropic trace estimates for log determinants, 2017b. arXiv preprint arXiv:1704.07223.10.1007/978-3-319-71249-9_20Search in Google Scholar

63. Han I, Malioutov D, Shin J. Large-scale log-determinant computation through stochastic Chebyshev expansions. In: Proceedings of the 32nd International Conference on Machine Learning, 2015:908–917.Search in Google Scholar

64. Peng W, Wang H. Large-scale log-determinant computation via weighted l2 polynomial approximation with prior distribution of eigenvalues. In:International conference on high performance computing and applications. Springer, 2015:120–125.10.1007/978-3-319-32557-6_12Search in Google Scholar

65. Zhang Y, Leithead WE. Approximate implementation of the logarithm of the matrix determinant in Gaussian process regression. J Stat Comput Simul 2007;77:329–348.10.1080/10629360600569279Search in Google Scholar

Received: 2017-2-7

Accepted: 2017-8-16

Published Online: 2017-9-21

A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix

Abstract

1 Introduction

2 Methods for estimating θ

2.1 Diagonal estimation

Theorem 1

2.2 Shrinkage estimation

2.3 Sparse estimation

2.4 Factor model estimation

3 Simulation studies

3.1 Normal data

3.2 Mixture normal data

3.3 Heavy-tailed data

3.4 Degenerate normal data

3.5 Real data

4 Conclusion

5 Discussion

Appendix

A proof of Theorem 1

Acknowledgements:

References

Journal and Issue

Articles in the same Issue