Promote sign consistency in the joint estimation of precision matrices

https://doi.org/10.1016/j.csda.2021.107210Get rights and content

Abstract

The Gaussian graphical model is a popular tool for inferring the relationships among random variables, where the precision matrix provides a natural interpretation of conditional independence. With high-dimensional data, sparsity of the precision matrix is often assumed, and various regularization methods have been applied for estimation. In several scenarios, it is desirable to conduct the joint estimation of multiple precision matrices. In joint estimation, entries corresponding to the same element of multiple precision matrices form a group, and group regularization methods have been applied for the estimation and identification of sparsity structures. In many practical examples, it can be difficult to interpret the results when parameters within the same group have conflicting signs. Unfortunately, existing methods lack an explicit mechanism in regards to sign consistency of group parameters. To tackle this problem, a novel regularization method is developed for the joint estimation of multiple precision matrices. It effectively enhances the sign consistency of group parameters and hence can lead to more interpretable results, while still allowing for conflicting signs to achieve full flexibility. The method’s consistency properties are rigorously established. Simulations show that the proposed method outperforms competing alternatives under a variety of settings. For the two data examples, the proposed approach leads to interpretable results that are different from the alternatives.

Introduction

The Gaussian graphical model has emerged as a popular tool for studying the dependence among random variables (Friedman, 2004, Wainwright and Jordan, 2008). In the Gaussian graphical model analysis, the precision matrix plays a pivotal role: two variables are conditionally independent given all other variables if and only if the corresponding element in the precision matrix is zero. When the data dimensionality is high, sparsity of the precision matrix is often assumed, and identification of nonzero elements (edges) needs to be conducted along with estimation. For this identification, regularization is needed, with the most popular technique perhaps being lasso (Bickel and Elizaveta, 2008, Friedman et al., 2008). To reduce bias caused by the 1 penalty, Lam and Fan (2009) apply the nonconvex SCAD penalty. Other existing methods include the neighborhood lasso selection (Meinshausen and Bühlmann, 2006), neighborhood Dantzig selection (Yuan, 2010), alternating linearization method (Scheinberg et al., 2010), penalized D-trace loss method (Zhang and Zou, 2014), constrained 1 minimization (Cai et al., 2011), and others. We refer to Fan et al. (2016) for a comprehensive overview of sparse estimators.

The aforementioned studies as well as many others in the literature are mainly concerned with analyzing a single dataset. However, in many important scenarios, it is desirable to conduct joint estimation with multiple independent datasets. Below we present two examples in genetics, which match the data analyzed in Section 4, and note that similar examples arise in many other fields.

  • Example 1. In genetic data analysis, it has been recognized that the results generated from a single dataset are often unsatisfactory (Zhao et al., 2015). For many important problems, there have been multiple independent datasets generated with similar designs. Pooling and jointly analyzing these datasets can effectively increase sample size and estimation/identification accuracy (Ma et al., 2011).

  • Example 2. Different diseases can share common molecular mechanisms. To identify these commonalities, it is essential to jointly analyze multiple datasets. This kind of analysis has been conducted by Ma et al. (2012), Liu et al. (2013), and others.

The idea of integrating multiple data sources and learning parameter homogeneity/heterogeneity goes beyond the two examples above and has been studied extensively using the penalized methods, for example by Tang and Song (2016), Cheng et al. (2015), Tang et al. (2020), and others. A popular way to integrate information from multiple datasets is meta-analysis. Some meta-analysis approaches operate on summary statistics, while the individual participant data meta-analysis approaches directly analyze individual level data (Riley et al., 2010). The random-effects model has been proposed to handle heterogeneity in magnitudes, but so far there has been a lack of work on promoting sign consistency. Although some individual participant data meta-analysis approaches can handle high-dimensional data, our literature review suggests that the penalization technique has not been commonly adopted. For example, Waldron et al. (2014) apply fixed-effects and random-effects models on the C-index, and Emura et al. (2018) propose to use a weighted sum of gene expressions for dimension reduction. We note that although the object of estimation in the aforementioned studies is not a precision matrix, the settings and considerations are expected to be similar for precision matrices.

In recent studies, the joint estimation of multiple precision matrices has been investigated. One framework is to incorporate a target matrix such that the estimation can be encouraged to shrink to this target matrix (Bilgrau et al., 2020), or update the precision matrix repeatedly over time with prior information available (van Wieringen et al., 2020). Another framework, which is the framework we adopt, is to jointly analyze multiple raw datasets without any accessible target matrix or prior information. Building on the maximum penalized likelihood, Chiquet et al. (2011) assume a common structure and encourage estimation of multiple networks towards this common structure. Guo et al. (2011) propose a hierarchical penalization approach to jointly estimate multiple graphical models. Danaher et al. (2014) examine the joint graphical lasso using group lasso and fused lasso penalties. Along this line, Saegusa and Shojaie (2016) take a Laplacian penalization approach. Zhu et al. (2014) develop a nonconvex method to search for a sparse representation for each matrix and identify clusters of entries across matrices. Different from the penalized likelihood approaches, Cai et al. (2016) develop a weighted constrained 1 minimization approach. With all these methods, a popular strategy is to treat the elements corresponding to the same edge in multiple datasets as a group, and regularization is imposed on the (norms of) group parameters. Despite considerable successes, a common limitation shared by most of the existing studies is a lack of consideration of the relationships across datasets (matrices). In data analysis, it is often sensible to expect that the same edges have the same signs in different datasets although the magnitudes can be different.

In this study, we conduct the joint estimation of multiple precision matrices under high-dimensional settings. Motivated by the above discussions, our goal is to promote sign consistency across precision matrices while flexibly allowing for conflicting signs. The rest of the article is organized as follows. In Section 2, we develop a new joint estimation approach to promote sign consistency. An effective computational algorithm is developed, and the theoretical properties are rigorously established. In Section 3, we conduct extensive simulations and compare our approach with the competing alternatives. Two data examples are analyzed in Section 4. The article concludes with discussions in Section 5. Proofs and additional numerical results are presented in the Supplementary File.

Section snippets

Methods

Consider joint estimation with M independent datasets with sample sizes n1,,nM, respectively. Denote n=nm. Assume the same set of random variables in the M datasets. Denote X(1),,X(M) as the M data matrices. Under the Gaussian graphical model framework, for X(m), its ith row Xi(m) is a multivariate-p random vector from N(0p×1,Σ0(m)). Denote S(m) as the maximum likelihood estimator of the covariance matrix, that is, S(m)=(X(m))X(m)nm. Let Θ(m)=(Σ(m))1 be the precision matrix. Denote Θ0(m)=(

Simulation

Simulation is conducted to assess performance of the proposed approach (denoted as SIGN) and to compare performance with three alternatives: (1) IND, which analyzes each dataset individually using MCP. IND is a representative of single-dataset methods. Note that numerical studies in the literature have shown that MCP can outperform the popular lasso penalty. (2) PLAIN, which jointly analyzes multiple datasets using the composite MCP. This is equivalent to the proposed approach with λ2=0.

Lung cancer datasets

Lung cancer poses a major public health concern, with non-small cell lung cancer (NSCLC) accounting for the majority of lung cancer incidences. In 2019, 228,150 new cases of lung and bronchial cancer were estimated to be diagnosed, and 147,510 deaths were estimated, ranking the second most for new cases and the most for death among all cancer sites (Siegel et al., 2019). Gene profiling studies have been extensively conducted on lung cancer. We use three independent studies on non-small cell

Discussion

Under the Gaussian graphical model framework, extensive research has been conducted on estimating the precision matrix. The joint estimation of multiple precision matrices is desirable under several important settings and has attracted considerable attention in the recent literature. This is especially true for the high-dimensional setting but may also be needed for the “classic” low-dimensional setting. In most of the existing joint estimation research, there is a lack of direct attention on

Acknowledgments

We thank the editor and reviewers for careful reviews and insightful comments, which have led to a significant improvement of this article.

Funding

Zhang was supported by the National Natural Science Foundation of China (Grant No. 11971404; Basic Scientific Center, China Project No. 71988101); the Ministry of Education Project of Humanities and Social Sciences, China (Grant No. 19YJC910010); the 111 Project, China (Grant No. B13028).

References (44)

  • FangK. et al.

    Integrative sparse principal component analysis

    J. Multivariate Anal.

    (2018)
  • van WieringenW.N. et al.

    Updating of the gaussian graphical model through targeted penalized estimation

    J. Multivariate Anal.

    (2020)
  • BarabásiA.-L. et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • BickelP.J. et al.

    Regularized estimation of large covariance matrices

    Ann. Statist.

    (2008)
  • BilgrauA.E. et al.

    Targeted fused ridge estimation of inverse covariance matrices from multiple high-dimensional data classes

    J. Mach. Learn. Res.

    (2020)
  • BrehenyP. et al.

    Penalized methods for bi-level variable selection

    Stat. Interface

    (2009)
  • CaiT. et al.

    Joint estimation of multiple high-dimensional precision matrices

    Statist. Sinica

    (2016)
  • CaiT. et al.

    A constrained 1 minimization approach to sparse precision matrix estimation

    J. Amer. Statist. Assoc.

    (2011)
  • ChengX. et al.

    Identification of homogeneous and heterogeneous variables in pooled cohort studies

    Biometrics

    (2015)
  • ChiquetJ. et al.

    Inferring multiple graphical structures

    Stat. Comput.

    (2011)
  • DanaherP. et al.

    The joint graphical lasso for inverse covariance estimation across multiple classes

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2014)
  • DickerL. et al.

    Variable selection and estimation with the seamless-L0 penalty

    Statist. Sinica

    (2013)
  • EmuraT. et al.

    Personalized dynamic prediction of death according to tumour progression and high-dimensional genetic factors: Meta-analysis with a joint model

    Stat. Methods Med. Res.

    (2018)
  • FanJ. et al.

    An overview of the estimation of large covariance and precision matrices

    Econom. J.

    (2016)
  • FriedmanN.

    Inferring cellular networks using probabilistic graphical models

    Science

    (2004)
  • FriedmanJ. et al.

    Sparse inverse covariance estimation with the graphical lasso

    Biostatistics

    (2008)
  • GuoJ. et al.

    Joint estimation of multiple graphical models

    Biometrika

    (2011)
  • HuangY. et al.

    Promoting similarity of sparsity structures in integrative analysis with penalization

    J. Amer. Statist. Assoc.

    (2017)
  • LamC. et al.

    Sparsistency and rates of convergence in large covariance matrix estimation

    Ann. Statist.

    (2009)
  • LiuJ. et al.

    Integrative analysis of multiple cancer genomic datasets under the heterogeneity model

    Stat. Med.

    (2013)
  • LiuJ. et al.

    Integrative analysis of cancer diagnosis studies with composite penalization

    Scand. J. Stat.

    (2014)
  • MaS. et al.

    Integrative analysis and variable selection with multiple high-dimensional data sets

    Biostatistics

    (2011)
  • The online version of this article contains supplementary material.

    View full text