Parallel integrative learning for large-scale multi-response regression with incomplete outcomes
Introduction
Multi-task learning has been widely used in various fields, such as bioinformatics (Kim et al., 2009; Hilafu et al., 2020), econometrics (Fan et al., 2019), social network analysis (Zhu et al., 2020), and recommender systems (Zhu et al., 2016), when one is interested in uncovering the association between multiple responses and a single set of predictor variables. Multi-response regression is one of the most important tools in multi-task learning. For example, investigating the relationship between several measures of health of a patient (i.e., cholesterol, blood pressure, and weight) and eating habits of this patient, or simultaneously predicting asset returns for several companies via vector autoregression models, both result in multi-response regression problems.
In the high-dimensional setting where the number of predictors is large, it is challenging to infer the association between predictors and responses because the responses may depend on only a subset of predictors. To address this issue and recover sparse response-predictor associations, many regularization methods for multi-response regression models have been proposed; see, for example, Rothman et al. (2010), Bunea et al., 2011, Bunea et al., 2012, Chen and Huang (2012), Chen et al. (2012), Chen and Chan (2016), Uematsu et al. (2019), and the references therein. In particular, Chen et al. (2012) and Chen and Chan (2016) have proposed sparse reduced-rank regression approaches, which combine the regularization and reduced-rank regression techniques (Izenman, 1975; Velu and Reinsel, 2013), and Uematsu et al. (2019) suggested the method of sparse orthogonal factor regression via the sparse singular value decomposition with orthogonality constrained optimization to find the underlying association networks.
In the era of big data, the coexistence of missing values, large number of responses, and high dimensionality in predictors is increasingly common in many applications. When both numbers of responses and predictors are large, the aforementioned methods may become inefficient because they are computationally intensive. In addition, these methods are not applicable to incomplete data because they mainly focus on full data problems. To obtain scalable estimation of sparse reduced-rank regression, some approaches based on sequential estimation techniques have been developed in recent years. To name a few, Mishra et al. (2017) proposed a sequential extraction procedure for model estimation, which extracts unit-rank factorization one by one in a sequential fashion, each time with the previously extracted components removed from the current response matrix. Although Mishra et al. (2017) also considered extensions to incomplete outcomes, they did not provide the theoretical justification for the case with incomplete outcomes. In addition, the sequential steps in their procedure may result in the error accumulation. Alternatively, Zheng et al. (2019) converted the sparse and low-rank regression problem to a sparse generalized eigenvalue problem and recovered the underlying coefficient matrix in a similar sequential fashion. Although this method has been shown to enjoy desirable theoretical properties, it cannot be applied directly to missing data.
In this paper, we propose a new methodology of parallel integrative learning regression (PEER) for large-scale multi-task learning with incomplete outcomes, where both responses and predictors are possibly of high dimensions. PEER is a novel two-step procedure, where in the first step we consider a constrained optimization and use an iterative singular value thresholding algorithm to obtain some initial estimates, and then in the second step we convert the multi-response regression into a set of univariate-response regressions, which can be efficiently implemented in parallel.
The major contributions of this paper are threefold. First, the proposed procedure PEER provides a scalable and computationally efficient approach to large-scale multi-response regression models with incomplete outcomes. PEER can uncover the association between multiple responses and a single set of predictor variables while simultaneously achieving dimension reduction and variable selection. Second, our procedure PEER addresses the error accumulation problem in existing sequential estimation approaches by converting the multi-response regression into a set of parallel univariate-response regressions. Third, we provide theoretical guarantees for PEER by establishing oracle inequalities in estimation and prediction. Our theoretical analysis shows that PEER can consistently estimate the singular vectors, latent factors as well as the regression coefficient matrix, and accurately predict the multivariate response vector under mild conditions. To the best of our knowledge, there is no existing theoretical result on large-scale multi-response regression with incomplete outcomes. Our theoretical results are new to the literature.
The rest of this paper is organized as follows. Section 2 introduces the model setting and our new procedure PEER. Section 3 establishes non-asymptotic properties of PEER in high dimensions. Section 4 illustrates the advantages of our method via extensive simulation studies. Section 5 presents the results of a real data example. Section 6 concludes with some discussions. All the proofs are relegated to the Appendix.
Section snippets
Model and methodology
In this section, we first introduce our model setting and briefly review sparse orthogonal factor regression framework for high-dimensional multi-response regression models. We then present our new approach PEER.
Theoretical properties
In this section, we investigate the theoretical properties of PEER. We first list some mild regularity conditions that facilitate our technical analysis.
Simulation studies
In this section, we evaluate the finite-sample performance of the proposed approach PEER through two simulation studies. The main difference between these two studies lies in right singular vectors 's. The right singular vectors 's are sparse in the second study but not necessarily sparse in the first study.
Yeast cell cycle data analysis
In this section, we apply the proposed method to a multivariate Yeast cell cycle data, in which our goal is to identify the association between transcription factors (TFs) and RNA transcript levels within the Eukaryotic cell cycle. The dataset that we used includes the yeast cell cycle data originally collected by Spellman et al. (1998) and the chromatin immunoprecipitation (ChIP) data in Lee et al. (2002). The yeast cell cycle data in Spellman et al. (1998) consist of RNA levels measured every
Discussion
In this paper, we have proposed a new and efficient approach PEER to achieve scalable and accurate estimation for large-scale multi-response regression with incomplete outcomes, where both responses and predictors are possibly of high dimensions. It has been shown through our theoretical properties and numerical studies that PEER achieves nice estimation and prediction accuracy.
Here we have focused on multi-response linear models with incomplete outcomes, where all the responses are continuous
Acknowledgements
Li's research is supported by 2020 individual Award (0358220) from the Innovative Research and Creative Activities Grant at California State University, Fullerton. Zheng's research is supported by National Natural Science Foundation of China (Grants 72071187, 11671374, 71731010, and 71921001) and Fundamental Research Funds for the Central Universities (Grants WK3470000017 and WK2040000027). The authors sincerely thank the Co-Editor, Associate Editor, and anonymous referees for their valuable
References (48)
- et al.
Generalized high-dimensional trace regression via nuclear norm regularization
J. Econom.
(2019) Reduced-rank regression for the multivariate linear model
J. Multivar. Anal.
(1975)- et al.
Leveraging mixed and incomplete outcomes via reduced-rank modeling
J. Multivar. Anal.
(2018) - et al.
Estimation and inference in semiparametric quantile factor models
J. Econom.
(2021) - et al.
Estimation and hypothesis test for partial linear multiplicative models
Comput. Stat. Data Anal.
(2018) - et al.
Multivariate spatial autoregressive model for large scale social networks
J. Econom.
(2020) - et al.
Simultaneous analysis of lasso and Dantzig selector
Ann. Stat.
(2009) - et al.
Optimal selection of reduced rank estimators of high-dimensional matrices
Ann. Stat.
(2011) - et al.
Joint variable and rank selection for parsimonious estimation of high-dimensional matrices
Ann. Stat.
(2012) - et al.
Sparsity oracle inequalities for the lasso
Electron. J. Stat.
(2007)
The Dantzig selector: statistical estimation when p is much larger than n
Ann. Stat.
A note on rank reduction in sparse multivariate regression
J. Stat. Theory Pract.
Reduced rank stochastic regression with a sparse singular value decomposition
J. R. Stat. Soc., Ser. B, Stat. Methodol.
Sparse reduced-rank regression for simultaneous dimension reduction and variable selection
J. Am. Stat. Assoc.
Optimal designs for multi-response generalized linear models with applications in thermal spraying
Least angle regression
Ann. Stat.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Am. Stat. Assoc.
Asymptotic equivalence of regularization methods in thresholded parameter space
J. Am. Stat. Assoc.
Tuning parameter selection in high dimensional penalized likelihood
J. R. Stat. Soc., Ser. B, Stat. Methodol.
Pathwise coordinate optimization
Ann. Appl. Stat.
High-dimensional generalized linear models and the lasso
Ann. Stat.
Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number of covariates
J. Am. Stat. Assoc.
Sparse reduced-rank regression for integrating omics data
BMC Bioinform.
Cited by (3)
Unified distributed robust regression and variable selection framework for massive data
2021, Expert Systems with ApplicationsCitation Excerpt :Because computational efficiency is desirable in real application, various penalization approaches have been proposed, e.g., the nonnegative garrotte (Breiman, 1995), the LASSO (Tibshirani, 1996; Zou, 2006), the bridge regression (Fu, 1998), the SCAD (Fan & Li, 2001), and the one-step sparse estimator (Zou & Li, 2008). For more references about this topic, one can refer to Dong et al. (2021), Ishwaran et al. (2010) and Scherr and Zhou (2020). What is more, the stability of the variable selection also is an important problem, one can refer Cateni et al. (2019) and Perthame et al. (2016).
Reproducible learning for accelerated failure time models via deep knockoffs
2023, Communications in Statistics - Theory and MethodsRecent advances in big data analytics
2022, The Palgrave Handbook of Operations Research
- 1
Dong and Li are co-first authors.