Parallel integrative learning for large-scale multi-response regression with incomplete outcomes

https://doi.org/10.1016/j.csda.2021.107243Get rights and content

Abstract

Multi-task learning is increasingly used to investigate the association structure between multiple responses and a single set of predictor variables in many applications. In the era of big data, the coexistence of incomplete outcomes, large number of responses, and high dimensionality in predictors poses unprecedented challenges in estimation, prediction and computation. In this paper, we propose a scalable and computationally efficient procedure, called PEER, for large-scale multi-response regression with incomplete outcomes, where both the numbers of responses and predictors can be high-dimensional. Motivated by sparse factor regression, we convert the multi-response regression into a set of univariate-response regressions, which can be efficiently implemented in parallel. Under some mild regularity conditions, we show that PEER enjoys nice sampling properties including consistency in estimation, prediction, and variable selection. Extensive simulation studies show that our proposal compares favorably with several existing methods in estimation accuracy, variable selection, and computation efficiency.

Introduction

Multi-task learning has been widely used in various fields, such as bioinformatics (Kim et al., 2009; Hilafu et al., 2020), econometrics (Fan et al., 2019), social network analysis (Zhu et al., 2020), and recommender systems (Zhu et al., 2016), when one is interested in uncovering the association between multiple responses and a single set of predictor variables. Multi-response regression is one of the most important tools in multi-task learning. For example, investigating the relationship between several measures of health of a patient (i.e., cholesterol, blood pressure, and weight) and eating habits of this patient, or simultaneously predicting asset returns for several companies via vector autoregression models, both result in multi-response regression problems.

In the high-dimensional setting where the number of predictors is large, it is challenging to infer the association between predictors and responses because the responses may depend on only a subset of predictors. To address this issue and recover sparse response-predictor associations, many regularization methods for multi-response regression models have been proposed; see, for example, Rothman et al. (2010), Bunea et al., 2011, Bunea et al., 2012, Chen and Huang (2012), Chen et al. (2012), Chen and Chan (2016), Uematsu et al. (2019), and the references therein. In particular, Chen et al. (2012) and Chen and Chan (2016) have proposed sparse reduced-rank regression approaches, which combine the regularization and reduced-rank regression techniques (Izenman, 1975; Velu and Reinsel, 2013), and Uematsu et al. (2019) suggested the method of sparse orthogonal factor regression via the sparse singular value decomposition with orthogonality constrained optimization to find the underlying association networks.

In the era of big data, the coexistence of missing values, large number of responses, and high dimensionality in predictors is increasingly common in many applications. When both numbers of responses and predictors are large, the aforementioned methods may become inefficient because they are computationally intensive. In addition, these methods are not applicable to incomplete data because they mainly focus on full data problems. To obtain scalable estimation of sparse reduced-rank regression, some approaches based on sequential estimation techniques have been developed in recent years. To name a few, Mishra et al. (2017) proposed a sequential extraction procedure for model estimation, which extracts unit-rank factorization one by one in a sequential fashion, each time with the previously extracted components removed from the current response matrix. Although Mishra et al. (2017) also considered extensions to incomplete outcomes, they did not provide the theoretical justification for the case with incomplete outcomes. In addition, the sequential steps in their procedure may result in the error accumulation. Alternatively, Zheng et al. (2019) converted the sparse and low-rank regression problem to a sparse generalized eigenvalue problem and recovered the underlying coefficient matrix in a similar sequential fashion. Although this method has been shown to enjoy desirable theoretical properties, it cannot be applied directly to missing data.

In this paper, we propose a new methodology of parallel integrative learning regression (PEER) for large-scale multi-task learning with incomplete outcomes, where both responses and predictors are possibly of high dimensions. PEER is a novel two-step procedure, where in the first step we consider a constrained optimization and use an iterative singular value thresholding algorithm to obtain some initial estimates, and then in the second step we convert the multi-response regression into a set of univariate-response regressions, which can be efficiently implemented in parallel.

The major contributions of this paper are threefold. First, the proposed procedure PEER provides a scalable and computationally efficient approach to large-scale multi-response regression models with incomplete outcomes. PEER can uncover the association between multiple responses and a single set of predictor variables while simultaneously achieving dimension reduction and variable selection. Second, our procedure PEER addresses the error accumulation problem in existing sequential estimation approaches by converting the multi-response regression into a set of parallel univariate-response regressions. Third, we provide theoretical guarantees for PEER by establishing oracle inequalities in estimation and prediction. Our theoretical analysis shows that PEER can consistently estimate the singular vectors, latent factors as well as the regression coefficient matrix, and accurately predict the multivariate response vector under mild conditions. To the best of our knowledge, there is no existing theoretical result on large-scale multi-response regression with incomplete outcomes. Our theoretical results are new to the literature.

The rest of this paper is organized as follows. Section 2 introduces the model setting and our new procedure PEER. Section 3 establishes non-asymptotic properties of PEER in high dimensions. Section 4 illustrates the advantages of our method via extensive simulation studies. Section 5 presents the results of a real data example. Section 6 concludes with some discussions. All the proofs are relegated to the Appendix.

Section snippets

Model and methodology

In this section, we first introduce our model setting and briefly review sparse orthogonal factor regression framework for high-dimensional multi-response regression models. We then present our new approach PEER.

Theoretical properties

In this section, we investigate the theoretical properties of PEER. We first list some mild regularity conditions that facilitate our technical analysis.

Simulation studies

In this section, we evaluate the finite-sample performance of the proposed approach PEER through two simulation studies. The main difference between these two studies lies in right singular vectors vk's. The right singular vectors vk's are sparse in the second study but not necessarily sparse in the first study.

Yeast cell cycle data analysis

In this section, we apply the proposed method to a multivariate Yeast cell cycle data, in which our goal is to identify the association between transcription factors (TFs) and RNA transcript levels within the Eukaryotic cell cycle. The dataset that we used includes the yeast cell cycle data originally collected by Spellman et al. (1998) and the chromatin immunoprecipitation (ChIP) data in Lee et al. (2002). The yeast cell cycle data in Spellman et al. (1998) consist of RNA levels measured every

Discussion

In this paper, we have proposed a new and efficient approach PEER to achieve scalable and accurate estimation for large-scale multi-response regression with incomplete outcomes, where both responses and predictors are possibly of high dimensions. It has been shown through our theoretical properties and numerical studies that PEER achieves nice estimation and prediction accuracy.

Here we have focused on multi-response linear models with incomplete outcomes, where all the responses are continuous

Acknowledgements

Li's research is supported by 2020 individual Award (0358220) from the Innovative Research and Creative Activities Grant at California State University, Fullerton. Zheng's research is supported by National Natural Science Foundation of China (Grants 72071187, 11671374, 71731010, and 71921001) and Fundamental Research Funds for the Central Universities (Grants WK3470000017 and WK2040000027). The authors sincerely thank the Co-Editor, Associate Editor, and anonymous referees for their valuable

References (48)

  • E. Candès et al.

    The Dantzig selector: statistical estimation when p is much larger than n

    Ann. Stat.

    (2007)
  • Chen, K., 2019. rrpack: reduced-rank regression. R package version...
  • K. Chen et al.

    A note on rank reduction in sparse multivariate regression

    J. Stat. Theory Pract.

    (2016)
  • K. Chen et al.

    Reduced rank stochastic regression with a sparse singular value decomposition

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (2012)
  • L. Chen et al.

    Sparse reduced-rank regression for simultaneous dimension reduction and variable selection

    J. Am. Stat. Assoc.

    (2012)
  • H. Dette et al.

    Optimal designs for multi-response generalized linear models with applications in thermal spraying

  • B. Efron et al.

    Least angle regression

    Ann. Stat.

    (2004)
  • J. Fan et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Am. Stat. Assoc.

    (2001)
  • Y. Fan et al.

    Asymptotic equivalence of regularization methods in thresholded parameter space

    J. Am. Stat. Assoc.

    (2013)
  • Y. Fan et al.

    Tuning parameter selection in high dimensional penalized likelihood

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (2013)
  • J. Friedman et al.

    Pathwise coordinate optimization

    Ann. Appl. Stat.

    (2007)
  • S.A. Van de Geer

    High-dimensional generalized linear models and the lasso

    Ann. Stat.

    (2008)
  • K. He et al.

    Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number of covariates

    J. Am. Stat. Assoc.

    (2018)
  • H. Hilafu et al.

    Sparse reduced-rank regression for integrating omics data

    BMC Bioinform.

    (2020)
  • Cited by (3)

    • Unified distributed robust regression and variable selection framework for massive data

      2021, Expert Systems with Applications
      Citation Excerpt :

      Because computational efficiency is desirable in real application, various penalization approaches have been proposed, e.g., the nonnegative garrotte (Breiman, 1995), the LASSO (Tibshirani, 1996; Zou, 2006), the bridge regression (Fu, 1998), the SCAD (Fan & Li, 2001), and the one-step sparse estimator (Zou & Li, 2008). For more references about this topic, one can refer to Dong et al. (2021), Ishwaran et al. (2010) and Scherr and Zhou (2020). What is more, the stability of the variable selection also is an important problem, one can refer Cateni et al. (2019) and Perthame et al. (2016).

    • Reproducible learning for accelerated failure time models via deep knockoffs

      2023, Communications in Statistics - Theory and Methods
    • Recent advances in big data analytics

      2022, The Palgrave Handbook of Operations Research
    1

    Dong and Li are co-first authors.

    View full text