Abstract
We consider variable selection in high-dimensional sparse multiresponse linear regression models, in which a q-dimensional response vector has a linear relationship with a p-dimensional covariate vector through a sparse coefficient matrix \(B\in R^{p\times q}\). We propose a consistent procedure for the purpose of identifying the nonzeros in B. The procedure consists of two major steps, where the first step focuses on the detection of all the nonzero rows in B, the latter aims to further discover its individual nonzero cells. The first step is an extension of Orthogonal Matching Pursuit (OMP) and the second step adopts the bootstrap strategy. The theoretical property of our proposed procedure is established. Extensive numerical studies are presented to compare its performances with available representatives.
Similar content being viewed by others
References
Buehlmann P (2006) Boosting for high-dimensional linear models. Ann. Stat. 34(2):559–583
Cai TT, Li H, Liu W, Xie J (2013) Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100(1):139–156
Chun H, Keleş S (2009) Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics 182(1):79–90
Chun H, Keleş S (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. 72(1):3–25
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann. Stat. 32(2):407–499
Ing C, Lai T (2011) A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Stat. Sin. 21(4):1473
Jia Z, Xu S (2007) Mapping quantitative trait loci for expression abundance. Genetics 176(1):611–623
Johnsson T (1992) A procedure for stepwise regression analysis. Stat. Pap. 33(1):21–29
Liu, J., Ma, S., Huang, J.: Penalized methods for multiple outcome data in genome-wide association studies. Technical report (2012)
Luo S, Chen Z (2013) Extended bic for linear regression models with diverging number of relevant features and high or ultra-high feature spaces. J. Stat. Plan. Infer. 143(3):494–504
Luo S, Chen Z (2014) Sequential lasso cum ebic for feature selection with ultra-high dimensional feature space. J. Am. Stat. Assoc. 109(507):1229–1240
Lutoborski A, Temlyakov V (2003) Vector greedy algorithms. J. Complex. 19(4):458–473
Ma S, Huang J, Song X (2011) Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics 12(4):763–775
Mammen E (1993) Bootstrap and wild bootstrap for high dimensional linear models. Ann. Stat. 21(1):255–285
Obozinski G, Wainwright MJ, Jordan MI (2011) Support union recovery in high-dimensional multivariate regression. Ann. Stat. 39(1):1–47
Özkale MR (2015) Predictive performance of linear regression models. Stat. Pap. 56(2):531–567
Peng J, Zhu J, Bergamaschi A, Han W, Noh D-Y, Pollack JR, Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4(1):53–77
Rothe G (1986) Some remarks on bootstrap techniques for constructing confidence intervals. Stat. Pap. 27(1):165–172
Similä, T., Tikka, J.: Common subset selection of inputs in multiresponse regression. In: Neural Networks, 2006. IJCNN’06. International Joint Conference on. IEEE, pp. 1908–1915 (2006)
Similä T, Tikka J (2007) Input selection and shrinkage in multiresponse linear regression. Comput. Stat. Data Anal. 52(1):406–422
Temlyakov VN (2000) Weak greedy algorithms. Adv. Comput. Math. 12(2):213–227
Tropp J, Gilbert A, Strauss M (2006) Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit. Signal Process. 86(3):572–588
Turlach B, Venables W, Wright S (2005) Simultaneous variable selection. Technometrics 47(3):349–363
Wang, J.: Joint estimation of sparse multivariate regression and conditional graphical models. Stat. Sin. pp. 831–851 (2015)
Yang, C., Wang, L., Zhang, S., Zhao, H.: Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping. Bioinformatics (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by National Natural Science Foundation of China (NSFC): 11401378 and Shanghai Jiao Tong University start-up fund for special researchers: WF220407103.
Appendix: Technical proofs
Appendix: Technical proofs
In this section, we provide technical proofs of our main theorems. For the convenience of the reader, we restate some useful notations and conditions.
For given K, if there exists a \(k\le K\) such that \({S}_{0}\subseteq s_k\), define
otherwise, \(\tilde{K}\) is defined to be K.
We assume the following conditions:
- (C1):
\(\ln p=o(n^{1/3})\);
- (C2):
The predictor vector \({x}\) and the error vector \({e}\) satisfy
- (C2.1):
the covariates in \({x}\) have a constant variance 1 and correlations bounded from 0 and 1.
- (C2.2):
\(\sigma _{\max }\equiv \max _{1 \le j,k\le p} \sigma ({x}_j{x}_k) < \infty \) where \(\sigma ({x}_j{x}_k)\) denotes the standard deviation of \({x}_j{x}_k\).
- (C2.3):
\(\max _{1\le j,k\le p}{E}\exp (t{x}_j{x}_k) \) and \(\max _{1\le i\le q,1\le j\le p}{E}\exp (t{x}_j{e}^i) \) are finite for t in a neighborhood of zero.
- (C3):
\(\sum _{1\le i\le p}\sum _{1\le j\le q}|{\beta }_{ij}|\le c\) where c is a positive constant.
- (C4):
There exists a constant \(\delta >0\) such that
$$\begin{aligned} \min _{1\le |A|\le K}\lambda _{\min }(\varGamma _{A,A})\ge \delta \end{aligned}$$(8)for K satisfying \(K=O\left( q^{-1}\sqrt{n/\ln p}\right) .\)
- (C5):
For the K in C4, there exists a \(0<\kappa <1\) satisfying \(n^{-\kappa }K\rightarrow +\,\infty , n^{1-2\kappa }/\ln p\rightarrow +\,\infty \) and
$$\begin{aligned} \lim \inf _{n\rightarrow +\infty }n^{\kappa }\min \limits _{j\in {S}_{0}}\Vert {\beta }_j\Vert _2^2>0. \end{aligned}$$
Denote \(\tilde{{y}}_{A,i}={x}_{A}\varGamma _{A,A}^{-1}\text{ E }(x_{A}^{\top }y_i),\;\; \hat{{y}}_{A,i}={x}_{A}\hat{\varGamma }_{A,A}^{-1}X_{A}^{\top }Y_i\) and \(\tilde{{y}}_{k,i}=\tilde{{y}}_{s_k,i}, \hat{{y}}_{k,i}=\hat{{y}}_{s_k,i}\).
Theorem 1
Under assumptions C1–C5, we have
When \(m=O(q^{-1}\sqrt{n/\ln p})\), the right hand side is \(O_p(m^{-1}).\)
Proof of Theorem 1
Denote
Firstly, we focus on \(\text{ I }\). For \(A\subseteq \{1,2,\ldots ,p\},1\le i\le q, 1\le j\le p,\) denote by \(E_i\) the ith column of the error matrix E and
for simplicity, denote them by \({\mu }_{k,j,i}, \hat{{\mu }}_{k,j,i}\) when \(A=s_k\). Since \({E}\left( {x}_{s_k}^{\top }\left( y_i-\tilde{{y}}_{k,i}\right) |(Y,X)\right) =0\), therefore, for any k,
Now it suffices to estimate the first term of the right hand side in the last inequality. Note that by definition of \(\hat{j}_k\),
Hence, triangle inequality implies
on the other hand,
For any \(0<\xi <1\) and \(C>0\), let \(\tilde{\xi }=2/(1-\xi )\) and define
then
the second inequality is implied by Theorem 3 in Temlyakov (2000). By direct computation, we have
From Lemma 1 in Luo and Chen (2014), under C1 to C5, when \(\max (\ln m,\ln q)=O(\ln p)\), we have
and also,
Consequently,
Now we focus on \(\text{ II }\). Note that
from the above discussions, we can see that all components in \(\hat{\varGamma }_{s_k,s_k}^{-1}X_{s_k}^{\top }Y_i-\varGamma _{s_k,s_k}^{-1}\text{ E }({x}_{s_k}y_i)\) is uniformly \(O_p(\sqrt{\ln p/n})\). Therefore, \(\text{ II }=O_p\left( m \ln p/n\right) .\) The desired result is obtained. \(\square \)
Theorem 2
Under assumptions C1-C5, the MOMP posses sure screening property, that is,
for K defined in C4.
Proof of Theorem 2
Let \(\tilde{{\beta }}_{ji}(A)\) be the coefficient of \({x}_A\) in the best linear predictor \(\tilde{{y}}_{A,i}\) as defined in (9) and \(\tilde{{\beta }}_{ji}(A)\) be 0 if \(j\notin A\). Note that
The inequality
and C3, C5 implies that \(|{S}_{0}|=O(n^{\kappa /2})\), yielding \(|{S}_{0}\cup s_m| = O(m+n^{\kappa /2})\) and it follows from the above inequality that, if \(s_m^c\cap {S}_{0}\ne \emptyset \) and \(m=K\), then
for some positive constant C when n is sufficiently large. From C5, \(mn^{-\kappa }\rightarrow +\infty \), this contradicts with Theorem 2. Therefore, \(P\left( {S}_{0}\subseteq s_{K}\right) \rightarrow 1\) as \(n\rightarrow \infty .\)\(\square \)
Theorem 3
Under the assumptions C1–C5, when \({e}\) follows a multivariate normal distribution, we have
if \(\gamma \) in (2) is larger than \(1-\ln n/2\ln p\).
Proof of Theorem 3
Theorem 2 implies that there exists a constant \(a>0\) such that
Suppose \(j\notin A,A_1\subsetneq A_2\), the following two identities
are very important to prove this theorem.
Without loss of generality, we assume that all features and errors \(e_i\) have sample mean 0 and sample variance 1. Define
From Lemma 1 in Luo and Chen (2014), it is straightforward to have the following conclusions,
- (A):
On one hand, \(\max \limits _{1\le k\le n}|A_{k-1}-\varGamma _{j_k,j_k|s_{k-1}}|=o_p(1)\); on the other hand,
$$\begin{aligned} A_{k-1}\ge & {} \lambda _{\min }\left( \hat{\varGamma }_{s_k,s_k}\right) \left[ 1+\Vert \left( X^{\top }_{s_{k-1}}X_{s_{k-1}}\right) ^{-1}X^{\top }_{s_{k-1}}X_{j_{k}}\Vert _2^2\right] \\\ge & {} \lambda _{\min }\left( \hat{\varGamma }_{s_k,s_k}\right) \end{aligned}$$When \(\lambda _{\min }\left( \varGamma _{s_k,s_k}\right) \ge \delta \), for any k-dimensional unit vector \({w}\),
$$\begin{aligned} \begin{aligned} \min \left( {w}^{\top }\hat{\varGamma }_{s_k,s_k}{w}\right) =&\min \left( {w}^{\top }\left\{ \left[ \hat{\varGamma }_{s_k,s_k}-\varGamma _{s_k,s_k}\right] +\varGamma _{s_k,s_k}\right\} {w}\right) \\ \ge&\min \left( {w}^{\top }\varGamma _{s_k,s_k}{w}\right) -\max \left( {w}^{\top }\left[ \hat{\varGamma }_{s_k,s_k}-\varGamma _{s_k,s_k}\right] {w}\right) \\ \ge&\lambda _{\min }\left( \varGamma _{s_k,s_k}\right) -\Vert {w}\Vert _1^2\max _{1\le i,j\le p}|n^{-1}X_i^{\top }X_j-\varGamma _{i,j}|\\ \ge&\delta -O_p\left( k\sqrt{\dfrac{\ln p}{n}}\right) . \end{aligned} \end{aligned}$$(16)That is, \(P(A_{k-1}\ge \delta )\rightarrow 1\) as \(n\rightarrow +\infty \) provided \(k\sqrt{\ln p/n}=o(1).\)
- (B):
\(\max \limits _{k=O(q^{-1}\sqrt{n/\ln p}),1\le i\le q}|B_{k-1,i}|=O_p\left( \sqrt{\dfrac{\ln p+\ln q}{n}}\right) =o_p\left( \min \limits _{j\in {S}_{0}}\Vert {\beta }_j\Vert _2\right) .\)
- (C):
\(\max \limits _{k=O(q^{-1}\sqrt{n/\ln p}),1\le i\le q}|C_{k-1,i}|=O_p\left( \sqrt{\dfrac{\ln p+\ln q}{n}}\right) =o_p\left( \min \limits _{j\in {S}_{0}}\Vert {\beta }_j\Vert _2\right) .\)
- (i):
If \(\hat{K}<\tilde{K}\), \(\text{ EBIC }_{\gamma }(s_{\hat{K}})\le \text{ EBIC }_{\gamma }(s_{\tilde{K}}) \) implies
$$\begin{aligned} n\ln \dfrac{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2}+(|\hat{K}| -|\tilde{K}|)(\ln n+2\gamma \ln p)\le 0. \end{aligned}$$(17)If we can show
$$\begin{aligned} P\left( n\ln \dfrac{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}-1}))y_i\Vert _2^2}{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2}-|\tilde{K}|(\ln n+2\gamma \ln p)\le 0\right) \rightarrow 0, \end{aligned}$$(18)then we will have \(P(\hat{K}<\tilde{K})\rightarrow 0.\) In the following, we aim to prove (18):
$$\begin{aligned} \begin{aligned}&\sum \limits _{1\le i\le q}\left\{ \dfrac{\Vert ({I}-{H}_0(s_{\tilde{K}-1}))y_i\Vert _2^2}{n}-\dfrac{\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2}{n}\right\} \\&\quad =\sum \limits _{1\le i\le q}\left\{ \left( {\beta }_{j_{\tilde{K}}i}\right) ^2A_{\tilde{K}-1}+2{\beta }_{j_{\tilde{K}}i}B_{\tilde{K}-1,i}+(A_{\tilde{K}-1})^{-1}(B_{\tilde{K}-1,i})^2\right\} , \end{aligned} \end{aligned}$$(19)And furthermore,
$$\begin{aligned} \sum \limits _{1\le i\le q}\dfrac{\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2}{n} =\sum \limits _{1\le i\le q}\left\{ C_{\tilde{K},i}+1\right\} =q\left( 1+O_p\left( \sqrt{\dfrac{\ln p+\ln q}{n}}\right) \right) . \end{aligned}$$(20)Hence, with probability tending to 1,
$$\begin{aligned} n\ln \dfrac{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}-1}))y_i\Vert _2^2}{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2} \ge n\ln \left( 1+\dfrac{\min \limits _{j\in {S}_{0}}\Vert {\beta }_j\Vert _2^2}{q}\right) \ge Cq^{-1}n^{1-\kappa },\nonumber \\ \end{aligned}$$(21)for some \(0<C<1\) while \(|\tilde{K}|(\ln n+2\gamma \ln p)\le q^{-1}\sqrt{n\ln p}\). Combined with C5, (18) is thus proved.
- (ii):
If \(\hat{K}>\tilde{K}\), \(\text{ EBIC }_{\gamma }(s_{\hat{K}})\le \text{ EBIC }_{\gamma }(s_{\tilde{K}}) \) implies,
$$\begin{aligned} \begin{aligned}&n\ln \left( 1+\dfrac{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2-\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}\right) \\&-(|\hat{K}| -|\tilde{K}|)(\ln n+2\gamma \ln p)\ge 0. \end{aligned} \end{aligned}$$(22)If we can prove that this inequality holds with a probability converging to 0, then \(P(\hat{K}>\tilde{K})=o(1).\) Note that
$$\begin{aligned} \begin{aligned} \sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2=&\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\hat{K}})){E}_i\Vert _2^2\\ \sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2=&\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}})){E}_i\Vert _2^2\\ \end{aligned} \end{aligned}$$From Lemma 2 in Luo and Chen (2013), we have
$$\begin{aligned} \max _{1\le i\le q}\dfrac{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2-\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}=\dfrac{2|\hat{K}-\tilde{K}|}{n}(1+o_p(1)), \end{aligned}$$Hence, by applying the conclusions in (i),
$$\begin{aligned} \begin{aligned}&n\ln \left( 1+\dfrac{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\tilde{K}}))y_i\Vert _2^2-\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}{\sum \limits _{1\le i\le q}\Vert ({I}-{H}_0(s_{\hat{K}}))y_i\Vert _2^2}\right) \\&\quad \le 2|\hat{K}-\tilde{K}|\ln p. \end{aligned} \end{aligned}$$When \(\gamma >1-\ln n/(2\ln p)\), the desired result is obtained.\(\square \)
Rights and permissions
About this article
Cite this article
Luo, S. Variable selection in high-dimensional sparse multiresponse linear regression models. Stat Papers 61, 1245–1267 (2020). https://doi.org/10.1007/s00362-018-0989-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-0989-x