Abstract
This paper concerns the folded concave penalized sparse linear regression (FCPSLR), a class of popular sparse recovery methods. Although FCPSLR yields desirable recovery performance when solved globally, computing a global solution is NP-complete. Despite some existing statistical performance analyses on local minimizers or on specific FCPSLR-based learning algorithms, it still remains open questions whether local solutions that are known to admit fully polynomial-time approximation schemes (FPTAS) may already be sufficient to ensure the statistical performance, and whether that statistical performance can be non-contingent on the specific designs of computing procedures. To address the questions, this paper presents the following threefold results: (1) Any local solution (stationary point) is a sparse estimator, under some conditions on the parameters of the folded concave penalties. (2) Perhaps more importantly, any local solution satisfying a significant subspace second-order necessary condition (S\(^3\)ONC), which is weaker than the second-order KKT condition, yields a bounded error in approximating the true parameter with high probability. In addition, if the minimal signal strength is sufficient, the S\(^3\)ONC solution likely recovers the oracle solution. This result also explicates that the goal of improving the statistical performance is consistent with the optimization criteria of minimizing the suboptimality gap in solving the non-convex programming formulation of FCPSLR. (3) We apply (2) to the special case of FCPSLR with minimax concave penalty and show that under the restricted eigenvalue condition, any S\(^3\)ONC solution with a better objective value than the Lasso solution entails the strong oracle property. In addition, such a solution generates a model error (ME) comparable to the optimal but exponential-time sparse estimator given a sufficient sample size, while the worst-case ME is comparable to the Lasso in general. Furthermore, to guarantee the S\(^3\)ONC admits FPTAS.
Similar content being viewed by others
Notes
Throughout this paper, a “local solution” refers to a solution that at least satisfies the first-order KKT condition, and may or may not satisfy a second-order necessary condition.
References
Adamczak, R., Litvak, A., Pajor, A., Tomczak-Jaegermann, N.: Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. J. Am. Math. Soc. 234, 535–561 (2010)
Bertsimas, D., Mazumder, R.: Least quantile regression via modern optimization. Ann. Stat. 42, 2494–2525 (2014)
Bian, W., Chen, X.: Optimality conditions and complexity for non-Lipschitz constrained optimization problems. http://www.polyu.edu.hk/ama/staff/xjchen/OCT26 (2014)
Bian, W., Chen, X., Ye, Y.: Complexity analysis of interior point algorithms for non-Lipschitz and non-convex minimization. Math. Program. A 149, 301–327 (2015)
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009)
Candés, E., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory. 51(12), 4203–4215 (2005)
Cartis, C., Gould, N.I.M., Toint, P.I.: Adaptive cubic regularization methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Prog. A 127, 245–295 (2011)
Chen, X., Ge, D., Wang, Z., Ye, Y.: Complexity of unconstrained L\(_2\)-L\(_{\mathbf{p}}\) minimization. Math. Prog. A. 143, 371–383 (2014)
Chen, X., Xu, F., Ye, Y.: Lower bound theory of non-zero entries in solutions of L\(_2\)-L\(\mathbf{p}\) minimization. SIAM J. Sci. Comput. 32(5), 2832–2852 (2010)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fan, J., Lv, J.: Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory 57, 5467–5484 (2011)
Fan, J., Lv, J., Qi, L.: Sparse high dimensional models in economics. Annu. Rev. Econo. 3, 291–317 (2011)
Fan, J., Xue, L., Zou, H.: Strong oracle optimality of folded concave penalized estimation. Ann. Stat. 42(3), 819–849 (2014)
Ge, D., Wang, Z., Ye, Y., Yin, H.: Strong NP-hardness result for regularized \(L_q\)-minimization problems with concave penalty functions. arxiv:1501.00622v1 (2015)
Hunter, D., Li, R.: Variable selection using MM algorithms. Ann. Stat. 33, 1617–1642 (2005)
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. arXiv:1106.2363v2. (2014)
Hsu, D., Kakade, S.M., Zhang, T.: A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 17(52), 1–6 (2012)
Huo, X., Chen, J.: Complexity of penalized likelihood estimation. J. Stat. Comput. Simul. 80(7), 747–759 (2010)
Liu, H., Yao, T., Li, R.: Global solutions for folded concave penalized nonconvex learning. Ann. Stat. 44(2), 629–659 (2016)
Liu, H., Yao, T., Li, R., Ye, Y.: Electronic Companion to: Folded Concave Penalized Sparse Linear Regression: Sparsity, Statistical Performance, and Algorithmic Theory for Local Solutions (2017)
Loh, P.-L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2015)
Negahban, S.N., Ravikumar, P., Wainwright, M.J., Yu, B.: A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012)
Nesterov, Yu., Polyak, B.T.: Cubic regularization of Newton’s method and its global performance. Math. Program. 108(1), 177–205 (2006)
Raskutti, G., Wainwright, M., Yu, B.: Restricted nullspace and eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11, 2241–2259 (2010)
Rudelson, M., Zhou, S.: Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theory 59(6), 3434–3447 (2013)
Raskutti, G., Wainwright, M.J., Yu, B.: Minimax rates of estimation for high-dimensional linear regression over \(\ell _q\)-balls. IEEE Trans. Inf. Theory 57(10), 6976–6994 (2011)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58(1), 267–288 (1996)
van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3, 1360–1392 (2009)
Vavasis, S.A.: Quadratic programming is in NP. Inf. Process. Lett. 36, 73–77 (1990)
Vershynin, R.: How close is the sample covariance matrix to the actual covariance matrix. arXiv:1004.3484v2 (2010)
Wang, L., Kim, Y., Li, R.: Calibrating non-convex penalized regressioni in ultra-high dimension. Ann. Stat. 41(5), 2505–2536 (2013)
Wang, Z., Liu, H., Zhang, T.: Optimal computational and statistical rates of convergence for sparse non-convex learning problems. Ann. Stat. 42(6), 2164–2201 (2014)
Ye, Y.: On affine scaling algorithms for non-convex quadratic programming. Math. Program. 56, 285–300 (1992)
Ye, Y.: On the complexity of approximating a KKT point of quadratic programming. Math. Program. 80, 195–211 (1998)
Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 28, 894–942 (2010)
Zhang, Y., Wainwright, M.J., Jordan, M.I.: Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. JMLR: Worksh. Conf. Proc. 35, 1–18 (2014)
Zhang, C., Zhang, T.: A general theory of concave regularization for high dimensional sparse estimation problems. Stat. Sci. 27(4), 576–593 (2012)
Zhou, S.: Restricted Eigenvalue Conditions on Subgaussian Random Matrices. arXiv:0912.4045v2 (2009)
Zou, H., Li, R.: One-step sparse estimation in non-concave penalized likelihood models. Ann. Stat. 36, 1509–1533 (2008)
Acknowledgements
The authors thank the AE and referees for their valuable comments, which significantly improve the paper. This work was supported by Penn State Grace Woodward Collaborative Research Grant, NSF grants CMMI 1300638 and DMS 1512422, NIH grants P50 DA036107 and P50 DA039838, Marcus PSU-Technion Partnership grant, Air Force Office of Scientific Research grant FA9550-12-1-0396, and Mid-Atlantic University Transportation Centers grant. This work was also partially supported by NNSFC grants 11690014 and 11690015. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, the NIDA, the NIH, the AFOSR, the MAUTC or the NNSFC.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 Some useful Lemmas
Lemma 6
For any \(\mathbf x^{true}\in \mathfrak {R}^p\), \(\mathbf A\in \mathfrak {R}^{n\times p}\), \(W\in \mathfrak {R}^n\), \(\mathbf b=\mathbf A\mathbf x^{true}+W\), consider f as defined in (3) with arbitrarily either \(P_{\lambda }=P_{\lambda ,SCAD}\) or \(P_{\lambda }=P_{\lambda ,MCP}\). Let \(\mathbf x^0\in \mathfrak {R}^p\) be a feasible solution to (3). If \(f(\mathbf x^0)\) satisfies that \( f(\mathbf x^{0})\le f(\mathbf x^{lasso})\), where \(\mathbf x^{lasso}\) is defined in (4) with the same problem data \( \mathbf x^{true}\), \(\mathbf A\), and \(\mathbf b\) as (3) and with an arbitrary penalty parameter \(\lambda _{lasso}>0\), then \( f(\mathbf x^{0})-f(\mathbf x^{true})\le (\lambda _{lasso}+\lambda )\left| \mathbf x^{lasso} - \mathbf x^{true}\right| . \)
Proof
Denote that \( f_{lasso}(\mathbf x)=(2n)^{-1}\Vert \mathbf A\mathbf x-\mathbf b\Vert ^2+\sum _{i=1}^p\lambda _{lasso}\vert x_i\vert \) for any \(\mathbf x=(x_i)\in \mathfrak {R}^p\).
Firstly, notice that by definition of \(\mathbf x^{lasso}\) in (4), \( f_{lasso}(\mathbf x^{lasso})\le f_{lasso}(\mathbf x^{true}).\) We then know that \( (2n)^{-1}\Vert \mathbf A\mathbf x^{lasso}-\mathbf b\Vert ^2-(2n)^{-1}\Vert \mathbf A\mathbf x^{true}-\mathbf b\Vert ^2\le \sum _{i=1}^p\lambda _{lasso}\vert x_i^{true}\vert -\sum _{i=1}^p\lambda _{lasso}\vert x_i^{lasso}\vert \le \sum _{i=1}^p\lambda _{lasso}\vert x_i^{true}- x_i^{lasso}\vert = \lambda _{lasso}\vert \mathbf x^{true}- \mathbf x^{lasso}\vert \)
Secondly, due to the concavity and differentiability of \(P_\lambda (\cdot )\) on \(\mathfrak {R}_+\) and the fact that \(0\le P'_\lambda (\vert x\vert )\le \lambda \) for all \(x\in \mathfrak {R}\), \(\sum _{i=1}^{p}P_\lambda (\vert x_i^{lasso}\vert )- \sum _{i=1}^{p}P_\lambda (\vert x_i^{true}\vert )\le \sum _{i=1}^p P_\lambda '(\vert x_i^{true}\vert )\cdot \left( \vert x_i^{lasso}\vert -\vert x_i^{true}\vert \right) \le \sum _{i=1}^p P_\lambda '(\vert x_i^{true}\vert )\cdot \vert x_i^{lasso} - x_i^{true}\vert \le \lambda \left| \mathbf x^{lasso} - \mathbf x^{true}\right| \).
Combining the above and the assumption that \(f(\mathbf x^{0})\le f(\mathbf x^{lasso})\), we know that \(f(\mathbf x^{0})-f(\mathbf x^{true})\le f(\mathbf x^{lasso})-f(\mathbf x^{true}) =(2n)^{-1}\Vert \mathbf A\mathbf x^{lasso}-\mathbf b\Vert ^2+\sum _{i=1}^{p}P_\lambda (\vert x_i^{lasso}\vert )-(2n)^{-1}\Vert \mathbf A\mathbf x^{true}-\mathbf b\Vert ^2- \sum _{i=1}^{p}P_\lambda (\vert x_i^{true}\vert ) \le (\lambda _{lasso}+\lambda )\left| \mathbf x^{lasso} - \mathbf x^{true}\right| ,\) as claimed. \(\square \)
Lemma 7
Assume that Condition B holds with initial solution \(\mathbf x^0\in \mathfrak {R}^p\). For any \(\mathbf x^{true}\in \mathfrak {R}^p\), \(\mathbf A\in \mathfrak {R}^{n\times p}\), \(W\in \mathfrak {R}^n\), \(\mathbf b=\mathbf A\mathbf x^{true}+W\), and for any \(\mathbf x^*=(x_i^*)\in \mathfrak {R}^p\) that satisfies (i) S\(^3\)ONC to (3) with arbitrarily either \(P_\lambda =P_{\lambda ,SCAD}\) or \(P_\lambda =P_{\lambda ,MCP}\); and (ii) the inequality that \(f(\mathbf x^*)\le f(\mathbf x^0)\), the following inequality holds: \((2n)^{-1}\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2\le \,n^{-1}W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true}) +\min \left\{ \sum _{i\in \mathcal S}P'_\lambda (\vert x_i^{*}\vert )\vert x_i^{true}\vert ,\,\sum _{i\in \mathcal S}P'_\lambda (\vert x_i^{*}\vert )\vert x_i^{*}- x_i^{true}\vert \right\} \). If, in addition, \(f(\mathbf x^*)\le f(\mathbf x^{true})+{\varGamma }\) for an arbitrary \({\varGamma }\ge 0\), then \( \frac{1}{2n}\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2\le \,\frac{1}{n}W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})+\min \left\{ \sum _{i\in \mathcal S}P'_\lambda (\vert x_i^{*}\vert )\vert x_i^{true}\vert ,\,\sum _{i\in \mathcal S}P'_\lambda (\vert x_i^{*}\vert )\vert x_i^{*}{-} x_i^{true}\vert ,\, P_{\lambda }(a\lambda )\cdot {(\vert \mathcal S\vert {-}\Vert \mathbf x^{*}\Vert _0)}{+}{\varGamma }\right\} .\)
Proof
Notice that \(\mathbf b=\mathbf A\mathbf x^{true}+W\). Then for any \(\mathbf x=(x_i)\in \mathfrak {R}^p\): \((2n)^{-1}\Vert \mathbf A\mathbf x-\mathbf b\Vert ^2+\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i\vert =(2n)^{-1}\Vert \mathbf A(\mathbf x-\mathbf x^{true})\Vert ^2+(2n)^{-1}W^\top W-n^{-1}W^\top \mathbf A(\mathbf x-\mathbf x^{true})+\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i\vert \).
Since \(\mathbf x^*\) satisfies S\(^3\)ONC, which implies FONC, we know that \(\mathbf x^*\in \arg \inf \{ \frac{1}{2n}\Vert \mathbf A\mathbf x-\mathbf b\Vert ^2+\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i\vert :\,\mathbf x\in \mathfrak {R}^p\}.\) Therefore, \(\frac{1}{2n}\Vert \mathbf A\mathbf x^*-\mathbf b\Vert ^2+\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i^*\vert \le \frac{1}{2n}\Vert \mathbf A\mathbf x^{true}-\mathbf b\Vert ^2+\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i^{true}\vert \). Combining the above, we know that \((2n)^{-1}\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2-n^{-1}W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})+\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i^{*}\vert \le \sum _{i=1}^pP'_\lambda (\vert x_i^{*}\vert )\vert x_i^{true}\vert \). Further invoking the definitions of \(\mathbf x^{true}\) and \(\mathcal S\) as well as triangular inequality and the fact that \(P'_{\lambda }(\vert x\vert )\ge 0\) for any \(x\in \mathfrak {R}\), we have \((2n)^{-1}{\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2} \le \, n^{-1}{W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})}+\sum _{i\in \mathcal S}P'_\lambda (\vert x_i^{*}\vert )\vert x_i^{true}\vert -\sum _{i=1}^pP_{\lambda }'(\vert x^{*}_i\vert )\vert x_i^{*}\vert \le \, n^{-1}W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})+\sum _{i\in \mathcal S}P'_\lambda (\vert x_i^{*}\vert )\vert x_i^{true}- x_i^{*}\vert -\sum _{i\in \mathcal S^c}P_{\lambda }'(\vert x^{*}_i\vert )\vert x_i^{*}\vert \). We then obtain the claimed result in the first part of the lemma.
To show the second part, by assumption, \(f(\mathbf x^*)\le f(\mathbf x^{true})+{\varGamma }\), we know \((2n)^{-1}\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2-n^{-1}{W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})}+(2n)^{-1}{\Vert W\Vert ^2}+\sum _{i=1}^{p}P_{\lambda }(\vert x_i^*\vert )\le (2n)^{-1}{\Vert W\Vert ^2}+\sum _{i =1}^{p}P_{\lambda }(\vert x_i^{true}\vert )+{\varGamma }\). Noticing the fact that (i) \(0\le P_{\lambda }(\vert x\vert )\le P_{\lambda }(a\lambda )\) for any \(x\in \mathfrak {R}\), (ii) \(P_{\lambda }(\vert 0\vert )=0\), and (iii) by definition of \(\mathcal S^c\), \(x_i^{true}=0\) for all \(i\in \mathcal S^c\), we hence know \((2n)^{-1}\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2-n^{-1}{W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})}\le P_{\lambda }(a\lambda )\cdot \vert \mathcal S\vert -\sum _{i=1}^{p}P_{\lambda }(\vert x_i^*\vert )+{\varGamma }\). Invoking Corollaries 3 and 4 under Condition B and the assumption that \(f(\mathbf x^*)\le f(\mathbf x^0)\), we know that \(x_i^*\ne 0\Longrightarrow \vert x_i^*\vert \ge a\lambda \). Also notice that \(P_{\lambda }(\vert x\vert )=P_{\lambda }(a\lambda )\) for all \(x\in \mathfrak {R}:\,\vert x\vert \ge a\lambda \). Therefore, the above implies \(\sum _{i=1}^{p}P_{\lambda }(\vert x_i^*\vert )= {P_{\lambda }(a\lambda )\cdot \Vert \mathbf x^{*}\Vert _0}\) and \((2n)^{-1}\Vert \mathbf A(\mathbf x^{*}-\mathbf x^{true})\Vert ^2-n^{-1}{W^\top \mathbf A(\mathbf x^{*}-\mathbf x^{true})}\le P_{\lambda }(a\lambda )\cdot (\vert \mathcal S\vert -\Vert \mathbf x^{*}\Vert _0)+{\varGamma }\). Combined with the results from the first part of this lemma, we have the claimed result in the second part. \(\square \)
Lemma 8
Consider a subgaussian \(\tilde{n}\)-dimensional random vector \(\tilde{W}\) in \(\mathfrak {R}^{\tilde{n}}\) that satisfies \(Prob[\vert \langle \tilde{W},\, \upsilon \rangle \vert \ge t]\le 2\exp \left( -{t^2}(2\sigma ^2)^{-1}\right) \). for any \(\upsilon \in \mathfrak {R}^{\tilde{n}}:\, \Vert \upsilon \Vert =1\), then for any \(V\in \mathfrak {R}^{\tilde{n}\times \tilde{n}}\) and \({\varSigma }_{v}=V^\top V\), \( Prob[\Vert V \tilde{W}\Vert ^2\le \sigma ^2(\mathbf{Tr}({\varSigma }_v)+2\sqrt{\mathbf{Tr}({\varSigma }_v^2)t}+2\Vert {\varSigma }_v\Vert t)]\ge 1-\exp (-t)\) for any \(t>0\), where \(\mathbf{Tr}(\cdot )\) denotes the trace of a matrix.
Proof
Evident from Theorem 2.1 in [17]. \(\square \)
Rights and permissions
About this article
Cite this article
Liu, H., Yao, T., Li, R. et al. Folded concave penalized sparse linear regression: sparsity, statistical performance, and algorithmic theory for local solutions. Math. Program. 166, 207–240 (2017). https://doi.org/10.1007/s10107-017-1114-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-017-1114-y