Abstract
The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered.
Similar content being viewed by others
References
Bradford JP, Kunz C, Kohavi R, Brunk C, Brodley CE (1998) Pruning decision trees with misclassification costs. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 131–136
Bühlmann P, Van-De Geer S (2011) Statistics for high-dimensional data. Springer, Berlin
Carrizosa E, Romero-Morales D (2001) Combining minsum and minmax: a goal programming approach. Oper Res 49(1):169–174
Carrizosa E, Martín-Barragán B, Morales DR (2008) Multi-group support vector machines with measurement costs: a biobjective approach. Discrete Appl Math 156:950–966
Datta S, Das S (2015) Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52
Donoho DL, Johnstone IM, Kerkyacharian G, Picard D (1995) Wavelet shrinkage: Asymptopia? J R Stat Soc Ser B (Methodol) 57(2):301–369
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Freitas A, Costa-Pereira A, Brazdil P (2007) Cost-sensitive decision trees applied to medical data. In: Song IY, Eder J, Nguyen TM (eds) Data warehousing and knowledge discovery. Springer, Berlin, pp 303–312
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, Heidelberg
Gaines BR, Kim J, Zhou H (2018) Algorithms for fitting the constrained Lasso. J Comput Graph Stat 27(4):861–871
Garside MJ (1965) The best sub-set in multiple regression analysis. J R Stat Soc Ser C (Appl Stat) 14(2–3):196–200
Gurobi Optimization L (2018) Gurobi optimizer reference manual. http://www.gurobi.com
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity. Chapman and Hall/CRC, New York
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
Hu Q, Zeng P, Lin L (2015) The dual and degrees of freedom of linearly constrained generalized lasso. Comput Stat Data Anal 86:13–26
James GM, Paulson C, Rusmevichientong P (2019) Penalized and constrained optimization: an application to high-dimensional website advertising. J Am Stat Assoc 1–31
Kouno T, de Hoon M, Mar JC, Tomaru Y, Kawano M, Carninci P, Suzuki H, Hayashizaki Y, Shin JW (2013) Temporal dynamics and transcriptional control using single-cell gene expression analysis. Genome Biol 14(10):R118
Lee W, Jun CH, Lee JS (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification. Inf Sci 381(Supplement C):92–103
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Ollier E, Viallon V (2017) Regression modelling on stratified data with the lasso. Biometrika 104(1):83–96
Prati RC, Batista GEAPA, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270
Redmond M, Baveja A (2002) A data-driven software tool for enabling cooperative information sharing among police departments. Eur J Oper Res 141(3):660–678
Rockafellar RT (1972) Convex analysis. Princeton University Press, Princeton
Shapiro A, Dentcheva D, Ruszczyński A (2009) Lectures on stochastic programming: modeling and theory. SIAM, Philadelphia
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13
Stamey TA, Kabalin JN, McNeal JE, Johnstone IM, Freiha F, Redwine EA, Yang N (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. Radical prostatectomy treated patients. J Urol 141(5):1076–1083
Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23:687–719
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108
Tibshirani RJ, Taylor J (2011) The solution path of the generalized Lasso. Ann Stat 39(3):1335–1371
Torres-Barrán A, Alaíz CM, Dorronsoro JR (2018) \(\nu \)-SVM solutions of constrained Lasso and elastic net. Neurocomputing 275:1921–1931
U.S. Department of Commerce, Bureau of the Census, Census of Population and Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files), U.S. Department of Commerce, Bureau of the Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992)
U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management and Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau of the Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992)
U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)
Yu G, Liu Y (2016) Sparse regression incorporating graphical structure among predictors. J Am Stat Assoc 111(514):707–720
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320
Acknowledgements
Research partially supported by research grants and projects MTM2015-65915-R (Ministerio de Economía y Competitividad, Spain), FQM-329 and P18-FR-2369 (Junta de Andalucía, Spain), Fundación BBVA and EC H2020 MSCA RISE NeEDS Project (Grant agreement ID: 822214). In addition, we would like to thank the associated editor and two anonymous reviewers for carefully reading this work, and for their insightful comments, which have helped to improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Proofs
1.1 Proof of Proposition 1
Given \(\lambda \ge 0\), consider problem (11). If \(\varvec{\beta } = \varvec{\beta }^+ - \varvec{\beta }^-\) with \(\varvec{\beta }^+ \ge 0\) and \(\varvec{\beta }^- \ge 0\) and \(\varvec{\lambda }=(0,\lambda ,\ldots ,\lambda )'\), a vector whose length is \(p+1\), then the differentiable version of that problem turns out to be
From the Karush–Kuhn–Tucker conditions,
Thus,
-
if \(\varvec{\beta }>0 \Rightarrow \varvec{\beta }^+>0, \varvec{\beta }^- =0 \Rightarrow \varvec{\theta }^+=0 \Rightarrow -\dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0\varvec{\beta }) + \varvec{\lambda } -\dfrac{2}{n_1}\eta {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1\varvec{\beta }) = 0\)
-
if \(\varvec{\beta }<0 \Rightarrow \varvec{\beta }^+=0, \varvec{\beta }^- >0 \Rightarrow \varvec{\theta }^-=0 \Rightarrow \dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0\varvec{\beta }) + \varvec{\lambda } +\dfrac{2}{n_1}\eta {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1\varvec{\beta }) = 0\)
Therefore,
where \(\eta (\lambda )\) is the Lagrange multiplier associated with the first constraint and \({\mathbf {b}}(\lambda )\) is a \((p+1)\)-dimensional vector whose s-th component, \(s=0,1,\ldots ,p\), takes the following value
Then, since \({\mathbf {X}}_0\) and \({\mathbf {X}}_1\) are maximum rank matrices, one obtains from (18) the following implicit expression for the solution \(\varvec{{\hat{\beta }}}^{CSCLasso}(\lambda )\) of Problem (11)
1.2 Proof of Theorem 1
Consider the function \(h:\varvec{\beta } \mapsto \dfrac{1}{n}\Vert {\mathbf {y}} - {\mathbf {X}}\varvec{\beta }\Vert ^2=\dfrac{1}{n}({\mathbf {y}}- {\mathbf {X}}\varvec{\beta })'({\mathbf {y}}-{\mathbf {X}}\varvec{\beta })\), where \({\mathbf {X}}\) is a maximum rank matrix by hypothesis. The matrix \({\mathbf {X}}\) is of maximum rank and therefore the Hessian matrix \(H_h(\varvec{\beta })=\dfrac{2}{n}{\mathbf {X}}'{\mathbf {X}}\) is positive definite, from where we conclude that \(h(\varvec{\beta })\) is strictly convex, and hence, \(h(\varvec{\beta }) + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\) is also a strictly convex function.
We next prove that \(h(\varvec{\beta })\) is a coercive function. Since \({\mathbf {X}}'{\mathbf {X}}\) is positive definite, its eigenvalues are all positive. In particular, the smallest eigenvalue, say \(\gamma _r\), will be nonzero. Moreover, using the spectral decomposition of a symmetric matrix,
where, in the second-to-last step, the Cauchy-Schwarz inequality has been used. As \(\Vert \varvec{\beta }\Vert \rightarrow +\infty \), then \(h(\varvec{\beta })\rightarrow +\infty \) too, and thus \(h(\varvec{\beta })\) is a coercive function.
Now we show that (13) has optimal solution. Let \(\varvec{\beta }^*\in \varvec{B}\). As \(h(\varvec{\beta })\) is coercive, then there exists \(R>0\) such that
for all \(\varvec{\beta }\) such that \(\Vert \varvec{\beta }\Vert > R\). For that reason, the problem can be reduced to the feasible compact region \(\varvec{B}\cap \{\varvec{\beta }: \,\, \Vert \varvec{\beta }\Vert \le R\}\), which implies that the optimal solution is reached. Finally, the uniqueness of the solution follows from the fact that \(h(\varvec{\beta }) + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\) is strictly convex.
1.3 Proof of Proposition 2
Let us consider the optimization problem (2) and let \(\varvec{{\hat{\beta }}}^{Lasso}(\lambda )\) denotes its optimal solution. The necessary and sufficient optimality condition is:
From the properties of subdifferential (see Theorem 23.9 of Rockafellar (1972)) it follows that
which implies that (19) becomes
Consequently, the necessary and sufficient condition (20) in \(\varvec{{\hat{\beta }}}^{Lasso}(\lambda )={\mathbf {0}}\) is
since \(\partial \Vert {\mathbf {0}}\Vert _1\) is the unit ball of the \(\Vert .\Vert _{\infty }\). Equivalently,
Therefore, the solution of the problem
will provide the minimum \(\lambda \) from which \(\varvec{{\hat{\beta }}}^{Lasso}(\lambda )={\mathbf {0}}\) is the optimal solution. If \({\mathbf {q}}=\lambda {\mathbf {t}}\), then Problem (21) becomes
The constraint \(\Vert {\mathbf {q}}\Vert _{\infty }\) is equivalent to \(|q_s| \le \lambda \), \(s=0,1,\ldots ,p\) and the result follows.
1.4 Proof of Proposition 3
The proof follows very closely that of Theorem 1. First of all, it shall be proven that \(h: \varvec{\beta } \mapsto E[(Y-X'\varvec{\beta })^{2}]\) is coercive. It is strictly convex on \(\varvec{\beta }\) since its Hessian matrix, \(2E[XX']\), is positive definite due to X is an absolutely continuous p-dimensional random variable:
since \(P(X'u=0)=0\). Moreover, \(\lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\) is a convex function on \(\varvec{\beta }\) and, therefore, \(E[( Y-X' \varvec{\beta })^{2}+ \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1]\) is also a strictly convex function on \(\varvec{\beta }\).
On the one hand, the eigenvalues of the Hessian matrix are all positive and, in particular, the smallest eigenvalue, say \(\gamma _r\), will be non-zero. On the other hand, using the spectral decomposition of a symmetric matrix,
where, in the second-to-last step, the Cauchy-Schwarz inequality was used. As \(\Vert \varvec{\beta }\Vert \rightarrow +\infty \), then \(E[(Y-X' \varvec{\beta })^{2}]\rightarrow +\infty \), that is, the quadratic function \(h(\varvec{\beta })=E[(Y-X' \varvec{\beta })^{2}]\) is coercive. The next step in the proof is to transform the original true problem (17) into an equivalent one with a feasible compact region \(\varvec{B}^*\). Given \(\varvec{\beta }^*\in \varvec{B}\), since \(h(\varvec{\beta })=E[(Y-X'\varvec{\beta })^{2}]\) is coercive, there exists R such that
for all \(\varvec{\beta }\) with \(\Vert \varvec{\beta }\Vert > R\). For that reason, the problem (17) can be reduced to the feasible compact region \(\varvec{B}^* = \varvec{B}\cap \{\varvec{\beta }: \,\, \Vert \varvec{\beta }\Vert \le R\}\), which implies that the optimal solution is reached.
Finally, the uniqueness of solution is a consequence of the strict convexity of the objective function.
1.5 Proof of Theorem 2
For the sake of simplicity, \(\varvec{\beta }^{CSCLasso}(\lambda )\) and \(\varvec{{\hat{\beta }}}^{CSCLasso}(\lambda )\) will be denoted henceforth by \(\varvec{\beta }\) and \(\varvec{{\hat{\beta }}}\), respectively. In addition, let us consider the nonempty compact set \(C=\varvec{B}\cap \{\varvec{\beta }: \,\, \Vert \varvec{\beta }\Vert \le R\}\), where R is chosen according to the proof of Theorem 3.1.
Theorem 2 is a direct consequence of Theorem 5.3 in Shapiro et al. (2009) under some technical conditions, namely:
-
C1.
The expected value function \(E[(Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1]\) is finite valued and continuous on C.
-
C2.
\(\dfrac{1}{n}\sum _{i=1}^{n}((y_{i}-x^{'}_{i} \varvec{\beta })^{2}+ \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1)\) converges to \(E[(Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1]\) w.p. 1, as \(n\rightarrow \infty \), uniformly in \(\varvec{\beta }\in C\).
Let us denote \(F(\varvec{\beta },(Y,X))=(Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1\). Then, the previous conditions C1 and C2 are consequences of Theorem 7.48 in Shapiro et al. (2009) provided that
-
A1.
for any \(\varvec{\beta } \in C\), the function \(F(\cdot ,(Y,X))\) is continuous at \(\varvec{\beta }\) for almost every (Y, X),
-
A2.
the function \(F(\varvec{\beta },(Y,X))\), with \(\varvec{\beta }\in C\), is dominated by an integrable function,
-
A3.
the sample is i.i.d.
Given (Y, X), the function \((Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1\) is continuous at \(\varvec{\beta }\) for any \(\varvec{\beta } \in C\), and therefore A1 is fulfilled. The sample is i.i.d. by hypothesis, and thus A3 holds too. Finally, in order to prove A2, it is necessary to find a measurable function \(g(Y, X)>0\) such that \(E[g(Y,X)]< \infty \) and, for every \(\varvec{\beta }\in C\), \(\mid F(\varvec{\beta }, (Y,X))\mid \le g(Y,X)\) w.p. 1. Using the Cauchy-Schwarz inequality, one has,
Let \(M_1\) and \(M_2\) be given by
which are well defined due to the compactness of C. Therefore, g can be chosen as
which is positive and, since \(E(\Vert X\Vert ^2)<\infty \), \(E(Y^2)<\infty \), \(E(\Vert YX\Vert )<\infty \), its expected value is finite. In consequence, A2 holds and the proof is concluded.
Appendix: Further results
To fully understand how the computation time behaves depending on \(n_k\) and p values, a grid in both parameters have been inspected. Figure 11 displays the logarithm of the user times in seconds obtained under Lasso and CSCLasso models when \(n_k\) and p change. The perspective drawn in the top left figure shows that Lasso model (bottom surface) is solved faster and in a smoother way. Besides, whereas smaller times are obtained for both methods when \(n_k\) and p are small, the biggest times are associated to \(n_k=300\) and \(p=500\).
Rights and permissions
About this article
Cite this article
Blanquero, R., Carrizosa, E., Ramírez-Cobo, P. et al. A cost-sensitive constrained Lasso. Adv Data Anal Classif 15, 121–158 (2021). https://doi.org/10.1007/s11634-020-00389-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00389-5
Keywords
- Performance constraints
- Cost-sensitive learning
- Sparse solutions
- Sample average approximation
- Heterogeneity
- Lasso