Skip to main content
Log in

A cost-sensitive constrained Lasso

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Bradford JP, Kunz C, Kohavi R, Brunk C, Brodley CE (1998) Pruning decision trees with misclassification costs. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 131–136

    Chapter  Google Scholar 

  • Bühlmann P, Van-De Geer S (2011) Statistics for high-dimensional data. Springer, Berlin

    Book  Google Scholar 

  • Carrizosa E, Romero-Morales D (2001) Combining minsum and minmax: a goal programming approach. Oper Res 49(1):169–174

    Article  MathSciNet  Google Scholar 

  • Carrizosa E, Martín-Barragán B, Morales DR (2008) Multi-group support vector machines with measurement costs: a biobjective approach. Discrete Appl Math 156:950–966

    Article  MathSciNet  Google Scholar 

  • Datta S, Das S (2015) Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52

    Article  Google Scholar 

  • Donoho DL, Johnstone IM, Kerkyacharian G, Picard D (1995) Wavelet shrinkage: Asymptopia? J R Stat Soc Ser B (Methodol) 57(2):301–369

    MathSciNet  MATH  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  Google Scholar 

  • Freitas A, Costa-Pereira A, Brazdil P (2007) Cost-sensitive decision trees applied to medical data. In: Song IY, Eder J, Nguyen TM (eds) Data warehousing and knowledge discovery. Springer, Berlin, pp 303–312

    Chapter  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, Heidelberg

    MATH  Google Scholar 

  • Gaines BR, Kim J, Zhou H (2018) Algorithms for fitting the constrained Lasso. J Comput Graph Stat 27(4):861–871

    Article  MathSciNet  Google Scholar 

  • Garside MJ (1965) The best sub-set in multiple regression analysis. J R Stat Soc Ser C (Appl Stat) 14(2–3):196–200

    Google Scholar 

  • Gurobi Optimization L (2018) Gurobi optimizer reference manual. http://www.gurobi.com

  • Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity. Chapman and Hall/CRC, New York

    Book  Google Scholar 

  • He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken

    Book  Google Scholar 

  • Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

    Article  Google Scholar 

  • Hu Q, Zeng P, Lin L (2015) The dual and degrees of freedom of linearly constrained generalized lasso. Comput Stat Data Anal 86:13–26

    Article  MathSciNet  Google Scholar 

  • James GM, Paulson C, Rusmevichientong P (2019) Penalized and constrained optimization: an application to high-dimensional website advertising. J Am Stat Assoc 1–31

  • Kouno T, de Hoon M, Mar JC, Tomaru Y, Kawano M, Carninci P, Suzuki H, Hayashizaki Y, Shin JW (2013) Temporal dynamics and transcriptional control using single-cell gene expression analysis. Genome Biol 14(10):R118

    Article  Google Scholar 

  • Lee W, Jun CH, Lee JS (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification. Inf Sci 381(Supplement C):92–103

    Article  Google Scholar 

  • Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Ollier E, Viallon V (2017) Regression modelling on stratified data with the lasso. Biometrika 104(1):83–96

    MathSciNet  MATH  Google Scholar 

  • Prati RC, Batista GEAPA, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270

    Article  Google Scholar 

  • Redmond M, Baveja A (2002) A data-driven software tool for enabling cooperative information sharing among police departments. Eur J Oper Res 141(3):660–678

    Article  Google Scholar 

  • Rockafellar RT (1972) Convex analysis. Princeton University Press, Princeton

    Google Scholar 

  • Shapiro A, Dentcheva D, Ruszczyński A (2009) Lectures on stochastic programming: modeling and theory. SIAM, Philadelphia

    Book  Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13

    Article  Google Scholar 

  • Stamey TA, Kabalin JN, McNeal JE, Johnstone IM, Freiha F, Redwine EA, Yang N (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. Radical prostatectomy treated patients. J Urol 141(5):1076–1083

    Article  Google Scholar 

  • Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23:687–719

    Article  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108

    Article  MathSciNet  Google Scholar 

  • Tibshirani RJ, Taylor J (2011) The solution path of the generalized Lasso. Ann Stat 39(3):1335–1371

    Article  MathSciNet  Google Scholar 

  • Torres-Barrán A, Alaíz CM, Dorronsoro JR (2018) \(\nu \)-SVM solutions of constrained Lasso and elastic net. Neurocomputing 275:1921–1931

    Article  Google Scholar 

  • U.S. Department of Commerce, Bureau of the Census, Census of Population and Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files), U.S. Department of Commerce, Bureau of the Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992)

  • U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management and Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau of the Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992)

  • U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)

  • Yu G, Liu Y (2016) Sparse regression incorporating graphical structure among predictors. J Am Stat Assoc 111(514):707–720

    Article  MathSciNet  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67

    Article  MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    Article  MathSciNet  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Research partially supported by research grants and projects MTM2015-65915-R (Ministerio de Economía y Competitividad, Spain), FQM-329 and P18-FR-2369 (Junta de Andalucía, Spain), Fundación BBVA and EC H2020 MSCA RISE NeEDS Project (Grant agreement ID: 822214). In addition, we would like to thank the associated editor and two anonymous reviewers for carefully reading this work, and for their insightful comments, which have helped to improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Remedios Sillero-Denamiel.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Proofs

1.1 Proof of Proposition 1

Given \(\lambda \ge 0\), consider problem (11). If \(\varvec{\beta } = \varvec{\beta }^+ - \varvec{\beta }^-\) with \(\varvec{\beta }^+ \ge 0\) and \(\varvec{\beta }^- \ge 0\) and \(\varvec{\lambda }=(0,\lambda ,\ldots ,\lambda )'\), a vector whose length is \(p+1\), then the differentiable version of that problem turns out to be

$$\begin{aligned} \begin{aligned} \underset{\varvec{\beta ^+,\beta ^-}}{\min }&\dfrac{1}{n_0}\Vert {\mathbf {y}}_0-{\mathbf {X}}_0(\varvec{\beta }^+ - \varvec{\beta }^-)\Vert ^{2} + \varvec{\lambda }'\varvec{\beta }^+ + \varvec{\lambda }'\varvec{\beta }^-\\ \text{ s.t. }&\dfrac{1}{n_1}\Vert {\mathbf {y}}_1-{\mathbf {X}}_1(\varvec{\beta }^+ - \varvec{\beta }^-)\Vert ^{2} -(1+ \tau )MSE_1(\varvec{{\hat{\beta }}}^{ols}) \le 0,\\&\varvec{\beta }^+ \ge \varvec{0} \Leftrightarrow -\varvec{\beta }^+ \le \varvec{0},\\&\varvec{\beta }^- \ge \varvec{0} \Leftrightarrow -\varvec{\beta }^- \le \varvec{0}. \end{aligned} \end{aligned}$$

From the Karush–Kuhn–Tucker conditions,

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^+, \varvec{\beta }^-, \varvec{\theta }^+, \varvec{\theta }^-, \eta )= \dfrac{1}{n_0}\Vert {\mathbf {y}}_0-{\mathbf {X}}_0 (\varvec{\beta }^+-\varvec{\beta }^-)\Vert ^2 + \varvec{\lambda }'\varvec{\beta }^+ + \varvec{\lambda }'\varvec{\beta }^- - (\varvec{\theta }^+)' \varvec{\beta }^+ - (\varvec{\theta }^-)'\varvec{\beta }^- \\&\quad + \eta \left( \dfrac{1}{n_1}\Vert {\mathbf {y}}_1 -{\mathbf {X}}_1(\varvec{\beta }^+-\varvec{\beta }^-)\Vert ^2 - (1+\tau )MSE_1(\varvec{{\hat{\beta }}}^{ols})\right) \\&\quad \dfrac{\partial }{\partial \varvec{\beta }^+}: -\dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0(\varvec{\beta }^+ - \varvec{\beta }^-)) + \varvec{\lambda } - \varvec{\theta }^+ - \dfrac{2}{n_1}\eta {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1(\varvec{\beta }^+ - \varvec{\beta }^-)) = 0 \\&\quad \dfrac{\partial }{\partial \varvec{\beta }^-}: \dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0(\varvec{\beta }^+ - \varvec{\beta }^-)) + \varvec{\lambda } - \varvec{\theta }^- + \dfrac{2}{n_1}\eta {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1(\varvec{\beta }^+ - \varvec{\beta }^-)) = 0 \\&\quad \varvec{\theta }^+, \varvec{\theta }^-, \eta \ge 0 \\&\quad (\varvec{\theta }^+)' \varvec{\beta }^+ = 0 \\&\quad (\varvec{\theta }^-)' \varvec{\beta }^- = 0 \\&\quad \eta \left( \dfrac{1}{n_1}\Vert {\mathbf {y}}_1 - {\mathbf {X}}_1(\varvec{\beta }^+ - \varvec{\beta }^-)\Vert ^2 -(1+\tau )MSE_1(\varvec{{\hat{\beta }}}^{ols})\right) = 0 \end{aligned}$$

Thus,

  • if \(\varvec{\beta }>0 \Rightarrow \varvec{\beta }^+>0, \varvec{\beta }^- =0 \Rightarrow \varvec{\theta }^+=0 \Rightarrow -\dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0\varvec{\beta }) + \varvec{\lambda } -\dfrac{2}{n_1}\eta {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1\varvec{\beta }) = 0\)

  • if \(\varvec{\beta }<0 \Rightarrow \varvec{\beta }^+=0, \varvec{\beta }^- >0 \Rightarrow \varvec{\theta }^-=0 \Rightarrow \dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0\varvec{\beta }) + \varvec{\lambda } +\dfrac{2}{n_1}\eta {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1\varvec{\beta }) = 0\)

Therefore,

$$\begin{aligned} \dfrac{2}{n_0}{\mathbf {X}}_0'({\mathbf {y}}_0-{\mathbf {X}}_0\varvec{\beta }) +\dfrac{2}{n_1}\eta (\lambda ) {\mathbf {X}}_1'({\mathbf {y}}_1 - {\mathbf {X}}_1\varvec{\beta }) = {\mathbf {b}}(\lambda ), \end{aligned}$$
(18)

where \(\eta (\lambda )\) is the Lagrange multiplier associated with the first constraint and \({\mathbf {b}}(\lambda )\) is a \((p+1)\)-dimensional vector whose s-th component, \(s=0,1,\ldots ,p\), takes the following value

$$\begin{aligned} b_s(\lambda )=\left\{ \begin{array}{lcc} \,\,\,\,\, \lambda , \,\, \text {if} \,\, \beta _s>0, \\ -\lambda , \,\, \text {if} \,\, \beta _s<0, \\ \,\,\,\,\, 0, \,\, \,\, \text {else.} \end{array}\right. \end{aligned}$$

Then, since \({\mathbf {X}}_0\) and \({\mathbf {X}}_1\) are maximum rank matrices, one obtains from (18) the following implicit expression for the solution \(\varvec{{\hat{\beta }}}^{CSCLasso}(\lambda )\) of Problem (11)

$$\begin{aligned} \varvec{{\hat{\beta }}}^{CSCLasso}(\lambda )= & {} \left( \dfrac{1}{n_0}{\mathbf {X}}_0'{\mathbf {X}}_0 + \dfrac{1}{n_1}\eta (\lambda ){\mathbf {X}}_1'{\mathbf {X}}_1\right) ^{-1} \left( \dfrac{1}{n_0}{\mathbf {X}}_0'{\mathbf {y}}_0 + \dfrac{1}{n_1}\eta (\lambda ){\mathbf {X}}_1'{\mathbf {y}}_1\right) \\&- \dfrac{1}{2}\left( \dfrac{1}{n_0}{\mathbf {X}}_0'{\mathbf {X}}_0 + \dfrac{1}{n_1}\eta (\lambda ){\mathbf {X}}_1'{\mathbf {X}}_1\right) ^{-1} {\mathbf {b}}(\lambda ). \end{aligned}$$

1.2 Proof of Theorem 1

Consider the function \(h:\varvec{\beta } \mapsto \dfrac{1}{n}\Vert {\mathbf {y}} - {\mathbf {X}}\varvec{\beta }\Vert ^2=\dfrac{1}{n}({\mathbf {y}}- {\mathbf {X}}\varvec{\beta })'({\mathbf {y}}-{\mathbf {X}}\varvec{\beta })\), where \({\mathbf {X}}\) is a maximum rank matrix by hypothesis. The matrix \({\mathbf {X}}\) is of maximum rank and therefore the Hessian matrix \(H_h(\varvec{\beta })=\dfrac{2}{n}{\mathbf {X}}'{\mathbf {X}}\) is positive definite, from where we conclude that \(h(\varvec{\beta })\) is strictly convex, and hence, \(h(\varvec{\beta }) + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\) is also a strictly convex function.

We next prove that \(h(\varvec{\beta })\) is a coercive function. Since \({\mathbf {X}}'{\mathbf {X}}\) is positive definite, its eigenvalues are all positive. In particular, the smallest eigenvalue, say \(\gamma _r\), will be nonzero. Moreover, using the spectral decomposition of a symmetric matrix,

$$\begin{aligned} \dfrac{1}{n}\Vert {\mathbf {y}} - {\mathbf {X}}\varvec{\beta }\Vert ^2= & {} \dfrac{1}{n}({\mathbf {y}} -{\mathbf {X}}\varvec{\beta })'({\mathbf {y}}-{\mathbf {X}} \varvec{\beta })=\dfrac{1}{n}\varvec{\beta }'{\mathbf {X}}' {\mathbf {X}}\varvec{\beta } - \dfrac{2}{n}{\mathbf {y}}'{\mathbf {X}}\varvec{\beta } + \dfrac{1}{n}{\mathbf {y}}'{\mathbf {y}}\\= & {} \dfrac{1}{n}\varvec{\beta }'Q'DQ\varvec{\beta } -\dfrac{2}{n}{\mathbf {y}}'{\mathbf {X}}\varvec{\beta }+ \dfrac{1}{n}{\mathbf {y}}'{\mathbf {y}} \\\ge & {} \dfrac{1}{n}\varvec{\beta }'Q'DQ\varvec{\beta } -\left| \dfrac{2}{n}{\mathbf {y}}'{\mathbf {X}}\varvec{\beta }\right| + \dfrac{1}{n}{\mathbf {y}}'{\mathbf {y}} \ge \dfrac{\gamma _r}{n}\Vert Q\varvec{\beta }\Vert ^2 - \left\| \dfrac{2}{n}{\mathbf {y}}'{\mathbf {X}}\right\| \Vert \varvec{\beta }\Vert +\dfrac{1}{n} {\mathbf {y}}'{\mathbf {y}}\\= & {} \dfrac{\gamma _r}{n}\Vert \varvec{\beta }\Vert ^2 - \left\| \dfrac{2}{n}{\mathbf {y}}'{\mathbf {X}}\right\| \Vert \varvec{\beta }\Vert + \dfrac{1}{n} {\mathbf {y}}'{\mathbf {y}}, \end{aligned}$$

where, in the second-to-last step, the Cauchy-Schwarz inequality has been used. As \(\Vert \varvec{\beta }\Vert \rightarrow +\infty \), then \(h(\varvec{\beta })\rightarrow +\infty \) too, and thus \(h(\varvec{\beta })\) is a coercive function.

Now we show that (13) has optimal solution. Let \(\varvec{\beta }^*\in \varvec{B}\). As \(h(\varvec{\beta })\) is coercive, then there exists \(R>0\) such that

$$\begin{aligned} \dfrac{1}{n}\Vert {\mathbf {y}} - {\mathbf {X}}\varvec{\beta }\Vert ^2 \,\,> \dfrac{1}{n}\Vert {\mathbf {y}} - {\mathbf {X}}\varvec{\beta }^*\Vert ^2 + \lambda \Vert {\mathcal {A}}\varvec{\beta }^*\Vert _1, \end{aligned}$$

for all \(\varvec{\beta }\) such that \(\Vert \varvec{\beta }\Vert > R\). For that reason, the problem can be reduced to the feasible compact region \(\varvec{B}\cap \{\varvec{\beta }: \,\, \Vert \varvec{\beta }\Vert \le R\}\), which implies that the optimal solution is reached. Finally, the uniqueness of the solution follows from the fact that \(h(\varvec{\beta }) + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\) is strictly convex.

1.3 Proof of Proposition 2

Let us consider the optimization problem (2) and let \(\varvec{{\hat{\beta }}}^{Lasso}(\lambda )\) denotes its optimal solution. The necessary and sufficient optimality condition is:

$$\begin{aligned} \nabla \dfrac{1}{n}\Vert {\mathbf {y}}-{\mathbf {X}} \varvec{{\hat{\beta }}}^{Lasso}(\lambda )\Vert ^{2} + \lambda \partial \Vert {\mathcal {A}}\varvec{{\hat{\beta }}}^{Lasso}(\lambda )\Vert _1 \ni \varvec{0}. \end{aligned}$$
(19)

From the properties of subdifferential (see Theorem 23.9 of Rockafellar (1972)) it follows that

$$\begin{aligned} \partial \Vert {\mathcal {A}}\varvec{{\hat{\beta }}}^{Lasso}(\lambda )\Vert _1 = {\mathcal {A}}' \partial \Vert .\Vert _{1_{\mid {\mathcal {A}}\varvec{{\hat{\beta }}}^{Lasso}(\lambda )}}, \end{aligned}$$

which implies that (19) becomes

$$\begin{aligned} -\dfrac{2}{n}{\mathbf {X}}'({\mathbf {y}}-{\mathbf {X}} \varvec{{\hat{\beta }}}^{Lasso}(\lambda )) + \lambda {\mathcal {A}}' \partial \Vert .\Vert _{1_{\mid {\mathcal {A}}\varvec{{\hat{\beta }}}^{Lasso}(\lambda )}} \ni {\mathbf {0}}. \end{aligned}$$
(20)

Consequently, the necessary and sufficient condition (20) in \(\varvec{{\hat{\beta }}}^{Lasso}(\lambda )={\mathbf {0}}\) is

$$\begin{aligned} -\dfrac{2}{n}{\mathbf {X}}'{\mathbf {y}} + \lambda \{{\mathcal {A}}'{\mathbf {t}} : \Vert {\mathbf {t}}\Vert _{\infty } \le 1\} \ni {\mathbf {0}}, \end{aligned}$$

since \(\partial \Vert {\mathbf {0}}\Vert _1\) is the unit ball of the \(\Vert .\Vert _{\infty }\). Equivalently,

$$\begin{aligned} \dfrac{2}{n}{\mathbf {X}}'{\mathbf {y}} \in \{{\mathcal {A}}'\lambda {\mathbf {t}} : \Vert {\mathbf {t}}\Vert _{\infty } \le 1\}. \end{aligned}$$

Therefore, the solution of the problem

$$\begin{aligned} \begin{aligned} \underset{\lambda , {\mathbf {t}}}{\min }&\lambda \\ \text{ s.t. }&\dfrac{2}{n}{\mathbf {X}}'{\mathbf {y}} = {\mathcal {A}}'\lambda {\mathbf {t}},\\&\Vert {\mathbf {t}}\Vert _{\infty } \le 1,\\&\lambda \ge 0, \end{aligned} \end{aligned}$$
(21)

will provide the minimum \(\lambda \) from which \(\varvec{{\hat{\beta }}}^{Lasso}(\lambda )={\mathbf {0}}\) is the optimal solution. If \({\mathbf {q}}=\lambda {\mathbf {t}}\), then Problem (21) becomes

$$\begin{aligned} \begin{aligned} \underset{\lambda , {\mathbf {q}}}{\min }&\lambda \\ \text{ s.t. }&\dfrac{2}{n}{\mathbf {X}}'{\mathbf {y}} = {\mathcal {A}}'{\mathbf {q}},\\&\Vert {\mathbf {q}}\Vert _{\infty } \le \lambda .\\ \end{aligned} \end{aligned}$$

The constraint \(\Vert {\mathbf {q}}\Vert _{\infty }\) is equivalent to \(|q_s| \le \lambda \), \(s=0,1,\ldots ,p\) and the result follows.

1.4 Proof of Proposition 3

The proof follows very closely that of Theorem 1. First of all, it shall be proven that \(h: \varvec{\beta } \mapsto E[(Y-X'\varvec{\beta })^{2}]\) is coercive. It is strictly convex on \(\varvec{\beta }\) since its Hessian matrix, \(2E[XX']\), is positive definite due to X is an absolutely continuous p-dimensional random variable:

$$\begin{aligned} u'E[XX']u=E[u'XX'u]=E[(X'u)^{2}]>0, \end{aligned}$$

since \(P(X'u=0)=0\). Moreover, \(\lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\) is a convex function on \(\varvec{\beta }\) and, therefore, \(E[( Y-X' \varvec{\beta })^{2}+ \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1]\) is also a strictly convex function on \(\varvec{\beta }\).

On the one hand, the eigenvalues of the Hessian matrix are all positive and, in particular, the smallest eigenvalue, say \(\gamma _r\), will be non-zero. On the other hand, using the spectral decomposition of a symmetric matrix,

$$\begin{aligned} E[(Y-X'\varvec{\beta })^{2}]= & {} \varvec{\beta }'E[XX']\varvec{\beta } - 2E[YX]\varvec{\beta } + E[Y^2]=\varvec{\beta }'Q'DQ\varvec{\beta } - 2E[YX]\varvec{\beta } + E[Y^2]\\\ge & {} \varvec{\beta }'Q'DQ\varvec{\beta } - \mid 2E[YX]\varvec{\beta } \mid + E[Y^2] \ge \gamma _r\Vert Q\varvec{\beta } \Vert ^{2}- \Vert E[YX]\Vert \Vert \varvec{\beta } \Vert + E[Y^2] \\= & {} \gamma _r\Vert \varvec{\beta }\Vert ^{2}- \Vert E[YX]\Vert \Vert \varvec{\beta } \Vert + E[Y^2], \end{aligned}$$

where, in the second-to-last step, the Cauchy-Schwarz inequality was used. As \(\Vert \varvec{\beta }\Vert \rightarrow +\infty \), then \(E[(Y-X' \varvec{\beta })^{2}]\rightarrow +\infty \), that is, the quadratic function \(h(\varvec{\beta })=E[(Y-X' \varvec{\beta })^{2}]\) is coercive. The next step in the proof is to transform the original true problem (17) into an equivalent one with a feasible compact region \(\varvec{B}^*\). Given \(\varvec{\beta }^*\in \varvec{B}\), since \(h(\varvec{\beta })=E[(Y-X'\varvec{\beta })^{2}]\) is coercive, there exists R such that

$$\begin{aligned} E[( Y - X'\varvec{\beta })^2] \,\,> E[( Y - X'\varvec{\beta }^*)^2 + \lambda \Vert {\mathcal {A}}\varvec{\beta }^*\Vert _1], \end{aligned}$$

for all \(\varvec{\beta }\) with \(\Vert \varvec{\beta }\Vert > R\). For that reason, the problem (17) can be reduced to the feasible compact region \(\varvec{B}^* = \varvec{B}\cap \{\varvec{\beta }: \,\, \Vert \varvec{\beta }\Vert \le R\}\), which implies that the optimal solution is reached.

Finally, the uniqueness of solution is a consequence of the strict convexity of the objective function.

1.5 Proof of Theorem 2

For the sake of simplicity, \(\varvec{\beta }^{CSCLasso}(\lambda )\) and \(\varvec{{\hat{\beta }}}^{CSCLasso}(\lambda )\) will be denoted henceforth by \(\varvec{\beta }\) and \(\varvec{{\hat{\beta }}}\), respectively. In addition, let us consider the nonempty compact set \(C=\varvec{B}\cap \{\varvec{\beta }: \,\, \Vert \varvec{\beta }\Vert \le R\}\), where R is chosen according to the proof of Theorem 3.1.

Theorem 2 is a direct consequence of Theorem 5.3 in Shapiro et al. (2009) under some technical conditions, namely:

  1. C1.

    The expected value function \(E[(Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1]\) is finite valued and continuous on C.

  2. C2.

    \(\dfrac{1}{n}\sum _{i=1}^{n}((y_{i}-x^{'}_{i} \varvec{\beta })^{2}+ \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1)\) converges to \(E[(Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1]\) w.p. 1, as \(n\rightarrow \infty \), uniformly in \(\varvec{\beta }\in C\).

Let us denote \(F(\varvec{\beta },(Y,X))=(Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1\). Then, the previous conditions C1 and C2 are consequences of Theorem 7.48 in Shapiro et al. (2009) provided that

  1. A1.

    for any \(\varvec{\beta } \in C\), the function \(F(\cdot ,(Y,X))\) is continuous at \(\varvec{\beta }\) for almost every (YX),

  2. A2.

    the function \(F(\varvec{\beta },(Y,X))\), with \(\varvec{\beta }\in C\), is dominated by an integrable function,

  3. A3.

    the sample is i.i.d.

Given (YX), the function \((Y-X'\mathbf {\varvec{\beta }})^{2}+ \lambda \Vert {\mathcal {A}}\mathbf {\varvec{\beta }}\Vert _1\) is continuous at \(\varvec{\beta }\) for any \(\varvec{\beta } \in C\), and therefore A1 is fulfilled. The sample is i.i.d. by hypothesis, and thus A3 holds too. Finally, in order to prove A2, it is necessary to find a measurable function \(g(Y, X)>0\) such that \(E[g(Y,X)]< \infty \) and, for every \(\varvec{\beta }\in C\), \(\mid F(\varvec{\beta }, (Y,X))\mid \le g(Y,X)\) w.p. 1. Using the Cauchy-Schwarz inequality, one has,

$$\begin{aligned} \mid F(\varvec{\beta }, (Y, X))\mid= & {} \mid (Y-X\varvec{\beta })^2 + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\mid \\= & {} \mid Y^2 - 2YX'\varvec{\beta } + \varvec{\beta }'XX'\varvec{\beta } + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1\mid \\\le & {} Y^2 + (X'\varvec{\beta })^2 + 2\mid YX'\varvec{\beta }\mid + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1 \\= & {} Y^2 + \mid X'\varvec{\beta }\mid ^2 + 2\mid YX'\varvec{\beta }\mid + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1 \\\le & {} Y^2 + \Vert X\Vert ^2 \Vert \varvec{\beta }\Vert ^2 + 2 \Vert YX\Vert \Vert \varvec{\beta }\Vert + \lambda \Vert {\mathcal {A}}\varvec{\beta }\Vert _1. \end{aligned}$$

Let \(M_1\) and \(M_2\) be given by

$$\begin{aligned} M_1 = \max _{\varvec{\beta }\in C} \Vert \varvec{\beta }\Vert \qquad \qquad M_2 = \max _{\varvec{\beta }\in C} |{\mathcal {A}}\varvec{\beta }| \end{aligned}$$

which are well defined due to the compactness of C. Therefore, g can be chosen as

$$\begin{aligned} g(Y,X)= Y^2 + M_1^2\Vert X\Vert ^2 + 2M_1 \Vert YX\Vert + \lambda M_2, \end{aligned}$$

which is positive and, since \(E(\Vert X\Vert ^2)<\infty \), \(E(Y^2)<\infty \), \(E(\Vert YX\Vert )<\infty \), its expected value is finite. In consequence, A2 holds and the proof is concluded.

Fig. 7
figure 7

Heat maps of \(\varvec{{\hat{\beta }}}^{CSCLasso}(\lambda ) = ({\hat{\beta }}^{CSCLasso}_1(\lambda ),\ldots ,{\hat{\beta }}^{CSCLasso}_8(\lambda ))\) using prostate dataset

Fig. 8
figure 8

Median \(MSE_k\) over the test sets for \(k=7,\ldots ,20\) under \(p=20\) features and the two \(n_k\) options. Each subgraph represents one group and the Y-axis shows the different percentages of improvement

Fig. 9
figure 9

Median \(MSE_k\) over the test sets for \(k=7,\ldots ,20\) under \(p=500\) features and the two \(n_k\) options. Each subgraph represents one group and the Y-axis shows the different percentages of improvement

Fig. 10
figure 10

Median overall MSE over the test sets and NZ percentage under the choice \(p=20\) (top) and \(p=500\) (bottom)

Fig. 11
figure 11

Four perspectives of the logarithm of the user times in seconds for Lasso (bottom surface in the four graphics) and CSCLasso (top surfaces) models across a grid in \(n_k\) and p

Appendix: Further results

See Figs. 7, 8, 9 and 10.

To fully understand how the computation time behaves depending on \(n_k\) and p values, a grid in both parameters have been inspected. Figure 11 displays the logarithm of the user times in seconds obtained under Lasso and CSCLasso models when \(n_k\) and p change. The perspective drawn in the top left figure shows that Lasso model (bottom surface) is solved faster and in a smoother way. Besides, whereas smaller times are obtained for both methods when \(n_k\) and p are small, the biggest times are associated to \(n_k=300\) and \(p=500\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Blanquero, R., Carrizosa, E., Ramírez-Cobo, P. et al. A cost-sensitive constrained Lasso. Adv Data Anal Classif 15, 121–158 (2021). https://doi.org/10.1007/s11634-020-00389-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00389-5

Keywords

Mathematics Subject Classification

Navigation