Abstract
Prediction tasks in personalized medicine require models that combine accuracy and interpretability. We propose an integer optimization approach for building sparse regression models with enforced coordination, using data partitioned among leaves in a prediction tree. We show that the method recovers the true underlying relationship between observations and target variables in large-scale synthetic data in seconds. We apply our method to several real-world medical prediction problems and observe that the additional structure imposed provides a substantial gain in interpretability, at a low cost to accuracy.
Data availability
All synthetic datasets and all publicly available datasets are available to interested readers. Medical data are protected under privacy rules and are not available.
References
Benjamin, E.J., Levy, D., Vaziri, S.M., D’agostino, R.B., Belanger, A.J., Wolf, P.A.: Independent risk factors for atrial fibrillation in a population-based cohort: the Framingham Heart Study. JAMA 271(11), 840–844 (1994)
Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. 270, 931–942 (2018)
Bertsimas, D., Dunn, J.: Machine Learning Under a Modern Optimization Lens. Dynamic Ideas (2019)
Bertsimas, D., Van Parys, B.: Sparse high-dimensional regression: exact scalable algorithms and phase transitions. Ann. Stat. 48(1), 300–323 (2020)
Bertsimas, D., King, A., Mazumder, R., et al.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
Bertsimas, D., Kallus, N., Weinstein, A.M., Zhuo, Y.D.: Personalized diabetes management using electronic medical records. Diabetes Care 40(2), 210–217 (2017)
Bertsimas, D., Pauphilet, J., Van Parys, B.: Sparse classification and phase transitions: a discrete optimization perspective (2017). arXiv preprint arXiv:1710.01352
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press (1984)
Dunn, J.: Optimal trees for prediction and prescription. PhD thesis, Massachusetts Institute of Technology (2018)
Dunning, I., Huchette, J., Lubin, M.: Jump: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)
Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math. Program. 36(3), 307–339 (1986)
Kaggle: House sales in King County, USA. https://www.kaggle.com/harlfoxem/housesalesprediction. Accessed: 2020-12-05 (2016)
Kapelevich, L., Zhang, R.: Sparclur/Sparclur.jl: v0.1 (2020). https://doi.org/10.5281/zenodo.4308537
Kornblith, S., Contributors: GLMNet.jl: Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet (2013). https://github.com/JuliaStats/GLMNet.jl
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)
Tikhonov, A.N.: On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943)
Acknowledgements
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1122374.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bertsimas, D., Dunn, J., Kapelevich, L. et al. Sparse regression over clusters: SparClur. Optim Lett 16, 433–448 (2022). https://doi.org/10.1007/s11590-021-01770-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-021-01770-9