A scalable surrogate sparse regression method for generalized linear models with applications to large scale data
Introduction
-penalized regression has been popularly used in the classical variable selection methods through the well-known information criteria such as Mallow’s (Mallows, 1973), Akaike’s information criterion (AIC) (Akaike, 1974), the Bayesian information criterion (BIC) (Schwarz et al., 1978, Chen and Chen, 2008), and risk inflation criteria (RIC) (Foster and George, 1994). It directly penalizes the cardinality of a model and has been shown to possess some optimal properties for variable selection and parameter estimation (Shen et al., 2012). On the other hand, -penalization is computationally NP-hard and thus not scalable to high dimensional covariates. It can also be unstable for variable selection (Breiman et al., 1996). The broken adaptive ridge (BAR) method has been recently studied as a scalable surrogate to penalization for simultaneous variable selection and parameter estimation (Dai et al., 2018a, Dai et al., 2018b, Frommlet and Nuel, 2016, Liu and Li, 2016). Defined as the limit of an iteratively reweighted penalization algorithm, the BAR estimator has been shown to enjoy the best of both and penalizations (Dai et al., 2018a, Dai et al., 2018b) while avoiding their shortcomings. For instance, it is easily scalable to high dimensional covariates and has been shown to be consistent for variable selection, oracle for parameter estimation, and has a grouping property for highly correlated covariates for the linear model (Dai et al., 2018b). As a surrogate to penalized regression, BAR tends to yield more parsimonuous models as compared to an -type penalization method with comparable prediction performance in empirical studies (Dai et al., 2018b).
The purpose of this paper is to fill some critical gaps in the theoretical and computational development of the BAR methodology for the generalized linear model and extend its application to large scale data. First of all, although the asymptotic properties of the BAR approach have been established for the linear model, it has yet to be fully investigated for the generalized linear model. One of the contributions of this paper is to rigorously establish the asymptotic statistical guarantees of the BAR estimator for the generalized linear model. In particular, we establish its consistency for variable selection and parameter estimation and its grouping property for highly correlated covariates for the generalized linear model. Secondly, as discussed later in Remark 1 of Section 2, for the generalized linear model, current BAR algorithm and implementation will become practically infeasible for large scale data when both and are large, because (1) data stored in the standard dense format will exceed a computer’s memory and (2) the computational algorithms will become prohibitively costly. Another key contribution of this paper is to develop a scalable implementation of the BAR for the generalized linear model for sparse high-dimensional and massive sample-size (sHDMSS) data that has the following characteristics: (1) high-dimensional with thousands of baseline covariates, (2) massive in sample-size with up to millions or even hundreds of millions of patients records, and (3) sparse with only a small portion of covariates being nonzero for each subject. sHDMSS data are commonly encountered in large scale health studies using massive electronic health record (EHR) databases. An example of sHDMSS data is given in Section 4 from the Truven MarketScan Medicare (MDCR) database, which includes 73,206 patients with 17,032 baseline covariates for studying the safety of dabigatran versus warfarin for treatment of nonvalvular atrial filbrillation in elder patients. Our scalable implementation of the BAR method for sHDMSS data exploits the data sparsity and is conveniently implemented by taking advantage of a recently developed Cyclops package by Suchard et al. (2013) for fitting massive penalized regression for the generalized linear model. Lastly, as a byproduct, our developed asymptotic theory implies that coupling the BAR method with an appropriate sure screening method will lead to an oracle sparse regression method for ultrahigh dimensional settings when the number of predictors far exceeds the sample size. We have developed an R package BrokenAdaptiveRidge and made it available to readers at https://github.com/OHDSI/BrokenAdaptiveRidge.
The paper is organized as follows. In Section 2, we describe the BAR estimator, state our main results on its asymptotic statistical guarantees, discuss how to adapt it to sHDMSS data by taking advantage of existing efficient computation techniques for massive -penalized generalized linear models, and combine it with some dimension reduction methods for analysis of ultrahigh dimensional data where the number of predictors far exceeds the sample size. Section 3 presents simulation studies to examine the performance of the BAR estimator for both small and massive sample sizes. We illustrate the proposed approach using a real world sHDMSS data in Section 4. Discussion and concluding remarks are given in Section 5. Technical proofs are provided in the Appendix A.
Section snippets
The estimator
Consider the generalized linear model (GLM) with a response vector and a design matrix . Assume that the observations , , are mutually independent. Conditional on , the distribution of belongs to the exponential family with the following density where is the canonical parameter, is the dispersion parameter satisfying , and , and are known functions. Assuming is
Simulation results
A series of simulations were conducted to illustrate the performance of GLM-BAR for logistic regression in both low and high dimensional settings with small and massive sample sizes. All computations were carried out in . We considered two scenarios: (I) small samples with , and (II) large scale data sets with sparse features to mimic the real data we analyzed in Section 4.
In scenario I, we simulated data under two settings in which the true signals are either large or small to moderate. We
A real data example
Over the last decade, the U.S. Food and Drug Administration (FDA) has invested hundreds of millions of dollars to develop the Sentinel Initiative, a national electronic medical system, to monitor the safety of its regulated products as a major part of its mission to protect public health. A Sentinel’s hall-mark study investigates the safety of dabigatran versus warfarin for treatment of nonvalvular atrial fibrillation in elderly patients enrolled in Medicare between October 2010 and December
Discussion
We have established the GLM-BAR as a viable tool for variable selection and parameter estimation for generalized linear models with diverging dimension by rigorously establishes its consistency for variable selection and parameter estimation and a grouping property of highly correlated covariates. This paper considers GLM with a canonical link. As pointed out by a referee, there are practical situations where a non-canonical link function is preferred. It seems straightforward to extend the
Software
GLM-BAR has been implemented in the R package BrokenAdaptiveRidge (https://github.com/OHDSI/BrokenAdaptiveRidge). Key computation details are given in Section 2.3.
Acknowledgments
Gang Li’s research was supported in part by National Institutes of Health Grants P30CA16042, UL1TR001881, and CA211015. Xiaoling Peng’s research was supported by Guangdong Natural Science Foundation No.2018A0303130231.
References (39)
- et al.
Broken adaptive ridge regression and its asymptotic properties
J. Multivariate Anal.
(2018) - et al.
Model free feature screening for ultrahigh dimensional data with responses missing at random
Comput. Statist. Data Anal.
(2017) - et al.
Adaptive conditional feature screening
Comput. Statist. Data Anal.
(2016) A new look at the statistical model identification
IEEE Trans. Autom. Control
(1974)- et al.
The use of the propensity score for estimating treatment effects: administrative versus clinical data
Statist. Med.
(2005) Heuristics of instability and stabilization in model selection
Ann. Statist.
(1996)- et al.
Extended bayesian information criteria for model selection with large model spaces
Biometrika
(2008) - et al.
The broken adaptive ridge procedure and its applications
Stat Sin.
(2018) - et al.
Nonparametric independence screening in sparse ultra-high-dimensional additive models
J. Amer. Statist. Assoc.
(2011) - et al.
Sure independence screening for ultrahigh dimensional feature space
J. R. Stat. Soc. Ser. B Stat. Methodol.
(2008)
Nonconcave penalized likelihood with a diverging number of parameters
Ann. Statist.
Ultrahigh dimensional feature selection: beyond the linear model
J. Mach. Learn. Res.
Sure independence screening in generalized linear models with NP-dimensionality
Ann. Statist.
Strong oracle optimality of folded concave penalized estimation
Ann. Statist.
The risk inflation criterion for multiple regression
Ann. Statist.
An adaptive ridge procedure for regularization
PLoS One
Coordinate Descent Methods for the Penalized Semiparametric Additive Hazards Model
Cardiovascular, bleeding, and mortality risks in elderly medicare patients treated with dabigatran or warfarin for nonvalvular atrial fibrillation
Circulation
Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data
Ann. Statist.
Cited by (4)
Missing data imputation, prediction, and feature selection in diagnosis of vaginal prolapse
2023, BMC Medical Research MethodologyA New -Regularized Log-Linear Poisson Graphical Model with Applications to RNA Sequencing Data
2021, Journal of Computational Biology