A scalable surrogate L0 sparse regression method for generalized linear models with applications to large scale data

https://doi.org/10.1016/j.jspi.2020.12.001Get rights and content

Highlights

  • The broken adaptive ridge (BAR) estimator for GLMs (GLM-BAR) is consistent for selection.

  • The GLM-BAR estimator enjoys an oracle property for estimation of the nonzero coefficients.

  • GLM-BAR has a grouping property for highly correlated covariates.

  • GLM-BAR is scalable for sparse high dimensional massive sample size data.

  • GLM-BAR generally produces a more sparse model than L1 based penalization methods with a lower false positive rate.

Abstract

This paper rigorously studies large sample properties of a surrogate L0 penalization method via iteratively performing reweighted L2 penalized regressions for generalized linear models and develop a scalable implementation of the method for sparse high dimensional massive sample size (sHDMSS) data. We show that for generalized linear models, the limit of the algorithm, referred to as the broken adaptive ridge (BAR) estimator, is consistent for variable selection, enjoys an oracle property for parameter estimation, and possesses a grouping property for highly correlated covariates. We further demonstrate that by taking advantage of an existing efficient implementation of massive L2-penalized generalized linear models, the proposed BAR method can be conveniently implemented for sHDMSS data. An illustration is given using a large sHDMSS data from the Truven MarketScan Medicare (MDCR) database to investigate the safety of dabigatran versus warfarin for treatment of nonvalvular atrial filbrillation in elder patients.

Introduction

L0-penalized regression has been popularly used in the classical variable selection methods through the well-known information criteria such as Mallow’s Cp (Mallows, 1973), Akaike’s information criterion (AIC) (Akaike, 1974), the Bayesian information criterion (BIC) (Schwarz et al., 1978, Chen and Chen, 2008), and risk inflation criteria (RIC) (Foster and George, 1994). It directly penalizes the cardinality of a model and has been shown to possess some optimal properties for variable selection and parameter estimation (Shen et al., 2012). On the other hand, L0-penalization is computationally NP-hard and thus not scalable to high dimensional covariates. It can also be unstable for variable selection (Breiman et al., 1996). The broken adaptive ridge (BAR) method has been recently studied as a scalable surrogate to L0 penalization for simultaneous variable selection and parameter estimation (Dai et al., 2018a, Dai et al., 2018b, Frommlet and Nuel, 2016, Liu and Li, 2016). Defined as the limit of an iteratively reweighted L2 penalization algorithm, the BAR estimator has been shown to enjoy the best of both L0 and L2 penalizations (Dai et al., 2018a, Dai et al., 2018b) while avoiding their shortcomings. For instance, it is easily scalable to high dimensional covariates and has been shown to be consistent for variable selection, oracle for parameter estimation, and has a grouping property for highly correlated covariates for the linear model (Dai et al., 2018b). As a surrogate to L0 penalized regression, BAR tends to yield more parsimonuous models as compared to an L1-type penalization method with comparable prediction performance in empirical studies (Dai et al., 2018b).

The purpose of this paper is to fill some critical gaps in the theoretical and computational development of the BAR methodology for the generalized linear model and extend its application to large scale data. First of all, although the asymptotic properties of the BAR approach have been established for the linear model, it has yet to be fully investigated for the generalized linear model. One of the contributions of this paper is to rigorously establish the asymptotic statistical guarantees of the BAR estimator for the generalized linear model. In particular, we establish its consistency for variable selection and parameter estimation and its grouping property for highly correlated covariates for the generalized linear model. Secondly, as discussed later in Remark 1 of Section 2, for the generalized linear model, current BAR algorithm and implementation will become practically infeasible for large scale data when both p and n are large, because (1) data stored in the standard dense format will exceed a computer’s memory and (2) the computational algorithms will become prohibitively costly. Another key contribution of this paper is to develop a scalable implementation of the BAR for the generalized linear model for sparse high-dimensional and massive sample-size (sHDMSS) data that has the following characteristics: (1) high-dimensional with thousands of baseline covariates, (2) massive in sample-size with up to millions or even hundreds of millions of patients records, and (3) sparse with only a small portion of covariates being nonzero for each subject. sHDMSS data are commonly encountered in large scale health studies using massive electronic health record (EHR) databases. An example of sHDMSS data is given in Section 4 from the Truven MarketScan Medicare (MDCR) database, which includes 73,206 patients with 17,032 baseline covariates for studying the safety of dabigatran versus warfarin for treatment of nonvalvular atrial filbrillation in elder patients. Our scalable implementation of the BAR method for sHDMSS data exploits the data sparsity and is conveniently implemented by taking advantage of a recently developed Cyclops package by Suchard et al. (2013) for fitting massive L2 penalized regression for the generalized linear model. Lastly, as a byproduct, our developed asymptotic theory implies that coupling the BAR method with an appropriate sure screening method will lead to an oracle sparse regression method for ultrahigh dimensional settings when the number of predictors far exceeds the sample size. We have developed an R package BrokenAdaptiveRidge and made it available to readers at https://github.com/OHDSI/BrokenAdaptiveRidge.

The paper is organized as follows. In Section 2, we describe the BAR estimator, state our main results on its asymptotic statistical guarantees, discuss how to adapt it to sHDMSS data by taking advantage of existing efficient computation techniques for massive L2-penalized generalized linear models, and combine it with some dimension reduction methods for analysis of ultrahigh dimensional data where the number of predictors far exceeds the sample size. Section 3 presents simulation studies to examine the performance of the BAR estimator for both small and massive sample sizes. We illustrate the proposed approach using a real world sHDMSS data in Section 4. Discussion and concluding remarks are given in Section 5. Technical proofs are provided in the Appendix A.

Section snippets

The estimator

Consider the generalized linear model (GLM) with a n×1 response vector y={y1,y2,,yn}T and a n×p design matrix X={x1,x2,,xn}T. Assume that the observations vi=(xiT,yi)T, i=1,,n, are mutually independent. Conditional on xi, the distribution of yi belongs to the exponential family with the following density fY(yi;θi,ϕ)=exp{(yiθia(θi))b(ϕ)c(yi,ϕ)},where θi is the canonical parameter, ϕ is the dispersion parameter satisfying 0<ϕ<, and a(), b() and c() are known functions. Assuming a() is

Simulation results

A series of simulations were conducted to illustrate the performance of GLM-BAR for logistic regression in both low and high dimensional settings with small and massive sample sizes. All computations were carried out in R. We considered two scenarios: (I) small samples with p<n, and (II) large scale data sets with sparse features to mimic the real data we analyzed in Section 4.

In scenario I, we simulated data under two settings in which the true signals are either large or small to moderate. We

A real data example

Over the last decade, the U.S. Food and Drug Administration (FDA) has invested hundreds of millions of dollars to develop the Sentinel Initiative, a national electronic medical system, to monitor the safety of its regulated products as a major part of its mission to protect public health. A Sentinel’s hall-mark study investigates the safety of dabigatran versus warfarin for treatment of nonvalvular atrial fibrillation in elderly patients enrolled in Medicare between October 2010 and December

Discussion

We have established the GLM-BAR as a viable tool for variable selection and parameter estimation for generalized linear models with diverging dimension by rigorously establishes its consistency for variable selection and parameter estimation and a grouping property of highly correlated covariates. This paper considers GLM with a canonical link. As pointed out by a referee, there are practical situations where a non-canonical link function is preferred. It seems straightforward to extend the

Software

GLM-BAR has been implemented in the R package BrokenAdaptiveRidge (https://github.com/OHDSI/BrokenAdaptiveRidge). Key computation details are given in Section 2.3.

Acknowledgments

Gang Li’s research was supported in part by National Institutes of Health Grants P30CA16042, UL1TR001881, and CA211015. Xiaoling Peng’s research was supported by Guangdong Natural Science Foundation No.2018A0303130231.

References (39)

  • DaiL. et al.

    Broken adaptive ridge regression and its asymptotic properties

    J. Multivariate Anal.

    (2018)
  • LaiP. et al.

    Model free feature screening for ultrahigh dimensional data with responses missing at random

    Comput. Statist. Data Anal.

    (2017)
  • LinL. et al.

    Adaptive conditional feature screening

    Comput. Statist. Data Anal.

    (2016)
  • AkaikeH.

    A new look at the statistical model identification

    IEEE Trans. Autom. Control

    (1974)
  • AustinP.C. et al.

    The use of the propensity score for estimating treatment effects: administrative versus clinical data

    Statist. Med.

    (2005)
  • BreimanL.

    Heuristics of instability and stabilization in model selection

    Ann. Statist.

    (1996)
  • ChenJ. et al.

    Extended bayesian information criteria for model selection with large model spaces

    Biometrika

    (2008)
  • DaiL. et al.

    The broken adaptive ridge procedure and its applications

    Stat Sin.

    (2018)
  • FanJ. et al.

    Nonparametric independence screening in sparse ultra-high-dimensional additive models

    J. Amer. Statist. Assoc.

    (2011)
  • FanJ. et al.

    Sure independence screening for ultrahigh dimensional feature space

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2008)
  • FanJ. et al.

    Nonconcave penalized likelihood with a diverging number of parameters

    Ann. Statist.

    (2004)
  • FanJ. et al.

    Ultrahigh dimensional feature selection: beyond the linear model

    J. Mach. Learn. Res.

    (2009)
  • FanJ. et al.

    Sure independence screening in generalized linear models with NP-dimensionality

    Ann. Statist.

    (2010)
  • FanJ. et al.

    Strong oracle optimality of folded concave penalized estimation

    Ann. Statist.

    (2014)
  • FosterD. et al.

    The risk inflation criterion for multiple regression

    Ann. Statist.

    (1994)
  • FrommletF. et al.

    An adaptive ridge procedure for L0 regularization

    PLoS One

    (2016)
  • Gorst-RasmussenA. et al.

    Coordinate Descent Methods for the Penalized Semiparametric Additive Hazards Model

    (2011)
  • GrahamD.J. et al.

    Cardiovascular, bleeding, and mortality risks in elderly medicare patients treated with dabigatran or warfarin for nonvalvular atrial fibrillation

    Circulation

    (2015)
  • HeX. et al.

    Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data

    Ann. Statist.

    (2013)
  • Cited by (4)

    View full text