Information-based optimal subdata selection for big data logistic regression

https://doi.org/10.1016/j.jspi.2020.03.004Get rights and content

Highlights

  • Derive an upper bound of the information from random subsampling method.

  • Propose a better trade-off between computational and statistical efficiency.

  • Prove that the information from the new method increases with the full data size.

  • Compare the performances of the proposed approaches with widely-applied methods.

Abstract

Technological advances have enabled an exponential growth in data volumes, and proven statistical methods are no longer applicable for extraordinary large data sets due to computational limitations. Subdata selection is an effective strategy to address this issue. In this study, we investigate existing sampling approaches and propose a novel framework of selecting subsets of data for logistic regression models. We show that, while the information contained in the subdata based on random sampling approaches is limited by the size of the subset, the information contained in the subdata based on the new framework increases as the size of the full data set increases. Performances of the proposed approach and those of other existing methods are compared under various criteria via extensive simulation studies.

Introduction

Technological advances have enabled an exponential growth in data collection and the size of data sets. For example, the cross-continental Square Kilometer Array, the next generation of astronomical telescopes, will generate 700 TB of data per second (Mattmann et al., 2014). While the extraordinary sizes of data sets provide researchers golden opportunities for scientific discoveries, they also bring tremendous challenges when attempting to analyze these large data sets. Proven statistical methods are no longer applicable due to computational limitations. Recent advances in statistical analysis to deal with these challenges are arguably on two major different strategies: the divide-and-conquer approach and the subdata selection approach.

The divide-and-conquer approach takes advantage of the parallel computing technology. A large data set is split into chunks of reasonable sizes, and analysis is implemented separately on each chunk of data and a specified aggregation method is implemented to merge pieces of information from chunks to produce final analysis. The analysis and aggregation methods depend on the structure of the data set and model assumptions. For the linear regression model, the least squares estimate can be directly decomposed into a weighted average of the least squares estimate based on each chunk. This has become the standard aggregation method for merging solutions from blocks with linear models. For nonlinear models, several aggregation methods are proposed. Lin and Xi (2011) proposed an approach for approximating the estimating equation estimator using a first order Taylor expansion. Under certain conditions, accuracy of the final estimator from aggregation is proved to be close to the direct estimator from the full data. Chen and Xie (2014) considered a divide-and-conquer approach for generalized linear models (GLM) where both the number of observations n and the number of covariates p are large. They incorporated variable selection via penalized regression into the subset processing step, and showed that, under certain regularity conditions, the aggregated estimator in model selection is consistent and asymptotically equivalent to the penalized estimator based on the full dataset. In Schifano et al. (2016), an approach similar to the divide-and-conquer approach is proposed, where accumulated parameter estimators based on data chunks arrived can be updated using future coming data. The divide-and-conquer approach gains efficiency mainly from the implementation of parallel computing, and it may not reduce computational time if implemented with a single core. The subdata approach reduces the computation burden by downsizing the data volume. The key question here is how to select an informative subdata such that it maintains as much information as possible. As noted in a recent NSF program guideline, “Tradeoffs between computational costs and statistical efficiency” is one of six research directions that needs to be addressed for theoretical foundation of data science (NSF, 2016).

Existing subdata approaches are mainly based on random subsampling. Combining the methods of subsampling (Politis et al., 1999) and bootstrapping Efron (1979), Bickel et al. (1997) and Kleiner et al. (2014) proposed a novel approach called bags of little bootstraps (BLB) to achieve computational efficiency. Liang et al. (2013) proposed a mean log-likelihood approach using Monte Carlo averages of estimates from subsamples to approximate the quantities needed in the analysis. The BLB and mean log-likelihood methods select subsamples using simple random sampling. Another line of the subsampling method is based on leverage sampling algorithms. In this approach, a sampling probability is assigned to each dataline according to its leverage score. Ma et al. (2015) reviewed existing subsampling methods in the context of linear regression and termed the methods leveraging algorithms, considered the statistical properties of leveraging algorithms, and proposed a shrinkage algorithmic leveraging method.

A major limitation of random subsampling methods is that the amount of information in a resulting subdata is proportional to the size of the subdata, which is often significantly smaller than the full data size. Wang et al. (2019) proved that, in linear regression, the variance of an estimator based on the random subsampling method converges to zero at a rate proportional to the inverse of the subdata size. Is it possible that the information contained in a subdata is related to the size of the full data rather than that of the subdata only? Ideally we would want to choose the subdata with the maximum amount of information among all possible subdata sets. However this is infeasible in practice since there are nr subsets of data with size r from a full data set of size n. This combination number is quickly out of reach even for moderate n and r, so an alternative approach has to be employed. Under linear models, Wang et al. (2019) proposed a novel approach called Information-Based Optimal Subdata Selection (IBOSS) to select a subdata. Unlike random subsampling methods, IBOSS is a deterministic approach. It selects a subdata based on the characterization of the D-optimal design. Under certain conditions, Wang et al. (2019) showed that the variance of the resultant estimator converges to zero at a rate corresponding to the size of the full data. The simulation studies demonstrated that the IBOSS approach significantly outperformed random subsampling approaches.

While the IBOSS approach effectively addresses the trade off between the computational complexity and statistical efficiency, it is under the linear model context. Does this strategy also work under nonlinear models? Unlike linear models, where the corresponding information matrices are relatively simple with an explicit form, the problem for nonlinear models is remarkably different, where the information matrices are much more complicated and depend on unknown parameters. Consequently, the problem under nonlinear models is considerably harder than that under linear models.

Nonlinear models, however, are widely applied in practice. Specifically, logistic regression models have played important roles in categorical data analysis. They have been used in various fields, like finance, medicine, and social sciences. Unlike linear models, where the estimators have closed form solutions, the estimators for logistic regression models have no closed form solutions in general. We have to utilize iterative approaches to calculate the estimates numerically. Compared with linear regression models, the computation cost for logistic regression models is much higher for big data sets. There is limited research on how to choose a subdata from a full data set for a logistic regression model, perhaps due to the complexity of the nonlinearity feature. Wang et al. (2018) proposed the optimal subsampling method under the A-optimality criterion (OSMAC) algorithm, where the probability weights are specified according to the A-optimality in optimal design theory (Kiefer, 1959). However, like many other random subsampling approaches for linear models, we shall show in the next section that the information extracted from the OSMAC approach is limited by the subsample size.

In this paper, we study subdata selection under logistic regression models utilizing the IBOSS strategy. A new algorithm of selecting subdata is proposed. Compared with existing subsampling approaches, the new algorithm has the following two advantages: the estimation efficiency of the algorithm is significantly higher and the computational cost is competitive.

The key contribution of this paper is that, under logistic regression models, it (i) proves that the information from random subsampling based subdata selection method is limited by the size of the subdata, (ii) proposes a new approach for the trade off between the computational complexity and statistical efficiency, and (iii) proves that the information from the new algorithm increases along with the size of full data. These results give a theoretical justification for the information based subdata selection under nonlinear models. Since “data reduction is perhaps the most critical component in retrieving information in big data” (Yildirim et al., 2014), this is a significant step in big data analysis under nonlinear models.

The rest of the paper is organized as follows: Section 2 introduces notations, provides a summary of existing methods, and presents lower-bounds of the variance covariance matrices for subsampling-based estimators. Section 3 introduces a new algorithm and discusses its asymptotic properties. Section 4 compares the performance of the new algorithm, the OSMAC algorithm, and the simple random sampling method using various simulation settings. Section 5 provides a brief summary of this paper and its possible extensions. All technical details are provided in the supplemental material.

Section snippets

Notations and existing methods

We present the model setup and existing methods in this section. Let Fn={(Yi,Zi),i=1,,n} denote the full data, where Yi is a binary response variable and Zi=(zi1,,zim)T is a m dimensional explanatory variable. Assume the logistic regression model: Prob(Yi=1|Xi)=pi(β)=eXiTβ1+eXiTβ,where β=(β0,β1,,βm)T and Xi=(1,ZiT)T=(1,zi1,,zim)T. Here, β0 is the intercept parameter and (β1,,βm)T is the m dimensional slope parameter. Like in linear models, β is frequently estimated by the maximum

IBOSS algorithm for logistic regression models

Recently, Wang et al. (2019) proposed a novel IBOSS subsampling approach for linear models. Unlike random subsampling approach which selects a subdata according to some sampling distribution, the IBOSS approach directly utilizes the structure of D-optimal design under linear models and deterministically selects informative subsets. Based on both simulated and real data, Wang et al. (2019) showed that the resultant estimator by implementing this procedure has significantly higher estimation

Simulation settings and result

In this section, the IBOSS procedure is evaluated in various distributions of Zi. The distributions used to generate Zi’s are listed below.

  • MzNormal: Multivariate-normal distribution with mean vector u=(0,,0)T and variance covariance matrix Σ, where Σij=0.5 if ij and Σij=1 if i=j.

  • NzNormal: Multivariate-normal distribution with mean vector u=(1,,1)T and variance covariance matrix Σ as defined above.

  • MixNormal: Mixed normal distribution 12N(u,Σ)+12N(u,Σ), where u=(1,,1)T and Σ is the same as

Discussion

In this paper, we study subdata selection under logistic regression models. We show that, for random sampling-based strategy, such as the mVc strategy and the uniform sampling, the information in the subdata is bounded by the size of subdata. A novel information-based optimal subdata selection approach is proposed. For the new approach, we show that at least one eigenvalue of the information matrix goes to infinity when full data size increases even when the subdata size is fixed. The results

CRediT authorship contribution statement

Qianshun Cheng: Methodology, Formal analysis, Writing - original draft. HaiYing Wang: Methodology, Writing - review & editing. Min Yang: Methodology, Project administration.

Acknowledgments

The authors are grateful for many insightful comments and suggestions from an anonymous referee, an associate editor, and editor, which helped to improve the article. Wang’s research was supported by National Science Foundation, United States of America grant DMS-1812013 and Yang’s research was supported by National Science Foundation, United States of America grant DMS-1811291.

References (18)

  • GourierouxC. et al.

    Asymptotic properties of the maximum likelihood estimator in dichotomous models

    J. Econometrics

    (1981)
  • BickelP.J. et al.

    Resampling fewer than n observations: Gains, losses, and remedies for losses

    Statist. Sinica

    (1997)
  • ChenX. et al.

    A split-and-conquer approach for analysis of extraordinarily large data

    Statist. Sinica

    (2014)
  • EfronB.

    Bootstrap methods: Another look at the Jackknife

    Ann. Statist.

    (1979)
  • HosmerD.W. et al.

    Applied Logistic Regression

    (2000)
  • KieferJ.

    Optimum experimental designs

    ournal of the Royal Statistical Society. Series B (Methodological)

    (1959)
  • KleinerA. et al.

    A scalable bootstrap for massive data

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2014)
  • LiangF. et al.

    A resampling-based stochastic approximation method for analysis of large geostatistical data

    J. Amer. Statist. Assoc.

    (2013)
  • LinN. et al.

    Aggregated estimating equation estimation

    Stat. Interface

    (2011)
There are more references available in the full text version of this article.

Cited by (28)

  • A Provably Accurate Randomized Sampling Algorithm for Logistic Regression

    2024, Proceedings of the AAAI Conference on Artificial Intelligence
View all citing articles on Scopus
View full text