Online updating method to correct for measurement error in big data streams
Introduction
Continued advances in science and technology have led to a constantly evolving definition of “big data”. Regardless of its formal definition, the amounts of data being collected in these fields continue to grow at a remarkably fast pace. Applying statistical models and methods to such big data can cause excessive computational burden, not only in terms of strains on computer memory due to large volume, but also strains in terms of computational efficiency since even seemingly very simple tasks can take an inordinate amount of time to compute (e.g., Wang et al., 2016a). To overcome these barriers, statistical and computational methodologies have largely focused on either subsampling-based approaches (e.g., Kleiner et al., 2014, Ma et al., 2015, Wang et al., 2019, Wang et al., 2018b), divide-and-conquer approaches (e.g., Lin and Xi, 2011, Chen and Xie, 2014, Song and Liang, 2015), or online updating approaches (e.g., Schifano et al., 2016, Wang et al., 2016a, Wang et al., 2018a, Wu et al., 2018, Xue et al., 2020).
The online updating approach for big data analysis is different from the other two approaches since the data is not assumed to exist all at once, but rather arrives sequentially in large chunks from a data stream. In this framework for regression-type analyses, Schifano et al. (2016) developed online updating algorithms that update the regression coefficient estimators and their variances as new data arrive; these algorithms are computationally efficient and minimally storage-intensive. Wang et al. (2018a) expanded the scope of the online updating method by accommodating the arrival of new predictor variables mid-way along the data stream. Furthermore, Wu et al. (2018) developed an online updating method for survival analysis under the Cox proportional hazards models, while Xue et al. (2020) proposed an online updating-based test to evaluate the proportional hazards assumption.
In this paper, we focus on online updating in the context of the linear errors-in-variables model. Errors-in-variables cause bias in the estimators for the true parameters in statistical models and a loss of power for statistical inference (e.g., Carroll et al., 2006). To solve these problems, measurement error models have been discussed extensively under different assumptions and settings: linear models (e.g., Carroll and Ruppert, 1996, Fuller, 2009, Zhang et al., 2017), generalized linear models (e.g., Stefanski and Carroll, 1987, Carroll, 1989, Liang and Ren, 2005), nonlinear models (e.g., Stefanski et al., 1985, Carroll and Li, 1992, Carroll et al., 1993, Wang, 2003, Wang et al., 2004), varying-coefficient partially linear models (e.g., Wang et al., 2012, Wang et al., 2013, Wang et al., 2016b), and additive partial linear models (e.g., Liang et al., 2008).
Unlike previous studies, we assume that the online-updating process begins with a subset of covariates unknowingly measured with error, and then after a particular known point along the data stream, they are measured precisely. Such phenomena may appear in many fields of application with improved instruments for data measurement. For example, Sapuppo et al. (2007) developed improved instruments for real-time measurement of blood flow velocity, which allows a wider range of velocity measurement than the previous instrument. Similarly, Zhang et al. (2016) have improved the psychrometer, which is the sensor for relative humidity measurement, with higher accuracy and stability. Under the online updating framework, the online updated estimators will be biased in general if any of the covariates are measured with error. Once the covariates are no longer measured with error, continuing to naively update the previous estimates (ignoring the measurement error), will also lead to biased estimators for parameters. Thus, we propose to correct the bias of the estimators once the covariates are no longer measured with error, and then proceed with the traditional online updating algorithm after correction. We further derive the asymptotic distribution for the corrected estimators.
The rest of the paper is organized as follows. In Section 2, we briefly review the online updating method in data streams assuming no covariate measurement error, and then propose our method to correct the bias due to covariate measurement error under the linear model framework. In Section 3, simulation studies and real data analysis are conducted. A discussion concludes in Section 4, with technical details provided in the online Supplementary Materials.
Section snippets
Model and method
In this section, we first briefly review the online updating method for linear models assuming no covariate measurement error. We then propose an online updating method for linear models to correct for covariate measurement error, assuming after a specific point along the data stream that the covariates are no longer measured with error.
Simulation study
In the simulation, we consider two blocks of data with sizes and , respectively. The data is generated by a linear regression model with where ’s are uncorrelated error terms and follow a normal distribution with mean zero and variance , and . Covariates are generated from a multivariate normal distribution with mean vector and covariance matrix with as off-diagonal entries and 1 as diagonal entries. At the first block of data, the
Discussion
The online updating method is useful for data arriving sequentially in a stream. In this paper, we studied a method to sequentially update estimators under the situation where some covariates were initially measured with error along the data stream. At the point in which we first observe the covariates measured without error, we could consider ignoring the previous updated-estimators (as they are biased) and start the online updating process anew with the precisely measures covariates, or keep
Acknowledgments
The authors wish to thank the anonymous reviewers and associate editor for their helpful comments, which have greatly improved this manuscript. The second author was partially supported by NSF Grant 1812013.
References (31)
- et al.
Statistical methods and computing for big data
Stat. Interface
(2016) - et al.
Adaptive LASSO for varying-coefficient partially linear measurement error models
J. Statist. Plann. Inference
(2013) Covariance analysis in generalized linear measurement error models
Stat. Med.
(1989)- et al.
Case-control studies with errors in covariates
J. Amer. Statist. Assoc.
(1993) - et al.
Measurement error regression with unknown link: dimension reduction and data visualization
J. Amer. Statist. Assoc.
(1992) - et al.
The use and misuse of orthogonal regression in linear errors-in-variables models
Amer. Statist.
(1996) - et al.
Measurement Error in Nonlinear Models: a Modern Perspective
(2006) - et al.
A split-and-conquer approach for analysis of extraordinarily large data
Statist. Sinica
(2014) - et al.
Discovering statistics using R
(2012) Measurement Error Models, Vol. 305
(2009)
A scalable bootstrap for massive data
J. R. Stat. Soc. Ser. B Stat. Methodol.
Generalized partially linear measurement error models
J. Comput. Graph. Statist.
Additive partial linear models with measurement errors
Biometrika
Aggregated estimating equation estimation
Stat. Interface
A statistical perspective on algorithmic leveraging
J. Mach. Learn. Res.
Cited by (10)
Fast Optimal Subsampling Probability Approximation for Generalized Linear Models
2024, Econometrics and StatisticsCitation Excerpt :Applying conventional statistical methods to such big data can strain both computer memory and computational efficiency, with even very simple tasks causing inordinate computational burden. There are several statistical and computational approaches to address this challenge: divide-and-conquer approach (e.g., Lin and Xi, 2011; Chen and Xie, 2014; Wang, 2019a), online updating approach (e.g., Schifano et al., 2016; Wang et al., 2016; 2018a; Wu et al., 2018; Xue et al., 2020; Lee et al., 2020), and subsampling-based approach (e.g., Drineas et al., 2011; Ma et al., 2015; Wang et al., 2018b; Ai et al., 2021; Wang, 2019b; Wang and Ma, 2020). The subsampling-based approach involves using sub-data which are drawn from the full data to approximate results of interest from the full data.
Renewable quantile regression for streaming data sets
2022, NeurocomputingDynamic feature weighting for data streams with distribution-based log-likelihood divergence
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :However, it assumes that the underlying distribution does not change over time; therefore, it neither detects nor adapts to possible concept or feature drifts. As such, an explicit feature drift detection mechanism that signals the latent feature-class distribution change with an individual statistic is often desired because it avoids the unnecessary feature weight updates, which may be time-consuming (Lee et al., 2020). In this regard, we develop an explicit feature drift detection algorithm based on a distribution-based log-likelihood divergence metric, namely, Dynamic Feature Weighting based on Log-likelihood Ratio (DFWLR).
Renewable learning for multiplicative regression with streaming datasets
2024, Computational Statistics