Online updating method to correct for measurement error in big data streams

doi:10.1016/j.csda.2020.106976

Computational Statistics & Data Analysis

Volume 149, September 2020, 106976

https://doi.org/10.1016/j.csda.2020.106976 Get rights and content

Highlights

•
Extends scope of online updating methods to linear measurement error models.
•
Corrects measurement error biases once measurements are observed precisely.
•
Establishes asymptotic distributions for the corrected and updated estimators.

Abstract

When huge amounts of data arrive in streams, online updating is an important method to alleviate both computational and data storage issues. The scope of previous research for online updating is extended in the context of the classical linear measurement error model. In the case where some covariates are unknowingly measured with error at the beginning of the stream, but then are measured without error after a particular point along the data stream, the updated estimators ignoring the measurement error are biased for the true parameters. Once the covariates measured without error are first observed, a method to correct the bias of the estimators, as well as to correct the biases in their variance estimator, is proposed; after correction, the traditional online updating method can then proceed as usual. Further, asymptotic distributions for the corrected and updated estimators are established. Simulation studies and a real data analysis with an airline on-time dataset are provided to illustrate the performance of the proposed method.

Introduction

Continued advances in science and technology have led to a constantly evolving definition of “big data”. Regardless of its formal definition, the amounts of data being collected in these fields continue to grow at a remarkably fast pace. Applying statistical models and methods to such big data can cause excessive computational burden, not only in terms of strains on computer memory due to large volume, but also strains in terms of computational efficiency since even seemingly very simple tasks can take an inordinate amount of time to compute (e.g., Wang et al., 2016a). To overcome these barriers, statistical and computational methodologies have largely focused on either subsampling-based approaches (e.g., Kleiner et al., 2014, Ma et al., 2015, Wang et al., 2019, Wang et al., 2018b), divide-and-conquer approaches (e.g., Lin and Xi, 2011, Chen and Xie, 2014, Song and Liang, 2015), or online updating approaches (e.g., Schifano et al., 2016, Wang et al., 2016a, Wang et al., 2018a, Wu et al., 2018, Xue et al., 2020).

The online updating approach for big data analysis is different from the other two approaches since the data is not assumed to exist all at once, but rather arrives sequentially in large chunks from a data stream. In this framework for regression-type analyses, Schifano et al. (2016) developed online updating algorithms that update the regression coefficient estimators and their variances as new data arrive; these algorithms are computationally efficient and minimally storage-intensive. Wang et al. (2018a) expanded the scope of the online updating method by accommodating the arrival of new predictor variables mid-way along the data stream. Furthermore, Wu et al. (2018) developed an online updating method for survival analysis under the Cox proportional hazards models, while Xue et al. (2020) proposed an online updating-based test to evaluate the proportional hazards assumption.

In this paper, we focus on online updating in the context of the linear errors-in-variables model. Errors-in-variables cause bias in the estimators for the true parameters in statistical models and a loss of power for statistical inference (e.g., Carroll et al., 2006). To solve these problems, measurement error models have been discussed extensively under different assumptions and settings: linear models (e.g., Carroll and Ruppert, 1996, Fuller, 2009, Zhang et al., 2017), generalized linear models (e.g., Stefanski and Carroll, 1987, Carroll, 1989, Liang and Ren, 2005), nonlinear models (e.g., Stefanski et al., 1985, Carroll and Li, 1992, Carroll et al., 1993, Wang, 2003, Wang et al., 2004), varying-coefficient partially linear models (e.g., Wang et al., 2012, Wang et al., 2013, Wang et al., 2016b), and additive partial linear models (e.g., Liang et al., 2008).

Unlike previous studies, we assume that the online-updating process begins with a subset of covariates unknowingly measured with error, and then after a particular known point along the data stream, they are measured precisely. Such phenomena may appear in many fields of application with improved instruments for data measurement. For example, Sapuppo et al. (2007) developed improved instruments for real-time measurement of blood flow velocity, which allows a wider range of velocity measurement than the previous instrument. Similarly, Zhang et al. (2016) have improved the psychrometer, which is the sensor for relative humidity measurement, with higher accuracy and stability. Under the online updating framework, the online updated estimators will be biased in general if any of the covariates are measured with error. Once the covariates are no longer measured with error, continuing to naively update the previous estimates (ignoring the measurement error), will also lead to biased estimators for parameters. Thus, we propose to correct the bias of the estimators once the covariates are no longer measured with error, and then proceed with the traditional online updating algorithm after correction. We further derive the asymptotic distribution for the corrected estimators.

The rest of the paper is organized as follows. In Section 2, we briefly review the online updating method in data streams assuming no covariate measurement error, and then propose our method to correct the bias due to covariate measurement error under the linear model framework. In Section 3, simulation studies and real data analysis are conducted. A discussion concludes in Section 4, with technical details provided in the online Supplementary Materials.

Section snippets

Model and method

In this section, we first briefly review the online updating method for linear models assuming no covariate measurement error. We then propose an online updating method for linear models to correct for covariate measurement error, assuming after a specific point along the data stream that the covariates are no longer measured with error.

Simulation study

In the simulation, we consider two blocks of data with sizes $n_{1}$ and $n_{2}$ , respectively. The data is generated by a linear regression model with $y_{i} = α_{0} + z_{i}^{'} α + x_{i}^{'} β + ε_{i}, i = 1, \dots, n,$ where $ε_{i}$ ’s are uncorrelated error terms and follow a normal distribution with mean zero and variance $σ^{2}$ , and $n = n_{1} + n_{2}$ . Covariates ${(z_{i}^{'} x_{i}^{'})}^{'}$ are generated from a multivariate normal distribution with mean vector $μ$ and covariance matrix $Σ$ with $σ_{z x}$ as off-diagonal entries and 1 as diagonal entries. At the first block of data, the

Discussion

The online updating method is useful for data arriving sequentially in a stream. In this paper, we studied a method to sequentially update estimators under the situation where some covariates were initially measured with error along the data stream. At the point in which we first observe the covariates measured without error, we could consider ignoring the previous updated-estimators (as they are biased) and start the online updating process anew with the precisely measures covariates, or keep

Acknowledgments

The authors wish to thank the anonymous reviewers and associate editor for their helpful comments, which have greatly improved this manuscript. The second author was partially supported by NSF Grant 1812013.

References (31)

WangC. et al.
Statistical methods and computing for big data
Stat. Interface
(2016)
WangH. et al.
Adaptive LASSO for varying-coefficient partially linear measurement error models
J. Statist. Plann. Inference
(2013)
CarrollR.J.
Covariance analysis in generalized linear measurement error models
Stat. Med.
(1989)
CarrollR.J. et al.
Case-control studies with errors in covariates
J. Amer. Statist. Assoc.
(1993)
CarrollR.J. et al.
Measurement error regression with unknown link: dimension reduction and data visualization
J. Amer. Statist. Assoc.
(1992)
CarrollR. et al.
The use and misuse of orthogonal regression in linear errors-in-variables models
Amer. Statist.
(1996)
CarrollR.J. et al.
Measurement Error in Nonlinear Models: a Modern Perspective
(2006)
ChenX. et al.
A split-and-conquer approach for analysis of extraordinarily large data
Statist. Sinica
(2014)
FieldA. et al.
Discovering statistics using R
(2012)
FullerW.A.
Measurement Error Models, Vol. 305
(2009)

KleinerA. et al.

A scalable bootstrap for massive data

J. R. Stat. Soc. Ser. B Stat. Methodol.

(2014)

LiangH. et al.

Generalized partially linear measurement error models

J. Comput. Graph. Statist.

(2005)

LiangH. et al.

Additive partial linear models with measurement errors

Biometrika

(2008)

LinN. et al.

Aggregated estimating equation estimation

Stat. Interface

(2011)

MaP. et al.

A statistical perspective on algorithmic leveraging

J. Mach. Learn. Res.

(2015)

Cited by (10)

Fast Optimal Subsampling Probability Approximation for Generalized Linear Models
2024, Econometrics and Statistics
Citation Excerpt :
Applying conventional statistical methods to such big data can strain both computer memory and computational efficiency, with even very simple tasks causing inordinate computational burden. There are several statistical and computational approaches to address this challenge: divide-and-conquer approach (e.g., Lin and Xi, 2011; Chen and Xie, 2014; Wang, 2019a), online updating approach (e.g., Schifano et al., 2016; Wang et al., 2016; 2018a; Wu et al., 2018; Xue et al., 2020; Lee et al., 2020), and subsampling-based approach (e.g., Drineas et al., 2011; Ma et al., 2015; Wang et al., 2018b; Ai et al., 2021; Wang, 2019b; Wang and Ma, 2020). The subsampling-based approach involves using sub-data which are drawn from the full data to approximate results of interest from the full data.
For massive data, subsampling techniques are popular to mitigate computational burden by reducing the data size. In a subsampling approach, subsampling probabilities for each data point are specified to obtain an informative sub-data, and then estimates based on the sub-data are obtained to approximate estimates from the full data. Assigning subsampling probabilities based on minimization of the asymptotic mean squared error of the estimator from a general subsample (A-optimality criterion) is a popular approach, however, it is still computationally demanding to calculate the probabilities under this setting. To efficiently approximate the A-optimal subsampling probabilities for generalized linear models, randomized algorithms are proposed. To develop the algorithms, the Johnson-Lindenstrauss Transform and Subsampled Randomized Hadamard Transform are used. Additionally, optimal subsampling probabilities are derived for the Gaussian linear model in the case where both the regression coefficients and dispersion parameter are of interest, and algorithms are developed to approximate the optimal subsampling probabilities. Simulation studies indicate that the estimators based on the developed algorithms have excellent performance for statistical inference and have substantial savings in computing time compared to the direct calculation of the A-optimal subsampling probabilities.
Renewable quantile regression for streaming data sets
2022, Neurocomputing
Online updating is an important statistical method for the analysis of big data arriving in streams due to its ability to break the storage barrier and the computational barrier under certain circumstances. The quantile regression, as a widely used regression model in many fields, faces challenges in model fitting and variable selection with big data arriving in streams. Chen et al. (2019, Annals of Statistics) has proposed a quantile regression method for streaming data, but a strong additional condition is required. In this paper, renewable optimized objective functions for regression parameter estimation and variable selection in a quantile regression are proposed. The proposed methods are illustrated using current data and the summary statistics of historical data. Theoretically, the proposed statistics are shown to have the same asymptotic distributions as the standard version computed on an entire data stream with the data batches pooled into one data set, without additional condition. Both simulations and data analysis are conducted to illustrate the finite sample performance of the proposed methods.
Dynamic feature weighting for data streams with distribution-based log-likelihood divergence
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
However, it assumes that the underlying distribution does not change over time; therefore, it neither detects nor adapts to possible concept or feature drifts. As such, an explicit feature drift detection mechanism that signals the latent feature-class distribution change with an individual statistic is often desired because it avoids the unnecessary feature weight updates, which may be time-consuming (Lee et al., 2020). In this regard, we develop an explicit feature drift detection algorithm based on a distribution-based log-likelihood divergence metric, namely, Dynamic Feature Weighting based on Log-likelihood Ratio (DFWLR).
Data streams are expected to undergo changes in data distribution, a phenomenon called concept drift. Another closely related phenomenon is the feature drift of data streams. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the learning task. Identifying the most relevant feature subset from a high-dimensional feature space is challenging in the stream mining scenario. In this study, we propose an online dynamic feature weighting algorithm. Specifically, a feature drift detection scheme is introduced that monitors the changes in the class relevance of the features through a change-detection algorithm based on the log-likelihood divergence score. The score is computed via the kernel density estimator based on the information-theoretic feature merit values. The algorithm is evaluated on both synthetic and real-world datasets, and it is shown that the proposed distribution-based drift detection framework can boost the Nearest Neighbor and Naive Bayes classifier accuracy rates (an average of 2.7% for Nearest Neighbor and 5.5% for Naive Bayes). It also signals feature drifts much faster than traditional methods based on detecting changes in accuracy rates. Finally, the limitations of the proposed method are assessed, and future research directions are discussed.
Renewable learning for multiplicative regression with streaming datasets
2024, Computational Statistics
Optimal subsampling for modal regression in massive data
2023, Metrika
Renewable Learning for Multiplicative Regression with Streaming Datasets
2022, arXiv

View all citing articles on Scopus

View full text

Online updating method to correct for measurement error in big data streams

Highlights

Abstract

Introduction

Section snippets

Model and method

Simulation study

Discussion

Acknowledgments

Stat. Interface

J. Statist. Plann. Inference

Covariance analysis in generalized linear measurement error models

Stat. Med.

Case-control studies with errors in covariates

J. Amer. Statist. Assoc.

Measurement error regression with unknown link: dimension reduction and data visualization

J. Amer. Statist. Assoc.

The use and misuse of orthogonal regression in linear errors-in-variables models

Amer. Statist.

Measurement Error in Nonlinear Models: a Modern Perspective

A split-and-conquer approach for analysis of extraordinarily large data

Statist. Sinica

Discovering statistics using R

Measurement Error Models, Vol. 305

A scalable bootstrap for massive data

J. R. Stat. Soc. Ser. B Stat. Methodol.

Generalized partially linear measurement error models

J. Comput. Graph. Statist.

Additive partial linear models with measurement errors

Biometrika

Aggregated estimating equation estimation

Stat. Interface

A statistical perspective on algorithmic leveraging

J. Mach. Learn. Res.