当前位置: X-MOL 学术Can. J. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Online updating method with new variables for big data streams.
The Canadian Journal of Statistics ( IF 0.8 ) Pub Date : 2017-08-09 , DOI: 10.1002/cjs.11330
Chun Wang 1 , Ming-Hui Chen 2 , Jing Wu 3 , Jun Yan 2 , Yuping Zhang 2 , Elizabeth Schifano 2
Affiliation  

For big data arriving in streams online updating is an important statistical method that breaks the storage barrier and the computational barrier under certain circumstances. In the regression context online updating algorithms assume that the set of predictor variables does not change, and consequently cannot incorporate new variables that may become available midway through the data stream. A naive approach would be to discard all previous information and start updating with new variables from scratch. We propose a method that utilizes the information from earlier data in the online updating algorithm with bias corrections to improve efficiency. The method is developed for linear models first, and then extended to estimating equations for generalized linear models. Closed‐form expressions for the efficiency gain over the naive approach are derived in a particular linear model setting. We compare the performance of our proposed bias‐correcting approach and the naive approach in simulation studies with data generated from a normal linear model and a logistic regression model. The method is applied to a study on airline delay, where reasons for delays were only available more recently, starting in 2003. The Canadian Journal of Statistics 46: 123–146; 2018 © 2017 Statistical Society of Canada

中文翻译:

具有大数据流新变量的在线更新方法。

对于大数据流而言,在线更新是一种重要的统计方法,它可以在某些情况下打破存储壁垒和计算壁垒。在回归上下文中,在线更新算法假定预测变量集未更改,因此无法合并可能在数据流中途可用的新变量。天真的方法是丢弃所有先前的信息,并从头开始使用新变量进行更新。我们提出一种利用在线更新算法中来自早期数据的信息进行偏差校正的方法,以提高效率。首先针对线性模型开发该方法,然后将其扩展到估计广义线性模型的方程式。在特定的线性模型设置中,可以得出针对幼稚方法的效率提高的闭式表达式。我们将模拟研究中提出的偏差校正方法和幼稚方法的性能与从正常线性模型和逻辑回归模型生成的数据进行比较。该方法用于航空公司延误的研究,该延误的原因仅在2003年才开始提供。加拿大统计杂志46:123-146;2018©2017加拿大统计学会
更新日期:2017-08-09
down
wechat
bug