Online multiple learning with working sufficient statistics for generalized linear models in big data,Statistics and Its Interface

当前位置： X-MOL 学术 › Stat. Interface › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Online multiple learning with working sufficient statistics for generalized linear models in big data
Statistics and Its Interface ( IF 0.8 ) Pub Date : 2021-07-08 , DOI: 10.4310/20-sii661
Tonglin Zhang ₁ , Baijian Yang ₂

Affiliation

The article proposes an online multiple learning approach to generalized linear models (GLMs) in big data. The approach relies on a new concept called working sufficient statistics (WSS), formulated under traditional iteratively reweighted least squares (IRWLS) for maximum likelihood of GLMs. Because the algorithm needs to access the entire data set multiple times, it is impossible to directly apply traditional IRWLS to big data. To overcome the difficulty, a new approach, called one-step IRWLS, is proposed under the framework of the online setting. The work investigates two methods. The first only uses the current data to formulate the objective function. The second also uses the information of the previous data. The simulation studies show that the results given by the second method can be as precise and accurate as those given by the exact maximum likelihood. A nice property is that one-step IRWLS successfully avoids the memory and computational efficiency barriers caused by the volume of big data. As the size of the WSS does not vary with the sample size, the proposed approach can be used even if the size of big data is much higher than the memory size of the computing system.

中文翻译：

在线多重学习，为大数据中的广义线性模型提供足够的统计数据

本文提出了一种在线多重学习方法来处理大数据中的广义线性模型 (GLM)。该方法依赖于一个称为工作充分统计 (WSS) 的新概念，该概念在传统的迭代重新加权最小二乘法 (IRWLS) 下制定，以获得 GLM 的最大似然。由于算法需要多次访问整个数据集，因此无法将传统的 IRWLS 直接应用于大数据。为了克服这个困难，在在线设置的框架下提出了一种称为一步 IRWLS 的新方法。这项工作研究了两种方法。第一个只使用当前数据来制定目标函数。第二个也使用了以前数据的信息。模拟研究表明，第二种方法给出的结果可以与精确最大似然给出的结果一样精确和准确。一个很好的特性是一步 IRWLS 成功地避免了由大数据量引起的内存和计算效率障碍。由于 WSS 的大小不随样本大小而变化，因此即使大数据的大小远高于计算系统的内存大小，也可以使用所提出的方法。

更新日期：2021-07-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>