当前位置: X-MOL 学术Stat. Interface › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical methods and computing for big data
Statistics and Its Interface ( IF 0.8 ) Pub Date : 2016-01-01 , DOI: 10.4310/sii.2016.v9.n4.a1
Chun Wang 1 , Ming-Hui Chen 1 , Elizabeth Schifano 1 , Jing Wu 1 , Jun Yan 1
Affiliation  

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.

中文翻译:

大数据的统计方法与计算

大数据是在数量、强度和复杂性方面超出标准分析工具能力的大规模数据。它们给统计学家带来了机遇,也带来了挑战。计算统计学家在大数据分析中的科学发现中的作用甚至未被同行统计学家认识。本文总结了应对大数据挑战的统计领域最近的方法论和软件发展。方法论分为三类:基于子采样、分治法和流数据的在线更新。作为一个新的贡献,在线更新方法被扩展到使用常用标准的变量选择,并在流数据的模拟研究中评估它们的性能。软件包的重点是开源 R 和 R 软件包,涵盖了有助于打破计算机内存和计算能力障碍的最新工具。一些工具在案例研究中进行了说明,其中包含对航空公司延误可能性的逻辑回归。
更新日期:2016-01-01
down
wechat
bug