当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Random Forests for Big Data
Big Data Research ( IF 3.5 ) Pub Date : 2017-08-23 , DOI: 10.1016/j.bdr.2017.07.003
Robin Genuer , Jean-Michel Poggi , Christine Tuleau-Malot , Nathalie Villa-Vialaneix

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how out-of-bag error is addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or “divide-and-conquer” approaches. The fifth variant is related to online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.



中文翻译:

大数据的随机森林

大数据是统计科学的主要挑战之一,从算法和理论角度来看,它具有许多后果。大数据始终涉及海量数据,但它们通常还包含在线数据和数据异构性。最近,一些统计方法已经适应处理大数据,例如线性回归模型,聚类方法和自举方案。基于决策树,结合聚合和引导思想,Breiman在2001年引入了随机森林。随机森林是一种功能强大的非参数统计方法,允许在单个通用框架中考虑回归问题以及两类和多类分类问题。关注分类问题 本文提出了对可用提案的选择性审查,这些提案涉及将随机森林扩展到大数据问题。这些建议依赖并行环境或随机森林的在线适应。我们还将描述在这些方法中如何解决袋外错误。然后,我们针对大数据上下文中的随机森林制定各种说明。最后,我们在两个庞大的数据集(15个和1.2亿个观测值),一个模拟的数据集以及真实的数据上进行了五个变体的实验。一种变体依赖于二次采样,而其他三种则与随机森林的并行实现有关,并且涉及引导程序对大数据的各种改编或“分而治之”的方法。第五个变体与在线学习随机森林有关。

更新日期:2017-08-23
down
wechat
bug