当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Trees, forests, chickens, and eggs: when and why to prune trees in a random forest
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2022-08-25 , DOI: 10.1002/sam.11594
Siyu Zhou 1 , Lucas Mentch 1
Affiliation  

Due to their long-standing reputation as excellent off-the-shelf predictors, random forests (RFs) continue to remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades-old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal-to-noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U-statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.

中文翻译:

树木、森林、鸡和鸡蛋:何时以及为何修剪随机森林中的树木

由于其作为优秀的现成预测器的长期声誉,随机森林 (RF) 仍然是应用统计学家和数据科学家的首选模型。然而,尽管它们被广泛使用,但直到最近,人们对它们的内部工作原理以及该程序的哪些方面推动了它们的成功知之甚少。最近,出现了两个相互竞争的假设——一个基于插值,另一个基于正则化。这项工作通过利用正则化框架来重新审视这个存在数十年的问题,即是否应该修剪整体中的个别树木,从而支持后者。尽管在大多数流行的软件包中 RF 的默认结构使用接近全深度的树,在这里,我们提供了强有力的证据表明树深度应该被视为整个过程中正则化的自然形式。特别是,我们的工作表明,当数据中的信噪比较低时,具有浅树的 RFs 是有利的。在建立这个论点的过程中,我们还通过平行于U统计并认为随机森林精度的显着跳跃是简单平均而不是插值的结果。
更新日期:2022-08-25
down
wechat
bug