Booster: An Accelerator for Gradient Boosting Decision Trees,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Booster: An Accelerator for Gradient Boosting Decision Trees
arXiv - CS - Hardware Architecture Pub Date : 2020-11-03 , DOI: arxiv-2011.02022
Mingxuan He, T. N. Vijaykumar, and Mithuna Thottethodi

We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, existing multicores and GPUs are unable to harness this parallelism because they do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving mapping of data record fields to the SRAMs, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPU. In addition, Booster employs a redundant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x speedup and 6.4x speedup over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm^2 of area and dissipate 23 W when operating at 1-GHz clock speed.

中文翻译：

Booster：梯度提升决策树的加速器

我们提出了 Booster，这是一种基于梯度提升模型独特特性的用于梯度提升树的新型加速器。我们观察到梯度提升训练的主要步骤（占训练时间的 90-98%）涉及对小足迹数据结构的简单、细粒度、独立的操作（例如，在结构中积累和比较值）。不幸的是，现有的多核和 GPU 无法利用这种并行性，因为它们不支持不规则且依赖于数据的大规模并行数据结构访问。通过采用可扩展的小型 SRAM 海方法和数据记录字段到 SRAM 的 SRAM 带宽保留映射，Booster 实现了比多核和 GPU 更高的并行性（例如，3200 路并行性）。此外，Booster 采用冗余数据表示，可显着降低内存带宽需求。我们的模拟表明，Booster 在理想的 32 核多核和理想的 GPU 上分别实现了 11.4 倍和 6.4 倍的加速。基于使用 45 纳米技术对经 FPGA 验证的 RTL 进行 ASIC 合成，我们估计 Booster 芯片在以 1 GHz 时钟速度运行时占据 60 mm^2 的面积并消耗 23 W。

更新日期：2020-11-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文