当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CatBoost for big data: an interdisciplinary review
Journal of Big Data ( IF 8.6 ) Pub Date : 2020-11-04 , DOI: 10.1186/s40537-020-00369-8
John T Hancock 1 , Taghi M Khoshgoftaar 1
Affiliation  

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.



中文翻译:


适用于大数据的 CatBoost:跨学科综述



梯度提升决策树 (GBDT) 是大数据中分类和回归任务的强大工具。研究人员应该熟悉 GBDT 当前实现的优点和缺点,以便有效地使用它们并做出成功的贡献。 CatBoost 是 GBDT 机器学习集成技术家族的成员。自 2018 年底首次亮相以来,研究人员已成功使用 CatBoost 进行涉及大数据的机器学习研究。我们借此机会回顾 CatBoost 与大数据相关的最新研究,并从积极看待 CatBoost 的研究以及 CatBoost 不优于其他技术的研究中学习最佳实践,因为我们可以从这两种技术中吸取教训场景类型。此外,作为一种基于决策树的算法,CatBoost 非常适合涉及分类、异构数据的机器学习任务。最近跨多个学科的工作说明了 CatBoost 在分类和回归任务中的有效性和缺点。我们在 CatBoost 文献中揭示的另一个重要问题是它对超参数的敏感性以及超参数调整的重要性。我们做出的一个贡献是采用跨学科方法在一项工作中涵盖与 CatBoost 相关的研究。这为研究人员提供了深入的了解,有助于阐明 CatBoost 在解决问题中的正确应用。据我们所知,这是第一项在单一出版物中研究与 CatBoost 相关的所有作品的调查。

更新日期:2020-11-04
down
wechat
bug