当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Boosting methods for multi-class imbalanced data classification: an experimental review
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-09-01 , DOI: 10.1186/s40537-020-00349-y
Jafar Tanha , Yousef Abdi , Negin Samadi , Nazila Razzaghi , Mohammad Asadpour

Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multi-class imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multi-class imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multi-class boosting algorithms on various multi-class imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multi-class imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains.

中文翻译:

多类不平衡数据分类的促进方法:实验综述

由于规范的机器学习算法假定数据集在每个类别中具有相同数量的样本,因此二进制分类成为一项非常具有挑战性的任务,以有效地区分不平衡数据集中的少数类别样本。因此,研究人员受到关注,并提出了许多解决此问题的方法,可以将其大致分为数据级别和算法级别。此外,多类不平衡学习比二进制学习困难得多,并且仍然是一个开放的问题。Boosting算法是机器学习中的一类集成学习方法,通过将单独的基础学习器组合为一个复合整体来提高其性能。本文的目的是回顾多类不平衡数据集上最重要的增强技术。进行了彻底的经验比较,以分析二进制和多类提升算法在各种多类不平衡数据集上的性能。另外,基于获得的性能评估指标结果和最近提出的比较指标标准,对所选指标进行比较,以确定适合多类不平衡数据集的性能指标。实验研究表明,在多类不平衡的常规数据集和大型数据集上,CatBoost和LogitBoost算法分别优于其他Boosting算法。此外,在多类不平衡数据域中,MMCC是比MAUC和G-mean更好的评估指标。基于获得的性能评估指标的结果和最近提出的比较指标的标准,将所选指标进行比较,以确定适用于多类不平衡数据集的性能指标。实验研究表明,在多类不平衡的常规数据集和大型数据集上,CatBoost和LogitBoost算法分别优于其他Boosting算法。此外,在多类不平衡数据域中,MMCC是比MAUC和G-mean更好的评估指标。基于获得的性能评估指标的结果和最近提出的比较指标的标准,将所选指标进行比较,以确定适用于多类不平衡数据集的性能指标。实验研究表明,在多类不平衡的常规数据集和大型数据集上,CatBoost和LogitBoost算法分别优于其他Boosting算法。此外,在多类不平衡数据域中,MMCC是比MAUC和G-mean更好的评估指标。实验研究表明,在多类不平衡的常规数据集和大型数据集上,CatBoost和LogitBoost算法分别优于其他Boosting算法。此外,在多类不平衡数据域中,MMCC是比MAUC和G-mean更好的评估指标。实验研究表明,在多类不平衡的常规数据集和大型数据集上,CatBoost和LogitBoost算法分别优于其他Boosting算法。此外,在多类不平衡数据域中,MMCC是比MAUC和G-mean更好的评估指标。
更新日期:2020-09-01
down
wechat
bug