当前位置: X-MOL 学术Appl. Soft Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sample and feature selecting based ensemble learning for imbalanced problems
Applied Soft Computing ( IF 7.2 ) Pub Date : 2021-09-08 , DOI: 10.1016/j.asoc.2021.107884
Zhe Wang 1, 2 , Peng Jia 2 , Xinlei Xu 2 , Bolu Wang 2 , Yujin Zhu 2 , Dongdong Li 1, 2
Affiliation  

Imbalanced problem is concerned with the performance of classifiers on the data set with severe class imbalance distribution. Traditional methods are misled by the majority samples to make the incorrect prediction and fail to make full use of minority samples. This paper is motivated to design a novel hybrid ensemble learning strategy named Sample and Feature Selection Hybrid Ensemble Learning (SFSHEL) and combine it with random forest to improve the classification performance of imbalanced data. Specifically, SFSHEL considers cluster-based stratification to undersample the majority samples and adopts sliding windows mechanism to generate a diversity of feature subsets, simultaneously. Then the weights trained through validation are assigned to different base learners and SFSHEL makes the prediction by weighted voting at last. In this manner, SFSHEL could not only guarantee the acceptable performance, but also save computational time. Furthermore, the weighting process makes SFSHEL interpret the importance of each selected feature set, which is important in the real-world scenarios. The contributions of the proposed strategy are: (1) reducing the impact of class imbalance distribution, (2) assigning based learner weights only once after the training process, and (3) generating weights of features to help interpret the importance of clinical features. In practice, the random forest is adopted as the base learner for SFSHEL, so as to build a classifier abbreviated as SFSHEL-RF. The experiments show the average performance of the proposed SFSHEL-RF on a part of KEEL dataset reaches 91.37%, which is comparable to our previous best ECUBoost-RF method and higher than the other eleven methods. On the clinical heart failure datasets, the performance of SFSHEL-RF can stably reach the level of the top three with three indicators. The experimental results on both the standard imbalanced and clinical heart failure datasets validate the effectiveness and stability of SFSHEL-RF.



中文翻译:

基于样本和特征选择的不平衡问题集成学习

不平衡问题与分类器在具有严重类别不平衡分布的数据集上的性能有关。传统方法被多数样本误导,做出错误预测,未能充分利用少数样本。本文旨在设计一种名为样本和特征选择混合集成学习(SFSHEL)的新型混合集成学习策略,并将其与随机森林相结合,以提高不平衡数据的分类性能。具体来说,SFSHEL 考虑基于聚类的分层对多数样本进行欠采样,并采用滑动窗口机制同时生成多样性的特征子集。然后将通过验证训练的权重分配给不同的基学习器,SFSHEL 最后通过加权投票进行预测。以这种方式,SFSHEL 不仅可以保证可接受的性能,还可以节省计算时间。此外,加权过程使 SFSHEL 解释每个选定特征集的重要性,这在现实世界场景中很重要。所提出的策略的贡献是:(1)减少类不平衡分布的影响,(2)在训练过程后仅分配一次基于学习器的权重,以及(3)生成特征权重以帮助解释临床特征的重要性。在实践中,采用随机森林作为 SFSHEL 的基学习器,从而构建一个分类器,缩写为 SFSHEL-RF。实验表明,所提出的 SFSHEL-RF 在部分 KEEL 数据集上的平均性能达到 91.37%,这与我们之前最好的 ECUBoost-RF 方法相当,并且高于其他 11 种方法。在临床心力衰竭数据集上,SFSHEL-RF的性能可以稳定达到三项指标前三的水平。在标准不平衡和临床心力衰竭数据集上的实验结果验证了 SFSHEL-RF 的有效性和稳定性。

更新日期:2021-09-15
down
wechat
bug