Sample and feature selecting based ensemble learning for imbalanced problems

https://doi.org/10.1016/j.asoc.2021.107884Get rights and content

Highlights

  • SFSHEL-RF adopts the re-sampling strategy CSRUS and SWFSS to deal with imbalanced problems.

  • SFSHEL-RF reduces the computational overhead by making the base learners be trained in parallel.

  • SFSHEL-RF generates weights to interpret the importance of different features in medical datasets.

Abstract

Imbalanced problem is concerned with the performance of classifiers on the data set with severe class imbalance distribution. Traditional methods are misled by the majority samples to make the incorrect prediction and fail to make full use of minority samples. This paper is motivated to design a novel hybrid ensemble learning strategy named Sample and Feature Selection Hybrid Ensemble Learning (SFSHEL) and combine it with random forest to improve the classification performance of imbalanced data. Specifically, SFSHEL considers cluster-based stratification to undersample the majority samples and adopts sliding windows mechanism to generate a diversity of feature subsets, simultaneously. Then the weights trained through validation are assigned to different base learners and SFSHEL makes the prediction by weighted voting at last. In this manner, SFSHEL could not only guarantee the acceptable performance, but also save computational time. Furthermore, the weighting process makes SFSHEL interpret the importance of each selected feature set, which is important in the real-world scenarios. The contributions of the proposed strategy are: (1) reducing the impact of class imbalance distribution, (2) assigning based learner weights only once after the training process, and (3) generating weights of features to help interpret the importance of clinical features. In practice, the random forest is adopted as the base learner for SFSHEL, so as to build a classifier abbreviated as SFSHEL-RF. The experiments show the average performance of the proposed SFSHEL-RF on a part of KEEL dataset reaches 91.37%, which is comparable to our previous best ECUBoost-RF method and higher than the other eleven methods. On the clinical heart failure datasets, the performance of SFSHEL-RF can stably reach the level of the top three with three indicators. The experimental results on both the standard imbalanced and clinical heart failure datasets validate the effectiveness and stability of SFSHEL-RF.

Introduction

The last decade has witnessed comprehensive researches and implements for the imbalanced classification problems [1], including the face recognition [2], feature rotation [3], feature selection [4], online learning [5], remote sensing [6], text categorization [7], e-mail filtering [8], and so on. A training dataset is called imbalanced if at least one of the classes is represented by significantly less number of instances than the others [9]. In practice, the class with fewer samples generally undertakes more misclassification risk and thus could be named the positive class, while the rest ones are negative. The imbalanced issues are usually severe challenges for traditional methods, because the negative samples might keep the classifier from learning the correct data distribution knowledge [1], [10], [11]. The list of used abbreviations with their descriptions and full form are shown in Table 1.

As solutions, many special strategies are proposed and could be categorized into three main groups. First, re-sampling or editing strategy tries to re-balance the proportion of classes [12]. Three frequently adopted re-sampling techniques contain the oversampling such as the random over-sampling [13] and Synthetic Minority Over-sampling Technique (SMOTE) [14], [15] multi-objective optimization undersampling [15], the undersampling such as the random under-sampling [16], Entropy and Confidence-based Undersampling Boosting (ECUBoost) [17] and one side selection [18], and the hybrid ones [12]. Second, cost-sensitive learning aims to emphasize the positive samples during the optimization by throwing more punishment on them [19], [20], [21]. Third, ensemble learning keeps effective in imbalanced cases through improvements in several steps, including splitting the samples to different structures for training [22][23], re-assigning the weights for base learners [24], introducing novel distribution updating formulation [25], or optimizing the hyper-parameters during iterations [26]. Victor Henrique Alves et al. [27] proposed a study of different multi-objective optimization design approaches for ensemble learning.

In practice, ensemble methods, boosting or bagging based, seem to be used more frequently than the other groups in dealing with imbalanced problems, which is contributed to the following superiorities. Firstly, ensemble learning could be conveniently combined with both re-sampling and cost-sensitive methods. For the former, the re-sampling method could help balance the class distribution in each iteration, so as to avoid the deterioration of the current base learner for imbalanced samples [28]. Relatively, training more than one learner could alleviate either the overfitting caused by oversampling or the structural loss due to undersampling. The typical combination cases are the SMOTEBoost [11] and RUSBoost [29]. For the latter, an ensemble method could easily become cost-sensitive if it separately considers the weights or errors of samples from different classes and throws more punishments on the positive class [24], [30]. Representative methods in this part include MetaCost [31] and AdaCost [25]. Secondly, ensemble learning could make significant improvements by adopting a new aggregating strategy in the testing process, or introducing special validation in each iteration [26], [32]. Lastly, ensemble learning could be integrated into another ensemble method to generate a hybrid one. For instance, when AdaBoost is wrapped into an undersampling-based bagging strategy, a new method called EasyEnsemble is born [33]. Existing works reveal that such combinations might achieve quite effective and efficient results. For instance, a hybrid method named the Asymmetric Bagging and Random Subspace for Support Vector Machines (ABRS-SVM) that does bagging processes twice respectively for sample and feature set is demonstrated its high performance on imbalanced datasets [34]. Moreover, a cascade method named cascade interpolation learning with double subspaces and confidence disturbance method integrates base classifiers through the double subspaces and the random under-sampling strategy [23], which weakens the interference caused by imbalanced data and enhance the generalization ability of the model. In summary, all the mentioned variants of ensemble methods have demonstrated themselves in different imbalanced tasks [3], [4], [5], [6], [21].

However, one dilemma exists in the mentioned ensemble ideas: Boosting-based ones commonly achieve higher accuracy. However, in the training process of these methods, the distribution of the training data used in the current iteration is adjusted according to the results of the previous iteration. Thus, the boosting-based methods could not run parallel, resulting in the expensive cost of time. Bagging-based ones are simpler to implement and seem better to be generalized [35], but the performance of these methods is unstable. Above all, the existing ensemble methods not always achieve superior performance in dealing with imbalance problems. Trying to overcome the drawbacks of the existing ensemble methods, one modified bagging model is embedded into another one to replace the traditional bootstrap process with different ideas, so as to propose a new hybrid ensemble strategy named the Sample and Feature Selection Hybrid Ensemble Learning (SFSHEL). The main contributions of SFSHEL can be highlighted as follows.

SFSHEL considers clustering-based stratification to undersample the majority samples, and simultaneously adopts sliding windows mechanism to generate a diversity of feature subsets, thus reducing the impact of class imbalance distribution.

SFSHEL inspects base learners and assigns weights only once after the training process. At the same time, the base learners could be trained in parallel. Thus, the computational overhead is extremely reduced.

SFSHEL generates weights that could help interpret the importance of different features in medical datasets, which is beneficial to explore the important clinic features of patients to aid medical diagnosis.

Further, the strong balanced classifier is advised to be the base learner, in order to learn enough information during each iteration of SFSHEL. In this paper, the random forest [36], [37] is the base learner in consideration of its effectiveness and efficiency. Consequently, a new classifier abbreviated as SFSHEL-RF is proposed. Moreover, both the baseline imbalanced and real-world clinical datasets are used to validate the performance of SFSHEL-RF. The experimental results show that the proposed SFSHEL-RF has a significant performance on both KEEL and clinical heart failure datasets. The average performance of the proposed SFSHEL-RF on a part of KEEL dataset reaches 91.37%. When compared with our previous best ECUBoost-RF method, SFSHEL-RF is 0.29% smaller than ECUBoost-RF in average performance. The gap is small and can be ignored. When compared with the other eleven methods, the proposed method achieves the first rank based on the average performance. On the clinical heart failure datasets, the performance of SFSHEL-RF can stably reach the level of the top three with three indicators. Although in some cases SFSHEL-RF did not enter the top three, the proposed method is more stable than others.

The rest of the paper is organized as follows. Section 2 pays retrospect to the two strategies of handling imbalance problems in data-level approach. Section 3 describes the detailed constructions of SFSHEL-RF and gives the necessity for using the random forest in SFSHEL. Section 4 presents the experimental settings and plans. Section 5 reports the corresponding results. Finally, conclusions are given in Section 6.

Section snippets

Related works

In general, there are two strategies to handle imbalance problems in data-level approach. One is the data sampling strategy which is re-sampling from the sample level, the other is feature selection strategy which is selecting from the feature level. Thus, this section briefly reviews these two strategies.

Sample and feature selection hybrid ensemble learning strategy

In this section, firstly the Cluster-based Stratified Random Undersampling Strategy (CSRUS) and Sliding Window Feature Selection Strategy (SWFSS) are respectively described in first two parts. Secondly, the detailed hybrid process of the above two strategies is described to form the proposed SFSHEL. Finally, the reasons why SFSHEL-RF adopts random forest as the base learner are given and the novelty of SFSHEL-RF are shown.

Problem domain

In the experiments, both standard and real-world imbalanced datasets are adopted, which will be introduced in the next two parts. All the used datasets are partitioned through the stratified 5-Fold Cross-Validation [55]. All experiments are repeatedly run ten times and the average value of the results under the optimal hyper-parameters are recorded.

Experimental results

In this section, corresponding results of experiments designed according to targets given in Section 4.2 are reported and discussed in the following four sections.

Conclusion

In this paper, a new strategy for selecting suitable subsets of both samples and features during the training process of ensemble learning is designed and abbreviated as SFSHEL. Further, when adopting the random forest as the base learner in practice, one classifier named SFSHEL-RF is proposed to solve not only the standard imbalanced but also the real-world classification problems. SFSHEL-RF selects subsets of samples and features simultaneously and thus is efficiently to address the

CRediT authorship contribution statement

Zhe Wang: Experiment, Writing – review & editing. Peng Jia: Methodology, Supervision. Xinlei Xu: Visualization. Bolu Wang: Investigation. Yujin Zhu: Proof reading. Dongdong Li: Proof reading.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant No. 20511100600, Shanghai Science and Technology Program “Federated based cross-domain and cross-task incremental learning” under Grant No. 21511100800, Natural Science Foundation of China under Grant No. 62076094, National Key Research and Development Project of Ministry of Science and Technology of China under Grant No. 2018AAA0101302, Natural

References (63)

  • Del RíoSara et al.

    On the use of mapreduce for imbalanced big data using random forest

    Inform. Sci.

    (2014)
  • QianYun et al.

    A resampling ensemble algorithm for classification of imbalance problems

    Neurocomputing

    (2014)
  • YinLiuzhi et al.

    Feature selection for high-dimensional imbalanced data

    Neurocomputing

    (2013)
  • HeHaibo et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • LuJiwen et al.

    Cost-sensitive subspace learning for face recognition

  • BlaserRico et al.

    Random rotation ensembles

    J. Mach. Learn. Res.

    (2016)
  • YangPengyi et al.

    Ensemble-based wrapper methods for feature selection and class imbalance learning

  • WangShuo et al.

    Resampling-based ensemble methods for online class imbalance learning

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • García-PedrajasNicolás et al.

    Oligois: scalable instance selection for class-imbalanced data sets

    IEEE Trans. Cybern.

    (2012)
  • CanoAlberto et al.

    Weighted data gravitation classification for standard and imbalanced data

    IEEE Trans. Cybern.

    (2013)
  • ChawlaNitesh V et al.

    Smoteboost: Improving prediction of the minority class in boosting

  • ChawlaNitesh V et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artificial Intelligence Res.

    (2002)
  • WangZhe et al.

    Entropy and confidence-based undersampling boosting random forests for imbalanced problems

    IEEE Trans. Neural Netw. Learn. Syst.

    (2020)
  • Miroslav Kubat, Stan Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection, in:...
  • LiYu-Feng et al.

    Cost-sensitive semi-supervised support vector machine

  • ZhangYin et al.

    Cost-sensitive face recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • ZhuYujin et al.

    Boundary-eliminated pseudoinverse linear discriminant for imbalanced problems

    IEEE Trans. Neural Netw. Learn. Syst.

    (2017)
  • ZhuZonghai et al.

    Geometric structural ensemble learning for imbalanced problems

    IEEE Trans. Cybern.

    (2018)
  • JoshiMahesh V. et al.

    Evaluating boosting algorithms to classify rare classes: Comparison and improvements

  • FanWei et al.

    Adacost: Misclassification cost-sensitive boosting

  • GalarMikel et al.

    A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches

    IEEE Trans. Syst. Man Cybern. C (Appl. Rev.)

    (2011)
  • Cited by (16)

    • SWSEL: Sliding Window-based Selective Ensemble Learning for class-imbalance problems

      2023, Engineering Applications of Artificial Intelligence
    • Evolving ensembles using multi-objective genetic programming for imbalanced classification

      2022, Knowledge-Based Systems
      Citation Excerpt :

      It assigns corresponding weights to each sampling method and each basic classifier. SFSHEL [21] is a sample and feature selection hybrid ensemble learning strategy combined with random forest, and it makes predictions by weighted voting at last. [27,28] attempted to combine the ideas of GP and ensemble learning to deal with imbalanced classification.

    View all citing articles on Scopus
    View full text