Sample and feature selecting based ensemble learning for imbalanced problems
Introduction
The last decade has witnessed comprehensive researches and implements for the imbalanced classification problems [1], including the face recognition [2], feature rotation [3], feature selection [4], online learning [5], remote sensing [6], text categorization [7], e-mail filtering [8], and so on. A training dataset is called imbalanced if at least one of the classes is represented by significantly less number of instances than the others [9]. In practice, the class with fewer samples generally undertakes more misclassification risk and thus could be named the positive class, while the rest ones are negative. The imbalanced issues are usually severe challenges for traditional methods, because the negative samples might keep the classifier from learning the correct data distribution knowledge [1], [10], [11]. The list of used abbreviations with their descriptions and full form are shown in Table 1.
As solutions, many special strategies are proposed and could be categorized into three main groups. First, re-sampling or editing strategy tries to re-balance the proportion of classes [12]. Three frequently adopted re-sampling techniques contain the oversampling such as the random over-sampling [13] and Synthetic Minority Over-sampling Technique (SMOTE) [14], [15] multi-objective optimization undersampling [15], the undersampling such as the random under-sampling [16], Entropy and Confidence-based Undersampling Boosting (ECUBoost) [17] and one side selection [18], and the hybrid ones [12]. Second, cost-sensitive learning aims to emphasize the positive samples during the optimization by throwing more punishment on them [19], [20], [21]. Third, ensemble learning keeps effective in imbalanced cases through improvements in several steps, including splitting the samples to different structures for training [22][23], re-assigning the weights for base learners [24], introducing novel distribution updating formulation [25], or optimizing the hyper-parameters during iterations [26]. Victor Henrique Alves et al. [27] proposed a study of different multi-objective optimization design approaches for ensemble learning.
In practice, ensemble methods, boosting or bagging based, seem to be used more frequently than the other groups in dealing with imbalanced problems, which is contributed to the following superiorities. Firstly, ensemble learning could be conveniently combined with both re-sampling and cost-sensitive methods. For the former, the re-sampling method could help balance the class distribution in each iteration, so as to avoid the deterioration of the current base learner for imbalanced samples [28]. Relatively, training more than one learner could alleviate either the overfitting caused by oversampling or the structural loss due to undersampling. The typical combination cases are the SMOTEBoost [11] and RUSBoost [29]. For the latter, an ensemble method could easily become cost-sensitive if it separately considers the weights or errors of samples from different classes and throws more punishments on the positive class [24], [30]. Representative methods in this part include MetaCost [31] and AdaCost [25]. Secondly, ensemble learning could make significant improvements by adopting a new aggregating strategy in the testing process, or introducing special validation in each iteration [26], [32]. Lastly, ensemble learning could be integrated into another ensemble method to generate a hybrid one. For instance, when AdaBoost is wrapped into an undersampling-based bagging strategy, a new method called EasyEnsemble is born [33]. Existing works reveal that such combinations might achieve quite effective and efficient results. For instance, a hybrid method named the Asymmetric Bagging and Random Subspace for Support Vector Machines (ABRS-SVM) that does bagging processes twice respectively for sample and feature set is demonstrated its high performance on imbalanced datasets [34]. Moreover, a cascade method named cascade interpolation learning with double subspaces and confidence disturbance method integrates base classifiers through the double subspaces and the random under-sampling strategy [23], which weakens the interference caused by imbalanced data and enhance the generalization ability of the model. In summary, all the mentioned variants of ensemble methods have demonstrated themselves in different imbalanced tasks [3], [4], [5], [6], [21].
However, one dilemma exists in the mentioned ensemble ideas: Boosting-based ones commonly achieve higher accuracy. However, in the training process of these methods, the distribution of the training data used in the current iteration is adjusted according to the results of the previous iteration. Thus, the boosting-based methods could not run parallel, resulting in the expensive cost of time. Bagging-based ones are simpler to implement and seem better to be generalized [35], but the performance of these methods is unstable. Above all, the existing ensemble methods not always achieve superior performance in dealing with imbalance problems. Trying to overcome the drawbacks of the existing ensemble methods, one modified bagging model is embedded into another one to replace the traditional bootstrap process with different ideas, so as to propose a new hybrid ensemble strategy named the Sample and Feature Selection Hybrid Ensemble Learning (SFSHEL). The main contributions of SFSHEL can be highlighted as follows.
SFSHEL considers clustering-based stratification to undersample the majority samples, and simultaneously adopts sliding windows mechanism to generate a diversity of feature subsets, thus reducing the impact of class imbalance distribution.
SFSHEL inspects base learners and assigns weights only once after the training process. At the same time, the base learners could be trained in parallel. Thus, the computational overhead is extremely reduced.
SFSHEL generates weights that could help interpret the importance of different features in medical datasets, which is beneficial to explore the important clinic features of patients to aid medical diagnosis.
Further, the strong balanced classifier is advised to be the base learner, in order to learn enough information during each iteration of SFSHEL. In this paper, the random forest [36], [37] is the base learner in consideration of its effectiveness and efficiency. Consequently, a new classifier abbreviated as SFSHEL-RF is proposed. Moreover, both the baseline imbalanced and real-world clinical datasets are used to validate the performance of SFSHEL-RF. The experimental results show that the proposed SFSHEL-RF has a significant performance on both KEEL and clinical heart failure datasets. The average performance of the proposed SFSHEL-RF on a part of KEEL dataset reaches 91.37%. When compared with our previous best ECUBoost-RF method, SFSHEL-RF is 0.29% smaller than ECUBoost-RF in average performance. The gap is small and can be ignored. When compared with the other eleven methods, the proposed method achieves the first rank based on the average performance. On the clinical heart failure datasets, the performance of SFSHEL-RF can stably reach the level of the top three with three indicators. Although in some cases SFSHEL-RF did not enter the top three, the proposed method is more stable than others.
The rest of the paper is organized as follows. Section 2 pays retrospect to the two strategies of handling imbalance problems in data-level approach. Section 3 describes the detailed constructions of SFSHEL-RF and gives the necessity for using the random forest in SFSHEL. Section 4 presents the experimental settings and plans. Section 5 reports the corresponding results. Finally, conclusions are given in Section 6.
Section snippets
Related works
In general, there are two strategies to handle imbalance problems in data-level approach. One is the data sampling strategy which is re-sampling from the sample level, the other is feature selection strategy which is selecting from the feature level. Thus, this section briefly reviews these two strategies.
Sample and feature selection hybrid ensemble learning strategy
In this section, firstly the Cluster-based Stratified Random Undersampling Strategy (CSRUS) and Sliding Window Feature Selection Strategy (SWFSS) are respectively described in first two parts. Secondly, the detailed hybrid process of the above two strategies is described to form the proposed SFSHEL. Finally, the reasons why SFSHEL-RF adopts random forest as the base learner are given and the novelty of SFSHEL-RF are shown.
Problem domain
In the experiments, both standard and real-world imbalanced datasets are adopted, which will be introduced in the next two parts. All the used datasets are partitioned through the stratified 5-Fold Cross-Validation [55]. All experiments are repeatedly run ten times and the average value of the results under the optimal hyper-parameters are recorded.
Experimental results
In this section, corresponding results of experiments designed according to targets given in Section 4.2 are reported and discussed in the following four sections.
Conclusion
In this paper, a new strategy for selecting suitable subsets of both samples and features during the training process of ensemble learning is designed and abbreviated as SFSHEL. Further, when adopting the random forest as the base learner in practice, one classifier named SFSHEL-RF is proposed to solve not only the standard imbalanced but also the real-world classification problems. SFSHEL-RF selects subsets of samples and features simultaneously and thus is efficiently to address the
CRediT authorship contribution statement
Zhe Wang: Experiment, Writing – review & editing. Peng Jia: Methodology, Supervision. Xinlei Xu: Visualization. Bolu Wang: Investigation. Yujin Zhu: Proof reading. Dongdong Li: Proof reading.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant No. 20511100600, Shanghai Science and Technology Program “Federated based cross-domain and cross-task incremental learning” under Grant No. 21511100800, Natural Science Foundation of China under Grant No. 62076094, National Key Research and Development Project of Ministry of Science and Technology of China under Grant No. 2018AAA0101302, Natural
References (63)
- et al.
Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin
ISPRS J. Photogramm. Remote Sens.
(2015) - et al.
Forestexter: An efficient random forest algorithm for imbalanced text categorization
Knowl.-Based Syst.
(2014) Class imbalance learning via a fuzzy total margin based support vector machine
Appl. Soft Comput.
(2015)- et al.
A method for resampling imbalanced datasets in binary classification tasks for real-world problems
Neurocomputing
(2014) - et al.
On the effectiveness of preprocessing methods when dealing with different levels of class imbalance
Knowl.-Based Syst.
(2012) - et al.
Improvement of bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization
Eng. Appl. Artif. Intell.
(2020) - et al.
Cluster-based under-sampling approaches for imbalanced data distributions
Expert Syst. Appl.
(2009) - et al.
Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems
Neural Netw.
(2019) - et al.
Tree-based space partition and merging ensemble learning framework for imbalanced problems
Inform. Sci.
(2019) - et al.
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognit.
(2007)
On the use of mapreduce for imbalanced big data using random forest
Inform. Sci.
A resampling ensemble algorithm for classification of imbalance problems
Neurocomputing
Feature selection for high-dimensional imbalanced data
Neurocomputing
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Cost-sensitive subspace learning for face recognition
Random rotation ensembles
J. Mach. Learn. Res.
Ensemble-based wrapper methods for feature selection and class imbalance learning
Resampling-based ensemble methods for online class imbalance learning
IEEE Trans. Knowl. Data Eng.
Oligois: scalable instance selection for class-imbalanced data sets
IEEE Trans. Cybern.
Weighted data gravitation classification for standard and imbalanced data
IEEE Trans. Cybern.
Smoteboost: Improving prediction of the minority class in boosting
SMOTE: synthetic minority over-sampling technique
J. Artificial Intelligence Res.
Entropy and confidence-based undersampling boosting random forests for imbalanced problems
IEEE Trans. Neural Netw. Learn. Syst.
Cost-sensitive semi-supervised support vector machine
Cost-sensitive face recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Boundary-eliminated pseudoinverse linear discriminant for imbalanced problems
IEEE Trans. Neural Netw. Learn. Syst.
Geometric structural ensemble learning for imbalanced problems
IEEE Trans. Cybern.
Evaluating boosting algorithms to classify rare classes: Comparison and improvements
Adacost: Misclassification cost-sensitive boosting
A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches
IEEE Trans. Syst. Man Cybern. C (Appl. Rev.)
Cited by (16)
Two-step ensemble under-sampling algorithm for massive imbalanced data classification
2024, Information SciencesClass-overlap undersampling based on Schur decomposition for Class-imbalance problems
2023, Expert Systems with ApplicationsSWSEL: Sliding Window-based Selective Ensemble Learning for class-imbalance problems
2023, Engineering Applications of Artificial IntelligenceEvolving ensembles using multi-objective genetic programming for imbalanced classification
2022, Knowledge-Based SystemsCitation Excerpt :It assigns corresponding weights to each sampling method and each basic classifier. SFSHEL [21] is a sample and feature selection hybrid ensemble learning strategy combined with random forest, and it makes predictions by weighted voting at last. [27,28] attempted to combine the ideas of GP and ensemble learning to deal with imbalanced classification.
Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques
2024, Journal of Intelligent and Fuzzy Systems