On the class overlap problem in imbalanced data classification
Introduction
Learning from datasets with skewed class distributions remains a challenge in machine learning. such datasets are realised as imbalanced datasets and widely seen in many applications, for example, anomaly detection [1], medical prediction [2], [3], [4], object recognition [5], [6] and business management [7]. In these domains, the minority class is usually the class of interest and has a higher misclassification cost than the majority class. Standard learning algorithms generally build classification models based upon the maximum accuracy, which often leads to biased classification towards the majority class and misclassification of minority class instances [8], [9]. However, such failure in classification of imbalanced datasets is not always caused by class imbalance solely. In fact, a linearly separable dataset can be perfectly classified by a typical classification algorithm no matter how skewed the class distribution is [10]. On the contrary, when class overlap is present, even a balanced dataset can be difficult for a learning task.
When dealing with classification of imbalanced data, rebalancing class distribution is among the most common approaches that researchers consider. Many traditional and recent resampling methods only aim at getting a more balanced version of the training data and do not factor in the problem of class overlap [11], [12], [13]. Some methods deal with instances in the overlapping region, especially those near the borderline areas; however, their resampling rates are controlled by the degree of class imbalance [14], [15]. Thus, in some scenarios, results can be highly influenced by class imbalance rather than class overlap. For instance, when a dataset suffers from high class overlap but its classes are slightly imbalanced, insufficient resampling may result in class overlap not being properly addressed. On the other hand, with low class overlap and high class imbalance, excessive resampling may occur.
The impacts of class imbalance, class overlap and other characteristics such as small disjunct and dataset size have been investigated [16], [17], [18], [19]. Class overlap frequently shows the highest negative influence among potential factors including class imbalance [9], [18]. This raises some important questions in handling classification of imbalanced datasets: (1) Are the solutions that mainly aim to rebalance the class distribution sufficiently effective? (2) Should the problem of class overlap be the main concern in developing new algorithms?
Although several reviews on the problem of imbalanced data in classification exist [9], [20], [21], the problem of class overlap in imbalanced data was not emphasised as the main issue and the discussions often lacked support of sufficient experimental evidence. Das et al. [9] proposed that the two key challenges for standard learning algorithms are class imbalance and class overlap. Possible nature of learning outcomes in different scenarios of class imbalance and class overlap based on the dataset size was suggested; however, no experimental evidence was given. The authors also investigated other data irregularities such as small disjunct and missing features; thus, the discussion on the class overlap problem was limited. In [20], merely a brief description of other studies on the effect of class overlap in relation to class imbalance was given. The authors paid particular attention to the discussion of different techniques used in existing methods for handling imbalanced classification. Stefanowski [21] motivated the research community to develop new algorithms for imbalanced data that realise data factors, which included overlapping between classes. The author presented the analyses on characteristics of the minority class, which was divided into sub-regions of safe, borderline, rare and outlier samples. This was studied along with the behaviours of different learning algorithms; however, this cannot yet be mathematically verified on real-world datasets. Like in many other reviews [22], [23], Kaur et al. [24] conducted a comparative analysis of methods, which was mainly organised as data preprocessing and algorithmic approaches, and the problem of class overlap was barely discussed. Some other reviews focused on the issue of imbalanced data classification in specific contexts such as big data [25], [26], multi-class problem [25], [27] and neural networks [28], [29]. These clearly show that there is still a gap in the study of class overlap in the context of class imbalance.
In this paper, the importance of handling class overlap in imbalanced data classification is investigated. This was carried out through an extensive experiment and a critical review of solutions to imbalanced learning. The experiment provides an objective measurement of the impact of class overlap versus the impact of class imbalance. Unlike in previous studies [16], [17], [18], [19], which were based on limited ranges of class imbalance and class overlap degrees, we carried out a full-scale experiment using over 1,000 synthetic datasets. The in-depth review of existing solutions to classification of imbalanced datasets is presented in an alternative perspective rather than data and algorithm levels, which was commonly arranged in other review papers [8], [9], [20], [30], [31]. We considered the main objective of the solutions and categorised them into class distribution-based and class overlap-based approaches for better comparing and contrasting the two approaches. Class distribution-based methods mainly concern and aim to suppress the problem of imbalanced class distribution. Class overlap-based methods focus on improving the visibility of instances, especially positive instances, in the overlapping region. In addition, recent and emerging methods that do not particularly deal with the class imbalance or class overlap problems are also discussed. These include, for example, the use of one of the latest techniques in machine learning, Generative Adversarial Networks (GANs) [32], [33].
The main contributions of this review are listed below.
- 1.
A technical discussion with advantages and disadvantages of evaluation metrics including how some of them can be misleading in certain imbalanced contexts
- 2.
An extensive experiment illustrating the scales of impact of class overlap and class imbalance on imbalanced dataset classification
- 3.
A critical discussion of methods and literature selected from leading peer-reviewed publications in the perspective of class overlap-based and class distribution-based approaches, as well as recent emerging technologies
- 4.
An overview of benchmarking methods in the literature showing commonly-used ones that can be considered as good standards, but at the same time suggesting a need for comparing against recent and state-of-the-art methods for more convincing and reliable evaluation
The remainder of this paper is organised as follows. In Section 2, we give the definitions of class imbalance and class overlap. Section 3 contains an in-depth discussion of evaluation metrics used in imbalanced learning. Section 4 provides the experimental results and discussion on the effects of class imbalance and class overlap on the learner’s performance in an extensive range of scenarios. In Section 5, we critically review existing approaches for handling classification of imbalanced datasets. Finally, the conclusion is delivered in Section 6.
Section snippets
Class imbalance
An imbalanced dataset is a dataset with an unequal distribution of classes. This is depicted in Fig. 1, where majority and minority class instances are represented by circles and triangles, respectively. In machine learning, class imbalance becomes an issue when the minority class is significantly smaller in size and is the primary class of interest with a relatively high misclassification cost. Thus, in a binary-class problem, the minority class is also realised as the positive class whereas
Evaluation metrics
Some typical evaluation metrics for classification are not affected by skewed class distributions while others can be misleading with biases towards the majority class. Common metrics for classification of imbalanced datasets such as sensitivity, specificity, balanced accuracy, G-mean, AUC and F1-score will be discussed in detail. For other assessment measures, the reader may refer to [42], [43], [44], [45].
In imbalanced problems, accurate detection of minority class instances is crucial. This
Impacts of class overlap vs class imbalance
When handling classification of imbalanced data, rebalancing the class distribution is often an approach that researchers take. However, it should also be realised that class overlap is another common issue in classification tasks, which becomes more serious when it occurs in an imbalanced context. Many traditional and recent resampling methods for handling imbalanced datasets only aim at making the class distribution balanced and do not factor in the problem of class overlap [11], [12], [13].
Existing solutions
Existing literature often discussed solutions to imbalanced datasets as data-level and algorithm-level methods [64], [65], [66]. Oversampling and undersampling are among the most common data-level techniques. At the algorithm level, new learning algorithms and modifications of standard learning algorithms are developed. Algorithm-level methods have an advantage of incorporating user’s requirements into the model [20]. However, as opposed to data resampling methods, they do not allow flexible
Conclusion
In this paper, we provided a comprehensive review on the impact of class overlap in classification of imbalanced datasets. This was presented through an extensive experiment, an in-depth discussion on existing solutions, a technical discussion on evaluation metrics, and an overview of benchmarking methods. The experiment was carried out at the full scale of class overlap and extreme degrees of class imbalance. Results showed that classification errors increased with the degree of class overlap
CRediT authorship contribution statement
Pattaramon Vuttipittayamongkol: Conceptualization, Methodology, Resources, Software, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization. Eyad Elyan: Conceptualization, Validation, Resources, Writing - review & editing, Visualization, Supervision. Andrei Petrovski: Proofreading, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (144)
- et al.
Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
Appl. Soft Comput.
(2016) - et al.
An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
Inform. Sci.
(2013) - et al.
Clustering-based undersampling in class-imbalanced data
Inform. Sci.
(2017) - et al.
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets
Expert Syst. Appl.
(2016) - et al.
Learning from class-imbalanced data: Review of methods and applications
Expert Syst. Appl.
(2017) - et al.
An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme
Knowl.-Based Syst.
(2018) - et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Netw.
(2018) - et al.
Coupling different methods for overcoming the class imbalance problem
Neurocomputing
(2015) - et al.
Effective data generation for imbalanced learning using conditional generative adversarial networks
Expert Syst. Appl.
(2018) - et al.
MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network
Neurocomputing
(2019)
Neighbourhood-based undersampling approach for handling imbalanced and overlapped data
Inform. Sci.
A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets
Expert Syst. Appl.
Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance
Pattern Recognit. Lett.
A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data
Neurocomputing
Comparing classifiers when the misallocation costs are uncertain
Pattern Recognit.
Cluster-based under-sampling approaches for imbalanced data distributions
Expert Syst. Appl.
Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem
Neurocomputing
Boosting the performance of over-sampling algorithms through under-sampling the minority class
Neurocomputing
SMOTE based class-specific extreme learning machine for imbalanced learning
Knowl.-Based Syst.
Underbagging based reduced kernelized weighted extreme learning machine for class imbalance learning
Eng. Appl. Artif. Intell.
Class imbalance learning using UnderBagging based kernelized extreme learning machine
Neurocomputing
A novel ensemble method for classifying imbalanced data
Pattern Recognit.
Inverse random under sampling for class imbalance problem and its application to multi-label classification
Pattern Recognit.
NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems
Expert Syst. Appl.
LR-SMOTE–An improved unbalanced data set oversampling based on K-means and SVM
Knowl.-Based Syst.
SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering
Inform. Sci.
Affinity and class probability-based fuzzy support vector machine for imbalanced data sets
Neural Netw.
Entropy-based fuzzy support vector machine for imbalanced datasets
Knowl.-Based Syst.
A new sampling method for classifying imbalanced data based on support vector machine ensemble
Neurocomputing
Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning
Inform. Sci.
Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms
Neurocomputing
Classifying imbalanced data sets using similarity based hierarchical decomposition
Pattern Recognit.
Parallel selective sampling method for imbalanced and large data classification
Pattern Recognit. Lett.
Random balance: ensembles of variable priors classifiers for imbalanced data
Knowl.-Based Syst.
EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognit.
Evolutionary-based selection of generalized instances for imbalanced classification
Knowl.-Based Syst.
EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data
Neurocomputing
Anomaly detection: A survey
ACM Comput. Surv.
Overlap-based undersampling method for classification of imbalanced medical datasets
Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease
Int. J. Neural Syst.
Transfer boosting with synthetic instances for class imbalanced object recognition
IEEE Trans. Cybern.
Deep learning for symbols detection and classification in engineering drawings
Neural Netw.
An ensemble random forest algorithm for insurance big data analysis
IEEE Access
Handling data irregularities in classification: Foundations, trends, and future challenges
Pattern Recognit.
Balancing strategies and class overlapping
SMOTE: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Improving imbalanced learning through a heuristic oversampling method based on K-means and SMOTE
Inform. Sci.
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
The class imbalance problem: A systematic study
Intell. Data Anal.
On the k-NN performance in a challenging scenario of imbalance and overlapping
Pattern Anal. Appl.
Cited by (145)
PFSC: Parameter-free sphere classifier for imbalanced data classification
2024, Expert Systems with ApplicationsHandling class imbalance and overlap with a Hesitation-based instance selection method
2024, Knowledge-Based SystemsLearning from class-imbalanced data using misclassification-focusing generative adversarial networks
2024, Expert Systems with ApplicationsSurvival classification of Gliomas through a novel enhancement-based strategy for class overlap of radiomics features
2024, Expert Systems with ApplicationsTwo-step ensemble under-sampling algorithm for massive imbalanced data classification
2024, Information SciencesA majority affiliation based under-sampling method for class imbalance problem
2024, Information Sciences