On the class overlap problem in imbalanced data classification

doi:10.1016/j.knosys.2020.106631

Knowledge-Based Systems

Volume 212, 5 January 2021, 106631

https://doi.org/10.1016/j.knosys.2020.106631 Get rights and content

Abstract

Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance.

Introduction

Learning from datasets with skewed class distributions remains a challenge in machine learning. such datasets are realised as imbalanced datasets and widely seen in many applications, for example, anomaly detection [1], medical prediction [2], [3], [4], object recognition [5], [6] and business management [7]. In these domains, the minority class is usually the class of interest and has a higher misclassification cost than the majority class. Standard learning algorithms generally build classification models based upon the maximum accuracy, which often leads to biased classification towards the majority class and misclassification of minority class instances [8], [9]. However, such failure in classification of imbalanced datasets is not always caused by class imbalance solely. In fact, a linearly separable dataset can be perfectly classified by a typical classification algorithm no matter how skewed the class distribution is [10]. On the contrary, when class overlap is present, even a balanced dataset can be difficult for a learning task.

When dealing with classification of imbalanced data, rebalancing class distribution is among the most common approaches that researchers consider. Many traditional and recent resampling methods only aim at getting a more balanced version of the training data and do not factor in the problem of class overlap [11], [12], [13]. Some methods deal with instances in the overlapping region, especially those near the borderline areas; however, their resampling rates are controlled by the degree of class imbalance [14], [15]. Thus, in some scenarios, results can be highly influenced by class imbalance rather than class overlap. For instance, when a dataset suffers from high class overlap but its classes are slightly imbalanced, insufficient resampling may result in class overlap not being properly addressed. On the other hand, with low class overlap and high class imbalance, excessive resampling may occur.

The impacts of class imbalance, class overlap and other characteristics such as small disjunct and dataset size have been investigated [16], [17], [18], [19]. Class overlap frequently shows the highest negative influence among potential factors including class imbalance [9], [18]. This raises some important questions in handling classification of imbalanced datasets: (1) Are the solutions that mainly aim to rebalance the class distribution sufficiently effective? (2) Should the problem of class overlap be the main concern in developing new algorithms?

Although several reviews on the problem of imbalanced data in classification exist [9], [20], [21], the problem of class overlap in imbalanced data was not emphasised as the main issue and the discussions often lacked support of sufficient experimental evidence. Das et al. [9] proposed that the two key challenges for standard learning algorithms are class imbalance and class overlap. Possible nature of learning outcomes in different scenarios of class imbalance and class overlap based on the dataset size was suggested; however, no experimental evidence was given. The authors also investigated other data irregularities such as small disjunct and missing features; thus, the discussion on the class overlap problem was limited. In [20], merely a brief description of other studies on the effect of class overlap in relation to class imbalance was given. The authors paid particular attention to the discussion of different techniques used in existing methods for handling imbalanced classification. Stefanowski [21] motivated the research community to develop new algorithms for imbalanced data that realise data factors, which included overlapping between classes. The author presented the analyses on characteristics of the minority class, which was divided into sub-regions of safe, borderline, rare and outlier samples. This was studied along with the behaviours of different learning algorithms; however, this cannot yet be mathematically verified on real-world datasets. Like in many other reviews [22], [23], Kaur et al. [24] conducted a comparative analysis of methods, which was mainly organised as data preprocessing and algorithmic approaches, and the problem of class overlap was barely discussed. Some other reviews focused on the issue of imbalanced data classification in specific contexts such as big data [25], [26], multi-class problem [25], [27] and neural networks [28], [29]. These clearly show that there is still a gap in the study of class overlap in the context of class imbalance.

In this paper, the importance of handling class overlap in imbalanced data classification is investigated. This was carried out through an extensive experiment and a critical review of solutions to imbalanced learning. The experiment provides an objective measurement of the impact of class overlap versus the impact of class imbalance. Unlike in previous studies [16], [17], [18], [19], which were based on limited ranges of class imbalance and class overlap degrees, we carried out a full-scale experiment using over 1,000 synthetic datasets. The in-depth review of existing solutions to classification of imbalanced datasets is presented in an alternative perspective rather than data and algorithm levels, which was commonly arranged in other review papers [8], [9], [20], [30], [31]. We considered the main objective of the solutions and categorised them into class distribution-based and class overlap-based approaches for better comparing and contrasting the two approaches. Class distribution-based methods mainly concern and aim to suppress the problem of imbalanced class distribution. Class overlap-based methods focus on improving the visibility of instances, especially positive instances, in the overlapping region. In addition, recent and emerging methods that do not particularly deal with the class imbalance or class overlap problems are also discussed. These include, for example, the use of one of the latest techniques in machine learning, Generative Adversarial Networks (GANs) [32], [33].

The main contributions of this review are listed below.

1.
A technical discussion with advantages and disadvantages of evaluation metrics including how some of them can be misleading in certain imbalanced contexts
2.
An extensive experiment illustrating the scales of impact of class overlap and class imbalance on imbalanced dataset classification
3.
A critical discussion of methods and literature selected from leading peer-reviewed publications in the perspective of class overlap-based and class distribution-based approaches, as well as recent emerging technologies
4.
An overview of benchmarking methods in the literature showing commonly-used ones that can be considered as good standards, but at the same time suggesting a need for comparing against recent and state-of-the-art methods for more convincing and reliable evaluation

The remainder of this paper is organised as follows. In Section 2, we give the definitions of class imbalance and class overlap. Section 3 contains an in-depth discussion of evaluation metrics used in imbalanced learning. Section 4 provides the experimental results and discussion on the effects of class imbalance and class overlap on the learner’s performance in an extensive range of scenarios. In Section 5, we critically review existing approaches for handling classification of imbalanced datasets. Finally, the conclusion is delivered in Section 6.

Section snippets

Class imbalance

An imbalanced dataset is a dataset with an unequal distribution of classes. This is depicted in Fig. 1, where majority and minority class instances are represented by circles and triangles, respectively. In machine learning, class imbalance becomes an issue when the minority class is significantly smaller in size and is the primary class of interest with a relatively high misclassification cost. Thus, in a binary-class problem, the minority class is also realised as the positive class whereas

Evaluation metrics

Some typical evaluation metrics for classification are not affected by skewed class distributions while others can be misleading with biases towards the majority class. Common metrics for classification of imbalanced datasets such as sensitivity, specificity, balanced accuracy, G-mean, AUC and F1-score will be discussed in detail. For other assessment measures, the reader may refer to [42], [43], [44], [45].

In imbalanced problems, accurate detection of minority class instances is crucial. This

Impacts of class overlap vs class imbalance

When handling classification of imbalanced data, rebalancing the class distribution is often an approach that researchers take. However, it should also be realised that class overlap is another common issue in classification tasks, which becomes more serious when it occurs in an imbalanced context. Many traditional and recent resampling methods for handling imbalanced datasets only aim at making the class distribution balanced and do not factor in the problem of class overlap [11], [12], [13].

Existing solutions

Existing literature often discussed solutions to imbalanced datasets as data-level and algorithm-level methods [64], [65], [66]. Oversampling and undersampling are among the most common data-level techniques. At the algorithm level, new learning algorithms and modifications of standard learning algorithms are developed. Algorithm-level methods have an advantage of incorporating user’s requirements into the model [20]. However, as opposed to data resampling methods, they do not allow flexible

Conclusion

In this paper, we provided a comprehensive review on the impact of class overlap in classification of imbalanced datasets. This was presented through an extensive experiment, an in-depth discussion on existing solutions, a technical discussion on evaluation metrics, and an overview of benchmarking methods. The experiment was carried out at the full scale of class overlap and extreme degrees of class imbalance. Results showed that classification errors increased with the degree of class overlap

CRediT authorship contribution statement

Pattaramon Vuttipittayamongkol: Conceptualization, Methodology, Resources, Software, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization. Eyad Elyan: Conceptualization, Validation, Resources, Writing - review & editing, Visualization, Supervision. Andrei Petrovski: Proofreading, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (144)

KrawczykB. et al.
Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy
Appl. Soft Comput.
(2016)
LópezV. et al.
An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
Inform. Sci.
(2013)
LinW.-C. et al.
Clustering-based undersampling in class-imbalanced data
Inform. Sci.
(2017)
NekooeimehrI. et al.
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets
Expert Syst. Appl.
(2016)
HaixiangG. et al.
Learning from class-imbalanced data: Review of methods and applications
Expert Syst. Appl.
(2017)
BiJ. et al.
An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme
Knowl.-Based Syst.
(2018)
BudaM. et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Netw.
(2018)
NanniL. et al.
Coupling different methods for overcoming the class imbalance problem
Neurocomputing
(2015)
DouzasG. et al.
Effective data generation for imbalanced learning using conditional generative adversarial networks
Expert Syst. Appl.
(2018)
Ali-GombeA. et al.
MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network
Neurocomputing
(2019)

VuttipittayamongkolP. et al.

Neighbourhood-based undersampling approach for handling imbalanced and overlapped data

Inform. Sci.

(2020)

RiveraW.A. et al.

A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets

Expert Syst. Appl.

(2016)

DeviD. et al.

Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance

Pattern Recognit. Lett.

(2017)

CollellG. et al.

A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data

Neurocomputing

(2018)

AdamsN.M. et al.

Comparing classifiers when the misallocation costs are uncertain

Pattern Recognit.

(1999)

YenS.-J. et al.

Cluster-based under-sampling approaches for imbalanced data distributions

Expert Syst. Appl.

(2009)

OfekN. et al.

Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem

Neurocomputing

(2017)

de MoraisR.F. et al.

Boosting the performance of over-sampling algorithms through under-sampling the minority class

Neurocomputing

(2019)

RaghuwanshiB.S. et al.

SMOTE based class-specific extreme learning machine for imbalanced learning

Knowl.-Based Syst.

(2020)

RaghuwanshiB.S. et al.

Underbagging based reduced kernelized weighted extreme learning machine for class imbalance learning

Eng. Appl. Artif. Intell.

(2018)

RaghuwanshiB.S. et al.

Class imbalance learning using UnderBagging based kernelized extreme learning machine

Neurocomputing

(2019)

SunZ. et al.

A novel ensemble method for classifying imbalanced data

Pattern Recognit.

(2015)

TahirM.A. et al.

Inverse random under sampling for class imbalance problem and its application to multi-label classification

Pattern Recognit.

(2012)

WeiJ. et al.

NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems

Expert Syst. Appl.

(2020)

LiangX. et al.

LR-SMOTE–An improved unbalanced data set oversampling based on K-means and SVM

Knowl.-Based Syst.

(2020)

SáezJ.A. et al.

SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

Inform. Sci.

(2015)

TaoX. et al.

Affinity and class probability-based fuzzy support vector machine for imbalanced data sets

Neural Netw.

(2020)

FanQ. et al.

Entropy-based fuzzy support vector machine for imbalanced datasets

Knowl.-Based Syst.

(2017)

JianC. et al.

A new sampling method for classifying imbalanced data based on support vector machine ensemble

Neurocomputing

(2016)

FernandesE.R. et al.

Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning

Inform. Sci.

(2019)

VorrabootP. et al.

Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms

Neurocomputing

(2015)

BeyanC. et al.

Classifying imbalanced data sets using similarity based hierarchical decomposition

Pattern Recognit.

(2015)

D’AddabboA. et al.

Parallel selective sampling method for imbalanced and large data classification

Pattern Recognit. Lett.

(2015)

Díez-PastorJ.F. et al.

Random balance: ensembles of variable priors classifiers for imbalanced data

Knowl.-Based Syst.

(2015)

GalarM. et al.

EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

Pattern Recognit.

(2013)

GarcíaS. et al.

Evolutionary-based selection of generalized instances for imbalanced classification

Knowl.-Based Syst.

(2012)

VluymansS. et al.

EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

Neurocomputing

(2016)

ChandolaV. et al.

Anomaly detection: A survey

ACM Comput. Surv.

(2009)

VuttipittayamongkolP. et al.

Overlap-based undersampling method for classification of imbalanced medical datasets

VuttipittayamongkolP. et al.

Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease

Int. J. Neural Syst.

(2020)

ZhangX. et al.

Transfer boosting with synthetic instances for class imbalanced object recognition

IEEE Trans. Cybern.

(2018)

ElyanE. et al.

Deep learning for symbols detection and classification in engineering drawings

Neural Netw.

(2020)

LinW. et al.

An ensemble random forest algorithm for insurance big data analysis

IEEE Access

(2017)

DasS. et al.

Handling data irregularities in classification: Foundations, trends, and future challenges

Pattern Recognit.

(2018)

BatistaG.E. et al.

Balancing strategies and class overlapping

ChawlaN.V. et al.

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

(2002)

DouzasG. et al.

Improving imbalanced learning through a heuristic oversampling method based on K-means and SMOTE

Inform. Sci.

(2018)

HanH. et al.

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

JapkowiczN. et al.

The class imbalance problem: A systematic study

Intell. Data Anal.

(2002)

GarcíaV. et al.

On the k-NN performance in a challenging scenario of imbalance and overlapping

Pattern Anal. Appl.

(2008)

Cited by (145)

PFSC: Parameter-free sphere classifier for imbalanced data classification
2024, Expert Systems with Applications
Imbalanced data classification is a prevalent challenge in real-world applications. While a conventional sphere-based classification algorithm, random sphere cover (RSC), evenly constructs a set of spheres for two classes in balanced data using a parameter for the minimum sphere size, it struggles with constructing minority spheres in class-imbalanced data. Although RSC can be combined with existing oversampling methods, this approach requires additional hyperparameters, and its effectiveness decreases as the minority size decreases. To overcome these issues, we propose a novel approach that employs the area under the receiver operating characteristic curve (AUC) to construct and expand spheres for minority class. This parameter-free sphere classifier considers both the majority and minority classes simultaneously. We conducted a thorough experiment on both synthetic and 50 real datasets, which revealed that our proposed method outperformed existing various oversampling techniques with the lowest training time.
Handling class imbalance and overlap with a Hesitation-based instance selection method
2024, Knowledge-Based Systems
Class imbalance is a common problem in machine learning, particularly in classification tasks. When the distribution of instances across known classes is biased or skewed, this issue leads to poor predictive performance. This is especially true for the minority class, which is often of greater importance in problems. However, class imbalance is not the only factor that can decrease performance. Overlapping problems and borderline instances can also degrade classification performance. Conventional imbalanced learning methods often balance the distribution between classes, for example, by oversampling the minority class or under sampling the majority class. However, these methods may not adequately address the difficulties caused by overlapping and borderline instances. The present paper expands on the classification of imbalanced datasets by addressing the issue of how boundary instances should be sampled to handle class imbalance and control class overlap. The designed Hesitation degree-based instance weighting method can identify the impact of instances on classifier performance while reducing the skewness of the dataset and alleviating the possibility of class overlap. Additionally, by integrating a chaotic evolutionary algorithm with the designed classifier, the most important instances can be selected. Statistical results show that the proposed method outperforms state-of-the-art methods in terms of reduction rate, error rate, and G-mean.
Learning from class-imbalanced data using misclassification-focusing generative adversarial networks
2024, Expert Systems with Applications
This paper presents a novel end-to-end oversampling-classification approach, which we refer to as imbalanced data-classifying generative adversarial network (ImbGAN), for imbalanced data classification. ImbGAN has a classifier-embedded structure within a GAN and consists of five components: (1) generator, (2) discriminator, (3) classifier, (4) storage for misclassified minority class data, and (5) storage for artificial minority class data. By iterative interaction with the embedded classifier, the first two components generate artificial minority class instances that are similar to minority class instances misclassified by the classifier. Therefore, these three networks are iteratively and simultaneously trained. The misclassified and artificial minority class instances are stored in the fourth and fifth components, respectively. These two components are also updated as iterations proceed. Our method obtains the final classification model from a single learning process, while most artificial data generation methods for imbalanced data classification go through an additional process for training classifiers after artificial data generation. Numerical experiments based on tabular, image, and text datasets confirm that the proposed method outperforms well-known synthetic sampling methods.
Survival classification of Gliomas through a novel enhancement-based strategy for class overlap of radiomics features
2024, Expert Systems with Applications
Conventional radiomics-based models precisely engineered image-based features from Magnetic Resonance Imaging (MRI) to extract the predictive patterns for High-Grade Gliomas (HGGs) survival prediction. But an in-depth exploration and assessment of these extracted features are still not conducted. To the best of our knowledge, this is the first study that perceives the range distributions of radiomics features and gains insight into the problem of class overlap among these features. A novel class-wise feature enhancement strategy addresses the ambiguous data regions in the extracted features without any data loss. This strategy explicitly tunes the data values of only two classes and retains the data of the third class to achieve excellent data separability. The enhancement depends on the difference in the feature values of the two classes and incorporates the scalability of this difference using different scaling factors. Furthermore, Box-Cox and logarithmic transformations are employed to overcome the non-normality of the enhanced features. Consequently, ablation experimentation is conducted to substantiate the classification metrics with pre- and post-enhancement cases. BraTS 2020 benchmark is employed, demonstrating that the proposed approach performs competitively in classifying HGG patients into three survival groups, namely, short, mid, and long survivors. It achieves an overall testing classification accuracy, precision, recall, and F1-score of 0.994, 0.993, 0.996, and 0.994, respectively. Therefore, instead of directly utilizing these raw extracted features, this strategy eliminates the overlapped class regions without any information loss and has proven to be a significant step for HGG survival classification applications.
Two-step ensemble under-sampling algorithm for massive imbalanced data classification
2024, Information Sciences
Imbalanced data classification is a challenging problem in the field of machine learning. Class imbalance, class overlap, and large data volume significantly affect classification performance. Focusing on the impact of class overlap on classification effectiveness, we propose a two-step ensemble under-sampling algorithm based on boundary information mining (TSSE-BIM) with the goal of reducing the information loss from under-sampling methods on large-scale imbalanced data. In the first stage, the proposed method applies an improved equalization under-sampling strategy to mine sample contribution information and quickly obtains the distribution information of data relative to the decision boundary. In the second stage, based on the boundary information, a weighted boundary sampling is performed to remove noisy and highly overlapping samples. It is easy to retain samples with high contribution and effectively suppress the information loss caused by under-sampling. Then, the overall framework is designed based on a serial ensemble similar to boosting, where the weights of each base classifier are assigned to achieve a more powerful performance based on the false positive rate and false negative rate on the original data. Finally, extensive experiments indicate that TSSE-BIM outperforms state-of-the-art methods and ranks first on average under four metrics, especially F1 and MCC.
A majority affiliation based under-sampling method for class imbalance problem
2024, Information Sciences
Class imbalance poses difficulties in training a classifier that perform well on minority classes, especially when there is a high imbalance ratio and significant class overlap. Existing data-level methods often suffer from problems like information loss and overfitting. To address these problems, we introduce a novel majority affiliation based under-sampling method (MAUS). The MAUS method employs a support vector data description model to capture the distribution of the minority class, thereby forming a hyper-sphere to establish a majority affiliation for each sample. The high-dimensional hyper-sphere constructed through all minority class samples avoids the problem of overfitting. Leveraging the majority affiliation in conjunction with the k-nearest neighbor algorithm, MAUS is capable of identifying region of class overlap and subsequently removing majority samples within these regions that negatively impact classification performance. This selective removal process minimizes excessive information loss at classification boundaries while alleviating the issue of class overlap. Furthermore, by removing those majority samples that are situated far from the classification boundary, MAUS reduces the imbalance ratio to the expected value, resulting in the attainment of a balanced dataset. To validate the effectiveness of our method, we conducted extensive experiments comparing it with state-of-the-art methods on 30 publicly available datasets. The results indicate that our approach outperforms existing methods on most of datasets and classifiers.

View all citing articles on Scopus

View full text

On the class overlap problem in imbalanced data classification

Abstract

Introduction

Section snippets

Class imbalance

Evaluation metrics

Impacts of class overlap vs class imbalance

Existing solutions

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Appl. Soft Comput.

Inform. Sci.

Inform. Sci.

Expert Syst. Appl.

Expert Syst. Appl.

Knowl.-Based Syst.

Neural Netw.

Neurocomputing

Expert Syst. Appl.

Neurocomputing

Inform. Sci.

Expert Syst. Appl.

Pattern Recognit. Lett.

Neurocomputing

Pattern Recognit.

Expert Syst. Appl.

Neurocomputing

Neurocomputing

Knowl.-Based Syst.

Eng. Appl. Artif. Intell.

Neurocomputing

Pattern Recognit.

Pattern Recognit.

Expert Syst. Appl.

Knowl.-Based Syst.

Inform. Sci.

Neural Netw.

Knowl.-Based Syst.

Neurocomputing

Inform. Sci.

Neurocomputing

Pattern Recognit.

Pattern Recognit. Lett.

Knowl.-Based Syst.

Pattern Recognit.

Knowl.-Based Syst.

Neurocomputing

Anomaly detection: A survey

ACM Comput. Surv.

Overlap-based undersampling method for classification of imbalanced medical datasets

Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease

Int. J. Neural Syst.

Transfer boosting with synthetic instances for class imbalanced object recognition

IEEE Trans. Cybern.

Deep learning for symbols detection and classification in engineering drawings

Neural Netw.

An ensemble random forest algorithm for insurance big data analysis

IEEE Access

Handling data irregularities in classification: Foundations, trends, and future challenges

Pattern Recognit.

Balancing strategies and class overlapping

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

Improving imbalanced learning through a heuristic oversampling method based on K-means and SMOTE

Inform. Sci.

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

The class imbalance problem: A systematic study

Intell. Data Anal.

On the k-NN performance in a challenging scenario of imbalance and overlapping

Pattern Anal. Appl.