Elsevier

Knowledge-Based Systems

Volume 212, 5 January 2021, 106631
Knowledge-Based Systems

On the class overlap problem in imbalanced data classification

https://doi.org/10.1016/j.knosys.2020.106631Get rights and content

Abstract

Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance.

Introduction

Learning from datasets with skewed class distributions remains a challenge in machine learning. such datasets are realised as imbalanced datasets and widely seen in many applications, for example, anomaly detection [1], medical prediction [2], [3], [4], object recognition [5], [6] and business management [7]. In these domains, the minority class is usually the class of interest and has a higher misclassification cost than the majority class. Standard learning algorithms generally build classification models based upon the maximum accuracy, which often leads to biased classification towards the majority class and misclassification of minority class instances [8], [9]. However, such failure in classification of imbalanced datasets is not always caused by class imbalance solely. In fact, a linearly separable dataset can be perfectly classified by a typical classification algorithm no matter how skewed the class distribution is [10]. On the contrary, when class overlap is present, even a balanced dataset can be difficult for a learning task.

When dealing with classification of imbalanced data, rebalancing class distribution is among the most common approaches that researchers consider. Many traditional and recent resampling methods only aim at getting a more balanced version of the training data and do not factor in the problem of class overlap [11], [12], [13]. Some methods deal with instances in the overlapping region, especially those near the borderline areas; however, their resampling rates are controlled by the degree of class imbalance [14], [15]. Thus, in some scenarios, results can be highly influenced by class imbalance rather than class overlap. For instance, when a dataset suffers from high class overlap but its classes are slightly imbalanced, insufficient resampling may result in class overlap not being properly addressed. On the other hand, with low class overlap and high class imbalance, excessive resampling may occur.

The impacts of class imbalance, class overlap and other characteristics such as small disjunct and dataset size have been investigated [16], [17], [18], [19]. Class overlap frequently shows the highest negative influence among potential factors including class imbalance [9], [18]. This raises some important questions in handling classification of imbalanced datasets: (1) Are the solutions that mainly aim to rebalance the class distribution sufficiently effective? (2) Should the problem of class overlap be the main concern in developing new algorithms?

Although several reviews on the problem of imbalanced data in classification exist [9], [20], [21], the problem of class overlap in imbalanced data was not emphasised as the main issue and the discussions often lacked support of sufficient experimental evidence. Das et al. [9] proposed that the two key challenges for standard learning algorithms are class imbalance and class overlap. Possible nature of learning outcomes in different scenarios of class imbalance and class overlap based on the dataset size was suggested; however, no experimental evidence was given. The authors also investigated other data irregularities such as small disjunct and missing features; thus, the discussion on the class overlap problem was limited. In [20], merely a brief description of other studies on the effect of class overlap in relation to class imbalance was given. The authors paid particular attention to the discussion of different techniques used in existing methods for handling imbalanced classification. Stefanowski [21] motivated the research community to develop new algorithms for imbalanced data that realise data factors, which included overlapping between classes. The author presented the analyses on characteristics of the minority class, which was divided into sub-regions of safe, borderline, rare and outlier samples. This was studied along with the behaviours of different learning algorithms; however, this cannot yet be mathematically verified on real-world datasets. Like in many other reviews [22], [23], Kaur et al. [24] conducted a comparative analysis of methods, which was mainly organised as data preprocessing and algorithmic approaches, and the problem of class overlap was barely discussed. Some other reviews focused on the issue of imbalanced data classification in specific contexts such as big data [25], [26], multi-class problem [25], [27] and neural networks [28], [29]. These clearly show that there is still a gap in the study of class overlap in the context of class imbalance.

In this paper, the importance of handling class overlap in imbalanced data classification is investigated. This was carried out through an extensive experiment and a critical review of solutions to imbalanced learning. The experiment provides an objective measurement of the impact of class overlap versus the impact of class imbalance. Unlike in previous studies [16], [17], [18], [19], which were based on limited ranges of class imbalance and class overlap degrees, we carried out a full-scale experiment using over 1,000 synthetic datasets. The in-depth review of existing solutions to classification of imbalanced datasets is presented in an alternative perspective rather than data and algorithm levels, which was commonly arranged in other review papers [8], [9], [20], [30], [31]. We considered the main objective of the solutions and categorised them into class distribution-based and class overlap-based approaches for better comparing and contrasting the two approaches. Class distribution-based methods mainly concern and aim to suppress the problem of imbalanced class distribution. Class overlap-based methods focus on improving the visibility of instances, especially positive instances, in the overlapping region. In addition, recent and emerging methods that do not particularly deal with the class imbalance or class overlap problems are also discussed. These include, for example, the use of one of the latest techniques in machine learning, Generative Adversarial Networks (GANs) [32], [33].

The main contributions of this review are listed below.

  • 1.

    A technical discussion with advantages and disadvantages of evaluation metrics including how some of them can be misleading in certain imbalanced contexts

  • 2.

    An extensive experiment illustrating the scales of impact of class overlap and class imbalance on imbalanced dataset classification

  • 3.

    A critical discussion of methods and literature selected from leading peer-reviewed publications in the perspective of class overlap-based and class distribution-based approaches, as well as recent emerging technologies

  • 4.

    An overview of benchmarking methods in the literature showing commonly-used ones that can be considered as good standards, but at the same time suggesting a need for comparing against recent and state-of-the-art methods for more convincing and reliable evaluation

The remainder of this paper is organised as follows. In Section 2, we give the definitions of class imbalance and class overlap. Section 3 contains an in-depth discussion of evaluation metrics used in imbalanced learning. Section 4 provides the experimental results and discussion on the effects of class imbalance and class overlap on the learner’s performance in an extensive range of scenarios. In Section 5, we critically review existing approaches for handling classification of imbalanced datasets. Finally, the conclusion is delivered in Section 6.

Section snippets

Class imbalance

An imbalanced dataset is a dataset with an unequal distribution of classes. This is depicted in Fig. 1, where majority and minority class instances are represented by circles and triangles, respectively. In machine learning, class imbalance becomes an issue when the minority class is significantly smaller in size and is the primary class of interest with a relatively high misclassification cost. Thus, in a binary-class problem, the minority class is also realised as the positive class whereas

Evaluation metrics

Some typical evaluation metrics for classification are not affected by skewed class distributions while others can be misleading with biases towards the majority class. Common metrics for classification of imbalanced datasets such as sensitivity, specificity, balanced accuracy, G-mean, AUC and F1-score will be discussed in detail. For other assessment measures, the reader may refer to [42], [43], [44], [45].

In imbalanced problems, accurate detection of minority class instances is crucial. This

Impacts of class overlap vs class imbalance

When handling classification of imbalanced data, rebalancing the class distribution is often an approach that researchers take. However, it should also be realised that class overlap is another common issue in classification tasks, which becomes more serious when it occurs in an imbalanced context. Many traditional and recent resampling methods for handling imbalanced datasets only aim at making the class distribution balanced and do not factor in the problem of class overlap [11], [12], [13].

Existing solutions

Existing literature often discussed solutions to imbalanced datasets as data-level and algorithm-level methods [64], [65], [66]. Oversampling and undersampling are among the most common data-level techniques. At the algorithm level, new learning algorithms and modifications of standard learning algorithms are developed. Algorithm-level methods have an advantage of incorporating user’s requirements into the model [20]. However, as opposed to data resampling methods, they do not allow flexible

Conclusion

In this paper, we provided a comprehensive review on the impact of class overlap in classification of imbalanced datasets. This was presented through an extensive experiment, an in-depth discussion on existing solutions, a technical discussion on evaluation metrics, and an overview of benchmarking methods. The experiment was carried out at the full scale of class overlap and extreme degrees of class imbalance. Results showed that classification errors increased with the degree of class overlap

CRediT authorship contribution statement

Pattaramon Vuttipittayamongkol: Conceptualization, Methodology, Resources, Software, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization. Eyad Elyan: Conceptualization, Validation, Resources, Writing - review & editing, Visualization, Supervision. Andrei Petrovski: Proofreading, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (144)

  • VuttipittayamongkolP. et al.

    Neighbourhood-based undersampling approach for handling imbalanced and overlapped data

    Inform. Sci.

    (2020)
  • RiveraW.A. et al.

    A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets

    Expert Syst. Appl.

    (2016)
  • DeviD. et al.

    Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance

    Pattern Recognit. Lett.

    (2017)
  • CollellG. et al.

    A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data

    Neurocomputing

    (2018)
  • AdamsN.M. et al.

    Comparing classifiers when the misallocation costs are uncertain

    Pattern Recognit.

    (1999)
  • YenS.-J. et al.

    Cluster-based under-sampling approaches for imbalanced data distributions

    Expert Syst. Appl.

    (2009)
  • OfekN. et al.

    Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem

    Neurocomputing

    (2017)
  • de MoraisR.F. et al.

    Boosting the performance of over-sampling algorithms through under-sampling the minority class

    Neurocomputing

    (2019)
  • RaghuwanshiB.S. et al.

    SMOTE based class-specific extreme learning machine for imbalanced learning

    Knowl.-Based Syst.

    (2020)
  • RaghuwanshiB.S. et al.

    Underbagging based reduced kernelized weighted extreme learning machine for class imbalance learning

    Eng. Appl. Artif. Intell.

    (2018)
  • RaghuwanshiB.S. et al.

    Class imbalance learning using UnderBagging based kernelized extreme learning machine

    Neurocomputing

    (2019)
  • SunZ. et al.

    A novel ensemble method for classifying imbalanced data

    Pattern Recognit.

    (2015)
  • TahirM.A. et al.

    Inverse random under sampling for class imbalance problem and its application to multi-label classification

    Pattern Recognit.

    (2012)
  • WeiJ. et al.

    NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems

    Expert Syst. Appl.

    (2020)
  • LiangX. et al.

    LR-SMOTE–An improved unbalanced data set oversampling based on K-means and SVM

    Knowl.-Based Syst.

    (2020)
  • SáezJ.A. et al.

    SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

    Inform. Sci.

    (2015)
  • TaoX. et al.

    Affinity and class probability-based fuzzy support vector machine for imbalanced data sets

    Neural Netw.

    (2020)
  • FanQ. et al.

    Entropy-based fuzzy support vector machine for imbalanced datasets

    Knowl.-Based Syst.

    (2017)
  • JianC. et al.

    A new sampling method for classifying imbalanced data based on support vector machine ensemble

    Neurocomputing

    (2016)
  • FernandesE.R. et al.

    Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning

    Inform. Sci.

    (2019)
  • VorrabootP. et al.

    Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms

    Neurocomputing

    (2015)
  • BeyanC. et al.

    Classifying imbalanced data sets using similarity based hierarchical decomposition

    Pattern Recognit.

    (2015)
  • D’AddabboA. et al.

    Parallel selective sampling method for imbalanced and large data classification

    Pattern Recognit. Lett.

    (2015)
  • Díez-PastorJ.F. et al.

    Random balance: ensembles of variable priors classifiers for imbalanced data

    Knowl.-Based Syst.

    (2015)
  • GalarM. et al.

    EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

    Pattern Recognit.

    (2013)
  • GarcíaS. et al.

    Evolutionary-based selection of generalized instances for imbalanced classification

    Knowl.-Based Syst.

    (2012)
  • VluymansS. et al.

    EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

    Neurocomputing

    (2016)
  • ChandolaV. et al.

    Anomaly detection: A survey

    ACM Comput. Surv.

    (2009)
  • VuttipittayamongkolP. et al.

    Overlap-based undersampling method for classification of imbalanced medical datasets

  • VuttipittayamongkolP. et al.

    Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease

    Int. J. Neural Syst.

    (2020)
  • ZhangX. et al.

    Transfer boosting with synthetic instances for class imbalanced object recognition

    IEEE Trans. Cybern.

    (2018)
  • ElyanE. et al.

    Deep learning for symbols detection and classification in engineering drawings

    Neural Netw.

    (2020)
  • LinW. et al.

    An ensemble random forest algorithm for insurance big data analysis

    IEEE Access

    (2017)
  • DasS. et al.

    Handling data irregularities in classification: Foundations, trends, and future challenges

    Pattern Recognit.

    (2018)
  • BatistaG.E. et al.

    Balancing strategies and class overlapping

  • ChawlaN.V. et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • DouzasG. et al.

    Improving imbalanced learning through a heuristic oversampling method based on K-means and SMOTE

    Inform. Sci.

    (2018)
  • HanH. et al.

    Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

  • JapkowiczN. et al.

    The class imbalance problem: A systematic study

    Intell. Data Anal.

    (2002)
  • GarcíaV. et al.

    On the k-NN performance in a challenging scenario of imbalance and overlapping

    Pattern Anal. Appl.

    (2008)
  • Cited by (145)

    View all citing articles on Scopus
    View full text