Elsevier

Information Fusion

Volume 69, May 2021, Pages 81-102
Information Fusion

Full length article
Adaptive ensemble of classifiers with regularization for imbalanced data classification

https://doi.org/10.1016/j.inffus.2020.10.017Get rights and content

Highlights

  • A novel AER method is proposed for imbalanced data classification.

  • The method deals with overfitting from a perspective of implicit regularization.

  • The proposed algorithm can prevent the numerical instability (getting NaN) problem.

  • Time and memory complexities are analyzed, and theoretical proofs are provided.

Abstract

The dynamic ensemble selection of classifiers is an effective approach for processing label-imbalanced data classifications. However, such a technique is prone to overfitting, owing to the lack of regularization methods and the dependence on local geometry of data. In this study, focusing on binary imbalanced data classification, a novel dynamic ensemble method, namely adaptive ensemble of classifiers with regularization (AER), is proposed, to overcome the stated limitations. The method solves the overfitting problem through a new perspective of implicit regularization. Specifically, it leverages the properties of stochastic gradient descent to obtain the solution with the minimum norm, thereby achieving regularization; furthermore, it interpolates the ensemble weights by exploiting the global geometry of data to further prevent overfitting. According to our theoretical proofs, the seemingly complicated AER paradigm, in addition to its regularization capabilities, can actually reduce the asymptotic time and memory complexities of several other algorithms. We evaluate the proposed AER method on seven benchmark imbalanced datasets from the UCI machine learning repository and one artificially generated GMM-based dataset with five variations. The results show that the proposed algorithm outperforms the major existing algorithms based on multiple metrics in most cases, and two hypothesis tests (McNemar’s and Wilcoxon tests) verify the statistical significance further. In addition, the proposed method has other preferred properties such as special advantages in dealing with highly imbalanced data, and it pioneers the researches on regularization for dynamic ensemble methods.

Introduction

Imbalanced data classification refers to the classification of datasets with significantly different instance numbers across classes [1]. Specifically, for the binary imbalanced data classification problem, there is usually a dominating number of instances from one class (the majority class) and a few instances belonging to the other class (the minority class). The problem of binary imbalanced data classification is common in engineering and scientific practices [2], [3], [4]. The problem is non-trivial, because most of the general-purpose classification methods will overwhelmingly favor the majority class in the label-imbalanced scenario, leading to significant performance degradation. Consequently, the development of binary imbalanced classification algorithms has become an independent and active research area.

Among the popular algorithms for binary imbalanced classification, the dynamic ensemble of classifiers has attracted significant attention. It works by training multiple classifiers using different subsets of the data, and dynamically selecting from, or combining them, during the inference process. By picking the most competent classifier(s) for each specific test instance, this approach can mitigate the “majority favoritism” in imbalanced data classification [5], [6]. There are various advanced algorithms built on the strategy of the dynamic ensemble model, and the novelty of most of them lies in their use of certain new techniques to ‘ensemble’ models. For instance, [7] proposes a generalized mixture function to combine different classifiers, and [8] proposes an adaptive ensemble method based on the classification problem. We review a few similar methods in this paper, and more details are presented in Section 2.

Despite the success of the dynamic ensemble of classifiers with regard to various tasks, we are unaware of any existing model that addresses the overfitting exhibited by such classifiers. Overfitting is a common problem, wherein the behavior of the classifiers overly fit the training data. This adversely affects the performance on the test data, because not all the information in the training data is useful (e.g. noises). At first glance, it appears that the dynamic ensemble of classifiers can safely circumvent the curse of overfitting, because they utilize the test data during the selection of classifiers. However, because each classifier is usually trained using a small subset of data (which contains information from the local geometry only), dynamically picking the most competent of them can lead to the overfitting of the local geometry of these classifiers. Even if we interpolate the dynamic ensemble with a set of (fixed) trained weights for the classifiers, the overfitting problem will persist, as the weights are obtained purely from the training data. Hence, it seems there is no simple solution to the overfitting problem of the dynamic ensemble of classifiers.

We solve the aforementioned problems using the regularization effect arising from gaussian mixture model (GMM)-based resampling and the stochastic gradient descent (SGD) algorithm. The proposed method is called the adaptive ensemble of classifiers with regularization (AER), where the term “with regularization” refers to the two types of regularization schemes that are developed in this study. The AER method first performs data resampling based on the GMM [9], [10]. We will generate two types of subsets. The first type will have a broader inclusion of the points from the majority class, and the second will have an almost balanced number of instances from the two classes. The former type of subsets can force the classifiers to consider the global geometry; therefore, this is the regarded as the first regularization to alleviate the overfitting problem. The latter type of subsets provides information on the local geometries to ensure they fit sufficiently powerful classifiers. After completing the resampling process, one individual classifier is learned for each sampled subset, and we explicitly learn a set of fixed coefficients/weights by optimizing the cross-entropy loss of the combined model with the SGD. The adaptation of the SGD is the second regularization, and its effectiveness has been verified by numerous studies [11], [12], [13], [14]. During the inference procedure, the normalized coefficient of each individual classifier will be determined by a combination of the on-the-fly likelihood and the trained classifier coefficients.

We theoretically and empirically evaluate the performance of the proposed AER. From a theoretical perspective, we analyze the time and space complexity of the AER model, and prove that the seemingly complicated AER model actually requires less time and memory to train. From an empirical perspective, we test the performance of the AER model, using the XGBoost classifier [15] (we refer to the combined method as AER-XGBoost) based on seven imbalanced UCI machine learning datasets and a GMM-generated dataset with five variations. Based on multiple metrics, experimental results reveal that the AER-XGBoost model exhibits competitive performances, outperforming multiple standard methods, such as the SVM and decision tree, and state-of-the-art methods, such as the focal loss neural network [16], vanilla XGBoost [15], focal loss XGBoost [17], and the LightGBM model [18]. The Mcnemar’s and Wilcoxon signed-rank tests are performed to further validate the superior performance of the AER, and the results are mostly sufficient to reject the null hypothesis for performance difference. We note that the AER generally performs significantly better in severe label-imbalanced and complex decision boundary scenarios.

The rest of the paper is structured as follows: Section 2 reviews the related work and points out our idea. Section 3 introduces the algorithm in detail, with its properties. Section 4 analyzes the advantageous time and memory complexity of the proposed algorithm. Experimental Framework and results analysis are demonstrated in Section 5 and Section 6 respectively, and related discussions are presented in Section 7. Lastly, Section 8 provides a general conclusion of the paper.

Section snippets

Related work

Imbalanced data classification refers to the classification problem where the number of samples for each class label is not balanced, or, where the class distribution is biased or skewed [1]. Since most of the standard classifiers assume relatively balanced class distributions and equal mis-classification costs, the class-imbalance can be perceived as a form of data irregularity [19], and it could significantly deteriorate the performances of classifiers. Performing high-accuracy classification

Methods

In this section, we introduce the details of the proposed AER model. The structure is laid out as follows: Section 3.1 will introduce the GMM-fitting and generation of the two types of subsets; Section 3.2 will discuss the specific implementation with XGBoost, which is the individual ‘base’ classifiers used in the experiments; the SGD training for the ensemble of classifiers will be illustrated in Section 3.3; and finally, the weight interpolation/combination and probabilistic prediction will

Theoretical analysis of the AER

In this section, we will demonstrate that the proposed AER method has advantageous time and memory complexity. Specifically, we will show theoretically that, under certain assumptions and for any classifier implemented with the AER framework, the time complexity will be, asymptotically, at least as good as the original implementation, and the asymptotic memory complexity will always be better than the full-batch implementations.

To begin with, let us recap the notations used in the AER model.

Experimental analysis: the framework

In this section, we introduce the framework of our empirical analysis for the AER model. We introduce the datasets in Section 5.1 with their backgrounds and characteristics. The methods compared against the AER model are discussed in Section 5.2, and the metrics to evaluate the results are presented in Section 5.3. Finally, we discuss our approaches for statistical testing to validate the significance of the results in Section 5.4.

Experimental analysis: the results and discussions

In this section, we present and analyze the experimental results of the proposed AER method. As introduced in Sections 5.1 Datasets, 5.2 Compared methods, seven compared methods are implemented on twelve imbalanced datasets. Limited by space, UCI Bioassay and Abalone 19 datasets are selected for primary demonstration, including the performance evaluation of AER with respect to the change of related parameters, also a comprehensive table illustrating performance comparison between AER and other

Discussions

We dedicate this section to discussing the implications of the foregoing theoretical and empirical analysis and the missing details in the experiments. Specifically, we want to discuss the following aspects: 1. The effectiveness of the regularization; 2. The suitable problem for the AER and the choice between logarithm- and exponential-based AERs; 3. The practical training time and training dynamics of the AER; and 4. Natural improvements and extensions of the AER.

From the experiments in

Conclusion

In this paper, a novel method, the adaptive ensemble of classifiers with regularization (AER), has been proposed for binary imbalanced data classification. The details of the method, including implementations with the XGBoost, are provided, and related training formulas are derived. In addition to the regularization properties, we illustrate that the method has favorable time and memory complexity. The performance of the proposed algorithm is tested on multiple datasets, and empirical evidences

CRediT authorship contribution statement

Chen Wang: Worked out the technical details, Performed the experiments, Wrote and revised the manuscript, Discussed the results and contributed to the manuscript. Chengyuan Deng: Performed the experiments, Discussed the results and contributed to the manuscript. Zhoulu Yu: Performed the experiments, Wrote and revised the manuscript, Discussed the results and contributed to the manuscript. Dafeng Hui: Wrote and revised the manuscript, Discussed the results and contributed to the manuscript.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We appreciate the constructive suggestions of Qin Yu, Hang Zhang, Yanmei Yu, and Chao Sun for the paper. We also thank Michael Tan of University College London for his writing suggestions.

Funding statement

This work is supported by the Sichuan Science and Technology Program, China (2020YFG0051), and the University-Enterprise Cooperation Projects, China (17H1199, 19H0355, 19H1121).

References (68)

  • RaghuwanshiBhagat Singh et al.

    Class-specific extreme learning machine for handling binary class imbalance problem

    Neural Netw.

    (2018)
  • ShuklaSanyam et al.

    Online sequential class-specific extreme learning machine for binary imbalanced learning

    Neural Netw.

    (2019)
  • GalarMikel et al.

    An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes

    Pattern Recognit.

    (2011)
  • WangQi et al.

    A novel ensemble method for imbalanced data learning: bagging of extrapolation-SMOTE SVM

    Comput. Intell. Neurosci.

    (2017)
  • WoźniakMichał et al.

    A survey of multiple classifier systems as hybrid systems

    Inf. Fusion

    (2014)
  • KoAlbert H.R. et al.

    From dynamic classifier selection to dynamic ensemble selection

    Pattern Recognit.

    (2008)
  • LinChen et al.

    LibD3C: ensemble classifiers with a clustering and dynamic selection strategy

    Neurocomputing

    (2014)
  • CruzRafael M.O. et al.

    META-DES: a dynamic ensemble selection framework using meta-learning

    Pattern Recognit.

    (2015)
  • XiaoJin et al.

    Dynamic classifier ensemble model for customer classification with imbalanced class distribution

    Expert Syst. Appl.

    (2012)
  • KrawczykBartosz et al.

    Dynamic ensemble selection for multi-class classification with one-class classifiers

    Pattern Recognit.

    (2018)
  • LeeHan Kyu et al.

    An overlap-sensitive margin classifier for imbalanced and overlapping data

    Expert Syst. Appl.

    (2018)
  • KrawczykBartosz et al.

    Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy

    Appl. Soft Comput.

    (2016)
  • XiaXin et al.

    Elblocker: Predicting blocking bugs with ensemble imbalance learning

    Inf. Softw. Technol.

    (2015)
  • ChenZhenxiang et al.

    Machine learning based mobile malware detection using highly imbalanced network traffic

    Inform. Sci.

    (2018)
  • HudaShamsul et al.

    A hybrid feature selection with ensemble classification for imbalanced healthcare data: A case study for brain tumor diagnosis

    IEEE Access

    (2016)
  • AshkezariAtefeh Dehghani et al.

    Application of fuzzy support vector machine for determining the health index of the insulation system of in-service power transformers

    IEEE Trans. Dielectr. Electr. Insul.

    (2013)
  • CruzRafael M.O. et al.

    On dynamic ensemble selection and data preprocessing for multi-class imbalance learning

    Int. J. Pattern Recognit. Artif. Intell.

    (2019)
  • BarberDavid

    Bayesian Reasoning and Machine Learning

    (2012)
  • ReynoldsDouglas A. et al.

    Robust text-independent speaker identification using Gaussian mixture speaker models

    IEEE Trans. Speech Audio Process.

    (1995)
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, Understanding deep learning requires...
  • BottouLéon et al.

    Optimization methods for large-scale machine learning

    SIAM Rev.

    (2018)
  • Tianqi Chen, Carlos Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD...
  • LinTsung-Yi et al.

    Focal loss for dense object detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • WangChen et al.

    Imbalance-XGBoost: Leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost

    Pattern Recognit. Lett.

    (2020)
  • Cited by (18)

    • An improved generative adversarial network to oversample imbalanced datasets

      2024, Engineering Applications of Artificial Intelligence
    • Predicting personalized grouping and consumption: A collaborative evolution model

      2021, Knowledge-Based Systems
      Citation Excerpt :

      In fact, as the one-class nature of implicit feedback, the prediction task of our CEP model can be seen as a binary classification problem with imbalanced data [19]. In this scenario, the accuracy metric does not well reflect the model performance, thus precision and recall are commonly used to evaluate the performance on minority class [75]. Furthermore, as the harmonic mean of precision and recall, the F1-score also been widely used in classification task [76].

    View all citing articles on Scopus
    View full text