Skip to main content
Log in

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Classifiers for a highly imbalanced dataset tend to bias in majority classes and, as a result, the minority class samples are usually misclassified as majority class. To overcome this, a proper undersampling technique that removes some majority samples can be an alternative. We propose an efficient and simple undersampling method for imbalanced datasets and show that the proposed method outperforms others with respect to four different performance measures by several illustrative experiments, especially for highly imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Bahnsen, A. C., Aouada, D., Stojanovic, A., & Ottersten, B. (2016). Feature engineering strategies for credit card fraud detection. Expert Systems with Applications, 51, 134–142.

    Article  Google Scholar 

  • Beckmann, M., Ebecken, N. F., & De Lima, B. S. P. (2015). A KNN undersampling approach for data balancing. Journal of Intelligent Learning Systems and Applications, 7, 104.

    Article  Google Scholar 

  • Blaszczynski, J., & Stefanowski, J. (2015). Neighbourhood sampling in bagging for imbalanced data. Neurocomputing., 150, 529–542.

    Article  Google Scholar 

  • Cai, R., Zhao, Q., She, D. P., Yang, L., Cao, H., & Yang, Q. Y. (2014). Bernoulli-based random undersampling schemes for 2D seismic data regularization. Applied Geophysics, 11, 321–330.

    Article  Google Scholar 

  • Chawla, N. V. (2010). “Data mining for imbalanced datasets: An overview”, In Data Mining and Knowledge Discovery Handbook (pp. 875-886). Springer.

  • Chyi, Y.M. (2003). “Classification analysis techniques for skewed class distribution problems”, Master Thesis, Department of Information Management, National Sun Yat-Sen University.

  • Dal Pozzolo, A., Caelen, O., Le Borgne, Y. A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41, 4915–4928.

    Article  Google Scholar 

  • Galar, M., Fernandez, A., Barrenechea, E., & Herrera, F. (2013). EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 46, 3460–3471.

    Article  Google Scholar 

  • Garcia, S., & Herrera, F. (2009). Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation, 17, 275–306.

    Article  Google Scholar 

  • Garica-Pedrajas, N., Perez-Rodriguez, J., Garcia-Pedrajas, M., Ortiz-Boyer, D., & Fyfe, C. (2012). Class imbalance methods for translation initiation site recognition in DNA sequences. Knowledge-Based Systems, 25, 22–34.

    Article  Google Scholar 

  • Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6, 429–449.

    Article  Google Scholar 

  • Kang, P., & Cho, S. (2006). “EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems”, In Neural Information Processing (pp. 837-846).

  • Krawczyk, B., Galar, M., Jelen, Ł., & Herrera, F. (2016). Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Applied Soft Computing, 38, 714–726.

    Article  Google Scholar 

  • Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern, 39, 539–550.

    Article  Google Scholar 

  • Majid, A., Ali, S., Iqbal, M., & Kausar, N. (2014). Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Computer Methods and Programs in Biomedicine, 113, 792–808.

    Article  Google Scholar 

  • Maldonado, S., & Lopez, J. (2014). Imbalanced data classification using second-order cone programming support vector machines. Pattern Recognition, 47, 2070–2079.

    Article  Google Scholar 

  • Napierala, K., & Stefanowski, J. (2015). Addressing imbalanced data with argument based rule learning. Expert Systems with Applications, 42, 9468–9481.

    Article  Google Scholar 

  • Passos, I. C., Mwangi, B., Cao, B., Hamilton, J. E., Wu, M. J., Zhang, X. Y., Zunta-Soares, G. B., Quevedo, J., Kauer-Santanna, M., Kapczinski, F., & Soares, J. C. (2016). Identifying a clinical signature of suicidality among patients with mood disorders: A pilot study using a machine learning approach. Journal of Affective Disorders, 193, 109–116.

    Article  Google Scholar 

  • Provost, F., & Fawcett, T. (2013). “Fitting a model to data”, in Data Science for Business: What you need to know about data mining and data-analytic thinking. California: O’Reilly Media.

    Google Scholar 

  • Quinlan, J.R. (2014). C4.5: Programs for Machine Learning. Elsevier.

  • Steinley, D., & Brusco, M. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification., 24, 99–121.

    Article  MathSciNet  Google Scholar 

  • Sundarkumar, G. G., & Ravi, V. (2015). A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Engineering Applications of Artificial Intelligence, 37, 368–377.

    Article  Google Scholar 

  • Tutz, G. (2012). Regression for categorical data. Cambridge University Press.

  • Wang, K. J., Adrian, A. M., Chen, K. H., & Wang, K. M. (2015). A hybrid classifier combining borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: A case study in Taiwan. Computer Methods and Programs in Biomedicine, 119, 63–76.

    Article  Google Scholar 

  • Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 3, 408–421.

  • Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36, 5718–5727.

    Article  Google Scholar 

  • Yu, H., Ni, J., & Zhao, J. (2013). ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing, 101, 309–318.

    Article  Google Scholar 

Download references

Funding

This work has been supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(2019R1A2C1088255), and Research Project Program for Newly-Recruited Personnel funded by the Ministry of Science and Technology of Taiwan, R.O.C. (MOST 108 - 2218 - E - 027 - 008 - MY2).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sun Hur.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahn, G., Park, YJ. & Hur, S. A Membership Probability–Based Undersampling Algorithm for Imbalanced Data. J Classif 38, 2–15 (2021). https://doi.org/10.1007/s00357-019-09359-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-019-09359-9

Keywords

Navigation