ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning

IBRAHIM, Mohammed H.

doi:10.1007/s00521-021-06198-x

ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning

Original Article
Published: 21 June 2021

Volume 33, pages 15781–15806, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Mohammed H. IBRAHIM ORCID: orcid.org/0000-0002-6093-6105¹

771 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

In many real-world problems, the datasets are imbalanced when the samples of majority classes are much greater than the samples of minority classes. In general, machine learning and data mining classification algorithms perform poorly on imbalanced datasets. In recent years, various oversampling techniques have been developed in the literature to solve the class imbalance problem. Unfortunately, few of the oversampling techniques can be spread to tackle the relationship between the classes and use the correlation between attributes. Moreover, in most cases, the existing oversampling techniques do not handle multi-class imbalanced datasets. To this end, in this paper, a simple but effective outlier detection-based oversampling technique (ODBOT) is proposed to handle the multi-class imbalance problem. In the proposed ODBOT, the outlier samples are detected by clustering within the minority class(es), and then, the synthetic samples are generated by consideration of these outlier samples. The proposed ODBOT generates very efficient and consistent synthetic samples for the minority class(es) by analyzing well the dissimilarity relationships among attribute values of all classes. Moreover, ODBOT can reduce the risk of the overlapping problem among different class regions and can build a better classification model. The performance of the proposed ODBOT is evaluated with extensive experiments using commonly used 60 imbalanced datasets and five classification algorithms. The experimental results show that the proposed ODBOT oversampling technique consistently outperformed the other common and state-of-the-art techniques in terms of various evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A survey on semi-supervised learning

Article Open access 15 November 2019

Jesper E. van Engelen & Holger H. Hoos

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

References

Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Hall EL, Kruger RP, Dwyer SJ, Hall DL, Mclaren RW, Lodwick GS (1971) A survey of preprocessing and feature extraction techniques for radiographic images. IEEE Trans Comput 100(9):1032–1044
Article Google Scholar
Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, pp 875–886
Zheng Z, Cai Y, Li Y (2016) Oversampling method for imbalanced classification. Comput Inform 34(5):1017–1037
Google Scholar
Stone CJ (1984) Classification and regression trees. Wadsworth Intl Group 8:452–456
Google Scholar
Kaur G, Chhabra A (2014) Improved J48 classification algorithm for the prediction of diabetes. Int J Comput Appl 98(22):13–17
Google Scholar
Yakowitz S, Karlsson M (1987) Nearest neighbor methods for time series, with application to rainfall/runoff prediction. In: Advances in the statistical sciences: stochastic hydrology. Springer, pp 149–160
Vapnik V (2013) The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Zurada JM (1992) Introduction to artificial neural systems, vol 8. West Publishing Company, St. Paul
Google Scholar
de Bruijne M (2016) Machine learning approaches in medical image analysis: From detection to diagnosis. Elsevier, Amsterdam
Google Scholar
Carneiro N, Figueira G, Costa M (2017) A data mining based system for credit-card fraud detection in e-tail. Decis Support Syst 95:91–101
Article Google Scholar
Pérez-Ortiz M, Jiménez-Fernández S, Gutiérrez PA, Alexandre E, Hervás-Martínez C, Salcedo-Sanz S (2016) A review of classification problems and algorithms in renewable energy applications. Energies 9(8):607
Article Google Scholar
Chen C-h (2015) Handbook of pattern recognition and computer vision. World Scientific, Singapore
Google Scholar
Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36(10):11994–12000
Article Google Scholar
Cireşan D, Meier U (2015) Multi-column deep neural networks for offline handwritten Chinese character classification. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–6
Ibrahim MH, Hacibeyoglu M (2020) A novel switching function approach for data mining classification problems. Soft Comput 24(7):4941–4957
Article Google Scholar
Tümer AE, Akkuş A (2018) Forecasting gross domestic product per capita using artificial neural networks with non-economical parameters. Phys A 512:468–473
Article Google Scholar
Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47
Google Scholar
Akila S, Reddy US (2016) Data imbalance: effects and solutions for classification of large and highly imbalanced data. Proc ICRECT 16:28–34
Google Scholar
Rout N, Mishra D, Mallick MK (2018) Handling imbalanced data: a survey. In: International proceedings on advances in soft computing, intelligent systems and applications. Springer, pp 431–443
Namvar A, Siami M, Rabhi F, Naderpour M (2018) Credit risk prediction in an imbalanced social lending environment. arXiv preprint arXiv:180500801
Santos MS, Soares JP, Abreu PH, Araujo H, Santos J (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput Intell Mag 13(4):59–76
Article Google Scholar
Chowdhury A, Alspector J (2003) Data duplication: an imbalance problem? In: ICML'2003 workshop on learning from imbalanced data sets (II), Washington, DC
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, pp 878–887
Google Scholar
Zhang Z (2016) Introduction to machine learning: k-nearest neighbors. Ann Transl Med 4(11):3–7
Article Google Scholar
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). IEEE, pp 104–111
Koziarski M, Krawczyk B, Woźniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33
Article Google Scholar
Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50:2465–2487
Article Google Scholar
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) NI-MWMOTE: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504
Article Google Scholar
Elyan E, Moreno-Garcia CF, Jayne C (2021) CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput Appl 33(7):2839–2851
Article Google Scholar
Zhu T, Lin Y, Liu Y, Zhang W, Zhang J (2019) Minority oversampling for imbalanced ordinal regression. Knowl-Based Syst 166:140–155
Article Google Scholar
García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21
Article Google Scholar
Maldonado S, López J, Vairetti C (2019) An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl Soft Comput 76:380–389
Article Google Scholar
Barua S, Islam MM, Yao X, Murase K (2012) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482
Chapter Google Scholar
Samad SA (2013) Random walk oversampling technique for minority class classification
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Article Google Scholar
Das B, Krishnan NC, Cook DJ (2014) RACOG and wRACOG: Two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Article Google Scholar
Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490
Article Google Scholar
Liu G, Yang Y, Li B (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl-Based Syst 158:154–174
Article Google Scholar
Gong L, Jiang S, Jiang L (2019) Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737
Article Google Scholar
Khan FU, Aziz IB (2019) Reducing high variability in medical image collection by a novel cluster based synthetic oversampling technique. In: 2019 IEEE conference on big data and analytics (ICBDA). IEEE, pp 45–50
Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59
Article Google Scholar
Tao X, Li Q, Guo W, Ren C, He Q, Liu R, Zou J (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73
Article MathSciNet MATH Google Scholar
Nekooeimehr I, Lai-Yuen SK (2016) Cluster-based weighted oversampling for ordinal regression (CWOS-Ord). Neurocomputing 218:51–60
Article Google Scholar
Nakamura M, Kajiwara Y, Otsuka A, Kimura H (2013) Lvq-smote–learning vector quantization based synthetic minority over–sampling technique for biomedical data. BioData mining 6(1):16
Article Google Scholar
Kim M-J, Kang D-K, Kim HB (2015) Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst Appl 42(3):1074–1082
Article Google Scholar
Chang S, Zhenzong X, Xuan G (2018) Improvement of K mean clustering algorithm based on density. arXiv preprint arXiv:181004559
Ibrahim MH (2020) WBBA-KM: a hybrid weight-based bat algorithm with K-means algorithm for cluster analysis. Politeknik Dergisi. https://doi.org/10.2339/politeknik.689384
Article Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287
Google Scholar
Lichman M (2013) UCI machine learning repository. Irvine, CA
Holmes G, Donkin A, Witten IH (1994) Weka: A machine learning workbench
Paul A, Sil J, Mukhopadhyay CD (2017) Gene selection for designing optimal fuzzy rule base classifier by estimating missing value. Appl Soft Comput 55:276–288
Article Google Scholar
Ibrahim MH (2020) https://mohammedbulova.blogspot.com/p/imbalanced-datasets.html

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Necmettin Erbakan University, Konya, 42090, Turkey
Mohammed H. IBRAHIM

Authors

Mohammed H. IBRAHIM
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed H. IBRAHIM.

Ethics declarations

Conflict of interest

The author declares that he has no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Experimental results (%) for the 52 imbalanced

See Table 17.

Table 17 List of experimental results for imbalanced datasets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

IBRAHIM, M.H. ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning. Neural Comput & Applic 33, 15781–15806 (2021). https://doi.org/10.1007/s00521-021-06198-x

Download citation

Received: 02 February 2021
Accepted: 02 June 2021
Published: 21 June 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s00521-021-06198-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on semi-supervised learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Experimental results (%) for the 52 imbalanced

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on semi-supervised learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Experimental results (%) for the 52 imbalanced

Appendix: Experimental results (%) for the 52 imbalanced

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation