Abstract
Imbalanced data are often encountered in every aspect of our lives, such as medical science, Internet, finance, and surveillance. Learning from imbalanced data which is also called the imbalanced learning problem is still a big challenge and deserves more attention. In this paper, we focus on overlap, which is one of the most important inherent factors that hinder learning from imbalanced data well. We put forward the overlapping degree (OD), and grouped data sets into two types, high OD (HOD) and low OD (LOD). The experimental results found that LOD data sets can achieve good results without any under-sampling algorithm, though some of them have high degree of imbalance, and the under-sampling algorithm does not improve the results very much. A new under-sampling algorithm, random forest cleaning rule (RFCL), was proposed to remove the majority class instances that cross the given new classification boundary which is a margin’s threshold. The degree of overlap and imbalance will be decreased in this way. This threshold is searched by maximizing the F1-score of the final classifier. Experimental results show that RFCL outperforms seven classic and two latest under-sampling methods in terms of F1-score and area under the curve, whether using random forest or support vector machine as the final classifier.
Similar content being viewed by others
References
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L (2017) Classification and regression trees. Routledge, London
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Bylander T, Hanzlik D (1999) Estimating generalization error using out-of-bag estimates. AAAI/IAAI 1999:321–327
Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Chen X, Kang Q, Zhou M, Wei Z (2016) A novel under-sampling algorithm based on iterative-partitioning filters for imbalanced classification. In: CASE 2016, IEEE, pp 490–494
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci-Basel 8(5):815
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Frigge M, Hoaglin DC, Iglewicz B (1989) Some implementations of the boxplot. Am Stat 43(1):50–54
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C 42(4):463–484
Gamberger D, Lavrač N, Džeroski S (1996) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: ALT ’96. Springer, Berlin, pp 199–212
He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data 9:1263–1284
Huda S, Yearwood J, Jelinek HF, Hassan MM, Fortino G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4:9145–9154
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Johnson BA, Tateishi R, Hoan NT (2013) A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. Int J Remote Sens 34(20):6969–6982
Kang Q, Chen X, Li S, Zhou M (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Syst Man Cybern 47(12):4263–4274
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22(3):387–396
Khoshgoftaar TM, Zhong S, Joshi V (2005) Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal 9(1):3–27
Kohavi R, et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI 1995, Montreal, Canada, vol 14, pp 1137–1145
Kubat M, Matwin S, et al. (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997, Nashville, USA, vol 97, pp 179–186
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: AIME 2001. Springer, Berlin, pp 63–66
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B 39(2):539–550
Lucas D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, Zhang Y (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geosci Model Dev 6(4):1157–1171
Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: AAAI 2003, vol 126
Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni V (2013) Quantitative structure-activity relationship models for ready biodegradability of chemicals. J Chem Inf Model 53(4):867–878
Nemenyi P (1963) Distribution-tree multiple comparison. PhD thesis
Quinlan JR (1996) Bagging, boosting, and c4.5. In: AAAI/IAAI 1996
Sáez JA, Luengo J, Stefanowski J, Herrera F (2014) Managing borderline and noisy examples in imbalanced classification by combining smote with ensemble filtering. In: IDEAL 2016. Springer, Berlin, pp 61–68
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci 291:184–203
Schapire RE, Freund Y, Bartlett P, Lee WS et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Sikora M, Wróbel Ł (2010) Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines. Arch Min Sci 55(1):91–114
Smith MR, Martinez T (2015) Using classifier diversity to handle label noise. In: IJCNN 2015. IEEE, pp 1–8
Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal 52(1):483–501
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39(1):281–288
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772
Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inform Sci 477:47–54
Vannucci M, Colla V (2018) Self-organizing-maps based undersampling for the classification of unbalanced datasets. In: IJCNN 2018. IEEE, pp 1–6
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) Forestexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116
Xue JH, Hall P (2015) Why does rebalancing class-unbalanced data improve auc for linear discriminant analysis? IEEE Trans Pattern Anal 37(5):1109–1112
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis 5(04):597–604
Yeh IC, Yang KJ, Ting TM (2009) Knowledge discovery on rfm model using bernoulli sequence. Expert Syst Appl 36(3):5866–5871
Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362
Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, Ning G (2018) Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 6:4641–4652
Zhu X, Wu X, Chen Q (2003) Eliminating class noise in large datasets. In: ICML-03, pp 920–927
Acknowledgements
Thanks to the data sets provided by the UCI repository. Also thanks to R language and the authors of different packages.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, R., Zhang, Z. & Wang, D. RFCL: A new under-sampling method of reducing the degree of imbalance and overlap. Pattern Anal Applic 24, 641–654 (2021). https://doi.org/10.1007/s10044-020-00929-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-020-00929-x