RFCL: A new under-sampling method of reducing the degree of imbalance and overlap

Zhang, Rui; Zhang, Zuoquan; Wang, Di

doi:10.1007/s10044-020-00929-x

RFCL: A new under-sampling method of reducing the degree of imbalance and overlap

Theoretical Advances
Published: 12 November 2020

Volume 24, pages 641–654, (2021)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

525 Accesses
17 Citations
Explore all metrics

Abstract

Imbalanced data are often encountered in every aspect of our lives, such as medical science, Internet, finance, and surveillance. Learning from imbalanced data which is also called the imbalanced learning problem is still a big challenge and deserves more attention. In this paper, we focus on overlap, which is one of the most important inherent factors that hinder learning from imbalanced data well. We put forward the overlapping degree (OD), and grouped data sets into two types, high OD (HOD) and low OD (LOD). The experimental results found that LOD data sets can achieve good results without any under-sampling algorithm, though some of them have high degree of imbalance, and the under-sampling algorithm does not improve the results very much. A new under-sampling algorithm, random forest cleaning rule (RFCL), was proposed to remove the majority class instances that cross the given new classification boundary which is a margin’s threshold. The degree of overlap and imbalance will be decreased in this way. This threshold is searched by maximizing the F1-score of the final classifier. Experimental results show that RFCL outperforms seven classic and two latest under-sampling methods in terms of F1-score and area under the curve, whether using random forest or support vector machine as the final classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Article 13 April 2022

Cian Lin, Chih-Fong Tsai & Wei-Chao Lin

Oversampling Methods to Handle the Class Imbalance Problem: A Review

References

Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L (2017) Classification and regression trees. Routledge, London
Book Google Scholar
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Article Google Scholar
Bylander T, Hanzlik D (1999) Estimating generalization error using out-of-bag estimates. AAAI/IAAI 1999:321–327
Google Scholar
Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Article Google Scholar
Chen X, Kang Q, Zhou M, Wei Z (2016) A novel under-sampling algorithm based on iterative-partitioning filters for imbalanced classification. In: CASE 2016, IEEE, pp 490–494
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci-Basel 8(5):815
Article Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Article Google Scholar
Frigge M, Hoaglin DC, Iglewicz B (1989) Some implementations of the boxplot. Am Stat 43(1):50–54
Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C 42(4):463–484
Article Google Scholar
Gamberger D, Lavrač N, Džeroski S (1996) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: ALT ’96. Springer, Berlin, pp 199–212
He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data 9:1263–1284
Google Scholar
Huda S, Yearwood J, Jelinek HF, Hassan MM, Fortino G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4:9145–9154
Article Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Article Google Scholar
Johnson BA, Tateishi R, Hoan NT (2013) A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. Int J Remote Sens 34(20):6969–6982
Article Google Scholar
Kang Q, Chen X, Li S, Zhou M (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Syst Man Cybern 47(12):4263–4274
Google Scholar
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22(3):387–396
Article Google Scholar
Khoshgoftaar TM, Zhong S, Joshi V (2005) Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal 9(1):3–27
Article Google Scholar
Kohavi R, et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI 1995, Montreal, Canada, vol 14, pp 1137–1145
Kubat M, Matwin S, et al. (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997, Nashville, USA, vol 97, pp 179–186
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: AIME 2001. Springer, Berlin, pp 63–66
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B 39(2):539–550
Article Google Scholar
Lucas D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, Zhang Y (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geosci Model Dev 6(4):1157–1171
Article Google Scholar
Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: AAAI 2003, vol 126
Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni V (2013) Quantitative structure-activity relationship models for ready biodegradability of chemicals. J Chem Inf Model 53(4):867–878
Article Google Scholar
Nemenyi P (1963) Distribution-tree multiple comparison. PhD thesis
Quinlan JR (1996) Bagging, boosting, and c4.5. In: AAAI/IAAI 1996
Sáez JA, Luengo J, Stefanowski J, Herrera F (2014) Managing borderline and noisy examples in imbalanced classification by combining smote with ensemble filtering. In: IDEAL 2016. Springer, Berlin, pp 61–68
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci 291:184–203
Article Google Scholar
Schapire RE, Freund Y, Bartlett P, Lee WS et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
MathSciNet MATH Google Scholar
Sikora M, Wróbel Ł (2010) Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines. Arch Min Sci 55(1):91–114
Google Scholar
Smith MR, Martinez T (2015) Using classifier diversity to handle label noise. In: IJCNN 2015. IEEE, pp 1–8
Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal 52(1):483–501
Article MathSciNet Google Scholar
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39(1):281–288
Article Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772
MathSciNet MATH Google Scholar
Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inform Sci 477:47–54
Article Google Scholar
Vannucci M, Colla V (2018) Self-organizing-maps based undersampling for the classification of unbalanced datasets. In: IJCNN 2018. IEEE, pp 1–6
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet Google Scholar
Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) Forestexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116
Article Google Scholar
Xue JH, Hall P (2015) Why does rebalancing class-unbalanced data improve auc for linear discriminant analysis? IEEE Trans Pattern Anal 37(5):1109–1112
Article Google Scholar
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis 5(04):597–604
Article Google Scholar
Yeh IC, Yang KJ, Ting TM (2009) Knowledge discovery on rfm model using bernoulli sequence. Expert Syst Appl 36(3):5866–5871
Article Google Scholar
Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362
Article Google Scholar
Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, Ning G (2018) Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 6:4641–4652
Article Google Scholar
Zhu X, Wu X, Chen Q (2003) Eliminating class noise in large datasets. In: ICML-03, pp 920–927

Download references

Acknowledgements

Thanks to the data sets provided by the UCI repository. Also thanks to R language and the authors of different packages.

Author information

Authors and Affiliations

School of Science, Beijing Jiaotong University, Beijing, China
Rui Zhang, Zuoquan Zhang & Di Wang

Authors

Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zuoquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Di Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zuoquan Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, R., Zhang, Z. & Wang, D. RFCL: A new under-sampling method of reducing the degree of imbalance and overlap. Pattern Anal Applic 24, 641–654 (2021). https://doi.org/10.1007/s10044-020-00929-x

Download citation

Received: 01 October 2019
Accepted: 27 October 2020
Published: 12 November 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10044-020-00929-x

Keywords

Mathematics Subject Classification

68T10

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

RFCL: A new under-sampling method of reducing the degree of imbalance and overlap

Abstract

Access this article

Similar content being viewed by others

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Oversampling Methods to Handle the Class Imbalance Problem: A Review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

RFCL: A new under-sampling method of reducing the degree of imbalance and overlap

Abstract

Access this article

Similar content being viewed by others

An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Oversampling Methods to Handle the Class Imbalance Problem: A Review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation