robROSE: A robust approach for dealing with imbalanced data in fraud detection

Baesens, Bart; Höppner, Sebastiaan; Ortner, Irene; Verdonck, Tim

doi:10.1007/s10260-021-00573-7

robROSE: A robust approach for dealing with imbalanced data in fraud detection

Original Paper
Published: 07 June 2021

Volume 30, pages 841–861, (2021)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

Bart Baesens¹,
Sebastiaan Höppner²,
Irene Ortner³ &
…
Tim Verdonck ORCID: orcid.org/0000-0003-1105-2028⁴

832 Accesses
7 Citations
Explore all metrics

Abstract

A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than \(0.5\%\) of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Handling Class Imbalance in Fraud Detection Using Machine Learning Techniques

Imbalanced Classification: Challenges and Approaches to Handle

Ensemble Learning with Resampling for Imbalanced Data

References

Bahnsen Alejandro Correa, Stojanovic Aleksandar, Aouada Djamila, Ottersten Björn (2013) Cost sensitive credit card fraud detection using bayes minimum risk. In 2013 12th international conference on machine learning and applications, volume 1, pages 333–338. IEEE
Barua Sukarna, Islam Md Monirul, Yao Xin, Murase Kazuyuki (2012) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Bowman Adrian W, Azzalini Adelchi (1997) Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations, volume 18. OUP Oxford
Breiman Leo, Friedman Jerome, Olshen Richard, Stone Charles (1984) Classification and regression trees. wadsworth int. Group 37(15):237–251
MATH Google Scholar
Cantoni Eva, Ronchetti Elvezio (2001) Robust inference for generalized linear models. J Am Statistical Assoc 96(455):1022–1030
Article MathSciNet Google Scholar
Cerioli Andrea, Perrotta Domenico (2014) Robust clustering around regression lines with high density regions. Adv Data Anal Classification 8(1):5–26
Article MathSciNet Google Scholar
Chawla Nitesh V, Bowyer Kevin W, Hall Lawrence O, Kegelmeyer W Philip (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Davis Jesse, Goadrich Mark (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM
Fawcett Tom (2004) Roc graphs: Notes and practical considerations for researchers. Mach Learn 31(1):1–38
MathSciNet Google Scholar
Fawcett Tom (2006) An introduction to roc analysis. Patt Recog Lett 27(8):861–874
Article MathSciNet Google Scholar
Han Hui, Wang Wen-Yuan, Mao Bing-Huan (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer
Hand David J, Whitrow Christopher, Adams Niall M, Juszczak Piotr, Weston Dave (2008) Performance criteria for plastic card fraud detection tools. J Operational Res Soc 59(7):956–962
Article Google Scholar
He Haibo, Bai Yang, Garcia Edwardo A, Li Shutao (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE
He Haibo, Garcia Edwardo A (2009) Learning from imbalanced data. IEEE Trans knowl Data Eng 21(9):1263–1284
Article Google Scholar
Holte Robert C, Acker Liane, Porter Bruce W, et al (1989) Concept learning and the problem of small disjuncts. In IJCAI, volume 89, pages 813–818. Citeseer
Krawczyk Bartosz (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Article Google Scholar
Krzanowski Wojtek J, Hand David J (2009) ROC curves for continuous data. Chapman and Hall/CRC
Ling Charles X, Huang Jin, Zhang Harry, et al. (2003) Auc: a statistically consistent and more discriminating measure than accuracy. In Ijcai, volume 3, pages 519–524
Liu Xu-Ying, Wu Jianxin, Zhou Zhi-Hua (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst, Man, Cybernetics, Part B (Cybernetics) 39(2):539–550
Google Scholar
Maechler M, Rousseeuw PJ, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Conceicao ELT, Anna di Palma M (2018) robustbase: Basic Robust Statistics. R package version 0.93-3
Marqués Ana Isabel, García Vicente, Sánchez José Salvador (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Operational Res Soci 64(7):1060–1070
Article Google Scholar
Menardi Giovanna, Torelli Nicola (2014) Rose: random over-sampling examples. Data Min Knowl Dis 28(1):92–122
Article Google Scholar
Ngai Eric WT, Hu Yong, Wong Yiu Hing, Chen Yijun, Sun Xin (2011) The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Syst 50(3):559–569
Article Google Scholar
Phua Clifton, Lee Vincent, Smith Kate, Gayler Ross (2010) A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119
Provost F Fawcett T, kohavi r (1998) the case against accuracy estimation for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning,
Rousseeuw Peter J, Driessen Katrien Van (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Article Google Scholar
Swets John A (2014) Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers. Psychology Press,
Valdora Marina, Yohai Víctor J (2014) Robust estimators for generalized linear models. J Statistical Plan Inference 146:31–48
Article MathSciNet Google Scholar
Van Vlasselaer Véronique, Eliassi-Rad Tina, Akoglu Leman, Snoeck Monique, Baesens Bart (2016) Gotcha! network-based fraud detection for social security fraud. Manag Sci 63(9):3090–3110
Article Google Scholar
Weiss Gary M, Provost Foster (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML- TR-43, Dept. of Computer Science, Rutgers Univ
Zhu Bing, Baesens Bart, Broucke Seppe KLM vanden (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction. Inform Sci 408:84–99
Article Google Scholar
Zhu Bing, Gao Zihan, Zhao Junkai, Broucke Seppe KLM vanden (2019) Iric: An r library for binary imbalanced classification. SoftwareX 10:100341
Article Google Scholar

Download references

Acknowledgements

This work was supported by the BNP Paribas Fortis Chair in Fraud Analytics and Internal Funds KU Leuven under Grant C16/15/068.

Author information

Authors and Affiliations

Faculty of Economics and Business, KU Leuven, Naamsestraat 69, 3000, Leuven, Belgium
Bart Baesens
Department of Mathematics, KU Leuven, Celestijnenlaan 200B, 3001, Leuven, Belgium
Sebastiaan Höppner
Applied Statistics GmbH, Taubstummengasse 4/10, 1040, Vienna, Austria
Irene Ortner
Department of Mathematics, University of Antwerp, Middelheimlaan 1, 2020, Antwerp, Belgium
Tim Verdonck

Authors

Bart Baesens
View author publications
You can also search for this author in PubMed Google Scholar
Sebastiaan Höppner
View author publications
You can also search for this author in PubMed Google Scholar
Irene Ortner
View author publications
You can also search for this author in PubMed Google Scholar
Tim Verdonck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Verdonck.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baesens, B., Höppner, S., Ortner, I. et al. robROSE: A robust approach for dealing with imbalanced data in fraud detection. Stat Methods Appl 30, 841–861 (2021). https://doi.org/10.1007/s10260-021-00573-7

Download citation

Accepted: 24 May 2021
Published: 07 June 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10260-021-00573-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

robROSE: A robust approach for dealing with imbalanced data in fraud detection

Abstract

Access this article

Similar content being viewed by others

Handling Class Imbalance in Fraud Detection Using Machine Learning Techniques

Imbalanced Classification: Challenges and Approaches to Handle

Ensemble Learning with Resampling for Imbalanced Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

robROSE: A robust approach for dealing with imbalanced data in fraud detection

Abstract

Access this article

Similar content being viewed by others

Handling Class Imbalance in Fraud Detection Using Machine Learning Techniques

Imbalanced Classification: Challenges and Approaches to Handle

Ensemble Learning with Resampling for Imbalanced Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation