Abstract
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than \(0.5\%\) of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.
Similar content being viewed by others
References
Bahnsen Alejandro Correa, Stojanovic Aleksandar, Aouada Djamila, Ottersten Björn (2013) Cost sensitive credit card fraud detection using bayes minimum risk. In 2013 12th international conference on machine learning and applications, volume 1, pages 333–338. IEEE
Barua Sukarna, Islam Md Monirul, Yao Xin, Murase Kazuyuki (2012) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Bowman Adrian W, Azzalini Adelchi (1997) Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations, volume 18. OUP Oxford
Breiman Leo, Friedman Jerome, Olshen Richard, Stone Charles (1984) Classification and regression trees. wadsworth int. Group 37(15):237–251
Cantoni Eva, Ronchetti Elvezio (2001) Robust inference for generalized linear models. J Am Statistical Assoc 96(455):1022–1030
Cerioli Andrea, Perrotta Domenico (2014) Robust clustering around regression lines with high density regions. Adv Data Anal Classification 8(1):5–26
Chawla Nitesh V, Bowyer Kevin W, Hall Lawrence O, Kegelmeyer W Philip (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Davis Jesse, Goadrich Mark (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM
Fawcett Tom (2004) Roc graphs: Notes and practical considerations for researchers. Mach Learn 31(1):1–38
Fawcett Tom (2006) An introduction to roc analysis. Patt Recog Lett 27(8):861–874
Han Hui, Wang Wen-Yuan, Mao Bing-Huan (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer
Hand David J, Whitrow Christopher, Adams Niall M, Juszczak Piotr, Weston Dave (2008) Performance criteria for plastic card fraud detection tools. J Operational Res Soc 59(7):956–962
He Haibo, Bai Yang, Garcia Edwardo A, Li Shutao (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE
He Haibo, Garcia Edwardo A (2009) Learning from imbalanced data. IEEE Trans knowl Data Eng 21(9):1263–1284
Holte Robert C, Acker Liane, Porter Bruce W, et al (1989) Concept learning and the problem of small disjuncts. In IJCAI, volume 89, pages 813–818. Citeseer
Krawczyk Bartosz (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Krzanowski Wojtek J, Hand David J (2009) ROC curves for continuous data. Chapman and Hall/CRC
Ling Charles X, Huang Jin, Zhang Harry, et al. (2003) Auc: a statistically consistent and more discriminating measure than accuracy. In Ijcai, volume 3, pages 519–524
Liu Xu-Ying, Wu Jianxin, Zhou Zhi-Hua (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst, Man, Cybernetics, Part B (Cybernetics) 39(2):539–550
Maechler M, Rousseeuw PJ, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Conceicao ELT, Anna di Palma M (2018) robustbase: Basic Robust Statistics. R package version 0.93-3
Marqués Ana Isabel, García Vicente, Sánchez José Salvador (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Operational Res Soci 64(7):1060–1070
Menardi Giovanna, Torelli Nicola (2014) Rose: random over-sampling examples. Data Min Knowl Dis 28(1):92–122
Ngai Eric WT, Hu Yong, Wong Yiu Hing, Chen Yijun, Sun Xin (2011) The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Syst 50(3):559–569
Phua Clifton, Lee Vincent, Smith Kate, Gayler Ross (2010) A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119
Provost F Fawcett T, kohavi r (1998) the case against accuracy estimation for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning,
Rousseeuw Peter J, Driessen Katrien Van (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Swets John A (2014) Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers. Psychology Press,
Valdora Marina, Yohai Víctor J (2014) Robust estimators for generalized linear models. J Statistical Plan Inference 146:31–48
Van Vlasselaer Véronique, Eliassi-Rad Tina, Akoglu Leman, Snoeck Monique, Baesens Bart (2016) Gotcha! network-based fraud detection for social security fraud. Manag Sci 63(9):3090–3110
Weiss Gary M, Provost Foster (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML- TR-43, Dept. of Computer Science, Rutgers Univ
Zhu Bing, Baesens Bart, Broucke Seppe KLM vanden (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction. Inform Sci 408:84–99
Zhu Bing, Gao Zihan, Zhao Junkai, Broucke Seppe KLM vanden (2019) Iric: An r library for binary imbalanced classification. SoftwareX 10:100341
Acknowledgements
This work was supported by the BNP Paribas Fortis Chair in Fraud Analytics and Internal Funds KU Leuven under Grant C16/15/068.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Baesens, B., Höppner, S., Ortner, I. et al. robROSE: A robust approach for dealing with imbalanced data in fraud detection. Stat Methods Appl 30, 841–861 (2021). https://doi.org/10.1007/s10260-021-00573-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-021-00573-7