Skip to main content
Log in

robROSE: A robust approach for dealing with imbalanced data in fraud detection

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than \(0.5\%\) of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bahnsen Alejandro Correa, Stojanovic Aleksandar, Aouada Djamila, Ottersten Björn (2013) Cost sensitive credit card fraud detection using bayes minimum risk. In 2013 12th international conference on machine learning and applications, volume 1, pages 333–338. IEEE

  • Barua Sukarna, Islam Md Monirul, Yao Xin, Murase Kazuyuki (2012) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  • Bowman Adrian W, Azzalini Adelchi (1997) Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations, volume 18. OUP Oxford

  • Breiman Leo, Friedman Jerome, Olshen Richard, Stone Charles (1984) Classification and regression trees. wadsworth int. Group 37(15):237–251

    MATH  Google Scholar 

  • Cantoni Eva, Ronchetti Elvezio (2001) Robust inference for generalized linear models. J Am Statistical Assoc 96(455):1022–1030

    Article  MathSciNet  Google Scholar 

  • Cerioli Andrea, Perrotta Domenico (2014) Robust clustering around regression lines with high density regions. Adv Data Anal Classification 8(1):5–26

    Article  MathSciNet  Google Scholar 

  • Chawla Nitesh V, Bowyer Kevin W, Hall Lawrence O, Kegelmeyer W Philip (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  • Davis Jesse, Goadrich Mark (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM

  • Fawcett Tom (2004) Roc graphs: Notes and practical considerations for researchers. Mach Learn 31(1):1–38

    MathSciNet  Google Scholar 

  • Fawcett Tom (2006) An introduction to roc analysis. Patt Recog Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  • Han Hui, Wang Wen-Yuan, Mao Bing-Huan (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer

  • Hand David J, Whitrow Christopher, Adams Niall M, Juszczak Piotr, Weston Dave (2008) Performance criteria for plastic card fraud detection tools. J Operational Res Soc 59(7):956–962

    Article  Google Scholar 

  • He Haibo, Bai Yang, Garcia Edwardo A, Li Shutao (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE

  • He Haibo, Garcia Edwardo A (2009) Learning from imbalanced data. IEEE Trans knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Holte Robert C, Acker Liane, Porter Bruce W, et al (1989) Concept learning and the problem of small disjuncts. In IJCAI, volume 89, pages 813–818. Citeseer

  • Krawczyk Bartosz (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232

    Article  Google Scholar 

  • Krzanowski Wojtek J, Hand David J (2009) ROC curves for continuous data. Chapman and Hall/CRC

  • Ling Charles X, Huang Jin, Zhang Harry, et al. (2003) Auc: a statistically consistent and more discriminating measure than accuracy. In Ijcai, volume 3, pages 519–524

  • Liu Xu-Ying, Wu Jianxin, Zhou Zhi-Hua (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst, Man, Cybernetics, Part B (Cybernetics) 39(2):539–550

    Google Scholar 

  • Maechler M, Rousseeuw PJ, Croux C, Todorov V, Ruckstuhl A, Salibian-Barrera M, Verbeke T, Koller M, Conceicao ELT, Anna di Palma M (2018) robustbase: Basic Robust Statistics. R package version 0.93-3

  • Marqués Ana Isabel, García Vicente, Sánchez José Salvador (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Operational Res Soci 64(7):1060–1070

    Article  Google Scholar 

  • Menardi Giovanna, Torelli Nicola (2014) Rose: random over-sampling examples. Data Min Knowl Dis 28(1):92–122

    Article  Google Scholar 

  • Ngai Eric WT, Hu Yong, Wong Yiu Hing, Chen Yijun, Sun Xin (2011) The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Syst 50(3):559–569

    Article  Google Scholar 

  • Phua Clifton, Lee Vincent, Smith Kate, Gayler Ross (2010) A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119

  • Provost F Fawcett T, kohavi r (1998) the case against accuracy estimation for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning,

  • Rousseeuw Peter J, Driessen Katrien Van (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223

    Article  Google Scholar 

  • Swets John A (2014) Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers. Psychology Press,

  • Valdora Marina, Yohai Víctor J (2014) Robust estimators for generalized linear models. J Statistical Plan Inference 146:31–48

    Article  MathSciNet  Google Scholar 

  • Van Vlasselaer Véronique, Eliassi-Rad Tina, Akoglu Leman, Snoeck Monique, Baesens Bart (2016) Gotcha! network-based fraud detection for social security fraud. Manag Sci 63(9):3090–3110

    Article  Google Scholar 

  • Weiss Gary M, Provost Foster (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML- TR-43, Dept. of Computer Science, Rutgers Univ

  • Zhu Bing, Baesens Bart, Broucke Seppe KLM vanden (2017) An empirical comparison of techniques for the class imbalance problem in churn prediction. Inform Sci 408:84–99

    Article  Google Scholar 

  • Zhu Bing, Gao Zihan, Zhao Junkai, Broucke Seppe KLM vanden (2019) Iric: An r library for binary imbalanced classification. SoftwareX 10:100341

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the BNP Paribas Fortis Chair in Fraud Analytics and Internal Funds KU Leuven under Grant C16/15/068.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Verdonck.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Baesens, B., Höppner, S., Ortner, I. et al. robROSE: A robust approach for dealing with imbalanced data in fraud detection. Stat Methods Appl 30, 841–861 (2021). https://doi.org/10.1007/s10260-021-00573-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-021-00573-7

Keywords

Navigation