当前位置: X-MOL 学术Stat. Methods Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
robROSE: A robust approach for dealing with imbalanced data in fraud detection
Statistical Methods & Applications ( IF 1.1 ) Pub Date : 2021-06-07 , DOI: 10.1007/s10260-021-00573-7
Bart Baesens , Sebastiaan Höppner , Irene Ortner , Tim Verdonck

A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than \(0.5\%\) of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.



中文翻译:

robROSE:一种在欺诈检测中处理不平衡数据的稳健方法

试图检测欺诈时的一个主要挑战是欺诈活动形成了一个少数类,它们只占数据集的很小一部分。在大多数数据集中,欺诈发生的时间通常小于\(0.5\%\)的情况。在如此高度不平衡的数据集中检测欺诈通常会导致有利于多数群体的预测,从而导致欺诈未被发现。我们讨论了一些流行的过采样技术,这些技术通过创建模仿少数类的合成样本来解决不平衡数据的问题。分析真实数据时经常遇到的问题是异常或异常值的存在。当数据中存在此类非典型观察时,大多数过采样技术都倾向于创建合成样本,从而扭曲检测算法并破坏结果分析。一个有用的异常检测工具是稳健统计,它旨在通过首先拟合大部分数据然后标记偏离它的数据观察来找到异常值。在本文中,我们提出了一个强大的 ROSE 版本,称为 robROSE,它结合了几种有前途的方法来同时处理不平衡数据和异常值的存在问题。所提出的方法实现了在忽略异常的同时增强欺诈案例的存在。我们的新采样技术的良好性能在模拟和真实数据集上得到了说明,并且表明 robROSE 可以更好地洞察数据结构。robROSE 算法的源代码是免费提供的。我们的新采样技术的良好性能在模拟和真实数据集上得到了说明,并且表明 robROSE 可以更好地了解数据结构。robROSE 算法的源代码是免费提供的。我们的新采样技术的良好性能在模拟和真实数据集上得到了说明,并且表明 robROSE 可以更好地了解数据结构。robROSE 算法的源代码是免费提供的。

更新日期:2021-06-07
down
wechat
bug