当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
LoRAS: an oversampling approach for imbalanced datasets
Machine Learning ( IF 4.3 ) Pub Date : 2020-11-12 , DOI: 10.1007/s10994-020-05913-4
Saptarshi Bej , Narek Davtyan , Markus Wolfien , Mariam Nassar , Olaf Wolkenhauer

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

中文翻译:

LoRAS:不平衡数据集的过采样方法

合成少数过采样技术 (SMOTE) 广泛用于分析不平衡数据集。众所周知,SMOTE 经常过度概括少数类,导致对多数类的错误分类,并影响模型的整体平衡。在本文中,我们提出了一种克服 SMOTE 限制的方法,采用局部随机仿射阴影采样 (LoRAS) 从少数类的近似数据流形中进行过采样。我们使用三种不同的机器学习 (ML) 算法对我们的算法与 14 个公开可用的不平衡数据集进行了基准测试,并比较了 LoRAS、SMOTE 和几个 SMOTE 扩展的性能,这些扩展共享使用少数类数据点的凸组合与 LoRAS 进行过采样的概念。我们观察到 LoRAS,平均而言,在 F1-Score 和 Balanced 准确性方面生成更好的 ML 模型。另一个关键观察结果是,虽然我们测试过的大多数 SMOTE 扩展平均提高了 SMOTE 的 F1-Score,但它们会损害分类模型的平衡精度。相反,LoRAS 提高了 F1 分数和平衡精度,从而产生更好的分类模型。此外,为了解释算法的成功,我们构建了一个数学框架来证明 LoRAS 过采样技术为少数类数据空间的底层局部数据分布的均值提供了更好的估计。平均而言,提高 F1-Score 相对于 SMOTE 的分数,它们会损害分类模型的平衡准确度。相反,LoRAS 提高了 F1 分数和平衡精度,从而产生更好的分类模型。此外,为了解释算法的成功,我们构建了一个数学框架来证明 LoRAS 过采样技术为少数类数据空间的底层局部数据分布的均值提供了更好的估计。平均而言,提高 F1-Score 相对于 SMOTE 的分数,它们会损害分类模型的平衡准确度。相反,LoRAS 提高了 F1 分数和平衡精度,从而产生更好的分类模型。此外,为了解释算法的成功,我们构建了一个数学框架来证明 LoRAS 过采样技术为少数类数据空间的底层局部数据分布的均值提供了更好的估计。
更新日期:2020-11-12
down
wechat
bug