当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Severely imbalanced Big Data challenges: investigating data sampling approaches
Journal of Big Data ( IF 8.1 ) Pub Date : 2019-11-30 , DOI: 10.1186/s40537-019-0274-4
Tawfiq Hasanin , Taghi M. Khoshgoftaar , Joffrey L. Leevy , Richard A. Bauder

Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

中文翻译:

严重失衡的大数据挑战:研究数据采样方法

大数据中多数和少数类别之间的严重类别失衡会使机器学习算法的预测性能偏向多数(负面)类别。如果少数(积极)阶层比多数(消极)阶层具有更大的价值,并且假阴性的发生比假阳性产生的惩罚更大,则偏见可能导致不利的后果。我们的论文结合了两个案例研究,每个案例都利用三个学习者,六个采样方法,两个绩效指标和五个采样分配比率来独特地研究严重的阶级失衡对大数据分析的影响。学习者(梯度增强树,逻辑回归,随机森林)是在Apache Spark框架内实现的。第一个案例研究基于Medicare欺诈检测数据集。与第一个案例不同,第二个案例研究包括来自一个来源的训练数据(SlowlorisBig Dataset)和来自单独来源的测试数据(POST数据集)。关于使用接收器工作特征曲线下面积几何平均性能指标的最佳采样方法,Medicare案例研究的结果尚不确定。但是,应该注意的是,随机欠采样方法在第一个案例研究中表现良好。在SlowlorisBig案例研究中,随机欠采样的效果令人信服地胜过其他五种采样方法(在使用接收器工作特性曲线下的面积几何均值指标测量性能时,随机超采样,合成少数采样,技术,SMOTE-borderline1,SMOTE-borderline2,自适应合成)。基于在两个案例研究中的分类性能,随机欠采样是最佳选择,因为它会导致样本数量明显减少的模型,从而减少了计算负担和训练时间。
更新日期:2019-11-30
down
wechat
bug