当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Investigating class rarity in big data
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-03-16 , DOI: 10.1186/s40537-020-00301-0
Tawfiq Hasanin , Taghi M. Khoshgoftaar , Joffrey L. Leevy , Richard A. Bauder

In Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.



中文翻译:

调查大数据中的班级稀有性

机器学习中,如果一类的实例(多数)比另一类(少数)的实例多得多,则此情况定义为类不平衡。对于数据集,类别失衡会导致机器学习算法的预测能力偏向大多数(负面)类别,并且在假阴性比假阳性招致更大惩罚的情况下,这种失衡可能会导致不良后果。我们的论文采用了两个案例,每个采用三个学生(梯度升压树,logistic回归,随机森林)和三个性能指标(独特的方法面积根据ROC曲线面积在精密召回曲线几何均值),以研究大数据中的类稀有性。在我们的实验中,通过随机删除少数(阳性)实例来人工生成逐渐减少的正类实例的八个子集,从而实现了类稀有,这是极端的类不平衡程度。所有模型评估均通过交叉验证进行。在第一个案例研究中,使用Medicare Part B数据集,随着稀有程度的降低,学习者的表现得分通常随接收者操作特征曲线下的面积而提高,而精确召回曲线下的面积几何平均数指标没有改善。在第二个案例研究中,该数据集使用了基于分布式拒绝服务攻击数据(POSTSlowloris组合)构建的数据集,“接收者操作特征曲线”下的区域”度量指标为学习者提供了非常高的分数,包括积极类实例的所有子集。对于第二项研究,随着稀有程度的降低,学习者的分数通常会随着“精确召回曲线下的面积”和“几何平均数”度量值的提高而提高。总体而言,就这两个案例研究而言,梯度提升(GBT)学习者的表现最佳。

更新日期:2020-04-21
down
wechat
bug