当前位置: X-MOL 学术Inf. Syst. Front. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
Information Systems Frontiers ( IF 6.9 ) Pub Date : 2020-06-21 , DOI: 10.1007/s10796-020-10022-7
Justin M. Johnson , Taghi M. Khoshgoftaar

Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance with deep neural networks and big data. Sampling rates are varied to create training distributions with positive class sizes from 0.025%–90%. The area under the receiver operating characteristics curve is used to compare performance, and thresholding is used to maximize class performance. Random over-sampling (ROS) consistently outperforms under-sampling (RUS) and baseline methods. The majority class proves susceptible to misrepresentation when using RUS, and results suggest that each data set is uniquely sensitive to imbalance and sample size. The hybrid ROS-RUS maximizes performance and efficiency, and is our preferred method for treating high imbalance within big data problems.

中文翻译:

深度学习和高度不平衡的大数据对数据采样的影响

用类不平衡数据训练预测模型已被证明是一项艰巨的任务。这个问题已经很好研究,但是大数据时代造成了越来越严重的失衡,这种失衡越来越难以建模。我们使用复杂程度各异的三个数据集来评估用于通过深度神经网络和大数据处理高级不平衡的数据采样策略。抽样率各不相同,以创建正班级人数介于0.025%–90%之间的培训分布。接收器工作特性曲线下的区域用于比较性能,阈值用于最大化类性能。随机过采样(ROS)始终优于欠采样(RUS)和基线方法。多数类别证明使用RUS时容易出现虚假陈述,结果表明,每个数据集对不平衡和样本量都是唯一敏感的。混合ROS-RUS可最大限度地提高性能和效率,
更新日期:2020-06-21
down
wechat
bug