当前位置: X-MOL 学术Int. J. Mach. Learn. & Cyber. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Imbalanced data classification based on diverse sample generation and classifier fusion
International Journal of Machine Learning and Cybernetics ( IF 5.6 ) Pub Date : 2021-04-12 , DOI: 10.1007/s13042-021-01321-9
Junhai Zhai , Jiaxing Qi , Sufang Zhang

Class imbalance problems are pervasive in many real-world applications, yet classifying imbalanced data remains to be a very challenging task in machine learning. SMOTE is the most influential oversampling approach. Based on SMOTE, many variants have been proposed. However, SMOTE and its variants have three drawbacks: (1) the probability distribution of the minority class samples is not considered; (2) the generated minority samples lack diversity; (3) the generated minority class samples overlap severely when oversampled many times for balancing with majority class samples. In order to overcome these three drawbacks, a generative adversarial network (GAN) based framework is proposed in this paper. The framework includes an oversampling method and a two-class imbalanced data classification approach. The oversampling method is based on an improved GAN model, and the classification approach is based on classifier fusion via fuzzy integral, which can well model the interactions among the base classifiers trained on the balanced data subsets constructed by the proposed oversampling method. Extensive experiments are conducted to compare the proposed methods with related methods on 5 aspects: MMD-score, Silhouette-score, F-measure, G-means, and AUC-area. The experimental results demonstrate that the proposed methods are more effective and efficient than the compared approaches.



中文翻译:

基于多样化样本生成和分类器融合的不平衡数据分类

类不平衡问题在许多现实应用中普遍存在,但是对不平衡数据进行分类仍然是机器学习中一项非常具有挑战性的任务。SMOTE是最具影响力的过采样方法。基于SMOTE,已经提出了许多变体。但是,SMOTE及其变体具有三个缺点:(1)不考虑少数类样本的概率分布;(2)生成的少数样本缺乏多样性;(3)为了与多数类样本保持平衡,当多次采样时,生成的少数类样本严重重叠。为了克服这三个缺点,本文提出了一种基于生成对抗网络(GAN)的框架。该框架包括过采样方法和两类不平衡数据分类方法。过采样方法基于改进的GAN模型,而分类方法则基于基于模糊积分的分类器融合,可以很好地模拟在该过采样方法构造的平衡数据子集上训练的基本分类器之间的交互。进行了广泛的实验,以比较所提出的方法和相关方法在5个方面:MMD得分,Silhouette得分,F度量,G均值和AUC区域。实验结果表明,所提出的方法比比较方法更加有效。进行了广泛的实验,以比较所提出的方法和相关方法在5个方面:MMD得分,Silhouette得分,F度量,G均值和AUC区域。实验结果表明,所提出的方法比比较方法更加有效。进行了广泛的实验,以比较所提出的方法和相关方法在5个方面:MMD得分,Silhouette得分,F度量,G均值和AUC区域。实验结果表明,所提出的方法比比较方法更加有效。

更新日期:2021-04-12
down
wechat
bug