当前位置: X-MOL 学术J. Circuits Syst. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Imbalanced Data Classification Algorithm Based on Clustering and SVM
Journal of Circuits, Systems and Computers ( IF 0.9 ) Pub Date : 2020-05-20 , DOI: 10.1142/s0218126621500365
Bo Huang 1 , Yimin Zhu 1 , Zhongzhen Wang 1 , Zhijun Fang 1
Affiliation  

The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.

中文翻译:

基于聚类和SVM的不平衡数据分类算法

类不平衡学习是数据挖掘和机器学习中最重要的研究课题之一。不平衡问题意味着其中一个类的样本比其他类的样本多得多。针对分类精度低、时间复杂度高的问题,提出一种基于聚类和SVM的不平衡数据分类新算法。该算法建议根据少数样本的分布特征对多数样本进行欠采样。首先,通过对少数人的聚类分析来检测特定的聚类。其次,提出了一种聚类边界策略来消除噪声样本的不良影响。为不平衡数据构建平衡数据集,本文根据聚类中样本的特点,提出了对多数样本进行欠采样的三个原则。最后从混合核支持向量机的线性组合中得到最优分类模型。基于 UCI 和 KEEL 数据库中的数据集的实验表明,我们的算法有效地降低了噪声样本的干扰。与SMOTE和Fast-CBUS相比,该算法不仅降低了特征维数,而且普遍提高了不同标注采样率下小类的精度。
更新日期:2020-05-20
down
wechat
bug