An oversampling algorithm combining SMOTE and k-means for imbalanced medical data,Information Sciences

当前位置： X-MOL 学术 › Inform. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An oversampling algorithm combining SMOTE and k-means for imbalanced medical data
Information Sciences ( IF 8.1 ) Pub Date : 2021-02-26 , DOI: 10.1016/j.ins.2021.02.056
Zhaozhao Xu , Derong Shen , Tiezheng Nie , Yue Kou , Nan Yin , Xi Han

The algorithm of C4.5 decision tree has the advantages of high classification accuracy, fast calculation speed and comprehensible classification rules, so it is widely used for medical data analysis. However, for imbalanced medical data, the classification accuracy of decision trees-based models is not ideal. Therefore, this paper proposes a cluster-based oversampling algorithm (KNSMOTE) combining Synthetic minority oversampling technique (SMOTE) and k-means algorithm. The sample classes clustered by k-means and the original sample classes are calculated to select the “safe samples” whose sample classes have not been changed. The “safe samples” are linearly interpolated to synthesize the new samples. The improved SMOTE sets the oversampling ratio according to the imbalance ratio of the original samples, which is used to synthesize the samples whose number is the same as that of the original samples. Compared with other oversampling algorithms on 6UCI datasets, our algorithm has achieved significant advantages. Our algorithm was applied to the medical datasets, and the average values of the Sensitivity and Specificity indexes of the Random Forest(RF) algorithm were 97.58% and 97.12%, respectively.

中文翻译：

结合SMOTE和k均值的过采样算法用于医疗数据不平衡

C4.5决策树算法具有分类准确率高，计算速度快，分类规则易于理解的优点，因此被广泛用于医学数据分析。但是，对于不平衡的医学数据，基于决策树的模型的分类准确性并不理想。因此，本文提出了一种基于簇的过采样算法（KNSMOTE），该算法将综合少数过采样技术（SMOTE）和k均值算法相结合。由k聚类的样本类计算均值和原始样本类别以选择样本类别未更改的“安全样本”。线性插值“安全样本”以合成新样本。改进后的SMOTE根据原始样本的不平衡率设置过采样率，用于合成数量与原始样本数量相同的样本。与6UCI数据集上的其他过采样算法相比，我们的算法具有明显的优势。我们的算法应用于医学数据集，随机森林算法的敏感性和特异性指标的平均值分别为97.58％和97.12％。

更新日期：2021-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>