当前位置: X-MOL 学术Int. J. Control Autom. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm
International Journal of Control, Automation and Systems ( IF 2.5 ) Pub Date : 2021-02-18 , DOI: 10.1007/s12555-019-1061-x
Jue Zhang , Li Chen , Jian-xue Tian , Fazeel Abid , Wusi Yang , Xiao-fen Tang

Learning from imbalanced data set is relatively new challenge for breast cancer diagnosis, where the diseases cases are often quite rare relative to normal population. Although traditional algorithms are all accuracy-oriented which result biased towards the majority class. The combinations of sampling methods with ensemble classifiers have shown certainly good performance. In this paper, a hybrid of cluster-based undersampling and boosted C5.0 is proposed. The proposed classification model consists of two phases: cluster analysis and classification. In cluster analysis, affinity propagation algorithm is used to define the number of clusters, and then the k-means clustering is utilized to select the border and informative samples. In the classification phase, C5.0 algorithm is used in conjunction with boosting technical, owing to leverage the strength of the individual classifiers. The proposed algorithm is assessed by 14 benchmark imbalanced data sets taken from UCI dataset repository. The extensive experimental results on different imbalanced datasets demonstrated that the proposed algorithm can achieve better classification performance in terms of Matthews’ Correlation Coefficient (MCC) as compared to other existing imbalanced dataset classification algorithms.



中文翻译:

基于簇的欠采样和Boosted C5.0算法对乳腺癌的诊断

从不平衡的数据集中学习是乳腺癌诊断相对较新的挑战,相对于正常人群,这种疾病的发病率通常很少。尽管传统算法都以准确性为导向,但结果偏向多数类。采样方法与集成分类器的组合无疑显示出良好的性能。本文提出了一种基于簇的欠采样和增强型C5.0的混合体。提出的分类模型包括两个阶段:聚类分析和分类。在聚类分析中,使用亲和力传播算法定义聚类数量,然后使用k均值聚类选择边界样本和信息样本。在分类阶段,将C5.0算法与Boosting技术结合使用,由于利用了各个分类器的优势。通过从UCI数据集存储库中获取的14个基准不平衡数据集对提出的算法进行了评估。在不同的不平衡数据集上的大量实验结果表明,与其他现有的不平衡数据集分类算法相比,该算法在Matthews的相关系数(MCC)方面可以实现更好的分类性能。

更新日期:2021-02-18
down
wechat
bug