当前位置: X-MOL 学术J. Supercomput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dynamic clustering method for imbalanced learning based on AdaBoost
The Journal of Supercomputing ( IF 2.5 ) Pub Date : 2020-03-02 , DOI: 10.1007/s11227-020-03211-3
Xiaoheng Deng , Yuebin Xu , Lingchi Chen , Weijian Zhong , Alireza Jolfaei , Xi Zheng

Our paper aims at learning from imbalance data based on ensemble learning. At the stage, the main solution is to combine under-sampling, oversampling or cost sensitivity learning with ensemble learning. However, these feature space-based methods fail to reflect the transformation of distribution and are usually accompanied with high computational complexity and risk of overfitting. In this paper, we propose a dynamic cluster algorithm based on coefficient of variation (or entropy), which learns the local spatial distribution of data and hierarchically clusters the majority. This algorithm has low complexity and can dynamically adjust the cluster according to the iteration of AdaBoost, adaptively synchronized with changes caused by sample weight changes. Then, we design an index to measure the importance of each cluster. Based on this index, a dynamic sampling algorithm based on maximum weight is proposed. The effectiveness of the sampling algorithm is proved by visual experiments. Finally, we propose a cost-sensitive algorithm based on Bagging, and combine it with the dynamic sampling algorithm to propose a multi-fusion imbalanced ensemble learning algorithm. In experimental research, our algorithms have been validated on three artificial datasets, 22 KEEL datasets and two gene expression cancer datasets, and have shown ideal or better performance than SOTA in terms of AUC, indicating that our algorithms are not only effective imbalance algorithms, but also provide potential for building a reliable biological cyber-physical system.

中文翻译:

基于AdaBoost的不平衡学习动态聚类方法

我们的论文旨在基于集成学习从不平衡数据中学习。在阶段,主要的解决方案是将欠采样、过采样或成本敏感学习与集成学习相结合。然而,这些基于特征空间的方法不能反映分布的变换,通常伴随着计算复杂度高和过拟合的风险。在本文中,我们提出了一种基于变异系数(或熵)的动态聚类算法,该算法学习数据的局部空间分布并分层聚类大多数。该算法复杂度低,可以根据AdaBoost的迭代动态调整集群,自适应同步样本权重变化引起的变化。然后,我们设计了一个指标来衡量每个集群的重要性。根据这个指数,提出了一种基于最大权重的动态采样算法。通过视觉实验证明了采样算法的有效性。最后,我们提出了一种基于 Bagging 的代价敏感算法,并将其与动态采样算法相结合,提出了一种多融合不平衡集成学习算法。在实验研究中,我们的算法已经在三个人工数据集、22个KEEL数据集和两个基因表达癌症数据集上得到验证,并且在AUC方面表现出理想或优于SOTA的性能,表明我们的算法不仅是有效的不平衡算法,而且还为构建可靠的生物信息物理系统提供了潜力。我们提出了一种基于 Bagging 的成本敏感算法,并将其与动态采样算法相结合,提出了一种多融合不平衡集成学习算法。在实验研究中,我们的算法已经在三个人工数据集、22个KEEL数据集和两个基因表达癌症数据集上得到验证,并且在AUC方面表现出理想或优于SOTA的性能,表明我们的算法不仅是有效的不平衡算法,而且还为构建可靠的生物信息物理系统提供了潜力。我们提出了一种基于 Bagging 的代价敏感算法,并将其与动态采样算法相结合,提出了一种多融合不平衡集成学习算法。在实验研究中,我们的算法已经在三个人工数据集、22个KEEL数据集和两个基因表达癌症数据集上得到验证,并且在AUC方面表现出理想或优于SOTA的性能,表明我们的算法不仅是有效的不平衡算法,而且还为构建可靠的生物信息物理系统提供了潜力。
更新日期:2020-03-02
down
wechat
bug