当前位置: X-MOL 学术Inform. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering
Information Sciences ( IF 8.1 ) Pub Date : 2020-01-20 , DOI: 10.1016/j.ins.2020.01.032
Xinmin Tao , Qing Li , Wenjie Guo , Chao Ren , Qing He , Rui Liu , JunRong Zou

Learning from imbalanced datasets poses a major challenge in data mining community. When dealing with imbalanced datasets, conventional classification algorithms generally perform poorly as they are originally designed to work under balanced class distribution scenarios. Although there exist different methods to addressing this issue, sampling methods especially over-sampling techniques have shown great potentials as they aim to improve datasets itself rather than the classifiers, which can allow them to be used for any classifier. In this paper, we propose a novel adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Unlike other clustering-based over-sampling methods, the proposed approach applies modified density peaks clustering rather than traditional k-means clustering techniques to cluster the minority instances due to its capability of accurately identifying sub-clusters with different sizes and densities, which is beneficial for the proposed method to simultaneously accommodate for between-class and within-class imbalance issues caused by various reasons. Subsequently, the size for each identified sub-cluster to be oversampled is adaptively determined according to its own size and density and then the minority instances within each sub-cluster are oversampled based on their probabilities inversely proportional to their distances to the majority class and their densities with the aim of generating more synthetic minority instances for borderline and sparser ones. Finally, in order to avoid the generation of overlapping, a heuristic filtering strategy is also developed to iteratively move the possibly overlapped minority instances away from the majority class. The extensive experimental results on the different imbalanced datasets demonstrate that the proposed approach can achieve better classification performance in most datasets as compared to the other existing over-sampling techniques.



中文翻译:

基于启发式过滤的密度峰聚类的不平衡数据集的自适应加权过采样

从不平衡的数据集中学习对数据挖掘社区构成了重大挑战。当处理不平衡的数据集时,常规分类算法通常性能较差,因为它们最初设计为在平衡的类分发方案下工作。尽管存在解决此问题的不同方法,但是采样方法(尤其是过采样技术)显示了巨大的潜力,因为它们旨在改进数据集本身而不是分类器,从而可以将其用于任何分类器。在本文中,我们提出了一种基于密度峰值聚类与启发式过滤的不平衡数据集的新型自适应加权过采样。与其他基于聚类的过采样方法不同,由于该方法能够准确识别具有不同大小和密度的子群集,因此该方法采用了改进的密度峰聚类而不是传统的k均值聚类技术来聚类少数实例,这对于所提出的方法同时适应于各种原因引起的班级和班级内部失衡问题。随后,根据其自身的大小和密度,自适应地确定每个要进行过采样的子集群的大小,然后根据其与多数类的距离及其成比例成反比的概率对每个子集群中的少数实例进行过采样。密度的目的是为边界和稀疏的实例生成更多的合成少数实例。最后,为了避免重叠的产生,还开发了一种启发式过滤策略来迭代地将可能重叠的少数实例从多数类中移开。在不同的不平衡数据集上进行的广泛实验结果表明,与其他现有的过采样技术相比,该方法可以在大多数数据集中实现更好的分类性能。

更新日期:2020-01-20
down
wechat
bug