当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Framework for extreme imbalance classification: SWIM—sampling with the majority class
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2019-07-17 , DOI: 10.1007/s10115-019-01380-z
Colin Bellinger , Shiven Sharma , Nathalie Japkowicz , Osmar R. Zaïane

The class imbalance problem is a pervasive issue in many real-world domains. Oversampling methods that inflate the rare class by generating synthetic data are amongst the most popular techniques for resolving class imbalance. However, they concentrate on the characteristics of the minority class and use them to guide the oversampling process. By completely overlooking the majority class, they lose a global view on the classification problem and, while alleviating the class imbalance, may negatively impact learnability by generating borderline or overlapping instances. This becomes even more critical when facing extreme class imbalance, where the minority class is strongly underrepresented and on its own does not contain enough information to conduct the oversampling process. We propose a framework for synthetic oversampling that, unlike existing resampling methods, is robust on cases of extreme imbalance. The key feature of the framework is that it uses the density of the well-sampled majority class to guide the generation process. We demonstrate implementations of the framework using the Mahalanobis distance and a radial basis function. We evaluate over 25 benchmark datasets and show that the framework offers a distinct performance improvement over the existing state-of-the-art in oversampling techniques.

中文翻译:

极端失衡分类的框架:SWIM-以多数类别进行抽样

类不平衡问题是许多现实世界中普遍存在的问题。通过生成合成数据使稀有类别膨胀的过采样方法是解决类别不平衡的最流行技术之一。但是,它们专注于少数群体的特征,并用它们来指导过采样过程。通过完全忽略多数班级,他们在分类问题上失去了全局观,并且在减轻班级不平衡的同时,可能会因生成边界线或重叠实例而对学习性产生负面影响。当面对极端的阶级失衡时,这一点就变得尤为重要,因为少数群体的代表人数严重不足,并且仅靠自身就没有足够的信息来进行过采样。我们提出了一个合成过度采样的框架,与现有的重采样方法不同,它在极端不平衡的情况下具有鲁棒性。该框架的关键特征是它使用采样充分的多数类的密度来指导生成过程。我们演示了使用马氏距离和径向基函数的框架实现。我们评估了25个以上的基准数据集,并表明该框架与现有的过采样技术相比,具有明显的性能提升。
更新日期:2019-07-17
down
wechat
bug