RFCL: A new under-sampling method of reducing the degree of imbalance and overlap,Pattern Analysis and Applications

当前位置： X-MOL 学术 › Pattern Anal. Applic. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

RFCL: A new under-sampling method of reducing the degree of imbalance and overlap
Pattern Analysis and Applications ( IF 3.7 ) Pub Date : 2020-11-12 , DOI: 10.1007/s10044-020-00929-x
Rui Zhang , Zuoquan Zhang , Di Wang

Imbalanced data are often encountered in every aspect of our lives, such as medical science, Internet, finance, and surveillance. Learning from imbalanced data which is also called the imbalanced learning problem is still a big challenge and deserves more attention. In this paper, we focus on overlap, which is one of the most important inherent factors that hinder learning from imbalanced data well. We put forward the overlapping degree (OD), and grouped data sets into two types, high OD (HOD) and low OD (LOD). The experimental results found that LOD data sets can achieve good results without any under-sampling algorithm, though some of them have high degree of imbalance, and the under-sampling algorithm does not improve the results very much. A new under-sampling algorithm, random forest cleaning rule (RFCL), was proposed to remove the majority class instances that cross the given new classification boundary which is a margin’s threshold. The degree of overlap and imbalance will be decreased in this way. This threshold is searched by maximizing the F1-score of the final classifier. Experimental results show that RFCL outperforms seven classic and two latest under-sampling methods in terms of F1-score and area under the curve, whether using random forest or support vector machine as the final classifier.

中文翻译：

RFCL：减少不平衡和重叠程度的一种新的欠采样方法

数据不平衡经常在我们生活的各个方面遇到，例如医学，互联网，金融和监视。从不平衡数据中学习（也称为不平衡学习问题）仍然是一个很大的挑战，值得更多关注。在本文中，我们关注于重叠，重叠是阻碍很好地从不平衡数据中学习的最重要的内在因素之一。我们提出了重叠度（OD），并将数据集分为高OD（HOD）和低OD（LOD）两类。实验结果表明，LOD数据集可以在不使用任何欠采样算法的情况下取得良好的效果，尽管其中一些数据集具有很高的不平衡度，并且欠采样算法并不能很好地改善结果。一种新的欠采样算法，随机森林清除规则（RFCL），提议删除跨越给定新分类边界（即页边距的阈值）的多数类实例。重叠和不平衡的程度将以此方式降低。通过最大化最终分类器的F1分数来搜索此阈值。实验结果表明，无论使用随机森林还是支持向量机作为最终分类器，RFCL在F1得分和曲线下面积方面均优于7种经典和两种最新的欠采样方法。

更新日期：2020-11-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11