Entropy-based Sampling Approaches for Multi-class Imbalanced Problems,IEEE Transactions on Knowledge and Data Engineering

当前位置： X-MOL 学术 › IEEE Trans. Knowl. Data. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Entropy-based Sampling Approaches for Multi-class Imbalanced Problems
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-11-01 , DOI: 10.1109/tkde.2019.2913859
Lusi Li , Haibo He , Jie Li

In data mining, large differences between multi-class distributions regarded as class imbalance issues have been known to hinder the classification performance. Unfortunately, existing sampling methods have shown their deficiencies such as causing the problems of over-generation and over-lapping by oversampling techniques, or the excessive loss of significant information by undersampling techniques. This paper presents three proposed sampling approaches for imbalanced learning: the first one is the entropy-based oversampling (EOS) approach; the second one is the entropy-based undersampling (EUS) approach; the third one is the entropy-based hybrid sampling (EHS) approach combined by both oversampling and undersampling approaches. These three approaches are based on a new class imbalance metric, termed entropy-based imbalance degree (EID), considering the differences of information contents between classes instead of traditional imbalance-ratio. Specifically, to balance a data set after evaluating the information influence degree of each instance, EOS generates new instances around difficult-to-learn instances and only remains the informative ones. EUS removes easy-to-learn instances. While EHS can do both simultaneously. Finally, we use all the generated and remaining instances to train several classifiers. Extensive experiments over synthetic and real-world data sets demonstrate the effectiveness of our approaches.

中文翻译：

多类不平衡问题的基于熵的采样方法

在数据挖掘中，已知被视为类不平衡问题的多类分布之间的巨大差异会阻碍分类性能。遗憾的是，现有的采样方法已经显示出它们的不足，例如由于过采样技术导致了过度生成和重叠的问题，或者由于欠采样技术导致了重要信息的过度丢失。本文提出了三种针对不平衡学习的采样方法：第一种是基于熵的过采样（EOS）方法；第二种是基于熵的欠采样（EUS）方法；第三种是基于熵的混合采样 (EHS) 方法，它结合了过采样和欠采样方法。这三种方法基于一种新的类不平衡度量，称为基于熵的不平衡度（EID），考虑类之间信息内容的差异而不是传统的不平衡比。具体来说，为了在评估每个实例的信息影响程度后平衡数据集，EOS 围绕难以学习的实例生成新实例，仅保留信息丰富的实例。EUS 删除了易于学习的实例。而 EHS 可以同时进行。最后，我们使用所有生成的和剩余的实例来训练几个分类器。对合成和真实世界数据集的大量实验证明了我们方法的有效性。EOS 围绕难以学习的实例生成新实例，并且只保留信息丰富的实例。EUS 删除了易于学习的实例。而 EHS 可以同时进行。最后，我们使用所有生成的和剩余的实例来训练几个分类器。对合成和真实世界数据集的大量实验证明了我们方法的有效性。EOS 围绕难以学习的实例生成新实例，并且只保留信息丰富的实例。EUS 删除了易于学习的实例。而 EHS 可以同时进行。最后，我们使用所有生成的和剩余的实例来训练几个分类器。对合成和真实世界数据集的大量实验证明了我们方法的有效性。

更新日期：2020-11-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11