当前位置: X-MOL 学术Intell. Data Anal. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Entropy difference and kernel-based oversampling technique for imbalanced data learning
Intelligent Data Analysis ( IF 0.9 ) Pub Date : 2020-12-18 , DOI: 10.3233/ida-194761
Xu Wu , Youlong Yang , Lingyu Ren

Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.

中文翻译:

熵差和基于核的过采样技术用于不平衡数据学习

类不平衡通常是各种现实数据集中的一个问题,其中一个类包含少量数据,而另一类包含大量数据。如果不使用数据预处理技术来平衡数据集,则使用传统的数据挖掘和机器学习算法来开发有效的模型非常困难。过采样通常用作不平衡数据集的预处理方法。具体而言,合成过采样技术着重于通过生成额外的人工少数派类实例来平衡多数派和少数派之间的训练实例数量。但是,当前的过采样技术仅考虑数量的不平衡,而没有关注分布是否平衡。因此,本文提出了一种基于熵的差分和基于核的SMOTE(EDKS),它通过熵差来考虑数据集的不平衡程度,并通过在支持向量机分类器的特征空间中进行过采样来克服SMOTE对于非线性问题的局限性。首先,EDKS方法将输入数据映射到特征空间以增加数据的可分离性。然后,EDKS计算内核空间中的熵差,确定多数类和少数类,并找到少数类中的稀疏区域。此外,提出的方法通过综合新实例并评估其保留能力来平衡数据分布。我们的算法可以有效地区分那些不平衡率相同但分布不同的数据集。
更新日期:2020-12-23
down
wechat
bug