当前位置: X-MOL 学术Int. J. Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A self‐adaptive synthetic over‐sampling technique for imbalanced classification
International Journal of Intelligent Systems ( IF 5.0 ) Pub Date : 2020-06-01 , DOI: 10.1002/int.22230
Xiaowei Gu 1, 2 , Plamen P. Angelov 1, 2, 3 , Eduardo A. Soares 1, 2
Affiliation  

Traditionally, in supervised machine learning, (a significant) part of the available data (usually 50%‐80%) is used for training and the rest—for validation. In many problems, however, the data are highly imbalanced in regard to different classes or does not have good coverage of the feasible data space which, in turn, creates problems in validation and usage phase. In this paper, we propose a technique for synthesizing feasible and likely data to help balance the classes as well as to boost the performance in terms of confusion matrix as well as overall. The idea, in a nutshell, is to synthesize data samples in close vicinity to the actual data samples specifically for the less represented (minority) classes. This has also implications to the so‐called fairness of machine learning. In this paper, we propose a specific method for synthesizing data in a way to balance the classes and boost the performance, especially of the minority classes. It is generic and can be applied to different base algorithms, for example, support vector machines, k‐nearest neighbour classifiers deep neural, rule‐based classifiers, decision trees, and so forth. The results demonstrated that (a) a significantly more balanced (and fair) classification results can be achieved and (b) that the overall performance as well as the performance per class measured by confusion matrix can be boosted. In addition, this approach can be very valuable for the cases when the number of actual available labelled data is small which itself is one of the problems of the contemporary machine learning.

中文翻译:

一种用于不平衡分类的自适应合成过采样技术

传统上,在监督机器学习中,可用数据的(很大一部分)(通常为 50%-80%)用于训练,其余的用于验证。然而,在许多问题中,数据在不同类别方面高度不平衡,或者没有很好地覆盖可行的数据空间,进而在验证和使用阶段产生问题。在本文中,我们提出了一种合成可行数据和可能数据的技术,以帮助平衡类以及提高混淆矩阵和整体的性能。简而言之,这个想法是合成与实际数据样本非常接近的数据样本,专门针对较少代表(少数)类别。这也影响了所谓的机器学习的公平性。在本文中,我们提出了一种特定的方法来合成数据,以平衡类并提高性能,尤其是少数类的性能。它是通用的,可以应用于不同的基本算法,例如,支持向量机、k-最近邻分类器、深度神经网络、基于规则的分类器、决策树等。结果表明,(a)可以实现更加平衡(和公平)的分类结果,并且(b)可以提高整体性能以及通过混淆矩阵衡量的每个类的性能。此外,这种方法对于实际可用标记数据数量很少的情况非常有价值,这本身就是当代机器学习的问题之一。尤其是少数民族。它是通用的,可以应用于不同的基本算法,例如,支持向量机、k-最近邻分类器、深度神经网络、基于规则的分类器、决策树等。结果表明,(a)可以实现更加平衡(和公平)的分类结果,并且(b)可以提高整体性能以及通过混淆矩阵衡量的每个类的性能。此外,这种方法对于实际可用标记数据数量很少的情况非常有价值,这本身就是当代机器学习的问题之一。尤其是少数民族。它是通用的,可以应用于不同的基本算法,例如,支持向量机、k-最近邻分类器、深度神经网络、基于规则的分类器、决策树等。结果表明,(a)可以实现更加平衡(和公平)的分类结果,并且(b)可以提高整体性能以及通过混淆矩阵衡量的每个类的性能。此外,这种方法对于实际可用标记数据数量很少的情况非常有价值,这本身就是当代机器学习的问题之一。等等。结果表明,(a)可以实现更加平衡(和公平)的分类结果,并且(b)可以提高整体性能以及通过混淆矩阵衡量的每个类的性能。此外,这种方法对于实际可用标记数据数量很少的情况非常有价值,这本身就是当代机器学习的问题之一。等等。结果表明,(a)可以实现更加平衡(和公平)的分类结果,并且(b)可以提高整体性能以及通过混淆矩阵衡量的每个类的性能。此外,这种方法对于实际可用标记数据数量很少的情况非常有价值,这本身就是当代机器学习的问题之一。
更新日期:2020-06-01
down
wechat
bug