当前位置: X-MOL 学术Data Technol. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers
Data Technologies and Applications ( IF 1.7 ) Pub Date : 2021-05-14 , DOI: 10.1108/dta-01-2021-0027
Zhenyuan Wang , Chih-Fong Tsai , Wei-Chao Lin

Purpose

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.

Design/methodology/approach

In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.

Findings

The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.

Originality/value

The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.



中文翻译:

类不平衡数据集中的数据清理问题:一类分类器的实例选择和缺失值插补

目的

类不平衡学习存在于许多领域问题数据集中,是数据挖掘和机器学习中的一个重要研究课题。一类分类技术旨在从作为多数类的正常数据中将异常识别为少数类,是类不平衡数据集的一种代表性解决方案。由于一类分类器仅使用正常数据进行训练,为以后的异常检测创建决策边界,因此训练集的质量,即多数类,是影响一类分类器性能的关键因素。

设计/方法/方法

在本文中,我们专注于两种数据清理或预处理方法来解决类不平衡数据集。第一种方法检查执行实例选择以从多数类中删除一些噪声数据是否可以提高单类分类器的性能。第二种方法结合了实例选择和缺失值插补,后者用于处理包含缺失值的不完整数据集。

发现

实验结果基于44类不平衡数据集;三种实例选择算法,包括 IB3、DROP3 和 GA,用于缺失值插补的 CART 决策树,以及包括 OCSVM、IFOREST 和 LOF 在内的三个一类分类器,表明如果仔细选择实例选择算法,执行此step 可以提高训练数据的质量,这使得一类分类器在没有实例选择的情况下优于基线。此外,当类不平衡数据集包含一些缺失值时,结合缺失值插补和实例选择,无论首先执行哪个步骤,都可以保持与没有缺失值的数据集相似的数据质量。

原创性/价值

本文的新颖之处在于研究了执行实例选择对一类分类器性能的影响,这是以前从未做过的。此外,这项研究是第一次尝试考虑训练集中存在缺失值的情况,以训练一类分类器。在这种情况下,比较执行缺失值插补和不同顺序的实例选择。

更新日期:2021-05-14
down
wechat
bug