当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A reconstruction error-based framework for label noise detection
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-04-15 , DOI: 10.1186/s40537-021-00447-5
Zahra Salekshahrezaee , Joffrey L. Leevy , Taghi M. Khoshgoftaar

Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the classification results of a learner to be poor. In this paper, we detect label noise with three unsupervised learners, namely \(\textit{principal component analysis} \hbox { (PCA)}\), \(\textit{independent component analysis} \hbox { (ICA)}\), and autoencoders. We evaluate these three learners on a credit card fraud dataset using multiple noise levels, and then compare results to the traditional Tomek links noise filter. Our binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise. For detecting noisy instances, we discovered that the autoencoder algorithm was the top performer (highest recall score of 0.90), while Tomek links performed the worst (highest recall score of 0.62).



中文翻译:

基于重构误差的标签噪声检测框架

标签噪声是一个重要的数据质量问题,会对机器学习算法产生负面影响。例如,标签噪声已显示出增加了训练有效预测模型所需的实例数量。还显示出它会增加模型的复杂性并降低模型的可解释性。另外,标签噪声会导致学习者的分类结果很差。在本文中,我们通过三个无监督学习者来检测标签噪声,即\(\ textit {主要成分分析} \ hbox {(PCA)} \)\(\ textit {独立成分分析} \ hbox {(ICA)} \ ),以及自动编码器。我们使用多个噪声级别在信用卡欺诈数据集上评估这三个学习者,然后将结果与传统的Tomek链接噪声滤波器进行比较。我们的二进制分类方法将标签噪声实例视为异常,唯一地对噪声数据使用重构误差,以识别和过滤标签噪声。为了检测嘈杂的实例,我们发现自动编码器算法的性能最高(最高召回得分为0.90),而Tomek链接的性能最差(最高召回得分为0.62)。

更新日期:2021-04-15
down
wechat
bug