Harmless label noise and informative soft-labels in supervised classification,Computational Statistics & Data Analysis

当前位置： X-MOL 学术 › Comput. Stat. Data Anal. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Harmless label noise and informative soft-labels in supervised classification
Computational Statistics & Data Analysis ( IF 1.5 ) Pub Date : 2021-04-19 , DOI: 10.1016/j.csda.2021.107253
Daniel Ahfock , Geoffrey J. McLachlan

Manual labelling of training examples is common practice in supervised learning. When the labelling task is of non-trivial difficulty, the supplied labels may not be equal to the ground-truth labels, and label noise is introduced into the training dataset. If the manual annotation is carried out by multiple experts, the same training example can be given different class assignments by different experts, which is indicative of label noise. In the framework of model-based classification, a simple, but key observation is that when the manual labels are sampled using the posterior probabilities of class membership, the noisy labels are as valuable as the ground-truth labels in terms of statistical information. A relaxation of this process is a random effects model for imperfect labelling by a group that uses approximate posterior probabilities of class membership. The relative efficiency of logistic regression using the noisy labels compared to logistic regression using the ground-truth labels can then be derived. The main finding is that logistic regression can be robust to label noise when label noise and classification difficulty are positively correlated. In particular, when classification difficulty is the only source of label errors, multiple sets of noisy labels can supply more information for the estimation of a classification rule compared to the single set of ground-truth labels.

中文翻译：

监督分类中的无害标签噪声和信息丰富的软标签

手动标记培训示例是监督学习中的常见做法。当贴标任务很困难时，提供的标签可能与真实标签不相等，并且标签噪声会引入训练数据集中。如果手动注释是由多位专家进行的，则同一培训示例可以由不同的专家进行不同的班级分配，这表明标签有噪音。在基于模型的分类框架中，一个简单但重要的观察结果是，当使用类别成员的后验概率对手动标签进行采样时，就统计信息而言，嘈杂的标签与真实标签一样有价值。这个过程的放松是一个随机效应模型，该模型用于使用类成员资格的近似后验概率的人群的不完全标记。然后可以得出使用噪声标签的逻辑回归与使用真实标签的逻辑回归相比的相对效率。主要发现是，当标签噪声和分类难度正相关时，逻辑回归对标签噪声可能具有鲁棒性。特别是，当分类困难是标签错误的唯一来源时，与单套地面真相标签相比，多套嘈杂的标签可以为估计分类规则提供更多信息。主要发现是，当标签噪声和分类难度正相关时，逻辑回归对标签噪声可能具有鲁棒性。特别是，当分类困难是标签错误的唯一来源时，与单套地面真相标签相比，多套嘈杂的标签可以为估计分类规则提供更多信息。主要发现是，当标签噪声和分类难度正相关时，逻辑回归对标签噪声可能具有鲁棒性。特别是，当分类困难是标签错误的唯一来源时，与单套地面真相标签相比，多套嘈杂的标签可以为估计分类规则提供更多信息。

更新日期：2021-04-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11