当前位置: X-MOL 学术Biometrika › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Classification with imperfect training labels
Biometrika ( IF 2.7 ) Pub Date : 2020-04-22 , DOI: 10.1093/biomet/asaa011
Timothy I Cannings 1 , Yingying Fan 2 , Richard J Samworth 3
Affiliation  

We study the effect of imperfect training data labels on the performance of classification methods. In a general setting, where the probability that an observation in the training dataset is mislabelled may depend on both the feature vector and the true label, we bound the excess risk of an arbitrary classifier trained with imperfect labels in terms of its excess risk for predicting a noisy label. This reveals conditions under which a classifier trained with imperfect labels remains consistent for classifying uncorrupted test data points. Furthermore, under stronger conditions, we derive detailed asymptotic properties for the popular $k$-nearest neighbour ($k$nn), support vector machine (SVM) and linear discriminant analysis (LDA) classifiers. One consequence of these results is that the knn and SVM classifiers are robust to imperfect training labels, in the sense that the rate of convergence of the excess risks of these classifiers remains unchanged; in fact, our theoretical and empirical results even show that in some cases, imperfect labels may improve the performance of these methods. On the other hand, the LDA classifier is shown to be typically inconsistent in the presence of label noise unless the prior probabilities of each class are equal. Our theoretical results are supported by a simulation study.

中文翻译:

具有不完美训练标签的分类

我们研究了不完善的训练数据标签对分类方法性能的影响。在一般情况下,训练数据集中的观察被错误标记的概率可能取决于特征向量和真实标签,我们限制了用不完美标签训练的任意分类器的额外风险,根据其预测的额外风险一个嘈杂的标签。这揭示了使用不完美标签训练的分类器在分类未损坏的测试数据点时保持一致的条件。此外,在更强的条件下,我们为流行的 $k$-最近邻 ($k$nn)、支持向量机 (SVM) 和线性判别分析 (LDA) 分类器推导出详细的渐近属性。这些结果的一个结果是 knn 和 SVM 分类器对不完美的训练标签具有鲁棒性,因为这些分类器的过度风险的收敛速度保持不变;事实上,我们的理论和实证结果甚至表明,在某些情况下,不完美的标签可能会提高这些方法的性能。另一方面,除非每个类的先验概率相等,否则 LDA 分类器在存在标签噪声的情况下通常是不一致的。我们的理论结果得到了模拟研究的支持。除非每个类的先验概率相等,否则 LDA 分类器在存在标签噪声的情况下通常会不一致。我们的理论结果得到了模拟研究的支持。除非每个类的先验概率相等,否则 LDA 分类器在存在标签噪声的情况下通常会不一致。我们的理论结果得到了模拟研究的支持。
更新日期:2020-04-22
down
wechat
bug