An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified,Statistics and Computing

当前位置： X-MOL 学术 › Stat. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified
Statistics and Computing ( IF 2.2 ) Pub Date : 2020-09-05 , DOI: 10.1007/s11222-020-09971-5
Daniel Ahfock , Geoffrey J. McLachlan

There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence in the case where the absence of class labels does not depend on the data, the expected error rate of a classifier formed from the classified and unclassified features in a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness as in the pioneering work of Rubin (Biometrika 63:581–592, 1976) for missingness in incomplete data analysis. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random in the feature space, but rather tend to be concentrated in regions of relatively high entropy. It suggests that the missingness of the labels of the features can be modelled by representing the conditional probability of a missing label for a feature via the logistic model with covariate depending on the entropy of the feature or an appropriate proxy for it. We consider here the case of two normal classes with a common covariance matrix where for computational convenience the square of the discriminant function is used as the covariate in the logistic model in place of the negative log entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate than that if the sample were completely classified.

中文翻译：

明显的悖论：与部分完全分类的样本相比，基于部分分类的样本的分类器的预期错误率可能更低

使用半监督学习来形成分类器的兴趣日益浓厚。众所周知，具有未知类别标签的未分类特征中的（Fisher）信息要比具有已知类别标签的分类特征的（Fisher）信息少（对于弱分离类别，则更少）。因此，在缺少分类标签的情况不依赖于数据的情况下，由部分分类的样本中的分类和未分类特征构成的分类器的预期错误率要大于对样本进行完全分类的错误率。我们建议将未分类特征的标签视为缺失数据，并为鲁宾的开创性工作（Biometrika 63：581–592，1976）引入一个框架，以解决不完整数据分析中的缺失问题。对文献中几个部分分类的数据集的检查表明，未分类的特征不是在特征空间中随机出现的，而是倾向于集中在相对较高熵的区域中。它表明可以通过逻辑模型通过逻辑模型表示特征缺失标签的条件概率来建模特征的缺失，该逻辑模型具有协变量，取决于特征的熵或其适当的代理。我们在这里考虑两个具有共同协方差矩阵的法线类的情况，为便于计算，将判别函数的平方用作对数模型中的协变量，以代替负对数熵。相反，

更新日期：2020-09-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>