Abstract
There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence in the case where the absence of class labels does not depend on the data, the expected error rate of a classifier formed from the classified and unclassified features in a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness as in the pioneering work of Rubin (Biometrika 63:581–592, 1976) for missingness in incomplete data analysis. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random in the feature space, but rather tend to be concentrated in regions of relatively high entropy. It suggests that the missingness of the labels of the features can be modelled by representing the conditional probability of a missing label for a feature via the logistic model with covariate depending on the entropy of the feature or an appropriate proxy for it. We consider here the case of two normal classes with a common covariance matrix where for computational convenience the square of the discriminant function is used as the covariate in the logistic model in place of the negative log entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate than that if the sample were completely classified.
Similar content being viewed by others
References
Aghaeepour, N., Finak, G., Hoos, H., Mosmann, T., Brinkman, R., Gottardo, R., Scheuermann, R.: FlowCAP consortium, dream consortium: critical assessment of automated flow cytometry data analysis techniques. Nat. Methods 10, 228–238 (2013)
Ahfock, D., McLachlan, G.J.: On missing data patterns in semi-supervised learning. arXiv ePreprint arXiv:1904.02883 (2019)
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems (2019)
Castelli, V., Cover, T.M.: The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inf. Theory 42, 2102–2117 (1996)
Chapelle, O., Schlköpf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2010)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–22 (1977)
Efron, B.: The efficiency of logistic regression compared to normal discriminant analysis. J. Am. Stat. Assoc. 70, 892–898 (1975)
Efron, B., Tibshirani, R.: Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54–75 (1986)
Elkan, C., Neto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–220 (2008)
Ganesalingam, S., McLachlan, G.J.: The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika 65, 658–665 (1978)
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems, pp. 529–536 (2005)
McLachlan, G.J.: Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. J. Am. Stat. Assoc. 70, 365–369 (1975)
McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (1992)
McLachlan, G.J., Gordon, R.D.: Mixture models for partially unclassified data: a case study of renal venous renin in hypertension. Stat. Med. 8, 1291–1300 (1989)
McLachlan, G.J., Scot, D.: Asymptotic relative efficiency of the linear discriminant function under partial nonrandom classification of the training data. J. Stat. Comput. Simul. 52, 415–426 (1995)
Mealli, F., Rubin, D.B.: Clarifying missing at random and relaated definitions, and implications when coupled with exchangeability. Biometrika 102, 995–1000 (2015)
Molenberghs, G., Fitzmaurice, G.M., Kenward, M.G., Tsiatis, A.A., Verbeke, G.: Handbook of Missing Data Methodology. CRC Press, Boca Raton (2014)
O’Neill, T.J.: Normal discrimination with unclassified observations. J. Am. Stat. Assoc. 73, 821–826 (1978)
Ratsaby, J., Venkatesh, S.S.: Learning from a mixture of labeled and unlabeled examples with parametric side information. In: Proceedings of the Eighth Annual Conference on Computational Learning Theory, pp. 412–417 (1995)
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Shahshahani, B.M., Landgrebe, D.A.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32, 1087–1095 (1994)
van Engelen, J., Hoos, H.: A survey on semi-supervised learning. Mach. Learn. 109, 373–440 (2020)
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
Zhang, T.: The value of unlabeled data for classification problems. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1191–1198 (2000)
Acknowledgements
The authors are indebted to the Co-ordinating Editor and two Reviewers for their comments that have improved the exposition of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was funded by the Australian Government through the Australian Research Council (Project Numbers DP170100907 and IC170100035).
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Ahfock, D., McLachlan, G.J. An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified. Stat Comput 30, 1779–1790 (2020). https://doi.org/10.1007/s11222-020-09971-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-020-09971-5