Skip to main content
Log in

An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence in the case where the absence of class labels does not depend on the data, the expected error rate of a classifier formed from the classified and unclassified features in a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness as in the pioneering work of Rubin (Biometrika 63:581–592, 1976) for missingness in incomplete data analysis. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random in the feature space, but rather tend to be concentrated in regions of relatively high entropy. It suggests that the missingness of the labels of the features can be modelled by representing the conditional probability of a missing label for a feature via the logistic model with covariate depending on the entropy of the feature or an appropriate proxy for it. We consider here the case of two normal classes with a common covariance matrix where for computational convenience the square of the discriminant function is used as the covariate in the logistic model in place of the negative log entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate than that if the sample were completely classified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Aghaeepour, N., Finak, G., Hoos, H., Mosmann, T., Brinkman, R., Gottardo, R., Scheuermann, R.: FlowCAP consortium, dream consortium: critical assessment of automated flow cytometry data analysis techniques. Nat. Methods 10, 228–238 (2013)

    Article  Google Scholar 

  • Ahfock, D., McLachlan, G.J.: On missing data patterns in semi-supervised learning. arXiv ePreprint arXiv:1904.02883 (2019)

  • Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems (2019)

  • Castelli, V., Cover, T.M.: The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inf. Theory 42, 2102–2117 (1996)

    Article  MathSciNet  Google Scholar 

  • Chapelle, O., Schlköpf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2010)

    Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–22 (1977)

    MATH  Google Scholar 

  • Efron, B.: The efficiency of logistic regression compared to normal discriminant analysis. J. Am. Stat. Assoc. 70, 892–898 (1975)

    Article  MathSciNet  Google Scholar 

  • Efron, B., Tibshirani, R.: Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54–75 (1986)

    Article  MathSciNet  Google Scholar 

  • Elkan, C., Neto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213–220 (2008)

  • Ganesalingam, S., McLachlan, G.J.: The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika 65, 658–665 (1978)

    Article  MathSciNet  Google Scholar 

  • Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems, pp. 529–536 (2005)

  • McLachlan, G.J.: Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. J. Am. Stat. Assoc. 70, 365–369 (1975)

    Article  MathSciNet  Google Scholar 

  • McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (1992)

    Book  Google Scholar 

  • McLachlan, G.J., Gordon, R.D.: Mixture models for partially unclassified data: a case study of renal venous renin in hypertension. Stat. Med. 8, 1291–1300 (1989)

    Article  Google Scholar 

  • McLachlan, G.J., Scot, D.: Asymptotic relative efficiency of the linear discriminant function under partial nonrandom classification of the training data. J. Stat. Comput. Simul. 52, 415–426 (1995)

    Article  MathSciNet  Google Scholar 

  • Mealli, F., Rubin, D.B.: Clarifying missing at random and relaated definitions, and implications when coupled with exchangeability. Biometrika 102, 995–1000 (2015)

    Article  MathSciNet  Google Scholar 

  • Molenberghs, G., Fitzmaurice, G.M., Kenward, M.G., Tsiatis, A.A., Verbeke, G.: Handbook of Missing Data Methodology. CRC Press, Boca Raton (2014)

    Book  Google Scholar 

  • O’Neill, T.J.: Normal discrimination with unclassified observations. J. Am. Stat. Assoc. 73, 821–826 (1978)

    Article  MathSciNet  Google Scholar 

  • Ratsaby, J., Venkatesh, S.S.: Learning from a mixture of labeled and unlabeled examples with parametric side information. In: Proceedings of the Eighth Annual Conference on Computational Learning Theory, pp. 412–417 (1995)

  • Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)

    Article  MathSciNet  Google Scholar 

  • Shahshahani, B.M., Landgrebe, D.A.: The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. Geosci. Remote Sens. 32, 1087–1095 (1994)

    Article  Google Scholar 

  • van Engelen, J., Hoos, H.: A survey on semi-supervised learning. Mach. Learn. 109, 373–440 (2020)

    Article  MathSciNet  Google Scholar 

  • Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  • Zhang, T.: The value of unlabeled data for classification problems. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1191–1198 (2000)

Download references

Acknowledgements

The authors are indebted to the Co-ordinating Editor and two Reviewers for their comments that have improved the exposition of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geoffrey J. McLachlan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was funded by the Australian Government through the Australian Research Council (Project Numbers DP170100907 and IC170100035).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 169 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahfock, D., McLachlan, G.J. An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified. Stat Comput 30, 1779–1790 (2020). https://doi.org/10.1007/s11222-020-09971-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09971-5

Keywords

Navigation