Elsevier

Information Systems

Volume 95, January 2021, 101641
Information Systems

On the appropriateness of Platt scaling in classifier calibration

https://doi.org/10.1016/j.is.2020.101641Get rights and content

Highlights

  • Proof of the general parametric assumptions of Platt scaling.

  • Correcting incorrect statements from related work.

  • Equivalence proof of Platt scaling and beta calibration, up to a preprocessing.

  • Analyzing evaluation metrics and showing that popular ones yield suboptimal results.

  • Supporting the theoretical findings in a simulation study with perfect information.

Abstract

Many applications using data mining and machine learning techniques require posterior probability estimates besides often highly accurate predictions. Classifier calibration is a separate branch of machine learning that aims at transforming classifier predictions into posterior class probabilities and thus are useful additional extensions in the respective applications. Among the existing state-of-the-art classifier calibration techniques, Platt scaling (sometimes also called sigmoid or logistic calibration) actually is the only parametric one while almost all of its competing methods do not rely on parametric assumptions. Platt scaling is controversially discussed in the classifier calibration literature, despite good empirical results reported in many domains there are many authors criticizing it. Interestingly, none of these criticisms properly deal with the underlying parametric assumptions. Instead, even incorrect statements exist. Thus, the first contribution of this work is to review such criticism and to present a proof of the true parametric assumptions. In fact, these are more general and valid for different probability distributions, which is an immediate consequence. Next, the relationship between Platt scaling and a different, relatively new classifier calibration technique called beta calibration is analyzed and it is shown that these two are actually equivalent: Their only difference lies in the characteristics of the classifier whose predictions are calibrated. Thus, the proven validity of Platt scaling additionally translates directly into a proven optimality of beta calibration. Furthermore, evaluating classifier calibration techniques is a highly non-trivial problem as the true posteriors cannot be used as a reference. Hence, the existing evaluation metrics are reviewed as well because there exist relatively popular evaluation criteria that should not be used at all. Finally, the theoretical findings are supported by a simulation study.

Introduction

Accurately estimating probabilities about unknown states or events is a highly relevant task in many domains. Possible examples are investment management [1], [2], [3], [4], [5], in particular credit risk analysis [6], [7] and credit scoring [8], [9], [10], [11]. Other examples cover customer expenditure prediction [12], decision analysis [13], decision making systems [14], resource planning [15] and failure prediction in business processes [16], fraud or phishing detection [17], [18], automatic classification of internet contents [19], [20], [21] and traffic management [22]. Many other cost-sensitive applications exist in medicine [23], [24], security-related tasks like fingerprint detection [25] or computer vision tasks like face recognition [26], [27].

In particular, there usually is a discrete (or even binary) set of possible, mutual-exclusive outcome states or events and the application requires an estimate of the unknown, actual true one. Nowadays there is a large variety of different data mining and machine learning algorithms available that often allow to perform high-accuracy predictions in the respective tasks. Possible examples are decision trees, random forests, naive Bayes classifiers, support vector machines or (deep) neural networks.

However, in many tasks there is a demand for optimizing the expected loss or profit to reasonably trade off risk and chance. For this, not only accurate predictions but also posterior probabilities are required. It is well known that in general, even highly accurate predictions are often miscalibrated, i.e. predicted probabilities do not well correlate with the true (but unknown) ones [4], [5], [13], [14], [28], [29], [30], [31].

Additionally, there are also highly accurate classification algorithms that do not even allow probabilistic interpretations like the support vector machine. In these cases, classifier calibration techniques can be used to approximate the posterior probabilities. Thus, it is an interesting and relevant task to transform accurate predictions into reliable probability estimates. Even though the binary case of only two different classes (or states/events, depending on the context) seems to be a restriction, it is of primary interest for several reasons: First, posterior probability estimation is generally a very hard problem that can only be accurately solved for very small feature dimensions and large sample sizes [32]. Next, many practical relevant prediction tasks are binary, for instance how large is a given person’s credit default probability. Finally, there exist pairwise coupling strategies to generalize from two-class probabilities to the multi-class case.

From this background this work deals with supervised machine learning, in particular with an (arbitrary) binary classification task consisting of training samples D={(xi,yi):i=1,,r} independently sampled from a generally unknown probability distribution over an input domain X×Y. Here, XRn refers to the task-specific set of input features characterizing the input data while Y is the set of classes, |Y|=2 in the binary case. Depending on the context, the labels are sometimes selected as Y={0,1} while in this work, Y={1,1} is assumed without loss of generality. Further, an arbitrary, non-discrete prediction function or classifier is given that can be used to classify newly observed instances x whose class value y is unknown. Formally, this is assumed as a non-discrete mapping f:XR. The main aim of this work lies at classifier calibration to post-process predictions f(x) into posterior class estimates σ(f(x))[0,1] approximating the unknown posterior class probability P(y=1x). The focus lies particularly on Platt scaling which is a parametric approach the results of which are often good and just as often criticized. Thus, this work analyzes in full detail when Platt scaling is an appropriate choice as a classifier calibration technique. In particular, it will be proven that it is optimal for different families of score distributions and therefore generally applicable in more cases than current reference results suggest. Another contribution deals with the validity of calibration evaluation metrics. Here it is shown that certain metrics are theoretically unjustified and can lead to wrong conclusions in practice.

Classifier calibration including different state-of-the-art techniques is introduced in Section 2. Section 3 then thoroughly analyzes Platt scaling, including incomplete and sometimes even wrong statements about its parametric assumptions. Based on this, the true parametric assumptions are extended in the main result of Section 3 which reveals that it is an optimal choice for different probability distributions. Furthermore, its relation to another, recently introduced technique called beta calibration is analyzed as in fact, beta calibration and Platt scaling proved to be two sides of the same coin. The next question that arises is how to properly evaluate classifier calibration techniques, which is discussed in Section 4. The theoretical findings are empirically supported in Section 5 using a simulation study. Finally, the paper is concluded in Section 6.

Section snippets

Related work

Classifier calibration by itself is relatively seldom studied in the data mining and machine learning literature in comparison with other areas. Despite the lack of a strictly formal definition of a classifier calibration technique, there is consensus to refer to any postprocessing method to transform classifier predictions into posterior probabilities that are intended to be well calibrated.

Formally, a classifier is well calibrated if the empirical class distribution 1ri=1r1(yi=1f(xi)=p) of

Platt scaling

Platt scaling, also known as sigmoid or logistic calibration, assumes a parametric, sigmoidal relationship between classification scores and posterior probabilities. Thus, it is also a one-dimensional logistic regression on f. Some authors refer to minor implementation details while differentiating between Platt scaling and logistic regression [42]. However throughout this work, any transformation of the aforementioned form is a valid variant. The biggest advantage of Platt scaling clearly is

Evaluation metrics

While applying classifier calibration in practice, there is a straightforward demand to evaluate a calibration technique’s performance, for instance to choose the best from a set of available ones. For this purpose, an evaluation metric is required to compare the estimated probability to the true one. Since the true posterior probabilities are unknown, a direct error cannot be computed. Thus, some surrogate error functions have to be used instead which are only based on a probability estimate p(

Empirical evaluation

Besides the theoretical insights gained in Sections 3 Platt scaling, 4 Evaluation metrics, there are also a few points that have to be taken into account whenever classifier calibration is applied in practice. Based on the insights from Section 3.2, it is assumed without loss of generality that a real-valued, non-probabilistic classifier f together with a set of predictions {(f(xi),yi):i=1,,r} is available.

To emphasize the practical relevance of the aforementioned points, this section applies7

Conclusion

The contribution of this work is three-fold: First, the current results on Platt scaling have been summarized in Section 3 where especially two incorrect statements are corrected that have been made in the respective literature. Here it is shown that the validity of a monotonic calibration mapping is independent of the classifier’s AUC and in the main result, the parametric assumptions of Platt scaling are derived and it has been shown that these are generally valid for more than the currently

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

I first like to thank my company CSB-System AG for supporting this work for the lasts months and agreeing in publishing the results. Furthermore, I thank Prof. Norbert Gronau, University of Potsdam, for useful discussion in preparing the manuscript. Finally, I gratefully thank the anonymous reviewers for their valuable feedback.

References (67)

  • YangX. et al.

    The one-against-all partition based binary tree support vector machine algorithms for multi-class classification

    Neurocomputing

    (2013)
  • XuP. et al.

    Evidential calibration of binary SVM classifiers

    Internat. J. Approx. Reason.

    (2016)
  • FerriC. et al.

    An experimental comparison of performance measures for classification

    Pattern Recognit. Lett.

    (2009)
  • NaeiniM.P. et al.

    Binary classifier calibration using a Bayesian non-parametric approach

  • M.P. Naeini, G.F. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using Bayesian binning, in:...
  • NaeiniM.P. et al.

    Binary classifier calibration using an ensemble of near isotonic regression models

    (2015)
  • NaeiniM.P. et al.

    Binary classifier calibration using an ensemble of linear trend estimation

  • NaeiniM.P. et al.

    Binary classifier calibration using an ensemble of piecewise linear regression models

    Knowl. Inf. Syst.

    (2018)
  • BellaA. et al.

    Calibration of machine learning models

  • FonsecaP.G. et al.

    Calibration of Machine Learning Classifiers for Probability of Default ModellingTech. rep.

    (2017)
  • BellaA. et al.

    Aggregative quantification for regression

    Data Min. Knowl. Discov.

    (2014)
  • NaeiniM.P. et al.

    Binary classifier calibration: A Bayesian non-parametric approach

    (2014)
  • GuoC. et al.

    On calibration of modern neural networks

    (2017)
  • SunW.W. et al.

    Stability enhanced large-margin classifier selection

    Statist. Sinica

    (2018)
  • JoachimsT.

    Text categorization with Support Vector Machines: Learning with many relevant features

  • ZhaoH. et al.

    A multi-classification method of improved SVM-based information fusion for traffic parameters forecasting

    PROMET - Traffic Transp.

    (2016)
  • JiangX. et al.

    Calibrating predictive model estimates to support personalized medicine

    J. Amer. Med. Inform. Assoc.

    (2012)
  • ConnollyB. et al.

    A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support

    BMC Bioinformatics

    (2017)
  • JafriR. et al.

    A survey of face recognition techniques

    J. Inf. Process. Syst.

    (2009)
  • CohenI. et al.

    Properties and benefits of calibrated classifiers

  • BellaA. et al.

    Similarity-Binning averaging: A generalisation of binning Calibration

  • FlachP.A.

    Classifier calibration

  • KullM. et al.

    Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

    Proc. Mach. Learn. Res.

    (2017)
  • Cited by (28)

    • A customised down-sampling machine learning approach for sepsis prediction

      2024, International Journal of Medical Informatics
    • CALIMERA: A new early time series classification method

      2023, Information Processing and Management
    • Detecting and ranking pornographic content in videos

      2022, Forensic Science International: Digital Investigation
      Citation Excerpt :

      In our work we settle for a model re-calibration technique implemented as a post-processing operation, rather than attempting calibration within the deep learning model itself or during its training. We investigated a number of computationally-efficient techniques including temperature scaling (Mozafari et al., 2018), Platt scaling (Platt, 1999), and isotonic regression (Niculescu-Mizil and Caruana, 2005), opting for Platt scaling based on the empirical results obtained (see Fig. 3), and due to its optimality characteristics (Böken, 2021). Zhao et al. (2017) perform a sweep over a range of values for both the flooding parameter γ and the grouping criterion τ, generating action proposals to be then fed to specific action recognisers.

    View all citing articles on Scopus
    View full text