On the appropriateness of Platt scaling in classifier calibration
Introduction
Accurately estimating probabilities about unknown states or events is a highly relevant task in many domains. Possible examples are investment management [1], [2], [3], [4], [5], in particular credit risk analysis [6], [7] and credit scoring [8], [9], [10], [11]. Other examples cover customer expenditure prediction [12], decision analysis [13], decision making systems [14], resource planning [15] and failure prediction in business processes [16], fraud or phishing detection [17], [18], automatic classification of internet contents [19], [20], [21] and traffic management [22]. Many other cost-sensitive applications exist in medicine [23], [24], security-related tasks like fingerprint detection [25] or computer vision tasks like face recognition [26], [27].
In particular, there usually is a discrete (or even binary) set of possible, mutual-exclusive outcome states or events and the application requires an estimate of the unknown, actual true one. Nowadays there is a large variety of different data mining and machine learning algorithms available that often allow to perform high-accuracy predictions in the respective tasks. Possible examples are decision trees, random forests, naive Bayes classifiers, support vector machines or (deep) neural networks.
However, in many tasks there is a demand for optimizing the expected loss or profit to reasonably trade off risk and chance. For this, not only accurate predictions but also posterior probabilities are required. It is well known that in general, even highly accurate predictions are often miscalibrated, i.e. predicted probabilities do not well correlate with the true (but unknown) ones [4], [5], [13], [14], [28], [29], [30], [31].
Additionally, there are also highly accurate classification algorithms that do not even allow probabilistic interpretations like the support vector machine. In these cases, classifier calibration techniques can be used to approximate the posterior probabilities. Thus, it is an interesting and relevant task to transform accurate predictions into reliable probability estimates. Even though the binary case of only two different classes (or states/events, depending on the context) seems to be a restriction, it is of primary interest for several reasons: First, posterior probability estimation is generally a very hard problem that can only be accurately solved for very small feature dimensions and large sample sizes [32]. Next, many practical relevant prediction tasks are binary, for instance how large is a given person’s credit default probability. Finally, there exist pairwise coupling strategies to generalize from two-class probabilities to the multi-class case.
From this background this work deals with supervised machine learning, in particular with an (arbitrary) binary classification task consisting of training samples independently sampled from a generally unknown probability distribution over an input domain . Here, refers to the task-specific set of input features characterizing the input data while is the set of classes, in the binary case. Depending on the context, the labels are sometimes selected as while in this work, is assumed without loss of generality. Further, an arbitrary, non-discrete prediction function or classifier is given that can be used to classify newly observed instances whose class value is unknown. Formally, this is assumed as a non-discrete mapping . The main aim of this work lies at classifier calibration to post-process predictions into posterior class estimates approximating the unknown posterior class probability . The focus lies particularly on Platt scaling which is a parametric approach the results of which are often good and just as often criticized. Thus, this work analyzes in full detail when Platt scaling is an appropriate choice as a classifier calibration technique. In particular, it will be proven that it is optimal for different families of score distributions and therefore generally applicable in more cases than current reference results suggest. Another contribution deals with the validity of calibration evaluation metrics. Here it is shown that certain metrics are theoretically unjustified and can lead to wrong conclusions in practice.
Classifier calibration including different state-of-the-art techniques is introduced in Section 2. Section 3 then thoroughly analyzes Platt scaling, including incomplete and sometimes even wrong statements about its parametric assumptions. Based on this, the true parametric assumptions are extended in the main result of Section 3 which reveals that it is an optimal choice for different probability distributions. Furthermore, its relation to another, recently introduced technique called beta calibration is analyzed as in fact, beta calibration and Platt scaling proved to be two sides of the same coin. The next question that arises is how to properly evaluate classifier calibration techniques, which is discussed in Section 4. The theoretical findings are empirically supported in Section 5 using a simulation study. Finally, the paper is concluded in Section 6.
Section snippets
Related work
Classifier calibration by itself is relatively seldom studied in the data mining and machine learning literature in comparison with other areas. Despite the lack of a strictly formal definition of a classifier calibration technique, there is consensus to refer to any postprocessing method to transform classifier predictions into posterior probabilities that are intended to be well calibrated.
Formally, a classifier is well calibrated if the empirical class distribution of
Platt scaling
Platt scaling, also known as sigmoid or logistic calibration, assumes a parametric, sigmoidal relationship between classification scores and posterior probabilities. Thus, it is also a one-dimensional logistic regression on . Some authors refer to minor implementation details while differentiating between Platt scaling and logistic regression [42]. However throughout this work, any transformation of the aforementioned form is a valid variant. The biggest advantage of Platt scaling clearly is
Evaluation metrics
While applying classifier calibration in practice, there is a straightforward demand to evaluate a calibration technique’s performance, for instance to choose the best from a set of available ones. For this purpose, an evaluation metric is required to compare the estimated probability to the true one. Since the true posterior probabilities are unknown, a direct error cannot be computed. Thus, some surrogate error functions have to be used instead which are only based on a probability estimate
Empirical evaluation
Besides the theoretical insights gained in Sections 3 Platt scaling, 4 Evaluation metrics, there are also a few points that have to be taken into account whenever classifier calibration is applied in practice. Based on the insights from Section 3.2, it is assumed without loss of generality that a real-valued, non-probabilistic classifier together with a set of predictions is available.
To emphasize the practical relevance of the aforementioned points, this section applies7
Conclusion
The contribution of this work is three-fold: First, the current results on Platt scaling have been summarized in Section 3 where especially two incorrect statements are corrected that have been made in the respective literature. Here it is shown that the validity of a monotonic calibration mapping is independent of the classifier’s and in the main result, the parametric assumptions of Platt scaling are derived and it has been shown that these are generally valid for more than the currently
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
I first like to thank my company CSB-System AG for supporting this work for the lasts months and agreeing in publishing the results. Furthermore, I thank Prof. Norbert Gronau, University of Potsdam, for useful discussion in preparing the manuscript. Finally, I gratefully thank the anonymous reviewers for their valuable feedback.
References (67)
- et al.
Approaches for credit scorecard calibration: An empirical analysis
Knowl.-Based Syst.
(2017) - et al.
Dynamic classifier selection: Recent advances and perspectives
Inf. Fusion
(2018) - et al.
Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research
European J. Oper. Res.
(2015) - et al.
Ensemble classification based on supervised clustering for credit scoring
Appl. Soft Comput.
(2016) - et al.
Predictive performance modeling for distributed computing using black-box monitoring and machine learning
Inf. Syst.
(2019) - et al.
Event-based failure prediction in distributed business processes
Inf. Syst.
(2019) - et al.
New one versus method: NOV@
Expert Syst. Appl.
(2014) - et al.
Speech-acts based analysis for requirements discovery from online discussions
Inf. Syst.
(2019) - et al.
Enhancing directed binary trees for multi-class classification
Inform. Sci.
(2013) - et al.
Fingerprint classification using one-vs-all support vector machines dynamically ordered with naïve Bayes classifiers
Pattern Recognit.
(2008)
The one-against-all partition based binary tree support vector machine algorithms for multi-class classification
Neurocomputing
Evidential calibration of binary SVM classifiers
Internat. J. Approx. Reason.
An experimental comparison of performance measures for classification
Pattern Recognit. Lett.
Binary classifier calibration using a Bayesian non-parametric approach
Binary classifier calibration using an ensemble of near isotonic regression models
Binary classifier calibration using an ensemble of linear trend estimation
Binary classifier calibration using an ensemble of piecewise linear regression models
Knowl. Inf. Syst.
Calibration of machine learning models
Calibration of Machine Learning Classifiers for Probability of Default ModellingTech. rep.
Aggregative quantification for regression
Data Min. Knowl. Discov.
Binary classifier calibration: A Bayesian non-parametric approach
On calibration of modern neural networks
Stability enhanced large-margin classifier selection
Statist. Sinica
Text categorization with Support Vector Machines: Learning with many relevant features
A multi-classification method of improved SVM-based information fusion for traffic parameters forecasting
PROMET - Traffic Transp.
Calibrating predictive model estimates to support personalized medicine
J. Amer. Med. Inform. Assoc.
A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support
BMC Bioinformatics
A survey of face recognition techniques
J. Inf. Process. Syst.
Properties and benefits of calibrated classifiers
Similarity-Binning averaging: A generalisation of binning Calibration
Classifier calibration
Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers
Proc. Mach. Learn. Res.
Cited by (28)
A customised down-sampling machine learning approach for sepsis prediction
2024, International Journal of Medical InformaticsCALIMERA: A new early time series classification method
2023, Information Processing and ManagementMS-ACGAN: A modified auxiliary classifier generative adversarial network for schizophrenia's samples augmentation based on microarray gene expression data
2023, Computers in Biology and MedicineDetecting and ranking pornographic content in videos
2022, Forensic Science International: Digital InvestigationCitation Excerpt :In our work we settle for a model re-calibration technique implemented as a post-processing operation, rather than attempting calibration within the deep learning model itself or during its training. We investigated a number of computationally-efficient techniques including temperature scaling (Mozafari et al., 2018), Platt scaling (Platt, 1999), and isotonic regression (Niculescu-Mizil and Caruana, 2005), opting for Platt scaling based on the empirical results obtained (see Fig. 3), and due to its optimality characteristics (Böken, 2021). Zhao et al. (2017) perform a sweep over a range of values for both the flooding parameter γ and the grouping criterion τ, generating action proposals to be then fed to specific action recognisers.