Model-free posterior inference on the area under the receiver operating characteristic curve

https://doi.org/10.1016/j.jspi.2020.03.008Get rights and content

Highlights

  • The proposed Gibbs posterior for AUC is model-free, hence robust.

  • We derive the Gibbs posterior concentration rate.

  • A calibration algorithm is employed so that credible sets have near-exact coverage.

  • The method performs well in the simulations compared to existing Bayesian methods.

Abstract

The area under the receiver operating characteristic curve (AUC) serves as a summary of a binary classifier’s performance. For inference on the AUC, a common modeling assumption is binormality, which restricts the distribution of the score produced by the classifier. However, this assumption introduces an infinite-dimensional nuisance parameter and may be restrictive in certain machine learning settings. To avoid making distributional assumptions, and to avoid the computational challenges of a fully nonparametric analysis, we develop a direct and model-free Gibbs posterior distribution for inference on the AUC. We present the asymptotic Gibbs posterior concentration rate, and a strategy for tuning the learning rate so that the corresponding credible intervals achieve the nominal frequentist coverage probability. Simulation experiments and a real data analysis demonstrate the Gibbs posterior’s strong performance compared to existing Bayesian methods.

Introduction

First proposed during World War II to assess the performance of radar receiver operators (Calì and Longobardi, 2015), the receiver operating characteristic (ROC) curve is now an essential tool for analyzing the performance of binary classifiers in areas such as signal detection (Green and Swets, 1966), psychology examination (Swets, 1973, Swets, 1986), radiology (Lusted, 1960, Hanley and McNeil, 1982), medical diagnosis (Swets and Pickett, 1982, Hanley, 1989), and data mining (Spackman, 1989, Fawcett, 2006). One informative summary of the ROC curve is the corresponding area under the curve (AUC). This measure provides an overall assessment of classifier’s performance, independent of the choice of threshold, and is, therefore, the preferred method for evaluating classification algorithms (Provost and Fawcett, 1997, Provost et al., 1998, Bradley, 1997, Huang and Ling, 2005). The AUC is an unknown quantity, and our goal is to use the information contained in the data to make inference about the AUC. The specific set up is as follows. For a binary classifier which produces a random score to indicate the propensity for, say, Group 1; individuals with scores higher than a threshold are classified to Group 1, the rest are classified to Group 0. Let U and V be independent scores corresponding to Group 1 and Group 0, respectively. Given a threshold t, define the specificity and sensitivity as spec(t)=P(V<t) and sens(t)=P(U>t). Then the ROC curve is a plot of the parametric curve (1spec(t),sens(t)) as t takes all possible values for scores. While the ROC curve summarizes the classifier’s tradeoff between sensitivity and specificity as the threshold varies, the AUC measures the probability of correctly assigning scores for two individuals from two groups, which equals P(U>V) (Bamber, 1975), and is independent of the choice of threshold. Consequently, the AUC is a functional of the joint distribution of (U,V), denoted by P, so the ROC curve is actually not needed to identify AUC.

In the context of inference on the AUC, when the scores are continuous, it is common to assume that P satisfies a so-called binormality assumption, which states that there exists a monotone increasing transformation that maps both U and V to normal random variables (Hanley, 1988). For most medical diagnostic tests, where the classifiers are simple and ready-to-use without training, such an assumption serves well (Hanley, 1988, Metz et al., 1998, Cai and Moskowitz, 2004), although it has been argued that other distributions can be more appropriate for some specific tests (e.g., Guignard and Salehi, 1983, Goddard and Hinberg, 1990). But for complicated classifiers which involve multiple predictors, as often arise in machine learning applications, binormality – or any other model assumption for that matter – becomes a burden. This motivates our pursuit of a “model-free” approach to inference about the AUC.

Specifically, our goal is the construction of a type of posterior distribution for the AUC. The most familiar such construction is via Bayes’s formula, but this requires a likelihood function and, hence, a statistical model. The only way one can be effectively “model-free” within a Bayesian framework is to make the model extra flexible, which requires lots of parameters. In the extreme case, a so-called Bayesian nonparametric approach would take the distribution P itself as the model parameter (e.g., Ghosal and van der Vaart, 2017, Gu et al., 2008). When the model includes lots of parameters, then the analyst has the burden of specifying prior distributions for these, based on little or no genuine prior information, and also computation of a high-dimensional posterior. But since the AUC is just a one-dimensional feature of this complicated set of parameters, there is no obvious return on the investment into prior specification and posterior computation. A better approach would be to construct the posterior distribution for the AUC directly, using available prior information about the AUC only, without specifying a model and without the introduction of artificial model parameters. That way, the data analyst can avoid the burdens of prior specification and posterior computation, bias due to model misspecification, and issues that can arise as a result of non-linear marginalization (e.g., Martin, 2019, Fraser, 2011).

As an alternative to the traditional Bayesian approach, we consider here the construction of a so-called Gibbs posterior for the AUC. In general, the Gibbs posterior construction proceeds by defining the quantity of interest as the minimizer of a suitable risk function, treating an empirical version of that loss function like a negative log-likelihood, and then combining with a prior distribution like in Bayes’s formula. General discussion of Gibbs posteriors can be found in Zhang, 2006a, Zhang, 2006b, Bissiri et al., 2016 and Alquier et al. (2016), and some statistical applications are discussed in Jiang and Tanner (2008) and Syring and Martin, 2017, Syring and Martin, 2019a, Syring and Martin, 2019b. Again, the advantage is that Gibbs posteriors avoid model misspecification bias and the need to deal with nuisance parameters. Moreover, under suitable conditions, Gibbs posteriors can be shown to have desirable asymptotic concentration properties (e.g., Syring and Martin, 2020, Bhattacharya and Martin, 2020, Chernozhukov and Hong, 2003), with theory that parallels that of Bayesian posteriors under model misspecification (e.g., Kleijn and van der Vaart, 2006, Kleijn and van der Vaart, 2012).

A subtle point is that, while the risk minimization problem that defines the quantity of interest is independent of the scale of the loss function, the Gibbs posterior is not. This scale factor is often referred to as the learning rate (e.g., Grünwald, 2012) and, because it controls the spread of the Gibbs posterior, its specification needs to be handled carefully. There are various approaches to the specification of the learning rate parameter (e.g., Grünwald, 2012, Grünwald and Van Ommen, 2017, Bissiri et al., 2016, Holmes and Walker, 2017, Lyddon et al., 2019). Here we adopt the approach in Syring and Martin (2019a) that aims to set the learning rate so that, in addition to its robustness to model misspecification and asymptotic concentration properties, the Gibbs posterior credible sets have the nominal frequentist coverage probability. When the sample size is large, we recommend an (asymptotically) equivalent calibration method that is simpler to compute.

The present paper is organized as follows. In Section 2.1, we review some methods for making inference on the AUC based on the binormality assumption, in particular, the Bayesian approach in Gu and Ghosal (2009) that involves a suitable rank-based likelihood. In Section 2.2, we argue that the binormality assumption is generally inappropriate in machine learning applications, and provide one illustrative example involving a support vector machine. This difficulty with model specification leads us to the Gibbs posterior, a model-free alternative to a Bayesian posterior, which is reviewed in Section 2.3. We develop the Gibbs posterior for inference on the AUC, derive its asymptotic concentration properties, and investigate how to properly scale the risk function in Section 3. Simulation experiments are carried out in Section 4, where a Gibbs posterior estimator performs favorably compared with the Bayesian approach based on a rank-based likelihood and another two Bayesian nonparametric methods. We also apply the Gibbs posterior on a real dataset for evaluating the performance of a biomarker for pancreatic cancer and compare our result with those based on some existing Bayesian methods. Finally, we give some concluding remarks in Section 5.

Section snippets

Binormality and related methods

Following Hanley (1988), the scores U and V satisfy the binormality assumption if their distribution functions are Φ[b1{H(u)a}] and Φ{H(v)} respectively, where a>0, b>0, H is a monotone increasing function, and Φ denotes the N(0,1) distribution function, which implies that U and V can be transformed to N(a,b2) and N(0,1) via H. If P=Pa,b,H denotes the distribution of (U,V) under this assumption, then the ROC curve and the AUC, respectively, are given by tΦ[b1{a+Φ1(t)}] and Φ{a(b2+1)12}.

Definition

As mentioned, the AUC is a functional of the joint distribution P of (U,V), i.e., θ=θ(P), given by θ=P(U>V). Recall that the data consists of independent copies (U1,,Um) and (V1,,Vn) of U and V, respectively. To construct a Gibbs posterior distribution for θ as discussed above, we need an appropriate loss function. That is, we need a function θ(u,v) such that the corresponding risk function, R(θ)=Pθ, is minimized at the true AUC, θ. If we define θ(u,v)={θ1(u>v)}2,θ[0,1],then it is easy

Simulation studies

Since the AUC is invariant when random variables U and V undergo the same monotone increasing transformation, we fix the distribution of V to be standard normal and consider four examples for the distribution of U:

    Example 1.

    UN(2,1) and θ=0.9214;

    Example 2.

    USN(3,1,4) – skew normal – and θ=0.9665;

    Example 3.

    U0.2N(1,1)+0.8N(2,0.52) and θ=0.8185;

    Example 4.

    U2Exp(1) and θ=0.7895.

Fig. 2 provides a visualization of the two densities in each of the four examples. Note that these four examples

Conclusion

In certain applications, the parameters of interest can be defined as minimizers of an appropriate risk function, separate from any statistical model. In such cases, one can avoid potential model misspecification biases by working some kind of “model-free” approach. The present paper considered one such example, namely, inference on the AUC, where the state-of-the-art statistical model is one that depends on an infinite-dimensional nuisance parameter. As an alternative, we propose to construct

CRediT authorship contribution statement

Zhe Wang: Conceptualization, Methodology, Writing - original draft, Writing -reveiw & editing. Ryan Martin: Conceptualization, Methodology, Funding acquisition, Writing -reveiw & editing.

Acknowledgments

The authors thank the editors and anonymous reviewers for their helpful feedback on a previous version of the manuscript. This work is partially supported by the U.S. National Science Foundation , DMS–1811802.

References (55)

  • BrodersenK.H. et al.

    The binormal assumption on precision-recall curves

  • CaiT. et al.

    Semi-parametric estimation of the binormal ROC curve for a continuous diagnostic test

    Biostatistics

    (2004)
  • CalìC. et al.

    Some mathematical properties of the ROC curve and their applications

    Ricerche Mat.

    (2015)
  • de CarvalhoV.I. et al.

    Bayesian nonparametric ROC regression modeling

    Bayesian Anal.

    (2013)
  • FasioloM. et al.

    Fast calibrated additive quantile regression

    (2017)
  • FraserD.A.

    Is Bayes posterior just quick and dirty confidence?

    Stat. Sci.

    (2011)
  • GhosalS. et al.
  • GoddardM. et al.

    Receiver operator characteristic (ROC) curves and non-normal data: an empirical study

    Stat. Med.

    (1990)
  • GreenD.M. et al.

    Signal Detection Theory and Psychophysics, Vol. 1

    (1966)
  • GrünwaldP.

    The safe Bayesian

  • GrünwaldP. et al.

    Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it

    Bayesian Anal.

    (2017)
  • GuJ. et al.

    Bayesian bootstrap estimation of ROC curve

    Stat. Med.

    (2008)
  • GuignardP. et al.

    Validity of the Gaussian assumption in the analysis of ROC data obtained from scintigraphic-like images

    Phys. Med. Biol.

    (1983)
  • HanleyJ.A.

    The robustness of the ‘binormal’ assumptions used in fitting ROC curves

    Med. Decis. Mak.

    (1988)
  • HanleyJ.A.

    Receiver operating characteristic (ROC) methodology: the state of the art

    Crit. Rev. Diagn. Imaging

    (1989)
  • HanleyJ.A. et al.

    The meaning and use of the area under a receiver operating characteristic (ROC) curve

    Radiology

    (1982)
  • HoeffdingW.

    A class of statistics with asymptotically normal distribution

    Ann. Math. Stat.

    (1948)
  • Cited by (19)

    • Gibbs posterior inference on multivariate quantiles

      2022, Journal of Statistical Planning and Inference
      Citation Excerpt :

      The Gibbs measure has its origins in statistical physics but a version of it has received attention in the statistics, machine learning, and econometrics literature; see, e.g., Bissiri et al. (2016), Zhang (2006a, b), and Chernozhukov and Hong (2003). Some recent statistical applications include data mining (Jiang and Tanner, 2008), clinical trials (Syring and Martin, 2017), image analysis (Syring and Martin, 2020b), actuarial science (Syring et al., 2019), and classifier performance assessment (Wang and Martin, 2020). Below we define the Gibbs posterior and some features that will be relevant in what follows.

    • Direct Gibbs posterior inference on risk minimizers: Construction, concentration, and calibration

      2022, Handbook of Statistics
      Citation Excerpt :

      This would require “reverse engineering” a loss function such that θ can be re-expressed as a risk minimizer. Examples of this reverse engineering can be found in Syring and Martin (2020) and Wang and Martin (2020, 2021). If this misspecified posterior is equipped with a learning rate η, that is, a power η < 1 on the likelihood function in the Bayes formulation, then the resulting fractional Bayes posterior (e.g., Bhattacharya et al., 2019) coincides with a Gibbs posterior based on the log-loss (3).

    • Chemometric development using portable molecular vibrational spectrometers for rapid evaluation of AVC (Valsa mali Miyabe et Yamada) infection of apple trees

      2021, Vibrational Spectroscopy
      Citation Excerpt :

      In order to verify the applicability of the LS-SVM models, ROC curve was used for the model evaluation. The result showed that the discriminant model had better applicability when the AUC was greater than 0.5 [36]. Fig. 9(a) was the ROC curves of the LS-SVM models for NIR spectra (NIR-LS-SVM) established by XLs, CARS, and combination of XLs and CARS (XLs-CARS), respectively.

    View all citing articles on Scopus
    View full text