The three-class ideal observer for univariate normal data: Decision variable and ROC surface properties

https://doi.org/10.1016/j.jmp.2012.05.003Get rights and content

Abstract

Although a fully general extension of ROC analysis to classification tasks with more than two classes has yet to be developed, the potential benefits to be gained from a practical performance evaluation methodology for classification tasks with three classes have motivated a number of research groups to propose methods based on constrained or simplified observer or data models. Here we consider an ideal observer in a task with underlying data drawn from three univariate normal distributions. We investigate the behavior of the resulting ideal observer’s decision variables and ROC surface. In particular, we show that the pair of ideal observer decision variables is constrained to a parametric curve in two-dimensional likelihood ratio space, and that the decision boundary line segments used by the ideal observer can intersect this curve in at most six places. From this, we further show that the resulting ROC surface has at most four degrees of freedom at any point, and not the five that would be required, in general, for a surface in a six-dimensional space to be non-degenerate. In light of the difficulties we have previously pointed out in generalizing the well-known area under the ROC curve performance metric to tasks with three or more classes, the problem of developing a suitable and fully general performance metric for classification tasks with three or more classes remains unsolved.

Highlights

► Extension of ROC analysis to tasks with more than two classes is difficult. ► We examine the three-class ideal observer for univariate trinormal data. ► This choice allows explicit analytic calculation of ideal observer behavior. ► We hope the insights gained will enable us to analyze more realistic data models.

Introduction

We are working to extend the well-known observer performance evaluation methodology of receiver operating characteristic (ROC) analysis (Egan, 1975, Metz, 1978) to classification tasks with three or more classes. This could conceivably be of benefit, for example, in a medical decision-making task in which a region of a patient image must be characterized as containing a malignant lesion, a benign lesion, or only normal tissue (Chan et al., 2003, Edwards et al., 2004); or in a task in which multiple types of abnormality or defect must be distinguished from each other and from benign or normal cases (He, Metz, Tsui, Links, & Frey, 2006).

Unfortunately, a fully general extension of ROC analysis to classification tasks with more than two classes has yet to be developed, for the following reasons. It is known that the performance of an observer in a classification task with N classes (N2) can be completely described by a set of N2N conditional error probabilities (Van Trees, 1968), and that the performance of the ideal observer (that which minimizes Bayes risk (Van Trees, 1968)) is completely characterized by an ROC hypersurface in which these conditional error probabilities depend on a set of N2N1 decision criteria (Edwards, Metz, & Kupinski, 2004). Although analytic expressions for the ideal observer’s conditional error probabilities given reasonable models for the underlying observational data have been worked out in the two-class case (Metz & Pan, 1999), this has not yet been accomplished in a fully general manner for tasks with three or more classes. Furthermore, we have shown that an obvious generalization of the area under the ROC curve (AUC) does not in fact yield a useful performance metric in tasks with three or more classes (Edwards, Metz, & Nishikawa, 2005). More recently, we showed that complicated constraining relationships exist among the decision boundaries themselves for the ideal observer (Edwards & Metz, 2005). These constraining relationships appear to imply that it is highly unlikely that analytical expressions for the conditional error probabilities in terms of the decision criteria can be developed which are as simple to interpret as those for the two-class task.

Despite the difficulties just described, the potential benefits to be gained from a practical performance evaluation methodology for classification tasks with three classes have motivated a number of research groups to propose such methods. These practical methods reduce the number of degrees of freedom used to describe the observer’s performance, either by implicitly leaving the remaining degrees of freedom out of the analysis, or by explicitly imposing restrictions on the form of the observer’s decision rule or on the set of decision criteria used by the observer. An example of the former would be the work of Mossman (1999) and of Dreiseitl, Ohno-Machado, and Binder (2000), which proposed reporting performance only in terms of the three “sensitivities” (conditional classification rates for correctly identifying the three classes) in a three-class task, while examples of the latter would include the works of Scurfield, 1996, Scurfield, 1998, of Chan et al. (2003) and Sahiner, Chan, and Hadjiiski (2006), of He et al. (2006) and He and Frey (2006), and of Schubert, Thorsen, and Oxley (2011). (Recently, He et al. have pointed out (He, Gallas, & Frey, 2010) that, given complete analytic knowledge of a three-class ideal observer’s “sensitivity” ROC surface, whose coordinates are the three correct classification probabilities, one can recover the probability density functions (pdfs) of the pair of likelihood ratios used by that observer. These pdfs, in turn, can be used to recover a complete description of the observer’s performance, including the misclassification ROC surface. However, the practical consequences of this theoretical result have yet to be fully investigated; and for the present, we are concerned here with a situation in which the underlying data pdfs are already assumed known, obviating the need for such an intermediate step via measurement of the sensitivity surface.)

The first work by Scurfield mentioned above (Scurfield, 1996), and that of Schubert et al. (2011), are of particular interest because they consider models which assume not only a particular form for the observer’s decision rule, but also that the underlying observational data, for which decisions are being made, are univariate. Such a univariate model is worth exploring for both theoretical and pragmatic reasons. Since a univariate model will be more tractable, in general, than a multivariate model, the former can serve as an important step in the development of a fully general classification and performance evaluation methodology; although the development of a suitable performance metric and evaluation methodology is beyond the scope of the present work, this, which is our long-term goal. More pragmatically, a variety of medical decision-making tasks can be envisioned in which the underlying data consist of a single variable: the result of a particular blood test; or the size of a lesion in a context where other features of the lesion can be ignored, such as a skin test for allergies or tuberculosis. The psychophysical task considered by Scurfield was the ability of an observer to distinguish monotonic auditory stimuli into “low tone”, “medium tone”, and “high tone” classes based on perceived frequency; the clinical task considered by Schubert et al. was to distinguish chronic allograft nephropathology from normal kidney function and proteinuria based on protein markers present in patient urine.

The decision rule of interest here, considered in both papers just mentioned is defined by a pair of thresholds on the underlying data; observations less than the lower threshold are assigned to one class, those between the two thresholds to a second class, and those above the higher threshold to a third class. This is a straightforward generalization of two-class classification models (such as the conventional (Metz, Herman, & Shen, 1998) and proper (Metz & Pan, 1999) binormal models; recall that the latter is motivated by an ideal observer approach to the underlying data model of the former) in which a single threshold is used to separate univariate data into two classes. While more complicated observer behavior is certainly conceivable, it is perhaps surprising that the three-class ideal observer itself is not limited to such a decision strategy. For normally distributed underlying data, in particular, we show here that, depending on the values of its five decision criteria and the distributional parameters of the underlying data, the three-class ideal observer operates by dividing the underlying data axis into as many as seven regions (i. e., by dividing the underlying data axis with as many as six thresholds). This is by no means intended as a criticism of the works already cited, which do not directly address the ideal observer; the particular observer models developed in those works were explicitly assumed to apply to arbitrary decision strategies, as opposed to specifically ideal observer decision strategies.

A univariate decision variable model in a task with more than two classes was investigated also by Kijewski, Swensson, and Judy (1989). However, evaluation was performed only using standard two-class ROC analysis on pairs of classes.

In Section 2, we review the theory of the three-class ideal observer, and develop explicit expressions for the ideal observer’s likelihood ratio decision variables in the case of normally distributed underlying data. In particular, because the two decision variables are functions of univariate data, it is shown that one of those decision variables can be expressed as a relation of the other (a curve, open or closed, in the decision variable space). Because the form of this relation between the decision variables is determined by the relative means and variances of the three pdfs of the underlying data, conditional on membership in each of the three classes, we consider three distinct situations: in Section 3, the case of all three variances being equal; in Section 4, the case in which two of the three variances are equal, and the third is distinct from them; and in Section 5, the case in which all three variances are distinct. In Section 6, these results are used to develop closed-form expressions for the ROC operating points of the ideal observer under consideration. In Section 7, we summarize some general properties of parametric surfaces that will be of use in subsequent sections. In Section 8, we apply these properties to the ROC surface at hand to obtain constraints on the number of degrees of freedom of that surface. In Section 9, we discuss the most salient qualitative features of the preceding results; and finally, the conclusions of the present work are given in Section 10.

Section snippets

General theory

We define the actual class (the “truth”) to which an observation belongs as t, and the class to which it is assigned (the “decision”) as d, where t and d can take on any of the values π1,,πi,,πN, the labels of the various classes. (We use boldface type to denote statistically variable quantities.) For simplicity, we will also write πk to denote the event t=πk, as in the a priori probability P(πk) and the conditional pdfs p(x|πk). It can be shown (Edwards et al., 2004, Van Trees, 1968) that

Case I: three equal variances

If all three variances are equal, then by the arguments preceding Eq. (6), we can assume without loss of generality that they are all unit variances. For the purpose of labeling the classes, however, it is more useful in this particular case to select the labels of the classes such that μ2<μ3(=0)<μ1. We can safely assume here that no two means are equal, because otherwise at least two of the normal distributions of the underlying data would have both equal means and equal variances; those

Case II: two equal variances

Recall from the arguments preceding Eqs. (6), (7), (8) that we have arbitrarily, but without any loss of generality, adopted the convention that σ12σ22σ32=1. This leaves us with only two ways in which exactly two of the variances can be equal: σ12<σ22=σ32=1, which we address in Section 4.1; and σ12=σ22<σ32=1, which we address in Section 4.2.

Case III: three different variances

We turn at last to the most general case, in which σ12<σ22<σ32=1. The parametric expressions for the likelihood ratios, their first derivatives, and the curvature polynomial are as expressed in Eqs. (12), (14), (17).

In the more general case in which μ1/(1σ12) and μ2/(1σ22) are distinct, we adopt the convention μ2/(1σ22)<μ1/(1σ12) (by transforming the underlying data variable to x if necessary); as mentioned in Section 4.2, this has the effect of causing the loop formed by the likelihood

ROC operating points

An example of a likelihood ratio curve, as described in the preceding sections, is shown in Fig. 8(a) with μ1=1/8, σ12=1/8, μ2=2/3, and σ22=1/2; and with an arbitrary choice of the decision criteria of Eqs. (3), (4), (5) such that the decision boundaries are segments of the lines LR2=(1/4)LR1, LR2=(3/10)LR13/20, and LR2=(3/8)LR1+17/8, respectively. It is shown in Section 5 that one can represent this particular decision strategy in terms of six distinct values of x, namely x1x6, such that

Parametric surfaces and Jacobians

Consider an n-dimensional vector space with elements v(v1,v2,,vn). A function such as vn=f(v1,v2,,vn1) defines a surface in this space. If we now take a point v on this surface, and a closely neighboring point v+Δv also assumed to lie on the surface, then the difference between those points will approach a tangent vector to the surface in the limit as Δv approaches zero in magnitude. In particular, the n1 vectors (0,0,,1(i),0,,f/vi) form a linearly independent set of such tangent

Degrees of freedom of ROC surfaces

In Sections 3 Case I: three equal variances, 4 Case II: two equal variances, 5 Case III: three different variances, and Appendix A, we show that, for univariate trinormal data, the decision boundary line segments used by the ideal observer can intersect the likelihood ratio curve in at most six places; an example of such a situation is shown in Fig. 8(a). More generally, the number of such intersections depends in a complicated way on the parameters of the underlying data (the means and

Discussion

As implied in Section 1, the behavior of the ideal observer even for a simple, restricted underlying data model can be quite complicated, particularly when viewed in terms of that underlying data. However, we would like to reiterate the point made in Section 4.1, that one of the goals of this paper was not to focus on the process of determining the numerical values of the data space thresholds on x, but rather to develop a more qualitative but broad understanding of the ideal observer’s

Conclusion

Although a fully general three-class extension to ROC analysis has yet to be developed, it is to be hoped that insights obtained in the consideration and analysis of constrained or simplified models will prove useful in the development of such fully general methodology. In the present paper, we have investigated the behavior of the three-class ideal observer under a particular constrained underlying data model, a univariate trinormal model. By categorizing the various possible situations given

Acknowledgments

The authors would like to thank Craig Abbey, whose questions and comments on this subject inspired us to investigate it more thoroughly than we had previously.

This work was supported by grant R01 EB000863 from the National Institutes of Health (Kevin Berbaum, the University of Iowa, principal investigator) through a subcontract from the University of Iowa to the University of Chicago (Charles Metz, subcontract principal investigator).

References (26)

  • D.C. Edwards et al.

    Optimization of restricted ROC surfaces in three-class classification tasks

    IEEE Transactions on Medical Imaging

    (2007)
  • Edwards, D. C., & Metz, C. E. (2007). A utility-based performance metric for ROC analysis of N-class classification...
  • Edwards, D. C., & Metz, C. E. Optimality of a utility-based performance metric for ROC analysis. In Berkman Sahiner,...
  • Cited by (3)

    • Relying on pulse oximetry to avoid hypoxaemia and hyperoxia: A multicentre prospective cohort study in patients with circulatory failure

      2023, Australian Critical Care
      Citation Excerpt :

      Usual demographic and baseline clinical characteristics and underlying conditions were prospectively collected. There is no universally recognised way to correctly synthetise the global performance metrics of three-class classifier tests (such as being below, within, or above the 90–95% SpO2 range to predict SaO2 below, within, or above the 90–95% range).10,11 Due to the multiplicity of the recommended safe intervals for SaO2 and PaO2 in the literature,5 our primary objective was to assess the predictive performance of SpO2 values to predict SaO2 within or outside the 90–95% interval5,12 and PaO2 within or outside the 60–100 mmHg interval.

    View full text