1 Introduction

In some applications, acquiring covariates comes at a cost which is not negligible. For example, in the medical domain, in order to classify whether a patient has diabetes or not, measuring glucose tolerance can be expensive. On the other hand, glucose tolerance can be a good indicator for diabetes, i.e. it increases our chance of predicting diabetes (or its absence) correctly.

The example illustrates that in the medical domain we often have to strike a balance between classification accuracy and the cost of acquiring the covariates. A rational criteria to decide on the best trade-off is to minimize the expected total cost of classification: expected cost of misclassification plus the cost of acquired covariates.

In the first part of this article, we formalize the optimization of the expected total cost of classification using the (conditional) Bayes risk and describe the optimal solution using a recursive procedure. However, it turns out that the procedure is computationally infeasible due to basically two factors: (1) calculating the Bayes risk requires to estimate a high dimensional integral, (2) the number of different covariate acquisition paths is exponential in the number of covariates.

As a consequence, we introduce two assumptions: (1) the optimal classifier can be represented by a generalized additive model (GAM), (2) the optimal sets of covariates are limited to a sequence of sets of increasing size. We show that under these two assumptions, a computationally efficient solution exists.

Our framework requires that the user can specify the cost of misclassification: the false positive cost (cost of wrongly classifying a healthy person as having diabetes), and false negative cost (cost of classifying a diabetes patient as healthy). However, we show that the requirement on the user to specify the false negative cost can be replaced by the specification of a lower bound on the recall. This is motivated by the medical domain where it is more common to specify the minimally acceptable recall (target recall), rather than specifying the false negative cost.

Our main contributions are as follows:

  1. 1.

    We describe the optimal solution for minimizing the expected total cost of classification, which has not been clarified in previous works like Dulac-Arnold et al. (2012) and Shim et al. (2018).

  2. 2.

    We prove that for a GAM, the estimation of the (conditional) Bayes risk reduces to a one dimensional density estimation and integral which can be solved computationally efficiently.

  3. 3.

    We propose an effective heuristic to estimate an optimal monotone increasing sequence of covariate sets by learning the regression coefficients of GAM with a group lasso penalty.

  4. 4.

    We prove that our framework can be used to guarantee a user-specified recall level, like 95% which is common in the medical domain.

  5. 5.

    We show on four medical datasets that the proposed method can lead to lower total cost of classification than the previous works in Ji and Carin (2007), Dulac-Arnold et al. (2012), Xu et al. (2012), Nan and Saligrama (2017) and Shim et al. (2018). Furthermore, evaluation under the requirement of a target recall shows that the proposed method achieves the target recall while minimizing the remaining costs and false discovery rate (FDR) better than previous works.

This article extends our preliminary work in Andrade and Okajima (2019) as follows:

  • Replacing the linear classifier with GAM to allow for non-linear relations between the covariates and the response (see Sect. 4.1).Footnote 1

  • Allowing for the specification of a minimal recall (see Sect. 6).

  • Determining the sequence of covariate sets by the solution path of a group lasso penalized convex optimization problem (see Sect. 5.2).Footnote 2

  • Additional experimental evaluations on two more medical datasets, and additional analysis in terms of false positive costs, recall, weighted average, and runtime (see Sect. 7). Furthermore, instead of symmetric misclassification costs, which were used in Andrade and Okajima (2019), our evaluation here focuses on asymmetric misclassification costs that are more realistic in practice.

In the next section, we formalize the optimal decision procedure to achieve, in expectation, the lowest total cost of classification. In Sect. 3, we introduce a (non-adaptive) method that minimizes an upper of the lowest achievable total cost, which we extend in Sect. 4 to an adaptive method. In Sect. 5, we explain two approximations for finding a sequence of monotone increasing covariate sets that is used by the proposed method. In Sect. 6, we show how the proposed framework can also be used for guaranteeing a target recall. Extensive empirical evaluations of our proposed method and previous methods are provided in Sect. 7. In Sect. 8, we give a concise review of related work. Finally, in Sect. 9, we summarize our findings.

2 A cost rational selection criteria

In this section, we clarify and formally describe the optimal decision for acquiring covariates and classification. For that purpose we appropriately extend the definitions of a Bayes procedure which are known to be optimal for classification, see e.g. Anderson (2003), Berger (2013).

Let \(L := \{l_1, \ldots , l_c\}\) denote the set of class labels, and \(c_{y, y^*}\) the cost of classifying a sample as class \(y^*\), when the true label is y. A decision procedure \(\delta ^*: {\mathbb {R}}^p \rightarrow L\) for which

$$\begin{aligned} \forall \delta : \, {{{{\mathbb {E}}}}}_{{\mathbf {x}},y} [c_{y,\delta ({\mathbf {x}})}] \ge {\mathbb{E}}_{{\mathbf {x}}, y} [c_{y,\delta ^*({\mathbf {x}})}] \, \end{aligned}$$

is called a Bayes procedure. The following procedure \(\delta ^*\) is a Bayes procedure [for a proof see, for example, Theorem 6.7.1 in Anderson (2003)]:

$$\begin{aligned} \begin{aligned} \delta ^*({\mathbf {x}})&= \mathop {\hbox {arg min}}\limits _{y^* \in L} \sum _{y \in L} p(y | {\mathbf {x}}) \cdot c_{y, y^*} \\&= \mathop {\hbox {arg min}}\limits _{y^* \in L} {\mathbb{E}}_y [c_{y, y^*} ] . \end{aligned} \end{aligned}$$
(1)

The expected misclassification cost of the Bayes procedure, i.e. \({\mathbb{E}}_{{\mathbf {x}}, y} [c_{y,\delta ^*({\mathbf {x}})}]\), is called the Bayes risk.

Let us denote by \(V := \{1, \ldots , p\}\) the index set of covariates with \(V \cap L = \emptyset\). We denote the Bayes procedure for classifying a sample based only on the covariates \(S \subseteq V\) by \(\delta ^*_{S} : {\mathbb {R}}^{|S|} \rightarrow L\). That means

$$\begin{aligned} \delta ^*_S({\mathbf {x}}_S) = \mathop {\hbox {arg min}}\limits _{y^* \in L} \sum _{y \in L} p(y | {\mathbf {x}}_S) \cdot c_{y, y^*} . \end{aligned}$$
(2)

When it is clear from the context, we drop the index on \(\delta ^*_S\), and just write \(\delta ^*({\mathbf {x}}_S)\) instead of \(\delta ^*_S({\mathbf {x}}_S)\).Footnote 3

2.1 Optimal procedure

The classical definition of Bayes procedure does not consider the cost of covariate acquisition, and assumes that all covariates are acquired at once. Therefore, let us first formally extend the definition appropriately.

We use the following definition of a decision procedure.

Definition 1

A function of the form

$$\begin{aligned} \pi : {\mathbb {R}}^p \times 2^V \rightarrow L \cup V, \end{aligned}$$

which fulfills, \(\forall {\mathbf {x}} \in {\mathbb {R}}^p, S \subseteq V\):

$$\begin{aligned}&\pi ({\mathbf {x}}, S) = \pi ({\mathbf {x}} \odot {\mathbf {1}}_S, S) , \end{aligned}$$
(3)
$$\begin{aligned}&\pi ({\mathbf {x}}, S) \in L \cup (V \setminus S) , \end{aligned}$$
(4)

is called a decision procedure.Footnote 4

The condition in Eq. (3) means that a decision procedure uses only the covariates that are indexed by S; the condition in Eq. (4) means that a decision procedure cannot select a covariate that is already in S. In summary, the decision procedure \(\pi ({\mathbf {x}}, S)\) either classifies the current sample, or selects a new covariate based on the observations \({\mathbf {x}}_S\). To simplify the notation, we write \(\pi ({\mathbf {x}}_S)\) instead of \(\pi ({\mathbf {x}}, S)\). Furthermore, we denote the cost of acquiring covariate i by \(c_i\).

Given a sample \({\mathbf {x}}\) with class label y, we denote the loss of a decision procedure \(\pi\) as \(l( ({\mathbf {x}}, y), \pi )\). The loss can be computed recursively as follows. Let \(l( ({\mathbf {x}}, y), \pi ) := l( ({\mathbf {x}}, y), \pi , \emptyset )\), with

$$\begin{aligned}&l( ({\mathbf {x}}, y), \pi , S) =&{\left\{ \begin{array}{ll} c_{y, \pi ({\mathbf {x}}_S)} &\quad{\text {if}}\, \pi ({\mathbf {x}}_S) \in L , \\ c_{\pi ({\mathbf {x}}_S)} + l( ({\mathbf {x}}, y), \pi , S \cup \{ \pi ({\mathbf {x}}_S) \} ) &\quad{\text{else}}. \end{array}\right. } \end{aligned}$$
(5)

If not stated otherwise, we assume that all costs are non-negative, i.e. \(c_i \ge 0\), and \(c_{y,y^{\prime}} \ge 0\).

Theorem 1

Let \(\pi ^*\) be a decision procedure,such that for all \(S \subseteq V\), we have

$$\begin{aligned} \begin{aligned} \pi ^*({\mathbf {x}}_S) = \mathop {\hbox {arg min}}\limits _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} \big [l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_{S} \big ] &\quad {\text{else}}. \end{array}\right. } \\ \end{aligned} \end{aligned}$$
(6)

Then \(\pi ^*\) is a Bayes procedure. That means for any other decision procedure \(\pi\) we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [l( ({\mathbf {x}}, y), \pi ^*) ] \le {\mathbb{E}}_{{\mathbf {x}}, y} [l( ({\mathbf {x}}, y), \pi ) ] . \end{aligned}$$

The proof is given in the “Appendix”. We note that, if the covariates are discrete, we can formulate the problem as a stationary Markov decision process (MDP) where every policy leads to a terminal state (Zubek et al., 2004; Bayer-Zubek 2004). The Bayes procedure from Theorem 1 is then equivalent to the optimal policy defined by the Bellman updates with the discounting factor set to 1 (Russell & Norvig, 2003).

For continuous covariates, implementing the exact decision procedure \(\pi ^*\) is, in general, intractable.Footnote 5 The reason is that in order to recursively evaluate the loss, we need to evaluate a sequence of interchanging minimizations and expectations. To illustrate this, consider the following example with only two covariates, \(V = \{1, 2\}\), we then have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [ l( ({\mathbf {x}}, y), \pi ^*) ]&= \min \Big \{ c_1 + {\mathbb{E}}_{{\mathbf {x}}_1} \big [ \min \big \{c_2 + {\mathbb{E}}_{{\mathbf {x}}_2} [C({\mathbf {x}}_{\{1,2\}}) | {\mathbf {x}}_1], C({\mathbf {x}}_1) \big \} \big ], \nonumber \\&\quad \quad c_2 + {\mathbb{E}}_{{\mathbf {x}}_2} \big [ \min \big \{ c_1 + {\mathbb{E}}_{{\mathbf {x}}_1} [ C({\mathbf {x}}_{\{1,2\}}) | {\mathbf {x}}_2], C({\mathbf {x}}_2) \big \} \big ], \nonumber \\&\quad \quad C({\mathbf {x}}_{\emptyset }) \Big \} , \end{aligned}$$
(7)

where the first line corresponds to first acquiring covariate 1 and then either classifing or acquiring covariate 2, the second line corresponds to first acquiring covariate 2, and the last line corresponds to not acquiring any covariate. Note that to simplify notation in the above example, we defined \(C({\mathbf {x}}_S)\) to be the expected cost of (mis-)classification when given \({\mathbf {x}}_S\), i.e.

$$\begin{aligned} C({\mathbf {x}}_S) := \min _{i \in L} {\mathbb{E}}_{y} [c_{y,i} | {\mathbf {x}}_S ] . \end{aligned}$$

Therefore, we propose two relaxations and corresponding methods named Cost-sensitive Covariate Selection (COS) and Adaptive Cost-sensitive Forward Selection (AdaCOS) in Sects. 3 and 4, respectively.

3 Cost-sensitive covariate selection (COS)

Our first relaxation is to pull-out all minimizations in the recursion of Eq. (6). Also see the example in Eq. (7) which explicitly writes out all minimizations. This leads to the following upper bound:

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [ l( ({\mathbf {x}}, y), \pi ^*) ] \le \min _{S \subseteq V} \Big ( {\mathbb{E}}_{{\mathbf {x}}_S, y}[c_{y,\delta ^*({\mathbf {x}}_S)}] + \sum _{i \in S} c_i \Big ) . \end{aligned}$$

In the following, we denote this upper bound by \({\mathcal {U}}\). However, directly trying to minimize \({\mathcal {U}}\) is still computationally difficult due to the exponential number of possible sets S. Note that this is similar to covariate selection in logistic regression like e.g. in Tibshirani (1996), O’Hara et al., (2009). Two important differences are that, in general, costs associated with covariates can be different from each other, and that the Bayes risk needs to be evaluate for all possible subsets \(S \subseteq V\). The situation is illustrated in Fig. 1.

We denote the method selecting the set \(S^*\) that minimizes \({\mathcal {U}}\), by Cost-sensitive Covariate Selection (COS). In order to approximately find the set \(S^*\) we will use the methods described in Sect. 5. One disadvantage of COS is that it always selects the same set of covariates \(S^*\) for any sample, though for some samples less/more covariates might be sufficient/necessary for good classification accuracy.

Fig. 1
figure 1

Shows all possible decisions, for an example with \(V = \{1, 2\}\). Each edge represents one decision: either asking for the value of a covariate, or classifying the sample based on the observed covariates so far using the Bayes procedure. Each leaf shows the upper bound \({\mathcal {U}}\) of the optimal expected total cost of classification when using the covariates that were selected on the path from the root to the leaf

4 Adaptive cost-sensitive forward selection

The method COS proposed in the previous section is not adaptive, i.e. it does not take into account the actually observed covariates in order to decide whether to proceed the acquisition of additional covariates, or whether to classify based on the covariates observed so far. However, as discussed in Sect. 2.1, without any additional assumptions estimating the optimal procedure \(\pi ^*\) from Theorem 1 is computationally infeasible. We therefore introduce two assumptions:

  1. 1.

    The optimal set S of acquired covariates belongs to \({\mathfrak {S}} = \{S_1, S_2, \ldots S_q\}\), where \(S_1 \subset S_2 \subset S_3 \ldots S_q \subseteq V\). An approximation method for finding the optimal sequence of subsets \({\mathfrak {S}}\) will be described in Sect. 5.

  2. 2.

    The conditional class probability \(p(y | {\mathbf {x}}_{S})\), for \(S \in {\mathfrak {S}}\), belongs to a logistic generalized additive model.

Before we proceed, let us introduce our definition of future costs. Let \(A \subseteq V\) and \(S \subseteq V \setminus A\), then we define

$$\begin{aligned} F_{{\mathbf {x}}_{A}}(S) := \overbrace{{\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} | {\mathbf {x}}_{A} \big ]}^{\text {(conditional) Bayes risk}} + \overbrace{\sum _{i \in S} c_i}^{\text {covariate costs}} . \end{aligned}$$
(8)

\(F_{{\mathbf {x}}_{A}}(S)\) is the expected total additional cost of classification when we have already acquired the covariates A, and are planning to acquire additionally the covariates S before classifying. In particular, the upper bound \({\mathcal {U}}\) can be expressed as \(\min _{S \subseteq V} F_{{\mathbf {x}}_{\emptyset }}(S)\).

Our approximation of the Bayes procedure \(\pi ^*\) from Theorem 1 is given in Algorithm 1. First, we acquire all covariates indexed by \(S_1\), and then check whether acquiring any additional covariates from \(S_2 \setminus S_1, \ldots S_q \setminus S_1\) reduces the total cost of classification in expectation. If that is the case, we acquire the covariates in \(S_2 \setminus S_1\), and proceed analogously. If the total cost of classification is not expected to decrease with more covariates, we stop and classify based on the covariates acquired so far. An example of the procedure, where \(V = \{1, 2\}\), is shown in Fig. 2. Note the example in Fig. 2, corresponds to the right most path from the tree in Fig. 1. Another, different example, with \(V = \{1, 2, 3, 4\}\), is shown in Fig. 3.

figure a

The algorithm is adaptive in the sense that the expected future costs \(F_{{\mathbf {x}}_{A}}(S)\) depend on the covariates \({\mathbf {x}}_{A}\) observed so far. Therefore, we see that the effectiveness of the algorithm hinges on the non-trivial task of calculating \(F_{{\mathbf {x}}_{A}}(S)\).

Finally, we note that in contrast to AdaCOS, the method COS does not use Algorithm 1, but selects only one fixed set of covariates \({\hat{S}} := \mathop {\hbox {arg min}}\limits _{S \subseteq V} F_{{\mathbf {x}}_{\emptyset }}(S)\), which is used for all samples at test time.

Fig. 2
figure 2

Shows an example where \(S_1 = \{2\}\), and \(S_2 = \{2, 1\}\). Assumes we have not acquired any covariates, yet. In order to decide whether to start covariate acquisition, AdaCOS compares the current expected cost of misclassication marked in green with the expected future costs when acquiring covariates \(\{x_2\}\) or \(\{x_2, x_1\}\), marked in blue and orange, respectively

Fig. 3
figure 3

Shows an example where \(S_1 = \{3\}\), \(S_2 = \{3, 1\}\), \(S_3 = \{3, 1, 2\}\), and \(S_4 = \{3, 1, 2, 4\}\). Assumes we have already acquired the covariates \(S_2\). In order to decide whether to proceed acquisition, AdaCOS compares the current expected cost of misclassication marked in green with the expected future costs when additionally acquiring covariates \(\{x_2\}\) or \(\{x_2, x_4\}\), marked in blue and orange, respectively

4.1 Bayes risk estimation

The main challenge in evaluating the future costs is to estimate the multi-dimensional integral in \({\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} | {\mathbf {x}}_{A} \big ]\). By assuming that the conditional class probability \(p(y | {\mathbf {x}}_{A \cup S})\) can be modeled by a logistic generalized additive model, we will show that it is possible to reduce the multi-dimensional integral into a one-dimensional with an effective approximation.

The logistic generalized additive model is defined as follows. Given the regression coefficients \(\varvec{\beta } = (\varvec{\beta }_1, \varvec{\beta }_2, \ldots , \varvec{\beta }_p) \in {\mathbb {R}}^{s \cdot p}\), and intercept \(\tau\), the conditional class probability \(p(y = 1 | {\mathbf {x}}, \varvec{\beta }, \tau )\) is modeled as

$$\begin{aligned} p(y = 1 | {\mathbf {x}}_{A \cup S}, \varvec{\beta }, \tau ) = g\Big (\tau + \sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i)\Big ) , \end{aligned}$$

where g denotes the sigmoid function, and the non-linear transformations \(f_i: {\mathbb {R}} \rightarrow {\mathbb {R}}^s\) are learned from the training data using penalized B splines, where s is the number of splines (Hastie et al., 2009).Footnote 6

Furthermore, we assume that there are only two class labels \(\{0, 1\}\), and that the costs of misclassification are larger than the cost of correct classification.Footnote 7 We then have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ \sum _y c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} p(y | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{0,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=0 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\qquad + {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{1,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=1 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] . \end{aligned}$$

Since

$$\begin{aligned} \delta ^*({\mathbf {x}}_{A \cup S})&= \mathop {\hbox {arg min}}\limits [ p(y=1 | {\mathbf {x}}_{A \cup S}) c_{1, 0} + p(y=0 | {\mathbf {x}}_{A \cup S}) c_{0, 0}, \\&\, p(y=0 | {\mathbf {x}}_{A \cup S}) c_{0, 1} + p(y=1 | {\mathbf {x}}_{A \cup S}) c_{1, 1}] , \end{aligned}$$

we have, where we defined \(p := p(y=1 | {\mathbf {x}}_{A \cup S})\),

$$\begin{aligned} \delta ^*({\mathbf {x}}_{A \cup S}) = 1&\Leftrightarrow p c_{1, 0} + (1-p) c_{0, 0} \ge (1-p) c_{0, 1} + p c_{1, 1} \\&\Leftrightarrow p (c_{1, 0} - c_{1, 1}) \ge (1-p) (c_{0, 1} - c_{0, 0}) \\&\Leftrightarrow \frac{p (c_{1, 0} - c_{1, 1})}{(1-p) (c_{0, 1} - c_{0, 0})} \ge 1 \\&\Leftrightarrow \frac{g(\sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) + \tau ) \cdot (c_{1, 0} - c_{1, 1}) }{(1 - g(\sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) + \tau )) \cdot (c_{0, 1} - c_{0, 0}) } \ge 1 \\&\Leftrightarrow e^{\sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) + \tau } \ge \frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}} \\&\Leftrightarrow \sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) \ge \log (\frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}}) - \tau \\&\Leftrightarrow \sum _{i \in S} \varvec{\beta }_i^T f_i(x_i) \ge \log (\frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}}) - \tau - \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) \\&\Leftrightarrow z \ge z^* , \end{aligned}$$

where we defined \(z := \sum _{i \in S} \varvec{\beta }_i^T f_i(x_i)\), and \(z^* := \log (\frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}}) - \tau - \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i)\). We see that \(\delta ^*({\mathbf {x}}_{A \cup S})\) depends only on z (random variable) and \(z^*\) (fixed). In the following, to simplify notation, let us denote by h(z) the conditional distribution \(p(z | {\mathbf {x}}_{A})\). We thus have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{1,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=1 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{1,\delta ^*({\mathbf {x}}_{A \cup S})} g(\sum _{i \in S} \varvec{\beta }_i^T f_i(x_i) + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_z \big [ c_{1,\delta ^*(z, z^*)} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) | {\mathbf {x}}_{A} \big ] \\&\quad = \int c_{1,\delta ^*(z, z^*)} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz \\&\quad = c_{1,0} \int _{- \infty }^{z^*} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz \\&\qquad + c_{1,1} \int _{z^*}^{\infty } g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz . \end{aligned}$$

Analogously, we have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{0,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=0 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{0,\delta ^*({\mathbf {x}}_{A \cup S})} (1- g(\sum _{i \in S} \varvec{\beta }_i^T f_i(x_i) + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_z \big [ c_{0,\delta ^*(z, z^*)} (1- g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) | {\mathbf {x}}_{A} \big ] \\&\quad = c_{0,0} \int _{- \infty }^{z^*} (1- g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) h(z) dz \\&\qquad + c_{0,1} \int _{z^*}^{\infty } (1 - g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) h(z) dz \\&\quad = { c_{0,0} \int _{-\infty }^{z^*} h(z) dz - c_{0,0} \int _{-\infty }^{z^*} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz }\\&\qquad + c_{0,1} \int _{z^*}^{\infty } h(z) dz - c_{0,1} \int _{z^*}^{\infty } g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz . \end{aligned}$$

Thus the remaining task is to evaluate the following integral

$$\begin{aligned}&\int _{a^{\prime}}^{b^{\prime}} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz \nonumber \\&\quad = \int _{a^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau }^{b^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau } g(u) h(u - \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) - \tau ) du . \end{aligned}$$
(9)

We assume that \(h(z) = p(z | {\mathbf {x}}_{A})\) can be well approximated by a normal distribution with mean \(\mu _z\) and variance \(\sigma ^2\). We defer the explanation of how to estimate \(\mu _z\) and \(\sigma ^2\) to Sect. 4.1.1.

The integral in Eq. (9) has no analytic solution. One popular strategy is to approximate the sigmoid function g by the cumulative distribution function of the standard normal distribution \(\varPhi\), as in Gaussian process classification (Rasmussen & Williams, 2006). However, it turns out that this approximation is not applicable here, since \(a^{\prime}\) or \(b^{\prime}\) is a finite real number in our case. Instead, we use here the fact that the sigmoid function can be well approximated with only a few number of linear functions. In order to facilitate notation, let us introduce the following constants:

$$\begin{aligned} a&:= a^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau \, , \\ b&:= b^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau \, , \\ \mu&:= \mu _z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau . \end{aligned}$$

Then we can write the integral in Eq. (9) as

$$\begin{aligned} \int _{a}^{b} g(u) \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du . \end{aligned}$$
(10)

Let us define the following piece-wise linear approximation of the sigmoid function:

$$\begin{aligned} g(u) \approx \sum _{t = 1}^{\xi + 2} \Big ( 1_{[b_{t-1}, b_t]}(u) \big ( m_t u + v_t \big ) \Big ) , \end{aligned}$$

where for \(1 \le t \le \xi + 1\), we set \(b_{t} := -10 + \frac{20}{\xi } (t - 1) ,\) and for \(1 \le t \le \xi\), we set

$$\begin{aligned} m_{t+1} := \frac{g(b_{t+1}) - g(b_t)}{b_{t+1} - b_t} , \qquad v_{t+1} := g(b_t) - m_{t+1} b_t , \end{aligned}$$

and

$$\begin{aligned}&b_0 := - \infty , \, m_1 := 0 , \, v_1 := g(b_1) , \\&b_{\xi + 2} := + \infty , \, m_{\xi + 2} := 0 , \, v_{\xi + 2} := g(b_{\xi + 1}) , \end{aligned}$$

and \(\xi\) is the number of linear approximations, which is, for example, set to 40. A comparison with the approximation \(\varPhi (\sqrt{\frac{\pi }{8}} u)\) is shown in Fig. 4. That means for a relatively few number of linear approximations, we can achieve an approximation that is more accurate than the \(\varPhi\)-approximation. More importantly, as we show below, this allows for a tractable calculation of the integral in Equation (10), which is not the case when using the \(\varPhi\)-approximation. Then we have

$$\begin{aligned}&\int _{a}^{b} g(u) \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\quad = \int _{a}^{b} \sum _{t = 1}^{\xi + 2} \Big ( 1_{[b_{t-1}, b_t]}(u) \big ( m_t u + v_t \big ) \Big ) \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\quad = \sum _{t = 1}^{\xi + 2} m_t \int _{\max (a, b_{t-1})}^{\min (b, b_t)} u \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\qquad + v_t \varPhi _{\max (a, b_{t-1})}^{\min (b, b_t)} , \end{aligned}$$

where we define \(\varPhi _l^o := \int _l^o \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du\), which can be well approximated with standard numerical libraries. The remaining integral can also be expressed by \(\varPhi\) using the substitution \(u - \mu := r\), we have

$$\begin{aligned}&\int _{l}^{o} u \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\quad = \int _{l - \mu }^{o - \mu } r \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} r^2} dr + \mu \int _{l - \mu }^{o - \mu } \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} r^2} dr \\&\quad = \frac{\sigma }{\sqrt{2 \pi }} \big ( e^{- \frac{1}{2\sigma ^2} (l - \mu )^2} - e^{- \frac{1}{2\sigma ^2} (o - \mu )^2} \big ) + \mu \varPhi _{l}^{o} . \end{aligned}$$
Fig. 4
figure 4

Comparison of Sigmoid function approximations. For both, the linear function approximation (proposed) and the discrete bin approximation (Ji and Carin 2007), we set \(\xi = 40\). For the normal CDF approximation we use \(\varPhi (\sqrt{\frac{\pi }{8}} u)\). Note that the proposed linear function approximation is almost indistinguishable from the Sigmoid function. In contrast the error of the normal approximation is large for around − 3.0 and around 3.0. The discrete approximation is obviously the worst

4.1.1 Estimation of \(\mu _z\) and \(\sigma ^2\)

Recall that we assumed that \(p(z | {\mathbf {x}}_{A})\) can be well approximated by a normal density with mean \(\mu _z\) and variance \(\sigma ^2\). In order to estimate \(\mu _z\) and \(\sigma ^2\), we propose to model z given \({\mathbf {x}}_{A}\) as a regression problem with additive noise, where z is the response variable, and \({\mathbf {x}}_{A}\) are the explanatory variables. In detail, for learning the regression model from the training data \(\{{\mathbf {x}}^{(k)} \}_{k = 1}^{n}\), we prepare a collection of response and explanatory variable pairs of the form \(\{ (z^{(k)}, {\mathbf {x}}_A^{(k)}) \}_{k = 1}^{n}\), where \(z^{(k)} = \sum _{i \in S} \varvec{\beta }_i^T f_i(x_i^{(k)})\). We note that for training the regression model, we could additionally exploit unlabeled data as follows.Footnote 8 Once we learned \(\varvec{\beta }_i\) and \(f_i\) from training data with class label information y, we can use all data, with and without class label, to create the data \(\{ (z^{(k)}, {\mathbf {x}}_A^{(k)}) \}_{k = 1}^{n + n_u}\), where \(n_u\) is the number of unlabeled samples.

For our experiments, we use a standard Bayesian linear regression model with a scaled inverse \(\chi ^2\) distribution prior on the noise variance (Gelman et al., 2013).

5 Cost-aware non-linear covariate selection

In the previous section, we assumed that the covariates are acquired in a specific sequence. In this section, we discuss two different approximation strategies for finding the optimal sequence of subsets \({\mathfrak {S}} = \{S_1, S_2, \ldots S_q\}\), where \(S_1 \subset S_2 \subset S_3 \ldots S_q \subseteq V\), such that the expected total cost of classification tends to minimal for a set \(S \in {\mathfrak {S}}\).

5.1 Forward selection

We suggest to set \(q := p + 1\), and use greedy forward selection as outlined in Algorithm 2.

figure b

Note that from the definition in Eq. 8, we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{S}} [F_{{\mathbf {x}}_{S}} ( \{ j \} )] = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ {\mathbb{E}}_{{x_j}, y} [ c_{y,\delta ^*({\mathbf {x}}_{S \cup \{j\}})} | {\mathbf {x}}_{S} ] \big ] + \sum _{i \in S} c_i . \end{aligned}$$

We estimate \({\mathbb{E}}_{{\mathbf {x}}_{S}} [F_{{\mathbf {x}}_{S}} ( \{ j \} )]\) using 10-fold cross-validation of the labeled training data \(\{ (y^{(k)}, {\mathbf {x}}^{(k)}) \}_{k = 1}^{n}\). In particular, for one fold \({\mathscr {A}}_f \subseteq \{1, \ldots , n\}\), we have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ {\mathbb{E}}_{{x_j}, y} [ c_{y,\delta ^*({\mathbf {x}}_{S \cup \{j\}})} | {\mathbf {x}}_{S} ]\big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S \cup \{j\}}, y} [ c_{y,\delta ^*({\mathbf {x}}_{S \cup \{j\}})} ] \\&\quad \approx \frac{1}{| {\mathscr {A}}_f |} \sum _{k \in {\mathscr {A}}_f} c_{y^{(k)},\delta _f({\mathbf {x}}_{S \cup \{j\}}^{(k)})}, \end{aligned}$$

where the model for the conditional probability \(p(y | {\mathbf {x}}_{S \cup \{j\}})\) used by \(\delta _f\) is trained using the samples in \(\{1, \ldots , n\} \setminus {\mathscr {A}}_f\). The final estimate is acquired by averaging over all folds.

An advantage of the above forward-selection procedure is that it uses an unbiased estimate of \({\mathbb{E}}_{{\mathbf {x}}_{S}} [F_{{\mathbf {x}}_{S}} ( \{ j \} )]\), and assuming the variance is not too large, we can expect to find a good local minima.

However, there are several disadvantages. First, if the variance of the estimator is high, we might get stuck in a bad local minima. Second, the forward-selection procedure is extremely computationally expensive, and, as a consequence, unfeasible if p is large. Finally, a more subtle disadvantage is that it requires the full specification of the misclassification costs, i.e. the specification of \(c_{0,1}\) and \(c_{1,0}\). As a consequence, it is not applicable when we are provided only with \(c_{0,1}\), which we will discuss in Sect. 6.

5.2 Group Lasso penalty

Some of the disadvantages of the feed-forward selection method can be overcome by jointly training the model for the conditional probability \(p(y | {\mathbf {x}})\) with a sparsity-enforcing penalty on the regression coefficients. Here, this is possible since we assume a generalized additive model for \(p(y | {\mathbf {x}})\).

In particular, we propose to acquire the sets \(S_1 \subset S_2 \subset S_3 \ldots S_q\) by using the search path of a penalized logistic loss function, subject to a post-hoc correction to ensure strict monotonicity. In detail, for different values of \(\lambda\), we solve the following convex optimization problem

$$\begin{aligned} \mathop {\hbox {minimize}}\limits _{\varvec{\beta }, \tau } - \sum _{k = 1}^n \log p(y = y^{(k)} | {\mathbf {x}}^{(k)}, \varvec{\beta }, \tau ) + \lambda \sum _{i = 1}^p c_i ||\varvec{\beta }_i||_2 , \end{aligned}$$
(11)

where \(\varvec{\beta } = (\varvec{\beta }_1, \varvec{\beta }_2, \ldots , \varvec{\beta }_p) \in {\mathbb {R}}^{s \cdot p}\). The group lasso penalty ensures that the regression coefficients \(\varvec{\beta }_i\) are either all zero or all non-zero (Hastie et al., 2015). Note that in Eq. (11) each group is scaled by \(c_i\) which ensures that the regression coefficient of covariates with high cost are penalized more.Footnote 9 As a consequence, in order to be included into the final model, covariates with high cost are required to lower the negative log-likelihood term more than covariates with low costs. By inspecting the search path for different values of \(\lambda _1> \lambda _2> \ldots > \lambda _q\), we acquire the corresponding sets \(S_1, S_2, \ldots , S_{q}\). For \(\lambda\), we choose 100 values evenly spaced on the log-scale between 0.0001 and 0.02, that is \(q = 100\). In more detail, after training with \(\lambda _j\), covariate i will be included in \(S_j\), if and only if \(\varvec{\beta }_i \ne 0\). In general, after removing duplicate sets, this will lead to \(S_1 \subset S_2 \subset S_3 \ldots S_{q^{\prime}} \subset V\), where \(q^{\prime} \le q\). Furthermore, if not already present, we add the empty \(\emptyset\), and the full covariate set V. In the rare case, where one set violates the strict monotonicity, we remove the set. For example, for the Breast Cancer data, group lasso finds the following sets:

$$\begin{aligned} S_1&= \{\} \\ S_2&= \{6\} \\ S_3&= \{2, 6\} \\ S_4&= \{2, 3, 6\} \\ S_5&= \{3, 6, 8\} \\ S_6&= \{2, 3, 6, 8\} \\ \ldots \end{aligned}$$

and therefore remove \(S_5\).

The conditional class probability \(p(y = 1 | {\mathbf {x}}, \varvec{\beta }, \tau )\) is modeled as

$$\begin{aligned} p(y = 1 | {\mathbf {x}}, \varvec{\beta }, \tau ) = g(\tau + \sum _{i = 1}^{p} \varvec{\beta }_i^T f_i(x_i)) . \end{aligned}$$
(12)

The non-linear transformations \(f_i: {\mathbb {R}} \rightarrow {\mathbb {R}}^s\), where s is the number of splines, are learned from the training data using penalized B splines (Hastie et al., 2009).

6 Extension to classification with recall guarantees

So far, we assumed that both misclassification costs \(c_{0, 1}\) and \(c_{1, 0}\) are given. Arguably, the false positive cost \(c_{0, 1}\) is relatively easy to specify. For example, in the medical domain, this might correspond to the price of a medicine which was unnecessarily prescribed to a healthy patient.

On the other hand, the specification of \(c_{1, 0}\) is more difficult. For example, specifying the cost of a dead patient (that might have been rescued) is difficult. Therefore, in the medical domain, it is more common to try to make a guarantee on the recall.Footnote 10 In particular, it is common practice to require that the recall is \(95\%\) (Kanao et al., 2009).

In the following, we show how to estimate the false negative cost \(c_{1, 0}\), given the false positive cost \(c_{0,1}\) and the requirement that the recall is larger or equal to some value r.

In the following, we denote by \(\mathbb {1}_M\) the indicator function which is 1 if expression M is true and otherwise 0.

Given a distribution over \((y, {\mathbf {x}})\) such that \({\mathbb{E}}[1_{y=1}] > 0\), the recall of a decision procedure \(\delta\) is defined as:

$$\begin{aligned} R_\delta&:= \int p({\mathbf {x}} | y = 1) \cdot \mathbb {1}_{\delta ({\mathbf {x}}) = 1} d {\mathbf {x}} . \end{aligned}$$

Assuming that \(\delta\) is a Bayes procedure, we have

$$\begin{aligned} \delta ({\mathbf {x}}) = 1&\Leftrightarrow p(y = 1 | {\mathbf {x}}) \cdot c_{1, 0} \ge p(y = 0 | {\mathbf {x}}) \cdot c_{0, 1} \\&\Leftrightarrow p(y = 1 | {\mathbf {x}}) \cdot c_{1, 0} \ge (1 - p(y = 1 | {\mathbf {x}})) \cdot c_{0, 1} \\&\Leftrightarrow p(y = 1 | {\mathbf {x}}) \ge \frac{c_{0, 1}}{c_{1, 0} + c_{0, 1}} . \end{aligned}$$

Setting

$$\begin{aligned} t := \frac{c_{0, 1}}{c_{1, 0} + c_{0, 1}} , \end{aligned}$$

we have that

$$\begin{aligned} R_\delta&= \int p({\mathbf {x}} | y = 1) \cdot \mathbb {1}_{p(y = 1 | {\mathbf {x}}) \ge t} d {\mathbf {x}} . \end{aligned}$$

In order to emphasize the dependence on t, we write in the following \(R_t\) instead of \(R_\delta\). In particular, let \(t^*\) be chosen such that

$$\begin{aligned} r = R_{t^*} \end{aligned}$$
(13)

where r is, for example, 0.95. Then the implicitly defined cost \(c_{1, 0}\) is given by

$$\begin{aligned} c_{1, 0} = \frac{1- t^*}{t^*} c_{0, 1} . \end{aligned}$$
(14)

It remains to show how \(t^*\) can be estimated. In general, Equation (13) does not have a solution (in terms of t). We therefore solve the following problem

$$\begin{aligned}&\max _t \, t \; , \, \text {subject to} \; \; r \le R_t , \end{aligned}$$
(15)

which has always a solution (since \(t = 0\) fulfills the constraint). Since \(p({\mathbf {x}} | y = 1)\) is unknown, we use the empirical training data distribution to estimate \(R_t\):

$$\begin{aligned} R_t&= \int p({\mathbf {x}} | y = 1) \cdot \mathbb {1}_{p(y = 1 | {\mathbf {x}}) \ge t} d {\mathbf {x}} \nonumber \\&\approx \frac{1}{n_1} \sum _{k : y^{(k)} = 1} \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}^{(k)}) \ge t} , \end{aligned}$$
(16)

where \(n_1\) is the number of true samples (i.e. label \(y = 1\)), and \(p^{(-k)}(y = 1 | {\mathbf {x}}^{(k)})\) is the estimate of \(p(y = 1 | {\mathbf {x}}^{(k)})\) of the classifier that was trained without sample k. In practice, since this type of leave-one-out estimation is computationally expensive, we use instead 10-fold cross-validation.

Since the expression in Eq. (16) is a monotone decreasing step function in t, we can easily solve the problem in (15) by sorting \(\big \{ p^{(-k)}(y = 1 | {\mathbf {x}}^{(k)}) \big \}_{k=1}^{n_1}\) in decreasing order.

6.1 Adaptive covariate acquisition with recall guarantees

So far, we discussed how to estimate \(c_{1,0}\) in the situation where only one classifier based on \(p(y = 1 | {\mathbf {x}})\) is used. However, in the adaptive acquisition setting, the situation is more complicated since, in general, for different observed sets of variables, the conditional class probabilities are different. In particular, let \(S \subset S^{\prime} \subseteq V\), where V is the set of all variables. Then, in general, we have

$$\begin{aligned} p(y = 1 | {\mathbf {x}}_S) \ne p(y = 1 | {\mathbf {x}}_{S^{\prime}}) , \end{aligned}$$

which means that, in general, the optimal threshold \(t^*\) which guarantees recall \(\ge r\) is different for different sets of observed variables S. Furthermore, in the adaptive setting, estimating the recall using Equation (16) is not valid anymore since the distribution of the samples, with label \(y = 1\), and for which we select the variable set \(S^{\prime}\) is, in general, different from \(p({\mathbf {x}} | y = 1)\), i.e. \(p({\mathbf {x}} | y = 1, \hbox { acquired}\ S^{\prime}) \ne p({\mathbf {x}} | y = 1)\).

Nevertheless, we show in the following that it is possible to define the cost \(c_{1,0}\) such that the recall requirement is fulfilled. First, let us introduce the following notations. Let \(S_1 \subset S_2 \subset S_3 \ldots S_q\), be the sets of variables that are considered for adaptive variable acquisition, i.e. first we acquire \(S_1\), and then we decide whether to classify or whether we acquire additionally the variables in \(S_2 \setminus S_1\), and so forth. Moreover, let \(\delta _{t, S}\) be the classifier based on the observed variables S and using threshold t, i.e.

$$\begin{aligned} \delta _{t, S}({\mathbf {x}}_{S}) := {\left\{ \begin{array}{ll} 1 &\quad{\text {if}}\,p(y = 1 | {\mathbf {x}}_{S}) \ge t , \\ 0 &\quad {\text{else}}. \end{array}\right. } \end{aligned}$$

To simplify the notation, we write in the following \(\delta _{t, S}\) short for \(\delta _{t, S}({\mathbf {x}}_{S})\).

Strict control of the recall can be achieved by requiring that

$$\begin{aligned} p(\delta _{t_{1}, S_{1}} = 1, \delta _{t_{2}, S_{2}} = 1, \ldots , \delta _{t_{q}, S_{q}} = 1 | y = 1) \ge r . \end{aligned}$$
(17)

This can be seen as follows. Assume that \(y = 1\) and any classifier \(\delta _{t_{j}, S_{j}}\) outputs label 0, then an adversarial selection strategy will select this classifier. Otherwise, if all classifiers output label 1, then even an adversarial selection strategy needs to select a classifier \(\delta _{t_{j}, S_{j}}\) for which the output is 1. By the requirement of Inequality (17), the latter will happen with probability of at least r.

If we require that all thresholds are the same, i.e.

$$\begin{aligned} t = t_{1} = t_2 = \ldots = t_q, \end{aligned}$$

then we can proceed as before. That means, based on Inequality (17), we first calculate \(t^*\), and then specify the false negative cost \(c_{1,0}\) using Equation (14). Analogously to before for checking Inequality (17), we use the empirical training data estimate:

$$\begin{aligned}&p(\delta _{t, S_{1}} = 1, \delta _{t, S_{2}} = 1, \ldots , \delta _{t, S_{q}} = 1 | y = 1) \\&\quad \approx \frac{1}{n_1} \sum _{k : y^{(k)} = 1} \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}_{S_1}^{(k)}) \ge t} \cdot \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}_{S_2}^{(k)}) \ge t} \ldots \cdot \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}_{S_q}^{(k)}) \ge t} . \end{aligned}$$

7 Experiments

We evaluate our proposed method on four real datasets from the medical domain which are frequently used for cost-sensitive classification: Pima Diabetes dataset (p = 8, n = 768, positive class ratio \(\approx 0.35\)), the Wisconsin Breast Cancer dataset (p = 10, n = 683, positive class ratio \(\approx 0.35\)), Heart-disease dataset (p = 13, n = 303, positive class ratio \(\approx 0.46\)), and the PhysioNet dataset (p = 30, n = 12000, positive class ratio \(\approx 0.14\)). The first three datasets are all available at the UCI Machine Learning repository,Footnote 11 the PhysioNet data is available at https://archive.physionet.org/pn3/challenge/2012/.

Note that the PhysioNet data (Goldberger et al., 2000) contains for each patient several health check measures like cholesterol, taken at different times during their stay in the intensive care unit. As in Shim et al., (2018), for each patient we use the last recorded value of each attribute to predict death (\(y = 1\)) or survival (\(y = 0\)). After filtering attributes which are mostly missing, we acquire a data set with 12000 patients and 30 attributes.

For Diabetes and Heart-disease we use the covariate costs as defined in Ji and Carin (2007), and Turney (1994), respectively. For the other datasets, we set the covariate costs uniformly to one.

Note that the Heart-disease and PhysioNet data contain missing values. For methods which cannot handle missing values (including our proposed methods) we assume that all covariates are jointly distributed according to a multivariate normal distribution, where the covariance matrix is estimated from all samples (including missing values) using the method from Lounici et al. (2014).

We compare the proposed method AdaCOS to the following methods:

COS The proposed method but fixing the covariate set \(S \in \{S_1, S_2,\ldots , S_q\}\) to the one which minimizes the total costs in expectation, i.e.

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{S})} \big ] + \sum _{i \in S} c_i , \end{aligned}$$

which is estimated using 10-fold crossvalidation as in Sect. 5.1.Footnote 12

Full Model The logistic generalized additive model which always acquires (and uses) all covariates.

Shim2018 The cost-sensitive classification method based on deep reinforcement learning as proposed in Shim et al. (2018).Footnote 13

GreedyMiser The cost-sensitive tree construction method proposed in Xu et al. (2012).Footnote 14

AdaptGbrt The method proposed in Nan and Saligrama (2017), which requires the specification of a high accuracy classifier for which we use the Full Model.Footnote 15

For all methods we estimate the hyperparameters with 10-fold crossvalidation, except where this is too computationally expensive: for Shim2018 we use the hold-out data split as in their provided implementation, for AdaptGbrt we use 5-fold crossvalidation.

As evaluation measure, we use the average total cost of classification, defined as

$$\begin{aligned} \text {avg total cost} := \frac{1}{n_t} \sum _{k = 1}^{n_t} \big ( c_{y^{(k)}, y^{(k)}_*} + \sum _{i \in S^{(k)}} c_i \big ) , \end{aligned}$$

where \(n_t\) is the number of test samples; \(y^{(k)}\) and \(y^{(k)}_*\) is the k-th true test class and predicted test class, respectively; \(S^{(k)}\) is the set of covariates that were used by the prediction model for classifying the k-th sample.

7.1 Results

For each dataset we use 5-fold cross-validation and report mean and standard deviation of the total costs. We evaluate all methods on two settings:

  • user-specified false positive and false negative misclassification costs.

  • user-specified false positive misclassification cost and target recall.

If not stated otherwise, we use group lasso, as explained in Sect. 5.2, for acquiring the sets \(S_1 \subset S_2, \ldots S_q\).

7.1.1 User-specified misclassification costs

In the first setting, we assume that the user specifies the false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\). The false negative cost is set to be 10 times the false positive cost, which reflects that it is more important to detect infected patients than avoiding wrongly classifying healthy patients.

The total cost of misclassification is shown in the top plot of Figs. 567, and  8 for Diabetes, Breast Cancer, PhysioNet and Heart-disease, respectively. We observe that with respect to minimizing the total cost of classification (top plots), our proposed method AdaCOS performs better than all previously proposed methods.

Fig. 5
figure 5

Results on Pima Diabetes dataset with user-specified false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\). The false negative cost is set to be 10 times the false positive cost

Fig. 6
figure 6

Results on Breast Cancer dataset with user-specified false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\). The false negative cost is set to be 10 times the false positive cost

Fig. 7
figure 7

Results on PhysioNet dataset with user-specified false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\). The false negative cost is set to be 10 times the false positive cost

Fig. 8
figure 8

Results on Heart-disease dataset with user-specified false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\). The false negative cost is set to be 10 times the false positive cost

In each of those figures, in the middle and bottom plot, we also show the weighted accuracy and the number of acquired covariates each plotted against the false positive cost (which is set by the user), respectively. Since we assume that false negative classification have 10 times higher cost than false positive classification, we use the weighted accuracy defined by

$$\begin{aligned} {\text {weighted accuracy}} = \frac{{\text {true positives}}\cdot 10 + {\text {true negatives}}}{{\text {number of true labels}}\cdot 10 + {\text {number of false labels}}} . \end{aligned}$$

From the bottom plots, as expected, we see that all methods start acquiring more covariates as the user-specified false positive cost increases. At the same time, all methods, except Shim2018, show an increase in (weighted) accuracy. In particular, Shim2018 underperforms on the smaller datasets Diabetes, Breast Cancer, and Heart-disease, which is likely due to the difficulty of adjusting the hyper-parameters of their deep neural network classifier on small hold-out validation data.

In terms of (weighted) accuracy, Full Model performs always optimal, i.e. even for small datasets we do not find any gains in predictive accuracy by using a sparser model. As a conclusion, if the covariate costs are zero or negligible, we might just opt for the full model to get the lowest total costs. On the other hand, if the ratio of false-positive cost to covariate cost is less than around 100, the full model performs considerably worse than the proposed method in terms of total cost.

The (weighted) accuracy of AdaCOS and COS are similar, while the former achieves the same accuracy with less covariates. This demonstrates the effectiveness of estimating the expected cost of misclassification depending on what we have observed so far using the conditional Bayes risk as in Eq. (8).

Additionally, in Fig. 9, we show how the (weighted) accuracy changes as the number of acquired covariates increases.Footnote 16 The plot confirms that AdaCOS consistently improves over COS.

Fig. 9
figure 9

Shows the weighted accuracy vs the number of acquired covariates. Results for where the number of acquired covariates was larger than any of the proposed methods are not shown. (All results were acquired with user-specified false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\), and false negative cost 10 times the false positive cost.)

Comparison of Group Lasso and Forward Selection Next, we compare the forward selection strategy (Sect. 5.1) and group lasso (Sect. 5.2) when used for acquiring the covariate sets \(S_1 \subset S_2, \ldots S_q\). For Diabetes, Breast Cancer, and Heart-disease, the total costs of the proposed method AdaCOS with forward selection and group lasso are shown in Fig. 10. Due to the high computational costs for large p, it was not feasible to apply forward selection to the PhysioNet dataset. We find that, except for the Breast Cancer dataset, both covariate acquisition strategies lead to similar results. For Breast Cancer, forward selection appears to be superior to group lasso.

Fig. 10
figure 10

Comparison of the proposed method AdaCOS when defining the covariates sets \(S_1, S_2, \ldots S_q\) using either the group lasso penalty or forward selection

Runtime Comparison In Table 1, first two rows, we compare the training time of AdaCOS with forward selection and AdaCOS with group lasso (Sect. 5.2).Footnote 17 We clearly, confirm that group lasso is considerably faster than forward selection, allowing us to apply our proposed method to the PhysioNet dataset.

For reference, in Table 1, row 3 to 6, we also show the training time of all other methods, but we note that a fair comparison is difficult due to different types of implementations and the Matlab license conditions. In particular, the experiments of our proposed methods and Shim2018, were all run using a Python environment with 88 Intel Xeon 2.20GHz CPUs, whereas GreedyMiser and AdaptGbrt run on Matlab with 8 Intel Xeon 3.50 GHz CPUs.Footnote 18

Finally, in Table 2, we also show the runtime at test-time of all methods. Though AdaCOS is slower than all other methods, the proposed method is still practically fast, requiring less than 2 min to process all test samples for PhysioNet.

Table 1 Shows the average training time in minutes per 1000 samples (standard deviation in brackets) for forward selection and group lasso
Table 2 Shows the average test time in minutes per 1000 samples (standard deviation in brackets) for all methods

Symmetric misclassification costs on Diabetes dataset In order to compare to the results reported in Ji and Carin (2007), Dulac-Arnold et al. (2012), we also evaluate on the Diabetes dataset with symmetric misclassification costs (i.e. false negative and false positive costs are the same), and the cost for correct classification set to \(-50\). The results, shown in Table 3, suggest that also in this setting the proposed method can have an advantage over previously proposed methods. In particular, the proposed method AdaCOS with forward selection has the lowest total costs, though, when using group lasso the proposed method underperforms.

Table 3 Shows the total cost of misclassification under the same cost setting as in Ji and Carin (2007), Dulac-Arnold et al. (2012): user-specified misclassification costs are symmetric (either 400 or 800), cost of correct classification equals \(-50\). The results for the methods DWSM and POMDP are taken from Dulac-Arnold et al. (2012) and Ji and Carin (2007), respectively

7.1.2 User-specified target recall

Finally, we investigate the setting where the user specifies target recall r instead of false negative costs. Here, we show the results for \(r = 0.95\), the results for target recall \(r = 0.99\) are similar and given in the supplement material.

For this setting, we do not consider the method AdaptGbrt, since it does not allow the output of class probabilities or scores. For Shim2018 and GreedyMiser, we found that simply using the class probabilities/scores from the validation data to learn thresholds with recall \(\ge r\), tended to lead to recall less that r on the test data, as shown in Fig. 11. Therefore, in order to make all results comparable, we show the results for Shim2018 and GreedyMiser at the same recall level as the proposed method AdaCOS.

Fig. 11
figure 11

Actually observed recall on test data when threshold on class probabilities was adjusted on validation data to have recall ≥ 0.95.

The recall of the proposed method AdaCOS on the test data, as shown in Fig. 11, never violates the target recall of 0.95.

Since no false negative costs are provided, we cannot evaluate in terms of total costs anymore. Instead, we evaluate in terms of average operation costs, defined as the average cost of false positives plus the costs for covariate acquisition:

$$\begin{aligned} \text {avg operation costs} := \frac{1}{n_t} \sum _{k = 1}^{n_t} \big ( (1 - y^{(k)}) \cdot y^{(k)}_* \cdot c_{0,1} + \sum _{i \in S^{(k)}} c_i \big ) . \end{aligned}$$

The results for all datasets are shown in Figures 121314, and 15. For completeness, on each of the those figures, we also plot the false discovery rate (FDR) against the covariate costs (bottom plots), where the crosses of different sizes mark the standard deviation of FDR and covariates costs.

Fig. 12
figure 12

Results on Pima Diabetes dataset with user-specified false positive cost in {5, 10, 50, 100, 500, 1000}, and target recal ≥ 0.95.

Fig. 13
figure 13

Results on Breast Cancer dataset with user-specied false positive cost in {1, 5, 10, 50, 100, 500, 1000}, and target recall ≥ 0.95.

Fig. 14
figure 14

Results on PhysioNet dataset with user-specified false positive cost in {1, 5, 10, 50, 100, 500, 1000}, and target recall ≥ 0.95.

Fig. 15
figure 15

Results on Heart-disease dataset with user-specified false positive cost in \(\{1, 5, 10, 50, 100, 500, 1000\}\), and target recall \(\ge 0.95\)

For all datasets, except Heart-Disease, the proposed method AdaCOS has the smallest operation costs. Furthermore, AdaCOS tends to achieve a lower false discovery rate with less covariates used.

8 Related work

Here, we briefly summarize various previous works for cost-sensitive classification.

Cost-sensitive Classification Without Covariate Costs The theory about optimal decision making given a (mis-)classifiation cost matrix and conditional class probabilities \(p(y | {\mathbf {x}})\) has been extensively studied in the statistical literature, see for example (Berger 2013), and the machine learning community, see for example (Zadrozny & Elkan, 2001; Zadrozny et al., 2003; Elkan, 2001). However, the costs of acquiring covariates are not addressed in these early works. The work in Zadrozny and Elkan (2001) also addresses the problem of an undefined cost-matrix. However, in contrast to our work, they are concerned with sample dependent costs, and do not consider recall constraints which we address.

Markov Decision Process (MDP) Framework The MDP formulation and solution using an action-utility representation (Q-learning) in Zubek et al. (2004), Bayer-Zubek (2004) is closest to our approach. Their method also leads to a Bayes procedure. However, they do not provide a formal proof and consider only discrete covariates. The work in Dulac-Arnold et al. (2011), Dulac-Arnold et al. (2012), Karayev et al. (2013) also uses the MDP framework. However, their proposed method cannot incorporate the uncertainty about the covariate distributions. The work in Ji and Carin (2007) tries to model such uncertainties by modeling the cost-sensitive classification problem as a partial observable Markov decision process (POMDP). However, their POMDP formulation can lead to repeatedly selecting the same covariates, and as a consequence they need to adapt the stopping criteria. In contrast, our proposed method, based on the extended definition of a Bayes procedure, leads to a solution for deciding when to stop asking for new covariates that has provably the lowest total costs in expectation.

Reinforcement Learning Approaches Janisch et al. (2017), Shim et al. (2018) suggest to use deep reinforcement learning with Q-learning. In contrast to MDP, a discriminative decision maker is learned which does not require an environmental model. Their method performs promising in the domain where huge amounts of labeled training data is available. Alternatively, the work in Benbouzid et al. (2012) suggests the use of SARSA. The method in Contardo et al. (2016) also addresses this problem with reinforcement learning. However, methods based on reinforcement learning are hard to optimize, and can lead to suboptimal behavior as our experimental comparison to Shim et al. (2018) demonstrated.

Discriminative Decision Approach The work in Wang et al. (2015) proposes an intriguing method for finding a decision procedure that is guaranteed to converge to the Bayes risk given sufficient enough training data. Their idea is to create a Bayes optimal classifier for all possible subsets of covariates, and a directed a-cyclic graph that connects them. They formulate the problem as an empirical risk minimization (ERM) problem, and show that with infinitely many training samples the loss at each node converges to the Bayes risks. However, in order to allow for scalability their method requires to acquire covariates in batches. The work in Trapeznikov and Saligrama (2013), Wang et al. (2014b) uses a similar framework but is restricted to a fixed sequential order, which is different from our method that also learns the sequential order. The work in Kusner et al. (2014), Xu et al. (2013), Wang et al. (2014a) also uses the ERM framework together with budget constraints, but it does not allow to incorporate asymmetric misclassification costs.

Cost-sensitive Tree Construction The works in Xu et al. (2012), Nan et al. (2015), Nan et al. (2016), Nan and Saligrama (2017), Peter et al. (2017) learn decision trees subject to budget constraints on the features. In particular, the methods in Nan and Saligrama (2017), Peter et al. (2017) are considered state of the art for this task. Their usage of gradient boosted decision trees (Friedman 2001) makes them in particular effective for very large training data. Cost-sensitive decision trees for discrete covariates are also considered in Sheng and Ling (2006), and extended to Bayesian Networks in Bilgic and Getoor (2007). In contrast to our work, these works do not explicitly optimize the expected total costs of misclassification, which can result in inferior performance, as we confirmed in Sect. 7.

Entropy-Based Approaches The work in Kanani and Melville (2008), Gao and Koller (2011), Kapoor and Horvitz (2009), Gong et al. (2019) optimizes a criteria that combines the costs of features with an estimate of the class entropy of the resulting classifier. As such their objective function is different from ours.

Density Estimation via Autoencoders The work in Kachuee et al. (2018) suggests to acquire the covariate which has the highest sensitivity to the output prediction y. In order to account for different covariate acquisition costs the sensitivity scores are re-scaled appropriately. The sensitivity scores are estimated using a denoising autoencoder. Similarly, the work in Ma et al. (2019) uses as objective function the expected Shannon information, which is estimated via a variational autoencoder. Both objective functions are not related to the minimization of the expected total cost.

Others The work in Greiner et al. (2002) extends the Probably Approximately Correct (PAC) framework to prove the existence of a cost-sensitive classifier that is with high probability optimal in the sense of providing minimal average total costs. However, they assume a probability distribution over only discrete covariates. The method in Lakkaraju and Rudin (2017) is additionally focused on interpretability, and, as a consequence, optimizes an objective function that is different from ours. Imitation learning is also applied to this task by He et al. (2012), but their definition of loss is different from minimizing the total classification costs that we consider here. The work in Nan et al. (2014) assumes a margin-based classifier and uses a k-nearest neighbor approach to estimate the accuracy of the classifier in each step before deciding whether to acquire the next feature. However, their stopping criteria is solely based on an estimate of the classification accuracy, and as such, does not allow for a decision which is cost optimal.

9 Conclusions

In this article, we addressed the problem of cost-sensitive classification where the goal is to minimize the total costs, defined as the expected cost of misclassification plus the cost for covariate acquisition.

In Sect. 2, we rigorously formalized this goal as the minimization of the (conditional) Bayes risk which can change after the acquisition of a new covariate. However, solving this minimization problem is hard. First, the evaluation of the conditional Bayes risk requires to estimate and integrate over a high dimensional density. Second, the Bayes risk must be evaluated for all combinations of covariate sets which is exponential in p the number of covariates.

In order to overcome the computational difficulties, we introduced two working assumptions:

  1. 1.

    The optimal classifier can be expressed as a generalized additive model (GAM).

  2. 2.

    The optimal sets of covariates can expressed as a sequence of sets that are monotone increasing, namely \(S_1 \subset S_2 \ldots \subset S_q\).

Using the first assumption, we showed, in Sect. 4, that the evaluation of the conditional Bayes risk reduces to a one dimensional density estimation and integration problem which can be efficiently estimated.

Furthermore, we showed that the sequence \(S_1 \subset S_2 \ldots \subset S_q\) can be computationally efficiently acquired by inspecting the regression coefficient path when penalizing GAM with group lasso.

Our experiments suggest that our proposed method AdaCOS achieves in most situations the lowest total costs of classification, when compared to the previous methods POMDP, DWSM, GreedyMiser, AdaptGbrt, and Shim2018 (Ji & Carin, 2007; Dulac-Arnold et al., 2012; Xu et al., 2012; Nan & Saligrama, 2017; Shim et al., 2018).

We note that some previous methods like Shim2018 (Shim et al., 2018) do not share our working assumptions, and instead use a very flexible classifier (deep neural network) and covariate acquisition strategy based on reinforcement learning. However, for small (\(n < 1000\)) datasets, and even for medium-sized (\(n > 10000\)) datasets like PhysioNet, we found that a generalized additive model is competitive or even better than a neural network classifier, and the flexibility of the reinforcement learning seems to suffer from high variance.

Finally, we considered the situation where not all misclassification costs are specified by the user. In particular, we considered the situation where the user specifies a target recall instead of the cost of false negative classification. We showed that it is possible to apply the proposed method by estimating the implicitly defined false negative cost. Our experiments showed that the resulting method indeed achieves the desired minimum recall, while minimizing the false discovery rate and covariate acquisition cost.

The source code of the proposed method and for reproducing all results is available at https://github.com/andrade-stats/AdaCOS_public.