Adaptive covariate acquisition for minimizing total cost of classification

Andrade, Daniel; Okajima, Yuzuru

doi:10.1007/s10994-021-05958-z

Adaptive covariate acquisition for minimizing total cost of classification

Open access
Published: 18 April 2021

Volume 110, pages 1067–1104, (2021)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Adaptive covariate acquisition for minimizing total cost of classification

Download PDF

1428 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In some applications, acquiring covariates comes at a cost which is not negligible. For example in the medical domain, in order to classify whether a patient has diabetes or not, measuring glucose tolerance can be expensive. Assuming that the cost of each covariate, and the cost of misclassification can be specified by the user, our goal is to minimize the (expected) total cost of classification, i.e. the cost of misclassification plus the cost of the acquired covariates. We formalize this optimization goal using the (conditional) Bayes risk and describe the optimal solution using a recursive procedure. Since the procedure is computationally infeasible, we consequently introduce two assumptions: (1) the optimal classifier can be represented by a generalized additive model, (2) the optimal sets of covariates are limited to a sequence of sets of increasing size. We show that under these two assumptions, a computationally efficient solution exists. Furthermore, on several medical datasets, we show that the proposed method achieves in most situations the lowest total costs when compared to various previous methods. Finally, we weaken the requirement on the user to specify all misclassification costs by allowing the user to specify the minimally acceptable recall (target recall). Our experiments confirm that the proposed method achieves the target recall while minimizing the false discovery rate and the covariate acquisition costs better than previous methods.

Controlling Costs in Feature Selection: Information Theoretic Approach

Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms

Article Open access 28 January 2020

Variable selection via generalized SELO-penalized linear regression models

Article 08 June 2018

1 Introduction

In some applications, acquiring covariates comes at a cost which is not negligible. For example, in the medical domain, in order to classify whether a patient has diabetes or not, measuring glucose tolerance can be expensive. On the other hand, glucose tolerance can be a good indicator for diabetes, i.e. it increases our chance of predicting diabetes (or its absence) correctly.

The example illustrates that in the medical domain we often have to strike a balance between classification accuracy and the cost of acquiring the covariates. A rational criteria to decide on the best trade-off is to minimize the expected total cost of classification: expected cost of misclassification plus the cost of acquired covariates.

In the first part of this article, we formalize the optimization of the expected total cost of classification using the (conditional) Bayes risk and describe the optimal solution using a recursive procedure. However, it turns out that the procedure is computationally infeasible due to basically two factors: (1) calculating the Bayes risk requires to estimate a high dimensional integral, (2) the number of different covariate acquisition paths is exponential in the number of covariates.

As a consequence, we introduce two assumptions: (1) the optimal classifier can be represented by a generalized additive model (GAM), (2) the optimal sets of covariates are limited to a sequence of sets of increasing size. We show that under these two assumptions, a computationally efficient solution exists.

Our framework requires that the user can specify the cost of misclassification: the false positive cost (cost of wrongly classifying a healthy person as having diabetes), and false negative cost (cost of classifying a diabetes patient as healthy). However, we show that the requirement on the user to specify the false negative cost can be replaced by the specification of a lower bound on the recall. This is motivated by the medical domain where it is more common to specify the minimally acceptable recall (target recall), rather than specifying the false negative cost.

Our main contributions are as follows:

1.
We describe the optimal solution for minimizing the expected total cost of classification, which has not been clarified in previous works like Dulac-Arnold et al. (2012) and Shim et al. (2018).
2.
We prove that for a GAM, the estimation of the (conditional) Bayes risk reduces to a one dimensional density estimation and integral which can be solved computationally efficiently.
3.
We propose an effective heuristic to estimate an optimal monotone increasing sequence of covariate sets by learning the regression coefficients of GAM with a group lasso penalty.
4.
We prove that our framework can be used to guarantee a user-specified recall level, like 95% which is common in the medical domain.
5.
We show on four medical datasets that the proposed method can lead to lower total cost of classification than the previous works in Ji and Carin (2007), Dulac-Arnold et al. (2012), Xu et al. (2012), Nan and Saligrama (2017) and Shim et al. (2018). Furthermore, evaluation under the requirement of a target recall shows that the proposed method achieves the target recall while minimizing the remaining costs and false discovery rate (FDR) better than previous works.

This article extends our preliminary work in Andrade and Okajima (2019) as follows:

Replacing the linear classifier with GAM to allow for non-linear relations between the covariates and the response (see Sect. 4.1).^{Footnote 1}
Allowing for the specification of a minimal recall (see Sect. 6).
Determining the sequence of covariate sets by the solution path of a group lasso penalized convex optimization problem (see Sect. 5.2).^{Footnote 2}
Additional experimental evaluations on two more medical datasets, and additional analysis in terms of false positive costs, recall, weighted average, and runtime (see Sect. 7). Furthermore, instead of symmetric misclassification costs, which were used in Andrade and Okajima (2019), our evaluation here focuses on asymmetric misclassification costs that are more realistic in practice.

In the next section, we formalize the optimal decision procedure to achieve, in expectation, the lowest total cost of classification. In Sect. 3, we introduce a (non-adaptive) method that minimizes an upper of the lowest achievable total cost, which we extend in Sect. 4 to an adaptive method. In Sect. 5, we explain two approximations for finding a sequence of monotone increasing covariate sets that is used by the proposed method. In Sect. 6, we show how the proposed framework can also be used for guaranteeing a target recall. Extensive empirical evaluations of our proposed method and previous methods are provided in Sect. 7. In Sect. 8, we give a concise review of related work. Finally, in Sect. 9, we summarize our findings.

2 A cost rational selection criteria

In this section, we clarify and formally describe the optimal decision for acquiring covariates and classification. For that purpose we appropriately extend the definitions of a Bayes procedure which are known to be optimal for classification, see e.g. Anderson (2003), Berger (2013).

Let $L := \{l_1, \ldots , l_c\}$ denote the set of class labels, and $c_{y, y^*}$ the cost of classifying a sample as class $y^*$, when the true label is y. A decision procedure $\delta ^*: {\mathbb {R}}^p \rightarrow L$ for which

$$\begin{aligned} \forall \delta : \, {{{{\mathbb {E}}}}}_{{\mathbf {x}},y} [c_{y,\delta ({\mathbf {x}})}] \ge {\mathbb{E}}_{{\mathbf {x}}, y} [c_{y,\delta ^*({\mathbf {x}})}] \, \end{aligned}$$

is called a Bayes procedure. The following procedure $\delta ^*$ is a Bayes procedure [for a proof see, for example, Theorem 6.7.1 in Anderson (2003)]:

$$\begin{aligned} \begin{aligned} \delta ^*({\mathbf {x}})&= \mathop {\hbox {arg min}}\limits _{y^* \in L} \sum _{y \in L} p(y | {\mathbf {x}}) \cdot c_{y, y^*} \\&= \mathop {\hbox {arg min}}\limits _{y^* \in L} {\mathbb{E}}_y [c_{y, y^*} ] . \end{aligned} \end{aligned}$$

(1)

The expected misclassification cost of the Bayes procedure, i.e. ${\mathbb{E}}_{{\mathbf {x}}, y} [c_{y,\delta ^*({\mathbf {x}})}]$, is called the Bayes risk.

Let us denote by $V := \{1, \ldots , p\}$ the index set of covariates with $V \cap L = \emptyset$. We denote the Bayes procedure for classifying a sample based only on the covariates $S \subseteq V$ by $\delta ^*_{S} : {\mathbb {R}}^{|S|} \rightarrow L$. That means

$$\begin{aligned} \delta ^*_S({\mathbf {x}}_S) = \mathop {\hbox {arg min}}\limits _{y^* \in L} \sum _{y \in L} p(y | {\mathbf {x}}_S) \cdot c_{y, y^*} . \end{aligned}$$

(2)

When it is clear from the context, we drop the index on $\delta ^*_S$, and just write $\delta ^*({\mathbf {x}}_S)$ instead of $\delta ^*_S({\mathbf {x}}_S)$.^{Footnote 3}

2.1 Optimal procedure

The classical definition of Bayes procedure does not consider the cost of covariate acquisition, and assumes that all covariates are acquired at once. Therefore, let us first formally extend the definition appropriately.

We use the following definition of a decision procedure.

Definition 1

A function of the form

$$\begin{aligned} \pi : {\mathbb {R}}^p \times 2^V \rightarrow L \cup V, \end{aligned}$$

which fulfills, $\forall {\mathbf {x}} \in {\mathbb {R}}^p, S \subseteq V$:

$$\begin{aligned}&\pi ({\mathbf {x}}, S) = \pi ({\mathbf {x}} \odot {\mathbf {1}}_S, S) , \end{aligned}$$

(3)

$$\begin{aligned}&\pi ({\mathbf {x}}, S) \in L \cup (V \setminus S) , \end{aligned}$$

(4)

is called a decision procedure.^{Footnote 4}

The condition in Eq. (3) means that a decision procedure uses only the covariates that are indexed by S; the condition in Eq. (4) means that a decision procedure cannot select a covariate that is already in S. In summary, the decision procedure $\pi ({\mathbf {x}}, S)$ either classifies the current sample, or selects a new covariate based on the observations ${\mathbf {x}}_S$. To simplify the notation, we write $\pi ({\mathbf {x}}_S)$ instead of $\pi ({\mathbf {x}}, S)$. Furthermore, we denote the cost of acquiring covariate i by $c_i$.

Given a sample ${\mathbf {x}}$ with class label y, we denote the loss of a decision procedure $\pi$ as $l( ({\mathbf {x}}, y), \pi )$. The loss can be computed recursively as follows. Let $l( ({\mathbf {x}}, y), \pi ) := l( ({\mathbf {x}}, y), \pi , \emptyset )$, with

$$\begin{aligned}&l( ({\mathbf {x}}, y), \pi , S) =&{\left\{ \begin{array}{ll} c_{y, \pi ({\mathbf {x}}_S)} &\quad{\text {if}}\, \pi ({\mathbf {x}}_S) \in L , \\ c_{\pi ({\mathbf {x}}_S)} + l( ({\mathbf {x}}, y), \pi , S \cup \{ \pi ({\mathbf {x}}_S) \} ) &\quad{\text{else}}. \end{array}\right. } \end{aligned}$$

(5)

If not stated otherwise, we assume that all costs are non-negative, i.e. $c_i \ge 0$, and $c_{y,y^{\prime}} \ge 0$.

Theorem 1

Let $\pi ^*$ be a decision procedure,such that for all $S \subseteq V$, we have

$$\begin{aligned} \begin{aligned} \pi ^*({\mathbf {x}}_S) = \mathop {\hbox {arg min}}\limits _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} \big [l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_{S} \big ] &\quad {\text{else}}. \end{array}\right. } \\ \end{aligned} \end{aligned}$$

(6)

Then $\pi ^*$ is a Bayes procedure. That means for any other decision procedure $\pi$ we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [l( ({\mathbf {x}}, y), \pi ^*) ] \le {\mathbb{E}}_{{\mathbf {x}}, y} [l( ({\mathbf {x}}, y), \pi ) ] . \end{aligned}$$

The proof is given in the “Appendix”. We note that, if the covariates are discrete, we can formulate the problem as a stationary Markov decision process (MDP) where every policy leads to a terminal state (Zubek et al., 2004; Bayer-Zubek 2004). The Bayes procedure from Theorem 1 is then equivalent to the optimal policy defined by the Bellman updates with the discounting factor set to 1 (Russell & Norvig, 2003).

For continuous covariates, implementing the exact decision procedure $\pi ^*$ is, in general, intractable.^{Footnote 5} The reason is that in order to recursively evaluate the loss, we need to evaluate a sequence of interchanging minimizations and expectations. To illustrate this, consider the following example with only two covariates, $V = \{1, 2\}$, we then have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [ l( ({\mathbf {x}}, y), \pi ^*) ]&= \min \Big \{ c_1 + {\mathbb{E}}_{{\mathbf {x}}_1} \big [ \min \big \{c_2 + {\mathbb{E}}_{{\mathbf {x}}_2} [C({\mathbf {x}}_{\{1,2\}}) | {\mathbf {x}}_1], C({\mathbf {x}}_1) \big \} \big ], \nonumber \\&\quad \quad c_2 + {\mathbb{E}}_{{\mathbf {x}}_2} \big [ \min \big \{ c_1 + {\mathbb{E}}_{{\mathbf {x}}_1} [ C({\mathbf {x}}_{\{1,2\}}) | {\mathbf {x}}_2], C({\mathbf {x}}_2) \big \} \big ], \nonumber \\&\quad \quad C({\mathbf {x}}_{\emptyset }) \Big \} , \end{aligned}$$

(7)

where the first line corresponds to first acquiring covariate 1 and then either classifing or acquiring covariate 2, the second line corresponds to first acquiring covariate 2, and the last line corresponds to not acquiring any covariate. Note that to simplify notation in the above example, we defined $C({\mathbf {x}}_S)$ to be the expected cost of (mis-)classification when given ${\mathbf {x}}_S$, i.e.

$$\begin{aligned} C({\mathbf {x}}_S) := \min _{i \in L} {\mathbb{E}}_{y} [c_{y,i} | {\mathbf {x}}_S ] . \end{aligned}$$

Therefore, we propose two relaxations and corresponding methods named Cost-sensitive Covariate Selection (COS) and Adaptive Cost-sensitive Forward Selection (AdaCOS) in Sects. 3 and 4, respectively.

3 Cost-sensitive covariate selection (COS)

Our first relaxation is to pull-out all minimizations in the recursion of Eq. (6). Also see the example in Eq. (7) which explicitly writes out all minimizations. This leads to the following upper bound:

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [ l( ({\mathbf {x}}, y), \pi ^*) ] \le \min _{S \subseteq V} \Big ( {\mathbb{E}}_{{\mathbf {x}}_S, y}[c_{y,\delta ^*({\mathbf {x}}_S)}] + \sum _{i \in S} c_i \Big ) . \end{aligned}$$

In the following, we denote this upper bound by ${\mathcal {U}}$. However, directly trying to minimize ${\mathcal {U}}$ is still computationally difficult due to the exponential number of possible sets S. Note that this is similar to covariate selection in logistic regression like e.g. in Tibshirani (1996), O’Hara et al., (2009). Two important differences are that, in general, costs associated with covariates can be different from each other, and that the Bayes risk needs to be evaluate for all possible subsets $S \subseteq V$. The situation is illustrated in Fig. 1.

We denote the method selecting the set $S^*$ that minimizes ${\mathcal {U}}$, by Cost-sensitive Covariate Selection (COS). In order to approximately find the set $S^*$ we will use the methods described in Sect. 5. One disadvantage of COS is that it always selects the same set of covariates $S^*$ for any sample, though for some samples less/more covariates might be sufficient/necessary for good classification accuracy.

4 Adaptive cost-sensitive forward selection

The method COS proposed in the previous section is not adaptive, i.e. it does not take into account the actually observed covariates in order to decide whether to proceed the acquisition of additional covariates, or whether to classify based on the covariates observed so far. However, as discussed in Sect. 2.1, without any additional assumptions estimating the optimal procedure $\pi ^*$ from Theorem 1 is computationally infeasible. We therefore introduce two assumptions:

1.
The optimal set S of acquired covariates belongs to ${\mathfrak {S}} = \{S_1, S_2, \ldots S_q\}$, where $S_1 \subset S_2 \subset S_3 \ldots S_q \subseteq V$. An approximation method for finding the optimal sequence of subsets ${\mathfrak {S}}$ will be described in Sect. 5.
2.
The conditional class probability $p(y | {\mathbf {x}}_{S})$, for $S \in {\mathfrak {S}}$, belongs to a logistic generalized additive model.

Before we proceed, let us introduce our definition of future costs. Let $A \subseteq V$ and $S \subseteq V \setminus A$, then we define

$$\begin{aligned} F_{{\mathbf {x}}_{A}}(S) := \overbrace{{\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} | {\mathbf {x}}_{A} \big ]}^{\text {(conditional) Bayes risk}} + \overbrace{\sum _{i \in S} c_i}^{\text {covariate costs}} . \end{aligned}$$

(8)

$F_{{\mathbf {x}}_{A}}(S)$ is the expected total additional cost of classification when we have already acquired the covariates A, and are planning to acquire additionally the covariates S before classifying. In particular, the upper bound ${\mathcal {U}}$ can be expressed as $\min _{S \subseteq V} F_{{\mathbf {x}}_{\emptyset }}(S)$.

Our approximation of the Bayes procedure $\pi ^*$ from Theorem 1 is given in Algorithm 1. First, we acquire all covariates indexed by $S_1$, and then check whether acquiring any additional covariates from $S_2 \setminus S_1, \ldots S_q \setminus S_1$ reduces the total cost of classification in expectation. If that is the case, we acquire the covariates in $S_2 \setminus S_1$, and proceed analogously. If the total cost of classification is not expected to decrease with more covariates, we stop and classify based on the covariates acquired so far. An example of the procedure, where $V = \{1, 2\}$, is shown in Fig. 2. Note the example in Fig. 2, corresponds to the right most path from the tree in Fig. 1. Another, different example, with $V = \{1, 2, 3, 4\}$, is shown in Fig. 3.

The algorithm is adaptive in the sense that the expected future costs $F_{{\mathbf {x}}_{A}}(S)$ depend on the covariates ${\mathbf {x}}_{A}$ observed so far. Therefore, we see that the effectiveness of the algorithm hinges on the non-trivial task of calculating $F_{{\mathbf {x}}_{A}}(S)$.

Finally, we note that in contrast to AdaCOS, the method COS does not use Algorithm 1, but selects only one fixed set of covariates ${\hat{S}} := \mathop {\hbox {arg min}}\limits _{S \subseteq V} F_{{\mathbf {x}}_{\emptyset }}(S)$, which is used for all samples at test time.

4.1 Bayes risk estimation

The main challenge in evaluating the future costs is to estimate the multi-dimensional integral in ${\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} | {\mathbf {x}}_{A} \big ]$. By assuming that the conditional class probability $p(y | {\mathbf {x}}_{A \cup S})$ can be modeled by a logistic generalized additive model, we will show that it is possible to reduce the multi-dimensional integral into a one-dimensional with an effective approximation.

The logistic generalized additive model is defined as follows. Given the regression coefficients $\varvec{\beta } = (\varvec{\beta }_1, \varvec{\beta }_2, \ldots , \varvec{\beta }_p) \in {\mathbb {R}}^{s \cdot p}$, and intercept $\tau$, the conditional class probability $p(y = 1 | {\mathbf {x}}, \varvec{\beta }, \tau )$ is modeled as

$$\begin{aligned} p(y = 1 | {\mathbf {x}}_{A \cup S}, \varvec{\beta }, \tau ) = g\Big (\tau + \sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i)\Big ) , \end{aligned}$$

where g denotes the sigmoid function, and the non-linear transformations $f_i: {\mathbb {R}} \rightarrow {\mathbb {R}}^s$ are learned from the training data using penalized B splines, where s is the number of splines (Hastie et al., 2009).^{Footnote 6}

Furthermore, we assume that there are only two class labels $\{0, 1\}$, and that the costs of misclassification are larger than the cost of correct classification.^{Footnote 7} We then have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ \sum _y c_{y,\delta ^*({\mathbf {x}}_{A \cup S})} p(y | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{0,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=0 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\qquad + {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{1,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=1 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] . \end{aligned}$$

Since

$$\begin{aligned} \delta ^*({\mathbf {x}}_{A \cup S})&= \mathop {\hbox {arg min}}\limits [ p(y=1 | {\mathbf {x}}_{A \cup S}) c_{1, 0} + p(y=0 | {\mathbf {x}}_{A \cup S}) c_{0, 0}, \\&\, p(y=0 | {\mathbf {x}}_{A \cup S}) c_{0, 1} + p(y=1 | {\mathbf {x}}_{A \cup S}) c_{1, 1}] , \end{aligned}$$

we have, where we defined $p := p(y=1 | {\mathbf {x}}_{A \cup S})$,

$$\begin{aligned} \delta ^*({\mathbf {x}}_{A \cup S}) = 1&\Leftrightarrow p c_{1, 0} + (1-p) c_{0, 0} \ge (1-p) c_{0, 1} + p c_{1, 1} \\&\Leftrightarrow p (c_{1, 0} - c_{1, 1}) \ge (1-p) (c_{0, 1} - c_{0, 0}) \\&\Leftrightarrow \frac{p (c_{1, 0} - c_{1, 1})}{(1-p) (c_{0, 1} - c_{0, 0})} \ge 1 \\&\Leftrightarrow \frac{g(\sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) + \tau ) \cdot (c_{1, 0} - c_{1, 1}) }{(1 - g(\sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) + \tau )) \cdot (c_{0, 1} - c_{0, 0}) } \ge 1 \\&\Leftrightarrow e^{\sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) + \tau } \ge \frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}} \\&\Leftrightarrow \sum _{i \in A \cup S} \varvec{\beta }_i^T f_i(x_i) \ge \log (\frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}}) - \tau \\&\Leftrightarrow \sum _{i \in S} \varvec{\beta }_i^T f_i(x_i) \ge \log (\frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}}) - \tau - \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) \\&\Leftrightarrow z \ge z^* , \end{aligned}$$

where we defined $z := \sum _{i \in S} \varvec{\beta }_i^T f_i(x_i)$, and $z^* := \log (\frac{c_{0, 1} - c_{0, 0}}{c_{1, 0} - c_{1, 1}}) - \tau - \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i)$. We see that $\delta ^*({\mathbf {x}}_{A \cup S})$ depends only on z (random variable) and $z^*$ (fixed). In the following, to simplify notation, let us denote by h(z) the conditional distribution $p(z | {\mathbf {x}}_{A})$. We thus have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{1,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=1 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{1,\delta ^*({\mathbf {x}}_{A \cup S})} g(\sum _{i \in S} \varvec{\beta }_i^T f_i(x_i) + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_z \big [ c_{1,\delta ^*(z, z^*)} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) | {\mathbf {x}}_{A} \big ] \\&\quad = \int c_{1,\delta ^*(z, z^*)} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz \\&\quad = c_{1,0} \int _{- \infty }^{z^*} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz \\&\qquad + c_{1,1} \int _{z^*}^{\infty } g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz . \end{aligned}$$

Analogously, we have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{0,\delta ^*({\mathbf {x}}_{A \cup S})} p(y=0 | {\mathbf {x}}_{A \cup S}) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ c_{0,\delta ^*({\mathbf {x}}_{A \cup S})} (1- g(\sum _{i \in S} \varvec{\beta }_i^T f_i(x_i) + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) | {\mathbf {x}}_{A} \big ] \\&\quad = {\mathbb{E}}_z \big [ c_{0,\delta ^*(z, z^*)} (1- g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) | {\mathbf {x}}_{A} \big ] \\&\quad = c_{0,0} \int _{- \infty }^{z^*} (1- g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) h(z) dz \\&\qquad + c_{0,1} \int _{z^*}^{\infty } (1 - g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau )) h(z) dz \\&\quad = { c_{0,0} \int _{-\infty }^{z^*} h(z) dz - c_{0,0} \int _{-\infty }^{z^*} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz }\\&\qquad + c_{0,1} \int _{z^*}^{\infty } h(z) dz - c_{0,1} \int _{z^*}^{\infty } g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz . \end{aligned}$$

Thus the remaining task is to evaluate the following integral

$$\begin{aligned}&\int _{a^{\prime}}^{b^{\prime}} g(z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau ) h(z) dz \nonumber \\&\quad = \int _{a^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau }^{b^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau } g(u) h(u - \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) - \tau ) du . \end{aligned}$$

(9)

We assume that $h(z) = p(z | {\mathbf {x}}_{A})$ can be well approximated by a normal distribution with mean $\mu _z$ and variance $\sigma ^2$. We defer the explanation of how to estimate $\mu _z$ and $\sigma ^2$ to Sect. 4.1.1.

The integral in Eq. (9) has no analytic solution. One popular strategy is to approximate the sigmoid function g by the cumulative distribution function of the standard normal distribution $\varPhi$, as in Gaussian process classification (Rasmussen & Williams, 2006). However, it turns out that this approximation is not applicable here, since $a^{\prime}$ or $b^{\prime}$ is a finite real number in our case. Instead, we use here the fact that the sigmoid function can be well approximated with only a few number of linear functions. In order to facilitate notation, let us introduce the following constants:

$$\begin{aligned} a&:= a^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau \, , \\ b&:= b^{\prime} + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau \, , \\ \mu&:= \mu _z + \sum _{i \in A} \varvec{\beta }_i^T f_i(x_i) + \tau . \end{aligned}$$

Then we can write the integral in Eq. (9) as

$$\begin{aligned} \int _{a}^{b} g(u) \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du . \end{aligned}$$

(10)

Let us define the following piece-wise linear approximation of the sigmoid function:

$$\begin{aligned} g(u) \approx \sum _{t = 1}^{\xi + 2} \Big ( 1_{[b_{t-1}, b_t]}(u) \big ( m_t u + v_t \big ) \Big ) , \end{aligned}$$

where for $1 \le t \le \xi + 1$, we set $b_{t} := -10 + \frac{20}{\xi } (t - 1) ,$ and for $1 \le t \le \xi$, we set

$$\begin{aligned} m_{t+1} := \frac{g(b_{t+1}) - g(b_t)}{b_{t+1} - b_t} , \qquad v_{t+1} := g(b_t) - m_{t+1} b_t , \end{aligned}$$

and

$$\begin{aligned}&b_0 := - \infty , \, m_1 := 0 , \, v_1 := g(b_1) , \\&b_{\xi + 2} := + \infty , \, m_{\xi + 2} := 0 , \, v_{\xi + 2} := g(b_{\xi + 1}) , \end{aligned}$$

and $\xi$ is the number of linear approximations, which is, for example, set to 40. A comparison with the approximation $\varPhi (\sqrt{\frac{\pi }{8}} u)$ is shown in Fig. 4. That means for a relatively few number of linear approximations, we can achieve an approximation that is more accurate than the $\varPhi$-approximation. More importantly, as we show below, this allows for a tractable calculation of the integral in Equation (10), which is not the case when using the $\varPhi$-approximation. Then we have

$$\begin{aligned}&\int _{a}^{b} g(u) \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\quad = \int _{a}^{b} \sum _{t = 1}^{\xi + 2} \Big ( 1_{[b_{t-1}, b_t]}(u) \big ( m_t u + v_t \big ) \Big ) \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\quad = \sum _{t = 1}^{\xi + 2} m_t \int _{\max (a, b_{t-1})}^{\min (b, b_t)} u \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\qquad + v_t \varPhi _{\max (a, b_{t-1})}^{\min (b, b_t)} , \end{aligned}$$

where we define $\varPhi _l^o := \int _l^o \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du$, which can be well approximated with standard numerical libraries. The remaining integral can also be expressed by $\varPhi$ using the substitution $u - \mu := r$, we have

$$\begin{aligned}&\int _{l}^{o} u \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} (u - \mu )^2} du \\&\quad = \int _{l - \mu }^{o - \mu } r \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} r^2} dr + \mu \int _{l - \mu }^{o - \mu } \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{- \frac{1}{2\sigma ^2} r^2} dr \\&\quad = \frac{\sigma }{\sqrt{2 \pi }} \big ( e^{- \frac{1}{2\sigma ^2} (l - \mu )^2} - e^{- \frac{1}{2\sigma ^2} (o - \mu )^2} \big ) + \mu \varPhi _{l}^{o} . \end{aligned}$$

4.1.1 Estimation of $\mu _z$ and $\sigma ^2$

Recall that we assumed that $p(z | {\mathbf {x}}_{A})$ can be well approximated by a normal density with mean $\mu _z$ and variance $\sigma ^2$. In order to estimate $\mu _z$ and $\sigma ^2$, we propose to model z given ${\mathbf {x}}_{A}$ as a regression problem with additive noise, where z is the response variable, and ${\mathbf {x}}_{A}$ are the explanatory variables. In detail, for learning the regression model from the training data $\{{\mathbf {x}}^{(k)} \}_{k = 1}^{n}$, we prepare a collection of response and explanatory variable pairs of the form $\{ (z^{(k)}, {\mathbf {x}}_A^{(k)}) \}_{k = 1}^{n}$, where $z^{(k)} = \sum _{i \in S} \varvec{\beta }_i^T f_i(x_i^{(k)})$. We note that for training the regression model, we could additionally exploit unlabeled data as follows.^{Footnote 8} Once we learned $\varvec{\beta }_i$ and $f_i$ from training data with class label information y, we can use all data, with and without class label, to create the data $\{ (z^{(k)}, {\mathbf {x}}_A^{(k)}) \}_{k = 1}^{n + n_u}$, where $n_u$ is the number of unlabeled samples.

For our experiments, we use a standard Bayesian linear regression model with a scaled inverse $\chi ^2$ distribution prior on the noise variance (Gelman et al., 2013).

5 Cost-aware non-linear covariate selection

In the previous section, we assumed that the covariates are acquired in a specific sequence. In this section, we discuss two different approximation strategies for finding the optimal sequence of subsets ${\mathfrak {S}} = \{S_1, S_2, \ldots S_q\}$, where $S_1 \subset S_2 \subset S_3 \ldots S_q \subseteq V$, such that the expected total cost of classification tends to minimal for a set $S \in {\mathfrak {S}}$.

5.1 Forward selection

We suggest to set $q := p + 1$, and use greedy forward selection as outlined in Algorithm 2.

Note that from the definition in Eq. 8, we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{S}} [F_{{\mathbf {x}}_{S}} ( \{ j \} )] = {\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ {\mathbb{E}}_{{x_j}, y} [ c_{y,\delta ^*({\mathbf {x}}_{S \cup \{j\}})} | {\mathbf {x}}_{S} ] \big ] + \sum _{i \in S} c_i . \end{aligned}$$

We estimate ${\mathbb{E}}_{{\mathbf {x}}_{S}} [F_{{\mathbf {x}}_{S}} ( \{ j \} )]$ using 10-fold cross-validation of the labeled training data $\{ (y^{(k)}, {\mathbf {x}}^{(k)}) \}_{k = 1}^{n}$. In particular, for one fold ${\mathscr {A}}_f \subseteq \{1, \ldots , n\}$, we have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{S}} \big [ {\mathbb{E}}_{{x_j}, y} [ c_{y,\delta ^*({\mathbf {x}}_{S \cup \{j\}})} | {\mathbf {x}}_{S} ]\big ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{S \cup \{j\}}, y} [ c_{y,\delta ^*({\mathbf {x}}_{S \cup \{j\}})} ] \\&\quad \approx \frac{1}{| {\mathscr {A}}_f |} \sum _{k \in {\mathscr {A}}_f} c_{y^{(k)},\delta _f({\mathbf {x}}_{S \cup \{j\}}^{(k)})}, \end{aligned}$$

where the model for the conditional probability $p(y | {\mathbf {x}}_{S \cup \{j\}})$ used by $\delta _f$ is trained using the samples in $\{1, \ldots , n\} \setminus {\mathscr {A}}_f$. The final estimate is acquired by averaging over all folds.

An advantage of the above forward-selection procedure is that it uses an unbiased estimate of ${\mathbb{E}}_{{\mathbf {x}}_{S}} [F_{{\mathbf {x}}_{S}} ( \{ j \} )]$, and assuming the variance is not too large, we can expect to find a good local minima.

However, there are several disadvantages. First, if the variance of the estimator is high, we might get stuck in a bad local minima. Second, the forward-selection procedure is extremely computationally expensive, and, as a consequence, unfeasible if p is large. Finally, a more subtle disadvantage is that it requires the full specification of the misclassification costs, i.e. the specification of $c_{0,1}$ and $c_{1,0}$. As a consequence, it is not applicable when we are provided only with $c_{0,1}$, which we will discuss in Sect. 6.

5.2 Group Lasso penalty

Some of the disadvantages of the feed-forward selection method can be overcome by jointly training the model for the conditional probability $p(y | {\mathbf {x}})$ with a sparsity-enforcing penalty on the regression coefficients. Here, this is possible since we assume a generalized additive model for $p(y | {\mathbf {x}})$.

In particular, we propose to acquire the sets $S_1 \subset S_2 \subset S_3 \ldots S_q$ by using the search path of a penalized logistic loss function, subject to a post-hoc correction to ensure strict monotonicity. In detail, for different values of $\lambda$, we solve the following convex optimization problem

$$\begin{aligned} \mathop {\hbox {minimize}}\limits _{\varvec{\beta }, \tau } - \sum _{k = 1}^n \log p(y = y^{(k)} | {\mathbf {x}}^{(k)}, \varvec{\beta }, \tau ) + \lambda \sum _{i = 1}^p c_i ||\varvec{\beta }_i||_2 , \end{aligned}$$

(11)

where $\varvec{\beta } = (\varvec{\beta }_1, \varvec{\beta }_2, \ldots , \varvec{\beta }_p) \in {\mathbb {R}}^{s \cdot p}$. The group lasso penalty ensures that the regression coefficients $\varvec{\beta }_i$ are either all zero or all non-zero (Hastie et al., 2015). Note that in Eq. (11) each group is scaled by $c_i$ which ensures that the regression coefficient of covariates with high cost are penalized more.^{Footnote 9} As a consequence, in order to be included into the final model, covariates with high cost are required to lower the negative log-likelihood term more than covariates with low costs. By inspecting the search path for different values of $\lambda _1> \lambda _2> \ldots > \lambda _q$, we acquire the corresponding sets $S_1, S_2, \ldots , S_{q}$. For $\lambda$, we choose 100 values evenly spaced on the log-scale between 0.0001 and 0.02, that is $q = 100$. In more detail, after training with $\lambda _j$, covariate i will be included in $S_j$, if and only if $\varvec{\beta }_i \ne 0$. In general, after removing duplicate sets, this will lead to $S_1 \subset S_2 \subset S_3 \ldots S_{q^{\prime}} \subset V$, where $q^{\prime} \le q$. Furthermore, if not already present, we add the empty $\emptyset$, and the full covariate set V. In the rare case, where one set violates the strict monotonicity, we remove the set. For example, for the Breast Cancer data, group lasso finds the following sets:

$$\begin{aligned} S_1&= \{\} \\ S_2&= \{6\} \\ S_3&= \{2, 6\} \\ S_4&= \{2, 3, 6\} \\ S_5&= \{3, 6, 8\} \\ S_6&= \{2, 3, 6, 8\} \\ \ldots \end{aligned}$$

and therefore remove $S_5$.

The conditional class probability $p(y = 1 | {\mathbf {x}}, \varvec{\beta }, \tau )$ is modeled as

$$\begin{aligned} p(y = 1 | {\mathbf {x}}, \varvec{\beta }, \tau ) = g(\tau + \sum _{i = 1}^{p} \varvec{\beta }_i^T f_i(x_i)) . \end{aligned}$$

(12)

The non-linear transformations $f_i: {\mathbb {R}} \rightarrow {\mathbb {R}}^s$, where s is the number of splines, are learned from the training data using penalized B splines (Hastie et al., 2009).

6 Extension to classification with recall guarantees

So far, we assumed that both misclassification costs $c_{0, 1}$ and $c_{1, 0}$ are given. Arguably, the false positive cost $c_{0, 1}$ is relatively easy to specify. For example, in the medical domain, this might correspond to the price of a medicine which was unnecessarily prescribed to a healthy patient.

On the other hand, the specification of $c_{1, 0}$ is more difficult. For example, specifying the cost of a dead patient (that might have been rescued) is difficult. Therefore, in the medical domain, it is more common to try to make a guarantee on the recall.^{Footnote 10} In particular, it is common practice to require that the recall is $95\%$ (Kanao et al., 2009).

In the following, we show how to estimate the false negative cost $c_{1, 0}$, given the false positive cost $c_{0,1}$ and the requirement that the recall is larger or equal to some value r.

In the following, we denote by $\mathbb {1}_M$ the indicator function which is 1 if expression M is true and otherwise 0.

Given a distribution over $(y, {\mathbf {x}})$ such that ${\mathbb{E}}[1_{y=1}] > 0$, the recall of a decision procedure $\delta$ is defined as:

$$\begin{aligned} R_\delta&:= \int p({\mathbf {x}} | y = 1) \cdot \mathbb {1}_{\delta ({\mathbf {x}}) = 1} d {\mathbf {x}} . \end{aligned}$$

Assuming that $\delta$ is a Bayes procedure, we have

$$\begin{aligned} \delta ({\mathbf {x}}) = 1&\Leftrightarrow p(y = 1 | {\mathbf {x}}) \cdot c_{1, 0} \ge p(y = 0 | {\mathbf {x}}) \cdot c_{0, 1} \\&\Leftrightarrow p(y = 1 | {\mathbf {x}}) \cdot c_{1, 0} \ge (1 - p(y = 1 | {\mathbf {x}})) \cdot c_{0, 1} \\&\Leftrightarrow p(y = 1 | {\mathbf {x}}) \ge \frac{c_{0, 1}}{c_{1, 0} + c_{0, 1}} . \end{aligned}$$

Setting

$$\begin{aligned} t := \frac{c_{0, 1}}{c_{1, 0} + c_{0, 1}} , \end{aligned}$$

we have that

$$\begin{aligned} R_\delta&= \int p({\mathbf {x}} | y = 1) \cdot \mathbb {1}_{p(y = 1 | {\mathbf {x}}) \ge t} d {\mathbf {x}} . \end{aligned}$$

In order to emphasize the dependence on t, we write in the following $R_t$ instead of $R_\delta$. In particular, let $t^*$ be chosen such that

$$\begin{aligned} r = R_{t^*} \end{aligned}$$

(13)

where r is, for example, 0.95. Then the implicitly defined cost $c_{1, 0}$ is given by

$$\begin{aligned} c_{1, 0} = \frac{1- t^*}{t^*} c_{0, 1} . \end{aligned}$$

(14)

It remains to show how $t^*$ can be estimated. In general, Equation (13) does not have a solution (in terms of t). We therefore solve the following problem

$$\begin{aligned}&\max _t \, t \; , \, \text {subject to} \; \; r \le R_t , \end{aligned}$$

(15)

which has always a solution (since $t = 0$ fulfills the constraint). Since $p({\mathbf {x}} | y = 1)$ is unknown, we use the empirical training data distribution to estimate $R_t$:

$$\begin{aligned} R_t&= \int p({\mathbf {x}} | y = 1) \cdot \mathbb {1}_{p(y = 1 | {\mathbf {x}}) \ge t} d {\mathbf {x}} \nonumber \\&\approx \frac{1}{n_1} \sum _{k : y^{(k)} = 1} \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}^{(k)}) \ge t} , \end{aligned}$$

(16)

where $n_1$ is the number of true samples (i.e. label $y = 1$), and $p^{(-k)}(y = 1 | {\mathbf {x}}^{(k)})$ is the estimate of $p(y = 1 | {\mathbf {x}}^{(k)})$ of the classifier that was trained without sample k. In practice, since this type of leave-one-out estimation is computationally expensive, we use instead 10-fold cross-validation.

Since the expression in Eq. (16) is a monotone decreasing step function in t, we can easily solve the problem in (15) by sorting $\big \{ p^{(-k)}(y = 1 | {\mathbf {x}}^{(k)}) \big \}_{k=1}^{n_1}$ in decreasing order.

6.1 Adaptive covariate acquisition with recall guarantees

So far, we discussed how to estimate $c_{1,0}$ in the situation where only one classifier based on $p(y = 1 | {\mathbf {x}})$ is used. However, in the adaptive acquisition setting, the situation is more complicated since, in general, for different observed sets of variables, the conditional class probabilities are different. In particular, let $S \subset S^{\prime} \subseteq V$, where V is the set of all variables. Then, in general, we have

$$\begin{aligned} p(y = 1 | {\mathbf {x}}_S) \ne p(y = 1 | {\mathbf {x}}_{S^{\prime}}) , \end{aligned}$$

which means that, in general, the optimal threshold $t^*$ which guarantees recall $\ge r$ is different for different sets of observed variables S. Furthermore, in the adaptive setting, estimating the recall using Equation (16) is not valid anymore since the distribution of the samples, with label $y = 1$, and for which we select the variable set $S^{\prime}$ is, in general, different from $p({\mathbf {x}} | y = 1)$, i.e. $p({\mathbf {x}} | y = 1, \hbox { acquired}\ S^{\prime}) \ne p({\mathbf {x}} | y = 1)$.

Nevertheless, we show in the following that it is possible to define the cost $c_{1,0}$ such that the recall requirement is fulfilled. First, let us introduce the following notations. Let $S_1 \subset S_2 \subset S_3 \ldots S_q$, be the sets of variables that are considered for adaptive variable acquisition, i.e. first we acquire $S_1$, and then we decide whether to classify or whether we acquire additionally the variables in $S_2 \setminus S_1$, and so forth. Moreover, let $\delta _{t, S}$ be the classifier based on the observed variables S and using threshold t, i.e.

$$\begin{aligned} \delta _{t, S}({\mathbf {x}}_{S}) := {\left\{ \begin{array}{ll} 1 &\quad{\text {if}}\,p(y = 1 | {\mathbf {x}}_{S}) \ge t , \\ 0 &\quad {\text{else}}. \end{array}\right. } \end{aligned}$$

To simplify the notation, we write in the following $\delta _{t, S}$ short for $\delta _{t, S}({\mathbf {x}}_{S})$.

Strict control of the recall can be achieved by requiring that

$$\begin{aligned} p(\delta _{t_{1}, S_{1}} = 1, \delta _{t_{2}, S_{2}} = 1, \ldots , \delta _{t_{q}, S_{q}} = 1 | y = 1) \ge r . \end{aligned}$$

(17)

This can be seen as follows. Assume that $y = 1$ and any classifier $\delta _{t_{j}, S_{j}}$ outputs label 0, then an adversarial selection strategy will select this classifier. Otherwise, if all classifiers output label 1, then even an adversarial selection strategy needs to select a classifier $\delta _{t_{j}, S_{j}}$ for which the output is 1. By the requirement of Inequality (17), the latter will happen with probability of at least r.

If we require that all thresholds are the same, i.e.

$$\begin{aligned} t = t_{1} = t_2 = \ldots = t_q, \end{aligned}$$

then we can proceed as before. That means, based on Inequality (17), we first calculate $t^*$, and then specify the false negative cost $c_{1,0}$ using Equation (14). Analogously to before for checking Inequality (17), we use the empirical training data estimate:

$$\begin{aligned}&p(\delta _{t, S_{1}} = 1, \delta _{t, S_{2}} = 1, \ldots , \delta _{t, S_{q}} = 1 | y = 1) \\&\quad \approx \frac{1}{n_1} \sum _{k : y^{(k)} = 1} \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}_{S_1}^{(k)}) \ge t} \cdot \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}_{S_2}^{(k)}) \ge t} \ldots \cdot \mathbb {1}_{p^{(-k)}(y = 1 | {\mathbf {x}}_{S_q}^{(k)}) \ge t} . \end{aligned}$$

7 Experiments

We evaluate our proposed method on four real datasets from the medical domain which are frequently used for cost-sensitive classification: Pima Diabetes dataset (p = 8, n = 768, positive class ratio $\approx 0.35$), the Wisconsin Breast Cancer dataset (p = 10, n = 683, positive class ratio $\approx 0.35$), Heart-disease dataset (p = 13, n = 303, positive class ratio $\approx 0.46$), and the PhysioNet dataset (p = 30, n = 12000, positive class ratio $\approx 0.14$). The first three datasets are all available at the UCI Machine Learning repository,^{Footnote 11} the PhysioNet data is available at https://archive.physionet.org/pn3/challenge/2012/.

Note that the PhysioNet data (Goldberger et al., 2000) contains for each patient several health check measures like cholesterol, taken at different times during their stay in the intensive care unit. As in Shim et al., (2018), for each patient we use the last recorded value of each attribute to predict death ($y = 1$) or survival ($y = 0$). After filtering attributes which are mostly missing, we acquire a data set with 12000 patients and 30 attributes.

For Diabetes and Heart-disease we use the covariate costs as defined in Ji and Carin (2007), and Turney (1994), respectively. For the other datasets, we set the covariate costs uniformly to one.

Note that the Heart-disease and PhysioNet data contain missing values. For methods which cannot handle missing values (including our proposed methods) we assume that all covariates are jointly distributed according to a multivariate normal distribution, where the covariance matrix is estimated from all samples (including missing values) using the method from Lounici et al. (2014).

We compare the proposed method AdaCOS to the following methods:

COS The proposed method but fixing the covariate set $S \in \{S_1, S_2,\ldots , S_q\}$ to the one which minimizes the total costs in expectation, i.e.

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{S}, y} \big [ c_{y,\delta ^*({\mathbf {x}}_{S})} \big ] + \sum _{i \in S} c_i , \end{aligned}$$

which is estimated using 10-fold crossvalidation as in Sect. 5.1.^{Footnote 12}

Full Model The logistic generalized additive model which always acquires (and uses) all covariates.

Shim2018 The cost-sensitive classification method based on deep reinforcement learning as proposed in Shim et al. (2018).^{Footnote 13}

GreedyMiser The cost-sensitive tree construction method proposed in Xu et al. (2012).^{Footnote 14}

AdaptGbrt The method proposed in Nan and Saligrama (2017), which requires the specification of a high accuracy classifier for which we use the Full Model.^{Footnote 15}

For all methods we estimate the hyperparameters with 10-fold crossvalidation, except where this is too computationally expensive: for Shim2018 we use the hold-out data split as in their provided implementation, for AdaptGbrt we use 5-fold crossvalidation.

As evaluation measure, we use the average total cost of classification, defined as

$$\begin{aligned} \text {avg total cost} := \frac{1}{n_t} \sum _{k = 1}^{n_t} \big ( c_{y^{(k)}, y^{(k)}_*} + \sum _{i \in S^{(k)}} c_i \big ) , \end{aligned}$$

where $n_t$ is the number of test samples; $y^{(k)}$ and $y^{(k)}_*$ is the k-th true test class and predicted test class, respectively; $S^{(k)}$ is the set of covariates that were used by the prediction model for classifying the k-th sample.

7.1 Results

For each dataset we use 5-fold cross-validation and report mean and standard deviation of the total costs. We evaluate all methods on two settings:

user-specified false positive and false negative misclassification costs.
user-specified false positive misclassification cost and target recall.

If not stated otherwise, we use group lasso, as explained in Sect. 5.2, for acquiring the sets $S_1 \subset S_2, \ldots S_q$.

7.1.1 User-specified misclassification costs

In the first setting, we assume that the user specifies the false positive cost in $\{1, 5, 10, 50, 100, 500, 1000\}$. The false negative cost is set to be 10 times the false positive cost, which reflects that it is more important to detect infected patients than avoiding wrongly classifying healthy patients.

The total cost of misclassification is shown in the top plot of Figs. 5, 6, 7, and 8 for Diabetes, Breast Cancer, PhysioNet and Heart-disease, respectively. We observe that with respect to minimizing the total cost of classification (top plots), our proposed method AdaCOS performs better than all previously proposed methods.

In each of those figures, in the middle and bottom plot, we also show the weighted accuracy and the number of acquired covariates each plotted against the false positive cost (which is set by the user), respectively. Since we assume that false negative classification have 10 times higher cost than false positive classification, we use the weighted accuracy defined by

$$\begin{aligned} {\text {weighted accuracy}} = \frac{{\text {true positives}}\cdot 10 + {\text {true negatives}}}{{\text {number of true labels}}\cdot 10 + {\text {number of false labels}}} . \end{aligned}$$

From the bottom plots, as expected, we see that all methods start acquiring more covariates as the user-specified false positive cost increases. At the same time, all methods, except Shim2018, show an increase in (weighted) accuracy. In particular, Shim2018 underperforms on the smaller datasets Diabetes, Breast Cancer, and Heart-disease, which is likely due to the difficulty of adjusting the hyper-parameters of their deep neural network classifier on small hold-out validation data.

In terms of (weighted) accuracy, Full Model performs always optimal, i.e. even for small datasets we do not find any gains in predictive accuracy by using a sparser model. As a conclusion, if the covariate costs are zero or negligible, we might just opt for the full model to get the lowest total costs. On the other hand, if the ratio of false-positive cost to covariate cost is less than around 100, the full model performs considerably worse than the proposed method in terms of total cost.

The (weighted) accuracy of AdaCOS and COS are similar, while the former achieves the same accuracy with less covariates. This demonstrates the effectiveness of estimating the expected cost of misclassification depending on what we have observed so far using the conditional Bayes risk as in Eq. (8).

Additionally, in Fig. 9, we show how the (weighted) accuracy changes as the number of acquired covariates increases.^{Footnote 16} The plot confirms that AdaCOS consistently improves over COS.

Comparison of Group Lasso and Forward Selection Next, we compare the forward selection strategy (Sect. 5.1) and group lasso (Sect. 5.2) when used for acquiring the covariate sets $S_1 \subset S_2, \ldots S_q$. For Diabetes, Breast Cancer, and Heart-disease, the total costs of the proposed method AdaCOS with forward selection and group lasso are shown in Fig. 10. Due to the high computational costs for large p, it was not feasible to apply forward selection to the PhysioNet dataset. We find that, except for the Breast Cancer dataset, both covariate acquisition strategies lead to similar results. For Breast Cancer, forward selection appears to be superior to group lasso.

Runtime Comparison In Table 1, first two rows, we compare the training time of AdaCOS with forward selection and AdaCOS with group lasso (Sect. 5.2).^{Footnote 17} We clearly, confirm that group lasso is considerably faster than forward selection, allowing us to apply our proposed method to the PhysioNet dataset.

For reference, in Table 1, row 3 to 6, we also show the training time of all other methods, but we note that a fair comparison is difficult due to different types of implementations and the Matlab license conditions. In particular, the experiments of our proposed methods and Shim2018, were all run using a Python environment with 88 Intel Xeon 2.20GHz CPUs, whereas GreedyMiser and AdaptGbrt run on Matlab with 8 Intel Xeon 3.50 GHz CPUs.^{Footnote 18}

Finally, in Table 2, we also show the runtime at test-time of all methods. Though AdaCOS is slower than all other methods, the proposed method is still practically fast, requiring less than 2 min to process all test samples for PhysioNet.

Table 1 Shows the average training time in minutes per 1000 samples (standard deviation in brackets) for forward selection and group lasso

Full size table

Table 2 Shows the average test time in minutes per 1000 samples (standard deviation in brackets) for all methods

Full size table

Symmetric misclassification costs on Diabetes dataset In order to compare to the results reported in Ji and Carin (2007), Dulac-Arnold et al. (2012), we also evaluate on the Diabetes dataset with symmetric misclassification costs (i.e. false negative and false positive costs are the same), and the cost for correct classification set to $-50$. The results, shown in Table 3, suggest that also in this setting the proposed method can have an advantage over previously proposed methods. In particular, the proposed method AdaCOS with forward selection has the lowest total costs, though, when using group lasso the proposed method underperforms.

Table 3 Shows the total cost of misclassification under the same cost setting as in Ji and Carin (2007), Dulac-Arnold et al. (2012): user-specified misclassification costs are symmetric (either 400 or 800), cost of correct classification equals $-50$. The results for the methods DWSM and POMDP are taken from Dulac-Arnold et al. (2012) and Ji and Carin (2007), respectively

Full size table

7.1.2 User-specified target recall

Finally, we investigate the setting where the user specifies target recall r instead of false negative costs. Here, we show the results for $r = 0.95$, the results for target recall $r = 0.99$ are similar and given in the supplement material.

For this setting, we do not consider the method AdaptGbrt, since it does not allow the output of class probabilities or scores. For Shim2018 and GreedyMiser, we found that simply using the class probabilities/scores from the validation data to learn thresholds with recall $\ge r$, tended to lead to recall less that r on the test data, as shown in Fig. 11. Therefore, in order to make all results comparable, we show the results for Shim2018 and GreedyMiser at the same recall level as the proposed method AdaCOS.

The recall of the proposed method AdaCOS on the test data, as shown in Fig. 11, never violates the target recall of 0.95.

Since no false negative costs are provided, we cannot evaluate in terms of total costs anymore. Instead, we evaluate in terms of average operation costs, defined as the average cost of false positives plus the costs for covariate acquisition:

$$\begin{aligned} \text {avg operation costs} := \frac{1}{n_t} \sum _{k = 1}^{n_t} \big ( (1 - y^{(k)}) \cdot y^{(k)}_* \cdot c_{0,1} + \sum _{i \in S^{(k)}} c_i \big ) . \end{aligned}$$

The results for all datasets are shown in Figures 12, 13, 14, and 15. For completeness, on each of the those figures, we also plot the false discovery rate (FDR) against the covariate costs (bottom plots), where the crosses of different sizes mark the standard deviation of FDR and covariates costs.

For all datasets, except Heart-Disease, the proposed method AdaCOS has the smallest operation costs. Furthermore, AdaCOS tends to achieve a lower false discovery rate with less covariates used.

8 Related work

Here, we briefly summarize various previous works for cost-sensitive classification.

Cost-sensitive Classification Without Covariate Costs The theory about optimal decision making given a (mis-)classifiation cost matrix and conditional class probabilities $p(y | {\mathbf {x}})$ has been extensively studied in the statistical literature, see for example (Berger 2013), and the machine learning community, see for example (Zadrozny & Elkan, 2001; Zadrozny et al., 2003; Elkan, 2001). However, the costs of acquiring covariates are not addressed in these early works. The work in Zadrozny and Elkan (2001) also addresses the problem of an undefined cost-matrix. However, in contrast to our work, they are concerned with sample dependent costs, and do not consider recall constraints which we address.

Markov Decision Process (MDP) Framework The MDP formulation and solution using an action-utility representation (Q-learning) in Zubek et al. (2004), Bayer-Zubek (2004) is closest to our approach. Their method also leads to a Bayes procedure. However, they do not provide a formal proof and consider only discrete covariates. The work in Dulac-Arnold et al. (2011), Dulac-Arnold et al. (2012), Karayev et al. (2013) also uses the MDP framework. However, their proposed method cannot incorporate the uncertainty about the covariate distributions. The work in Ji and Carin (2007) tries to model such uncertainties by modeling the cost-sensitive classification problem as a partial observable Markov decision process (POMDP). However, their POMDP formulation can lead to repeatedly selecting the same covariates, and as a consequence they need to adapt the stopping criteria. In contrast, our proposed method, based on the extended definition of a Bayes procedure, leads to a solution for deciding when to stop asking for new covariates that has provably the lowest total costs in expectation.

Reinforcement Learning Approaches Janisch et al. (2017), Shim et al. (2018) suggest to use deep reinforcement learning with Q-learning. In contrast to MDP, a discriminative decision maker is learned which does not require an environmental model. Their method performs promising in the domain where huge amounts of labeled training data is available. Alternatively, the work in Benbouzid et al. (2012) suggests the use of SARSA. The method in Contardo et al. (2016) also addresses this problem with reinforcement learning. However, methods based on reinforcement learning are hard to optimize, and can lead to suboptimal behavior as our experimental comparison to Shim et al. (2018) demonstrated.

Discriminative Decision Approach The work in Wang et al. (2015) proposes an intriguing method for finding a decision procedure that is guaranteed to converge to the Bayes risk given sufficient enough training data. Their idea is to create a Bayes optimal classifier for all possible subsets of covariates, and a directed a-cyclic graph that connects them. They formulate the problem as an empirical risk minimization (ERM) problem, and show that with infinitely many training samples the loss at each node converges to the Bayes risks. However, in order to allow for scalability their method requires to acquire covariates in batches. The work in Trapeznikov and Saligrama (2013), Wang et al. (2014b) uses a similar framework but is restricted to a fixed sequential order, which is different from our method that also learns the sequential order. The work in Kusner et al. (2014), Xu et al. (2013), Wang et al. (2014a) also uses the ERM framework together with budget constraints, but it does not allow to incorporate asymmetric misclassification costs.

Cost-sensitive Tree Construction The works in Xu et al. (2012), Nan et al. (2015), Nan et al. (2016), Nan and Saligrama (2017), Peter et al. (2017) learn decision trees subject to budget constraints on the features. In particular, the methods in Nan and Saligrama (2017), Peter et al. (2017) are considered state of the art for this task. Their usage of gradient boosted decision trees (Friedman 2001) makes them in particular effective for very large training data. Cost-sensitive decision trees for discrete covariates are also considered in Sheng and Ling (2006), and extended to Bayesian Networks in Bilgic and Getoor (2007). In contrast to our work, these works do not explicitly optimize the expected total costs of misclassification, which can result in inferior performance, as we confirmed in Sect. 7.

Entropy-Based Approaches The work in Kanani and Melville (2008), Gao and Koller (2011), Kapoor and Horvitz (2009), Gong et al. (2019) optimizes a criteria that combines the costs of features with an estimate of the class entropy of the resulting classifier. As such their objective function is different from ours.

Density Estimation via Autoencoders The work in Kachuee et al. (2018) suggests to acquire the covariate which has the highest sensitivity to the output prediction y. In order to account for different covariate acquisition costs the sensitivity scores are re-scaled appropriately. The sensitivity scores are estimated using a denoising autoencoder. Similarly, the work in Ma et al. (2019) uses as objective function the expected Shannon information, which is estimated via a variational autoencoder. Both objective functions are not related to the minimization of the expected total cost.

Others The work in Greiner et al. (2002) extends the Probably Approximately Correct (PAC) framework to prove the existence of a cost-sensitive classifier that is with high probability optimal in the sense of providing minimal average total costs. However, they assume a probability distribution over only discrete covariates. The method in Lakkaraju and Rudin (2017) is additionally focused on interpretability, and, as a consequence, optimizes an objective function that is different from ours. Imitation learning is also applied to this task by He et al. (2012), but their definition of loss is different from minimizing the total classification costs that we consider here. The work in Nan et al. (2014) assumes a margin-based classifier and uses a k-nearest neighbor approach to estimate the accuracy of the classifier in each step before deciding whether to acquire the next feature. However, their stopping criteria is solely based on an estimate of the classification accuracy, and as such, does not allow for a decision which is cost optimal.

9 Conclusions

In this article, we addressed the problem of cost-sensitive classification where the goal is to minimize the total costs, defined as the expected cost of misclassification plus the cost for covariate acquisition.

In Sect. 2, we rigorously formalized this goal as the minimization of the (conditional) Bayes risk which can change after the acquisition of a new covariate. However, solving this minimization problem is hard. First, the evaluation of the conditional Bayes risk requires to estimate and integrate over a high dimensional density. Second, the Bayes risk must be evaluated for all combinations of covariate sets which is exponential in p the number of covariates.

In order to overcome the computational difficulties, we introduced two working assumptions:

1.
The optimal classifier can be expressed as a generalized additive model (GAM).
2.
The optimal sets of covariates can expressed as a sequence of sets that are monotone increasing, namely $S_1 \subset S_2 \ldots \subset S_q$.

Using the first assumption, we showed, in Sect. 4, that the evaluation of the conditional Bayes risk reduces to a one dimensional density estimation and integration problem which can be efficiently estimated.

Furthermore, we showed that the sequence $S_1 \subset S_2 \ldots \subset S_q$ can be computationally efficiently acquired by inspecting the regression coefficient path when penalizing GAM with group lasso.

Our experiments suggest that our proposed method AdaCOS achieves in most situations the lowest total costs of classification, when compared to the previous methods POMDP, DWSM, GreedyMiser, AdaptGbrt, and Shim2018 (Ji & Carin, 2007; Dulac-Arnold et al., 2012; Xu et al., 2012; Nan & Saligrama, 2017; Shim et al., 2018).

We note that some previous methods like Shim2018 (Shim et al., 2018) do not share our working assumptions, and instead use a very flexible classifier (deep neural network) and covariate acquisition strategy based on reinforcement learning. However, for small ($n < 1000$) datasets, and even for medium-sized ($n > 10000$) datasets like PhysioNet, we found that a generalized additive model is competitive or even better than a neural network classifier, and the flexibility of the reinforcement learning seems to suffer from high variance.

Finally, we considered the situation where not all misclassification costs are specified by the user. In particular, we considered the situation where the user specifies a target recall instead of the cost of false negative classification. We showed that it is possible to apply the proposed method by estimating the implicitly defined false negative cost. Our experiments showed that the resulting method indeed achieves the desired minimum recall, while minimizing the false discovery rate and covariate acquisition cost.

The source code of the proposed method and for reproducing all results is available at https://github.com/andrade-stats/AdaCOS_public.

Notes

In Andrade and Okajima (2019), $\varvec{\beta }$ was the regression coefficient vector. Here $\varvec{\beta }$ takes now the form of a matrix where each column specifies the coefficients of the basis functions corresponding to one covariate.
In Andrade and Okajima (2019) we introduced only the greedy covariate acquisition approach which is explained here in Sect. 5.1.
Remark about our notation: we denote by bold font a column vector, e.g. ${\mathbf {x}} \in {\mathbb {R}}^p$, and a column vector indexed by a set $A \subseteq V$ denotes the corresponding sub-vector, e.g. ${\mathbf {x}}_A \in {\mathbb {R}}^{|A|}$.
$\odot$ denotes the Hadamard product, and ${\mathbf {1}}_S \in {\mathbb {R}}^p$ is the vector that is one in all positions indexed by S, and zero otherwise.
Note that our proposed general framework is applicable to both continuous and discrete variables, where the expectations are then either evaluated via integrals or sums. Though, the specific Gaussian approximation in Sect. 4.1, will be more appropriate for continuous variables.
For learning the B splines we make use of pyGAM available at https://pygam.readthedocs.io/en/latest/
This allows also for negative costs. In particular, a negative cost for $c_{y,y}$ means a benefit for correct classification.
Though, for the medical datasets we consider in Sect. 7, no unlabeled training data was available.
In order to make this type of penalty meaningful we ensure that each $f_i(x_i)$ has mean 0 and standard deviation 1.
Here we use the terminology from the machine learning literature, though, in the medical literature the term “sensitivity” is more common than “recall”.
https://archive.ics.uci.edu/ml/index.html.
In case where the target recall is specified, we first estimate the false negative cost as in Sect. 6.1, and then proceed as before.
Available at https://github.com/OpenXAIProject/Joint-AFA-Classification.
Available at http://kilian.cs.cornell.edu/code/code.html.
Available at https://github.com/fnan/AdaptApprox.
Note that each point displayed in the plot corresponds to one result when the false positive cost was set to one value in $\{1, 5, 10, 50, 100, 500, 1000\}$. However, points are overlapping when the acquired covariates (and also the weighted accuracy) were exactly the same for two different false positive costs.
Note that the training of AdaCOS and COS is basically the same. AdaCOS and COS differ only at test time.
Due to the Matlab’s license conditions we could not run the experiments on the same server as the other methods.

References

Anderson, T. W. (2003). An introduction to multivariate statistical analysis (Vol. 2). Wiley.
MATH Google Scholar
Andrade, D., & Okajima, Y. (2019). Efficient bayes risk estimation for cost-sensitive classification. In The 22nd international conference on artificial intelligence and statistics (pp. 3372–3381)
Bayer-Zubek, V. (2004). Learning diagnostic policies from examples by systematic search. In Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 27–34). AUAI Press.
Benbouzid, D., Busa-Fekete, R., & Kégl, B. (2012). Fast classification using sparse decision dags. In Proceedings of the 29th international conference on international conference on machine learning (pp. 747–754)
Berger, J. O. (2013). Statistical decision theory and Bayesian analysis. Springer.
Bilgic, M., & Getoor, L. (2007). Voila: Efficient feature-value acquisition for classification. In Proceedings of the national conference on artificial intelligence (Vol. 22, p. 1225). AAAI Press, Menlo Park.
Contardo, G., Denoyer, L., & Artières, T. (2016). Sequential cost-sensitive feature acquisition. In International symposium on intelligent data analysis (pp. 284–294). Springer.
Dulac-Arnold, G., Denoyer, L., Preux, P., & Gallinari, P. (2011). Datum-wise classification: a sequential approach to sparsity. In Joint European conference on machine learning and knowledge discovery in databases (pp. 375–390). Springer.
Dulac-Arnold, G., Denoyer, L., Preux, P., & Gallinari, P. (2012). Sequential approaches for learning datum-wise sparse representations. Machine Learning, 89(1–2), 87–122.
Article MathSciNet Google Scholar
Elkan, C. (2001). The foundations of cost-sensitive learning. International Joint Conference on Artificial Intelligence, 17, 973–978.
Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.
Article MathSciNet Google Scholar
Gao, T., & Koller, D. (2011). Active classification based on value of classifier. In Advances in neural information processing systems (pp 1062–1070).
Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. Chapman and Hall/CRC.
Goldberger, A. L., Amaral, L. A., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C. K., & Stanley, H. E. (2000). Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation, 101(23), e215–e220.
Article Google Scholar
Gong, W., Tschiatschek, S., Nowozin, S., Turner, R.E., Hernández-Lobato, J. M., & Zhang, C. (2019). Icebreaker: Element-wise efficient information acquisition with a bayesian deep latent gaussian model. In Advances in neural information processing systems (pp 14791–14802).
Greiner, R., Grove, A. J., & Roth, D. (2002). Learning cost-sensitive active classifiers. Artificial Intelligence, 139(2), 137–174.
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. Chapman and Hall/CRC.
He, H., Eisner, J., & Daume, H. (2012) Imitation learning by coaching. In Advances in neural information processing systems (pp 3149–3157).
Janisch, J., Pevnỳ, T., & Lisỳ, V. (2017). Classification with costly features using deep reinforcement learning. arXiv preprint arXiv:171107364.
Ji, S., & Carin, L. (2007). Cost-sensitive feature acquisition and classification. Pattern Recognition, 40(5), 1474–1485.
Article Google Scholar
Kachuee, M., Darabi, S., Moatamed, B., & Sarrafzadeh, M. (2018). Dynamic feature acquisition using denoising autoencoders. IEEE Transactions on Neural Networks and Learning Systems, 30(8), 2252–2262.
Article Google Scholar
Kanani, P., & Melville, P. (2008). Prediction-time active feature-value acquisition for cost-effective customer targeting. In Advances in neural information processing systems (NIPS).
Kanao, K., Komori, O., Nakashima, J., Ohigashi, T., Kikuchi, E., Miyajima, A., Nakagawa, K., Eguchi, S., & Oya, M. (2009). PSA cut-off nomogram that avoid over-detection of prostate cancer in elderly men. The Journal of Urology, 4(181), 748.
Article Google Scholar
Kapoor, A., & Horvitz, E. (2009). Breaking boundaries: Active information acquisition across learning and diagnosis. InProceedings of the 22nd international conference on neural information processing systems (pp. 898–906). Curran Associates Inc.
Karayev, S., Fritz, M. J., & Darrell, T. (2013). Dynamic feature selection for classification on a budget. In International conference on machine learning (ICML): Workshop on prediction with sequential models.
Kusner, M. J., Chen, W., Zhou, Q., Xu, Z. E., Weinberger, K. Q., & Chen, Y. (2014). Feature-cost sensitive learning with submodular trees of classifiers. In AAAI (pp. 1939–1945).
Lakkaraju, H., & Rudin, C. (2017). Learning cost-effective and interpretable treatment regimes. In Artificial intelligence and statistics (pp. 166–175).
Lounici, K. (2014). High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20(3), 1029–1058.
Article MathSciNet Google Scholar
Ma, C., Tschiatschek, S., Palla, K., Hernandez-Lobato, J. M., Nowozin, S., & Zhang, C. (2019). Eddi: Efficient dynamic discovery of high-value information with partial VAE. In International conference on machine learning (pp. 4234–4243).
Nan, F., & Saligrama, V. (2017). Adaptive classification for prediction under a budget. In Advances in neural information processing systems (pp. 4730–4740).
Nan, F., Wang, J., Trapeznikov, K., & Saligrama, V. (2014). Fast margin-based cost-sensitive classification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2952–2956). IEEE.
Nan, F., Wang, J., & Saligrama, V. (2015). Feature-budgeted random forest. In International conference on machine learning (pp. 1983–1991).
Nan, F., Wang, J., & Saligrama, V. (2016). Pruning random forests for prediction on a budget. In Advances in neural information processing systems (pp. 2334–2342).
O’Hara, R. B., & Sillanpää, M. J. (2009). A review of Bayesian variable selection methods: What, how and which. Bayesian Analysis, 4(1), 85–117.
MathSciNet MATH Google Scholar
Peter, S., Diego, F., Hamprecht, F. A., & Nadler, B. (2017). Cost efficient gradient boosting. In Advances in neural information processing systems (pp 1550–1560).
Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning. MIT Press.
Russell, S., & Norvig, P. (2003). Artificial intelligence: A modern approach. Pearson.
Sheng, V. S., & Ling, C. X. (2006). Feature value acquisition in testing: a sequential batch test algorithm. In Proceedings of the 23rd international conference on Machine learning (pp. 809–816). ACM.
Shim, H., Hwang, S. J., & Yang, E. (2018). Joint active feature acquisition and classification with variable-size set encoding. In Advances in neural information processing systems (pp. 1368–1378).
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58, 267–288.
Article MathSciNet Google Scholar
Trapeznikov, K., & Saligrama, V. (2013). Supervised sequential classification under budget constraints. In Artificial intelligence and statistics (pp. 581–589).
Turney, P. D. (1994). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, 2, 369–409.
Article Google Scholar
Wang, J., Bolukbasi, T., Trapeznikov, K., & Saligrama, V. (2014a). Model selection by linear programming. In European conference on computer vision (pp. 647–662). Springer.
Wang, J., Trapeznikov, K., & Saligrama, V. (2014b). An LP for sequential learning under budgets. In Artificial intelligence and statistics (pp. 987–995).
Wang, J., Trapeznikov, K., & Saligrama, V. (2015). Efficient learning by directed acyclic graph for resource constrained prediction. In Advances in neural information processing systems (pp. 2152–2160).
Xu, Z., Kusner, M., Weinberger, K., & Chen, M. (2013). Cost-sensitive tree of classifiers. In International conference on machine learning (pp. 133–141).
Xu, Z., Weinberger, K. Q., & Chapelle, O. (2012). The greedy miser: learning under test-time budgets. In Proceedings of the 29th international conference on international conference on machine learning (pp. 1299–1306). Omnipress.
Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 204–213).
Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. In Third IEEE international conference on data mining (pp. 435–442). IEEE.
Zubek, V. B., & Dietterich, T. G. (2002). Pruning improves heuristic search for cost-sensitive learning. In International conference on machine learning (pp. 19–26).

Download references

Author information

Authors and Affiliations

Education and Research Center for Artificial Intelligence and Data Innovation, Hiroshima University, Hiroshima, Japan
Daniel Andrade
Central Research Laboratories, NEC, Kawasaki, Japan
Yuzuru Okajima

Authors

Daniel Andrade
View author publications
You can also search for this author in PubMed Google Scholar
Yuzuru Okajima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Andrade.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Jesse Davis.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1988 KB)

Appendix

Here, we prove Theorem 1, which states that the procedure $\pi ^*$, as defined in Equation (6), is a Bayes procedure. That means we need to show that for any other decision procedure $\pi$ we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}, y} [l( ({\mathbf {x}}, y), \pi ^*) ] \le {\mathbb{E}}_{{\mathbf {x}}, y} [l( ({\mathbf {x}}, y), \pi ) ] . \end{aligned}$$

Proof

Let $S \subseteq V$ be the set of already observed covariates, then the expected remaining costs for a decision procedure $\pi$ is given by

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi , S) | {\mathbf {x}}_S ] . \end{aligned}$$

We will prove by induction that for any $S \subseteq V$, and any decision procedure $\pi$, we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] \le {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi , S) | {\mathbf {x}}_S ] . \end{aligned}$$

The claim then follows by setting $S := \emptyset$.

Base case: $S = V$. We have

$$\begin{aligned} \pi ^*({\mathbf {x}}_S) = \mathop {\hbox {arg min}}\limits _{i \in L} {\mathbb{E}}_y [c_{y,i} | {\mathbf {x}}_S] . \end{aligned}$$

Therefore $\pi ^*({\mathbf {x}}_S)$ is a Bayes procedure, and as a consequence

$$\begin{aligned} {\mathbb{E}}_{y} [c_{y,\pi ^*({\mathbf {x}}_S)} | {\mathbf {x}}_S ] \le {\mathbb{E}}_y [c_{y,\pi ({\mathbf {x}}_S)} | {\mathbf {x}}_S] . \end{aligned}$$

And therefore

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] \le {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi , S) | {\mathbf {x}}_S ] . \end{aligned}$$

(Since $S = V$, and we have

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] = {\mathbb{E}}_{y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] = {\mathbb{E}}_{y} [c_{y,\pi ^*(S)} | {\mathbf {x}}_S ] , \end{aligned}$$

and the same analogously for $\pi$.)

Induction step: $S \subset V$. Assume that for all $S \cup \{i\}$, where $i \in V \setminus S$, the induction assumptions holds, that is

$$\begin{aligned} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus (S \cup \{i\})}, y} [l( ({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_{S \cup \{i\}}] \le {\mathbb{E}}_{{\mathbf {x}}_{V \setminus (S \cup \{i\})}, y} [l( ({\mathbf {x}}, y), \pi , S \cup \{i\}) | {\mathbf {x}}_{S \cup \{i\}}] . \end{aligned}$$

Let ${\hat{\pi }}$ denote a Bayes procedure. Using the structure of the loss function as defined in Eq. (5), we have

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), {\hat{\pi }}, S) | {\mathbf {x}}_S ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ {\left\{ \begin{array}{ll} c_{y,{\hat{\pi }}({\mathbf {x}})} &\quad{\text {if}}\,{\hat{\pi }}({\mathbf {x}}) \in L , \\ c_{{\hat{\pi }}({\mathbf {x}})} + l(({\mathbf {x}}, y), {\hat{\pi }}, S \cup \{ {\hat{\pi }}({\mathbf {x}}) \}) &{} \quad {\text{else}}. \end{array}\right. } | {\mathbf {x}}_S ] \\&\quad \ge \min _{i \in L \cup (V \setminus S)} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ {\left\{ \begin{array}{ll} c_{y,i} &\quad{\text {if}}\,i \in L , \\ c_i + l(({\mathbf {x}}, y), {\hat{\pi }}, S \cup \{i\}) & \quad {\text{else}}. \end{array}\right. } | {\mathbf {x}}_S ] \\&\quad = \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l(({\mathbf {x}}, y), {\hat{\pi }}, S \cup \{i\}) | {\mathbf {x}}_S ] &\quad {\text{else}}. \end{array}\right. } \\&\quad = \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{x_i} \Big [ {\mathbb{E}}_{{\mathbf {x}}_{V \setminus (S \cup \{i\})}, y} \big [l(({\mathbf {x}}, y), {\hat{\pi }}, S \cup \{i\}) | {\mathbf {x}}_{S \cup \{i\}} \big ] | {\mathbf {x}}_{S} \Big ] &\quad {\text{else}}. \end{array}\right. } \\&\quad {\mathop {\ge }\limits ^{(1)}} \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{x_i} \Big [ {\mathbb{E}}_{{\mathbf {x}}_{V \setminus (S \cup \{i\})}, y} \big [l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_{S \cup \{i\}} \big ] | {\mathbf {x}}_{S} \Big ] &\quad {\text{else}}. \end{array}\right. } \\&\quad = \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} \big [l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_{S} \big ] &\quad {\text{else}}. \end{array}\right. } \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] , \end{aligned}$$

where in the line marked by (1) we used the induction assumption. The last line follows from Lemma 1. Since ${\hat{\pi }}$ is a Bayes procedure, we must have equality in the second and fifth line. Therefore $\pi ^*$ is also a Bayes procedure. $\square$

Lemma 1

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] \\&\quad = \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_i + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} \big [l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_{S} \big ] &\quad {\text{else}}. \end{array}\right. } \end{aligned}$$

Proof

$$\begin{aligned}&{\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [l( ({\mathbf {x}}, y), \pi ^*, S) | {\mathbf {x}}_S ] \\&\quad = {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ {\left\{ \begin{array}{ll} c_{y,\pi ^*({\mathbf {x}}_S)} &\quad{\text {if}}\,\pi ^*({\mathbf {x}}_S) \in L , \\ c_{\pi ^*({\mathbf {x}}_S)} + l(({\mathbf {x}}, y), \pi ^*, S \cup \{ \pi ^*({\mathbf {x}}_S) \}) &\quad {\text{else}}. \end{array}\right. } | {\mathbf {x}}_S ] \\&\quad = {\left\{ \begin{array}{ll} {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ c_{y,\pi ^*({\mathbf {x}}_S)} | {\mathbf {x}}_S ] &\quad{\text {if}}\,\pi ^*({\mathbf {x}}_S) \in L , \\ c_{\pi ^*({\mathbf {x}}_S)} + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ l(({\mathbf {x}}, y), \pi ^*, S \cup \{ \pi ^*({\mathbf {x}}_S) \}) | {\mathbf {x}}_S ] &\quad {\text{else}}. \end{array}\right. } \\&\quad = {\left\{ \begin{array}{ll} {\mathbb{E}}_{y} [ c_{y,\pi ^*({\mathbf {x}}_S)} | {\mathbf {x}}_S ] &\quad{\text {if}}\,\pi ^*({\mathbf {x}}_S) \in L , \\ c_{\pi ^*({\mathbf {x}}_S)} + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ l(({\mathbf {x}}, y), \pi ^*, S \cup \{\pi ^*({\mathbf {x}}_S)\}) | {\mathbf {x}}_S ] &\quad {\text{else}}. \end{array}\right. } \end{aligned}$$

1. Case: $\pi ^*({\mathbf {x}}_S) \in L$. Then because of the definition of $\pi ^*$, we have

$$\begin{aligned} {\mathbb{E}}_{y} [ c_{y,\pi ^*({\mathbf {x}}_S)} | {\mathbf {x}}_S ]&= \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{y} [ c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_{i} + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_S ] &\quad {\text{else}}. \end{array}\right. } \end{aligned}$$

2. Case: $\pi ^*({\mathbf {x}}_S) \notin L$. Then because of the definition of $\pi ^*$, we have

$$\begin{aligned}&c_{\pi ^*({\mathbf {x}}_S)} + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ l(({\mathbf {x}}, y), \pi ^*, S \cup \{\pi ^*({\mathbf {x}}_S)\}) | {\mathbf {x}}_S ] \\&\quad = \min _{i \in L \cup (V \setminus S)} {\left\{ \begin{array}{ll} {\mathbb{E}}_{y} [ c_{y,i} | {\mathbf {x}}_S ] &\quad{\text {if}}\,i \in L , \\ c_{i} + {\mathbb{E}}_{{\mathbf {x}}_{V \setminus S}, y} [ l(({\mathbf {x}}, y), \pi ^*, S \cup \{i\}) | {\mathbf {x}}_S ] &\quad {\text{else}}. \end{array}\right. } \end{aligned}$$

$\square$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Andrade, D., Okajima, Y. Adaptive covariate acquisition for minimizing total cost of classification. Mach Learn 110, 1067–1104 (2021). https://doi.org/10.1007/s10994-021-05958-z

Download citation

Received: 21 February 2020
Revised: 26 January 2021
Accepted: 10 February 2021
Published: 18 April 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10994-021-05958-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive covariate acquisition for minimizing total cost of classification

Abstract

Similar content being viewed by others

Controlling Costs in Feature Selection: Information Theoretic Approach

Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms

Variable selection via generalized SELO-penalized linear regression models

1 Introduction

2 A cost rational selection criteria

2.1 Optimal procedure

Definition 1

Theorem 1

3 Cost-sensitive covariate selection (COS)

4 Adaptive cost-sensitive forward selection

4.1 Bayes risk estimation

4.1.1 Estimation of \(\mu _z\) and \(\sigma ^2\)

5 Cost-aware non-linear covariate selection

5.1 Forward selection

5.2 Group Lasso penalty

6 Extension to classification with recall guarantees

6.1 Adaptive covariate acquisition with recall guarantees

7 Experiments

7.1 Results

7.1.1 User-specified misclassification costs

7.1.2 User-specified target recall

8 Related work

9 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 1988 KB)

Appendix

Appendix

Proof

Lemma 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation