1 Introduction

Deep neural networks achieve state-of-the-art results on many tasks, especially when a large amount of training data is available. Their success highlights the importance of reducing the cost of collecting labels on a large scale. Active learning can be used to select data samples that will most benefit a predictor’s training, thereby reducing the amount of labeled data needed without hurting the predictor’s accuracy. The effectiveness of uncertainty and disagreement-basedFootnote 1 active learning methods have been demonstrated on several datasets for shallow predictors (Settles and Craven 2008; Settles 2009), and more recently also for deep learning predictors (Gal et al. 2017; Shen et al. 2018; Siddhant and Lipton 2018). Nevertheless, random sampling is still the most popular method to build new datasets in several domains, including natural language processing  (Tomanek and Olsson 2009). This is due to the practical issues of deploying uncertainty-based active sampling (Settles 2011; Lowell et al. 2019), including its limited applicability, robustness, and transparency.

Applicability Uncertainty sampling selects samples with the lowest prediction confidence of a predictor and collects their labels. However, the uncertainty in prediction may be hard to estimate for some complex models. For example, a relation extraction system is often a pipeline that includes named entity extraction, entity linking, and sentence classification. Combining the uncertainty of different components is difficult and usually relies on ad hoc approaches (Reichart et al. 2008). Another common case is that state-of-the-art commercial software packages or online services may only provide final predictions (i.e., a black-box predictor) (Wang et al. 2017a). Our first research question is whether a general active learning method can be based only on the predictions from such black-box predictors.

Robustness Uncertainty sampling assumes that the labels of uncertain examples are the most informative training data. However, the strategy may select outliers or ambiguous examples for annotators due to their low confidence scores. This selection bias often introduces substantial labeling noise into the dataset (Mussmann and Liang 2018). In fact, outlier detection methods often improve prediction performance by adopting a sample selection strategy that is exactly opposite to uncertainty sampling (Bouguelia et al. 2018). It is usually hard to distinguish an informative training example from a misleading outlier, and blindly adopting uncertainty sampling may result in much worse performance than random sampling in practice. Our second research question is whether a general active learning method can be robust to labeling noise without relying on prior knowledge.

Transparency Active sampling introduces a sampling bias, but we often know little about its effects on the performance in different aspects. This lack of transparency causes many practical issues. Not understanding why an active learning method works well for a dataset, we have few insights into whether the sampling method will work for another similar dataset well enough to compensate the effort of trying active learning, whether the deep predictors will perform worse in some crucial aspects/classes, whether the selection will emphasize undesirable biases (e.g., a person’s race), and whether the sampling efficiency improvement will vanish in the long run or after switching the deep predictor for uncertainty estimation.

Furthermore, practitioners need to choose a sampling method for a new task and a given deep predictor before collecting labels (e.g., random, uncertainty or diversification sampling, or combination of the above). Existing deep active learning studies usually focus on improving prediction performance without exploring why and how much the performance gains depend on the predictor’s specific ways of modeling feature interactions. The lack of such insights makes the choice extremely difficult (Lowell et al. 2019). Thus, our third research question is whether we can have a general analysis tool to provide insights into when and why an active sampling method works, and forecast the potential benefits in different aspects based on a small number of existing labels.

To answer the above research questions, we first illustrate three kinds of samples in Fig. 1. The blue triangle nodes are easy samples, which means their prediction errors are low and remain almost unchanged if we collect more such labels. The pink square nodes represent samples with noisy labels, which means their prediction errors or uncertainties are high but decrease slowly as more labels are seen. To select informative samples like the orange diamond nodes, we propose a batch active learning framework that maximizes the error reduction of each sample batch.

Fig. 1
figure 1

Comparison of sampling strategies. Each node is a word in an NER problem and we plot its prediction error/uncertainty versus its count in the current training dataset. Random or diversification-based sampling a often selects samples irrelevant to the task, such as a blue triangle node. Uncertainty sampling c prefers to select samples with the highest error/uncertainty (e.g., pink square nodes). However, their error decay could be small because of some inherent label noise. Our method b is a novel balance in this spectrum which usually provides better performance than diversification while being more robust, explainable, and applicable than uncertainty sampling (+ diversification)

In this framework, we propose three sampling methods. When it’s not possible to access the uncertainty of the deep predictor but labels of a validation dataset are available, we cluster the samples and predict the validation error reduction on each cluster as shown in Fig. 2. The first method is called error decay on groups (EDG). Without a validation dataset, the second method approximates the error decay using prediction changes on groups. Finally, if uncertainty is available, we model the uncertainty reduction of each sample. In Sect. 3, we describe the first method in detail, and view the second and third methods as its extensions and denote them as EDG_ext1 and EDG_ext2, respectively.

Fig. 2
figure 2

The uncertainty sampling for a black-box predictor is hard to interpret. Our proposed framework predicts the reduction of error/uncertainty on clusters of samples, optionally with the help of validation data. The resulting strategies provide a more direct explanation of the choices made by batch active learning

Analyzing error decay on clusters results in several practical benefits. We can model the error decay using only model predictions; noisy samples will not be selected due to their low error decay; the strategy of our active sampling is interpretable (e.g., sampling more uppercased words in a named entity recognition problem due to their larger prediction error decay); and we can predict the performance gains from active learning in different clusters/aspects. Nevertheless, having only a few labels, we found that achieving the practical advantages sacrifices some sampling efficiency. To feasibly approximate the error reduction, we assume independence between samples in different clusters. For instance, if a sample is a word in a sentence, our selection method assumes its neighboring words do not affect the error decay of the word. In practice, violation of this assumption causes sub-optimal sampling efficiency when compared to uncertainty-based active learning.

We apply our framework to named entity recognition (NER) problems where our independence assumption is famously violated, and comprehensively evaluate the pros and cons of the proposed framework.Footnote 2 We summarize our results from NER experiments in Table 1, which shows that our framework provides novel ways to trading partial sampling efficiency gain for better applicability, robustness, and transparency.

Table 1 Comparison of different sampling approaches

1.1 Summary of main contributions

  1. 1.

    We propose a novel active learning method, EDG, which models validation error decay curves on clusters of samples. We extend the method by replacing validation error decay with prediction difference decay or uncertainty decay to avoid relying on validation data. This demonstrates the flexibility of EDG framework.

  2. 2.

    In one synthetic and three real-world NER datasets, we show that EDG significantly outperforms the diversification baseline for the black-box model with or without the help of validation data. This demonstrates the effectiveness and applicability of EDG framework.

  3. 3.

    Modeling the error decay on clusters can be used as an analysis tool for arbitrary active learning methods. The experimental results show that no single method always wins and the proposed analysis tool provides intuitions and guidelines on selecting a specific method. This demonstrates the transparency of EDG framework.

  4. 4.

    We propose a new evaluation method based on pseudo labels. The method allows us to test active learning methods on a large sampling pool and test their robustness to systematic labeling noise.

  5. 5.

    In the experiments on pseudo labels and synthetic dataset, we show that combining EDG with state-of-the-art uncertainty sampling methods (i.e., choosing the samples with the highest uncertainty decay rather than uncertainty) improves sampling efficiency in the presence of systematic labeling noise, random labeling noise, and model change. This demonstrates the robustness of EDG framework.

2 Related work

Several studies aim at making active learning practical (Settles 2011). For example, Phillips et al. (2018) extend the techniques of interpreting a classifier’s predictions to explain the active sampling process, but still rely on the uncertainty estimation of the classifier. Bloodgood and Vijay-Shanker (2009) survey and propose stopping criteria based on overall error decay. Instead, we propose a novel active sampling method by modeling error decay on groups of samples.

Recently, active learning on black-box models has attracted research attention due to practical needs. Wang et al. (2017a) focus on improving a black-box semantic role labeling model using another neural network with low transparency. Rubens et al. (2011) propose estimating the variance of predictions of a black-box regressor, which is computationally prohibitive unless applied to simple models such as a linear regressor.

Popular active learning methods such as uncertainty sampling are not robust to noise (Mussmann and Liang 2018), and there is a body of research in active learning (referenced below) to address this issue. However, the main focus of previous studies is to identify and post-process the noisy labels given some prior knowledge of the noise-to-signal ratio. Instead, we propose active learning methods which are inherently robust to noise by avoiding the selection of difficult samples for annotators in the first place.

The methods to identify noisy labels are similar to the methods that measure the uncertainty of each sample (Sheng et al. 2008; Bouguelia et al. 2018), which could be based on disagreement between workers (Sheng et al. 2008; Zhao et al. 2011), disagreement between manually created labels and automatically generated predictions  (Bouguelia et al. 2015; Khetan et al. 2018), model uncertainty (Sheng et al. 2008; Kremer et al. 2018; Bouguelia et al. 2018), or estimation of workers’ quality (Zhang et al. 2015; Khetan et al. 2018). After identifying noisy labels, different methods adopt different strategies such as relabeling (Sheng et al. 2008; Bouguelia et al. 2015), excluding/down-weighting the noisy labels (Khetan et al. 2018), or both (Zhao et al. 2011; Zhang et al. 2015; Bouguelia et al. 2018), or acquiring high-quality labels (Kremer et al. 2018).

Some approaches, such as Dasgupta (2011), cluster input features and diversify samples by choosing them from different clusters, without considering information from the predictors being trained, and often showing only limited improvements in sampling efficiency. Recent approaches (Settles and Craven 2008; Wei et al. 2015; Sener and Savarese 2018; Ravi and Larochelle 2018) have combined uncertainty and diversification (e.g., by multiplying the informativeness and representativeness scores). By directly modeling the error reduction, our proposed methods naturally balance the two criteria without relying on model uncertainty estimation. This makes our methods robust to labeling noise and applicable to black-box models.

Another direction for deep active learning is to learn an error reduction predictor or a sample selector. However, the selection model is either only applicable to a simple model like naïve Bayes (Roy and McCallum 2001; Fu et al. 2018), or requires a large amount of data to train a complex predictor or selector. The training data for the non-transparent selection model usually needs to come from a similar task. For instance, they could be images with different labels for image classification, other users for recommendation, or another language for NER (Bachman et al. 2017; Fang et al. 2017; Ravi and Larochelle 2018). Although transferring the error reduction predictor between different types of datasets is possible (Konyushkova et al. 2017), it is unclear on which dataset pairs such a transfer would work (Koshorek et al. 2019).

A challenge related to active learning is to discover blind spots of the predictor (also called unknown unknowns). Lakkaraju et al. (2017) view this problem as a multi-armed bandit problem; they first cluster the samples in a pool, and select the samples from a group more often when more unknown unknowns are discovered from the group. However, it is not clear whether this strategy yields better sampling efficiency in terms of the predictor’s performance.

Chen et al. (2018) define unfairness as the difference in classification accuracy between two groups and suggest additional data collection as one of the remedies for such unfairness. However, they do not study an active sampling method to efficiently reduce the classification error or the unfairness.

In the extensions of our method, we maximize the prediction change (i.e., EDG_ext1 in our experiments). The method is related to a type of active learning strategies based on maximizing model change (Settles et al. 2008; Settles and Craven 2008). However, these methods do not address the practical challenges such as transparency, applicability to black-box models, and robustness to labeling noise.

3 Method

The main goal of batch active learning is to reduce the error \(E(C,D_U)\) of a classifier or tagger C on an unseen testing dataset \(D_U\) after labeling a fixed number of samples. To simplify the explanation, we first assume that testing error can be well-approximated by validation error, and the methods that do not require validation data will be described later as extensions in Sect. 4.

Based on external datasets, previous work maximizes the error reduction by modeling interactions among various types of signals such as uncertainty estimation, input features, and the state of the sampling process (Konyushkova et al. 2017; Bachman et al. 2017; Fang et al. 2017; Ravi and Larochelle 2018). In order to not rely on a large external dataset, we first cluster samples into multiple groups, and assume that the validation error of the samples in each group only depends on the number of annotated samples in the same group. Then, estimating the error reduction is decomposed into simple one-dimensional regression problems which can be done by observing only a few pairs of validation errors and its corresponding number of samples.

In the following subsections, we describe our framework in the context of solving NER problems. The framework is generally applicable to any classification or sequential tagging problems if the methods used to cluster samples (Sect. 3.2) and the error decay function (Eq. 4) are modified properly.

3.1 Error partition

Given a feature, we can derive a partition p by clustering the samples into \(J^p\) groups. For example, using the sentence embedding as our features, we can use K-means (MacQueen 1967) to cluster every sentence in the corpus into multiple groups containing sentences with similar embeddings. Then, the testing error \(E(C,D_U)\) can be partitioned using the sentence groups as:

$$\begin{aligned} E(C,D_U) = \sum \limits _{s_i \in D_U} \sum \limits _{j=1}^{J^p} P(g^p_j|s_i) \sum \limits _{l=1}^{|s_i|} {{\mathbb {1}}}( y_{i,l} \ne \hat{y}^C_{i,l}), \end{aligned}$$
(1)

where C is the current predictor, \(s_i\) is ith sentence in the testing data \(D_U\), \(P(g^p_j|s_i)\) is the probability that the sentence \(s_i\) belongs to the jth group \(g^p_j\), and \(P(g^p_j|s_i)\) could be the indicator function \({{\mathbb {1}}}(s_i \in g^p_j)\) if a hard clustering method is used. \(\sum _{l=1}^{|s_i|} {{\mathbb {1}}}( y_{i,l} \ne \hat{y}^C_{i,l})\) is the error of the sentence \(s_i\) using predictor C, \(|s_i|\) is the length of the sentence, \(y_{i,l}\) is the ground-truth tag for the lth token in the sentence \(s_i\), and \(\hat{y}^C_{i,l}\) is the tag predicted by the predictor C.

By assuming that the error of sentence \(s_i\) could be approximated by the estimated average error of its groups \(\hat{E}(g^p_j,C)\) (i.e., \(\sum _{l=1}^{|s_i|} {{\mathbb {1}}}( y_{i,l} \ne \hat{y}^C_{i,l})\approx \hat{E}(g^p_j,C) |s_i|\)), we estimate the overall error as:

$$\begin{aligned} \hat{E}^p(C,D_U) = \sum \limits _{j=1}^{J^p} \hat{E}(g^p_j,C) m(g^p_j,D_U), \end{aligned}$$
(2)

where \(m(g^p_j,D_U) = \sum _{s_i \in D_U} P(g^p_j|s_i) |s_i|\) could be viewed as the number of times group \(g^p_j\) appears in \(D_U\).

In addition to sentence clusters, we can also rely on word features and word clusters to form a partition p. Consequently, the error becomes

$$\begin{aligned} E(C,D_U) = \sum \limits _{s_i \in D_U} \sum _{l=1}^{|s_i|} \sum \limits _{j=1}^{J^p} P(g^p_j|s_{i,l}) {{\mathbb {1}}}( y_{i,l} \ne \hat{y}^C_{i,l}), \end{aligned}$$
(3)

where \(s_{i,l}\) is lth token in the sentence \(s_i\). Hence, \(m(g^p_j,D_U) = \sum \nolimits _{s_i \in D_U} \sum \nolimits _{l=1}^{|s_i|} P(g^p_j|s_{i,l})\) in Eq. (2).

3.2 Clustering for NER

When we use different partitions p, we get different error estimates \(\hat{E}^p(C,D_U)\). To increase the robustness of our error estimation, we adopt multiple partitions based on different features and aggregate the testing error estimates for selecting the next batch of samples to be annotated.

In our experiments on real-world NER datasets, we build four partitions using different features of sentences and words as follows:

  • Sentence We compute sentence embeddings by averaging the word embeddings, and cluster all the sentence embeddings into 10 groups. Next, the cosine similarities between sentence embeddings and the cluster centers are passed through a softmax layer with temperature parameter 0.1 to compute \(P(g^p_j|s_i)\).

  • Word We perform a simple top-down hierarchical clustering on word embeddings, which first clusters the words into 10 groups and further partitions each group into 10 clusters. This step results in 100 clusters for words in total.

  • Word + Shape Instead of performing clustering on the lowest layer of the hierarchy, we partition the words in each group using four different word shapes: uppercase letters, lowercase letters, first uppercase letter and rest lowercase letters, and all the shapes other than above. The same word shape features are also used in our tagger.

  • Word + Sentence Similarly, we partition each of the 10 word groups in the lowest layer of the hierarchy. For each word, we find the sentence \(s_i\) the word belongs to, and rely on the sentence group \(g^p_j\) with highest \(P(g^p_j|s_i)\) to perform the partition.

Performing clustering on the concatenation of multiple feature spaces is less interpretable, so we choose to model the feature interdependency by hierarchical clustering (i.e., concatenating the clustering results). For example, in the third partition (i.e., Word + Shape), a cluster contains all words that have the same shape feature and belong to one of the 10 word embedding clusters.

Among the four partitions, the first one (i.e., Sentence) uses soft clustering because a sentence might contain multiple aspects that belong to different groups. We perform hard clustering on word features because it achieves similar performance when compared to soft clustering, and speeds up updating the cluster size when a new sample is added. For efficiency, all the clustering is done by mini-batch K-means (Sculley 2010) in \(D_A\), the union of training data, sampling pool, and validation data.

To simplify the method and to have a better control on experimental settings, we use the same method to form groups based on the tagger’s input features for all datasets, and use the same groups when selecting all batches in a dataset. Nevertheless, we note that the framework allows us to model error decay on more fine-grained clusters as more training data are collected, or use other external features (e.g., the journal where the sentence is published) that might not be easily incorporated into the tagger or uncertainty sampling.

3.3 Error decay modeling

Within each group, we assume that the error depends only on the number of samples in the group being observed in the training data. This is to avoid a complicated and uninterpretable error decay model built by many pairs of training data subset and validation error. We model the error of predictor \(C_{T_t}\) on jth group \(\hat{E}(g^p_j,C_{T_t})\) in Eq. (2) using a one-dimensional function e(n), where \(n=m(g^p_j,T_t)\) is the size of group \(g^p_j\) in the training data \(T_t\) after tth batch is collected, and further constrain the class of decay functions e(n) using prior knowledge of the tasks.

The decay function of prediction error e(n) depends on the task (Si et al. 1992; Hestness et al. 2017). The error decay rate of many tasks has been shown to be \(1/n^k\), both theoretically and empirically (Hestness et al. 2017), and k is typically between 0.5 and 2 (Si et al. 1992).

In sequence tagging tasks, the error decay rate depends on the importance of context. To intuitively explain how the importance of context affects the error decay rate, we discuss the form of error decay functions in one case where context does not affect the label and in another case where context matters.


Case 1 (context does not matter) Assuming we are classifying each token in a sentence into two classes and its label does not depend on context (like predicting the outcome of a coin toss), we only make reducible errors when we observe the less-likely label more times than the other label. Applying Chernoff bounds (Mitzenmacher and Upfal 2017), we can show that the error decay rate is as fast as an exponential function.

Without loss of generality, we assume the probability q of observing positive class (i.e., head) in the ith token is smaller than 0.5. Let n be the number of coin tosses be n and the random variable \(X^i_j={{\mathbb {1}}}({{\text {jth toss on ith coin is head}}} )\). In order to classify the testing tokens optimally (i.e., predict tail whenever seeing the ith token), we would like to observe \(X^i = \sum _{j=1}^{n}X^i_j < \frac{n}{2}\). Therefore, the error rate is \(P(X^i \ge \frac{n}{2})(1-q)+q(1-P(X^i \ge \frac{n}{2}))=P(X^i \ge \frac{n}{2})(1-2q)+q\).

Since all \(X^i_j\) are assumed to be independent, we can use Chernoff bounds to model the decay of \(P(X^i \ge \frac{n}{2})= P(X^i \ge (1+\delta )\mu )\) as n increases, where \(\mu =q\cdot n\), and \(\delta =\frac{0.5-q}{q}\). Chernoff bounds tell us that \(P(X^i \ge \frac{n}{2}) \le \exp (-n \cdot h(q))\), where h(q) is an error decay speed function that depends on q. Different versions of Chernoff bounds lead to different \(h(\cdot )\), but all \(h(\cdot )\) increase as q decreases. That is, when coins are more biased, error rate decays faster.


Case 2 (context matters) If we assume that the influence of each word in the context to the label is independent, we need to estimate the probability of having the label given a word in the context to predict the label accurately. For example, we want to know how likely the label is a person name when we observe “Dr” in the context, so that we can estimate how likely the Pepper in “Dr Pepper” should be labeled as a person. The error of the probability estimation decays with rate \(1/\sqrt{n}\) in the long run according to Chernoff bounds or the central limit theorem, so the error decay function is likely to be as slow as \(1/\sqrt{n}\) when the words in context affect the label.

The error decay rate of most of the NER tasks should lie between the decay rates in the above two cases because the taggers will gradually learn to utilize longer contexts. Thus, we model the error decay \(\hat{E}(g^p_j,C_{T_t}) = e\left( m(g^p_j,T_t)\right)\) for NER by a fractional polynomial:

$$\begin{aligned} e(n) = c_j + b_j \left( \frac{a_{0.5}}{ (a_0 \cdot n)^{0.5}} + \sum \limits _{k=1}^3 \frac{a_k}{ (a_0 \cdot n)^k } \right) , \end{aligned}$$
(4)

where \(a_{0.5}\), \(a_{0-3}\), \(b_j\) and \(c_j\) are parameters to be optimized, and we constrain these parameters to be non-negative. \(c_j\) is an estimate of irreducible error in this model, \(b_j\) tries to predict the initial error (when the first batch is collected), \(a_{0.5}\) and \(a_{1-3}\) weight the curves with different decay rates, and \(a_0\) scales the number of training samples. If e(n) is proportional to \(1/\sqrt{n}\), the estimated \(a_{1-3}\) would be close to 0. If e(n) is proportional to an exponential function, \(a_{1-3}\) would become the coefficients in its Taylor expansion.

The parameters \(a_{0.5}\), \(a_{0-3}\), \(b_j\), and \(c_j\) are estimated by solving

$$\begin{aligned}&{\mathop{\mathrm{arg\,min}}\limits_{\begin{array}{c} \{a_{0-3},a_{0.5}, \\ b_j,c_j\} \in {{\mathbf {R}}}^M_{\ge 0} \end{array}}} \sum _{j=1}^{J^p} w_j \sum _{t=1}^{t_m} v_{tj} \left( \hat{E}(g^p_j,C_{T_t}) - \frac{E^p_j(C_{T_t},D_V)}{m(g^p_j,D_V)} \right) ^2, \end{aligned}$$
(5)

where \(t_m\) is the number of annotated batches and the estimated error \(\hat{E}(g^p_j,C_{T_t})=e\left( m(g^p_j,T_t) \right)\). The average error of jth group in validation dataset \(D_V\), \(E^p_j(C_{T_t},D_V) = \sum _{s_i \in D_V} P(g^p_j|s_i) \sum _l {{\mathbb {1}}}( y_{i,l} \ne \hat{y}^{C_{T_t}}_{i,l})\) for a partition using sentence clusters and \(E^p_j(C_{T_t},D_V)= \sum _{s_i \in D_V} \sum _l P(g^p_j|s_{i,l}) {{\mathbb {1}}}( y_{i,l} \ne \hat{y}^{C_{T_t}}_{i,l})\) for a partition using word clusters. \(w_j\) and \(v_{tj}\) are constant weights,Footnote 3 and \(M=2J^p+5\) is the number of parameters. Due to the small number of parameters, error decay curves could be modeled by retraining deep neural networks only a few times (we set \(t_m=5\) when selecting the first batch in our experiments).

3.4 Query batch selection

Modeling the error decay on each cluster based on different features could be used as an analysis tool to increase the transparency of existing active learning methods. Such an analysis reveals the weaknesses (i.e., the groups of samples with high validation error) of the current tagger and allows us to estimate the number of samples that needs to be collected to reach a desirable error rate.

We propose a novel active learning method to actively address the fixable weaknesses of the tagger discovered by the analysis tool. When a single partition p is used, we select the next batch B by maximizing

$$\begin{aligned} H^p(B \cup T) = - \sum \limits _j e\left( m(g^p_j,B \cup T)\right) m(g^p_j,D_A), \end{aligned}$$
(6)

where T is the collected training data, and \(D_A\) is the union of the pools of candidate samples, training data T, and validation data \(D_V\), which are used to approximate group occurrence statistics in the testing data \(D_U\). Note that we use \(e(m(g^p_j,B \cup T))\) to approximate \(\hat{E}(g^p_j,C_{B \cup T})\), so we prevent retraining the predictor C within each batch selection.

Proposition 1

Suppose that \(\hat{E}(g^p_j,C_{T_t})\) is a twice differentiable, non-increasing and convex function with respect to \(m(g^p_j,T_t)\) for all j, then \(H^p(T)\) is non-decreasing and submodular.

The convexity of \(\hat{E}(g^p_j,C_{T_t})\) is a reasonable assumption because the error usually decays at a slower rate as more samples are collected. Since selecting more samples only decreases the value of adding other samples, \(H^p(T)\) is submodular (see “Appendix 2” for a rigorous proof).

Finding the optimal B in Eq. (6) is NP-complete because the set cover problem can be reduced to this optimization problem (Guillory and Bilmes 2010), but the submodularity implies that a greedy algorithm could achieve \(1-1/e\) approximation, which is the best possible approximation for a polynomial time algorithm (up to a constant factor) (Lund and Yannakakis 1994; Guillory and Bilmes 2010).

figure a

When having multiple partitions p based on different features, we select the next sentence in the batch according to:

$$\begin{aligned} {{\,\mathrm{arg\,max}\,}}_{s_i} \left( \prod \limits _{p} \left( \frac{H^p( S_i \cup T) - H^p(T)}{|s_i|} + \epsilon \right) \right) ^{\frac{1}{F}}, \end{aligned}$$
(7)

where \(\epsilon\) is a small smoothness term. If the partition p is a set of sentence clusters, then \(S_i = \{s_{i}\}\). If it is is a set of word clusters, then \(S_i = \{s_{i,l}\}_{l=1}^{|s_i|}\), where \(s_{i,l}\) is lth word in ith sentence. We normalize the error reduction \(H^p( S_i \cup T) - H^p(T)\) by the sentence length \(|s_i|\) to avoid the bias of selecting longer sentences as done in previous work (Settles and Craven 2008; Shen et al. 2018). After annotators label the whole batch, we retrain the tagger model and update the error decay prediction by solving (5) before selecting the next batch.

The selection process is summarized in Algorithm 1. In the first few batches, we perform random sampling to collect pairs of every cluster size and the prediction error in the cluster for solving the one-dimensional regression problems. After the number of collected batches \(t_m\) is larger than the number of burn-in epochs \(t_{b}\), we have sufficient size and error pairs required to model the error decay in (5). Then, we can predict the future error \(H^p( S_i \cup T)\) and select the samples that minimize the error using (7).

Note that (7) naturally balances informativeness and representativeness. From the informativeness perspective, the sample without error decay won’t be selected. From the representativeness and diversification perspectives, we will decrease the value of choosing a sample in a batch after the samples in the same clusters are selected.

4 Method extensions

For some applications, a validation set is not large enough to be used to model the error decay curves, and our independent assumption may be too strong. To address this concern, we also test two extensions of our method.

4.1 Prediction difference decay

We replace the ground truth labels in (5) with the prediction \(\hat{y}^{C_{T_{t_m}}}_{i,l}\) based on the current training data. That is, our sampling method computes the difference between the current prediction and the previous predictions \(C_{T_{1}}\), ..., \(C_{T_{t_m-1}}\) in each group, and models the decay of the difference to maximize the convergence rate of predictions. We denote this method as EDG_ext1 in our experiments.

4.2 Uncertainty or disagreement decay

When the uncertainty or disagreement information is available, we can model their decay and choose the sentences with highest uncertainty decay rather than highest uncertainty. To avoid making the independence assumption, we skip the clustering step and assume that future uncertainty decay is proportional to the previous uncertainty decay, and set the score of ith sentence to be

$$\begin{aligned} \min ( \max (u^{t_f}_i-u^{t_m}_i,0), u^{t_m}_i), \end{aligned}$$
(8)

where \(u^{t_m}_i\) is the current uncertainty of ith sentence, and \(u^{t_f}_i\) is its previous uncertainty. Note that we take the minimum between the difference and \(u^{t_m}_i\) to ensure that the predicted future uncertainty is always non-negative. This method is denoted as EDG_ext2 in our experiments.

5 Experimental setup

NER problems are often used as benchmark to evaluate (deep) active learning methods (Settles and Craven 2008; Shen et al. 2018; Siddhant and Lipton 2018) because they are the foundation for many information extraction tasks, and acquiring tags for each token requires a large amount of human effort. We follow Strubell et al. (2017) and use phrase-level micro-F1 as the performance metric for NER tasks. Precision and recall are computed by counting the number of correct boundary and type predictions. Unless otherwise stated, we use a four-layer convolutional neural network (CNN) as our tagger (Strubell et al. 2017).Footnote 4

5.1 Simulation on gold labels

This is one of the most widely used setups to evaluate active learning methods. We compare the performance of NER tagger trained on different data subsets chosen by different methods. In the “Appendix 1”, we also compare the performance of applying different active learning methods to BiLSTM-CRF models.

5.1.1 Synthetic dataset

We synthesize a dataset with 100 words; each word could be tagged as one of four entity types or none (not an entity). There are three categories of words. The first category consists of half of the words which are always tagged as none. This setup reflects the fact that a substantial amount of words such as verbs are almost always tagged as none in NER tasks.

One-fourth of the words belong to the second category where every word mention has equal probability of being tagged as one of the entity types or none. In real-word NER tasks, the noisy label assignment may be due to inherently ambiguous or difficult words.

The remaining 25 words are in the third category where the labels are predictable and depend on the other context words. The likelihood of words in the third category being tagged as one of the four entity types is sampled from a Dirichlet distribution with \(\alpha _{1-4} = 1\), while the likelihood of being none is zero. Whenever one of these words w appear in the sentence, we check two of its preceding and succeeding words that are also in the third category, average their likelihoods of entity types, and assign the type with the highest likelihood to the word w.

When generating a sentence, the first word is picked randomly. The transition probability within each category is 0.9. Inside the first and second categories, the transition probability is uniformly distributed, while the probability of transition to each w inside the third category is proportional to a predetermined random number between 0.1 and 1. The sentence length is between 5 and 50 and there is a 0.1 probability of ending the sentence after generating a word within the range.

In this dataset, each word is a group in our method and no clustering is performed. The ith word has a word embedding vector \([{{\mathbb {1}}}(k=i)]_{k=1}^{100}\). When modeling error decay, we start from 1000 tokens and use a batch size of 500. When evaluating the sampling methods, we start with 3000 using a batch size of 1000. That is, after the first batch is selected, we update the error decay curves based on the prediction of taggers trained on 1000, 1500, 2000, 2500, 3000 and 4000 tokens.

5.1.2 Real-world datasets

We test the active sampling methods on CoNLL 2003 English NER (Tjong Kim Sang and De Meulder 2003), NCBI disease (Doğan et al. 2014), and MedMentions (Murty et al. 2018) datasets. The size of these datasets are presented in Table 2. CoNLL 2003 dataset has four entity types: people name (PER), organization name (ORG), location name (LOC), and other entities (MISC). NCBI disease dataset has only one type (disease name). For MedMentions, we only consider semantic types that are at level 3 or 4 (higher means more specific) in UMLS (Bodenreider 2004). Any concept mapping to more abstract semantic types is removed as was done by Greenberg et al. (2018) and this subset is called MedMentions ST19. The 19 concept types in MedMentions ST19 are virus, bacterium, anatomical structure, body substance, injury or poisoning, biologic function, health care activity, research activity, medical device, spatial concept, biomedical occupation or discipline, organization, professional or occupational group, population group, chemical, food, intellectual product, clinical attribute, and Eukaryote.

Table 2 The size of datasets for the simulation on gold labels

In all the three datasets, the first 30,000 tokens are from randomly sampled sentences. To model error decay, we start from 10,000 tokens and retrain the tagger whenever 5000 new tokens are added. When evaluating the sampling methods, we start from 30,000 tokens, using a batch size of 10,000.Footnote 5

5.2 Simulation on pseudo labels

In practice, we often observe systematic noise from annotators. The noise could come from some inherently difficult or ambiguous cases in the task or from incapable workers in the crowdsourcing platforms. Thus, we propose a novel evaluation method to test the robustness of different sampling methods in the presence of such noise.

As shown in Fig. 3, we first train a high-quality tagger using all the training data and use it to tag a large sampling pool. Then, different active learning methods are used to optimize the tagger trained on these pseudo labels. The micro-F1 is measured by comparing the tagger’s prediction with pseudo labels or gold labels on unseen sentences. The evaluation method also allows us to perform sampling on a much larger sampling pool,Footnote 6 which is usually used in the actual deployment of active learning methods. We randomly select 100,000 abstracts from PubMed as our new sampling pool, which is around 180 times larger than the pool we used in the simulation on gold labels. Precisely, the sampling pool consists of 24,572,575 words and 921,327 sentences. The testing data with pseudo labels have 2,447,607 tokens and 91,591 sentences.

Fig. 3
figure 3

Simulation on pseudo labels compares active learning methods on a large pool with noisy labels. In addition to the original validation and testing set, we also use a testing set with pseudo labels for evaluation

We also evaluate the sampling methods on two practical variations of the above setting. We use the data collected for optimizing a CNN to train BiLSTM-CRF models, which can be used to test the robustness of active learning methods after switching tagger models (Lowell et al. 2019). In addition, when collecting gold labels for biomedical NER, annotators often tag the whole abstract at a time, which can only be tested using a large sampling pool.

For all sampling methods, the sampling score of an abstract is the average of the sampling scores of the sentences in the abstract weighted by the sentence length. That is, the selection criteria view an abstract as a bag of sentences similar to how a sentence is considered as a bag of words when clustering is performed on words. We greedily select the abstract with the highest sampling score.

5.3 Sampling strategies

We compare the following sampling methods:

  • Random (RND): We select sentences randomly with uniform probability.

  • Error Decay on Groups (EDG): This is our method where we optimize (7) using validation data.

  • EDG_ext1 (w/o Val): As described in Sect. 4.1, we replace the validation error in EDG with the prediction difference.

  • Maximum Normalized Log-Probability (US): We use the least confidence sampling (Culotta and McCallum 2005). This variant of uncertainty sampling has been shown to be very effective in NER tasks (Shen et al. 2018). When applying maximum normalized log-probability to the CNN model, we select the sentences via \({{\,\mathrm{arg\,min}\,}}_{s_i} (1/|s_i|) \sum _{l=1}^{|s_i|} \max _{y_{il}} \log P(y_{il})\).

  • Maximum Normalized Log-Probability with Diversification (US + Div): We diversify uncertain samples based on sentence embeddings (i.e., the average embedding of its words) (Wei et al. 2015; Shen et al. 2018). We implement the US + Div, also called filtered active submodular selection (FASS), described in Shen et al. (2018). We use cosine similarity to measure the similarity between sentence embeddings. The number of candidate sentences is the batch size times \(t=100\).

  • Diversification (Div): We use the same algorithm as US + Div, except that all samples are equally uncertain.

  • US + Div + EDG_ext2: This is the same algorithm as US + Div, but with uncertainty scores replaced with their difference in Eq. (8).

  • Bayesian Active Learning by Disagreement (BALD): We select samples based on the disagreement among forward passes with different dropouts (Gal et al. 2017). The prediction disagreement of lth token in ith sentence is computed by \(\frac{\sum _{k=1}^K {{\mathbb {1}}}(y^k_{il} \ne \text {mode}_{k'}(y^{k'}_{il}) ) }{K}\). The number of forward passes K is set to 10 in our experiments. The sentence disagreement is the average of tokens’ disagreement. We use the default hyperparameter values for the dropouts as in Strubell et al. (2017).

  • BALD + EDG_ext2: Here, disagreement scores are replaced with their difference in Eq. (8).

5.4 NER tagger details

We use the published hyperparametersFootnote 7 for real-world datasets, and simplify the tagger for the synthetic dataset to decrease the standard deviation of micro-F1 scores. In the synthetic dataset, we reduce the number of layers of CNN to two because the label depends only on the left two words and right two words. Furthermore, we change the learning rate from \(5 \times 10^{-4}\) to \(10^{-4}\), batch size from 128 to 32, and max epochs from 250 to 1000 to make the performance more stable. When training BiLSTM-CRF, we also use the implementation and its default hyperparameters from Strubell et al. (2017). In all the experiments, the number of epochs is chosen using validation data.

The word embeddings for CoNLL 2003 are vectors with 50 dimensions from SENNA (Collobert et al. 2011). The word embeddings for NCBI disease and MedMentions ST19 are word2vec (Mikolov et al. 2013) with 50 dimensions trained on randomly sampled 10% of all PubMed text. Before clustering, we normalize all the word embedding vectors such that the square of the \(\ell _2\) distance between two words is twice their cosine distance.

5.5 Visualization of the error decay model

In addition to qualitatively evaluating our methods, we also visualize our error decay models. The visualization examines whether our error decay function in Eq. (4) can accurately model the empirical error decay in NER datasets, and whether our sample selection strategies are transparent and interpretable.

Given a dataset and a partition p, the empirical and predicted errors in each group \(g^p_j\) are plotted as two curves in Fig. 4. We compare the y value of empirical error \(E^p_j(C_{T_t},D_V)/m(g^p_j,D_V)\) and predicted error \(\hat{E}(g^p_j,C_{T_t})=e\left( m(g^p_j,T_t) \right)\) given different x values \(m(g^p_j,T_t)\), the cluster size of group \(g^p_j\) in the training set. Each curve connects six points corresponding to \(t=1 \ldots 6\), and the total number of words in training set \(|T_t|\) are 10,000, 15,000, 20,000, 25,000, 30,000, and 40,000 in real world datasets, respectively. The first 30,000 words are selected randomly and the last 10,000 words are selected by EDG.

Fig. 4
figure 4

Error decay on groups that are modeled after the first batch is collected by EDG. The x markers on the curves are the real error and \(\bullet\) means prediction from the fitting curve. The groups shown in the figure for NCBI disease and MedMentions ST19 are formed by clustering word and sentence embeddings, respectively

In different datasets, we visualize the different partitions derived from different features. In synthetic data, a group is a word. In CoNLL 2003, we plot each group that contains all the words with the same shape. In NCBI disease, we plot each group that contains words with similar word embeddings. In MedMentions ST19, we plot each group that contains sentences with similar sentence embeddings. Note that in the quantitative experiments, we actually use word + shape in CoNLL 2003 and 100 clusters in NCBI disease, but we illustrate only shape in CoNLL 2003 and 10 clusters in NCBI disease to simplify the figure.

6 Results and analysis

The results of error decay visualization, the simulation on gold labels, and pseudo labels are shown in Figs. 45 and Table 3, respectively.

Fig. 5
figure 5

Comparison of different sampling methods on the four NER tasks. The validation (first row) and testing scores (second row) are averaged from the micro-F1 (%) of three CNNs trained with different random initializations. The performance of methods which cannot be applied to black-box taggers is plotted using dotted curves

Table 3 Simulation on pseudo labels for NCBI disease dataset

As shown in Table 3, the scores on the gold validation set, the gold testing set, and the testing set using pseudo labels follow a similar trend and most active learning methods do better than random regardless of which test set is used. The observation indicates that the taggers do not overfit the label noise in the training data severely and justifies our pseudo label experiments.

We first qualitatively analyze the error decay modeling in Sect. 6.1. Next, we quantitatively compare different methods in Sects. 6.2, 6.3, 6.4, and 6.5.

6.1 Error decay visualization

In Fig. 4, the predicted error decay curves usually fit the empirical values well. Some deviations of empirical error come from the randomness of training CNNs. For example, the fourth point in MedMentions ST19 has higher empirical error in almost all clusters because the parameters of CNN trained on the set with 25,000 words happen to converge to a worse state.

In the figures, we can see that the length between the fifth and sixth points in each curve varies because the last 10,000 words in the training set are actively selected. The clusters that might have a larger error decay (e.g., the orange curve in CoNLL 2003) would get more training instances in the last sample batch. The figures demonstrate that the one-dimensional regression problem for each cluster could be solved well even though the sampling process is not random and only six training pairs for each curve are observed.

We can interpret the sampling strategy of EDG from the different number of samples selected in different groups. For example, EDG improves the sampling efficiency when compared to random sampling by selecting more words whose first letter is uppercase in CoNLL 2003, and by selecting more disease name candidates than verbs or actions in NCBI disease.

The transparency of EDG explains some important empirical observations in previous work. For example, Lowell et al. (2019) observed that the benefit of active learning is usually more significant in NER than in sentence classification, and the improvement in NER is robust against the change of predictor model. Shen et al. (2018) observed that in some NER datasets (e.g., CoNLL 2003), we can train a neural tagger that reaches a similar performance using only a selected small portion of the training set compared to using all the data. Figure 4 indicates that one of the main reasons is that there are more lowercase words in CoNLL 2003 than uppercase words, and lowercase words are almost never tagged as names of people, organizations, or locations. Therefore, the active learning methods could easily achieve high sampling efficiency by selecting more uppercase words, and the selection tendency can benefit various kinds of predictor models.

In real-world datasets, the error decay rate usually follows the function \(1/\sqrt{n}\) when n is large for most of the groups regardless of the feature being used. For example, \(a_{0.5}\) in Fig. 4 is at least two times larger than the \(a_{1}+a_{2}+a_{3}\) in CoNLL 2003, NCBI disease, and MedMentions ST19. The small difference between empirical and predicted error also justifies our assumption that the weight parameters of terms are shared across all the groups (i.e., \(a_{0-3}\) and \(a_{0.5}\) do not depend on the cluster index j).

6.2 EDG versus Div and RND

Among the methods we compared against, only random (RND) and diversification (Div) can be applied to black-box taggers. Our method (EDG) significantly outperforms Div, and Div outperforms RND in synthetic, CoNLL 2003, and NCBI disease datasets, which demonstrates the effectiveness of EDG. This also justifies our assumptions and indicates that the error decay curves are modeled well enough for the purpose of active sampling.

6.3 US versus US + Div

Shen et al. (2018) found that diversification is surprisingly not helpful in batch active learning. However, our results suggest that this finding might be valid only when the sampling pool size is small and/or some groups of frequent words/sentences are clearly not helpful. When the pool is sufficiently large and the task is to jointly extract many different types of entities like MedMentions ST19, sampling almost all kinds of sentences can be helpful to the task as all sentence clusters have similar decay on the far right of Fig. 4. Then, the diversification approach (Div) can be as effective as US and EDG, while US + Div(+ EDG_ext2) provides the best result.

6.4 EDG versus uncertainty-based methods

As shown in Wang et al. (2017a), it is difficult to perform better in terms of sampling efficiency when comparing a black-box active learning method with uncertainty sampling (US). In real-world datasets, EDG achieves part of the performance gain from US that is easily explainable (e.g., coming from ignoring those easy words), controllable by humans,Footnote 8 and does not involve the specifics of the tagger to model the interaction between each word and its context.

Furthermore, US is not robust to labeling noise or ambiguous samples (Mussmann and Liang 2018), which have high errors but low error decay. For instance, US almost always selects difficult words with high irreducible errors in the synthetic data. In real-world dataset, we could also observe ambiguous or difficult words. For example, insulin is a chemical, but insulin resistance could be a disease or a symptom in NCBI disease dataset. In Fig. 4, we can see that EDG does not select many words in the group. Our error curves in the supplementary material A.4 show that US selects many such words with incorrect pseudo labels. The vulnerability makes EDG outperform US in synthetic data and Table 3 on average.

US + Div and BALD are more robust to labeling noise than US, but still suffers from a similar problem. Thus, the sampling strategies that choose more samples with reducible uncertainty (i.e., US + Div + EDG_ext2 and BALD + EDG_ext2) could significantly improve the accuracy of taggers in noisy datasets like our synthetic data and NCBI disease dataset with pseudo labels, while having comparable performance on the other clean datasets with gold labels.

6.5 EDG_ext1 (w/o Val) versus EDG

In all datasets, modeling the error decay using pseudo labels (EDG_ext1) achieves similar performance when compared to using gold validation data (EDG), and also outperforms Div. In addition, the micro-F1 scores of EDG on validation and testing data roughly show a similar trend, which suggests that our method does not overfit the validation data even though it has access to its gold labels during sampling.

7 Conclusions

We proposed a general active learning framework which is based only on the predictions from black-box predictors, is robust to labeling noise without relying on prior knowledge, and forecasts the potential error reduction in different aspects based on a small number of existing labels.

Our experimental results suggest that no single batch active learning method wins in all the cases and every method has its own weaknesses. We recommend practitioners to analyze the error decay on groups in order to choose a proper sampling algorithm. If the sampling pool is small and the error decay analysis shows that many samples could be easily tagged, then uncertainty sampling methods are expected to perform well. Otherwise, diversification should be considered or combined with uncertainty sampling. Finally, error decay on groups (EDG) or its extensions should be adopted if there are practical deployment challenges such as issues of applicability (e.g., only a black-box predictor is available), robustness (e.g., labels are inherently noisy), or transparency (e.g., an interpretable sampling process or an error reduction estimation is desired).

8 Future work

In our experiments, we demonstrated that our methods are transparent and robust to labeling noise. However, we have not yet applied them to tasks other than NER. For example, when we annotate a corpus for relation extraction, we usually want to select a document which is informative for the named entity recognizer, entity linker, and sentence classifier. This challenge is also called multi-task active learning (Reichart et al. 2008; Settles 2011). Compared to heuristically combining uncertainty from different models (Reichart et al. 2008), our methods provide more flexibility because it allows us to assign weights on the error reduction of each task and select the next batch by considering all tasks jointly.

In addition to the above pipeline system, question answering (QA) is another example where uncertainty is difficult to estimate. Many reading comprehension models such as pointer networks predict the start and end positions of the answer in a paragraph (Wang et al. 2017b). However, higher uncertainty on the position prediction does not necessarily mean the model is uncertain about the answer. It is possible that the correct answer appears in many places in the paragraph and the network points to all the right places with similar low probability. By modeling the error decay directly, our methods avoid the issue.

Finally, we have not compared EDG with active learning methods that are designed for a specific task to solve a specific practical issue. For example, the active sampling methods proposed by Wang et al. (2017a) are designed for semantic role labeling and focus on the applicability issue (i.e., black-box setting). Due to the difficulty of adapting their methods to NER and making fair comparisons, we leave such comparisons for future work.