Paper The following article is Open access

Large deviations in the perceptron model and consequences for active learning

, and

Published 15 July 2021 © 2021 The Author(s). Published by IOP Publishing Ltd
, , Citation H Cui et al 2021 Mach. Learn.: Sci. Technol. 2 045001 DOI 10.1088/2632-2153/abfbbb

2632-2153/2/4/045001

Abstract

Active learning (AL) is a branch of machine learning that deals with problems where unlabeled data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of querying a limited number of samples to obtain the corresponding labels, subsequently used for supervised learning. In this work, we consider the task of choosing the subset of samples to be labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix and the ground truth labels to be generated by a single-layer teacher random neural network. We employ replica methods to analyze the large deviations for the accuracy achieved after supervised learning on a subset of the original pool. These large deviations then provide optimal achievable performance boundaries for any AL algorithm. We show that the optimal learning performance can be efficiently approached by simple message-passing AL algorithms. We also provide a comparison with the performance of some other popular active learning strategies.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

1.1. Motivation

Supervised learning consists in presenting a parametric function (often a neural network) with a series of samples (samples) and labels, and adjusting (training) the parameters (network weights) so as to match the network output with the labels as closely as possible. Active learning (AL) is concerned with choosing the most informative samples so that the training requires the least number of labeled samples to reach the same test accuracy. AL is relevant in situations where the potential set of samples is large, but obtaining the labels is expensive (computationally or otherwise). There exist many strategies for AL, see e.g. [1] for a review. In membership-based AL [24] the algorithm is allowed to query the label of any sample, most often one it generates itself. In stream-based AL [5] an infinite sequence of samples is presented to the learner which can decide whether or not to query its label. In pool-based AL, which is the object of the present work, the learner can only query samples that belong to a pre-existing, fixed pool of samples. It therefore needs to choose according to some strategy which samples to query so as to have the best possible test accuracy.

Pool-based AL is relevant for many machine learning applications, e.g. because not every possible input vector is of relevance. A beautiful recent application of AL is in computational chemistry [6] where a neural network is trained to predict inter-atomic potentials. In this case the pool of data is large and consists in all possible alloys, but not of arbitrary input vectors, and labelling is extremely expensive, as it demands resource-intensive ab-initio simulations. Consequently, only a limited number of samples can be labeled, i.e. one only possesses a certain budget for the cardinal of the training set. Another setting where a cheap large pool of input data is readily available but labelling is expensive is drug discovery [7], where given a target molecule one aims to find new compounds among the pool able to bind it. Another example would be on text classification [810], where labelling a text requires non-negligible human input, while a large pool of texts is readily available on the internet. Establishing efficient pool-based AL procedures in this case implies to select a priori the most informative data samples for labelling.

Main-stream works on AL focus on designing heuristic algorithmic strategies in a variety of settings, and analyzing the performance thereof. It is very rarely known what are the information-theoretic limitations an AL algorithm can face and hence evaluating the distance from optimality is mostly an open question. The main contribution of the present work is to provide a toy model that is at the one hand challenging for AL, and at the same time where the optimal performance of pool-based AL can be computed and heuristic algorithms hence evaluated and bench-marked against the optimal solution. To our knowledge, this is the first work to derive optimal performance results for pool-based AL procedures are computed. More specifically we study the random perceptron model [11]. The available pool of samples is assumed to be i.i.d. vectors following a normal distribution, the teacher generating the labels is taken to be also a perceptron with the vector of teacher-weights having i.i.d. normal components. We compute the large deviation function for how likely it is to find a subset of the samples that leads to a given learning accuracy. Our results are obtained through the non-rigorous yet asymptotically exact (under appropriate probabilistic assumptions on the data) replica method, from theoretical statistical physics [12]. While the presented analysis is based on the so-called replica symmetric (RS) ansatz, we also provide a stability analysis of our results against possible symmetry-breaking effects. Providing a rigorous proof of the obtained results or turning them into rigorous bounds would be a natural, and rather challenging, next step. In the algorithmic part of this work we benchmark several existing algorithms and also propose two new algorithms relying on the approximate-message-passing algorithm for estimation of the label uncertainty for yet unlabeled sample, showing that they closely achieve, in the studied cases, the relevant information-theoretic limitations.

The paper is organized as follows: the problem is defined and related work discussed in section 1.2. In section 2, we propose a measure to quantify the informativeness of given subsets of samples. In section 3, we derive the large deviation function over all possible subset choices and deduce performance boundaries that apply to any pool-based AL strategy. In section 4 we then summarize our results on the large deviations and support them with numerical simulations. In section 5, we then compare these theoretical results with the performance of existing AL algorithms and propose two new ones, based on approximate-message-passing.

1.2. Definition of the problem and related work

A natural modeling framework for analyzing learning processes and generalization properties is the so-called teacher-student (or planted) perceptron model [13], where the input samples are assumed to be random i.i.d. vectors, and the ground truth labels are assumed to be generated by a neural network (denoted as the teacher) belonging to the same hypothesis class as the student-neural-network. In this work we will restrict to single-layer neural networks (without hidden units) for which this setting was defined and studied in [14]. Specifically we collect the input vectors into a matrix $\boldsymbol{F} \in {\mathbb R}^{P \times N}$ where N is the dimension of the input space and P is the number of samples. The teacher generating the labels, called teacher perceptron, is characterized by a teacher-vector of weights $\textbf{x}^0$ and produces the label vector $\boldsymbol{Y}\in\mathbb{R}^{P}$ according to $\textbf{Y} = \mathrm{sign} (\textbf{F}\cdot \textbf{x}^0)$. Learning is then done using a student perceptron and consists in finding a vector x so that for the training set $\textbf{F}$ we have as closely as possible $\textbf{Y} = \mathrm{sign} (\textbf{F}\cdot \textbf{x})$. The relevant notion of error is the test accuracy (generalization error) measuring the agreement between the teacher and the student on a new sample not presented in the training set. Since both teacher and student possess the same architecture, the training process can be rephrased in terms of an inference problem (as discussed for instance in [13]): the student aims to infer the teacher weights, used to generate the labels, from the knowledge of a set of input-output associations. This scenario allows for nice geometrical insights (see for example [15]), as the generalization properties are linked to the distance in weight space between teacher and student functions. Note that, in the case of a noiseless labelling process, the teacher–student scenario guarantees that perfect training is always possible.

Active learning was previously studied in the context of the teacher–student perceptron problem. Best known is the line of work on query by committee (QBC) [4, 16, 17], dealing with the membership based AL setting, i.e. where the samples are picked one by one into the training set and can be absolutely arbitrary N-dimensional vectors. The AL is in that case more a strategy for designing the samples rather than one for selecting them smartly from an predefined set. In the original work [4] the new samples are chosen so that a committee of several student-neural-networks has the maximum possible disagreement on the new sample. The paper shows that in this way one can reach a generalization error that decreases exponentially with the size of the training set, while for a random training set the generalization error can decrease only inversely proportionally to the size of the set [15]. However, in many practical applications the possible set of samples to be picked into the training set is not arbitrarily big, e.g. not every input vector represents an encoding of a molecular structure. We hence argue that the pool-based AL, studied in the present paper, where the samples are selected from a pre-defined set is of larger relevance to many applications.

The theoretical part of this paper is presented for a generalization of the perceptron model, specifically the for the random teacher–student generalized linear models (GLM), see e.g. [18]. The teacher-student GLM setup is an inference problem over a dataset $\{\boldsymbol{F}^\mu,y^\mu\}_{\mu = 1}^{P}$. The labels $y^\mu$ are generated from the data $\boldsymbol{F}^\mu$ through a probability distribution $P_{\mathrm{out}}(y^\mu|\boldsymbol{F}^\mu\cdot \boldsymbol{x^0})$, where $\boldsymbol{x^0}$ is a fixed vector, generated once and for all from a prior measure PX (·). Given the dataset $\{\boldsymbol{F}^\mu,y^\mu\}_{\mu = 1}^{P}$, the goal of the inference problem is to infer the weight vector $\boldsymbol{x^0}$ that was used to generate the labels. A completely equivalent formulation of the teacher student GLM setting is to consider that the labels $y^\mu$ have been generated by a single-layer 'teacher' neural network with weight vector $\boldsymbol{x^0}$. A 'student' network sharing the exact same architecture is then trained on the resulting dataset $\{\boldsymbol{F}^\mu,y^\mu\}_{\mu = 1}^{P}$. Since a common architecture is shared between the teacher and student network, training is in this case equivalent to the student trying to match its own weight vector with the teacher weights $\boldsymbol{x^0}$. An instance of a GLM is thus specified by a prior measure PX (·) on the weights $\boldsymbol{x}$, from which the true generative model is assumed to be sampled, and an output channel measure $P_{\mathrm{out}}(\cdot|\cdot)$, defining the generative process for the labels $y^\mu$ given the pre-activations $\boldsymbol{F}^\mu\cdot\boldsymbol{x}$. In the part where results of this work are presented we focus on the prototypical case of the noiseless continuous perceptron, where $P_{X}(x) = e^{-\frac{x^{2}}{2}}/\sqrt{2\pi} $ and $P_{\mathrm{out}}(y|h) = \delta(y-\mathrm{sign}(h))$ where for example µ we have $h^\mu = \textbf{F}^\mu \cdot \textbf{x}$. Moreover, we will consider the setting where the learning model is matched to the generative model and thus the student has perfect knowledge of the correct form of the two above defined measures.

The pool-based AL task can now be more formally stated as follows: given a set of N-dimensional samples $\mathcal{S} = \{\boldsymbol{F}^{\mu}\}$ of cardinality $|\mathcal{S}| = P = \alpha N$, the goal is to select and query the labels of a subset $S\in \mathcal{S}$ of cardinality $|S| = nN$, $0\lt n\leqslant\alpha$, according to some AL criterion. We will refer to n as the budget of the student. The true labels are then obtained through $y^\mu\sim P_{\mathrm{out}}(y^\mu|\boldsymbol{F}^\mu \cdot \boldsymbol{x}^0)$, $\boldsymbol{x}^{0}\sim P_{X}(\boldsymbol{x}^0)$. Henceforth measures with vector arguments are understood to be products over the coordinates of the corresponding scalar measures. For technical reasons, we rely on the strong (but customary) assumption that the samples are i.i.d. Gaussian distributed, $F_{i}^{\mu}\sim\mathcal{N}(0,1)$, ∀i ∈ {1, ..., N},  ∀µ ∈ {1, ..., P}. Note that, while this assumption implies that the full set $\mathcal{S}$ of input data is generally unstructured and uncorrelated, it does not prevent non-trivial correlations to appear in any smaller labeled subset S, selected through an AL procedure.

In pool-based AL settings, it is assumed that the student has a fixed budget n for building its training set, i.e. that only up to nN labels can be queried for training. The AL goal is to select, among the pool $\mathcal{S}$ of available samples, the nN most informative labels, to present to the student so that the latter achieves the best possible generalization performance. While many criteria of informativeness have been considered in the literature, see e.g. [1], in the teacher-student setting there exist a natural measure of informativeness, which we shall define in the next section.

2. The Gardner volume as a measure of optimality

A natural strategy for ranking the possible subset selections is to evaluate the mutual information $\mathcal{I}(\boldsymbol{x}^{0};\boldsymbol{Y}|\boldsymbol{F})$ between the teacher vector random variable $\boldsymbol{x^{0}}$ and the random variable corresponding to the subset of labels $\boldsymbol{Y}$, conditioned on the realization of the corresponding inputs $\boldsymbol{F}$. The mutual information between two random variables $\boldsymbol{x^0},\boldsymbol{Y}$ quantifies the mutual dependence between these variables, or, in other words, how much information about one random variable is gained if the other is observed [19]. More precisely, $\mathcal{I}(\boldsymbol{x}^{0};\boldsymbol{Y}|\boldsymbol{F})$ is mathematically defined as:

Equation (1)

where

Equation (2)

Equation (3)

are so-called conditional entropies. $\mathbb{E}_{\boldsymbol{Y},\boldsymbol{x^0}}$ denotes the average with respect to the random variables $\boldsymbol{Y},\boldsymbol{x^0}$. The entropy of a random variable quantifies the uncertainty about its realization [19]. Therefore, the mutual information admits the very intuitive interpretation of the gain of information about the realization of $\boldsymbol{Y}$ if in addition to the selected data $\boldsymbol{F}$ the realization of $\boldsymbol{x^0}$ is also known. Good selections contain larger amounts of information about the ground truth, encoded in the labels, and make the associated inference problem for the student easier. Conversely, bad selections are characterized by less informative labels. In the case of the teacher-student perceptron, where the output channel $P_{\mathrm{out}}(\cdot|\cdot)$ is completely deterministic and binary, the mutual information can be rewritten (following [18]) as:

Equation (4)

In going from the first to the second line, we used the fact that $P(\boldsymbol{Y}|\boldsymbol{F},\boldsymbol{x^0}) = \delta(\boldsymbol{Y}-\mathrm{sign}(\boldsymbol{F}\cdot\boldsymbol{x^0}))$ and that the entropy associated to a point mass distribution is vanishing ($\mathcal{H}(\boldsymbol{Y}|\boldsymbol{F},\boldsymbol{x}^{0}) = 0$), see for example [19]. Equation (4) allows a connection with a quantity well-known in statistical physics, the so-called Gardner volume [11, 15, 20], denoted in the following by v:

Equation (5)

The Gardner volume differs from the mutual information (4) just by a trivial multiplicative factor −1/N. The Gardner volume represents the extent of the version space [21], i.e. the entropy of hypotheses in the model class consistent with the labeled training set. This provides a natural measure of the quality of the student training. A narrower volume implies less uncertainty about the ground truth $\boldsymbol{x^{0}}$ and is thus a desirable objective in an AL framework. We shall focus the rest of our discussion on the large deviation properties of the Gardner volume while inviting the reader to keep in mind its connection with the mutual information given in (4) and (5).

There exist other natural measures of informativeness, e.g. the student generalization error epsilong and the magnetization (or teacher/student overlap) $m = \boldsymbol{x}\cdot\boldsymbol{x^{0}}/N$. In the thermodynamic limit $N\uparrow\infty$, epsilong is a decreasing function of m (see the appendix C for more details). Moreover we will show analytical and numerical evidence that all these measures co-vary, at least in the simple teacher-student setting studied in this work. A numerical check at finite N of the correlation between v and m can also be found in section 4.1. In contrast to the Gardner volume v, the computation of the best achievable values for the test error epsilong or the magnetization m is harder and to the author's knowledge a methodologically open question even in the setting of the present article.

3. Large deviations of the Gardner volume

We consider the problem of sampling labeled subsets of cardinality nN, $0\lt n\leqslant \alpha$, from a fixed pool of data of cardinality αN, $\alpha\sim\mathcal{O}(1)$, and study the variations in the associated Gardner volumes. We will hereby assume (as is usual in statistical physics) that, for any fixed pool and subset size, the Gardner volume probability distribution follows a large deviation principle, i.e. that there exist an exponential number $e^{N\Sigma(n,v)}$ of subsets choices that produce Gardner volumes equal to v. Employing a statistical physics terminology, we will refer to the rate function, Σ(n, v), as the complexity of labeled subsets associated to a budget n and a volume v.

In the large N limit, the overwhelming majority of subsets will thus realize a Gardner volume $v^\star$, such that $v^\star = {\mathrm{argmax}_v}\Sigma(n,v)$. We will call $v^\star$ the typical Gardner volume because for a randomly drawn subset the resulting Gardner volume is with high probability close to this value. This is because of the large deviation principle that implies the fluctuations around this typical value to be exponentially rare, random sampling will thus almost certainly yield Gardner volumes extremely close to $v^\star$. However, the aim of AL is to find strategies for accessing the atypically informative subsets (i.e. the atypically small volumes $v\lt v^\star$), whence the necessity of analyzing the large deviations properties of the subset selection process.

It is convenient to introduce a vector of selection variables $\{\sigma_{\mu}\}_{0\leqslant \mu \leqslant \alpha N}\in\{0,1\}^{\alpha N}$, such that $\sigma_{\mu} = 1$ when the sample $\boldsymbol{F}_{\mu}\in\mathcal{S}$ is selected (and added to the labeled training set), while $\sigma_{\mu} = 0$ otherwise. In this notation the selected subset $S\subset\mathcal{S}$ is easily defined as $S = \{\boldsymbol{F}_{\mu}\in\mathcal{S}|\sigma_{\mu} = 1\}$.

Since a direct computation of the complexity is not straightforward, as customary in this type of analyses [22] we derive it by first evaluating its Legendre transform. We introduce the (unnormalized) measure over the selection variables, for any reals β, φ:

Equation (6)

and the associated free entropy:

Equation (7)

From a statistical physics perspective, Ξ can be regarded as a grand-canonical partition function, with β playing the role of an inverse temperature, the Gardner volume being the associated energy function, and where φ is an effective chemical potential controlling the cardinality of the selection subset, $|S|$. In the thermodynamic limit $N\uparrow\infty$, by applying the saddle-point method one can easily see that Φ(β, φ) will be dominated by a subset of selection vectors $\{\sigma_{\mu}\}$ whose budget and energy, $n^\star$ and $v^\star$, are given by:

Equation (8)

Thus, inverting the Legendre transform yields the sought complexity:

Equation (9)

At fixed budget n, the range of values of the volume v associated to positive complexities, i.e. with $\Sigma(n,v)\gt0$, effectively spans all the achievable Gardner volumes for subsets of that given cardinality, agnostic of the actual strategy for selecting them. In particular, ${\mathrm{inf}}_v\{v|\Sigma(n,v)\gt0\}$ and ${\mathrm{sup}}_v\{v|\Sigma(n,v)\gt0\}$ define the minimal and maximal Gardner volumes and provide theoretical algorithmic boundaries for all realizable AL strategies. Note that this means that our prototypical model, albeit being idealized, constitutes a nice benchmark for comparing known pool-based AL heuristics.

3.1. Replica symmetric free energy for a GLM

This section details the computation of the free entropy (7) for the generic case of GLM, with arbitrary teacher/student posteriors/priors. The specialization to the particular case of the teacher-student perceptron with no mismatch (Bayes-optimal) will be carried out section 3.2. Our computation borrows from [18, 23] which study the simple measure. Because we study large deviations, our formalism has some semblance with one-step replica breaking (1RSB) equations, a discussion of which can be found for example in [2426].

3.1.1. Notations and assumptions

We consider a student GLM [23] with N-dimensional weights learning from αN samples $\boldsymbol{F}^{\mu}\in\mathbb{R}^{N}$ stacked in a matrix $\boldsymbol{F}\in\mathcal{M}_{\alpha N,N}(\mathbb{R})$ and the corresponding labels $y^{\mu}$ stacked into $\boldsymbol{Y}\in \mathbb{R}^{\alpha N}$. We assume the student–teacher (or planted) setting [13], where the labels are generated by the ground truth (teacher weights) $\boldsymbol{x}^{0}$ with the channel measure $\overline{P_{\mathrm{out}}}(y^{\mu}|\boldsymbol{F}^{\mu}\cdot\boldsymbol{x^{0}})$. The teacher weight itself is drawn with prior $\overline{P_{X}}(\cdot)$. Given $\boldsymbol{F}$ and $\boldsymbol{Y}$, the student perceptron is trained so that its own weight vector $\boldsymbol{x}$ tries to match the ground truth $\boldsymbol{x}^{0}$. The inference is carried out with the student prior PX (·) and posterior $P_{\mathrm{out}}(\boldsymbol{Y}|\boldsymbol{F}\cdot\boldsymbol{x})$. Note that the cases where $P_{X}(\cdot)\ne\overline{P_{X}}(\cdot)$ or $P_{\mathrm{out}}(\cdot)\ne\overline{P_{\mathrm{out}}}(\cdot)$ mean that the student ignores the precise Markov chain wherefrom the labels are generated, as discussed in section 1.2, see also [13]. The likelihood that a vector $\boldsymbol{x}$ is the ground truth vector is then $P_{X}(\boldsymbol{x})P_{\mathrm{out}}(\boldsymbol{Y}|\boldsymbol{F}\cdot\boldsymbol{x})$. In this case the Gardner volume reads:

Equation (10)

The smaller v, the easier the student inference, see section 1.2. The validity of the Gardner volume as a measure of informativeness is justified for the Bayes-optimal perceptron in section 2.

We consider here pool-based AL, where only a subset S of the pool $\mathcal{S} = \{\boldsymbol{F}^{\mu}\}_{1\leqslant \mu\leqslant\alpha N}$ is used for training. The choice of subset can be conveniently parametrized by the Boolean $\sigma_{\mu}\in\{0,1\}$, where $\sigma_{\mu} = 1$ means sample $\boldsymbol{F}_{\mu}$ is used, while $\sigma_{\mu} = 0$ means $\boldsymbol{F}_{\mu}$ is not selected. For a given budget $0\leqslant n = \frac{1}{N}|S| = \frac{1}{N}\sum\limits_{\mu = 1}^{\alpha N}\sigma_{\mu}\leqslant \alpha$, we intend to find the selection S that minimizes the Gardner volume $v(\sigma_{\mu})$, viz. that allows the best student guess. To do this we shall compute the complexity Σ(n, v), with $e^{N\Sigma(n,v)}$ the number of ways to select nN samples so that the Gardner volume associated with the training of the student is v, as in section 3.

To simplify, the samples are taken to be identically and independently distributed according to a normal distribution $\forall(i,\mu),~F^{\mu}_{i}\overset{d}{ = }\mathcal{N}(0,\frac{1}{\sqrt{N}})$. Moreover all measures over vectors are assumed to be separable, that is factorizable as a product of identical measures over the components. Notation-wise $P_{X}(\boldsymbol{x})$ for example is therefore understood to mean $\prod\limits_{i = 1}^{N}P_{X}(x_{i})$.

3.1.2. Replica trick

The goal is to compute the averaged log partition function (free entropy in statistical physics terms):

Equation (11)

β can be seen as an inverse temperature and φ as a chemical potential, see section 3. Note that the selection variables $\{\sigma_{\mu}\}$ play in the grand-canonical partition function (11) the role of an annealed disorder in disordered systems terminology [26], and shall be sometimes referred to as such in the following.

The standard way of taking care of the logarithm in equation (11) is the replica trick [3.1, 26, 28],

Equation (12)

To compute $\mathbb{E}_{\boldsymbol{F},\boldsymbol{Y},\boldsymbol{x^{0}}}\Xi^{s}$, one needs to further replicate β times to care for the power β involved in the summand in equation (11):

Equation (13)

In the present problem we thus introduced two replication levels. Each replica is hence characterized by a set of two indices: the first a index runs from 1 to s and specifies the disorder replica, the second α index, running from 1 to β is related to the replication in β. In total there are therefore s×β replicas. The teacher is set as replica 0. Implicitly henceforth when summed over will be running over [1, s]×[1, β]∪0. But

Equation (14)

where we defined $h^{\mu}_{a\alpha}\equiv\boldsymbol{F}^{\mu}\cdot\boldsymbol{x}^{a\alpha}$ Gaussian because of the central limit theorem and enforced the definition of its covariance matrix $\boldsymbol{Q}$ with integral representations of Dirac deltas. The conjugate matrix is $\hat{\boldsymbol{Q}}$. Matrix elements are noted with small q. Then,

Equation (15)

where we factorized both in i indices (first parenthesis) and in µ indices (second parenthesis). The free entropy defined in (11) then reads:

Equation (16)

with,

Equation (17)

Equation (18)

3.1.3. Replica symmetric (RS) ansatz

The extremization in equation (16) is hard to carry out. As is now standard in the disordered systems literature we can reduce the number of parameters to be extremized over by enforcing the so-called RS ansatz [26] on both replication levels:

Equation (19)

Equation (20)

Equation (21)

Equation (22)

Equation (23)

where $q\lt Q$. Physically, the ansatz (19)–(23) means that two replicas seeing the same realization of disorder (i.e. possessing the same first index) have an overlap Q greater than the overlap between students seeing different realizations (an thus possessing different a-index). The $-\frac{1}{2}$ in the definition of $\hat{r}$ (21) is just introduced for latter convenience.

Note finally that while the ansatz (19)–(23) is replica-symmetric for both replications, it gives a set of equations that are formally those of a 1RSB problem [25]. This is also a reason why taking 1RSB ansatz [26] in the present large deviation calculation would be rather involved as it would lead to equations in the usual 2RSB form that are numerically involved to be solved.

We plug the RS ansatz (19)–(23) into the three contributions that make up equation (16). The trace term is:

Equation (24)

We can decompose the exponent in (17) according to the ansatz (19)–(23):

Equation (25)

In the last but one term index 0 does not intervene. Introducing Hubbard–Stratonovitch fields {λa } for the last but one term and Hubbard–Stratonovith field ξ for the last, IX reads:

Equation (26)

To carry out the computation for IY (equation (18)) we need to explicitly compute the inverse of the Parisi matrix $\boldsymbol{Q}$ involved in equation (18). This is done in the following subsection.

3.1.4. Inverse of the overlap matrix

Name $\boldsymbol{\tilde{Q}}\equiv\boldsymbol{Q}^{-1}$ the inverse of the overlap matrix $\boldsymbol{Q}$. Since $\tilde{\boldsymbol{Q}}$ is clearly of the same form as $\boldsymbol{Q}$, we can parametrize its coefficient in an identical fashion as those of $\boldsymbol{Q}$ $\tilde{r}^{0},\tilde{m},\tilde{r},\tilde{q},\tilde{Q}$. $\boldsymbol{\tilde{Q}}\boldsymbol{Q} = \unicode{x1D7D9}_{\beta s+1}$ means:

Equation (27)

Equation (28)

Equation (29)

Equation (30)

Equation (31)

Equation (32)

yielding:

Equation (33)

Equation (34)

Equation (35)

Equation (36)

Equation (37)

To compute the determinant of $\boldsymbol{Q}$, the simplest way is to guess the eigenvectors. For (x, 1, 1, ..., 1)T we get two eigenvalues $\lambda_{\pm}$ whose product is:

Equation (38)

Then come s(β − 1) eigenvectors $\boldsymbol{e_{i}}-\boldsymbol{e_{i+1}}, ~i\not\equiv 0[\beta]$ (we are indexing starting from 0), with associated eigenvalues (r − Q). Then for $0\leqslant s\leqslant s-2$,

Equation (39)

is an eigenvector with eigenvalue $r+(\beta-1)Q-\beta q$. Then

Equation (40)

The same equality holds with tilde quantities in the right-hand side provided the signs are inverted, since $\mathrm{ln~det}\boldsymbol{Q} = -\mathrm{ln~det}\tilde{\boldsymbol{Q}}$. Identifying term by term results straightforwardly in a set of relations between tilde and non-tilde quantities (henceforth referred as determinant relations),

Equation (41)

Equation (42)

Equation (43)

3.1.5. Evaluating the replica symmetric free entropy for GLM

Now decomposing the exponent in IY (18)

Equation (44)

Introducing Hubbard–Stratonovich fields {ζa } and η for the last two sums of (44) and factorizing in the index a

Equation (45)

Now that all terms are computed the next step is then to divide by s and take the $s\rightarrow 0$ limit as prescribed by the replica trick (12). First, we need to enforce that all non-vanishing order 0 contribution cancel out, since the free entropy should not be diverging. Then, one needs to actually compute first order terms that will contribute in Φ (16).

At order 0, IY  = 1 since

Equation (46)

The cancellation of order 0 terms imposes:

Equation (47)

where the first term comes from the trace term (24). It follows that $\hat{r}^{0} = 0$. Moreover, because of the saddle point equality:

Equation (48)

derived straightforwardly from (16) we also have

Equation (49)

The order 1 contribution of IX can be rewritten by carrying out a change of variables $\xi\rightarrow \xi+\hat{q}^{-\frac{1}{2}}\hat{m}x^{0}$ in equation (26), IX assumes the compact form:

Equation (50)

where

Equation (51)

Equation (52)

Assessing the order 1contribution from IY requires more work. Changing $\eta\rightarrow \eta-\frac{\tilde{m}}{\sqrt{-\tilde{q}}}h^{0}$ in (45) yields:

Equation (53)

with

Equation (54)

Equation (55)

Expanding $\mathrm{ln}I_{Y}$ to $\mathcal{O}(s)$ (subscripts in parentheses standing for order in s) gives

Equation (56)

We used the fact that at order 0 terms canceled, and the identity $\int D\eta \int dy g^{0}(y,\eta) = \sqrt{2\pi r^{0}}+\mathcal{O}(s)$. But

Equation (57)

thus

Equation (58)

It is actually possible to proceed to Gaussian changes of variables in the last term so as to exactly cancel the first three contributions in (58). To do this:

Equation (59)

Equation (60)

(we used the determinant relations (41)–(43)) allowing to rewrite the last term in (58) as:

Equation (61)

with

Equation (62)

It is straightforward to see:

Equation (63)

so the last term in (61) is:

Equation (64)

Similarly the first terms in (61) are:

Equation (65)

We again used the determinant identity (41)–(43) in the last line. Then tilde quantities in the s = 0 limit can be accordingly be replaced by their expressions (33)–(37):

Equation (66)

Equation (67)

Equation (68)

Equation (69)

Ultimately some changes of variables can be used to bring $g^{0}_{(0)}$ to a more compact form:

Equation (70)

Finally

Equation (71)

with

Equation (72)

Equation (73)

3.1.5.1. Replica symmetric free entropy for GLM

Putting everything together the replica free entropy (16) reads:

Equation (74)

Equation (75)

Equation (76)

Equation (77)

Equation (78)

Equation (79)

3.2. Replica free energy for the perceptron

The Bayes-optimal teacher-student setting for the perceptron is defined by the following measures (see section 1.2):

Equation (80)

Equation (81)

3.2.1. Replica symmetric free entropy for the perceptron

We shall simply plug into the generic GLM (74) expressions the particular priors and posteriors for the perceptron (80) and (81). First,

Equation (82)

Equation (83)

while two straightforward Gaussian integrals yield:

Equation (84)

Equation (85)

Equation (86)

Thus,

Equation (87)

Now defining the special function: $H(x)\equiv\frac{1}{\sqrt{2\pi}}\int\limits_{x}^{\infty}Dt$

Equation (88)

In writing so we took into account the fact that that y = ±1, which implies also to replace the integral over y in the energetic part by a sum over {±1}. Furthermore,

Equation (89)

from which it follows that the energetic term in equation (74) reads:

Equation (90)

The y = 1 and y =−1 being equal modulo a double change of variable $\zeta,\eta\rightarrow -\zeta,-\eta$ whence the factor 2. Thus for the perceptron:

Equation (91)

4. Large deviation results

In figure 1, we show the results of the large deviation analysis at α = 3. Note that the qualitative picture is unaltered when α is varied (e.g. equivalent results for α = 10 are shown in section 3.2). The different curves, obtained at fixed values of the budget n, show the complexity (i.e. the exponential rate of the number) of possible subset choices, Σ, that realize the corresponding Gardner volumes v. As expected, the maximum of each curve is observed at β = 0, and yields the typical Garner volume of a teacher-student perceptron that has learned to correctly classify nN i.i.d. Gaussian input patterns. The associated complexity is simply given by the binomial distribution.

Figure 1.

Figure 1. (Left) Complexity–volume curves Σ(n, v) for various budgets n, at pool size α = 3 extracted from the large deviation computations. These curves reach their maxima at a point with coordinates corresponding to the Gardner volume of randomly chosen nN samples, and log-number of choices of nN elements among αN ones. (Right) The magnetization order parameter m (in other words the teacher–student overlap) as a function of the Gardner volume v for a pool of cardinality α = 3, as extracted from the large deviation computations. As is physically intuitive, smaller Gardner volumes imply larger values of the magnetization.

Standard image High-resolution image

The cases where the extremum in equation (9) is realized for positive values of β describe choices of the labeled subsets that induce atypically large Gardner volumes: these correspond to AL scenarios where the student query is worse than random sampling. The number of possible realizations of these scenarios decreases exponentially as one approaches the right-hand extremum of where the complexity curve is positive, describing the largest possible volume at that given budget n. An important remark is that as soon as $\beta\gt0$, the statistics of the input patterns in the labeled set is no longer i.i.d. but has increasing correlations for larger β.

On the other side, negative β induces atypically small Gardner volumes and labeled subsets with high information content. Again, as one spans smaller and smaller volumes the associated complexity drops, making the problem of finding these desirable subsets harder and harder. The left positive-complexity extremum of the curves in the left plot of figure 1 corresponds to the smallest reachable Gardner volumes. We observe in the figure that for larger values of budgets the complexity curves saturate fast very close to the smallest possible Garner volume corresponding to the Gardner volume for entire pool of samples v(α), suggesting that past a certain budget querying additional examples does not add much information.

In the right plot of figure 1, we also show the prediction for the typical value of the magnetization, i.e. the overlap between teacher and students, as the Gardner volume is varied. As mentioned in section 2, small Gardner volumes induce high magnetizations and thus low generalization errors.

In figure 2 the typical (purple) and corresponding minimum (orange, yellow, cyan) Gardner volumes are depicted as a function of the budget n for various pool sizes α = 3,  10,  100. Note that the qualitative picture is unaltered when α is varied. We further observe that the minimum volume becomes very close to the Gardner volume of the entire pool of samples v(α) already for very small budgets n.

Figure 2.

Figure 2. Typical (maximum complexity) Gardner volume (purple, decreasing linearly at large α) and information theoretically smallest achievable one (orange, yellow and blue) extracted from the large deviation computation for α ∈ {3, 10, 100}. The horizontal lines depict the value of the Gardner volume corresponding to the whole pool, we see the fast saturation of the lowest volumes at these lines. The information-theoretic volume-halving limit 2n for label-agnostic AL procedures is plotted in a dotted line. We notice that the qualitative picture is essentially unchanged when α is varied.

Standard image High-resolution image

4.1. Additional numerical confirmation

We supply numerical evidence for some assumptions made in this work, in particular the replica trick (12) and the use of the Gardner volume as a measure of informativeness (see section 2 in the main text).

First, we sample numerically at random subsets of cardinalities n ∈ {0.3, 0.6, 0.9, 2.7} out of a pool of cardinal given by α = 3, and plot the complexity extracted therefrom in figure 3. The volumes were evaluated using the approximate message passing (AMP) algorithm 2, and simulations were performed at N = 20, with 107 draws, for a fixed teacher. Such a large number of samples and small system size is needed in order to access exponentially rare events. For larger sizes the probability of rare event is exponentially smaller and proportionally more samples would be needed, which is not computationally accessible. Because of the $\mathcal{O}(10^7\times N^2)$ complexity N has been kept small, while AMP is known to be valid only in the $N\uparrow\infty$ limit, hence inducing errors due to finite size. Nevertheless, the agreement with the theoretical curves for Σ(n, v) is quite good.

Figure 3.

Figure 3. Complexity vs volume curves for α = 3, and n ∈ {0.3, 0.6, 0.9, 2.7}. The dots are the values extracted from numerical experiments performed at N = 20 by repeatedly sampling passively 107 times a subset of cardinality n out of a fixed pool of size α = 3. Solid lines are the theoretical complexities as predicted by the large deviation computations, see also figure 1. Volumes were evaluated using the AMP algorithm 2. The agreement is rather good knowing the discrepancies that ought to be expected because of running AMP algorithm 2 at finite and small N.

Standard image High-resolution image

Algorithm 1. Uncertainty sampling.
  Select heuristic strategy from table 1
   Define batch size k
   Initialize $S\subset\mathcal{S} = \{\boldsymbol{F}_{\mu}\}_{1\leqslant\mu\leqslant\alpha N}$ ($|S|\gt0$)
   while $|S|\lt nN$ do
     Obtain required estimates given S
     Obtain model predictions at data-points in Sc
     Sort predictions according to sorting criterion
     Add first k elements in the sorting permutation to S
   end while

Algorithm 2. Single-instance AMP for the perceptron.
   Initialize $\boldsymbol{\hat{x}}\leftarrow 0$
   Initialize $\boldsymbol{\hat{\Delta}}\leftarrow 1$
   Initialize $\boldsymbol{g}_{\mathrm{out}}\leftarrow 0$
   while Convergence criterion not satisfied do
     $\Gamma_{\mu}^{t}\leftarrow\sum\limits_{i}(F_{i}^{\mu})^{2}\hat{\Delta}_{i}^{t-1}$
     $\omega^{t}_{\mu}\leftarrow\sum\limits_{i}F_{i}^{\mu}\hat{x}_{i}^{t-1} -\Gamma_{\mu}^{t}g_{\mathrm{out}}^{t-1}$
     $g_{\mathrm{out},\mu}^{t}\leftarrow\frac{y^{\mu}}{\sqrt{2\pi \Gamma_{\mu}^{t}}}\frac{e^{-\frac{(\omega_{\mu}^{t})^{2}}{2\Gamma_{\mu}^{t}}}}{H\left(-\frac{y^{\mu}\omega_{\mu}^{t}}{\sqrt{\Gamma^{t}_{\mu}}}\right)}$
     $(\Sigma_{i}^{t})^{-1}\leftarrow-\sum\limits_{\mu}(F_{i}^{\mu})^{2} ( -\frac{\omega^{t}}{V^{t}}g_{\mathrm{out}}^{t}-(g_{\mathrm{out}}^{t})^{2})$
     $ R_{i}^{t}\leftarrow\hat{x}_{i}^{t-1}+\Sigma_{i}^{t} \sum\limits_{\mu}F_{i}^{\mu} g_{\mathrm{out}}^{t}$
     $\hat{x}_{i}^{t}\leftarrow\frac{R_{i}^{t}}{1+\Sigma^{t}_{i}}$
     $\hat{\Delta}_{i}^{t}\leftarrow\frac{\Sigma_{i}^{t}}{1+\Sigma^{t}_{i}}$
   end while

We finally present a numerical check of the theoretical prediction for the m(v) curves, see figure 1. At large instance size, N = 2000, it is not computationally feasible to obtain sufficient statistics for observing the large deviations of the volume through passive subset sampling, as it was done in the previous experiment. Thus, we resorted to the label-informed AL-AMP AL strategy (see section 5, algorithm 1 and table 1) for biasing the subset selection towards more/less informative subsets. In particular, we constructed each subset by mixing varying ratios of maximally informative samples (selected according to the informed AL-AMP procedure) and minimally informative samples (selected according to the same procedure but with the reversed sorting order). In figure 4, the pool size is α = 10 and the budget is fixed to n = 1.5. For each subset, the AMP algorithm 2 was run to get the estimator $\hat{\boldsymbol{x}}$ and the magnetization $m = \frac{x^0\cdot \hat{x}}{N}$ was deduced therefrom. This incidentally corroborates once more that using the Gardner volume instead of the magnetization to judge for the informativeness of a selection is coherent.

Figure 4.

Figure 4. Magnetization m against Gardner volume v for various subsets. The experiments were performed at system size N = 2 × 103, pool size α = 10 and budget n = 1.5. Subsets covering a wide range of volumes were designed by varying the ratio of informative samples (using label-informed AL-AMP, see section 5) and uninformative samples (selected using simple passive learning). Magnetizations and volumes were evaluated for each subset by training the model using the AMP procedure 2. In solid line is the typical m(v) curve predicted by the large deviation computations, which agrees quite well with the numerical simulations.

Standard image High-resolution image

Table 1. table summarizing the specifics of the uncertainty sampling strategies considered in this paper.

Uncertainty sampling strategies
HeuristicRequired estimatesSorting criterion
Agnostic AL-AMP $\hat{\boldsymbol{x}}_\textrm{AMP}, \hat{\boldsymbol{\Delta}}_\textrm{AMP}$ $\mathrm{argmin}_\mu \left|\mathrm{erf}\left(\frac{\boldsymbol{F}^{^{\prime} \mu}\hat{\boldsymbol{x}}_\textrm{AMP}}{\sqrt{2(\boldsymbol{F}^{^{\prime} \mu})^2\hat{\boldsymbol{\Delta}}_\textrm{AMP}}}\right)\right|$
Informed AL-AMP $\hat{\boldsymbol{x}}_\textrm{AMP}, \hat{\boldsymbol{\Delta}}_\textrm{AMP}$ $ \mathrm{argmax}_\mu \left|y^\mu - \mathrm{erf}\left(\frac{\boldsymbol{F}^{^{\prime} \mu}\hat{\boldsymbol{x}}_\textrm{AMP}}{\sqrt{2(\boldsymbol{F}^{^{\prime} \mu})^2\hat{\boldsymbol{\Delta}}_\textrm{AMP}}}\right)\right|$
Query by committee $\{\boldsymbol{x}_\textrm{SGD}^k\}_{k = 1}^{K}$ $\mathrm{argmin}_\mu |\sum_{k = 1}^{K} \mathrm{sign}\left(\boldsymbol{F}^{^{\prime} \mu}\cdot \boldsymbol{x}_\textrm{SGD}^k\right)|$
Logistic regression $\boldsymbol{x_\textrm{log}}$ $\mathrm{argmin}_\mu \left|\boldsymbol{F}^{^{\prime} \mu}\cdot \boldsymbol{x}_\textrm{log}\right|$
Perceptron learning $\boldsymbol{x_\textrm{perc}}$ $\mathrm{argmin}_\mu \left|\boldsymbol{F}^{^{\prime} \mu}\cdot \boldsymbol{x}_\textrm{perc}\right|$

4.2. Stability of the replica symmetric solution

We remark at this point that the presented replica calculation was obtained in the so-called RA ansatz. In general, it is possible for the RS result not to be asymptotically exact, requiring replica symmetry breaking (RSB) in order to evaluate the correct free entropy Φ(β, φ) [26]. In this model, while RSB is surely not needed close to the maximum of the complexity curves as implied by results in [18], it conversely has to be taken into account away from typical volumes. The volumes for which the RS ansatz (19)–(23) ceases to be valid can be evaluated by considering an infinitesimal perturbation thereof of 1 step RSB form as is detailed in appendix B, yielding the stability condition:

Equation (92)

Figure 5 shows in solid lines the regions in v space where the RS assumption holds. At the same time, the presence of RSB usually entails corrections that are very small in magnitude. We hence believe the error bounds here reported under the assumption of replica symmetry to achieve a good degree of accuracy.

Figure 5.

Figure 5. Complexity vs volume curves for α = 3, and n ∈ {0.3, 0.6, 0.9, 2.7}, see also figure 1. Solid lines corresponds to ranges of values for v where the replica symmetric ansatz is consistent (stable). As n increases the window of validity of the ansatz shrinks, signaling the emergence of non-trivial structures in the geometrical organization of the selection vectors associated with atypical Gardner volumes.

Standard image High-resolution image

5. Algorithmic implications for active learning

5.1. Generic considerations

The setting investigated in this paper provides a unique chance to benchmark the algorithmic performance of any given pool-based AL algorithm against the optimal achievable performance, and to measure how closely are the large deviations results approached. This offers a controlled setup complementary to the usual one of simply comparing different algorithms on standard datasets, where no point of reference exists to evaluate the performances. The aim of the present section is hence to illustrate how the results on Gardner volumes reported above may serve to evaluate existing AL procedures. Before moving to such algorithmic performance we should make a distinction between two possible classes of AL scenarios:

  • Label-agnostic settings, where the student has no prior knowledge on the ground truth labels. In other words, the AL selection must be based solely on the knowledge of the input patterns $\{\boldsymbol{x}\}$, and make no use of the true labels {y}. In this case, for binary labels there is a simple lower bound on the Gardner volume reachable with nN samples, $v\geqslant 2^{-n}$, which is obtained by the argument that every new sample can at best divide the current volume by a factor two, see [4]. This strategy is explored in the famous QBC AL strategy, and the classical work [4] argues that the volume halving can be actually achieved when an unlimited set of samples is available. Plotting this volume-halving bound in figure 2 we see that even though there exist subsets of the pool that would lead to smaller Gardner volumes, they cannot be found in a label-agnostic way.
  • Label-informed settings, where external knowledge on the true labels is available and can be used for extracting more information during the selection process. In many real-world applications the structure in the input data could be exploited (e.g. through clustering, transfer learning, etc) for making unsupervised guesses of the labels and for bootstrapping an AL strategy. A concrete example where external insight is available is drug discovery [7], where additional information can be inferred from the presence of chemical functional groups (or absence thereof) on the molecules in the data pool. In the present work, we study whether it is possible, with full access to the labels, to devise an efficient method for finding a subset of samples that achieves close to the minimal Gardner volume bound (note that this is still an algorithmically non-trivial problem).

In this section we will investigate both the label-agnostic and label-informed strategies. We will benchmark several well known AL algorithms on the model studied in the present paper as well as design and test a new message passing based AL algorithm. Before doing that let us describe the general strategy.

Many of the commonly used AL criteria rely on some form of label-uncertainty measure. Uncertainty sampling [1, 29] is an AL scheme based on the idea of iteratively selecting and labelling data-points where the prediction of the available trained model is the least confident. In general, the computational complexity associated to this type of scheme is of order $\mathcal{O}(N^{3})$, requiring an extensive number of runs of a training algorithm (which can scale as $\mathcal{O}(N^2)$ at best). Since even training a single model per pattern addition can become expensive in the large N setting, in all our numerical tests we opted for adding to the labeled set batches of k = 20 samples instead of a single sample per iteration. We remark that, despite the k-fold speed-up, the observed performance deterioration is negligible. The structure of this type of algorithm is sketched in algorithm 1.

5.2. Approximate message passing for AL (AL-AMP)

In general, estimating the Gardner volume on a given training set or the label-uncertainty of a new sample is a computationally hard problem. However, in perceptrons (or more general GLMs) with i.i.d. Gaussian input data $\boldsymbol{F}$, at large system size N one can rely on the estimate provided by a well known algorithm for approximate inference, AMP (historically also referred to as the Thouless–Anderson–Palmer (TAP) equations, see [30]). AMP is a standard iterative procedure used for Bayesian inference on a factor graph associated to a probability measure. We refer the interested reader to the ample literature dedicated to message-passing algorithms ([12, 23, 31] for example) for more detailed discussion thereon. The AMP algorithm [13, 32, 33], yields (at convergence) an estimator of the posterior means, $\hat{\boldsymbol{x}}$, and variances, $\hat{\boldsymbol{\Delta}}$, thus accounting for uncertainty in the inference process including the label of a new sample. The Gardner volume v (corresponding to the so-called Bethe free entropy) can then be expressed as a simple function of the AMP fixed-point messages (see [34] for an example). We provide a pseudo-code of AMP in the case of the perceptron in algorithm 2. An important remark is that when the training set is not sampled randomly from the pool, as in the AL context, correlations can arise and AMP is no longer rigorously guaranteed to converge nor to provide a consistent estimate of the Gardner volume. In the present work, we can only argue that its employment seems to be justified a posteriori by observing the agreement between theoretical predictions and numerical experiments for instance for the generalization error.

We use the AMP algorithm to introduce a new uncertainty sampling procedure relying on the information contained in the AMP messages, denoted as AL-AMP in the following. At each iteration, the single-instance AMP equations are run on the current training subset to yield posterior mean estimate $\hat{\boldsymbol{x}}$ and variance $\boldsymbol{\hat{\Delta}}$. These quantities can then be used to evaluate, for all the unlabeled samples, the output magnetization (i.e. the Bayesian prediction) defined as

Equation (93)

where we introduced the output overlaps $\boldsymbol{\omega} = \boldsymbol{F}^{^{\prime}}\cdot\hat{\boldsymbol{x}}$ and variances $\boldsymbol{\Gamma} = (\boldsymbol{F}^{^{\prime}}\odot \boldsymbol{F}^{^{\prime}})\cdot\hat{\boldsymbol{\Delta}}$, where $\odot$ is the component-wise product. The output magnetizations correspond to the weighted output average over all the estimators contained in the current version space, and their magnitude represents the overall confidence in the classification of the still unlabeled samples. This means that AMP provides an extremely efficient way of obtaining the information on uncertainty. The specifics of the algorithm can be found in table 1.

We also explore numerically the label-informed AL regime introduced in the previous section. We consider its limiting case by introducing the informed AL-AMP strategy, which can fully access the true labels $\boldsymbol{Y}$ in order to query the samples $\boldsymbol{F}^{\mu}$ whose output magnetisation $m_{\mathrm{out}}^\mu$ (93) is maximally distant from the correct classification $y^{\mu}$. This selection process can iteratively reduce the Gardner volume of factors larger than 2. Again, the relevant specifics of informed AL-AMP algorithm are detailed in table 1.

5.3. Other tested measures of uncertainty

One of the widely used uncertainty sampling procedure is the so-called QBC strategy [4, 16]. In QBC, at each time step, a committee of K students is to be sampled from the version space (e.g. via the Gibbs algorithm). The committee is then employed to choose the labels to be queried, by identifying the samples where maximum disagreement in the committee members outputs is observed. The QBC algorithm was introduced as a proxy for doing bisection, i.e. cutting version space into two equal-volume halves. As already mentioned, this constitutes the optimal information gain in an label-agnostic setting [35]. Note that, however, the QBC procedure can achieve volume-halving only in the infinite-size committee limit, $K\uparrow\infty$, with uniform version space sampling and with availability of infinitely many samples. Obviously, running a large number K of ergodic Gibbs sampling procedures quickly becomes computationally unfeasible. Moreover in the pool-based AL the pool of samples is limited. In order to allow comparison with other strategies at finite sizes, we approximated the uniform sampling with a set of greedy optimization procedures (e.g. stochastic gradient descent) from random initialization conditions, checking numerically that this yields a committee of students reasonably spread out in version space. It is possible to ensure a greater coverage of the version space by performing a short Monte-Carlo random walk for each committee member. The effect has been found to be small for computationally reasonable lengths of walk.

We also implemented an alternative uncertainty sampling strategy, relying on a single training procedure (e.g. training with the perceptron algorithm or logistic regression) per iteration: in this case, the uncertainty information is extracted from the magnitude of the pre-activations measured at the unlabeled samples after each training cycle. This strategy implements the intuitive geometric idea of looking for the samples that are most orthogonal to the available updated estimator, which are more likely to halve the version space independently of the value of the true label.

All the tested procedures, with the exception of QBC, display a complexity of approximately $\mathcal{O}(N^3)$, possibly to be corrected by a factor accounting for the number of iterations required at each training step. The adapted QBC algorithm similarly has complexity $\mathcal{O}(KN^3)$, as it involves a committee of K models. Ideally, the original QBC algorithm [4] would require at each step to sample uniformly the current version space, thus implying an exponentially costly Monte Carlo step.

5.4. Algorithmic results

In figure 6, we compare the minimum Gardner volume obtained from the large deviation calculation with the algorithmic performance obtained on synthetic data at finite size, N = 2 × 103, by the AL-AMP algorithms detailed in algorithm 1 and table 1. The data-pool size is fixed to α = 3. The large deviation analysis yields values for the minimum and maximum achievable Gardner volumes at any budget n. We compare the algorithmic results also with the prediction for the typical case and with the volume-halving curve 2n . Since in the considered pool-based setting the volume-halving performance cannot be achieved for volumes smaller than the Gardner volume corresponding to the entire pool v(α), the relevant volume-halving bound should be more precisely $\mathrm{max}(2^{-n},v(\alpha))$. Random sampling displays good agreement with the expected typical volumes. Most notably, the label-agnostic AL-AMP algorithm tightly follows the volume-halving bound $\mathrm{max}(2^{-n},v(\alpha))$, thus reaching close to optimal possible performance. Since for large α the behavior of $v(\alpha) = \mathrm{const.}/\alpha$ [15] we conclude that the AL-AMP algorithm will reach close to minimum possible Gardner volumes for a budget $n \sim {\cal O}[\log(\alpha)]$.

Figure 6.

Figure 6. (Left) Performance of the label-agnostic (yellow circles) and label-informed (blue circles) AL-AMP, plotted together with the minimum and maximum values of the Gardner volume extracted from the large deviation computation (purple and green) and the volume-halving curve (dotted black). For comparison we also plot the typical Gardner volume (cyan) and the one obtained by random sampling (orange squares). Numerical experiments were run for system size N = 2 × 103 and pool size α = 3. For each algorithmic performance curve the average over 10 samples is presented. Fluctuations were found to be negligible around the average and are not shown. (Right) The same plot with the Gardner volume replaced by the Bayesian test accuracy, derived in appendix C. For the AL-AMP algorithm the accuracy is evaluated using a test set of size $P_{\mathrm{test}} = 5\times10^4$. The qualitative picture is very similar to the one for the Gardner volume curves (left), once more confirming that Gardner volumes and generalization errors both constitute good measures for informativeness.

Standard image High-resolution image

Theoretical justification of this very good performance is however yet to be established: as a matter of fact, due to the purely sequential nature of the employed AL scheme we are not accounting for possible correlations between subsequent queries. Moreover, in machine learning practice, higher order methods for AL (like expected model change [36], with a $\mathcal{O}(N^4)$ complexity) have been shown to perform potentially better than uncertainty sampling. We conjecture the observed optimal performance of AL-AMP can be traced back to the noiseless nature of the considered learning setting and leave the problem of studying AL in the presence of noise for future work. Note that, even in our pool-based setting, we thus obtain an exponential reduction in the number of samples, similar to the original QBC work [4].

The label-informed AL-AMP also approaches the theoretically minimal volume but not as closely. We remark that an important limit of the AL-AMP algorithm comes from the fact that AMP is not guaranteed to provide good estimators (or converge at all) with correlated data. For example, in the numerical experiments for obtaining the informed AL-AMP curve, we had to resort to mild damping schemes in the message-passing to allow fixed-points being reached. This effect was stronger for the label-informed algorithm than for the label-agnostic one.

In figure 7, we provide a numerical comparison of the performance of the agnostic AL-AMP and the other above mentioned label-agnostic AL algorithms. The finite size experiments were run at N = 2 × 103, while here we set α = 10. Note that, while the mentioned different AL strategies where employed for selecting the labeled subset, in all cases supervised learning and the related performance estimates were obtained by running AMP. In the plot, we can see that, while AL-AMP is able to extract very close to the maximum amount of information from each query (one bit per pattern, until the volume v(α) is saturated), other heuristics with the same computational complexity are sub-optimal. In particular, in the simplified QBC we observe that increasing the size K of the committee does not yield very noticeable change in its performance, most probably because the committee cannot cover a sufficient portion of the version space if the computational cost is to be kept reasonable. On the other hand, using the information of the magnitude pre-activations allows better performance while being also more time-efficient, since only a single perceptron, rather than a committee thereof, has to be trained at each step. The logistic loss allows a rather good performance, close to that of AL-AMP, while the uncertainty sampling with the perceptron algorithm yields a mitigated performance.

Figure 7.

Figure 7. (Left) Performance of the label-agnostic algorithms presented in table 1 is plotted against the budget n and compared to the volume-halving lower bound. Experiments were performed at system size N = 2 × 103, and pool size α = 10. For each algorithm the average over 10 samples is presented. Note that error bars are smaller than marker size. (Left) (Bayesian) test accuracy of the same heuristics for various budgets n. The test set size was chosen to be $P_{\mathrm{test}} = 5\times10^4$. In blue the Bayesian test accuracy for a typical subset, see appendix C. Again, the qualitative picture is unchanged going from the Gardner volume to the test accuracy.

Standard image High-resolution image

We leave a more systematic bench-marking of the many existing strategies for future work, stressing the fact that, while there certainly exist more involved procedures that can yield better performance than the presented heuristics, the absolute performance bounds still apply, agnostic of the implemented AL strategy. Because the reported results in the present work are specific to the GLM setup with Gaussian data, future investigations should also be concerned with extending the AL-AMP procedure to real-word datasets, as opposed to the synthetic data used in the present work, and observe whether the good performance generalizes. It should be stressed that such endeavor would likely require to use a variant of AMP, namely vectorial AMP [37], more robust to mismatches between data and model assumptions.

6. Conclusions

Using the replica method for large deviation calculation of the Gardner volume, we computed for the teacher-student perceptron model the minimum Gardner volume (equivalently, maximum mutual information) achievable by selecting a subset with fixed cardinality from a pre-existing pool of i.i.d. normal samples. We evaluated the large deviation function based on the RS assumption; checking for RSB and evaluating the eventual corrections to the presented results is left for future work, as well as rigorous establishment of the presented results. Our result for the information-theoretic limit of pool-based AL in this setting complements the already known volume-halving bound for label-agnostic strategies. We hope our result may serve as a guideline to benchmark future heuristic algorithms on the present model, while our modus operandi regarding the derivation of the large deviations may help for future endeavor in theoretical analysis of AL in more realistic settings. We presented the performance of some known heuristics, plus we suggested the AL-AMP algorithms to perform the uncertainty based AL. We show numerically that on the present model the label-agnostic AL-AMP algorithm performs very close to the optimal bound, thus being able to achieve accuracy corresponding to the entire pool of samples with exponentially fewer samples.

Acknowledgments

We want to thank Guilhem Semerjian for clarifying discussions in early stages on this work. This work is supported by the ERC under the European Union's Horizon 2020 Research and Innovation Program 714608-SMiLe.

Data availability statement

No new data were created or analysed in this study.

Appendix A.: Saddle point equations for the perceptron

A.1. Saddle-point equations for the perceptron

The canonical way of carrying out the extremization in equation (91) is to take the saddle-point equations (zero-gradient conditions) and to solve them. The saddle point equations associated to Φ(β, φ) (equation (91)) read:

Equation (A.1)

Equation (A.2)

Equation (A.3)

Equation (A.4)

Equation (A.5)

Equation (A.6)

Equation (A.7)

Equation (A.8)

Equation (A.9)

In practice, equations (A.1)–(A.9) are iterated until convergence. The time indices are derived from an independent computation using AMP [13, 24], not shown here. They indicate in which order the equations ought to be iterated in order to converge. Remark that the update schedule very simply consists in updating in parallel all order parameters m, q, Q, r then all auxiliary (hatted) order parameters $\hat{m},\hat{q},\hat{Q},r$. After convergence the order parameters can be used in equation (91) to evaluate the free entropy Φ and subsequently evaluate the complexity Σ(n, v) by inverting the Legendre transform (9).

We present for illustration in figure A1 the complexity curves for α = 10 for some budgets n, and refer the interested reader to figure 1 for the same plot at pool size α = 3. As the budget n is increased, smaller values of Gardner volumes become accessible, provided sufficiently informative subset are found. A detailed discussion can further be found in section 3.

Figure A1.

Figure A1. Complexity Σ(n, v) as a function of the Gardner volume v for budgets $n\leqslant 2$ at pool size α = 10 extracted from the large deviation computations, see also figure 1 for the same curves at different pool size. For any budget n the maximum complexity corresponds to the volume reached by random subset selection (passive learning), see discussion in section 3 in the main text. We invite the reader to notice that the qualitative allure of the complexity curves remains significantly unchanged as the pool size α is varied (see figure 1 for α = 3).

Standard image High-resolution image

Appendix B.: Stability of the replica symmetric ansatz

In this appendix we investigate the stability of the RS ansatz (19)–(23) under an infinitesimal perturbation, which we choose to be of one-step RSB (1RSB) form. This is tantamount to ascertaining whether the extremum in the optimization problem (16) is reached outside of the subspace of RS matrices (19)–(23). We first give the expression of the 1RSB free entropy in the general case of a GLM, before specializing it to the perceptron. Eventually, we analyze the stability of the 1RSB saddle-point equations under infinitesimal departure from the RS ansatz, and give a stability condition for the RS assumption to be consistent.

B.1. 1RSB ansatz

We depart from the RS setting and assume the disorder (selection variables σa ) to be 1RSB, with Parisi parameter noted τ [25, 26]. For convenience we shall replace $1\leqslant a \leqslant s$ by a double index $(a^{^{\prime}},a)$, with $1\leqslant a^{^{\prime}}\leqslant \frac{s}{\tau}$ the 1RSB cluster index and $1\leqslant a\leqslant \tau$ indexing the selection variables inside a same block. As a consequence the $\boldsymbol{x}$ variables will carry a triple index $(a^{^{\prime}},a,\alpha)$. We also need to differentiate the former q index into q1 (for students seeing different disorders pertaining to the same 1RSB cluster) and q0 (for students seeing disorders from different clusters).

B.2. Inverse and determinant of a 1RSB planted matrix

Let us note as before $\boldsymbol{\tilde{Q}} = \boldsymbol{Q}^{-1}$ with elements $\tilde{r}^{0},\tilde{m},\tilde{r},\tilde{Q},\tilde{q_{1}},\tilde{q_{0}}$. Inverting the matrix is tantamount to solving the equations:

Equation (B.1)

Equation (B.2)

Equation (B.3)

Equation (B.4)

Equation (B.5)

Equation (B.6)

Equation (B.7)

Solutions are :

Equation (B.8)

Equation (B.9)

Equation (B.10)

Equation (B.11)

Equation (B.12)

Equation (B.13)

One can also compute $\mathrm{det}\boldsymbol{Q}$ by finding the eigenvectors. There is a couple of eigenvectors with product

Equation (B.14)

and $\frac{s}{\tau}$ eigenvectors with eigenvalue:

Equation (B.15)

and $\frac{s}{\tau}(\tau-1)$ eigenvectors with eigenvalue:

Equation (B.16)

Thus,

Equation (B.17)

B.3. Computing the 1RSB free entropy

The trace term $\mathrm{Tr}\boldsymbol{\hat{Q}}\boldsymbol{Q}$ in (16) now reads with the 1RSB ansatz

Equation (B.18)

For the IX term we proceed in very similar manner as in the RS case, see section 3.1. The end result is:

Equation (B.19)

Equation (B.20)

Equation (B.21)

Note that the results

Equation (B.22)

Equation (B.23)

also carry through to the 1RSB case. The treatment of the IY term requires, as in the non-RSB case, more care. The computation is very analogous (though more lengthy) and shall not be further detailed presently. We report only the end result:

Equation (B.24)

Equation (B.25)

Equation (B.26)

In summary the 1RSB free entropy reads:

Equation (B.27)

Equation (B.28)

Equation (B.29)

Equation (B.30)

Equation (B.31)

B.4. Specialization to the perceptron

We now proceed to specialize the 1RSB free entropy (B.27), derived above for a generic GLM, for the special case of a perceptron. The only term that differs non-trivially from the RS specialization reported in section 3.2 is $I_{X}^1$, for which one further (Gaussian) integration has to be carried out. We give the final formula for the perceptron free entropy:

Equation (B.32)

B.5. Stability analysis

Having derived the free entropy (B.32) in the 1RSB ansatz, we can now study the stability of the RS solution with respect to an infinitesimal 1RSB perturbation. To that end we consider a 1RSB ansatz departing infinitesimally from the RS form:

Equation (B.33)

Equation (B.34)

We name $q,\hat{q}$ the common order 0 value for these overlaps. The entropic part (trace term and IX ) of the free entropy (B.32) then reads to first order in $\epsilon, \hat{\epsilon}$

Equation (B.35)

Prior to expanding the energetic part IY it is convenient to define the shorthand

Equation (B.36)

as we did in the RS computations, see section 3.2. Expanding the argument of $* = \log\int D\eta \left[ 1+e^{\phi}\int D\zeta H\left(\ldots\right)^{\beta} \right]^{\tau}$

Equation (B.37)

Equation (B.38)

Equation (B.39)

In going from the first to the second line we used that the $\mathcal {O}(\epsilon^{\frac{1}{2}})$ term is killed by the integration over η. The first fraction in the second line is found to be vanishing using an integration by parts.

The variables $\epsilon,\hat{\epsilon}$ intervene in the saddle-point Equations. If iterating the SP equations induce $\epsilon,\hat{\epsilon}$ to become large, then the assumption that the RS fixed point is stable ceases to hold. To obtain the dynamical equations for the pair $\epsilon,\hat{\epsilon}$ one has to derive the SP equations associated to zero-gradient conditions in the q and $\hat{q}$ direction. Note that the derivatives of terms not involving epsilon or $\hat{\epsilon}$ shall eventually sum to zero, since we assume to be at the RS ($\epsilon = \hat{\epsilon} = 0$) fixed points. The first $\partial_{\hat{q}}\Phi_{\mathrm{1RSB}} = 0$ equation reads

Equation (B.40)

while the $\partial_{q}\Phi_{\mathrm{1RSB}} = 0$ implies:

Equation (B.41)

Equation (B.42)

We choose not to explicit the q derivative as the expression is rather large. The stability condition, expressing the fact that the dynamical system (B.40) and (B.41) converges to $\epsilon = \hat{\epsilon} = 0$, then reads:

Equation (B.43)

The stability condition (92) can be evaluated numerically using the values for the order parameters resulting from the convergence of the saddle-point equations (A.1)–(A.9). The results are presented in figure 5. As the budget n increases, the region of validity of the RS assumptions diminishes. This seems to suggest that the selections $\{\sigma_{\mu}\}_{1\leqslant\mu\leqslant\alpha N}$ leading to atypical enough volumes v are grouped into isolated clusters in configuration space, i.e. the landscape is 1RSB, this effect being more pronounced for larger values of budget. That the RS assumption should break down away from typicality is expected. However, in most known settings, this simple ansatz proves to be a very good approximation nonetheless, and taking into account further steps of RS breaking usually yields only minor improvement [12]. We therefore believe that the error bounds here reported have a good degree of accuracy, although they may certainly be improved if further symmetry breaking are taken into account. The precise evaluation of these corrections is left for future investigation.

Appendix C.: Optimal generalization error for the large deviation perceptron

We derive here the expression for the optimal generalization error epsilong (in the Bayesian sense) associated to a subset of volume v and of budget n as a function of the perceptron order parameters m and q, see sections 3.1 and 3.2. The Bayesian epsilong was introduced for example in [38] and, unlike the test error yielded by training the perceptron on some loss, is independent of the training procedure and thus may serve as a nice measure of informativeness for subsets. For the usual perceptron model, it is known that the optimal generalization error is achieved when the student classification is performed by averaging the predicted label over the student measure $P_{X}(\cdot)P_{\mathrm{out}}({\boldsymbol{Y}|\boldsymbol{F}\cdot})$ and taking the sign thereof. Note that the average predicted label is but the output magnetization $\boldsymbol{m}_{\mathrm{out}}$ discussed in section 5 of the main text. Transposition to the large deviation setting, which allows to fix the budget n and the volume v, is straightforward provided one averages over the large deviation measure (11). By definition the test error is the probability that a new sample $\boldsymbol{F}_{\mathrm{new}}\overset{d}{ = }\mathcal{N}(0,1)$ is correctly classified by the student according to the output magnetization $\mathbb{E}^{\beta,\phi}_{\boldsymbol{x}}\mathrm{sgn}(\boldsymbol{x}\cdot\boldsymbol{F}_{\mathrm{new}})$

Equation (C.1)

where $\mathbb{E}^{\beta,\phi}_{\boldsymbol{x}}$ denote the average with respect to the large deviation posterior measure (11) with control parameters β and φ. Introducing an integral representation for a Dirac delta and expanding the resulting exponential:

Equation (C.2)

Equation (C.3)

For any fixed j, the computation of $\mathbb{E}_{\boldsymbol{x}^{0},\boldsymbol{Y}}\Theta[\mathrm{sgn}(\boldsymbol{x}^{0}\cdot\boldsymbol{F}_{\mathrm{new}})v]\mathbb{E}_{\boldsymbol{F}}(\mathbb{E}_{\boldsymbol{x}}^{\beta,\phi})^{j}$ is formally very similar to the one detailed in section 3.1 and follows the same lines. First notice that using the replica trick $\mathbb{E}_{\boldsymbol{x}}^{\beta,\phi}\mathrm{sgn}(\boldsymbol{x}\cdot\boldsymbol{F}_{\mathrm{new}})$ prescribes to introduce as precedently βs replicas

Equation (C.4)

Equation (C.5)

From which it follows that:

Equation (C.6)

The net effect is simply to transform the first index a into a double index (l, a), thereby introducing a third level of replication. Pursuing equation (C.2),

Equation (C.7)

Equation (C.8)

In going from the second to the last line we introduced overlap variables h as in section 3.1. The rest of the large deviation measure in equation (C.6) factorize into $e^{Nsj\Phi(\beta,\phi)}$, with Φ the free entropy computed in sections 3.1 and 3.2, and goes to 1 as the $s\rightarrow0$ limit is taken. We also introduced an overlap matrix $\mathcal{Q}$ of RS form

Equation (C.9)

Equation (C.10)

Equation (C.11)

Equation (C.12)

The order parameters r0, r, m and q were defined in the replica ansatz equations (19)–(23). Note that the relevant overlap is q, rather than Q, since the j-fold replication in equation (C.6) affects also the selection variables σ and thus the variables $\boldsymbol{x}^{l11}$ see different disorders, see section 3.1. The inverse $\tilde{\mathcal{Q}} = \mathcal{Q}^{-1}$ is characterized by the coefficients:

Equation (C.13)

Equation (C.14)

Equation (C.15)

Equation (C.16)

The last integral in equation (C.8) can be taken care of in the usual manner, by decomposing the exponent and introducing a Hubbard–Stratonovitch field η, see for example section 3.1. Similarly the expression can then be factorized in l indices. The result is:

Equation (C.17)

Equation (C.18)

This terminates the computation of εg , since from equation (C.8):

Equation (C.19)

Equation (C.20)

Equation (C.21)

We used the fact that $x\rightarrow1-2H(x)$ was odd and the fact that r0 = 1 for the perceptron model with Gaussian priors, see section 3.2.

Please wait… references are loading.
10.1088/2632-2153/abfbbb