1 Introduction

In the past few years, there has been a growing interest in finding methods to train discrete weights neural networks. As a matter of fact, when it comes to implementations, discrete weights allow to reach a better efficiency, as they considerably simplify the multiply-accumulate operations, with the extreme case where weights become binary and there is no need to perform any multiplication anymore. Unfortunately, training discrete weights neural networks is complex in practice, since it basically boils down to a NP-hard optimization problem. To circumvent this difficulty, many works have introduced techniques that aim at finding reasonable approximations [6, 7, 13, 24].

Among these works, in a recent paper Baldassi et al. [2] discuss the learning process in artificial neural networks with discrete weights and try to explain why these networks work so efficiently. Their approach is based on an analysis of the learning procedure in artificial neural networks. In this process a huge number of connection weights are adjusted using some stochastic optimization algorithm for a given target function. Some of the resulting optima of the target or energy function have better computational performance and generalization properties, other worse. The authors in [2] propose that the better and more robust configurations of weights lie in dense regions with many maxima or minima (depending on the sign) of the target function, while the optimal configurations that are isolated, i.e. far away from the next optimum of the energy function have poor computational performance. They propose a new measure, called the robust ensemble, that suppresses such configurations with bad computational performance. On the other hand, the robust ensemble amplifies the dense regions with many good configurations. In [2] the authors present various algorithms to sample from this robust ensemble, one of them is Replicated Simulated Annealing or Simulated Annealing with Scoping. This algorithm combines the replica approach from statistical physics with the simulated annealing algorithm that is supposed to find the minima (or maxima) of a target function. The replica technique is used to regularize the highly non-convex target or energy function (another regularization idea was introduced recently in [4]), while the simulated annealing algorithm is used afterwards to minimize this new energy. We will define Replicated Simulated Annealing in Sect. 2.

To give a first impression of this algorithm, assume we have \(\varSigma :=\{-1,+1\}^N\) as our state space and N is large. On \(\varSigma \) we have very rugged energy function \(E:\varSigma \rightarrow {\mathbb {R}}\) and assume that \(\min _{\sigma \in \varSigma }E(\sigma )=0\). To find the minima of E one could run a Metropolis algorithm for the Gibbs measure at inverse temperature \(\beta >0\):

$$\begin{aligned} \pi _\beta (\sigma ):= \frac{\exp (-\beta E(\sigma ))}{Z_\beta }, \qquad \sigma \in \varSigma . \end{aligned}$$

Here \(Z_\beta := \sum _{\sigma '} \exp (-\beta E(\sigma '))\) is the partition function of the model, a normalizing factor that makes \(\pi _\beta \) a probability measure. If one carefully lowers the temperature, i.e. if one increases \(\beta \) slowly enough during the process the corresponding Markov chain will get stuck in one of the maxima of \(\pi _\infty \) which are easily seen to be the minima of E. This is the classical Simulated Annealing algorithm, cf. [25] or [15] for the seminal papers. The question how to choose the optimal dependence of \(\beta \) from time t and its convergence properties have been extensively discussed. We just mention [5, 8, 18, 20, 22], for a short and by far not complete list of references. The upshot is that good a ”cooling schedule” is of the form \(\beta _t=\frac{1}{m} \log (1+ C t)\), where m can be roughly described as the largest hill to be climbed to get from an arbitrary state to one of the global minima of the target function.

However, sometimes not all the global minima are equally important, in particular one may be interested in regions with many global minima (or almost global minima), so called dense regions. An obstacle may be, that E exhibits many global minima, but only relatively few of them are in dense regions. Let us motivate this question by a central example which we will often have in mind in this context and which also is one of the central objects in [2].

Example 1

Assume we have patterns \(\xi ^1, \ldots , \xi ^M\), \(\xi ^\mu \in \{\pm 1\}^N, \mu =1, \ldots , M\) where \(M=\alpha N\) for some \(\alpha >0\). Each of these patterns belongs to one of two groups, which we indicate by \(\vartheta (\xi ^\mu )= \vartheta ^\mu \in \{\pm 1\}.\)

The task is to classify these patterns. One of the standard methods to do this in machine learning is the perceptron. Perceptron was defined in the 1960s by Rosenblatt [34]. It is one of the first mathematical models of neural networks, inspired by the biological phenomenon of vision. In its simplest form, the one we are studying here, it is a single neuron, corresponding to the mathematical model of Mac Culloch and Pitts [30]. It is then a binary classifier, which separates two sets of points linearly separable by a hyperplane. In mathematical terms it maps its input x to f(x) which is either 1 or 0, and thus puts it into one of two classes. This decision is made with the help of a vector of weights

$$\begin{aligned} W=(W_1, \ldots , W_N) \in \{-1,+1\}^N. \end{aligned}$$

More precisely,

$$\begin{aligned} f(x)={\left\{ \begin{array}{ll}1&{}{\text {if }}\ \langle {W}, x \rangle >0,\\ 0&{}{\text {otherwise}}\end{array}\right. } \end{aligned}$$

where \(\langle W,x\rangle \) is the dot product \(\langle W,x\rangle =\sum _i W_i x_i\). These weights \(W=(W_1, \ldots , W_N) \in \{-1,+1\}^N\) have to learned and we want the classification to be perfect, i.e. we want that

$$\begin{aligned} \varTheta (\vartheta ^\mu \langle W, \xi ^\mu \rangle ):= \varTheta (\vartheta ^\mu \sum _{i=1}^N W_i \xi _i^\mu )=1 \end{aligned}$$

for all \(\mu =1, \ldots M\). Here

$$\begin{aligned} \varTheta (x)=\left\{ \begin{array}{ll} 1 &{} \text{ if } x>0\\ 0 &{} \text{ otherwise }\end{array} \right. \end{aligned}$$

denotes the Heaviside-function. Hence our classification task is fulfilled if

$$\begin{aligned} \sum _{\mu =1}^M \varTheta (-\vartheta ^\mu \langle W, \xi ^\mu \rangle )=0 \end{aligned}$$

(where we assume that N is even to avoid the specification of tie-breaking rules) or, equivalently

$$\begin{aligned} \prod _{\mu =1}^M \varTheta (\vartheta ^\mu \langle W, \xi ^\mu \rangle )=1. \end{aligned}$$
(1)

Note that in Rosenblatt’s initial model, the weights \((W_1,\ldots , W_N)\), called synaptic weights, are real-valued and not restricted to take their values in \(\{-1,+1\}^N\). The objective now is to find weights W such that (1) is true, i.e. we are searching for weights W such that \(\overline{E}(W)=-\prod _{\mu =1}^M \varTheta (\vartheta ^\mu \langle W, \xi ^\mu \rangle )\) is minimal. Obviously, this optimization problem is of the above mentioned form. However, one prefers weights W in co called dense regions, i.e. weights that are surrounded by weights \({\tilde{W}}\) that are also minima of \(\overline{E}\). The idea is that these states have good generalization properties or a small generalization error. This means that we want to find weights, that still classify input patterns correctly, which we have not seen in our training set \(\xi ^1, \ldots , \xi ^M\). It is at least plausible that weights with a small generalization error lie in dense regions of \(\{\pm 1\}^N\).

In this work we are interested in making explicit convergence properties of the algorithm of Replicated Simulated Annealing. We also perform experiments using synthetic and real data bases. The outline is as follows: in Sect. 2 we mathematically formalize and describe the algorithm of Replicated Simulated Annealing, in Sect. 3 we study its convergence properties. Naturally, this convergence will be studied on an infinite time horizon. This is a slightly different set-up than in [2], where experiments are performed for finite time. However, the question, whether or not Replicated Simulated Annealing converges is the first question that should be analyzed, before studying which choice of parameters yields the best results. This latter question is addressed in Sect. 4, where we perform experiments using synthetic and real data bases. Of course, the time horizon for such experiments is finite. On the other hand, in finite time we can control the influence of the choice of the parameters on the performance which is hard to control theoretically. Finally, Sect. 5 is a conclusion.

2 Replicated Simulated Annealing

Recall that we are searching for the minima of a function \(E:\varSigma \rightarrow {\mathbb {R}}\), where \(\varSigma :=\{-1,+1\}^N\). To find minima of E in dense regions of the state space the authors in [2] propose a new measure given by

$$\begin{aligned} P_{y,\beta ,\gamma }(\sigma ):= \frac{\exp (y \varPhi _{\beta ,\gamma }(\sigma ))}{Z(y,\beta , \gamma )} \end{aligned}$$
(2)

where

$$\begin{aligned} Z(y,\beta , \gamma ):= \sum _{\sigma ''}\exp (y \varPhi _{\beta ,\gamma }(\sigma '')). \end{aligned}$$
(3)

\(P_{y,\beta ,\gamma }\) has, at least formally, the structure of a Gibbs measure at inverse temperature y. Its ”energy function” is given by

$$\begin{aligned} \varPhi _{\beta ,\gamma }(\sigma ):= \log \sum _{\sigma ' \in \varSigma } \exp (-\beta E(\sigma ')-\gamma d(\sigma , \sigma ')), \end{aligned}$$
(4)

where \(d(\cdot , \cdot )\) is some monotonically increasing function of a distance on \(\varSigma \). This distance will be chosen below, but there are not too many reasonable essentially different distance functions on \(\varSigma \), anyway.

Since \(\varPhi _{y,\beta ,\gamma }\) weights each configuration \(\sigma \) by a function of its distance to \(\sigma '\) and the \(\sigma '\) again by an exponential of their energy, it is, indeed, plausible that \(\varPhi _{y,\beta ,\gamma }\) is much smoother than E and will have its minima in dense regions. We will come back to this question in the next section.

However, a serious problem is, how one could simulate from the measure \(P_{y,\beta ,\gamma }\). Indeed, computing the ”energy” \(\varPhi _{y,\beta ,\gamma }(\sigma )\) of a single configuration \(\sigma \) involves, among others, computing \(E(\sigma ')\) for all \(\sigma ' \in \varSigma \). Computing these values is almost as hard as finding the minima of E (even though one might not be immediately able to tell which of these minima are in dense regions). To find a promising algorithm that does not rely on computing all the values of \(E(\sigma )\), Baldassi et al. [2] propose the following:

First of all assume that \(y \ge 2\) is an integer. Second take as function of the distance between two spins \(\sigma \) and \(\sigma '\) the (negative) inner product: \(d(\sigma , \sigma ')=-\langle \sigma , \sigma '\rangle \). As a matter of fact, this is a natural choice, since two natural distance functions, the Hamming distance and the square of the Euclidian distance are functions of the inner product: \( \sum _{i} (\sigma _i-\sigma '_i)^2=2N-2 \langle \sigma , \sigma '\rangle \) as well as \(d_H(\sigma , \sigma ')= \frac{N -\langle \sigma , \sigma '\rangle }{2} \) and the N dependent terms cancel, because they also occur in \(Z(y,\beta , \gamma )\). Using the fact that y is an integer, we can now compute the partition function \(Z(y,\beta , \gamma )\) (by replacing \(\sigma ''\) by \(\sigma \) in (3)) of this model:

$$\begin{aligned} Z(y,\beta , \gamma )&= \sum _{\sigma \in \varSigma } \exp (y \varPhi _{\beta , \gamma }(\sigma )) = \sum _{\sigma \in \varSigma } \exp (\sum _{a=1}^y \varPhi _{\beta , \gamma }(\sigma ))\\&= \sum _{\sigma \in \varSigma } \prod _{a=1}^y \sum _{\sigma ^a \in \varSigma } \exp (-\beta E(\sigma ^a)-\gamma d(\sigma , \sigma ^a))\\&= \sum _{\sigma \in \varSigma }\sum _{\sigma ^1 \in \varSigma }\ldots \sum _{\sigma ^y \in \varSigma } \exp (-\beta E(\sigma ^1)-\gamma d(\sigma , \sigma ^1))\ldots \exp (-\beta E(\sigma ^y)-\gamma d(\sigma , \sigma ^y))\\&= \sum _{\sigma \in \varSigma }\sum _{\{\sigma ^a\} \in \varSigma ^y}\exp (-\beta \sum _{a=1}^y E(\sigma ^a)-\gamma \sum _{a=1}^y d(\sigma , \sigma ^a)). \end{aligned}$$

Here \(\sum _{\{\sigma ^a\}}\) is the sum over all \(\sigma ^1, \ldots , \sigma ^y\). Hence \(Z(y,\beta , \gamma )\) can be considered as a partition function on the space of all \((\sigma , \{\sigma ^a\})\in \varSigma ^{y+1}\) of the measure

$$\begin{aligned} Q((\sigma , \{\sigma ^a\}):= \frac{\exp (-\beta \sum _{a=1}^y E(\sigma ^a)-\gamma \sum _{a=1}^y d(\sigma , \sigma ^a))}{Z(y,\beta ,\gamma )}. \end{aligned}$$
(5)

Its marginal with respect to the second coordinate \(\{\sigma ^a\}\) is given by

$$\begin{aligned} \overline{Q}(\{\sigma ^a\}):= \frac{\sum _{\sigma } \exp (-\beta \sum _{a=1}^y E(\sigma ^a)-\gamma \sum _{a=1}^y d(\sigma , \sigma ^a))}{Z(y,\beta ,\gamma )}. \end{aligned}$$
(6)

Making use of our choice \(d(\sigma , \sigma ')=-\langle \sigma , \sigma '\rangle \) we obtain for the numerator in \(\overline{Q}\):

$$\begin{aligned} {Z(y,\beta ,\gamma )}\overline{Q}(\{\sigma ^a\})= & {} \sum _{\sigma } \exp \left( -\beta \sum _{a=1}^y E(\sigma ^a)+\gamma \sum _{a=1}^y \langle \sigma , \sigma ^a\rangle \right) \\= & {} \frac{2^N}{2^N}\sum _{\sigma } \exp \left( -\beta \sum _{a=1}^y E(\sigma ^a)+\gamma \sum _{i=1}^N\sum _{a=1}^y \sigma _i \sigma ^a_i\right) \\= & {} 2^N \exp \left( -\beta \sum _{a=1}^y E(\sigma ^a)+\sum _{i=1}^N \log \cosh (\gamma \sum _{a=1}^y \sigma _i^a)\right) \end{aligned}$$

Putting the \(2^N\) into the normalizing constant we thus obtain that

$$\begin{aligned} \overline{Q}(\{\sigma ^a\})= \frac{\exp \left( -\beta \sum _{a=1}^y E(\sigma ^a)+\sum _{i=1}^N \log \cosh (\gamma \sum _{a=1}^y \sigma _i^a)\right) }{Z'(y,\beta ,\gamma )}. \end{aligned}$$

This form of the measure \(\overline{Q}\) is now accessible to a Simulated Annealing algorithm: being in \(\{\sigma ^a\}\) one picks one of the \(\sigma ^a\) at random and one coordinate \(\sigma _i^a\) of \(\sigma ^a\) at random and flips it to become \(-\sigma _i^a\). This new configuration is then accepted with the usual Simulated Annealing probabilities.

Example 2

(Example 1 continued) In our perceptron example we so far proposed the energy function

$$\begin{aligned} \overline{E}(W)=-\prod _{\mu =1}^M \varTheta (\vartheta ^\mu \langle W, \xi ^\mu \rangle )=-\prod _{\mu =1}^{\alpha N} \varTheta (\vartheta ^\mu \langle W, \xi ^\mu \rangle ). \end{aligned}$$

This function, however, may be a bit unwieldy when using Simulated Annealing, since it just tells how many patterns have been classified correctly but not whether we are moving in a ”good” or a ”bad” direction when the proposed configuration \(\{{W'}^a\}\) has the same energy as the old configuration \(\{{W}^a\}\). We therefore propose (as e.g. [2]) to use the energy function

$$\begin{aligned} E(W)= \sum _{\mu =1}^{\alpha N} E^\mu (W) \quad \text{ with } E^\mu (W)= R(-\vartheta ^\mu \langle W, \xi ^\mu \rangle ), \end{aligned}$$

instead. Here \(R(x)= \frac{x+1}{2} \varTheta (x)\) and we again assume that N is odd, otherwise we would need to take \(R(x) \frac{x}{2} \varTheta (x)\). In other words \(E^\mu \) is the number of bits that we need to change, in order to classify \(\xi ^\mu \) correctly.

3 Convergence of the Annealing Process

In this section we want to discuss the convergence properties of the annealing procedure introduced above. The two major questions are: Does the process converge to an invariant measure, and if so, does this measure have the desired property of favoring dense regions? This question is not addressed in [2]. However, we feel that it is the first problem that needs to be analyzed. Indeed, if the process does not converge to the desired distribution the question is rather when to stop it than what is the optimal choice of parameters.

We will distinguish two cases: the first is when \(\gamma \) in the definition of the measures Q in (5) and \({\overline{Q}}\) in (6) does not depend on time, while the second is, when it does.

Before analyzing these two cases, we will slightly modify the annealing procedure, to make it accessible to the best results that are available for discrete time, see [1]. As a matter of fact, we find discrete time slightly more appropriate for computer simulations than the continuous time set-up in e.g. [18, 22], or [8]. To this end, we will study cooling schedules, where the inverse temperature \(\beta _n\) is fixed for \(T_n\) consecutive steps of the annealing process. Denote by \(\nu _n\) the distribution of the annealing process \(X_n\) at time

$$\begin{aligned} L_n := T_1 + \ldots + T_n. \end{aligned}$$

Note that \(\nu _n\) can be computed recursively: If \(S_\beta \) denotes the transition matrix of the Metropolis–Hastings chain (see [19] or [21]) at inverse temperature \(\beta \) (see (7) below), then

$$\begin{aligned} \nu _n = \nu _{n-1} S_{\beta _n}^{T_n} \qquad \text{ where } \nu _0 \text{ is } \text{ a } \text{ fixed } \text{ probability } \text{ measure } \text{ on } \varSigma ^y. \end{aligned}$$

Here, of course, \(S_{\beta _n}^{T_n}\) is the \(T_n\)’th power of the transition matrix \(S_{\beta _n}\) (which is constant for the last \(T_n\) steps, as described above).

3.1 Fixed \(\gamma \)

If \(\gamma \) is fixed it is convenient to split the Simulated Annealing algorithm introduced above into a \(\gamma \)-dependent part and a \(\beta \)-dependent part. To this end, let us introduce the following probability measure \(\mu _0\) on \(\varSigma ^y\):

$$\begin{aligned} \mu _0(\{\sigma ^a\}):=\frac{\exp \left( \sum _{i=1}^N \log \cosh (\gamma \sum _{a=1}^y \sigma _i^a)\right) }{\varGamma } \end{aligned}$$

with \(\varGamma := \sum _{\{{\tilde{\sigma }}^a\}} \exp \left( \sum _{i=1}^N \log \cosh (\gamma \sum _{a=1}^y {\tilde{\sigma }}_i^a)\right) \).

Next define a transition matrix \(\varPi \) on \(\varSigma ^y\). \(\varPi \) will only allow transition from \(\{\sigma ^a\}\) to \(\{{\tilde{\sigma }}^a\}\), if there are exactly one \(\sigma ^a, a= 1, \ldots y\) and one \(i=1, \ldots N\), such that \(\sigma ^a_i=-{\tilde{\sigma }}^a_i\), and for all other b and j we have \(\sigma ^b_j={\tilde{\sigma }}^b_j\). In this case, we define

$$\begin{aligned} \varPi _\gamma (\{\sigma ^a\},\{{\tilde{\sigma }}^a\}):=\varPi (\{\sigma ^a\},\{{\tilde{\sigma }}^a\}) \min \left( 1, \frac{\cosh (\gamma \sum _{a=1}^y \tilde{\sigma }_i^a)}{\cosh (\gamma \sum _{a=1}^y \sigma _i^a)}\right) . \end{aligned}$$

For all other configurations \(\{{\hat{\sigma }}^a\}\ne \{\sigma ^a\}\), we have \(\varPi _\gamma (\{\sigma ^a\},\{{\hat{\sigma }}^a\})=0\) and we set

$$\begin{aligned} \varPi _\gamma (\{\sigma ^a\},\{\sigma ^a\}):= 1 - \sum _{\{{\tilde{\sigma }}^a\}\ne \{\sigma ^a\}} \varPi _\gamma (\{\sigma ^a\},\{{\tilde{\sigma }}^a\}). \end{aligned}$$

Note that \(\varPi _\gamma \) is nothing but the Metropolis–Hastings algorithm for the measure \(\mu _0\) (see [21]). In particular, \(\varPi _\gamma \) is reversible with respect to the measure \(\mu _0\), i.e.

$$\begin{aligned} \mu _0(\{\sigma ^a\})\varPi _\gamma (\{\sigma ^a\},\{{\tilde{\sigma }}^a\})=\mu _0(\{\tilde{\sigma }^{a}\})\varPi _\gamma (\tilde{\{}\sigma ^{a}\},\{\sigma ^a\}). \end{aligned}$$

Now consider the Metropolis–Hastings chain on \(\varSigma ^y\) with proposal chain \(\varPi _\gamma \) and transition probabilities

$$\begin{aligned}&S_\beta (\{\sigma ^a\},\{{\tilde{\sigma }}^a\})\nonumber \\&\quad :=\left\{ \begin{array}{ll} \exp \left( -\beta (\sum _{a=1}^y E(\sigma ^a)-\sum _{a=1}^y E({\tilde{\sigma }}^a))^+\right) \varPi _\gamma (\{\sigma ^a\},\{{\tilde{\sigma }}^a\})&{} \text{ if } \{\sigma ^a\} \ne \{{\tilde{\sigma }}^a\}\\ 1- \sum _{\{{\hat{\sigma }}^a\}\ne \{\sigma ^a\} } \exp \left( -\beta (\sum _{a=1}^y E(\sigma ^a)-\sum _{a=1}^y E({\hat{\sigma }}^a))^+\right) \varPi _\gamma (\{\sigma ^a\},\{{\hat{\sigma }}^a\})&{} \text{ if } \{\sigma ^a\} =\{{\tilde{\sigma }}^a\} \end{array} \right. \end{aligned}$$
(7)

Here \((x)^+:= \max \{x,0\}\). For an appropriate normalizing constant \({\hat{\varGamma }}\) this chain has as its invariant measure

$$\begin{aligned} \frac{\exp (-\beta \sum _{a=1}^y E(\sigma ^a))}{{\hat{\varGamma }}} \mu _0(\{\sigma ^a\})= \overline{Q}(\{\sigma ^a\})=:\overline{Q}_\beta (\{\sigma ^a\}). \end{aligned}$$
(8)

So indeed for each fixed \(\beta >0\), \(\gamma >0\) we have found a Metropolis chain for \({\overline{Q}}\).

If we now let \(\beta =\beta _n\) depend on n in the form described at the beginning of the section, we arrive at a Simulated Annealing algorithm with piecewise constant temperature.

We will quickly introduce some of Azencott’s notation [1]. The invariant measure of \(S_{\beta _n}\) is \(\overline{Q}_{\beta _n}\). Recall that we assumed that \(\min _{\sigma \in \varSigma } E(\sigma )=0\) and define

$$\begin{aligned} B:= \min _{\sigma : E(\sigma ) \ne 0} E(\sigma ). \end{aligned}$$
(9)

Next we bound

$$\begin{aligned} ||\overline{Q}_{\beta _n}-\overline{Q}_{\beta _{n+1}}||_\infty \le \kappa _1 \exp (-\beta _n B) \end{aligned}$$

for some constant \(\kappa _1\). Indeed such an estimate is true for any difference of Gibbs measures with respect to the same energy function. To see this let

$$\begin{aligned} \rho _{\beta _n}:= \frac{\exp (-\beta _n U(x))}{Z_n} \end{aligned}$$

be a sequence of Gibbs measures with respect to the energy function U on a discrete space of size K. We assume \(\min _x U(x)=0\) (without loss of generality, otherwise we subtract the minimum from U) and \(\min _{x: U(x)\ne 0} U(x)= B\). Let \(k:= |\{x: U(x)=0\}|.\) Then, for any x with \(U(x)=0\) we simply have

$$\begin{aligned} |\rho _{\beta _n}(x)-\rho _{\beta _{n+1}}(x)|= \frac{|\sum _{y: U(y) \ne 0} \exp (-\beta _n U(y))- \exp (-\beta _{n+1} U(y))|}{Z_n Z_{n+1}}\le 2 (K-k) e^{-\beta _n B}, \end{aligned}$$

since \(\beta _{n+1}\ge \beta _n\) and \(Z_n \ge 1\) for all n. Otherwise, if \(U(x)>0\), we trivially can compute

$$\begin{aligned} |\rho _{\beta _n}(x)-\rho _{\beta _{n+1}}(x)|\le e^{-\beta _n U(x)} + e^{-\beta _{n+1} U(x)} \le 2 e^{-\beta _n B}. \end{aligned}$$

To describe the spectral gap of \(S_{\beta _n}\), for any two \(\{\sigma ^a\},\{\tau ^a\} \in \varSigma ^y\) let \(\mathcal {P}(\{\sigma ^a\},\{\tau ^a\})\) be the set of all paths in \(\varSigma ^y\) from \(\{\sigma ^a\}\) to \(\{\tau ^a\}\). For \(p \in \mathcal {P}(\{\sigma ^a\},\{\tau ^a\})\) with vertices \(\{\nu ^{a,r}\}_{r=0}^M\) define

$$\begin{aligned} \mathrm {Elev}(p):= \max _{\{\nu ^{a,r}\}_{r=0}^M} \sum _{a=1}^y E(\nu ^{a,r}). \end{aligned}$$

Moreover define

$$\begin{aligned} H(\{\sigma ^a\},\{\tau ^a\})= \min _{p \in \mathcal {P}(\{\sigma ^a\},\{\tau ^a\})} \mathrm {Elev}(p) \end{aligned}$$

and

$$\begin{aligned} m:= \max _{\{\sigma ^a\},\{\tau ^a\}}(H(\{\sigma ^a\},\{\tau ^a\})-\sum _{a=1}^y E(\sigma ^a)-\sum _{a=1}^y E(\tau ^a). \end{aligned}$$
(10)

The quantity m is related to the optimal cooling schedule for Simulated Annealing as well as to the spectral gap of the associated Metropolis Hastings algorithm \(S_\beta \). To understand this, define the operator \(L_\beta f(\{\sigma ^a\})\) for \(\{\sigma ^a\}\in \varSigma ^y\) and \(f:\varSigma ^y \rightarrow {\mathbb {R}}\) by

$$\begin{aligned} L_\beta f(\{\sigma ^a\})= \sum _{\{\tau ^a\}} \left( f(\{\sigma ^a\})-f(\{\tau ^a\})\right) S_\beta (\{\sigma ^a\},\{\tau ^a\}). \end{aligned}$$

Let \(\mathcal {E}_\beta \) be the associated Dirichlet form, i.e. for functions \(f,g:\varSigma ^y \rightarrow {\mathbb {R}}\)

$$\begin{aligned} \mathcal {E}_\beta (f,g):= & {} -\int \left( f, L_\beta g \right) d{\overline{Q}}_\beta \\= & {} \frac{1}{2 {\hat{\varGamma }}}\sum _{\{\sigma ^a\},\{\tau ^a\}}(f(\{\sigma ^a\})-f(\{\tau ^a\}) (g(\{\sigma ^a\})-g(\{\tau ^a\})\\&\times \exp \left( -\beta (\sum _a E(\sigma ^a) \vee \sum _a E(\tau ^a))\right) \mu _0(\{\sigma ^a\})\varPi _\gamma (\{\sigma ^a\},\{\tau ^a\}). \end{aligned}$$

Then with

$$\begin{aligned} \psi (\beta ):= \inf \{\mathcal {E}(f,f): ||f||_{L^2({\overline{Q}}_\beta )}=1 \text{ and } \int f d{\overline{Q}}_\beta =0\} \end{aligned}$$

we have

Proposition 1

There are constants \(c>0\) and \(C < \infty \) such that for all \(\beta \ge 0\),

$$\begin{aligned} c e^{-\beta m} \le \psi (\beta ) \le C e^{-\beta m}. \end{aligned}$$

Proof

See [22, Theorem 2.1]. \(\square \)

But we also have that

$$\begin{aligned} \psi (\beta )=1-\lambda _1(\beta ) \end{aligned}$$

where \(\lambda _1(\beta )\) is the second largest eigenvalue of \(S_\beta \), cf. [23, p. 176] or [9, (1.2)]. This establishes the relation of m to the spectral gap of \(S_\beta \).

Introduce

$$\begin{aligned} \varepsilon _n := ||\overline{Q}_{\beta _n}-\nu _n ||_\infty . \end{aligned}$$

Then

Theorem 1

\(\varepsilon _n\) converges to 0, if

$$\begin{aligned} \lim _{n \rightarrow \infty } \left[ - \sum _{k=1}^n T_k \exp (-\beta _k m) + n \log \kappa _1\right] =-\infty . \end{aligned}$$
(11)

In particular, we need that \(\sum _{k=1}^n T_k \exp (-\beta _k m)\rightarrow \infty \). In this case \(\nu _n\) has the same limit as \(\overline{Q}_{\beta _n}\) and this is given by a distribution \(\overline{Q}_{\infty }\) on

$$\begin{aligned} \mathcal {N}_0:=\{\{\sigma ^a\}: \sum _{a=1}^y E(\sigma ^a)=0\} \end{aligned}$$
(12)

such that

$$\begin{aligned} {\overline{Q}}_\infty (\{\sigma ^a\}) \asymp \mu _0(\{\sigma ^a\}), \quad \text{ if } \{\sigma ^a\} \in \mathcal {N}_0 \end{aligned}$$

and \(\overline{Q}_\infty (\{\sigma ^a\})=0\), otherwise (here \(\asymp \) denotes proportionality). Of course \({\overline{Q}}_\infty \) is normalized in such a way that it is a probability measure on \(\mathcal {N}_0\).

Proof

The convergence part is basically the content of [1, Sect. 7]. Note that the computations there are done for a proposal chain \(\varPi \) that has the uniform measure as its invariant distribution. However, the proof on p. 231 [1] carries over verbatim to our situation.

After that it is easy matter to check that \(\overline{Q}_{\beta _n}\) has a limiting distribution \(\overline{Q}_{\infty }\) and that \(\overline{Q}_{\infty }\) charges every point in \(\mathcal {N}_0\) with a probability proportional to \(\mu _0(\{\sigma ^a\})\). \(\square \)

A choice for \(T_k\) where (11) holds is given by

$$\begin{aligned} T_k:= \frac{\exp (-m \beta _k)}{C}(\log \kappa _1 +Cb) \end{aligned}$$

for a constant \(b>0\), m as given in (10), and C as given in Proposition 1.

As Azencott [1] points out, in this case

$$\begin{aligned} \beta _n \sim \frac{\alpha }{B}\log n+\frac{b}{B} n, \end{aligned}$$

for some \(\alpha >1\) and B defined as in (9),

$$\begin{aligned} T_n \sim n^{\frac{\alpha m}{B}} \exp (\frac{b B}{m} n), \end{aligned}$$

and \(T_n\) and \(L_n\) have the same order of magnitude, i.e. the algorithm spends most of the time in the lowest temperature band.

One also sees the logarithmic relation between \(\beta \) and T, i.e.

$$\begin{aligned} \beta _n \sim \frac{1}{m} \log (L_n) \sim \frac{1}{m} \log (T_n). \end{aligned}$$

We now turn to the question whether this algorithm achieves that typical samples from it have realizations in dense regions of \(\varSigma \). First of all this needs to be defined:

Definition 1

Let \(\sigma \in \varSigma \) with \(E(\sigma )=0\) and let \(R>0\) and \(k \in \mathbb {N}\). The discrete ball \(B_R(\sigma ) \subset \varSigma \) with radius R, centered in \(\sigma \) is called an (Rk)-dense with respect to E, if there are exactly k global minima \(\tau \) of E in \(B_R(\sigma )\). (Without loss of generality all balls considered here and henceforth are Hamming balls.)

\(\sigma \) is called R-isolated, if \(\sigma \) is the only global minimum of E in \(B_R(\sigma )\).

The authors in [2] are not very explicit about a definition of ”dense regions” and the situation where Replicated Simulated Annealing should be applied. However, from their examples, they seem to have in mind a situation close to the following caricature:

Situation 1

Given \(1<a<b<1\) and \(\alpha _N \rightarrow \infty \), \(\delta _N\rightarrow \infty \) with

$$\begin{aligned} \lim _{N\rightarrow \infty }\frac{\alpha _N}{\delta _N} =0 \quad \text{ as } \text{ well } \text{ as } \lim _{N \rightarrow \infty } \frac{\delta _N}{N} =0, \end{aligned}$$

we say that a sequence of energy functions \(E_N\) on \(\varSigma _N:=\varSigma =\{-1,+1\}^N\) is \((a,b,\alpha _N,\delta _N)\)-regular, if it has \(b^N\) global minima, if there exists \(\sigma \in \varSigma _N\) such that \(B_{\alpha _N}(\sigma )\) is \((\alpha _N, a^N)\)-dense and such that all the other \(b^N-a^N\) minima are \(\delta _N\)-isolated.

It is now rather obvious that \({\overline{Q}}_\infty (\{\cdot \})\) prefers such dense regions:

Proposition 2

Assume we are in the situation described in Situation 1. Hence we have a sequence of energy functions that is \((a,b,\alpha _N,\delta _N)\)-regular. Then, given \(\varepsilon >0\), for any admissible choice of these parameters, there exist y, \(N_0\) and \(\gamma \) such that

$$\begin{aligned} {\overline{Q}}_\infty (\times _{i=1}^y B_{\alpha _N}(\sigma )):=\overline{Q}_\infty (\{(\sigma ^1, \ldots , \sigma ^y):\sigma ^a \in B_{\alpha _N}(\sigma ) \,\forall a\})\ge 1-\varepsilon \end{aligned}$$

for all \(N \ge N_0\).

Proof

Note that \({\overline{Q}}_\infty \) has its mass concentrated on the set \(\mathcal {N}_0\) (given by equation (12)) and the differences in the mass for the various configurations from this set stem from factor

$$\begin{aligned} \mu _0(\{\sigma ^a\})=\frac{\exp \left( \sum _{i=1}^N \log \cosh (\gamma \sum _{a=1}^y \sigma _i^a)\right) }{\varGamma } \end{aligned}$$

Let us just consider the numerators of these weights.

Let \(\sigma \) be an \(\delta _N\)-isolated minimum of \(E_N\). If all \(\sigma ^1, \ldots , \sigma ^y\) are located in \(\sigma \), then the numerator of \(\mu _0(\{\sigma ^a\})\) equals \(\exp \left( N \log \cosh (\gamma y)\right) \). Otherwise there is at least one \(\sigma ^a\) that is different from \(\sigma \), say in a global minimum \(\tau \) of \(E_N\). By assumption \(d_H(\sigma , \tau )\ge \delta _N\). Thus a configuration that has at least one \(\sigma ^a=\tau \) has a weight at most

$$\begin{aligned} \exp ((N-\delta _N)\log \cosh (\gamma y)+\delta _N \log \cosh ((y-2)\gamma )). \end{aligned}$$

Now there are \(b^N-a^N\) \(\delta _N\) isolated minima. Hence the sum of the numerators of the probabilities of these isolated minima can be be bounded from above by

$$\begin{aligned} \exp \left( N \log \cosh (\gamma y)\right) b^N \left( 1+ b^{N y}\frac{e^{(N-\delta _N)\log \cosh (\gamma y)+\delta _N \log \cosh ((y-2)\gamma ))}}{e^{N \log \cosh (\gamma y)}}\right) . \end{aligned}$$

Here \(b^N\) is a bound on the number of isolated minima, \(\exp \left( N \log \cosh (\gamma y)\right) \) is the weight, when all \(\sigma ^a\) are identical, \( b^{N y}\) is an upper bound on the number of choices we have, when one \(\sigma ^a\) equals a given isolated minimum and at least one \(\sigma ^b\) is different, and finally \(e^{(N-\delta _N)\log \cosh (\gamma y)+\delta _N \log \cosh ((y-2)\gamma ))}\) is a rough upper bound on the weight in that case.

Note, that we will choose \(\gamma \) and y below in such a way that \(y \gamma \rightarrow \infty \), when \(N \rightarrow \infty \). We will therefore bound \(\log \cosh (y \gamma ) \le y \gamma \). Then the total contribution of the isolated minima becomes at most:

$$\begin{aligned} e^{N\gamma y} b^N \left( 1+ b^{N y}e^{-2\gamma \delta _N}\right) \end{aligned}$$

If we choose \(\gamma \ge \frac{Ny \log b}{2 \delta _N}\) the contribution to the numerator of the probability of the isolated minima will be at most \(2 \exp \left( N \gamma y)\right) b^N\).

On the other hand, for the case that all \(\sigma ^a\) are in the dense region \(B_{\alpha _N}(\sigma )\) we have \(a^{Ny}\) choices. For each of these choices at least \(N-y \alpha _N\) of the coordinates of all \(\sigma ^1, \ldots , \sigma ^y\) are identical. Again, since \(y \gamma \rightarrow \infty \), when \(N \rightarrow \infty \), given \(\varepsilon '>0\) we may bound \(\log \cosh (y \gamma ) \ge y \gamma (1-\varepsilon ')\). Thus the overall weight (this is again the numerator of the corresponding probability) of the dense region is at least \(a^{Ny} e^{(N-y\alpha _N) \gamma y(1-\varepsilon ')}\).

To compare the two weights, let us see, if we can arrange the parameters in such a way that

$$\begin{aligned} a^{Ny} e^{(N-y\alpha _N) \gamma y(1-\varepsilon ')} \gg 2 e^{N \gamma y}b^N \end{aligned}$$

(by which we mean that

$$\begin{aligned} \frac{a^{Ny} e^{(N-y\alpha _N) \gamma y(1-\varepsilon ')}}{2 e^{N \gamma y}b^N} \rightarrow \infty \end{aligned}$$

as \(N \rightarrow \infty \)). Since \(\varepsilon '>0\) is fixed and arbitrarily small, we may as well check whether

$$\begin{aligned} a^{Ny} e^{(N-y\alpha _N) \gamma y} \gg 2 e^{N \gamma y}b^N \end{aligned}$$

which is the case, if and only if

$$\begin{aligned} \exp \left( Ny \left( \log a-\frac{y\gamma \alpha _N}{N}-\frac{\log b}{y} \right) \right) \gg 1. \end{aligned}$$

To this end, substitute \(\gamma = \frac{Ny \log b}{2 \delta _N}\) (and note that indeed \(\gamma =\gamma _N \rightarrow \infty \) as \(N \rightarrow \infty \)) to obtain for the exponent on the right hand side:

$$\begin{aligned} Ny\left( \log a - \frac{y^2 \log b \, \alpha _N}{2 \delta _N} -\frac{\log b}{y} \right) . \end{aligned}$$

Now take \(y=\lceil \frac{1}{2}\frac{\log b}{\log a}\rceil .\) Since, by assumption \(\frac{\alpha _N}{\delta _N} \rightarrow 0\) and y does not depend on N, also \(\frac{y^2 \log b \, \alpha _N}{2 \delta _N}\) converges to 0. This implies that the exponent will eventually become negative, hence the dense region carries an arbitrarily large mass. \(\square \)

Remark 1

Reading [2] carefully, one may get the impression that for them a dense region is one with an exponential number of local minima of \(E_N\) (again, the authors in [2] are not very explicit about this). However, if we are taking the limit \(\beta \rightarrow \infty \) slowly enough as in a real Simulated Annealing schedule, the local minima that are not global minima will eventually get zero probability and hence are negligible. As a matter of fact, if one works with finite times as in our next section, this is, of course, not true. In this case however, one could equally well study a low temperature Metropolis chain, since most of the time in the annealing schedules is spent in the low temperature region, anyway, as remarked above. For this Metropolis–Hastings chain a result similar to Proposition 2 can be shown very similarly.

3.2 The Limit \(\gamma \rightarrow \infty \)

The situation where also \(\gamma \) depends on time and converges to infinity, when time becomes large, is different to the fixed \(\gamma \) situation. Even though this is not explicitly stated in [2] it seems to be the version of the algorithm that the authors in have in mind. Indeed, as mentioned, they only consider a finite time horizon, in which they, however, increase \(\gamma \).

In the situation with \(\gamma \rightarrow \infty \) we need to modify the considerations of the previous section. Again we will assume that we keep \(\beta _n, \gamma _n\) constant on an interval \(T_n \le t \le T_{n+1}-1\). For the algorithm in this fixed time interval, again, the invariant measure is given by \(\overline{Q}_\beta \) with \(\beta =\beta _n\) and \(\gamma =\gamma _n\) as given in (8). This is the case because during this interval the parameters of the Metropolis chain do not change. To stress the dependence on both parameters, we will now denote this measure by \(\overline{Q}_{\beta ,\gamma }\).

Following the arguments in the previous subsection we now see that there is a constant \(\kappa _2\), such that

$$\begin{aligned} ||\overline{Q}_{\beta _n,\gamma _n}-\overline{Q}_{\beta _{n+1},\gamma _{n+1}}||_\infty \le \kappa _2 e^{-(\beta _n B +\gamma _n B')}. \end{aligned}$$

Here again, \(B:= \min _{\sigma : E(\sigma ) \ne 0} E(\sigma )\). Analogously, the constant \(B'\) is defined as the gap between the maximum of the function

$$\begin{aligned} H(\{\sigma ^a\}):= \sum _{i=1}^N \log \cosh (\sum _{a=1}^y \sigma _i^a) \qquad \text{ on } \varSigma ^y \end{aligned}$$

and its second largest value. Hence

$$\begin{aligned} B':= & {} N \log \cosh (\gamma y)-((N-1)\log \cosh (\gamma y)+\log \cosh (\gamma (y-2)))\\= & {} \log \cosh (\gamma y)-\log \cosh (\gamma (y-2)). \end{aligned}$$

The maximum of H is realized when we take all \(\sigma ^a\) identical, while the second term in \(B'\) stems from the fact that the we obtain the second largest value of H by changing one \(\sigma ^a\) in one spin from a maximizing configuration. Since we will consider the limit \(\gamma \rightarrow \infty \) we may safely replace \(\log \cosh (\gamma y)\) by \(\gamma y -\log 2\) and \(\log \cosh (\gamma (y-2))\) by \(\gamma (y-2) -\log 2\) to obtain

$$\begin{aligned} B' \approx 2 \gamma . \end{aligned}$$

To determine how the cooling schedule has to be chosen, we need to estimate the spectral gap of the Metropolis chain. Note that, if we use Proposition 1 to do so, we run into the problem, that the constants c and C there depend on time, because the energy function does. The solution is, of course, to include this time dependence into the definitions. Hence for a time t let

$$\begin{aligned} F_t(\{\sigma ^a\}):= \sum _{a=1}^y E(\sigma ^a)- \frac{\gamma _t}{\beta _t}H(\{\sigma ^a\}). \end{aligned}$$

Then, we can represent the Simulated Annealing chain, which we will now denote by \(S_{\beta , \gamma }\) and which is still given by (7) (with the only difference that now also \(\varPi _\gamma \) depends on time) as a Simulated Annealing algorithm with time-dependent energy function \(F_t\), see e.g. [28, 14]. Indeed, in this case we may replace the proposal chain \(\varPi _\gamma \) in (7) to \(\varPi \). Here \(\varPi \) being in \(\{\sigma ^a\}\) picks one of \(a=1, \ldots y\) and one index \(i=1, \ldots , N\) at random and flips \(\sigma _i^a\) to \(-\sigma _i^a\). Then \(S_{\beta , \gamma }\) can be written as

$$\begin{aligned}&S_{\beta ,\gamma }(\{\sigma ^a\},\{{\tilde{\sigma }}^a\})\nonumber \\&\quad := \left\{ \begin{array}{ll} \exp \left( -\beta (F_t(\{\sigma ^a\})-F_t(\{{\tilde{\sigma }}^a\}))^+\right) \varPi (\{\sigma ^a\},\{{\tilde{\sigma }}^a\})&{} \text{ if } \{\sigma ^a\} \ne \{{\tilde{\sigma }}^a\}\\ 1- \sum _{\{{\hat{\sigma }}^a\}} \exp \left( -\beta (F_t(\{\sigma ^a\})-F_t(\{{\hat{\sigma }}^a\}))^+\right) \varPi (\{\sigma ^a\},\{{\hat{\sigma }}^a\})&{} \text{ if } \{\sigma ^a\} =\{{\tilde{\sigma }}^a\} \end{array} \right. \nonumber \\ \end{aligned}$$
(13)

Now we can use results from [28] (cf. [27] for related work) to compute the spectral gap \(S_{\beta ,\gamma }\). In analogy to what we did in the previous subsection, define

$$\begin{aligned} m_t:= \max _{\{\sigma ^a\},\{\tau ^a\}}(H_t(\{\sigma ^a\},\{\tau ^a\})-F_t(\{\sigma ^a\})-F_t(\{\tau ^a\})\nonumber \\ \end{aligned}$$

where

$$\begin{aligned} H_t(\{\sigma ^a\},\{\tau ^a\})= \min _{p \in \mathcal {P}(\{\sigma ^a\},\{\tau ^a\})} \mathrm {Elev}_t(p) \end{aligned}$$

and

$$\begin{aligned} \mathrm {Elev}_t(p):= \max _{\{\nu ^{a,r}\}_{r=0}^M} F_t(\{\nu ^a\}). \end{aligned}$$

Again, in analogy to the previous subsection for functions \(f,g:\varSigma ^y \rightarrow {\mathbb {R}}\) let the Dirichlet-form \(\mathcal {E}_{\beta ,\gamma }\) be given by

$$\begin{aligned} \mathcal {E}_{t,\beta ,\gamma } (f,g)= & {} \frac{1}{2 {\hat{\varGamma }}}\sum _{\{\sigma ^a\},\{\tau ^a\}}(f(\{\sigma ^a\})-f(\{\tau ^a\}) (g(\{\sigma ^a\})-g(\{\tau ^a\})\\&\times \exp \left( -\beta (F_t(\{\sigma ^a\}) \vee F_t(\{\tau ^a\}))\right) \mu _0(\{\sigma ^a\}) \varPi (\{\sigma ^a\},\{\tau ^a\}). \end{aligned}$$

Then for

$$\begin{aligned} \psi _t(\beta ):= \inf \{\mathcal {E}_{t,\beta ,\gamma }(f,f): ||f||_{L^2({\overline{Q}}_{\beta ,\gamma })}=1 \text{ and } \int f d{\overline{Q}}_{\beta ,\gamma })=0\} \end{aligned}$$

it holds

Proposition 3

There are constants \(c>0\) and \(C < \infty \) such that for all \(\beta \ge 0\).

$$\begin{aligned} c e^{-\beta m_t} \le \psi _t (\beta ) \le C e^{-\beta m_t}. \end{aligned}$$

Proof

This is the content of [28, Theorem 2.1]. \(\square \)

As before Proposition 3 implies for the second largest eigenvalue \(\lambda _1(\beta ,\gamma )\) of \(S_{\beta ,\gamma }\) that

$$\begin{aligned} |\lambda _1(\beta _n,\gamma _n)| \le 1-C e^{-\beta _n m_N}. \end{aligned}$$

From linear algebra we therefore obtain that for each \(n=1,2, \ldots \) and any probability measure \(\nu _0\) on \(\varSigma ^y\)

$$\begin{aligned} || \nu _0 S_{\beta _n,\gamma _n}^{T_n}||_\infty \le \kappa _3 \left( 1-C e^{-\beta _n m_n}\right) ^{T_n}||\nu _0||_\infty \end{aligned}$$

for some constant \(\kappa _3>0\) (cf. the very similar argument for ordinary Simulated Annealing in [1, (7.8)]). Writing again

$$\begin{aligned} \varepsilon _n := ||\overline{Q}_{\beta _n,\gamma _n}-\nu _n ||_\infty \end{aligned}$$

by the recursive structure of the annealing algorithm and the considerations above we obtain the estimate

$$\begin{aligned} \varepsilon _n\le \kappa _2 e^{-(\beta _n B +\gamma _n B')}+\kappa _3 \left( 1-C e^{-\beta _n m_n}\right) ^{T_n} \varepsilon _{n-1}. \end{aligned}$$

Solving this recursive inequality gives

$$\begin{aligned} \varepsilon _n \le \kappa _3^n \prod _{k=1}^n \left( 1-C e^{-\beta _k m_k}\right) ^{T_k}\left( \sum _{k=1}^n \frac{\kappa _2 e^{-(\beta _n B +\gamma _n B')}}{u_k}+\varepsilon _0\right) \end{aligned}$$
(14)

(cf. [1, (7.14)]). Here

$$\begin{aligned} u_k:=\kappa _3^k \prod _{j=1}^k \left( 1-C e^{-\beta _j m_j}\right) ^{T_j}. \end{aligned}$$

Hence we need to chose our parameters \(\beta _n,\gamma _n, T_n\) in such a way that the right hand side converges to zero. In this case we have shown the following theorem.

Theorem 2

If

$$\begin{aligned} \kappa _3^n \prod _{k=1}^n \left( 1-C e^{-\beta _k m_k}\right) ^{T_k}\left( \sum _{k=1}^n \frac{\kappa _2 e^{-(\beta _n B +\gamma _n B')}}{u_k}+\varepsilon _0\right) \rightarrow 0 \end{aligned}$$
(15)

as \(n \rightarrow \infty \), the distribution \(\nu _n\) of \(S_{\beta _n,\gamma _n}\) has the same limit as \(\overline{Q}_{\beta _n,\gamma _n}\) as \(n \rightarrow \infty \).

Following [1, (7.17)] a necessary condition for (15) is

$$\begin{aligned} \lim _{n \rightarrow \infty } \left( -\sum _{k=1}^n T_k e^{-\beta _k m_k}+n \log \kappa _3\right) =-\infty \end{aligned}$$

Remark 2

If we are right with the assumption that the authors in [2] would take \(\gamma _n \rightarrow \infty \) when time gets large, the result of the theorem is, however, not what the authors in [2] seem to intend with their introduction of Replicated Simulated Annealing algorithm. Indeed, when \(\beta _n \rightarrow \infty \), and \(\gamma _n \rightarrow \infty \) the measure \(\overline{Q}_{\beta _n,\gamma _n}\) converges to \({\overline{Q}}_{\infty ,\infty }\). However, the latter is nothing but the uniform distribution on

$$\begin{aligned} {\tilde{{\mathcal {N}}}}_0:=\{\{\sigma ^1, \ldots , \sigma ^y\}: E(\sigma ^a)=0 \, \text { for all } a=1, \ldots y, \text{ and } \sigma ^1 = \ldots = \sigma ^y\}. \end{aligned}$$

In particular, \({\overline{Q}}_{\infty ,\infty }\) does not put higher probability on configuration in dense regions of the state space.

Remark 3

Note that for both, Theorems 1 and 2, the cooling schedules have to be chosen very carefully. An anonymous referee remarked that there are simulation algorithms for Gibbs measures that do not use such a cooling strategy as parallel tempering [33], swapping [16, 17, 32], or equi-energy sampling [26]. We are grateful for this remark.

However, there are some issues with these algorithms. First of all, all of these algorithms simulate Gibbs measures at non-zero temperatures. That means we will obtain an impression of the energy landscapes, but not necessarily convergence towards their global maxima or minima. However, for the simulations in Sect. 4 this is still an important remark.

The tempering algorithms usually suffer from the deficit that they require computation of partition functions which is as hard as finding the minima or maxima of the energies involved. Swapping circumvents these problems. However, the speed convergence may be a problem (as it is for simulated annealing). In some situations the swapping algorithm converges rapidly (i.e. in polynomial time), see e.g. [12, 29, 31], in others the convergence takes exponentially long, see [3] or [10]. The results in [11] show that equi-energy sampling typically does not overcome the problem of torpid mixing.

In the next section, we empirically study a slightly modified version of the algorithm of Replicated Simulated Annealing, using both synthetic toy datasets and real data bases.

4 Experiments

Throughout this section, we present various experiments we conducted to empirically study the effectiveness of Replicated Simulated Annealing. While the previous section had an emphasis on theoretical results on the asymptotic behaviour of the algorithm—which from our point of view is necessary for its introduction—the current section analyzes its finite time behaviour and the role of the choice of parameters.

Notice that hence in this simulation section we will necessarily stay closer to the setting in [2]. Especially, other than in the preceding theoretical section we will not let \(\beta \) and \(\gamma \) tend to infinity (for the theoretical part this was necessary in order to get convergence results, while it is impossible in practical applications). The precise setting will be described below. We will put an emphasis on studying the effect of the choice of these hyperparameters on the performance of our algorithm, as well as the robustness of the found solutions.

We conduct our experiments using the MNIST dataset, also described below, and synthetic data.

4.1 MNIST Dataset

MNIST is a dataset of images depicting digits between 0 and 9. We randomly choose a learning set of 6000 examples per digit, i.e. these examples are used to calibrate the model. The aim is to train a classifier to correctly predict which digits are depicted in previously unseen images. This ability of generalization is measured using a test set containing 1000 examples per digit, distinct from those appearing in the training set. The proportion of correctly classified images in the training set (resp. test set) is called the training accuracy (resp. test accuracy). MNIST images are 28 \(\times \) 28 pixels and grey-leveled. As such, they are typically represented by a 784-sized vector of numbers between 0 and 255.

When training a binary (weights can only be − 1 or 1) logistic regression classifier on MNIST using Replicated Simulated Annealing, we typically achieve a 88% accuracy on the test set, which is on par with the performance obtained with continuous weights and gradient descent. Note that when training our models, we use the cross-entropy loss as our energy, which we refer to as the training loss in the following. In the case of classification with \(K=10\) classes, the output of the model associated to an input \(x_i\) is a probability vector \(y_i=(y_{i,1},\ldots , y_{i,K})\), and the cross-entropy is then

$$\begin{aligned} H(t,y) = - \sum _{i=1}^n \sum _{k=1}^K t_{i,k} \log (y_{i,k}), \end{aligned}$$

where \(t_i\in \{0,1\}^K\) is the true class of \(x_i\). Beside, the cross-entropy loss on the test set is referred to as the test loss. We train the networks for a total of 300,000 total iterations, starting from a random configuration. Our models contain a total of \(784\cdot 10 = 7840\) parameters, corresponding to a single matrix the input of which is a raw image of 784 dimensions and the output of which is a 10-sized vector where the largest coordinate indicates the associated decision.

4.2 Effect of the Initial and Final Values of \(\beta \)

As mentioned above in our experiments we will always take \(\beta \) from a certain bounded range of values \([\beta _i, \beta _f]\) (these bounds will be used in the remaining of this work). We first explore the influence of the initial and final values for \(\beta \). Throughout our experiments, we change the value of \(\beta \) from \(\beta _i\) to \(\beta _f\) following an exponential interpolation where \(\beta = \beta _i \left( \beta _f/\beta _i\right) ^{it/it_{\max }}\), it being the current number of iterations and \(it_{\max }\) the total number of iterations. This choice of interpolation appeared to give the best and most consistent results among the interpolations we tried, including linear and quadratic with various parameters. Note that for this first series of experiments we only train one model (\(\gamma = 0\)). First in Table 1, we indicate the number of active transitions (when a potential flip of a value has been performed). Little surprisingly, we observe that the higher the values of \(\beta \), the less likely we perform flips. We observe a range of two orders of magnitude with our selected parameters.

Table 1 Number of active transitions, during the learning of MNIST, as a function of \(\beta _i\) and \(\beta _f\)

In Table 2, we depict the corresponding training loss and training accuracy. Interestingly, we observe that the largest values of \(\beta \) are not necessarily giving the best results, suggesting that allowing to perform flips that immediately slightly lower the loss can be beneficial in the long run. We also observe that the results do not seem to be very sensitive of the choice of the initial and final values for \(\beta \), as a large range of these values yield a very similar performance. Together with Table 1, we can observe that \(\beta _i = 100\) and \(\beta _f = 100,000\) is a reasonable choice of parameters. This is also confirmed by the results given in Table 3 where we depict the corresponding test loss and test accuracy.

Table 2 Final MNIST train set loss and corresponding accuracies, as a function of \(\beta _i\) and \(\beta _f\)
Table 3 Final MNIST test set loss and corresponding accuracies, as a function of \(\beta _i\) and \(\beta _f\)

4.3 Influence of \(\gamma \)

To gain a better understanding of the influence of \(\gamma \), in the next series of experiments we reduce the number of training samples to accelerate computations. Namely we use 10,000 arbitrary training samples. We perform 10 runs for each value of \(\gamma \), choosing the best values of \(\beta _i\) and \(\beta _f\) found in the previous section. We plot the error bars (confidence interval at 95%) for each value of \(\gamma \). In Fig. 1 we depict the evolution of the training accuracy and training loss. In Fig. 2 the evolution of the test accuracy and test loss, and in Fig. 3 the evolution of the number of active transitions. We observe that \(\gamma \) helps in finding better solutions, that is to say solutions with higher accuracies on both the training and the test set. That is only true for a limited range though, as increasing \(\gamma \) too much lead to dramatic decrease in overall performance. This is not surprising as a too large \(\gamma \) leads to forbid many transitions that would result in reducing the loss. Also this may be seen as being in agreement with the findings of Proposition 2, Theorem 15 and Remark 2 in the previous section (even though there we chose the parameters in such a way to convergence to an invariant measure was guaranteed).

Fig. 1
figure 1

Evolution of the train accuracy and train loss as a function of \(\gamma \), and for various values of y

Fig. 2
figure 2

Evolution of the test accuracy and test loss as a function of \(\gamma \), and for various values of y

Fig. 3
figure 3

Evolution of the number of active iterations as a function of \(\gamma \), and for various values of y

4.4 Robustness of Trained Models

To study the robustness of trained models, we consider randomly perturbating a proportion p of the weights in the trained models, and evaluating the impact on the test accuracy. We average each point over 1000 runs of random perturbations, but since it takes a very long time to train the models with MNIST, we always use the same trained models (one for each value of \(\gamma \)). In Fig. 4, we depict the results for \(y=3\), in Fig. 5 for \(y=5\), and in Fig. 6 for \(y=7\). In order to add statistically more significant results, we also plot in Fig. 7 results obtained with synthetic data and \(y=10\). Synthetic data is created by generating 30 vectors uniformly drawn with repetition from all binary vectors of size 100. In this experiment, we average the results over 1000 tests for each point. For this additional experiment, we found that the best values are \(\beta _i = 0.1\) and \(\beta _f = 1000\).

Fig. 4
figure 4

Robustness of trained models on MNIST as a funciton of the proportion of flipped parameters p (\(y=3\)). Shaded regions around the curves correspond to the confidence interval at 95%

Fig. 5
figure 5

Robustness of trained models on MNIST as a function of the proportion of flipped parameters p (\(y=5\)). Shaded regions around the curves correspond to the confidence interval at 95%

Fig. 6
figure 6

Robustness of trained models on MNIST as a function of the proportion of flipped parameters p (\(y=7\)). Shaded regions around the curves correspond to the confidence interval at 95%

Fig. 7
figure 7

Robustness of trained models with synthetic data as a function of the proportion of flipped parameters p (\(y=10\)). Shaded regions around the curves correspond to the confidence interval at 95%

Interestingly, we observe that the most robust models are the ones for a balanced value of \(\gamma \), typically 0.8 or 1.6. This is even true for the case of synthetic data, despite the fact all the models start with a perfect accuracy of 100% when uncorrupted. This is inline with the claims of the authors of [2].

5 Conclusion

In this work, we have proposed to mathematically and empirically study the algorithm of Replicated Simulated Annealing, that is used to find good configurations of discrete weights neural networks. Here the term “good configurations” refers to configurations in so called dense regions. We have proposed a definition of such dense regions, which are supposed to yield good generalization properties. We have given conditions that ensure convergence of the algorithm and discussed its ability to find good configurations in dense robust regions of the search space. We have seen that to do so the parameter \(\beta \) always need to be taken to infinity when time becomes large, while the parameter \(\gamma \) needs to stay finite.

We also performed experiments using both real datasets and synthetic data to illustrate the role of the choice of the parameters in finite time. Overall, our findings show that Replicated Simulated Annealing is able to find interesting, i.e. ”good”, configurations, but that the gain compared to a simple Simulated Annealing is rather small, sometimes even nonexistent in the asymptotic regime, depending on whether one lets \(\gamma \rightarrow \infty \) or not.