1 Introduction

Many problems in machine learning and statistics, engineering design, physics, medicine, management science, and in many other fields, require finding a global minimum of a function without derivative information; see, e.g., the excellent survey on derivative-free optimization [32]. Some of the most successful approaches for derivative-free global optimization include deterministic methods based on recursively splitting the feasible space in rectangles, such as the DIRECT (DIvide a hyper-RECTangle) [22] and Multilevel Coordinate Search (MCS) [18] algorithms, and stochastic methods such as Particle Swarm Optimization (PSO) [12], genetic algorithms [43], and evolutionary algorithms [16].

The aforementioned methods can be very successful in reaching a global minimum without any assumption on convexity and smoothness of the function, but may result in evaluating the function a large number of times during the execution of the algorithm. In many problems, however, the objective function is a black box that can be very time-consuming to evaluate. For example, in hyperparameter tuning of machine learning algorithms, one needs to run a large set of training tests per hyperparameter choice; in structural engineering design, testing the resulting mechanical property corresponding to a given choice of parameters may involve several hours for computing solutions to partial differential equations; in control systems design, testing a combination of controller parameters involves running a real closed-loop experiment, which is time consuming and costly. For this reason, many researchers have been studying algorithms for black-box global optimization that aim at minimizing the number of function evaluations by replacing the function to minimize with a surrogate function [21]. The latter is obtained by sampling the objective function and interpolating the samples with a map that, compared to the original function, is very cheap to evaluate. The surrogate is then used to solve a (much cheaper) global optimization problem that decides the new point where the original function must be evaluated. A better-quality surrogate is then created by also exploiting the new sample and the procedure is iterated. For example, quadratic surrogate functions are used in the well known global optimization method NEWUOA [30].

As underlined by several authors (see, e.g., [21]), purely minimizing the surrogate function may lead to converge to a point that is not the global minimum of the black-box function. To take into account the fact that the surrogate and the true objective function differ from each other in an unknown way, the surrogate is typically augmented by an extra term that takes into account such an uncertainty. The resulting acquisition function is therefore minimized instead for generating a new sample of the optimization vector, trading off between seeking for a new vector where the surrogate is small and looking for regions of the feasible space that have not yet been visited.

Bayesian Optimization (BO) is a popular class of global optimization methods based on surrogates that, by modeling the black-box function as a Gaussian process, enables one to quantify in statistical terms the discrepancy between the two functions, an information that is taken into account to drive the search. BO has been studied since the sixties in global optimization [25] and in geostatistics [26] under the name of Kriging methods; it become popular to solve problems of Design and Analysis of Computer Experiments (DACE) [34], see for instance the popular Efficient Global Optimization (EGO) algorithm [23]. It is nowadays very popular in machine learning for tuning hyperparameters of different algorithms [9, 13, 35, 37].

Motivated by learning control systems from data [29] and self-calibration of optimal control parameters [14], in this paper we propose an alternative approach to solve global optimization problems in which the objective function is expensive to evaluate that is based on Inverse Distance Weighting (IDW) interpolation [24, 36] and Radial Basis Functions (RBFs) [17, 27]. The use of RBFs for solving global optimization problems was already adopted in [11, 15], in which the acquisition function is constructed by introducing a “measure of bumpiness”. The author of [15] shows that such a measure has a relation with the probability of hitting a lower value than a given threshold of the underlying function, as used in Bayesian optimization. RBFs were also adopted in [31], with additional constraints imposed to make sure that the feasible set is adequately explored. In this paper we use a different acquisition function based on two exploration components: an estimate of the confidence interval associated with the interpolant function, defined as in [24], and a new measure based on inverse distance weighting that is totally independent of the underlying black-box function and its surrogate. Both terms aim at exploring the domain of the optimization vector. Moreover, arbitrary constraints that are simple to evaluate are also taken into account, as they can be easily imposed during the minimization of the acquisition function.

Compared to Bayesian optimization, our non-probabilistic approach to global optimization is very competitive, as we show in a set of benchmark global optimization problems and on hyperparameter selection problems, and also computationally lighter than off-the-shelf implementations of BO.

A preliminary version of this manuscript was made available in [4] and later extended in [5] to solve preference-based optimization problems. MATLAB and a Python implementations of the proposed approach and of the one of [5] are available for download at http://cse.lab.imtlucca.it/~bemporad/glis. For an application of the GLIS algorithm proposed in this paper to learning optimal calibration parameters in embedded model predictive control applications the reader is referred to [14].

The paper is organized as follows. After stating the global optimization problem we want to solve in Sect. 2, Sects. 3 and 4 deal with the construction of the surrogate and acquisition functions, respectively. The proposed global optimization algorithm is detailed in Sect. 5 and several results are reported in Sect. 6. Finally, some conclusions are drawn in Sect. 7.

2 Problem formulation

Consider the following constrained global optimization problem

$$\begin{aligned} \begin{array}{rl} \min _x&{} f(x)\\ \mathop {\mathrm{s.t.}}\nolimits &{} \ell \le x\le u\\ &{} x\in \mathcal {X}\end{array} \end{aligned}$$
(1)

where \(f:{\mathbb R}^n\rightarrow {\mathbb R}\) is an arbitrary function of the optimization vector \(x\in {\mathbb R}^n\), \(\ell ,u\in {\mathbb R}^n\) are vectors of lower and upper bounds, and \(\mathcal {X}\subseteq {\mathbb R}^n\) imposes further arbitrary constraints on x. Typically \(\mathcal {X}=\{x\in {\mathbb R}^n:\ g(x)\le 0\}\), where the vector function \(g:{\mathbb R}^n\rightarrow {\mathbb R}^q\) defines inequality constraints, with \(q=0\) meaning that no inequality constraint is enforced; for example, linear inequality constraints are defined by setting \(g(x)=Ax-b\), with \(A\in {\mathbb R}^{q\times n}\), \(b\in {\mathbb R}^q\), \(q\ge 0\). We are particularly interested in problems as in (1) such that f(x) is expensive to evaluate and its gradient is not available, while the condition \(x\in \mathcal {X}\) is easy to evaluate. Although not comprehensively addressed in this paper, we will show that our approach also tolerates noisy evaluations of f, that is if we measure \(y=f(x)+\varepsilon\) instead of f(x), where \(\varepsilon\) is an unknown quantity. We will not make any assumption on f, g, and \(\varepsilon\). In (1) we do not include possible linear equality constraints \(A_{e}x=b_e\), as they can be first eliminated by reducing the number of optimization variables and therefore perform the exploration more easily in a lower dimensional space.

3 Surrogate function

Assume that we have collected a vector \(F=[f_1\ \ldots \ f_N]'\) of N samples \(f_i=f(x_i)\) of f, \(F\in {\mathbb R}^N\) at corresponding points \(X=[x_1\ \ldots \ x_N]'\), \(X\in {\mathbb R}^{N\times n}\), with \(x_i\ne x_j\), \(\forall i\ne j\), \(i,j=1,\ldots ,N\). We consider next two types of surrogate functions, namely Inverse Distance Weighting (IDW) functions [24, 36] and Radial Basis Functions (RBFs) [15, 27].

3.1 Inverse distance weighting functions

Given a generic new point \(x\in {\mathbb R}^n\) consider the squared Euclidean distance function \(d^2:{\mathbb R}^{n}\times {\mathbb R}^n\rightarrow {\mathbb R}\)

$$\begin{aligned} d^2(x,x_i)=(x_i-x)'(x_i-x),\quad i=1,\ldots ,N \end{aligned}$$
(2)

In standard IDW functions [36] the weight functions \(w_i:{\mathbb R}^n\backslash \{x_i\}\rightarrow {\mathbb R}\) are defined by the inverse squared distances

$$\begin{aligned} w_i(x)=\frac{1}{d^2(x,x_i)} \end{aligned}$$
(3a)

The alternative weighting function

$$\begin{aligned} w_i(x)=\frac{e^{-d^2(x,x_i)}}{d^2(x,x_i)} \end{aligned}$$
(3b)

suggested in [24] has the advantage of being similar to the inverse squared distance in (3a) for small values of \(d^2\), but makes the effect of points \(x_i\) located far from x fade out quickly due to the exponential term.

By defining for \(i=1,\ldots ,N\) the following functions \(v_i:{\mathbb R}^n\rightarrow {\mathbb R}\) as

$$\begin{aligned} v_i(x)=\left\{ \begin{array}{ll} 1&{}\quad \hbox {if}\ x=x_i\\ 0&{}\quad \hbox {if}\ x=x_j,\ j\ne i\\ \displaystyle {\frac{w_i(x)}{\sum _{j=1}^Nw_j(x)}} &{}\quad \hbox {otherwise}\end{array}\right. \end{aligned}$$
(4)

the surrogate function \({\hat{f}}:{\mathbb R}^n\rightarrow {\mathbb R}\)

$$\begin{aligned} {\hat{f}}(x)=\sum _{i=1}^Nv_i(x)f_i \end{aligned}$$
(5)

is an IDW interpolation of (XF).

Lemma 1

The IDW interpolation function \({\hat{f}}\) defined in (5) enjoys the following properties:

  1. P1.

    \({\hat{f}}(x_j)=f_j\), \(\forall j=1,\ldots ,N\);

  2. P2.

    \(\min _j\{f_j\}\le {\hat{f}}(x)\le \max _j\{f_j\}\), \(\forall x\in {\mathbb R}^n\);

  3. P3.

    \({\hat{f}}\) is differentiable everywhere on \({\mathbb R}^n\) and in particular \(\nabla f(x_j)=0\) for all \(j=1,\ldots ,N\).

The proof of Lemma 1 is very simple and is reported in “Appendix A”.

Note that in [24] the authors suggest to improve the surrogate function by adding a regression model in (5) to take global trends into account. In our numerical experiments we found, however, that adding such a term does not lead to significant improvements of the proposed global optimization algorithm.

A one-dimensional example of the IDW surrogate \({\hat{f}}\) sampled at five different points of the scalar function

$$\begin{aligned} f(x)=\left( 1+\frac{x\sin (2x)\cos (3x)}{1+x^2}\right) ^2+\frac{x^2}{12}+\frac{x}{10} \end{aligned}$$
(6)

is depicted in Fig. 1. The global optimizer is \(x^*\approx -0.9599\) corresponding to the global minimum \(f(x^*)\approx 0.2795\).

Fig. 1
figure 1

A scalar example of f(x) as in (6) (blue) sampled at \(N=5\) points (blue circles), IDW surrogate \({\hat{f}}(x)\) (orange) with \(w_i(x)\) as in (3b), RBF inverse quadratic with \(\epsilon =0.5\) (red), RBF thin plate spline surrogate with \(\epsilon =0.01\) (green), global minimum (purple diamond) (Color figure online)

3.2 Radial basis functions

A possible drawback of the IDW function \({\hat{f}}\) defined in (5) is due to property P3: as the number N of samples increases, the surrogate function tends to ripple, having its derivative to always assume zero value at samples. An alternative is to use a radial basis function (RBF) [15, 27] as a surrogate function. These are defined by setting

$$\begin{aligned} {\hat{f}}(x)=\sum _{i=1}^N\beta _i\phi (\epsilon d(x,x_i)) \end{aligned}$$
(7)

where \(d:{\mathbb R}^{n}\times {\mathbb R}^n\rightarrow {\mathbb R}\) is the function defining the Euclidean distance as in (2), \(d(x,x_i)=\Vert x-x_i\Vert _2\), \(\epsilon >0\) is a scalar parameter, \(\beta _i\) are coefficients to be determined as explained below, and \(\phi :{\mathbb R}\rightarrow {\mathbb R}\) is a RBF. Popular examples of RBFs are

$$\begin{aligned} \begin{array}{llll} \phi (\epsilon d)=\frac{1}{1+(\epsilon d)^2}&{} \hbox {(inverse quadratic)} &{} \phi (\epsilon d)=e^{-(\epsilon d)^2}&{} \hbox {(Gaussian)}\\ \phi (\epsilon d)=\sqrt{1+(\epsilon d)^2}&{} \hbox { (multiquadric)} &{} \phi (\epsilon d)=(\epsilon d)^2\log (\epsilon d)&{} \hbox {(thin plate spline)}\\ \phi (\epsilon d)=\epsilon d&{} \hbox {(linear)} &{} \phi (\epsilon d)=\frac{1}{\sqrt{1+(\epsilon d)^2}}&{} \hbox {(inverse multiquadric)} \end{array} \end{aligned}$$
(8)

The coefficient vector \(\beta =[\beta _1\ \ldots \ \beta _N]'\) is obtained by imposing the interpolation condition

$$\begin{aligned} {\hat{f}}(x_i)=f_i,\quad i=1,\ldots ,N \end{aligned}$$
(9)

Condition (9) leads to solving the linear system

$$\begin{aligned} M\beta = F \end{aligned}$$
(10a)

where M is the \(N\times N\) symmetric matrix whose (ij)-entry is

$$\begin{aligned} M_{ij}=\phi (\epsilon d(x_i,x_j)) \end{aligned}$$
(10b)

with \(M_{ii}=1\) for all the RBF type listed in (8) but the linear and thin plate spline, for which \(M_{ii}=\lim _{d\rightarrow 0}\phi (\epsilon d)=0\). Note that if function f is evaluated at a new sample \(x_{N+1}\), matrix M only requires adding the last row/column obtained by computing \(\phi (\epsilon d(x_{N+1},x_j))\) for all \(j=1,\ldots ,N+1\).

As highlighted in [15, 21], matrix M might be singular, even if \(x_i\ne x_j\) for all \(i\ne j\). To prevent issues due to a singular M, [15, 21] suggest using a surrogate function given by the sum of a RBF and a polynomial function of a certain degree. To also take into account unavoidable numerical issues when distances between sampled points get close to zero, which will easily happen as new samples are added towards finding a global minimum, in this paper we suggest instead to use a singular value decomposition (SVD) \(M=U\Sigma V'\) of M.Footnote 1 By neglecting singular values below a certain positive threshold \(\epsilon _{\mathrm{SVD}}\), we can approximate \(\Sigma =\left[ {\begin{matrix}\Sigma _1 &{} 0\\ 0 &{} 0 \end{matrix}} \right]\), where \(\Sigma _1\) collects all singular values \(\sigma _i\ge \epsilon _{\mathrm{SVD}}\), and accordingly split \(V=[V_1\ V_2]\), \(U=[U_1\ U_2]\) so that

$$\begin{aligned} \beta =V_1\Sigma _1^{-1}U_1'F \end{aligned}$$
(10c)

The threshold \(\epsilon _{\mathrm{SVD}}\) turns out to be useful when dealing with noisy measurements \(y=f(x)+\varepsilon\) of f. Figure 2 shows the approximation \({\hat{f}}\) obtained from 50 samples with \(\varepsilon\) normally distributed around zero with standard deviation 0.1, when \(\epsilon _\mathrm{SVD}=10^{-2}\).

Fig. 2
figure 2

Function f(x) as in (6) is sampled 50 times, with each sample corrupted by noise \(\varepsilon \sim {{\mathcal {N}}}(0,10^{-2})\) (blue). The RBF thin plate spline surrogate with \(\epsilon =0.01\) (green) is obtained by setting \(\epsilon _{\mathrm{SVD}}=10^{-2}\) (Color figure online)

A drawback of RBFs, compared to IDW functions, is that property P2 is no longer satisfied, with the consequence that the surrogate may extrapolate large values \({\hat{f}}(x)\) where f(x) is actually small, and vice versa. See the examples plotted in Fig. 1. On the other hand, while differentiable everywhere, RBFs do not necessarily have zero gradients at sample points as in P3, which is favorable to better approximate the underlying function with limited samples. For the above reasons, we will mostly focus on RBF surrogates in our numerical experiments.

3.3 Scaling

To take into account that different components \(x^j\) of x may have different ranges \(u^j-\ell ^j\), we simply rescale the variables in optimization problem (1) so that they all range in \([-1,1]\). To this end, we first possibly tighten the given box constraints \(B_{\ell ,u}=\{x\in {\mathbb R}^n: \ell \le x\le u\}\) by computing the bounding box \(B_{\ell _g,u_g}\) of the set \(\{x\in {\mathbb R}^n:\ x\in \mathcal {X}\}\) and replacing \(B_{\ell ,u}\leftarrow B_{\ell ,u}\cap B_{\ell _g,u_g}\). The bounding box \(B_{\ell _g,u_g}\) is obtained by solving the following 2n optimization problems

$$\begin{aligned} \begin{array}{rcl} \ell _g^i&{}=&{}\min _{x\in \mathcal {X}} e_i'x\\ u_g^i&{}=&{}\max _{x\in \mathcal {X}} e_i'x \end{array} \end{aligned}$$
(11a)

where \(e_i\) is the ith column of the identity matrix, \(i=1,\ldots ,n\). In case of linear inequality constraints \(\mathcal {X}=\{x\in {\mathbb R}^n\ Ax\le b\}\) , the problems in (11a) can be solved by linear programming (LP), as shown later in (17). Since now on, we assume that \(\ell ,u\) are replaced by

$$\begin{aligned} \begin{array}{rcl} \ell &{}\leftarrow &{}\max \{\ell ,\ell _g\}\\ u&{}\leftarrow &{}\min \{u,u_g\} \end{array} \end{aligned}$$
(11b)

where “\(\min\)” and “\(\max\)” in (11b) operate component-wise. Next, we introduce scaled variables \({\bar{x}}\in {\mathbb R}^n\) whose relation with x is

$$\begin{aligned} x^j({\bar{x}})=\bar{x}^j\left( \frac{u^j-\ell ^j}{2}\right) +\frac{u^j+\ell ^j}{2} \end{aligned}$$
(12a)

for all \(j=1,\ldots ,n\) and finally formulate the following scaled global optimization problem

$$\begin{aligned} \begin{array}{rl} \min &{} f_s({\bar{x}}) \\ \mathop {\mathrm{s.t.}}\nolimits &{} -1\le {\bar{x}}^j\le 1,\quad j=1,\ldots ,n\\ &{} {\bar{x}}\in \mathcal {X}_s \end{array} \end{aligned}$$
(12b)

where \(f_s:{\mathbb R}^n\rightarrow {\mathbb R}\), \(\mathcal {X}_s\) are defined as

$$\begin{aligned} \begin{array}{rcl} f_s({\bar{x}})&{}=&{}f(x({\bar{x}})) \\ \mathcal {X}_s&{}=&{}\{{\bar{x}}\in {\mathbb R}^n:\ x({\bar{x}})\in \mathcal {X}\} \end{array} \end{aligned}$$

In case \(\mathcal {X}\) is a polyhedron we have

$$\begin{aligned} \mathcal {X}_s=\{{\bar{x}}:\ {\bar{A}}{\bar{x}}\le {\bar{b}}\} \end{aligned}$$
(12c)

where \({\bar{A}}\), \({\bar{b}}\) are a rescaled version of Ab defined as

$$\begin{aligned} \begin{array}{rcl} {\bar{A}}&{}=&{}A\mathop {\mathrm{diag}}\nolimits \left( \frac{u-\ell }{2}\right) \\ {\bar{b}}&{}=&{}b-A\left( \frac{u+\ell }{2}\right) \end{array} \end{aligned}$$
(12d)

and \(\mathop {\mathrm{diag}}\nolimits (\frac{u-\ell }{2})\) is the diagonal matrix whose diagonal elements are the components of \(\frac{u-\ell }{2}\).

Note that, when approximating \(f_s\) with \({\hat{f}}_s\), we use the squared Euclidean distances

$$\begin{aligned} d^2({\bar{x}},{\bar{x}}_i)=\sum _{h=1}^n ({\bar{x}}-{\bar{x}}_i)^2=\sum _{h=1}^n \left( \theta ^h(x^h-x^h_i)\right) ^{p^h} \end{aligned}$$

where the scaling factors \(\theta ^h=\frac{u^h-\ell ^h}{2}\) and \(p^h\equiv 2\) are constant. Therefore, finding a surrogate \({\hat{f}}_s\) of \(f_s\) in \([-1,1]\) is equivalent to finding a surrogate \({\hat{f}}\) of f under scaled distances. This is a much simpler scaling approach than computing the scaling factors \(\theta ^h\) and power p as it is common in stochastic process model approaches such as Kriging methods [23, 34]. As highlighted in [21], Kriging methods use radial basis functions \(\phi (x_i,x_j)=e^{-\sum _{h=1}^n\theta ^h|x_i^h-x_j^h|^{p^h}}\), a generalization of Gaussian RBF functions in which the scaling factors and powers are recomputed as the data set X changes.

Note also that the approach adopted in [5] for scaling automatically the surrogate function via cross-validation could be also used here, as well as other approaches specific for RBFs such as Rippa’s method [33]. In our numerical experiments we have found that adjusting the RBF parameter \(\epsilon\) via cross-validation, while increasing the computational effort, does not provide significant benefit. What is in fact most critical is the tradeoff between exploitation of the surrogate and exploration of the feasible set, that we discuss in the next section.

4 Acquisition function

As mentioned earlier, minimizing the surrogate function to get a new sample \(x_{N+1}\)\(=\)\(\arg \min {\hat{f}}(x)\) subject to \(\ell \le x\le u\) and \(x\in \mathcal {X}\) , evaluating \(f(x_{N+1})\), and iterating over N may easily miss the global minimum of f. This is particularly evident when \({\hat{f}}\) is the IDW surrogate (5), that by Property P2 of Lemma 1 has a global minimum at one of the existing samples \(x_i\). Besides exploiting the surrogate function \({\hat{f}}\), when looking for a new candidate optimizer \(x_{N+1}\) it is therefore necessary to add to \({\hat{f}}\) a term for exploring areas of the feasible space that have not yet been probed.

In Bayesian optimization, such an exploration term is provided by the covariance associated with the Gaussian process. A function measuring “bumpiness” of a surrogate RBF function was used in [15]. Here instead we propose two functions that provide exploration capabilities, that can be used in alternative to each other or in a combined way. First, as suggested in [24] for IDW functions, we consider the confidence interval function \(s:{\mathbb R}^n\rightarrow {\mathbb R}\) for \({\hat{f}}\) defined by

$$\begin{aligned} s(x)=\sqrt{\sum _{i=1}^Nv_i(x)(f_i-{\hat{f}}(x))^2} \end{aligned}$$
(13)

We will refer to function s as the IDW variance function associated with (XF). Clearly, when \({\hat{f}}(x_i)=f(x_i)\) then \(s(x_i)=0\) for all \(i=1,\ldots ,N\) (no uncertainty at points \(x_i\) where f is evaluated exactly). See Fig. 3 for a noise-free example and Fig. 4 for the case of noisy measurements of f.

Fig. 3
figure 3

Plot of \({\hat{f}}(x)\pm s(x)\) and z(x) for the scalar example as in Fig. 1, with \(w_i(x)\) as in (3a) and f as in (6)

Fig. 4
figure 4

Zoomed plot of \({\hat{f}}(x)\pm s(x)\) for the scalar example as in Fig. 3 when 50 samples of f(x) are measured with noise \(\varepsilon \sim {{\mathcal {N}}}(0,10^{-2})\) and \(\epsilon _{\mathrm{SVD}}=10^{-2}\)

Second, we introduce the new IDW distance function \(z:{\mathbb R}^n\rightarrow {\mathbb R}\) defined by

$$\begin{aligned} z(x)=\left\{ \begin{array}{ll} 0 &{} \quad \hbox {if}\ x\in \{x_1,\ldots ,x_N\}\\ \frac{2}{\pi }\tan ^{-1}\left( \frac{1}{\sum _{i=1}^Nw_i(x)}\right) &{} \quad \hbox {otherwise}\end{array}\right. \end{aligned}$$
(14)

where \(w_i(x)\) is given by either (3a) or (3b). The rationale behind (14) is that z(x) is zero at sampled points and grows in between. The arctangent function in (14) avoids that z(x) grows excessively when x is located far away from all sampled points. Figure 5 shows a scalar example of functions \(v_1\) and z.

Fig. 5
figure 5

A scalar example of functions \(v_1(x)\) and z(x) for \(x_1=-1\), \(x_2=2\), \(x_3=3\)

Given parameters \(\alpha ,\delta \ge 0\) and N samples (XF), we define the following acquisition function \(a:{\mathbb R}^n\rightarrow {\mathbb R}\)

$$\begin{aligned} a(x)={\hat{f}}(x)-\alpha s(x)-\delta \Delta F z(x) \end{aligned}$$
(15)

where \(\Delta F={\text{max}} \{{\text{max}} _i\{f_i\}-{\text{min }}_i\{f_i\},\epsilon _{\Delta {\rm F}}\}\) is the range of the observed samples F and the threshold \(\epsilon _{\Delta {\rm F}}>0\) is introduced to prevent the case in which f is not a constant function but, by chance, all sampled values \(f_i\) are equal. Scaling z by \(\Delta F\) eases the selection of the hyperparameter \(\delta\), as the amplitude of \(\Delta F z\) is comparable to that of \({\hat{f}}\).

As we will detail next, given N samples (XF) a global minimum of the acquisition function (15) is used to define the \((N+1)\)th sample \(x_{N+1}\) by solving the global optimization problem

$$\begin{aligned} x_{N+1}=\arg \min _{\ell \le x\le u,\ x\in \mathcal {X}} a(x) \end{aligned}$$
(16)

The rationale behind choosing (15) for acquisition is the following. The term \({\hat{f}}\) directs the search towards a new sample \(x_{N+1}\) where the objective function f is expected to be optimal, assuming that f and its surrogate \({\hat{f}}\) have a similar shape, and therefore allows a direct exploitation of the samples F already collected. The other two terms account instead for the exploration of the feasible set with the hope of finding better values of f, with s promoting areas in which \({\hat{f}}\) is more uncertain and z areas that have not been explored yet. Both s and z provide exploration capabilities, but with an important difference: function z is totally independent on the samples F already collected and promotes a more uniform exploration, s instead depends on F and the surrogate \({\hat{f}}\). The coefficients \(\alpha\), \(\delta\) determine the exploitation/exploration tradeoff one desires to adopt.

For the example of scalar function f in (6) sampled at five random points, the acquisition function a obtained by setting \(\alpha =1\), \(\delta =\frac{1}{2}\), using a thin plate spline RBF with \(\epsilon _{\mathrm{SVD}}=10^{-6}\), and \(w_i(x)\) as in (3a), and the corresponding minimum are depicted in Fig. 6.

Fig. 6
figure 6

Plot of \({\hat{f}}(x)\) and acquisition function a(x) with \(\alpha =1\), \(\delta =\frac{1}{2}\), thin plate spline RBF with \(\epsilon _{\mathrm{SVD}}=10^{-6}\), for the scalar example as in Fig. 1, with \(w_i(x)\) as in (3a) and f as in (6)

The following result, whose easy proof is reported in “Appendix A”, highlights a nice property of the acquisition function a.

Lemma 2

Function a is differentiable everywhere on \({\mathbb R}^n\).

Problem (16) is a global optimization problem whose objective function and constraints are very easy to evaluate. It can be solved very efficiently using various global optimization techniques, either derivative-free [32] or, if \(\mathcal {X}=\{x: g(x)\le 0\}\) and g is also differentiable, derivative-based. In case some components of vector x are restricted to be integer, (16) can be solved by mixed-integer programming.

5 Global optimization algorithm

Algorithm 1, that we will refer to as GLIS (GLobal minimum using Inverse distance weighting and Surrogate radial basis functions), summarizes the proposed approach to solve the global optimization problem (1) using surrogate functions (either IDW or RBF) and the IDW acquisition function (15).

figure a

As common in global optimization based on surrogate functions, in Step 4 Latin Hypercube Sampling (LHS) [28] is used to generate the initial set X of samples in the given range \(\ell ,u\). Note that the generated initial points may not satisfy the inequality constraints \(x\in \mathcal {X}\). We distinguish between two cases:

(i):

the objective function f can be evaluated outside the feasible set \(\mathcal {F}\);

(ii):

f cannot be evaluated outside \(\mathcal {F}\).

In the first case, initial samples of f falling outside \(\mathcal {F}\) are still useful to define the surrogate function and can be therefore kept. In the second case, since f cannot be evaluated at initial samples outside \(\mathcal {F}\), a possible approach is to generate more than \(N_{\mathrm{init}}\) samples and discard the infeasible ones before evaluating f. For example, the author of [6] suggests the simple method reported in Algorithm 2. This requires the feasible set \(\mathcal {F}\) to be full-dimensional. In case of linear inequality constraints \(\mathcal {X}=\{x:\ Ax\le b\}\), full-dimensionality of the feasible set \(\mathcal {F}\) can be easily checked by computing the Chebychev radius \(r_{\mathcal {F}}\) of \(\mathcal {F}\) via the LP [8]

$$\begin{aligned} \begin{array}{rl} r_{\mathcal {F}} = \max _{r,x} &{}r \\ \mathop {\mathrm{s.t.}}\nolimits &{} A_ix\le b_i-\Vert A_i\Vert _2r,\quad i=1,\ldots ,q\\ &{} \ell _i+r\le x_i\le u_i-r,\quad i=1,\ldots ,n \end{array} \end{aligned}$$
(17)

where in (17) the subscript i denotes the ith row (component) of a matrix (vector). The polyhedron \(\mathcal {F}\) is full dimensional if and only if \(r_{\mathcal {F}}>0\). Clearly, the smaller the ratio between the volume of \(\mathcal {F}\) and the volume of the bounding box \(B_{\ell _g,u_g}\), the larger on average will be the number of samples generated by Algorithm 2.

Note that, in alternative to LHS, the IDW function (14) could be also used to generate \(N_{\mathrm{init}}\) feasible points by solving

$$\begin{aligned} x_{N+1}=\max _{x\in \mathcal {F}}z(x) \end{aligned}$$

for \(N=1,\ldots ,N_{\mathrm{init}}-1\), for any \(x_1\in \mathcal {F}\).

figure b

The examples reported in this paper use the Particle Swarm Optimization (PSO) algorithm [41] to solve problem (16) at Step 6.2, although several other global optimization methods such as DIRECT [22] or others [18, 32] could be used in alternative. Inequality constraints \(\mathcal {X}=\{x:\ g(x)\le 0\}\) can be handled as penalty functions, for example by replacing (16) with

$$\begin{aligned} x_{N+1}=\arg \min _{\ell \le x\le u}\left\{ a(x)+\rho \Delta F\sum _{i=1}^q\max \{g_i(x),0\}^2\right\} \end{aligned}$$
(18)

where in (18) \(\rho \gg 1\). Note that due to the heuristic involved in constructing function a, it is not crucial to find global solutions of very high optimality accuracy when solving problem (16). Regarding feasibility, in case \(x_{N+1}\) violates the constraints and f cannot be evaluated outside \(\mathcal {X}\), a remedy would be to increase the penalty parameter \(\rho\) and/or to slightly tighten the constraints by penalizing \(\max \{g_i(x)+\epsilon _g,0\}\) in (18) instead of \(\max \{g_i(x),0\}\), for some small positive scalar \(\epsilon _g\).

The exploration parameter \(\alpha\) promotes visiting points in \([\ell ,u]\) where the function surrogate has largest variance, \(\delta\) promotes instead pure exploration independently on the surrogate function approximation, as it is only based on the sampled points \(x_1,\ldots ,x_N\) and their mutual distance. For example, if \(\alpha =0\) and \(\delta \gg 1\) Algorithm 1 will try to explore the entire feasible region, with consequent slower detection of points x with low cost f(x). On the other hand, setting \(\delta =0\) will make GLIS proceed only based on the function surrogate and its variance, that may lead to miss regions in \([\ell ,u]\) where a global optimizer is located. For \(\alpha =\delta =0\), GLIS will proceed based on pure minimization of \({\hat{f}}\) that, as observed earlier, can easily lead to converge away from a global optimizer.

Figure 7 shows the first six iterations of the GLIS algorithm when applied to minimize the function f given in (6) in \([-3,3]\) with \(\alpha =1\), \(\delta =0.5\).

Fig. 7
figure 7

GLIS steps when applied to minimize the function f given in (6) using the same settings as in Fig. 6 and \(\epsilon _{\mathrm{SVD}}=10^{-6}\). The plots show function f (blue), its samples \(f_i\) (blue circles), the thin plate spline interpolation \({\hat{f}}\) with \(\epsilon =0.01\) (green), the acquisition function a (yellow), and the minimum of the acquisition function reached at \(x_{N+1}\) (purple diamond) (Color figure online)

5.1 Computational complexity

The complexity of Algorithm 1, as a function of the number \(N_\mathrm{max}\) of iterations and dimension n of the optimization space and not counting the complexity of evaluating f, depends on Steps 6.1 and 6.2. The latter depends on the global optimizer used to solve Problem (16), which typically depends heavily on n. Step 6.1 involves computing \(N_{\mathrm{max}}(N_\mathrm{max}-1)\) RBF values \(\phi (\epsilon d(x_i,x_j))\), \(i,j=1,\ldots ,N_{\mathrm{max}}\), \(j\ne i\), compute the SVD decomposition of the \(N\times N\) symmetric matrix M in (10a), whose complexity is \(O(N^3)\), and solve the linear system in (10a) (\(O(N^2)\)) at each step \(N=N_\mathrm{init},\ldots ,N_{\mathrm{max}}\).

6 Numerical tests

In this section we report different numerical tests performed to assess the performance of the proposed algorithm (GLIS) and how it compares to Bayesian optimization (BO). For the latter, we have used the off-the-shelf algorithm bayesopt implemented in the Statistics and Machine Learning Toolbox for MATLAB [39], based on the lower confidence bound as acquisition function. All tests were run on an Intel i7-8550 CPU @1.8GHz machine. Algorithm 1 was run in MATLAB R2019b in interpreted code. The PSO solver [42] was used to solve problem (18). We focus our comparison on BO only as it one of the most efficient methods to deal with the optimization of expensive black-box functions.

6.1 GLIS optimizing its own parameters

We first use GLIS to optimize its own hyperparameters \(\alpha\), \(\delta\), \(\epsilon\) when solving the minimization problem with f(x) as in (6) and \(x\in [-3,3]\). In what follows, we use the subscript \(()_H\) to denote the parameters/function used in the execution of the outer instance of Algorithm 1 that is optimizing \(\alpha\), \(\delta\), \(\epsilon\). To this end, we solve the global optimization problem (1) with \(x=[\alpha \ \delta \ \epsilon ]'\),

$$\begin{aligned} f_H(x)=\sum _{i=1}^{N_t}\sum _{h=0}^{N_{\mathrm{max}}/2}(h+1)\min \{ f(x_{i,1}),\ldots ,f(x_{i,N_{\mathrm{max}}/2+h})\} \end{aligned}$$
(19)

where f is the scalar function in (6) that we want to minimize in \([-3,3]\), and we set \(\ell _H=[0\ 0\ 0.1]'\), \(u_H=[3\ 3\ 3]'\), which is a reasonably large-enough range according to our numerical experience. The \(\min\) in (19) provides the best objective value found up to iteration \(N_\mathrm{max}/2+h\), the term \((h+1)\) aims at penalizing high values of the best objective the more the later they occur during the iterations, \(N_t=20\) is the number of times Algorithm 1 is executed to minimize \(f_H\) for the same triplet \((\alpha ,\delta ,\epsilon )\), \(N_\mathrm{max}=20\) is the number of times f is evaluated per execution, \(x_{i,N}\) is the sample generated by Algorithm 1 during the ith run at step N, \(i=1,\ldots ,N_t\), \(N=1,\ldots ,N_{\mathrm{max}}\). Clearly (19) penalizes failure to convergence close to the global optimum \(f^*\) in \(N_{\mathrm{max}}\) iterations without caring of how the algorithm performs during the first \(N_{\mathrm{max}}/2-1\) iterations.

In optimizing (19), the outer instance of Algorithm 1 is run with \(\alpha _H=2\), \(\delta _H=0.5\), \(\epsilon _H=0.5\), \(N_{\mathrm{init,H}}=8\), \(N_{\mathrm{max,H}}=100\) [which means that \(f_H\) in (19) is evaluated 100 times, each evaluation requiring executing Algorithm 1 \(N_t=20\) times to minimize function f], and PSO as the global optimizer of the acquisition function. The RBF inverse quadratic function is used in both the inner and outer instances of Algorithm 1. The resulting optimal selection is

$$\begin{aligned} \alpha =1.5078,\quad \delta =1.4246,\quad \epsilon =1.0775 \end{aligned}$$
(20)

Figure 8 compares the behavior of GLIS (Algorithm 1) when minimizing f(x) as in (6) in \([-3,3]\) with tentative parameters \(\alpha =1\), \(\delta =1\), \(\epsilon =0.5\) and with the optimal values in (20). The figure also shows the results obtained by using BO on the same problem.

Clearly the results of the hyper-optimization depend on the function f which is minimized in the inner loop. For a more comprehensive and general optimization of GLIS hyperparameters, one could alternatively consider in \(f_H\) the average performance with respect to a collection of problems instead of just one problem.

Fig. 8
figure 8

Minimization of f(x) as in (6) in \([-3,3]\): tentative hyperparameters (left) and optimal hyperparameters (right). The average performance obtained over \(N_{\mathrm{test}}=100\) runs as a function of evaluations of \(f_H\), along with the band defined by the best- and worst-case instances

6.2 Benchmark global optimization problems

We test the proposed global optimization algorithm on standard benchmark problems, summarized in Table 1. For each function the table shows the corresponding number of variables, upper and lower bounds, and the name of the example in [19] reporting the definition of the function. For lack of space, we will only consider the GLIS algorithm implemented using inverse quadratic RBFs for the surrogate, leaving IDW only for exploration, because compared to other RBFs it was found a robust choice experimentally.

As a reference target for assessing the quality of the optimization results, for each benchmark problem the optimization algorithm DIRECT [22] was used to compute the global optimum of the function through the NLopt interface [20], except for the ackley and stepfunction2 benchmarks in which PSO is used instead due to the slow convergence of DIRECT on those problems. The corresponding global minima were validated, when possible, against results reported in [19] or, in case of one- or two-dimensional problems, by inspection.

Algorithm 1 is run by using the RBF inverse quadratic function with hyperparameters obtained by dividing the values in (20) by the number n of variables, with the rationale that exploration is more difficult in higher dimensions and it is therefore better to rely more on the surrogate function during acquisition. The threshold \(\epsilon _\mathrm{SVD}=10^{-6}\) is adopted to compute the RBF coefficients in (10c). The number of initial samples is \(N_\mathrm{init}=2n\).

For each benchmark, the problem is solved \(N_{\mathrm{test}}=100\) times to collect statistically significant enough results. The last two columns of Table 1 report the average CPU time spent for solving the \(N_{\mathrm{test}}=100\) instances of each benchmark using BO and GLIS. As the benchmark functions are very easy to compute, the CPU time spending on evaluating the \(N_{\mathrm{max}}\) function values F is negligible, so the time values reported in the table are practically those due to the execution of the algorithms. Algorithm 1 (GLIS) is between 4.6 and 9.4 times faster than Bayesian optimization (about 7 times faster on average). The execution time of GLIS in Python 3.7 on the same machine, using the PSO package pyswarm (https://pythonhosted.org/pyswarm) to optimize the acquisition function, is similar to that of the BO package GPyOpt [38].

The results of the tests are reported in Fig. 9, where in each plot we show the average function value obtained over \(N_{\mathrm{test}}=100\) runs as a function of the number of function evaluations, and the band defined by the best-case and worst-case instances, and how the global optimum is approached.

Table 1 Benchmark problems considered in the comparison
Fig. 9
figure 9

Comparison between Algorithm 1 (GLIS) and Bayesian optimization (BO) on benchmark problems. Each plot reports the best function value obtained as a function of the number of function evaluations: average over \(N_{\mathrm{test}}=100\) runs (thick lines), band defined by the best- and worst-case instances (shadowed areas), and global optimum to be attained (black dashed line)

In order to test the algorithm in the presence of constraints, we consider the camelsixhumps problem and solve it under the following constraints

$$\begin{aligned} \begin{array}{ll} -2\le x_1\le 2,\quad -1\le x_2\le 1\\ \left[ \begin{matrix}1.6295&{}\quad 1 \\ -\,1&{}\quad 4.4553\\ -\,4.3023 &{}\quad -\,1\\ -\,5.6905&{}\quad -\,12.1374\\ 17.6198&{}\quad 1\end{matrix}\right] x\le \left[ \begin{matrix}3.0786\\ 2.7417\\ -\,1.4909\\ 1\\ 32.5198\end{matrix}\right] ,&\quad x_1^2+(x_2+0.1)^2\le 0.5 \end{array} \end{aligned}$$

Algorithm 1 is run with hyperparameters set by dividing by \(n=2\) the values obtained in (20) and with \(\epsilon _{\mathrm{SVD}}=10^{-6}\), \(N_{\mathrm{init}}=2n\) for \(N_{\mathrm{max}}=20\) iterations, with penalty \(\rho =1000\) in (18). The results are plotted in Fig. 10. The unconstrained two global minima of the function are located at \(\left[ {\begin{matrix} -0.0898 \\ 0.7126 \end{matrix}} \right]\), \(\left[ {\begin{matrix}0.0898\\ -0.7126 \end{matrix}} \right]\).

Fig. 10
figure 10

Constrained camelsixhumps problem: constrained minimum (yellow dot) and unconstrained global minima (red diamonds) (Color figure online)

6.3 ADMM hyperparameter tuning for QP

The Alternating Direction Method of Multipliers (ADMM) [7] is a popular method for solving optimization problems such as the following convex Quadratic Program (QP)

$$\begin{aligned} \begin{array}{rl} \phi (\theta )=\min _z &{}\frac{1}{2}z'Qz+(c+F\theta )'z \\ \mathop {\mathrm{s.t.}}\nolimits &{} Az\le b+S\theta \end{array} \end{aligned}$$
(21)

where \(z\in {\mathbb R}^n\) is the optimization vector, \(\theta \in {\mathbb R}^p\) is a vector of parameters affecting the problem, and \(A\in {\mathbb R}^{q\times n}\), \(b\in {\mathbb R}^{q}\), \(S\in {\mathbb R}^{q\times p}\), and we assume \(Q=Q'\succ 0\). Problems of the form (21) arise for example in model predictive control applications [2, 3], where z represents a sequence of future control inputs to optimize and \(\theta\) collects signals that change continuously at runtime depending on measurements and set-point values. ADMM can be used effectively to solve QP problems (21), see for example the solver described in [1]. A very simple ADMM formulation for QP is summarized in Algorithm 3.

figure c

We consider a randomly generated QP test problem with \(n=5\), \(q=10\), \(p=3\) that is feasible for all \(\theta \in [-1,1]^3\), whose matrices are reported in “Appendix B” for reference. We set \(N=100\) in Algorithm 3, and generate \(M=2000\) samples \(\theta _j\) uniformly distributed in \([-1,1]^3\). The aim is to find the hyperparameters \(x=[{\bar{\rho }}\ {\bar{\alpha }}]'\) that provide the best QP solution quality after N ADMM iterations. Quality is expressed by the following objective function

$$\begin{aligned} f(x)= & {} \log \left( \frac{1}{M}\sum _{j=1}^M\max \left\{ \frac{\phi ^*_j(x)-\phi ^*(x)}{1+|\phi ^*(x)|},0\right\} \right. \nonumber \\&\left. +\, {\bar{\beta }}\max \left\{ \max _i\left\{ \frac{A_iz_j^*(x)-b_i-S_ix}{1+|b_i+S_ix|}\right\} ,0\right\} \right) \end{aligned}$$
(22)

where \(\phi ^*_j(x)\), \(z^*_j(x)\) are the optimal value and optimizer found at run \(\#j\), respectively, \(\phi ^*(x)\) is the solution of the QP problem obtained by running the very fast and accurate ODYS QP solver [10]. The first term in (22) measures relative optimality, the second term relative violation of the constraints, and we set \({\bar{\beta }}=1\) to equally weight relative optimality versus relative accuracy. Function f in (22) is minimized for \(\ell =\left[ {\begin{matrix}0.01\\ 0.01 \end{matrix}} \right]\) and \(u=\left[ {\begin{matrix}3\\ 3 \end{matrix}} \right]\) using GLIS with the same parameters used in Sect. 6.2 and, for comparison, by Bayesian optimization. Due to the fact that ADMM provides suboptimal solutions, when acquiring the samples \(f_i\) the argument of the logarithm in (22) is always found positive in our tests. The test is repeated \(N_{\mathrm{test}}=100\) times and the results are depicted in Fig. 11. It is apparent that GLIS attains slightly better function values for the same number of functions evaluations than BO, both on average and in the worst-case. The resulting hyperparameter tuning that minimized the selected ADMM performance index (22) is \({\bar{\rho }}=0.1566\), \({\bar{\alpha }}=1.9498\).

Fig. 11
figure 11

Hyperparameter optimization for ADMM. Average performance over \(N_{\mathrm{test}}=100\) runs and best/worst-case instance bands

7 Conclusions

This paper has proposed an approach based on surrogate functions to address global optimization problems whose objective function is expensive to evaluate, possibly under constraints that are inexpensive to evaluate. Contrarily to Bayesian optimization methods, the approach is driven by deterministic arguments based on radial basis functions (or inverse distance weighting) to create the surrogate, and on inverse distance weighting to characterize the uncertainty between the surrogate and the black-box function to optimize, as well as to promote the exploration of the feasible space. The computational burden associated with the algorithm is lighter then the one of Bayesian optimization while performance is comparable. Clearly, the main limitation of the algorithm is related to the dimension of the optimization vector it can cope with, as many other black-box global optimization algorithms.

Current research is devoted to extend the approach to include constraints that are also expensive to evaluate, and to explore if performance can be improved by adapting the parameters \(\alpha\) and \(\delta\) during the search. Future research should address theoretical issues of convergence of the approach, by investigating assumptions on the black-box function f and on the parameters \(\alpha ,\delta ,\epsilon _{\rm{SVD}},\epsilon _{\Delta {\rm F}}>0\) of the algorithm, so to allow guaranteeing convergence, for example using the arguments in [15] based on the results in [40].