Gaussian process optimization with failures: classification and convergence proof

Bachoc, François; Helbert, Céline; Picheny, Victor

doi:10.1007/s10898-020-00920-0

Gaussian process optimization with failures: classification and convergence proof

Published: 08 July 2020

Volume 78, pages 483–506, (2020)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

François Bachoc¹,
Céline Helbert² &
Victor Picheny³

766 Accesses
19 Citations
Explore all metrics

Abstract

We consider the optimization of a computer model where each simulation either fails or returns a valid output performance. We first propose a new joint Gaussian process model for classification of the inputs (computation failure or success) and for regression of the performance function. We provide results that allow for a computationally efficient maximum likelihood estimation of the covariance parameters, with a stochastic approximation of the likelihood gradient. We then extend the classical improvement criterion to our setting of joint classification and regression. We provide an efficient computation procedure for the extended criterion and its gradient. We prove the almost sure convergence of the global optimization algorithm following from this extended criterion. We also study the practical performances of this algorithm, both on simulated data and on a real computer model in the context of automotive fan design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Class GP: Gaussian Process Modeling for Heterogeneous Functions

A note on the Estimation of a Gamma-Variance Process: Learning from a Failure

Article 24 February 2016

Robust Optimization with Gaussian Process Models

Notes

A constant mean function could be incorporated and estimated with no additional complexity.

References

Azzimonti, D., Ginsbourger, D.: Estimating orthant probabilities of high-dimensional Gaussian vectors with an application to set estimation. J. Comput. Graph. Stat. 27(2), 255–267 (2018)
Article MathSciNet Google Scholar
Bect, J., Bachoc, F., Ginsbourger, D.: A supermartingale approach to Gaussian process based sequential design of experiments. Bernoulli 25(4A), 2883–2919 (2019)
Article MathSciNet Google Scholar
Benassi, R., Bect, J., Vazquez, E.: Robust Gaussian process-based global optimization using a fully Bayesian expected improvement criterion. In: International Conference on Learning and Intelligent Optimization, pp. 176–190. Springer (2011)
Botev, Z.I.: The normal law under linear restrictions: simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(1), 125–148 (2017)
Article MathSciNet Google Scholar
Bull, A.D.: Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011)
MathSciNet MATH Google Scholar
Gelbart, M.A., Snoek, J., Adams, R.P.: Bayesian optimization with unknown constraints. In: UAI (2014)
Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1(2), 141–149 (1992)
Google Scholar
Ginsbourger, D., Le Riche, R., Carraro, L.: Kriging is well-suited to parallelize optimization. In: Computational Intelligence in Expensive Optimization Problems, pp. 131–162. Springer (2010)
Ginsbourger, D., Roustant, O., Durrande, N.: On degeneracy and invariances of random fields paths with applications in Gaussian process modelling. J. Stat. Plan. Inference 170, 117–128 (2016)
Article MathSciNet Google Scholar
Gramacy, R., Lee, H.: Optimization under unknown constraints. Bayesian Stat. 9, 229 (2011)
Article MathSciNet Google Scholar
Gramacy, R.B., Gray, G.A., Le Digabel, S., Lee, H.K., Ranjan, P., Wells, G., Wild, S.M.: Modeling an augmented Lagrangian for blackbox constrained optimization. Technometrics 58(1), 1–11 (2016)
Article MathSciNet Google Scholar
Hernandez-Lobato, J.M., Gelbart, M., Hoffman, M., Adams, R., Ghahramani, Z.: Predictive entropy search for Bayesian optimization with unknown constraints. In: International Conference on Machine Learning, pp. 1699–1707 (2015)
Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black box functions. J. Glob. Optim. 13, 455–492 (1998)
Article MathSciNet Google Scholar
Kallenberg, O.: Foundations of Modern Probability, 2nd edn. Springer, Berlin (2002)
Book Google Scholar
Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., Xing, E.P.: Neural architecture search with Bayesian optimisation and optimal transport. In: Advances in Neural Information Processing Systems, pp. 2016–2025 (2018)
Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. Wiley, Hoboken (2005)
Book Google Scholar
Lindberg, D.V., Lee, H.K.: Optimization under constraints by applying an asymmetric entropy measure. J. Comput. Graph. Stat. 24(2), 379–393 (2015)
Article MathSciNet Google Scholar
López-Lopera, A.F., Bachoc, F., Durrande, N., Roustant, O.: Finite-dimensional Gaussian approximation with linear inequality constraints. SIAM/ASA J. Uncertain. Quantif. 6(3), 1224–1255 (2018)
Article MathSciNet Google Scholar
Maatouk, H., Bay, X.: A New Rejection Sampling Method for Truncated Multivariate Gaussian Random Variables Restricted to Convex Sets, pp. 521–530. Springer, Cham (2016)
MATH Google Scholar
Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, Berlin (2012)
MATH Google Scholar
Mockus, J.B., Tiesis, V., Žilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Dixon, L.C.W., Szegö, G.P. (eds.) Towards Global Optimization, vol. 2, pp. 117–129. North Holland, New York (1978)
Google Scholar
Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)
MathSciNet MATH Google Scholar
Pakman, A., Paninski, L.: Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Graph. Stat. 23(2), 518–542 (2014)
Article MathSciNet Google Scholar
Picheny, V.: A stepwise uncertainty reduction approach to constrained global optimization. In: Artificial Intelligence and Statistics, pp. 787–795 (2014)
Picheny, V., Gramacy, R.B., Wild, S., Le Digabel, S.: Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian. In: Advances in Neural Information Processing Systems, pp. 1435–1443 (2016)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)
MATH Google Scholar
Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)
Article Google Scholar
Sacher, M., Duvigneau, R., Le Maitre, O., Durand, M., Berrini, E., Hauville, F., Astolfi, J.-A.: A classification approach to efficient global optimization in presence of non-computable domains. Struct. Multidiscip. Optim. 58(4), 1537–1557 (2018)
Article MathSciNet Google Scholar
Sasena, M.J., Papalambros, P., Goovaerts, P.: Exploration of metamodeling sampling criteria for constrained global optimization. Eng. Optim. 34(3), 263–278 (2002)
Article Google Scholar
Schonlau, M., Welch, W.J., Jones, D.R.: Global versus local search in constrained optimization of computer models. In: Lecture Notes-Monograph Series, pp. 11–25 (1998)
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022 (2010)
Taylor, J., Benjamini, Y.: RestrictedMVN: multivariate normal restricted by affine constraints. https://cran.r-project.org/web/packages/restrictedMVN/index.html (2017). Online; 2 Feb 2017
Vazquez, E., Bect, J.: Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Stat. Plan. Inference 140(11), 3088–3095 (2010)
Article MathSciNet Google Scholar
Vazquez, E., Bect, J.: Pointwise consistency of the kriging predictor with known mean and covariance functions. In: mODa 9–aAdvances in Model-Oriented Design and Analysis, pp. 221–228. Springer (2010)
Wu, J., Frazier, P.: The parallel knowledge gradient method for batch Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 3126–3134 (2016)
Zhigljavsky, A., Žilinskas, A.: Selection of a covariance function for a Gaussian random field aimed for modeling global optimization problems. Optim. Lett. 13(2), 249–259 (2019)
Article MathSciNet Google Scholar
Žilinskas, A., Calvin, J.: Bi-objective decision making in global optimization based on statistical models. J. Glob. Optim. 74(4), 599–609 (2019)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was partly funded by the ANR project PEPITO. We are grateful to Jean-Marc Azaïs and David Ginsbourger for discussing the topics in this paper.

Author information

Authors and Affiliations

Institut de Mathématiques de Toulouse, Université Paul Sabatier, Toulouse, France
François Bachoc
Ecole Centrale de Lyon, CNRS UMR 5208, Institut Camille Jordan, Univ. de Lyon, 36 av. G. de Collongue, 69134, Ecully Cedex, France
Céline Helbert
PROWLER.io, Cambridge, UK
Victor Picheny

Authors

François Bachoc
View author publications
You can also search for this author in PubMed Google Scholar
Céline Helbert
View author publications
You can also search for this author in PubMed Google Scholar
Victor Picheny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to François Bachoc.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material

The supplementary material contains additional figures and tables for Sections \inlink{5}{section:simulations} and \inlink{6}{section:industrial:case:study}. (pdf 1,904KB)

Appendices

A Proofs

Proof

(Proof of Lemma 1)

With $\phi ^{Z_n}$ the p.d.f. of $Z_n$ and with $s_n = (i_1,\ldots ,i_n)^\top $, we have

$$\begin{aligned} \mathrm {P_{nf}}(x)&= \mathbb {P} \left( \left. Z(x)> 0 \right| \mathrm {sign}(Z_n) = s_n \right) \nonumber \\&\quad = \frac{ \mathbb {E} \left( \mathbf {1}_{Z(x)> 0} \prod _{j=1}^n \mathbf {1}_{ \mathrm {sign}(Z(x_j)) = i_j } \right) }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } \nonumber \\&\quad = \frac{ \int _{\mathbb {R}^n} \phi ^{Z_n}(z_1,\ldots ,z_n) \left( \prod _{j=1}^n \mathbf {1}_{ \mathrm {sign}(z_j) = i_j } \right) \mathbb {P} \left( \left. Z(x) > 0 \right| Z(x_1) = z_1 , \ldots , Z(x_n) = z_n \right) dz_1 \ldots dz_n }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } \nonumber \\&\qquad = \int _{\mathbb {R}^n} \phi ^{Z_n}_{s_n}(z_n) \bar{\varPhi } \left( \frac{-m_{n}^Z(x,z_n)}{ \sqrt{ k_{n}^Z(x) } } \right) dz_n. \end{aligned}$$

(16)

The Eq. (16) is obtained by observing that

$$\begin{aligned} \frac{ \phi ^{Z_n}(z_1,\ldots ,z_n) \left( \prod _{j=1}^n \mathbf {1}_{ \mathrm {sign}(z_j) = i_j } \right) }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } = \frac{ \phi ^{Z_n}(z_1,\ldots ,z_n) \mathbf {1}_{ \mathrm {sign}(Z_n) = s_n } }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } = \phi ^{Z_n}_{s_n}(z_n) \end{aligned}$$

by definition of $\phi ^{Z_n}_{s_n}(z_n)$, and that

$$\begin{aligned} \mathbb {P} \left( \left. Z(x) > 0 \right| Z(x_1) = z_1 , \ldots , Z(x_n) = z_n \right) = \bar{\varPhi } \left( \frac{-m_{n}^Z(x,z_n)}{ \sqrt{ k_{n}^Z(x) } } \right) \end{aligned}$$

by Gaussian conditioning. $\square $

Proof

(Proof of Lemma 2) For any measurable function f, by the law of total expectation and using the independence of Y, $(W_1,\ldots ,W_n)$ and Z, we have

$$\begin{aligned} \mathbb {E}&\left[ f \left( I_1 , \ldots , I_n , V_1 , \ldots , V_n \right) \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0,1\}} \mathbb {P}_{\mu ^Z,\theta _Z} \left( I_1 = i_1 , \ldots , I_n = i_n \right) \\&\mathbb {E} \left[ f \left( i_1 , \ldots , i_n , i_1 Y(x_1) + (1-i_1) W_1 , \ldots , i_n Y(x_n) + (1-i_n) W_n \right) \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0,1\}} \mathbb {P}_{\mu ^Z,\theta _Z} \left( I_1 = i_1 , \ldots , I_n = i_n \right) \\&~ ~ \int _{\mathbb {R}^{n}} dv \phi _{\mu ^Y,\theta _Y,s_n}^Y \left( v_{s_n} \right) \left( \prod _{ \begin{array}{c} j = 1,\ldots ,n \\ i_j = 0 \end{array}} \phi (v_j) \right) f \left( i_1 , \ldots , i_n , v_1 , \ldots , v_n \right) . \end{aligned}$$

This concludes the proof by definition of a p.d.f. $\square $

Proof

(Proof of Lemma 3)

Consider measurable functions f(Y), g(Z), $h( I_1, \ldots , I_n )$ and $\psi ( I_1 Y(x_1), \ldots , I_n Y(x_n) )$. We have, by independence of Y and Z,

$$\begin{aligned} \mathbb {E}&\left[ f(Y) g(Z) h( I_1, \ldots , I_n ) \psi ( I_1 Y(x_1), \ldots , I_n Y(x_n) ) \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0 , 1\}} \mathbb {P} \left( I_1 = i_1, \ldots , I_n = i_n \right) \\&\mathbb {E} \left[ \left. f(Y) g(Z) h( i_1, \ldots , i_n ) \psi (i_1 Y(x_1), \ldots , i_n Y(x_n) ) \right| I_1 = i_1, \ldots , I_n = i_n \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0 , 1\}} \mathbb {P} \left( I_1 = i_1, \ldots , I_n = i_n \right) h( i_1, \ldots , i_n ) \\&\mathbb {E} \left[ f(Y) \psi (i_1 Y(x_1), \ldots , i_n Y(x_n) ) \right] \mathbb {E} \left[ \left. g(Z) \right| I_1 = i_1, \ldots , I_n = i_n \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0 , 1\}} \mathbb {P} \left( I_1 = i_1, \ldots , I_n = i_n \right) h( i_1, \ldots , i_n ) \\&\mathbb {E} \left[ \psi (i_1 Y(x_1), \ldots , i_n Y(x_n) ) \mathbb {E} \left[ \left. f(Y) \right| Y_{n,s_n} \right] \right] \mathbb {E} \left[ \left. g(Z) \right| I_1 = i_1, \ldots , I_n = i_n \right] . \end{aligned}$$

The last display can be written as, with $\mathcal {L}_{n}$ the distribution of

$$\begin{aligned} I_1 ,\ldots ,I_n , I_1 Y(x_1) , \ldots , I_n Y(x_n), \end{aligned}$$

$$\begin{aligned}&\int _{\mathbb {R}^{2n}} d \mathcal {L}_n( i_1,\ldots ,i_n,i_1 y_1, \ldots , i_n y_n ) h(i_1,\ldots ,i_n) \psi ( i_1 y_1 , \ldots , i_n y_n )&\\&\mathbb {E} \left[ \left. f(Y) \right| Y_{n,s_n} = Y_q \right] \mathbb {E} \left[ \left. g(Z) \right| I_1 = i_1, \ldots , I_n = i_n \right] ,&\end{aligned}$$

where $Y_q$ is as defined in the statement of the lemma. This concludes the proof. $\square $

We now address the proof of Theorem 1. We let $(X_i)_{i \in \mathbb {N}}$ be the random observation points, such that $X_i$ is obtained from (13) and (14) for $i \in \mathbb {N}$. The next lemma shows that conditioning on the random observation points and observed values works “as if” the observation points $X_1,\ldots ,X_n$ were non-random.

Lemma 4

For any $x_1,\ldots ,x_k \in \mathcal {D}$, $i_1,\ldots ,i_k \in \{ 0,1\}^k$ and $i_1 y_1,\ldots ,i_k y_k \in \mathbb {R}^k$, the conditional distribution of (Y, Z) given

$$\begin{aligned}&X_1 = x_1,\mathrm {sign}(Z(X_1)) = i_1,\mathrm {sign}(Z(X_1)) Y(X_1) = i_1 y_1,\ldots , \\&X_k = x_k,\mathrm {sign}(Z(X_k)) = i_k,\mathrm {sign}(Z(X_k)) Y(X_k) = i_k y_k \end{aligned}$$

is the same as the conditional distribution of (Y, Z) given

$$\begin{aligned} \mathrm {sign}(Z(x_1)) = i_1,\mathrm {sign}(Z(x_1)) Y(x_1) = i_1 y_1,\ldots ,\mathrm {sign}(Z(x_k)) = i_k,\mathrm {sign}(Z(x_k)) Y(x_k) = i_k y_k. \end{aligned}$$

Proof

This lemma can be shown similarly as Proposition 2.6 in [2]. $\square $

Proof

(Proof of Theorem 1)

For $k \in \mathbb {N}$, we remark that $\mathcal {F}_k$ is the sigma-algebra generated by

$$\begin{aligned} X_1,\mathrm {sign}(Z(X_1)),\mathrm {sign}(Z(X_1))Y(X_1) , \ldots , X_k,\mathrm {sign}(Z(X_k)),\mathrm {sign}(Z(X_k)) Y(X_k). \end{aligned}$$

We let $\mathbb {E}_k$, $\mathbb {P}_k$ and $\mathrm {var}_k$ denote the expectation, probability and variance conditionally on $\mathcal {F}_k$. For $x \in \mathcal {D}$, we let $\mathbb {E}_{k,x}$, $\mathbb {P}_{k,x}$ and $\mathrm {var}_{k,x}$ denote the expectation, probability and variance conditionally on

$$\begin{aligned}&X_1,\mathrm {sign}(Z(X_1)),\mathrm {sign}(Z(X_1))Y(x_1) , \ldots , X_k,\\&\qquad \mathrm {sign}(Z(X_k)),\mathrm {sign}(Z(X_k))Y(X_k), x,\mathrm {sign}(Z(x)),\mathrm {sign}(Z(x))Y(x). \end{aligned}$$

We let $\sigma _k^2(u) = \mathrm {var}_k(Y(u))$, $m_k(u) = \mathbb {E}_k[Y(u)]$ and $P_k(u) = \mathbb {P}_k( Z(u) > 0 )$. We also let $\sigma _{k,x}^2(u) = \mathrm {var}_{k,x}(Y(u))$, $m_{k,x}(u) = \mathbb {E}_{k,x}[Y(u)]$ and $P_{k,x}(u) = \mathbb {P}_{k,x}( Z(u) > 0 )$.

With these notations, the observation points satisfy, for $k \in \mathbb {N}$,

$$\begin{aligned} X_{k+1} \in \mathrm {argmax}_{ x \in \mathcal {D} } \mathbb {E}_{k} \left( \max _{ \begin{array}{c} u: P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array} } Y(u) - M_k \right) , \end{aligned}$$

(17)

where

$$\begin{aligned} M_k = \max _{ \begin{array}{c} u: P_k(u) = 1 \\ \sigma _k(u) = 0 \end{array} }Y(u). \end{aligned}$$

We first show that (17) can be defined as a stepwise uncertainty reduction (SUR) sequential design [2]. We have

$$\begin{aligned} X_{k+1} \in&\mathrm {argmax}_{x \in \mathcal {D}} \mathbb {E}_k \left( \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \\ \in&\mathrm {argmin}_{x \in \mathcal {D}} \mathbb {E}_k \left( \mathbb {E}_{k,x} \left( \max _{Z(u) > 0} Y(u) \right) - \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right) \nonumber \end{aligned}$$

(18)

since the second term in (18) does not depend on x and from the law of total expectation. We let

$$\begin{aligned} H_k = \mathbb {E}_{k} \left( \max _{Z(u) > 0} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \end{aligned}$$

and

$$\begin{aligned} H_{k,x} = \mathbb {E}_{k,x} \left( \max _{Z(u) > 0} Y(u) - \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right) . \end{aligned}$$

Then we have for $k \ge 1$

$$\begin{aligned} X_{k+1} \in \mathrm {argmin}_{x \in \mathcal {D}} \mathbb {E}_k \left( H_{k,x} \right) . \end{aligned}$$

We have, using the law of total expectation, and since $\mathbb {E}_{k,x} \left[ \max _{ \begin{array}{c} P_{k,x}(u) = 1, \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right] = \max _{ \begin{array}{c} P_{k,x}(u) = 1, \sigma _{k,x}(u) = 0 \end{array}} Y(u)$,

$$\begin{aligned} H_k - \mathbb {E}_k( H_{k+1} )&= \mathbb {E}_k \left( \max _{ \begin{array}{c} P_{k,X_{k+1}}(u) = 1 \\ \sigma _{k,X_{k+1}}(u) = 0 \end{array}} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \\&\ge 0 \end{aligned}$$

since, for all $u,x \in \mathcal {D}$, $\sigma _{k,x}(u) \le \sigma _k(u) $ and $P_k(u) = 1$ implies $P_{k,x}(u) = 1$. Hence $(H_k)_{k \in \mathbb {N}}$ is a supermartingale and of course $H_k \ge 0$ for all $k \in \mathbb {N}$. Also $| H_1| \le 2 \mathbb {E}_1 \left[ \max _{u \in \mathcal {D}} |Y(u)| \right] $ so that $H_1$ is bounded in $L^1$, since the mean value of the maximum of a continuous Gaussian process on a compact set is finite. Hence, from Theorem 6.23 in [14], $H_k$ converges a.s. as $k \rightarrow \infty $ to a finite random variable. Hence, as in the proof of Theorem 3.10 in [2], we have $H_k - \mathbb {E}_k(H_{k+1})$ goes to 0 a.s. as $k \rightarrow \infty $. Hence, by definition of $X_{k+1}$ we obtain $\sup _{x \in \mathcal {D}} ( H_k - \mathbb {E}_k( H_{k,x} ) ) \rightarrow 0$ a.s. as $k \rightarrow \infty $. This yields, from the law of total expectation,

$$\begin{aligned} 0 \longleftarrow _{n \rightarrow \infty }^{a.s.}&\sup _{x \in \mathcal {D}} \mathbb {E}_k \left( \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \\ \ge&\sup _{x \in \mathcal {D}} \mathbb {E}_k \left[ \mathbf {1}_{Z(x) > 0} \left( Y(x) - M_k \right) ^+ \right] \nonumber \\ \ge&\sup _{x \in \mathcal {D}} P_k(x) \gamma ( m_k(x) - M_k , \sigma _k(x) ), \nonumber \end{aligned}$$

(19)

from Lemma 3 and (12), where

$$\begin{aligned} \gamma ( a , b )&= a \Phi \left( \frac{a}{ b } \right) + b \phi \left( \frac{a}{ b } \right) . \end{aligned}$$

Recall from Section 3 in [34] that $\gamma $ is continuous and satisfies $\gamma (a, b) > 0$ if $b > 0$ and $\gamma (a, b) \ge a$ if $a > 0$. We have for $k \in \mathbb {N}$, $0 \le \sigma _k(u) \le \max _{v \in \mathcal {D}} \sqrt{\mathrm {var}(Y(v))} < \infty $. Also, with the same proof as that of Proposition 2.9 in [2], we can show that the sequence of random functions $(m_k)_{k \in \mathbb {N}}$ converges a.s. uniformly on $\mathcal {D}$ to a continuous random function $m_{\infty }$ on $\mathcal {D}$. Thus, from (19), by compacity, we have, a.s. as $k \rightarrow \infty $, $\sup _{x \in \mathcal {D}} P_k(x) ( m_k(x) - M_k )^+ \rightarrow 0$ and $\sup _{x \in \mathcal {D}} P_k(x) \sigma _k(x) \rightarrow 0$. Hence, Part 1. is proved.

Let us address Part 2. For all $\tau \in \mathbb {N}$, consider fixed $v_1,\ldots ,v_{N_\tau } \in \mathcal {D}$ for which $\max _{u \in \mathcal {D}} \min _{i=1,\ldots ,N_\tau } || u - v_i || \le 1/\tau $. Consider the event $E_\tau = \{ \exists u \in \mathcal {D} ; \inf _{i \in \mathbb {N}} || X_i - u || \ge 2/\tau \}$. Then, $E_\tau $ implies the event $E_{v,\tau } = \cup _{ j=1 }^{N_\tau } E_{v,\tau ,j} $ where $ E_{v,\tau ,j} = \{ \inf _{i \in \mathbb {N}} || X_i - v_j || \ge 1/\tau \}$. Let us now show that $\mathbb {P}( E_{v,\tau ,j} ) = 0$ for $ j = 1,\ldots , N_\tau $. Assume that $E_{v,\tau ,j} \cap \mathcal {C}$ holds, where $\mathcal {C}$ is the event in Part 1. of the theorem, with $\mathbb {P} ( \mathcal {C} ) = 1$. Since Y has the NEB property, we have $\liminf _{k \rightarrow \infty } \sigma _k(v_j) >0$. Hence, $P_k( v_j ) \rightarrow 0$ as $k \rightarrow \infty $ since $\mathcal {C}$ is assumed. We then have

$$\begin{aligned} \mathrm {var}( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } ) = P_k( v_j ) (1 - P_k( v_j )) \rightarrow 0 \end{aligned}$$

(20)

a.s. as $k \rightarrow \infty $. But we have

$$\begin{aligned} \mathrm {var}&( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } ) \\&\quad = \mathbb {E} \left[ \left. \left( \mathbf {1}_{ Z(v_j)> 0 } - P_k( v_j ) \right) ^2 \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } \right] \\&\quad = \mathbb {E} \left[ \left. \mathbb {E} \left[ \left. \left( \mathbf {1}_{ Z(v_j)> 0 } - P_k( v_j ) \right) ^2 \right| Z(x_1),\ldots ,Z(x_k) \right] \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } \right] . \end{aligned}$$

Since $P_k(v_j)$ is a function of $Z(x_1), \ldots , Z(x_n)$, we obtain

$$\begin{aligned} \mathrm {var}&( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } ) \\&\quad \ge \mathbb {E} \left[ \left. \mathrm {var} \left( \mathbf {1}_{ Z(v_j)> 0 } | Z(x_1), \ldots , Z(x_k) \right) \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } \right] \\&\quad = \mathbb {E} \left[ \left. g \left( \bar{\varPhi } \left( \frac{- m_k(v_j)}{ \sigma _k(v_j) } \right) \right) \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } \right] , \end{aligned}$$

with $g(t) = t(1-t)$ and with $\bar{\varPhi }$ as in Lemma 1. We let $S = \sup _{k \in \mathbb {N}} |m_k(v_j)|$ and $s = \inf _{k \in \mathbb {N}} \sigma _k(v_j)$. Then, from the uniform convergence of $m_k$ discussed below and from the NEB property of Z, we have $\mathbb {P}(E_{S,s}) = 1$ where $E_{S,s} = \{ S < + \infty , s> 0 \}$. Then, if $E_{v,\tau ,j} \cap \mathcal {C} \cap E_{S,s}$ holds, we have

$$\begin{aligned} \mathrm {var}&( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } ) \\&\quad \ge \mathbb {E} \left[ \left. g \left( \bar{\varPhi } \left( \frac{S}{s } \right) \right) \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } \right] \\&\quad \rightarrow _{k \rightarrow \infty }^{a.s.} \mathbb {E} \left[ \left. g \left( \bar{\varPhi } \left( \frac{S}{s } \right) \right) \right| \mathcal {F}_{Z,\infty } \right] , \end{aligned}$$

where $\mathcal {F}_{Z,\infty } = \sigma ( \left\{ \mathbf {1}_{Z(X_i) > 0)} \right\} _{i \in \mathbb {N}} )$ from Theorem 6.23 in [14]. Almost surely, conditionally on $\mathcal {F}_{Z,\infty }$ we have a.s. $S < \infty $ and $s >0$. Hence we obtain that, on the event $E_{v,\tau ,j} \cap \mathcal {A}$ with $\mathbb {P}( \mathcal {A} ) = 1$, $\mathrm {var} ( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } ) $ does not go to zero. Hence, from (20), we have $\mathbb {P}(E_{v,\tau ,j}) = 0$. This yields that $(X_i)_{i \in \mathbb {N}}$ is a.s. dense in $\mathcal {D}$. Hence, since $\{u ; Z(u) >0\}$ is an open set, we have $\max _{i; Z(X_i)> 0} Y(X_i) \rightarrow \max _{Z(u) >0} Y(u)$ a.s. as $n \rightarrow \infty $. $\square $

B Stochastic approximation of the likelihood gradient for Gaussian process based classification

In Appendixes B and C, for two matrices A and B of sizes $a \times d$ and $b \times d$, and for a function $h: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}$, let h(A, B) be the $a \times b$ matrix $[h(a_i,b_j)]_{i=1,\ldots ,a,j=1,\ldots ,b}$, where $a_i$ and $b_j$ are the lines i and j of A and B.

Let $s_n = (i_1,\ldots ,i_n) \in \{0,1\}^n$ be fixed. Assume that the likelihood $\mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )$ has been evaluated by $\hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )$. Assume also that realizations $z_n^{(1)},\ldots ,z_n^{(N)}$, approximately following the conditional distribution of $Z_n$ given $\mathrm {sign}(Z_n) = s_n$, are available.

Let $\mathcal {Z} = \{ z_n \in \mathbb {R}^n: \mathrm {sign}(z_n) = s_n \}$. Treating $x_1,\ldots ,x_n$ as d-dimensional line vectors, let $\mathbf {x}$ be the matrix $ (x_1^\top ,\ldots ,x_n^\top )^\top $. Then we have

$$\begin{aligned} \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) = \int _{\mathcal {Z}} \frac{1}{(2 \pi )^{n/2}} \frac{1}{\sqrt{|k_{\theta }^Z(\mathbf {x},\mathbf {x})|}} e^{ \frac{-1}{2} (z_n - \mu \mathrm {1}_n)^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n) } dz_n, \end{aligned}$$

where $\mathrm {1}_n = (1,\ldots ,1)^\top \in \mathbb {R}^n$ and |.| denotes the determinant.

Derivating with respect to $\mu $ yields

$$\begin{aligned} \frac{\partial }{ \partial \mu } \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) =&\int _{\mathcal {Z}} \frac{1}{(2 \pi )^{n/2}} \frac{1}{\sqrt{|k_{\theta }^Z(\mathbf {x},\mathbf {x})|}} e^{ \frac{-1}{2} (z_n - \mu \mathrm {1}_n)^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n) } \\&( \mathrm {1}_n^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n)) dz_n \\ =&\mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) \mathbb {E}_{\mu ,\theta } \left( \left. \mathrm {1}_n^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (Z_n - \mu \mathrm {1}_n) \right| \mathrm {sign}(Z_n) = s_n ) \right) , \end{aligned}$$

where $\mathbb {E}_{\mu ,\theta }$ means that the conditional expectation is calculated under the assumption that Z has constant mean function $\mu $ and covariance function $k_{\theta }^Z$. Hence we have the stochastic approximation $\hat{\nabla }_{\mu }$ for $\partial /\partial \mu \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )$ given by

$$\begin{aligned} \hat{\nabla }_{\mu } = \hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) \frac{1}{N} \sum _{i=1}^N \mathrm {1}_n^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z^{(i)}_n - \mu \mathrm {1}_n). \end{aligned}$$

Derivating with respect to $\theta _i$ for $i=1,\ldots ,p$ yields, with $\mathrm {adj}(M) $ the adjugate of a matrix M,

$$\begin{aligned}&\frac{\partial }{ \partial \theta _i } \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\\&\quad =\int _{\mathcal {Z}} \left( \frac{-1}{2} |k_{\theta }^Z(\mathbf {x},\mathbf {x})|^{-1} \mathrm {Tr} \left( \mathrm {adj}(k_{\theta }^Z(\mathbf {x},\mathbf {x})) \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } \right) \right. \\&\qquad \left. + \frac{1}{2} (z_n - \mu \mathrm {1}_n)^\top \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } (z_n - \mu \mathrm {1}_n) \right)&\\&\qquad \frac{1}{(2 \pi )^{n/2}} \frac{1}{\sqrt{|k_{\theta }^Z(\mathbf {x},\mathbf {x})|}} e^{ \frac{-1}{2} (z_n - \mu \mathrm {1}_n)^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n) } dz_n&\\&\quad = \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )&\\&\qquad \mathbb {E}_{\mu ,\theta } \left( \frac{-1}{2} |k_{\theta }^Z(\mathbf {x},\mathbf {x})|^{-1} \mathrm {Tr} \left( \mathrm {adj}(k_{\theta }^Z(\mathbf {x},\mathbf {x})) \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } \right) \right. \\&\qquad \left. \left. + \frac{1}{2} (Z_n - \mu \mathrm {1}_n)^\top \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } (Z_n - \mu \mathrm {1}_n) \right| \mathrm {sign}(Z_n) = s_n ) \right) .&\end{aligned}$$

Hence we have the stochastic approximation $\hat{\nabla }_{\theta _i}$ for $\partial /\partial \theta _i \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )$ given by

$$\begin{aligned} \hat{\nabla }_{\theta _i}&= \hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) \frac{1}{N} \sum _{i=1}^N \left( \frac{-1}{2} |k_{\theta }^Z(\mathbf {x},\mathbf {x})|^{-1} \mathrm {Tr} \left( \mathrm {adj}(k_{\theta }^Z(\mathbf {x},\mathbf {x})) \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } \right) \right. \\&\quad \left. + \frac{1}{2} (z^{(i)}_n - \mu \mathrm {1}_n)^\top \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } (z^{(i)}_n - \mu \mathrm {1}_n) \right) . \end{aligned}$$

Remark 2

Several implementations of algorithms are available to obtain the realizations $z_n^{(1)},\ldots ,z_n^{(N)}$, as discussed after Algorithm 1. It may also be the case that some implementations provide both the estimate $\hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )$ and the realizations $z_n^{(1)},\ldots ,z_n^{(N)}$.

C Expressions of the mean and covariance of the conditional Gaussian process and of the gradient of the acquisition function

Let $\mu ^Y$ and $k^Y$ be the mean and covariance functions of Y. Treating $x_1,\ldots ,x_n$ as d-dimensional line vectors, let $\mathbf {x}_q$ be the matrix extracted from $ (x_1^\top ,\ldots ,x_n^\top )^\top $ by keeping only the lines which indices j satisfy $i_j = 1$.

We first recall the classical expressions of GP conditioning:

$$\begin{aligned} m_q^Y(x,Y_q)= & {} \mu ^Y + k^Y(x,\mathbf {x}_q) \left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} \left( Y_q - \mu ^Y \right) \\ k_q^Y(x, x')= & {} k^Y(x,x') - k^Y(x,\mathbf {x}_q) \left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} k^Y(\mathbf {x}_q,x'). \end{aligned}$$

$\nabla _x m^Y_q(x,Y_q)$ and $\nabla _x k_q^Y(x,x)$ are straightforward provided that $\nabla _{x} k^Y({x},{y})$ is available:

$$\begin{aligned} \nabla _x m^Y_q(x,Y_q)= & {} [ \nabla _x k^Y(x,\mathbf {x}_q) ] \left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} \left( Y_q - \mu ^Y \right) \\ \nabla _x k^Y_q(x, x)= & {} \nabla _x k^Y(x, x) -2 k^Y(x,\mathbf {x}_q)\left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} \nabla _x k^Y(\mathbf {x}_q,x). \end{aligned}$$

Then:

$$\begin{aligned} \nabla _x EI_q(x) = \Phi \left( \frac{m^Y_q(x,Y_q) - M_q}{\sqrt{k^Y_q(x,x)}} \right) \nabla _x m^Y_q(x,Y_q) + \phi \left( \frac{M_q - m^Y_q(x,Y_q)}{\sqrt{k^Y_q(x,x)}} \right) \frac{1}{2 \sqrt{k^Y_q(x,x)}}\nabla _x k^Y_q(x, x). \end{aligned}$$

For $\mathrm {P_{nf}}(x)$, using the approximation of Algorithm 1, we have:

$$\begin{aligned} \widehat{\mathrm {P_{nf}}}(x) = \frac{1}{N} \sum _{i=1}^N \bar{\varPhi } \left( \frac{-m_n^Z(x, z_n^{(i)})}{ \sqrt{ k_n^Z(x,x) } } \right) , \end{aligned}$$

with $k_n^Z(x,x)$ as $k_n^Y(x,x)$ and

$$\begin{aligned} m_n^Z(x, z_n^{(i)}) = \mu ^Z + k^Z(x,\mathbf {x}) \left( k^Z(\mathbf {x},\mathbf {x}) \right) ^{-1} \left( z_n^{(i)} - \mu ^Z \right) . \end{aligned}$$

Applying the standard differentiation rules delivers:

$$\begin{aligned} \nabla _x \widehat{\mathrm {P_{nf}}}(x)= & {} \frac{1}{N}\sum _{i=1}^N \phi \left( \frac{m_n^Z(x,z_n^{(i)})}{\sqrt{k^Z_n(x,x)}}\right) \left[ \frac{1}{\sqrt{k^Z_n(x,x)}} \nabla _x m_n^Z(x,z_n^{(i)}) - \frac{m_n^Z(x,z_n^{(i)})}{2[k^Z_n(x,x)]^{3/2}} \nabla _x k_n^Z(x,x) \right] . \end{aligned}$$

The gradient of the acquisition function can then be obtained using the product rule.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bachoc, F., Helbert, C. & Picheny, V. Gaussian process optimization with failures: classification and convergence proof. J Glob Optim 78, 483–506 (2020). https://doi.org/10.1007/s10898-020-00920-0

Download citation

Received: 16 April 2019
Accepted: 30 November 2019
Published: 08 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10898-020-00920-0

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gaussian process optimization with failures: classification and convergence proof

Abstract

Access this article

Similar content being viewed by others

Class GP: Gaussian Process Modeling for Heterogeneous Functions

A note on the Estimation of a Gamma-Variance Process: Learning from a Failure

Robust Optimization with Gaussian Process Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material

Appendices

A Proofs

Proof

Proof

Proof

Lemma 4

Proof

Proof

B Stochastic approximation of the likelihood gradient for Gaussian process based classification

Remark 2

C Expressions of the mean and covariance of the conditional Gaussian process and of the gradient of the acquisition function

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Gaussian process optimization with failures: classification and convergence proof

Abstract

Access this article

Similar content being viewed by others

Class GP: Gaussian Process Modeling for Heterogeneous Functions

A note on the Estimation of a Gamma-Variance Process: Learning from a Failure

Robust Optimization with Gaussian Process Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material

Appendices

A Proofs

Proof

Proof

Proof

Lemma 4

Proof

Proof

B Stochastic approximation of the likelihood gradient for Gaussian process based classification

Remark 2

C Expressions of the mean and covariance of the conditional Gaussian process and of the gradient of the acquisition function

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation