Skip to main content
Log in

Gaussian process optimization with failures: classification and convergence proof

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

We consider the optimization of a computer model where each simulation either fails or returns a valid output performance. We first propose a new joint Gaussian process model for classification of the inputs (computation failure or success) and for regression of the performance function. We provide results that allow for a computationally efficient maximum likelihood estimation of the covariance parameters, with a stochastic approximation of the likelihood gradient. We then extend the classical improvement criterion to our setting of joint classification and regression. We provide an efficient computation procedure for the extended criterion and its gradient. We prove the almost sure convergence of the global optimization algorithm following from this extended criterion. We also study the practical performances of this algorithm, both on simulated data and on a real computer model in the context of automotive fan design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. A constant mean function could be incorporated and estimated with no additional complexity.

References

  1. Azzimonti, D., Ginsbourger, D.: Estimating orthant probabilities of high-dimensional Gaussian vectors with an application to set estimation. J. Comput. Graph. Stat. 27(2), 255–267 (2018)

    Article  MathSciNet  Google Scholar 

  2. Bect, J., Bachoc, F., Ginsbourger, D.: A supermartingale approach to Gaussian process based sequential design of experiments. Bernoulli 25(4A), 2883–2919 (2019)

    Article  MathSciNet  Google Scholar 

  3. Benassi, R., Bect, J., Vazquez, E.: Robust Gaussian process-based global optimization using a fully Bayesian expected improvement criterion. In: International Conference on Learning and Intelligent Optimization, pp. 176–190. Springer (2011)

  4. Botev, Z.I.: The normal law under linear restrictions: simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(1), 125–148 (2017)

    Article  MathSciNet  Google Scholar 

  5. Bull, A.D.: Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011)

    MathSciNet  MATH  Google Scholar 

  6. Gelbart, M.A., Snoek, J., Adams, R.P.: Bayesian optimization with unknown constraints. In: UAI (2014)

  7. Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1(2), 141–149 (1992)

    Google Scholar 

  8. Ginsbourger, D., Le Riche, R., Carraro, L.: Kriging is well-suited to parallelize optimization. In: Computational Intelligence in Expensive Optimization Problems, pp. 131–162. Springer (2010)

  9. Ginsbourger, D., Roustant, O., Durrande, N.: On degeneracy and invariances of random fields paths with applications in Gaussian process modelling. J. Stat. Plan. Inference 170, 117–128 (2016)

    Article  MathSciNet  Google Scholar 

  10. Gramacy, R., Lee, H.: Optimization under unknown constraints. Bayesian Stat. 9, 229 (2011)

    Article  MathSciNet  Google Scholar 

  11. Gramacy, R.B., Gray, G.A., Le Digabel, S., Lee, H.K., Ranjan, P., Wells, G., Wild, S.M.: Modeling an augmented Lagrangian for blackbox constrained optimization. Technometrics 58(1), 1–11 (2016)

    Article  MathSciNet  Google Scholar 

  12. Hernandez-Lobato, J.M., Gelbart, M., Hoffman, M., Adams, R., Ghahramani, Z.: Predictive entropy search for Bayesian optimization with unknown constraints. In: International Conference on Machine Learning, pp. 1699–1707 (2015)

  13. Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black box functions. J. Glob. Optim. 13, 455–492 (1998)

    Article  MathSciNet  Google Scholar 

  14. Kallenberg, O.: Foundations of Modern Probability, 2nd edn. Springer, Berlin (2002)

    Book  Google Scholar 

  15. Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., Xing, E.P.: Neural architecture search with Bayesian optimisation and optimal transport. In: Advances in Neural Information Processing Systems, pp. 2016–2025 (2018)

  16. Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. Wiley, Hoboken (2005)

    Book  Google Scholar 

  17. Lindberg, D.V., Lee, H.K.: Optimization under constraints by applying an asymmetric entropy measure. J. Comput. Graph. Stat. 24(2), 379–393 (2015)

    Article  MathSciNet  Google Scholar 

  18. López-Lopera, A.F., Bachoc, F., Durrande, N., Roustant, O.: Finite-dimensional Gaussian approximation with linear inequality constraints. SIAM/ASA J. Uncertain. Quantif. 6(3), 1224–1255 (2018)

    Article  MathSciNet  Google Scholar 

  19. Maatouk, H., Bay, X.: A New Rejection Sampling Method for Truncated Multivariate Gaussian Random Variables Restricted to Convex Sets, pp. 521–530. Springer, Cham (2016)

    MATH  Google Scholar 

  20. Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, Berlin (2012)

    MATH  Google Scholar 

  21. Mockus, J.B., Tiesis, V., Žilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Dixon, L.C.W., Szegö, G.P. (eds.) Towards Global Optimization, vol. 2, pp. 117–129. North Holland, New York (1978)

    Google Scholar 

  22. Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)

    MathSciNet  MATH  Google Scholar 

  23. Pakman, A., Paninski, L.: Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Graph. Stat. 23(2), 518–542 (2014)

    Article  MathSciNet  Google Scholar 

  24. Picheny, V.: A stepwise uncertainty reduction approach to constrained global optimization. In: Artificial Intelligence and Statistics, pp. 787–795 (2014)

  25. Picheny, V., Gramacy, R.B., Wild, S., Le Digabel, S.: Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian. In: Advances in Neural Information Processing Systems, pp. 1435–1443 (2016)

  26. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)

    MATH  Google Scholar 

  27. Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)

    Article  Google Scholar 

  28. Sacher, M., Duvigneau, R., Le Maitre, O., Durand, M., Berrini, E., Hauville, F., Astolfi, J.-A.: A classification approach to efficient global optimization in presence of non-computable domains. Struct. Multidiscip. Optim. 58(4), 1537–1557 (2018)

    Article  MathSciNet  Google Scholar 

  29. Sasena, M.J., Papalambros, P., Goovaerts, P.: Exploration of metamodeling sampling criteria for constrained global optimization. Eng. Optim. 34(3), 263–278 (2002)

    Article  Google Scholar 

  30. Schonlau, M., Welch, W.J., Jones, D.R.: Global versus local search in constrained optimization of computer models. In: Lecture Notes-Monograph Series, pp. 11–25 (1998)

  31. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)

  32. Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022 (2010)

  33. Taylor, J., Benjamini, Y.: RestrictedMVN: multivariate normal restricted by affine constraints. https://cran.r-project.org/web/packages/restrictedMVN/index.html (2017). Online; 2 Feb 2017

  34. Vazquez, E., Bect, J.: Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Stat. Plan. Inference 140(11), 3088–3095 (2010)

    Article  MathSciNet  Google Scholar 

  35. Vazquez, E., Bect, J.: Pointwise consistency of the kriging predictor with known mean and covariance functions. In: mODa 9–aAdvances in Model-Oriented Design and Analysis, pp. 221–228. Springer (2010)

  36. Wu, J., Frazier, P.: The parallel knowledge gradient method for batch Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 3126–3134 (2016)

  37. Zhigljavsky, A., Žilinskas, A.: Selection of a covariance function for a Gaussian random field aimed for modeling global optimization problems. Optim. Lett. 13(2), 249–259 (2019)

    Article  MathSciNet  Google Scholar 

  38. Žilinskas, A., Calvin, J.: Bi-objective decision making in global optimization based on statistical models. J. Glob. Optim. 74(4), 599–609 (2019)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was partly funded by the ANR project PEPITO. We are grateful to Jean-Marc Azaïs and David Ginsbourger for discussing the topics in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to François Bachoc.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material

The supplementary material contains additional figures and tables for Sections \inlink{5}{section:simulations} and \inlink{6}{section:industrial:case:study}. (pdf 1,904KB)

Appendices

A Proofs

Proof

(Proof of Lemma 1)

With \(\phi ^{Z_n}\) the p.d.f. of \(Z_n\) and with \(s_n = (i_1,\ldots ,i_n)^\top \), we have

$$\begin{aligned} \mathrm {P_{nf}}(x)&= \mathbb {P} \left( \left. Z(x)> 0 \right| \mathrm {sign}(Z_n) = s_n \right) \nonumber \\&\quad = \frac{ \mathbb {E} \left( \mathbf {1}_{Z(x)> 0} \prod _{j=1}^n \mathbf {1}_{ \mathrm {sign}(Z(x_j)) = i_j } \right) }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } \nonumber \\&\quad = \frac{ \int _{\mathbb {R}^n} \phi ^{Z_n}(z_1,\ldots ,z_n) \left( \prod _{j=1}^n \mathbf {1}_{ \mathrm {sign}(z_j) = i_j } \right) \mathbb {P} \left( \left. Z(x) > 0 \right| Z(x_1) = z_1 , \ldots , Z(x_n) = z_n \right) dz_1 \ldots dz_n }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } \nonumber \\&\qquad = \int _{\mathbb {R}^n} \phi ^{Z_n}_{s_n}(z_n) \bar{\varPhi } \left( \frac{-m_{n}^Z(x,z_n)}{ \sqrt{ k_{n}^Z(x) } } \right) dz_n. \end{aligned}$$
(16)

The Eq. (16) is obtained by observing that

$$\begin{aligned} \frac{ \phi ^{Z_n}(z_1,\ldots ,z_n) \left( \prod _{j=1}^n \mathbf {1}_{ \mathrm {sign}(z_j) = i_j } \right) }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } = \frac{ \phi ^{Z_n}(z_1,\ldots ,z_n) \mathbf {1}_{ \mathrm {sign}(Z_n) = s_n } }{ \mathbb {P} \left( \mathrm {sign}(Z_n) = s_n \right) } = \phi ^{Z_n}_{s_n}(z_n) \end{aligned}$$

by definition of \(\phi ^{Z_n}_{s_n}(z_n)\), and that

$$\begin{aligned} \mathbb {P} \left( \left. Z(x) > 0 \right| Z(x_1) = z_1 , \ldots , Z(x_n) = z_n \right) = \bar{\varPhi } \left( \frac{-m_{n}^Z(x,z_n)}{ \sqrt{ k_{n}^Z(x) } } \right) \end{aligned}$$

by Gaussian conditioning. \(\square \)

Proof

(Proof of Lemma 2) For any measurable function f, by the law of total expectation and using the independence of Y, \((W_1,\ldots ,W_n)\) and Z, we have

$$\begin{aligned} \mathbb {E}&\left[ f \left( I_1 , \ldots , I_n , V_1 , \ldots , V_n \right) \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0,1\}} \mathbb {P}_{\mu ^Z,\theta _Z} \left( I_1 = i_1 , \ldots , I_n = i_n \right) \\&\mathbb {E} \left[ f \left( i_1 , \ldots , i_n , i_1 Y(x_1) + (1-i_1) W_1 , \ldots , i_n Y(x_n) + (1-i_n) W_n \right) \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0,1\}} \mathbb {P}_{\mu ^Z,\theta _Z} \left( I_1 = i_1 , \ldots , I_n = i_n \right) \\&~ ~ \int _{\mathbb {R}^{n}} dv \phi _{\mu ^Y,\theta _Y,s_n}^Y \left( v_{s_n} \right) \left( \prod _{ \begin{array}{c} j = 1,\ldots ,n \\ i_j = 0 \end{array}} \phi (v_j) \right) f \left( i_1 , \ldots , i_n , v_1 , \ldots , v_n \right) . \end{aligned}$$

This concludes the proof by definition of a p.d.f. \(\square \)

Proof

(Proof of Lemma 3)

Consider measurable functions f(Y), g(Z), \(h( I_1, \ldots , I_n )\) and \(\psi ( I_1 Y(x_1), \ldots , I_n Y(x_n) )\). We have, by independence of Y and Z,

$$\begin{aligned} \mathbb {E}&\left[ f(Y) g(Z) h( I_1, \ldots , I_n ) \psi ( I_1 Y(x_1), \ldots , I_n Y(x_n) ) \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0 , 1\}} \mathbb {P} \left( I_1 = i_1, \ldots , I_n = i_n \right) \\&\mathbb {E} \left[ \left. f(Y) g(Z) h( i_1, \ldots , i_n ) \psi (i_1 Y(x_1), \ldots , i_n Y(x_n) ) \right| I_1 = i_1, \ldots , I_n = i_n \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0 , 1\}} \mathbb {P} \left( I_1 = i_1, \ldots , I_n = i_n \right) h( i_1, \ldots , i_n ) \\&\mathbb {E} \left[ f(Y) \psi (i_1 Y(x_1), \ldots , i_n Y(x_n) ) \right] \mathbb {E} \left[ \left. g(Z) \right| I_1 = i_1, \ldots , I_n = i_n \right] \\ =&\sum _{i_1,\ldots ,i_n \in \{ 0 , 1\}} \mathbb {P} \left( I_1 = i_1, \ldots , I_n = i_n \right) h( i_1, \ldots , i_n ) \\&\mathbb {E} \left[ \psi (i_1 Y(x_1), \ldots , i_n Y(x_n) ) \mathbb {E} \left[ \left. f(Y) \right| Y_{n,s_n} \right] \right] \mathbb {E} \left[ \left. g(Z) \right| I_1 = i_1, \ldots , I_n = i_n \right] . \end{aligned}$$

The last display can be written as, with \(\mathcal {L}_{n}\) the distribution of

$$\begin{aligned} I_1 ,\ldots ,I_n , I_1 Y(x_1) , \ldots , I_n Y(x_n), \end{aligned}$$
$$\begin{aligned}&\int _{\mathbb {R}^{2n}} d \mathcal {L}_n( i_1,\ldots ,i_n,i_1 y_1, \ldots , i_n y_n ) h(i_1,\ldots ,i_n) \psi ( i_1 y_1 , \ldots , i_n y_n )&\\&\mathbb {E} \left[ \left. f(Y) \right| Y_{n,s_n} = Y_q \right] \mathbb {E} \left[ \left. g(Z) \right| I_1 = i_1, \ldots , I_n = i_n \right] ,&\end{aligned}$$

where \(Y_q\) is as defined in the statement of the lemma. This concludes the proof. \(\square \)

We now address the proof of Theorem 1. We let \((X_i)_{i \in \mathbb {N}}\) be the random observation points, such that \(X_i\) is obtained from (13) and (14) for \(i \in \mathbb {N}\). The next lemma shows that conditioning on the random observation points and observed values works “as if” the observation points \(X_1,\ldots ,X_n\) were non-random.

Lemma 4

For any \(x_1,\ldots ,x_k \in \mathcal {D}\), \(i_1,\ldots ,i_k \in \{ 0,1\}^k\) and \(i_1 y_1,\ldots ,i_k y_k \in \mathbb {R}^k\), the conditional distribution of (YZ) given

$$\begin{aligned}&X_1 = x_1,\mathrm {sign}(Z(X_1)) = i_1,\mathrm {sign}(Z(X_1)) Y(X_1) = i_1 y_1,\ldots , \\&X_k = x_k,\mathrm {sign}(Z(X_k)) = i_k,\mathrm {sign}(Z(X_k)) Y(X_k) = i_k y_k \end{aligned}$$

is the same as the conditional distribution of (YZ) given

$$\begin{aligned} \mathrm {sign}(Z(x_1)) = i_1,\mathrm {sign}(Z(x_1)) Y(x_1) = i_1 y_1,\ldots ,\mathrm {sign}(Z(x_k)) = i_k,\mathrm {sign}(Z(x_k)) Y(x_k) = i_k y_k. \end{aligned}$$

Proof

This lemma can be shown similarly as Proposition 2.6 in [2]. \(\square \)

Proof

(Proof of Theorem 1)

For \(k \in \mathbb {N}\), we remark that \(\mathcal {F}_k\) is the sigma-algebra generated by

$$\begin{aligned} X_1,\mathrm {sign}(Z(X_1)),\mathrm {sign}(Z(X_1))Y(X_1) , \ldots , X_k,\mathrm {sign}(Z(X_k)),\mathrm {sign}(Z(X_k)) Y(X_k). \end{aligned}$$

We let \(\mathbb {E}_k\), \(\mathbb {P}_k\) and \(\mathrm {var}_k\) denote the expectation, probability and variance conditionally on \(\mathcal {F}_k\). For \(x \in \mathcal {D}\), we let \(\mathbb {E}_{k,x}\), \(\mathbb {P}_{k,x}\) and \(\mathrm {var}_{k,x}\) denote the expectation, probability and variance conditionally on

$$\begin{aligned}&X_1,\mathrm {sign}(Z(X_1)),\mathrm {sign}(Z(X_1))Y(x_1) , \ldots , X_k,\\&\qquad \mathrm {sign}(Z(X_k)),\mathrm {sign}(Z(X_k))Y(X_k), x,\mathrm {sign}(Z(x)),\mathrm {sign}(Z(x))Y(x). \end{aligned}$$

We let \(\sigma _k^2(u) = \mathrm {var}_k(Y(u))\), \(m_k(u) = \mathbb {E}_k[Y(u)]\) and \(P_k(u) = \mathbb {P}_k( Z(u) > 0 )\). We also let \(\sigma _{k,x}^2(u) = \mathrm {var}_{k,x}(Y(u))\), \(m_{k,x}(u) = \mathbb {E}_{k,x}[Y(u)]\) and \(P_{k,x}(u) = \mathbb {P}_{k,x}( Z(u) > 0 )\).

With these notations, the observation points satisfy, for \(k \in \mathbb {N}\),

$$\begin{aligned} X_{k+1} \in \mathrm {argmax}_{ x \in \mathcal {D} } \mathbb {E}_{k} \left( \max _{ \begin{array}{c} u: P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array} } Y(u) - M_k \right) , \end{aligned}$$
(17)

where

$$\begin{aligned} M_k = \max _{ \begin{array}{c} u: P_k(u) = 1 \\ \sigma _k(u) = 0 \end{array} }Y(u). \end{aligned}$$

We first show that (17) can be defined as a stepwise uncertainty reduction (SUR) sequential design [2]. We have

$$\begin{aligned} X_{k+1} \in&\mathrm {argmax}_{x \in \mathcal {D}} \mathbb {E}_k \left( \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \\ \in&\mathrm {argmin}_{x \in \mathcal {D}} \mathbb {E}_k \left( \mathbb {E}_{k,x} \left( \max _{Z(u) > 0} Y(u) \right) - \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right) \nonumber \end{aligned}$$
(18)

since the second term in (18) does not depend on x and from the law of total expectation. We let

$$\begin{aligned} H_k = \mathbb {E}_{k} \left( \max _{Z(u) > 0} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \end{aligned}$$

and

$$\begin{aligned} H_{k,x} = \mathbb {E}_{k,x} \left( \max _{Z(u) > 0} Y(u) - \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right) . \end{aligned}$$

Then we have for \(k \ge 1\)

$$\begin{aligned} X_{k+1} \in \mathrm {argmin}_{x \in \mathcal {D}} \mathbb {E}_k \left( H_{k,x} \right) . \end{aligned}$$

We have, using the law of total expectation, and since \(\mathbb {E}_{k,x} \left[ \max _{ \begin{array}{c} P_{k,x}(u) = 1, \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right] = \max _{ \begin{array}{c} P_{k,x}(u) = 1, \sigma _{k,x}(u) = 0 \end{array}} Y(u)\),

$$\begin{aligned} H_k - \mathbb {E}_k( H_{k+1} )&= \mathbb {E}_k \left( \max _{ \begin{array}{c} P_{k,X_{k+1}}(u) = 1 \\ \sigma _{k,X_{k+1}}(u) = 0 \end{array}} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \\&\ge 0 \end{aligned}$$

since, for all \(u,x \in \mathcal {D}\), \(\sigma _{k,x}(u) \le \sigma _k(u) \) and \(P_k(u) = 1\) implies \(P_{k,x}(u) = 1\). Hence \((H_k)_{k \in \mathbb {N}}\) is a supermartingale and of course \(H_k \ge 0\) for all \(k \in \mathbb {N}\). Also \(| H_1| \le 2 \mathbb {E}_1 \left[ \max _{u \in \mathcal {D}} |Y(u)| \right] \) so that \(H_1\) is bounded in \(L^1\), since the mean value of the maximum of a continuous Gaussian process on a compact set is finite. Hence, from Theorem 6.23 in [14], \(H_k\) converges a.s. as \(k \rightarrow \infty \) to a finite random variable. Hence, as in the proof of Theorem 3.10 in [2], we have \(H_k - \mathbb {E}_k(H_{k+1})\) goes to 0 a.s. as \(k \rightarrow \infty \). Hence, by definition of \(X_{k+1}\) we obtain \(\sup _{x \in \mathcal {D}} ( H_k - \mathbb {E}_k( H_{k,x} ) ) \rightarrow 0\) a.s. as \(k \rightarrow \infty \). This yields, from the law of total expectation,

$$\begin{aligned} 0 \longleftarrow _{n \rightarrow \infty }^{a.s.}&\sup _{x \in \mathcal {D}} \mathbb {E}_k \left( \max _{ \begin{array}{c} P_{k,x}(u) = 1 \\ \sigma _{k,x}(u) = 0 \end{array}} Y(u) - \max _{ \begin{array}{c} P_{k}(u) = 1 \\ \sigma _{k}(u) = 0 \end{array}} Y(u) \right) \\ \ge&\sup _{x \in \mathcal {D}} \mathbb {E}_k \left[ \mathbf {1}_{Z(x) > 0} \left( Y(x) - M_k \right) ^+ \right] \nonumber \\ \ge&\sup _{x \in \mathcal {D}} P_k(x) \gamma ( m_k(x) - M_k , \sigma _k(x) ), \nonumber \end{aligned}$$
(19)

from Lemma 3 and (12), where

$$\begin{aligned} \gamma ( a , b )&= a \Phi \left( \frac{a}{ b } \right) + b \phi \left( \frac{a}{ b } \right) . \end{aligned}$$

Recall from Section 3 in [34] that \(\gamma \) is continuous and satisfies \(\gamma (a, b) > 0\) if \(b > 0\) and \(\gamma (a, b) \ge a\) if \(a > 0\). We have for \(k \in \mathbb {N}\), \(0 \le \sigma _k(u) \le \max _{v \in \mathcal {D}} \sqrt{\mathrm {var}(Y(v))} < \infty \). Also, with the same proof as that of Proposition 2.9 in [2], we can show that the sequence of random functions \((m_k)_{k \in \mathbb {N}}\) converges a.s. uniformly on \(\mathcal {D}\) to a continuous random function \(m_{\infty }\) on \(\mathcal {D}\). Thus, from (19), by compacity, we have, a.s. as \(k \rightarrow \infty \), \(\sup _{x \in \mathcal {D}} P_k(x) ( m_k(x) - M_k )^+ \rightarrow 0\) and \(\sup _{x \in \mathcal {D}} P_k(x) \sigma _k(x) \rightarrow 0\). Hence, Part 1. is proved.

Let us address Part 2. For all \(\tau \in \mathbb {N}\), consider fixed \(v_1,\ldots ,v_{N_\tau } \in \mathcal {D}\) for which \(\max _{u \in \mathcal {D}} \min _{i=1,\ldots ,N_\tau } || u - v_i || \le 1/\tau \). Consider the event \(E_\tau = \{ \exists u \in \mathcal {D} ; \inf _{i \in \mathbb {N}} || X_i - u || \ge 2/\tau \}\). Then, \(E_\tau \) implies the event \(E_{v,\tau } = \cup _{ j=1 }^{N_\tau } E_{v,\tau ,j} \) where \( E_{v,\tau ,j} = \{ \inf _{i \in \mathbb {N}} || X_i - v_j || \ge 1/\tau \}\). Let us now show that \(\mathbb {P}( E_{v,\tau ,j} ) = 0\) for \( j = 1,\ldots , N_\tau \). Assume that \(E_{v,\tau ,j} \cap \mathcal {C}\) holds, where \(\mathcal {C}\) is the event in Part 1. of the theorem, with \(\mathbb {P} ( \mathcal {C} ) = 1\). Since Y has the NEB property, we have \(\liminf _{k \rightarrow \infty } \sigma _k(v_j) >0\). Hence, \(P_k( v_j ) \rightarrow 0\) as \(k \rightarrow \infty \) since \(\mathcal {C}\) is assumed. We then have

$$\begin{aligned} \mathrm {var}( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } ) = P_k( v_j ) (1 - P_k( v_j )) \rightarrow 0 \end{aligned}$$
(20)

a.s. as \(k \rightarrow \infty \). But we have

$$\begin{aligned} \mathrm {var}&( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } ) \\&\quad = \mathbb {E} \left[ \left. \left( \mathbf {1}_{ Z(v_j)> 0 } - P_k( v_j ) \right) ^2 \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } \right] \\&\quad = \mathbb {E} \left[ \left. \mathbb {E} \left[ \left. \left( \mathbf {1}_{ Z(v_j)> 0 } - P_k( v_j ) \right) ^2 \right| Z(x_1),\ldots ,Z(x_k) \right] \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } \right] . \end{aligned}$$

Since \(P_k(v_j)\) is a function of \(Z(x_1), \ldots , Z(x_n)\), we obtain

$$\begin{aligned} \mathrm {var}&( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } ) \\&\quad \ge \mathbb {E} \left[ \left. \mathrm {var} \left( \mathbf {1}_{ Z(v_j)> 0 } | Z(x_1), \ldots , Z(x_k) \right) \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } \right] \\&\quad = \mathbb {E} \left[ \left. g \left( \bar{\varPhi } \left( \frac{- m_k(v_j)}{ \sigma _k(v_j) } \right) \right) \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } \right] , \end{aligned}$$

with \(g(t) = t(1-t)\) and with \(\bar{\varPhi }\) as in Lemma 1. We let \(S = \sup _{k \in \mathbb {N}} |m_k(v_j)|\) and \(s = \inf _{k \in \mathbb {N}} \sigma _k(v_j)\). Then, from the uniform convergence of \(m_k\) discussed below and from the NEB property of Z, we have \(\mathbb {P}(E_{S,s}) = 1\) where \(E_{S,s} = \{ S < + \infty , s> 0 \}\). Then, if \(E_{v,\tau ,j} \cap \mathcal {C} \cap E_{S,s}\) holds, we have

$$\begin{aligned} \mathrm {var}&( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k)> 0 } ) \\&\quad \ge \mathbb {E} \left[ \left. g \left( \bar{\varPhi } \left( \frac{S}{s } \right) \right) \right| \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } \right] \\&\quad \rightarrow _{k \rightarrow \infty }^{a.s.} \mathbb {E} \left[ \left. g \left( \bar{\varPhi } \left( \frac{S}{s } \right) \right) \right| \mathcal {F}_{Z,\infty } \right] , \end{aligned}$$

where \(\mathcal {F}_{Z,\infty } = \sigma ( \left\{ \mathbf {1}_{Z(X_i) > 0)} \right\} _{i \in \mathbb {N}} )\) from Theorem 6.23 in [14]. Almost surely, conditionally on \(\mathcal {F}_{Z,\infty }\) we have a.s. \(S < \infty \) and \(s >0\). Hence we obtain that, on the event \(E_{v,\tau ,j} \cap \mathcal {A}\) with \(\mathbb {P}( \mathcal {A} ) = 1\), \(\mathrm {var} ( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } ) \) does not go to zero. Hence, from (20), we have \(\mathbb {P}(E_{v,\tau ,j}) = 0\). This yields that \((X_i)_{i \in \mathbb {N}}\) is a.s. dense in \(\mathcal {D}\). Hence, since \(\{u ; Z(u) >0\}\) is an open set, we have \(\max _{i; Z(X_i)> 0} Y(X_i) \rightarrow \max _{Z(u) >0} Y(u)\) a.s. as \(n \rightarrow \infty \). \(\square \)

B Stochastic approximation of the likelihood gradient for Gaussian process based classification

In Appendixes B and C, for two matrices A and B of sizes \(a \times d\) and \(b \times d\), and for a function \(h: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\), let h(AB) be the \(a \times b\) matrix \([h(a_i,b_j)]_{i=1,\ldots ,a,j=1,\ldots ,b}\), where \(a_i\) and \(b_j\) are the lines i and j of A and B.

Let \(s_n = (i_1,\ldots ,i_n) \in \{0,1\}^n\) be fixed. Assume that the likelihood \(\mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) has been evaluated by \(\hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\). Assume also that realizations \(z_n^{(1)},\ldots ,z_n^{(N)}\), approximately following the conditional distribution of \(Z_n\) given \(\mathrm {sign}(Z_n) = s_n\), are available.

Let \(\mathcal {Z} = \{ z_n \in \mathbb {R}^n: \mathrm {sign}(z_n) = s_n \}\). Treating \(x_1,\ldots ,x_n\) as d-dimensional line vectors, let \(\mathbf {x}\) be the matrix \( (x_1^\top ,\ldots ,x_n^\top )^\top \). Then we have

$$\begin{aligned} \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) = \int _{\mathcal {Z}} \frac{1}{(2 \pi )^{n/2}} \frac{1}{\sqrt{|k_{\theta }^Z(\mathbf {x},\mathbf {x})|}} e^{ \frac{-1}{2} (z_n - \mu \mathrm {1}_n)^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n) } dz_n, \end{aligned}$$

where \(\mathrm {1}_n = (1,\ldots ,1)^\top \in \mathbb {R}^n\) and |.| denotes the determinant.

Derivating with respect to \(\mu \) yields

$$\begin{aligned} \frac{\partial }{ \partial \mu } \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) =&\int _{\mathcal {Z}} \frac{1}{(2 \pi )^{n/2}} \frac{1}{\sqrt{|k_{\theta }^Z(\mathbf {x},\mathbf {x})|}} e^{ \frac{-1}{2} (z_n - \mu \mathrm {1}_n)^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n) } \\&( \mathrm {1}_n^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n)) dz_n \\ =&\mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) \mathbb {E}_{\mu ,\theta } \left( \left. \mathrm {1}_n^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (Z_n - \mu \mathrm {1}_n) \right| \mathrm {sign}(Z_n) = s_n ) \right) , \end{aligned}$$

where \(\mathbb {E}_{\mu ,\theta }\) means that the conditional expectation is calculated under the assumption that Z has constant mean function \(\mu \) and covariance function \(k_{\theta }^Z\). Hence we have the stochastic approximation \(\hat{\nabla }_{\mu }\) for \(\partial /\partial \mu \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) given by

$$\begin{aligned} \hat{\nabla }_{\mu } = \hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) \frac{1}{N} \sum _{i=1}^N \mathrm {1}_n^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z^{(i)}_n - \mu \mathrm {1}_n). \end{aligned}$$

Derivating with respect to \(\theta _i\) for \(i=1,\ldots ,p\) yields, with \(\mathrm {adj}(M) \) the adjugate of a matrix M,

$$\begin{aligned}&\frac{\partial }{ \partial \theta _i } \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\\&\quad =\int _{\mathcal {Z}} \left( \frac{-1}{2} |k_{\theta }^Z(\mathbf {x},\mathbf {x})|^{-1} \mathrm {Tr} \left( \mathrm {adj}(k_{\theta }^Z(\mathbf {x},\mathbf {x})) \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } \right) \right. \\&\qquad \left. + \frac{1}{2} (z_n - \mu \mathrm {1}_n)^\top \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } (z_n - \mu \mathrm {1}_n) \right)&\\&\qquad \frac{1}{(2 \pi )^{n/2}} \frac{1}{\sqrt{|k_{\theta }^Z(\mathbf {x},\mathbf {x})|}} e^{ \frac{-1}{2} (z_n - \mu \mathrm {1}_n)^\top k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} (z_n - \mu \mathrm {1}_n) } dz_n&\\&\quad = \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )&\\&\qquad \mathbb {E}_{\mu ,\theta } \left( \frac{-1}{2} |k_{\theta }^Z(\mathbf {x},\mathbf {x})|^{-1} \mathrm {Tr} \left( \mathrm {adj}(k_{\theta }^Z(\mathbf {x},\mathbf {x})) \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } \right) \right. \\&\qquad \left. \left. + \frac{1}{2} (Z_n - \mu \mathrm {1}_n)^\top \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } (Z_n - \mu \mathrm {1}_n) \right| \mathrm {sign}(Z_n) = s_n ) \right) .&\end{aligned}$$

Hence we have the stochastic approximation \(\hat{\nabla }_{\theta _i}\) for \(\partial /\partial \theta _i \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) given by

$$\begin{aligned} \hat{\nabla }_{\theta _i}&= \hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n ) \frac{1}{N} \sum _{i=1}^N \left( \frac{-1}{2} |k_{\theta }^Z(\mathbf {x},\mathbf {x})|^{-1} \mathrm {Tr} \left( \mathrm {adj}(k_{\theta }^Z(\mathbf {x},\mathbf {x})) \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } \right) \right. \\&\quad \left. + \frac{1}{2} (z^{(i)}_n - \mu \mathrm {1}_n)^\top \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } k_{\theta }^Z(\mathbf {x},\mathbf {x})^{-1} \frac{\partial k_{\theta }^Z(\mathbf {x},\mathbf {x})}{ \partial \theta _i } (z^{(i)}_n - \mu \mathrm {1}_n) \right) . \end{aligned}$$

Remark 2

Several implementations of algorithms are available to obtain the realizations \(z_n^{(1)},\ldots ,z_n^{(N)}\), as discussed after Algorithm 1. It may also be the case that some implementations provide both the estimate \(\hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) and the realizations \(z_n^{(1)},\ldots ,z_n^{(N)}\).

C Expressions of the mean and covariance of the conditional Gaussian process and of the gradient of the acquisition function

Let \(\mu ^Y\) and \(k^Y\) be the mean and covariance functions of Y. Treating \(x_1,\ldots ,x_n\) as d-dimensional line vectors, let \(\mathbf {x}_q\) be the matrix extracted from \( (x_1^\top ,\ldots ,x_n^\top )^\top \) by keeping only the lines which indices j satisfy \(i_j = 1\).

We first recall the classical expressions of GP conditioning:

$$\begin{aligned} m_q^Y(x,Y_q)= & {} \mu ^Y + k^Y(x,\mathbf {x}_q) \left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} \left( Y_q - \mu ^Y \right) \\ k_q^Y(x, x')= & {} k^Y(x,x') - k^Y(x,\mathbf {x}_q) \left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} k^Y(\mathbf {x}_q,x'). \end{aligned}$$

\(\nabla _x m^Y_q(x,Y_q)\) and \(\nabla _x k_q^Y(x,x)\) are straightforward provided that \(\nabla _{x} k^Y({x},{y})\) is available:

$$\begin{aligned} \nabla _x m^Y_q(x,Y_q)= & {} [ \nabla _x k^Y(x,\mathbf {x}_q) ] \left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} \left( Y_q - \mu ^Y \right) \\ \nabla _x k^Y_q(x, x)= & {} \nabla _x k^Y(x, x) -2 k^Y(x,\mathbf {x}_q)\left( k^Y(\mathbf {x}_q,\mathbf {x}_q) \right) ^{-1} \nabla _x k^Y(\mathbf {x}_q,x). \end{aligned}$$

Then:

$$\begin{aligned} \nabla _x EI_q(x) = \Phi \left( \frac{m^Y_q(x,Y_q) - M_q}{\sqrt{k^Y_q(x,x)}} \right) \nabla _x m^Y_q(x,Y_q) + \phi \left( \frac{M_q - m^Y_q(x,Y_q)}{\sqrt{k^Y_q(x,x)}} \right) \frac{1}{2 \sqrt{k^Y_q(x,x)}}\nabla _x k^Y_q(x, x). \end{aligned}$$

For \(\mathrm {P_{nf}}(x)\), using the approximation of Algorithm 1, we have:

$$\begin{aligned} \widehat{\mathrm {P_{nf}}}(x) = \frac{1}{N} \sum _{i=1}^N \bar{\varPhi } \left( \frac{-m_n^Z(x, z_n^{(i)})}{ \sqrt{ k_n^Z(x,x) } } \right) , \end{aligned}$$

with \(k_n^Z(x,x)\) as \(k_n^Y(x,x)\) and

$$\begin{aligned} m_n^Z(x, z_n^{(i)}) = \mu ^Z + k^Z(x,\mathbf {x}) \left( k^Z(\mathbf {x},\mathbf {x}) \right) ^{-1} \left( z_n^{(i)} - \mu ^Z \right) . \end{aligned}$$

Applying the standard differentiation rules delivers:

$$\begin{aligned} \nabla _x \widehat{\mathrm {P_{nf}}}(x)= & {} \frac{1}{N}\sum _{i=1}^N \phi \left( \frac{m_n^Z(x,z_n^{(i)})}{\sqrt{k^Z_n(x,x)}}\right) \left[ \frac{1}{\sqrt{k^Z_n(x,x)}} \nabla _x m_n^Z(x,z_n^{(i)}) - \frac{m_n^Z(x,z_n^{(i)})}{2[k^Z_n(x,x)]^{3/2}} \nabla _x k_n^Z(x,x) \right] . \end{aligned}$$

The gradient of the acquisition function can then be obtained using the product rule.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bachoc, F., Helbert, C. & Picheny, V. Gaussian process optimization with failures: classification and convergence proof. J Glob Optim 78, 483–506 (2020). https://doi.org/10.1007/s10898-020-00920-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-020-00920-0

Keywords

Mathematics Subject Classification

Navigation