Abstract
We consider the optimization of a computer model where each simulation either fails or returns a valid output performance. We first propose a new joint Gaussian process model for classification of the inputs (computation failure or success) and for regression of the performance function. We provide results that allow for a computationally efficient maximum likelihood estimation of the covariance parameters, with a stochastic approximation of the likelihood gradient. We then extend the classical improvement criterion to our setting of joint classification and regression. We provide an efficient computation procedure for the extended criterion and its gradient. We prove the almost sure convergence of the global optimization algorithm following from this extended criterion. We also study the practical performances of this algorithm, both on simulated data and on a real computer model in the context of automotive fan design.
Similar content being viewed by others
Notes
A constant mean function could be incorporated and estimated with no additional complexity.
References
Azzimonti, D., Ginsbourger, D.: Estimating orthant probabilities of high-dimensional Gaussian vectors with an application to set estimation. J. Comput. Graph. Stat. 27(2), 255–267 (2018)
Bect, J., Bachoc, F., Ginsbourger, D.: A supermartingale approach to Gaussian process based sequential design of experiments. Bernoulli 25(4A), 2883–2919 (2019)
Benassi, R., Bect, J., Vazquez, E.: Robust Gaussian process-based global optimization using a fully Bayesian expected improvement criterion. In: International Conference on Learning and Intelligent Optimization, pp. 176–190. Springer (2011)
Botev, Z.I.: The normal law under linear restrictions: simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(1), 125–148 (2017)
Bull, A.D.: Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011)
Gelbart, M.A., Snoek, J., Adams, R.P.: Bayesian optimization with unknown constraints. In: UAI (2014)
Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1(2), 141–149 (1992)
Ginsbourger, D., Le Riche, R., Carraro, L.: Kriging is well-suited to parallelize optimization. In: Computational Intelligence in Expensive Optimization Problems, pp. 131–162. Springer (2010)
Ginsbourger, D., Roustant, O., Durrande, N.: On degeneracy and invariances of random fields paths with applications in Gaussian process modelling. J. Stat. Plan. Inference 170, 117–128 (2016)
Gramacy, R., Lee, H.: Optimization under unknown constraints. Bayesian Stat. 9, 229 (2011)
Gramacy, R.B., Gray, G.A., Le Digabel, S., Lee, H.K., Ranjan, P., Wells, G., Wild, S.M.: Modeling an augmented Lagrangian for blackbox constrained optimization. Technometrics 58(1), 1–11 (2016)
Hernandez-Lobato, J.M., Gelbart, M., Hoffman, M., Adams, R., Ghahramani, Z.: Predictive entropy search for Bayesian optimization with unknown constraints. In: International Conference on Machine Learning, pp. 1699–1707 (2015)
Jones, D., Schonlau, M., Welch, W.: Efficient global optimization of expensive black box functions. J. Glob. Optim. 13, 455–492 (1998)
Kallenberg, O.: Foundations of Modern Probability, 2nd edn. Springer, Berlin (2002)
Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., Xing, E.P.: Neural architecture search with Bayesian optimisation and optimal transport. In: Advances in Neural Information Processing Systems, pp. 2016–2025 (2018)
Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. Wiley, Hoboken (2005)
Lindberg, D.V., Lee, H.K.: Optimization under constraints by applying an asymmetric entropy measure. J. Comput. Graph. Stat. 24(2), 379–393 (2015)
López-Lopera, A.F., Bachoc, F., Durrande, N., Roustant, O.: Finite-dimensional Gaussian approximation with linear inequality constraints. SIAM/ASA J. Uncertain. Quantif. 6(3), 1224–1255 (2018)
Maatouk, H., Bay, X.: A New Rejection Sampling Method for Truncated Multivariate Gaussian Random Variables Restricted to Convex Sets, pp. 521–530. Springer, Cham (2016)
Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, Berlin (2012)
Mockus, J.B., Tiesis, V., Žilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Dixon, L.C.W., Szegö, G.P. (eds.) Towards Global Optimization, vol. 2, pp. 117–129. North Holland, New York (1978)
Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)
Pakman, A., Paninski, L.: Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Graph. Stat. 23(2), 518–542 (2014)
Picheny, V.: A stepwise uncertainty reduction approach to constrained global optimization. In: Artificial Intelligence and Statistics, pp. 787–795 (2014)
Picheny, V., Gramacy, R.B., Wild, S., Le Digabel, S.: Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian. In: Advances in Neural Information Processing Systems, pp. 1435–1443 (2016)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)
Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)
Sacher, M., Duvigneau, R., Le Maitre, O., Durand, M., Berrini, E., Hauville, F., Astolfi, J.-A.: A classification approach to efficient global optimization in presence of non-computable domains. Struct. Multidiscip. Optim. 58(4), 1537–1557 (2018)
Sasena, M.J., Papalambros, P., Goovaerts, P.: Exploration of metamodeling sampling criteria for constrained global optimization. Eng. Optim. 34(3), 263–278 (2002)
Schonlau, M., Welch, W.J., Jones, D.R.: Global versus local search in constrained optimization of computer models. In: Lecture Notes-Monograph Series, pp. 11–25 (1998)
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022 (2010)
Taylor, J., Benjamini, Y.: RestrictedMVN: multivariate normal restricted by affine constraints. https://cran.r-project.org/web/packages/restrictedMVN/index.html (2017). Online; 2 Feb 2017
Vazquez, E., Bect, J.: Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Stat. Plan. Inference 140(11), 3088–3095 (2010)
Vazquez, E., Bect, J.: Pointwise consistency of the kriging predictor with known mean and covariance functions. In: mODa 9–aAdvances in Model-Oriented Design and Analysis, pp. 221–228. Springer (2010)
Wu, J., Frazier, P.: The parallel knowledge gradient method for batch Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 3126–3134 (2016)
Zhigljavsky, A., Žilinskas, A.: Selection of a covariance function for a Gaussian random field aimed for modeling global optimization problems. Optim. Lett. 13(2), 249–259 (2019)
Žilinskas, A., Calvin, J.: Bi-objective decision making in global optimization based on statistical models. J. Glob. Optim. 74(4), 599–609 (2019)
Acknowledgements
This work was partly funded by the ANR project PEPITO. We are grateful to Jean-Marc Azaïs and David Ginsbourger for discussing the topics in this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material
The supplementary material contains additional figures and tables for Sections \inlink{5}{section:simulations} and \inlink{6}{section:industrial:case:study}. (pdf 1,904KB)
Appendices
A Proofs
Proof
(Proof of Lemma 1)
With \(\phi ^{Z_n}\) the p.d.f. of \(Z_n\) and with \(s_n = (i_1,\ldots ,i_n)^\top \), we have
The Eq. (16) is obtained by observing that
by definition of \(\phi ^{Z_n}_{s_n}(z_n)\), and that
by Gaussian conditioning. \(\square \)
Proof
(Proof of Lemma 2) For any measurable function f, by the law of total expectation and using the independence of Y, \((W_1,\ldots ,W_n)\) and Z, we have
This concludes the proof by definition of a p.d.f. \(\square \)
Proof
(Proof of Lemma 3)
Consider measurable functions f(Y), g(Z), \(h( I_1, \ldots , I_n )\) and \(\psi ( I_1 Y(x_1), \ldots , I_n Y(x_n) )\). We have, by independence of Y and Z,
The last display can be written as, with \(\mathcal {L}_{n}\) the distribution of
where \(Y_q\) is as defined in the statement of the lemma. This concludes the proof. \(\square \)
We now address the proof of Theorem 1. We let \((X_i)_{i \in \mathbb {N}}\) be the random observation points, such that \(X_i\) is obtained from (13) and (14) for \(i \in \mathbb {N}\). The next lemma shows that conditioning on the random observation points and observed values works “as if” the observation points \(X_1,\ldots ,X_n\) were non-random.
Lemma 4
For any \(x_1,\ldots ,x_k \in \mathcal {D}\), \(i_1,\ldots ,i_k \in \{ 0,1\}^k\) and \(i_1 y_1,\ldots ,i_k y_k \in \mathbb {R}^k\), the conditional distribution of (Y, Z) given
is the same as the conditional distribution of (Y, Z) given
Proof
This lemma can be shown similarly as Proposition 2.6 in [2]. \(\square \)
Proof
(Proof of Theorem 1)
For \(k \in \mathbb {N}\), we remark that \(\mathcal {F}_k\) is the sigma-algebra generated by
We let \(\mathbb {E}_k\), \(\mathbb {P}_k\) and \(\mathrm {var}_k\) denote the expectation, probability and variance conditionally on \(\mathcal {F}_k\). For \(x \in \mathcal {D}\), we let \(\mathbb {E}_{k,x}\), \(\mathbb {P}_{k,x}\) and \(\mathrm {var}_{k,x}\) denote the expectation, probability and variance conditionally on
We let \(\sigma _k^2(u) = \mathrm {var}_k(Y(u))\), \(m_k(u) = \mathbb {E}_k[Y(u)]\) and \(P_k(u) = \mathbb {P}_k( Z(u) > 0 )\). We also let \(\sigma _{k,x}^2(u) = \mathrm {var}_{k,x}(Y(u))\), \(m_{k,x}(u) = \mathbb {E}_{k,x}[Y(u)]\) and \(P_{k,x}(u) = \mathbb {P}_{k,x}( Z(u) > 0 )\).
With these notations, the observation points satisfy, for \(k \in \mathbb {N}\),
where
We first show that (17) can be defined as a stepwise uncertainty reduction (SUR) sequential design [2]. We have
since the second term in (18) does not depend on x and from the law of total expectation. We let
and
Then we have for \(k \ge 1\)
We have, using the law of total expectation, and since \(\mathbb {E}_{k,x} \left[ \max _{ \begin{array}{c} P_{k,x}(u) = 1, \sigma _{k,x}(u) = 0 \end{array}} Y(u) \right] = \max _{ \begin{array}{c} P_{k,x}(u) = 1, \sigma _{k,x}(u) = 0 \end{array}} Y(u)\),
since, for all \(u,x \in \mathcal {D}\), \(\sigma _{k,x}(u) \le \sigma _k(u) \) and \(P_k(u) = 1\) implies \(P_{k,x}(u) = 1\). Hence \((H_k)_{k \in \mathbb {N}}\) is a supermartingale and of course \(H_k \ge 0\) for all \(k \in \mathbb {N}\). Also \(| H_1| \le 2 \mathbb {E}_1 \left[ \max _{u \in \mathcal {D}} |Y(u)| \right] \) so that \(H_1\) is bounded in \(L^1\), since the mean value of the maximum of a continuous Gaussian process on a compact set is finite. Hence, from Theorem 6.23 in [14], \(H_k\) converges a.s. as \(k \rightarrow \infty \) to a finite random variable. Hence, as in the proof of Theorem 3.10 in [2], we have \(H_k - \mathbb {E}_k(H_{k+1})\) goes to 0 a.s. as \(k \rightarrow \infty \). Hence, by definition of \(X_{k+1}\) we obtain \(\sup _{x \in \mathcal {D}} ( H_k - \mathbb {E}_k( H_{k,x} ) ) \rightarrow 0\) a.s. as \(k \rightarrow \infty \). This yields, from the law of total expectation,
Recall from Section 3 in [34] that \(\gamma \) is continuous and satisfies \(\gamma (a, b) > 0\) if \(b > 0\) and \(\gamma (a, b) \ge a\) if \(a > 0\). We have for \(k \in \mathbb {N}\), \(0 \le \sigma _k(u) \le \max _{v \in \mathcal {D}} \sqrt{\mathrm {var}(Y(v))} < \infty \). Also, with the same proof as that of Proposition 2.9 in [2], we can show that the sequence of random functions \((m_k)_{k \in \mathbb {N}}\) converges a.s. uniformly on \(\mathcal {D}\) to a continuous random function \(m_{\infty }\) on \(\mathcal {D}\). Thus, from (19), by compacity, we have, a.s. as \(k \rightarrow \infty \), \(\sup _{x \in \mathcal {D}} P_k(x) ( m_k(x) - M_k )^+ \rightarrow 0\) and \(\sup _{x \in \mathcal {D}} P_k(x) \sigma _k(x) \rightarrow 0\). Hence, Part 1. is proved.
Let us address Part 2. For all \(\tau \in \mathbb {N}\), consider fixed \(v_1,\ldots ,v_{N_\tau } \in \mathcal {D}\) for which \(\max _{u \in \mathcal {D}} \min _{i=1,\ldots ,N_\tau } || u - v_i || \le 1/\tau \). Consider the event \(E_\tau = \{ \exists u \in \mathcal {D} ; \inf _{i \in \mathbb {N}} || X_i - u || \ge 2/\tau \}\). Then, \(E_\tau \) implies the event \(E_{v,\tau } = \cup _{ j=1 }^{N_\tau } E_{v,\tau ,j} \) where \( E_{v,\tau ,j} = \{ \inf _{i \in \mathbb {N}} || X_i - v_j || \ge 1/\tau \}\). Let us now show that \(\mathbb {P}( E_{v,\tau ,j} ) = 0\) for \( j = 1,\ldots , N_\tau \). Assume that \(E_{v,\tau ,j} \cap \mathcal {C}\) holds, where \(\mathcal {C}\) is the event in Part 1. of the theorem, with \(\mathbb {P} ( \mathcal {C} ) = 1\). Since Y has the NEB property, we have \(\liminf _{k \rightarrow \infty } \sigma _k(v_j) >0\). Hence, \(P_k( v_j ) \rightarrow 0\) as \(k \rightarrow \infty \) since \(\mathcal {C}\) is assumed. We then have
a.s. as \(k \rightarrow \infty \). But we have
Since \(P_k(v_j)\) is a function of \(Z(x_1), \ldots , Z(x_n)\), we obtain
with \(g(t) = t(1-t)\) and with \(\bar{\varPhi }\) as in Lemma 1. We let \(S = \sup _{k \in \mathbb {N}} |m_k(v_j)|\) and \(s = \inf _{k \in \mathbb {N}} \sigma _k(v_j)\). Then, from the uniform convergence of \(m_k\) discussed below and from the NEB property of Z, we have \(\mathbb {P}(E_{S,s}) = 1\) where \(E_{S,s} = \{ S < + \infty , s> 0 \}\). Then, if \(E_{v,\tau ,j} \cap \mathcal {C} \cap E_{S,s}\) holds, we have
where \(\mathcal {F}_{Z,\infty } = \sigma ( \left\{ \mathbf {1}_{Z(X_i) > 0)} \right\} _{i \in \mathbb {N}} )\) from Theorem 6.23 in [14]. Almost surely, conditionally on \(\mathcal {F}_{Z,\infty }\) we have a.s. \(S < \infty \) and \(s >0\). Hence we obtain that, on the event \(E_{v,\tau ,j} \cap \mathcal {A}\) with \(\mathbb {P}( \mathcal {A} ) = 1\), \(\mathrm {var} ( \mathbf {1}_{ Z(v_j)> 0 } | \mathbf {1}_{ Z(X_1)> 0 }, \ldots , \mathbf {1}_{ Z(X_k) > 0 } ) \) does not go to zero. Hence, from (20), we have \(\mathbb {P}(E_{v,\tau ,j}) = 0\). This yields that \((X_i)_{i \in \mathbb {N}}\) is a.s. dense in \(\mathcal {D}\). Hence, since \(\{u ; Z(u) >0\}\) is an open set, we have \(\max _{i; Z(X_i)> 0} Y(X_i) \rightarrow \max _{Z(u) >0} Y(u)\) a.s. as \(n \rightarrow \infty \). \(\square \)
B Stochastic approximation of the likelihood gradient for Gaussian process based classification
In Appendixes B and C, for two matrices A and B of sizes \(a \times d\) and \(b \times d\), and for a function \(h: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\), let h(A, B) be the \(a \times b\) matrix \([h(a_i,b_j)]_{i=1,\ldots ,a,j=1,\ldots ,b}\), where \(a_i\) and \(b_j\) are the lines i and j of A and B.
Let \(s_n = (i_1,\ldots ,i_n) \in \{0,1\}^n\) be fixed. Assume that the likelihood \(\mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) has been evaluated by \(\hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\). Assume also that realizations \(z_n^{(1)},\ldots ,z_n^{(N)}\), approximately following the conditional distribution of \(Z_n\) given \(\mathrm {sign}(Z_n) = s_n\), are available.
Let \(\mathcal {Z} = \{ z_n \in \mathbb {R}^n: \mathrm {sign}(z_n) = s_n \}\). Treating \(x_1,\ldots ,x_n\) as d-dimensional line vectors, let \(\mathbf {x}\) be the matrix \( (x_1^\top ,\ldots ,x_n^\top )^\top \). Then we have
where \(\mathrm {1}_n = (1,\ldots ,1)^\top \in \mathbb {R}^n\) and |.| denotes the determinant.
Derivating with respect to \(\mu \) yields
where \(\mathbb {E}_{\mu ,\theta }\) means that the conditional expectation is calculated under the assumption that Z has constant mean function \(\mu \) and covariance function \(k_{\theta }^Z\). Hence we have the stochastic approximation \(\hat{\nabla }_{\mu }\) for \(\partial /\partial \mu \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) given by
Derivating with respect to \(\theta _i\) for \(i=1,\ldots ,p\) yields, with \(\mathrm {adj}(M) \) the adjugate of a matrix M,
Hence we have the stochastic approximation \(\hat{\nabla }_{\theta _i}\) for \(\partial /\partial \theta _i \mathbb {P}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) given by
Remark 2
Several implementations of algorithms are available to obtain the realizations \(z_n^{(1)},\ldots ,z_n^{(N)}\), as discussed after Algorithm 1. It may also be the case that some implementations provide both the estimate \(\hat{\mathbb {P}}_{\mu ,\theta }( \mathrm {sign}(Z_n) = s_n )\) and the realizations \(z_n^{(1)},\ldots ,z_n^{(N)}\).
C Expressions of the mean and covariance of the conditional Gaussian process and of the gradient of the acquisition function
Let \(\mu ^Y\) and \(k^Y\) be the mean and covariance functions of Y. Treating \(x_1,\ldots ,x_n\) as d-dimensional line vectors, let \(\mathbf {x}_q\) be the matrix extracted from \( (x_1^\top ,\ldots ,x_n^\top )^\top \) by keeping only the lines which indices j satisfy \(i_j = 1\).
We first recall the classical expressions of GP conditioning:
\(\nabla _x m^Y_q(x,Y_q)\) and \(\nabla _x k_q^Y(x,x)\) are straightforward provided that \(\nabla _{x} k^Y({x},{y})\) is available:
Then:
For \(\mathrm {P_{nf}}(x)\), using the approximation of Algorithm 1, we have:
with \(k_n^Z(x,x)\) as \(k_n^Y(x,x)\) and
Applying the standard differentiation rules delivers:
The gradient of the acquisition function can then be obtained using the product rule.
Rights and permissions
About this article
Cite this article
Bachoc, F., Helbert, C. & Picheny, V. Gaussian process optimization with failures: classification and convergence proof. J Glob Optim 78, 483–506 (2020). https://doi.org/10.1007/s10898-020-00920-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-020-00920-0