Skip to main content
Log in

Zeroth-order methods for noisy Hölder-gradient functions

  • Original Paper
  • Published:
Optimization Letters Aims and scope Submit manuscript

Abstract

In this paper, we prove new complexity bounds for zeroth-order methods in non-convex optimization with inexact observations of the objective function values. We use the Gaussian smoothing approach of Nesterov and Spokoiny(Found Comput Math 17(2): 527–566, 2015. https://doi.org/10.1007/s10208-015-9296-2) and extend their results, obtained for optimization methods for smooth zeroth-order non-convex problems, to the setting of minimization of functions with Hölder-continuous gradient with noisy zeroth-order oracle, obtaining noise upper-bounds as well. We consider finite-difference gradient approximation based on normally distributed random Gaussian vectors and prove that gradient descent scheme based on this approximation converges to the stationary point of the smoothed function. We also consider convergence to the stationary point of the original (not smoothed) function and obtain bounds on the number of steps of the algorithm for making the norm of its gradient small. Additionally we provide bounds for the level of noise in the zeroth-order oracle for which it is still possible to guarantee that the above bounds hold. We also consider separately the case of \(\nu = 1\) and show that in this case the dependence of the obtained bounds on the dimension can be improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M.: Automatic differentiation in machine learning: a survey (2018). arxiv:1502.05767

  2. Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empirical comparison of gradient approximations in derivative-free optimization (2019). arxiv:1905.01332

  3. Berahas, A.S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise (2019). arxiv:1910.04055

  4. Bolte, J., Glaudin, L., Pauwels, E., Serrurier, M.: A Hölderian backtracking method for min-max and min-min problems (2020). arxiv:2007.08810

  5. Brent, R.: Algorithms for Minimization Without Derivatives. Dover Books on Mathematics, Dover Publications (1973)

    MATH  Google Scholar 

  6. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to derivative-free optimization. Soc. Ind. Appl. Math. (2009). https://doi.org/10.1137/1.9780898718768

    Article  MATH  Google Scholar 

  7. Dvurechensky, P.: Gradient method with inexact oracle for composite non-convex optimization (2017). arxiv:1703.09180

  8. Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Statist. 38(1), 191–200 (1967). https://doi.org/10.1214/aoms/1177699070

    Article  MathSciNet  MATH  Google Scholar 

  9. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013). https://doi.org/10.1137/120880811

    Article  MathSciNet  MATH  Google Scholar 

  10. Kim, K., Nesterov, Y., Skokov, V., Cherkasskii, B.: Effektivnii algoritm vychisleniya proisvodnyh i ekstremalnye zadachi (efficient algorithm for calculation of derivatives and extreme problems). Ekonomika i matematicheskie metody 20(2), 309–318 (1984)

    MathSciNet  Google Scholar 

  11. Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numerica 28, 287–404 (2019). https://doi.org/10.1017/S0962492919000060

    Article  MathSciNet  MATH  Google Scholar 

  12. Liu, S., Kailkhura, B., Chen, P.Y., Ting, P., Chang, S., Amini, L.: Zeroth-order stochastic variance reduction for nonconvex optimization. Adv. Neural Inf. Process. Syst. 31, 3727–3737 (2018)

    Google Scholar 

  13. Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1), 381–404 (2015). https://doi.org/10.1007/s10107-014-0790-0

    Article  MathSciNet  MATH  Google Scholar 

  14. Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2015). https://doi.org/10.1007/s10208-015-9296-2

    Article  MathSciNet  MATH  Google Scholar 

  15. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175–184 (1960). https://doi.org/10.1093/comjnl/3.3.175

    Article  MathSciNet  Google Scholar 

  16. Saeed Ghadimi, G.L., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. (2013). https://doi.org/10.1007/s10107-014-0846-1

    Article  MATH  Google Scholar 

  17. Spall, J.C.: Introduction to Stochastic Search and Optimization, 1st edn. Wiley, New York, NY, USA (2003)

    Book  Google Scholar 

  18. Sutton, R..S., Barto, A..G.: Reinforcement learning: An introduction. MIT press (2018)

  19. Wang, J., Liu, Y., Li, B.: Reinforcement learning with perturbed rewards. Proc. AAAI Conf. Artif. Intell. 34, 6202–6209 (2020). https://doi.org/10.1609/aaai.v34i04.6086

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to K. Scheinberg and A. Beznosikov for several discussions on derivative-free methods.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Innokentiy Shibaev.

Ethics declarations

Funding

The research of A. Gasnikov and P. Dvurechensly was partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, project no. 0714-2020-0005.

Conflict of interest

Not applicable

Availability of data and material

Not applicable

Code availability

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research of A. Gasnikov and P. Dvurechensly was partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, Project No. 0714-2020-0005.

Appendix

Appendix

1.1 Proofs of Lemmas 2.1–2.5

Proof

(Lemma 1) From (1) we get \(\Vert Bu\Vert _{*}^2 = \langle Bu,B^{-1}Bu\rangle =\langle Bu,u\rangle =\Vert u\Vert ^2\). Using this and Lemma 7 we obtain

$$\begin{aligned}&\Vert \nabla \tilde{f}_\mu (x, \delta ) - \nabla f_\mu (x)\Vert _{*} \\&\quad \overset{(14)}{=}\left\| \frac{1}{\kappa }\int \limits _{E}\frac{\tilde{f}(x+\mu u,\delta )\pm f(x+\mu u)}{\mu } e^{-\tfrac{1}{2}\Vert u\Vert ^2}Budu - \nabla f_\mu (x)\right\| _{*} \\&\quad \leqslant \left\| \frac{1}{\kappa } \int \limits _{E}\left( \frac{\tilde{f}(x+\mu u,\delta ) - f(x+\mu u)}{\mu }\right) e^{-\tfrac{1}{2}\Vert u\Vert ^2}Budu \right\| _{*} \\&\qquad + \left\| \frac{1}{\kappa } \int \limits _{E}\frac{f(x+\mu u)}{\mu } e^{-\tfrac{1}{2}\Vert u\Vert ^2}Budu - \nabla f_\mu (x) \right\| _{*} \\&\quad \overset{Asm.~1,~(9)}{\leqslant } \frac{1}{\kappa }\int \limits _{E}\frac{\delta }{\mu } \Vert u\Vert e^{-\tfrac{1}{2}\Vert u\Vert ^2}du + \left\| \nabla f_\mu (x) - \nabla f_\mu (x) \right\| _{*} \overset{Lem.~7}{\leqslant } \frac{\delta }{\mu }n^{1/2} \\ \end{aligned}$$

and

$$\begin{aligned}&\Vert \nabla f_\mu (x) - \nabla f(x)\Vert _{*} \overset{(11)}{=} \left\| \frac{1}{\kappa }\int \limits _{E}\left( \nabla f(x+\mu u) - \nabla f(x) \right) e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \right\| _{*} \\&\quad \overset{Asm.~2}{\leqslant } \frac{1}{\kappa }\int \limits _{E}L_{\nu }\Vert \mu u\Vert ^{\nu }e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \overset{Lem.~7}{\leqslant } \mu ^{\nu } L_{\nu }n^{\nu /2} \end{aligned}$$

thus, finally

$$\begin{aligned}&\Vert \nabla \tilde{f}_\mu (x, \delta ) - \nabla f(x)\Vert _{*} \leqslant \Vert \nabla \tilde{f}_\mu (x, \delta ) - \nabla f_\mu (x)\Vert _{*} + \Vert \nabla f_\mu (x) - \nabla f(x)\Vert _{*} \\&\quad \leqslant \frac{\delta }{\mu }n^{1/2} + \mu ^{\nu } L_{\nu }n^{\nu /2}. \end{aligned}$$

\(\square \)

Proof

(Lemma 2)

$$\begin{aligned}&\Vert \nabla f_{\mu }(y)-\nabla f_{\mu }(x)\Vert _{*} \\&\quad \overset{(8)}{=}\frac{1}{\kappa } \left\| \int \limits _{E}\left( \frac{f(y+\mu u) - f(y)}{\mu }-\frac{f(x+\mu u) - f(x)}{\mu }\right) Bu e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \right\| _{*} \\&\quad \leqslant \frac{1}{\mu \kappa } \int \limits _{E}\left| \int \limits _{0}^{1}\langle \nabla f(\mu u + ty + (1-t)x) - \nabla f(ty + (1-t)x), y-x\rangle dt\right| \Vert u\Vert e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \\&\quad \overset{Asm.~2}{\leqslant } \frac{1}{\mu \kappa } \int \limits _{E}L_{\nu }\mu ^{\nu } \Vert y-x\Vert \Vert u\Vert ^{1+\nu } e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \overset{Lem.~7}{\leqslant } \frac{L_{\nu }}{\mu ^{1-\nu }}n^{\tfrac{1+\nu }{2}}\Vert y-x\Vert . \end{aligned}$$

Integrating this we obtain

$$\begin{aligned} f_{\mu }(y)-f_{\mu }(x)-\langle \nabla f_{\mu }(x),y-x\rangle \leqslant \frac{L_{\nu }}{2\mu ^{1-\nu }}n^{\tfrac{1+\nu }{2}}\Vert y-x\Vert ^2 \end{aligned}$$
(26)

so using this way we proved lemma with \(A_1 = \frac{L_{\nu }}{\mu ^{1-\nu }}n^{\tfrac{1+\nu }{2}}\) and \(A_2 = 0\).

The other way to obtain \(A_1\) and \(A_2\) is to directly upper bound \(f_{\mu }(y)-f_{\mu }(x)-\langle \nabla f_{\mu }(x),y-x\rangle \) applying Lemma 8:

$$\begin{aligned}&f_{\mu }(y)-f_{\mu }(x)-\langle \nabla f_{\mu }(x),y-x\rangle \\&\overset{(6,11)}{=} \frac{1}{\kappa }\int \limits _{E}\left( f(y+\mu u) - f(x+\mu u) - \langle \nabla f(x+\mu u),y-x\rangle \right) e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \\&\overset{Asm.~2}{\leqslant } \frac{L_{\nu }}{1+\nu }\Vert y-x\Vert ^{1+\nu } \overset{\text {Lem.~8}}{\leqslant } \frac{1}{2}\left[ \frac{1-\nu }{1+\nu }\frac{2}{\tilde{\delta }}\right] ^{\frac{1-\nu }{1+\nu }} L_{\nu }^{\frac{2}{1+\nu }}\Vert y-x\Vert ^2 + \tilde{\delta }. \end{aligned}$$

Setting \(\tilde{\delta } = \hat{\delta }\mu ^{1+\nu }L_\nu \) and using upper bound \(\left[ 2\frac{1-\nu }{1+\nu }\right] ^{\frac{1-\nu }{1+\nu }}\leqslant 2\) we obtain

$$\begin{aligned}&f_{\mu }(y)-f_{\mu }(x)-\langle \nabla f_{\mu }(x),y-x\rangle \leqslant \left[ \frac{1}{\hat{\delta }}\right] ^{\frac{1-\nu }{1+\nu }}\frac{L_{\nu }}{\mu ^{1-\nu }}\Vert y-x\Vert ^2 + \hat{\delta }L_{\nu }\mu ^{1+\nu } \end{aligned}$$
(27)

so we proved lemma with \(A_1 = \left[ \frac{1}{\hat{\delta }}\right] ^{\frac{1-\nu }{1+\nu }}\frac{2L_{\nu }}{\mu ^{1-\nu }}\) and \(A_2 = \hat{\delta }L_{\nu }\mu ^{1+\nu }\). \(\square \)

Proof

(Lemma 3) To proof this we should notice that

$$\begin{aligned} \frac{1}{\kappa } \int \limits _{E}\langle \nabla f(x), u\rangle e^{-\tfrac{1}{2}\Vert u\Vert ^2}du = 0 \end{aligned}$$

thus

$$\begin{aligned}&|f_{\mu }(x)-f(x)| \overset{(6)}{=} \left| \int \limits _{E}\left( f(x+\mu u) - f(x)\right) e^{-\tfrac{1}{2}\Vert u\Vert ^2}du\right| \\&=\left| \int \limits _{E}\left( f(x+\mu u) - f(x) - \langle \nabla f(x), \mu u\rangle \right) e^{-\tfrac{1}{2}\Vert u\Vert ^2}du\right| \\&\overset{Asm.~2}{\leqslant } \frac{L_{\nu }}{1 + \nu }\mu ^{1 + \nu }\int \limits _{E}\Vert u\Vert ^{1+\nu } e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \overset{Lem.~7}{\leqslant }\frac{L_{\nu }}{1 + \nu }\mu ^{1 + \nu }n^{\frac{1+\nu }{2}}. \end{aligned}$$

\(\square \)

Proof

(Lemma 4) From the fact that \(a^2 \leqslant 2(a+b)^2 + 2b^2\):

$$\begin{aligned} \Vert \nabla f(x)\Vert _{*}^2&\leqslant 2\Vert \nabla f_{\mu }(x)\Vert _{*}^2 + 2\Vert \nabla f(x) - \nabla f_{\mu }(x)\Vert _{*}^2 \\&\overset{Lem.~1}{\leqslant } 2\Vert \nabla f_{\mu }(x)\Vert _{*}^2 + 2\mu ^{2\nu }L_{\nu }^2 n^{2\nu }. \end{aligned}$$

\(\square \)

Proof

(Lemma 5)

$$\begin{aligned} \mathbb {E}_{u}\left[ \Vert g_{\mu }(x,u,\delta )\Vert ^2_{*}\right] = \frac{1}{\kappa } \int \limits _{E}\left| \frac{\tilde{f}(x+\mu u,\delta ) - \tilde{f}(x,\delta )}{\mu }\right| ^2 \Vert u\Vert ^2 e^{-\tfrac{1}{2}\Vert u\Vert ^2}du \end{aligned}$$

let’s bound \(|\tilde{f}(x+\mu u,\delta ) - \tilde{f}(x,\delta )|\):

$$\begin{aligned}&|\tilde{f}(x+\mu u,\delta ) - \tilde{f}(x,\delta )| \leqslant 2\delta + |f(x+\mu u) - f(x)| \\&\quad \leqslant 2\delta + |f(x+\mu u) - f_{\mu }(x+\mu u) - f(x) + f_{\mu }(x)| + |f_{\mu }(x+\mu u) - f_{\mu }(x)| \\&\quad \overset{Lem.~3}{\leqslant } 2\delta + \frac{2L_{\nu }}{1+\nu }\mu ^{1+\nu }n^{\frac{1+\nu }{2}} + |f_{\mu }(x+\mu u) - f_{\mu }(x) - \langle \nabla f_{\mu }(x),\mu u\rangle | \\&\quad + |\langle \nabla f_{\mu }(x),\mu u\rangle | \\&\quad \overset{Lem.~2}{\leqslant } 2\delta + \frac{2L_{\nu }}{1+\nu }\mu ^{1+\nu }n^{\frac{1+\nu }{2}} + \frac{\mu ^2 A_1}{2}\Vert u\Vert ^2 + A_2 + |\langle \nabla f_{\mu }(x),\mu u\rangle | \end{aligned}$$

thus from the fact that \(\left( \sum _{i=1}^{k}a_i\right) ^2\leqslant k\left( \sum _{i=1}^{k}a_i^2\right) \)

$$\begin{aligned}&|\tilde{f}(x+\mu u,\delta ) - \tilde{f}(x,\delta )|^2 \\&\quad \leqslant 5\left( 4\delta ^2 + \frac{4L_{\nu }^2}{(1+\nu )^2}\mu ^{2+2\nu }n^{1+\nu } + \frac{\mu ^4 A_1^2}{4}\Vert u\Vert ^4 + A_2^2 + \langle \nabla f_{\mu }(x),\mu u\rangle ^2\right) \end{aligned}$$

and applying Theorem 3 we finally obtain

$$\begin{aligned}&\mathbb {E}_{u}\left[ \Vert g_{\mu }(x,u,\delta )\Vert ^2_{*}\right] \leqslant 20(n+4)\Vert \nabla f_{\mu }(x)\Vert ^2 \\&\quad + 5\left( \frac{4\delta ^2}{\mu ^2}n + \frac{4L_{\nu }^2}{(1+\nu )^2}\mu ^{2\nu }n^{2+\nu } + \frac{\mu ^2 A_1^2}{4}(n+6)^3 + \frac{A_2^2}{\mu ^2}n\right) \end{aligned}$$

\(\square \)

1.2 External results

Lemma 7

(Lemma 1 from [14]) For \(p\geqslant 0\), we have

$$\begin{aligned} \frac{1}{\kappa }\int \limits _{E}\Vert u\Vert ^{p}e^{-\tfrac{1}{2}\Vert u\Vert ^2}du\leqslant {\left\{ \begin{array}{ll} n^{p/2}, &{}p\in [0,2]\\ (n+p)^{p/2}, &{}p>2 \end{array}\right. } \end{aligned}$$

Lemma 8

(Lemma 2 from [13]) Let the function f satisfy Assumption 2. Then for any \(\tilde{\delta }>0\)

$$\begin{aligned} \frac{L_{\nu }}{1+\nu }t^{1+\nu }\leqslant \frac{1}{2}\left[ \frac{1-\nu }{1+\nu }\frac{2}{\tilde{\delta }}\right] ^{\frac{1-\nu }{1+\nu }}L_{\nu }^{\frac{2}{1+\nu }}t^2+\tilde{\delta } = \frac{L}{2}t^2+\tilde{\delta } \end{aligned}$$

Theorem 3

(Theorem 3 from [14]) If f is differentiable at x and u is a standard random normal vector, then

$$\begin{aligned} \mathbb {E}_{u}\left[ \langle \nabla f(x), u\rangle ^2 \Vert u\Vert ^2\right] \leqslant (n+4)\Vert \nabla f(x)\Vert _{*}^2 \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shibaev, I., Dvurechensky, P. & Gasnikov, A. Zeroth-order methods for noisy Hölder-gradient functions. Optim Lett 16, 2123–2143 (2022). https://doi.org/10.1007/s11590-021-01742-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11590-021-01742-z

Keywords

Navigation