Abstract
In this paper, we prove new complexity bounds for zeroth-order methods in non-convex optimization with inexact observations of the objective function values. We use the Gaussian smoothing approach of Nesterov and Spokoiny(Found Comput Math 17(2): 527–566, 2015. https://doi.org/10.1007/s10208-015-9296-2) and extend their results, obtained for optimization methods for smooth zeroth-order non-convex problems, to the setting of minimization of functions with Hölder-continuous gradient with noisy zeroth-order oracle, obtaining noise upper-bounds as well. We consider finite-difference gradient approximation based on normally distributed random Gaussian vectors and prove that gradient descent scheme based on this approximation converges to the stationary point of the smoothed function. We also consider convergence to the stationary point of the original (not smoothed) function and obtain bounds on the number of steps of the algorithm for making the norm of its gradient small. Additionally we provide bounds for the level of noise in the zeroth-order oracle for which it is still possible to guarantee that the above bounds hold. We also consider separately the case of \(\nu = 1\) and show that in this case the dependence of the obtained bounds on the dimension can be improved.
Similar content being viewed by others
References
Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M.: Automatic differentiation in machine learning: a survey (2018). arxiv:1502.05767
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empirical comparison of gradient approximations in derivative-free optimization (2019). arxiv:1905.01332
Berahas, A.S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise (2019). arxiv:1910.04055
Bolte, J., Glaudin, L., Pauwels, E., Serrurier, M.: A Hölderian backtracking method for min-max and min-min problems (2020). arxiv:2007.08810
Brent, R.: Algorithms for Minimization Without Derivatives. Dover Books on Mathematics, Dover Publications (1973)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to derivative-free optimization. Soc. Ind. Appl. Math. (2009). https://doi.org/10.1137/1.9780898718768
Dvurechensky, P.: Gradient method with inexact oracle for composite non-convex optimization (2017). arxiv:1703.09180
Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Statist. 38(1), 191–200 (1967). https://doi.org/10.1214/aoms/1177699070
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013). https://doi.org/10.1137/120880811
Kim, K., Nesterov, Y., Skokov, V., Cherkasskii, B.: Effektivnii algoritm vychisleniya proisvodnyh i ekstremalnye zadachi (efficient algorithm for calculation of derivatives and extreme problems). Ekonomika i matematicheskie metody 20(2), 309–318 (1984)
Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numerica 28, 287–404 (2019). https://doi.org/10.1017/S0962492919000060
Liu, S., Kailkhura, B., Chen, P.Y., Ting, P., Chang, S., Amini, L.: Zeroth-order stochastic variance reduction for nonconvex optimization. Adv. Neural Inf. Process. Syst. 31, 3727–3737 (2018)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1), 381–404 (2015). https://doi.org/10.1007/s10107-014-0790-0
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2015). https://doi.org/10.1007/s10208-015-9296-2
Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175–184 (1960). https://doi.org/10.1093/comjnl/3.3.175
Saeed Ghadimi, G.L., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. (2013). https://doi.org/10.1007/s10107-014-0846-1
Spall, J.C.: Introduction to Stochastic Search and Optimization, 1st edn. Wiley, New York, NY, USA (2003)
Sutton, R..S., Barto, A..G.: Reinforcement learning: An introduction. MIT press (2018)
Wang, J., Liu, Y., Li, B.: Reinforcement learning with perturbed rewards. Proc. AAAI Conf. Artif. Intell. 34, 6202–6209 (2020). https://doi.org/10.1609/aaai.v34i04.6086
Acknowledgements
The authors are grateful to K. Scheinberg and A. Beznosikov for several discussions on derivative-free methods.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding
The research of A. Gasnikov and P. Dvurechensly was partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, project no. 0714-2020-0005.
Conflict of interest
Not applicable
Availability of data and material
Not applicable
Code availability
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research of A. Gasnikov and P. Dvurechensly was partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, Project No. 0714-2020-0005.
Appendix
Appendix
1.1 Proofs of Lemmas 2.1–2.5
Proof
(Lemma 1) From (1) we get \(\Vert Bu\Vert _{*}^2 = \langle Bu,B^{-1}Bu\rangle =\langle Bu,u\rangle =\Vert u\Vert ^2\). Using this and Lemma 7 we obtain
and
thus, finally
\(\square \)
Proof
(Lemma 2)
Integrating this we obtain
so using this way we proved lemma with \(A_1 = \frac{L_{\nu }}{\mu ^{1-\nu }}n^{\tfrac{1+\nu }{2}}\) and \(A_2 = 0\).
The other way to obtain \(A_1\) and \(A_2\) is to directly upper bound \(f_{\mu }(y)-f_{\mu }(x)-\langle \nabla f_{\mu }(x),y-x\rangle \) applying Lemma 8:
Setting \(\tilde{\delta } = \hat{\delta }\mu ^{1+\nu }L_\nu \) and using upper bound \(\left[ 2\frac{1-\nu }{1+\nu }\right] ^{\frac{1-\nu }{1+\nu }}\leqslant 2\) we obtain
so we proved lemma with \(A_1 = \left[ \frac{1}{\hat{\delta }}\right] ^{\frac{1-\nu }{1+\nu }}\frac{2L_{\nu }}{\mu ^{1-\nu }}\) and \(A_2 = \hat{\delta }L_{\nu }\mu ^{1+\nu }\). \(\square \)
Proof
(Lemma 3) To proof this we should notice that
thus
\(\square \)
Proof
(Lemma 4) From the fact that \(a^2 \leqslant 2(a+b)^2 + 2b^2\):
\(\square \)
Proof
(Lemma 5)
let’s bound \(|\tilde{f}(x+\mu u,\delta ) - \tilde{f}(x,\delta )|\):
thus from the fact that \(\left( \sum _{i=1}^{k}a_i\right) ^2\leqslant k\left( \sum _{i=1}^{k}a_i^2\right) \)
and applying Theorem 3 we finally obtain
\(\square \)
1.2 External results
Lemma 7
(Lemma 1 from [14]) For \(p\geqslant 0\), we have
Lemma 8
(Lemma 2 from [13]) Let the function f satisfy Assumption 2. Then for any \(\tilde{\delta }>0\)
Theorem 3
(Theorem 3 from [14]) If f is differentiable at x and u is a standard random normal vector, then
Rights and permissions
About this article
Cite this article
Shibaev, I., Dvurechensky, P. & Gasnikov, A. Zeroth-order methods for noisy Hölder-gradient functions. Optim Lett 16, 2123–2143 (2022). https://doi.org/10.1007/s11590-021-01742-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-021-01742-z