Skip to main content
Log in

Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization, with a focus on addressing constrained optimization, high-dimensional setting, and saddle point avoiding. To handle constrained optimization, we first propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient algorithm using only zeroth-order information. To facilitate zeroth-order optimization in high dimensions, we explore the advantages of structural sparsity assumptions. Specifically, (i) we highlight an implicit regularization phenomenon where the standard stochastic gradient algorithm with zeroth-order information adapts to the sparsity of the problem at hand by just varying the step size and (ii) propose a truncated stochastic gradient algorithm with zeroth-order information, whose rate of convergence depends only poly-logarithmically on the dimensionality. We next focus on avoiding saddle points in nonconvex setting. Toward that, we interpret the Gaussian smoothing technique for estimating gradient based on zeroth-order information as an instantiation of first-order Stein’s identity. Based on this, we provide a novel linear-(in dimension) time estimator of the Hessian matrix of a function using only zeroth-order information, which is based on second-order Stein’s identity. We then provide a zeroth-order variant of cubic regularized Newton method for avoiding saddle points and discuss its rate of convergence to local minima.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. We remark that our step size choice requires knowledge of a rough upper bound on the true sparsity parameter.

  2. For a definition of almost-differentiable function, we refer the reader to Definition 1 in [75].

References

  1. Agarwal, A., Dekel, O., Xiao, L.: Optimal algorithms for online convex optimization with multi-point bandit feedback. In: Proceedings of The 23rd Conference on Learning Theory, pp. 28–40 (2010)

  2. Akhavan, A., Pontil, M., Tsybakov, A.: Exploiting higher order smoothness in derivative-free optimization and continuous bandits. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

  3. Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems, pp. 2680–2691 (2018)

  4. Bach, F., Perchet, V.: Highly-smooth zero-th order online optimization. In: V. Feldman, A. Rakhlin, O. Shamir (eds.) 29th Annual Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 49, pp. 257–283. PMLR (2016)

  5. Beck, A.: First-Order Methods in Optimization, vol. 25. Society for Industrial and Applied Mathematics (SIAM) (2017)

  6. Belloni, A., Liang, T., Narayanan, H., Rakhlin, A.: Escaping the local minima via simulated annealing: Optimization of approximately convex functions. In: P. Grunwald, E. Hazan, S. Kale (eds.) Proceedings of The 28th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 40, pp. 240–265. PMLR (2015)

  7. Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization: analysis, algorithms, and engineering applications, vol. 2. Society for Industrial and Applied Mathematics (SIAM) (2001)

  8. Bertsekas, D.P.: Nonlinear programming. Athena scientific Belmont (2016)

  9. Bertsekas, D.P., Scientific, A.: Convex optimization algorithms. Athena Scientific Belmont (2015)

  10. Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)

  11. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press (2004)

    Book  Google Scholar 

  12. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5(1), 1–122 (2012)

  13. Bubeck, S., Lee, Y.T., Eldan, R.: Kernel-based methods for bandit convex optimization. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 72–85 (2017)

  14. Cai, H., Mckenzie, D., Yin, W., Zhang, Z.: Zeroth-order regularized optimization (ZORO): Approximately sparse gradients and adaptive sampling (2020)

  15. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018)

    Article  MathSciNet  Google Scholar 

  16. Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization, Part I: Motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)

    Article  MathSciNet  Google Scholar 

  17. Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization, Part II: Worst-case function-and derivative-evaluation complexity. Mathematical programming 130(2), 295–319 (2011)

    Article  MathSciNet  Google Scholar 

  18. Cartis, C., Gould, N.I., Toint, P.L.: Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics 18(5), 1073–1107 (2018)

    Article  MathSciNet  Google Scholar 

  19. Chen, L., Zhang, M., Hassani, H., Karbasi, A.: Black box submodular maximization: Discrete and continuous settings. In: S. Chiappa, R. Calandra (eds.) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 108, pp. 1058–1070 (2020)

  20. Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J.: ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. ACM (2017)

  21. Choromanski, K., Rowland, M., Sindhwani, V., Turner, R., Weller, A.: Structured evolution with compact architectures for scalable policy optimization. In: Proceedings of the 35th International Conference on Machine Learning. PMLR (2018)

  22. Conn, A., Scheinberg, K., Vicente, L.: Introduction to derivative-free optimization, vol. 8. Society of Industrial and Applied Mathematics (SIAM) (2009)

  23. Dani, V., Kakade, S.M., Hayes, T.P.: The price of bandit information for online optimization. In: Advances in Neural Information Processing Systems, pp. 345–352 (2008)

  24. Demyanov, V., Rubinov, A.: Approximate methods in optimization problems. American Elsevier Publishing (1970)

  25. DeVore, R., Petrova, G., Wojtaszczyk, P.: Approximation of functions of few variables in high dimensions. Constructive Approximation 33(1), 125–143 (2011)

    Article  MathSciNet  Google Scholar 

  26. Donoho, D.L.: Compressed sensing. IEEE Transactions on information theory 52(4), 1289–1306 (2006)

    Article  MathSciNet  Google Scholar 

  27. Duchi, J., Jordan, M., Wainwright, M., Wibisono, A.: Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61(5), 2788–2806 (2015)

    Article  MathSciNet  Google Scholar 

  28. Elibol, M., Lei, L., Jordan, M.I.: Variance reduction with sparse gradients. In: Proceedings of the 8th International Conference on Learning Representations (ICLR), pp. 1058–1070 (2020)

  29. Erdogdu, M.A.: Newton-Stein method: an optimization method for GLMs via Stein’s lemma. The Journal of Machine Learning Research 17(1), 7565–7616 (2016)

    MathSciNet  MATH  Google Scholar 

  30. Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Research Logistics Quarterly 3, 95–110 (1956)

    Article  MathSciNet  Google Scholar 

  31. Gasnikov, A.V., Krymova, E.A., Lagunovskaya, A.A., Usmanova, I.N., Fedorenko, F.A.: Stochastic online optimization. single-point and multi-point non-linear multi-armed bandits. convex and strongly-convex case. Automation and remote control 78(2), 224–234 (2017)

  32. Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points: Online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)

  33. Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)

  34. Ghadimi, S.: Conditional gradient type methods for composite nonlinear and stochastic optimization. Mathematical Programming (2018). https://doi.org/10.1007/s10107-017-1225-5

  35. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4), 2341–2368 (2013)

    Article  MathSciNet  Google Scholar 

  36. Han, C., Yuan, M.: Information based complexity for high dimensional sparse functions. Journal of Complexity 57, 101443 (2020)

    Article  MathSciNet  Google Scholar 

  37. Hazan, E., Kale, S.: Projection-free online learning. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, pp. 1843–1850 (2012)

  38. Hazan, E., Levy, K.: Bandit convex optimization: Towards tight bounds. In: Advances in Neural Information Processing Systems, pp. 784–792 (2014)

  39. Hazan, E., Luo, H.: Variance-reduced and projection-free stochastic optimization. In: International Conference on Machine Learning, pp. 1263–1271 (2016)

  40. Hearn, D.: The gap function of a convex program. Operations Research Letters 2, 95–110 (1982)

    MATH  Google Scholar 

  41. Hu, X., Prashanth, L.A., György, A., Szepesvari, C.: (Bandit) Convex Optimization with Biased Noisy Gradient Oracles. In: The 19th International Conference on Artificial Intelligence and Statistics, pp. 3420–3428 (2016)

  42. Jaggi, M.: Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. 427–435 (2013)

  43. Jain, P., Kar, P.: Non-convex optimization for machine learning.Foundations and Trends® in Machine Learning 10(3-4), 142–336 (2017)

  44. Jain, P., Tewari, A., Kar, P.: On iterative hard thresholding methods for high-dimensional m-estimation. In: Advances in Neural Information Processing Systems, pp. 685–693 (2014)

  45. Jamieson, K., Nowak, R., Recht, B.: Query complexity of derivative-free optimization. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2012)

  46. Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: International Conference on Machine Learning, pp. 1724–1732 (2017)

  47. Kawaguchi, K., Kaelbling, L.P.: Elimination of all bad local minima in deep learning. arXiv:1901.00279

  48. Lan, G., Zhou, Y.: Conditional gradient sliding for convex optimization. SIAM Journal on Optimization 26(2), 1379–1409 (2016)

    Article  MathSciNet  Google Scholar 

  49. Lattimore, T.: Improved regret for zeroth-order adversarial bandit convex optimisation. arXiv:2006.00475

  50. Li, J., Balasubramanian, K., Ma, S.: Stochastic zeroth-order riemannian derivative estimation and optimization. arXiv:2003.11238 (2020)

  51. Mania, H., Guy, A., Recht, B.: Simple random search provides a competitive approach to reinforcement learning. In: Advances in Neural Information Processing Systems (2018)

  52. Minsker, S.: Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics 46(6A), 2871–2903 (2018)

    Article  MathSciNet  Google Scholar 

  53. Mockus, J.: Bayesian approach to global optimization: theory and applications, vol. 37. Springer Science & Business Media (2012)

  54. Mokhtari, A., Hassani, H., Karbasi, A.: Conditional gradient method for stochastic submodular maximization: Closing the gap. In: International Conference on Artificial Intelligence and Statistics, pp. 1886–1895 (2018)

  55. Mokhtari, A., Hassani, H., Karbasi, A.: Stochastic conditional gradient methods: From convex minimization to submodular maximization. Journal of Machine Learning Research 21, 1–49 (2020)

    MathSciNet  MATH  Google Scholar 

  56. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear programming. Mathematical programming 39(2), 117–129 (1987)

    Article  MathSciNet  Google Scholar 

  57. Nemirovski, A.S., Yudin, D.: Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. John Wiley, XV (1983)

    Google Scholar 

  58. Nesterov, Y.: Introductory Lectures on Convex Optimization: a basic course. Kluwer Academic Publishers, Massachusetts (2004)

    Book  Google Scholar 

  59. Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2013)

  60. Nesterov, Y., Polyak, B.: Cubic regularization of newton method and its global performance. Mathematical Programming 108(1), 177–205 (2006)

    Article  MathSciNet  Google Scholar 

  61. Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 527–566 (2017)

    Article  MathSciNet  Google Scholar 

  62. Nestrov, Y.: Implementable tensor methods in unconstrained convex optimization. Mathematical Programming 186, 157–183 (2021)

    Article  MathSciNet  Google Scholar 

  63. Nocedal, J., Wright, S.J.: Numerical optimization. Springer Science & Business Media (2006)

  64. Raskutti, G., Wainwright, M.J., Yu, B.: Minimax-optimal rates for sparse additive models over kernel classes via convex programming. The Journal of Machine Learning Research 13(1), 389–427 (2012)

    MathSciNet  MATH  Google Scholar 

  65. Reddi, S., Sra, S., Póczos, B., Smola, A.: Stochastic Frank-Wolfe Methods for Nonconvex Optimization. In: Proceedings of the 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1244–1251 (2016)

  66. Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.: A generic approach for escaping saddle points. In: International Conference on Artificial Intelligence and Statistics, pp. 1233–1242 (2018)

  67. Rio, E.: Moment inequalities for sums of dependent random variables under projective conditions. Journal of Theoretical Probability 22(1), 146–163 (2009)

    Article  MathSciNet  Google Scholar 

  68. Rubinstein, R., Kroese, D.: Simulation and the Monte Carlo method, vol. 10. John Wiley & Sons, New Jersey (2016)

    Book  Google Scholar 

  69. Saha, A., Tewari, A.: Improved regret guarantees for online smooth convex optimization with bandit feedback. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 636–642 (2011)

  70. Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864

  71. Shamir, O.: On the complexity of bandit and derivative-free stochastic convex optimization. In: Conference on Learning Theory, pp. 3–24 (2013)

  72. Snoek, J., Larochelle, H., Adams, R.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp. 2951–2959 (2012)

  73. Spall, J.: Introduction to stochastic search and optimization: estimation, simulation, and control, vol. 65. John Wiley & Sons, New Jersey (2005)

    MATH  Google Scholar 

  74. Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California (1972)

  75. Stein, C.M.: Estimation of the mean of a multivariate normal distribution. The annals of Statistics pp. 1135–1151 (1981)

  76. Sun, J., Qu, Q., Wright, J.: When are nonconvex problems not scary? arXiv:1510.06096

  77. Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Foundations of Computational Mathematics 18(5), 1131–1198 (2018)

    Article  MathSciNet  Google Scholar 

  78. Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)

  79. Tropp, J.A.: The expected norm of a sum of independent random matrices: An elementary approach. In: High Dimensional Probability VII, pp. 173–202. Springer (2016)

  80. Tyagi, H., Kyrillidis, A., Gärtner, B., Krause, A.: Algorithms for learning sparse additive models with interactions in high dimensions. Information and Inference: A Journal of the IMA 7(2), 183–249 (2018)

    Article  MathSciNet  Google Scholar 

  81. Wang, Y., Du, S., Balakrishnan, S., Singh, A.: Stochastic zeroth-order optimization in high dimensions. In: A. Storkey, F. Perez-Cruz (eds.) Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 84, pp. 1356–1365 (2018)

  82. Wojtaszczyk, P.: Complexity of approximation of functions of few variables in high dimensions. Journal of Complexity 27(2), 141–150 (2011)

    Article  MathSciNet  Google Scholar 

  83. Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184, 35–70 (2020)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeed Ghadimi.

Additional information

Communicated by Francis Bach.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Balasubramanian, K., Ghadimi, S. Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points. Found Comput Math 22, 35–76 (2022). https://doi.org/10.1007/s10208-021-09499-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-021-09499-8

Keywords

Mathematics Subject Classification

Navigation