Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning

Bolte, Jérôme; Pauwels, Edouard

doi:10.1007/s10107-020-01501-5

Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning

Full Length Paper
Series A
Published: 15 April 2020

Volume 188, pages 19–51, (2021)
Cite this article

Mathematical Programming Submit manuscript

Jérôme Bolte¹ &
Edouard Pauwels^2,3

1902 Accesses
28 Citations
1 Altmetric
Explore all metrics

Abstract

Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path differentiable. Using Whitney stratification techniques for semialgebraic and definable sets, our model provides variational formulas for nonsmooth automatic differentiation oracles, as for instance the famous backpropagation algorithm in deep learning. Our differential model is applied to establish the convergence in values of nonsmooth stochastic gradient methods as they are implemented in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Subgradient Method Converges on Tame Functions

Article 07 January 2019

Damek Davis, Dmitriy Drusvyatskiy, … Jason D. Lee

Deep relaxation: partial differential equations for optimizing deep neural networks

Article 28 June 2018

Pratik Chaudhari, Adam Oberman, … Guillaume Carlier

The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems

Article 14 February 2018

Weinan E & Bing Yu

Notes

Although all results we provide are generalizable to complete Riemannian manifolds.
Which is possible thanks to Lemma 1.
Valadier’s terminology finds here a surprising justification, since “saine”, healthy in English, is chosen as an antonym to “pathological”.
We only consider embedded submanifolds.
In [12] the authors assume f to be arbitrary and obtain similar result, for simplicity we pertain to the Lipschitz case.
From a practical point of view, qualification is hard to enforce or even check.
Which we shall not define formally since it is not essential to our purpose.
If a unique \(\sigma :{\mathbb {R}}\rightarrow {\mathbb {R}}\) is applied to each coordinate of each layer, this amounts to consider a conservative field for \(\sigma \), for example its Clarke subgradient.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: Symposium on Operating Systems Design and Implementation, OSDI, vol. 6, pp. 265–283 (2016)
Adil, S.: Opérateurs monotones aléatoires et application à l’optimisation stochastique. PhD Thesis, Paris Saclay (2018)
Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis, 3rd edn. Springer, Berlin (2005)
MATH Google Scholar
Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(1), 1–34 (2000)
Article MathSciNet Google Scholar
Aubin, J.P., Cellina, A.: Differential Inclusions: Set-valued Maps and Viability Theory, vol. 264. Springer, Berlin (1984)
Book Google Scholar
Aubin, J.-P., Frankowska, H.: Set-Valued Analysis. Springer, Berlin (2009)
Book Google Scholar
Barakat, A., Bianchi, P.: Convergence and Dynamical Behavior of the Adam Algorithm for Non Convex Stochastic Optimization (2018). arXiv preprint arXiv:1810.02263
Baydin, A., Pearlmutter, B., Radul, A., Siskind, J.: Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18(1), 5595–5637 (2018)
MathSciNet MATH Google Scholar
Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Séminaire de Probabilités XXXIII, pp. 1–68. Springer, Berlin, Heidelberg (1999)
Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)
Article MathSciNet Google Scholar
Bianchi, P., Hachem, W., Salim, A.: Constant step stochastic approximations involving differential inclusions: stability, long-run convergence and applications. Stochastics 91(2), 288–320 (2019)
Article MathSciNet Google Scholar
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Article MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet Google Scholar
Borkar, V.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)
MATH Google Scholar
Borwein, J., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, Berlin (2010)
Google Scholar
Borwein, J.M., Moors, W.B.: Essentially smooth Lipschitz functions. J. Funct. Anal. 149(2), 305–351 (1997)
Article MathSciNet Google Scholar
Borwein, J.M., Moors, W.B.: A chain rule for essentially smooth Lipschitz functions. SIAM J. Optim. 8(2), 300–308 (1998)
Article MathSciNet Google Scholar
Borwein, J., Moors, W., Wang, X.: Generalized subdifferentials: a Baire categorical approach. Trans. Am. Math. Soc. 353(10), 3875–3893 (2001)
Article MathSciNet Google Scholar
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. Curran Associates, Inc. (2008)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Castera, C., Bolte, J., Févotte, C., Pauwels, E.: An inertial Newton algorithm for deep learning (2019). arXiv preprint arXiv:1905.12278
Clarke, F.H.: Optimization and Nonsmooth Analysis. SIAM, Philadelphia (1983)
MATH Google Scholar
Chizat, L., Bach F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 3036–3046. Curran Associates, Inc. (2018)
Corliss, G., Faure, C., Griewank, A., Hascoet, L., Naumann, U. (eds.): Automatic Differentiation Of Algorithms: From Simulation to Optimization. Springer, Berlin (2002)
Google Scholar
Correa, R., Jofre, A.: Tangentially continuous directional derivatives in nonsmooth analysis. J. Optim. Theory Appl. 61(1), 1–21 (1989)
Article MathSciNet Google Scholar
Coste, M.: An Introduction to O-Minimal Geometry. RAAG notes, Institut de Recherche Mathématique de Rennes, p. 81 (1999)
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020)
Article MathSciNet Google Scholar
Evans, L.C., Gariepy, R.F.: Measure Theory and Fine Properties of Functions, Revised edn. Chapman and Hall/CRC, London (2015)
Book Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)
Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, vol. 105. SIAM, Philadelphia (2008)
Book Google Scholar
Griewank, A.: On stable piecewise linearization and generalized algorithmic differentiation. Optim. Methods Softw. 28(6), 1139–1178 (2013)
Article MathSciNet Google Scholar
Griewank, A., Walther, A., Fiege, S., Bosse, T.: On Lipschitz optimization based on gray-box piecewise linearization. Math. Program. 158(1–2), 383–415 (2016)
Article MathSciNet Google Scholar
Ioffe, A.D.: Nonsmooth analysis: differential calculus of nondifferentiable mappings. Trans. Am. Math. Soc. 266(1), 1–56 (1981)
Article MathSciNet Google Scholar
Ioffe, A.D.: Variational Analysis of Regular Mappings. Springer Monographs in Mathematics. Springer, Cham (2017)
Book Google Scholar
Kakade, S.M., Lee, J.D.: Provably correct automatic sub-differentiation for qualifed programs. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 7125–7135. Curran Associates, Inc. (2018)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. l’inst. Fourier 48(3), 769–783 (1998)
Article MathSciNet Google Scholar
Kurdyka, K., Mostowski, T., Parusinski, A.: Proof of the gradient conjecture of R. Thom. Ann. Math. 152(3), 763–792 (2000)
Article MathSciNet Google Scholar
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
MATH Google Scholar
Le Cun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4), 551–575 (1977)
Article MathSciNet Google Scholar
Majewski, S., Miasojedow, B., Moulines, E.: Analysis of nonsmooth stochastic approximation: the differential inclusion approach (2018). arXiv preprint arXiv:1805.01916
Mohammadi, B., Pironneau, O.: Applied Shape Optimization for Fluids. Oxford University Press, Oxford (2010)
MATH Google Scholar
Moulines, E., Bach, F.R.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 451–459. Curran Associates, Inc. (2011)
Moreau J.-J.: Fonctionnelles sous-différentiables, Séminaire Jean Leray (1963)
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation i: Basic Theory. Springer, Berlin (2006)
Book Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in Pytorch. In: NIPS Workshops (2017)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Rockafellar, R.T.: Convex functions and dual extremum problems. Doctoral dissertation, Harvard University (1963)
Rockafellar, R.: On the maximal monotonicity of subdifferential mappings. Pacific J. Math. 33(1), 209–216 (1970)
Article MathSciNet Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (1998)
Book Google Scholar
Rumelhart, E., Hinton, E., Williams, J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Article Google Scholar
Speelpenning, B.: Compiling fast partial derivatives of functions given by algorithms (No. COO-2383-0063; UILU-ENG-80-1702; UIUCDCS-R-80-1002). Illinois Univ., Urbana (USA). Dept. of Computer Science (1980)
Thibault, L., Zagrodny, D.: Integration of subdifferentials of lower semicontinuous functions on Banach spaces. J. Math. Anal. Appl. 189(1), 33–58 (1995)
Article MathSciNet Google Scholar
Thibault, L., Zlateva, N.: Integrability of subdifferentials of directionally Lipschitz functions. In: Proceedings of the American Mathematical Society, pp. 2939–2948 (2005)
Valadier, M.: Entraînement unilatéral, lignes de descente, fonctions lipschitziennes non pathologiques. C. R. l’Acad. Sci. 308, 241–244 (1989)
MATH Google Scholar
van den Dries, L., Miller, C.: Geometric categories and o-minimal structures. Duke Math. J 84(2), 497–540 (1996)
MathSciNet MATH Google Scholar
Wang, X.: Pathological Lipschitz functions in \({\mathbb{R}}^n\). Master thesis, Simon Fraser University (1995)

Download references

Acknowledgements

The authors acknowledge the support of AI Interdisciplinary Institute ANITI funding, through the French “Investing for the Future— PIA3” program under the Grant Agreementi ANR-19-PI3A-0004, Air Force Office of Scientific Research, Air Force Material Command, USAF, under Grant Numbers FA9550-19-1-7026, FA9550-18-1-0226, and ANR MasDol. J. Bolte acknowledges the support of ANR Chess, Grant ANR-17-EURE-0010 and ANR OMS. The authors would like to thank Lionel Thibault, Sylvain Sorin for a carefull reading of an early version of this work and Gersende Fort for her very valuable comments and discussions on stochastic approximation.

Author information

Authors and Affiliations

Toulouse School of Economics, Université Toulouse 1 Capitole, Toulouse, France
Jérôme Bolte
IRIT, Université de Toulouse, CNRS, Toulouse, France
Edouard Pauwels
DEEL IRT Saint Exupery, Toulouse, France
Edouard Pauwels

Authors

Jérôme Bolte
View author publications
You can also search for this author in PubMed Google Scholar
Edouard Pauwels
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jérôme Bolte.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bolte, J., Pauwels, E. Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. 188, 19–51 (2021). https://doi.org/10.1007/s10107-020-01501-5

Download citation

Received: 05 November 2019
Accepted: 28 March 2020
Published: 15 April 2020
Issue Date: July 2021
DOI: https://doi.org/10.1007/s10107-020-01501-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning

Abstract

Access this article

Similar content being viewed by others

Stochastic Subgradient Method Converges on Tame Functions

Deep relaxation: partial differential equations for optimizing deep neural networks

The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning

Abstract

Access this article

Similar content being viewed by others

Stochastic Subgradient Method Converges on Tame Functions

Deep relaxation: partial differential equations for optimizing deep neural networks

The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation