Sparse optimization on measures with over-parameterized gradient descent

Chizat, Lénaïc

doi:10.1007/s10107-021-01636-z

Sparse optimization on measures with over-parameterized gradient descent

Full Length Paper
Series A
Published: 17 March 2021

Volume 194, pages 487–532, (2022)
Cite this article

Mathematical Programming Submit manuscript

Lénaïc Chizat¹

1768 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Minimizing a convex function of a measure with a sparsity-inducing penalty is a typical problem arising, e.g., in sparse spikes deconvolution or two-layer neural networks training. We show that this problem can be solved by discretizing the measure and running non-convex gradient descent on the positions and weights of the particles. For measures on a d-dimensional manifold and under some non-degeneracy assumptions, this leads to a global optimization algorithm with a complexity scaling as $\log (1/\epsilon )$ in the desired accuracy $\epsilon $, instead of $\epsilon ^{-d}$ for convex methods. The key theoretical tools are a local convergence analysis in Wasserstein space and an analysis of a perturbed mirror descent in the space of measures. Our bounds involve quantities that are exponential in d which is unavoidable under our assumptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Gradient Projection Algorithm for Smooth Sets and Functions in Nonconvex Case

Article 01 August 2020

Maxim V. Balashov

Recent Advances in Stochastic Riemannian Optimization

MASAGA: A Linearly-Convergent Stochastic First-Order Method for Optimization on Manifolds

Notes

The algorithm we study in this paper corresponds to the “2-homogeneous case” in [17]. Also, Chizat and Bach [17] allows non-smooth regularizers and does not require non-degeneracy.
Extension of the metric and gradients to the whole of $\Omega $ can be made on a case by case basis, see Sect. 2.2.
The pushforward measure $T_\# \mu $ is characterized by $\int \psi \mathrm {d}(T_\#\mu ) = \int (\psi \circ T) \mathrm {d}\mu $ for any continuous function $\psi $.
https://github.com/lchizat/2019-sparse-optim-measures.

References

Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer, Berlin (2008)
MATH Google Scholar
Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629–681 (2017)
MATH Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends® Mach. Learn. 4(1), 1–106 (2012)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Article MathSciNet MATH Google Scholar
Blanchet, A., Bolte, J.: A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. J. Funct. Anal. 275(7), 1650–1673 (2018)
Article MathSciNet MATH Google Scholar
Boyd, N., Schiebinger, G., Recht, B.: The alternating descent conditional gradient method for sparse inverse problems. SIAM J. Optim. 27(2), 616–639 (2017)
Article MathSciNet MATH Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Boyer, C., Chambolle, A., De Castro, Y., Duval, V., De Gournay, F., Weiss, P.: On representer theorems and convex regularization. SIAM J. Optim. 29(2), 1260–1281 (2019)
Article MathSciNet MATH Google Scholar
Bredies, K., Pikkarainen, H.K.: Inverse problems in spaces of measures. ESAIM Control Optim. Calc. Var. 19(1), 190–218 (2013)
Article MathSciNet MATH Google Scholar
Burago, D., Burago, Y., Ivanov, S.: A Course in Metric Geometry, vol. 33. American Mathematical Society, Providence (2001)
MATH Google Scholar
Candès, E.J., Fernandez-Granda, C.: Towards a mathematical theory of super-resolution. Commun. Pure Appl. Math. 67(6), 906–956 (2014)
Article MathSciNet MATH Google Scholar
Catala, P., Duval, V., Peyré, G.: A low-rank approach to off-the-grid sparse deconvolution. J. Phys. Conf. Ser. 904, 012015 (2017)
Article Google Scholar
Champagnat, F., Herzet, C.: Atom selection in continuous dictionaries: reconciling polar and SVD approximations. In: ICASSP 2019-IEEE 44th International Conference on Acoustics, Speech, and Signal Processing, pp. 1–5. IEEE (2019)
Chen, Y., Li, W.: Wasserstein natural gradient in statistical manifolds with continuous sample space. arXiv preprint arXiv:1805.08380 (2018)
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3040–3050 (2018)
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems, pp. 2937–2947 (2019)
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: An interpolating distance between optimal transport and Fisher–Rao metrics. Found. Comput. Math. 18(1), 1–44 (2018)
Article MathSciNet MATH Google Scholar
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: Unbalanced optimal transport: dynamic and Kantorovich formulations. J. Funct. Anal. 274(11), 3090–3123 (2018)
Article MathSciNet MATH Google Scholar
Cohn, D.L.: Measure Theory, vol. 165. Springer, Berlin (1980)
Book MATH Google Scholar
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. A J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)
Article MathSciNet MATH Google Scholar
De Castro, Y., Gamboa, F.: Exact reconstruction using Beurling minimal extrapolation. J. Math. Anal. Appl. 395(1), 336–354 (2012)
Article MathSciNet MATH Google Scholar
De Castro, Y., Gamboa, F., Henrion, D., Lasserre, J.-B.: Exact solutions to super resolution on semi-algebraic domains in higher dimensions. IEEE Trans. Inf. Theory 63(1), 621–630 (2017)
Article MathSciNet MATH Google Scholar
Denoyelle, Q., Duval, V., Peyré, G., Soubies, E.: The sliding Frank-algorithm and its application to super-resolution microscopy. Inverse Probl. 36(1), 014001 (2019)
Article MathSciNet MATH Google Scholar
Du, S.S., Zhai, X., Barnabas P., Aarti, S.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations (2018)
Dumitrescu, B.: Positive Trigonometric Polynomials and Signal Processing Applications. Springer, Berlin (2007)
MATH Google Scholar
Duval, V., Peyré, G.: Exact support recovery for sparse spikes deconvolution. Found. Comput. Math. 15(5), 1315–1355 (2015)
Article MathSciNet MATH Google Scholar
Flinth, A., de Gournay, F., Weiss, P.: On the linear convergence rates of exchange and continuous methods for total variation minimization. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01530-0
Flinth, A., Weiss, P.: Exact solutions of infinite dimensional total-variation regularized problems. Inf. Inference A J. IMA 8(3), 407–443 (2019)
Article MathSciNet MATH Google Scholar
Gallouët, T., Monsaingeon, L.: A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM J. Math. Anal. 49(2), 1100–1130 (2017)
Article MathSciNet MATH Google Scholar
Gautschi, W.: Numerical Analysis. Springer, Berlin (1997)
MATH Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Hauer, D., Mazón, J.: Kurdyka-Łojasiewicz–Simon inequality for gradient flows in metric spaces. Trans. Am. Math. Soc. (2019)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Englewood Cliffs (1994)
MATH Google Scholar
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Berlin (2016)
Kondratyev, S., Monsaingeon, L., Vorotnikov, D.: A new optimal transport distance on the space of finite Radon measures. Adv. Differ. Equ. 21(11/12), 1117–1164 (2016)
MathSciNet MATH Google Scholar
Krichene, W., Bayen, A., Bartlett, P.L.: Accelerated mirror descent in continuous and discrete time. In: Advances in Neural Information Processing Systems, pp. 2845–2853 (2015)
Kushner, H., George Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
MATH Google Scholar
Li, W., Montúfar, G.: Natural gradient via optimal transport. Inf. Geom. 1(2), 181–214 (2018)
Article MathSciNet MATH Google Scholar
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. Invent. Math. 211(3), 969–1117 (2018)
Article MathSciNet MATH Google Scholar
Mairal, J., Bach, F., Ponce, J.: Sparse modeling for image and vision processing. Found. Trends® Comput. Graph. Vis. 8(2–3), 85–283 (2014)
Maniglia, S.: Probabilistic representation and uniqueness results for measure-valued solutions of transport equations. Journal de mathématiques pures et appliquées 87(6), 601–626 (2007)
Article MathSciNet MATH Google Scholar
Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 115(33), E7665–E7671 (2018)
Article MathSciNet MATH Google Scholar
Menz, G., Schlichting, A.: Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. Ann. Probab. 42(5), 1809–1884 (2014)
Article MathSciNet MATH Google Scholar
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, New York (1983)
Google Scholar
Nitanda, A., Suzuki, T.: Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438 (2017)
Polyak, B.T.: Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 3(4), 643–653 (1963)
MathSciNet MATH Google Scholar
Poon, C., Keriven, N., Peyré, G.: The geometry of off-the-grid compressed sensing. arXiv preprint arXiv:1802.08464 (2018)
Rotskoff, G., Jelassi, S., Bruna, J., Vanden-Eijnden, E.: Global convergence of neuron birth–death dynamics. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research (2019)
Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146–7155 (2018)
Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, NY 55, 58–63 (2015)
Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math. 80(2), 725–752 (2020)
Article MathSciNet MATH Google Scholar
Tang, G., Bhaskar, B.N., Shah, P., Recht, B.: Compressed sensing off the grid. IEEE Trans. Inf. Theory 59(11), 7465–7490 (2013)
Article MathSciNet MATH Google Scholar
Traonmilin, Y., Aujol, J.-F.: The basins of attraction of the global minimizers of the non-convex sparse spike estimation problem. Inverse Probl. 36(4), 045003 (2020)
Article MathSciNet MATH Google Scholar
Trillos, N.G., Slepčev, D.: On the rate of convergence of empirical measures in $\infty $-transportation distance. Can. J. Math. 67(6), 1358–1383 (2015)
Article MathSciNet MATH Google Scholar
Wang, Y., Li, W.: Accelerated information gradient flow. arXiv preprint arXiv:1909.02102 (2019)
Weed, J., Bach, F., et al.: Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 25(4A), 2620–2648 (2019)
Article MathSciNet MATH Google Scholar
Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel. In: Advances in Neural Information Processing Systems, pp. 9712–9724 (2019)

Download references

Acknowledgements

The author thanks Francis Bach for fruitful discussions related to this work and the anonymous referees for their thorough reading and suggestions.

Author information

Authors and Affiliations

CNRS, Laboratoire de Mathématiques d’Orsay, Université Paris-Saclay, 91405, Orsay, France
Lénaïc Chizat

Authors

Lénaïc Chizat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lénaïc Chizat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Dealing with signed measures

Let us show that problems over signed measures with total variation regularization are covered by problem (1), after a suitable reformulation. Consider a function ${\tilde{\phi }}{:}\,{\tilde{\Theta }}\rightarrow {\mathcal {F}}$ and the functional on signed measures ${\tilde{J}}{:}\,{\mathcal {M}}({\tilde{\Theta }})\rightarrow {\mathbb {R}}$ defined as

$$\begin{aligned} {\tilde{J}}(\mu ) = R\left( \int {\tilde{\phi }} \mathrm {d}\mu \right) + \lambda \vert \mu \vert ({\tilde{\Theta }}). \end{aligned}$$

(17)

where $\vert \mu \vert ({\tilde{\Theta }})$ is the total variation of $\mu $. This is a continuous version of the LASSO problem, known as BLASSO [23]. Define $\Theta $ as the disjoint union of two copies ${\tilde{\Theta }}_+$ and ${\tilde{\Theta }}_-$ of ${\tilde{\Theta }}$ and define the symmetrized function $\phi {:}\,\Theta \rightarrow {\mathcal {F}}$ as

$$\begin{aligned} \phi (\theta ) = {\left\{ \begin{array}{ll} +{\tilde{\phi }}(\theta ) &{}\quad \hbox { if}\ \theta \in {\tilde{\Theta }}_+\\ -{\tilde{\phi }}(\theta ) &{}\quad \hbox { if}\ \theta \in {\tilde{\Theta }}_- \end{array}\right. }. \end{aligned}$$

With this choice of $\phi $, minimizing (17) or minimizing (1) are equivalent, in a sense made precise in Proposition A.1. This symmetrization procedure, also suggested in [17], is simple to implement in practice: in Algorithm 1, we fix at initialization the sign attributed to each particle—depending on whether it belongs to ${\tilde{\Theta }}_+$ or ${\tilde{\Theta }}_-$—and do not change it throughout the iterations.

Proposition A.1

The infima of (17) and (1) are the same and:

(i)
if ${\tilde{\mu }}$ is a minimizer of ${\tilde{J}}$ and ${\tilde{\mu }} = {\tilde{\mu }}_+-{\tilde{\mu }}_-$ is its Jordan decomposition, then the measure which restriction to ${\tilde{\Theta }}_+$ (resp. ${\tilde{\Theta }}_-$) coincides with ${\tilde{\mu }}_+$ (resp. $\mu _-$) is a minimizer of J;
(ii)
reciprocally, if $\mu $ is a minimizer of J then $\mu _+-\mu _-$ where $\mu _+$ (resp. $\mu _-$) is the restriction of $\mu $ to ${\tilde{\Theta }}_+$ (resp. $\Theta _-$) is a minimizer of ${\tilde{J}}$.

Proof

We recall that for any decomposition of a signed measure as the difference of nonnegative measures ${\tilde{\mu }} = {\tilde{\mu }}_+-{\tilde{\mu }}_-$, it holds $\vert {\tilde{\mu }} \vert (\Theta ) \le {\tilde{\mu }}_+(\Theta )+{\tilde{\mu }}_-(\Theta )$, with equality if and only if $({\tilde{\mu }}_+,{\tilde{\mu }}_-)$ is the Jordan decomposition of ${\tilde{\mu }}$ [21, Sec. 4.1]. It follows that starting from any ${\tilde{\mu }}\in {\mathcal {M}}({\tilde{\Theta }})$, the construction in (i) yields a measure $\mu \in {\mathcal {M}}_+(\Theta )$ satisfying ${\tilde{J}}({\tilde{\mu }}) = J(\mu )$. Also, starting from any $\mu \in {\mathcal {M}}_+(\Theta )$, the construction in (ii) yields a measure ${\tilde{\mu }}\in {\mathcal {M}}({\tilde{\Theta }})$ satisfying ${\tilde{J}}({\tilde{\mu }})\le J(\mu )$, with equality if and only if $(\mu _+,\mu _-)$ is a Jordan decomposition. The conclusion follows. $\square $

Generic non-convex minimization

In this section, we show that any smooth optimization problem on a manifold is equivalent to solving a problem of the form (1). This corresponds to the case of a scalar-valued $\phi $.

Proposition B.1

Let $\phi {:}\,\Theta \rightarrow {\mathbb {R}}$ be a smooth function with minimum $\phi ^\star <0$ that admits a global minimizer, and let

$$\begin{aligned} \nu ^\star \in \arg \min _{\nu \in {\mathcal {M}}_+(\Theta )} J(\nu )&\text {where}&J(\nu ):=\frac{1}{2}\left( 2 + \int _\Theta \phi (\theta )\mathrm {d}\nu (\theta ) \right) ^2 + \lambda \nu (\Theta )\nonumber \\ \end{aligned}$$

(18)

where $0<\lambda <-2\phi ^\star $. Then $\emptyset \ne {{\,\mathrm{spt}\,}}\nu ^\star \subset \arg \min \phi $ so minimizers of $\phi $ can be built from $\nu ^\star $. Reciprocally, from a minimizer of $\phi $, one can build a minimizer for (18).

Proof

For a measure $\nu \in {\mathcal {M}}_+(\Theta )$, we define $f_\nu :=\int _\Theta \phi (\theta )\mathrm {d}\nu (\theta ) \in {\mathbb {R}}$. It holds

$$\begin{aligned} \int _{\Theta }J'_\nu (\theta )\mathrm {d}\nu (\theta ) = \int _{\Theta }\left( \phi (\theta )(2+f_\nu ) + \lambda \right) \mathrm {d}\nu (\theta )= f_\nu ^2 +2f_\nu + \lambda \nu (\Theta ). \end{aligned}$$

Now suppose that $\nu $ is a global minimizer of J. Then the optimality condition in Proposition 3.1 implies that

$$\begin{aligned} f_\nu ^2 +2f_\nu + \lambda \nu (\Theta )=0. \end{aligned}$$

(19)

Solving for $f_\nu $ is possible if $\lambda \nu (\Theta )<1$ and leads to $f_\nu =\sqrt{1-\lambda \nu (\Theta )} -1$. We also deduce from the fact that $f_\nu >-1$ that $\arg \min J'_\nu = \arg \min \phi $, and so ${{\,\mathrm{spt}\,}}\nu \subset \arg \min \phi $. It remains to find under which condition $\nu (\Theta )>0$. We use the fact that $f_\nu = \phi ^\star \nu (\Theta )$ in Eq. (19), and get

$$\begin{aligned} \nu (\Theta ) = \max \left\{ 0, \frac{-2\phi ^\star -\lambda }{(\phi ^\star )^2} \right\} \end{aligned}$$

which in particular satisfies $\lambda \nu (\Theta )<1$. Thus, as long as $-2\phi ^\star >\lambda $, we have $\nu (\Theta )>0$. Finally, we verify that global minimizers exist, so that the above reasoning makes sense. If $-2\phi ^\star -\lambda \le 0$, then $\nu =0$ satisfies the global optimality conditions. Otherwise, choose $\theta ^\star $ a minimizer for $\phi ^\star $ and define $\nu = \nu (\Theta )\delta _{\theta ^\star }$ with the value above for $\nu (\Theta )$, which also satisfies the global optimality conditions. $\square $

Wasserstein gradient flow

In this section, we recall and adapt some results and proofs from [17], for the sake of completeness.

1.1 Existence

For this result, we assume (A1-2). For a compactly supported initial condition $\mu _0\in {\mathcal {P}}_2(\Omega )$, the proof of existence for Wasserstein gradient flows [Eq. (7)] in [17] goes through, as it is simply based on a compactness arguments which can be directly translated to this Riemannian setting (more precisely, we apply here Arzelà–Ascoli compactness criterion for curves in the Wasserstein space on the cone of $\Theta $, which is a complete metric space [42]). Note that these arguments do not require convexity of R, but in order to guarantee global existence in time, we need to assume that $\nabla R$ is bounded in sub-level sets of F.

For the existence of solutions for projected dynamics on $\Theta $ for any $\nu _0\in {\mathcal {M}}_+(\Theta )$, consider a measure $\mu _0\in {\mathcal {M}}_+(\Omega )$ such that ${\mathsf {h}}\mu _0=\nu _0$ (see [42] for such a construction) and the corresponding Wasserstein gradient flow $(\mu _t)_{t\ge 0}$ for F. Then ${\mathsf {h}}\mu _t$ is a solution to (9).

For the existence of Wasserstein gradient flows [Eq. (7)] for F when $\mu _0$ is not compactly supported, proceed as follows: there exists a Wasserstein–Fisher–Rao gradient flow $\nu _t$ satisfying $\nu _0={\mathsf {h}}\mu _0$. Now we can simply define $\mu _t$ as the solution to $\partial _t \mu _t = \mathrm {div}(\mu _t J'_{\nu _t})$. It can be directly checked that ${\mathsf {h}}\mu _t = \nu _t$ for $t\ge 0$ and thus $\mu _t$ is a solution to Eq. (7).

We do not attempt to show uniqueness in the present work. Note that it is proved in [17] for the case where $\Theta $ is a sphere, by applying the theory developed in [3].

1.2 Asymptotic global convergence

In this section, we give a short proof of Theorem 2.2, adapted from [17]. The next lemma is the crux of the global convergence proof. It gives a criterion to espace from the neighborhood of measures which are not minimizers.

Lemma C.1

(Criteria to espace local minima) Under (A1-3), let $\nu \in {\mathcal {M}}_+(\Theta )$ be such that $v^\star :=\min _{\theta \in \Theta } J'_{\nu }(\theta )<0$. Then there exists $v \in [2v^\star /3,v^\star /3]$ and $\epsilon >0$ such that if $(\nu _t)_{t\ge 0}$ is a projected gradient flow of J satisfying $\Vert \nu - \nu _{t_0}\Vert _\mathrm {BL}^*< \epsilon $ for some $t_0\ge 0$ and $\nu _{t_0}((J'_{\nu })^{-1}(]-\infty ,v]))>0$ then there exists $t_1>t_0$ such that $\Vert \nu - \nu _{t_1}\Vert _\mathrm {BL}^*\ge \epsilon $.

Proof

We first assume that $J'_{\nu }$ takes nonnegative values and let $v\in [2v^\star /3,v^\star /3]$ be a regular value of $g_\nu $, i.e. be such that $\Vert \nabla J'_{\nu }\Vert $ does not vanish on the v level-set of $J'_\nu $. Such a v is guaranteed to exist thanks to Morse–Sard’s lemma and our assumption that $\phi $ is d-times continuously differentiable, which implies that $J'_{\nu }$ is the same. Let $K_v = (J'_{\nu })^{-1}(]-\infty ,v])\subset \Theta $ be the corresponding sublevel set. By the regular value theorem, the boundary $\partial K_v$ of $K_v$ is a differentiable orientable compact submanifold of $\Theta $ and is orthogonal to $\nabla J'_{\nu }$. By construction, it holds for all $\theta \in K_v$, $J'_{\nu }(\theta ) \le v^\star /3$ and, for some $u>0$, by the regular value property, $\nabla J'_\nu (\theta )\cdot \mathbf {n}_{\theta } > u$ for all $\theta \in \partial K_v$ where $\mathbf {n}_\theta $ is the unit normal vector to $\partial K_v$ pointing outwards. Since the map $\nu \mapsto J'_{\nu }$ is locally Lipschitz as a map $({\mathcal {M}}_+(\Theta ), \Vert \cdot \Vert _{\mathrm {BL}}^*) \rightarrow ({\mathcal {C}}^1(\Theta ),\Vert \cdot \Vert _{\mathrm {BL}})$, there exists $\epsilon >0$ such that if $\nu _t\in {\mathcal {M}}_+(\Theta )$ satisfies $\Vert \nu _t -\nu \Vert _{\mathrm {BL}}^*<\epsilon $, then

$$\begin{aligned} \forall \theta \in K_v, \quad J'_{\nu _t}(\theta ) \le v^\star /4&\text {and}&\forall \theta \in \partial K_v,\quad \nabla J'_{\nu _t}(\theta )\cdot \mathbf {n}_{\theta } > u/2. \end{aligned}$$

Now let us consider a projected gradient flow $(\nu _t)_{t\ge 0}$ such that $\Vert \nu _{0} -\nu \Vert _{\mathrm {BL}}^*<\epsilon $ and let $t_1>0$ be the first time such that $\Vert \nu _{t_1}-\nu \Vert _{\mathrm {BL}}^*\ge \epsilon $, which might a priori be infinite. For $t\in {[t_0,t_1[}$, it holds

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} \nu _t(K_\nu ) \ge -4 \alpha \int _{K_v} J'_{\nu _t}\mathrm {d}\nu _t \ge v^\star \alpha \nu _t(K_v) \end{aligned}$$

where the first inequality can be seen by using the “characteristic” representation of solutions to (9), see [44]. It follows by Grönwall’s lemma that $\nu _t(K_v)\ge \exp (\alpha v^\star t)\nu _0(K_v)$ which implies that $t_1$ is finite. Finally, if we had not assumed that 0 is in the range of $J'_\nu $ in the first place, then we could simply take $K=\Theta $ and conclude by similar arguments. $\square $

Proof of Theorem 2.2

Let $\nu _\infty \in {\mathcal {M}}(\Theta )$ be the weak limit of $(\nu _t)_t$. It satisfies the stationary point condition $\int \vert J'_{\nu _\infty }\vert ^2\mathrm {d}\nu _{\infty }=0$. Then by the optimality conditions in Proposition 3.1, either $\nu _\infty $ is a minimizer of J, or $J'_{\nu _\infty }$ is not nonnegative. For the sake of contradiction, assume the latter. Let $\epsilon $ be given by Lemma C.1 and let $t_0 = \sup \{ t\ge 0;\; \Vert \nu _t-\nu _\infty \Vert _{\mathrm {BL}}^*\ge \epsilon \}$ which is finite since we have assumed that $\nu _t$ weakly converges to $\nu _\infty $. But $\nu _{t_0}$ has full support since it can be written as the pushforward of a rescaled version of $\nu _0$ by a diffeomorphism, see [44, Eq. (1.3)] (note that this step is considerably simplified here by the fact that we do not have a potentially non-smooth regularizer, unlike in [17] where topological degree theory comes into play). Then the conclusion of Lemma C.1 contradicts the definition of $t_0$. $\square $

Proof of the gradient inequality

In this whole section, we consider without loss of generality $\alpha =\beta =1$ (we explain in Sect. D.7 how to adapt the results to arbitrary $\alpha ,\beta $). For simplicity, we only track the dependencies in $\nu $ and $\tau $. Any quantity that is independent of $\nu $ and $\tau $ is treated as a constant and represented by $C,C',C''>0$, and the quantity these symbols refer to can change from line to line.

1.1 Bound on the transport distance to minimizers

Given a measure $\nu \in {\mathcal {M}}_+(\Theta )$, we consider the local centered moments introduced in Definition 3.6 and in addition, for $i\in \{1,\dots ,m^\star \}$,

$$\begin{aligned} \delta \theta _i = {\bar{\theta }}_i-\theta _i&{\tilde{b}}^\theta _i = {\bar{r}}_i\delta \theta _i,&s_i :={\bar{r}}_i({{\,\mathrm{tr}\,}}\Sigma _i)^{\frac{1}{2}}. \end{aligned}$$

Finally, we will quantify errors with the following quantity

$$\begin{aligned} W_\tau (\nu )^2 :={\bar{r}}_0^2 + \Vert b\Vert ^2 + \Vert s\Vert ^2 = {\bar{r}}_0^2 + \sum _{i=1}^{m^\star } \left( \vert b_i^r\vert ^2 + \Vert b_i^\theta \Vert ^2 + s_i^2 \right) \end{aligned}$$

(20)

which also controls the ${\widehat{W}}_2$ distance (introduced in Sect. 3.1) to the minimizer $\nu ^\star $ of J, as shown in the next proposition.

Lemma D.1

It holds ${\widehat{W}}_2(\nu ,\nu ^\star )\le W_\tau (\nu )(1+O(\tau ^2)+O(W_\tau (\nu )^2))$.

Proof

Note that for $W_\tau (\nu )$ small enough, it holds $\nu (\Theta _i)>0$ for $i\in \{1,\dots ,m^\star \}$. Let $\mu \in {\mathcal {P}}_2(\Omega )$ be such that ${\mathsf {h}}\mu = \nu $ and consider the transport map $T{:}\,\Omega \rightarrow \Omega $ defined as

$$\begin{aligned} T(r,\theta ) = {\left\{ \begin{array}{ll} (r\frac{r_i}{{\bar{r}}_i}, \theta _i) &{}\hbox {if } \theta \in \Theta _i \hbox { and } {\bar{r}}_i>0, i\in \{1,\dots ,m^\star \},\\ (0,\theta ) &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

By construction, it holds ${\mathsf {h}}(T_\#\mu ) = \nu ^\star $. Let us estimate the transport cost associated to this map

$$\begin{aligned} {\mathcal {T}} = \int _\Omega {{\,\mathrm{dist}\,}}((r,\theta ),T(r,\theta ))^2\mathrm {d}\mu (r,\theta ). \end{aligned}$$

The geodesic distance associated to the cone metric is

$$\begin{aligned} {{\,\mathrm{dist}\,}}((r_1,\theta _1),(r_2,\theta _2))^2&= r_1^2 + r_2^2 - 2r_1r_2\cos (\min \{{{\,\mathrm{dist}\,}}(\theta _1,\theta _2),\pi \})\\&= (r_1-r_2)^2 +r_1r_2{{\,\mathrm{dist}\,}}(\theta _1,\theta _2)^2 +O(r_1r_2{{\,\mathrm{dist}\,}}(\theta _1,\theta _2)^4) \end{aligned}$$

Now, if we only consider points $\theta \in \Theta _i$ with ${\tilde{\theta }}$ their coordinates in a normal frame centered at $\theta _i$ (note that in all other proofs, we do not need to distinguish between $\theta $ and ${\tilde{\theta }}$), we have the approximation

$$\begin{aligned} {{\,\mathrm{dist}\,}}((r_1,\theta _1),(r_2,\theta _2))^2 = (r_1-r_2)^2 +r_1r_2\Vert {\tilde{\theta }}_1 - {\tilde{\theta }}_2\Vert ^2(1+O(\tau ^2)). \end{aligned}$$

Let us decompose $T(r,\theta )$ as $(rT^r(\theta ),T^\theta (\theta ))$ and estimate the two contributions forming ${\mathcal {T}}$ separately. On the one hand, we have

$$\begin{aligned} \int (rT^r(\theta )-r)^2\mathrm {d}\mu (r,\theta )&= {\bar{r}}_0^2 + \sum _{i=1}^{m^\star } ({\bar{r}}_i-r_i)^2 = {\bar{r}}_0^2 +\Vert b^r\Vert ^2(1+O(W_\tau (\nu )^2)). \end{aligned}$$

On the other hand, we have

$$\begin{aligned} \int r^2T^r(\theta )\Vert T^\theta (\theta )-\theta \Vert ^2\mathrm {d}\mu (r,\theta )&=\sum _{i=1}^{m^\star } {\bar{r}}_ir_i ({{\,\mathrm{tr}\,}}\Sigma _i + \Vert \delta \theta _i\Vert ^2)\\&=(\Vert s\Vert ^2 + \Vert b^\theta \Vert ^2)(1+O(W_\tau (\nu )^2)). \end{aligned}$$

As a consequence, we have ${\mathcal {T}}= W_\tau (\nu )(1 + O(W_\tau (\nu )^2)+ O(\tau ^2))$. Remark that this estimate does not depend on the chosen lifting $\mu $ satisfying ${\mathsf {h}}\mu =\nu $. We then conclude by using the characterization in [42, Thm. 7.20] for the distance ${\widehat{W}}_2$:

$$\begin{aligned} {\widehat{W}}_2(\nu _1,\nu _2) = \min \left\{ W_2(\mu _1,\mu _2) ;\; ({\mathsf {h}}\mu _1,{\mathsf {h}}\mu _2) = (\nu _1,\nu _2)\right\} . \end{aligned}$$

Thus ${\widehat{W}}_2(\nu ,\nu ^\star )^2 \le W_2(\mu ,T_\#(\mu ))^2\le {\mathcal {T}}$, and the result follows. $\square $

1.2 Local expansion lemma

Lemma D.2

(Expansion around $\nu ^\star )$ Let $\psi $ be any (vector or real-valued) smooth function on $\Theta $ and $\nu \in {\mathcal {M}}_+(\Theta )$. If $\tau >0$ is an admissible radius, then the following first and second-order expansions hold

$$\begin{aligned} \int \psi \mathrm {d}(\nu - \nu ^\star )&= \sum _{i=1}^m r_i {\bar{\nabla }} \psi (\theta _i)^\intercal b_i + \int _{\Theta _0} \psi \mathrm {d}\nu +\sum _{i=1}^m \int _{\Theta _i} M_{2,\psi }(\theta _i,\theta )\mathrm {d}\nu (\theta )\\&=\sum _{i=1}^m r_i{\bar{\nabla }} \psi (\theta _i)^\intercal b_i +\frac{1}{2} \sum _{i=1}^m {\bar{r}}_i^2 \left( {{\,\mathrm{tr}\,}}\left( \nabla ^2\psi (\theta _i) \Sigma _i \right) + \delta \theta _i^\intercal \nabla ^2\psi (\theta _i) \delta \theta _i \right) \\&\qquad + \int _{\Theta _0} \psi \mathrm {d}\nu +\sum _{i=1}^m \int _{\Theta _i} M_{3,\psi }(\theta _i,\theta )\mathrm {d}\nu (\theta ) \end{aligned}$$

where $M_{k,\psi }(\theta _i,\theta )$ is the remainder in the $k-1$th order Taylor expansion of $\psi $ around $\theta _i$ in local coordinates (and we recall that ${\bar{\nabla }} \psi := (2\psi ,\nabla \psi )$).

Proof

By a Taylor expansion of $\psi $ around $\theta _i$ for $i\in \{1,\dots ,m^\star \}$, it holds

$$\begin{aligned} \int _{\Theta _i} \psi \mathrm {d}\nu = \int _{\Theta _i} \left( \psi (\theta _i) + \nabla \psi (\theta _i)^\intercal (\theta -\theta _i) + (\theta -\theta _i) ^\intercal \nabla ^2\psi (\theta _i)(\theta -\theta _i) + M_{3,\psi }(\theta _i,\theta )\right) \mathrm {d}\nu (\theta ) \end{aligned}$$

and substracting $\int _{\Theta _i}\psi \mathrm {d}\nu ^\star = r_i^2\phi (\theta _i)$ yields

$$\begin{aligned} \int _{\Theta _i} \psi \mathrm {d}(\nu -\nu ^\star )&= ({\bar{r}}_i^2 -r_i^2)\psi (\theta _i) + {\bar{r}}_i^2 \nabla \psi (\theta _i)^\intercal \delta \theta _i \\&\quad +\,\frac{1}{2} \sum _{i=1}^m {\bar{r}}_i^2 \left( {{\,\mathrm{tr}\,}}\left( \nabla ^2\psi (\theta _i) \Sigma _i \right) + \delta \theta _i^\intercal \nabla ^2\psi (\theta _i) \delta \theta _i \right) +\sum _{i=1}^m \int _{\Theta _i} M_{3,\psi }(\theta _i,\theta )\mathrm {d}\nu (\theta ) \end{aligned}$$

where we have used a bias-variance decomposition for the quadratic term. The result follows by summing the integrals over each $\Theta _i$ and using the expression of b. $\square $

1.3 Bound on the distance to minimizers

In the next lemma, we globally bound the quantity $W_\tau (\nu )$ from Eq. (20) in terms of the function values. It involves the quantity $v^\star >0$ which is such that for any local minimum $\theta $ of $J'_{\nu ^\star }$, either $\theta = \theta _i$ for some $i\in \{1,\dots ,m^\star \}$ or $J'_{\nu ^\star }(\theta )\ge v^*$ (which is non-zero under (A5)). We also recall that ${\tilde{b}}^\theta _i = {\bar{r}}_i \delta \theta _i$, as defined in Appendix D.3.

Lemma D.3

(Global distance bound) Under (A1-5), let $\tau _{\mathrm {adm}}$ be an admissible radius $\tau $ as in Definition 3.6, fix some $J_{\max }>0$ and let

$$\begin{aligned} \tau _0 = \min \left\{ \tau _{\mathrm {adm}}, 2\sqrt{\frac{v^\star }{\sigma _{\min }(H)}}, \frac{3\sigma _{\min }(H)}{2 \mathrm {Lip}(\nabla ^2 J'_{\nu ^\star })} \right\} . \end{aligned}$$

Then there exists $C, C'>0$ such that for all $\tau \le \tau _0$ and $\nu \in {\mathcal {M}}_+(\Theta )$ such that $J(\nu )\le J_{\max }$, it holds

$$\begin{aligned} W_\tau (\nu ) \le \frac{C}{\tau ^2}(J(\nu )-J^\star )^\frac{1}{2}&\text {and}&\Vert {\tilde{b}}^\theta \Vert ^2+\Vert s\Vert ^2\le C'(J(\nu )-J^\star ). \end{aligned}$$

Proof

Let us write $f_\nu :=\int \phi \mathrm {d}\nu $ and $f^\star = \int \phi \mathrm {d}\nu ^\star $. By strong convexity of R at $f^\star $, and optimality of $\mu ^\star $, there exists $C>0$ such that for all $\nu \in {\mathcal {M}}_+(\Theta )$ it holds

$$\begin{aligned} J(\nu )-J^\star \ge \int _{\Theta } J'_{\nu ^\star }\mathrm {d}\nu + C\min \{ \Vert f_\nu -f^\star \Vert ^2,\Vert f_\nu -f^\star \Vert \}. \end{aligned}$$

(21)

To prove the first claim, we thus have to bound $W_\tau (\nu )$ using the terms in the right-hand side of (21).

Step 1 By a Taylor expansion, one has for $\theta \in \Theta _i$ for $i\in \{1,\dots ,m^\star \}$,

$$\begin{aligned} \vert J'_{\nu ^\star }(\theta ) - \tfrac{1}{2} (\theta -\theta _i)^\intercal H_i (\theta -\theta _i) \vert \le \tfrac{1}{6} \mathrm {Lip}( \nabla ^2 J'_{\nu ^\star }) \Vert \theta -\theta _i\Vert ^3. \end{aligned}$$

Thus, if $\Vert \theta - \theta _i\Vert \le 3\sigma _{\min }(H)/(2 \mathrm {Lip}( \nabla ^2 J'_{\nu ^\star }))$, then $J'_{\nu ^\star }(\theta ) \ge \frac{1}{4} (\theta -\theta _i)^\intercal H_i (\theta -\theta _i)$ for $\theta \in \Theta _i$. Decomposing the integral of this quadratic term into bias and variance, we get

$$\begin{aligned} \int _{\Theta _i} (\theta -\theta _i)^\intercal H_i (\theta -\theta _i)\mathrm {d}\nu (\theta )&= {\bar{r}}_i^2 \left( \delta \theta _i^\intercal H_i \delta \theta _i + {{\,\mathrm{tr}\,}}(\Sigma _i H_i )\right) \end{aligned}$$

and we deduce a first bound by summing the terms for $i\in \{1,\dots ,m^\star \}$,

$$\begin{aligned} \int _{\Theta {\setminus } \Theta _0} J'_{\nu ^\star } \mathrm {d}\nu \ge \frac{\sigma _{\min }(H)}{4}(\Vert {\tilde{b}}^\theta \Vert ^2 + \Vert s\Vert ^2). \end{aligned}$$

Step 2 In order to lower bound the integral over $\Theta _0$, we first derive a lower bound for $J'_{\nu ^\star }$ on $\Theta _0$. This is a continuously differentiable and nonnegative function on a closed domain $\Theta _0$ so its minimum is attained either at a local minima in the interior of $\Theta _0$ or on its boundary. Using the quadratic lower bound from the previous paragraph, it follows that for $\theta \in \Theta _0$,

$$\begin{aligned} J'_{\nu ^\star }(\theta )\ge \min \{ v^\star , \tau ^2 \sigma _{\min }(H)/4\}. \end{aligned}$$

Thus, if we also assume that $\tau \le 2\sqrt{v^\star /\sigma _{\min }(H)}$ then $J'_{\nu ^\star }(\theta )\ge \tau ^2 \sigma _{\min }(H)/4$ for $\theta \in \Theta _0$ and it follows that

$$\begin{aligned} \int _{\Theta _0} J'_{\nu ^\star } \mathrm {d}\nu \ge \frac{\sigma _{\min }(H)}{4} \tau ^2{\bar{r}}_0^2. \end{aligned}$$

Using inequality (21) we have shown so far that

$$\begin{aligned} {\tilde{W}}_\tau (\nu )^2:={\bar{r}}_0^2 +\Vert {\tilde{b}}^\theta \Vert ^2 +\Vert s\Vert ^2 \le \frac{C}{\tau ^2} (J(\mu )-J^\star ). \end{aligned}$$

(22)

Notice that ${\tilde{W}}_\tau (\nu )$ is similar to $W_\tau (\nu )$ but it does not contain the terms controlling the deviations of mass $\vert {\bar{r}}_i-r_i\vert $. These quantities can be controlled by using the coercivity of R, i.e. the last term in (21), as we do now.

Step 3 Using the first order expansion of Lemma D.2 then squaring gives

$$\begin{aligned} \left| \Vert f_\nu -f^\star \Vert ^2 - \frac{1}{2} b^\intercal K b \right|&\le C \left( \Vert b\Vert {\tilde{W}}_\tau (\nu )^2 + {\tilde{W}}_\tau (\nu )^4 \right) . \end{aligned}$$

Since we have assumed that K is positive definite, it follows

$$\begin{aligned} \Vert f_\nu -f^\star \Vert ^2 \ge C \Vert b\Vert ^2 - C {\tilde{W}}_\tau (\nu )^2 \Vert b\Vert - C{\tilde{W}}_\tau (\nu )^4 \end{aligned}$$

and thus, after rearranging the terms

$$\begin{aligned} (\Vert b\Vert - C{\tilde{W}}_\tau (\nu )^2)^2 \le C \Vert f_\nu -f^\star \Vert ^2 + C{\tilde{W}}_\tau (\nu )^4. \end{aligned}$$

It follows that $\Vert b\Vert \le C\Vert f_\nu -f^\star \Vert + C{\tilde{W}}_\tau (\nu )^2$. Also, by inequality (21), if $J(\nu )\le J_{\max }$, then $\Vert f_\nu -f^\star \Vert ^2\le C(J(\nu )-J^\star )$. Moreover, by inequality (22), we get

$$\begin{aligned} \Vert b\Vert \le \frac{C}{\tau ^2}(J(\nu )-J^\star ) +C (J(\nu )-J^\star )^\frac{1}{2} \le \frac{C}{\tau ^2} (J(\nu )-J^\star )^\frac{1}{2}. \end{aligned}$$

We finally combine with the bound on ${\tilde{W}}_\tau (\nu )$ to conclude since $W_\tau (\nu )^2\le {\tilde{W}}_\tau (\nu )^2+\Vert b\Vert ^2$ $\square $

1.4 Proof of the distance inequality (Proposition 3.2)

By Lemma D.1, it holds

$$\begin{aligned} {\widehat{W}}_2(\nu ,\nu ^\star )\le W_\tau (\nu )(1+O(\tau ^2)+O(W_\tau (\nu )^2)). \end{aligned}$$

Moreover, by Lemma D.3, there exists $\tau _0>0$ and $C>0$ such that

$$\begin{aligned} W_\tau (\nu )\le \frac{C}{\tau _0^2}(J(\nu )-J^\star )^\frac{1}{2}. \end{aligned}$$

Combining these two lemmas, it follows that for some $C'>0$, we have

$$\begin{aligned} {\widehat{W}}_2(\nu ,\nu ^\star )^2 \le C'(J(\nu )-J^\star )^\frac{1}{2}. \end{aligned}$$

This also implies a control on the Bounded-Lipschitz distance since it holds $(\Vert \nu - \nu ^\star \Vert _{\mathrm {BL}}^*)^2\le (2+\pi ^2/2)(\nu (\Theta )+\nu ^\star (\Theta )){\widehat{W}}_2(\nu ,\nu ^\star )^2$, see [42, Prop. 7.18].

1.5 Local estimate of the objective

We now prove a local expansion formula for J.

Proposition D.4

(Local expansion) It holds

$$\begin{aligned} J(\nu )-J^\star = \frac{1}{2} b^\intercal K b + \frac{1}{2} \sum _{i=1}^m {\bar{r}}_i^2 \left( {{\,\mathrm{tr}\,}}(\Sigma _i H_i) +\delta \theta ^\intercal H_i \delta \theta _i \right) + \int _{\Theta _0} J'_{\nu ^\star } \mathrm {d}\nu + \mathop {\mathrm {err}}(\tau ,\nu ) \end{aligned}$$

where $\mathop {\mathrm {err}}(\tau ,\nu ) =O( \tau (\Vert {\tilde{b}}^\theta \Vert ^2 + \Vert s\Vert ^2) + W_\tau (\nu )^3)$. In particular, if $\tau $ is fixed small enough,

$$\begin{aligned} J(\nu )-J^\star \le \sigma _{\max }(K) \Vert b\Vert ^2 + \sigma _{\max }(H)(\Vert b^\theta \Vert ^2 +\Vert s\Vert ^2) +\Vert J'_{\nu ^\star }\Vert _\infty {\bar{r}}_0^2 +O(W_\tau (\nu )^3). \end{aligned}$$

Proof

Let us write $f_\nu :=\int \phi \mathrm {d}\nu $ and $f^\star = \int \phi \mathrm {d}\nu ^\star $. By a second order Taylor expansion of R around $f^\star $, we have

$$\begin{aligned} J(\nu ) - J^\star = \int _{\Theta } J'_{\nu ^\star } \mathrm {d}\nu + \frac{1}{2} \Vert f_\nu -f^\star \Vert ^2_\star + O(\Vert f_\nu -f^\star \Vert ^3). \end{aligned}$$

Using the first order expansion of Lemma D.2 for $\phi $, we get $ \Vert f_\nu -f^\star \Vert ^2_\star = b^\intercal Kb + O(W_{\tau }(\nu )^3) $. Also, using the second order expansion of Lemma D.2 for $J'_{\nu ^\star }$ and using the fact that $J'_{\nu ^\star }$ and its gradient vanish for all $\theta _i$, we get

$$\begin{aligned} \int _{\Theta } J'_{\nu ^\star } \mathrm {d}\nu = \frac{1}{2} \sum _{i=1}^m {\bar{r}}_i^2 \left( {{\,\mathrm{tr}\,}}(\Sigma _i H_i) +\delta \theta ^\intercal H_i \delta \theta _i \right) + \int _{\Theta _0}J'_{\nu ^\star } \mathrm {d}\nu + O(\tau (\Vert s\Vert ^2+\Vert b^\theta \Vert ^2)) \end{aligned}$$

and the expansion follows. Notice also that in the expression of $J(\nu )$, ${\bar{r}}_i$ and $r_i$ are interchangeable up to introducing higher order error, since $\vert r_i - {\bar{r}}_i\vert = O(\vert b^r_i\vert )$ (and also $\Vert {\tilde{b}}^\theta \Vert = \Vert b^\theta \Vert (1+O(W_\tau (\nu )))$). $\square $

1.6 Local estimate of the gradient norm

Proposition D.5

(Gradient estimate) For $\nu \in {\mathcal {P}}_2(\Omega )$, it holds

$$\begin{aligned} \Vert g_\nu \Vert ^2_{L^2(\nu )} = b^\intercal (K + H)^2b+ \sum _{i=1}^m {\bar{r}}_i^2 {{\,\mathrm{tr}\,}}(\Sigma _i H_i^2) +\Vert g_\nu \Vert ^2_{L^2(\nu \vert _{\Theta _0})}+\mathop {\mathrm {err}}(\tau ,\nu ) \end{aligned}$$

where $\mathop {\mathrm {err}}(\tau ,\nu )\lesssim \tau (\Vert {\tilde{b}}^\theta \Vert ^2 +\Vert s\Vert ^2) +W_\tau (\nu )^3$. In particular, if $\tau $ is fixed small enough

$$\begin{aligned} \Vert g_\nu \Vert ^2_{L^2(\nu )} \ge \frac{1}{2} (\sigma _{\min }(K)+ \sigma _{\min }(H))^2\Vert b\Vert ^2 + \frac{1}{2} \sigma _{\min }(H)^2 \Vert s\Vert ^2 + \frac{1}{4} {\bar{r}}_0^2 \sigma _{\min }(H)^2 \tau ^4 +O(W_\tau (\nu )^3). \end{aligned}$$

Proof

For this proof, we write $f_\nu - f^\star = \delta f_0 +\delta f_b + \delta f_{\mathrm {err}}$ where

$$\begin{aligned} \delta f_0&:=\int _{\Theta _0}\phi (\theta )\mathrm {d}\nu (\theta ),&\delta f_b&:=\sum _{i=1}^m r_i {\bar{\nabla }} \phi (\theta _i) b_{i},&\delta f_{\mathrm {err}}&:=\sum _{i=1}^{m} \int _{\Theta _i} M_{\phi ,2}(\theta _i,\theta )\mathrm {d}\nu (\theta ). \end{aligned}$$

where the decomposition follows from Lemma D.2. The expression for the norm of the gradient is as follows:

$$\begin{aligned} \Vert g_\nu \Vert ^2_{L^2(\nu )} = \int _{\Theta }\Vert {\bar{\nabla }} J'_\nu (\theta ) \Vert ^2\mathrm {d}\nu (\theta ) \end{aligned}$$

where ${\bar{\nabla }} J = (2J,\nabla J)$. We start with the following decomposition for $\theta \in \Theta _i$ (recall that $J'_\nu (\theta ) =\langle \phi (\theta ),\nabla R(\int \phi \mathrm {d}\nu )\rangle +\lambda $):

$$\begin{aligned} J'_\nu (\theta )&= \lambda + \left\langle \phi (\theta _i) + (\theta -\theta _i)^\intercal \nabla \phi (\theta _i) + \frac{1}{2} (\theta -\theta _i)^\intercal \nabla ^2\phi (\theta _i)(\theta -\theta _i)+ M_{\phi ,3}(\theta _i,\theta ),\nabla R(f^\star )\right\rangle \\&\quad +\,\langle \phi (\theta _i)+ (\theta -\theta _i)^\intercal \nabla \phi (\theta _i) + M_{\phi ,2}(\theta _i,\theta ),f_\mu -f^\star \rangle _{\star }+ \langle \phi (\theta ),M_{\nabla R,2}(f^\star ,f)\rangle \end{aligned}$$

Here we use the notation $\langle \cdot ,\cdot \rangle _\star $ to denote the quadratic form associated to $\nabla ^2 R(f^\star )$. Thanks to the optimality conditions ${\bar{\nabla }} J'_{\nu ^\star }(\theta _i)=0$ for $i\in \{1,\dots ,m\}$, we get

$$\begin{aligned} {\bar{\nabla }}_j J'_\nu (\theta )&= [H_i(\theta -\theta _i)]_j + \langle {\bar{\nabla }}_j \phi (\theta _i),\delta f_0+\delta f_b \rangle _\star + [N(\theta _i,\theta )]_j\\&= [H_i(\theta -{\bar{\theta }}_i)]_j+ \left( [H_i({\bar{\theta }}_i-\theta _i)]_j+ \langle {\bar{\nabla }}_j \phi (\theta _i),\delta f_b \rangle _\star \right) + \langle {\bar{\nabla }}_j \phi (\theta _i),\delta f_0\rangle _\star + [N(\theta _i,\theta )]_j \end{aligned}$$

where N collects the higher order terms and is defined as

$$\begin{aligned}{}[N(\theta _i,\theta )]_j&= \langle {\bar{\nabla }}_j M_{\phi ,3}(\theta _i,\theta ),\nabla R(f^\star )\rangle +\langle {\bar{\nabla }}_j M_{\phi ,2}(\theta _i,\theta ),f-f^\star \rangle _{\star } \\&\quad + \langle {\bar{\nabla }}_j \phi (\theta ),M_{\nabla R,2}(f^\star ,f)\rangle \end{aligned}$$

where $\Vert {\bar{\nabla }}_j M_{\phi ,3}(\theta _i,\theta )\Vert = O(\Vert \theta -\theta _i\Vert ^2)$ if $j>0$ and $O(\Vert \theta -\theta _i\Vert ^3)$ if $j=0$. Expanding the square gives the following ten terms:

$$\begin{aligned} \int _{\Omega {\setminus } \Omega _0} \Vert {\bar{\nabla }} J'_\nu \Vert ^2\mathrm {d}\nu (\theta ) = \sum _{i=1}^{m} \int _{\Theta _i} \sum _{j=0}^{d} [H_i(\theta -{\bar{\theta }}_i)]_j^2 \mathrm {d}\nu \end{aligned}$$

(I)

$$\begin{aligned} + \sum _{i=1}^{m}\int _{\Theta _i} \sum _{j=0}^d \left( \langle {\bar{\nabla }}_j\phi (\theta _i),\delta f_b \rangle _\star + [H_i({\bar{\theta }}_i-\theta _i)]_j \right) ^2 \mathrm {d}\nu (\theta ) \end{aligned}$$

(II)

$$\begin{aligned} + \sum _{i=1}^{m} \int _{\Theta _i} \sum _{j=0}^d \langle {\bar{\nabla }}_j\phi (\theta _i), \delta f_0 \rangle _\star ^2\mathrm {d}\nu (\theta ) \end{aligned}$$

(III)

$$\begin{aligned} + \sum _{i=1}^{m}\int _{\Theta _i} \!\! 2 \sum _{j=0}^d [H_i(\theta -{\bar{\theta }}_i)]_j\cdot ( \langle {\bar{\nabla }}_j \phi (\theta _i),\delta f_b\rangle _\star + [H_i({\bar{\theta }}_i-\theta _i)]_j) \mathrm {d}\nu (\theta ) \end{aligned}$$

(IV)

$$\begin{aligned} + \sum _{i=1}^{m}\int _{\Theta _i} \!\! 2 \sum _{j=0}^d [H_i(\theta -{\bar{\theta }}_i)]_j\cdot \langle {\bar{\nabla }}_j\phi (\theta _i), \delta f_0\rangle _\star \mathrm {d}\nu (\theta ) \end{aligned}$$

(V)

$$\begin{aligned} + \sum _{i=1}^{m}\int _{\Theta _i} \!\! 2 \sum _{j=0}^d (\langle {\bar{\nabla }}_j \phi (\theta _i),\delta f_b\rangle _\star + [H_i({\bar{\theta }}_i-\theta _i)]_j)\cdot \langle {\bar{\nabla }}_j \phi (\theta _i), \delta f_0\rangle _\star \mathrm {d}\nu (\theta )\!\! \end{aligned}$$

(VI)

$$\begin{aligned} + \sum _{i=1}^{m}\int _{\Theta _i} \!\! 2 \sum _{j=0}^d [N(\theta _i,\theta )]_j\cdot [H_i(\theta -{\bar{\theta }}_i)]_j\mathrm {d}\nu (\theta ) \end{aligned}$$

(VII)

$$\begin{aligned} + \sum _{i=1}^{m}\int _{\Theta _i} \!\! 2 \sum _{j=0}^d [N(\theta _i,\theta )]_j\cdot (\langle {\bar{\nabla }}_j \phi (\theta _i),\delta f_b\rangle _\star + [H_i({\bar{\theta }}_i-\theta _i)]_j)\mathrm {d}\nu (\theta ) \end{aligned}$$

(VIII)

$$\begin{aligned} +\sum _{i=1}^{m}\int _{\Theta _i} \!\! 2 \sum _{j=0}^d [N(\theta _i,\theta )]_j\cdot \langle {\bar{\nabla }}_j \phi (\theta _i), \delta f_0 \rangle _\star \mathrm {d}\nu (\theta ) \end{aligned}$$

(IX)

$$\begin{aligned} + \sum _{i=1}^{m} \int _{\Theta _i} \sum _{j=0}^d [N(\theta _i,\theta )]_j^2 \mathrm {d}\nu (\theta ) \end{aligned}$$

(X)

Terms (I)–(II) are the main terms in the expansion, while the other terms are higher order. The term (I) is a local curvature term and can be expressed as $\mathrm {(I)} = \sum _{i=1}^m {\bar{r}}_i^2 {{\,\mathrm{tr}\,}}\Sigma _i H_i^2$. The term (II) is a global interaction term that writes

$$\begin{aligned} \mathrm {(II)}&= \sum _{i=1}^m {\bar{r}}_i^2 \sum _{j=0}^d \vert \langle {\bar{\nabla }}_j\phi (\theta _i),\delta f_b\rangle _\star + [H_i({\bar{\theta }}_i-\theta _i)]_j \vert ^2 \\&= \sum _{i=1}^m\sum _{j=0}^d \left| \sum _{i'=1}^m \sum _{j'=0}^d (\langle {\bar{r}}_i {\bar{\nabla }}_j\phi (\theta _i),r_{i'}{\bar{\nabla }}_{j'}\phi (\theta _{i'})\rangle _\star + {\bar{H}}_{(i,j),(i,j')})(b_{i',j'}) \right| ^2\\&= \Vert ({\bar{K}} + H)(b)\Vert ^2 \end{aligned}$$

where the entries of ${\bar{K}}$ and ${\bar{H}}$ differ from those of K and H by a factor ${\bar{r}}_i/r_i$. More precisely,

$$\begin{aligned}{}[{\bar{K}}-K]_{(i,j),(i',j')} = ({\bar{r}}_i/r_i-1)K_{(i,j),(i',j')} \end{aligned}$$

and similarly for ${\bar{H}}- H$. Since $\vert {\bar{r}}_i/r_i-1\vert = O(\vert b^r_i\vert )$ we have $\sigma _{\max }({\bar{K}}-K )=O(W_\tau (\nu ))$. It follows, by expanding the square, that

$$\begin{aligned} \Vert ({\bar{K}} + {\bar{H}})(b)\Vert ^2 = \Vert (K + H)(b)\Vert ^2 + O(W_\tau (\nu )^3). \end{aligned}$$

The remaining terms are error terms, that we estimate directly in terms of $W_\tau (\nu )$ and $\tau $. We use in particular the fact that by Hölder’s inequality, $\int _{\Theta _i} \Vert \theta - {\bar{\theta }}_i\Vert \mathrm {d}\nu (\theta ) = O({\bar{r}}_i^2 {{\,\mathrm{tr}\,}}\Sigma _i^{\frac{1}{2}})$. One has

$\mathrm {(III)} =O\left( \sum _{i=1}^m {\bar{r}}_i^2 {\bar{r}}_0^4 \right) = O(W_{\tau }^4(\nu ))$;
$\mathrm {(IV)} = \mathrm {(V)} = 0$ because the integral of the terms $H_i(\theta -{\bar{\theta }}_i)$ vanishes;
$\mathrm {(VI)} = O\left( (\sum _{i=1}^m {\bar{r}}_i^2 (\Vert b\Vert +\Vert \delta \theta _i\Vert )\cdot {\bar{r}}_0^2\right) =O(W_{\tau }^3(\nu ))$;
$\mathrm {(VII)} = O(\tau (\Vert {\tilde{b}}^\theta \Vert ^2 + \Vert s\Vert ^2)) + O(W_{\tau }(\nu )^3)$;
$\mathrm {(VIII)} = O(W_{\tau }^3(\nu ))$;
$\mathrm {(IX)} = O(W_{\tau }^4(\nu ))$;
$\mathrm {(X)} = O(\tau ^2(\Vert {\tilde{b}}^\theta \Vert ^2+\Vert s\Vert ^2))+O(W_{\tau }(\nu )^4)$.

It follows that overall, the error term is in $O(\tau (\Vert {\tilde{b}}^\theta \Vert ^2+\Vert s\Vert ^2)+W_{\tau }(\nu )^3)$. There remains to lower bound the norm of the gradient over $\Theta _0$, which can be done as follows. As seen in the proof of Lemma D.3, if $\tau $ is small enough then $J'_{\nu ^\star }(\theta )\ge \tau ^2\sigma _{\min }(H)/4$ for $\theta \in \Theta _0$. Considering only the first component of the gradient, it holds

$$\begin{aligned} \int _{\Theta _0} \Vert {\bar{\nabla }} J'_\nu (\theta ) \Vert ^2\mathrm {d}\nu (\theta ) \ge \int _{\Theta _0} 4 \vert J'_\nu (\theta )\vert ^2 \mathrm {d}\nu (\theta ). \end{aligned}$$

Using the expansion $J'_\nu (\theta )=J'_{\nu ^\star }(\theta )+\langle \phi (\theta ),M_{\nabla R,1}(f^\star ,f_\nu )\rangle $, we get

$$\begin{aligned} \int _{\Theta _0} \Vert {\bar{\nabla }} J'_\nu (\theta ) \Vert ^2\mathrm {d}\nu (\theta ) \ge C {\bar{r}}_0^2 \tau ^4 + O(W_\tau (\nu )^3). \end{aligned}$$

The result follows by collecting all the estimates above. $\square $

1.7 Proof of the sharpness inequality (Theorem 3.3)

By Proposition D.4 we have that for $\tau >0$ small enough

$$\begin{aligned} J(\nu )-J^\star \le CW_\tau (\nu )^2 + O(W_\tau (\nu )^3) \end{aligned}$$

where $C = \sigma _{\max }(K+H) +\Vert J'_{\nu ^\star }\Vert _\infty $.

Similarly, by Proposition D.5, for $\tau $ small enough, it holds

$$\begin{aligned} \Vert g_\nu \Vert ^2_{L^2(\nu )} \ge C'W_\tau (\nu )^2 +O(W_\tau (\nu )^3) \end{aligned}$$

where $C'= \frac{1}{8}\sigma _{\min }(H)^2\tau ^4$. Now fix $\tau >0$ satisfying the hypothesis of Lemma D.3 and the two previous inequalities. By Lemma D.3, $W_\tau (\nu ) = O((J(\nu )-J^\star )^\frac{1}{2})$. We deduce that there exists $J_0>J^\star $ and $\kappa _0>0$, such that whenever $\nu \in {\mathcal {M}}_+(\Theta )$ satisfies $J(\nu )<J_0$, one has

$$\begin{aligned} \Vert g_\nu \Vert ^2_{L^2(\nu )}\ge \kappa _0(J(\nu )-J^\star ). \end{aligned}$$

Finally, notice that if different metric factors $(\alpha ,\beta )\ne (1,1)$ are introduced, one can always lower bound the new gradient squared norm as

$$\begin{aligned} \int _\Theta \left( 4 \alpha \vert J'_\nu (\theta )\vert ^2 +\beta \Vert \nabla J'_\nu (\theta )\Vert ^2_\theta \right) \mathrm {d}\nu (\theta ) \ge \min \{\alpha ,\beta \} \int _\Theta \left( 4 \vert J'_\nu (\theta )\vert ^2 + \Vert \nabla J'_\nu (\theta )\Vert ^2_\theta \right) \end{aligned}$$

which proves the statement for any $(\alpha ,\beta )$. Note however that if one wants to make a more quantitative bound, then there are values $(\alpha _0,\beta _0)$ that would lead to a better conditioning and potentially higher values for $J_0$. In this case, the factor appearing in the sharpness inequality should rather be $\min \{\alpha /\alpha _0,\beta /\beta _0\}$.

Estimation of the mirror rate function

We provide an upper bound for the mirror rate function $\mathcal {Qq}$ in the situation that is of interest to us, with $\nu ^\star $ sparse. Note that this approach could be generalized to arbitrary $\nu ^\star $.

Lemma E.1

Under (A1), there exists $C_{\Theta }> 0$ that only depends on the curvature of $\Theta $, such that for all $\nu ^\star ,\nu _0\in {\mathcal {M}}_+(\Theta )$ where $\nu ^\star =\sum _{i=1}^{m^\star } r_i^2\delta _{\theta _i}$ and $\nu _0=\rho {{\,\mathrm{vol}\,}}$ where $\log \rho $ is L-Lipschitz, then

$$\begin{aligned} \mathcal {Qq}_{\nu ^\star ,\nu _0}(\tau )\le \frac{1}{\tau } \left( \nu ^\star (\Theta )\cdot d\cdot \big (C_\Theta +\log (\tau )+L/\tau \big ) +\nu _0(\Theta )- \nu ^\star (\Theta )+ \sum _{i=1}^{m^\star } r_i^2 \log \left( \frac{r_i^2}{\rho (\theta _i)}\right) \right) . \end{aligned}$$

Moreover, for any other ${\hat{\nu }}_0\in {\mathcal {M}}_+(\Theta )$, it holds $ \mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )\le \mathcal {Qq}_{\nu ^\star ,\nu _0}(\tau ) + \nu ^\star (\Theta )\cdot W_\infty (\nu _0,{\hat{\nu }}_0). $

In the context of Lemma E.1, we introduce the quantity,

$$\begin{aligned} {\bar{{\mathcal {H}}}}(\nu ^\star ,\rho ) :=\sum _{i=1}^{m^\star } r_i^2 \log \left( \frac{r_i^2}{\rho (\theta _i)}\right) - \nu ^\star (\Theta ) + \nu _0(\Theta ). \end{aligned}$$

which measures how much $\rho $ is a good prior for the (a priori unknown) minimizer $\nu ^\star $. With this quantity, the conclusion of Lemma E.1 reads, for $\tau \ge L$,

$$\begin{aligned} \mathcal {Qq}_{\nu ^\star ,\nu _0}(\tau )\le \frac{{\bar{{\mathcal {H}}}}(\nu ^\star ,\rho ) + \nu ^\star (\Theta )\cdot d\cdot (\log (\tau )+C_\Theta )}{\tau }. \end{aligned}$$

Proof

Let us build $\nu _\epsilon $ in such a way that the quantity defining $\mathcal {Qq}_{\nu ^\star ,\nu _0}(\tau )$ in Eq. (15) is small. For this, consider a radius $\epsilon >0$ and consider the measure $\nu _\epsilon $ defined as the normalized volume measure on each geodesic ball of radius $\tau $ around each $\theta _i$, with mass $r_i^2$ on this ball, and vanishing everywhere else. Using the transport map that maps these balls to their centers $\theta _i$, we get if $\Theta $ is flat,

$$\begin{aligned} \Vert \nu _\epsilon -\nu ^\star \Vert _\mathrm {BL}^* \le W_1(\nu _\epsilon ,\nu ^\star ) \le \sum _{i=1}^{m^\star } \frac{r_i^2}{V^{(d)}(\epsilon )} \int _0^\epsilon s \frac{d}{ds}V^{(d)}(s)\mathrm {d}s \end{aligned}$$

where $V^{(d)}(\epsilon )$ is the volume of a ball of radius $\epsilon $ in ${\mathbb {R}}^d$, that scales as $\epsilon ^d$. Using an integration by parts, it follows

$$\begin{aligned} \frac{1}{V^{(d)}(\epsilon )} \int _0^\epsilon s \frac{d}{ds}V^{(d)}(s)\mathrm {d}s = \epsilon - \int _0^\epsilon \frac{V^{(d)}(s)}{V^{(d)}(\epsilon )}\mathrm {d}s = \epsilon -\int _0^\epsilon \left( \frac{s}{\epsilon }\right) ^d\mathrm {d}s = \frac{\epsilon d}{d+1}, \end{aligned}$$

thus $W_1(\nu _\epsilon ,\nu ^\star ) \le \nu ^\star (\Theta )\epsilon $. In the general case where $\Theta $ is a potentially curved manifold, this upper bound also depends on the curvature of $\Theta $ around each $\theta _i$, a dependency that we hide in the multiplicative constant so $W_1(\nu _\epsilon ,\nu ^\star ) \le C\nu ^\star (\Theta )\epsilon $. Let us now control the entropy term. Writing $\rho _\epsilon =\mathrm {d}\nu _\epsilon /\mathrm {d}{{{\,\mathrm{vol}\,}}}$ and $\Theta _i$ for the geodesic ball of radius $\epsilon $ around $\theta _i$, it holds

$$\begin{aligned} {\mathcal {H}}(\nu _\epsilon ,\nu _0)&= \nu _0(\Theta ) -\nu ^\star (\Theta ) + \sum _{i=1}^{m^\star } \int _{\Theta _i} \rho _\epsilon (\theta )\log (\rho _\epsilon (\theta )/\rho (\theta ))\mathrm {d}{{{\,\mathrm{vol}\,}}}(\theta ). \end{aligned}$$

The integral term can be estimated as follows,

$$\begin{aligned} \int _{\Theta _i} \rho _\epsilon (\theta ) \log (\rho _\epsilon (\theta )/\rho (\theta ))\mathrm {d}{{{\,\mathrm{vol}\,}}}(\theta )&= \int _{\Theta _i} \frac{r_i^2}{V^{(d)}(\epsilon )} \left( \log (r_i^2)-\log V^{(d)}(\epsilon ) -\log (\rho (\theta ))\right) \mathrm {d}{{{\,\mathrm{vol}\,}}}(\theta )\\&\le r_i^2\left( \log (r_i^2)-\log V^{(d)}(\epsilon ) -\log \rho (\theta _i) +\mathrm {Lip}(\log \rho )\cdot \epsilon \right) . \end{aligned}$$

Recalling that $-\log V^{(d)}(\epsilon ) \le -d\log (\epsilon )+C$ for some C that only depends on the curvature of $\Theta $, we get that the right-hand side of (15) is bounded by

$$\begin{aligned} \frac{1}{\tau } \left( C\nu ^\star (\Theta )+\nu _0(\Theta ) -\nu ^\star (\Theta ) d\log (\epsilon ) + \sum _{i=1}^{m^\star } r_i^2 \log \left( \frac{r_i^2}{\rho (\theta _i)}\right) + \epsilon L\nu ^\star (\Theta )\right) + C\nu ^\star (\Theta )\epsilon . \end{aligned}$$

Let us fix $\epsilon >0$ by minimizing $C\nu ^\star (\Theta )\epsilon -\nu ^\star (\Theta )d\log (\epsilon )/\tau $, which gives $\epsilon = d/(C\tau )$. The first claim follows by plugging this value for $\epsilon $ in the expression above.

For the second claim of the statement, let us build a suitable candidate ${\hat{\nu }}_\epsilon $ in order to upper bound the infimum that defines $\mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )$. Let T be an optimal transport map from $\nu _0$ to ${\hat{\nu }}_0$ for $W_\infty $, i.e. a measurable map $T{:}\,\Theta \rightarrow \Theta $ satisfying $T_\# \nu _0 = {\hat{\nu }}_0$ and $\max \{{{\,\mathrm{dist}\,}}(\theta ,T(\theta ));\; \theta \in {{\,\mathrm{spt}\,}}\nu _0(=\Theta )\} = W_\infty (\nu _0,{\hat{\nu }}_0)$ (see [53, Sec. 3.2], the absolute continuity of $\nu _0$ is sufficient for such a map to exist). Now we define ${\hat{\nu }}_\epsilon = T_\# \nu _\epsilon $ where $\nu _\epsilon $ is such that ${\mathcal {H}}(\nu _\epsilon ,\nu _0)<\infty $. Since the relative entropy is non-increasing under pushforwards, it holds ${\mathcal {H}}({\hat{\nu }}_\epsilon ,{\hat{\nu }}_0)\le {\mathcal {H}}(\nu _\epsilon ,\nu _0)$. Moreover, it holds $\Vert \nu _\epsilon -{\hat{\nu }}_\epsilon \Vert _{\mathrm {BL}}^*\le W_1(\nu _\epsilon ,{\hat{\nu }}_\epsilon )\le \nu ^\star (\Theta ) W_\infty (\nu _\epsilon ,{\hat{\nu }}_\epsilon )$. Thus we have

$$\begin{aligned} \mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )&\le \Vert \nu ^\star - {\hat{\nu }}_\epsilon \Vert _{\mathrm {BL}}^* +\frac{1}{\tau }{\mathcal {H}}({\hat{\nu }}_\epsilon ,{\hat{\nu }}_0) \\&\le \Vert \nu _\epsilon -{\hat{\nu }}_\epsilon \Vert _{\mathrm {BL}}+ \Vert \nu ^\star - \nu _\epsilon \Vert _{\mathrm {BL}} +\frac{1}{\tau }{\mathcal {H}}(\nu _\epsilon ,\nu _0)\\&\le \nu ^\star (\Theta ) W_\infty (\nu _\epsilon ,{\hat{\nu }}_\epsilon ) + \Vert \nu ^\star - \nu _\epsilon \Vert _{\mathrm {BL}}^* +\frac{1}{\tau }{\mathcal {H}}(\nu _\epsilon ,\nu _0). \end{aligned}$$

The claim follows by noticing that, by construction, $W_\infty (\nu _\epsilon ,{\hat{\nu }}_\epsilon ) \le W_\infty (\nu _0,{\hat{\nu }}_0)$ and then by taking the infimum in $\nu _\epsilon $. $\square $

Global convergence for gradient descent

In the following, result, we study the non-convex gradient descent updates $\mu _{k+1} = (T_k)_\# \mu _k$ and $\nu _k={\mathsf {h}}\mu _k$ where

$$\begin{aligned} T_k(r,\theta ) = \mathrm {Ret}_{(r,\theta )}(-2 \alpha _k J'_{\nu _k}(\theta ), -\beta _k \nabla J'_{\nu _k}(\theta )) \end{aligned}$$

with step-sizes $\alpha , \beta >0$. When $\beta =0$, we recover mirror descent updates in ${\mathcal {M}}_+(\Theta )$ with the entropy mirror map (more specifically, this is true when $\mathrm {Ret}$ is the “mirror” retraction defined in Sect. 2.3).

Lemma F.1

Assume (A1-3) and that J admits a minimizer $\nu ^\star \in {\mathcal {M}}_+(\Theta )$. Then there exists $C,\eta _{\max }>0$ such that for all $\nu _0\in {\mathcal {M}}_+(\Theta )$, denoting $B = \sup _{J(\nu )\le J(\nu _0)} \Vert J'_{\nu }\Vert _{\mathrm {BL}}$, if $\max \{\alpha ,\beta \}<\beta _{\max }$, it holds

$$\begin{aligned} J(\nu _k)-J^\star \le \inf _{k' \in [0,k]}\left( B\cdot \mathcal {Qq}_{\nu ^\star ,\nu _0}(4B \alpha k ) + C\alpha + \beta B^2k \right) . \end{aligned}$$

Proof

As in the proof of Lemma 2.5, we define $(T_k^r(\theta ),T_k^\theta (\theta )) :=T_k(1,\theta )$ and we define recursively $\nu ^\epsilon _{k+1}=(T^\theta _k)_\# \nu ^\epsilon _k$ where $\nu ^\epsilon _0$ is such that ${\mathcal {H}}(\nu ^\epsilon _0,\nu _0)<\infty $. Using the invariance of the relative entropy under diffeomorphisms (indeed, $T^\theta _k$ is a diffeomorphism of $\Theta $ for $\beta $ small enough), and doing a first order expansion of $T^r_k = 1 -2\alpha J'_{\nu _k} +O(\alpha ^2)$ it holds for $\beta $ small enough

$$\begin{aligned} \frac{1}{4\alpha } \left( {\mathcal {H}}(\nu ^\epsilon _{k+1},\nu _{k+1})-{\mathcal {H}}(\nu ^\epsilon _{k},\nu _k)\right)&=\frac{1}{4\alpha } ({\mathcal {H}}(\nu ^\epsilon _k,(T^r_k)^2\nu _k)-{\mathcal {H}}(\nu ^\epsilon _k,\nu _k)) \\&= \frac{1}{4\alpha }\Big ( \int \log ((T_k^r)^{-2})\mathrm {d}\nu _k^\epsilon +\int ((T_k^r)^2-1)\mathrm {d}\nu _k \Big ) \\&= \int J'_{\nu _k}\cdot \mathrm {d}(\nu ^\epsilon _k-\nu _k) + O(\alpha )\\&=\int J'_{\nu _k}\mathrm {d}(\nu ^\star -\nu _k) + \int J'_{\nu _k}\mathrm {d}(\nu ^\epsilon _k - \nu ^\star ) + O(\alpha )\\&\le -(J(\nu _k)-J^\star )+\Vert J'_{\nu _k}\Vert _{\mathrm {BL}}\cdot \Vert \nu ^\star -\nu ^\epsilon _k\Vert _{\mathrm {BL}}^* + C\alpha \end{aligned}$$

where the term in $O(\alpha )$ originates from a first order approximation of the retraction. Now, taking $\max \{\alpha ,\beta \}$ small enough to ensure decrease of $(J(\nu _k))_k$ (by Lemma 2.5) so that C above can be chosen independently of k, it follows

$$\begin{aligned} \left( \frac{1}{k} \sum _{k'=0}^{k-1} J(\nu _{k'})\right) - J^\star&\le \frac{1}{4\alpha k}{\mathcal {H}}(\nu ^\epsilon _0,\nu _0) + B\Vert \nu ^\star - \nu ^\epsilon _0 \Vert _{\mathrm {BL}}^* + \frac{B}{k}\left( \sum _{k'=0}^{k-1} \Vert \nu _{k'}^\epsilon -\nu _0^\epsilon \Vert \right) + C\alpha \\&\le B \mathcal {Qq}_{\nu ^\star ,\nu _0}(4B \alpha k ) + C\alpha + \frac{1}{2} {\beta B^2(k-1)} \end{aligned}$$

by bounding each term $\Vert \nu _{k'}^\epsilon -\nu _0^\epsilon \Vert $ by $B\beta k'$. $\square $

Proof of Theorem 4.2 (gradient descent)

The proof follows closely that of Theorem 4.1 but we do not track the “constants” (this would be more tedious). By Lemma E.1, there exists $C>0$ (that depends on ${\bar{{\mathcal {H}}}}$, the curvature of $\Theta $ and $\nu ^\star (\Theta )$) such that $\mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )\le C(\log \tau )/\tau +\nu ^\star (\Theta )W_\infty (\nu _0,{\hat{\nu }}_0)$. Combining this with Lemma F.1, we get that when $\max \{\alpha ,\beta \} \le \eta _{\max }$,

$$\begin{aligned} J(\nu _k)-J^\star \le C\frac{\log (B \alpha k)}{\alpha k} + \beta B^2k+ C'\alpha + B \nu ^\star (\Theta )\cdot W_\infty (\nu _0,{\hat{\nu }}_0). \end{aligned}$$

Our goal is to choose $k_0,\alpha , \beta $ and $W_\infty (\nu _0,{\hat{\nu }}_0)$ so that this is quantity smaller than $\Delta _0:=J_0-J^\star $. With $\alpha =1/\sqrt{k}$ and $\beta = \beta _0/k$ we get

$$\begin{aligned} J(\nu _k)-J^\star \le \frac{C'\log (Bk)}{\sqrt{k}} + B^2\beta _0+ B \nu ^\star (\Theta ) \cdot W_\infty (\nu _0,{\hat{\nu }}_0). \end{aligned}$$

Then, using a bound $\log (u)\le C_\epsilon u^\epsilon $, we may choose $k \gtrsim \Delta _0^{-2-\epsilon }$, $\beta _0 \le \frac{1}{3} \Delta _0/B^2$ and $W_\infty (\nu _0,{\hat{\nu }}_0)\le \frac{1}{3} \Delta _0/(B\nu ^\star (\Theta ))$ in order to have $J(\nu _k)-J^\star \le \Delta _0$. This gives $\alpha \lesssim \Delta _0^{1+\epsilon /2}$, $\beta \lesssim \Delta _0^{3+\epsilon }$ and the regime of exponential convergence kicks off after $k=\Delta _0^{-2-\epsilon }$ iterations. $\square $

Faster rate for mirror descent

In this section, we show that for a specific choice of retraction, the convergence rate of $O(\log (t)/t)$ for the gradient flow is preserved for the gradient descent.

Proposition G.1

(Mirror flow, fast rate) Assume (A1-4) and consider the infinite dimensional mirror descent update

$$\begin{aligned} \nu _{k+1}=\exp (-4\alpha J'_{\nu _k})\nu _{k} \end{aligned}$$

which corresponds to the so-called mirror retraction in Sect. 2.3 and $\beta =0$. For any $\nu _0 \in {\mathcal {M}}_+(\Theta )$, there exists $\alpha _{\max }>0$ such that for $\alpha \le \alpha _{\max }$ it holds, denoting $B_{\nu _0} = \sup _{J(\nu )\le J(\nu _0)} \Vert J'_{\nu }\Vert _{\mathrm {BL}}$,

$$\begin{aligned} J(\nu _k) - J^\star \le B_{\nu _0} \mathcal {Qq}_{\nu ^\star ,\nu _0}(2\alpha B_{\nu _0} k). \end{aligned}$$

In particular, combining with Lemma E.1, if $\nu _0 = \rho \!{{\,\mathrm{vol}\,}}$ has a smooth positive density, then $J(\nu _k)-J^\star = O(\log (k)/k)$.

Proof

Consider $\nu _\epsilon \in {\mathcal {M}}_+(\Theta )$ such that ${\mathcal {H}}(\nu _\epsilon ,\nu _0)<\infty $. It holds

$$\begin{aligned} \frac{1}{4\alpha }\left( {\mathcal {H}}(\nu _\epsilon ,\nu _k)-{\mathcal {H}}(\nu _\epsilon ,\nu _{t+1})\right)&= - \frac{1}{4\alpha } \int \log \left( \frac{\nu _{k+1}}{\nu _k}\right) \mathrm {d}(\nu _{k+1}-\nu _\epsilon )+\frac{1}{4\alpha } {\mathcal {H}}(\nu _{k+1},\nu _k)\\&=\int J'_{\nu _k}\mathrm {d}(\nu _{k+1}-\nu _\epsilon )+\frac{1}{4\alpha } {\mathcal {H}}(\nu _{k+1},\nu _k) \end{aligned}$$

where the first equality is obtained by rearranging terms in the definition of ${\mathcal {H}}$, and the second one is specific to the mirror retraction. Let us estimate the two terms in the right-hand side. Using convexity inequalities, we get

$$\begin{aligned}&\int J'_{\nu _k}\mathrm {d}(\nu _{k+1}-\nu _\epsilon ) = \int J'_{\nu _k}\mathrm {d}(\nu _{k}-\nu ^\star ) + \int J'_{\nu _k}\mathrm {d}(\nu _{k+1}-\nu _k)+ \int J'_{\nu _k}\mathrm {d}(\nu ^\star -\nu _\epsilon ) \\&\quad \ge J(\nu _k)-J(\nu ^\star )+J(\nu _{k+1})-J(\nu _k) + O(\alpha \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)})+ \int J'_{\nu _k}\mathrm {d}(\nu ^\star -\nu _\epsilon ) \\&\quad \ge J(\nu _{k+1})-J^\star + O(\alpha \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)}) - \Vert J'_{\nu _k}\Vert _\mathrm {BL}\cdot \Vert \nu ^\star -\nu _\epsilon \Vert _\mathrm {BL}^*. \end{aligned}$$

Here the term in $O(\alpha \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)})$ comes from the proof of Lemma 2.5 (note that the iterates remain in a sublevel of J for $\alpha $ small enough). As for the relative entropy term, we have, using the convexity inequality $\exp (u)\ge 1+ u$,

$$\begin{aligned} \frac{1}{\alpha }{\mathcal {H}}(\nu _{k+1},\nu _k)&= \frac{1}{\alpha }\int \left( \exp (-2\alpha J'_{\nu _k}) (-2\alpha J'_{\nu _k}- 1) + 1 \right) \mathrm {d}\nu _{k}\\&\ge \frac{1}{\alpha }\int \left( 4\alpha ^2 \vert J'_{\nu _k}\Vert ^2 -1 +1\right) \mathrm {d}\nu _k = \int 4\alpha \vert J'_{\nu _k}\vert ^2\mathrm {d}\nu _k = \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)}. \end{aligned}$$

We use this inequality in place of the strong convexity of the mirror function used in the usual proof of mirror descent (because there is no Pinsker inequality on ${\mathcal {M}}_+(\Theta )$). Coming back to the first equality we have derived, it holds,

$$\begin{aligned} B_{\nu _0} \Vert \nu ^\star -\nu _\epsilon \Vert _\mathrm {BL}^*+ \frac{1}{4\alpha }\left( {\mathcal {H}}(\nu _\epsilon ,\nu _k)-{\mathcal {H}}(\nu _\epsilon ,\nu _{t+1})\right)&\ge J(\nu _{k+1})-J^\star + \frac{1}{4} \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)} \\&+ O(\alpha \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)}) \end{aligned}$$

Thus for $\alpha $ small enough, it holds

$$\begin{aligned} B_{\nu _0} \Vert \nu ^\star -\nu _\epsilon \Vert _\mathrm {BL}^*+\frac{1}{4\alpha }\left( {\mathcal {H}}(\nu _\epsilon ,\nu _k)-{\mathcal {H}}(\nu _\epsilon ,\nu _{t+1})\right) \ge J(\nu _{k+1})-J^\star . \end{aligned}$$

Summing over K iterations and dividing by K, we get

$$\begin{aligned} \left( \frac{1}{K} \sum _{k=1}^{K} J(\nu _{k}) \right) - J^\star \le \frac{1}{4\alpha K} {\mathcal {H}}(\nu _\epsilon ,\nu _K) + B_{\nu _0} \Vert \nu ^\star -\nu _\epsilon \Vert _\mathrm {BL}^*. \end{aligned}$$

Since for $\alpha $ small enough $(J(\nu _k))_{k\ge 1}$ is decreasing (by Lemma 2.5), the result follows. $\square $

Convergence rate for lower bounded densities

In this section, we justify the claim made in Sect. 4.3 about the convergence without condition on $\beta /\alpha $. Let us recall the result that we want to prove.

Proposition H.1

Under (A1-3), for any $J_{\max }>J^\star $, there exists $C>0$ such that for any $\eta ,t>0$ and $\nu _0\in {\mathcal {M}}_+(\Theta )$ satisfying $J(\nu _0)\le J_{\max }$, if the projected gradient flow (9) satisfies for $0\le s \le t$,

$$\begin{aligned} \nu _s\vert _{S_t} \ge \eta {{\,\mathrm{vol}\,}}\vert _{S_t} \end{aligned}$$

where $S_t = \{\theta \in \Theta ;\; J'_{\nu _s}(\theta )\le 0 \text { for some } s\in {[0,t]}\}$, then $ J(\nu _t)-J^\star \le \frac{C}{\sqrt{\alpha \eta t}}. $

Proof

Following [60], we start with the convexity inequality

$$\begin{aligned} J(\nu _t)-J^\star \le \int J'_{\nu _t}\mathrm {d}\nu _t - \int J'_{\nu _t}\mathrm {d}\nu ^\star . \end{aligned}$$

Let us control these two terms separately. On the one hand, one has by Jensen’s inequality

$$\begin{aligned} \left( \int J'_{\nu _t}\mathrm {d}\nu _t \right) ^2 =(\nu _t(\Theta ))^2\left( \frac{1}{\nu _t(\Theta )}\int J'_{\nu _t}\mathrm {d}\nu _t\right) ^2 \le \nu _t(\Theta ) \int \vert J'_{\nu _t}\vert ^2 \mathrm {d}\nu _t \le - \frac{\nu _t(\Theta )}{4\alpha } \frac{\mathrm {d}}{\mathrm {d}t} J(\nu _t). \end{aligned}$$

Using the fact that on sublevels of J, $\nu (\Theta )$ and $\Vert g_{\nu }\Vert ^2_{L^2(\nu )}$ are bounded, we have, for some $C>0$,

$$\begin{aligned} \int J'_{\nu _t}\mathrm {d}\nu _t\le C\left( -\frac{1}{\alpha }\frac{\mathrm {d}}{\mathrm {d}t} J(\nu _t)\right) ^{1/3}. \end{aligned}$$

On the other hand, we have

$$\begin{aligned} \int J'_{\nu _t}\mathrm {d}\nu ^\star \ge \nu ^\star (\Theta )\min \left\{ 0, \min _{\theta \in \Theta } J'_{\nu _t}(\theta )\right\} =:\nu ^\star (\Theta )\cdot v_t. \end{aligned}$$

where the last equality defines $v_t\le 0$. Using the gradient flow structure, let us show that a non-zero $v_t$ and a lower bound $\eta $ on the density of $\nu _t$ (at least on the set $\{J'_{\nu _t}\le 0$}) guarantees a decrease of the objective. Indeed, letting $\Theta _t = \{\theta \in \Theta ;\; J'_{\nu _t}(\theta ) \le v_t /2\}$ (which could be empty), we get

$$\begin{aligned} - \frac{\mathrm {d}}{\mathrm {d}t} J(\nu _t)\ge 4\alpha \int _{\Theta _t} \vert J'_{\nu _t}\vert ^2 \mathrm {d}\nu _t \ge 4\alpha (v_t/2)^2 \nu _t(\Theta _t) = \alpha \cdot v_t^2 \cdot \eta \cdot {{\,\mathrm{vol}\,}}(\Theta _t). \end{aligned}$$

Moreover, the Lipschitz regularity of $J'_{\nu }$ is bounded on sublevels of J, and thus along gradient flow trajectories, so there exists $C'>0$ such that ${{\,\mathrm{vol}\,}}(\Theta _t)\ge C'\cdot \vert v_t\vert $. It follows

$$\begin{aligned} \vert v_t\vert ^3 \le -\frac{1}{C'\alpha \eta }\frac{\mathrm {d}}{\mathrm {d}t}J(\nu _t) \quad \Rightarrow \quad v_t \ge -\left( -\frac{1}{C'\alpha \eta }\frac{\mathrm {d}}{\mathrm {d}t}J(\nu _t)\right) ^{1/3}. \end{aligned}$$

Coming back to our first inequality, we have

$$\begin{aligned} J(\nu _t)-J^\star \le C\left( -\frac{1}{\alpha }\frac{\mathrm {d}}{\mathrm {d}t} J(\nu _t)\right) ^{1/3} + \left( -\frac{1}{C'\alpha \eta }\frac{\mathrm {d}}{\mathrm {d}t}J(\nu _t)\right) ^{1/3} \le \frac{C''}{(\alpha \eta )^{1/3}}\left( -\frac{\mathrm {d}}{\mathrm {d}t}J(\nu _t)\right) ^{1/3} \end{aligned}$$

for some $C''>0$ that, given $J(\nu _0)$, is independent of $\alpha ,\eta $ and $\nu _t$. It remains to remark that a continuously differentiable and positive function h that satisfies $h(t)\le C^{-1/3} \cdot (-h'(t))^{1/3}$ satisfies $C\le -h'(t)/h(t)^3 = \frac{1}{2} \frac{\mathrm {d}}{\mathrm {d}t}(h(t)^{-2})$ and, after integrating between 0 and t, $h(t)\le \left( 2Ct +h(0)^{-2}\right) ^{-1/2}\le \frac{1}{\sqrt{2Ct}}$. We conclude by taking $h(t)=J(\nu _t)-J^\star $ and $C\propto \alpha \eta $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chizat, L. Sparse optimization on measures with over-parameterized gradient descent. Math. Program. 194, 487–532 (2022). https://doi.org/10.1007/s10107-021-01636-z

Download citation

Received: 10 September 2019
Accepted: 04 January 2021
Published: 17 March 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10107-021-01636-z

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse optimization on measures with over-parameterized gradient descent

Abstract

Access this article

Similar content being viewed by others

The Gradient Projection Algorithm for Smooth Sets and Functions in Nonconvex Case

Recent Advances in Stochastic Riemannian Optimization

MASAGA: A Linearly-Convergent Stochastic First-Order Method for Optimization on Manifolds

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Dealing with signed measures

Proposition A.1

Proof

Generic non-convex minimization

Proposition B.1

Proof

Wasserstein gradient flow

1.1 Existence

1.2 Asymptotic global convergence

Lemma C.1

Proof

Proof of Theorem 2.2

Proof of the gradient inequality

1.1 Bound on the transport distance to minimizers

Lemma D.1

Proof

1.2 Local expansion lemma

Lemma D.2

Proof

1.3 Bound on the distance to minimizers

Lemma D.3

Proof

1.4 Proof of the distance inequality (Proposition 3.2)

1.5 Local estimate of the objective

Proposition D.4

Proof

1.6 Local estimate of the gradient norm

Proposition D.5

Proof

1.7 Proof of the sharpness inequality (Theorem 3.3)

Estimation of the mirror rate function

Lemma E.1

Proof

Global convergence for gradient descent

Lemma F.1

Proof

Proof of Theorem 4.2 (gradient descent)

Faster rate for mirror descent

Proposition G.1

Proof

Convergence rate for lower bounded densities

Proposition H.1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation