Abstract
Minimizing a convex function of a measure with a sparsity-inducing penalty is a typical problem arising, e.g., in sparse spikes deconvolution or two-layer neural networks training. We show that this problem can be solved by discretizing the measure and running non-convex gradient descent on the positions and weights of the particles. For measures on a d-dimensional manifold and under some non-degeneracy assumptions, this leads to a global optimization algorithm with a complexity scaling as \(\log (1/\epsilon )\) in the desired accuracy \(\epsilon \), instead of \(\epsilon ^{-d}\) for convex methods. The key theoretical tools are a local convergence analysis in Wasserstein space and an analysis of a perturbed mirror descent in the space of measures. Our bounds involve quantities that are exponential in d which is unavoidable under our assumptions.
Similar content being viewed by others
Notes
Extension of the metric and gradients to the whole of \(\Omega \) can be made on a case by case basis, see Sect. 2.2.
The pushforward measure \(T_\# \mu \) is characterized by \(\int \psi \mathrm {d}(T_\#\mu ) = \int (\psi \circ T) \mathrm {d}\mu \) for any continuous function \(\psi \).
References
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer, Berlin (2008)
Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629–681 (2017)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends® Mach. Learn. 4(1), 1–106 (2012)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Blanchet, A., Bolte, J.: A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. J. Funct. Anal. 275(7), 1650–1673 (2018)
Boyd, N., Schiebinger, G., Recht, B.: The alternating descent conditional gradient method for sparse inverse problems. SIAM J. Optim. 27(2), 616–639 (2017)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Boyer, C., Chambolle, A., De Castro, Y., Duval, V., De Gournay, F., Weiss, P.: On representer theorems and convex regularization. SIAM J. Optim. 29(2), 1260–1281 (2019)
Bredies, K., Pikkarainen, H.K.: Inverse problems in spaces of measures. ESAIM Control Optim. Calc. Var. 19(1), 190–218 (2013)
Burago, D., Burago, Y., Ivanov, S.: A Course in Metric Geometry, vol. 33. American Mathematical Society, Providence (2001)
Candès, E.J., Fernandez-Granda, C.: Towards a mathematical theory of super-resolution. Commun. Pure Appl. Math. 67(6), 906–956 (2014)
Catala, P., Duval, V., Peyré, G.: A low-rank approach to off-the-grid sparse deconvolution. J. Phys. Conf. Ser. 904, 012015 (2017)
Champagnat, F., Herzet, C.: Atom selection in continuous dictionaries: reconciling polar and SVD approximations. In: ICASSP 2019-IEEE 44th International Conference on Acoustics, Speech, and Signal Processing, pp. 1–5. IEEE (2019)
Chen, Y., Li, W.: Wasserstein natural gradient in statistical manifolds with continuous sample space. arXiv preprint arXiv:1805.08380 (2018)
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3040–3050 (2018)
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems, pp. 2937–2947 (2019)
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: An interpolating distance between optimal transport and Fisher–Rao metrics. Found. Comput. Math. 18(1), 1–44 (2018)
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: Unbalanced optimal transport: dynamic and Kantorovich formulations. J. Funct. Anal. 274(11), 3090–3123 (2018)
Cohn, D.L.: Measure Theory, vol. 165. Springer, Berlin (1980)
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. A J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)
De Castro, Y., Gamboa, F.: Exact reconstruction using Beurling minimal extrapolation. J. Math. Anal. Appl. 395(1), 336–354 (2012)
De Castro, Y., Gamboa, F., Henrion, D., Lasserre, J.-B.: Exact solutions to super resolution on semi-algebraic domains in higher dimensions. IEEE Trans. Inf. Theory 63(1), 621–630 (2017)
Denoyelle, Q., Duval, V., Peyré, G., Soubies, E.: The sliding Frank-algorithm and its application to super-resolution microscopy. Inverse Probl. 36(1), 014001 (2019)
Du, S.S., Zhai, X., Barnabas P., Aarti, S.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations (2018)
Dumitrescu, B.: Positive Trigonometric Polynomials and Signal Processing Applications. Springer, Berlin (2007)
Duval, V., Peyré, G.: Exact support recovery for sparse spikes deconvolution. Found. Comput. Math. 15(5), 1315–1355 (2015)
Flinth, A., de Gournay, F., Weiss, P.: On the linear convergence rates of exchange and continuous methods for total variation minimization. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01530-0
Flinth, A., Weiss, P.: Exact solutions of infinite dimensional total-variation regularized problems. Inf. Inference A J. IMA 8(3), 407–443 (2019)
Gallouët, T., Monsaingeon, L.: A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM J. Math. Anal. 49(2), 1100–1130 (2017)
Gautschi, W.: Numerical Analysis. Springer, Berlin (1997)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Hauer, D., Mazón, J.: Kurdyka-Łojasiewicz–Simon inequality for gradient flows in metric spaces. Trans. Am. Math. Soc. (2019)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Englewood Cliffs (1994)
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Berlin (2016)
Kondratyev, S., Monsaingeon, L., Vorotnikov, D.: A new optimal transport distance on the space of finite Radon measures. Adv. Differ. Equ. 21(11/12), 1117–1164 (2016)
Krichene, W., Bayen, A., Bartlett, P.L.: Accelerated mirror descent in continuous and discrete time. In: Advances in Neural Information Processing Systems, pp. 2845–2853 (2015)
Kushner, H., George Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
Li, W., Montúfar, G.: Natural gradient via optimal transport. Inf. Geom. 1(2), 181–214 (2018)
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. Invent. Math. 211(3), 969–1117 (2018)
Mairal, J., Bach, F., Ponce, J.: Sparse modeling for image and vision processing. Found. Trends® Comput. Graph. Vis. 8(2–3), 85–283 (2014)
Maniglia, S.: Probabilistic representation and uniqueness results for measure-valued solutions of transport equations. Journal de mathématiques pures et appliquées 87(6), 601–626 (2007)
Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 115(33), E7665–E7671 (2018)
Menz, G., Schlichting, A.: Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. Ann. Probab. 42(5), 1809–1884 (2014)
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, New York (1983)
Nitanda, A., Suzuki, T.: Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438 (2017)
Polyak, B.T.: Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 3(4), 643–653 (1963)
Poon, C., Keriven, N., Peyré, G.: The geometry of off-the-grid compressed sensing. arXiv preprint arXiv:1802.08464 (2018)
Rotskoff, G., Jelassi, S., Bruna, J., Vanden-Eijnden, E.: Global convergence of neuron birth–death dynamics. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research (2019)
Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146–7155 (2018)
Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, NY 55, 58–63 (2015)
Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math. 80(2), 725–752 (2020)
Tang, G., Bhaskar, B.N., Shah, P., Recht, B.: Compressed sensing off the grid. IEEE Trans. Inf. Theory 59(11), 7465–7490 (2013)
Traonmilin, Y., Aujol, J.-F.: The basins of attraction of the global minimizers of the non-convex sparse spike estimation problem. Inverse Probl. 36(4), 045003 (2020)
Trillos, N.G., Slepčev, D.: On the rate of convergence of empirical measures in \(\infty \)-transportation distance. Can. J. Math. 67(6), 1358–1383 (2015)
Wang, Y., Li, W.: Accelerated information gradient flow. arXiv preprint arXiv:1909.02102 (2019)
Weed, J., Bach, F., et al.: Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 25(4A), 2620–2648 (2019)
Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel. In: Advances in Neural Information Processing Systems, pp. 9712–9724 (2019)
Acknowledgements
The author thanks Francis Bach for fruitful discussions related to this work and the anonymous referees for their thorough reading and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Dealing with signed measures
Let us show that problems over signed measures with total variation regularization are covered by problem (1), after a suitable reformulation. Consider a function \({\tilde{\phi }}{:}\,{\tilde{\Theta }}\rightarrow {\mathcal {F}}\) and the functional on signed measures \({\tilde{J}}{:}\,{\mathcal {M}}({\tilde{\Theta }})\rightarrow {\mathbb {R}}\) defined as
where \(\vert \mu \vert ({\tilde{\Theta }})\) is the total variation of \(\mu \). This is a continuous version of the LASSO problem, known as BLASSO [23]. Define \(\Theta \) as the disjoint union of two copies \({\tilde{\Theta }}_+\) and \({\tilde{\Theta }}_-\) of \({\tilde{\Theta }}\) and define the symmetrized function \(\phi {:}\,\Theta \rightarrow {\mathcal {F}}\) as
With this choice of \(\phi \), minimizing (17) or minimizing (1) are equivalent, in a sense made precise in Proposition A.1. This symmetrization procedure, also suggested in [17], is simple to implement in practice: in Algorithm 1, we fix at initialization the sign attributed to each particle—depending on whether it belongs to \({\tilde{\Theta }}_+\) or \({\tilde{\Theta }}_-\)—and do not change it throughout the iterations.
Proposition A.1
The infima of (17) and (1) are the same and:
-
(i)
if \({\tilde{\mu }}\) is a minimizer of \({\tilde{J}}\) and \({\tilde{\mu }} = {\tilde{\mu }}_+-{\tilde{\mu }}_-\) is its Jordan decomposition, then the measure which restriction to \({\tilde{\Theta }}_+\) (resp. \({\tilde{\Theta }}_-\)) coincides with \({\tilde{\mu }}_+\) (resp. \(\mu _-\)) is a minimizer of J;
-
(ii)
reciprocally, if \(\mu \) is a minimizer of J then \(\mu _+-\mu _-\) where \(\mu _+\) (resp. \(\mu _-\)) is the restriction of \(\mu \) to \({\tilde{\Theta }}_+\) (resp. \(\Theta _-\)) is a minimizer of \({\tilde{J}}\).
Proof
We recall that for any decomposition of a signed measure as the difference of nonnegative measures \({\tilde{\mu }} = {\tilde{\mu }}_+-{\tilde{\mu }}_-\), it holds \(\vert {\tilde{\mu }} \vert (\Theta ) \le {\tilde{\mu }}_+(\Theta )+{\tilde{\mu }}_-(\Theta )\), with equality if and only if \(({\tilde{\mu }}_+,{\tilde{\mu }}_-)\) is the Jordan decomposition of \({\tilde{\mu }}\) [21, Sec. 4.1]. It follows that starting from any \({\tilde{\mu }}\in {\mathcal {M}}({\tilde{\Theta }})\), the construction in (i) yields a measure \(\mu \in {\mathcal {M}}_+(\Theta )\) satisfying \({\tilde{J}}({\tilde{\mu }}) = J(\mu )\). Also, starting from any \(\mu \in {\mathcal {M}}_+(\Theta )\), the construction in (ii) yields a measure \({\tilde{\mu }}\in {\mathcal {M}}({\tilde{\Theta }})\) satisfying \({\tilde{J}}({\tilde{\mu }})\le J(\mu )\), with equality if and only if \((\mu _+,\mu _-)\) is a Jordan decomposition. The conclusion follows. \(\square \)
Generic non-convex minimization
In this section, we show that any smooth optimization problem on a manifold is equivalent to solving a problem of the form (1). This corresponds to the case of a scalar-valued \(\phi \).
Proposition B.1
Let \(\phi {:}\,\Theta \rightarrow {\mathbb {R}}\) be a smooth function with minimum \(\phi ^\star <0\) that admits a global minimizer, and let
where \(0<\lambda <-2\phi ^\star \). Then \(\emptyset \ne {{\,\mathrm{spt}\,}}\nu ^\star \subset \arg \min \phi \) so minimizers of \(\phi \) can be built from \(\nu ^\star \). Reciprocally, from a minimizer of \(\phi \), one can build a minimizer for (18).
Proof
For a measure \(\nu \in {\mathcal {M}}_+(\Theta )\), we define \(f_\nu :=\int _\Theta \phi (\theta )\mathrm {d}\nu (\theta ) \in {\mathbb {R}}\). It holds
Now suppose that \(\nu \) is a global minimizer of J. Then the optimality condition in Proposition 3.1 implies that
Solving for \(f_\nu \) is possible if \(\lambda \nu (\Theta )<1\) and leads to \(f_\nu =\sqrt{1-\lambda \nu (\Theta )} -1\). We also deduce from the fact that \(f_\nu >-1\) that \(\arg \min J'_\nu = \arg \min \phi \), and so \({{\,\mathrm{spt}\,}}\nu \subset \arg \min \phi \). It remains to find under which condition \(\nu (\Theta )>0\). We use the fact that \(f_\nu = \phi ^\star \nu (\Theta )\) in Eq. (19), and get
which in particular satisfies \(\lambda \nu (\Theta )<1\). Thus, as long as \(-2\phi ^\star >\lambda \), we have \(\nu (\Theta )>0\). Finally, we verify that global minimizers exist, so that the above reasoning makes sense. If \(-2\phi ^\star -\lambda \le 0\), then \(\nu =0\) satisfies the global optimality conditions. Otherwise, choose \(\theta ^\star \) a minimizer for \(\phi ^\star \) and define \(\nu = \nu (\Theta )\delta _{\theta ^\star }\) with the value above for \(\nu (\Theta )\), which also satisfies the global optimality conditions. \(\square \)
Wasserstein gradient flow
In this section, we recall and adapt some results and proofs from [17], for the sake of completeness.
1.1 Existence
For this result, we assume (A1-2). For a compactly supported initial condition \(\mu _0\in {\mathcal {P}}_2(\Omega )\), the proof of existence for Wasserstein gradient flows [Eq. (7)] in [17] goes through, as it is simply based on a compactness arguments which can be directly translated to this Riemannian setting (more precisely, we apply here Arzelà–Ascoli compactness criterion for curves in the Wasserstein space on the cone of \(\Theta \), which is a complete metric space [42]). Note that these arguments do not require convexity of R, but in order to guarantee global existence in time, we need to assume that \(\nabla R\) is bounded in sub-level sets of F.
For the existence of solutions for projected dynamics on \(\Theta \) for any \(\nu _0\in {\mathcal {M}}_+(\Theta )\), consider a measure \(\mu _0\in {\mathcal {M}}_+(\Omega )\) such that \({\mathsf {h}}\mu _0=\nu _0\) (see [42] for such a construction) and the corresponding Wasserstein gradient flow \((\mu _t)_{t\ge 0}\) for F. Then \({\mathsf {h}}\mu _t\) is a solution to (9).
For the existence of Wasserstein gradient flows [Eq. (7)] for F when \(\mu _0\) is not compactly supported, proceed as follows: there exists a Wasserstein–Fisher–Rao gradient flow \(\nu _t\) satisfying \(\nu _0={\mathsf {h}}\mu _0\). Now we can simply define \(\mu _t\) as the solution to \(\partial _t \mu _t = \mathrm {div}(\mu _t J'_{\nu _t})\). It can be directly checked that \({\mathsf {h}}\mu _t = \nu _t\) for \(t\ge 0\) and thus \(\mu _t\) is a solution to Eq. (7).
We do not attempt to show uniqueness in the present work. Note that it is proved in [17] for the case where \(\Theta \) is a sphere, by applying the theory developed in [3].
1.2 Asymptotic global convergence
In this section, we give a short proof of Theorem 2.2, adapted from [17]. The next lemma is the crux of the global convergence proof. It gives a criterion to espace from the neighborhood of measures which are not minimizers.
Lemma C.1
(Criteria to espace local minima) Under (A1-3), let \(\nu \in {\mathcal {M}}_+(\Theta )\) be such that \(v^\star :=\min _{\theta \in \Theta } J'_{\nu }(\theta )<0\). Then there exists \(v \in [2v^\star /3,v^\star /3]\) and \(\epsilon >0\) such that if \((\nu _t)_{t\ge 0}\) is a projected gradient flow of J satisfying \(\Vert \nu - \nu _{t_0}\Vert _\mathrm {BL}^*< \epsilon \) for some \(t_0\ge 0\) and \(\nu _{t_0}((J'_{\nu })^{-1}(]-\infty ,v]))>0\) then there exists \(t_1>t_0\) such that \(\Vert \nu - \nu _{t_1}\Vert _\mathrm {BL}^*\ge \epsilon \).
Proof
We first assume that \(J'_{\nu }\) takes nonnegative values and let \(v\in [2v^\star /3,v^\star /3]\) be a regular value of \(g_\nu \), i.e. be such that \(\Vert \nabla J'_{\nu }\Vert \) does not vanish on the v level-set of \(J'_\nu \). Such a v is guaranteed to exist thanks to Morse–Sard’s lemma and our assumption that \(\phi \) is d-times continuously differentiable, which implies that \(J'_{\nu }\) is the same. Let \(K_v = (J'_{\nu })^{-1}(]-\infty ,v])\subset \Theta \) be the corresponding sublevel set. By the regular value theorem, the boundary \(\partial K_v\) of \(K_v\) is a differentiable orientable compact submanifold of \(\Theta \) and is orthogonal to \(\nabla J'_{\nu }\). By construction, it holds for all \(\theta \in K_v\), \(J'_{\nu }(\theta ) \le v^\star /3\) and, for some \(u>0\), by the regular value property, \(\nabla J'_\nu (\theta )\cdot \mathbf {n}_{\theta } > u\) for all \(\theta \in \partial K_v\) where \(\mathbf {n}_\theta \) is the unit normal vector to \(\partial K_v\) pointing outwards. Since the map \(\nu \mapsto J'_{\nu }\) is locally Lipschitz as a map \(({\mathcal {M}}_+(\Theta ), \Vert \cdot \Vert _{\mathrm {BL}}^*) \rightarrow ({\mathcal {C}}^1(\Theta ),\Vert \cdot \Vert _{\mathrm {BL}})\), there exists \(\epsilon >0\) such that if \(\nu _t\in {\mathcal {M}}_+(\Theta )\) satisfies \(\Vert \nu _t -\nu \Vert _{\mathrm {BL}}^*<\epsilon \), then
Now let us consider a projected gradient flow \((\nu _t)_{t\ge 0}\) such that \(\Vert \nu _{0} -\nu \Vert _{\mathrm {BL}}^*<\epsilon \) and let \(t_1>0\) be the first time such that \(\Vert \nu _{t_1}-\nu \Vert _{\mathrm {BL}}^*\ge \epsilon \), which might a priori be infinite. For \(t\in {[t_0,t_1[}\), it holds
where the first inequality can be seen by using the “characteristic” representation of solutions to (9), see [44]. It follows by Grönwall’s lemma that \(\nu _t(K_v)\ge \exp (\alpha v^\star t)\nu _0(K_v)\) which implies that \(t_1\) is finite. Finally, if we had not assumed that 0 is in the range of \(J'_\nu \) in the first place, then we could simply take \(K=\Theta \) and conclude by similar arguments. \(\square \)
Proof of Theorem 2.2
Let \(\nu _\infty \in {\mathcal {M}}(\Theta )\) be the weak limit of \((\nu _t)_t\). It satisfies the stationary point condition \(\int \vert J'_{\nu _\infty }\vert ^2\mathrm {d}\nu _{\infty }=0\). Then by the optimality conditions in Proposition 3.1, either \(\nu _\infty \) is a minimizer of J, or \(J'_{\nu _\infty }\) is not nonnegative. For the sake of contradiction, assume the latter. Let \(\epsilon \) be given by Lemma C.1 and let \(t_0 = \sup \{ t\ge 0;\; \Vert \nu _t-\nu _\infty \Vert _{\mathrm {BL}}^*\ge \epsilon \}\) which is finite since we have assumed that \(\nu _t\) weakly converges to \(\nu _\infty \). But \(\nu _{t_0}\) has full support since it can be written as the pushforward of a rescaled version of \(\nu _0\) by a diffeomorphism, see [44, Eq. (1.3)] (note that this step is considerably simplified here by the fact that we do not have a potentially non-smooth regularizer, unlike in [17] where topological degree theory comes into play). Then the conclusion of Lemma C.1 contradicts the definition of \(t_0\). \(\square \)
Proof of the gradient inequality
In this whole section, we consider without loss of generality \(\alpha =\beta =1\) (we explain in Sect. D.7 how to adapt the results to arbitrary \(\alpha ,\beta \)). For simplicity, we only track the dependencies in \(\nu \) and \(\tau \). Any quantity that is independent of \(\nu \) and \(\tau \) is treated as a constant and represented by \(C,C',C''>0\), and the quantity these symbols refer to can change from line to line.
1.1 Bound on the transport distance to minimizers
Given a measure \(\nu \in {\mathcal {M}}_+(\Theta )\), we consider the local centered moments introduced in Definition 3.6 and in addition, for \(i\in \{1,\dots ,m^\star \}\),
Finally, we will quantify errors with the following quantity
which also controls the \({\widehat{W}}_2\) distance (introduced in Sect. 3.1) to the minimizer \(\nu ^\star \) of J, as shown in the next proposition.
Lemma D.1
It holds \({\widehat{W}}_2(\nu ,\nu ^\star )\le W_\tau (\nu )(1+O(\tau ^2)+O(W_\tau (\nu )^2))\).
Proof
Note that for \(W_\tau (\nu )\) small enough, it holds \(\nu (\Theta _i)>0\) for \(i\in \{1,\dots ,m^\star \}\). Let \(\mu \in {\mathcal {P}}_2(\Omega )\) be such that \({\mathsf {h}}\mu = \nu \) and consider the transport map \(T{:}\,\Omega \rightarrow \Omega \) defined as
By construction, it holds \({\mathsf {h}}(T_\#\mu ) = \nu ^\star \). Let us estimate the transport cost associated to this map
The geodesic distance associated to the cone metric is
Now, if we only consider points \(\theta \in \Theta _i\) with \({\tilde{\theta }}\) their coordinates in a normal frame centered at \(\theta _i\) (note that in all other proofs, we do not need to distinguish between \(\theta \) and \({\tilde{\theta }}\)), we have the approximation
Let us decompose \(T(r,\theta )\) as \((rT^r(\theta ),T^\theta (\theta ))\) and estimate the two contributions forming \({\mathcal {T}}\) separately. On the one hand, we have
On the other hand, we have
As a consequence, we have \({\mathcal {T}}= W_\tau (\nu )(1 + O(W_\tau (\nu )^2)+ O(\tau ^2))\). Remark that this estimate does not depend on the chosen lifting \(\mu \) satisfying \({\mathsf {h}}\mu =\nu \). We then conclude by using the characterization in [42, Thm. 7.20] for the distance \({\widehat{W}}_2\):
Thus \({\widehat{W}}_2(\nu ,\nu ^\star )^2 \le W_2(\mu ,T_\#(\mu ))^2\le {\mathcal {T}}\), and the result follows. \(\square \)
1.2 Local expansion lemma
Lemma D.2
(Expansion around \(\nu ^\star )\) Let \(\psi \) be any (vector or real-valued) smooth function on \(\Theta \) and \(\nu \in {\mathcal {M}}_+(\Theta )\). If \(\tau >0\) is an admissible radius, then the following first and second-order expansions hold
where \(M_{k,\psi }(\theta _i,\theta )\) is the remainder in the \(k-1\)th order Taylor expansion of \(\psi \) around \(\theta _i\) in local coordinates (and we recall that \({\bar{\nabla }} \psi := (2\psi ,\nabla \psi )\)).
Proof
By a Taylor expansion of \(\psi \) around \(\theta _i\) for \(i\in \{1,\dots ,m^\star \}\), it holds
and substracting \(\int _{\Theta _i}\psi \mathrm {d}\nu ^\star = r_i^2\phi (\theta _i)\) yields
where we have used a bias-variance decomposition for the quadratic term. The result follows by summing the integrals over each \(\Theta _i\) and using the expression of b. \(\square \)
1.3 Bound on the distance to minimizers
In the next lemma, we globally bound the quantity \(W_\tau (\nu )\) from Eq. (20) in terms of the function values. It involves the quantity \(v^\star >0\) which is such that for any local minimum \(\theta \) of \(J'_{\nu ^\star }\), either \(\theta = \theta _i\) for some \(i\in \{1,\dots ,m^\star \}\) or \(J'_{\nu ^\star }(\theta )\ge v^*\) (which is non-zero under (A5)). We also recall that \({\tilde{b}}^\theta _i = {\bar{r}}_i \delta \theta _i\), as defined in Appendix D.3.
Lemma D.3
(Global distance bound) Under (A1-5), let \(\tau _{\mathrm {adm}}\) be an admissible radius \(\tau \) as in Definition 3.6, fix some \(J_{\max }>0\) and let
Then there exists \(C, C'>0\) such that for all \(\tau \le \tau _0\) and \(\nu \in {\mathcal {M}}_+(\Theta )\) such that \(J(\nu )\le J_{\max }\), it holds
Proof
Let us write \(f_\nu :=\int \phi \mathrm {d}\nu \) and \(f^\star = \int \phi \mathrm {d}\nu ^\star \). By strong convexity of R at \(f^\star \), and optimality of \(\mu ^\star \), there exists \(C>0\) such that for all \(\nu \in {\mathcal {M}}_+(\Theta )\) it holds
To prove the first claim, we thus have to bound \(W_\tau (\nu )\) using the terms in the right-hand side of (21).
Step 1 By a Taylor expansion, one has for \(\theta \in \Theta _i\) for \(i\in \{1,\dots ,m^\star \}\),
Thus, if \(\Vert \theta - \theta _i\Vert \le 3\sigma _{\min }(H)/(2 \mathrm {Lip}( \nabla ^2 J'_{\nu ^\star }))\), then \(J'_{\nu ^\star }(\theta ) \ge \frac{1}{4} (\theta -\theta _i)^\intercal H_i (\theta -\theta _i)\) for \(\theta \in \Theta _i\). Decomposing the integral of this quadratic term into bias and variance, we get
and we deduce a first bound by summing the terms for \(i\in \{1,\dots ,m^\star \}\),
Step 2 In order to lower bound the integral over \(\Theta _0\), we first derive a lower bound for \(J'_{\nu ^\star }\) on \(\Theta _0\). This is a continuously differentiable and nonnegative function on a closed domain \(\Theta _0\) so its minimum is attained either at a local minima in the interior of \(\Theta _0\) or on its boundary. Using the quadratic lower bound from the previous paragraph, it follows that for \(\theta \in \Theta _0\),
Thus, if we also assume that \(\tau \le 2\sqrt{v^\star /\sigma _{\min }(H)}\) then \(J'_{\nu ^\star }(\theta )\ge \tau ^2 \sigma _{\min }(H)/4\) for \(\theta \in \Theta _0\) and it follows that
Using inequality (21) we have shown so far that
Notice that \({\tilde{W}}_\tau (\nu )\) is similar to \(W_\tau (\nu )\) but it does not contain the terms controlling the deviations of mass \(\vert {\bar{r}}_i-r_i\vert \). These quantities can be controlled by using the coercivity of R, i.e. the last term in (21), as we do now.
Step 3 Using the first order expansion of Lemma D.2 then squaring gives
Since we have assumed that K is positive definite, it follows
and thus, after rearranging the terms
It follows that \(\Vert b\Vert \le C\Vert f_\nu -f^\star \Vert + C{\tilde{W}}_\tau (\nu )^2\). Also, by inequality (21), if \(J(\nu )\le J_{\max }\), then \(\Vert f_\nu -f^\star \Vert ^2\le C(J(\nu )-J^\star )\). Moreover, by inequality (22), we get
We finally combine with the bound on \({\tilde{W}}_\tau (\nu )\) to conclude since \(W_\tau (\nu )^2\le {\tilde{W}}_\tau (\nu )^2+\Vert b\Vert ^2\) \(\square \)
1.4 Proof of the distance inequality (Proposition 3.2)
By Lemma D.1, it holds
Moreover, by Lemma D.3, there exists \(\tau _0>0\) and \(C>0\) such that
Combining these two lemmas, it follows that for some \(C'>0\), we have
This also implies a control on the Bounded-Lipschitz distance since it holds \((\Vert \nu - \nu ^\star \Vert _{\mathrm {BL}}^*)^2\le (2+\pi ^2/2)(\nu (\Theta )+\nu ^\star (\Theta )){\widehat{W}}_2(\nu ,\nu ^\star )^2\), see [42, Prop. 7.18].
1.5 Local estimate of the objective
We now prove a local expansion formula for J.
Proposition D.4
(Local expansion) It holds
where \(\mathop {\mathrm {err}}(\tau ,\nu ) =O( \tau (\Vert {\tilde{b}}^\theta \Vert ^2 + \Vert s\Vert ^2) + W_\tau (\nu )^3)\). In particular, if \(\tau \) is fixed small enough,
Proof
Let us write \(f_\nu :=\int \phi \mathrm {d}\nu \) and \(f^\star = \int \phi \mathrm {d}\nu ^\star \). By a second order Taylor expansion of R around \(f^\star \), we have
Using the first order expansion of Lemma D.2 for \(\phi \), we get \( \Vert f_\nu -f^\star \Vert ^2_\star = b^\intercal Kb + O(W_{\tau }(\nu )^3) \). Also, using the second order expansion of Lemma D.2 for \(J'_{\nu ^\star }\) and using the fact that \(J'_{\nu ^\star }\) and its gradient vanish for all \(\theta _i\), we get
and the expansion follows. Notice also that in the expression of \(J(\nu )\), \({\bar{r}}_i\) and \(r_i\) are interchangeable up to introducing higher order error, since \(\vert r_i - {\bar{r}}_i\vert = O(\vert b^r_i\vert )\) (and also \(\Vert {\tilde{b}}^\theta \Vert = \Vert b^\theta \Vert (1+O(W_\tau (\nu )))\)). \(\square \)
1.6 Local estimate of the gradient norm
Proposition D.5
(Gradient estimate) For \(\nu \in {\mathcal {P}}_2(\Omega )\), it holds
where \(\mathop {\mathrm {err}}(\tau ,\nu )\lesssim \tau (\Vert {\tilde{b}}^\theta \Vert ^2 +\Vert s\Vert ^2) +W_\tau (\nu )^3\). In particular, if \(\tau \) is fixed small enough
Proof
For this proof, we write \(f_\nu - f^\star = \delta f_0 +\delta f_b + \delta f_{\mathrm {err}}\) where
where the decomposition follows from Lemma D.2. The expression for the norm of the gradient is as follows:
where \({\bar{\nabla }} J = (2J,\nabla J)\). We start with the following decomposition for \(\theta \in \Theta _i\) (recall that \(J'_\nu (\theta ) =\langle \phi (\theta ),\nabla R(\int \phi \mathrm {d}\nu )\rangle +\lambda \)):
Here we use the notation \(\langle \cdot ,\cdot \rangle _\star \) to denote the quadratic form associated to \(\nabla ^2 R(f^\star )\). Thanks to the optimality conditions \({\bar{\nabla }} J'_{\nu ^\star }(\theta _i)=0\) for \(i\in \{1,\dots ,m\}\), we get
where N collects the higher order terms and is defined as
where \(\Vert {\bar{\nabla }}_j M_{\phi ,3}(\theta _i,\theta )\Vert = O(\Vert \theta -\theta _i\Vert ^2)\) if \(j>0\) and \(O(\Vert \theta -\theta _i\Vert ^3)\) if \(j=0\). Expanding the square gives the following ten terms:
Terms (I)–(II) are the main terms in the expansion, while the other terms are higher order. The term (I) is a local curvature term and can be expressed as \(\mathrm {(I)} = \sum _{i=1}^m {\bar{r}}_i^2 {{\,\mathrm{tr}\,}}\Sigma _i H_i^2\). The term (II) is a global interaction term that writes
where the entries of \({\bar{K}}\) and \({\bar{H}}\) differ from those of K and H by a factor \({\bar{r}}_i/r_i\). More precisely,
and similarly for \({\bar{H}}- H\). Since \(\vert {\bar{r}}_i/r_i-1\vert = O(\vert b^r_i\vert )\) we have \(\sigma _{\max }({\bar{K}}-K )=O(W_\tau (\nu ))\). It follows, by expanding the square, that
The remaining terms are error terms, that we estimate directly in terms of \(W_\tau (\nu )\) and \(\tau \). We use in particular the fact that by Hölder’s inequality, \(\int _{\Theta _i} \Vert \theta - {\bar{\theta }}_i\Vert \mathrm {d}\nu (\theta ) = O({\bar{r}}_i^2 {{\,\mathrm{tr}\,}}\Sigma _i^{\frac{1}{2}})\). One has
-
\(\mathrm {(III)} =O\left( \sum _{i=1}^m {\bar{r}}_i^2 {\bar{r}}_0^4 \right) = O(W_{\tau }^4(\nu ))\);
-
\(\mathrm {(IV)} = \mathrm {(V)} = 0\) because the integral of the terms \(H_i(\theta -{\bar{\theta }}_i)\) vanishes;
-
\(\mathrm {(VI)} = O\left( (\sum _{i=1}^m {\bar{r}}_i^2 (\Vert b\Vert +\Vert \delta \theta _i\Vert )\cdot {\bar{r}}_0^2\right) =O(W_{\tau }^3(\nu ))\);
-
\(\mathrm {(VII)} = O(\tau (\Vert {\tilde{b}}^\theta \Vert ^2 + \Vert s\Vert ^2)) + O(W_{\tau }(\nu )^3)\);
-
\(\mathrm {(VIII)} = O(W_{\tau }^3(\nu ))\);
-
\(\mathrm {(IX)} = O(W_{\tau }^4(\nu ))\);
-
\(\mathrm {(X)} = O(\tau ^2(\Vert {\tilde{b}}^\theta \Vert ^2+\Vert s\Vert ^2))+O(W_{\tau }(\nu )^4)\).
It follows that overall, the error term is in \(O(\tau (\Vert {\tilde{b}}^\theta \Vert ^2+\Vert s\Vert ^2)+W_{\tau }(\nu )^3)\). There remains to lower bound the norm of the gradient over \(\Theta _0\), which can be done as follows. As seen in the proof of Lemma D.3, if \(\tau \) is small enough then \(J'_{\nu ^\star }(\theta )\ge \tau ^2\sigma _{\min }(H)/4\) for \(\theta \in \Theta _0\). Considering only the first component of the gradient, it holds
Using the expansion \(J'_\nu (\theta )=J'_{\nu ^\star }(\theta )+\langle \phi (\theta ),M_{\nabla R,1}(f^\star ,f_\nu )\rangle \), we get
The result follows by collecting all the estimates above. \(\square \)
1.7 Proof of the sharpness inequality (Theorem 3.3)
By Proposition D.4 we have that for \(\tau >0\) small enough
where \(C = \sigma _{\max }(K+H) +\Vert J'_{\nu ^\star }\Vert _\infty \).
Similarly, by Proposition D.5, for \(\tau \) small enough, it holds
where \(C'= \frac{1}{8}\sigma _{\min }(H)^2\tau ^4\). Now fix \(\tau >0\) satisfying the hypothesis of Lemma D.3 and the two previous inequalities. By Lemma D.3, \(W_\tau (\nu ) = O((J(\nu )-J^\star )^\frac{1}{2})\). We deduce that there exists \(J_0>J^\star \) and \(\kappa _0>0\), such that whenever \(\nu \in {\mathcal {M}}_+(\Theta )\) satisfies \(J(\nu )<J_0\), one has
Finally, notice that if different metric factors \((\alpha ,\beta )\ne (1,1)\) are introduced, one can always lower bound the new gradient squared norm as
which proves the statement for any \((\alpha ,\beta )\). Note however that if one wants to make a more quantitative bound, then there are values \((\alpha _0,\beta _0)\) that would lead to a better conditioning and potentially higher values for \(J_0\). In this case, the factor appearing in the sharpness inequality should rather be \(\min \{\alpha /\alpha _0,\beta /\beta _0\}\).
Estimation of the mirror rate function
We provide an upper bound for the mirror rate function \(\mathcal {Qq}\) in the situation that is of interest to us, with \(\nu ^\star \) sparse. Note that this approach could be generalized to arbitrary \(\nu ^\star \).
Lemma E.1
Under (A1), there exists \(C_{\Theta }> 0\) that only depends on the curvature of \(\Theta \), such that for all \(\nu ^\star ,\nu _0\in {\mathcal {M}}_+(\Theta )\) where \(\nu ^\star =\sum _{i=1}^{m^\star } r_i^2\delta _{\theta _i}\) and \(\nu _0=\rho {{\,\mathrm{vol}\,}}\) where \(\log \rho \) is L-Lipschitz, then
Moreover, for any other \({\hat{\nu }}_0\in {\mathcal {M}}_+(\Theta )\), it holds \( \mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )\le \mathcal {Qq}_{\nu ^\star ,\nu _0}(\tau ) + \nu ^\star (\Theta )\cdot W_\infty (\nu _0,{\hat{\nu }}_0). \)
In the context of Lemma E.1, we introduce the quantity,
which measures how much \(\rho \) is a good prior for the (a priori unknown) minimizer \(\nu ^\star \). With this quantity, the conclusion of Lemma E.1 reads, for \(\tau \ge L\),
Proof
Let us build \(\nu _\epsilon \) in such a way that the quantity defining \(\mathcal {Qq}_{\nu ^\star ,\nu _0}(\tau )\) in Eq. (15) is small. For this, consider a radius \(\epsilon >0\) and consider the measure \(\nu _\epsilon \) defined as the normalized volume measure on each geodesic ball of radius \(\tau \) around each \(\theta _i\), with mass \(r_i^2\) on this ball, and vanishing everywhere else. Using the transport map that maps these balls to their centers \(\theta _i\), we get if \(\Theta \) is flat,
where \(V^{(d)}(\epsilon )\) is the volume of a ball of radius \(\epsilon \) in \({\mathbb {R}}^d\), that scales as \(\epsilon ^d\). Using an integration by parts, it follows
thus \(W_1(\nu _\epsilon ,\nu ^\star ) \le \nu ^\star (\Theta )\epsilon \). In the general case where \(\Theta \) is a potentially curved manifold, this upper bound also depends on the curvature of \(\Theta \) around each \(\theta _i\), a dependency that we hide in the multiplicative constant so \(W_1(\nu _\epsilon ,\nu ^\star ) \le C\nu ^\star (\Theta )\epsilon \). Let us now control the entropy term. Writing \(\rho _\epsilon =\mathrm {d}\nu _\epsilon /\mathrm {d}{{{\,\mathrm{vol}\,}}}\) and \(\Theta _i\) for the geodesic ball of radius \(\epsilon \) around \(\theta _i\), it holds
The integral term can be estimated as follows,
Recalling that \(-\log V^{(d)}(\epsilon ) \le -d\log (\epsilon )+C\) for some C that only depends on the curvature of \(\Theta \), we get that the right-hand side of (15) is bounded by
Let us fix \(\epsilon >0\) by minimizing \(C\nu ^\star (\Theta )\epsilon -\nu ^\star (\Theta )d\log (\epsilon )/\tau \), which gives \(\epsilon = d/(C\tau )\). The first claim follows by plugging this value for \(\epsilon \) in the expression above.
For the second claim of the statement, let us build a suitable candidate \({\hat{\nu }}_\epsilon \) in order to upper bound the infimum that defines \(\mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )\). Let T be an optimal transport map from \(\nu _0\) to \({\hat{\nu }}_0\) for \(W_\infty \), i.e. a measurable map \(T{:}\,\Theta \rightarrow \Theta \) satisfying \(T_\# \nu _0 = {\hat{\nu }}_0\) and \(\max \{{{\,\mathrm{dist}\,}}(\theta ,T(\theta ));\; \theta \in {{\,\mathrm{spt}\,}}\nu _0(=\Theta )\} = W_\infty (\nu _0,{\hat{\nu }}_0)\) (see [53, Sec. 3.2], the absolute continuity of \(\nu _0\) is sufficient for such a map to exist). Now we define \({\hat{\nu }}_\epsilon = T_\# \nu _\epsilon \) where \(\nu _\epsilon \) is such that \({\mathcal {H}}(\nu _\epsilon ,\nu _0)<\infty \). Since the relative entropy is non-increasing under pushforwards, it holds \({\mathcal {H}}({\hat{\nu }}_\epsilon ,{\hat{\nu }}_0)\le {\mathcal {H}}(\nu _\epsilon ,\nu _0)\). Moreover, it holds \(\Vert \nu _\epsilon -{\hat{\nu }}_\epsilon \Vert _{\mathrm {BL}}^*\le W_1(\nu _\epsilon ,{\hat{\nu }}_\epsilon )\le \nu ^\star (\Theta ) W_\infty (\nu _\epsilon ,{\hat{\nu }}_\epsilon )\). Thus we have
The claim follows by noticing that, by construction, \(W_\infty (\nu _\epsilon ,{\hat{\nu }}_\epsilon ) \le W_\infty (\nu _0,{\hat{\nu }}_0)\) and then by taking the infimum in \(\nu _\epsilon \). \(\square \)
Global convergence for gradient descent
In the following, result, we study the non-convex gradient descent updates \(\mu _{k+1} = (T_k)_\# \mu _k\) and \(\nu _k={\mathsf {h}}\mu _k\) where
with step-sizes \(\alpha , \beta >0\). When \(\beta =0\), we recover mirror descent updates in \({\mathcal {M}}_+(\Theta )\) with the entropy mirror map (more specifically, this is true when \(\mathrm {Ret}\) is the “mirror” retraction defined in Sect. 2.3).
Lemma F.1
Assume (A1-3) and that J admits a minimizer \(\nu ^\star \in {\mathcal {M}}_+(\Theta )\). Then there exists \(C,\eta _{\max }>0\) such that for all \(\nu _0\in {\mathcal {M}}_+(\Theta )\), denoting \(B = \sup _{J(\nu )\le J(\nu _0)} \Vert J'_{\nu }\Vert _{\mathrm {BL}}\), if \(\max \{\alpha ,\beta \}<\beta _{\max }\), it holds
Proof
As in the proof of Lemma 2.5, we define \((T_k^r(\theta ),T_k^\theta (\theta )) :=T_k(1,\theta )\) and we define recursively \(\nu ^\epsilon _{k+1}=(T^\theta _k)_\# \nu ^\epsilon _k\) where \(\nu ^\epsilon _0\) is such that \({\mathcal {H}}(\nu ^\epsilon _0,\nu _0)<\infty \). Using the invariance of the relative entropy under diffeomorphisms (indeed, \(T^\theta _k\) is a diffeomorphism of \(\Theta \) for \(\beta \) small enough), and doing a first order expansion of \(T^r_k = 1 -2\alpha J'_{\nu _k} +O(\alpha ^2)\) it holds for \(\beta \) small enough
where the term in \(O(\alpha )\) originates from a first order approximation of the retraction. Now, taking \(\max \{\alpha ,\beta \}\) small enough to ensure decrease of \((J(\nu _k))_k\) (by Lemma 2.5) so that C above can be chosen independently of k, it follows
by bounding each term \(\Vert \nu _{k'}^\epsilon -\nu _0^\epsilon \Vert \) by \(B\beta k'\). \(\square \)
Proof of Theorem 4.2 (gradient descent)
The proof follows closely that of Theorem 4.1 but we do not track the “constants” (this would be more tedious). By Lemma E.1, there exists \(C>0\) (that depends on \({\bar{{\mathcal {H}}}}\), the curvature of \(\Theta \) and \(\nu ^\star (\Theta )\)) such that \(\mathcal {Qq}_{\nu ^\star ,{\hat{\nu }}_0}(\tau )\le C(\log \tau )/\tau +\nu ^\star (\Theta )W_\infty (\nu _0,{\hat{\nu }}_0)\). Combining this with Lemma F.1, we get that when \(\max \{\alpha ,\beta \} \le \eta _{\max }\),
Our goal is to choose \(k_0,\alpha , \beta \) and \(W_\infty (\nu _0,{\hat{\nu }}_0)\) so that this is quantity smaller than \(\Delta _0:=J_0-J^\star \). With \(\alpha =1/\sqrt{k}\) and \(\beta = \beta _0/k\) we get
Then, using a bound \(\log (u)\le C_\epsilon u^\epsilon \), we may choose \(k \gtrsim \Delta _0^{-2-\epsilon }\), \(\beta _0 \le \frac{1}{3} \Delta _0/B^2\) and \(W_\infty (\nu _0,{\hat{\nu }}_0)\le \frac{1}{3} \Delta _0/(B\nu ^\star (\Theta ))\) in order to have \(J(\nu _k)-J^\star \le \Delta _0\). This gives \(\alpha \lesssim \Delta _0^{1+\epsilon /2}\), \(\beta \lesssim \Delta _0^{3+\epsilon }\) and the regime of exponential convergence kicks off after \(k=\Delta _0^{-2-\epsilon }\) iterations. \(\square \)
Faster rate for mirror descent
In this section, we show that for a specific choice of retraction, the convergence rate of \(O(\log (t)/t)\) for the gradient flow is preserved for the gradient descent.
Proposition G.1
(Mirror flow, fast rate) Assume (A1-4) and consider the infinite dimensional mirror descent update
which corresponds to the so-called mirror retraction in Sect. 2.3 and \(\beta =0\). For any \(\nu _0 \in {\mathcal {M}}_+(\Theta )\), there exists \(\alpha _{\max }>0\) such that for \(\alpha \le \alpha _{\max }\) it holds, denoting \(B_{\nu _0} = \sup _{J(\nu )\le J(\nu _0)} \Vert J'_{\nu }\Vert _{\mathrm {BL}}\),
In particular, combining with Lemma E.1, if \(\nu _0 = \rho \!{{\,\mathrm{vol}\,}}\) has a smooth positive density, then \(J(\nu _k)-J^\star = O(\log (k)/k)\).
Proof
Consider \(\nu _\epsilon \in {\mathcal {M}}_+(\Theta )\) such that \({\mathcal {H}}(\nu _\epsilon ,\nu _0)<\infty \). It holds
where the first equality is obtained by rearranging terms in the definition of \({\mathcal {H}}\), and the second one is specific to the mirror retraction. Let us estimate the two terms in the right-hand side. Using convexity inequalities, we get
Here the term in \(O(\alpha \Vert g_{\nu _k}\Vert ^2_{L^2(\nu _k)})\) comes from the proof of Lemma 2.5 (note that the iterates remain in a sublevel of J for \(\alpha \) small enough). As for the relative entropy term, we have, using the convexity inequality \(\exp (u)\ge 1+ u\),
We use this inequality in place of the strong convexity of the mirror function used in the usual proof of mirror descent (because there is no Pinsker inequality on \({\mathcal {M}}_+(\Theta )\)). Coming back to the first equality we have derived, it holds,
Thus for \(\alpha \) small enough, it holds
Summing over K iterations and dividing by K, we get
Since for \(\alpha \) small enough \((J(\nu _k))_{k\ge 1}\) is decreasing (by Lemma 2.5), the result follows. \(\square \)
Convergence rate for lower bounded densities
In this section, we justify the claim made in Sect. 4.3 about the convergence without condition on \(\beta /\alpha \). Let us recall the result that we want to prove.
Proposition H.1
Under (A1-3), for any \(J_{\max }>J^\star \), there exists \(C>0\) such that for any \(\eta ,t>0\) and \(\nu _0\in {\mathcal {M}}_+(\Theta )\) satisfying \(J(\nu _0)\le J_{\max }\), if the projected gradient flow (9) satisfies for \(0\le s \le t\),
where \(S_t = \{\theta \in \Theta ;\; J'_{\nu _s}(\theta )\le 0 \text { for some } s\in {[0,t]}\}\), then \( J(\nu _t)-J^\star \le \frac{C}{\sqrt{\alpha \eta t}}. \)
Proof
Following [60], we start with the convexity inequality
Let us control these two terms separately. On the one hand, one has by Jensen’s inequality
Using the fact that on sublevels of J, \(\nu (\Theta )\) and \(\Vert g_{\nu }\Vert ^2_{L^2(\nu )}\) are bounded, we have, for some \(C>0\),
On the other hand, we have
where the last equality defines \(v_t\le 0\). Using the gradient flow structure, let us show that a non-zero \(v_t\) and a lower bound \(\eta \) on the density of \(\nu _t\) (at least on the set \(\{J'_{\nu _t}\le 0\)}) guarantees a decrease of the objective. Indeed, letting \(\Theta _t = \{\theta \in \Theta ;\; J'_{\nu _t}(\theta ) \le v_t /2\}\) (which could be empty), we get
Moreover, the Lipschitz regularity of \(J'_{\nu }\) is bounded on sublevels of J, and thus along gradient flow trajectories, so there exists \(C'>0\) such that \({{\,\mathrm{vol}\,}}(\Theta _t)\ge C'\cdot \vert v_t\vert \). It follows
Coming back to our first inequality, we have
for some \(C''>0\) that, given \(J(\nu _0)\), is independent of \(\alpha ,\eta \) and \(\nu _t\). It remains to remark that a continuously differentiable and positive function h that satisfies \(h(t)\le C^{-1/3} \cdot (-h'(t))^{1/3}\) satisfies \(C\le -h'(t)/h(t)^3 = \frac{1}{2} \frac{\mathrm {d}}{\mathrm {d}t}(h(t)^{-2})\) and, after integrating between 0 and t, \(h(t)\le \left( 2Ct +h(0)^{-2}\right) ^{-1/2}\le \frac{1}{\sqrt{2Ct}}\). We conclude by taking \(h(t)=J(\nu _t)-J^\star \) and \(C\propto \alpha \eta \). \(\square \)
Rights and permissions
About this article
Cite this article
Chizat, L. Sparse optimization on measures with over-parameterized gradient descent. Math. Program. 194, 487–532 (2022). https://doi.org/10.1007/s10107-021-01636-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01636-z