1 Introduction

We consider optimization problems of the form

$$\begin{aligned} \min _{x \in {\mathcal {M}}} f(x) \end{aligned}$$
(P)

where f is lower-bounded and twice continuously differentiable on a Riemannian manifold \({\mathcal {M}}\). For the special case where \({\mathcal {M}}\) is a Euclidean space, problem (P) amounts to smooth, unconstrained optimization. The more general case is important for applications notably in scientific computing, statistics, imaging, learning, communications and robotics: see for example [1, 27].

For a general non-convex objective f, computing a global minimizer of (P) is hard. Instead, our goal is to compute approximate first- and second-order critical points of (P). A number of non-convex problems of interest exhibit the property that second-order critical points are optimal [7, 11, 14, 24, 30, 36, 49]. Several of these are optimization problems on nonlinear manifolds. Therefore, theoretical guarantees for approximately finding second-order critical points can translate to guarantees for approximately solving these problems.

It is therefore natural to ask for fast algorithms which find approximate second-order critical points on manifolds, within a tolerance \(\epsilon \) (see below). Existing algorithms include RTR [13], ARC [2] and perturbed RGD [20, 44]. Under some regularity conditions, ARC uses Hessian-vector products to achieve a rate of \(O(\epsilon ^{-7/4})\). In contrast, under the same regularity conditions, perturbed RGD uses only function value and gradient queries, but achieves a poorer rate of \(O(\epsilon ^{-2})\). Does there exist an algorithm which finds approximate second-order critical points with a rate of \(O(\epsilon ^{-7/4})\) using only function value and gradient queries? The answer was known to be yes in Euclidean space. Can it also be done on Riemannian manifolds, hence extending applicability to applications treated in the aforementioned references? We resolve that question positively with the algorithm \(\mathtt {PTAGD}\) below.

From a different perspective, the recent success of momentum-based first-order methods in machine learning [42] has encouraged interest in momentum-based first-order algorithms for non-convex optimization which are provably faster than gradient descent [15, 28]. We show such provable guarantees can be extended to optimization under a manifold constraint. From this perspective, our paper is part of a body of work theoretically explaining the success of momentum methods in non-convex optimization.

There has been significant difficulty in accelerating geodesically convex optimization on Riemannian manifolds. See “Related literature” below for more details on best known bounds [3] as well as results proving that acceleration in certain settings is impossible on manifolds [26]. Given this difficulty, it is not at all clear a priori that it is possible to accelerate non-convex optimization on Riemannian manifolds. Our paper shows that it is in fact possible.

We design two new algorithms and establish worst-case complexity bounds under Lipschitz assumptions on the gradient and Hessian of f. Beyond a theoretical contribution, we hope that this work will provide an impetus to look for more practical fast first-order algorithms on manifolds.

More precisely, if the gradient of f is L-Lipschitz continuous (in the Riemannian sense defined below), it is known that Riemannian gradient descent can find an \(\epsilon \)-approximate first-order critical pointFootnote 1 in at most \(O(\Delta _f L / \epsilon ^2)\) queries, where \(\Delta _f\) upper-bounds the gap between initial and optimal cost value [8, 13, 47]. Moreover, this rate is optimal in the special case where \({\mathcal {M}}\) is a Euclidean space [16], but it can be improved under the additional assumption that the Hessian of f is \(\rho \)-Lipschitz continuous.

Recently in Euclidean space, Carmon et al. [15] have proposed a deterministic algorithm for this setting (L-Lipschitz gradient, \(\rho \)-Lipschitz Hessian) which requires at most \({{\tilde{O}}}(\Delta _f L^{1/2} \rho ^{1/4} / \epsilon ^{7/4})\) queries (up to logarithmic factors), and is independent of dimension. This is a speed up of Riemannian gradient descent by a factor of \({{\tilde{\Theta }}}(\sqrt{\frac{L}{\sqrt{\rho \epsilon }}})\). For the Euclidean case, it has been shown under these assumptions that first-order methods require at least \(\Omega (\Delta _f L^{3/7} \rho ^{2/7} / \epsilon ^{12/7})\) queries [17, Thm. 2]. This leaves a gap of merely \({{\tilde{O}}}(1/\epsilon ^{1/28})\) in the \(\epsilon \)-dependency.

Soon after, Jin et al. [28] showed how a related algorithm with randomization can find \((\epsilon , \sqrt{\rho \epsilon })\)-approximate second-order critical pointsFootnote 2 with the same complexity, up to polylogarithmic factors in the dimension of the search space and in the (reciprocal of) the probability of failure.

Both the algorithm of Carmon et al. [15] and that of Jin et al. [28] fundamentally rely on Nesterov’s accelerated gradient descent method (AGD) [40], with safe-guards against non-convexity. To achieve improved rates, AGD builds heavily on a notion of momentum which accumulates across several iterations. This makes it delicate to extend AGD to nonlinear manifolds, as it would seem that we need to transfer momentum from tangent space to tangent space, all the while keeping track of fine properties.

In this paper, we build heavily on the Euclidean work of Jin et al. [28] to show the following. Assume f has Lipschitz continuous gradient and Hessian on a complete Riemannian manifold satisfying some curvature conditions. With at most \(\tilde{O}(\Delta _f L^{1/2} {{\hat{\rho }}}^{1/4} / \epsilon ^{7/4})\) queries (where \({{\hat{\rho }}}\) is larger than \(\rho \) by an additive term affected by L and the manifold’s curvature),

  1. 1.

    It is possible to compute an \(\epsilon \)-approximate first-order critical point of f with a deterministic first-order method,

  2. 2.

    It is possible to compute an \((\epsilon , \sqrt{{{\hat{\rho }}} \epsilon })\)-approximate second-order critical point of f with a randomized first-order method.

In the first case, the complexity is independent of the dimension of \({\mathcal {M}}\). In the second case, the complexity includes polylogarithmic factors in the dimension of \({\mathcal {M}}\) and in the probability of failure. This parallels the Euclidean setting. In both cases (and in contrast to the Euclidean setting), the Riemannian curvature of \({\mathcal {M}}\) affects the complexity in two ways: (a) because \({{\hat{\rho }}}\) is larger than \(\rho \), and (b) because the results only apply when the target accuracy \(\epsilon \) is small enough in comparison with some curvature-dependent thresholds. It is an interesting open question to determine whether such a curvature dependency is inescapable.

We call our first algorithm \(\mathtt {TAGD}\) for tangent accelerated gradient descent,Footnote 3 and the second algorithm \(\mathtt {PTAGD}\) for perturbed tangent accelerated gradient descent. Both algorithms and (even more so) their analyses closely mirror the perturbed accelerated gradient descent algorithm (PAGD) of Jin et al. [28], with one core design choice that facilitates the extension to manifolds: instead of transporting momentum from tangent space to tangent space, we run several iterations of AGD (safe-guarded against non-convexity) in individual tangent spaces. After an AGD run in the current tangent space, we “retract” back to a new point on the manifold and initiate another AGD run in the new tangent space. In so doing, we only need to understand the fine behavior of AGD in one tangent space at a time. Since tangent spaces are linear spaces, we can capitalize on existing Euclidean analyses. This general approach is in line with prior work in [20], and is an instance of the dynamic trivializations framework of Lezcano-Casado [33].

In order to run AGD on the tangent space \(\mathrm {T}_x{\mathcal {M}}\) at x, we must “pullback” the cost function f from \({\mathcal {M}}\) to \(\mathrm {T}_x{\mathcal {M}}\). A geometrically pleasing way to do so is via the exponential mapFootnote 4\(\mathrm {Exp}_x :\mathrm {T}_x{\mathcal {M}}\rightarrow {\mathcal {M}}\), whose defining feature is that for each \(v \in \mathrm {T}_x{\mathcal {M}}\) the curve \(\gamma (t) = \mathrm {Exp}_x(tv)\) is the geodesic of \({\mathcal {M}}\) passing through \(\gamma (0) = x\) with velocity \(\gamma '(0) = v\). Then, \({{\hat{f}}}_x = f \circ \mathrm {Exp}_x\) is a real function on \(\mathrm {T}_x{\mathcal {M}}\) called the pullback of f at x. To analyze the behavior of AGD applied to \({{\hat{f}}}_x\), the most pressing question is:

To what extent does \({{\hat{f}}}_x = f \circ \mathrm {Exp}_x\) inherit the Lipschitz properties of f?

In this paper, we show that if f has Lipschitz continuous gradient and Hessian and if the gradient of f at x is sufficiently small, then \({{\hat{f}}}_x\) restricted to a ball around the origin of \(\mathrm {T}_x{\mathcal {M}}\) has Lipschitz continuous gradient and retains partial Lipschitz-type properties for its Hessian. The norm condition on the gradient and the radius of the ball are dictated by the Riemannian curvature of \({\mathcal {M}}\). These geometric results are of independent interest.

Because \({{\hat{f}}}_x\) retains only partial Lipschitzness, our algorithms depart from the Euclidean case in the following ways: (a) at points where the gradient is still large, we perform a simple gradient step; and (b) when running AGD in \(\mathrm {T}_x{\mathcal {M}}\), we are careful not to leave the prescribed ball around the origin: if we ever do, we take appropriate action. For those reasons and also because we do not have full Lipschitzness but only radial Lipschitzness for the Hessian of \({{\hat{f}}}_x\), minute changes throughout the analysis of Jin et al. [28] are in order.

To be clear, in their current state, \(\mathtt {TAGD}\) and \(\mathtt {PTAGD}\) are theoretical constructs. As one can see from later sections, running them requires the user to know the value of several parameters that are seldom available (including the Lipschitz constants L and \(\rho \)); the target accuracy \(\epsilon \) must be set ahead of time; and the tuning constants as dictated here by the theory are (in all likelihood) overly cautious.

Moreover, to compute the gradient of \({{\hat{f}}}_x\) we need to differentiate through the exponential map (or a retraction, as the case may be). This is sometimes easy to do in closed form (see [33] for families of examples), but it could be a practical hurdle. On the other hand, our algorithms do not require parallel transport. It remains an interesting open question to develop practical accelerated gradient methods for non-convex problems on manifolds.

In closing this introduction, we give simplified statements of our main results. These are all phrased under the following assumption (see Sect. 2 for geometric definitions):

A 1

The Riemannian manifold \({\mathcal {M}}\) and the cost function \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) have these properties:

  • \({\mathcal {M}}\) is complete, its sectional curvatures are in the interval \([-K, K]\), and the covariant derivative of its Riemann curvature endomorphism is bounded by F in operator norm; and

  • f is lower-bounded by \(f_{\mathrm {low}}\), has L-Lipschitz continuous gradient \(\mathrm {grad}f\) and \(\rho \)-Lipschitz continuous Hessian \(\mathrm {Hess}f\) on \({\mathcal {M}}\).

1.1 Main Geometry Results

As a geometric contribution, we show that pullbacks through the exponential map retain certain Lipschitz properties of f. Explicitly, at a point \(x \in {\mathcal {M}}\) we have the following statement.

Theorem 1.1

Let \(x \in {\mathcal {M}}\). Under A1, let \(B_x(b)\) be the closed ball of radius \(b \le \min \!\left( \frac{1}{4\sqrt{K}}, \frac{K}{4F} \right) \) around the origin in \(\mathrm {T}_x{\mathcal {M}}\). If \(\Vert \mathrm {grad}f(x)\Vert \le Lb\), then

  1. 1.

    The pullback \({{\hat{f}}}_x = f \circ \mathrm {Exp}_x\) has 2L-Lipschitz continuous gradient \(\nabla {{\hat{f}}}_x\) on \(B_x(b)\), and

  2. 2.

    For all \(s \in B_x(b)\), we have \(\Vert \nabla ^2 {{\hat{f}}}_x(s) - \nabla ^2 {{\hat{f}}}_x(0)\Vert \le {{\hat{\rho }}} \Vert s\Vert \) with \({{\hat{\rho }}} = \rho + L\sqrt{K}\).

(Above, \(\Vert \cdot \Vert \) denotes both the Riemannian norm on \(\mathrm {T}_x{\mathcal {M}}\) and the associated operator norm. Also, \(\nabla {{\hat{f}}}_x\) and \(\nabla ^2 {{\hat{f}}}_x\) are the gradient and Hessian of \({{\hat{f}}}_x\) on the Euclidean space \(\mathrm {T}_x{\mathcal {M}}\).)

We expect this result to be useful in several other contexts. Section 2 provides a more complete (and somewhat more general) statement. At the same time and independently, Lezcano-Casado [35] develops similar geometric bounds and applies them to study gradient descent in tangent spaces—see “Related literature” below for additional details.

1.2 Main Optimization Results

We aim to compute approximate first- and second-order critical points of f, as defined here:

Definition 1.2

A point \(x \in {\mathcal {M}}\) is an \(\epsilon \)-FOCP for (P) if \(\Vert \mathrm {grad}f(x)\Vert \le \epsilon \). A point \(x \in {\mathcal {M}}\) is an \((\epsilon _1, \epsilon _2)\)-SOCP for (P) if \(\Vert \mathrm {grad}f(x)\Vert \le \epsilon _1\) and \(\lambda _{\mathrm {min}}(\mathrm {Hess}f(x)) \ge -\epsilon _2\), where \(\lambda _{\mathrm {min}}(\cdot )\) extracts the smallest eigenvalue of a self-adjoint operator.

In Sect. 5, we define and analyze the algorithm \(\mathtt {TAGD}\). Resting on the geometric result above, that algorithm with the exponential retraction warrants the following claim about the computation of first-order points. The \(O(\cdot )\) notation is with respect to scaling in \(\epsilon \).

Theorem 1.3

If A1 holds, there exists an algorithm (\(\mathtt {TAGD}\)) which, given any \(x_0 \in {\mathcal {M}}\) and small enough tolerance \(\epsilon > 0\), namely,

$$\begin{aligned} \epsilon&\le \frac{1}{144} \min \!\left( \frac{1}{K} {{\hat{\rho }}}, \frac{K^2}{F^2} {{\hat{\rho }}}, \frac{36 \ell ^2}{{{\hat{\rho }}}}\right) \nonumber \\ {}&= \frac{1}{144} \min \!\left( \frac{1}{K}, \frac{K^2}{F^2}, \left( \frac{12 L}{\rho + L\sqrt{K}}\right) ^2 \right) (\rho + L\sqrt{K}), \end{aligned}$$
(1)

produces an \(\epsilon \)-FOCP for (P) using at most a constant multiple of T function and pullback gradient queries, and a similar number of evaluations of the exponential map, where

$$\begin{aligned} T&= (f(x_0) - f_{\mathrm {low}}) \frac{{\hat{\rho }}^{1/4} \ell ^{1/2}}{\epsilon ^{7/4}} \log \!\left( \frac{16\ell }{\sqrt{{\hat{\rho }} \epsilon }}\right) ^{6} \\&= O\!\left( (f(x_0) - f_{\mathrm {low}}) (\rho + L\sqrt{K})^{1/4} L^{1/2} \cdot \frac{1}{\epsilon ^{7/4}} \log \!\left( \frac{1}{\epsilon }\right) ^6 \right) , \end{aligned}$$

with \(\ell = 2L\) and \({{\hat{\rho }}} = \rho + L\sqrt{K}\). The algorithm uses no Hessian queries and is deterministic.

This result is dimension-free but not curvature-free because K and F constrain \(\epsilon \) and affect \({{\hat{\rho }}}\).

Remark 1.4

In the statements of all theorems and lemmas, the notations \(O(\cdot ), \Theta (\cdot )\) only hide universal constants, i.e., numbers like \(\frac{1}{2}\) or 100. They do not hide any parameters. Moreover, \({\tilde{O}}(\cdot ), {\tilde{\Theta }}(\cdot )\) only hide universal constants and logarithmic factors in the parameters.

Remark 1.5

If \(\epsilon \) is large enough (that is, if \(\epsilon > \Theta (\frac{\ell ^2}{{{\hat{\rho }}}})\)), then \(\mathtt {TAGD}\) reduces to vanilla Riemannian gradient descent with constant step-size. The latter is known to produce an \(\epsilon \)-FOCP in \(O(1/\epsilon ^2)\) iterations, yet our result here announces this same outcome in \(O(1/\epsilon ^{7/4})\) iterations. This is not a contradiction: when \(\epsilon \) is large, \(1/\epsilon ^{7/4}\) can be worse than \(1/\epsilon ^2\). In short: the rates are only meaningful for small \(\epsilon \), in which case \(\mathtt {TAGD}\) does use accelerated gradient descent steps.

In Sect. 6 we define and analyze the algorithm \(\mathtt {PTAGD}\). With the exponential retraction, the latter warrants the following claim about the computation of second-order points.

Theorem 1.6

If A1 holds, there exists an algorithm (\(\mathtt {PTAGD}\)) which, given any \(x_0 \in {\mathcal {M}}\), any \(\delta \in (0, 1)\) and small enough tolerance \(\epsilon > 0\) (same condition as in Theorem 1.3) produces an \(\epsilon \)-FOCP for (P) using at most a constant multiple of T function and pullback gradient queries, and a similar number of evaluations of the exponential map, where

$$\begin{aligned} T&= (f(x_0) - f_{\mathrm {low}}) \frac{{\hat{\rho }}^{1/4} \ell ^{1/2}}{\epsilon ^{7/4}} \log \!\left( \frac{d^{1/2} \ell ^{3/2} \Delta _f}{({{\hat{\rho }}} \epsilon )^{1/4} \epsilon ^2 \delta } \right) ^{6} + \frac{\ell ^{1/2}}{{{\hat{\rho }}}^{1/4} \epsilon ^{1/4}} \log \!\left( \frac{d^{1/2} \ell ^{3/2} \Delta _f}{({{\hat{\rho }}} \epsilon )^{1/4} \epsilon ^2 \delta } \right) \\&= O\!\left( (f(x_0) - f_{\mathrm {low}}) (\rho + L\sqrt{K})^{1/4} L^{1/2} \cdot \frac{1}{\epsilon ^{7/4}} \log \!\left( \frac{d}{\epsilon \delta }\right) ^6 \right) , \end{aligned}$$

with \(\ell = 2L\), \({{\hat{\rho }}} = \rho + L\sqrt{K}\), \(d = \dim {\mathcal {M}}\) and any \(\Delta _f \ge \max (f(x_0) - f_{\mathrm {low}}, \sqrt{\epsilon ^3 / {\hat{\rho }}})\). With probability at least \(1 - 2\delta \), that point is also \((\epsilon , \sqrt{{{\hat{\rho }}} \epsilon })\)-SOCP. The algorithm uses no Hessian queries and is randomized.

This result is almost dimension-free, and still not curvature-free for the same reasons as above.

1.3 Related Literature

At the same time and independently, Lezcano-Casado [35] develops geometric bounds similar to our own. Both papers derive the same second-order inhomogenous linear ODE (ordinary differential equation) describing the behavior of the second derivative of the exponential map. Lezcano-Casado [35] then uses ODE comparison techniques to derive the geometric bounds, while the present work uses a bootstrapping technique. Lezcano-Casado [35] applies these bounds to study gradient descent in tangent spaces, whereas we study non-convex accelerated algorithms for finding first- and second-order critical points.

The technique of pulling back a function to a tangent space is frequently used in other settings within optimization on manifolds. See for example the recent papers of Bergmann et al. [9] and Lezcano-Casado [34]. Additionally, the use of Riemannian Lipschitz conditions in optimization as they appear in Section 2 can be traced back to [21, Def. 4.1] and [23, Def. 2.2].

Accelerating optimization algorithms on Riemannian manifolds has been well-studied in the context of geodesically convex optimization problems. Such problems can be solved globally, and usually the objective is to bound the suboptimality gap rather than finding approximate critical points. A number of papers have studied Riemannian versions of AGD; however, none of these papers have been able to achieve a fully accelerated rate for convex optimization. Zhang and Sra [48] show that if the initial iterate is sufficiently close to the minimizer, then acceleration is possible. Intuitively this makes sense, since manifolds are locally Euclidean. Ahn and Sra [3] pushed this further, developing an algorithm converging strictly faster than RGD, and which also achieves acceleration when sufficiently close to the minimizer.

Alimisis et al. [4,5,6] analyze the problem of acceleration on the class of non-strongly convex functions, as well as under weaker notions of convexity. Interestingly, they also show that in the continuous limit (using an ODE to model optimization algorithms) acceleration is possible. However, it is unclear whether the discretization of this ODE preserves a similar acceleration.

Recently, Hamilton and Moitra [26] have shown that full acceleration (in the geodesically convex case) is impossible in the hyperbolic plane, in the setting where function values and gradients are corrupted by a very small amount of noise. In contrast, in the analogous Euclidean setting, acceleration is possible even with noisy oracles [22].

2 Riemannian Tools and Regularity of Pullbacks

In this section, we build up to and state our main geometric result. As we do so, we provide a few reminders of Riemannian geometry. For more on this topic, we recommend the modern textbooks by Lee [31, 32]. For book-length, optimization-focused introductions see [1, 12]. Some proofs of this section appear in Appendices A and B.

We consider a manifold \({\mathcal {M}}\) with Riemannian metric \(\left\langle {\cdot },{\cdot }\right\rangle _x\) and associated norm \(\Vert \cdot \Vert _x\) on the tangent spaces \(\mathrm {T}_x{\mathcal {M}}\). (In other sections, we omit the subscript x.) The tangent bundle

$$\begin{aligned} \mathrm {T}{\mathcal {M}}= \{ (x, s) : x \in {\mathcal {M}}{\text { and }} s \in \mathrm {T}_x{\mathcal {M}}\} \end{aligned}$$

is itself a smooth manifold. The Riemannian metric provides a notion of gradient.

Definition 2.1

The Riemannian gradient of a differentiable function \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) is the unique vector field \(\mathrm {grad}f\) on \({\mathcal {M}}\) which satisfies:

$$\begin{aligned} \mathrm {D}f(x)[s]&= \left\langle {\mathrm {grad}f(x)},{s}\right\rangle _x&{\text { for all }} (x, s) \in \mathrm {T}{\mathcal {M}}, \end{aligned}$$

where \(\mathrm {D}f(x)[s]\) is the directional derivative of f at x along s.

The Riemannian metric further induces a uniquely defined Riemannian connection \(\nabla \) (used to differentiate vector fields on \({\mathcal {M}}\)) and an associated covariant derivative \(\mathrm {D}_t\) (used to differentiate vector fields along curves on \({\mathcal {M}}\)). (The symbol \(\nabla \) here is not to be confused with its use elsewhere to denote differentiation of scalar functions on Euclidean spaces.) Applying the connection to the gradient vector field, we obtain Hessians.

Definition 2.2

The Riemannian Hessian of a twice differentiable function \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) at x is the linear operator \(\mathrm {Hess}f(x)\) to and from \(\mathrm {T}_x{\mathcal {M}}\) defined by

$$\begin{aligned} \mathrm {Hess}f(x)[s]&= \nabla _s \mathrm {grad}f = \left. \mathrm {D}_t\mathrm {grad}f(c(t)) \right| _{t = 0}, \end{aligned}$$

where in the last equality c can be any smooth curve on \({\mathcal {M}}\) satisfying \(c(0) = x\) and \(c'(0) = s\). This operator is self-adjoint with respect to the metric \(\left\langle {\cdot },{\cdot }\right\rangle _x\).

We can also define the Riemmannian third derivative \(\nabla ^3 f\) (a tensor of order three), see [12, Ch. 10] for details. We write \(\left\| {\nabla ^3 f(x)}\right\| \le \rho \) to mean \(\left| \nabla ^3 f(x)(u, v, w)\right| \le \rho \) for all unit vectors \(u, v, w \in \mathrm {T}_x{\mathcal {M}}\).

A retraction \(\mathrm {R}\) is a smooth map from (a subset of) \(\mathrm {T}{\mathcal {M}}\) to \({\mathcal {M}}\) with the following property: for all \((x, s) \in \mathrm {T}{\mathcal {M}}\), the smooth curve \(c(t) = \mathrm {R}(x, ts) = \mathrm {R}_x(ts)\) on \({\mathcal {M}}\) passes through \(c(0) = x\) with velocity \(c'(0) = s\). Such maps are used frequently in Riemannian optimization in order to move on a manifold. For example, a key ingredient of Riemannian gradient descent is the curve \(c(t) = \mathrm {R}_x(-t \mathrm {grad}f(x))\) which initially moves away from x along the negative gradient direction.

To a curve c, we naturally associate a velocity vector field \(c'\). Using the covariant derivative \(\mathrm {D}_t\), we differentiate this vector field along c to define the acceleration \(c'' = \mathrm {D}_tc'\) of c: this is also a vector field along c. In particular, the geodesics of \({\mathcal {M}}\) are the curves with zero acceleration.

The exponential map \(\mathrm {Exp}:{\mathcal {O}}\rightarrow {\mathcal {M}}\)—defined on an open subset \({\mathcal {O}}\) of the tangent bundle—is a special retraction whose curves are geodesics. Specifically, \(\gamma (t) = \mathrm {Exp}(x, ts) = \mathrm {Exp}_x(ts)\) is the unique geodesic on \({\mathcal {M}}\) which passes through \(\gamma (0) = x\) with velocity \(\gamma '(0) = s\). If the domain of \(\mathrm {Exp}\) is the whole tangent bundle, we say \({\mathcal {M}}\) is complete.

To compare tangent vectors in distinct tangent spaces, we use parallel transports. Explicitly, let c be a smooth curve connecting the points \(c(0) = x\) and \(c(1) = y\). We say a vector field Z along c is parallel if its covariant derivative \(\mathrm {D}_tZ\) is zero. Conveniently, for any given \(v \in \mathrm {T}_x{\mathcal {M}}\) there exists a unique parallel vector field along c whose value at \(t = 0\) is v. Therefore, the value of that vector field at \(t = 1\) is a well-defined vector in \(\mathrm {T}_y{\mathcal {M}}\): we call it the parallel transport of v from x to y along c. We introduce the notation

$$\begin{aligned} P_t^c :\mathrm {T}_{c(0)}{\mathcal {M}}\rightarrow \mathrm {T}_{c(t)}{\mathcal {M}}\end{aligned}$$

to denote parallel transport along a smooth curve c from c(0) to c(t). This is a linear isometry: \((P_t^c)^{-1} = (P_t^c)^*\), where the star denotes an adjoint with respect to the Riemannian metric. For the special case of parallel transport along the geodesic \(\gamma (t) = \mathrm {Exp}_x(ts)\), we write

$$\begin{aligned} P_{ts} :\mathrm {T}_x{\mathcal {M}}\rightarrow \mathrm {T}_{\mathrm {Exp}_x(ts)}{\mathcal {M}} \end{aligned}$$
(2)

with the meaning \(P_{ts} = P_t^\gamma \).

Using these tools, we can define Lipschitz continuity of gradients and Hessians. Note that in the particular case where \({\mathcal {M}}\) is a Euclidean space we have \(\mathrm {Exp}_x(s) = x + s\) and parallel transports are identities, so that this reduces to the usual definitions.

Definition 2.3

The gradient of \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) is L-Lipschitz continuous if

$$\begin{aligned} \Vert P_s^* \mathrm {grad}f(\mathrm {Exp}_x(s)) - \mathrm {grad}f(x)\Vert _x&\le L \Vert s\Vert _x&{\text { for all }} (x, s) \in {\mathcal {O}}, \end{aligned}$$
(3)

where \(P_s^*\) is the adjoint of \(P_s\) with respect to the Riemannian metric.

The Hessian of f is \(\rho \)-Lipschitz continuous if

$$\begin{aligned} \Vert P_s^* \circ \mathrm {Hess}f(\mathrm {Exp}_x(s)) \circ P_s - \mathrm {Hess}f(x)\Vert _x&\le \rho \Vert s\Vert _x&{\text { for all }} (x, s) \in {\mathcal {O}}, \end{aligned}$$
(4)

where \(\Vert \cdot \Vert _x\) denotes both the Riemannian norm on \(\mathrm {T}_x{\mathcal {M}}\) and the associated operator norm.

It is well known that these Lipschitz conditions are equivalent to convenient inequalities, often used to study the complexity of optimization algorithms. More details appear in [12, Ch. 10].

Proposition 2.4

If a function \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) has L-Lipschitz continuous gradient, then

$$\begin{aligned} \left| f(\mathrm {Exp}_x(s)) - f(x) - \left\langle {\mathrm {grad}f(x)},{s}\right\rangle _x \right|&\le \frac{L}{2} \Vert s\Vert _x^2&{\text { for all }} (x, s) \in {\mathcal {O}}. \end{aligned}$$

If in addition f is twice differentiable, then \(\left\| {\mathrm {Hess}f(x)}\right\| \le L\) for all \(x \in {\mathcal {M}}\).

If f has \(\rho \)-Lipschitz continuous Hessian, then

$$\begin{aligned}&\left| f(\mathrm {Exp}_x(s)) - f(x) - \left\langle {\mathrm {grad}f(x)},{s}\right\rangle _x - \frac{1}{2} \left\langle {s},{\mathrm {Hess}f(x)[s]}\right\rangle _x \right| \\&\quad \le \frac{\rho }{6} \Vert s\Vert _x^3 \quad {\text { and}} \\&\quad \left\| P_{s}^* \mathrm {grad}f(\mathrm {Exp}_x(s)) - \mathrm {grad}f(x) - \mathrm {Hess}f(x)[s] \right\| _x \\&\quad \le \frac{\rho }{2} \Vert s\Vert _x^2 \quad {\text { for all }} (x, s) \in {\mathcal {O}}.&\end{aligned}$$

If in addition f is three times differentiable, then \(\left\| {\nabla ^3 f(x)}\right\| \le \rho \) for all \(x \in {\mathcal {M}}\).

The other way around, if f is three times continuously differentiable and the stated inequalities hold, then its gradient and Hessian are Lipschitz continuous with the stated constants.

For sufficiently simple algorithms, these inequalities may be all we need to track progress in a sharp way. As an example, the iterates of Riemannian gradient descent with constant step-size 1/L satisfy \(x_{k+1} = \mathrm {Exp}_{x_k}(s_k)\) with \(s_k = -\frac{1}{L}\mathrm {grad}f(x_k)\). It follows directly from the first inequality above that \(f(x_k) - f(x_{k+1}) \ge \frac{1}{2L} \Vert \mathrm {grad}f(x_k)\Vert ^2\). From there, it takes a brief argument to conclude that this method finds a point with gradient smaller than \(\epsilon \) in at most \(2L(f(x_0) - f_{\mathrm {low}})\frac{1}{\epsilon ^2}\) steps. A similar (but longer) story applies to the analysis of Riemannian trust regions and adaptive cubic regularization [2, 13].

However, the inequalities in Proposition 2.4 fall short when finer properties of the algorithms are only visible at the scale of multiple combined iterations. This is notably the case for accelerated gradient methods. For such algorithms, individual iterations may not achieve spectacular cost decrease, but a long sequence of them may accumulate an advantage over time (using momentum). To capture this advantage in an analysis, it is not enough to apply inequalities above to individual iterations. As we turn to assessing a string of iterations jointly by relating the various gradients and step directions we encounter, the nonlinearity of \({\mathcal {M}}\) generates significant hurdles.

For these reasons, we study the pullbacks of the cost function, namely, the functions

$$\begin{aligned} {{\hat{f}}}_x&= f \circ \mathrm {Exp}_x :\mathrm {T}_x{\mathcal {M}}\rightarrow {{\mathbb {R}}}. \end{aligned}$$
(5)

Each pullback is defined on a linear space, hence we can in principle run any Euclidean optimization algorithm on \({{\hat{f}}}_x\) directly: our strategy is therefore to apply a momentum-based method on \({\hat{f}}_x\). To this end, we now work towards showing that if f has Lipschitz continuous gradient and Hessian then \({{\hat{f}}}_x\) also has certain Lipschitz-type properties.

The following formulas appear in [2, Lem. 5]: we are interested in the case \(\mathrm {R}= \mathrm {Exp}\). (We use \(\nabla \) and \(\nabla ^2\) to designate gradients and Hessians of functions on Euclidean spaces: not to be confused with the connection \(\nabla \).)

Lemma 2.5

Given \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) twice continuously differentiable and (xs) in the domain of a retraction \(\mathrm {R}\), the gradient and Hessian of the pullback \({{\hat{f}}}_x = f \circ \mathrm {R}_x\) at \(s \in \mathrm {T}_x{\mathcal {M}}\) are given by

$$\begin{aligned} \nabla {{\hat{f}}}_x(s)&= T_s^* \mathrm {grad}f(\mathrm {R}_x(s))&{\text { and }}&\nabla ^2 {{\hat{f}}}_x(s)&= T_s^* \circ \mathrm {Hess}f(\mathrm {R}_x(s)) \circ T_s + W_s, \end{aligned}$$
(6)

where \(T_s\) is the differential of \(\mathrm {R}_x\) at s (a linear operator):

$$\begin{aligned} T_s&= \mathrm {D}\mathrm {R}_x(s) :\mathrm {T}_{x}{\mathcal {M}}\rightarrow \mathrm {T}_{\mathrm {R}_x(s)}{\mathcal {M}}, \end{aligned}$$
(7)

and \(W_s\) is a self-adjoint linear operator on \(\mathrm {T}_x{\mathcal {M}}\) defined through polarization by

$$\begin{aligned} \left\langle {W_s[\dot{s}]},{\dot{s}}\right\rangle _x&= \left\langle {\mathrm {grad}f(\mathrm {R}_x(s))},{c''(0)}\right\rangle _{\mathrm {R}_x(s)}, \end{aligned}$$
(8)

with \(c''(0) \in \mathrm {T}_{\mathrm {R}_x(s)}{\mathcal {M}}\) the (intrinsic) acceleration on \({\mathcal {M}}\) of \(c(t) = \mathrm {R}_x(s+t\dot{s})\) at \(t = 0\).

Remark 2.6

Throughout, \(s, \dot{s}, \ddot{s}\) will simply denote tangent vectors.

We turn to curvature. The Lie bracket of two smooth vector fields XY on \({\mathcal {M}}\) is itself a smooth vector field, conveniently expressed in terms of the Riemannian connection as \([X, Y] = \nabla _X Y - \nabla _Y X\). Using this notion, the Riemann curvature endomorphism R of \({\mathcal {M}}\) is an operator which maps three smooth vector fields XYZ of \({\mathcal {M}}\) to a fourth smooth vector field as:

$$\begin{aligned} R(X, Y) Z&= \nabla _X \nabla _Y Z - \nabla _Y \nabla _X Z - \nabla _{[X, Y]} Z. \end{aligned}$$
(9)

Whenever R is identically zero, we say \({\mathcal {M}}\) is flat: this is the case notably when \({\mathcal {M}}\) is a Euclidean space and when \({\mathcal {M}}\) has dimension one (e.g., a circle is flat, while a sphere is not).

Though it is not obvious from the definition, the value of the vector field R(XY)Z at \(x \in {\mathcal {M}}\) depends on XYZ only through their value at x. Therefore, given \(u, v, w \in \mathrm {T}_x{\mathcal {M}}\) we can make sense of the notation R(uv)w as denoting the vector in \(\mathrm {T}_x{\mathcal {M}}\) corresponding to R(XY)Z at x, where XYZ are arbitrary smooth vector fields whose values at x are uvw, respectively. The map \((u, v, w) \mapsto R(u, v)w\) is linear in each input.

Two linearly independent tangent vectors uv at x span a two-dimensional plane of \(\mathrm {T}_x{\mathcal {M}}\). The sectional curvature of \({\mathcal {M}}\) along that plane is a real number K(uv) defined as

$$\begin{aligned} K(u, v)&= \frac{\left\langle {R(u, v)v},{u}\right\rangle _x}{\Vert u\Vert _x^2\Vert v\Vert _x^2 - \left\langle {u},{v}\right\rangle _x^2}. \end{aligned}$$
(10)

Of course, all sectional curvatures of flat manifolds are zero. Also, all sectional curvatures of a sphere of radius r are \(1/r^2\) and all sectional curvatures of the hyperbolic space with parameter r are \(-1/r^2\)—see [32, Thm. 8.34].

Using the connection \(\nabla \), we differentiate the curvature endomorphism R as follows. Given any smooth vector field U, we let \(\nabla _U R\) be an operator of the same type as R itself, in the sense that it maps three smooth vector fields XYZ to a fourth one denoted \((\nabla _U R)(X, Y)Z\) through

$$\begin{aligned} (\nabla _U R)(X, Y)Z&= \nabla _U(R(X, Y) Z) - R(\nabla _U X, Y) Z \nonumber \\&\quad - R(X, \nabla _U Y) Z - R(X, Y) \nabla _U Z. \end{aligned}$$
(11)

Observe that this formula captures a convenient chain rule on \(\nabla _U(R(X, Y) Z)\). As for R, the value of \(\nabla R(X, Y, Z, U) \triangleq (\nabla _U R)(X, Y)Z\) at x depends on XYZU only through their values at x. Therefore, \(\nabla R\) unambiguously maps \(u, v, w, z \in \mathrm {T}_x{\mathcal {M}}\) to \(\nabla R(u, v, w, z) \in \mathrm {T}_x{\mathcal {M}}\), linearly in all inputs. We say the operator norm of \(\nabla R\) at x is bounded by F if

$$\begin{aligned} \Vert \nabla R(u, v, w, z)\Vert _x&\le F \Vert u\Vert _x \Vert v\Vert _x \Vert w\Vert _x \Vert z\Vert _x \end{aligned}$$

for all \(u, v, w, z \in \mathrm {T}_x{\mathcal {M}}\). We say \(\nabla R\) has operator norm bounded by F if this holds for all x. If \(F = 0\) (that is, \(\nabla R \equiv 0\)), we say R is parallel and \({\mathcal {M}}\) is called locally symmetric. This is notably the case for manifolds with constant sectional curvature—Euclidean spaces, spheres and hyperbolic spaces—and (Riemannian) products thereof [41, pp. 219–221].

We are ready to state the main result of this section. Note that \({\mathcal {M}}\) need not be complete.

Theorem 2.7

Let \({\mathcal {M}}\) be a Riemannian manifold whose sectional curvatures are in the interval \([{K_{\mathrm {low}}}, {K_{\mathrm {up}}}]\), and let \(K = \max (|{K_{\mathrm {low}}}|, |{K_{\mathrm {up}}}|)\). Also assume \(\nabla R\)—the covariant derivative of the Riemann curvature endomorphism R—is bounded by F in operator norm.

Let \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) be twice continuously differentiable and select \(b > 0\) such that

$$\begin{aligned} b \le \min \!\left( \frac{1}{4\sqrt{K}}, \frac{K}{4F} \right) . \end{aligned}$$

Pick any point \(x \in {\mathcal {M}}\) such that \(\mathrm {Exp}_x\) is defined on the closed ball \(B_x(b)\) of radius b around the origin in \(\mathrm {T}_x{\mathcal {M}}\). We have the following three conclusions:

  1. 1.

    If f has L-Lipschitz continuous gradient and \(\Vert \mathrm {grad}f(x)\Vert _x \le Lb\), then \({{\hat{f}}}_x = f \circ \mathrm {Exp}_x\) has 2L-Lipschitz continuous gradient in \(B_x(b)\), that is, for all \(u, v \in B_x(b)\) it holds that \(\Vert \nabla {{\hat{f}}}_x(u) - \nabla {{\hat{f}}}_x(v)\Vert _x \le 2L \Vert u - v\Vert _x\).

  2. 2.

    If moreover f has \(\rho \)-Lipschitz continuous Hessian, then \(\Vert \nabla ^2 {{\hat{f}}}_x(s) - \nabla ^2 {{\hat{f}}}_x(0)\Vert _x \le {{\hat{\rho }}} \Vert s\Vert _x\) for all \(s \in B_x(b)\), with \({{\hat{\rho }}} = \rho + L \sqrt{K}\).

  3. 3.

    For all \(s \in B_x(b)\), the singular values of \(T_s = \mathrm {D}\mathrm {Exp}_x(s)\) lie in the interval [2/3, 4/3].

A few comments are in order:

  1. 1.

    For locally symmetric spaces (\(F = 0\)), we interpret K/F as infinite (regardless of K).

  2. 2.

    If \({\mathcal {M}}\) is compact, then it is complete and there necessarily exist finite K and F. See work by Greene [25] for a discussion on non-compact manifolds.

  3. 3.

    If \({\mathcal {M}}\) is a homogeneous Riemannian manifold (not necessarily compact), then there exist finite K and F, and these can be assessed by studying a single point on the manifold. This follows directly from the definition of homogeneous Riemannian manifold [32, p. 55].

  4. 4.

    All symmetric spaces are homogeneous and locally symmetric [32, Exercise 6–19, Exercise 7–3 and p. 78] so there exists finite K and \(F = 0\). Let \({\text {Sym}}(d)\) be the set of real \(d \times d\) symmetric matrices. The set of \(d\times d\) positive definite matrices

    $$\begin{aligned} {\mathcal {P}}_d = \{P \in {\text {Sym}}(d) : P \succ 0\} \end{aligned}$$

    endowed with the so-called affine invariant metric

    $$\begin{aligned} \left\langle {X},{Y}\right\rangle _P = \mathrm {Tr}(P^{-1} X P^{-1} Y) \quad \text {for } P \in {\mathcal {P}}_d \text { and } X, Y \in \mathrm {T}_P {\mathcal {P}}_d \cong {\text {Sym}}(d) \end{aligned}$$

    is a non-compact symmetric space of non-constant curvature. It is commonly used in practice [10, 37, 38, 43]. One can show that \(K = \frac{1}{2}\) and \(F=0\) are the right constants for this manifold.

  5. 5.

    The following statements are equivalent: (a) \({\mathcal {M}}\) is complete; (b) \(\mathrm {Exp}\) is defined on the whole tangent bundle: \({\mathcal {O}}= \mathrm {T}{\mathcal {M}}\); and (c) for some \(b > 0\), \(\mathrm {Exp}_x\) is defined on \(B_x(b)\) for all \(x \in {\mathcal {M}}\). In later sections, we need to apply Theorem 2.7 at various points of \({\mathcal {M}}\) with constant b, which is why we then assume \({\mathcal {M}}\) is complete.

  6. 6.

    The properties of \(T_s\) are useful in combination with Lemma 2.5 to relate gradients and Hessians of the pullbacks to gradients and Hessians on the manifold. For example, if \(\nabla {{\hat{f}}}_x(s)\) has norm \(\epsilon \), then \(\mathrm {grad}f(\mathrm {Exp}_x(s))\) has norm somewhere between \(\frac{3}{4}\epsilon \) and \(\frac{3}{2}\epsilon \). Under the conditions of the theorem, \(W_s\) (8) is bounded as \(\Vert W_s\Vert _x \le \frac{9}{4} K \Vert \nabla {{\hat{f}}}_x(s)\Vert _x \Vert s\Vert _x\).

  7. 7.

    We only get satisfactory Lipschitzness at points where the gradient is bounded by Lb. Fortunately, for the algorithms we study, whenever we encounter a point with gradient larger than that threshold, it is sufficient to take a simple gradient descent step.

To prove Theorem 2.7, we must control \(\nabla ^2 {{\hat{f}}}_x(s)\). According to Lemma 2.5, this requires controlling both \(T_s\) (a differential of the exponential map) and \(c''(0)\) (the intrinsic initial acceleration of a curve defined via the exponential map, but which is not itself a geodesic in general). On both counts, we must study differentials of exponentials. Jacobi fields are the tool of choice for such tasks. As a first step, we use Jacobi fields to investigate the difference between \(T_s\) and \(P_s\): two linear operators from \(\mathrm {T}_x{\mathcal {M}}\) to \(\mathrm {T}_{\mathrm {Exp}_x(s)}{\mathcal {M}}\). We prove a general result in Appendix A (exact for constant sectional curvature) and state a sufficient particular case here. Control of \(T_s\) follows as a corollary because \(P_s\) (parallel transport) is an isometry.

Proposition 2.8

Let \({\mathcal {M}}\) be a Riemannian manifold whose sectional curvatures are in the interval \([{K_{\mathrm {low}}}, {K_{\mathrm {up}}}]\), and let \(K = \max (|{K_{\mathrm {low}}}|, |{K_{\mathrm {up}}}|)\). For any \((x, s) \in {\mathcal {O}}\) with \(\Vert s\Vert _x \le \frac{\pi }{\sqrt{K}}\),

$$\begin{aligned} \Vert (T_s - P_s)[\dot{s}]\Vert _{\mathrm {Exp}_x(s)}&\le \frac{1}{3} K \Vert s\Vert _x^2 \Vert \dot{s}_\perp \Vert _x, \end{aligned}$$
(12)

where \(\dot{s}_\perp = \dot{s} - \frac{\left\langle {s},{\dot{s}}\right\rangle _x}{\left\langle {s},{s}\right\rangle _x}s\) is the component of \(\dot{s}\) orthogonal to s.

Corollary 2.9

Let \({\mathcal {M}}\) be a Riemannian manifold whose sectional curvatures are in the interval \([{K_{\mathrm {low}}}, {K_{\mathrm {up}}}]\), and let \(K = \max (|{K_{\mathrm {low}}}|, |{K_{\mathrm {up}}}|)\). For any \((x, s) \in {\mathcal {O}}\) with \(\Vert s\Vert _x \le \frac{1}{\sqrt{K}}\),

$$\begin{aligned} \sigma _{{\text {min}}}(T_s)&\ge \frac{2}{3}&{\text { and }}&\sigma _{{\text {max}}}(T_s)&\le \frac{4}{3}. \end{aligned}$$
(13)

Proof

By Proposition 2.8, the operator norm of \(T_s - P_s\) is bounded above by \(\frac{1}{3} K \Vert s\Vert _x^2 \le \frac{1}{3}\). Furthermore, parallel transport \(P_s\) is an isometry: its singular values are equal to 1. Thus,

$$\begin{aligned} \sigma _{{\text {max}}}(T_s)&= \sigma _{{\text {max}}}(P_s + T_s - P_s) \le \sigma _{{\text {max}}}(P_s) + \sigma _{{\text {max}}}(T_s - P_s) \le 1 + \frac{1}{3} = \frac{4}{3}. \end{aligned}$$

Likewise, with min/max taken over unit-norm vectors \(u \in \mathrm {T}_x{\mathcal {M}}\) and writing \(y = \mathrm {Exp}_x(s)\),

$$\begin{aligned} \sigma _{{\text {min}}}(T_s)&= \min _{u} \Vert T_s u\Vert _y \ge \min _{u} \Vert P_s u\Vert _y - \Vert (T_s - P_s)u\Vert _y \\&= 1 - \max _{u} \Vert (T_s - P_s)u\Vert _y \ge \frac{2}{3}. \end{aligned}$$

\(\square \)

We turn to controlling the term \(c''(0)\) which appears in the definition of operator \(W_s\) in the expression for \(\nabla ^2 {{\hat{f}}}_x(s)\) provided by Lemma 2.5. We present a detailed proof in Appendix B for a general statement, and state a sufficient particular case here. The proof is fairly technical: it involves designing an appropriate nonlinear second-order ODE on the manifold and bounding its solutions. The ODE is related to the Jacobi equation, except we had to differentiate to the next order, and the equation is not homogeneous.

Proposition 2.10

Let \({\mathcal {M}}\) be a Riemannian manifold whose sectional curvatures are in the interval \([{K_{\mathrm {low}}}, {K_{\mathrm {up}}}]\), and let \(K = \max (|{K_{\mathrm {low}}}|, |{K_{\mathrm {up}}}|)\). Further assume \(\nabla R\) is bounded by F in operator norm.

Pick any \((x, s) \in {\mathcal {O}}\) such that

$$\begin{aligned} \Vert s\Vert _x&\le \min \!\left( \frac{1}{4\sqrt{K}}, \frac{K}{4F} \right) . \end{aligned}$$

For any \(\dot{s} \in \mathrm {T}_x{\mathcal {M}}\), the curve \(c(t) = \mathrm {Exp}_x(s + t \dot{s})\) has initial acceleration bounded as

$$\begin{aligned} \Vert c''(0)\Vert _{\mathrm {Exp}_x(s)}&\le \frac{3}{2} K \Vert s\Vert _x \Vert \dot{s}\Vert _x \Vert \dot{s}_\perp \Vert _x, \end{aligned}$$

where \(\dot{s}_\perp = \dot{s} - \frac{\left\langle {s},{\dot{s}}\right\rangle _x}{\left\langle {s},{s}\right\rangle _x} s\) is the component of \(\dot{s}\) orthogonal to s.

Equipped with all of the above, it is now easy to prove the main theorem of this section.

Proof of Theorem 2.7

Consider the pullback \({{\hat{f}}}_x = f \circ \mathrm {Exp}_x\) defined on \(\mathrm {T}_x{\mathcal {M}}\). Since \(\mathrm {T}_x{\mathcal {M}}\) is linear, it is a classical exercise to verify that \(\nabla {{\hat{f}}}_x\) is 2L-Lipschitz continuous in \(B_x(b)\) if and only if \(\Vert \nabla ^2 {{\hat{f}}}_x(s)\Vert _x \le 2L\) for all s in \(B_x(b)\). Using Lemma 2.5, we start bounding the Hessian as follows:

$$\begin{aligned} \Vert \nabla ^2 {{\hat{f}}}_x(s)\Vert _x&\le \sigma _{{\text {max}}}(T_s^*) \sigma _{{\text {max}}}(T_s) \Vert \mathrm {Hess}f(\mathrm {Exp}_x(s))\Vert _{\mathrm {Exp}_x(s)} + \Vert W_s\Vert _x, \end{aligned}$$

with operator \(W_s\) defined by (8). Since \(\mathrm {grad}f\) is L-Lipschitz continuous, \(\Vert \mathrm {Hess}f(y)\Vert _y \le L\) for all \(y \in {\mathcal {M}}\) (this follows fairly directly from Proposition 2.4). To bound \(W_s\), we start with a Cauchy–Schwarz inequality then we consider the worst case for the magnitude of \(c''(0)\):

$$\begin{aligned} \Vert W_s\Vert _x&\le \Vert \mathrm {grad}f(\mathrm {Exp}_x(s))\Vert _{\mathrm {Exp}_x(s)} \cdot \max _{\dot{s} \in \mathrm {T}_x{\mathcal {M}}, \Vert \dot{s}\Vert _x = 1} \Vert c''(0)\Vert _{\mathrm {Exp}_x(s)}. \end{aligned}$$

Combining these steps yields a first bound of the form

$$\begin{aligned} \Vert \nabla ^2 {{\hat{f}}}_x(s)\Vert _x&\le \sigma _{{\text {max}}}(T_s)^2 L \nonumber \\&\quad + \Vert \mathrm {grad}f(\mathrm {Exp}_x(s))\Vert _{\mathrm {Exp}_x(s)} \cdot \max _{\dot{s} \in \mathrm {T}_x{\mathcal {M}}, \Vert \dot{s}\Vert _x = 1} \Vert c''(0)\Vert _{\mathrm {Exp}_x(s)}. \end{aligned}$$
(14)

To proceed, we keep working on the \(W_s\)-terms: use Proposition 2.10, L-Lipschitz-continuity of the gradient, and our bounds on the norms of s and \(\mathrm {grad}f(x)\) to see that:

$$\begin{aligned} \Vert W_s\Vert _x&\le \max _{\dot{s} \in \mathrm {T}_x{\mathcal {M}}, \Vert \dot{s}\Vert _x = 1} \Vert c''(0)\Vert _{\mathrm {Exp}_x(s)} \cdot \Vert \mathrm {grad}f(\mathrm {Exp}_x(s))\Vert _{\mathrm {Exp}_x(s)} \nonumber \\&\le \frac{3}{2} K \Vert s\Vert _x \cdot \Vert P_s^*\mathrm {grad}f(\mathrm {Exp}_x(s)) - \mathrm {grad}f(x) + \mathrm {grad}f(x)\Vert _{x} \nonumber \\&\le \frac{3}{2} K \Vert s\Vert _x \cdot \left( L \Vert s\Vert _x + \Vert \mathrm {grad}f(x)\Vert _x \right) \nonumber \\&\le 3KLb\Vert s\Vert _x \le \frac{3}{4} L \sqrt{K} \Vert s\Vert _x \le \frac{3}{16}L. \end{aligned}$$
(15)

Returning to (14) and using Corollary 2.9 to bound \(T_s\) confirms that

$$\begin{aligned} \Vert \nabla ^2 {{\hat{f}}}_x(s)\Vert _x&\le \frac{16}{9} L + \frac{3}{16} L < 2L. \end{aligned}$$

Thus, \(\nabla {{\hat{f}}}_x\) is 2L-Lipschitz continuous in the ball of radius b around the origin in \(\mathrm {T}_x{\mathcal {M}}\).

To establish the second part of the claim, we use the same intermediate results and \(\rho \)-Lipschitz continuity of the Hessian. First, using Lemma 2.5 twice and noting that \(W_0 = 0\) so that \(\nabla ^2 {{\hat{f}}}_x(0) = \mathrm {Hess}f(x)\), we have:

$$\begin{aligned} \nabla ^2 {{\hat{f}}}_x(s) - \nabla ^2 {{\hat{f}}}_x(0)&= P_s^* \circ \mathrm {Hess}f(\mathrm {Exp}_x(s)) \circ P_s - \mathrm {Hess}f(x) \\&\quad + (T_s - P_s)^* \circ \mathrm {Hess}f(\mathrm {Exp}_x(s)) \circ T_s \\&\quad + P_s^* \circ \mathrm {Hess}f(\mathrm {Exp}_x(s)) \circ (T_s - P_s) \\&\quad + W_s. \end{aligned}$$

We bound this line by line calling upon Proposition 2.8, Corollary 2.9 and (15) to get:

$$\begin{aligned} \Vert \nabla ^2 {{\hat{f}}}_x(s) - \nabla ^2 {{\hat{f}}}_x(0)\Vert _x&\le \rho \Vert s\Vert _x + \frac{4}{9} L K \Vert s\Vert _x^2 + \frac{1}{3} L K \Vert s\Vert _x^2 + 3LKb\Vert s\Vert _x \\&\le \left( \rho + \frac{1}{9} L \sqrt{K} + \frac{1}{12} L \sqrt{K} + \frac{3}{4} L \sqrt{K} \right) \Vert s\Vert _x \\&\le \left( \rho + L \sqrt{K} \right) \Vert s\Vert _x. \end{aligned}$$

This shows a type of Lipschitz continuity of the Hessian of the pullback with respect to the origin, in the ball of radius b. \(\square \)

3 Assumptions and parameters for \(\mathtt {TAGD}\) and \(\mathtt {PTAGD}\)

Our algorithms apply to the minimization of \(f :{\mathcal {M}}\rightarrow {{\mathbb {R}}}\) on a Riemannian manifold \({\mathcal {M}}\) equipped with a retraction \(\mathrm {R}\) defined on the whole tangent bundle \(\mathrm {T}{\mathcal {M}}\). The pullback of f at \(x \in {\mathcal {M}}\) is \({{\hat{f}}}_x = f \circ \mathrm {R}_x :\mathrm {T}_x{\mathcal {M}}\rightarrow {{\mathbb {R}}}\). In light of Sect. 2, we make the following assumptions.

A 2

There exists a constant \(f_{\mathrm {low}}\) such that \(f(x) \ge f_{\mathrm {low}}\) for all \(x \in {\mathcal {M}}\). Moreover, f is twice continuously differentiable and there exist constants \(\ell \), \({{\hat{\rho }}}\) and b such that, for all \(x \in {\mathcal {M}}\) with \(\Vert \mathrm {grad}f(x)\Vert \le \frac{1}{2} \ell b\),

  1. 1.

    \(\nabla {{\hat{f}}}_x\) is \(\ell \)-Lipschitz continuous in \(B_x(3b)\) (in particular, \(\Vert \nabla ^2 {{\hat{f}}}_x(0)\Vert \le \ell \)),

  2. 2.

    \(\Vert \nabla ^2 {{\hat{f}}}_x(s) - \nabla ^2 {{\hat{f}}}_x(0)\Vert \le {{\hat{\rho }}} \Vert s\Vert \) for all \(s \in B_x(3b)\), and

  3. 3.

    \(\sigma _{{\text {min}}}(T_s) \ge \frac{1}{2}\) with \(T_s = \mathrm {D}\mathrm {R}_x(s)\) for all \(s \in B_x(3b)\),

where \(B_x(3b) = \{ u \in \mathrm {T}_x{\mathcal {M}}: \Vert u\Vert \le 3b \}\). Finally, for all \((x, s) \in \mathrm {T}{\mathcal {M}}\) it holds that

  1. 4.

    \({{\hat{f}}}_x(s) \le {{\hat{f}}}_x(0) + \langle {\nabla {{\hat{f}}}_x(0)},{s}\rangle + \frac{\ell }{2}\Vert s\Vert ^2\).

The first three items in A2 confer Lipschitz properties to the derivatives of the pullbacks \({{\hat{f}}}_x\) restricted to balls around the origins of tangent spaces: these are the balls where we shall run accelerated gradient steps. We only need these guarantees at points where the gradient is below a threshold. For all other points, a regular gradient step provides ample progress: the last item in A2 serves that purpose only, see Proposition 5.2.

Section 2 tells us that A2 holds in particular when we use the exponential map as a retraction and f itself has appropriate (Riemannian) Lipschitz properties. This is the link between Theorems 1.3 and 1.6 in the introduction and Theorems 5.1 and 6.1 in later sections.

Corollary 3.1

If we use the exponential retraction \(\mathrm {R}= \mathrm {Exp}\) and A1 holds, then A2 holds with \(f_{\mathrm {low}}\), and

$$\begin{aligned} \ell&= 2L,&{{\hat{\rho }}} = \rho + L\sqrt{K},&b = \frac{1}{12}\min \!\left( \frac{1}{\sqrt{K}}, \frac{K}{F}\right) . \end{aligned}$$
(16)

With constants as in A2, we further define a number of parameters. First, the user specifies a tolerance \(\epsilon \) which must not be too loose.

A 3

The tolerance \(\epsilon > 0\) satisfies \(\sqrt{{{\hat{\rho }}} \epsilon } \le \frac{1}{2}\ell \) and \(\epsilon \le b^2 {{\hat{\rho }}}\).

Then, we fix a first set of parameters (see [28] for more context; in particular, \(\kappa \) plays the role of a condition number; under A3, we have \(\kappa \ge 2\)):

$$\begin{aligned} \eta&= \frac{1}{4\ell },&\kappa&= \frac{\ell }{\sqrt{{{\hat{\rho }}} \epsilon }},&\theta&= \frac{1}{4\sqrt{\kappa }},&\gamma&= \frac{\sqrt{{{\hat{\rho }}} \epsilon }}{4},&s&= \frac{1}{32} \sqrt{\frac{\epsilon }{{{\hat{\rho }}}}}. \end{aligned}$$
(17)

We define a second set of parameters based on some \(\chi \ge 1\) (as set in some of the lemmas and theorems below) and a universal constant \(c > 0\) (implicitly defined as the smallest real satisfying a finite number of lower-bounds required throughout the paper):

$$\begin{aligned} r&= \eta \epsilon \chi ^{-5}c^{-8},&{\mathscr {T}}&= \sqrt{\kappa }\chi c,&{\mathscr {E}}&= \sqrt{\frac{\epsilon ^3}{{\hat{\rho }}}}\chi ^{-5}c^{-7},\nonumber \\ {\mathscr {L}}&= \sqrt{\frac{4\epsilon }{{\hat{\rho }}}} \chi ^{-2}c^{-3},&{\mathscr {M}}&= \frac{\epsilon \sqrt{\kappa }}{\ell }c^{-1}. \end{aligned}$$
(18)

When we say “with \(\chi \ge A \ge 1\)” (for example, in Theorems 5.1 and 6.1), we mean: “with \(\chi \) the smallest value larger than A such that \({\mathscr {T}}\) is a positive integer multiple of 4.”

Lemma C.1 in Appendix C lists useful relations between the parameters.

4 Accelerated Gradient Descent in a Ball of a Tangent Space

figure a

The main ingredient of algorithms \(\mathtt {TAGD}\) and \(\mathtt {PTAGD}\) is \(\mathtt {TSS}\): the tangent space steps algorithm. Essentially, the latter runs the classical accelerated gradient descent algorithm (AGD) from convex optimization on \({{\hat{f}}}_x\) in a tangent space \(\mathrm {T}_x{\mathcal {M}}\), with a few tweaks:

  1. 1.

    Because \({{\hat{f}}}_x\) need not be convex, \(\mathtt {TSS}\) monitors the generated sequences for signs of non-convexity. If \({{\hat{f}}}_x\) happens to behave like a convex function along the sequence \(\mathtt {TSS}\) generates, then we reap the benefits of convexity. Otherwise, the direction along which \({{\hat{f}}}_x\) behaves in a non-convex way can be used as a good descent direction. This is the idea behind the “convex until proven guilty” paradigm developed by Carmon et al. [15] and also exploited by Jin et al. [28]. Explicitly, given \(x \in {\mathcal {M}}\) and \(s, u \in \mathrm {T}_x{\mathcal {M}}\), for a specified parameter \(\gamma > 0\), we check the negative curvature condition (one might also call it the non-convexity condition) (NCC):

    $$\begin{aligned} {\hat{f}}_x(s) < {\hat{f}}_x(u) + \langle {\nabla {\hat{f}}_x(u)},{s - u}\rangle - \frac{\gamma }{2}\left\| {s - u}\right\| ^2. \end{aligned}$$
    (NCC)

    If (NCC) triggers with a triplet (xsu) and s is not too large, we can exploit that fact to generate substantial cost decrease using the negative curvature exploitation algorithm, \(\mathtt {NCE}\): see Lemma 4.4. (This is about curvature of the cost function, not the manifold.)

  2. 2.

    In contrast to the Euclidean case in [28], our assumption A2 provides Lipschitz-type guarantees only in a ball of radius 3b around the origin in \(\mathrm {T}_x{\mathcal {M}}\). Therefore, we must act if iterates generated by \(\mathtt {TSS}\) leave that ball. This is done in two places. First, the momentum step in step 4 of \(\mathtt {TSS}\) is capped so that \(\Vert u_j\Vert \) remains in the ball of radius 2b around the origin. Second, if \(s_{j+1}\) leaves the ball of radius b (as checked in step 10) then we terminate this run of \(\mathtt {TSS}\) by returning to the manifold. Lemma 4.1 guarantees that the iterates indeed remain in appropriate balls, that \(\theta _j\) (19) in the capped momentum step is uniquely defined, and that if a momentum step is capped, then immediately after that \(\mathtt {TSS}\) terminates.

The initial momentum \(v_0\) is always set to zero. By default, the AGD sequence is initialized at the origin: \(s_0 = 0\). However, for \(\mathtt {PTAGD}\) we sometimes want to initialize at a different point (a perturbation away from the origin): this is only relevant for Sect. 6.

figure b

In the remainder of this section, we provide four general purpose lemmas about \(\mathtt {TSS}\). Proofs are in Appendix D. We note that \(\mathtt {TAGD}\) and \(\mathtt {PTAGD}\) call \(\mathtt {TSS}\) only at points x where \(\Vert \mathrm {grad}f(x)\Vert \le \frac{1}{2} \ell b\). The first lemma below notably guarantees that, for such runs, all iterates \(u_j, s_j\) generated by \(\mathtt {TSS}\) remain (a fortiori) in balls of radius 3b, so that the strongest provisions of A2 always apply: we use this fact often without mention.

Lemma 4.1

(\(\mathtt {TSS}\) stays in balls) Fix parameters and assumptions as laid out in Sect. 3. Let \(x \in {\mathcal {M}}\) satisfy \(\Vert \mathrm {grad}f(x)\Vert \le \frac{1}{2}\ell b\). If \(\mathtt {TSS}(x)\) or \(\mathtt {TSS}(x, s_0)\) (with \(\Vert s_0\Vert \le b\)) defines vectors \(u_0, \ldots , u_q\) (and possibly more), then it also defines vectors \(s_0, \ldots , s_q\), and we have:

$$\begin{aligned} \Vert s_0\Vert , \ldots , \Vert s_q\Vert&\le b,&\Vert u_0\Vert , \ldots , \Vert u_q\Vert&\le 2b,&{\text { and }}&2\eta \gamma \le \theta \le \theta _j \le 1. \end{aligned}$$

If \(s_{q+1}\) is defined, then \(\Vert s_{q+1}\Vert \le 3b\) and, if \(\Vert u_q\Vert = 2b\), then \(\Vert s_{q+1}\Vert > b\) and \(u_{q+1}\) is undefined.

Along the iterates of AGD, the value of the cost function \({{\hat{f}}}_x\) may not monotonically decrease. Fortunately, there is a useful quantity which monotonically decreases along iterates: Jin et al. [28] call it the Hamiltonian. In several ways, it serves the purpose of a Lyapunov function. Importantly, the Hamiltonian decreases regardless of any special events that occur while running \(\mathtt {TSS}\). It is built as a combination of the cost function value and the momentum. The next lemma makes this precise: we use monotonic decrease of the Hamiltonian often without mention. This corresponds to [28, Lem. 9 and 20].

Lemma 4.2

(Hamiltonian decrease) Fix parameters and assumptions as laid out in Sect. 3. Let \(x \in {\mathcal {M}}\) satisfy \(\Vert \mathrm {grad}f(x)\Vert \le \frac{1}{2}\ell b\). For each pair \((s_j, v_j)\) defined by \(\mathtt {TSS}(x)\) or \(\mathtt {TSS}(x, s_0)\) (with \(\Vert s_0\Vert \le b\)), define the Hamiltonian

$$\begin{aligned} E_j&= {{\hat{f}}}_x(s_j) + \frac{1}{2\eta } \Vert v_j\Vert ^2. \end{aligned}$$
(20)

If \(E_{j+1}\) is defined, then \(E_j\), \(\theta _j\) and \(u_j\) are also defined and:

$$\begin{aligned} E_{j+1}&\le E_j - \frac{\theta _j}{2\eta } \Vert v_j\Vert ^2 - \frac{\eta }{4} \Vert \nabla {{\hat{f}}}_x(u_j)\Vert ^2 \le E_j. \end{aligned}$$

If moreover \(\Vert v_j\Vert \ge {\mathscr {M}}\), then \(E_j - E_{j+1} \ge \frac{4{\mathscr {E}}}{{\mathscr {T}}}\).

Jin et al. [28] formalize an important property of \(\mathtt {TSS}\) sequences in the Euclidean case, namely, the fact that “either the algorithm makes significant progress or the iterates do not move much.” They call this the improve or localize phenomenon. The next lemma states this precisely in our context. This corresponds to [28, Cor. 11].

Lemma 4.3

(Improve or localize) Fix parameters and assumptions as laid out in Sect. 3. Let \(x \in {\mathcal {M}}\) satisfy \(\Vert \mathrm {grad}f(x)\Vert \le \frac{1}{2}\ell b\). If \(\mathtt {TSS}(x)\) or \(\mathtt {TSS}(x, s_0)\) (with \(\Vert s_0\Vert \le b\)) defines vectors \(s_0, \ldots , s_q\) (and possibly more), then \(E_0, \ldots , E_q\) are defined by (20) and, for all \(0 \le q' \le q\),

$$\begin{aligned} \Vert s_q - s_{q'}\Vert ^2 \le (q-q') \sum _{j = q'}^{q-1} \Vert s_{j+1} - s_j\Vert ^2 \le 16 \sqrt{\kappa } \eta (q-q') (E_{q'} - E_q). \end{aligned}$$

For \(q' = 0\) in particular, using \(E_0 = {{\hat{f}}}_x(s_0)\) we can write \(E_q \le {{\hat{f}}}_x(s_0) - \frac{\Vert s_q - s_0\Vert ^2}{16 \sqrt{\kappa } \eta q}\).

As outlined earlier, in case the \(\mathtt {TSS}\) sequence witnesses non-convexity in \({{\hat{f}}}_x\) through the (NCC) check, we call upon the \(\mathtt {NCE}\) algorithm to exploit this event. The final lemma of this section formalizes the fact that this yields appropriate cost improvement. (Indeed, if \(\Vert s_j\Vert > {\mathscr {L}}\) one can argue that sufficient progress was already achieved; otherwise, the lemma applies and we get a result from \(E_j \le E_0 = {{\hat{f}}}_x(s_0)\).) This corresponds to [28, Lem. 10 and 17].

Lemma 4.4

(Negative curvature exploitation) Fix parameters and assumptions as laid out in Sect. 3. Let \(x \in {\mathcal {M}}\) satisfy \(\Vert \mathrm {grad}f(x)\Vert \le \frac{1}{2}\ell b\). Assume \(\mathtt {TSS}(x)\) or \(\mathtt {TSS}(x, s_0)\) (with \(\Vert s_0\Vert \le b\)) defines \(u_j\), so that \(s_j, v_j\) are also defined, and \(E_j\) is defined by (20). If (NCC) triggers with \((x, s_j, u_j)\) and \(\Vert s_j\Vert \le {\mathscr {L}}\), then \({{\hat{f}}}_x(\mathtt {NCE}(x, s_j, v_j)) \le E_j - 2{\mathscr {E}}\).

5 First-Order Critical Points

Our algorithm to compute \(\epsilon \)-approximate first-order critical points on Riemannian manifolds is \(\mathtt {TAGD}\): this is a deterministic algorithm which does not require access to the Hessian of the cost function. Our main result regarding \(\mathtt {TAGD}\), namely, Theorem 5.1, states that it does so in a bounded number of iterations. As worked out in Theorem 1.3, this bound scales as \(\epsilon ^{-7/4}\), up to polylogarithmic terms. The complexity is independent of the dimension of the manifold.

The proof of Theorem 5.1 rests on two propositions introduced hereafter in this section. Interestingly, it is only in the proof of Theorem 5.1 that we track the behavior of iterates of \(\mathtt {TAGD}\) across multiple points on the manifold. This is done by tracking decrease of the value of the cost function f. All supporting results (lemmas and propositions) handle a single tangent space at a time. As a result, lemmas and propositions fully benefit from the linear structure of tangent spaces. This is why we can salvage most of the Euclidean proofs of Jin et al. [28], up to mostly minor (but numerous and necessary) changes.

figure c

Theorem 5.1

Fix parameters and assumptions as laid out in Sect. 3, with

$$\begin{aligned} \chi \ge \log _2(\theta ^{-1}) \ge 1. \end{aligned}$$
(21)

Given \(x_0 \in {\mathcal {M}}\), \(\mathtt {TAGD}(x_0)\) returns \(x_t \in {\mathcal {M}}\) satisfying \(f(x_t) \le f(x_0)\) and \(\Vert \mathrm {grad}f(x_t)\Vert \le \epsilon \) with

$$\begin{aligned} t \le T_1 \triangleq \frac{f(x_0) - f_{\mathrm {low}}}{{\mathscr {E}}} {\mathscr {T}}. \end{aligned}$$
(22)

Running the algorithm requires at most \(2T_1\) pullback gradient queries and \(3T_1\) function queries (but no Hessian queries), and a similar number of calls to the retraction.

Proof of Theorem 5.1

The call to \(\mathtt {TAGD}(x_0)\) generates a sequence of points \(x_{t_0}, x_{t_1}, x_{t_2}, \ldots \) on \({\mathcal {M}}\), with \(t_0 = 0\). A priori, this sequence may be finite or infinite. Considering two consecutive indices \(t_i\) and \(t_{i+1}\), we either have \(t_{i+1} = t_i + 1\) (if the step from \(x_{t_i}\) to \(x_{t_{i+1}}\) is a single gradient step (Case 1)) or \(t_{i+1} = t_i + {\mathscr {T}}\) (if that same step is obtained through a call to \(\mathtt {TSS}\) (Case 2)). Moreover:

  • In Case 1, Proposition 5.2 applies and guarantees

    $$\begin{aligned} f(x_{t_i}) - f(x_{t_{i+1}}) \ge \frac{{\mathscr {E}}}{{\mathscr {T}}} = \frac{{\mathscr {E}}}{{\mathscr {T}}}(t_{i+1} - t_i). \end{aligned}$$
  • In Case 2, Proposition 5.3 applies and guarantees that if \(\Vert \mathrm {grad}f(x_{t_{i+1}})\Vert > \epsilon \) then

    $$\begin{aligned} f(x_{t_i}) - f(x_{t_{i+1}}) \ge {\mathscr {E}} = \frac{{\mathscr {E}}}{{\mathscr {T}}}(t_{i+1} - t_i). \end{aligned}$$

It is now clear that \(\mathtt {TAGD}(x_0)\) terminates after a finite number of steps. Indeed, if it does not, then the above reasoning shows that the algorithm produces an amortized decrease in the cost function f of \(\frac{{\mathscr {E}}}{{\mathscr {T}}}\) per unit increment of the counter t, yet the value of f cannot decrease by more than \(f(x_0) - f_{\mathrm {low}}\) because f is globally lower-bounded by \(f_{\mathrm {low}}\).

Accordingly, assume \(\mathtt {TAGD}(x_0)\) generates \(x_{t_0}, \ldots , x_{t_k}\) and terminates there, returning \(x_{t_k}\). We know that \(f(x_{t_k}) \le f(x_0)\) and \(\Vert \mathrm {grad}f(x_{t_k})\Vert \le \epsilon \). Moreover, from the discussion above and \(t_0 = 0\), we know that

$$\begin{aligned} f(x_0) - f_{\mathrm {low}}&\ge f(x_0) - f(x_{t_k}) = \sum _{i = 0}^{k-1} f(x_{t_i}) - f(x_{t_{i+1}}) \\&\ge \frac{{\mathscr {E}}}{{\mathscr {T}}} \sum _{i = 0}^{k-1} t_{i+1} - t_i = \frac{{\mathscr {E}}}{{\mathscr {T}}} t_k. \end{aligned}$$

Thus, \(t_k \le \frac{f(x_0) - f_{\mathrm {low}}}{{\mathscr {E}}} {\mathscr {T}} \triangleq T_1\).

How much work does it take to run the algorithm? Each (regular) gradient step requires one gradient query and increases the counter by one. Each run of \(\mathtt {TSS}\) requires at most \(2 {\mathscr {T}}\) gradient queries and \(2{\mathscr {T}} + 3 \le 3{\mathscr {T}}\) function queries (\(3 \le {\mathscr {T}}\) because \({\mathscr {T}}\) is a positive integer multiple of 4) and increases the counter by \({\mathscr {T}}\). Therefore, by the time \(\mathtt {TAGD}\) produces \(x_t\) it has used at most 2t gradient queries and 3t function queries. \(\square \)

The two following propositions form the backbone of the proof of Theorem 5.1. Each handles one of the two possible cases in one (outer) iteration of \(\mathtt {TAGD}\), namely: Case 1 is a “vanilla” Riemannian gradient descent step, while Case 2 is a call to \(\mathtt {TSS}\) to run (modified) AGD in the current tangent space. The former has a trivial and standard proof. The latter relies on all lemmas from Sect. 4 and on two additional lemmas introduced in Appendix F, all following Jin et al. [28].

Proposition 5.2

(Case 1) Fix parameters and assumptions as laid out in Sect. 3. Assume \(x \in {\mathcal {M}}\) satisfies \(\Vert \mathrm {grad}f(x)\Vert > 2\ell {\mathscr {M}}\). Then, \(x_+ = \mathrm {R}_{x}(-\eta \mathrm {grad}f(x))\) satisfies \(f(x) - f(x_+) \ge \frac{{\mathscr {E}}}{{\mathscr {T}}}\).

Proof of Proposition 5.2

This follows directly by property 4 in A2 with \({{\hat{f}}}_x = f \circ \mathrm {R}_x\) since \({{\hat{f}}}_x(0) = f(x)\) and \(\nabla {{\hat{f}}}_x(0) = \mathrm {grad}f(x)\) by properties of retractions, and also using \(\ell \eta = 1/4\):

$$\begin{aligned} f(x_+)&= {{\hat{f}}}_x(-\eta \mathrm {grad}f(x)) \le {{\hat{f}}}_x(0) - \eta \Vert \mathrm {grad}f(x)\Vert ^2 \\&\quad + \frac{\ell }{2} \Vert \eta \mathrm {grad}f(x)\Vert ^2 \le f(x) - (7/8) \ell {\mathscr {M}}^2. \end{aligned}$$

To conclude, it remains to use that \((7/8) \ell {\mathscr {M}}^2 \ge \frac{{\mathscr {E}}}{{\mathscr {T}}}\), as shown in Lemma C.1. \(\square \)

The next proposition corresponds mostly to [28, Lem. 12]. A proof is in Appendix F.

Proposition 5.3

(Case 2) Fix parameters and assumptions as laid out in Sect. 3, with

$$\begin{aligned} \chi \ge \log _2(\theta ^{-1}) \ge 1. \end{aligned}$$
(23)

If \(x \in {\mathcal {M}}\) satisfies \(\Vert \mathrm {grad}f(x)\Vert \le 2\ell {\mathscr {M}}\), then \(x_{{\mathscr {T}}} = \mathtt {TSS}(x)\) falls in one of two cases:

  1. 1.

    Either \(\Vert \mathrm {grad}f(x_{{\mathscr {T}}})\Vert \le \epsilon \) and \(f(x) - f(x_{{\mathscr {T}}}) \ge 0\),

  2. 2.

    Or \(\Vert \mathrm {grad}f(x_{{\mathscr {T}}})\Vert > \epsilon \) and \(f(x) - f(x_{{\mathscr {T}}}) \ge {\mathscr {E}}\).

6 Second-Order Critical Points

As discussed in the previous section, \(\mathtt {TAGD}\) produces \(\epsilon \)-approximate first-order critical points at an accelerated rate, deterministically. Such a point might happen to be an approximate second-order critical point, or it might not. In order to produce approximate second-order critical points, \(\mathtt {PTAGD}\) builds on top of \(\mathtt {TAGD}\) as follows.

Whenever \(\mathtt {TAGD}\) produces a point with gradient smaller than \(\epsilon \), \(\mathtt {PTAGD}\) generates a random vector \(\xi \) close to the origin in the current tangent space and runs \(\mathtt {TSS}\) starting from that perturbation. The run of \(\mathtt {TSS}\) itself is deterministic. However, the randomized initialization has the following effect: if the current point is not an approximate second-order critical point, then with high probability the sequence generated by \(\mathtt {TSS}\) produces significant cost decrease. Intuitively, this is because the current point is a saddle point, and gradient descent-type methods slowly but likely escape saddles. If this happens, we simply proceed with the algorithm. Otherwise, we can be reasonably confident that the point from which we ran the perturbed \(\mathtt {TSS}\) is an approximate second-order critical point, and we terminate there.

figure d

Our main result regarding \(\mathtt {PTAGD}\), namely, Theorem 6.1, states that it computes approximate second-order critical points with high probability in a bounded number of iterations. As worked out in Theorem 1.6, this bound scales as \(\epsilon ^{-7/4}\), up to polylogarithmic terms which include a dependency in the dimension of the manifold and the probability of success.

Mirroring Sect. 5, the proof of Theorem 6.1 rests on the two propositions of that section and on an additional proposition introduced hereafter in this section.

Theorem 6.1

Pick any \(x_0 \in {\mathcal {M}}\). Fix parameters and assumptions as laid out in Sect. 3, with \(d = \dim {\mathcal {M}}\), \(\delta \in (0, 1)\), any \(\Delta _f \ge \max \!\left( f(x_0) - f_{\mathrm {low}}, \sqrt{\frac{\epsilon ^3}{{\hat{\rho }}}}\right) \) and

$$\begin{aligned} \chi \ge \log _2\!\left( \frac{d^{1/2} \ell ^{3/2} \Delta _f}{({\hat{\rho }} \epsilon )^{1/4} \epsilon ^2 \delta } \right) \ge \log _2(\theta ^{-1}) \ge 1. \end{aligned}$$

The call to \(\mathtt {PTAGD}(x_0)\) returns \(x_t \in {\mathcal {M}}\) satisfying \(f(x_t) \le f(x_0)\), \(\Vert \mathrm {grad}f(x_t)\Vert \le \epsilon \) and (with probability at least \(1-2\delta \)) also \(\lambda _{\mathrm {min}}(\nabla ^2 {{\hat{f}}}_{x_t}(0)) \ge -\sqrt{{{\hat{\rho }}} \epsilon }\) with

$$\begin{aligned} t + {\mathscr {T}} \le T_2 \triangleq \left( 2 + 4\frac{f(x_0) - f_{\mathrm {low}}}{{\mathscr {E}}} \right) {\mathscr {T}}. \end{aligned}$$
(24)

To reach termination, the algorithm requires at most \(2T_2\) pullback gradient queries and \(4T_2\) function queries (but no Hessian queries), and a similar number of calls to the retraction.

Notice how this result gives a (probabilistic) guarantee about the smallest eigenvalue of the Hessian of the pullback \({{\hat{f}}}_x\) at 0 rather than about the Hessian of f itself at x. Owing to Lemma 2.5, the two are equal in particular when we use the exponential retraction (more generally, when we use a second-order retraction): see also [13, §3.5].

Proof of Theorem 6.1

The proof starts the same way as that of Theorem 5.1. The call to \(\mathtt {PTAGD}(x_0)\) generates a sequence of points \(x_{t_0}, x_{t_1}, x_{t_2}, \ldots \) on \({\mathcal {M}}\), with \(t_0 = 0\). A priori, this sequence may be finite or infinite. Considering two consecutive indices \(t_i\) and \(t_{i+1}\), we either have \(t_{i+1} = t_i + 1\) (if the step from \(x_{t_i}\) to \(x_{t_{i+1}}\) is a single gradient step (Case 1)) or \(t_{i+1} = t_i + {\mathscr {T}}\) (if that same step is obtained through a call to \(\mathtt {TSS}\), with or without perturbation (Cases 3 and 2, respectively)). Moreover:

  • In Case 1, Proposition 5.2 applies and guarantees

    $$\begin{aligned} f(x_{t_i}) - f(x_{t_{i+1}}) \ge \frac{{\mathscr {E}}}{{\mathscr {T}}} = \frac{{\mathscr {E}}}{{\mathscr {T}}}(t_{i+1} - t_i). \end{aligned}$$

    The algorithm does not terminate here.

  • In Case 2, Proposition 5.3 applies and guarantees that if \(\Vert \mathrm {grad}f(x_{t_{i+1}})\Vert > \epsilon \) then

    $$\begin{aligned} f(x_{t_i}) - f(x_{t_{i+1}}) \ge {\mathscr {E}} = \frac{{\mathscr {E}}}{{\mathscr {T}}}(t_{i+1} - t_i), \end{aligned}$$

    and the algorithm does not terminate here.

    If, however, \(\Vert \mathrm {grad}f(x_{t_{i+1}})\Vert \le \epsilon \), then \(f(x_{t_i}) - f(x_{t_{i+1}}) \ge 0\) and the step from \(x_{t_{i+1}}\) to \(x_{t_{i+2}}\) does not fall in Case 2: it must fall in Case 3. (Indeed, it cannot fall in Case 1 because the fact that a Case 2 step occurred tells us \(\epsilon < 2\ell {\mathscr {M}}\).) The algorithm terminates with \(x_{t_{i+1}}\) unless \(f(x_{t_{i+1}}) - f(x_{t_{i+2}}) \ge \frac{1}{2}{\mathscr {E}}\). In other words, if the algorithm does not terminate with \(x_{t_{i+1}}\), then

    $$\begin{aligned} f(x_{t_{i}}) - f(x_{t_{i+2}})&= f(x_{t_{i}}) - f(x_{t_{i+1}}) + f(x_{t_{i+1}}) \\&\quad - f(x_{t_{i+2}}) \ge \frac{1}{2}{\mathscr {E}} = \frac{{\mathscr {E}}}{4{\mathscr {T}}}(t_{i+2} - t_{i}). \end{aligned}$$
  • In Case 3, the algorithm terminates with \(x_{t_{i}}\) unless

    $$\begin{aligned} f(x_{t_{i}}) - f(x_{t_{i+1}}) \ge \frac{1}{2}{\mathscr {E}} = \frac{{\mathscr {E}}}{2{\mathscr {T}}}(t_{i+1} - t_{i}). \end{aligned}$$

Clearly, \(\mathtt {PTAGD}(x_0)\) must terminate after a finite number of steps. Indeed, if it does not, then the above reasoning shows that the algorithm produces an amortized decrease in the cost function f of \(\frac{{\mathscr {E}}}{4{\mathscr {T}}}\) per unit increment of the counter t, yet the value of f cannot decrease by more than \(f(x_0) - f_{\mathrm {low}}\).

Accordingly, assume \(\mathtt {PTAGD}(x_0)\) generates \(x_{t_0}, \ldots , x_{t_{k+1}}\) and terminates there (returning \(x_{t_k}\)). The step from \(x_{t_k}\) to \(x_{t_{k+1}}\) necessarily falls in Case 3: \(t_{k+1} - t_k = {\mathscr {T}}\). The step from \(x_{t_{k-1}}\) to \(x_{t_k}\) could be of any type. If it falls in Case 2, it could be that \(f(x_{t_{k-1}}) - f(x_{t_k})\) is as small as zero and that \(t_{k} - t_{k-1} = {\mathscr {T}}\). (All other scenarios are better, in that the cost function decreases more, and the counter increases as much or less.) Moreover, for all steps prior to that, each unit increment of t brings about an amortized decrease in f of \(\frac{{\mathscr {E}}}{4{\mathscr {T}}}\). Thus, \(t_{k+1} \le t_{k-1} + 2{\mathscr {T}}\) and

$$\begin{aligned} f(x_0) - f_{\mathrm {low}}\ge f(x_{0}) - f(x_{t_{k-1}}) \ge \frac{{\mathscr {E}}}{4{\mathscr {T}}} t_{k-1}. \end{aligned}$$

Combining, we find

$$\begin{aligned} t_k + {\mathscr {T}} = t_{k+1} \le \left( 2 + 4\frac{f(x_0) - f_{\mathrm {low}}}{{\mathscr {E}}} \right) {\mathscr {T}} \triangleq T_2. \end{aligned}$$

What can we say about the point that is returned, \(x_{t_k}\)? Deterministically, \(f(x_{t_k}) \le f(x_0)\) and \(\Vert \mathrm {grad}f(x_{t_k})\Vert \le \epsilon \) (notice that we cannot guarantee the same about \(x_{t_{k+1}}\)). Let us now discuss the role of randomness.

In any run of \(\mathtt {PTAGD}(x_0)\), there are at most \(T_2/{\mathscr {T}}\) perturbations, that is, “Case 3” steps. By Proposition 6.2, the probability of any single one of those steps failing to prevent termination at a point where the smallest eigenvalue of the Hessian of the pullback at the origin is strictly less than \(-\sqrt{{{\hat{\rho }}} \epsilon }\) is at most \(\frac{\delta {\mathscr {E}}}{3\Delta _f}\). Thus, by a union bound, the probability of failure in any given run of \(\mathtt {PTAGD}(x_0)\) is at most (we use \(\Delta _f \ge \max \!\left( f(x_0) - f_{\mathrm {low}}, \sqrt{\frac{\epsilon ^3}{{\hat{\rho }}}}\right) \ge \max \!\left( f(x_0) - f_{\mathrm {low}}, 2^7{\mathscr {E}} \right) \) because \(\chi \ge 1\) and \(c \ge 2\)):

$$\begin{aligned} \frac{T_2}{{\mathscr {T}}} \cdot \frac{\delta {\mathscr {E}}}{3\Delta _f} = \left( 2 + 4\frac{f(x_0) - f_{\mathrm {low}}}{{\mathscr {E}}} \right) \frac{\delta {\mathscr {E}}}{3\Delta _f} \le \left( \frac{2{\mathscr {E}}}{3\Delta _f} + \frac{4}{3}\right) \delta \le 2\delta . \end{aligned}$$

In all other events, we have \(\lambda _{\mathrm {min}}(\nabla ^2 {{\hat{f}}}_{x_{t_k}}(0)) \ge -\sqrt{{{\hat{\rho }}} \epsilon }\).

For accounting of the maximal amount of work needed to run \(\mathtt {PTAGD}(x_0)\), use reasoning similar to that at the end of the proof of Theorem 5.1, adding the cost of checking the condition “\(f(x_t) - f(x_{t+{\mathscr {T}}}) < \frac{1}{2}{\mathscr {E}}\)” after each perturbed call to \(\mathtt {TSS}\).

Note: the inequality \(\frac{d^{1/2} \ell ^{3/2} \sqrt{\epsilon ^3 / {\hat{\rho }}}}{({{\hat{\rho }}} \epsilon )^{1/4} \epsilon ^2 \delta } \ge \theta ^{-1}\) holds for all \(d \ge 1\) and \(\delta \in (0, 1)\) with \(c \ge 4\). \(\square \)

The next proposition corresponds mostly to [28, Lem. 13]. A proof is in Appendix G.

Proposition 6.2

(Case 3) Fix parameters and assumptions as laid out in Sect. 3, with \(d = \dim {\mathcal {M}}\), \(\delta \in (0, 1)\), any \(\Delta _f > 0\) and

$$\begin{aligned} \chi \ge \max \!\left( \log _2(\theta ^{-1}), \log _2\!\left( \frac{d^{1/2} \ell ^{3/2} \Delta _f}{({{\hat{\rho }}} \epsilon )^{1/4} \epsilon ^2 \delta } \right) \right) \ge 1. \end{aligned}$$

If \(x \in {\mathcal {M}}\) satisfies \(\Vert \mathrm {grad}f(x)\Vert \le \min (\epsilon , 2\ell {\mathscr {M}})\) and \(\lambda _{\mathrm {min}}(\nabla ^2 {{\hat{f}}}_x(0)) \le -\sqrt{{{\hat{\rho }}} \epsilon }\), and \(\xi \) is sampled uniformly at random from the ball of radius r around the origin in \(\mathrm {T}_x{\mathcal {M}}\), then \(x_{{\mathscr {T}}} = \mathtt {TSS}(x, \xi )\) satisfies \(f(x) - f(x_{{\mathscr {T}}}) \ge {\mathscr {E}}/2\) with probability at least \(1 - \frac{\delta {\mathscr {E}}}{3\Delta _f}\) over the choice of \(\xi \).

7 Conclusions and Perspectives

Our main complexity results for \(\mathtt {TAGD}\) and \(\mathtt {PTAGD}\) (Theorems 1.3 and 1.6) recover known Euclidean results when \({\mathcal {M}}\) is a Euclidean space. In particular, they retain the important properties of scaling essentially with \(\epsilon ^{-7/4}\) and of being either dimension-free (for \(\mathtt {TAGD}\)) or almost dimension-free (for \(\mathtt {PTAGD}\)). Those properties extend as is to the Riemannian case.

However, our Riemannian results are negatively impacted by the Riemannian curvature of \({\mathcal {M}}\), and also by the covariant derivative of the Riemann curvature endomorphism. We do not know whether such a dependency on curvature is necessary to achieve acceleration. In particular, the non-accelerated rates for Riemannian gradient descent, Riemannian trust-regions and Riemannian adaptive regularization with cubics under Lipschitz assumptions do not suffer from curvature [2, 13].

Curvature enters our complexity bounds through our geometric results (Theorem 2.7). For the latter, we do believe that curvature must play a role. Thus, it is natural to ask:

Can we achieve acceleration for first-order methods on Riemannian manifolds with weaker (or without) dependency on the curvature of the manifold?

For the geodesically convex case, all algorithms we know of are affected by curvature [3,4,5, 48]. Additionally, Hamilton and Moitra [26] show that curvature can significantly slow down convergence rates in the geodesically convex case with noisy gradients.

Adaptive regularization with cubics (ARC) may offer insights in that regard. ARC is a cubically regularized approximate Newton method with optimal iteration complexity on the class of cost functions with Lipschitz continuous Hessian, assuming access to gradients and Hessians [19, 39]. Specifically, assuming f has \(\rho \)-Lipschitz continuous Hessian, ARC finds an \((\epsilon , \sqrt{\rho \epsilon })\)-approximate second-order critical point in at most \({{\tilde{O}}}(\Delta _f \rho ^{1/2} / \epsilon ^{3/2})\) iterations, omitting logarithmic factors. This also holds on complete Riemannian manifolds [2, Cor. 3, eqs (16), (26)]. Note that this is dimension-free and curvature-free. Each iteration, however, requires solving a separate subproblem more costly than a gradient evaluation. Carmon and Duchi [18, §3] argue that it is possible to solve the subproblems accurately enough so as to find \(\epsilon \)-approximate first-order critical points with \(\sim 1/\epsilon ^{7/4}\) Hessian-vector products overall, with randomization and a logarithmic dependency in dimension. Compared to \(\mathtt {TAGD}\), this has the benefit of being curvature-free, at the cost of randomization, a logarithmic dimension dependency, and of requiring Hessian-vector products. The latter could conceivably be approximated with finite differences of the gradients. Perhaps that operation leads to losses tied to curvature? If not, as it is unclear why there ought to be a trade-off between curvature dependency and randomization, this may be the indication that the curvature dependency is not necessary for acceleration.

On a distinct note and as pointed out in the introduction, \(\mathtt {TAGD}\) and \(\mathtt {PTAGD}\) are theoretical constructs. Despite having the theoretical upper-hand in worst-case scenarios, we do not expect them to be competitive against time-tested algorithms such as Riemannian versions of nonlinear conjugate gradients or the trust-region methods. It remains an interesting open problem to devise a truly practical accelerated first-order method on manifolds.

In the Euclidean case, Carmon et al. [15] showed that if one assumes not only the gradient and the Hessian of f but also the third derivative of f are Lipschitz continuous, then it is possible to find \(\epsilon \)-approximate first-order critical points in just \({{\tilde{O}}}(\epsilon ^{-5/3})\) iterations. We suspect that our proof technique could be used to prove a similar result on manifolds, possibly at the cost of also assuming a bound on the second covariant derivative of the Riemann curvature endomorphism.