1 Introduction

The numerical solution of nonsmooth optimization problems and the acceleration of their convergence has been regarded a fundamental issue in the past 10–20 years. This is mainly due to the development of image reconstruction and processing, data science and machine learning which require to solve large and highly nonlinear minimization problems. Two of the most popular approaches are forward–backward splittings [22, 23, 42], in particular the FISTA method [7, 8], and first-order primal–dual methods, first introduced in [32, 54] and further studied in [15, 17]. The common thread of all these methods is that they split the global objective into many elementary bricks which, individually, may be “easy” to optimize.

In their original version, all the above mentioned approaches require that the mathematical operations necessary in every step of the respective algorithms can be executed without errors. However, one might stumble over situations in which these operations can only be performed up to a certain precision, e.g. due to an erroneous computation of a gradient or due to the application of a proximal operator lacking a closed-form solution. Examples where this problem arises are TV-type regularized inverse problems [6, 7, 30, 33, 59] or low-rank minimization and matrix completion [14, 44]. To address this issue, there has been a lot of work investigating the convergence of proximal point methods [24, 36, 37, 39, 56, 58], where the latter two also prove rates, and proximal forward–backward splittings [23] under the presence of errors. The objectives of these publications reach from general convergence proofs [21, 34, 43, 53, 64] and convergence up to some accuracy level [26, 27, 47] to convergence rates in dependence of the errors [5, 60, 61] also for the accelerated versions including the FISTA method. The recent paper [9] follows a similar approach to [61], however extending also to nonconvex problems using variable metric strategies and linesearch.

For the recently popular class of first-order primal–dual algorithms mentioned above the list remains short: to the best of our knowledge the only work which considers inaccuracies in the proximal operators for this class of algorithms is the one of Condat [25], who explicitly models errors and proves convergence under mild assumptions on the decay of the errors. However, he does not show any convergence rates. We must also mention Nemirovski’s approach in [48] which is an extension of the extragradient method. This saddle-point optimization algorithm converges with an optimal O(1/N) convergence rate as soon as a particular inequality is satisfied at each iteration, possibly with a controlled error term (cf. Prop 2.2 in [48]).

The goal of this paper is twofold: Most importantly, we investigate the convergence of the primal–dual algorithms presented in [15, 17] for problems of the form

$$\begin{aligned} \min _{x \in {\mathcal {X}}} \max _{y \in {\mathcal {Y}}} ~ \langle Kx, y \rangle + f(x) + g(x) - h^*(y), \end{aligned}$$

for convex and lower semicontinuous g and h and convex and Lipschitz differentiable f, with errors occurring in the computation of \(\nabla f\) and the proximal operators for g and \(h^*\). Following the line of the preceding works on forward–backward splittings, we consider the different notions of inexact proximal points used in [60] and extended in [5, 58, 61] and, assuming an appropriate decay of the errors, establish the convergence rates of [15, 17] also for perturbed algorithms. More precisely, we prove the well-known \(O\left( 1/N\right)\) rate for the basic version, a \(O\left( 1/N^2\right)\) rate if either fg or \(h^*\) are strongly convex, and a linear convergence rate in case both the primal and dual objective are strongly convex. Moreover we show that also under a slower decay of errors we can establish rates, however unsurprisingly slower depending on the errors.

In the spirit of [61] for inexact forward–backward algorithms, the second goal of this paper is to provide an interesting application for such inexact primal–dual algorithms. We put the focus on situations where one takes the path of inexactness deliberately in order to split the global objective more efficiently. A particular instance are problems of the form

$$\begin{aligned} \min _x ~ h(K_1x) + g(K_2x) = \min _x \max _y ~ \langle y,K_1 x\rangle + g(K_2x) - h^*(y). \end{aligned}$$
(1)

A popular example is the TV-\(L^1\) model with some imaging operator \(K_1\), where g and h are chosen to be \(L^1\)-norms and \(K_2 = \nabla\) is a gradient. It has e.g been studied analytically by Alliney [2,3,4] and subsequently by Chan and Esedoglu [19], Kärkkäinen et al. [38] and Nikolova [50, 51]. However, due to the nonlinearity and nondifferentiability of the involved terms, solutions of the model are numerically hard to compute. One can find a variety of approaches to solve the TV-\(L^1\) model, reaching from (smoothed) gradient descent [19] over interior point methods [35], primal–dual methods using a semi-smooth Newton method [28] to augmented Lagrangian methods [31, 62]. Interestingly, the inexact framework we propose in this paper provides a very simple and intuitive algorithmic approach to the solution of the TV-\(L^1\) model. More precisely, applying an inexact primal–dual algorithm to formulation (1), we obtain a nested algorithm in the spirit of [6, 7, 20, 30, 33, 59, 61],

$$\begin{aligned} y^{n+1}&= {{\mathrm {prox}}}_{\sigma h^*}(y^n + \sigma K_1(x^{n+1} + \theta (x^{n+1} - x^n))), \\ x^{n+1}&= {{\mathrm {prox}}}_{\tau (g \circ K_2)}(x^n - \tau K_1^*y^{n+1}), \end{aligned}$$

where \({{\mathrm {prox}}}_{\sigma h^*}\) denotes the proximal map with respect to \(h^*\) and step size \(\sigma\) (cf. Sect. 2). It requires to solve the inner subproblem of the proximal step with respect to \(g \circ K_2\), i.e. a TV-denoising problem, which does not have a closed-form solution but has to be solved numerically (possibly with an initial guess which improves over the iterates). It has been observed in [7] that, using this strategy on the primal TV-\(L^2\) deblurring problem can cause the FISTA algorithm to diverge if the inner step is not executed with sufficient precision. As a remedy, the authors of [61] demonstrated that the theoretical error bounds they established for inexact FISTA can also serve as a criterion for the necessary precision of the inner proximal problem and hence make the nested approach viable. We show that the bounds for inexact primal–dual algorithms established in this paper can be used to make the nested approach viable for entirely non-differentiable problems such as the TV-\(L^1\) model, while the results of [61] for partly smooth objectives can also be obtained as a special case of the accelerated versions.

In the context of inexact and nested algorithms it is worthwhile mentioning the recent ‘Catalyst’ framework [40, 41], which uses nested inexact proximal point methods to accelerate a large class of generic optimization problems in the context of machine learning. The inexactness criterion applied there is the same as in [5, 60]. Our approach, however, is much closer to [5, 60, 61], stating convergence rates for perturbed algorithms, while [40, 41] focus entirely on nested algorithms, which we only consider as a particular instance of perturbed algorithms in the numerical experiments.

The remainder of the paper is organized as follows: in the next section we introduce the notions of inexact proxima that we will use throughout the paper. In the following Sect. 3 we formulate inexact versions of the primal–dual algorithms presented in [15] and [17] and prove their convergence including rates depending on the decay of the errors. In the numerical Sect. 4 we apply the above splitting idea for nested algorithms to some well-known imaging problems and show how inexact primal–dual algorithms can be used to improve their convergence behavior.

2 Inexact computations of the proximal point

In this section we introduce and discuss the idea of the proximal point and several ways for its approximation. For a proper, convex and lower semicontinuous function \(g :{\mathcal {X}}\rightarrow {(-\infty ,+\infty ]}\) mapping from a Hilbert space \({\mathcal {X}}\) to the real line extended with the value \(+\infty\) and \(y \in {\mathcal {X}}\) the proximal point [45, 46, 55, 56] is given by

$$\begin{aligned} {{\mathrm {prox}}}_{\tau g} (y) = \arg \min _{x \in {\mathcal {X}}} \left\{ \frac{1}{2\tau } \Vert x - y\Vert ^2 + g(x) \right\} . \end{aligned}$$
(2)

Since g is proper we directly obtain \({{\mathrm {prox}}}_{\tau g} (y) \in {{\mathrm {dom}}} g\). The \(1/\tau\)-strongly convex mapping \({{\mathrm {prox}}}_{\tau g} :{\mathcal {X}}\rightarrow {\mathcal {X}}\) is called proximity operator of \(\tau g\). Letting

$$\begin{aligned} G_\tau :{\mathcal {X}}\rightarrow {(-\infty ,+\infty ]}, \quad x \mapsto \frac{1}{2\tau } \Vert x - y\Vert ^2 + g(x) \end{aligned}$$
(3)

be the proximity function, the first-order optimality condition for the proximum gives different characterizations of the proximal point:

$$\begin{aligned} z = {{\mathrm {prox}}}_{\tau g} (y) \iff 0 \in \partial G_\tau (z) \iff \frac{y-z}{\tau } \in \partial g(z). \end{aligned}$$
(4)

Based on these equivalences we introduce four different types of inexact computations of the proximal point, which are differently restrictive. Most have already been considered in literature and we give reference next to the definitions. We like to recommend [58, 61] for some illustrations of the different notions in case of a projection. We start with the first expression in (4).

Definition 1

Let \(\varepsilon \ge 0\). We say that \(z \in {\mathcal {X}}\) is a type-0 approximation of the proximal point \({{\mathrm {prox}}}_{\tau g}(y)\) with precision \(\varepsilon\) if

$$\begin{aligned} z \approx _0^{\varepsilon } {{\mathrm {prox}}}_{\tau g}(y) \overset{{\mathrm {def}}}{\iff } \Vert z - {{\mathrm {prox}}}_{\tau g}(y) \Vert \le \sqrt{2 \tau \varepsilon }. \end{aligned}$$

This refers to choosing a proximal point from the \(\sqrt{2\tau \varepsilon }\)-ball around the true proximum. It is important to notice that a type-0 approximation is not necessarily feasible, i.e., it can occur that \(z \notin {{\mathrm {dom}}} g\). This can easily be verified e.g. for g being the indicator function of a convex set, and requires us to treat this approximation slightly differently from the following ones in “Type-0 approximations” section of “Appendix”. Observe that it is easy to check a posteriori the precision of a type-0 approximation provided \(\partial g\) is easy to evaluate. Indeed, if \(e\in \tau \partial g(z)+ z-y\), standard estimates show that \(\Vert z - {{\mathrm {prox}}}_{\tau g}(y) \Vert \le \Vert e\Vert\) and \(z\approx _0^{\varepsilon }{{\mathrm {prox}}}_{\tau g}(y)\) for \(\varepsilon =\Vert e\Vert ^2/(2\tau )\).

In order to relax the second or third expression in (4), we need the notion of an \(\varepsilon\)-subdifferential of \(g :{\mathcal {X}}\rightarrow {(-\infty ,+\infty ]}\) at \(z \in {\mathcal {X}}\):

$$\begin{aligned} \partial _{\varepsilon } g(z) := \{ p \in {\mathcal {X}}~|~ g(x) \ge g(z) + \langle p, x-z \rangle - \varepsilon ~ \forall x \in {\mathcal {X}}\}. \end{aligned}$$

As a direct consequence of the definition we obtain a notion of \(\varepsilon\)-optimality

$$\begin{aligned} 0 \in \partial _{\varepsilon } g(z) \iff g(z) \le \inf g + \varepsilon . \end{aligned}$$
(5)

Based on this and the second expression in (4), we define a second notion of an inexact proximum. It has e.g. been considered in [5, 60] to prove the convergence of inexact proximal gradient methods under the presence of errors.

Definition 2

Let \(\varepsilon \ge 0\). We say that \(z \in {\mathcal {X}}\) is a type-1 approximation of the proximal point \({{\mathrm {prox}}}_{\tau g}(y)\) with precision \(\varepsilon\) if

$$\begin{aligned} z \approx _1^{\varepsilon } {{\mathrm {prox}}}_{\tau g}(y) \overset{{\mathrm {def}}}{\iff } 0 \in \partial _{\varepsilon } G_\tau (z). \end{aligned}$$

Hence, by (5), a type-1 approximation minimizes the energy of the proximity function (3) up to an error of \(\varepsilon\). It turns out that a type-0 approximation is weaker than a type-1 approximation:

Lemma 1

Let\(z \approx _1^{\varepsilon } {{\mathrm {prox}}}_{\tau g}(y)\). Then\(z \in {{\mathrm {dom}}} g\)and

$$\begin{aligned} \Vert z - {{\mathrm {prox}}}_{\tau g}(y) \Vert \le \sqrt{2 \tau \varepsilon }, \end{aligned}$$

that is\(z \approx _0^{\varepsilon } {{\mathrm {prox}}}_{\tau g}(y)\).

The result is easy to verify and can be found e.g. in [36, 56, 58]. An even morerestrictive version of an inexact proximum can be obtained by relaxing the third expression in (4), which has been introduced in [39] and subsequently been used in [24, 58] in the context of inexact proximalpoint methods.

Definition 3

Let \(\varepsilon \ge 0\). We say that \(z \in {\mathcal {X}}\) is a type-2 approximation of the proximal point \({{\mathrm {prox}}}_{\tau g}(y)\) with precision \(\varepsilon\) if

$$\begin{aligned} z \approx _2^{\varepsilon } {{\mathrm {prox}}}_{\tau g}(y) \overset{{\mathrm {def}}}{\iff } \frac{y-z}{\tau } \in \partial _{\varepsilon } g(z). \end{aligned}$$

Letting \(\phi _\tau (z) = \Vert z - y \Vert ^2/(2 \tau )\), the following characterization from [58] of the \(\varepsilon\)-subdifferential of the proximity function \(G_\tau = g + \phi _\tau\) sheds light on the difference between a type-1 and type-2 approximation:

$$\begin{aligned} \partial _\varepsilon G_\tau (z)&= \bigcup _{0 \le \varepsilon _1 + \varepsilon _2 \le \varepsilon } \partial _{\varepsilon _1} g(z) + \partial _{\varepsilon _2} \phi _\tau (z) \nonumber \\&= \bigcup _{0 \le \varepsilon _1 + \varepsilon _2 \le \varepsilon } \partial _{\varepsilon _1} g(z) + \left\{ \frac{z-y-e}{\tau } : \Vert e\Vert \le \sqrt{2 \tau \varepsilon _2} \right\} . \end{aligned}$$
(6)

Equation (6) decomposes the error in the \(\varepsilon\)-subdifferential of \(G_\tau\) into two parts related to g respectively \(\phi _\tau\). As a consequence, for a type-1 approximation it is not clear how the error is distributed between g or \(\phi _\tau\), while for a type-2 approximation the error occurs solely in g. Hence a type-2 approximation can be seen as an extreme case of a type-1 approximation with \(\varepsilon _2 = 0\).

In view of the decomposition (6), we define a fourth notion of an inexact proximum as the extreme case \(\varepsilon _1 = 0\), which has been considered in e.g. [36, 56].

Definition 4

Let \(\varepsilon \ge 0\). We say that \(z \in {\mathcal {X}}\) is a type-3 approximation of the proximal point \({{\mathrm {prox}}}_{\tau g}(y)\) with precision \(\varepsilon\) if

$$\begin{aligned} z \approx _3^{\varepsilon } {{\mathrm {prox}}}_{\tau g}(y) \overset{{\mathrm {def}}}{\iff } \exists e \in {\mathcal {X}},\quad \Vert e\Vert \le \sqrt{2 \tau \varepsilon } : z = {{\mathrm {prox}}}_{\tau _g}(y + e). \end{aligned}$$

Definition 4 gives the notion of a “correct” output for an incorrect input of the proximal operator. Being the two extreme cases, type-2 and type-3 proxima are also proxima of type 1. The decomposition (6) further leads to the following lemma from [60], which allows to treat the type-1, -2 and -3 approximations in the same setting.

Lemma 2

If\(z \in {\mathcal {X}}\)is a type-1 approximation of\({{\mathrm {prox}}}_{\tau g}(y)\)withprecision\(\varepsilon\), then there exists\(r \in {\mathcal {X}}\)with\(\Vert r\Vert \le \sqrt{2 \tau \varepsilon }\)such that

$$\begin{aligned} (y-z-r)/\tau \in \partial _\varepsilon g(z). \end{aligned}$$

Now note that letting \(r = 0\) in Lemma 2 gives a type-2 approximation, replacing the \(\varepsilon\)-subdifferential by a normal one gives a type-3 approximation. We mention that there exist further notions of approximations of the proximal point, e.g. used in [36], and refer to [61, Section 2.2] for a compact discussion.

Even tough we prove our results for different types of approximations, the most useful in practice seems to be the approximation of type 2. This is due to the following insights obtained by Villa et al. [61]: Without loss of generality let \(g(x) = w(Bx),\) with proper, convex and l.s.c. \(w:{\mathcal {Z}}\rightarrow {(-\infty ,+\infty ]}\) and linear \(B :{\mathcal {X}}\rightarrow {\mathcal {Z}}\). Then the calculation of the proximum requires to solve

$$\begin{aligned} \min _{x\in {\mathcal {X}}}~ G_\tau (x)&= \min _{x \in {\mathcal {X}}} ~ \frac{1}{2\tau } \Vert x - y \Vert ^2 + w(Bx). \end{aligned}$$
(7)

Now if there exists \(x_0 \in {\mathcal {X}}\) such that g is continuous in \(Bx_0\), the Fenchel-Moreau-Rockafellar duality formula [63, Corollary 2.8.5] states that

$$\begin{aligned} \min _{x\in {\mathcal {X}}}~ G_\tau (x) = - \min _{z \in {\mathcal {Z}}} ~ \frac{\tau }{2} \Vert B^*z\Vert ^2 - \langle B^*z,y\rangle + w^*(z), \end{aligned}$$

where we refer to the right hand side as the “dual” problem \(W_\tau (z)\). Furthermore we can always recover the primal solution \({\hat{x}}\) from the dual solution \({\hat{z}}\) via the relation \({\hat{x}}= y - B^* {\hat{z}}\). Most importantly, we obtain that \({\hat{x}}\) and \({\hat{z}}\) solve the primal respectively dual problem if and only if the duality gap \({\mathcal {G}}(x,z) := G_\tau (x) + W_\tau (z)\) vanishes, i.e.

$$\begin{aligned} 0 = \min _{(x,z) \in {\mathcal {X}}\times {\mathcal {Z}}} G_\tau (x) + W_\tau (z) = {\mathcal {G}}({\hat{x}},{\hat{z}}). \end{aligned}$$

The following result in [61] states that the duality gap can also be used as a criterion to assess admissible type-2 approximations of the proximal point:

Proposition 1

Let\(z \in {\mathcal {Z}}\). Then

$$\begin{aligned} {\mathcal {G}}(y - B^* z,z) \le \varepsilon \Rightarrow y - B^*z \approx _2^\varepsilon {{\mathrm {prox}}}_{\tau g}(y). \end{aligned}$$

Proposition 1 has an interesting implication: if one can construct a feasible dual variable z during the solution of (7), it is easy to check the admissibility of the corresponding primal variable x to be a type-2 approximation by evaluating the duality gap. We shall make use of that in the numerical experiments in Sect. 4.

Of course, since a type-2 approximation automatically is a type-1 and type-0 approximation, the above argumentation is also valid to find feasible approximations in these cases. However, since type-1 and type-0 approximations are technically less restrictive, one should need to characterize when an approximation is of such type without already being an approximation of type 2. An example of a type-0 approximation may be found in problems where the desired proximum is the projection onto the intersection of convex sets. The (inexact) proximum may be computed in a straightforward fashion using Dykstra’s algorithm [29], which has e.g. been done in [11] or [1, 17, Ex. 7.7] for Mumford–Shah-type segmentation problems. Depending on the involved sets, one may get an upper bound on the maximal distance of the current iterate of Dykstra’s algorithm to these sets, obtaining a bound on how far the current iterate is from the true proximum. These considerations, however, require to be tested in the respective cases.

3 Inexact primal–dual algorithms

We can now prove the convergence of some primal–dual algorithms under the presence of the respective error. We start with the type-1, -2 and -3 approximations and outline in “Type-0 approximations” section of “Appendix” how to get a grip also on the type-0 approximation. The convergence analysis in this chapter is based on a combination of techniques derived in previous works on the topic: similar results on the convergence of exact primal–dual algorithms can be found e.g. in [15, 17, 18], the techniques to obtain error bounds for the inexact proximum are mainly taken from [5, 60]. We consider the saddle-point problem

$$\begin{aligned} \min _{x \in {\mathcal {X}}} \max _{y \in {\mathcal {Y}}} ~ {\mathcal {L}}(x,y) = \langle Kx, y \rangle + f(x) + g(x) - h^*(y), \end{aligned}$$
(8)

where we make the following assumptions:

  1. 1.

    \(K :{\mathcal {X}}\rightarrow {\mathcal {Y}}\) is a linear and bounded operator between Hilbert spaces \({\mathcal {X}}\) and \({\mathcal {Y}}\) with norm \(L = \Vert K\Vert\),

  2. 2.

    \(f :{\mathcal {X}}\rightarrow {(-\infty ,+\infty ]}\) is proper, convex, lower semicontinuous and differentiable with \(L_f\)-Lipschitz gradient,

    $$\begin{aligned} \Vert \nabla f(x) - \nabla f(x') \Vert \le L_f \Vert x - x' \Vert \quad {\text { for all }} x,x' \in {{\mathrm {dom}}} f, \end{aligned}$$
  3. 3.

    \(g ,h :{\mathcal {X}}\rightarrow {(-\infty ,+\infty ]}\) are proper, lower semicontinuous and convex functions,

  4. 4.

    problem (8) admits at least one solution \((x^\star ,y^\star ) \in {\mathcal {X}}\times {\mathcal {Y}}\).

It is well-known that taking the supremum over y in \({\mathcal {L}}(x,y)\) leads to the corresponding “primal” formulation of the saddle-point problem (8)

$$\begin{aligned} \min _{x \in {\mathcal {X}}} ~ f(x) + g(x) + h(Kx), \end{aligned}$$

which for a lot of variational problems might be the starting point. Analogously, taking the infimum over x leads to the dual problem. Given an algorithm producing iterates \((x^N,y^N)\) for the solution of (8), the goal of this section is to obtain estimates

$$\begin{aligned} {\mathcal {L}}(x^N,y) - {\mathcal {L}}(x,y^N) \le \frac{C(x,y,x^0,y^0)}{N^\alpha } \end{aligned}$$
(9)

for \(\alpha > 0\) and \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\). If \((x,y) = (x^\star ,y^\star )\) is a saddle point, the left hand side vanishes if and only if the pair \((x^N,y^N)\) is a saddle point itself, yielding a convergence rate in terms of the primal–dual objective in \(O\left( 1/N^\alpha \right)\). Under mild additional assumptions one can then derive estimates e.g. for the error in the primal objective. If the supremum over y in \({\mathcal {L}}(x^N,y)\) is attained at some \({\tilde{y}}\), one easily sees that

$$\begin{aligned}&f(x^N) + g(x^N) + h(Kx^N) - [ f(x^\star ) + g(x^\star ) + h(Kx^\star ) ] \nonumber \\&\quad = \sup _{y \in {\mathcal {Y}}} ~ {\mathcal {L}}(x^N,y) - \sup _{y \in {\mathcal {Y}}} ~ {\mathcal {L}}(x^\star ,y) \le {\mathcal {L}}(x^N,{\tilde{y}}) - {\mathcal {L}}(x^\star ,y^N) \nonumber \\&\quad \le \frac{C(x^\star ,{\tilde{y}},x^0,y^0)}{N^\alpha }, \end{aligned}$$
(10)

giving a convergence estimate for the primal objective.

In the original versions of primal–dual algorithms (e.g. [15, 18]), the authors require \(h^*\) and g to have a simple structure such that their proximal operators (2) can be sufficiently easily evaluated. A particular feature of most of these operators is that they have a closed-form solution and can hence be evaluated exactly. We study the situation where the proximal operators for g or \(h^*\) can only be evaluated up to a certain precision in the sense of Sect. 2, and as well the gradient of f may contain errors. As opposed to the general iteration of an exact primal–dual algorithm [18]

$$\begin{aligned} \begin{aligned} {\hat{y}}&= {{\mathrm {prox}}}_{\sigma h^*} ({\bar{y}}+ \sigma K{\tilde{x}}), \\ {\hat{x}}&= {{\mathrm {prox}}}_{\tau g} ({\bar{x}}- \tau ( K^*{\tilde{y}}+ \nabla f({\bar{x}}))), \end{aligned} \end{aligned}$$
(11)

where \(({\bar{x}},{\bar{y}})\) and \(({\tilde{x}},{\tilde{y}})\) are the previous points, and \(({\hat{x}},{\hat{y}})\) are the updated exact points, we introduce the general iteration of an inexact primal–dual algorithm

$$\begin{aligned} \begin{aligned} \check{y}&\approx _2^\delta {{\mathrm {prox}}}_{\sigma h^*} ({\bar{y}}+ \sigma K{\tilde{x}}), \\ \check{x}&\approx _i^\varepsilon {{\mathrm {prox}}}_{\tau g} ({\bar{x}}- \tau ( K^*{\tilde{y}}+ \nabla f({\bar{x}}) + e)). \end{aligned} \end{aligned}$$
(12)

Here the updated points \((\check{x},\check{y})\) denote the inexact proximal point , which are only computed up to precision \(\varepsilon\) respectively \(\delta\), in the sense of a type-2 approximation from Sect. 2 for \(\check{y}\) and a type-i approximation for \(i\in \{1,2,3\}\) for \(\check{x}\). The vector \(e \in {\mathcal {X}}\) denotes a possible error in the gradient of f. We use two different pairs of input points \(({\bar{x}},{\bar{y}})\) and \(({\tilde{x}},{\tilde{y}})\) in order to include intermediate overrelaxed input points. It is clear, however, that we require \({\tilde{x}}\) to depend on \(\check{x}\) respectively \({\tilde{y}}\) on \(\check{y}\) in order to couple the primal and dual updates.

At first glance it seems counterintuitive that we allow errors of type 1, 2 and 3 in \(\check{x}\), while only allowing for type-2 errors in \(\check{y}\). The following general descent rule for the iteration (12) sheds some more light on this fact and forms the basis for all the following proofs. It can be derived using simple convexity results and resembles the classical energy descent rules for forward–backward splittings. It can then be used to obtain estimates on the decay of the objective of the form (9). We prove the descent rule for a type-1 approximation of the primal proximum since we always obtain the result for a type-2 or type-3 approximation as a special case.

Lemma 3

Let\(\tau , \sigma > 0\)and \((\check{x},\check{y})\)be obtained from\(({\bar{x}},{\bar{y}})\)and\(({\tilde{x}},{\tilde{y}})\)and the updates (12) for\(i = 1\). Then for every\((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\)we have

$$\begin{aligned} {\mathcal {L}}(\check{x},y) - {\mathcal {L}}(x,\check{y})&\le ~ \frac{\Vert x - {\bar{x}}\Vert ^2}{2 \tau } + \frac{\Vert y - {\bar{y}}\Vert ^2}{2 \sigma } - \frac{\Vert x - \check{x}\Vert ^2}{2 \tau }- \frac{1-\tau L_f}{2 \tau } \Vert {\bar{x}}- \check{x}\Vert ^2 \nonumber \\&\quad - \frac{\Vert y - \check{y}\Vert ^2}{2\sigma } - \frac{\Vert {\bar{y}}- \check{y}\Vert ^2}{2 \sigma } + \langle K(x-\check{x}), {\tilde{y}}- \check{y}\rangle \nonumber \\&\quad - \langle K ({\tilde{x}}- \check{x}), y- \check{y}\rangle + \left( \Vert e \Vert + \sqrt{2 \varepsilon / \tau } \right) \Vert x - \check{x}\Vert + \varepsilon + \delta . \end{aligned}$$
(13)

Proof

For the inexact type-2 proximum \(\check{y}\) we have by Definition 3 that \(({\bar{y}} + \sigma K {\tilde{x}} - \check{y})/\sigma \in \partial _\delta h^*(\check{y})\), so by the definition of the subdifferential we find

$$\begin{aligned} h^*(\check{y})&\le h^*(y) + \left\langle \frac{{\bar{y}}+ \sigma K {\tilde{x}}- \check{y}}{\sigma }, \check{y}-y \right\rangle \nonumber + \delta \nonumber \\&= h^*(y) + \left\langle \frac{{\bar{y}}- \check{y}}{\sigma }, \check{y}-y \right\rangle + \langle K {\tilde{x}}, \check{y}- y \rangle + \delta \nonumber \\&\le h^*(y) - \frac{\Vert {\bar{y}}- \check{y}\Vert ^2}{2\sigma } - \frac{\Vert y - \check{y}\Vert ^2}{2 \sigma } + \frac{\Vert {\bar{y}}- y \Vert ^2}{2 \sigma } + \langle K {\tilde{x}}, \check{y}- y \rangle + \delta . \end{aligned}$$
(14)

For the inexact type-1 primal proximum, from Definition 2 and Lemma 2 we have that there exists a vector r with \(\Vert r\Vert \le \sqrt{2 \tau \varepsilon }\) such that

$$\begin{aligned} ({\bar{x}} - \tau (K^* {\tilde{y}} + \nabla f({\bar{x}}) + e) - \check{x}- r)/\tau \in \partial _{\varepsilon } g(\check{x}). \end{aligned}$$

Hence we find that

$$\begin{aligned} g(\check{x})&\le g(x) + \left\langle \frac{{\bar{x}}- \tau (K^* {\tilde{y}}+ \nabla f({\bar{x}}) + e) - \check{x}- r}{\tau }, \check{x}- x \right\rangle + \varepsilon \nonumber \\&\le g(x) - \frac{\Vert {\bar{x}}- \check{x}\Vert ^2}{2 \tau } - \frac{\Vert x - \check{x}\Vert ^2}{2 \tau } + \frac{\Vert {\bar{x}}- x \Vert ^2}{2 \tau } + \langle {\tilde{y}}, K(x - \check{x}) \rangle \nonumber \\&\quad - \langle \nabla f({\bar{x}}), \check{x}- x \rangle + \left( \Vert e\Vert + \sqrt{(2 \varepsilon / \tau } \right) \Vert x - \check{x}\Vert + \varepsilon , \end{aligned}$$
(15)

where we applied the Cauchy–Schwarz inequality to the error term. Further by the Lipschitz property and convexity of f we have (cf. [49])

$$\begin{aligned} f(\check{x}) \le f(x) + \langle \nabla f({\bar{x}}), \check{x}- x \rangle + \frac{L_f}{2} \Vert \check{x}- {\bar{x}}\Vert ^2. \end{aligned}$$
(16)

Now we add the Eqs. (14), (15) and (16), insert

$$\begin{aligned} \langle K\check{x},y \rangle - \langle K\check{x},y \rangle , \quad \langle Kx,\check{y}\rangle - \langle Kx,\check{y}\rangle , \quad \langle K\check{x},\check{y}\rangle - \langle K\check{x},\check{y}\rangle , \end{aligned}$$

and rearrange to arrive at the result. □

We point out that, as a special case, choosing a type-2 approximation for the primal proximum in Lemma 3 corresponds to dropping the square root in the estimate (13), choosing a type-3 approximation corresponds to dropping the additional \(\varepsilon\) at the end. Also note that the above descent rule is the same as the one in [15, 18] except for the additional error terms in the last line of (13).

Lemma 3 has an immediate implication: in order to control the errors \(\Vert e\Vert\) and \(\varepsilon\) in the last line of Lemma 3 it is obvious that we need to find a useful bound on \(\Vert x - \check{x}\Vert\). We shall obtain this bound using a linear extrapolation in the primal variable x [15]. However, if we allow e.g. a type-1 approximation also in \(\check{y}\), we obtain an additional error term in (13) involving \(\Vert y - \check{y}\Vert\) that we need to bound as well. Even though we shall be able to obtain a bound in most cases, it will be arbitrarily weak due to the asymmetric nature of the used primal–dual algorithms, or go along with severe restrictions on the step sizes. This fact will become more obvious from the proofs in the following.

3.1 The convex case: no acceleration

We start with a proof for a basic version of algorithm (12) using a technical lemma taken from [60] (see “Two technical lemmas” section in "Appendix"). The following inexact primal–dual algorithm corresponds to the choice

$$\begin{aligned} (\check{x},\check{y}) = (x^{n+1},y^{n+1}), \quad ({\bar{x}},{\bar{y}}) = (x^n, y^n), \quad ({\tilde{x}}, {\tilde{y}}) = (2x^n - x^{n-1},y^{n+1}) \end{aligned}$$
(17)

in algorithm (12):

$$\begin{aligned} \begin{aligned} y^{n+1}&\approx _2^{\delta _{n+1}} {{\mathrm {prox}}}_{\sigma h^*} (y^n + \sigma K (2 x^n - x^{n-1} )) \\ x^{n+1}&\approx _i^{\varepsilon _{n+1}} {{\mathrm {prox}}}_{\tau g} ( x^n - \tau (K^* y^{n+1} + \nabla f(x^n) + e^{n+1})). \end{aligned} \end{aligned}$$
(18)

Theorem 1

Let\(L = \Vert K\Vert\)and choosesmall\(\beta > 0\)and\(\tau , \sigma > 0\)such that\(\tau L_f + \sigma \tau L^2 + \tau \beta L< 1\), and let the iterates\((x^n, y^n)\)be defined by Algorithm (18) for\(i \in \{1,2,3\}\). Then for any\(N \ge 1\)and\(X^N := \left( \sum _{n=1}^N x^n \right) /N\), \(Y^N := \left( \sum _{n=1}^N y^n \right) /N\)we have for a saddle point\((x^\star ,y^\star ) \in {\mathcal {X}}\times {\mathcal {Y}}\)that

$$\begin{aligned}&{\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star , Y^N) \nonumber \\&\quad \le \frac{1}{2\tau N} \left( \Vert x^\star - x^0 \Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^\star - y^0 \Vert + 2 A_{N,i} + \sqrt{2 B_{N,i}} \right) ^2, \end{aligned}$$
(19)

where

$$\begin{aligned} A_{N,1}&= \sum _{n=1}^N \tau \Vert e^n \Vert + \sqrt{2 \tau \varepsilon _n},&\quad B_{N,1} = \sum _{n=1}^N \tau \varepsilon _n + \tau \delta _n, \\ A_{N,2}&= \sum _{n=1}^N \tau \Vert e^n \Vert ,&\quad B_{N,2} = \sum _{n=1}^N \tau \varepsilon _n + \tau \delta _n, \\ A_{N,3}&= \sum _{n=1}^N \tau \Vert e^n \Vert + \sqrt{2 \tau \varepsilon _n},&\quad B_{N,3} = \sum _{n=1}^N \tau \delta _n. \end{aligned}$$

Remark 1

The purpose of the parameter\(\beta > 0\)isonly of technical nature and is needed in order to show convergence of the iterates of algorithm (18). In practice, however, we did not observe any issues setting it super small (respectively, to zero). Its role will become obvious in the next Theorem.

Proof

Using the choices (17) in Lemma 3 leads us to

$$\begin{aligned} {\mathcal {L}}(x^{n+1},y) - {\mathcal {L}}(x,y^{n+1})&\le \frac{\Vert x - x^n \Vert ^2}{2\tau } - \frac{\Vert x - x^{n+1} \Vert ^2}{2\tau } \nonumber \\&\quad - \frac{1-\tau L_f}{2\tau } \Vert x^{n+1} \nonumber \\&\quad - x^n \Vert ^2 + \frac{\Vert y - y^n\Vert ^2}{2\sigma } - \frac{\Vert y - y^{n+1} \Vert ^2}{2\sigma } - \frac{\Vert y^{n+1} - y^n \Vert ^2}{2\sigma } \nonumber \\&\quad +\langle K((x^{n+1} - x^n) - (x^n - x^{n-1})),y-y^{n+1} \rangle \nonumber \\&\quad + \left( \Vert e^{n+1}\Vert + \sqrt{(2 \varepsilon _{n+1})/\tau } \right) \Vert x - x^{n+1} \Vert + \varepsilon _{n+1} + \delta _{n+1}. \end{aligned}$$
(20)

The goal of the proof is, as usual, to manipulate this inequality such that we obtain a recursion where most of the terms cancel when summing the inequality. The starting point is an extension of the scalar product on the right hand side:

$$\begin{aligned}&\langle K((x^{n+1} - x^n) - (x^n - x^{n-1})),y-y^{n+1} \rangle \\&\quad = ~\langle K(x^{n+1} - x^n), y - y^{n+1} \rangle - \langle K(x^n - x^{n-1}),y-y^n \rangle \\&\qquad + \langle K(x^n - x^{n-1}), y^{n+1} - y^n \rangle \\&\quad \le ~ \langle K(x^{n+1} - x^n), y - y^{n+1} \rangle - \langle K(x^n - x^{n-1}),y-y^n \rangle \\&\qquad + (\tau \sigma L^2 + \tau \beta L) \frac{\Vert x^n - x^{n-1} \Vert ^2}{2\tau } + \frac{\sigma L}{\sigma L + \beta } \frac{\Vert y^{n+1} - y^n \Vert ^2}{2\sigma } , \end{aligned}$$

where we used (for \(\alpha > 0\)) that by Young’s inequality for every \(x,x' \in {\mathcal {X}}\) and \(y,y' \in {\mathcal {Y}}\),

$$\begin{aligned} \langle K(x-x'),y-y' \rangle \le L \Vert x-x'\Vert \Vert y-y'\Vert \le \frac{L\alpha \tau }{2\tau } \Vert x-x'\Vert ^2 + \frac{L\sigma }{2\alpha \sigma } \Vert y-y'\Vert ^2, \end{aligned}$$
(21)

and \(\alpha = \sigma L + \beta\). This gives

$$\begin{aligned} {\mathcal {L}}(x^{n+1},y) - {\mathcal {L}}(x,y^{n+1})&\le \frac{\Vert x - x^n \Vert ^2}{2\tau } - \frac{\Vert x - x^{n+1} \Vert ^2}{2\tau } \nonumber \\&\quad - \frac{1 - \tau L_f}{2\tau } \Vert x^{n+1} - x^n \Vert ^2 \nonumber \\&\quad + \frac{\tau \sigma L^2 + \tau \beta L}{2\tau } \Vert x^n - x^{n-1} \Vert ^2 + \frac{\Vert y - y^n \Vert ^2}{2\sigma } \nonumber \\&\quad - \frac{\Vert y - y^{n+1} \Vert ^2}{2\sigma } - \frac{\beta }{\sigma L + \beta } \frac{\Vert y^{n+1} - y^n \Vert ^2}{2\sigma } \nonumber \\&\quad + \langle K(x^{n+1} - x^n), y - y^{n+1} \rangle - \langle K(x^n - x^{n-1}),y-y^n \rangle \nonumber \\&\quad + \left( \Vert e^{n+1}\Vert + \sqrt{(2 \varepsilon _{n+1})/\tau } \right) \Vert x - x^{n+1} \Vert + \varepsilon _{n+1} + \delta _{n+1}. \end{aligned}$$
(22)

Now we let \(x^0 = x^{-1}\) and sum (22) from \(n = 0, \dots , N-1\) to obtain

$$\begin{aligned} \sum _{n=1}^N {\mathcal {L}}(x^n,y) - {\mathcal {L}}(x,y^n)&\le \frac{\Vert x - x^0 \Vert ^2}{2\tau } + \frac{\Vert y - y^0 \Vert ^2}{2\sigma } - \frac{\Vert x - x^N \Vert ^2}{2 \tau } - \frac{\Vert y - y^N \Vert ^2}{2 \sigma } \\&\quad - \frac{1 - \tau L_f}{2\tau } \Vert x^N - x^{N-1} \Vert ^2 - \kappa \sum _{n=1}^{N-1} \frac{1}{2\tau } \Vert x^n - x^{n-1} \Vert ^2 \\&\quad - \frac{\beta }{\sigma L + \beta } \sum _{n=1}^N \frac{\Vert y^n - y^{n-1} \Vert ^2}{2\sigma } + \langle K(x^N - x^{N-1}, y - y^N \rangle \\&\quad + \sum _{n=1}^N \left( \Vert e^n\Vert + \sqrt{(2 \varepsilon _n)\tau } \right) \Vert x - x^n \Vert + \sum _{n=1}^N (\varepsilon _n + \delta _n), \end{aligned}$$

where \(\kappa = 1 - \tau L_f - \tau \sigma L^2 - \tau \beta L\). With Young’s inequality on the inner product with \(\alpha = \frac{1 - \tau L_f}{L \tau }\) such that \(L \alpha \tau = 1 - \tau L_f\) and \((L \sigma )/\alpha = (\tau \sigma L^2)/(1 - \tau L_f) < 1\) we obtain

$$\begin{aligned}&\sum _{n=1}^N {\mathcal {L}}(x^n,y) - {\mathcal {L}}(x,y^n) + \frac{1}{2 \tau } \Vert x - x^N \Vert ^2 + \left( 1 - \frac{\tau \sigma L^2}{1 - \tau L_f}\right) \frac{\Vert y - y^N \Vert ^2}{2 \sigma } \nonumber \\&\quad \le \frac{1}{2\tau } \Vert x - x^0 \Vert ^2 + \frac{1}{2\sigma } \Vert y - y^0 \Vert ^2 + \sum _{n=1}^N \left( \Vert e^n\Vert + \sqrt{(2 \varepsilon _n)/\tau } \right) \Vert x - x^n \Vert \nonumber \\&\qquad + \sum _{n=1}^N ( \varepsilon _n + \delta _n ) - \kappa \sum _{n=1}^{N-1} \frac{1}{2\tau } \Vert x^n - x^{n-1} \Vert ^2 - \frac{\beta }{\sigma L + \beta } \sum _{n=1}^N \frac{1}{2\sigma }\Vert y^n - y^{n-1} \Vert ^2. \end{aligned}$$
(23)

Note that the introduction of the parameter \(\beta > 0\) allowed us to keep an additional term involving the difference of the dual iterates on the right hand side of the inequality. This will allow to prove the convergence of the iterates later on in Theorem 2. The above inequality can now be used to bound the sum on the left hand side as well as \(\Vert x - x^N \Vert\) by only the initialization \((x^0,y^0)\) and the errors \(e^n\), \(\varepsilon _n\) and \(\delta _n\). We start with the latter and let \((x,y) = (x^\star ,y^\star )\) such that the sum on the left hand side is nonnegative, hence with \(\varDelta _0(x,y) := \Vert x - x^0 \Vert ^2/(2\tau ) + \Vert y - y^0 \Vert ^2/(2\sigma )\) we have

$$\begin{aligned} \frac{1}{2 \tau } \Vert x^\star - x^N \Vert ^2&\le \varDelta _0(x^\star ,y^\star ) + \sum _{n=1}^N \left( \Vert e^n\Vert + \sqrt{(2 \varepsilon _n)/\tau } \right) \Vert x^\star - x^n \Vert \\&\quad + \sum _{n=1}^N (\varepsilon _n + \delta _n), \end{aligned}$$

(note that the remaining terms on the RHS of (23) are negative). We multiply the equation by \(2 \tau\) and continue with a technical result by Schmidt et al. [60, p.12]. Using Lemma 4 with \(u_N = \Vert x^\star - x^N \Vert\), \(S_N = 2\tau \varDelta _0(x^\star ,y^\star ) + 2 \tau \sum _{n=1}^N (\varepsilon _n + \delta _n)\) and \(\lambda _n = 2 (\tau \Vert e^n \Vert + \sqrt{2 \tau \varepsilon _n})\) we obtain a bound on \(\Vert x^\star - x^N \Vert\):

$$\begin{aligned} \Vert x^N - x^\star \Vert \le A_N + \sqrt{ 2\tau \varDelta _0(x^\star ,y^\star ) + 2B_N + A_N^2 }, \end{aligned}$$

where we set \(A_N := \sum _{n=1}^N (\tau \Vert e^n \Vert + \sqrt{2 \tau \varepsilon _n})\) and \(B_N := \sum _{n=1}^N \tau (\varepsilon _n + \delta _n)\). Since \(A_N\) and \(B_N\) are increasing we find for all \(n \le N\):

$$\begin{aligned} \Vert x^n - x^\star \Vert&\le A_n + \sqrt{2\tau \varDelta _0(x^\star ,y^\star ) + 2B_n + A_n^2 } \nonumber \\&\le A_N + \sqrt{ 2\tau \varDelta _0(x^\star ,y^\star ) + 2B_N + A_N^2 } \nonumber \\&\le 2 A_N + \Vert x^0 - x^\star \Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^0 - y^\star \Vert + \sqrt{2B_N}. \end{aligned}$$
(24)

This finally gives

$$\begin{aligned}&\varDelta _0(x^\star ,y^\star ) + \sum _{n=1}^N \left( \Vert e^n\Vert + \sqrt{(2 \varepsilon _n)/\tau } \right) \Vert x^\star - x^n \Vert + \sum _{n=1}^N ( \varepsilon _n + \delta _n ) \nonumber \\&\quad \le \varDelta _0(x^\star ,y^\star ) + \frac{1}{\tau } B_N + \frac{1}{\tau } A_N \left( 2 A_N + \Vert x^0 - x^\star \Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^0 - y^\star \Vert + \sqrt{2B_N} \right) \nonumber \\&\quad = \frac{1}{2\tau } \left( \Vert x^0 - x^\star \Vert ^2 + \frac{\tau }{\sigma } \Vert y^0 - y^\star \Vert ^2 + 2B_N + 4 A_N^2 \right. \nonumber \\&\qquad \left. + 2 A_N \Vert x^0 - x^\star \Vert + 2 A_N \sqrt{\frac{\tau }{\sigma }} \Vert y^0 - y^\star \Vert + 2 A_N \sqrt{2 B_N} \right) \nonumber \\&\quad \le \frac{1}{2\tau } \left( \Vert x^0 - x^\star \Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^0 - y^\star \Vert + 2 A_N + \sqrt{2B_N} \right) ^2, \end{aligned}$$
(25)

and bounds the error terms. We now obtain from (23) that

$$\begin{aligned}&\sum _{n=1}^N {\mathcal {L}}(x^n,y^\star ) - {\mathcal {L}}(x^\star ,y^n) \\&\quad \le \frac{1}{2\tau } \left( \Vert x^0 - x^\star \Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^0 - y^\star \Vert + 2 A_N + \sqrt{2B_N} \right) ^2, \end{aligned}$$

which gives the assertion using the convexity of gf and \(h^*\), the definition of the ergodic sequences \(X^N\) and \(Y^N\) and Jensen’s inequality. It remains to note that for a type-2 approximation the square root in \(A_N\) can be dropped and for a type-3 approximation \(B_N =0\), which gives the different \(A_{N,i}, B_{N,i}\). □

We can immediately deduce the following corollary.

Corollary 1

If\(i \in \{1,2,3\}\), \(\alpha > 0\)and

$$\begin{aligned} \Vert e^n\Vert = O\left( \frac{1}{n^{\alpha + \frac{1}{2}}}\right) , \quad \delta _n = O\left( \frac{1}{n^{2\alpha }}\right) , \quad \varepsilon _n = {\left\{ \begin{array}{ll} O\left( \frac{1}{n^{2\alpha + 1}}\right) , &{} \quad {\text {if}} \quad i \in \{1,3\} \\ O\left( \frac{1}{n^{2\alpha }}\right) , &{} \quad {\text {if}} \quad i = 2. \end{array}\right. } \end{aligned}$$

then

$$\begin{aligned} {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star ,Y^N) = {\left\{ \begin{array}{ll} O\left( N^{-1}\right) &{}\quad {\text {if}} \quad \alpha > 1/2, \\ O\left( \ln ^2(N)/N\right) &{}\quad {\text {if}} \quad \alpha = 1/2, \\ O\left( N^{-2\alpha }\right) &{}\quad {\text {if}} \quad \alpha \in (0,\frac{1}{2}). \\ \end{array}\right. } \end{aligned}$$

Proof

Under the assumptions of the corollary, if \(\alpha > 1/2\), the sequences \(\{\Vert e^n\Vert \}\), \(\{\varepsilon _n\}\) and \(\{\delta _n\}\) are summable and the error term on the right hand side of Eq. (19) is bounded, hence we obtain a convergence rate of O(1/N). If \(\alpha = 1/2\), all errors behave like O(1/n) (note the square root on \(\varepsilon _n\) for \(i \in \{1,3\}\)), hence \(A_{N,i} = B_{N,i} = O(\ln (N))\), which gives the second assertion. If \(0< \alpha < 1/2\), then by Lemma 5 we obtain \(A_{N,i}^2 = B_{N,i} = O(N^{1-2 \alpha })\), which gives the third assertion. □

Before we establish a convergence result from Theorem 1, respectively Corollary 1, let us comment on this result. In many useful situations it can be quite weak. Exact versions of such primal–dual algorithms [15, 18] guarantee inequality (19) for all \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\), rather than for just a saddle point \((x^\star ,y^\star )\). This allows (under some additional assumptions) to both state a rate for the primal and/or dual energy as well as the primal–dual gap and, for infinite dimensional \({\mathcal {X}}\) and \({\mathcal {Y}}\), that the cluster points of the ergodic averages \((X^N,Y^N)\) are saddle points and hence a solution to our initial problem. Theorem 1, however, due to the necessary bound on the error terms, establishes the desired inequality only for a saddle point \((x^\star ,y^\star )\). It only implies a rate in a more degenerate distance, namely a Bregman distance [12, 52]. This is standard and easily seen rewriting the left hand side of (19), adding \(\langle Kx^\star ,y^\star \rangle - \langle x^\star , K^*y^\star \rangle\),

$$\begin{aligned} {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star , Y^N)&= \langle KX^N, y^\star \rangle + f(X^N) + g(X^N) - h^*(y^\star ) \nonumber \\&\quad - (\langle Kx^\star , Y^N \rangle + f(x^\star ) + g(x^\star ) - h^*(Y^N)) \nonumber \\&= f(X^N) + g(X^N) \nonumber \\&\quad - (f(x^\star ) + g(x^\star ) -\langle K^*y^\star ,X^N-x^\star \rangle ) \nonumber \\&\quad + h^*(y^N) - (h^*(y^\star ) + \langle K x^\star , y^N - y^\star \rangle ). \end{aligned}$$
(26)

Using

$$\begin{aligned} p = -K^* y^\star \in \partial g(x^\star ) + \nabla f(x^\star ), \quad q = Kx^\star \in \partial h^*(y^\star ), \end{aligned}$$

we obtain that (26) is the sum of two Bregman distances,

$$\begin{aligned} {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star , Y^N) = D_{f+g}^p (X^N,x^\star ) + D_{h^*}^q (Y^N,y^\star ), \end{aligned}$$

between the (ergodic) iterates \((X^N,Y^N)\) and the saddle point \((x^\star ,y^\star )\). Hence, Corollary 1 states the rate with respect to this distance.

As shown in, e.g., [13], a vanishing Bregman distance, e.g.,

$$\begin{aligned} D_{f+g}^p (x,x^\star ) + D_{h^*}^q (y,y^\star ) = 0, \end{aligned}$$
(27)

for some \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\), in general does not imply that \(x = x^\star\) or \(y = y^\star\), neither does it imply that the pair (xy) is even a saddle point. As a matter of fact, without any further assumptions on fg and \(h^*\), the set of zeros of a Bregman distance can be arbitrarily large and the left-hand side of the inequality in Corollary 1 could vanish even though we have not found a solution to our problem.

On the other hand, it is easy to show that (27) yields that (xy) is a saddle-point whenever \(f+g\) and \(h^*\) are strictly convex (that is, \(f+g\) strictly convex and h\(C^1\) in the interior of \({{\mathrm {dom}}} h\), with \(\partial h\) empty elsewhere [57, Thm. 26.3]). In that case obviously, (27) yields \((x,y)=(x^\star ,y^\star )\). Other situations might be tricky. One of the worst cases is a simple matrix game (cf. [17]),

$$\begin{aligned} \min _{x \in \varDelta _l} \max _{y \in \varDelta _k} ~ \langle Ax,y \rangle , \end{aligned}$$

where \(A \in {\mathbb {R}}^{k \times l}\) and \(\varDelta _l,\varDelta _k\) denote the unit simplices in \({\mathbb {R}}^l\) respectively \({\mathbb {R}}^k\). Quite obviously, here we have \(f = 0\), \(g = \delta _{\varDelta _l}\) and \(h^* = \delta _{\varDelta _k}\), such that we have to compute the Bregman distances with respect to a characteristic function, which can only be zero or infinity. Hence, every feasible point causes the Bregman distance to vanish such that a rate in this distance is of no use. However, there is a simple workaround in such cases, whenever the primal (or even the dual) variable are restricted to some bounded set D, such that f and/or g have bounded domain. Note that this is a standard assumption also arising in similar works on the topic (e.g. [48]). As can be seen from the proof, one needs a bound on \(\Vert x^n - x^\star \Vert\) in order to control the errors. In this case one can estimate \(\Vert x^n - x^\star \Vert \le {{\mathrm {diam}}}(D)\), and following the line of the proof [cf. inequality (23)] we obtain for all \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\) that

$$\begin{aligned} {\mathcal {L}}(X^N,y) - {\mathcal {L}}(x, Y^N)&\le \frac{1}{N} \bigg (\frac{\Vert x - x^0 \Vert ^2}{2\tau } + \frac{\Vert y - y^0 \Vert ^2}{2\sigma } \\&\quad + \frac{{{\mathrm {diam}}}(D)}{\tau } A_{N,i} + \frac{1}{\tau } B_{N,i} \bigg ). \end{aligned}$$

Eventually, this again allows deducing a rate for the primal–dual gap (e.g., along the lines in [17]).

Remark 2

Even in bad cases there might exist situations where a rate in a Bregman distance is useful. For instance, the basis pursuit problem aims primarily at finding the support of the solution, rather than its quantitative values (which are then recovered easily). As shown in [13] a Bregman distance with respect to the 1-norm can only vanish if the support of both given arguments agrees. Hence, given a vanishing left-hand side in Corollary 1, we might not have found a saddle point, however, an element with the same support such that our problem is solved.

As we have lined out, a rate in a Bregman distance can be difficult to interpret, and it depends on the particular situation whether it is useful or not. However, we can at least show the convergence of the iterates in case \({\mathcal {X}}\) and \({\mathcal {Y}}\) have finite dimension.

Theorem 2

Let\({\mathcal {X}}\)and\({\mathcal {Y}}\)be finite dimensional and let the sequences\((x^n,y^n)\)and\((X^N,Y^N)\)be defined by Theorem1. If the partial sums\(A_{N,i}\)and\(B_{N,i}\)in Theorem 1converge, there exists a saddlepoint\((x^*,y^*) \in {\mathcal {X}}\times {\mathcal {Y}}\)such that\(x^n \rightarrow x^*\)and\(y^n \rightarrow y^*\).

Proof

Since by assumption \(A_{N,i}\) and \(B_{N,i}\) are summable, plugging \((x^\star ,y^\star )\) into inequality (23) and using (25) establishes the boundedness of the sequence \((x^n,y^n)\) for all \(n \in {\mathbb {N}}\). Hence there exists a subsequence \((x^{n_k},y^{n_k})\) (strongly) converging to a cluster point \((x^*,y^*)\). Using \((x,y) = (x^\star ,x^\star )\) in (23) and the boundedness of the error terms established in (25) we also find that \(\Vert x^{n-1} - x^n \Vert \rightarrow 0\) and \(\Vert y^{n-1} - y^n \Vert \rightarrow 0\) (note that this is precisely the reason for the introduction of \(\beta\) and the strict inequality in \(\tau L_f + \tau \sigma L^2 + \tau \beta L < 1\)). As a consequence we also have \(\Vert x^{n_k-1} - x^{n_k} \Vert \rightarrow 0\) and

$$\begin{aligned} \Vert x^{n_k-1} - x^* \Vert \le \Vert x^{n_k-1} - x^{n_k} \Vert + \Vert x^{n_k} - x^* \Vert \rightarrow 0, \quad k \rightarrow \infty , \end{aligned}$$

i.e. also \(x^{n_k-1} \rightarrow x^*\). Let now T denote the primal update of the exact algorithm (11), i.e. \({\hat{x}}^{n+1} = T({\hat{x}}^n)\), and \(T_{\varepsilon _n}\) denote the primal update of the inexact Algorithm 12, i.e. \(x^{n+1} = T_{\varepsilon _n}(x^n)\). Then, due to the continuity of T, we obtain

$$\begin{aligned} \Vert x^* - T(x^*) \Vert&= \lim _{k \rightarrow \infty } \Vert x^{n_k-1} - T(x^{n_k-1}) \Vert \\&\le \lim _{k \rightarrow \infty } \left( \Vert x^{n_k-1} - T_{\varepsilon _{n_k}}(x^{n_k-1}) \Vert + \Vert T_{\varepsilon _{n_k}}(x^{n_k-1}) - T(x^{n_k-1}) \Vert \right) \\&\le \lim _{k \rightarrow \infty } \left( \Vert x^{n_k-1} - x^{n_k} \Vert + \sqrt{2 \tau \varepsilon _{n_k}} \right) = 0. \end{aligned}$$

We apply the same argumentation to \(y^n\), which together implies that \((x^*,y^*)\) is a fixed point of the (exact) iteration 11 and hence a saddle point of our original problem (8). We now use \((x,y) = (x^*,y^*)\) in inequality (22) and sum from \(n = n_k, \dots , N-1\) (leaving out negative terms on the right hand side) to obtain for \(N > n_k\)

$$\begin{aligned}&\frac{1}{2\tau } \Vert x^* - x^N \Vert ^2 + \frac{1}{2\sigma } \Vert y^* - y^N \Vert ^2 \\&\quad \le \langle K(x^N - x^{N-1}),y^*-y^N \rangle - \langle K(x^{n_k} - x^{n_k-1}),y^* - y^{n_k} \rangle \\&\qquad + \frac{\tau \sigma L^2 + \tau \beta L}{2\tau } \Vert x^{n_k} - x^{n_k-1} \Vert ^2 + \frac{1}{2\tau } \Vert x^* - x^{n_k} \Vert ^2 + \frac{1}{2\sigma } \Vert y^* - y^{n_k} \Vert ^2 \\&\qquad + \sum _{n = n_k+1}^N \left( \Vert e^n \Vert + \sqrt{(2 \varepsilon _n)/\tau } \right) \Vert x^* - x^n \Vert + \sum _{n = n_k+1}^N (\varepsilon _n + \delta _n). \end{aligned}$$

It remains to notice that since \(\Vert e_n\Vert \rightarrow 0, \varepsilon _n \rightarrow 0, \delta _n \rightarrow 0\) and the above observations, the right hand side tends to zero for \(k \rightarrow \infty\), which implies that also \(x^N \rightarrow x^*\) and \(y^N \rightarrow y^*\) for \(N \rightarrow \infty\). □

3.2 The convex case: a stronger version

If we restrict ourselves to type-2 approximations, we can state a stronger version for the reduced problem with \(f=0\):

$$\begin{aligned} \min _{x \in {\mathcal {X}}} \max _{y \in {\mathcal {Y}}} ~ {\mathcal {L}}(x,y) = \langle y,Kx \rangle + g(x) - h^*(y), \end{aligned}$$
(28)

again assuming it has at least one saddle point \((x^\star ,y^\star )\). We consider the algorithm

$$\begin{aligned} \begin{aligned} y^{n+1}&\approx _2^{\delta _{n+1}} {{\mathrm {prox}}}_{\sigma h^*} (y^n + \sigma K(2x^n - x^{n-1})), \\ x^{n+1}&\approx _2^{\varepsilon _{n+1}} {{\mathrm {prox}}}_{\tau g} (x^n - \tau K^* y^{n+1})), \end{aligned} \end{aligned}$$
(29)

which is the inexact analog of the basic exact primal–dual algorithm presented in [15]. Simply speaking, the main difference to the previous section is that, choosing a type-2 approximation and \(f = 0\), there are no errors occurring in the input of the proximal operators, such that we do not need a bound on \(\Vert x-x^n\Vert\), which allows us to obtain a rate for the objective for all \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\) instead of only for a saddle point \((x^\star ,y^\star )\) (cf. Theorem 1). Following their line of proof, we can state the following result:

Theorem 3

Let\(L = \Vert K\Vert\)and\(\tau , \sigma > 0\)such that\(\tau \sigma L^2 < 1\), and let the sequence \((x^n,y^n)\)be definedby algorithm (29). Then for\(X^N := \left( \sum _{n=1}^N x^n \right) /N\), \(Y^N := \left( \sum _{n=1}^N y^n \right) /N\)and every\((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\)we have

$$\begin{aligned} {\mathcal {L}}(X^N,y) - {\mathcal {L}}(x,Y^N) \le \frac{1}{N} \left( \frac{1}{2\tau } \Vert x-x^0\Vert ^2 + \frac{1}{2\sigma } \Vert y-y^0\Vert ^2 + \sum _{n=1}^N (\varepsilon _n + \delta _n) \right) . \end{aligned}$$
(30)

Furthermore, if\(\varepsilon _n = O\left( n^{-\alpha }\right)\)and\(\delta _n = O\left( n^{-\alpha }\right)\), then

$$\begin{aligned} {\mathcal {L}}(X^N,y) - {\mathcal {L}}(x,Y^N) = {\left\{ \begin{array}{ll} O\left( N^{-1}\right) , &{}\quad {\text {if}} \quad \alpha > 1, \\ O\left( \ln (N)/N\right) , &{}\quad {\text {if}} \quad \alpha = 1, \\ O\left( N^{-\alpha }\right) , &{}\quad {\text {if}} \quad \alpha \in (0,1). \end{array}\right. } \end{aligned}$$

Proof

The proof can be done exactly along the lines of [15, Theorem 1] (or along the proof of Theorem 1), so we just give the main steps. Letting \(f = 0\) and choosing a type-2 approximation gives \(L_f = 0\) and lets us drop the term \(( \Vert e^{n+1}\Vert + \sqrt{(2 \varepsilon _{n+1})/\tau }) \Vert x - x^{n+1} \Vert\) in inequality (20). This is the essential difference, since we do not have to establish a bound on \(\Vert x - x^{n+1} \Vert\). Choosing \(\alpha = \sqrt{\sigma /\tau }\) in Young’s inequality and proceeding as before the gives

$$\begin{aligned}&\sum _{n=1}^N ( {\mathcal {L}}(x^n,y) - {\mathcal {L}}(x,y^n) ) + (1-\tau \sigma L^2) \frac{\Vert y-y^N\Vert ^2}{2\sigma } + \frac{\Vert x-x^N\Vert ^2}{2\tau } \nonumber \\&\qquad + (1- \sqrt{\tau \sigma } L) \sum _{n=1}^N \frac{\Vert y^n - y^{n-1}\Vert ^2}{2\sigma } + (1- \sqrt{\tau \sigma } L) \sum _{n=1}^{N-1} \frac{\Vert x^n - x^{n-1}\Vert ^2}{2\sigma } \nonumber \\&\quad \le \frac{1}{2\sigma } \Vert y-y^0\Vert ^2 + \frac{1}{2\tau } \Vert x-x^0\Vert ^2 + \sum _{n=1}^N (\varepsilon _n + \delta _n). \end{aligned}$$
(31)

The definition of the ergodic sequences and Jensen’s inequality yield the assertion. □

As before we can state convergence of the iterates if the errors \(\{\varepsilon _n\}\) and \(\{\delta _n\}\) decay fast enough. The proof is the same as for Theorem 2.

Theorem 4

Let the sequences\((x^n,y^n)\)and\((X^N,Y^N)\)be defined by (29) respectively. If the sequences\(\{\varepsilon _n\}\)and\(\{\delta _n\}\)in Theorem3are summable, then every weak cluster point\((x^*,y^*)\)of\((X^N,Y^N)\)is a saddle point ofproblem (28). Moreover, if thedimension of\({\mathcal {X}}\)and\({\mathcal {Y}}\)is finite, there exists a saddle point\((x^*,y^*) \in {\mathcal {X}}\times {\mathcal {Y}}\)such that\(x^n \rightarrow x^*\)and\(y^n \rightarrow y^*\).

Proof

Since by assumption \(A_{N,i}\) and \(B_{N,i}\) are summable, plugging \((x^\star ,y^\star )\) into Eq. (31) establishes the boundedness of the sequence \(x^N\) for all \(N \in {\mathbb {N}}\), which also implies the boundedness of the ergodic average \(X^N\). Note that by the same argumentation as for \(x^N\), this also establish a global bound on \(y^N\) and \(Y^N\). Hence there exists a subsequence \((X^{N_k},Y^{N_k})\) weakly converging to a cluster point \((x^*, y^*)\). Then, since fg and \(h^*\) are l.s.c. (thus also weakly l.s.c.), we deduce from Eq. (30) that, for every fixed \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\),

$$\begin{aligned} {\mathcal {L}}(x^*,y) - {\mathcal {L}}(x,y^*)&\le \liminf _{k \rightarrow \infty } {\mathcal {L}}(X^{N_k},y) - {\mathcal {L}}(x,Y^{N_k}) \\&\le \liminf _{k \rightarrow \infty } \frac{1}{N_k} \left( \frac{1}{2\tau } \Vert x-x^0\Vert ^2 + \frac{1}{2\sigma } \Vert y-y^0\Vert ^2 + \sum _{n=1}^N (\varepsilon _n + \delta _n) \right) \\&= 0, \end{aligned}$$

Taking the supremum over (xy) then implies that \((x^*,y^*)\) is a saddle point itself and establishes the first assertion. The rest of the proof follows analogously to the proof of Theorem 2. □

Remark 3

The main difference between Theorems 3 and 1 is that inequality (30) bounds the left hand side for all \(x,y \in {\mathcal {X}}\times {\mathcal {Y}}\) and not only for a saddle point \((x^\star ,y^\star )\). Following [15, Remark 2] and if \(\{\varepsilon _n\}, \{\delta _n\}\) are summable we can state the same \(O\left( 1/N\right)\) convergence of the primal energy, dual energy or the global primal–dual gap under the additional assumption that h has full domain, \(g^*\) has full domain or both have full domain. More precisely, if e.g. h has full domain, then it is classical that \(h^*\) is superlinear and that the supremum appearing in the conjugate is attained at some \({\tilde{y}}^N \in \partial h(KX^N)\), which is uniformly bounded in N due to (31) (if \((x,y) = (x^\star ,y^\star )\) then \((X^N,Y^N)\) is globally bounded), such that

$$\begin{aligned} \max _{y \in {\mathcal {Y}}} ~ {\mathcal {L}}(x^N,y) = ~ \langle {\tilde{y}}^N, KX^N \rangle -h^*({\tilde{y}}^N) + g(X^N) = h(KX^N) + g(X^N). \end{aligned}$$

Now evoking inequality (30) and proceeding exactly along (10) we can state that

$$\begin{aligned}&h(KX^N) + g(X^N) - [h(Kx^\star ) + g(x^\star )] \\&\quad \le \frac{1}{N} \left( \frac{1}{2\tau } \Vert x^\star -x^0\Vert ^2 + C + \sum _{n=1}^N (\varepsilon _n + \delta _n) \right) , \end{aligned}$$

with a constant C depending on \(x^0\), \(y^0\), h and \(\Vert K\Vert\). Analogously we can establish the convergence rates for the dual problem and also the global gap.

Remark 4

If \(h^*\) has bounded domain, e.g. if h is a norm, we can even state “mixed” rates for the primal energy if the errors are not summable. Since in this case \(\Vert y - y^0 \Vert \le {{\mathrm {diam}}}({{\mathrm {dom}}} h^*)\) we may take the supremum over all \(y \in {{\mathrm {dom}}} h^*\) and obtain

$$\begin{aligned}&h(KX^N) + g(X^N) - [h(Kx^\star ) + g(x^\star )] \\&\quad \le \frac{1}{N} \left( \frac{\Vert x^\star - x^0 \Vert ^2}{2\tau } + \frac{{{\mathrm {diam}}}({{\mathrm {dom}}} h^*)^2}{2 \sigma } + \sum _{n=1}^N (\varepsilon _n + \delta _n) \right) = O\left( N^{-\alpha }\right) , \end{aligned}$$

for \(\varepsilon _n,\delta _n \in O\left( n^{-\alpha }\right)\). The above result in particular holds for the aforementioned TV-\(L^1\) model, which we shall consider in the numerical section.

3.3 The strongly convex case: primal acceleration

We now turn our focus on possible accelerations of the scheme and consider again the full problem (8) with the additional assumption that g is \(\gamma\)-strongly convex, i.e. for any \(x \in {{\mathrm {dom}}} \partial g\)

$$\begin{aligned} g(x') \ge g(x) + \langle p, x' - x \rangle + \frac{\gamma }{2} \Vert x'-x\Vert ^2, \quad \forall p \in \partial g(x), \quad \forall x' \in {\mathcal {X}}. \end{aligned}$$

As g is \(\gamma\)-strongly convex, its conjugate \(g^*\) has \(1/\gamma\)-Lipschitz gradient, so that acceleration is possible. We mention that we obtain the same result if f (or both g and f) are strongly convex, since it is possible to transfer the strong convexity from f to g and vice versa [18, Section 5]. Hence for simplicity we focus on the case where g is strongly convex. Choosing

$$\begin{aligned} (\check{x},\check{y}) = (x^{n+1},y^{n+1}), ({\bar{x}},{\bar{y}}) = (x^n,y^n), ({\tilde{x}},{\tilde{y}}) = (x^n + \theta _n(x^n - x^{n-1}),y^{n+1}), \end{aligned}$$
(32)

in algorithm (12) we define an accelerated inexact primal–dual algorithm:

$$\begin{aligned} \begin{aligned} y^{n+1}&\approx _2^{\delta _{n+1}} {{\mathrm {prox}}}_{\sigma _n h^*} (y^n + \sigma _n K (x^n + \theta _n (x^n - x^{n-1})) \\ x^{n+1}&\approx _i^{\varepsilon _{n+1}} {{\mathrm {prox}}}_{\tau _n g} ( x^n - \tau _n (K^* y^{n+1} + \nabla f(x^n) + e^{n+1})) \\ \theta _{n+1}&= 1 / \sqrt{1 + \gamma \tau _n}, \quad \tau _{n+1} = \theta _{n+1} \tau _n, \quad \sigma _{n+1} = \sigma _n / \theta _{n+1}. \end{aligned} \end{aligned}$$
(33)

We prove the following theorem in “Proof of Theorem 5” of “Appendix”.

Theorem 5

Let\(L = \Vert K\Vert\)and\(\tau _n, \sigma _n, \theta _n\)such that

$$\begin{aligned} \tau _n L_f + \tau _n \sigma _n L^2 \le 1, \qquad \theta _{n+1} \sigma _{n+1} = \sigma _n, \qquad (1 + \gamma \tau _n) \tau _{n+1} \theta _{n+1} \ge \tau _n. \end{aligned}$$

Let\((x^n,y^n) \in {\mathcal {X}}\times {\mathcal {Y}}\)be defined by the above algorithmfor\(i \in \{1,2,3\}\). Then for any saddle point\((x^\star , y^\star ) \in {\mathcal {X}}\times {\mathcal {Y}}\) of (8) and

$$\begin{aligned} T_N := \sum _{n=1}^N \frac{\sigma _{n-1}}{\sigma _0}, \qquad X^N := \frac{1}{T_N} \sum _{n=1}^N \frac{\sigma _{n-1}}{\sigma _0} x^n, \qquad Y^N := \frac{1}{T_N} \sum _{n=1}^N \frac{\sigma _{n-1}}{\sigma _0} y^n, \end{aligned}$$

we have that

$$\begin{aligned} T_N (&{\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star , Y^N)) \\&\le \frac{1}{2 \sigma _0} \left( \sqrt{\frac{\sigma _0}{\tau _0}} \Vert x^\star - x^0 \Vert + \Vert y^\star - y^0 \Vert + \sqrt{2 B_{N,i}} + 2 \sqrt{\frac{\tau _N}{\sigma _N}} A_{N,i} \right) ^2, \end{aligned}$$

and

$$\begin{aligned}&\frac{\sigma _N}{2\tau _N} \Vert x^\star - x^N \Vert ^2 \\&\quad \le \frac{1}{2} \left( \sqrt{\frac{\sigma _0}{\tau _0}} \Vert x^\star - x^0 \Vert + \Vert y^\star - y^0 \Vert + \sqrt{2 B_{N,i}} + 2 \sqrt{\frac{\tau _N}{\sigma _N}} A_{N,i} \right) ^2, \end{aligned}$$

where

$$\begin{aligned} A_ {N,1}&= \sum _{n=1}^N \sigma _{n-1} \Vert e^n\Vert + \sqrt{\frac{2 \sigma _{n-1}^2 \varepsilon _n}{\tau _{n-1}}},&\qquad B_{N,1} = 2 \sum _{n=1}^N \sigma _{n-1} (\varepsilon _n + \delta _n), \\ A_{N,2}&= \sum _{n=1}^N \sigma _{n-1} \Vert e^n\Vert ,&\qquad B_{N,2} = 2 \sum _{n=1}^N \sigma _{n-1} (\varepsilon _n + \delta _n), \\ A_{N,3}&= \sum _{n=1}^N \sigma _{n-1} \Vert e^n\Vert + \sqrt{\frac{2 \sigma _{n-1}^2 \varepsilon _n}{\tau _{n-1}}},&\qquad B_{N,3} = 2 \sum _{n=1}^N \sigma _{n-1} \delta _n. \end{aligned}$$

As a direct consequence of Theorem 5 we can state convergence rates of the accelerated algorithm (33) in dependence on the errors \(\{\Vert e^n\Vert \}, \{\delta _n\}\) and \(\{\varepsilon _n\}\).

Corollary 2

Let\(\tau _0 = 1/(2L_f)\), \(\sigma _0 = L_f/L^2\)and\(\tau _n,\sigma _n\)and\(\theta _n\)be given by (33). Let\(\alpha > 0\), \(i \in \{1,2,3\}\)and

$$\begin{aligned} \Vert e^n\Vert = O\left( \frac{1}{n^\alpha }\right) , \qquad \delta _n = O\left( \frac{1}{n^{2\alpha }}\right) , \qquad \varepsilon _n = {\left\{ \begin{array}{ll} O\left( \frac{1}{n^{1+2\alpha }}\right) , &{}\quad {\text {if }} i \in \{1,3\} \\ O\left( \frac{1}{n^{2\alpha }}\right) , &{}\quad {\text {if }} i =2. \end{array}\right. } \end{aligned}$$

Then

$$\begin{aligned} {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star ,Y^N) = {\left\{ \begin{array}{ll} O\left( N^{-2}\right) &{}\quad {\text {if}} \quad \alpha > 1, \\ O\left( \ln ^2(N)/N^2\right) &{}\quad {\text {if}} \quad \alpha = 1, \\ O\left( N^{-2\alpha }\right) &{}\quad {\text {if}} \quad \alpha \in (0,1). \\ \end{array}\right. } \end{aligned}$$

Proof

In [15] it has been shown that with this choice we have \(\tau _n \sim 2/(n\gamma )\). Since the product \(\tau _n \sigma _n = \tau _0 \sigma _0 = 1/(2L^2)\) stays constant over the course of the iterations, this implies that \(\sigma _n \sim (n\gamma )/(4L^2)\), from which we directly deduce that \(T_N \sim (\gamma N^2)/(8L_f)\), hence \(T_N = O\left( N^2\right)\). Moreover we find that \(\sqrt{\tau _N/\sigma _N} \sim (\sqrt{8}L)/(\gamma N)\). Now let \(i = 1\) and \(\alpha \in (0,1)\), then we have

$$\begin{aligned} A_{N,1} = \sum _{n=1}^N \sigma _{n-1} \Vert e^n\Vert + \sqrt{\frac{2 \sigma _{n-1}^2 \varepsilon _n}{\tau _{n-1}}} \sim \sum _{n=1}^N \frac{(n-1)\gamma }{4L^2} \Vert e^n\Vert + \sqrt{\frac{2\gamma ^3((n-1)^3 \varepsilon _n)}{32L^4}} \end{aligned}$$

Now by assumption \(\Vert e^n\Vert = O\left( n^{-\alpha }\right)\) and \(\varepsilon _n = O\left( n^{-(1+2\alpha )}\right)\) which implies that \(A_{N,1} = O\left( N^{2-\alpha }\right)\). By analogous reasoning we find \(B_{N,1} = O\left( N^{2-2\alpha }\right)\). Summing up we obtain that

$$\begin{aligned} \frac{\tau _N}{\sigma _N} \frac{A_{N,1}^2}{T_N} = O\left( N^{-2\alpha }\right) , \qquad {\text {and}} \qquad \frac{B_{N,1}}{T_N} = O\left( N^{-2\alpha }\right) , \end{aligned}$$

yielding the last row of the assertion. For \(\alpha = 1\) we see that \(\sqrt{\tau _N/\sigma _N} A_{N,1}\) is finite and \(B_{N,1} = O\left( \log (N)\right)\), for \(\alpha > 1\) also \(B_{N,1}\) is summable, implying the other two rates. It remains to notice that the cases \(i \in \{2,3\}\) can be obtained as special cases. □

3.4 The strongly convex case: dual acceleration

This section is devoted to the comparison of inexact primal–dual algorithms and inexact forward–backward splittings established in [5, 60, 61], considering the problem

$$\begin{aligned} \min _{x \in {\mathcal {X}}} ~ h(Kx) + g(x), \end{aligned}$$
(34)

with h having a \(1/\gamma\)-Lipschitz gradient and proximable g. The above mentioned works establish convergence rates for an inexact forward–backward splitting on this problem, where both the computation of the proximal operator with respect to g and the gradient of h might contain errors ([61] only considers errors in the proximum).

The corresponding primal–dual formulation of problem (34) reads

$$\begin{aligned} \min _{x \in {\mathcal {X}}} \max _{y \in {\mathcal {Y}}} ~ {\mathcal {L}}(x,y) = \langle Kx, y \rangle + g(x) - h^*(y), \end{aligned}$$
(35)

where now \(h^*\) is \(\gamma\)-strongly convex. Hence we know that the algorithm can be accelerated “à la” [15, 18] or as in the previous section, and we shall be able to essentially recover the results on inexact forward–backward splittings/inexact FISTA obtained by [5, 60, 61]. Choosing (note \(f = 0\) and \(e = 0\))

$$\begin{aligned} (\check{x},\check{y}) = (x^{n+1},y^{n+1}), ({\bar{x}},{\bar{y}}) = (x^n,y^n), ({\tilde{x}},{\tilde{y}}) = (x^n + \theta _n(x^n - x^{n-1}),y^{n+1}), \end{aligned}$$
(36)

in algorithm (12) we define an accelerated inexact primal–dual algorithm:

$$\begin{aligned} y^{n+1}&\approx _2^{\delta _{n+1}} {{\mathrm {prox}}}_{\sigma _n h^*} (y^n + \sigma _n K (x^n + \theta _n (x^n - x^{n-1})) \\ x^{n+1}&\approx _i^{\varepsilon _{n+1}} {{\mathrm {prox}}}_{\tau _n g} ( x^n - \tau _n K^* y^{n+1} ) \\ \theta _{n+1}&= 1 / \sqrt{1 + 2\gamma \sigma _n}, \quad \sigma _{n+1} = \theta _{n+1} \sigma _n, \quad \tau _{n+1} = \tau _n / \theta _{n+1}. \end{aligned}$$

We prove the following theorem in “Proof of Theorem 6” of "Appendix".

Theorem 6

Let\(L = \Vert K\Vert\)and\(\tau _n, \sigma _n, \theta _n\)such that

$$\begin{aligned} \tau _n \sigma _n \theta _n^2 L^2 \le 1, \qquad \theta _{n+1} \tau _{n+1} = \tau _n, \qquad (1 + \gamma \sigma _n) \sigma _{n+1} \theta _{n+1} \ge \sigma _n. \end{aligned}$$

Let\((x^n,y^n) \in {\mathcal {X}}\times {\mathcal {Y}}\)be defined by the above algorithmfor\(i \in \{1,2,3\}\). Then for a saddle point \((x^\star , y^\star ) \in {\mathcal {X}}\times {\mathcal {Y}}\)and

$$\begin{aligned} T_N := \sum _{n=1}^N \frac{\tau _{n-1}}{\tau _0}, \qquad X^N := \frac{1}{T_N} \sum _{n=1}^N \frac{\tau _{n-1}}{\tau _0} x^n, \qquad Y^N := \frac{1}{T_N} \sum _{n=1}^N \frac{\tau _{n-1}}{\tau _0} y^n, \end{aligned}$$

we have that

$$\begin{aligned}&T_N ( {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star , Y^N)) \\&\quad \le \frac{1}{2 \tau _0} \left( \Vert x^\star - x^0 \Vert + \sqrt{\frac{\tau _0}{\sigma _0}} \Vert y^\star - y^0 \Vert + \sqrt{2 B_{N,i}} + 2A_{N,i} \right) ^2, \end{aligned}$$

and

$$\begin{aligned} C_N \frac{\tau _N}{\sigma _N} \Vert y^\star - y^N \Vert ^2&\le \left( \Vert x^\star - x^0 \Vert + \sqrt{\frac{\tau _0}{\sigma _0}} \Vert y^\star - y^0 \Vert + \sqrt{2 B_{N,i}} + 2A_{N,i} \right) ^2, \end{aligned}$$

where\(C_N = 1 - \sigma _N \tau _N \theta _N^2 L^2\)and

$$\begin{aligned} A_{N,1}&= \sum _{n=1}^N \sqrt{2 \tau _{n-1} \varepsilon _n},&\quad B_{N,1} = 2 \sum _{n=1}^N \tau _{n-1} (\varepsilon _n + \delta _n), \\ A_{N,2}&= 0,&\quad B_{N,2} = 2 \sum _{n=1}^N \tau _{n-1} (\varepsilon _n + \delta _n), \\ \quad A_{N,3}&= \sum _{n=1}^N \sqrt{2 \tau _{n-1} \varepsilon _n},&\quad B_{N,3} = 2 \sum _{n=1}^N \tau _{n-1} \delta _n. \end{aligned}$$

We can once more establish convergence rates depending on the decay of the errors.

Corollary 3

Let\(\tau _0,\sigma _0\)such that\(\tau _0 \sigma _0 L^2 = 1\). Let\(\alpha > 0\), \(i \in \{1,2,3\}\)and

$$\begin{aligned} \delta _n = O\left( \frac{1}{n^{2\alpha }}\right) , \qquad \varepsilon _n = {\left\{ \begin{array}{ll} O\left( \frac{1}{n^{1+2\alpha }}\right) , &{}\quad {\text {if }} i \in \{1,3\} \\ O\left( \frac{1}{n^{2\alpha }}\right) , &{}\quad {\text {if }} i =2. \end{array}\right. } \end{aligned}$$

Then

$$\begin{aligned} {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star ,Y^N) = {\left\{ \begin{array}{ll} O\left( N^{-2}\right) &{}\quad {\text {if}} \quad \alpha > 1, \\ O\left( \ln ^2(N)/N^2\right) &{}\quad {\text {if}} \quad \alpha = 1, \\ O\left( N^{-2\alpha }\right) &{}\quad {\text {if}} \quad \alpha \in (0,1). \\ \end{array}\right. } \end{aligned}$$

Proof

We refer to [15] for a proof that using the step sizes in Theorem 6, it can be shown that \(\sigma _n \sim 1/(n\gamma )\) and accordingly \(\tau _n \sim (n\gamma )/L^2\). This directly implies that \(T_N \sim (\gamma N^2)/(2L)\). Now for \(i=1\) and \(\alpha \in (0,1)\) we have that

$$\begin{aligned} A_{N,1} = \sum _{n=1}^N \sqrt{2 \tau _{n-1} \varepsilon _n} \sim \sum _{n=1}^N \sqrt{\frac{2 \gamma (n-1) \varepsilon _n}{L^2}} = \frac{\sqrt{2 \gamma }}{L}\sum _{n=1}^N \sqrt{(n-1) \varepsilon _n}. \end{aligned}$$

Now by assumption \(\varepsilon _n = O\left( n^{-1-2\alpha }\right)\), which implies that \(\sqrt{(n-1)\varepsilon _n} = O\left( n^{-\alpha }\right)\) and we deduce \(A_{N,1} = O\left( N^{1-\alpha }\right)\) using Lemma 5. By an analogous argumentation

$$\begin{aligned} B_{N,1} = 2 \sum _{n=1}^N \tau _{n-1} (\varepsilon _n + \delta _n) \sim \frac{2 \gamma }{L^2} \sum _{n=1}^N (n-1) (\varepsilon _n + \delta _n). \end{aligned}$$

Now since \(\delta _n = O\left( n^{-2 \alpha }\right)\) we deduce that \(n \delta _n = O\left( n^{1-2\alpha }\right)\) and hence \(B_{N,1} = O\left( N^{2-2 \alpha }\right)\). Using \(T_N = O\left( N^{2}\right)\), we find

$$\begin{aligned} \frac{B_{N,1}}{T_N} = O\left( N^{-2\alpha }\right) , \quad {\text {and}} \quad \frac{A_{N,1}^2}{T_N} = O\left( N^{-2\alpha }\right) , \end{aligned}$$

which gives the result for \(i=1\) and \(\alpha \in (0,1)\). Choosing \(\alpha > 1\) will yield convergence for \(A_{N,1}\) and \(B_{N,1}\), which implies the fastest overall convergence rate \(O\left( 1/N^2\right)\), the case \(\alpha = 1\) gives \(A_{N,1} = O\left( \log (N)\right)\) and \(B_{N,1} = O\left( \log (N)\right)\). It remains to note that the results for \(i=2,3\) can be obtained as special cases. □

Corollary 3 essentially recovers the results given in [5, 60, 61], though the comparison is not exactly straightforward. For an optimal \(O\left( N^{-2}\right)\) convergence in objective with a type-1 approximation the authors of [60] require \(\varepsilon _n = O\left( 1/n^{4+\kappa }\right)\) for any \(\kappa > 0\), for the error \(d_n\) in the gradient of \(h \circ K\) they need \(\Vert d_n\Vert = O\left( 1/n^{4+\kappa }\right)\). Since a type-2 approximation of the proximum is more demanding, the authors of [61] obtain a weaker dependence of the convergence on the error and only require \(\varepsilon _n = O\left( n^{3+\kappa }\right)\). Note that they only consider the case \(d_n = 0\). The work in [5] essentially refines both results under the same assumptions on the errors. Corollary 3 now states that for an optimal \(O\left( N^{-2}\right)\) convergence we require \(\varepsilon _n = O\left( n^{3 + \kappa }\right)\) in case of a type-1 approximation and \(\varepsilon _n = O\left( n^{2 + \kappa }\right)\) in case of an error of type-2, which seems to be one order less than the other results. We do not have a precise mathematical explanation at this point. The main difference appears to be the changing step sizes \(\tau _n, \sigma _n\) in the proximal operators for the inexact primal–dual algorithm in Theorem 6, which behave like n respectively 1/n, while the step sizes remain fixed for inexact forward–backward. The numerical section, however, indeed confirms the weaker dependence of the inexact primal–dual algorithm on the errors.

Remark 5

We want to highlight that, in the spirit of Sect. 3.2 it is as well possible to state a stronger version in case the approximations are of type-2 in both the primal and dual proximal point, which then bounds the “gap” for all \((x,y) \in {\mathcal {X}}\times {\mathcal {Y}}\) instead of for a saddle point \((x^\star ,y^\star )\) in Theorem 6 [cf. inequality (54)]:

$$\begin{aligned}&{\mathcal {L}}(X^N,y) - {\mathcal {L}}(x,Y^N) \\&\quad \le \frac{1}{T_N} \left( \frac{1}{2\tau _0} \Vert x - x^0 \Vert ^2 + \frac{1}{2\sigma _0} \Vert y - y^0 \Vert ^2 + \sum _{n=1}^N \frac{\tau _{n-1}}{\tau _0} (\varepsilon _n + \delta _n) \right) . \end{aligned}$$

Under some additional assumptions we can then again derive estimates on the primal energy for every fixed \(N \in {\mathbb {N}}\). If again h has full domain, the supremum appearing in the conjugate is attained at some \({\tilde{y}}^N\) and exactly along (10) we derive

$$\begin{aligned}&h(KX^N) + g(X^N) + f(X^N) - \left[ h(Kx^\star ) + g(x^\star ) + f(x^\star ) \right] \\&\quad \le \frac{1}{T_N} \left( \frac{1}{2\tau _0} \Vert x^\star - x^0 \Vert ^2 + \frac{1}{2\sigma _0} \Vert {\tilde{y}}^N - y^0 \Vert ^2 + \sum _{n=1}^N \frac{\tau _{n-1}}{\tau _0} (\varepsilon _n + \delta _n) \right) . \end{aligned}$$

In case the errors are summable we again obtain that also \({\tilde{y}}^N\) is globally bounded (cf. Remark 3) and we obtain convergence in \(O\left( 1/N^2\right)\). If the errors are not summable there is no similar argument to obtain the global boundedness of the \({\tilde{y}}^N\), however at least on a heuristic level one can expect a convergence to \(y^*\) at a similar rate as \(X^N\). This is indeed confirmed in the numerical section where we observe the \(O\left( N^{-2\alpha }\right)\) decay from Corollary 3 also for the primal objective for nonsummable errors.

3.5 The smooth case

We finally discuss an accelerated primal–dual algorithm if both g and \(h^*\) are \(\gamma\)- respectively \(\mu\)-strongly convex. In this setting the primal objective is both smooth and strongly convex, and first-order algorithms can be accelerated to linear convergence. We consider the algorithm

$$\begin{aligned} \begin{aligned} y^{n+1}&\approx _2^{\delta _{n+1}} {{\mathrm {prox}}}_{\sigma h^*} (y^n + \sigma K (x^n + \theta (x^n - x^{n-1})) \\ x^{n+1}&\approx _i^{\varepsilon _{n+1}} {{\mathrm {prox}}}_{\tau g} ( x^n - \tau (K^* y^{n+1} + \nabla f(x^n) + e^{n+1})), \\ \frac{1}{\theta }&= 1+\gamma \tau = 1 + \mu \sigma , \quad \tau L_f + \tau \sigma \theta ^2 L^2 \le 1, \end{aligned} \end{aligned}$$
(37)

and prove the following result in “Proof of Theorem 7” of “Appendix

Theorem 7

Let\(L = \Vert K\Vert\)and\(\tau ,\sigma , \theta\)such that

$$\begin{aligned} 1+\gamma \tau = 1 + \mu \sigma = \frac{1}{\theta } \quad {\text {and}} \quad \tau L_f + \tau \sigma \theta ^2 L^2 \le 1. \end{aligned}$$
(38)

Let\((x^n,y^n) \in {\mathcal {X}}\times {\mathcal {Y}}\)be defined by algorithm (37) for\(i \in \{1,2,3\}\). Then for the uniquesaddle-point\((x^\star ,y^\star )\)and

$$\begin{aligned} T_N := \sum _{n=1}^N \frac{1}{\theta ^{n-1}} , \quad X^N := \frac{1}{T_N} \sum _{n=1}^N \frac{1}{\theta ^{n-1}} x^n, \quad Y^N := \frac{1}{T_N} \sum _{n=1}^N \frac{1}{\theta ^{n-1}} y^n \end{aligned}$$

we have

$$\begin{aligned}&T_N ({\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star ,Y^N)) \\&\quad \le \frac{1}{2\tau } \left( \Vert x^\star -x^0\Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^\star -y^0\Vert + 2 \theta ^{\frac{N}{2}} A_{N,i} + \sqrt{2 B_{N,i}} \right) ^2 \end{aligned}$$

and

$$\begin{aligned} \Vert x^\star -x^N\Vert ^2 \le \theta ^N \left( \Vert x^\star -x^0\Vert + \sqrt{\frac{\tau }{\sigma }} \Vert y^\star -y^0\Vert + 2 \theta ^{\frac{N}{2}} A_N + \sqrt{2 B_N} \right) ^2 \end{aligned}$$

where

$$\begin{aligned} A_{N,1}&= \sum _{n=1}^N \frac{1}{\theta ^{n-1}} (\tau \Vert e^n\Vert + \sqrt{2\tau \varepsilon _n}),&\quad B_{N,1} = \sum _{n=1}^N \frac{\tau }{\theta ^{n-1}} (\varepsilon _n + \delta _n), \\ A_{N,2}&= \sum _{n=1}^N \frac{\tau \Vert e^n\Vert }{\theta ^{n-1}},&\quad B_{N,2} = \sum _{n=1}^N \frac{\tau }{\theta ^{n-1}} (\varepsilon _n + \delta _n), \\ A_{{N,3}}&= \sum _{n=1}^N \frac{1}{\theta ^{n-1}} (\tau \Vert e^n\Vert + \sqrt{2\tau \varepsilon _n}),&\quad B_{N,3} = \sum _{n=1}^N \frac{\tau }{\theta ^{n-1}} \delta _n. \end{aligned}$$

We can now state convergence rates, if the decay of the errors is also geometric.

Corollary 4

Let\(\alpha > 0\), \(i \in \{1,2,3\}\)and for\(0<q<1\)

$$\begin{aligned} \Vert e^n\Vert = O\left( \sqrt{q}^n\right) , \qquad \delta _n = O\left( q^n\right) , \qquad \varepsilon _n = O\left( q^n\right) . \end{aligned}$$

Then

$$\begin{aligned} {\mathcal {L}}(X^N,y^\star ) - {\mathcal {L}}(x^\star ,Y^N) + \frac{\Vert x^\star -x^N\Vert ^2}{2\tau } = {\left\{ \begin{array}{ll} O\left( \theta ^N\right) , &{}\quad {\text { if }} \theta > q, \\ O\left( N\theta ^N\right) , &{}\quad {\text { if }} \theta = q, \\ O\left( q^N\right) , &{}\quad {\text { if }} \theta < q. \end{array}\right. } \end{aligned}$$

Proof

It is clear that we need to investigate the decay of the term

$$\begin{aligned} C_{N,i} := \theta ^{2N} A_{N,i}^2 + \theta ^N B_{N,i} \end{aligned}$$

to obtain a convergence rate. In view of the specific form of \(A_{N,i}\) and \(B_{N,i}\) and the rate of \(\varepsilon _n\), \(\delta _n\) and \(\Vert e^n\Vert\) we consider

$$\begin{aligned} \theta ^N \sum _{n=1}^N \frac{q^{n-1}}{\theta ^{n-1}} = \theta ^N \sum _{n=0}^{N-1} \left( \frac{q}{\theta } \right) ^n = (\theta ^N - q^N) \frac{\theta }{\theta -q}. \end{aligned}$$
(39)

Now if \(q<\theta <1\), Eq. (39) implies that \(\theta ^N B_{N,i} = O\left( \theta ^N\right)\). For \(A_{N,i}\) we note the factor \(\theta ^{N}\) is squared, as opposed to the factor of \(B_{N,i}\), which implies that the decay of \(\Vert e^n\Vert\) and \(\sqrt{\varepsilon _n}\) can be less restrictive for \(A_{N,i}\) and explains the square root on the constant q for \(\Vert e^n\Vert\). We have to distinguish whether \(\sqrt{q} < \theta\) or \(\sqrt{q} > \theta\). In the former case we have by Eq. (39), now with \(\sqrt{q}\) instead of q, that

$$\begin{aligned} \theta ^{2N} A_{N,i}^2 = (\theta ^N A_{N,i})^2 = O\left( (\theta ^N - \sqrt{q}^N)^2\right) = O\left( \theta ^{2N}\right) = O\left( \theta ^N\right) , \end{aligned}$$

while in the latter we obtain \(\theta ^{2N} A_{N,i}^2 = O\left( q^N\right) = O\left( \theta ^N\right)\), which in sum gives \(C_{N,i} = O\left( \theta ^N\right)\). If \(\theta< q<1\), we have by analogous argumentation and (39) that \(\theta ^N B_{N,i} = O\left( q^N\right)\) and since \(\theta< q< \sqrt{q} < 1\) also \(\theta ^{2N} A_{N,i}^2 = O\left( q^N\right)\), which implies \(C_{N,i} = O\left( q^N\right)\). For the case \(\theta = q\) it is sufficient to notice that (39) is in \(O\left( N\theta ^N\right)\). □

It remains to give some explicit formulation of the step sizes that fulfill the conditions (38). Solving (38) for \(\tau ,\sigma\) and \(\theta\) gives [18]

$$\begin{aligned} \tau&= \frac{1 + \sqrt{1 + 4L^2/(\gamma \mu ) + L_f^2 / \gamma ^2 + 2L_f / \gamma } - L_f/\gamma }{2L_f + 2L^2/\mu }, \\ \sigma&= \frac{1 + \sqrt{1 + 4L^2/(\gamma \mu ) + L_f^2 / \gamma ^2 + 2L_f / \gamma } - L_f/\gamma }{2 L_f \mu / \gamma + 2 L^2 / \gamma }, \\ \theta&= 1 - \frac{\sqrt{1 + 4L^2/(\gamma \mu ) + L_f^2/\gamma ^2 + 2 L_f / \gamma } - L_f/\gamma - 1}{2L^2/(\gamma \mu )}. \end{aligned}$$

4 Numerical experiments

There exists a large variety of interesting optimization problems, e.g. in imaging, that could be investigated in the context of inexact primal–dual algorithms, and even creating numerical examples for all the discussed notions of inexact proxima and different versions of algorithms clearly goes beyond the scope of this paper. Instead, we want to discuss two different questions on two classical imaging problems and leave further studies to the interested reader. The main goal of this section is to confirm numerically, that the convergence rates we proved above are “sharp” in some sense, meaning that if the errors are close to the upper bounds we obtain the convergence rates predicted by the theory. The second point we want to address is whether one can actually benefit from the theory and employ different splitting strategies in order to obtain nested algorithms, which can then only be solved in an inexact fashion (cf. [61]).

We investigate both questions using problems of the form

$$\begin{aligned} \min _{x \in X} ~ h(K_1x) + g(K_2x) = \min _{x \in X} \max _{y \in Y} ~ \langle y,K_1 x\rangle + g(K_2x) - h^*(y), \end{aligned}$$
(40)

\(K_1 :X \rightarrow Y\), \(K_2 :X \rightarrow Z\), where we assume that the proximal operators of both g and \(h^*\) (or \(g^*\) and h by Moreau’s decomposition) have an exact closed form solution. The right hand side of (40) leads to a nested inexact primal–dual algorithm

$$\begin{aligned} y^{n+1}&= {{\mathrm {prox}}}_{\sigma h^*}(y^n + \sigma K_1(x^{n+1} + \theta (x^{n+1} - x^n))), \nonumber \\ x^{n+1}&\approx _2^{\varepsilon _{n+1}} {{\mathrm {prox}}}_{\tau (g \circ K_2)}(x^n - \tau K_1^*y^{n+1}). \end{aligned}$$
(41)

Hence the dual proximal operator can be evaluated exactly (i.e. \(\delta _n = 0\)), while the inner subproblem has to be computed in an inner loop up to the necessary precision \(\varepsilon _n\). We choose the type-2 approximation since in this case, according to Proposition 1, the precision of the proximum can be assessed by means of the duality gap. In order to be able to evaluate the gap, we solve the \(1/\tau\)-strongly convex dual problem

$$\begin{aligned} \min _{z \in Z} ~ \frac{\tau }{2} \Vert K_2^*z\Vert ^2 - \langle K_2^* z, y^{n+1} \rangle + g^*(z), \end{aligned}$$

using FISTA [8]. To distinguish between outer and inner problems for the splittings we denote the iteration number for the outer problem by n, while the iteration number of the inner problem is k. In order to achieve the necessary precision, we iterate the proximal problem until the primal–dual gap (cf. also Sect. 2) satisfies

$$\begin{aligned} {\mathcal {G}}(y^{n+1}-\tau B^*z^k,z^k) \le C\varepsilon _n, \end{aligned}$$
(42)

where \(\varepsilon _n = O\left( 1/n^\alpha \right)\), respectively \(\varepsilon _n = O\left( \theta ^n\right)\) for the last experiment. We vary the parameter \(\alpha\) in order to show the effect of the error decay rate on the algorithm (cf. Remark 4). While for the asymptotic results we proved in the previous section the constant C of the rate does not matter, it indeed does in practice. In order to use Proposition 1 as a criterion, C should correspond to the “natural” size of the duality gap of (41). In order not to choose the constraint too restrictive but still active we follow [61] and choose \(C = {\mathcal {G}}(y^0-\tau B^*y^0,0)\), which is the duality gap of the first proximal subproblem for \(n = 1\) evaluated at \(z = 0\).

For the sake of brevity we discuss only three problems: we start with the non-differentiable TV-\(L^1\) model for deblurring, a problem which cannot be accelerated, and continue with “standard” TV-\(L^2\) deblurring, which also serves as a prototype for a whole list of applications with a general operator instead of a blurring kernel (cf. e.g. [30, 59]). Since in this case the objective is Lipschitz-differentiable, the convex conjugate is strongly convex, which allows to accelerate the algorithm. The third problem we investigate is a “smoothed” version of the TV-\(L^2\) model, which can be accelerated to linear convergence.

We investigate two different setups: as already announced above, we want to confirm the convergence rates predicted by the theory numerically. We hence require the inexact proximal problem (41) to be solved with an error close to the accuracy level \(\varepsilon _n\). To achieve this we, where it is necessary, deliberately solve the inner problem suboptimally, meaning that we use a cold start (random initialization of the algorithm) and reduced step sizes for the inner problem, ensuring that the inner problem is not solved “accidentially” at a higher precision. We shall see that this is indeed necessary for the slow TV-\(L^1\) problem. In a second setup we investigate whether the obtained error bounds can also be used as a criterion to ensure (optimal) convergence of the nested algorithm (41). As observed in e.g. [7] for the TV-\(L^2\) model and the FISTA algorithm, insufficient precision of the inner proximum can cause the algorithm to diverge. Instead of performing a fixed high number of inner iterations as a remedy, we solve the inner problem only up to precision \(\varepsilon _n\) in every step, which by the theory ensures that the algorithm converges with the same rate as the decay of the errors. We then use the best possible step sizes and a warm start strategy (initialization by the previous solution) in order to minimize the computational costs of the inner loop. It has already been observed in [61], that such strategy may significantly speeds up the process. We use a standard primal–dual reconstruction (PDHG) after \(10^5\) iterations as a numerical “ground truth” \(u^*\) to compute the optimal energy \(F^* = F(u^*)\).

Fig. 1
figure 1

Inexact primal–dual on the TV-\(L^1\) problem. a, b loglog plots of the relative objective error versus the outer iteration number for different decay rates \(\alpha\) of the errors. a Cold start, error close to the bound \(O\left( 1/n^\alpha \right)\), b warm start. c, d Number of inner iterations respectively sum of inner iterations versus number of outer iterations for different decay rates \(\alpha\). One can observe in (a) that the predicted rates in the worst case are attained, while in practice the problem also converges for very few inner iterations (b), (c) and (d)

4.1 Nondifferentiable deblurring with the TV-\(L^1\) model

In this section we study the numerical solution of the TV-\(L^1\) model

$$\begin{aligned} u^* \in \arg \min _{u \in X} ~ F(u) = \Vert Au - f \Vert _1 + \lambda \Vert \nabla u \Vert _1, \end{aligned}$$
(43)

with a discrete blurring operator \(A :X \rightarrow X\). As already lined out in the introduction, there exist a variety of methods to solve the problem (e.g. [19, 28, 35, 62]), where most of them make use of the fact that the operator A can be written as a convolution. We use an easy strategy which does not rely on the structure of the operator and is hence also applicable to operators different from convolutions. Due to the nondifferentiability of both the data term and regularizer, a very simple approach is to dualize both terms (similar to ADMM [10] or ’PD-Split’ in [17]):

$$\begin{aligned} \min _{u \in X} \max _{y_1 \in X,y_2 \in Y} ~ \langle y_1, Au-f \rangle + \langle y_2, \nabla u \rangle - \delta _{P_1} (y_1) - \delta _{P_\lambda } (y_2), \end{aligned}$$

where \(P_\lambda\) denotes the convex set \(P_\lambda = \{ x \in X ~|~ \Vert x \Vert _\infty \le \lambda \}\). One can then employ a standard primal–dual method (PDHG [15]) which reads

$$\begin{aligned} y_1^{n+1}&= {{\mathrm {proj}}}_{P_1} (y_1^n + \sigma A (2 u^{n+1} - u^n)), \\ y_2^{n+1}&= {{\mathrm {proj}}}_{P_\lambda } (y_2^n + \sigma \nabla (2 u^{n+1} - u^n)), \\ u^{n+1}&= u^n - \tau (A^*y_1^{n+1} - {{\mathrm {div}}}(y_2^{n+1})). \end{aligned}$$

Unfortunately one can observe that whenever there is no primal term in the formulation of the problem, the energy tends to oscillate and convergence can be quite slow (even though of course in \(O\left( 1/N\right)\), cf. Fig. 1b). As an alternative we propose to split the problem differently and operate on the following primal–dual formulation:

$$\begin{aligned} \min _{u\in X} \max _{y \in Y} ~ \langle y,Au-f \rangle - \delta _{P_1}(y) + \lambda \Vert \nabla u \Vert _1. \end{aligned}$$

We employ algorithm (29), i.e. the non-accelerated basic inexact primal–dual algorithm (iPD) with type-2 errors and obtain

$$\begin{aligned} \begin{aligned} y^{n+1}&= {{\mathrm {proj}}}_{P_1}(y^n + \sigma A (2u^{n+1} - u^n)), \\ u^{n+1}&\approx _2^{\varepsilon _{n+1}} \arg \min _{u\in X} ~ \frac{1}{2\tau } \Vert u - (u^n - \tau A^*y^{n+1})\Vert ^2 + \lambda \Vert \nabla u \Vert _1. \end{aligned} \end{aligned}$$
(44)

Note that the dual proximum in this case can be evaluated error-free.

As a numerical study we perform deblurring on MATLAB’s Lily image in [0, 1] with resolution \(256 \times 192\), which has been corrupted by a Gaussian blur of approximately 12 pixels full width at half maximum (where we assume a pixel size of 1) and 50 percent salt-and-pepper noise, i.e. 50 percent of the pixels have been randomly set to either 0 or 1. Furthermore, we performed power iterations to determine the operator norm of \((A, \nabla )\) as \(L \approx \sqrt{8}\) and set \(\sigma = \tau = 0.99/\sqrt{8}\) for (PDHG). For (iPD) L can be determined analytically as \(L = \Vert A \Vert = 1\), hence \(\tau = \sigma = 0.99\) for (iPD).

At first, we want to confirm the convergence rates predicted by the theory numerically. One can easily observe that the decay of the relative objective is almost exactly as predicted: with higher \(\alpha\) it approaches \(O\left( N^{-1}\right)\), in fact for summable errors it even seems a little better. In the second setup we investigate whether the obtained error bounds can also be used as a criterion to ensure (optimal) convergence of the nested algorithm (44), and results for varying parameter \(\alpha\) can be found in Fig. 1b. Interestingly for this problem, the error bounds from the theory are indeed too pessimistic or, vice versa, the TV-\(L^1\) problem is “easier” than expected. As can be observed in Fig. 1b, the convergence rate for all choices of \(\alpha\) tends towards \(O\left( 1/N\right)\), with slight advantages for higher \(\alpha\), while the number of required inner iterations k (Fig. 1c, d) to reach the necessary precision is remarkably low. In fact, performing just a single inner iteration in every step of the algorithm resulted in a \(O\left( 1/N\right)\) convergence rate (cf. also Fig. 1d). The required number of inner iterations even decreases over the course of the outer iterations which suggests that the dual variable of the inner problem “converges” as well. Note that this does not contradict the theoretical findings of this paper, but the contrary: while the first study clearly confirms that in the worst case the proved worst-case estimates are reached, the second implies that in practice one might as well perform by far better.

4.2 Differentiable deblurring with the TV-\(L^2\) model

The second problem we investigate is the TV-\(L^2\) model for image deblurring

$$\begin{aligned} u^* \in \arg \min _{u \in X} ~ \frac{1}{2} \Vert Au - f \Vert _2^2 + \lambda \Vert \nabla u \Vert _1. \end{aligned}$$
(45)

Again, the easiest approach to solve (45) is to write down a primal–dual formulation

$$\begin{aligned} \min _{u \in X} \max _{y_1 \in X,y_2 \in Y} ~ \langle y_1, Au-f \rangle + \langle y_2, \nabla u \rangle - \frac{1}{2} \Vert y_1\Vert ^2 - \delta _{P_\lambda } (y_2). \end{aligned}$$

Since the above problem is not strongly convex in \(y_2\) it cannot be accelerated, so a basic primal–dual algorithm [15] (PDHG) for the solution reads

$$\begin{aligned} y_1^{n+1}&= \frac{y_1^n + \sigma (A (2 u^{n+1} - u^n) - f)}{1 + \sigma }, \\ y_2^{n+1}&= {{\mathrm {proj}}}_{P_\lambda } (y_2^n + \sigma \nabla (2 u^{n+1} - u^n) ), \\ u^{n+1}&= u^n - \tau (A^* y_1^{n+1} - {{\mathrm {div}}}(y_2^{n+1}) ). \end{aligned}$$

We remark that, due to the special relation between the Fourier transform and a convolution, the same problem can be solved without dualizing the data term, since the primal proximal operator admits a closed form solution [15]. The problem however stays non-strongly convex, and in order to keep this a general prototype for \(L^2\)-type problems, we do not use this formulation.

The inexact approach instead operates on a different primal–dual formulation given by

$$\begin{aligned} \min _{u \in X} \max _{y \in X} ~ \langle y, Au-f \rangle - \frac{1}{2} \Vert y\Vert ^2 + \lambda \Vert \nabla u \Vert _1, \end{aligned}$$

which is now 1-strongly convex in y and can be accelerated. Using the inexact primal–dual algorithm from Sect. 3.4 leads to

$$\begin{aligned} y^{n+1}&= (y^n + \sigma _n (A (u^{n+1} + \theta _n (u^{n+1} - u^n)) -f))/(1 + \sigma _n), \\ u^{n+1}&\approx _2^{\varepsilon _{n+1}} \arg \min _{u \in {\mathcal {X}}} ~ \frac{1}{2\tau _n} \Vert u - (u^n - \tau _n A^*y^{n+1}) \Vert ^2 + \Vert \nabla u \Vert _1, \end{aligned}$$

with \(\tau _n,\sigma _n,\theta _n\) as given in Theorem 6. We again perform deblurring on MATLAB’s Lily image in [0, 1] with resolution \(256 \times 192\), which has been corrupted by a Gaussian blur of approximately 12 pixels full width at half maximum (where we assume a pixel size of 1), and in this case Gaussian noise with standard deviation \(s = 0.01\) and zero mean. We allow errors of the size \(\varepsilon _n = C/n^{-2\alpha }\) for \(\alpha \in (0,1)\), which by Corollary 3 should result in a \(O\left( N^{-2\alpha }\right)\) rate respectively \(O\left( N^{-2}\right)\) for \(\alpha > 1\). The results can be found in Fig. 2. In contrast to the TV-\(L^1\) problem, in this experiment it was not necessary to employ a cold start strategy and reduced step sizes for the inner problem in order to obtain the worst case rates. Instead also for a warm start and best possible step sizes for the inner problem the bounds for the gap (42) were active for all choices of \(\alpha\). Figure 2 shows the error in relative objective for the ergodic sequence \(U^N\) (a) and the iterates \(u^n\) (b) for increasing \(\alpha\). It can be observed that the rate is almost exactly the one predicted, while the iterates themselves even decay a little faster than the ergodic sequence. The amount of inner iterations necessary to obtain the required precision of the proximum is unsurprisingly higher than in the non-accelerated case, though they stay reasonable for rather low outer iteration numbers.

Fig. 2
figure 2

Inexact primal–dual on the TV-\(L^2\) problem. a and b loglog plots of the relative objective error versus the outer iteration number for different precisions \(C/n^{-2\alpha }\) of the errors. a Ergodic sequence, b iterates. c and d number of inner iterations respectively sum of inner iterations versus number of outer iterations for different decay rates \(\alpha\). One can observe that the predicted rate of \(O\left( N^{-2\alpha }\right)\) is attained both for the ergodic sequence and the single iterates, exactly reflecting the influence of the errors/imprecision

4.3 Smooth deblurring with the TV-\(L^2\) model

The last problem we consider is a smoothed version of the TV-\(L^2\) model from the previous experiments:

$$\begin{aligned} u^* \in \arg \min _{u \in X} ~ \frac{1}{2} \Vert Au - f \Vert _2^2 + \lambda \Vert \nabla u \Vert _1 + \frac{\gamma }{2} \Vert u\Vert ^2, \end{aligned}$$
(46)

for small \(\gamma\), with primal–dual formulation

$$\begin{aligned} \min _{u \in X} \max _{y_1 \in X,y_2 \in Y} ~ \langle y_1, Au-f \rangle + \langle y_2, \nabla u \rangle - \frac{1}{2} \Vert y_1\Vert ^2 - \delta _{P_\lambda } (y_2) + \frac{\gamma }{2} \Vert u\Vert ^2. \end{aligned}$$
(47)

Since the above problem is \(\gamma\)-strongly convex in u (note that it is also \(L_f = \gamma\)-Lipschitz differentiable in the primal variable), a possible accelerated primal–dual algorithm [18] (PDHGacc) for the solution reads

$$\begin{aligned} y_1^{n+1}&= \frac{y_1^n + \sigma _n (A (u^{n+1} + \theta _n (u^{n+1} - u^n)) - f)}{1 + \sigma _n}, \\ y_2^{n+1}&= {{\mathrm {proj}}}_{P_\lambda } (y_2^n + \sigma _n \nabla (u^{n+1} + \theta _n (u^{n+1} - u^n)) ), \\ u^{n+1}&= (1-\tau _n \gamma ) u^n - \tau _n (A^* y_1^{n+1} - {{\mathrm {div}}}(y_2^{n+1}) ), \end{aligned}$$

with \(\tau _n, \sigma _n,\theta _n\) given by Theorem 5 (see also [18]). We choose \(\tau _0 = 0.99 / L\), \(\sigma _0 = (1 - \tau _0 L_f) / \tau _0 L^2\) such that \(\tau _0 L_f + \tau _0 \sigma _0 L^2 = 1\) as required, with \(L = \Vert (A,\nabla ) \Vert \approx \sqrt{8}\) (see also the previous section). We remark that the primal term involving \(\gamma\) could also be handled implicitly, leading to a linear proximal step instead of the explicit evaluation of the gradient which, however, did not substantially affect the results. In the spirit of the previous experiments we employ a different splitting on this problem:

$$\begin{aligned} \min _{u \in X} \max _{y \in Y} ~ \langle y,Au -f \rangle -\frac{1}{2} \Vert y\Vert ^2 + \lambda \Vert \nabla u \Vert _1 + \frac{\gamma }{2} \Vert u\Vert ^2. \end{aligned}$$
(48)

The benefit is that even for small \(\gamma\) this problem is \(\gamma\)-strongly convex in the primal and 1-strongly convex in the dual variable and hence can be accelerated to linear convergence, which provides a huge boost in performance. Note that the same is not possible in formulation (47), since the problem is not strongly convex in \(y_2\). We can handle the smooth primal term in (48) explicitly such that the associated inexact primal–dual algorithm (iPD) from Sect. 3.5 reads

$$\begin{aligned} y^{n+1}&= (y^n + \sigma (A (u^{n+1} + \theta (u^{n+1} - u^n)) -f))/(1 + \sigma ), \\ u^{n+1}&\approx _2^{\varepsilon _{n+1}} \arg \min _{u \in {\mathcal {X}}} ~ \frac{1}{2\tau } \Vert u - [(1 - \tau \gamma ) u^n - \tau A^*y^{n+1}] \Vert ^2 + \Vert \nabla u \Vert _1. \end{aligned}$$

with \(\tau ,\sigma ,\theta\) defined at the end of Sect. 3.5. In this case we have \(\gamma = L_f\), such that the formulas simplify to

$$\begin{aligned} \tau = \frac{\sqrt{4 + 4L^2/(\gamma \mu )}}{2 \gamma + 2 L^2/\mu }, \quad \sigma = \frac{\sqrt{4 + 4L^2/(\gamma \mu )}}{2 \mu + 2L^2 /\gamma }, \quad \theta = 1 - \frac{\sqrt{4 + 4L^2/(\gamma \mu )} -2}{2L^2 / (\gamma \mu )}. \end{aligned}$$

We revisit the experimental setting from Sect. 4.3, such that \(L = \Vert A\Vert = 1\), \(\lambda = 0.01\) and choose \(\gamma = 1e-3\). With this size of \(\gamma\) the results were barely distinguishable from the results of the non-smoothed model from Sect. 4.2. This leads to \(\theta \approx 0.96\) for the constant of the linear convergence. Figure 3 shows the results for (PDHGacc) and (iPD) using an error decay rate of \(q = 0.9\), i.e. according to Corollary 4 we expect a linear convergence with constant \(\theta > q\), which is indeed the case. One can observe that already after 250 iterations (iPD) reaches a relative objective error of \(1{\mathrm {e}}{-}10\), while the accelerated PD version has barely reached \(1{\mathrm {e}}{-}2\). It should however be mentioned that also (PDHGacc) reaches the \(O\left( N^{-2}\right)\) rate soon after these 250 iterations. Figure 3c shows the price we pay for the inner loop, i.e. the number of inner iterations which is necessary over the course of the 250 outer iterations. As one expects for linear convergence, the number of inner iterations explodes for high outer iteration numbers, which substantially slows down the algorithm. However, the algorithm reaches an error of \(1{\mathrm {e}}{-}6\) in relative objective already after approximately 100 iterations, in which case the number of inner iterations is still remarkably low (around 10–20), which makes the approach viable in practice. This is in particular interesting for problems with a very costly operator A, where the tradeoff between outer and inner iterations is high.

Fig. 3
figure 3

Inexact primal–dual on the smoothed TV-\(L^2\) problem. a and b Loglog plots of the relative objective error respectively relative error in norm versus the outer iteration numbers for accelerated primal–dual (PDHGacc) and inexact primal–dual (iPD) for \(q = 0.9\), c loglog plot of the inner iteration number vs. outer iteration number for \(q = 0.9\). One can observe that the predicted convergence rate of \(O\left( \theta ^N\right)\) is exactly attained, while for lower outer iteration numbers the necessary amount of inner iterations stays reasonably low

5 Conclusion and outlook

In this paper we investigated the convergence of the class of primal–dual algorithms developed in [15, 18, 54] under the presence of errors occurring in the computation of the proximal points and/or gradients. Following [5, 60, 61] we studied several types of errors and showed that under a sufficiently fast decay of these errors we can establish the same convergence rates as for the error-free algorithms. More precisely we proved the (optimal) \(O\left( 1/N\right)\) convergence to a saddle-point in finite dimensions for the class of non-smooth problems considered in this paper, and proved a \(O\left( 1/N^2\right)\) or even linear \(O\left( \theta ^N\right)\) convergence rate for partly smooth respectively entirely smooth problems. We demonstrated both the performance and the practical use of the approach on the example of nested algorithms, which can be used to split the global objective more efficiently in many situations. A particular example is the nondifferentiable TV-\(L^1\) model which can be very easily solved by our approach. A few questions remain open for the future: A very practical one is whether one can use the idea of nested algorithms to (heuristically) speed up the convergence of real life problems which are not possible to accelerate, such as TV-type methods in medical imaging. As demonstrated in the numerical section, using an inexact primal–dual algorithm one can often “introduce” strong convexity by splitting the problem differently and hence obtain the possibility to accelerate. This can in particular be interesting for problems with operators of very different costs, where the trade-off between inner and outer iterations is high and hence a lot of inner iterations are still feasible. Following the same line, it would furthermore be interesting to combine the convergence results for inexact algorithms with stochastic approaches as done in [16], which are also designed to speed up the convergence for this particular situation, which could provide an additional boost. Another point to investigate is whether one can combine the inexact approach with linesearch and variable metric strategies similar to [9].