1 Introduction

In many areas of science and engineering one is faced with the problem of having to compute a meaningful approximate solution, defined in an appropriate way, of linear systems of equations of the form

$$ A\mathbf{x}\approx\mathbf{b}, $$
(1)

where \(A\in {\mathbb R}^{m\times n}\) is a given matrix, \(\mathbf {x}\in {\mathbb R}^{n}\) is the desired solution, and \(\mathbf {b}\in {\mathbb R}^{m}\) is a data vector. The symbol ≈ indicates that we would like to determine a vector x such that Ax approximates b in a suitable way. We consider the case when the singular values of A decrease to zero quickly without a significant gap. Then the matrix A is severely ill-conditioned and, moreover, it is not meaningful to define the rank of A. Approximation problems (1) with a matrix of this kind are commonly referred to as discrete ill-posed problems; see, e.g., [17, 26, 27] for discussions.

In applications of interest to us, the data vector b is not available; instead a measured vector \(\mathbf {b}^{\delta }\in {\mathbb R}^{m}\), which is contaminated by noise \(\boldsymbol {\eta }\in {\mathbb R}^{m}\), is known, i.e.,

$$ \mathbf{b}^{\delta}=[b_{1}^{\delta},b_{2}^{\delta},\ldots,b_{m}^{\delta}]^{T}=\mathbf{b}+\boldsymbol{\eta} $$

replaces the vector b in (1). Here and throughout this paper the superscript T denotes transposition. We assume that η is made up of white Gaussian noise, impulse noise and/or salt and pepper noise. Impulse noise modifies only some entries of b and leaves the other entries unchanged. In detail, we have for 1 ≤ im,

$$ b^{\delta}_{i}=\left\{\begin{array}{ll}d_{i}&\text{with probability } \sigma,\\ b_{i}&\text{with probability } 1-\sigma, \end{array}\right. $$

where the di are realizations of a random variable with uniform distribution in the interval \([d_{\min \limits },d_{\max \limits }]\), which is the dynamic range of bi. If \(d_{i}\in \{d_{\min \limits },d_{\max \limits }\}\), i.e., if the di only attain their maximal or minimal achievable values, then impulse noise is commonly referred to as salt-and-pepper noise. Impulse noise simulates the effect of broken sensors on a measuring device such as a CCD camera.

Since A is ill-conditioned and bδ is contaminated by noise, the least-squares solution Abδ of minimal norm of (1) with b replaced by bδ is meaningless due to propagation and severe amplification of the error η in bδ into the solution; here A denotes the Moore-Penrose pseudo-inverse of A. To reduce the propagated error in the computed solution, one typically modifies the problem to be solved. This modification is commonly referred to as regularization.

A regularization method that recently has received considerable attention is the p-q minimization method; see [14, 18, 30, 35]. This method solves the minimization problem

$$ \underset{\mathbf{x}\in{\mathbb R}^{n}}{\min}\left\{\frac{1}{p}\| A\mathbf{x}-\mathbf{b}^{\delta} \|_{p}^{p}+ \frac{\mu}{q}\| L\mathbf{x} \|^{q}_{q}\right\}, $$
(2)

where \(L\in \mathbb {R}^{r\times n}\) is a regularization matrix, μ > 0 a regularization parameter, and \(\| \mathbf {z} \|_{p}^{p}={\sum }_{i=1}^{k}|z_{i}|^{p}\) for \(\mathbf {z}=[z_{1},z_{2},\ldots ,z_{k}]^{T}\in {\mathbb R}^{k}\) for k ∈{m,n}. We are interested in parameters p and q in the interval (0,2]. Observe that z↦∥zp is a norm only if p ≥ 1; however, for notational convenience, we will refer to the quantity ∥zp as an p-norm also for 0 < p < 1.

The regularization parameter μ > 0 determines the trade-off between the first and second terms in (2), and decides how sensitive the solution of (2) is to the error η in bδ. Moreover, the choice of μ affects how close the solution is to the desired vector xtrue = Ab. An imprudent choice of μ may result in that the solution of (2) is a poor approximation of xtrue. It therefore is important to develop methods that are able to determine a suitable value of μ. This value will depend on the matrices A and L, the vectors bδ and η, as well as the parameters p and q. If p = q = 2, then the problem (2) reduces to Tikhonov regularization in general form; see, e.g., [17, 22, 27, 38] for discussions on Tikhonov regularization.

We now briefly discuss different choices for p and q, and first comment on the choice of q, which affects the second term in (2). This terms is commonly referred to as the regularization term. In many situations the desired solution xtrue of (1) is sparse after some transformation. For instance, if we represent xtrue in terms of framelets or wavelets, then xtrue typically has a sparse representation, i.e., many coefficients in this representation vanish. Moreover, when L is a discretized gradient operator, the vector Lxtrue generally has many vanishing entries. To promote sparsity of the computed solution of (2), we may consider letting L = I and q = 0. Then the regularization term measures the size of the computed solution by the 0-norm. This “norm” counts the number of nonzero entries in the vector x in (2). Note that this norm is not convex. Similarly, to promote sparsity in the vector Lx, we may consider letting q = 0 in (2). However, the minimization problems so obtained are extremely difficult to solve. Therefore, it is common to approximate the 0-norm by the 1-norm. This approximation has the advantage that the 1-norm is convex, which makes it easier to solve (2) than when using the 0-norm. However, q-norms with 0 < q < 1 are better approximations of the 0-norm than the 1-norm. In particular, smaller values of q > 0 yield better approximations of the 0-norm than larger values; see Fig. 1 for an illustration. For 0 < q < 1 the resulting minimization problem (2) is not convex and its solution may be difficult to determine; see Lanza et al. [36] for a recent discussion on the choice of q in the context of image restoration.

Fig. 1
figure 1

Comparison of different q-norms. The solid black graph represents the 0-norm, the dotted black graph shows the 1-norm, the dark gray solid graph displays the 0.5-norm, and the light gray solid graph depicts the 0.1-norm

The choice of p affects the first term in (2), known as the fidelity term, and should depend on the type of noise η in the data vector bδ. If the noise η is white Gaussian, then p = 2 is appropriate and a method for determining μ for this kind of noise, based on the discrepancy principle, is described in [7]. This method requires that a fairly accurate estimate of the norm of η be available and allows 0 < q < 1. However, for impulse noise, choosing p = 2 usually produces computed solutions of poor quality. It has been shown, see, e.g., [8, 30, 35], that the choice 0 < p < 1 leads to accurate restorations in the case of salt-and-pepper noise. In [8], the authors developed two strategies based on cross-validation for determining a suitable value of μ for any 0 < p ≤ 2 without any knowledge of the noise. This method is further commented on below.

Let \(\mathcal {N}(M)\) denote the null space of the matrix M. We will assume that the matrices A and L are such that

$$ \mathcal{N}(A)\cap\mathcal{N}(L)=\{\mathbf{0}\}. $$
(3)

Then the matrix ATA + μLTL is nonsingular for any μ > 0, and the minimization problem (2) for p = q = 2 has the unique solution

$$ \mathbf{x}_{\mu}=(A^{T}A+\mu L^{T}L)^{-1}A^{T}\mathbf{b}^{\delta}. $$
(4)

We remark that both the selection of L and μ are important for the quality of the computed solution and have been widely discussed; see, e.g., [15, 19, 31,32,33, 37].

Generalized cross validation (GCV) introduced by Golub et al. [23] is a popular approach to choosing the regularization parameter for Tikhonov regularization; more recent discussions on this method can be found in [19, 20, 24, 29]. The GCV method is statistically based and chooses a regularization parameter that minimizes the GCV functional

$$ G(\mu)=\frac{\| \mathbf{b}^{\delta}-A\mathbf{x}_{\mu} \|_{2}^{2}}{\left( \text{\textsf{trace}}\left( I-A\left( A^{T}A+\mu L^{T}L\right)^{-1} A^{T}\right)\right)^{2}}. $$
(5)

Denote the minimizer of G(μ) by μGCV. In rare events, when the minimizer is not unique, we choose the largest one.

The minimization of G(μ) requires the computation, or at least estimation, of the trace of the matrix IA(ATA + μLTL)− 1AT that is not explicitly known. This can make the computation of G(μ) expensive when the matrices A and L are large. Different ways to speed up the computations by application of a global block Lanczos method or Golub-Kahan bidiagonalization are described in [19, 20]. However, these works are limited to the situation when L = I.

In this paper, we combine ideas in [19] with the techniques described in [30] to derive an algorithm for computing the GCV parameter for (2) for general values of 0 < p,q ≤ 2. Specifically, we consider the AMM-GKS algorithm in [30] for the solution of (2) (see Section 3.1 for more details), and modify it so that at each iteration the regularization parameter is chosen adaptively by minimizing a reduced version of the function (5). The computation of the GCV parameter is fully automatic in the sense that no information about the noise η is required. The GCV method for determining the regularization parameter is a so-called heuristic method, see Kindermann [32], and therefore may fail for certain data vectors b. In our experience, failure is extremely rare. The GCV method is related to cross validation (CV), another heuristic method for determining a suitable regularization parameter; see [8, 39]. We will illustrate in Section 4 that GCV typically gives about the same value of the regularization parameter as CV for a much lower computational effort.

This paper is organized as follows: Section 2 reviews how the GCV parameter for Tikhonov regularization in general form can be computed fairly inexpensively by projection into a Krylov subspace, and in Section 3 we discuss how to determine the GCV parameter for large-scale minimization problems (2). A smoothed variant of GCV that can give better regularization parameter values also is presented. Section 4 contains numerical examples that illustrate the performance of the proposed approach. Finally, Section 5 contains some concluding remarks.

2 GCV for Tikhonov regularization in general form

This section discusses the computation of the GCV parameter for Tikhonov regularization in general form, i.e., when solving the minimization problem (2) with p = q = 2. We first describe an approach that requires the generalized singular value decomposition (GSVD) of the matrix pair {A,L}. This approach is computationally feasible only for small to medium-sized problems since the computation of the GSVD for a large matrix pair is very expensive. We then discuss how the computational burden for large-scale problems can be reduced by projecting the matrix pair {A,L} into a Krylov subspace of fairly small dimension.

2.1 GCV computation by using the GSVD

This subsection reviews material in [27], where further details can be found. Consider the Tikhonov regularization problem in general form with μ > 0,

$$ \mathbf{x}_{\mu}=\arg\underset{\mathbf{x}\in{\mathbb R}^{n}}{\min}\left\{\| A\mathbf{x}-\mathbf{b}^{\delta} \|_{2}^{2}+\mu\| L\mathbf{x} \|_{2}^{2}\right\}, $$
(6)

and introduce the GSVD of the matrix pair {A,L},

$$ \left\{\begin{array}{l} A=U{\varSigma}_{A}Y^{T},\\ L=V{\varSigma}_{L}Y^{T}, \end{array}\right. $$
(7)

where ΣA and ΣL are diagonal, possibly rectangular, matrices, the matrices U and V are orthogonal, and the matrix Y is square and non-singular. Further details of these matrices, including their sizes, can be found, e.g., in [16, 25, 27]. The evaluation of the GCV function G(μ), defined by (5), requires the computation of the quantities

$$ \begin{array}{@{}rcl@{}} r_{\mu}&=&\| A\mathbf{x}_{\mu}-\mathbf{b}^{\delta} \|_{2}^{2},\\ t_{\mu}&=&\text{\textsf{trace}}\left( I-A\left( A^{T}A+\mu L^{T}L\right)^{-1}A^{T}\right), \end{array} $$

for several values of μ > 0, where the vector xμ is given by (4). These quantities can be evaluated inexpensively when the factorizations (7) are available as follows. For notational simplicity, the formulas below assume that all matrices are square. This restriction easily can be removed; see, e.g., [16, 25, 27] for details.

The quantity rμ can be expressed as

$$ \begin{array}{@{}rcl@{}} r_{\mu}&=&\| \mathbf{b}^{\delta}-A\mathbf{x}_{\mu} \|_{2}^{2}\\ &\overset{(a)}{=}&\| \mathbf{b}^{\delta}-A(A^{T}A+\mu L^{T}L)^{-1}A^{T}\mathbf{b}^{\delta} \|_{2}^{2}\\ &\overset{(b)}{=}&\| \tilde{\mathbf{b}}^{\delta}- {\varSigma}_{A}({{\varSigma}_{A}^{T}}{\varSigma}_{A}+\mu {{\varSigma}_{L}^{T}}{\varSigma}_{L})^{-1}{{\varSigma}_{A}^{T}}\tilde{\mathbf{b}}^{\delta} \|_{2}^{2}, \end{array} $$

where (a) is obtained by using the definition (4) of xμ, and (b) is derived using the factorizations (7) with \(\tilde {\mathbf {b}}^{\delta }=U^{T}\mathbf {b}^{\delta }\). Analogously,

$$ \begin{array}{@{}rcl@{}} t_{\mu}&=&\text{\textsf{trace}}\left( I-A\left( A^{T}A+\mu L^{T}L\right)^{-1}A^{T}\right)\\ &=&\text{\textsf{trace}}\left( I-{\varSigma}_{A}({{\varSigma}_{A}^{T}}{\varSigma}_{A}+ \mu {{\varSigma}_{L}^{T}}{\varSigma}_{L})^{-1}{{\varSigma}_{A}^{T}}\right), \end{array} $$

where we have used the factorizations (7) and the fact that trace is invariant under similarity transformation. We have

$$ \mu_{\text{GCV}}=\arg\underset{\mu}{\min}\frac{r_{\mu}}{t_{\mu}}, $$

where we in rare cases of nonunicity let μGCV be the largest of the minimizers.

It is clear from the above expressions for rμ and tμ that the main computational cost for evaluating (5) is the computation of the GSVD (7) of the matrix pair {A,L}. Observe that, to determine μGCV we typically have to evaluate G(μ) for many values of μ. However, it suffices to compute the factorizations (7) only once, since they are independent of μ.

2.2 GCV computation in a Krylov subspace

When the matrices A and L are large, the computation of the GSVD (7) is unattractive due to its high cost; for instance, Paige’s algorithm generally requires at least 35.3n3 arithmetic floating point operations when \(A,L\in {\mathbb R}^{n\times n}\); see [1, Table 5.1] for details. We therefore now describe a more computational attractive approach that is based on projecting the problem into a Krylov subspace of fairly small dimension. Our projection method extends the method used in [19] by allowing the matrix L to be different from the identity.

Let \(A\in {\mathbb R}^{m\times n}\) and \(L\in {\mathbb R}^{r\times n}\) and let the columns of VRn×d form an orthonormal basis for the Krylov subspace \(\mathcal {V}\) of dimension \(1\leq d\ll \min \limits \{m,n,r\}\),

$$ \mathcal{V}=\mathcal{K}_{d}\left( A^{T}A,A^{T}\mathbf{b}^{\delta} \right)= \text{\textsf{span}}\left\{A^{T}\mathbf{b}^{\delta},A^{T}AA^{T}\mathbf{b}^{\delta},\ldots,(A^{T}A)^{d-1}A^{T}\mathbf{b}^{\delta} \right\}. $$

We assume here that d is small enough so that \(\dim (\mathcal {K}_{d}\left (A^{T}A,A^{T}\mathbf {b}^{\delta } \right ))=d\). This is the generic situation. It is convenient to compute the columns of V by applying d steps of Golub-Kahan bidiagonalization to A with initial vector bδ. This gives the decomposition

$$ AV=[U,u_{d+1}]B^{(A)}, $$
(8)

where the matrices [U,ud+ 1] ∈ Rm×(d+ 1) and VRn×d have orthonormal columns and \(B^{(A)}\in {\mathbb R}^{(d+1)\times d}\) is lower bidiagonal; see, e.g., [25]. We assume that d is small enough so that this decomposition exists.

We would like to compute an approximate solution of the Tikhonov minimization problem (6) in \(\mathcal {V}\). The restriction of the approximate solution to the subspace \(\mathcal {V}\) of low dimension d reduces the computational cost of significantly. Thus, we would like to solve

$$ \mathbf{x}=\arg\underset{\mathbf{x}\in\mathcal{V}}{\min}\left\{\| A\mathbf{x}-\mathbf{b}^{\delta} \|^{2}_{2}+\mu\| L\mathbf{x} \|_{2}^{2}\right\}. $$

Substituting x = V y for some \(\mathbf {y}\in {\mathbb R}^{d}\) into the above equation gives

$$ \mathbf{y}=\arg\underset{\mathbf{y}\in{\mathbb R}^{d}}{\min}\left\{\| AV\mathbf{y}-\mathbf{b}^{\delta} \|_{2}^{2}+\mu\| LV\mathbf{y} \|_{2}^{2}\right\}. $$
(9)

Substitute the QR factorization

$$ LV=Q^{(L)}R^{(L)}, $$

where the matrix \(Q^{(L)}\in {\mathbb R}^{r\times d}\) has orthonormal columns and \(R^{(L)}\in {\mathbb R}^{d\times d}\) is upper triangular, and the decomposition (8) into (9) gives

$$ \mathbf{y}=\arg\underset{\mathbf{y}\in{\mathbb R}^{d}}{\min}\left\{\| B^{(A)}\mathbf{y}-\mathbf{e}_{1}\|\mathbf{b}^{\delta}\|_{2}^{2} \|_{2}^{2}+ \mu\| R^{(L)}\mathbf{y} \|_{2}^{2}\right\}, $$
(10)

with e1 = [1,0,…,0]T denoting the first axis vector. The projected problem (10) is of quite inexpensive to compute and of fairly small dimension. In particular, it is feasible to compute the GSVD of the matrix pair {B(A),R(L)} and use it to calculate an approximation of the GCV function G(μ).

3 GCV for p- q regularization

We describe how to combine the computations discussed in Section 2 for determining the regularization parameter by the GCV method with the algorithms presented in [30] for minimizing the expression (2).

3.1 Solution of the p- q minimization problem for fixed μ > 0

For convenience of the reader, we outline the AMM-GKS method for solving (2) proposed in [30]. Introduce a smoothed version of the q-norm \(\| \mathbf {x} \|_{q}^{q}\) for 0 < q ≤ 1 as follows. Let \({\varPhi }_{q}:{\mathbb R}\rightarrow {\mathbb R}\) be defined by

$$ {\varPhi}_{q}(t)=|t|^{q}, $$

and observe that, if 0 < q ≤ 1, then tΦq(t) is not differentiable at t = 0. We approximate Φq by the differentiable function

$$ {\varPhi}_{q,\varepsilon}(t)=\left( \sqrt{t^{2}+\varepsilon^{2}}\right)^{q}, $$
(11)

where ε > 0 is a (small) constant. Similarly, we define for \(\mathbf {x}=[x_{1},x_{2},\ldots ,x_{n}]^{T}\in {\mathbb R}^{n}\) the smoothed version of \(\| \mathbf {x} \|_{q}^{q}\) as

$$ \| \mathbf{x} \|_{q}^{q}\approx\sum\limits_{i=1}^{n} {\varPhi}_{q,\varepsilon}(x_{i}). $$

Finally, introduce the smoothed version of the functional that is minimized in (2),

$$ \mathcal{J}_{\varepsilon}(\mathbf{x}):=\frac{1}{p}\sum\limits_{i=1}^{m}{\varPhi}_{p,\varepsilon}((A\mathbf{x}-\mathbf{b}^{\delta})_{i})+ \frac{\mu}{q}\sum\limits_{i=1}^{r}{\varPhi}_{q,\varepsilon}((L\mathbf{x})_{i}). $$
(12)

This gives us the smoothed version

$$ \mathbf{x}^{*}:=\arg\min_{\mathbf{x}}\mathcal{J}_{\varepsilon}(\mathbf{x}) $$

of the minimization problem (2).

The AMM-GKS method described in [30] is a majorization-minimization method for computing a stationary point of the functional (12). This method determines a sequence of iterates x(k), k = 1,2,…, that converge to a stationary point of (12). Each iterate is computed in by two steps: first the functional (12) is majorized by a quadratic tangent majorant \(\mathbf {x}\rightarrow \mathcal {Q}(\mathbf {x},\mathbf {x}^{(k)})\) for \(\mathcal {J}_{\varepsilon }\) at x(k). Then the unique minimizer of \(\mathbf {x}\rightarrow \mathcal {Q}(\mathbf {x},\mathbf {x}^{(k)})\) is calculated. This minimizer is the next iterate x(k+ 1). We outline this method in the remainder of this subsection.

Definition 1 (30)

Consider the differentiable functional \(\mathcal {J}_{\varepsilon }(\mathbf {x}):{\mathbb R}^{n}\rightarrow {\mathbb R}\). We say that the functional \(\mathbf {x}\mapsto \mathcal {Q}(\mathbf {x},\mathbf {y}):{\mathbb R}^{n}\rightarrow {\mathbb R}\) is a quadratic tangent majorant for \(\mathcal {J}_{\varepsilon }(\mathbf {x})\) at \(\mathbf {y}\in {\mathbb R}^{n}\) if the following conditions are satisfied:

  • \(\mathcal {Q}(\mathbf {x},\mathbf {y})\) is quadratic,

  • \(\mathcal {Q}(\mathbf {x},\mathbf {y})\geq \mathcal {J}(\mathbf {x})\) for all \(\mathbf {x}\in {\mathbb R}^{n}\),

  • \(\mathcal {Q}(\mathbf {y},\mathbf {y})=\mathcal {J}(\mathbf {y})\) and \(\nabla _{\mathbf {x}}\mathcal {Q}(\mathbf {y},\mathbf {y})=\nabla _{\mathbf {x}}\mathcal {J}(\mathbf {y})\),

where ∇x denotes the gradient with respect to the first argument of Q.

Majorization step.:

The quadratic tangent majorant at the point x(k) constructed by the AMM-GKS method is determined as follows. Let

$$ \begin{array}{ll} \mathbf{v}^{(k)}&:=A\mathbf{x}^{(k)}-\mathbf{b}^{\delta},\\ \mathbf{u}^{(k)}&:=L\mathbf{x}^{(k)}, \end{array} $$

and compute the vectors

$$ \begin{array}{ll} \boldsymbol{\omega}_{\text{fid}}^{(k)}&:=((\mathbf{v}^{(k)})^{2}+\varepsilon^{2})^{p/2-1},\\ \boldsymbol{\omega}^{(k)}_{\text{reg}}&:=((\mathbf{u}^{(k)})^{2}+\varepsilon^{2})^{q/2-1}, \end{array} $$

where all the operations, including squaring, are meant element-wise. Define the diagonal matrices

$$ W^{(k)}_{\text{fid}}=\text{diag}(\boldsymbol{\omega}_{\text{fid}}^{(k)}),\quad W^{(k)}_{\text{reg}}=\text{diag}(\boldsymbol{\omega}_{\text{reg}}^{(k)}). $$

It is shown in [30] that the functional

$$ \mathcal{Q}(\mathbf{x},\mathbf{x}^{(k)})=\frac{1}{2}\| (W^{(k)}_{\text{fid}})^{1/2}(A\mathbf{x}-\mathbf{b}^{\delta}) \|_{2}^{2}+ \frac{\mu}{2}\| (W^{(k)}_{\text{reg}})^{1/2}L\mathbf{x} \|_{2}^{2}+c, $$
(13)

with c a suitable constant that is independent of x, is a quadratic tangent majorant for \(\mathcal {J}_{\varepsilon }\) at x(k); see [30] for details on the derivation of \(\boldsymbol {\omega }^{(k)}_{\text {fid}}\), \(\boldsymbol {\omega }^{(k)}_{\text {reg}}\), and c.

Minimization step.:

We describe the computation of the minimizer x(k+ 1) of (13). Since \(\mathcal {Q}\) is quadratic, x(k+ 1) can be determined as the zero of the gradient, i.e., by solving

$$ (A^{T}W^{(k)}_{\text{fid}}A+\mu L^{T}W^{(k)}_{\text{reg}}L)\mathbf{x}=A^{T}W^{(k)}_{\text{fid}}\mathbf{b}^{\delta}. $$
(14)

The system matrix is nonsingular and positive definite for all μ > 0 if and only if

$$ \mathcal{N}(A^{T}W^{(k)}_{\text{fid}}A)\cap\mathcal{N}(L^{T}W^{(k)}_{\text{reg}}L)=\{\mathbf{0}\}. $$

This condition typically is satisfied. Then the solution x(k+ 1) of (14) is the unique minimizer of \(\mathbf {x}\to \mathcal {Q}(\mathbf {x},\mathbf {x}^{(k)})\).

The computation of the solution of (14) can be demanding when the matrices A and L are large. An approximate solution can be computed efficiently by projecting the problem into a low-dimensional subspace \(\mathcal {V}_{d}\) of dimension \(1\leq d\ll \min \limits \{m,n,r\}\). Let the columns of the matrix \(V_{d}\in {\mathbb R}^{n\times d}\) form an orthonormal basis for \(\mathcal {V}_{d}\) and calculate the skinny QR factorizations

$$ \begin{array}{rclcll} (W^{(k)}_{\text{fid}})^{1/2}AV_{d}&=&Q_{A}R_{A}\quad&\text{with}\quad&Q_{A}\in{\mathbb R}^{m\times d},& ~~~R_{A}\in{\mathbb R}^{d\times d},\\ (W^{(k)}_{\text{reg}})^{1/2}LV_{d}&=&Q_{L}R_{L}\quad&\text{with}\quad&Q_{L}\in{\mathbb R}^{r\times d},& ~~~R_{L}\in{\mathbb R}^{d\times d}, \end{array} $$

where the matrices QA and QL have orthonormal columns and the matrices RA and RL are upper triangular. Using these factorizations, we obtain the reduced analogue of (14),

$$ ({R_{A}^{T}}R_{A}+\mu {R_{L}^{T}}R_{L})\mathbf{y}={R_{A}^{T}}{Q_{A}^{T}}(W^{(k)}_{\text{fid}})^{1/2}\mathbf{b}^{\delta}. $$
(15)

The solution y(k+ 1) of this equation gives, with a slight abuse of notation, the approximate solution

$$ \mathbf{x}^{(k+1)}=V_{d}\mathbf{y}^{(k+1)} $$

of (14). Substituting this expression into (14) gives the residual vector

$$ \mathbf{r}^{(k+1)}:=A^{T}W^{(k)}_{\text{fid}}(AV_{d}\mathbf{y}^{(k+1)}-\mathbf{b}^{\delta})+ \mu L^{T}W^{(k)}_{\text{reg}}LV_{d}\mathbf{y}^{(k+1)}. $$

Observe that this vector in general is nonvanishing, since we are solving (14) in the subspace \(\mathcal {V}_{d}\) of \({\mathbb R}^{n}\). We enlarge at each step the solution subspace by adding the scaled residual vector \(\mathbf {v}_{\text {new}}=\mathbf {r}^{(k+1)}/\|\mathbf {r}^{(k+1)}\|\) to the solution subspace. This vector is orthogonal to the columns of the matrix Vd. Define the matrix \(V_{d+1}=[V_{d},\mathbf {v}_{\text {new}}]\in {\mathbb R}^{n\times (d+1)}\). Its columns form an orthonormal basis for the new solution subspace \(\mathcal {V}_{d+1}\). The so determined solution subspace is referred to as a Generalized Krylov subspace (GKS); see, e.g., [34] for another application of this kind of subspaces. The computations are initialized by choosing a d0-dimensional solution subspace of \({\mathbb R}^{n}\) and letting the columns of the matrix \(V_{d_{0}}\) form an orthonormal basis for this subspace. Typically, setting d0 = 1 and \(V_{0}=A^{T}\mathbf {b}^{\delta }/\| A^{T}\mathbf {b}^{\delta } \|_{2}\) is appropriate. An algorithm that describes these computations and determines the regularization parameter μ by the GCV method is described in the following subsection.

3.2 GCV and MM-GKS

In the computations outlined in the previous subsection, the regularization parameter is kept fixed. We now describe how a value μk of the regularization parameter can be determined in a fairly simple manner before computing the next approximate solution x(k+ 1) of (2).

Equation (15) is the normal equation associated with the minimization problem

$$ \mathbf{y}^{(k+1)}=\arg\underset{\mathbf{y}\in{\mathbb R}^{d}}{\min}\left\{\| R_{A}\mathbf{y}-{Q_{A}^{T}}W^{(k)}_{\text{fid}}\mathbf{b}^{\delta} \|_{2}^{2}+ \mu\| R_{L}\mathbf{y} \|_{2}^{2}\right\}. $$

This minimization problem is analogous to (10). Since it stems from projecting a large problem into a generalized Krylov subspace of modest dimension d, the matrices RA and RL are small enough to allow the computation of the GSVD of the matrix pair {RA,RL} at moderate expense. We therefore can use the technique described in Section 2.1 to determine a suitable value of the regularization parameter by GCV. In detail, let the GSVD of the matrix pair {RA,RL} be given by

$$ \begin{array}{l} R_{A}=U{\varSigma}_{A}Y^{T},\\ R_{L}=V{\varSigma}_{L}Y^{T}. \end{array} $$

Similarly as in Section 2.1, we define, for any μ > 0, the quantities

$$ \begin{array}{ll} r^{(k)}_{\mu}&=\| U^{T}{Q_{A}^{T}}\mathbf{b}^{\delta}-{{\varSigma}_{A}^{2}}({{\varSigma}_{A}^{2}}+{\mu{\varSigma}_{L}^{2}})^{-1} U^{T}{Q_{A}^{T}}W^{(k)}_{\text{fid}}\mathbf{b}^{\delta} \|_{2}^{2},\\ t^{(k)}_{\mu}&=\text{\textsf{trace}}\left( I-{{\varSigma}_{A}^{2}}({{\varSigma}_{A}^{2}}+{\mu{\varSigma}_{L}^{2}})^{-1}\right). \end{array} $$
(16)

This derivation uses that both matrices RA and RL are square. We compute the GCV parameter μ(k) as the solution of

$$ \mu^{(k)}=\arg\underset{\mu}{\min}\frac{r^{(k)}_{\mu}}{t^{(k)}_{\mu}}. $$

If the minimum is not unique, then we choose the largest one. Typically, the minimum is unique. This is assumed in Algorithm 1, which describes the computations. The parameter μ(k) is updated in each iteration.

figure a

Algorithm 1 requires the computation of the GSVD of the matrix pair {RA,RL} in each iteration. Since the matrices RA and RL are not very large, this is not very expensive.

3.3 Smoothed GCV for impulse noise

We describe a modified version of the GCV that, as we will see in Section 4, is well suited for impulse noise. The Euclidean norm of impulse noise is infinite. Therefore, the numerator in G(μ), see (5), may be infinite. It therefore may be difficult to compute the minimum of G(μ). In computations in finite precision arithmetic, the Euclidean norm of impulse noise is extremely large and this leads to that it may be difficult to determine the minimum of G(μ). To remedy this issue, we smooth the data vector bδ by convolving it with a Gaussian filter with small variance. Let \(\mathbf {k}_{\nu ^{2}}\in {\mathbb R}^{m}\) represent a sampling of a Gaussian function with mean zero and variance ν2, and define the smoothed vector

$$ \mathbf{b}^{\delta}_{\text{smooth}}=\mathbf{k}_{\nu^{2}}*\mathbf{b}^{\delta}, $$

where ∗ denotes the convolution operator. It is well known that if we assume periodic boundary conditions, then convolution can be carried out easily with the aid of the Fourier transform \(\mathfrak {F}\). We have

$$ \mathfrak{F}(\mathbf{b}^{\delta}_{\text{smooth}})=\mathfrak{F}(\mathbf{k}_{\nu^{2}})\odot\mathfrak{F}(\mathbf{b}^{\delta}), $$

where ⊙ denotes element-wise multiplication. Moreover, if bδ represents the vectorization of a multi-dimensional array, e.g., an image, it is beneficial to let \(\mathbf {k}_{\nu ^{2}}\) be the vectorization of a discretized multi-dimensional Gaussian function.

Using \(\mathbf {b}^{\delta }_{\text {smooth}}\) we can define the smoothed version of the GCV functional by

$$ G_{\text{smooth}}(\mu)=\frac{\| \mathbf{b}^{\delta}_{\text{smooth}}-A\mathbf{x}_{\mu} \|_{2}^{2}} {\left( \text{\textsf{trace}}\left( I-A\left( A^{T}A+\mu L^{T}L\right)^{-1}A^{T}\right)\right)^{2}}. $$

This functional is the standard GCV functional (5) with the data vector bδ replaced by the smoothed data vector \(\mathbf {b}^{\delta }_{\text {smooth}}\). The MM-GKS-GCV-Smooth algorithm is obtained by minimizing Gsmooth at each iteration in Algorithm 1 instead of G.

4 Numerical examples

This section shows some numerical examples that illustrate the performance of the proposed methods. The examples are concerned with the restoration of images that have been contaminated by noise and space-invariant blur, as well as with computerized tomography; see [28] for details on image restoration and [9] for a description of computerized tomography.

We set q = 0.1 in all examples, while the value of p depends on the type of noise the data vector bδ is corrupted by. For white Gaussian noise, we set p = 2; for other kinds of noise, we let p = 0.8.

We would like to compute a sparse solution in the framelet domain and therefore use a two-level framelet analysis operator as regularization operator L. Framelets are extensions of wavelets. They are defined as follows:

Definition 2

Let \(\mathcal {W}\in {\mathbb R}^{r\times n}\) with nr. The set of the rows of \(\mathcal {W}\) is a framelet system for \(\mathbb {R}^{n}\) if \(\forall \mathbf {x}\in {\mathbb R}^{n}\) it holds

$$ \| \mathbf{x} \|_{2}^{2}=\sum\limits_{j=1}^{r}{\mathbf{y}_{j}^{T}\mathbf{x}}, $$

where \(\mathbf {y}_{j}\in \mathbb {R}^{n}\) is the j th row of \(\mathcal {W}\) (written as a column vector), i.e., \(\mathcal {W}=\left [\mathbf {y}_{1},\ldots ,\mathbf {y}_{r}\right ]^{T}\). The matrix \(\mathcal {W}\) is referred to as an analysis operator and \(\mathcal {W}^{T}\) as a synthesis operator. The matrix \(\mathcal {W}\) defines a tight frame if and only if \(\mathcal {W}^{T}\mathcal {W}=I\).

We remark that in general \(\mathcal {W}\mathcal {W}^{T}\ne I\), unless r = n and the framelets are orthonormal. Note that \({\mathcal N}({\mathcal W})=\{\mathbf {0}\}\). Therefore, property (3) holds. Tight frames have been used in many image restoration applications including inpainting and deblurring; see, e.g., [2,3,4,5,6, 10,11,12,13].

We use framelets determined by linear B-splines. In one space-dimension they are made up of a low-pass filter W0 and two high-pass filters W1 and W2, whose masks are

$$ \mathbf{w}^{(0)}=\frac{1}{2}\left[1, 2, 1\right], \mathbf{w}^{(1)}= \frac{\sqrt{2}}{4}\left[1, 0, -1\right], \mathbf{w}^{(2)}=\frac{1}{4}\left[-1, 2, -1\right], $$

respectively. We can derive the synthesis operator \(\mathcal {W}\) starting from these masks by imposing reflexive boundary conditions. These boundary conditions secure that \(\mathcal {W}^{T}\mathcal {W}= I\). Defining

$$ W_{0}=\frac{1}{4}\left[\begin{array}{ccccc} 3&1 &0 &\ldots&0\\ 1&2 &1 & & \\ &\ddots&\ddots&\ddots& \\ & & 1& 2&1\\ 0&\ldots& 0& 1&3 \end{array}\right],\quad W_{1}=\frac{\sqrt{2}}{4}\left[\begin{array}{ccccc} -1&1 &0 &\ldots& 0\\ -1&0 &1 & & \\ &\ddots&\ddots&\ddots& \\ & & -1& 0& 1\\ 0&\ldots& 0& -1& 1 \end{array}\right], $$

and

$$ W_{2}=\frac{1}{4}\left[\begin{array}{ccccc} 1&-1 &0 &\ldots& 0\\ -1&2 &-1 & & \\ &\ddots&\ddots&\ddots& \\ & & -1& 2&-1\\ 0&\ldots& 0& 1& 1 \end{array}\right]. $$

The synthesis operator is obtained by stacking the three matrices above

$$ \mathcal{W}=\left[\begin{array}{c}W_{0}\\W_{1}\\W_{2} \end{array} \right]. $$

We can extend the operator above to two space-dimension by means of tensor products

$$ W_{ij} = W_{i} \otimes W_{j}, \quad i, j = 0, 1, 2. $$

This yields the analysis operator

$$ \mathcal{W}=\left[\begin{array}{c}W_{00}\\W_{01}\\\vdots\\W_{22} \end{array} \right]. $$

The matrix W00 is a low-pass filter. All the other matrices Wij contain at least one high-pass filter in some direction. In our examples, we set \(L=\mathcal {W}\).

We compare generalized cross validation with two other approaches to selecting the regularization parameter in (2), including cross validation as described in [8] and the discrepancy principle as discussed in [7]. We also compare the quality of the computed solutions determined by these method with the solution obtained with the optimal choice of the regularization parameter. The latter choice of the regularization parameter is not practical, because it requires that the desired solution xtrue be known; the regularization parameter is then determined to yield a computed solution that minimizes the restoration error

$$ \text{RRE}(\mathbf{x})=\frac{\| \mathbf{x}-\mathbf{x}_{\text{true}} \|_{2}}{\| \mathbf{x}_{\text{true}} \|_{2}}, $$
(17)

where xtrue denotes the desired solution.

The method described in [7] determines a sequence of regularization parameters, in a similar fashion as in this work, by requiring that the computed approximate solution in each iteration satisfies the discrepancy principle (DP). To perform well, this method requires that a fairly accurate estimate of the norm of the error ∥η2 in the data vector bδ be known. This norm is not meaningful for impulse noise. Therefore, the discrepancy principle cannot be used when bδ is contaminated by impulse noise.

Two methods for determining the regularization parameter by cross-validation are described in [8]. The first method determines the regularization parameter so that missing data is well predicted. This classical approach of carrying out cross-validation is referred to as CV in the tables. The other method, which we refer to as modified cross-validation (MCV), seeks to determine the regularization parameter so that missing entries of the computed solution are well predicted. The MCV method typically gives approximations of the desired solution xtrue of somewhat higher quality than the CV method; see [8] for illustrations. This paper compares different parameter choice rules for the minimization problem (2). A comparison with other methods for solving the minimization problem (2), that require the regularization parameter to be hand-tuned, can be found in [30, 35].

The restoration error (17) and the structural and similarity index (SSIM) are compared in our tests. The definition of the latter is somewhat involved; here we just note that the SSIM measures how well the overall structure of an image is recovered, and that the larger the index the better the reconstruction. In particular, the highest value achievable is 1; see [40] for details.

For all examples we set ε = 1 in (11); see [7] for a discussion on the choice of ε. We terminate the iterations either after 100 iterations or when

$$ \frac{\| \mathbf{x}^{(k+1)}-\mathbf{x}^{(k)} \|_{2}}{\| \mathbf{x}^{(k)} \|_{2}}<10^{-4}. $$

Finally, for the computations of \(\mathbf {b}^{\delta }_{\text {smooth}}\), we set ν2 = 1 in all examples.

All the computations were performed using Matlab R2018b running on a Windows 10 laptop computer with a i7-8750H @2.20 GHz CPU and 16 GB of RAM.

Tomography

The first example considers a tomography problem. The example is constructed with the aid of the IR Tools Matlab package [21]. The phantom has 256 × 256 pixels, and we consider 90 equispaced angles in the interval [0,π(, and 362 rays per angle. This results in a matrix \(A\in {\mathbb R}^{32580\times 65536}\). We added 1% of white Gaussian noise to the sinogram, i.e., ∥η2 = 0.01∥b2. Since the noise is Gaussian, we set p = 2 and, like in all considered examples, q = 0.1. We show the true phantom and the sinogram in Fig. 2.

Fig. 2
figure 2

Tomography test case: a True phantom (256 × 256 pixels), b sinogram (90 equispaced angles in [0,π(, 362 measurements per angle, and 1% white Gaussian noise)

Table 1 reports the RRE and SSIM for the computed solutions determined by all the parameter choice rules considered in Table 1. All the rules can be seen to perform very similarly. Obviously, the optimal parameter provides the best reconstruction in terms of RRE. This parameter choice rule requires knowledge of the exact solution xtrue and therefore cannot be used in real application. The DP provides the second best approximation of xtrue, however, it demands knowledge of the norm of the noise η. The CV, MCV, and GCV methods do not require any knowledge of the noise. We can observe, that although the MCV method slightly outperforms the GCV method in terms of the RRE (but not in terms of the SSIM), the GCV method demands much less CPU time than the CV and MCV methods. This is due to the fact that both the CV and MCV methods require several runs of the MM-GKS algorithm, while the GCV method only carries out a single run. Finally, we note that the GCV-Smooth method does not perform well. This is due to the fact that the noise in the data bδ is purely Gaussian. The smoothing of bδ leads to an underestimation of the amount of noise in bδ, which results in a too small value of the regularization parameter; the computed solution is under-regularized. These considerations are confirmed by the visual comparison of the reconstructions shown in Fig. 3.

Table 1 RRE, SSIM, and CPU time in seconds obtained with the different choice rule for the regularization parameter considered
Fig. 3
figure 3

Tomography test case reconstructions: a Optimal, b DP, c GCV, d GCV-Smooth

Peppers

Our second example considers an image deblurring problem. We blur the image in Fig. 4a with the PSF shown in Fig. 4b, and add 25% of impulse noise. This yields the blurred and noisy image displayed in Fig. 4c. When we construct this example, we cut the boundaries of the blurred image to simulate realistic data; see [28] for more details. Since the image is generic, we impose reflexive boundary conditions. Moreover, since the blurred image is corrupted by impulse noise, we set p = 0.8.

Fig. 4
figure 4

Peppers test case: a True image (246 × 246 pixels), b PSF (9 × 9 pixels), c Blurred and noisy image (25% of impulse noise)

We report the results obtained in Table 1. Since the noise is not Gaussian, we cannot apply the discrepancy principle to determine the regularization parameter. However, the CV, MCV, and GCV methods can be applied. We also report the RRE and SSIM for the optimal choice of the regularization parameter. We can see that the GCV and GCV-Smooth methods provide the best reconstructions. In particular, the reconstruction obtained with GCV is slightly better than the optimal one, and the one obtained with GCV-Smooth is significantly better. While this may appear strange, it is due to the construction of the generalized Krylov subspace. In particular, since the solution subspace depends on the approximate solutions generated during the computations, the solution subspaces determined by the various methods differ, and this may lead to different results. We remark that, while we for stationary methods (with μ fixed) can show converge of the computed iterates, we have no such result for the iterates determined by the CV and GCV methods. Finally, we observe that the use of the smoothed GCV function yields a very accurate reconstruction which, in particular, is more accurate than the one obtained with the standard GCV method. These observations are confirmed by visual inspection of the reconstructions displayed in Fig. 5.

Fig. 5
figure 5

Peppers test case reconstructions: a Optimal, b MCV, c GCV, d GCV-Smooth

Cameraman

The last example describes another image deblurring problem. We consider the exact image in Fig. 6a and blur it with the PSF in Fig. 6b. Adding 20% of salt-and-pepper noise produces the blurred and noisy image in Fig. 6c. As in the previous example, we cut the boundary of the blurred image to simulate realistic data. In view of that the image is generic, we impose reflexive boundary conditions. Since the PSF is quadrantically symmetric, the matrix A obtained can be diagonalized by the discrete cosine transform. Because LTL = I, we therefore could avoid the application of generalized Krylov subspaces in the MM-GKS method. However, for consistency we use the same approach as in the previous examples. Since the noise is not Gaussian, we set p = 0.8, similarly as in the example above.

Fig. 6
figure 6

Cameraman test case: a True image (252 × 252 pixels), b PSF (12 × 12 pixels), c Blurred and noisy image (20% of salt and pepper noise)

Table 1 shows the RRE and SSIM values obtained with the different parameter choice rules. We observe that the GCV and GCV-smooth methods provide more accurate approximations of xtrue in terms of RRE, than both the CV and MCV, for a much lower computational cost. Moreover, the RRE obtained with the GCV method is close to the optimal one, and the RRE for the restoration determined by the GCV-Smooth method is significantly smaller. Figure 7 displays the obtained reconstructions. Visual inspection shows that the reconstruction obtained with the GCV-Smooth method is very detailed, but “grayish”. The latter is due to the presence of some pixels, whose values are significantly larger than 255 (which is the value corresponding to white, the value 0 corresponds to black), which causes a rescaling before plotting. Projecting the reconstruction orthogonally into the cube [0,255]n gives the image in Fig. 8. We can observe that the gray-scale level is now correct and the reconstruction is very close to the original image.

Fig. 7
figure 7

Cameraman test case reconstructions: a Optimal, b MCV, c GCV, d GCV-Smooth

Fig. 8
figure 8

Cameraman test case reconstruction obtained using the MM-GKS-GCV-Smoothed method and projecting the approximation into the cube [0,255]n

5 Conclusions

We propose the GCV method for choosing the regularization parameter for the p-q regularization method. This method is easy to implement and computationally fairly inexpensive thanks to that the original minimization problem (2) is projected into a generalized Krylov subspace of small to moderate size. A comparison with some other available methods shows that the proposed approach is competitive and provides accurate approximations of the desired solution. Moreover, the proposed approach is completely automatic and does not need hand-tuning of any parameter. Thus, it can be considered for real world applications.