1 Introduction

1.1 Nonlinear Systems and Empirical Loss Minimization

A wide spectrum of science and engineering applications calls for solutions to a nonlinear system of equations. Imagine we have collected a set of data points \({\varvec{y}}=\{y_{j}\}_{1\le j \le m}\), generated by a nonlinear sensing system,

$$\begin{aligned} y_j \,\approx \,\mathcal {A}_j \big ({\varvec{x}}^{\star }\big ), \quad 1\le j\le m, \end{aligned}$$

where \({\varvec{x}}^{\star }\) is the unknown object of interest and the \(\mathcal {A}_j\)’s are certain nonlinear maps known a priori. Can we reconstruct the underlying object \({\varvec{x}}^{\star }\) in a faithful yet efficient manner? Problems of this kind abound in information and statistical science, prominent examples including low-rank matrix recovery [19, 64], robust principal component analysis [17, 21], phase retrieval [20, 59], neural networks [103, 132], to name just a few.

In principle, it is possible to attempt reconstruction by searching for a solution that minimizes the empirical loss, namely

$$\begin{aligned} \text {minimize}_{{\varvec{x}}}\quad f({\varvec{x}})= \sum _{j=1}^m \big | y_j -\mathcal {A}_j({\varvec{x}}) \big |^{2}. \end{aligned}$$
(1)

Unfortunately, this empirical loss minimization problem is, in many cases, nonconvex, making it NP-hard in general. This issue of nonconvexity comes up in, for example, several representative problems that epitomize the structures of nonlinear systems encountered in practice.Footnote 1

  • Phase retrieval/solving quadratic systems of equations Imagine we are asked to recover an unknown object \({\varvec{x}}^{\star }\in \mathbb {R}^{n}\), but are only given the square modulus of certain linear measurements about the object, with all sign/phase information of the measurements missing. This arises, for example, in X-ray crystallography [15], and in latent-variable models where the hidden variables are captured by the missing signs [33]. To fix ideas, assume we would like to solve for \({\varvec{x}}^{\star }\in \mathbb {R}^n \) in the following quadratic system of m equations

    $$\begin{aligned} y_{j}=\big ({\varvec{a}}_{j}^{\top }{\varvec{x}}^{\star }\big )^{2},\qquad 1\le j\le m, \end{aligned}$$

    where \(\{{\varvec{a}}_{j}\}_{1\le j \le m}\) are the known design vectors. One strategy is thus to solve the following problem

    $$\begin{aligned} \text {minimize}_{{\varvec{x}}\in \mathbb {R}^{n}}\quad f({\varvec{x}})=\frac{1}{4m}\sum _{j=1}^{m}\Big [y_{j}-\big ({\varvec{a}}_{j}^{\top }{\varvec{x}}\big )^{2}\Big ]^{2}. \end{aligned}$$
    (2)
  • Low-rank matrix completion In many scenarios such as collaborative filtering, we wish to make predictions about all entries of an (approximately) low-rank matrix \({\varvec{M}}^{\star }\in \mathbb {R}^{n\times n}\) (e.g., a matrix consisting of users’ ratings about many movies), yet only a highly incomplete subset of the entries are revealed to us [19]. For clarity of presentation, assume \({\varvec{M}}^{\star }\) to be rank-r (\(r\ll n\)) and positive semidefinite (PSD), i.e., \({\varvec{M}}^{\star }={\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\) with \({\varvec{X}}^{\star }\in \mathbb {R}^{n\times r}\), and suppose we have only seen the entries

    $$\begin{aligned} Y_{j,k} = M^{\star }_{j,k} = ( {\varvec{X}}^{\star } {\varvec{X}}^{\star \top } )_{j,k} ,\qquad (j,k)\in \Omega \end{aligned}$$

    within some index subset \(\Omega \) of cardinality m. These entries can be viewed as nonlinear measurements about the low-rank factor \({\varvec{X}}^{\star }\). The task of completing the true matrix \({\varvec{M}}^{\star }\) can then be cast as solving

    $$\begin{aligned} \text {minimize}_{{\varvec{X}}\in \mathbb {R}^{n\times r}}\quad f({\varvec{X}}) = \frac{n^2}{4m} \sum _{(j,k)\in \Omega }\left( Y_{j,k}-{\varvec{e}}_{j}^{\top }{\varvec{X}}{\varvec{X}}^{\top }{\varvec{e}}_{k}\right) ^{2}, \end{aligned}$$
    (3)

    where the \({\varvec{e}}_{j}\)’s stand for the canonical basis vectors in \(\mathbb {R}^n\).

  • Blind deconvolution/solving bilinear systems of equations Imagine we are interested in estimating two signals of interest \({\varvec{h}}^{\star },{\varvec{x}}^{\star }\in \mathbb {C}^{K}\), but only get to collect a few bilinear measurements about them. This problem arises from mathematical modeling of blind deconvolution [3, 76], which frequently arises in astronomy, imaging, communications, etc. The goal is to recover two signals from their convolution. Put more formally, suppose we have acquired m bilinear measurements taking the following form

    $$\begin{aligned} y_{j}={\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{j},\qquad 1\le j\le m, \end{aligned}$$

    where \({\varvec{a}}_{j},{\varvec{b}}_{j}\in \mathbb {C}^{K}\) are distinct design vectors (e.g., Fourier and/or random design vectors) known a priori and \({\varvec{b}}_{j}^{\textsf {H} }\) denotes the conjugate transpose of \({\varvec{b}}_{j}\). In order to reconstruct the underlying signals, one asks for solutions to the following problem

    $$\begin{aligned} \text {minimize}_{{\varvec{h}},{\varvec{x}}\in \mathbb {C}^{K}}\quad f({\varvec{h}},{\varvec{x}})= \sum _{j=1}^{m}\big |y_{j}-{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j}\big |^{2}. \end{aligned}$$

1.2 Nonconvex Optimization via Regularized Gradient Descent

First-order methods have been a popular heuristic in practice for solving nonconvex problems including (1). For instance, a widely adopted procedure is gradient descent, which follows the update rule

$$\begin{aligned} {\varvec{x}}^{t+1}={\varvec{x}}^{t}-\eta _{t}\nabla f\big ({\varvec{x}}^{t}\big ),\qquad t\ge 0, \end{aligned}$$
(4)

where \(\eta _{t}\) is the learning rate (or step size) and \({\varvec{x}}^{0}\) is some proper initial guess. Given that it only performs a single gradient calculation \(\nabla f(\cdot )\) per iteration (which typically can be completed within near-linear time), this paradigm emerges as a candidate for solving large-scale problems. The concern is: whether \({\varvec{x}}^{t}\) converges to the global solution and, if so, how long it takes for convergence, especially since (1) is highly nonconvex.

Fortunately, despite the worst-case hardness, appealing convergence properties have been discovered in various statistical estimation problems, the blessing being that the statistical models help rule out ill-behaved instances. For the average case, the empirical loss often enjoys benign geometry, in a local region (or at least along certain directions) surrounding the global optimum. In light of this, an effective nonconvex iterative method typically consists of two stages:

  1. 1.

    a carefully designed initialization scheme (e.g., spectral method);

  2. 2.

    an iterative refinement procedure (e.g., gradient descent).

This strategy has recently spurred a great deal of interest, owing to its promise of achieving computational efficiency and statistical accuracy at once for a growing list of problems (e.g., [18, 25, 32, 61, 64, 76, 78, 107]). However, rather than directly applying gradient descent (4), existing theory often suggests enforcing proper regularization. Such explicit regularization enables improved computational convergence by properly “stabilizing” the search directions. The following regularization schemes, among others, have been suggested to obtain or improve computational guarantees. We refer to these algorithms collectively as Regularized Gradient Descent.

  • Trimming/truncation, which discards/truncates a subset of the gradient components when forming the descent direction. For instance, when solving quadratic systems of equations, one can modify the gradient descent update rule as

    $$\begin{aligned} {\varvec{x}}^{t+1}={\varvec{x}}^{t}-\eta _{t}\mathcal {T}\left( \nabla f\big ({\varvec{x}}^{t}\big )\right) , \end{aligned}$$
    (5)

    where \(\mathcal {T}\) is an operator that effectively drops samples bearing too much influence on the search direction. This strategy [25, 118, 126] has been shown to enable exact recovery with linear-time computational complexity and optimal sample complexity.

  • Regularized loss, which attempts to optimize a regularized empirical risk

    $$\begin{aligned} {\varvec{x}}^{t+1}= {\varvec{x}}^{t}-\eta _{t} \left( \nabla f\big ({\varvec{x}}^{t}\big )+\nabla R\big ({\varvec{x}}^{t}\big )\right) , \end{aligned}$$
    (6)

    where \(R({\varvec{x}})\) stands for an additional penalty term in the empirical loss. For example, in low-rank matrix completion \(R(\cdot )\) imposes penalty based on the \(\ell _{2}\) row norm [64, 107] as well as the Frobenius norm [107] of the decision matrix, while in blind deconvolution, it penalizes the \(\ell _{2}\) norm as well as certain component-wise incoherence measure of the decision vectors [58, 76, 82].

  • Projection, which projects the iterates onto certain sets based on prior knowledge, that is,

    $$\begin{aligned} {\varvec{x}}^{t+1}=\mathcal {P}\left( {\varvec{x}}^{t}-\eta _{t}\nabla f\big ({\varvec{x}}^{t}\big )\right) , \end{aligned}$$
    (7)

    where \(\mathcal {P}\) is a certain projection operator used to enforce, for example, incoherence properties. This strategy has been employed in both low-rank matrix completion [32, 131] and blind deconvolution [76].

Equipped with such regularization procedures, existing works uncover appealing computational and statistical properties under various statistical models. Table 1 summarizes the performance guarantees derived in the prior literature; for simplicity, only orderwise results are provided.

Remark 1

There is another role of regularization commonly studied in the literature, which exploits prior knowledge about the structure of the unknown object, such as sparsity to prevent over-fitting and improve statistical generalization ability. This is, however, not the focal point of this paper, since we are primarily pursuing solutions to (1) without imposing additional structures.

Table 1 Prior theory for gradient descent (with spectral initialization)

1.3 Regularization-Free Procedures?

The regularized gradient descent algorithms, while exhibiting appealing performance, usually introduce more algorithmic parameters that need to be carefully tuned based on the assumed statistical models. In contrast, vanilla gradient descent (cf. (4))—which is perhaps the very first method that comes into mind and requires minimal tuning parameters—is far less understood (cf. Table 1). Take matrix completion and blind deconvolution as examples: to the best of our knowledge, there is currently no theoretical guarantee derived for vanilla gradient descent.

The situation is better for phase retrieval: the local convergence of vanilla gradient descent, also known as Wirtinger flow (WF), has been investigated in [18, 96]. Under i.i.d. Gaussian design and with near-optimal sample complexity, WF (combined with spectral initialization) provably achieves \(\epsilon \)-accuracy (in a relative sense) within \(O\big (n\log \left( {1} / {\varepsilon }\right) \big )\) iterations. Nevertheless, the computational guarantee is significantly outperformed by the regularized version (called truncated Wirtinger flow [25]), which only requires \(O\big (\log \left( {1} / {\varepsilon }\right) \big )\) iterations to converge with similar per-iteration cost. On closer inspection, the high computational cost of WF is largely due to the vanishingly small step size \(\eta _{t}=O\big ({1} / ({n\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}})\big )\)—and hence slow movement—suggested by the theory [18]. While this is already the largest possible step size allowed in the theory published in [18], it is considerably more conservative than the choice \(\eta _{t}=O\big ({1} / {\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}}\big )\) theoretically justified for the regularized version [25, 126].

The lack of understanding and suboptimal results about vanilla gradient descent raise a very natural question: Are regularization-free iterative algorithms inherently suboptimal when solving nonconvex statistical estimation problems of this kind?

1.4 Numerical Surprise of Unregularized Gradient Descent

Fig. 1
figure 1

a Relative \(\ell _{2}\) error of \({\varvec{x}}^{t}\) (modulo the global phase) versus iteration count for phase retrieval under i.i.d. Gaussian design, where \(m=10n\) and \(\eta _{t}=0.1\). b Relative error of \({\varvec{X}}^{t}{\varvec{X}}^{t\top }\) (measured by \(\left\| \cdot \right\| _{\mathrm {F}},\left\| \cdot \right\| ,\left\| \cdot \right\| _{\infty }\)) versus iteration count for matrix completion, where \(n=1000\), \(r=10\), \(p=0.1\), and \(\eta _{t}=0.2\). c Relative error of \({\varvec{h}}^{t}{\varvec{x}}^{t\,\textsf {H} }\) (measured by \(\left\| \cdot \right\| _{\mathrm {F}}\)) versus iteration count for blind deconvolution, where \(m=10K\) and \(\eta _{t}=0.5\)

To answer the preceding question, it is perhaps best to first collect some numerical evidence. In what follows, we test the performance of vanilla gradient descent for phase retrieval, matrix completion, and blind deconvolution, using a constant step size. For all of these experiments, the initial guess is obtained by means of the standard spectral method. Our numerical findings are as follows:

  • Phase retrieval For each n, set \(m=10n\), take \({\varvec{x}}^{\star }\in \mathbb {R}^{n}\) to be a random vector with unit norm, and generate the design vectors \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}},{\varvec{I}}_{n})\), \(1\le j\le m\). Figure 1a illustrates the relative \(\ell _{2}\) error \(\min \{\Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\Vert _{2},\Vert {\varvec{x}}^{t}+{\varvec{x}}^{\star }\Vert _{2}\}/\Vert {\varvec{x}}^{\star }\Vert _{2}\) (modulo the unrecoverable global phase) versus the iteration count. The results are shown for \(n=20,100,200,1000\), with the step size taken to be \(\eta _{t}=0.1 \) in all settings.

  • Matrix completion Generate a random PSD matrix \({\varvec{M}}^{\star }\in \mathbb {R}^{n \times n}\) with dimension \(n=1000\), rank \(r=10\), and all nonzero eigenvalues equal to one. Each entry of \({\varvec{M}}^{\star }\) is observed independently with probability \(p=0.1\). Figure 1b plots the relative error versus the iteration count, where can either be the Frobenius norm \(\left\| \cdot \right\| _{\mathrm {F}}\), the spectral norm \(\Vert \cdot \Vert \), or the entrywise \(\ell _{\infty }\) norm \(\Vert \cdot \Vert _{\infty }\). Here, we pick the step size as \(\eta _{t}=0.2\).

  • Blind deconvolution For each \(K\in \left\{ 20,100,200,1000\right\} \) and \(m=10K\), generate the design vectors \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}}, \frac{1}{2}{\varvec{I}}_{K})+i\mathcal {N}({\varvec{0}},\frac{1}{2}{\varvec{I}}_{K})\) for \(1\le j\le m\) independently,Footnote 2 and the \({\varvec{b}}_{j}\)’s are drawn from a partial discrete Fourier transform (DFT) matrix (to be described in Sect. 3.3). The underlying signals \({\varvec{h}}^{\star },{\varvec{x}}^{\star }\in \mathbb {C}^{K}\) are produced as random vectors with unit norm. Figure 1c plots the relative error \(\Vert {\varvec{h}}^{t}{\varvec{x}}^{t\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\Vert _{\mathrm {F}}/ \Vert {\varvec{h}}^{\star } {\varvec{x}}^{\star \textsf {H} }\Vert _{\mathrm {F}}\) versus the iteration count, with the step size taken to be \(\eta _{t} = 0.5\) in all settings.

In all of these numerical experiments, vanilla gradient descent enjoys remarkable linear convergence, always yielding an accuracy of \(10^{-5}\) (in a relative sense) within around 200 iterations. In particular, for the phase retrieval problem, the step size is taken to be \(\eta _{t}=0.1 \) although we vary the problem size from \(n=20\) to \(n=1000\). The consequence is that the convergence rates experience little changes when the problem sizes vary. In comparison, the theory published in [18] seems overly pessimistic, as it suggests a diminishing step size inversely proportional to n and, as a result, an iteration complexity that worsens as the problem size grows.

In addition, it has been empirically observed in prior literature [25, 76, 127] that vanilla gradient descent performs comparably with the regularized counterpart for phase retrieval and blind deconvolution. To complete the picture, we further conduct experiments on matrix completion. In particular, we follow the experimental setup for matrix completion used above. We vary p from 0.01 to 0.1 with 51 logarithmically spaced points. For each p, we apply vanilla gradient descent, projected gradient descent [32] and gradient descent with additional regularization terms [107] with step size \(\eta = 0.2\) to 50 randomly generated instances. Successful recovery is declared if \(\Vert {\varvec{X}}^{t}{\varvec{X}}^{t\top } - {\varvec{M}}^{\star }\Vert _{\mathrm {F}} / \Vert {\varvec{M}}^{\star }\Vert _{\mathrm {F}} \le 10^{-5}\) in \(10^{4}\) iterations. Figure 2 reports the success rate versus the sampling rate. As can be seen, the phase transition of vanilla GD and that of GD with regularized cost are almost identical, whereas projected GD performs slightly better than the other two.

Fig. 2
figure 2

Success rate versus sampling rate p over 50 Monte Carlo trials for matrix completion with \(n = 1000\) and \(r=10\)

In short, the above empirical results are surprisingly positive yet puzzling. Why was the computational efficiency of vanilla gradient descent unexplained or substantially underestimated in prior theory?

1.5 This Paper

The main contribution of this paper is toward demystifying the “unreasonable” effectiveness of regularization-free nonconvex iterative methods. As asserted in previous work, regularized gradient descent succeeds by properly enforcing/promoting certain incoherence conditions throughout the execution of the algorithm. In contrast, we discover that

  • Vanilla gradient descent automatically forces the iterates to stay incoherent with the measurement mechanism, thus implicitly regularizing the search directions.

This “implicit regularization” phenomenon is of fundamental importance, suggesting that vanilla gradient descent proceeds as if it were properly regularized. This explains the remarkably favorable performance of unregularized gradient descent in practice. Focusing on the three representative problems mentioned in Sect. 1.1, our theory guarantees both statistical and computational efficiency of vanilla gradient descent under random designs and spectral initialization. With near-optimal sample complexity, to attain \(\epsilon \)-accuracy,

  • Phase retrieval (informal) vanilla gradient descent converges in \(O\big (\log n\log \frac{1}{\epsilon }\big )\) iterations;

  • Matrix completion (informal) vanilla gradient descent converges in \(O\big (\log \frac{1}{\epsilon }\big )\) iterations;

  • Blind deconvolution (informal) vanilla gradient descent converges in \(O\big (\log \frac{1}{\epsilon }\big )\) iterations.

In other words, gradient descent provably achieves (nearly) linear convergence in all of these examples. Throughout this paper, an algorithm is said to converge (nearly) linearly to \({\varvec{x}}^{\star }\) in the noiseless case if the iterates \(\{{\varvec{x}}^t\}\) obey

$$\begin{aligned} \mathrm {dist}({\varvec{x}}^{t+1}, {\varvec{x}}^{\star }) \le (1-c)\, \mathrm {dist}({\varvec{x}}^{t}, {\varvec{x}}^{\star }), \qquad \forall t \ge 0 \end{aligned}$$

for some \(0<c\le 1\) that is (almost) independent of the problem size. Here, \(\mathrm {dist}(\cdot ,\cdot )\) can be any appropriate discrepancy measure.

As a by-product of our theory, gradient descent also provably controls the entrywise empirical risk uniformly across all iterations; for instance, this implies that vanilla gradient descent controls entrywise estimation error for the matrix completion task. Precise statements of these results are deferred to Sect. 3 and are briefly summarized in Table 2.

Notably, our study of implicit regularization suggests that the behavior of nonconvex optimization algorithms for statistical estimation needs to be examined in the context of statistical models, which induces an objective function as a finite sum. Our proof is accomplished via a leave-one-out perturbation argument, which is inherently tied to statistical models and leverages homogeneity across samples. Altogether, this allows us to localize benign landscapes for optimization and characterize finer dynamics not accounted for in generic gradient descent theory.

Table 2 Prior theory versus our theory for vanilla gradient descent (with spectral initialization)

1.6 Notations

Before continuing, we introduce several notations used throughout the paper. First of all, boldfaced symbols are reserved for vectors and matrices. For any vector \({\varvec{v}}\), we use \(\Vert {\varvec{v}}\Vert _2\) to denote its Euclidean norm. For any matrix \({\varvec{A}}\), we use \(\sigma _{j}({\varvec{A}})\) and \(\lambda _{j}({\varvec{A}})\) to denote its jth largest singular value and eigenvalue, respectively, and let \({\varvec{A}}_{j,\cdot }\) and \({\varvec{A}}_{\cdot ,j}\) denote its jth row and jth column, respectively. In addition, \(\Vert {\varvec{A}}\Vert \), \(\Vert {\varvec{A}}\Vert _{\mathrm {F}}\), \(\Vert {\varvec{A}}\Vert _{2,\infty }\), and \(\Vert {\varvec{A}}\Vert _{\infty }\) stand for the spectral norm (i.e., the largest singular value), the Frobenius norm, the \(\ell _2/\ell _{\infty }\) norm (i.e., the largest \(\ell _2\) norm of the rows), and the entrywise \(\ell _{\infty }\) norm (the largest magnitude of all entries) of a matrix \({\varvec{A}}\). Also, \({\varvec{A}}^{\top }\), \({\varvec{A}}^\textsf {H} \), and \(\overline{{\varvec{A}}}\) denote the transpose, the conjugate transpose, and the entrywise conjugate of \({\varvec{A}}\), respectively. \({\varvec{I}}_{n}\) denotes the identity matrix with dimension \(n\times n\). The notation \(\mathcal {O}^{n\times r}\) represents the set of all \(n\times r\) orthonormal matrices. The notation [n] refers to the set \(\{1,\cdots , n\}\). Also, we use \(\text {Re}(x)\) to denote the real part of a complex number x. Throughout the paper, we use the terms “samples” and “measurements” interchangeably.

Additionally, the standard notation \(f(n)=O\left( g(n)\right) \) or \(f(n)\lesssim g(n)\) means that there exists a constant \(c>0\) such that \(\left| f(n)\right| \le c|g(n)|\), \(f(n)\gtrsim g(n)\) means that there exists a constant \(c>0\) such that \(|f(n)|\ge c\left| g(n)\right| \), and \(f(n)\asymp g(n)\) means that there exist constants \(c_{1},c_{2}>0\) such that \(c_{1}|g(n)|\le |f(n)|\le c_{2}|g(n)|\). Also, \(f(n)\gg g(n)\) means that there exists some large enough constant \(c>0\) such that \(|f(n)|\ge c\left| g(n)\right| \). Similarly, \(f(n)\ll g(n)\) means that there exists some sufficiently small constant \(c>0\) such that \(|f(n)|\le c\left| g(n)\right| \).

2 Implicit Regularization: A Case Study

To reveal reasons behind the effectiveness of vanilla gradient descent, we first examine existing theory of gradient descent and identify the geometric properties that enable linear convergence. We then develop an understanding as to why prior theory is conservative, and describe the phenomenon of implicit regularization that helps explain the effectiveness of vanilla gradient descent. To facilitate discussion, we will use the problem of solving random quadratic systems (phase retrieval) and Wirtinger flow as a case study, but our diagnosis applies more generally, as will be seen in later sections.

2.1 Gradient Descent Theory Revisited

In the convex optimization literature, there are two standard conditions about the objective function—strong convexity and smoothness—that allow for linear convergence of gradient descent.

Definition 1

(Strong convexity) A twice continuously differentiable function \(f:\mathbb {R}^{n}\mapsto \mathbb {R}\) is said to be \(\alpha \)-strongly convex for \(\alpha > 0\) if

$$\begin{aligned} \nabla ^{2}f({\varvec{x}})\succeq \alpha {\varvec{I}}_{n},\qquad \forall {\varvec{x}}\in \mathbb {R}^n. \end{aligned}$$

Definition 2

(Smoothness) A twice continuously differentiable function \(f:\mathbb {R}^{n}\mapsto \mathbb {R}\) is said to be \(\beta \)-smooth for \(\beta > 0\) if

$$\begin{aligned} \left\| \nabla ^{2}f({\varvec{x}})\right\| \le \beta ,\qquad \forall {\varvec{x}}\in \mathbb {R}^n. \end{aligned}$$

It is well known that for an unconstrained optimization problem, if the objective function f is both \(\alpha \)-strongly convex and \(\beta \)-smooth, then vanilla gradient descent (4) enjoys \(\ell _{2}\) error contraction [9, Theorem 3.12], namely

$$\begin{aligned} \big \Vert {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\Vert _{2}\le & {} \left( 1-\frac{2}{\beta /\alpha +1}\right) \big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\big \Vert _{2},\quad \text{ and }\quad \big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\Vert _{2}\nonumber \\\le & {} \left( 1-\frac{2}{\beta /\alpha +1}\right) ^{t}\big \Vert {\varvec{x}}^{0}-{\varvec{x}}^{\star } \big \Vert _{2},\quad t\ge 0, \end{aligned}$$
(8)

as long as the step size is chosen as \(\eta _{t}=2/(\alpha +\beta )\). Here, \({\varvec{x}}^{\star }\) denotes the global minimum. This immediately reveals the iteration complexity for gradient descent: the number of iterations taken to attain \(\epsilon \)-accuracy (in a relative sense) is bounded by

$$\begin{aligned} O\left( \frac{\beta }{\alpha }\log \frac{1}{\epsilon }\right) . \end{aligned}$$

In other words, the iteration complexity is dictated by and scales linearly with the condition number—the ratio \(\beta /\alpha \) of smoothness to strong convexity parameters.

Moving beyond convex optimization, one can easily extend the above theory to nonconvex problems with local strong convexity and smoothness. More precisely, suppose the objective function f satisfies

$$\begin{aligned} \nabla ^{2}f({\varvec{x}})\succeq \alpha {\varvec{I}}\qquad \text {and}\qquad \big \Vert \nabla ^{2}f({\varvec{x}})\big \Vert \le \beta \end{aligned}$$

over a local \(\ell _{2}\) ball surrounding the global minimum \({\varvec{x}}^{\star }\):

$$\begin{aligned} \mathcal {B}_{\delta }({\varvec{x}}) := \big \{ {\varvec{x}} \mid \Vert {\varvec{x}}-{\varvec{x}}^{\star }\Vert _{2}\le \delta \Vert {\varvec{x}}^{\star }\Vert _{2} \big \}. \end{aligned}$$
(9)

Then the contraction result (8) continues to hold, as long as the algorithm is seeded with an initial point that falls inside \(\mathcal {B}_{\delta }({\varvec{x}})\).

2.2 Local Geometry for Solving Random Quadratic Systems

To invoke generic gradient descent theory, it is critical to characterize the local strong convexity and smoothness properties of the loss function. Take the problem of solving random quadratic systems (phase retrieval) as an example. Consider the i.i.d. Gaussian design in which \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}},{\varvec{I}}_{n})\), \(1\le j\le m\), and suppose without loss of generality that the underlying signal obeys \(\Vert {\varvec{x}}^{\star }\Vert _{2}=1\). It is well known that \({\varvec{x}}^{\star }\) is the unique minimizer—up to global phase—of (2) under this statistical model, provided that the ratio m / n of equations to unknowns is sufficiently large. The Hessian of the loss function \(f({\varvec{x}})\) is given by

$$\begin{aligned} \nabla ^{2}f\left( {\varvec{x}}\right) =\frac{1}{m}\sum _{j=1}^{m}\left[ 3\left( {\varvec{a}}_{j}^{\top }{\varvec{x}}\right) ^{2} -y_{j}\right] {\varvec{a}}_{j}{\varvec{a}}_{j}^{\top }. \end{aligned}$$
(10)
  • Population-level analysis Consider the case with an infinite number of equations or samples, i.e., \(m\rightarrow \infty \), where \(\nabla ^{2}f({\varvec{x}})\) converges to its expectation. Simple calculation yields that

    $$\begin{aligned} \mathbb {E}\big [\nabla ^{2}f({\varvec{x}})\big ]=3\left( \Vert {\varvec{x}}\Vert _{2}^{2} {\varvec{I}}_{n}+2{\varvec{x}}{\varvec{x}}^{\top }\right) -\left( {\varvec{I}}_{n}+2{\varvec{x}}^{\star } {\varvec{x}}^{\star \top }\right) . \end{aligned}$$

    It is straightforward to verify that for any sufficiently small constant \(\delta >0\), one has the crude bound

    $$\begin{aligned} {\varvec{I}}_{n} \,\preceq \, \mathbb {E}\big [\nabla ^{2}f({\varvec{x}})\big ] \,\preceq \, 10{\varvec{I}}_{n},\qquad \forall {\varvec{x}}\in \mathcal {B}_{\delta }({\varvec{x}}) : \big \Vert {\varvec{x}}-{\varvec{x}}^{\star }\big \Vert _{2}\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2}, \end{aligned}$$

    meaning that f is 1-strongly convex and 10-smooth within a local ball around \({\varvec{x}}^{\star }\). As a consequence, when we have infinite samples and an initial guess \({\varvec{x}}^{0}\) such that \(\Vert {\varvec{x}}^{0}-{\varvec{x}}^{\star }\Vert _{2}\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2}\), vanilla gradient descent with a constant step size converges to the global minimum within logarithmic iterations.

  • Finite-sample regime with\(m\asymp n\log n\) Now that f exhibits favorable landscape in the population level, one thus hopes that the fluctuation can be well controlled so that the nice geometry carries over to the finite-sample regime. In the regime where \(m\asymp n\log n\) (which is the regime considered in [18]), the local strong convexity is still preserved, in the sense that

    $$\begin{aligned} \nabla ^{2}f({\varvec{x}}) \,\succeq \, \left( {1} / {2}\right) \cdot {\varvec{I}}_{n},\qquad \forall {\varvec{x}}:\text { }\big \Vert {\varvec{x}}-{\varvec{x}}^{\star }\big \Vert _{2}\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2} \end{aligned}$$

    occurs with high probability, provided that \(\delta >0\) is sufficiently small (see [96, 101] and Lemma 1). The smoothness parameter, however, is not well controlled. In fact, it can be as large as (up to logarithmic factors)Footnote 3

    $$\begin{aligned} \big \Vert \nabla ^{2}f({\varvec{x}})\big \Vert \,\lesssim \,n \end{aligned}$$

    even when we restrict attention to the local \(\ell _{2}\) ball (9) with \(\delta >0\) being a fixed small constant. This means that the condition number \(\beta /\alpha \) (defined in Sect. 2.1) may scale as O(n), leading to the step size recommendation

    $$\begin{aligned} \eta _t\,\asymp \,1/n, \end{aligned}$$

    and, as a consequence, a high iteration complexity \(O\big (n\log (1 / {\epsilon } )\big )\). This underpins the analysis in [18].

In summary, the geometric properties of the loss function—even in the local \(\ell _{2}\) ball centering around the global minimum—are not as favorable as one anticipates, in particular in view of its population counterpart. A direct application of generic gradient descent theory leads to an overly conservative step size and a pessimistic convergence rate, unless the number of samples is enormously larger than the number of unknowns.

Remark 2

Notably, due to Gaussian designs, the phase retrieval problem enjoys more favorable geometry compared to other nonconvex problems. In matrix completion and blind deconvolution, the Hessian matrices are rank-deficient even at the population level. In such cases, the above discussions need to be adjusted, e.g., strong convexity is only possible when we restrict attention to certain directions.

2.3 Which Region Enjoys Nicer Geometry?

Interestingly, our theory identifies a local region surrounding \({\varvec{x}}^{\star }\) with a large diameter that enjoys much nicer geometry. This region does not mimic an \(\ell _{2}\) ball, but rather, the intersection of an \(\ell _{2}\) ball and a polytope. We term it the region of incoherence and contraction (RIC). For phase retrieval, the RIC includes all points \({\varvec{x}}\in \mathbb {R}^{n}\) obeying

$$\begin{aligned} \big \Vert {\varvec{x}}-{\varvec{x}}^{\star }\big \Vert _{2}&\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2}\qquad \text {and} \end{aligned}$$
(11a)
$$\begin{aligned} \max _{1\le j\le m}\big |{\varvec{a}}_{j}^{\top }\big ({\varvec{x}}-{\varvec{x}}^{\star }\big )\big |&\lesssim \sqrt{\log n}\,\big \Vert {\varvec{x}}^{\star }\big \Vert _{2}, \end{aligned}$$
(11b)

where \(\delta >0\) is some small numerical constant. As will be formalized in Lemma 1, with high probability the Hessian matrix satisfies

$$\begin{aligned} \left( {1}/{2}\right) \cdot {\varvec{I}}_{n}\,\preceq \,\nabla ^{2}f({\varvec{x}}) \,\preceq \,O(\log n)\cdot {\varvec{I}}_n \end{aligned}$$

simultaneously for \({\varvec{x}}\) in the RIC. In words, the Hessian matrix is nearly well conditioned (with the condition number bounded by \(O(\log n)\)), as long as (i) the iterate is not very far from the global minimizer (cf. (11a)) and (ii) the iterate remains incoherentFootnote 4 with respect to the sensing vectors (cf. (11b)). Another way to interpret the incoherence condition (11b) is that the empirical risk needs to be well controlled uniformly across all samples. See Fig. 3a for an illustration of the above region.

Fig. 3
figure 3

a The shaded region is an illustration of the incoherence region, which satisfies \(\big |{\varvec{a}}_{j}^{\top }({\varvec{x}}-{\varvec{x}}^{\star })\big |\lesssim \sqrt{\log n}\) for all points \({\varvec{x}}\) in the region. b When \({\varvec{x}}^{0}\) resides in the desired region, we know that \({\varvec{x}}^{1}\) remains within the \(\ell _{2}\) ball but might fall out of the incoherence region (the shaded region). Once \({\varvec{x}}^{1}\) leaves the incoherence region, we lose control and may overshoot. c Our theory reveals that with high probability, all iterates will stay within the incoherence region, enabling fast convergence

The following observation is thus immediate: one can safely adopt a far more aggressive step size (as large as \(\eta _{t}=O(1/\log n)\)) to achieve acceleration, as long as the iterates stay within the RIC. This, however, fails to be guaranteed by generic gradient descent theory. To be more precise, if the current iterate \({\varvec{x}}^{t}\) falls within the desired region, then in view of (8), we can ensure \(\ell _{2}\) error contraction after one iteration, namely

$$\begin{aligned} \Vert {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\Vert _{2}\le \Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\Vert _{2}, \end{aligned}$$

and hence \({\varvec{x}}^{t+1}\) stays within the local \(\ell _{2}\) ball and hence satisfies (11a). However, it is not immediately obvious that \({\varvec{x}}^{t+1}\) would still stay incoherent with the sensing vectors and satisfy (11b). If \({\varvec{x}}^{t+1}\) leaves the RIC, it no longer enjoys the benign local geometry of the loss function, and the algorithm has to slow down in order to avoid overshooting. See Fig. 3b for a visual illustration. In fact, in almost all regularized gradient descent algorithms mentioned in Sect. 1.2, one of the main purposes of the proposed regularization procedures is to enforce such incoherence constraints.

2.4 Implicit Regularization

However, is regularization really necessary for the iterates to stay within the RIC? To answer this question, we plot in Fig. 4a (resp. Fig. 4b) the incoherence measure \(\frac{\max _{j}\left| {\varvec{a}}_{j}^{\top }{\varvec{x}}^{t}\right| }{\sqrt{\log n}\left\| {\varvec{x}}^{\star }\right\| _{2}}\) (resp. \(\frac{\max _{j}\left| {\varvec{a}}_{j}^{\top }({\varvec{x}}^{t}-{\varvec{x}}^\star )\right| }{\sqrt{\log n}\left\| {\varvec{x}}^{\star }\right\| _{2}}\)) versus the iteration count in a typical Monte Carlo trial, generated in the same way as for Fig. 1a. Interestingly, the incoherence measure remains bounded by 2 for all iterations \(t>1\). This important observation suggests that one may adopt a substantially more aggressive step size throughout the whole algorithm.

Fig. 4
figure 4

The incoherence measure \(\frac{\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }{\varvec{x}}^{t}\right| }{\sqrt{\log n}\left\| {\varvec{x}}^{\star }\right\| _{2}}\) (in a) and \(\frac{\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }({\varvec{x}}^{t}-{\varvec{x}}^\star )\right| }{\sqrt{\log n}\left\| {\varvec{x}}^{\star }\right\| _{2}}\) (in b) of the gradient iterates versus iteration count for the phase retrieval problem. The results are shown for \(n\in \left\{ 20,100,200,1000\right\} \) and \(m=10n\), with the step size taken to be \(\eta _{t}=0.1\). The problem instances are generated in the same way as in Fig. 1a

The main objective of this paper is thus to provide a theoretical validation of the above empirical observation. As we will demonstrate shortly, with high probability all iterates along the execution of the algorithm (as well as the spectral initialization) are provably constrained within the RIC, implying fast convergence of vanilla gradient descent (cf. Fig. 3c). The fact that the iterates stay incoherent with the measurement mechanism automatically, without explicit enforcement, is termed “implicit regularization.”

2.5 A Glimpse of the Analysis: A Leave-One-Out Trick

In order to rigorously establish (11b) for all iterates, the current paper develops a powerful mechanism based on the leave-one-out perturbation argument, a trick rooted and widely used in probability and random matrix theory. Note that the iterate \({\varvec{x}}^t\) is statistically dependent with the design vectors \(\{{\varvec{a}}_j\}\). Under such circumstances, one often resorts to generic bounds like the Cauchy–Schwarz inequality, which would not yield a desirable estimate. To address this issue, we introduce a sequence of auxiliary iterates \(\{{\varvec{x}}^{t,(l)}\}\) for each \(1\le l\le m\) (for analytical purposes only), obtained by running vanilla gradient descent using all but the lth sample. As one can expect, such auxiliary trajectories serve as extremely good surrogates of \(\{{\varvec{x}}^t\}\) in the sense that

$$\begin{aligned} {\varvec{x}}^{t} \,\approx \, {\varvec{x}}^{t,\left( l\right) }, \qquad 1\le l\le m, \quad t\ge 0, \end{aligned}$$
(12)

since their constructions only differ by a single sample. Most importantly, since \({\varvec{x}}^{t,(l)}\) is independent with the lth design vector, it is much easier to control its incoherence w.r.t. \({\varvec{a}}_l\) to the desired level:

$$\begin{aligned} \big |{\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t,(l)}-{\varvec{x}}^{\star }\big )\big | \lesssim \sqrt{\log n}\,\big \Vert {\varvec{x}}^{\star }\big \Vert _{2}. \end{aligned}$$
(13)

Combining (12) and (13) then leads to (11b). See Fig. 5 for a graphical illustration of this argument. Notably, this technique is very general and applicable to many other problems. We invite the readers to Sect. 5 for more details.

Fig. 5
figure 5

Illustration of the leave-one-out sequence w.r.t. \({\varvec{a}}_{l}\). a The sequence \(\{{\varvec{x}}^{t,(l)}\}_{t\ge 0}\) is constructed without using the lth sample. b Since the auxiliary sequence \(\{{\varvec{x}}^{t,(l)}\}\) is constructed without using \({\varvec{a}}_{l}\), the leave-one-out iterates stay within the incoherence region w.r.t. \({\varvec{a}}_{l}\) with high probability. Meanwhile, \(\{{\varvec{x}}^{t}\}\) and \(\{{\varvec{x}}^{t,(l)}\}\) are expected to remain close as their construction differ only in a single sample

3 Main Results

This section formalizes the implicit regularization phenomenon underlying unregularized gradient descent and presents its consequences, namely near-optimal statistical and computational guarantees for phase retrieval, matrix completion, and blind deconvolution. Note that the discrepancy measure \(\text {dist}\left( \cdot , \cdot \right) \) may vary from problem to problem.

3.1 Phase Retrieval

Suppose the m quadratic equations

$$\begin{aligned} y_{j}=\big ({\varvec{a}}_{j}^{\top }{\varvec{x}}^{\star }\big )^{2},\qquad j=1,2,\ldots ,m \end{aligned}$$
(14)

are collected using random design vectors, namely \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}},{\varvec{I}}_{n})\), and the nonconvex problem to solve is

$$\begin{aligned} \text {minimize}_{{\varvec{x}}\in \mathbb {R}^{n}}\quad f({\varvec{x}}):=\frac{1}{4m}\sum _{j=1}^{m}\left[ \left( {\varvec{a}}_{j}^{\top }{\varvec{x}}\right) ^{2}-y_{j}\right] ^{2}. \end{aligned}$$
(15)

The Wirtinger flow (WF) algorithm, first introduced in [18], is a combination of spectral initialization and vanilla gradient descent; see Algorithm 1.

figure a

Recognizing that the global phase/sign is unrecoverable from quadratic measurements, we introduce the \(\ell _{2}\) distance modulo the global phase as follows

$$\begin{aligned} \mathrm {dist}({\varvec{x}},{\varvec{x}}^{\star }):=\min \left\{ \Vert {\varvec{x}}-{\varvec{x}}^{\star }\Vert _{2},\Vert {\varvec{x}}+{\varvec{x}}^{\star }\Vert _{2}\right\} . \end{aligned}$$
(18)

Our finding is summarized in the following theorem.

Theorem 1

Let \({\varvec{x}}^{\star }\in \mathbb {R}^{n}\) be a fixed vector. Suppose \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},{\varvec{I}}_{n}\right) \) for each \(1\le j\le m\) and \(m\ge c_{0}n\log n\) for some sufficiently large constant \(c_{0}>0\). Assume the step size obeys \(\eta _{t}\equiv \eta ={c_{1}} / \left( {\log n}\cdot \Vert {\varvec{x}}_0\Vert _{2}^{2}\right) \) for any sufficiently small constant \(c_{1}>0\). Then there exist some absolute constants \(0<\varepsilon <1\) and \(c_{2}>0\) such that with probability at least \(1-O\left( mn^{-5}\right) \), Algorithm 1 satisfies that for all \(t\ge 0\),

$$\begin{aligned} \mathrm {dist}({\varvec{x}}^{t},{\varvec{x}}^{\star })&\le \varepsilon (1-\eta \Vert {\varvec{x}}^{\star } \Vert _2^2 /2)^{t}\Vert {\varvec{x}}^{\star }\Vert _{2}, \end{aligned}$$
(19a)
$$\begin{aligned} \max _{1\le j\le m}\big |{\varvec{a}}_{j}^{\top }\big ({\varvec{x}}^{t}-{\varvec{x}}^{\star }\big )\big |&\le c_{2}\sqrt{\log n}\Vert {\varvec{x}}^{\star }\Vert _{2}. \end{aligned}$$
(19b)

Theorem 1 reveals a few intriguing properties of Algorithm 1.

  • Implicit regularization Theorem 1 asserts that the incoherence properties are satisfied throughout the execution of the algorithm (see (19b)), which formally justifies the implicit regularization feature we hypothesized.

  • Near-constant step size Consider the case where \(\Vert {\varvec{x}}^{\star } \Vert _2 =1\). Theorem 1 establishes near-linear convergence of WF with a substantially more aggressive step size \(\eta \asymp 1/\log n\). Compared with the choice \(\eta \lesssim 1/n\) admissible in [18, Theorem 3.3], Theorem 1 allows WF/GD to attain \(\epsilon \)-accuracy within \(O(\log n\log (1/\epsilon ))\) iterations. The resulting computational complexity of the algorithm is

    $$\begin{aligned} O\left( mn\log n\log \frac{1}{\epsilon }\right) , \end{aligned}$$

    which significantly improves upon the result \(O\big (mn^{2}\log \left( {1} / {\epsilon }\right) \big )\) derived in [18]. As a side note, if the sample size further increases to \(m\asymp n\log ^2 n\), then a constant step size \(\eta \asymp 1\) is also feasible, resulting in an iteration complexity \(\log (1/\epsilon )\). This follows since with high probability, the entire trajectory resides within a more refined incoherence region \(\max _{j}\big |{\varvec{a}}_{j}^{\top }\big ({\varvec{x}}^{t}-{\varvec{x}}^{\star }\big )\big | \lesssim \Vert {\varvec{x}}^{\star }\Vert _{2}\). We omit the details here.

  • Incoherence of spectral initialization We have also demonstrated in Theorem 1 that the initial guess \({\varvec{x}}^{0}\) falls within the RIC and is hence nearly orthogonal to all design vectors. This provides a finer characterization of spectral initialization, in comparison with prior theory that focuses primarily on the \(\ell _2\) accuracy [18, 90]. We expect our leave-one-out analysis to accommodate other variants of spectral initialization studied in the literature [12, 25, 83, 88, 118].

Remark 3

As it turns out, a carefully designed initialization is not pivotal in enabling fast convergence. In fact, randomly initialized gradient descent provably attains \(\varepsilon \)-accuracy in \(O(\log n + \log \tfrac{1}{\varepsilon })\) iterations; see [27] for details.

3.2 Low-Rank Matrix Completion

Let \({\varvec{M}}^{\star }\in \mathbb {R}^{n\times n}\) be a positive semidefinite matrixFootnote 5 with rank r, and suppose its eigendecomposition is

$$\begin{aligned} {\varvec{M}}^{\star }={\varvec{U}}^{\star }{\varvec{\Sigma }}^{\star }{\varvec{U}}^{\star \top }, \end{aligned}$$
(20)

where \({\varvec{U}}^{\star }\in \mathbb {R}^{n\times r}\) consists of orthonormal columns and \({\varvec{\Sigma }}^{\star }\) is an \(r\times r\) diagonal matrix with eigenvalues in a descending order, i.e., \(\sigma _{\max }=\sigma _{1}\ge \cdots \ge \sigma _{r}=\sigma _{\min }>0\). Throughout this paper, we assume the condition number \(\kappa :=\sigma _{\max }/\sigma _{\min }\) is bounded by a fixed constant, independent of the problem size (i.e., n and r). Denoting \({\varvec{X}}^{\star }={\varvec{U}}^{\star }({\varvec{\Sigma }}^{\star })^{1/2}\) allows us to factorize \({\varvec{M}}^{\star }\) as

$$\begin{aligned} {\varvec{M}}^{\star }={\varvec{X}}^{\star }{\varvec{X}}^{\star \top }. \end{aligned}$$
(21)

Consider a random sampling model such that each entry of \({\varvec{M}}^{\star }\) is observed independently with probability \(0<p\le 1\), i.e., for \(1\le j\le k\le n\),

$$\begin{aligned} Y_{j,k}={\left\{ \begin{array}{ll} M_{j,k}^{\star }+E_{j,k},\quad &{} \text {with probability }p,\\ 0, &{} \text {else}, \end{array}\right. } \end{aligned}$$
(22)

where the entries of \({\varvec{E}}=[ E_{j,k} ] _{1\le j\le k\le n}\) are independent sub-Gaussian noise with sub-Gaussian norm \(\sigma \) (see [116, Definition 5.7]). We denote by \(\Omega \) the set of locations being sampled, and \(\mathcal {P}_\Omega ({\varvec{Y}})\) represents the projection of \({\varvec{Y}}\) onto the set of matrices supported in \(\Omega \). We note here that the sampling rate p, if not known, can be faithfully estimated by the sample proportion \(| \Omega |/n^2\).

To fix ideas, we consider the following nonconvex optimization problem

$$\begin{aligned} \text {minimize}_{{\varvec{X}}\in \mathbb {R}^{n\times r}}\quad f\left( {\varvec{X}}\right) :=\frac{1}{4p}\sum _{(j,k)\in \Omega }\big ({\varvec{e}}_{j}^{\top }{\varvec{X}}{\varvec{X}}^{\top }{\varvec{e}}_{k}-Y_{j,k}\big )^{2}. \end{aligned}$$
(23)

The vanilla gradient descent algorithm (with spectral initialization) is summarized in Algorithm 2.

figure b

Before proceeding to the main theorem, we first introduce a standard incoherence parameter required for matrix completion [19].

Definition 3

(Incoherence for matrix completion) A rank-r matrix \({\varvec{M}}^{\star }\) with eigendecomposition \({\varvec{M}}^{\star }={\varvec{U}}^{\star }{\varvec{\Sigma }}^{\star }{\varvec{U}}^{\star \top }\) is said to be \(\mu \)-incoherent if

$$\begin{aligned} \left\| {\varvec{U}}^{\star }\right\| _{2,\infty }\le \sqrt{\frac{\mu }{n}}\left\| {\varvec{U}}^{\star }\right\| _{\mathrm {F}}=\sqrt{\frac{\mu r}{n}}. \end{aligned}$$
(25)

In addition, recognizing that \({\varvec{X}}^{\star }\) is identifiable only up to orthogonal transformation, we define the optimal transform from the tth iterate \({\varvec{X}}^t\) to \({\varvec{X}}^{\star }\) as

$$\begin{aligned} \widehat{{\varvec{H}}}^{t}:=\mathop {\mathrm {argmin}}_{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{X}}^{t}{\varvec{R}}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}, \end{aligned}$$
(26)

where \(\mathcal {O}^{r\times r}\) is the set of \(r\times r\) orthonormal matrices. With these definitions in place, we have the following theorem.

Theorem 2

Let \({\varvec{M}}^{\star }\) be a rank-r, \(\mu \)-incoherent PSD matrix, and its condition number \(\kappa \) is a fixed constant. Suppose the sample size satisfies \(n^{2}p\ge C\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies

$$\begin{aligned} \sigma \sqrt{\frac{n}{p}}\ll \frac{\sigma _{\min }}{\sqrt{\kappa ^3 \mu r\log ^{3}n}}. \end{aligned}$$
(27)

With probability at least \(1-O\left( n^{-3}\right) \), the iterates of Algorithm 2 satisfy

$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}^{t}}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}&\le \left( C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+C_{1}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right) \big \Vert {\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}, \end{aligned}$$
(28a)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}^{t}}-{\varvec{X}}^{\star }\big \Vert _{2,\infty }&\le \left( C_{5}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}+C_{8}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\right) \big \Vert {\varvec{X}}^{\star }\big \Vert _{2,\infty }, \end{aligned}$$
(28b)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}^{t}}-{\varvec{X}}^{\star }\big \Vert&\le \left( C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+C_{10}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right) \big \Vert {\varvec{X}}^{\star }\big \Vert \end{aligned}$$
(28c)

for all \(0\le t\le T=O(n^{5})\), where \(C_{1}\), \(C_{4}\), \(C_{5}\), \(C_{8}\), \(C_{9}\), and \(C_{10}\) are some absolute positive constants and \(1-\left( {\sigma _{\min }} / {5}\right) \cdot \eta \le \rho <1\), provided that \(0<\eta _{t}\equiv \eta \le {2} / \left( {25\kappa \sigma _{\max }}\right) \).

Theorem 2 provides the first theoretical guarantee of unregularized gradient descent for matrix completion, demonstrating near-optimal statistical accuracy and computational complexity.

  • Implicit regularization In Theorem 2, we bound the \(\ell _{2}/\ell _{\infty }\) error of the iterates in a uniform manner via (28b). Note that \(\big \Vert {\varvec{X}}-{\varvec{X}}^{\star }\big \Vert _{2,\infty }=\max _{j}\big \Vert {\varvec{e}}_{j}^{\top }\big ({\varvec{X}}-{\varvec{X}}^{\star }\big )\big \Vert _{2}\), which implies the iterates remain incoherent with the sensing vectors throughout and have small incoherence parameters (cf. (25)). In comparison, prior works either include a penalty term on \(\{\Vert {\varvec{e}}_{j}^{\top }{\varvec{X}}\Vert _2\}_{1\le j\le n}\) [64, 107] and/or \(\Vert {\varvec{X}}\Vert _{\mathrm {F}}\) [107] to encourage an incoherent and/or low-norm solution, or add an extra projection operation to enforce incoherence [32, 131]. Our results demonstrate that such explicit regularization is unnecessary.

  • Constant step size Without loss of generality, we may assume that \(\sigma _{\max } = \Vert {\varvec{M}}^{\star } \Vert = O( 1 )\), which can be done by choosing proper scaling of \({\varvec{M}}^{\star }\). Hence, we have a constant step size \(\eta _t \asymp 1\). Actually, it is more convenient to consider the scale-invariant parameter \(\rho \): Theorem 2 guarantees linear convergence of the vanilla gradient descent at a constant rate \(\rho \). Remarkably, the convergence occurs with respect to three different unitarily invariant norms: the Frobenius norm \(\Vert \cdot \Vert _{\mathrm {F}}\), the \(\ell _{2}/\ell _{\infty }\) norm \(\Vert \cdot \Vert _{2,\infty }\), and the spectral norm \(\Vert \cdot \Vert \). As far as we know, the latter two are established for the first time. Note that our result even improves upon that for regularized gradient descent; see Table 1.

  • Near-optimal sample complexity When the rank \(r=O(1)\), vanilla gradient descent succeeds under a near-optimal sample complexity \(n^{2}p\gtrsim n\mathrm {poly}\log n\), which is statistically optimal up to some logarithmic factor.

  • Near-minimal Euclidean error In view of (28a), as t increases, the Euclidean error of vanilla GD converges to

    $$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}} \lesssim \frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\big \Vert {\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}, \end{aligned}$$
    (29)

    which coincides with the theoretical guarantee in [32, Corollary 1] and matches the minimax lower bound established in [67, 89].

  • Near-optimal entrywise error The \(\ell _{2}/\ell _{\infty }\) error bound (28b) immediately yields entrywise control of the empirical risk. Specifically, as soon as t is sufficiently large (so that the first term in (28b) is negligible), we have

    $$\begin{aligned} \big \Vert {\varvec{X}}^{t}{\varvec{X}}^{t\top }-{\varvec{M}}^{\star }\big \Vert _{\infty }&\le \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top }\big \Vert _{\infty }+\big \Vert \big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big ){\varvec{X}}^{\star \top }\big \Vert _{\infty }\\&\le \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\ \big \Vert _{2,\infty }\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}{-}{\varvec{X}}^{\star }\big \Vert _{2,\infty }{+}\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}{-}{\varvec{X}}^{\star }\big \Vert _{2,\infty }\big \Vert {\varvec{X}}^{\star }\big \Vert _{2,\infty } \\&\lesssim \frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{M}}^{\star }\right\| _{\infty }, \end{aligned}$$

    where the last line follows from (28b) as well as the facts that \(\Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\Vert _{2,\infty }\le \Vert {\varvec{X}}^{\star }\Vert _{2,\infty }\) and \(\Vert {\varvec{M}}^{\star }\Vert _{\infty }=\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\). Compared with the Euclidean loss (29), this implies that when \(r=O(1)\), the entrywise error of \({\varvec{X}}^{t}{\varvec{X}}^{t\top }\) is uniformly spread out across all entries. As far as we know, this is the first result that reveals near-optimal entrywise error control for noisy matrix completion using nonconvex optimization, without resorting to sample splitting.

Remark 4

Theorem 2 remains valid if the total number T of iterations obeys \(T=n^{O(1)}\). In the noiseless case where \(\sigma = 0\), the theory allows arbitrarily large T.

Finally, we report the empirical statistical accuracy of vanilla gradient descent in the presence of noise. Figure 6 displays the squared relative error of vanilla gradient descent as a function of the signal-to-noise ratio (SNR), where the SNR is defined to be

$$\begin{aligned} \textsf {SNR} \,:=\, \frac{\sum _{(j,k)\in \Omega } \big (M^{\star }_{j,k}\big )^2}{\sum _{(j,k)\in \Omega } \textsf {Var} \left( E_{j,k}\right) } \,\approx \, \frac{ \Vert {\varvec{M}}^{\star }\Vert _{\mathrm {F}}^2}{n^2 \sigma ^2} , \end{aligned}$$
(30)

and the relative error is measured in terms of the square of the metrics as in (28) as well as the squared entrywise prediction error. Both the relative error and the SNR are shown on a dB scale (i.e., \(10\log _{10}(\text {SNR})\) and \(10\log _{10}(\text {squared relative error})\) are plotted). The results are averaged over 20 independent trials. As one can see from the plot, the squared relative error scales inversely proportional to the SNR, which is consistent with our theory.Footnote 6

Fig. 6
figure 6

Squared relative error of the estimate \(\widehat{{\varvec{X}}}\) (measured by \(\left\| \cdot \right\| _{\mathrm {F}},\left\| \cdot \right\| ,\left\| \cdot \right\| _{2,\infty }\) modulo global transformation) and \(\widehat{{\varvec{M}}}=\widehat{{\varvec{X}}}\widehat{{\varvec{X}}}^{\top }\) (measured by \(\left\| \cdot \right\| _{\infty }\)) versus SNR for noisy matrix completion, where \(n=500\), \(r=10\), \(p=0.1\), and \(\eta _{t}=0.2\). Here \(\widehat{{\varvec{X}}}\) denotes the estimate returned by Algorithm 2 after convergence. The results are averaged over 20 independent Monte Carlo trials

3.3 Blind Deconvolution

Suppose we have collected m bilinear measurements

$$\begin{aligned} y_{j}={\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{j},\qquad 1\le j\le m, \end{aligned}$$
(31)

where \({\varvec{a}}_{j}\) follows a complex Gaussian distribution, i.e., \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) +i\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) \) for \(1\le j \le m\), and \({\varvec{B}} :=\left[ {\varvec{b}}_1,\cdots ,{\varvec{b}}_m\right] ^\textsf {H} \in \mathbb {C}^{m\times K}\) is formed by the first K columns of a unitary discrete Fourier transform (DFT) matrix \({\varvec{F}}\in \mathbb {C}^{m\times m}\) obeying \({\varvec{F}}{\varvec{F}}^{\textsf {H} }={\varvec{I}}_m\) (see Appendix D.3.2 for a brief introduction to DFT matrices). This setup models blind deconvolution, where the two signals under convolution belong to known low-dimensional subspaces of dimension K [3].Footnote 7 In particular, the partial DFT matrix \({\varvec{B}}\) plays an important role in image blind deblurring. In this subsection, we consider solving the following nonconvex optimization problem

$$\begin{aligned} \text {minimize}_{{\varvec{h}},{\varvec{x}}\in \mathbb {C}^{K}}\quad f\left( {\varvec{h}},{\varvec{x}}\right) =\sum _{j=1}^{m}\left| {\varvec{b}}_{j}^{\textsf {H} } {\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j} -y_j \right| ^{2}. \end{aligned}$$
(32)

The (Wirtinger) gradient descent algorithm (with spectral initialization) is summarized in Algorithm 3; here, \(\nabla _{{\varvec{h}}} f({\varvec{h}},{\varvec{x}})\) and \(\nabla _{{\varvec{x}}} f({\varvec{h}},{\varvec{x}})\) stand for the Wirtinger gradient and are given in (77) and (78), respectively; see [18, Section 6] for a brief introduction to Wirtinger calculus.

It is self-evident that \({\varvec{h}}^{\star }\) and \({\varvec{x}}^{\star }\) are only identifiable up to global scaling, that is, for any nonzero \(\alpha \in \mathbb {C}\),

$$\begin{aligned} {\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }=\frac{1}{\overline{\alpha }} {\varvec{h}}^{\star }\left( \alpha {\varvec{x}}^{\star }\right) ^{\textsf {H} }. \end{aligned}$$

In light of this, we will measure the discrepancy between

$$\begin{aligned} {\varvec{z}}:=\begin{bmatrix} {\varvec{h}} \\ {\varvec{x}} \end{bmatrix}\in \mathbb {C}^{2K}\qquad \text {and} \qquad {\varvec{z}}^{\star }:=\begin{bmatrix} {\varvec{h}}^{\star } \\ {\varvec{x}}^{\star } \end{bmatrix}\in \mathbb {C}^{2K} \end{aligned}$$
(33)

via the following function

$$\begin{aligned} \mathrm {dist}\left( {\varvec{z}},{\varvec{z}}^{\star }\right) := \min _{\alpha \in \mathbb {C}} \sqrt{\left\| \frac{1}{\overline{\alpha }}{\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2}^{2}+\left\| \alpha {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}^{2}}. \end{aligned}$$
(34)
figure c

Before proceeding, we need to introduce the incoherence parameter [3, 76], which is crucial for blind deconvolution, whose role is similar to the incoherence parameter (cf. Definition 3) in matrix completion.

Definition 4

(Incoherence for blind deconvolution) Let the incoherence parameter \(\mu \) of \({\varvec{h}}^{\star }\) be the smallest number such that

$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{\star }\right| \le \frac{\mu }{\sqrt{m}}\left\| {\varvec{h}}^{\star }\right\| _{2}. \end{aligned}$$
(36)

The incoherence parameter describes the spectral flatness of the signal \({\varvec{h}}^{\star }\). With this definition in place, we have the following theorem, where for identifiability we assume that \(\left\| {\varvec{h}}^{\star }\right\| _{2}=\left\| {\varvec{x}}^{\star }\right\| _{2}\).

Theorem 3

Suppose the number of measurements obeys \(m\ge C\mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\), and suppose the step size \(\eta >0\) is taken to be some sufficiently small constant. Then there exist constants \(c_{1},c_{2},C_{1},C_{3},C_{4}>0\) such that with probability exceeding \(1-c_{1}m{}^{-5}-c_{1}me^{-c_{2}K}\), the iterates in Algorithm 3 satisfy

$$\begin{aligned} \mathrm {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right)&\le C_{1}\left( 1-\frac{\eta }{16}\right) ^{t}\frac{1}{\log ^{2}m}\left\| {\varvec{z}}^{\star }\right\| _{2}, \end{aligned}$$
(37a)
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\textsf {H} }\left( \alpha ^{t}{\varvec{x}}^{t}-{\varvec{x}}^{\star }\right) \right|&\le C_{3}\frac{1}{\log ^{1.5}m}\left\| {\varvec{x}}^{\star }\right\| _{2}, \end{aligned}$$
(37b)
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\frac{1}{\overline{\alpha ^{t}}}{\varvec{h}}^{t}\right|&\le C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m\left\| {\varvec{h}}^{\star }\right\| _{2} \end{aligned}$$
(37c)

for all \(t\ge 0\). Here, we denote \(\alpha ^t\) as the alignment parameter,

$$\begin{aligned} \alpha ^{t}:=\arg \min _{\alpha \in \mathbb {C}}\left\| \frac{1}{\overline{\alpha }}{\varvec{h}}^{t}-{\varvec{h}}^{\star }\right\| _{2}^{2}+\left\| \alpha {\varvec{x}}^{t}-{\varvec{x}}^{\star }\right\| _{2}^{2}. \end{aligned}$$
(38)

Theorem 3 provides the first theoretical guarantee of unregularized gradient descent for blind deconvolution at a near-optimal statistical and computational complexity. A few remarks are in order.

  • Implicit regularization Theorem 3 reveals that the unregularized gradient descent iterates remain incoherent with the sampling mechanism (see (37b) and (37c)). Recall that prior works operate upon a regularized cost function with an additional penalty term that regularizes the global scaling \(\{\Vert {\varvec{h}}\Vert _2,\Vert {\varvec{x}}\Vert _2\}\) and the incoherence \(\{|{\varvec{b}}_j^\textsf {H} {\varvec{h}}|\}_{1\le j\le m}\) [58, 76, 82]. In comparison, our theorem implies that it is unnecessary to regularize either the incoherence or the scaling ambiguity, which is somewhat surprising. This justifies the use of regularization-free (Wirtinger) gradient descent for blind deconvolution.

  • Constant step size Compared to the step size \(\eta _t \lesssim 1/m\) suggested in [76] for regularized gradient descent, our theory admits a substantially more aggressive step size (i.e., \(\eta _t\asymp 1\)) even without regularization. Similar to phase retrieval, the computational efficiency is boosted by a factor of m, attaining \(\epsilon \)-accuracy within \(O\left( \log (1/\epsilon )\right) \) iterations (vs. \(O\left( m\log (1/\epsilon )\right) \) iterations in prior theory).

  • Near-optimal sample complexity It is demonstrated that vanilla gradient descent succeeds at a near-optimal sample complexity up to logarithmic factors, although our requirement is slightly worse than [76] which uses explicit regularization. Notably, even under the sample complexity herein, the iteration complexity given in [76] is still \(O\left( m/\mathrm {poly}\log (m)\right) \).

  • Incoherence of spectral initialization As in phase retrieval, Theorem 3 demonstrates that the estimates returned by the spectral method are incoherent with respect to both \(\{{\varvec{a}}_j\}\) and \(\{{\varvec{b}}_j\}\). In contrast, [76] recommends a projection operation (via a linear program) to enforce incoherence of the initial estimates, which is dispensable according to our theory.

  • Contraction in\(\left\| \cdot \right\| _{\mathrm {F}}\) It is easy to check that the Frobenius norm error satisfies \(\left\| {\varvec{h}}^{t}{\varvec{x}}^{t\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\right\| _{\mathrm {F}}\lesssim \mathrm {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right) \), and therefore, Theorem 3 corroborates the empirical results shown in Fig. 1c.

4 Related Work

Solving nonlinear systems of equations has received much attention in the past decade. Rather than directly attacking the nonconvex formulation, convex relaxation lifts the object of interest into a higher-dimensional space and then attempts recovery via semidefinite programming (e.g., [3, 19, 20, 94]). This has enjoyed great success in both theory and practice. Despite appealing statistical guarantees, semidefinite programming is in general prohibitively expensive when processing large-scale datasets.

Nonconvex approaches, on the other end, have been under extensive study in the last few years, due to their computational advantages. There is a growing list of statistical estimation problems for which nonconvex approaches are guaranteed to find global optimal solutions, including but not limited to phase retrieval [18, 25, 90], low-rank matrix sensing and completion [7, 32, 48, 115, 130], blind deconvolution and self-calibration [72, 76, 78, 82], dictionary learning [106], tensor decomposition [49], joint alignment [24], learning shallow neural networks [103, 132], robust subspace learning [34, 74, 86, 91]. In several problems [40, 48, 49, 75, 77, 86, 87, 105, 106], it is further suggested that the optimization landscape is benign under sufficiently large sample complexity, in the sense that all local minima are globally optimal, and hence nonconvex iterative algorithms become promising in solving such problems. See [37] for a recent overview. Below we review the three problems studied in this paper in more detail. Some state-of-the-art results are summarized in Table 1.

  • Phase retrieval Candès et al. proposed PhaseLift [20] to solve the quadratic systems of equations based on convex programming. Specifically, it lifts the decision variable \({\varvec{x}}^{\star }\) into a rank-one matrix \({\varvec{X}}^{\star }={\varvec{x}}^{\star }{\varvec{x}}^{\star \top }\) and translates the quadratic constraints of \({\varvec{x}}^{\star }\) in (14) into linear constraints of \({\varvec{X}}^{\star }\). By dropping the rank constraint, the problem becomes convex [11, 16, 20, 29, 113]. Another convex program PhaseMax [5, 41, 50, 53] operates in the natural parameter space via linear programming, provided that an anchor vector is available. On the other hand, alternating minimization [90] with sample splitting has been shown to enjoy much better computational guarantee. In contrast, Wirtinger flow [18] provides the first global convergence result for nonconvex methods without sample splitting, whose statistical and computational guarantees are later improved by [25] via an adaptive truncation strategy. Several other variants of WF are also proposed [12, 68, 102], among which an amplitude-based loss function has been investigated [117,118,119, 127]. In particular, [127] demonstrates that the amplitude-based loss function has a better curvature, and vanilla gradient descent can indeed converge with a constant step size at the orderwise optimal sample complexity. A small sample of other nonconvex phase retrieval methods include [6, 10, 22, 36, 43, 47, 92, 98, 100, 109, 122], which are beyond the scope of this paper.

  • Matrix completion Nuclear norm minimization was studied in [19] as a convex relaxation paradigm to solve the matrix completion problem. Under certain incoherence conditions imposed upon the ground truth matrix, exact recovery is guaranteed under near-optimal sample complexity [14, 23, 38, 51, 93]. Concurrently, several works [54, 55, 60, 61, 63,64,65, 71, 110, 123, 129, 129] tackled the matrix completion problem via nonconvex approaches. In particular, the seminal work by Keshavan et al. [64, 65] pioneered the two-stage approach that is widely adopted by later works. Sun and Luo [107] demonstrated the convergence of gradient descent type methods for noiseless matrix completion with a regularized nonconvex loss function. Instead of penalizing the loss function, [32, 131] employed projection to enforce the incoherence condition throughout the execution of the algorithm. To the best of our knowledge, no rigorous guarantees have been established for matrix completion without explicit regularization. A notable exception is [63], which uses unregularized stochastic gradient descent for matrix completion in the online setting. However, the analysis is performed with fresh samples in each iteration. Our work closes the gap and makes the first contribution toward understanding implicit regularization in gradient descent without sample splitting. In addition, entrywise eigenvector perturbation has been studied by [1, 26, 60] in order to analyze the spectral algorithms for matrix completion, which helps us establish theoretical guarantees for the spectral initialization step. Finally, it has recently been shown that the analysis of nonconvex gradient descent in turn yields near-optimal statistical guarantees for convex relaxation in the context of noisy matrix completion; see [28, 31].

  • Blind deconvolution In [3], Ahmed et al. first proposed to invoke similar lifting ideas for blind deconvolution, which translates the bilinear measurements (31) into a system of linear measurements of a rank-one matrix \({\varvec{X}}^{\star }={\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\). Near-optimal performance guarantees have been established for convex relaxation [3]. Under the same model, Li et al. [76] proposed a regularized gradient descent algorithm that directly optimizes the nonconvex loss function (32) with a few regularization terms that account for scaling ambiguity and incoherence. In [58], a Riemannian steepest descent method is developed that removes the regularization for scaling ambiguity, although they still need to regularize for incoherence. In [2], a linear program is proposed but requires exact knowledge of the signs of the signals. Blind deconvolution has also been studied for other models—interested readers are referred to [35, 72, 73, 81, 82, 120, 128].

On the other hand, our analysis framework is based on a leave-one-out perturbation argument. This technique has been widely used to analyze high-dimensional problems with random designs, including but not limited to robust M-estimation [44, 45], statistical inference for sparse regression [62], likelihood ratio test in logistic regression [108], phase synchronization [1, 133], ranking from pairwise comparisons [30], community recovery [1], and covariance sketching [79]. In particular, this technique results in tight performance guarantees for the generalized power method [133], the spectral method [1, 30], and convex programming approaches [30, 44, 108, 133]; however, it has not been applied to analyze nonconvex optimization algorithms.

Finally, we note that the notion of implicit regularization—broadly defined—arises in settings far beyond the models and algorithms considered herein. For instance, it has been conjectured that in matrix factorization, over-parameterized stochastic gradient descent effectively enforces certain norm constraints, allowing it to converge to a minimal-norm solution as long as it starts from the origin [52]. The stochastic gradient methods have also been shown to implicitly enforce Tikhonov regularization in several statistical learning settings [80]. More broadly, this phenomenon seems crucial in enabling efficient training of deep neural networks [104, 125].

5 A General Recipe for Trajectory Analysis

In this section, we sketch a general recipe for establishing performance guarantees of gradient descent, which conveys the key idea for proving the main results of this paper. The main challenge is to demonstrate that appropriate incoherence conditions are preserved throughout the trajectory of the algorithm. This requires exploiting statistical independence of the samples in a careful manner, in conjunction with generic optimization theory. Central to our approach is a leave-one-out perturbation argument, which allows to decouple the statistical dependency while controlling the component-wise incoherence measures.

figure d

5.1 General Model

Consider the following problem where the samples are collected in a bilinear/quadratic form as

$$\begin{aligned} y_{j}={\varvec{\psi }}_{j}^{\textsf {H} } {\varvec{H}}^{\star }{\varvec{X}}^{\star \textsf {H} } {\varvec{\phi }}_{j} , \qquad 1\le j\le m, \end{aligned}$$
(39)

where the objects of interest \({\varvec{H}}^{\star }, {\varvec{X}}^{\star }\in \mathbb {C}^{n\times r}\) or \(\mathbb {R}^{n\times r}\) might be vectors or tall matrices taking either real or complex values. The design vectors \(\left\{ {\varvec{\psi }}_{j}\right\} \) and \(\{{\varvec{\phi }}_{j}\}\) are in either \(\mathbb {C}^n\) or \(\mathbb {R}^n\), and can be either random or deterministic. This model is quite general and entails all three examples in this paper as special cases:

  • Phase retrieval: \({\varvec{H}}^{\star }={\varvec{X}}^{\star }={\varvec{x}}^{\star }\in \mathbb {R}^{n}\), and \({\varvec{\psi }}_{j}={\varvec{\phi }}_{j}={\varvec{a}}_{j}\);

  • Matrix completion: \({\varvec{H}}^{\star }={\varvec{X}}^{\star }\in \mathbb {R}^{n\times r}\) and \({\varvec{\psi }}_{j},{\varvec{\phi }}_{j}\in \{{\varvec{e}}_{1},\cdots ,{\varvec{e}}_{n}\}\);

  • Blind deconvolution: \({\varvec{H}}^{\star }={\varvec{h}}^{\star }\in \mathbb {C}^{K}\), \({\varvec{X}}^{\star }={\varvec{x}}^{\star }\in \mathbb {C}^{K}\), \({\varvec{\phi }}_{j}={\varvec{a}}_{j},\) and \({\varvec{\psi }}_{j}={\varvec{b}}_{j}\).

For this setting, the empirical loss function is given by

$$\begin{aligned} f({\varvec{Z}}) := f({\varvec{H}},{\varvec{X}})=\frac{1}{m}\sum _{j=1}^{m}\Big | {\varvec{\psi }}_{j}^{\textsf {H} } {\varvec{H}}{\varvec{X}}^{\textsf {H} } {\varvec{\phi }}_{j}-y_{j}\Big |^{2}, \end{aligned}$$

where we denote \({\varvec{Z}}=({\varvec{H}},{\varvec{X}})\). To minimize \(f({\varvec{Z}})\), we proceed with vanilla gradient descent

$$\begin{aligned} {\varvec{Z}}^{t+1} = {\varvec{Z}}^{t}-\eta \nabla f\big ({\varvec{Z}}^{t}\big ), \qquad \forall t\ge 0 \end{aligned}$$

following a standard spectral initialization, where \(\eta \) is the step size. As a remark, for complex-valued problems, the gradient (resp. Hessian) should be understood as the Wirtinger gradient (resp. Hessian).

It is clear from (39) that \({\varvec{Z}}^{\star } = ({\varvec{H}}^{\star },{\varvec{X}}^{\star })\) can only be recovered up to certain global ambiguity. For clarity of presentation, we assume in this section that such ambiguity has already been taken care of via proper global transformation.

5.2 Outline of the Recipe

We are now positioned to outline the general recipe, which entails the following steps.

  • Step 1: characterizing local geometry in the RIC Our first step is to characterize a region \(\mathcal {R}\)—which we term as the region of incoherence and contraction (RIC)—such that the Hessian matrix \(\nabla ^{2}f({\varvec{Z}})\) obeys strong convexity and smoothness,

    $$\begin{aligned} {\varvec{0}} \,\prec \, \alpha {\varvec{I}} \,\preceq \, \nabla ^{2}f({\varvec{Z}}) \,\preceq \, \beta {\varvec{I}},\qquad \forall {\varvec{Z}}\in \mathcal {R}, \end{aligned}$$
    (40)

    or at least along certain directions (i.e., restricted strong convexity and smoothness), where \(\beta /\alpha \) scales slowly (or even remains bounded) with the problem size. As revealed by optimization theory, this geometric property (40) immediately implies linear convergence with the contraction rate \(1- O(\alpha /\beta )\) for a properly chosen step size \(\eta \), as long as all iterates stay within the RIC.

    A natural question then arises: What does the RIC \(\mathcal {R}\) look like? As it turns out, the RIC typically contains all points such that the \(\ell _2\) error \(\Vert {\varvec{Z}} - {\varvec{Z}}^{\star }\Vert _{\mathrm {F}}\) is not too large and

    $$\begin{aligned}&(\mathbf{incoherence }) \qquad \max _{j} {\big \Vert {\varvec{\phi }}_{j}^{\textsf {H} }({\varvec{X}}-{\varvec{X}}^{\star })\big \Vert _2}~~\text {and}~~ \max _{j} {\big \Vert {\varvec{\psi }}_{j}^{\textsf {H} }({\varvec{H}}-{\varvec{H}}^{\star })\big \Vert _2}\nonumber \\ {}&\quad \text { are well controlled}. \end{aligned}$$
    (41)

    In the three examples, the above incoherence condition translates to:

    • Phase retrieval: \(\max _{j} {\big |{\varvec{a}}_{j}^{\top }({\varvec{x}}-{\varvec{x}}^{\star })\big |}\) is well controlled;

    • Matrix completion: \(\big \Vert {\varvec{X}}- {\varvec{X}}^{\star }\big \Vert _{2,\infty }\) is well controlled;

    • Blind deconvolution: \(\max _{j} {\big |{\varvec{a}}_{j}^{\top }({\varvec{x}}-{\varvec{x}}^{\star })\big |}\) and \(\max _{j} {\big |{\varvec{b}}_{j}^{\top }({\varvec{h}}-{\varvec{h}}^{\star })\big |}\) are well controlled.

  • Step 2: introducing the leave-one-out sequences To justify that no iterates leave the RIC, we rely on the construction of auxiliary sequences. Specifically, for each l, produce an auxiliary sequence \(\{{\varvec{Z}}^{t,(l)}=({\varvec{X}}^{t,(l)},{\varvec{H}}^{t,(l)})\}\) such that \({\varvec{X}}^{t,(l)}\) (resp. \({\varvec{H}}^{t,(l)}\)) is independent of any sample involving \({\varvec{\phi }}_l\) (resp. \({\varvec{\psi }}_l\)). As an example, suppose that the \({\varvec{\phi }}_l\)’s and the \({\varvec{\psi }}_l\)’s are independently and randomly generated. Then for each l, one can consider a leave-one-out loss function

    $$\begin{aligned} f^{(l)}({\varvec{Z}}) : = \frac{1}{m}\sum _{j: j\ne l}\Big | {\varvec{\psi }}_{j}^{\textsf {H} } {\varvec{H}}{\varvec{X}}^{\textsf {H} } {\varvec{\phi }}_{j}-y_{j}\Big |^{2} \end{aligned}$$

    that discards the lth sample. One further generates \(\{{\varvec{Z}}^{t,(l)}\}\) by running vanilla gradient descent w.r.t. this auxiliary loss function, with a spectral initialization that similarly discards the lth sample. Note that this procedure is only introduced to facilitate analysis and is never implemented in practice.

  • Step 3: establishing the incoherence condition We are now ready to establish the incoherence condition with the assistance of the auxiliary sequences. Usually, the proof proceeds by induction, where our goal is to show that the next iterate remains within the RIC, given that the current one does.

    • Step 3(a): proximity between the original and the leave-one-out iterates As one can anticipate, \(\{{\varvec{Z}}^{t}\}\) and \(\{{\varvec{Z}}^{t,(l)}\}\) remain “glued” to each other along the whole trajectory, since their constructions differ by only a single sample. In fact, as long as the initial estimates stay sufficiently close, their gaps will never explode. To intuitively see why, use the fact \(\nabla f({\varvec{Z}}^{t})\approx \nabla f^{(l)}({\varvec{Z}}^{t})\) to discover that

      $$\begin{aligned} {\varvec{Z}}^{t+1}-{\varvec{Z}}^{t+1,(l)}&={\varvec{Z}}^{t}-\eta \nabla f({\varvec{Z}}^{t})-\big ({\varvec{Z}}^{t,(l)}-\eta \nabla f^{(l)}\big ({\varvec{Z}}^{t,(l)}\big )\big )\\&\approx {\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)}-\eta \nabla ^{2}f({\varvec{Z}}^{t})\big ({\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)}\big ), \end{aligned}$$

      which together with the strong convexity condition implies \(\ell _{2}\) contraction

      $$\begin{aligned} \big \Vert {\varvec{Z}}^{t+1}-{\varvec{Z}}^{t+1,(l)} \big \Vert _{\mathrm {F}} \approx \Big \Vert \big ({\varvec{I}}-\eta \nabla ^{2}f({\varvec{Z}}^{t})\big )\big ({\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)}\big )\Big \Vert _{\mathrm {F}}\le \big \Vert {\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)} \big \Vert _{2}. \end{aligned}$$

      Indeed, (restricted) strong convexity is crucial in controlling the size of leave-one-out perturbations.

    • Step 3(b): incoherence condition of the leave-one-out iterates The fact that \({\varvec{Z}}^{t+1}\) and \({\varvec{Z}}^{t+1,(l)}\) are exceedingly close motivates us to control the incoherence of \({\varvec{Z}}^{t+1,(l)} - {\varvec{Z}}^{\star }\) instead, for \(1\le l\le m\). By construction, \({\varvec{X}}^{t+1,(l)}\) (resp. \({\varvec{H}}^{t+1,(l)}\)) is statistically independent of any sample involving the design vector \({\varvec{\phi }}_{l}\) (resp. \({\varvec{\psi }}_ l\)), a fact that typically leads to a more friendly analysis for controlling \(\left\| {\varvec{\phi }}_{l}^{\textsf {H} }\big ({\varvec{X}}^{t+1,(l)}-{\varvec{X}}^{\star }\big ) \right\| _2\) and \(\left\| {\varvec{\psi }}_{l}^{\textsf {H} }\big ({\varvec{H}}^{t+1,(l)}-{\varvec{H}}^{\star }\big ) \right\| _2\).

    • Step 3(c): combining the bounds With these results in place, apply the triangle inequality to obtain

      $$\begin{aligned} \big \Vert {\varvec{\phi }}_{l}^{\textsf {H} }\big ({\varvec{X}}^{t+1}-{\varvec{X}}^{\star }\big )\big \Vert _2&\le \big \Vert {\varvec{\phi }}_{l}\Vert _2 \big \Vert {\varvec{X}}^{t+1}-{\varvec{X}}^{t+1,(l)} \big \Vert _{\mathrm {F}} + \big \Vert {\varvec{\phi }}_{l}^{\textsf {H} }\big ({\varvec{X}}^{t+1,(l)}-{\varvec{X}}^{\star }\big )\big \Vert _2, \end{aligned}$$

      where the first term is controlled in Step 3(a) and the second term is controlled in Step 3(b). The term \(\big \Vert {\varvec{\psi }}_{l}^{\textsf {H} }\big ({\varvec{H}}^{t+1}-{\varvec{H}}^{\star }\big )\big \Vert _2\) can be bounded similarly. By choosing the bounds properly, this establishes the incoherence condition for all \(1\le l\le m\) as desired.

6 Analysis for Phase Retrieval

In this section, we instantiate the general recipe presented in Sect. 5 to phase retrieval and prove Theorem 1. Similar to the section 7.1 in [18], we are going to use \(\eta _t = c_1/( \log n \cdot \Vert {\varvec{x}}^{\star } \Vert _2^2 )\) instead of \(c_1/( \log n \cdot \Vert {\varvec{x}}_0 \Vert _2^2 )\) as the step size for analysis. This is because with high probability, \(\Vert {\varvec{x}}_0 \Vert _2\) and \(\Vert {\varvec{x}}^{\star } \Vert _2\) are rather close in the relative sense. Without loss of generality, we assume throughout this section that \(\big \Vert {\varvec{x}}^{\star }\big \Vert _{2}=1\) and

$$\begin{aligned} \mathrm {dist}({\varvec{x}}^{0},{\varvec{x}}^{\star })=\Vert {\varvec{x}}^{0}-{\varvec{x}}^{\star }\Vert _{2} \le \Vert {\varvec{x}}^{0}+{\varvec{x}}^{\star }\Vert _{2}. \end{aligned}$$
(42)

In addition, the gradient and the Hessian of \(f(\cdot )\) for this problem (see (15)) are given, respectively, by

$$\begin{aligned} \nabla f\left( {\varvec{x}}\right)&=\frac{1}{m}\sum _{j=1}^{m}\left[ \left( {\varvec{a}}_{j}^{\top }{\varvec{x}}\right) ^{2} -y_{j}\right] \left( {\varvec{a}}_{j}^{\top }{\varvec{x}}\right) {\varvec{a}}_{j}, \end{aligned}$$
(43)
$$\begin{aligned} \nabla ^{2}f\left( {\varvec{x}}\right)&=\frac{1}{m}\sum _{j=1}^{m}\left[ 3 \left( {\varvec{a}}_{j}^{\top }{\varvec{x}}\right) ^{2}-y_{j}\right] {\varvec{a}}_{j} {\varvec{a}}_{j}^{\top }, \end{aligned}$$
(44)

which are useful throughout the proof.

6.1 Step 1: Characterizing Local Geometry in the RIC

6.1.1 Local Geometry

We start by characterizing the region that enjoys both strong convexity and the desired level of smoothness. This is supplied in the following lemma, which plays a crucial role in the subsequent analysis.

Lemma 1

(Restricted strong convexity and smoothness for phase retrieval) Fix any sufficiently small constant \(C_{1}>0\) and any sufficiently large constant \(C_{2}>0\), and suppose the sample complexity obeys \(m\ge c_{0}n\log n\) for some sufficiently large constant \(c_{0}>0\). With probability at least \(1-O(mn^{-10})\),

$$\begin{aligned} \nabla ^{2}f\left( {\varvec{x}}\right) \succeq \left( {1} / {2}\right) \cdot {\varvec{I}}_{n} \end{aligned}$$

holds simultaneously for all \({\varvec{x}}\in \mathbb {R}^{n}\) satisfying \(\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\le 2C_{1}\), and

$$\begin{aligned} \nabla ^{2}f\left( {\varvec{x}}\right) \preceq \left( 5C_{2}\left( 10+C_{2}\right) \log n\right) \cdot {\varvec{I}}_{n} \end{aligned}$$

holds simultaneously for all \({\varvec{x}}\in \mathbb {R}^{n}\) obeying

$$\begin{aligned} \left\| {\varvec{x}} -{\varvec{x}}^{\star }\right\| _{2}&\le 2C_{1}, \end{aligned}$$
(45a)
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}} -{\varvec{x}}^{\star }\right) \right|&\le C_{2}\sqrt{\log n}. \end{aligned}$$
(45b)

Proof

See Appendix A.1. \(\square \)

In words, Lemma 1 reveals that the Hessian matrix is positive definite and (almost) well conditioned, if one restricts attention to the set of points that are (i) not far away from the truth (cf. (45a)) and (ii) incoherent with respect to the measurement vectors \(\left\{ {\varvec{a}}_{j}\right\} _{1\le j\le m}\) (cf. (45b)).

6.1.2 Error Contraction

As we point out before, the nice local geometry enables \(\ell _{2}\) contraction, which we formalize below.

Lemma 2

There exists an event that does not depend on t and has probability \(1-O(mn^{-10})\), such that when it happens and \({\varvec{x}}^t\) obeys conditions (45), one has

$$\begin{aligned} \left\| {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\right\| _{2}&\le \left( 1-\eta /2\right) \left\| {\varvec{x}}^{t}-{\varvec{x}}^{\star }\right\| _{2} \end{aligned}$$
(46)

provided that the step size satisfies \(0<\eta \le 1/ \left[ 5C_{2}\left( 10+C_{2}\right) \log n\right] \).

Proof

This proof applies the standard argument when establishing the \(\ell _{2}\) error contraction of gradient descent for strongly convex and smooth functions. See Appendix A.2. \(\square \)

With the help of Lemma 2, we can turn the proof of Theorem 1 into ensuring that the trajectory \(\left\{ {\varvec{x}}^t\right\} _{0\le t \le n}\) lies in the RIC specified by (47).Footnote 8 This is formally stated in the next lemma.

Lemma 3

Suppose for all \(0\le t\le T_{0}:=n\), the trajectory \(\left\{ {\varvec{x}}^t\right\} \) falls within the region of incoherence and contraction (termed the RIC), namely

$$\begin{aligned} \left\| {\varvec{x}}^{t}-{\varvec{x}}^{\star }\right\| _{2}&\le C_{1}, \end{aligned}$$
(47a)
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\left( {\varvec{x}}^{t}-{\varvec{x}}^{\star }\right) \right|&\le C_{2}\sqrt{\log n}, \end{aligned}$$
(47b)

then the claims in Theorem 1 hold true. Here and throughout this section, \(C_{1},C_{2}>0\) are two absolute constants as specified in Lemma 1.

Proof

See Appendix A.3. \(\square \)

6.2 Step 2: Introducing the Leave-One-Out Sequences

In comparison with the \(\ell _{2}\) error bound (47a) that captures the overall loss, the incoherence hypothesis (47b)—which concerns sample-wise control of the empirical risk—is more complicated to establish. This is partly due to the statistical dependence between \({\varvec{x}}^{t}\) and the sampling vectors \(\{{\varvec{a}}_{l}\}\). As described in the general recipe, the key idea is the introduction of a leave-one-out version of the WF iterates, which removes a single measurement from consideration.

To be precise, for each \(1\le l\le m\), we define the leave-one-out empirical loss function as

$$\begin{aligned} f^{\left( l\right) }({\varvec{x}}):=\frac{1}{4m}\sum _{j:j\ne l}\left[ \left( {\varvec{a}}_{j}^{\top }{\varvec{x}}\right) ^{2}-y_{j}\right] ^{2}, \end{aligned}$$
(48)

and the auxiliary trajectory \(\left\{ {\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) is constructed by running WF w.r.t. \(f^{(l)}({\varvec{x}})\). In addition, the spectral initialization \({\varvec{x}}^{0,\left( l\right) }\) is computed based on the rescaled leading eigenvector of the leave-one-out data matrix

$$\begin{aligned} {\varvec{Y}}^{\left( l\right) }:=\frac{1}{m}\sum _{j:j\ne l}y_{j}{\varvec{a}}_{j}{\varvec{a}}_{j}^{\top }. \end{aligned}$$
(49)

Clearly, the entire sequence \(\left\{ {\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) is independent of the lth sampling vector \({\varvec{a}}_{l}\). This auxiliary procedure is formally described in Algorithm 4.

figure e

6.3 Step 3: Establishing the Incoherence Condition by Induction

As revealed by Lemma 3, it suffices to prove that the iterates \(\{{\varvec{x}}^t\}_{0\le t \le T_{0}}\) satisfies (47) with high probability. Our proof will be inductive in nature. For the sake of clarity, we list all the induction hypotheses:

$$\begin{aligned} \left\| {\varvec{x}}^{t}-{\varvec{x}}^{\star }\right\| _{2}&\le C_{1}, \end{aligned}$$
(51a)
$$\begin{aligned} \max _{1\le l\le m}\big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{t,\left( l\right) }\big \Vert _{2}&\le C_{3}\sqrt{\frac{\log n}{n}} \end{aligned}$$
(51b)
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}}^{t}-{\varvec{x}}^{\star }\right) \right|&\le C_{2}\sqrt{\log n}. \end{aligned}$$
(51c)

Here \(C_{3} >0\) is some universal constant. For any \(t\ge 0\), define \(\mathcal {E}_{t}\) to be the event where the conditions in (51) hold for the tth iteration. According to Lemma 2, there exists some event \(\mathcal {E}\) with probability \(1-O(mn^{-10})\) such that on \(\mathcal {E}_t \cap \mathcal {E}\) one has

$$\begin{aligned} \left\| {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\right\| _{2} \le C_{1}. \end{aligned}$$
(52)

This subsection is devoted to establishing (51b) and (51c) for the \((t+1)\)th iteration, assuming that (51) holds true up to the tth iteration. We defer the justification of the base case (i.e., initialization at \(t=0\)) to Sect. 6.4.

  • Step 3(a): proximity between the original and the leave-one-out iterates The leave-one-out sequence \(\{{\varvec{x}}^{t,(l)}\}\) behaves similarly to the true WF iterates \(\{{\varvec{x}}^{t}\}\) while maintaining statistical independence with \({\varvec{a}}_{l}\), a key fact that allows us to control the incoherence of lth leave-one-out sequence w.r.t. \({\varvec{a}}_{l}\). We will formally quantify the gap between \({\varvec{x}}^{t+1}\) and \({\varvec{x}}^{t+1,(l)}\) in the following lemma, which establishes the induction in (51b).

Lemma 4

Suppose that the sample size obeys \(m\ge Cn\log n\) for some sufficiently large constant \(C>0\) and that the step size obeys \(0<\eta <1/[5C_{2}(10+C_{2})\log n]\). Then on some event \(\mathcal {E}_{t+1,1}\subseteq \mathcal {E}_{t}\) obeying \(\mathbb {P}(\mathcal {E}_{t}\cap \mathcal {E}_{t+1,1}^{c} ) = O( mn^{-10})\), one has

$$\begin{aligned} \max _{1\le l\le m}\left\| {\varvec{x}}^{t+1}-{\varvec{x}}^{t+1,(l)}\right\| _{2}\le C_{3}\sqrt{\frac{\log n}{n}}. \end{aligned}$$
(53)

Proof

The proof relies heavily on the restricted strong convexity (see Lemma 1) and is deferred to Appendix A.4. \(\square \)

  • Step 3(b): incoherence of the leave-one-out iterates By construction, \({\varvec{x}}^{t+1,(l)}\) is statistically independent of the sampling vector \({\varvec{a}}_{l}\). One can thus invoke the standard Gaussian concentration results and the union bound to derive that on an event \(\mathcal {E}_{t+1,2} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}(\mathcal {E}_{t}\cap \mathcal {E}_{t+1,2}^{c} ) = O( mn^{-10})\),

    $$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{\star }\big )\right|&\le 5\sqrt{\log n}\big \Vert {\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{\star }\big \Vert _{2}\nonumber \\&\overset{(\text {i})}{\le }5\sqrt{\log n}\left( \big \Vert {\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{t+1}\big \Vert _{2}+\left\| {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\right\| _{2}\right) \nonumber \\&\overset{(\mathrm {ii})}{\le }5\sqrt{\log n}\left( C_{3}\sqrt{\frac{\log n}{n}}+C_{1}\right) \nonumber \\&\le C_{4}\sqrt{\log n} \end{aligned}$$
    (54)

    holds for some constant \(C_{4}\ge 6 C_{1}>0\) and n sufficiently large. Here, (i) comes from the triangle inequality and (ii) arises from the proximity bound (53) and the condition (52).

  • Step 3(c): combining the bounds We are now prepared to establish (51c) for the (\(t+1\))th iteration. Specifically,

    $$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\left( {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\right) \right|&\le \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t+1}-{\varvec{x}}^{t+1,\left( l\right) }\big ) \right| +\max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{\star }\big ) \right| \nonumber \\&\overset{\left( \text {i}\right) }{\le }\max _{1\le l\le m}\Vert {\varvec{a}}_{l}\Vert _{2}\big \Vert {\varvec{x}}^{t+1}-{\varvec{x}}^{t+1,\left( l\right) } \big \Vert _{2}+C_{4}\sqrt{\log n}\nonumber \\&\overset{\left( \text {ii}\right) }{\le }\sqrt{6n}\cdot C_{3}\sqrt{\frac{\log n}{n}}+C_{4}\sqrt{\log n}\le C_{2}\sqrt{\log n}, \end{aligned}$$
    (55)

    where (i) follows from the Cauchy–Schwarz inequality and (54), the inequality (ii) is a consequence of (53) and (98), and the last inequality holds as long as \(C_{2}/(C_{3}+C_{4})\) is sufficiently large. From the deduction above we easily get \(\mathbb {P}(\mathcal {E}_{t}\cap \mathcal {E}_{t+1}^{c} ) = O( mn^{-10}) \).

Using mathematical induction and the union bound, we establish (51) for all \(t\le T_{0}=n\) with high probability. This in turn concludes the proof of Theorem 1, as long as the hypotheses are valid for the base case.

6.4 The Base Case: Spectral Initialization

In the end, we return to verify the induction hypotheses for the base case (\(t=0\)), i.e., the spectral initialization obeys (51). The following lemma justifies (51a) by choosing \(\delta \) sufficiently small.

Lemma 5

Fix any small constant \(\delta >0\), and suppose \(m>c_{0}n\log n\) for some large constant \(c_{0}>0\). Consider the two vectors \({\varvec{x}}^{0}\) and \(\widetilde{{\varvec{x}}}^{0}\) as defined in Algorithm 1, and suppose without loss of generality that (42) holds. Then with probability exceeding \(1-O(n^{-10})\), one has

$$\begin{aligned} \left\| {\varvec{Y}}-\mathbb {E}\left[ {\varvec{Y}}\right] \right\|\le & {} \delta , \end{aligned}$$
(56)
$$\begin{aligned} \Vert {\varvec{x}}^{0}-{\varvec{x}}^{\star }\Vert _{2}\le & {} 2\delta \qquad \text {and}\qquad \big \Vert \widetilde{{\varvec{x}}}^{0}-{\varvec{x}}^{\star }\big \Vert _{2}\le \sqrt{2}\delta . \end{aligned}$$
(57)

Proof

This result follows directly from the Davis–Kahan sin\(\Theta \) theorem. See Appendix A.5. \(\square \)

We then move on to justifying (51b), the proximity between the original and leave-one-out iterates for \(t=0\).

Lemma 6

Suppose \(m>c_{0}n\log n\) for some large constant \(c_{0}>0\). Then with probability at least \({1-O(mn^{-10})}\), one has

$$\begin{aligned} \max _{1\le l\le m}\big \Vert {\varvec{x}}^{0}-{\varvec{x}}^{0,\left( l\right) }\big \Vert _{2} \le C_{3}\sqrt{\frac{\log n}{n}}. \end{aligned}$$
(58)

Proof

This is also a consequence of the Davis–Kahan sin\(\Theta \) theorem. See Appendix A.6. \(\square \)

The final claim (51c) can be proved using the same argument as in deriving (55) and hence is omitted.

7 Analysis for Matrix Completion

In this section, we instantiate the general recipe presented in Sect. 5 to matrix completion and prove Theorem 2. Before continuing, we first gather a few useful facts regarding the loss function in (23). The gradient of it is given by

$$\begin{aligned} \nabla f\left( {\varvec{X}}\right)&=\frac{1}{p}\mathcal {P}_{\Omega }\left[ {\varvec{X}}{\varvec{X}}^{\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}}. \end{aligned}$$
(59)

We define the expected gradient (with respect to the sampling set \(\Omega \)) to be

$$\begin{aligned} \nabla F\left( {\varvec{X}}\right) =\left[ {\varvec{X}}{\varvec{X}}^{\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}} \end{aligned}$$

and also the (expected) gradient without noise to be

$$\begin{aligned} \nabla f_{\mathrm {clean}}\left( {\varvec{X}}\right) =\frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{X}}^{\top }-{\varvec{M}}^{\star }\right) {\varvec{X}}\qquad \text {and}\qquad \nabla F_{\mathrm {clean}}\left( {\varvec{X}}\right) =\left( {\varvec{X}}{\varvec{X}}^{\top }-{\varvec{M}}^{\star }\right) {\varvec{X}}. \end{aligned}$$
(60)

In addition, we need the Hessian \(\nabla ^{2}f_{\mathrm {clean}}({\varvec{X}})\), which is represented by an \(nr\times nr\) matrix. Simple calculations reveal that for any \({\varvec{V}}\in \mathbb {R}^{n\times r}\),

$$\begin{aligned}&\mathrm {vec}\left( {\varvec{V}}\right) ^{\top }\nabla ^{2}f_{\mathrm {clean}}\left( {\varvec{X}}\right) \mathrm {vec}\left( {\varvec{V}}\right) =\frac{1}{2p}\left\| \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }+{\varvec{X}}{\varvec{V}}^{\top }\right) \right\| _{\mathrm {F}}^{2}\nonumber \\&\quad +\frac{1}{p}\left\langle \mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{X}}^{\top }-{\varvec{M}}^{\star }\right) ,{\varvec{V}}{\varvec{V}}^{\top }\right\rangle , \end{aligned}$$
(61)

where \(\mathrm {vec}({\varvec{V}})\in \mathbb {R}^{nr}\) denotes the vectorization of \({\varvec{V}}\).

7.1 Step 1: Characterizing Local Geometry in the RIC

7.1.1 Local Geometry

The first step is to characterize the region where the empirical loss function enjoys restricted strong convexity and smoothness in an appropriate sense. This is formally stated in the following lemma.

Lemma 7

(Restricted strong convexity and smoothness for matrix completion) Suppose that the sample size obeys \(n^{2}p\ge C\kappa ^{2}\mu rn\log n\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O\left( n^{-10}\right) \), the Hessian \(\nabla ^{2}f_{\mathrm {clean}}({\varvec{X}})\) as defined in (61) obeys

$$\begin{aligned} \mathrm {vec}\left( {\varvec{V}}\right) ^{\top }\nabla ^{2}f_{\mathrm {clean}}\left( {\varvec{X}}\right) \mathrm {vec}\left( {\varvec{V}}\right)&\ge \frac{\sigma _{\min }}{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\qquad \text {and}\qquad \left\| \nabla ^{2}f_{\mathrm {clean}}\left( {\varvec{X}}\right) \right\| \le \frac{5}{2}\sigma _{\max } \end{aligned}$$
(62)

for all \({\varvec{X}}\) and \({\varvec{V}} = {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\), with \({\varvec{H}}_{Y}:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{Y}}{\varvec{R}}-{\varvec{Z}}\right\| _{\mathrm {F}}\), satisfying:

$$\begin{aligned} \left\| {\varvec{X}}-{\varvec{X}}^{\star }\right\| _{2,\infty }&\le \epsilon \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$
(63a)
$$\begin{aligned} \Vert {\varvec{Z}}-{\varvec{X}}^{\star }\Vert&\le \delta \Vert {\varvec{X}}^{\star }\Vert , \end{aligned}$$
(63b)

where \(\epsilon \ll 1/\sqrt{\kappa ^{3}\mu r\log ^{2}n}\) and \(\delta \ll 1 / \kappa \).

Proof

See Appendix B.1. \(\square \)

Lemma 7 reveals that the Hessian matrix is well conditioned in a neighborhood close to \({\varvec{X}}^{\star }\) that remains incoherent measured in the \(\ell _2/\ell _\infty \) norm (cf. (63a)), and along directions that point toward points which are not far away from the truth in the spectral norm (cf. (63b)).

Remark 5

The second condition (63b) is characterized using the spectral norm \(\Vert \cdot \Vert \), while in previous works this is typically presented in the Frobenius norm \(\Vert \cdot \Vert _{\mathrm {F}}\). It is also worth noting that the Hessian matrix—even in the infinite-sample and noiseless case—is rank-deficient and cannot be positive definite. As a result, we resort to the form of strong convexity by restricting attention to certain directions (see the conditions on \({\varvec{V}}\)).

7.1.2 Error Contraction

Our goal is to demonstrate the error bounds (28) measured in three different norms. Notably, as long as the iterates satisfy (28) at the tth iteration, then \(\Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\Vert _{2,\infty }\) is sufficiently small. Under our sample complexity assumption, \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) satisfies the \(\ell _{2}/\ell _{\infty }\) condition (63a) required in Lemma 7. Consequently, we can invoke Lemma 7 to arrive at the following error contraction result.

Lemma 8

(Contraction w.r.t. the Frobenius norm) Suppose that \(n^{2}p\ge C\kappa ^{3}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). There exists an event that does not depend on t and has probability \(1-O(n^{-10})\), such that when it happens and (28a), (28b) hold for the tth iteration, one has

$$\begin{aligned} \big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}\le C_{4}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+C_{1}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}} \end{aligned}$$

provided that \(0<\eta \le {2} / ({25\kappa \sigma _{\max }})\), \(1-\left( {\sigma _{\min }} / {4}\right) \cdot \eta \le \rho <1\), and \(C_{1}\) is sufficiently large.

Proof

The proof is built upon Lemma 7. See Appendix B.2. \(\square \)

Further, if the current iterate satisfies all three conditions in (28), then we can derive a stronger sense of error contraction, namely contraction in terms of the spectral norm.

Lemma 9

(Contraction w.r.t. the spectral norm) Suppose \(n^{2}p\ge C\kappa ^{3}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). There exists an event that does not depend on t and has probability \(1-O(n^{-10})\), such that when it happens and (28) holds for the tth iteration, one has

$$\begin{aligned} \big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big \Vert \le C_{9}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +C_{10}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| \end{aligned}$$
(64)

provided that \(0<\eta \le {1} / \left( {2\sigma _{\max }}\right) \) and \(1-\left( {\sigma _{\min }} / {3}\right) \cdot \eta \le \rho <1\).

Proof

The key observation is this: the iterate that proceeds according to the population-level gradient reduces the error w.r.t. \(\Vert \cdot \Vert \), namely

$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-\eta \nabla F_{\mathrm {clean}}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big )-{\varvec{X}}^{\star }\big \Vert <\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert , \end{aligned}$$

as long as \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) is sufficiently close to the truth. Notably, the orthonormal matrix \(\widehat{{\varvec{H}}}^{t}\) is still chosen to be the one that minimizes the \(\Vert \cdot \Vert _{\mathrm {F}}\) distance (as opposed to \(\Vert \cdot \Vert \)), which yields a symmetry property \({\varvec{X}}^{\star \top }{\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}=\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big )^{\top }{\varvec{X}}^{\star }\), crucial for our analysis. See Appendix B.3 for details. \(\square \)

7.2 Step 2: Introducing the Leave-One-Out Sequences

In order to establish the incoherence properties (28b) for the entire trajectory, which is difficult to deal with directly due to the complicated statistical dependence, we introduce a collection of leave-one-out versions of \(\left\{ {\varvec{X}}^{t}\right\} _{t\ge 0}\), denoted by \(\left\{ {\varvec{X}}^{t,(l)}\right\} _{t\ge 0}\) for each \(1\le l\le n\). Specifically, \(\left\{ {\varvec{X}}^{t,(l)}\right\} _{t\ge 0}\) is the iterates of gradient descent operating on the auxiliary loss function

$$\begin{aligned} f^{(l)}\left( {\varvec{X}}\right) :=\frac{1}{4p}\left\| \mathcal {P}_{\Omega ^{-l}}\left[ {\varvec{X}}{\varvec{X}}^{\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] \right\| _{\mathrm {F}}^{2}+\frac{1}{4}\left\| \mathcal {P}_{l}\left( {\varvec{X}}{\varvec{X}}^{\top }-{\varvec{M}}^{\star }\right) \right\| _{\mathrm {F}}^{2}. \end{aligned}$$
(65)

Here, \(\mathcal {P}_{\Omega _{l}}\) (resp. \(\mathcal {P}_{\Omega ^{-l}}\) and \(\mathcal {P}_{l}\)) represents the orthogonal projection onto the subspace of matrices which vanish outside of the index set \(\Omega _{l}:=\left\{ \left( i,j\right) \in \Omega \mid i=l\text { or }j=l\right\} \) (resp.  \(\Omega ^{-l}:=\left\{ \left( i,j\right) \in \Omega \mid i\ne l,j\ne l\right\} \) and \(\left\{ \left( i,j\right) \mid i=l\text { or }j=l\right\} \)); that is, for any matrix \({\varvec{M}}\),

$$\begin{aligned}&\left[ \mathcal {P}_{\Omega _{l}}\left( {\varvec{M}}\right) \right] _{i,j}={\left\{ \begin{array}{ll} M_{i,j}, &{} \text {if }\left( i=l\text { or }j=l\right) \text { and }(i,j)\in \Omega ,\\ 0, &{} \text {else}, \end{array}\right. } \end{aligned}$$
(66)
$$\begin{aligned}&\left[ \mathcal {P}_{\Omega ^{-l}}\left( {\varvec{M}}\right) \right] _{i,j}={\left\{ \begin{array}{ll} M_{i,j}, &{} \text {if }i\ne l\text { and }j\ne l\text { and }(i,j)\in \Omega \\ 0, &{} \text {else} \end{array}\right. }\quad \text {and}\nonumber \\ \left[ \mathcal {P}_{l}\left( {\varvec{M}}\right) \right] _{i,j}= & {} {\left\{ \begin{array}{ll} 0, &{} \text {if }i\ne l\text { and }j\ne l,\\ M_{i,j}, &{} \text {if }i=l\text { or }j=l. \end{array}\right. } \end{aligned}$$
(67)

The gradient of the leave-one-out loss function (65) is given by

$$\begin{aligned} \nabla f^{\left( l\right) }\left( {\varvec{X}}\right)&=\frac{1}{p}\mathcal {P}_{\Omega ^{-l}}\left[ {\varvec{X}}{\varvec{X}}^{\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}}+\mathcal {P}_{l}\left( {\varvec{X}}{\varvec{X}}^{\top }-{\varvec{M}}^{\star }\right) {\varvec{X}}. \end{aligned}$$
(68)

The full algorithm to obtain the leave-one-out sequence \(\{{\varvec{X}}^{t,(l)}\}_{t\ge 0}\) (including spectral initialization) is summarized in Algorithm 5.

figure f

Remark 6

Rather than simply dropping all samples in the lth row/column, we replace the lth row/column with their respective population means. In other words, the leave-one-out gradient forms an unbiased surrogate for the true gradient, which is particularly important in ensuring high estimation accuracy.

7.3 Step 3: Establishing the Incoherence Condition by Induction

We will continue the proof of Theorem 2 in an inductive manner. As seen in Sect. 7.1.2, the induction hypotheses (28a) and (28c) hold for the \((t+1)\)th iteration as long as (28) holds at the tth iteration. Therefore, we are left with proving the incoherence hypothesis (28b) for all \(0\le t\le T=O(n^{5})\). For clarity of analysis, it is crucial to maintain a list of induction hypotheses, which includes a few more hypotheses that complement (28), and is given below.

$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}&\le \left( C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+C_{1}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}, \end{aligned}$$
(70a)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{2,\infty }&\le \left( C_{5}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}+C_{8}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$
(70b)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert&\le \left( C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+C_{10}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$
(70c)
$$\begin{aligned} \max _{1\le l \le n}\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\big \Vert _{\mathrm {F}}&\le \left( C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}+C_{7}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$
(70d)
$$\begin{aligned} \max _{1\le l \le n}\big \Vert \big ({\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }\big \Vert _{2}&\le \left( C_{2}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+C_{6}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \end{aligned}$$
(70e)

hold for some absolute constants \(0<\rho <1\) and \(C_{1},\cdots ,C_{10}>0\). Here, \(\widehat{{\varvec{H}}}^{t,\left( l\right) }\) and \({\varvec{R}}^{t,(l)}\) are orthonormal matrices defined by

$$\begin{aligned} \widehat{{\varvec{H}}}^{t,\left( l\right) }&:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}, \end{aligned}$$
(71)
$$\begin{aligned} {\varvec{R}}^{t,\left( l\right) }&:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\big \Vert {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}-{\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big \Vert _{\mathrm {F}}. \end{aligned}$$
(72)

Clearly, the first three hypotheses (70a)–(70c) constitute the conclusion of Theorem 2, i.e., (28). The last two hypotheses (70d) and (70e) are auxiliary properties connecting the true iterates and the auxiliary leave-one-out sequences. Moreover, we summarize below several immediate consequences of (70), which will be useful throughout.

Lemma 10

Suppose \(n^{2}p \ge C\kappa ^{3}\mu ^{2}r^{2}n\log n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). Under hypotheses (70), one has

$$\begin{aligned} \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,(l)}\widehat{{\varvec{H}}}^{t,\left( l\right) }\right\| _{\mathrm {F}}&\le 5\kappa \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,(l)}{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}}, \end{aligned}$$
(73a)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}&\le \left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}\le \left\{ 2C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+2C_{1}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}, \end{aligned}$$
(73b)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\big \Vert _{2,\infty }&\le \left\{ \left( C_{3}+C_{5}\right) \rho ^{t}\mu r\sqrt{\frac{\log n}{np}}+\left( C_{8}+C_{7}\right) \frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$
(73c)
$$\begin{aligned} \big \Vert {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\big \Vert&\le \left\{ 2C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+2C_{10}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$
(73d)

In particular, (73a) follows from hypotheses (70c) and (70d).

Proof

See Appendix B.4. \(\square \)

In the sequel, we follow the general recipe outlined in Sect. 5 to establish the induction hypotheses. We only need to establish (70b), (70d), and (70e) for the \((t+1)\)th iteration, since (70a) and (70c) are established in Sect. 7.1.2. Specifically, we resort to the leave-one-out iterates by showing that: first, the true and the auxiliary iterates remain exceedingly close throughout; second, the lth leave-one-out sequence stays incoherent with \({\varvec{e}}_{l}\) due to statistical independence.

  • Step 3(a): proximity between the original and the leave-one-out iterates We demonstrate that \({\varvec{X}}^{t+1}\) is well approximated by \({\varvec{X}}^{t+1,(l)}\), up to proper orthonormal transforms. This is precisely the induction hypothesis (70d) for the \((t+1)\)th iteration.

Lemma 11

Suppose the sample complexity satisfies \(n^{2}p\ge C \kappa ^{4}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). Let \(\mathcal {E}_t\) be the event where the hypotheses in (70) hold for the tth iteration. Then on some event \(\mathcal {E}_{t+1,1} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}( \mathcal {E}_t \cap \mathcal {E}_{t+1,1}^c ) = O(n^{-10})\), we have

$$\begin{aligned} \left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}{\varvec{R}}^{t+1,\left( l\right) }\right\| _{\mathrm {F}}\le & {} C_{3}\rho ^{t+1}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\nonumber \\&+\,C_{7}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \end{aligned}$$
(74)

provided that \(0<\eta \le {2} / {\left( 25\kappa \sigma _{\max }\right) }\), \(1-\left( {\sigma _{\min }} / {5}\right) \cdot \eta \le \rho <1\), and \(C_{7} > 0\) is sufficiently large.

Proof

The fact that this difference is well controlled relies heavily on the benign geometric property of the Hessian revealed by Lemma 7. Two important remarks are in order: (1) both points \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) and \({\varvec{X}}^{t,(l)}{\varvec{R}}^{t,\left( l\right) }\) satisfy (63a); (2) the difference \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,(l)}{\varvec{R}}^{t,\left( l\right) }\) forms a valid direction for restricted strong convexity. These two properties together allow us to invoke Lemma 7. See Appendix B.5. \(\square \)

  • Step 3(b): incoherence of the leave-one-out iterates Given that \({\varvec{X}}^{t+1,(l)}\) is sufficiently close to \({\varvec{X}}^{t+1}\), we turn our attention to establishing the incoherence of this surrogate \({\varvec{X}}^{t+1,(l)}\) w.r.t. \({\varvec{e}}_{l}\). This amounts to proving the induction hypothesis (70e) for the \((t+1)\)th iteration.

Lemma 12

Suppose the sample complexity meets \(n^{2}p\ge C \kappa ^{3}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). Let \(\mathcal {E}_t\) be the event where the hypotheses in (70) hold for the tth iteration. Then on some event \(\mathcal {E}_{t+1,2} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}( \mathcal {E}_t \cap \mathcal {E}_{t+1,2}^c ) = O(n^{-10})\), we have

$$\begin{aligned} \left\| \big ({\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t+1,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}\le C_{2}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+C_{6}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \end{aligned}$$
(75)

so long as \(0<\eta \le 1/{\sigma _{\max }}\), \(1-\left( {\sigma _{\min }} / {3}\right) \cdot \eta \le \rho <1\), \( C_{2} \gg \kappa C_{9} \), and \( C_{6} \gg \kappa C_{10} / \sqrt{\log n} \).

Proof

The key observation is that \({\varvec{X}}^{t+1,(l)}\) is statistically independent from any sample in the lth row/column of the matrix. Since there are an order of np samples in each row/column, we obtain enough information that helps establish the desired incoherence property. See Appendix B.6. \(\square \)

  • Step 3(c): combining the bounds Inequalities (70d) and (70e) taken collectively allow us to establish the induction hypothesis (70b). Specifically, for every \(1\le l\le n\), write

    $$\begin{aligned} \big ({\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big )_{l,\cdot } =&\,\big ({\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}\widehat{{\varvec{H}}}^{t+1, \left( l\right) }\big )_{l,\cdot }\\&+\big ({\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t+1, \left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }, \end{aligned}$$

    and the triangle inequality gives

    $$\begin{aligned} \big \Vert \big ({\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big )_{l,\cdot }\big \Vert _{2}&\le \big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)} \widehat{{\varvec{H}}}^{t+1,\left( l\right) }\big \Vert _{\mathrm {F}}\nonumber \\&\quad +\big \Vert \big ({\varvec{X}}^{t+1, \left( l\right) }\widehat{{\varvec{H}}}^{t+1,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l, \cdot }\big \Vert _{2}. \end{aligned}$$
    (76)

    The second term has already been bounded by (75). Since we have established the induction hypotheses (70c) and (70d) for the \((t+1)\)th iteration, the first term can be bounded by (73a) for the \((t+1)\)th iteration, i.e.,

    $$\begin{aligned} \left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}\widehat{{\varvec{H}}}^{t+1,\left( l\right) } \right\| _{\mathrm {F}} \le 5\kappa \left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}{\varvec{R}}^{t+1,\left( l\right) } \right\| _{\mathrm {F}}. \end{aligned}$$

    Plugging the above inequality, (74), and (75) into (76), we have

    $$\begin{aligned}&\left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \le 5\kappa \left( C_{3}\rho ^{t+1}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\right) \\&\qquad +C_{2}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{6}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \le C_{5}\rho ^{t+1}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{8}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \end{aligned}$$

    as long as \(C_{5}/(\kappa C_{3}+C_{2})\) and \(C_{8}/(\kappa C_{7}+C_{6})\) are sufficiently large. This establishes the induction hypothesis (70b). From the deduction above, we see \(\mathcal {E}_t \cap \mathcal {E}_{t+1}^c = O(n^{-10})\) and thus finish the proof.

7.4 The Base Case: Spectral Initialization

Finally, we return to check the base case; namely, we aim to show that the spectral initialization satisfies the induction hypotheses (70a)–(70e) for \(t=0\). This is accomplished via the following lemma.

Lemma 13

Suppose the sample size obeys \(n^{2}p\ge C \mu ^{2}r^{2}n\log n\) for some sufficiently large constant \(C>0\), the noise satisfies (27), and \(\kappa =\sigma _{\max }/\sigma _{\min }\asymp 1\). Then with probability at least \(1-O\left( n^{-10}\right) \), the claims in (70a)–(70e) hold simultaneously for \(t=0\).

Proof

This follows by invoking the Davis–Kahan sin\(\Theta \) theorem [39] as well as the entrywise eigenvector perturbation analysis in [1]. We defer the proof to Appendix B.7. \(\square \)

8 Analysis for Blind Deconvolution

In this section, we instantiate the general recipe presented in Sect. 5 to blind deconvolution and prove Theorem 3. Without loss of generality, we assume throughout that \(\left\| {\varvec{h}}^{\star }\right\| _{2}=\left\| {\varvec{x}}^{\star }\right\| _{2}=1\).

Before presenting the analysis, we first gather some simple facts about the empirical loss function in (32). Recall the definition of \({\varvec{z}}\) in (33), and for notational simplicity, we write \(f\left( {\varvec{z}}\right) =f({\varvec{h}},{\varvec{x}})\). Since \({\varvec{z}}\) is complex-valued, we need to resort to Wirtinger calculus; see [18, Section 6] for a brief introduction. The Wirtinger gradient of (32) with respect to \({\varvec{h}}\) and \({\varvec{x}}\) are given, respectively, by

$$\begin{aligned} \nabla _{{\varvec{h}}}f\left( {\varvec{z}}\right) = \nabla _{{\varvec{h}}}f\left( {\varvec{h}},{\varvec{x}}\right)&=\sum _{j=1}^{m}\left( {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j}-y_{j}\right) {\varvec{b}}_{j}{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}; \end{aligned}$$
(77)
$$\begin{aligned} \nabla _{{\varvec{x}}}f\left( {\varvec{z}}\right) = \nabla _{{\varvec{x}}}f\left( {\varvec{h}},{\varvec{x}}\right)&=\sum _{j=1}^{m}\overline{({\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j}-y_{j})}{\varvec{a}}_{j}{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}. \end{aligned}$$
(78)

It is worth noting that the formal Wirtinger gradient contains \(\nabla _{\overline{{\varvec{h}}}}f\left( {\varvec{h}},{\varvec{x}}\right) \) and \(\nabla _{\overline{{\varvec{x}}}}f\left( {\varvec{h}},{\varvec{x}}\right) \) as well. Nevertheless, since \(f\left( {\varvec{h}},{\varvec{x}}\right) \) is a real-valued function, the following identities always hold

$$\begin{aligned} \nabla _{{\varvec{h}}}f\left( {\varvec{h}},{\varvec{x}}\right) =\overline{\nabla _{\overline{{\varvec{h}}}} f\left( {\varvec{h}},{\varvec{x}}\right) }\qquad \text {and}\qquad \nabla _{{\varvec{x}}}f \left( {\varvec{h}},{\varvec{x}}\right) =\overline{\nabla _{\overline{{\varvec{x}}}}f\left( {\varvec{h}},{\varvec{x}}\right) }. \end{aligned}$$

In light of these observations, one often omits the gradient with respect to the conjugates; correspondingly, the gradient update rule (35) can be written as

$$\begin{aligned} {\varvec{h}}^{t+1}&={\varvec{h}}^{t}-\frac{\eta }{\left\| {\varvec{x}}^{t}\right\| _{2}^{2}}\sum _{j=1}^{m}\left( {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{t}{\varvec{x}}^{t\textsf {H} }{\varvec{a}}_{j}-y_{j}\right) {\varvec{b}}_{j}{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{t}, \end{aligned}$$
(79a)
$$\begin{aligned} {\varvec{x}}^{t+1}&={\varvec{x}}^{t}-\frac{\eta }{\left\| {\varvec{h}}^{t}\right\| _{2}^{2}}\sum _{j=1}^{m}\overline{({\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{t}{\varvec{x}}^{t\textsf {H} }{\varvec{a}}_{j}-y_{j})}{\varvec{a}}_{j}{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{t}. \end{aligned}$$
(79b)

We can also compute the Wirtinger Hessian of \(f({\varvec{z}})\) as follows,

$$\begin{aligned} \nabla ^{2}f\left( {\varvec{z}}\right) =\left[ \begin{array}{cc} {\varvec{A}} &{} {\varvec{B}}\\ {\varvec{B}}^{\textsf {H} } &{} \overline{{\varvec{A}}} \end{array}\right] , \end{aligned}$$
(80)

where

$$\begin{aligned} {\varvec{A}}= & {} \left[ \begin{array}{cc} \sum _{j=1}^{m}\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right| ^{2}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} } &{} \sum _{j=1}^{m}\left( {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j}-y_{j}\right) {\varvec{b}}_{j}{\varvec{a}}_{j}^{\textsf {H} }\\ \sum _{j=1}^{m}\left[ \left( {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j} -y_{j}\right) {\varvec{b}}_{j}{\varvec{a}}_{j}^{\textsf {H} }\right] ^{\textsf {H} } &{} \sum _{j=1}^{m} \left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\right| ^{2}{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} } \end{array}\right] \in \mathbb {C}^{2K\times 2K};\\ {\varvec{B}}= & {} \left[ \begin{array}{cc} {\varvec{0}} &{} \sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\left( {\varvec{a}}_{j} {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right) ^{\top }\\ \sum _{j=1}^{m}{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\left( {\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} } {\varvec{h}}\right) ^{\top } &{} {\varvec{0}} \end{array}\right] \in \mathbb {C}^{2K\times 2K}. \end{aligned}$$

Last but not least, we say \(\left( {\varvec{h}}_{1},{\varvec{x}}_{1}\right) \) is aligned with \(\left( {\varvec{h}}_{2},{\varvec{x}}_{2}\right) \), if the following holds,

$$\begin{aligned} \left\| {\varvec{h}}_{1}-{\varvec{h}}_{2}\right\| _{2}^{2}+\left\| {\varvec{x}}_{1}-{\varvec{x}}_{2}\right\| _{2}^{2}=\min _{\alpha \in \mathbb {C}}\left\{ \left\| \frac{1}{\overline{\alpha }}{\varvec{h}}_{1}-{\varvec{h}}_{2}\right\| _{2}^{2}+\left\| \alpha {\varvec{x}}_{1}-{\varvec{x}}_{2}\right\| _{2}^{2}\right\} . \end{aligned}$$

To simplify notations, define \(\widetilde{{\varvec{z}}}^t\) as

$$\begin{aligned} \widetilde{{\varvec{z}}}^t = \begin{bmatrix} \widetilde{{\varvec{h}}}^{t} \\ \widetilde{{\varvec{x}}}^{t} \end{bmatrix} :=\begin{bmatrix} \frac{1}{\overline{\alpha ^{t}}}{\varvec{h}}^{t}\\ \alpha ^{t}{\varvec{x}}^{t} \end{bmatrix} \end{aligned}$$
(81)

with the alignment parameter \(\alpha ^{t}\) given in (38). Then we can see that \(\widetilde{{\varvec{z}}}^t\) is aligned with \({\varvec{z}}^{\star }\) and

$$\begin{aligned} \text {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right) =\text {dist}\left( \widetilde{{\varvec{z}}}^{t}, {\varvec{z}}^{\star }\right) =\left\| \widetilde{{\varvec{z}}}^t - {\varvec{z}}^{\star } \right\| _{2}. \end{aligned}$$

8.1 Step 1: Characterizing Local Geometry in the RIC

8.1.1 Local Geometry

The first step is to characterize the region of incoherence and contraction (RIC), where the empirical loss function enjoys restricted strong convexity and smoothness properties. To this end, we have the following lemma.

Lemma 14

(Restricted strong convexity and smoothness for blind deconvolution) Let \(c>0\) be a sufficiently small constant and

$$\begin{aligned} \delta = c/\log ^2 m. \end{aligned}$$

Suppose the sample size satisfies \(m\ge c_0 \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(c_0 >0\). Then with probability \(1-O\left( m^{-10} + e^{-K} \log m \right) \), the Wirtinger Hessian \(\nabla ^{2}f\left( {\varvec{z}}\right) \) obeys

$$\begin{aligned} {\varvec{u}}^{\textsf {H} }\left[ {\varvec{D}}\nabla ^{2}f\left( {\varvec{z}}\right) +\nabla ^{2}f\left( {\varvec{z}}\right) {\varvec{D}}\right] {\varvec{u}}\ge ({1}/{4})\cdot \left\| {\varvec{u}}\right\| _{2}^{2}\qquad \text {and}\qquad \left\| \nabla ^{2}f\left( {\varvec{z}}\right) \right\| \le 3 \end{aligned}$$

simultaneously for all

$$\begin{aligned} {\varvec{z}}=\left[ \begin{array}{c} {\varvec{h}}\\ {\varvec{x}} \end{array}\right] \qquad \text {and}\qquad {\varvec{u}} =\left[ \begin{array}{c} {\varvec{h}}_{1}-{\varvec{h}}_{2}\\ {\varvec{x}}_{1}-{\varvec{x}}_{2}\\ \overline{{\varvec{h}}_{1}-{\varvec{h}}_{2}}\\ \overline{{\varvec{x}}_{1}-{\varvec{x}}_{2}} \end{array}\right] \qquad \text {and}\qquad {\varvec{D}}=\left[ \begin{array}{cccc} \gamma _{1}{\varvec{I}}_{K}\\ &{} \gamma _{2}{\varvec{I}}_{K}\\ &{} &{} \gamma _{1}{\varvec{I}}_{K}\\ &{} &{} &{} \gamma _{2}{\varvec{I}}_{K} \end{array}\right] , \end{aligned}$$

where \({\varvec{z}}\) satisfies

$$\begin{aligned} \max \left\{ \left\| {\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2},\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\right\}&\le \delta ; \end{aligned}$$
(82a)
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right|&\le 2C_{3}\frac{1}{\log ^{3/2}m}; \end{aligned}$$
(82b)
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\right|&\le 2C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m; \end{aligned}$$
(82c)

\(\left( {\varvec{h}}_{1},{\varvec{x}}_{1}\right) \) is aligned with \(\left( {\varvec{h}}_{2},{\varvec{x}}_{2}\right) \), and they satisfy

$$\begin{aligned} \max \left\{ \left\| {\varvec{h}}_{1}-{\varvec{h}}^{\star }\right\| _{2},\left\| {\varvec{h}}_{2}-{\varvec{h}}^{\star }\right\| _{2},\left\| {\varvec{x}}_{1}-{\varvec{x}}^{\star }\right\| _{2},\left\| {\varvec{x}}_{2}-{\varvec{x}}^{\star }\right\| _{2}\right\} \le \delta ; \end{aligned}$$
(83)

and finally, \({\varvec{D}}\) satisfies for \(\gamma _{1},\gamma _{2}\in \mathbb {R}\),

$$\begin{aligned} \max \left\{ \left| \gamma _{1}-1\right| ,\left| \gamma _{2}-1\right| \right\} \le \delta . \end{aligned}$$
(84)

Here, \(C_{3},C_{4}>0\) are numerical constants.

Proof

See Appendix C.1. \(\square \)

Lemma 14 characterizes the restricted strong convexity and smoothness of the loss function used in blind deconvolution. To the best of our knowledge, this provides the first characterization regarding geometric properties of the Hessian matrix for blind deconvolution. A few interpretations are in order.

  • Conditions (82) specify the region of incoherence and contraction (RIC). In particular, (82a) specifies a neighborhood that is close to the ground truth in \(\ell _2\) norm, and (82b) and (82c) specify the incoherence region with respect to the sensing vectors \(\{{\varvec{a}}_j\}\) and \(\{{\varvec{b}}_j\}\), respectively.

  • Similar to matrix completion, the Hessian matrix is rank-deficient even at the population level. Consequently, we resort to a restricted form of strong convexity by focusing on certain directions. More specifically, these directions can be viewed as the difference between two pre-aligned points that are not far from the truth, which is characterized by (83).

  • Finally, the diagonal matrix \({\varvec{D}}\) accounts for scaling factors that are not too far from 1 (see (84)), which allows us to account for different step sizes employed for \({\varvec{h}}\) and \({\varvec{x}}\).

8.1.2 Error Contraction

The restricted strong convexity and smoothness allow us to establish the contraction of the error measured in terms of \(\text {dist}(\cdot , {\varvec{z}}^{\star })\) as defined in (34) as long as the iterates stay in the RIC.

Lemma 15

Suppose the number of measurements satisfies \(m\ge C \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\), and the step size \(\eta >0\) is some sufficiently small constant. There exists an event that does not depend on t and has probability \(1-O\left( m^{-10} + e^{-K} \log m \right) \), such that when it happens and

$$\begin{aligned} \mathrm {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right)&\le \xi , \end{aligned}$$
(85a)
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right|&\le C_{3}\frac{1}{\log ^{1.5}m}, \end{aligned}$$
(85b)
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right|&\le C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m \end{aligned}$$
(85c)

hold for some constants \(C_{3},C_{4}>0\), one has

$$\begin{aligned} \mathrm {dist}\left( {\varvec{z}}^{t+1},{\varvec{z}}^{\star }\right) \le (1-\eta /16)\, \mathrm {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right) . \end{aligned}$$

Here, \(\widetilde{{\varvec{h}}}^{t}\) and \(\widetilde{{\varvec{x}}}^{t}\) are defined in (81), and \(\xi \ll 1/\log ^2 m\).

Proof

See Appendix C.2. \(\square \)

As a result, if \({\varvec{z}}^{t}\) satisfies condition (85) for all \(0\le t\le T\), then

$$\begin{aligned} \mathrm {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right) \le \rho \, \mathrm {dist}\left( {\varvec{z}}^{t-1},{\varvec{z}}^{\star }\right) \le \rho ^{t} \mathrm {dist}\left( {\varvec{z}}^{0},{\varvec{z}}^{\star }\right) \le \rho ^{t}c_{1},\qquad 0<t\le T, \end{aligned}$$

where \(\rho := 1- \eta /16\). Furthermore, similar to the case of phase retrieval (i.e., Lemma 3), as soon as we demonstrate that conditions (85) hold for all \(0\le t\le m\), then Theorem 3 holds true. The proof of this claim is exactly the same as for Lemma 3 and is thus omitted for conciseness. In what follows, we focus on establishing (85) for all \(0\le t\le m\).

Before concluding this subsection, we make note of another important result that concerns the alignment parameter \(\alpha ^{t}\), which will be useful in the subsequent analysis. Specifically, the alignment parameter sequence \(\{\alpha ^{t}\}\) converges linearly to a constant whose magnitude is fairly close to 1, as long as the two initial vectors \({\varvec{h}}^0\) and \({\varvec{x}}^0\) have similar \(\ell _2\) norms and are close to the truth. Given that \(\alpha ^t\) determines the global scaling of the iterates, this reveals rapid convergence of both \(\Vert {\varvec{h}}^t\Vert _2\) and \(\Vert {\varvec{x}}^t\Vert _2\), which explains why there is no need to impose extra terms to regularize the \(\ell _2\) norm as employed in [58, 76].

Lemma 16

When \(m > 1\) is sufficiently large, the following two claims hold true.

  • If \(\big ||\alpha ^{t}|-1\big |\le 1/2\) and \(\mathrm {dist}({\varvec{z}}^{t},{\varvec{z}}^{\star })\le C_{1}/\log ^{2}m\), then

    $$\begin{aligned} \left| \frac{\alpha ^{t+1}}{\alpha ^{t}}-1\right| \le c\,\mathrm {dist}({\varvec{z}}^{t},{\varvec{z}}^{\star })\le \frac{cC_{1}}{\log ^{2}m} \end{aligned}$$

    for some absolute constant \(c>0\);

  • If \(\big ||\alpha ^{0}|-1\big |\le 1/4\) and \(\mathrm {dist}({\varvec{z}}^{s},{\varvec{z}}^{\star })\le C_{1}(1-\eta /16)^{s}/\log ^{2}m\) for all \(0\le s\le t\), then one has

    $$\begin{aligned} \big ||\alpha ^{s+1}|-1\big |\le 1/2,\qquad 0\le s\le t. \end{aligned}$$

Proof

See Appendix C.2. \(\square \)

The initial condition \(\big ||\alpha ^{0}|-1\big |<1/4\) will be guaranteed to hold with high probability by Lemma 19.

8.2 Step 2: Introducing the Leave-One-Out Sequences

As demonstrated by the assumptions in Lemma 15, the key is to show that the whole trajectory lies in the region specified by (85a)–(85c). Once again, the difficulty lies in the statistical dependency between the iterates \(\left\{ {\varvec{z}}^{t}\right\} \) and the measurement vectors \(\left\{ {\varvec{a}}_{j}\right\} \). We follow the general recipe and introduce the leave-one-out sequences, denoted by \(\left\{ {\varvec{h}}^{t,\left( l\right) },{\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) for each \(1\le l\le m\). Specifically, \(\left\{ {\varvec{h}}^{t,\left( l\right) },{\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) is the gradient sequence operating on the loss function

$$\begin{aligned} f^{(l)}\left( {\varvec{h}},{\varvec{x}}\right) :=\sum _{j:j\ne l}\left| {\varvec{b}}_{j}^{\textsf {H} }\left( {\varvec{h}}{\varvec{x}}^{\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\right) {\varvec{a}}_{j}\right| ^{2}. \end{aligned}$$
(86)

The whole sequence is constructed by running gradient descent with spectral initialization on the leave-one-out loss (86). The precise description is supplied in Algorithm 6.

For notational simplicity, we denote \({\varvec{z}}^{t,(l)}=\left[ \begin{array}{c} {\varvec{h}}^{t,(l)}\\ {\varvec{x}}^{t,(l)} \end{array}\right] \) and use \(f({\varvec{z}}^{t,(l)})=f({\varvec{h}}^{t,(l)},{\varvec{x}}^{t,(l)})\) interchangeably. Define similarly the alignment parameters

$$\begin{aligned} \alpha ^{t,\left( l\right) }:=\arg \min _{\alpha \in \mathbb {C}}\left\| \frac{1}{\overline{\alpha }}{\varvec{h}}^{t,\left( l\right) }-{\varvec{h}}^{\star }\right\| _{2}^{2}+\big \Vert \alpha {\varvec{x}}^{t,\left( l\right) }-{\varvec{x}}^{\star } \big \Vert _{2}^{2}, \end{aligned}$$
(87)

and denote \(\widetilde{{\varvec{z}}}^{t,(l)}=\left[ \begin{array}{c} \widetilde{{\varvec{h}}}^{t,\left( l\right) }\\ \widetilde{{\varvec{x}}}^{t,\left( l\right) } \end{array}\right] \) where

$$\begin{aligned} \widetilde{{\varvec{h}}}^{t,\left( l\right) }=\frac{1}{\overline{\alpha ^{t,\left( l\right) }}} {\varvec{h}}^{t,\left( l\right) }\qquad \text {and}\qquad \widetilde{{\varvec{x}}}^{t,\left( l\right) }=\alpha ^{t,\left( l\right) }{\varvec{x}}^{t,\left( l\right) }. \end{aligned}$$
(88)
figure g

8.3 Step 3: Establishing the Incoherence Condition by Induction

As usual, we continue the proof in an inductive manner. For clarity of presentation, we list below the set of induction hypotheses underlying our analysis:

$$\begin{aligned} \text {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right)&\le C_{1}\frac{1}{\log ^{2}m}, \end{aligned}$$
(90a)
$$\begin{aligned} \max _{1\le l\le m}\text {dist}\big ({\varvec{z}}^{t,\left( l\right) },\widetilde{{\varvec{z}}}^{t}\big )&\le C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}, \end{aligned}$$
(90b)
$$\begin{aligned} \max _{1\le l\le m}\big |{\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\big )\big |&\le C_{3}\frac{1}{\log ^{1.5}m}, \end{aligned}$$
(90c)
$$\begin{aligned} \max _{1\le l\le m}\big |{\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |&\le C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m, \end{aligned}$$
(90d)

where \(\widetilde{{\varvec{h}}}^{t},\; \widetilde{{\varvec{x}}}^{t}\), and \(\widetilde{{\varvec{z}}}^{t}\) are defined in (81). Here, \(C_{1},C_{3}>0\) are some sufficiently small constants, while \(C_{2},C_{4}>0\) are some sufficiently large constants. We aim to show that if these hypotheses (90) hold up to the tth iteration, then the same would hold for the (\(t+1\))th iteration with exceedingly high probability (e.g., \(1-O(m^{-10})\)). The first hypothesis (90a) has already been established in Lemma 15, and hence, the rest of this section focuses on establishing the remaining three. To justify the incoherence hypotheses (90c) and (90d) for the (\(t+1\))th iteration, we need to leverage the nice properties of the leave-one-out sequences and establish (90b) first. In the sequel, we follow the steps suggested in the general recipe.

  • Step 3(a): proximity between the original and the leave-one-out iterates We first justify hypothesis (90b) for the (\(t+1\))th iteration via the following lemma.

Lemma 17

Suppose the sample complexity obeys \(m\ge C \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\). Let \(\mathcal {E}_t\) be the event where hypotheses (90a)–(90d) hold for the tth iteration. Then on an event \(\mathcal {E}_{t+1,1} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}(\mathcal {E}_t \cap \mathcal {E}_{t+1,1}^c ) = O(m^{-10} + me^{-cK} )\) for some constant \(c>0\), one has

$$\begin{aligned} \max _{1\le l\le m}\mathrm {dist}\big ({\varvec{z}}^{t+1,\left( l\right) },\widetilde{{\varvec{z}}}^{t+1}\big )&\le C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}\\ \text {and}\qquad \max _{1\le l\le m}\big \Vert \widetilde{{\varvec{z}}}^{t+1,\left( l\right) }-\widetilde{{\varvec{z}}}^{t+1}\big \Vert _{2}&\lesssim C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}, \end{aligned}$$

provided that the step size \(\eta >0\) is some sufficiently small constant.

Proof

As usual, this result follows from the restricted strong convexity, which forces the distance between the two sequences of interest to be contractive. See Appendix C.3.\(\square \)

  • Step 3(b): incoherence of the leave-one-out iterate\({\varvec{x}}^{t+1,(l)}\)w.r.t. \({\varvec{a}}_{l}\) Next, we show that the leave-one-out iterate \(\widetilde{{\varvec{x}}}^{t+1,\left( l\right) }\)—which is independent of \({\varvec{a}}_{l}\)—is incoherent w.r.t. \({\varvec{a}}_{l}\) in the sense that

    $$\begin{aligned} \left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1,\left( l\right) } -{\varvec{x}}^{\star }\big )\right| \le 10C_{1}\frac{1}{\log ^{3/2}m} \end{aligned}$$
    (91)

    with probability exceeding \(1-O\left( m^{-10} + e^{-K} \log m \right) \). To see why, use the statistical independence and the standard Gaussian concentration inequality to show that

    $$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big )\right| \le 5\sqrt{\log m}\max _{1\le l\le m}\big \Vert \widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big \Vert _{2} \end{aligned}$$

    with probability exceeding \(1-O(m^{-10})\). It then follows from the triangle inequality that

    $$\begin{aligned} \big \Vert \widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big \Vert _{2}&\le \big \Vert \widetilde{{\varvec{x}}}^{t+1,(l)}-\widetilde{{\varvec{x}}}^{t+1}\big \Vert _{2}+\left\| \widetilde{{\varvec{x}}}^{t+1}-{\varvec{x}}^{\star }\right\| _{2}\\&\overset{\left( \text {i}\right) }{\le }CC_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}+C_{1}\frac{1}{\log ^{2}m}\\&\overset{\left( \text {ii}\right) }{\le }2C_{1}\frac{1}{\log ^{2}m}, \end{aligned}$$

    where (i) follows from Lemmas 15 and 17 and (ii) holds as soon as \(m/( \mu ^{2}\sqrt{K}\log ^{13/2}m )\) is sufficiently large. Combining the preceding two bounds establishes (91).

  • Step 3(c): combining the bounds to show incoherence of \({\varvec{x}}^{t+1}\)w.r.t. \(\left\{ {\varvec{a}}_{l}\right\} \) The above bounds immediately allow us to conclude that

    $$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1}-{\varvec{x}}^{\star }\big )\right| \le C_{3}\frac{1}{\log ^{3/2}m} \end{aligned}$$

    with probability at least \(1-O\left( m^{-10} + e^{-K}\log m \right) \), which is exactly hypothesis (90c) for the (\(t+1\))th iteration. Specifically, for each \(1\le l\le m\), the triangle inequality yields

    $$\begin{aligned} \left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1}-{\varvec{x}}^{\star }\big )\right|&\le \left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1} -\widetilde{{\varvec{x}}}^{t+1,(l)}\big )\right| +\left| {\varvec{a}}_{l}^{\textsf {H} } \big (\widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big )\right| \\&\overset{\left( \text {i}\right) }{\le }\left\| {\varvec{a}}_{l}\right\| _{2}\left\| \widetilde{{\varvec{x}}}^{t+1}-\widetilde{{\varvec{x}}}^{t+1,(l)} \right\| _{2}+\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star } \big )\right| \\&\overset{\left( \text {ii}\right) }{\le }3\sqrt{K}\cdot CC_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}+10C_{1}\frac{1}{\log ^{3/2}m}\\&\overset{\left( \text {iii}\right) }{\le }C_{3}\frac{1}{\log ^{3/2}m}. \end{aligned}$$

    Here (i) follows from Cauchy–Schwarz; (ii) is a consequence of (190), Lemma 17, and bound (91); and the last inequality holds as long as \(m / ( \mu ^{2}K\log ^{6}m )\) is sufficiently large and \(C_{3}\ge 11C_{1}\).

  • Step 3(d): incoherence of\({\varvec{h}}^{t+1}\)w.r.t. \(\{{\varvec{b}}_{l}\}\) It remains to justify that \({\varvec{h}}^{t+1}\) is also incoherent w.r.t. its associated design vectors \(\{{\varvec{b}}_{l}\}\). This proof of this step, however, is much more involved and challenging, due to the deterministic nature of the \({\varvec{b}}_{l}\)’s. As a result, we would need to “propagate” the randomness brought about by \(\{{\varvec{a}}_{l}\}\) to \({\varvec{h}}^{t+1}\) in order to facilitate the analysis. The result is summarized as follows.

Lemma 18

Suppose that the sample complexity obeys \(m\ge C \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\). Let \(\mathcal {E}_t\) be the event where hypotheses (90a)–(90d) hold for the tth iteration. Then on an event \(\mathcal {E}_{t+1,2} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}(\mathcal {E}_t \cap \mathcal {E}_{t+1,2}^c ) = O(m^{-10})\), one has

$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t+1}\right| \le C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m \end{aligned}$$

as long as \(C_{4}\) is sufficiently large and \(\eta >0\) is taken to be some sufficiently small constant.

Proof

The key idea is to divide \(\{1,\cdots ,m\}\) into consecutive bins each of size \(\mathrm {poly}\log (m)\), and to exploit the randomness (namely, the randomness from \({\varvec{a}}_{l}\)) within each bin. This binning idea is crucial in ensuring that the incoherence measure of interest does not blow up as t increases. See Appendix C.4. \(\square \)

With these steps in place, we conclude the proof of Theorem 3 via induction and the union bound.

8.4 The Base Case: Spectral Initialization

In order to finish the induction steps, we still need to justify the induction hypotheses for the base cases; namely, we need to show that the spectral initializations \({\varvec{z}}^{0}\) and \(\left\{ {\varvec{z}}^{0,\left( l\right) }\right\} _{1\le l \le m} \) satisfy the induction hypotheses (90) at \(t=0\).

To start with, the initializations are sufficiently close to the truth when measured by the \(\ell _{2}\) norm, as summarized by the following lemma.

Lemma 19

Fix any small constant \(\xi >0\). Suppose the sample size obeys \(m\ge {C\mu ^{2}K\log ^{2}m}/{\xi ^{2}}\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O(m^{-10})\), we have

$$\begin{aligned} \min _{\alpha \in \mathbb {C},|\alpha |=1}\left\{ \big \Vert \alpha {\varvec{h}}^{0}-{\varvec{h}}^{\star }\big \Vert _{2}+\big \Vert \alpha {\varvec{x}}^{0}-{\varvec{x}}^{\star }\big \Vert _{2}\right\}&\le \xi \qquad \text {and} \end{aligned}$$
(92)
$$\begin{aligned} \min _{\alpha \in \mathbb {C},|\alpha |=1}\left\{ \big \Vert \alpha {\varvec{h}}^{0,(l)}-{\varvec{h}}^{\star }\big \Vert _{2}+\big \Vert \alpha {\varvec{x}}^{0,(l)} -{\varvec{x}}^{\star }\big \Vert _{2}\right\}&\le \xi ,\qquad 1\le l\le m, \end{aligned}$$
(93)

and \(\left| |\alpha _0|-1\right| \le 1/4\).

Proof

This follows from Wedin’s sin\(\Theta \) theorem [121] and [76, Lemma 5.20]. See Appendix C.5. \(\square \)

From the definition of \(\mathrm {dist}(\cdot ,\cdot )\) (cf. (34)), we immediately have

$$\begin{aligned} \mathrm {dist}\big ({\varvec{z}}^{0},{\varvec{z}}^{\star }\big )&=\min _{\alpha \in \mathbb {C}}\sqrt{\left\| \frac{1}{\overline{\alpha }}{\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2}^{2}+\left\| \alpha {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}^{2}}\nonumber \\&\overset{\left( \text {i}\right) }{\le }\min _{\alpha \in \mathbb {C}}\left\{ \left\| \frac{1}{\overline{\alpha }}{\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2}+\left\| \alpha {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\right\} \nonumber \\&\overset{\left( \text {ii}\right) }{\le }\min _{\alpha \in \mathbb {C},|\alpha |=1}\left\{ \big \Vert \alpha {\varvec{h}}^{0}-{\varvec{h}}^{\star }\big \Vert _{2}+\big \Vert \alpha {\varvec{x}}^{0}-{\varvec{x}}^{\star }\big \Vert _{2}\right\} \overset{\left( \text {iii}\right) }{\le }C_{1}\frac{1}{\log ^{2}m}, \end{aligned}$$
(94)

as long as \( m\ge C\mu ^{2}K\log ^{6}m\) for some sufficiently large constant \(C>0\). Here (i) follows from the elementary inequality that \(a^{2}+b^{2}\le \left( a+b\right) ^{2}\) for positive a and b, (ii) holds since the feasible set of the latter one is strictly smaller, and (iii) follows directly from Lemma 19. This finishes the proof of (90a) for \(t=0\). Similarly, with high probability we have

$$\begin{aligned}&\mathrm {dist}\big ({\varvec{z}}^{0,\left( l\right) },{\varvec{z}}^{\star }\big )&\le \min _{\alpha \in \mathbb {C},|\alpha |=1}\left\{ \big \Vert \alpha {\varvec{h}}^{0,(l)}-{\varvec{h}}^{\star }\big \Vert _{2}+\big \Vert \alpha {\varvec{x}}^{0,(l)}-{\varvec{x}}^{\star }\big \Vert _{2}\right\} \lesssim \frac{1}{\log ^{2}m},\nonumber \\ {}&\quad 1\le l\le m. \end{aligned}$$
(95)

Next, when properly aligned, the true initial estimate \({\varvec{z}}^0\) and the leave-one-out estimate \({\varvec{z}}^{0,(l)}\) are expected to be sufficiently close, as claimed by the following lemma. Along the way, we show that \({\varvec{h}}^{0}\) is incoherent w.r.t. the sampling vectors \(\left\{ {\varvec{b}}_{l}\right\} \). This establishes (90b) and (90d) for \(t=0\).

Lemma 20

Suppose that \(m\ge C \mu ^{2}K\log ^{3}m\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O(m^{-10})\), one has

$$\begin{aligned} \max _{1\le l \le m}\mathrm {dist}\big ({\varvec{z}}^{0,\left( l\right) },\widetilde{{\varvec{z}}}^{0}\big )\le C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{5}m}{m}} \end{aligned}$$
(96)

and

$$\begin{aligned} \max _{1\le l \le m}\big |{\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{0}\big |&\le C_{4}\frac{\mu \log ^{2}m}{\sqrt{m}}. \end{aligned}$$
(97)

Proof

The key is to establish that \(\mathrm {dist}\big ({\varvec{z}}^{0,\left( l\right) },\widetilde{{\varvec{z}}}^{0}\big )\) can be upper bounded by some linear scaling of \(\big |{\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{0}\big |\), and vice versa. This allows us to derive bounds simultaneously for both quantities. See Appendix C.6.\(\square \)

Finally, we establish (90c) regarding the incoherence of \({\varvec{x}}^{0}\) with respect to the design vectors \(\left\{ {\varvec{a}}_{l}\right\} \).

Lemma 21

Suppose that \(m\ge C \mu ^{2}K\log ^{6}m\) for some sufficiently large constant \(C>0\). Then with probability exceeding \(1-O(m^{-10})\), we have

$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{0}-{\varvec{x}}^{\star }\big )\right|&\le C_{3}\frac{1}{\log ^{1.5}m}. \end{aligned}$$

Proof

See Appendix C.7. \(\square \)

9 Discussion

This paper showcases an important phenomenon in nonconvex optimization: even without explicit enforcement of regularization, the vanilla form of gradient descent effectively achieves implicit regularization for a large family of statistical estimation problems. We believe this phenomenon arises in problems far beyond the three cases studied herein, and our results are initial steps toward understanding this fundamental phenomenon. There are numerous avenues open for future investigation, and we point out a few of them.

  • Improving sample complexity In the current paper, the required sample complexity \(O\left( \mu ^{3}r^{3}n\log ^{3}n\right) \) for matrix completion is suboptimal when the rank r of the underlying matrix is large. While this allows us to achieve a dimension-free iteration complexity, it is slightly higher than the sample complexity derived for regularized gradient descent in [32]. We expect our results continue to hold under lower sample complexity \(O\left( \mu ^{2}r^{2}n\log n\right) \), but it calls for a more refined analysis (e.g., a generic chaining argument).

  • Leave-one-out tricks for more general designs So far, our focus is on independent designs, including the i.i.d. Gaussian design adopted in phase retrieval and partially in blind deconvolution, as well as the independent sampling mechanism in matrix completion. Such independence property creates some sort of “statistical homogeneity,” for which the leave-one-out argument works beautifully. It remains unclear how to generalize such leave-one-out tricks for more general designs (e.g., more general sampling patterns in matrix completion and more structured Fourier designs in phase retrieval and blind deconvolution). In fact, the readers can already get a flavor of this issue in the analysis of blind deconvolution, where the Fourier design vectors require much more delicate treatments than purely Gaussian designs.

  • Uniform stability The leave-one-out perturbation argument is established upon a basic fact: when we exclude one sample from consideration, the resulting estimates/predictions do not deviate much from the original ones. This leave-one-out stability bears similarity to the notion of uniform stability studied in statistical learning theory [8]. We expect our analysis framework to be helpful for analyzing other learning algorithms that are uniformly stable.

  • Other iterative methods and other loss functions The focus of the current paper has been the analysis of vanilla GD tailored to the natural squared loss. This is by no means to advocate GD as the top-performing algorithm in practice; rather, we are using this simple algorithm to isolate some seemingly pervasive phenomena (i.e., implicit regularization) that generic optimization theory fails to account for. The simplicity of vanilla GD makes it an ideal object to initiate such discussions. That being said, practitioners should definitely explore as many algorithmic alternatives as possible before settling on a particular algorithm. Take phase retrieval for example: iterative methods other than GD and/or algorithms tailored to other loss functions have been proposed in the nonconvex optimization literature, including but not limited to alternating minimization, block coordinate descent, and sub-gradient methods and prox-linear methods tailed to nonsmooth losses. It would be interesting to develop a full theoretical understanding of a broader class of iterative algorithms, and to conduct a careful comparison regarding which loss functions lead to the most desirable practical performance.

  • Connections to deep learning? We have focused on nonlinear systems that are bilinear or quadratic in this paper. Deep learning formulations/architectures, highly nonlinear, are notorious for their daunting nonconvex geometry. However, iterative methods including stochastic gradient descent have enjoyed enormous practical success in learning neural networks (e.g., [46, 103, 132]), even when the architecture is significantly over-parameterized without explicit regularization. We hope the message conveyed in this paper for several simple statistical models can shed light on why simple forms of gradient descent and variants work so well in learning complicated neural networks.

Finally, while the present paper provides a general recipe for problem-specific analyses of nonconvex algorithms, we acknowledge that a unified theory of this kind has yet to be developed. As a consequence, each problem requires delicate and somewhat lengthy analyses of its own. It would certainly be helpful if one could single out a few stylized structural properties/elements (like sparsity and incoherence in compressed sensing [13]) that enable near-optimal performance guarantees through an overarching method of analysis; with this in place, one would not need to start each problem from scratch. Having said that, we believe that our current theory elucidates a few ingredients (e.g., the region of incoherence and leave-one-out stability) that might serve as crucial building blocks for such a general theory. We invite the interested readers to contribute toward this path forward.