Next Article in Journal
A Multi-Class Automatic Sleep Staging Method Based on Photoplethysmography Signals
Next Article in Special Issue
Information-Theoretic Generalization Bounds for Meta-Learning and Applications
Previous Article in Journal
Beyond Causal Explanation: Einstein’s Principle Not Reichenbach’s
Previous Article in Special Issue
Deep Task-Based Quantization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors †

1
Department of Mathematics, Northeastern University, Boston, MA 02115, USA
2
Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
3
Helm.ai, Menlo Park, CA 94025, USA
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in NeurIPS 2020.
Entropy 2021, 23(1), 115; https://doi.org/10.3390/e23010115
Submission received: 1 December 2020 / Revised: 30 December 2020 / Accepted: 8 January 2021 / Published: 16 January 2021

Abstract

:
We provide a non-asymptotic analysis of the spiked Wishart and Wigner matrix models with a generative neural network prior. Spiked random matrices have the form of a rank-one signal plus noise and have been used as models for high dimensional Principal Component Analysis (PCA), community detection and synchronization over groups. Depending on the prior imposed on the spike, these models can display a statistical-computational gap between the information theoretically optimal reconstruction error that can be achieved with unbounded computational resources and the sub-optimal performances of currently known polynomial time algorithms. These gaps are believed to be fundamental, as in the emblematic case of Sparse PCA. In stark contrast to such cases, we show that there is no statistical-computational gap under a generative network prior, in which the spike lies on the range of a generative neural network. Specifically, we analyze a gradient descent method for minimizing a nonlinear least squares objective over the range of an expansive-Gaussian neural network and show that it can recover in polynomial time an estimate of the underlying spike with a rate-optimal sample complexity and dependence on the noise level.

1. Introduction

One of the fundamental problems in statistical inference and signal processing is the estimation of a signal given noisy high dimensional data. A prototypical example is provided by spiked matrix models where a signal y 🟉 R n is to be estimated from a matrix Y taking one of the following forms:
  • Spiked Wishart Model in which Y R N × n is given by:
    Y = u y 🟉 T + σ Z ,
    where σ > 0 , u N ( 0 , I N ) , Z i j are i.i.d. from N ( 0 , 1 ) , and u and Z are independent;
  • Spiked Wigner Model in which Y R n × n is given by:
    Y = y 🟉 y 🟉 T + ν H
    where ν > 0 , H R n × n is drawn from a Gaussian Orthogonal Ensemble GOE ( n ) , that is, H i i N ( 0 , 2 / n ) for all 1 i n and H i j = H j i N ( 0 , 1 / n ) for 1 j < i n .
In the last 20 years, spiked random matrices have been extensively studied, as they serve as a mathematical model for many signal recovery problems such as PCA [1,2,3,4], synchronization over graphs [5,6,7] and community detection [8,9,10]. Furthermore, these models are archetypal examples of the trade-off between statistical accuracy and computational efficiency. From a statistical perspective, the objective is to understand how the choice of the prior on y 🟉 determines the critical signal-to-noise ratio (SNR) and number of measurements above which it becomes information-theoretically possible to estimate the signal. From a computational perspective, the objective is to design efficient algorithms that leverage such prior information. A recent and vast body of literature has shown that depending on the chosen prior, gaps can arise between the minimum SNR required to solve the problem and the one above which known polynomial-time algorithms succeed. An emblematic example is provided the Sparse PCA problem where the signal y 🟉 in (1) is taken to be s-sparse. In this case N = O ( s log n ) number of samples are sufficient for estimating y 🟉 [2,4], while the best known efficient algorithms require N = O ( s 2 ) [3,11,12]. This gap is believed to be fundamental. This “statistical-computational gap” has been observed also for Spiked Wigner models (2) and, in general, for other structured signal recovery problems where the prior imposed has a combinatorial flavor (see the next section and [13,14] for surveys).
Motivated by the recent advances of deep generative networks in learning complex data structures, in this paper we study the spiked random matrix models (1) and (2), where the planted signal y 🟉 has a generative network prior. We assume that a generative neural network G : R k R n with k < n , has been trained on a data set of spikes, and the unknown spike y 🟉 R n lies on the range of G, that is, we can write y 🟉 = G ( x 🟉 ) for some x 🟉 R k . As a mathematical model for the trained G, we consider a network of the form:
G ( x ) = relu ( W d relu ( W 2 relu ( W 1 x ) ) ) ,
with weight matrices W i R n i × n i 1 and relu ( x ) = max ( x , 0 ) is applied entrywise. We furthermore assume that the network is expansive, that is, n = n d > n d 1 > > n 0 = k , and the weights have Gaussian entries. These modeling assumptions and their variants were used in [15,16,17,18,19,20].
Enforcing generative network priors has led to substantially fewer measurements needed for signal recovery than with traditional sparsity priors for a variety of signal recovery problems [17,21,22]. In the case of phase retrieval, [17,23] have shown that under the generative prior (3), efficient compressive phase retrieval is possible with sample complexity proportional (up to log factors) to the underlying signal dimensionality k. In contrast, for a sparsity-based prior, the best known polynomial time algorithms (convex methods [24,25,26], iterative thresholding [27,28,29], etc.) require a sample complexity proportional to the square of the sparsity level for stable recovery. Given that generative priors lead to no computational-statistical gap with compressive phase retrieval, one might anticipate that they will close other computational-statistical gaps as well. Indeed, [30] analyzed the spiked models (1) and (2) under a generative network prior similar to (3) and observed no computational-statistical gap in the asymptotic limit k , n , N with n / k = O ( 1 ) and N / n = O ( 1 ) . For more details on this work and on the comparison of sparsity and generative priors, see Section 2.2.

Our Contribution

In this paper we analyze the spiked matrix models (1) and (2) under a generative network prior in the nonasymptotic, finite data regime. We consider a d-layer feedforward generative network G : R k R n with architecture (3). We furthermore assume that the planted spike y 🟉 R n lies on the range of G, that is, there exists a latent vector x 🟉 R k such that y 🟉 = G ( x 🟉 ) .
To estimate y 🟉 , we first find an estimate x ^ of the latent vector x 🟉 and then use G ( x ^ ) to estimate y 🟉 . We thus consider the following minimization problem (under the conditions on the generative network specified below, it was shown in [15] that G is invertible and there exists a unique x 🟉 that satisfies y 🟉 = G ( x ) :
min x R k f ( x ) : = 1 4 G ( x ) G ( x ) T M F 2 ,
where:
  • for the Wishart model (1) we take M = Σ N σ 2 I n with Σ N = Y T Y / N
  • for the Wigner model (2) we take M = Y .
Despite the non-convexity and non-smoothness of the problem, our preliminary work in [31] shows that when the generative network G is expansive and has Gaussian weights, (4) enjoys a favorable optimization geometry. Specifically, every nonzero point outside two small neighborhoods around x 🟉 and a negative multiple of it, has a descent direction which is given a.e. by the gradient of f. Furthermore, in [31] it is shown that the the global minimum of f lies in the neighborhoods around x 🟉 and has optimal reconstruction error. This result suggests that a first order optimization algorithm can succeed in efficiently solving (4), and no statistical-computational gap is present for the spiked matrix models with a (random) generative network prior in the finite data regime. In the current paper, we prove this conjecture by providing a polynomial-time subgradient method that minimizes the non-convex problem (4) and obtains information-theoretically optimal error rates.
Our main contribution can be summarized as follows. We analyze a subgradient method (Algorithm 1) for the minimization of (4) and show that after a polynomial number of steps T ˜ and up to polynomials factors in the depth d of the network, the iterate x T ˜ satisfies the following reconstruction errors:
  • in the Spiked Wishart Model:
    G ( x T ˜ ) y 🟉 2 1 + σ 2 y 🟉 2 2 k log ( n ) N y 🟉 2
    in the regime N k log ( n ) ;
  • in the Spiked Wigner Model:
    G ( x T ˜ ) y 🟉 2 ν y 🟉 2 2 k log ( n ) n y 🟉 2 .
We notice that these bounds are information-theoretically optimal up to the log factors in n, and correspond to the best achievable in the case of a k-dimensional subspace prior. In particular, they imply that efficient recovery in the Wishart model is possible with a number of samples N proportional to the intrinsic dimension of the signal y 🟉 . Similarly, the bound in the Spiked Wigner Model implies that imposing a generative network prior leads to a reduction of the noise by a factor of k / n .
Algorithm 1: Subgradient method for the minizimization problem (4)
Entropy 23 00115 i001

2. Related Work

2.1. Sparse PCA and Other Computational-Statistical Gaps

A canonical problem in Statistics is finding the directions that explain most of the variance in a given cloud of data, and it is classically solved by Principal Component Analysis. Spiked covariance models were introduced in [1] to study the statistical performance of this algorithm in the high dimensional regime. Under a spiked covariance model it is assumed that the data are of the form:
y i = u i y 🟉 + σ z i ,
where σ > 0 , u i N ( 0 , 1 ) and z i N ( 0 , I n ) are independent and identically distributed, and y 🟉 is the unit-norm planted spike. Each y i is an i.i.d. sample from a centered Gaussian N ( 0 , Σ ) with spiked covariance matrix given by Σ = y 🟉 y 🟉 T + σ 2 I n , with y 🟉 being the direction that explains most of the variance. The estimate of y 🟉 provided by PCA is then given by the leading eigenvector y ^ of the empirical covariance matrix Σ N = 1 N i = 1 N y i y i T , and standard techniques from high dimensional probability can be used to show that (we write f ( n ) g ( n ) if f ( n ) C n for some constant C > 0 that might depend σ and y 🟉 2 . Similarly for f ( n ) g ( n ) as long as N n ,
min ϵ = ± ϵ y ^ y 🟉 2 n N ,
with overwhelming probability. Note incidentally that the data matrix Y R N × n with rows { y i T } i can be written as (1).
Bounds of the form (8), however, become uninformative in modern high dimensional regimes where the ambient dimension of the data n is much larger than, or on the order of, the number of samples N. Even worse, in the asymptotic regime n / N c > 0 and for σ 2 large enough, the spike y 🟉 and the estimate y ^ become orthogonal [32], and minimax techniques show that no other estimators based solely on the data (7), can achieve better overlap with y 🟉 [33].
In order to obtain consistent estimates and lower the sample complexity of the problem, therefore, additional prior information on the spike y 🟉 has to be enforced. For this reason, in recent years various priors have been analyzed such as positivity [34], cone constraints [35] and sparsity [32,36]. In the latter case y 🟉 is assumed to be s-sparse, and it can be shown (e.g., [33]) that for N s log n and n s , the s-sparse largest eigenvector y ^ s of Σ N
y ^ s = argmax y S 2 n 1 , y 0 s y T Σ N y
Satisfies with high probability the condition:
min ϵ = ± ϵ y ^ s y 🟉 2 s log n N .
This implies, in particular, that the signal y 🟉 can be estimated with a number of samples that scales linearly with its intrinsic dimension s. These rates are also minimax optimal; see for example [4] for the mean squared error and [2] for the support recovery. Despite these encouraging results, no currently known polynomial time algorithm achieves such optimal error rates and, for example, the covariance thresholding algorithm of [37] requires N s 2 samples in order to obtain exact support recovery or estimation rate
min ϵ = ± ϵ y ^ s y 🟉 2 s 2 log n N ,
as shown in [3]. In summary, only computationally intractable algorithms are known to reach the statistical limit N = Ω ( s ) for Sparse PCA, while polynomial time methods are only sub-optimal, requiring N = Ω ( s 2 ) . Notably, [38] provided a reduction of Sparse PCA to the planted clique problem which is conjectured to be computationally hard.
Further strong evidence for the hardness of sparse PCA have been given in a series of recent works [39,40,41,42,43]. Other computational-statistical gaps have also been found and studied in a variety of other contexts such as sparse Gaussian mixture models [44], tensor principal component analysis [45], community detection [46] and synchronization over groups [47]. These works fit in the growing and important body of literature aiming at understanding the trade-offs between statistical accuracy and computational efficiency in statistical inverse problems.
We finally note that many of the above mentioned problems can be phrased as recovery of a spike vector from a spiked random matrix. The difficulty can be viewed as arising from simultaneously imposing low-rankness and additional prior information on the signal (sparsity in case of Sparse PCA). This difficulty can be found in sparse phase retrieval as well. For example, [25] has shown that m = O ( s log n ) number of quadratic measurements are sufficient to ensure well-posedness of the estimation of an s-sparse signal of dimension n lifted to a rank-one matrix, while m O ( s 2 / log 2 n ) measurements are necessary for the success of natural convex relaxations of the problem. Similarly, [48] studied the recovery of simultaneously low-rank and sparse matrices, showing the existence of a gap between what can be achieved with convex and tractable relaxations and nonconvex and intractable methods.

2.2. Inverse Problems with Generative Network Priors

Recently, in the wake of successes of deep learning, generative networks have gained popularity as a novel approach for encoding and enforcing priors in signal recovery problems. In one deep-learning-based approach, a dataset of “natural signals” is used to train a generative network in an unsupervised manner. The range of this network defines a low-dimensional set which, if successfully trained, contains or approximately contains, target signals of interest [19,21]. Non-convex optimization methods are then used for recovery by optimizing over the range of the network. We notice that allowing the algorithms the complete knowledge of the generative network architecture and of the learned weights is roughtly analogous to allowing sparsity-based algorithms the knowledge of the basis or frame in which the signal is modeled as sparse.
The use of generative network for signal recovery has been successfully demonstrated in a variety of settings such as compressed sensing [21,49,50], denoising [16,51], blind deconvolution [22], inpainting [52] and many more [53,54,55,56]. In these papers, generative networks significantly outperform sparsity based priors at signal reconstruction in the low-measurement regime. This fundamentally leverages the fact that a natural signal can be represented more concisely by a generative network than by a sparsity prior under an appropriate basis. This characteristic has been observed even in untrained generative networks where the prior information is encoded only in the network architecture and has been used to devise state-of-the-art signal recovery methods [57,58,59].
Parallel to these empirical successes, a recent line of works have investigated theoretical guarantees for various statistical estimation tasks with generative network priors. Following the work of [15,21] gave global guarantees for compressed sensing, followed then by many others for various inverse problems [19,20,50,51,55]. In particular, in [17] the authors have shown that m = Ω ( k log n ) number of measurements are sufficient to recover a signal from random phaseless observations, assuming that the signal lies on the range of a generative network with latent dimension k. The same authors have then provided in [23] a polynomial time algorithm for recovery under the previous settings. Note that, contrary to the sparse phase retrieval problem, generative priors for phase retrieval allow for efficient algorithms with optimal sample complexity, up to logarithmic factors, with respect to the intrinsic dimension of the signal.
Further theoretical advances in signal recovery with generative network priors have been spurred by using techniques from statistical physics. Recently, [30] analyzed the spiked matrix models (1) and (2) with y 🟉 in the range of a generative network with random weights, in the asymptotic limit k , n , N with n / k = O ( 1 ) and N / n = O ( 1 ) . The analysis is carried out mainly for networks with sign or linear activation functions in the Bayesian setting where the latent vector is drawn from a separable distribution. The authors of [30] provide an Approximate Message Passing and a spectral algorithm, and they numerically observe no statistical-computational gap as these polynomial time methods are able to asymptotically match the information-theoretic optimum. In this asymptotic regime, [60] further provided precise statistical and algorithmic thresholds for compressed sensing and phase retrieval.

3. Algorithm and Main Result

In this section we present an efficient and statistically-optimal algorithm for the estimation of the signal y 🟉 given a spiked matrix Y of the form (1) or (2). The recovery method is detailed in Algorithm 1, and it is based on the direct optimization of the nonlinear least squares problem (4).
Applied in [16] for denoising and compressed sensing under generative network priors, and later used in [23] for phase retrieval, the first order optimization method described in Algorithm 1 leverages the theory of Clarke subdifferentials (the reader is referred to [61] for more details). As the objective function f is continuous and piecewise smooth, at every point x R k it has a Clarke subdifferential given by
f ( x ) = conv { v 1 , v 2 , , v T } ,
where conv denotes the convex hull of the vectors v 1 , , v T , which are respectively the gradient of the T smooth functions adjoint at x. The vectors v x f ( x ) are the subgradients of f at x, and at a point x where f is differentiable it holds that f ( x ) = { f ( x ) } .
The reconstruction method presented in Algorithm 1 is motivated by the landscape analysis of the minimization problem (4) for a network G with sufficiently-expansive Gaussian weights matrices. Under this assumption, we showed in [31] that (4) has a benign optimization geometry and in particular that for any nonzero point outside a neighborhood of x 🟉 and a negative multiple of it, any subgradient of f is a direction of strict descent. Furthermore we showed that the points in the vicinity of the spurious negative multiple of x 🟉 have function values strictly larger than those close to x 🟉 . Figure 1 shows the expected value of f in the noiseless case, ν = 0 and N , for a generative network with latent dimension k = 2 . This plot highlights the global minimimum at x 🟉 = [ 1 , 1 ] , and the flat region in near a negative multiple of x 🟉 .
At each step, the subgradient method in Algorithm 1 checks if the current iterate x i has a larger loss value than its negative multiple, and if so negates x i . As we show in the proof of our main result, this step will ensure that the algorithm will avoid the neighborhood around the spurious negative multiple of x 🟉 and will converge to the neighborhood around x 🟉 in a polynomial number of steps.
Below we make the following assumptions on the weight matrices of G.
Assumption 1.
The generative network G defined in (3), has weights W i R n i × n i 1 with i.i.d. entries from N ( 0 , 1 / n i ) and satisfying the expansivity condition with constant ϵ > 0 :
n i + 1 c n i ϵ 2 log ( 1 / ϵ )
for all i and a universal constant c > 0 .
We note that in [31] the expansivity condition was more stringent, requiring an additional log factor. Since the publication of our paper, [62] has shown that the more relaxed assumption (10) suffices for ensuring a benign optimization geometry. Under Assumption 1, our main theorem below shows that the subgradient method in Algorithm 1 can estimate the spike y 🟉 with optimal sample complexity and in a polynomial number of steps.
Theorem 1.
Let x 🟉 R k nonzero and y 🟉 = G ( x 🟉 ) where G is a generative network satisfying Assumption 1 with ϵ K 1 / d 96 . Consider the minimization problem (4) and assume that the noise level ω satisfies ω K 2 x 🟉 2 2 2 d / d 44 where:
  • for theSpiked Wishart Model(1) take M = Σ N σ 2 I n , and
    ω : = ( y 🟉 2 2 + σ 2 ) max 338 k log ( 3 n 1 d n 2 d 1 n d 1 2 n ) N , 156 k log ( 3 n 1 d n 2 d 1 n d 1 2 n ) N ;
  • for theSpiked Wigner Model(2) take M = Y , and
    ω : = ν 169 k log ( 3 n 1 d n 2 d 1 n d 1 2 n ) n .
Consider Algorithm 1 with nonzero x 0 and x 0 2 < R 🟉 where R 🟉 5 x 🟉 2 / ( 2 2 ) , and stepsize μ = 2 2 d K 3 / ( 8 d 4 R 🟉 2 ) . Then with probability at least 1 2 e k log n i = 1 d e C n i 1 , 0 < x i 2 < R 🟉 for any i 1 , there exists an integer T K 4 f ( x 0 ) 2 2 d / ( R 🟉 4 d 4 ϵ ) such that for any i T :
x i + 1 x 🟉 2 ρ 1 i + 1 T x T x 🟉 2 + ρ 2 2 d x 🟉 2 ω
G ( x i + 1 ) y 🟉 2 1.2 2 d / 2 ρ 1 i + 1 T x T x 🟉 2 + 1.3 ρ 2 ω y 🟉 2
where C > 0 , K 1 , , K 4 > 0 , ρ 1 ( 0 , 1 ) and ρ 2 > 0 are universal constants.
Note that the quantity 2 2 d in the hypotheses and conclusions of the theorem is an artifact of the scaling of the network and it should not be taken as requiring exponentially small noise or number of steps. Indeed under Assumption 1, the ReLU activation zeros out roughly half of the entries of its argument leading to an “effective” operator norm of approximately 1 / 2 . We furthermore notice that the dependence of the depth d is likely quite conservative and it was not optimized in the proof as the main objective was to obtain tight dependence on the intrinsic dimension of the signal k. As shown in the numerical experiments, the actual dependence on the depth is much better in practice. Finally, observe that despite the nonconvex nature of the objective function in (4) we obtain a rate of convergence which is not directly dependent on the dimension of the signal, reminiscent of what happens in the convex case.
The quantity ω in Theorem 1 can be interpreted as the intrinsic noise level of the problem (inverse SNR). The theorem guarantees that in a polynomial number of steps the iterates of the subgradient method will converge to x 🟉 up to ω . For T ˜ large enough G ( x T ˜ ) will satisfy the rate-optimal error bounds (5) and (6).

Numerical Experiments

We illustrate the predictions of our theory by providing results of Algorithm 1 on a set of synthetic experiments. We consider 2-layer generative networks with ReLU activation functions, hidden layer of dimension n 1 = 500 , output dimension n 2 = n = 1500 and varying number of latent dimension k [ 40 , 60 , 100 ] . We randomly sample the weights of the matrix independently from N ( 0 , 2 / n i ) (this scaling removes that 2 d dependence in Theorem 1). We then consider data Y according the spiked models (1) and (2), where x 🟉 R k is chosen so that y 🟉 = G ( x 🟉 ) has unit norm. For the Wishart model, we vary the number of samples N; and for the Wigner model, we vary the noise level ν so that the following quantities remain constant for the different networks with latent dimension k:
θ WS : = k log ( n 1 2 n ) / N , θ WG : = ν k log ( n 1 2 n ) / n .
In Figure 2 we plot the reconstruction error given by G ( x ) y 🟉 2 against θ WS and θ WG . As predicted by Theorem 1, the errors scale linearly with respect to these control parameters, and moreover the overlap of these plots confirms that these rates are tight with respect to the order of k.

4. Recovery Under Deterministic Conditions

We will derive Theorem 1 from Theorem 3, below, which is based on a set of deterministic conditions on the weights of the matrix and the noise. Specifically, we consider the minimization problem (4) with
M = G ( x 🟉 ) G ( x 🟉 ) T + H
for an unknown symmetric matrix H R n × n , nonzero x 🟉 R k , and a given d-layer feed forward generative network G as in (3).
In order to describe the main deterministic conditions on the generative network G, we begin by introducing some notation. For W R n × k and x R k , we define the operator W + , x : = diag ( W x > 0 ) W such that relu ( W x ) = W + , x x . Moreover, we let W 1 , + , x = ( W 1 ) + , x = diag ( W 1 x > 0 ) W 1 , and for 2 i d we define recursively
W i , + , x = diag ( W i , Π j = i 1 1 W j , + , x x > 0 ) W i ,
where Π i = d 1 W i = W d W d 1 W 1 . Finally we let Λ x = Π j = d 1 W j , + , x and note that G ( x ) = Λ x x . With this notation we next recall the following deterministic condition on the layers of the generative network.
Definition 2
(Weight Distribution Condition [15]). We say that W R n × k satisfies the Weight Distribution Condition (WDC) with constant ϵ > 0 if for all nonzero x 1 , x 2 R k :
W + , x 1 T W + , x 2 Q x 1 , x 2 2 ϵ ,
where
Q x 1 , x 2 = π θ x 1 , x 2 2 π I k + sin θ x 1 , x 2 2 π M x ^ 2 x ^ 2
and θ x 1 , x 2 = ( x 1 , x 2 ) , x ^ 1 = x 1 / x 1 2 , x ^ 2 = x 2 / x 2 2 , I k is the k × k identity matrix and M x ^ 1 x ^ 2 is the matrix that sends x ^ 1 x ^ 2 , x ^ 2 x ^ 1 , and with kernel span ( { x 1 , x 2 } ) .
Note that Q x 1 , x 2 is the expected value of W + , x 1 T W + , x 2 when W has rows w i N ( 0 , I k / n ) , and if x 1 = x 2 then Q x 1 , y 2 is an isometry up to the scaling factor 1 / 2 . Below we will say that a d-layer generative network G of the form (3), satisfies the WDC with constant ϵ > 0 if every weight matrix W i has the WDC with constant ϵ for all i = 1 , d .
The WDC was originally introduced in [15], and ensures that the angle between two vectors in the latent space is approximately preserved at the output layer and, in turn, it guarantees the invertibility of the network. Assumption 1 will guarantees that the generative network G satisfies the WDC with high probability.
We are now able to state our recovery guarantees for a spike y 🟉 under deterministic conditions on the network G and noise H.
Theorem 3.
Let d 2 and assume the generative network (3) has weights W i R n i × n i 1 satisfying the WDC with constant 0 < ϵ K 1 / d 96 . Consider Algorithm 1 with M = G ( x 🟉 ) G ( x 🟉 ) T + H , x 🟉 R k \ { 0 } and H a symmetric matrix satisfying:
Λ x T H Λ x 2 ω 2 d , and ω K 2 x 🟉 2 2 2 d d 44 .
Take x 0 nonzero and with x 0 2 < R 🟉 where R 🟉 5 x 🟉 2 / ( 2 2 ) , μ = 2 2 d K 3 / ( 8 d 4 R 🟉 2 ) . Then the iterates { x i } i 0 generated by the Algorithm 1 satisfy 0 < x i < R 🟉 and obey to the following:
(A) 
there exists an integer T K 4 f ( x 0 ) 2 2 d R 🟉 4 d 4 ϵ such that
x T x 🟉 2 K 5 d 14 ϵ x 🟉 2 + K 6 2 d d 10 ω x 🟉 2 1
(B) 
for any i T :
x i + 1 x 🟉 2 ρ 1 i + 1 T x T x 🟉 2 + ρ 2 2 d x 🟉 2 ω
G ( x i + 1 ) G ( x 🟉 ) 2 1.2 2 d / 2 ρ 1 i + 1 T x T x 🟉 2 + 1.3 ρ 2 ω y 🟉 2
where K 1 , , K 6 > 0 , ρ 1 ( 0 , 1 ) and ρ 2 > 0 are universal constants.
Theorem 1 follows then from Theorem 3 after proving that with high probability the spectral norm of Λ x T H Λ x , where H = M y 🟉 y 🟉 T , can be upper bounded by ω / 2 d , and the weights of the network G satisfy with high probability the WDC.
In the rest of this section section we will describe the main steps and tools needed to prove Theorem 3.

4.1. Technical Tools and Outline of the Proofs

Our proof strategy for Theorem 3 can be summarized as follows:
  • In Proposition A1 (Appendix A.3) we show that the iterates { x i } i = 1 of the Algorithm 1 stay inside the Euclidean ball of radius R 🟉 and remain nonzero for all i 1 .
  • We then identify two small Euclidean balls B + and B around respectively x 🟉 and ρ d x 🟉 , where ρ d ( 0 , 1 ) only depends on the depth of the network. In Proposition A2 we show that after a polynomial number of steps, the iterates { x i } of the Algorithm 1 enter the region B + B (Appendix A.4).
  • We show, in Proposition A3, that the negation step causes the iterates of the algorithm to avoid the spurious point ρ d x 🟉 and actually enter B + within a polynomial number of steps (Appendix A.5).
  • We finally show in Proposition A4, that in B + the loss function f enjoys a favorable convexity-like property, which implies that the iterates { x i } will remain in B + and eventually converge to x 🟉 up to the noise level (Appendix A.6).
One of the main difficulties in the analysis of a subgradient method in Algorithm 1 is the lack of smoothness of the loss function f. We show that the WDC allows us to overcome this issue by showing that the subgradients of f are uniformly close, up to the noise level, to the vector field h x R k :
h x : = 1 2 2 d x x T h ˜ x h ˜ x T x ,
where h ˜ x is continuous for nonzero x (see Appendix A.2). We show furthermore that h x is locally Lipschitz, which allows us to conclude that the gradient method decreases the value of the loss function until eventually reaching B + B (Appendix A.4).
Using the WDC, we show that the loss function f is uniformly close to
f E ( x ) = 1 2 2 d + 2 x 2 4 + x 🟉 2 4 2 x , h ˜ x 2 .
A direct analysis of f E reveals that its values inside B are strictly larger then those inside B + . This property extends to f as well, and guarantees that the gradient method will not converge to the spurious point ρ d x 🟉 (Appendix A.5).

Author Contributions

Conceptualization, P.H. and V.V.; Formal analysis, J.C., Writing—original draft, J.C.; Writing - review & editing J.C. and P.H.; Supervision P.H. and V.V. All authors have read and agreed to the published version of the manuscript.

Funding

PH was partially supported by NSF CAREER Grant DMS-1848087 and NSF Grant DMS-2022205.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Supporting Lemmas and Proof of Theorem 3

The proof of Theorem 3 is provided in Appendix A.7. We begin this section with a set of preliminary results and supporting lemmas.

Appendix A.1. Notation

We collect the notation that is used throughout the paper. For any real number a, let relu ( a ) = max ( a , 0 ) and for any vector v R n , denote the entrywise application of relu as relu ( v ) . Let diag ( W x > 0 ) be the diagonal matrix with i-th diagonal element equal to 1 if ( W x ) i > 0 and 0 otherwise. For any vector x we denote with x its Euclidean norm and for any matrix A we denote with A its spectral norm and with A F its Frobenius norm. The euclidean inner product between two vectors a and b is a , b , while for two matrices A and B their Frobenius inner product will be denoted by A , B F . For any nonzero vector x R n , let x ^ = x / x . For a set S we will write | S | for its cardinality and S c for its complement. Let B ( x , r ) be the Euclidean ball of radius r centered at x, and S k 1 be the unit sphere in R k . Let θ 0 = ( x , x 🟉 ) and for i 0 let θ i + 1 = g ( θ i ) where g is defined in (A1). We will write γ = Ω ( δ ) to mean that there exists a positive constant C such that γ C δ and similarly γ = O ( δ ) if γ C δ . Additionally we will use a = b + O 1 ( δ ) when a b δ , where the norm is understood to be the absolute value for scalars, the Euclidean norm for vectors and the spectral norm for matrices.

Appendix A.2. Preliminaries

For later convenience we will define the following vectors:
p x : = Λ x T Λ x x , q x : = Λ x T Λ x 🟉 x 🟉 , v ¯ x : = [ p x p x T q x q x T ] x , η x : = Λ x T H Λ x x .
Note then that when f is differentiable at x, then v ˜ x : = f ( x ) = v ¯ x η x and in particular when H = 0 then we have v ˜ x = v ¯ x .
The following function controls how the angles are contracted by a ReLU layer:
g ( θ ) : = cos 1 ( π θ ) cos θ + sin θ π .
As we mentioned in Section 4.1, our analysis is based on showing that the subgradients of f are uniformly close to the vector field h x given by
h x : = 1 2 2 d x x T h ˜ x h ˜ x T x ,
where
h ˜ x : = ( i = 0 d 1 π θ ¯ i π ) x 🟉 + i = 1 d 1 sin θ i π ( j = i + 1 d 1 π θ ¯ j π ) x 🟉 x x ,
and θ i : = g ( θ ¯ i 1 ) for g given by (A1) and θ 0 = ( x , y ) .
Lemma A1
(Lemma 8 in [15]). Suppose that d 2 and the WDC holds with ϵ < 1 / ( 16 π d 2 ) 2 , then for all nonzero x , x 🟉 R k ,
Λ x x , Λ x 🟉 x 🟉 1 4 π 1 2 d x 2 x 🟉 ,
Λ x T Λ x 🟉 x 🟉 h ˜ x 2 d 24 d 3 ϵ 2 d x 🟉 , and
Λ x 2 1 2 d ( 1 + 2 ϵ ) d 1 + 4 ϵ d 2 d 13 12 1 2 d .
Proof. 
The first two bounds can be found in [15] (Lemma 8). The third bound follows by noticing that the WDC implies:
Λ x 2 Π i = d 1 W i , + , x 2 1 2 d ( 1 + 2 ϵ ) d 1 + 4 ϵ d 2 d 13 12 1 2 d ,
where we used log ( 1 + z ) z and e z ( 1 + 2 z ) for all 0 z 1 . □
The next lemma shows that the noiseless gradient v ¯ x concentrates around h x .
Lemma A2.
Suppose d 2 and the WDC holds with ϵ < 1 / ( 16 π d 2 ) 2 , then for all nonzero x , x 🟉 R k :
v ¯ x h x 86 d 4 ϵ 2 2 d max ( x 🟉 2 , x 2 ) x .
We now use the characterization of the Clarke subdifferential given in (9) to derive a bound on the concentration of v x f ( x ) around h x up to the noise level.
Lemma A3.
Under the assumptions of Lemma A2, and assuming Λ x T H Λ x ω / 2 d , for any v x f ( x ) :
v x h x 86 d 4 ϵ 2 2 d max ( x 🟉 2 , x 2 ) x + ω 2 d x
From the above and the bound on the noise level ω we can bound the norm of the step v x in the Algorithm 1.
Lemma A4.
Under the assumptions of Lemma A3, and assuming that ω satisfies (13), for any v x f ( x ) :
v x 4 2 2 d d 2 max ( x 2 , x 🟉 2 ) x .

Appendix A.3. Iterates Stay Bounded

In this section we prove that all the iterates { x i } generated by Algorithm 1 remain inside the Euclidean ball B ( 0 , R 🟉 ) where R 🟉 2 C 🟉 x 🟉 and C 🟉 = 5 / 4 .
Lemma A5.
Let the assumptions of Lemma A4 be satisfied, μ = K 3 2 2 d / ( 8 d 4 R 🟉 2 ) with 512 π 2 K 3 < 1 . Then for any x R k with C 🟉 x 🟉 < x R 🟉 and any λ [ 0 , 1 ] , it holds that x λ μ v x x .
From the previous lemma we can now derive the boundedness of the iterates of the Algorithm 1.
Proposition A1.
Under the assumptions of Theorem 3, if x B ( 0 , R 🟉 ) \ { 0 } it follows that x λ μ v x B ( 0 , R 🟉 ) \ { 0 } . Furthermore if x 0 B ( 0 , R 🟉 ) \ { 0 } , the iterates { x i } i = 1 of the Algorithm 1 satisfy x i λ μ v x i B ( 0 , R 🟉 ) \ { 0 } for all i 1 and λ [ 0 , 1 ] .
Proof. 
Assume C 🟉 x 🟉 < x R 🟉 , then the conclusions follows from Lemma A5. Assuming instead that x C 🟉 x 🟉 , note that
x λ μ v x x + μ v x ( 1 + 1 2 d 2 ) x R 🟉
using Lemma A4, d 2 and the assumptions on μ and R 🟉 . Finally observe that if x i B ( 0 , R 🟉 ) then the same holds for x ˜ i .
Finally if for some i 0 it was the case that 0 = x ˜ i λ μ v x ˜ i , then this would imply that x ˜ i = λ μ v x ˜ i which cannot happen because by Lemma A4 and the choice of the step size it holds that μ v x ˜ i x ˜ i / 8 . □

Appendix A.4. Convergence to S β

We define the set S β outside which we can lower bound the norm of h x as
S β : = { x R k | h x β 2 2 d max ( x 2 , x 🟉 2 ) x } ,
where we take
β = 7 · 86 d 4 ϵ + 2 d ω x 🟉 2 .
Outside the set S β the sub-gradients of f are bounded below and the landscape has favorable optimization geometry.
Lemma A6.
Let x B ( 0 , R 🟉 ) \ { 0 } S β c , then for all v x f ( x )
v x 6 max ( x 2 , x 🟉 2 ) x 2 2 d 86 d 4 ϵ + ω 2 d x .
Moreover let λ [ 0 , 1 ] and x λ = x λ μ v x λ then
v x v x λ 15 16 v x ,
for all v x f ( x ) and v x λ f ( x λ ) .
Based on the previous lemma we can prove the main result of this section.
Proposition A2.
Under the assumptions of Theorem 3, if x i B ( 0 , R 🟉 ) \ { 0 } S β c then
f ( x i + 1 ) f ( x i ) K R 🟉 4 d 4 ϵ 2 2 d
for some numerical constant K > 0 . Moreover there exists an integer T f ( x 0 ) 2 2 d K R 🟉 4 d 4 ϵ such x i + T S β .
Proof. 
Let x i S β and assume that f ( x i ) f ( x i ) , then x ˜ i = x i . By the mean value theorem for Clarke subdifferentials [61] (Theorem 8.13), there exists λ ( 0 , 1 ) such that for x i , λ = x i μ λ v x i and a v x i , λ f ( x i , λ ) it holds that
f ( x i + 1 ) f ( x ˜ i ) = v x i , λ , μ v x i = v x i , μ v x i + v x i , λ v x i , μ v x i μ v x i ( v x i v x i , λ v x i ) μ 16 v x i 2
where the first inequality follows from the triangle inequality and the second from Equation (A9). Next observe that by (A8)
v x 2 36 max ( x 2 , x 🟉 2 ) 2 x 2 2 4 d 86 2 d 8 ϵ
which together with the definition of μ and (A11) gives (A10).
Next take x i S β and assume f ( x i ) > f ( x i ) , so that x ˜ i = x i . Observe that
f ( x i + 1 ) f ( x i ) < f ( x i + 1 ) f ( x i ) = f ( x i + 1 ) f ( x ˜ i ) ,
we obtain then (A10) proceeding as before.
Finally the claim on the maximum number of iterations directly follows by a telescopic sum on (A10) and f ( · ) 0 . □

Appendix A.5. Convergence to a Neighborhood Around x 🟉

In the previous section we have shown that after a finite number of steps the iterates { x i } of the Algorithm 1 will enter in the region S β . In this section we show that, thanks to the negation step in the descent algorithm, they will eventually be confined in a neighborhood of x 🟉 .
The following lemma shows that S β is contained inside two balls around x 🟉 and x 🟉 .
Lemma A7.
Suppose 8 π d 6 β 1 , then we have S β B + B where
B + : = B ( x 🟉 , R 1 β d 10 x 🟉 ) and B : = B ( ρ d x 🟉 , R 2 β d 10 x 🟉 ) ,
where R 1 , R 2 are numerical constants and 0 < ρ d < 1 such that ρ d 1 as d .
We furthermore observe that the ball around x 🟉 has values strictly higher that the one around x 🟉 .
Lemma A8.
Suppose that d 2 , the WDC holds with ϵ < 1 / ( 16 π d 2 ) 2 and H satisfies (13). Then for any ϕ d [ ρ d , 1 ] , it holds that
f ( x ) < f ( y ) ,
for all x B ( ϕ d x 🟉 , ϱ x 🟉 d 12 ) and y B ( ϕ d x 🟉 , ϱ x 🟉 d 12 ) where ϱ < 1 is a universal constant.
The main result of this section is about the convergence to a neighborhood of x 🟉 of the iterates { x i } .
Proposition A3.
Under the assumptions of Theorem 3, if x i B ( 0 , R 🟉 ) \ { 0 } , then there exists a finite number of steps T K f ( x i ) 2 2 d R 🟉 4 d 4 ϵ such that x i + T B ( x 🟉 , R 1 β d 10 x 🟉 ) . In particular it holds that
x i + T x 🟉 C 1 d 14 ϵ x 🟉 + C 2 2 d d 10 ω x 🟉 1 .
Proof. 
Either x i S β or by Proposition A2 there exist T such that x i + T S β . By the choice of ϵ > 0 , the definition of β in (A7), and the assumption on the noise level (13), it follows that the hypotheses of Lemma A7 are satisfied. We therefore define the two neighborhoods S β + : = S β B + and S β : = S β B and conclude that that either x i + T S β + or x i + T S β .
We next notice that S β + B ( x 🟉 , ϱ x 🟉 d 12 ) and S β B ( ρ d x 🟉 , ϱ x 🟉 d 12 ) . We can then use Lemma A8 to conclude that by the negation step, if x i + T S β + then x ˜ i + T S β + , otherwise we will have x ˜ i + T S β .
We now analyze the case x i + T S β S β c . Applying again Proposition A2, we have that there exists and integer T K f ( x i + N ) 2 2 d R 🟉 4 d 4 ϵ such that x i + T + T S β . Furthermore Proposition A2 implies that f ( x i + T + T ) < f ( x i + T ) , while from Lemma A8 we know that f ( x i + T ) < f ( y ) for all y S β . We conclude therefore that x i + T + T must be in S β + .
In summary we obtained that there exists an integer T K f ( x 0 ) 2 2 d R 🟉 4 d 4 ϵ such that x i + T S β + B + . Finally Equation (A13) follows from the definition of β in (A7) and B + . □

Appendix A.6. Convergence to x 🟉 up to Noise

Lemma A9.
Suppose the WDC holds with ϵ < 1 / ( 200 4 d 6 ) , then for all x B ( x 🟉 , d ϵ x 🟉 ) and v x x f ( x ) , it holds that
v x τ 1 2 2 d x 🟉 2 ( x x 🟉 ) τ 2 x 🟉 2 2 2 d x x 🟉 + τ 3 ω 2 d x 🟉
where τ 1 = 21 / 5 , τ 2 = 17 / 5 and τ 3 = ( 1 + 1 / ( 400 ) 2 ) .
Based on the previous condition on the direction of the sub-gradients we can then prove that the iterates of the algorithm converge to x 🟉 up to noise.
Proposition A4.
Under the assumptions of Theorem (3), if x T B + then for any i T it holds that x i B + and furthermore
x i + 1 x 🟉 ρ 1 i + 1 T x T x 🟉 + ρ 2 2 d x 🟉 ω
where ρ 1 ( 0 , 1 ) and ρ 2 > 0 are numerical constants.
Proof. 
If i = T , by Proposition A2, it follows that x i = x ˜ i B + . Furthermore, the assumptions of Lemma A9 are satisfied and we can write:
x i + 1 x 🟉 = x ˜ i μ v x ˜ i x 🟉 ( 1 μ τ 1 x 🟉 2 2 2 d ) x ˜ i x 🟉 + μ v x ˜ i τ 1 2 2 d x 🟉 2 ( x ˜ i x 🟉 ) ( 1 μ ( τ 1 τ 2 ) x 🟉 2 2 2 d ) x ˜ i x 🟉 + μ τ 3 ω 2 d x 🟉 .
Next recall that ω satisfies (13), μ = K 3 2 2 d / ( 8 d 4 R 🟉 2 ) and R 🟉 2 25 x 🟉 2 / 8 , then
x i + 1 x 🟉 ( 1 μ ( τ 1 τ 2 ) x 🟉 2 2 2 d ) x ˜ i x 🟉 + μ K 2 2 2 d τ 3 d 44 x 🟉 3 [ 1 ( τ 1 τ 2 ) K 3 25 d 4 + K 2 K 3 25 d 48 ] x 🟉
Therefore for K 2 and K 3 small enough we obtain that x i + 1 B ( x 🟉 , R 1 β d 10 x 🟉 ) and by induction this holds for all i T . Finally we obtain (A14) by letting ρ 1 = ( 1 μ ( τ 1 τ 2 ) x 🟉 2 / 2 2 d ) and ρ 2 = K 3 τ 3 / ( 25 d 4 ) in (A15). □

Appendix A.7. Proof of Theorem 3

We begin recalling the following fact on the local Lipschitz property of the generative network G under the WDC.
Lemma A10
(Lemma 21 in [16]). Suppose x B ( x 🟉 , d ϵ x 🟉 ) , and the WDC holds with ϵ < 1 / ( 200 4 d 6 ) . Then it holds that:
G ( x ) G ( x 🟉 ) 1.2 2 d / 2 x x 🟉 .
We then conclude the proof of Theorem 3 by using the above lemma and the results in the previous sections.
(I)
By assumption x 0 B ( 0 , R 🟉 ) \ { 0 } so that according to Proposition A1 for any i 1 it holds that x i B ( 0 , R 🟉 ) \ { 0 } .
(II)
By Proposition A3, there exists an integer T such that x T B + and therefore it satisfies the conclusions of Theorem 3.A
x T x 🟉 C 1 d 14 ϵ x 🟉 + C 2 2 d d 10 ω x 🟉 1 .
(III)
Once in B + the iterates of Algorithm 1 converge to x 🟉 up to the noise level, as shown by Proposition A4 and Equation (A14)
x i + 1 x 🟉 ρ 1 i + 1 T x T x 🟉 + ρ 2 2 d x 🟉 ω
which corresponds to (11) in Theorem 3.B.
(IV)
The reconstruction error (12) in Theorem 3.B, follows then from (11) by applying Lemma A10 and the lower bound (A4).

Appendix B. Supplementary Proofs

Appendix B.1. Supplementary Proofs for Appendix A.2

Below we prove Lemma A2 on the concentration of the gradient of f at a differentiable point.
Proof of Lemma A2. 
We begin by noticing that:
v ¯ x h x = [ p x , x p x x , x x 2 2 d ] + [ h ˜ x , x h ˜ x 2 2 d q x , x x ] .
Below we show that:
p x , x p x x , x x 2 2 d 50 2 2 d d 3 ϵ max { x 2 , x 🟉 2 } x .
and
q x , x p x h ˜ x , x h ˜ x 2 2 d | 36 2 2 d d 4 ϵ max { x 2 , x 🟉 2 } x .
from which the thesis follows.
Regarding Equation (A16) observe that:
p x , x p x x , x x 2 2 d = p x , x [ p x x 2 d ] + p x x 2 d , x 2 d x ( Λ x x 2 + x 2 2 d ) p x x 2 d 50 2 2 d d 3 ϵ x 3
where in the first inequality we used p x , x = Λ x x 2 and in the second we used Equations (A5) and (A6) of Lemma A1.
Next note that:
q x , x q x h ˜ x , x h ˜ x 2 2 d = q x , x ( q x h ˜ x 2 d ) + q x h ˜ x 2 d , x h ˜ x 2 d ( q x + h ˜ x 2 d ) x q x h ˜ x 2 d ( 13 12 + 1 + d π ) x x 🟉 2 d q x h ˜ x 3 2 d x x 🟉 2 d q x h ˜ x
where in the second inequality we have the bound (A6) and the definition of h ˜ x . Equation (A17) is then found by appealing to Equation (A5) in Lemma A1. □
The previous lemma is now used to control the concentration of the subgradients v x of f around h x .
Proof of Lemma A3. 
When f is differentiable at x, f ( x ) = v ˜ x = v ¯ x + η x , so that by Lemma A2 and the assumption on the noise:
v ˜ x h x v ¯ x h x + η x 86 d 4 ϵ 2 2 d max ( x 🟉 2 , x 2 ) x + ω 2 d x .
Observe, now, that by (9), for any x R k , v x f ( x ) = conv ( v 1 , , v t ) , and therefore v x = a 1 v 1 + + a T v T for some a 1 , , a T 0 , i a i = 1 . Moreover for each v i there exist a w i such that v i = lim δ 0 + v ˜ x + δ w i . Therefore using Equation (A18), the continuity of h x with respect to nonzero x and i a i = 1 :
v x h x i = 1 T a i v i h x i = 1 T a i lim δ 0 v ˜ x + δ w i h x + δ w i 86 d 4 ϵ 2 2 d max ( x 🟉 2 , x 2 ) x + ω 2 d x .
The above results are now used to bound the norm of v x f ( x ) .
Proof of Lemma A4. 
Since ϵ < 1 / ( 16 π d 2 ) 2 observe that 86 d 4 ϵ 2 d 2 , therefore by the assumption on the noise level and Lemma A3 it follows that for any v x f ( x ) and K 2 1
v x h x 1 2 2 d 5 2 d 2 max ( x 2 , x 🟉 2 ) x .
Next observe that that since h ˜ x d x 🟉 , we have:
h x 1 2 2 d 5 4 d 2 max ( x 2 , x 🟉 2 ) x
and from v x v x h x + h x we obtain the thesis. □

Appendix B.2. Supplementary Proofs for Appendix A.3

In this section we prove Lemma A5 which implies that the norm of the iterates x i does not increase in the region B ( 0 , R 🟉 ) \ B ( 0 , C 🟉 x 🟉 ) .
Proof of Lemma A5. 
Note that the thesis is equivalent to 2 x , v x λ μ v x 2 . Next recall that by the WDC for any x R k and 2 d ϵ < 1 :
( 1 2 ϵ d ) 2 d x 2 ( 1 2 ϵ ) d 2 d x 2 G ( x ) 2 ( 1 + 2 ϵ ) d 2 d x 2 ( 1 + 4 ϵ d ) 2 d x 2 .
At a nonzero differentiable point x R k \ { 0 } with C 🟉 x 🟉 < x < R 🟉 , then
v ˜ x , x G ( x ) 2 ( G ( x ) 2 G ( x 🟉 ) 2 ) Λ x T H Λ x x 2 x 2 2 2 d ( 1 2 ϵ d ) 2 x 2 [ ( 1 2 ϵ d ) ( 1 + 4 ϵ d ) + K 2 / d 44 ] x 🟉 2 x 2 2 2 d ( 1 4 d ϵ ) x 2 [ ( 1 + 2 d ϵ ) + K 2 / d 44 ] x 🟉 2 .
Next, by Lemma A4, the definition of the step length μ and max ( x 2 , x 🟉 2 ) R 🟉 2 , we have:
λ μ 2 v ˜ x 2 K 3 2 2 d x 4 ,
which, using x > C 🟉 x 🟉 , gives
v ˜ x , x λ μ 2 v ˜ x 2 x 2 2 2 d ( 1 4 d ϵ K 3 ) x 2 ( 1 + 2 d ϵ + K 2 ) x 🟉 2 x 2 x 🟉 2 2 2 d ( 1 4 d ϵ K 3 ) C 🟉 ( 1 + 2 d ϵ + K 2 ) .
We can then conclude by observing that by the assumptions and for small enough constants ( 1 4 d ϵ K 3 ) C 🟉 ( 1 + 2 d ϵ + K 2 / d 44 ) > 0 .
At a non-differentiable point x, by the characterization of the Clarke subdifferential, we can write v x = = 1 m c v where v = lim δ f ( x + δ w ) then
x , v x = x , = 1 m c v = = 1 m c lim δ 0 + x + δ w , v ˜ x + δ w
which implies that the lower bound (A19) also holds for x , v x . Similarly Lemma A4 leads to the upper bound (A20) also for v x , which then leads to the thesis 2 x , v x λ μ v x 2 . □

Appendix B.3. Supplementary Proofs for Appendix A.4

In the next lemmas we show that h x is locally Lipschitz.
Lemma A11.
For all x , y 0
h x h y 1 2 2 d 2 ( x 2 + y 2 ) + d 2 + 5 d 3 max ( x y , y x ) x 🟉 2 x y
Proof. 
Note that h ˜ x = ξ x 🟉 + ζ x 🟉 x ^ where ξ and ζ are defined in (A23). Then observe that by (A28) and (A29) it follows that h ˜ x d x 🟉 . Furthermore by [16] (Lemma 18) for any nonzero x , y R k we have
h ˜ x h ˜ y 9 4 d 2 max 1 x , 1 y x 🟉 x y
Next notice that
h x h y 1 2 2 d x , x x y , y y + 1 2 2 d h ˜ y , y h ˜ y h ˜ x , x h ˜ x ,
where by triangle inequality the first term on the left hand side can be bounded as
x , x x y , y y x 2 x y + x 2 y 2 y x 2 + x y + y 2 x y 2 x 2 + y 2 x y
Finally note that by from the bound (A21) we obtain:
h ˜ y , y h ˜ y h ˜ x , x h ˜ x h ˜ x x h ˜ x h ˜ y + h ˜ y h ˜ x , x h ˜ y , y h ˜ x x h ˜ x h ˜ y + h ˜ y h ˜ x x y + h ˜ y y h ˜ x h ˜ y d 2 x 🟉 2 x y + d ( x + y ) x 🟉 h ˜ x h ˜ y 2 d 2 + 5 d 3 max x y , y x x 🟉 2 x y .
where we used
( x + y ) max 1 x , y 2 max x , y min x , y = 2 max x y , y x .
Based on the previous lemma we can now prove that h x is locally Lipschitz.
Lemma A12.
Let x B ( 0 , R 🟉 ) \ { 0 } , λ ( 0 , 1 ) , μ = K 3 2 2 d / ( 8 d 4 R 🟉 2 ) with 512 π 2 K 3 < 1 and v x f ( x ) , then for x λ = x λ μ v x it holds that:
h x h x λ 7 16 v x
Proof. 
Consider x B ( 0 , R 🟉 ) and observe that by Lemma A4, d 2 and the choice of μ , for any v x f ( x ) and any λ ( 0 , 1 ) we have μ v x x / 8 . It follows that
7 8 x x λ 9 8 x ,
and in particular
max x x λ , x λ x 8 7 .
Therefore by Lemma A11 we deduce that
h x h x λ λ μ 2 2 d 2 ( x 2 + x λ 2 ) + ( d 2 + 6 d 3 ) x 🟉 2 v x μ 2 2 d 4 R 🟉 2 + ( d 2 + 6 d 3 ) x 🟉 2 v x
where in the second inequality we used max ( x 2 , x λ 2 ) R 🟉 2 by Proposition A1. The thesis is obtained by substituting the definition of μ and using K 3 1 . □
Based on the previous result we can now prove Lemma A6.
Proof of Lemma A6. 
Let x B ( 0 , R 🟉 ) S β c ,
v x h x v x h x β 2 d max ( x 🟉 2 , x 2 ) x 86 d 4 ϵ 2 2 d max ( x 🟉 2 , x 2 ) x ω 2 d x 6 max ( x 2 , x 🟉 2 ) x 2 2 d ( 86 d 4 ϵ + 2 d ω x 🟉 2 ) 6 max ( x 2 , x 🟉 2 ) x 2 2 d 86 d 4 ϵ + ω 2 d x
where we used the definition β in Equation (A7) and Lemma A3.
Next take λ [ 0 , 1 ] and x λ = x λ μ v x with v x f ( x ) . Then for any v x λ f ( x λ )
v x v x λ v x h x + h x h x λ + h x λ v x λ 86 d 4 ϵ 2 2 d ( max ( x 🟉 2 , x 2 ) x + max ( x 🟉 2 , x λ 2 ) x λ ) + ω 2 d ( x + x λ ) + 7 16 v x 86 d 4 ϵ 2 2 d 1 + 9 8 3 max ( x 🟉 2 , x 2 ) x + 1 + 9 8 ω 2 d x + 7 16 v x | 3 86 d 4 ϵ 2 2 d max ( x 🟉 2 , x 2 ) x + ω 2 d x + 7 16 v x 1 2 + 7 16 v x
where in the first inequality we used Lemma A3 and Lemma A12, in the second inequality (A22) and in the last one (A8). □

Appendix B.4. Supplementary Proofs for Appendix A.5

Below we prove that the region of R k where we cannot control the norm of the vector field h x is contained in two balls around x 🟉 and ρ d x 🟉 .
We prove Lemma A7 by showing the following.
Lemma A13.
Suppose 8 π d 6 β 1 . Define:
ρ d : = i = 0 d 1 sin θ ˇ i π ( j = i + 1 d 1 π θ ˇ j π )
where θ ˇ 0 = π and θ ˇ i = g ( θ ˇ i 1 ) . If x S β , then we have either:
| θ ¯ 0 | 32 d 4 π β and | x 2 x 🟉 2 | 258 π β d 6 x 🟉
or
| θ ¯ 0 π | 8 π d 4 β and | x 2 ρ d 2 x 🟉 2 | 281 π 2 β d 10 x 🟉 .
In particular, we have:
S β B ( x 🟉 , R 1 β d 10 x 🟉 ) B ( ρ d x 🟉 , R 2 β d 10 x 🟉 )
where R 1 , R 2 are numerical constants and ρ d 1 as d .
Proof of Lemma A13. 
Without loss of generality, let x 🟉 = e 1 and x = r cos θ ¯ 0 · e 1 + r sin θ ¯ 0 · e 2 , for some θ ¯ 0 [ 0 , π ] , and r 0 . Recall that we call x ^ = x / x and x ^ 🟉 = x 🟉 / x 🟉 . We then introduce the following notation:
ξ = i = 0 d 1 π θ ¯ i π , ζ = i = 0 d 1 sin θ ¯ i π j = i + 1 d 1 π θ ¯ j π , r = x , R = max ( r 2 , 1 ) ,
where θ i = g ( θ ¯ i 1 ) with g as in (A1), and observe that h ˜ x = ( ξ x ^ 🟉 + ζ x ^ ) . Let α : = h ˜ x , x ^ , then we can write:
h x = 1 2 2 d x , x x h ˜ x , x h ˜ x = r 2 2 d r 2 x ^ α ( ξ x ^ 🟉 + ζ x ^ ) .
Using the definition of x ^ and x ^ 🟉 we obtain:
2 2 d h x r = ( r 2 α ζ ) cos θ ¯ 0 α ξ · e 1 + [ r 2 α ζ ] sin θ ¯ 0 · e 2 ,
and conclude that since x S β , then:
| ( r 2 α ζ ) cos θ ¯ 0 α ξ | β R
| [ r 2 α ζ ] sin θ ¯ 0 | β R .
We now list some bounds that will be useful in the subsequent analysis. We have:
θ ¯ i θ ¯ i 1 for i 1
θ ¯ i cos 1 ( 1 / π ) for i 2
| ξ | 1
| ζ | d π sin θ ¯ 0
ξ π θ ¯ 0 π d 3
θ ˇ i 3 π i + 3 for i 0
θ ˇ i π i + 1 for i 0
θ ¯ 0 = π + O 1 ( δ ) | ξ | δ π
θ ¯ 0 = π + O 1 ( δ ) ζ = ρ d + O 1 ( 3 d 3 δ ) if d 2 δ π 1
1 / π α 1 .
The identities (A26) through (A34) can be found in Lemma 16 of [16], while the identity (A35) follows by noticing that α = ξ cos θ ¯ 0 + ζ = cos θ d and using (A27) together with d 2 .
Bound on R. 
We now show that if x S β , then r 2 4 d and therefore R 4 d .
If r 2 1 , then the claim is trivial. Take r 2 > 1 , then note that either | sin θ ¯ 0 | 1 / 2 or | cos θ ¯ 0 | 1 / 2 must hold. If | sin θ ¯ 0 | 1 / 2 then from (A25) it follows that r 2 α ζ 2 β R = 2 β r 2 which implies:
r 2 α ζ 1 2 β 1 ( 1 2 β ) d π d 2
using (A29) and (A35) in the second inequality and β < 1 / 4 in the third. Next take | cos θ ¯ 0 | 1 / 2 , then (A24) implies | r 2 α ζ | 2 ( β r 2 + α ξ ) which in turn results in:
r 2 α ( ζ + 2 ξ ) 1 2 β 4 d
using (A28), (A29), (A35) and β < 1 / 4 . In conclusion if x S β then r 2 4 d R 4 d .
Bounds on θ ¯ 0 . 
We proceed by showing that we only have to analyze the small angle case θ ¯ 0 0 and the large angle case θ ¯ 0 π . At least one of the following three cases must hold:
(1)
sin θ ¯ 0 16 β π d 4 : Then we have θ ¯ = O 1 ( 32 π β π d 4 ) or θ ¯ = π + O 1 ( 32 π β π d 4 ) as 32 π β π d 4 < 1 .
(2)
| r 2 α ζ | < β R : Then (A24), (A35) and β < 1 yield | ξ | 2 β π R . Using (A30), we then get θ ¯ = π + O 1 ( 2 β π 2 d 3 R ) .
(3)
sin θ ¯ 0 > 16 β π d 4 and | r 2 α ζ | β R : Then (A27) gives | r 2 α ζ | β M / sin θ ¯ 0 which used with (A24) leads to:
| α ξ | β R + | r 2 α ζ | β R + β R sin θ ¯ 0 2 β R sin θ ¯ 0 .
Then using (A35), the assumption on sin θ ¯ 0 and R 4 d we obtain ξ d 3 / 2 . The latter together with (A30) leads to θ ¯ 0 π / 2 . Finally as | r 2 α ζ | β R then (A25) leads to | sin θ ¯ 0 | β . Therefore as θ ¯ 0 π / 2 and β < 1 , we can conclude that θ ¯ 0 = π + O 1 ( 2 β ) .
Inspecting the three cases, and recalling that R 4 d , we can see that it suffices to analyze the small angle case θ ¯ 0 = O 1 ( 32 d 4 π β ) and the large angle case θ ¯ = π + O 1 ( 8 β π 2 d 4 ) .
Small angle case. 
We assume θ ¯ 0 = O 1 ( δ ) with δ = 32 d 4 π β and show that x 2 x 🟉 2 . We begin collecting some bounds. Since θ ¯ i θ ¯ 0 δ , then 1 ξ ( 1 δ / π ) d 1 + O 1 ( 2 d δ / π ) assuming δ d / π 1 / 2 , which holds true since 64 d 5 β < 1 . Moreover from (A29) we have ζ = O 1 ( d δ / π ) . Finally observe that cos θ ¯ 0 = 1 + O 1 ( θ ¯ 0 2 / 2 ) = 1 + O 1 ( δ / 2 ) for δ < 1 . We then have α = 1 + O 1 ( 2 d δ ) so that α ζ = O 1 ( d 2 δ ) and α ξ = 1 + O 1 ( 4 d 2 δ ) . We can therefore rewrite (A24) as:
( r 2 + O 1 ( d 2 δ ) ) ( 1 + O 1 ( δ / 2 ) ) ( 1 + O 1 ( 4 d 2 δ ) ) = O 1 ( β R ) .
Using the bound r 2 R 4 d and the definition of δ , we obtain:
r 2 1 = O 1 δ r 2 2 + d 2 δ + d 2 δ 2 2 + 4 d 2 δ + 4 d β = O 1 ( 8 d 2 δ + 4 d β ) = O 1 ( 258 π d 6 β )
Large angle case. 
Here we assume θ ¯ = π + O 1 ( δ ) with δ = 8 β π 2 d 4 and show that it must be x 2 ρ d 2 x 🟉 2 .
From (A33) we know that ξ = O 1 ( δ / π ) , while from (A34) we know that ζ = ρ d + O 1 ( 3 d 3 δ ) as long as 8 β π d 6 1 . Moreover for large angles and δ < 1 , it holds cos θ ¯ 0 = 1 + O 1 ( ( θ ¯ 0 π ) 2 / 2 ) = 1 + O 1 ( δ 2 / 2 ) . These bounds lead to:
α = ξ cos θ ¯ 0 + ζ = ρ d + O 1 δ π + δ 3 2 π + 3 d 3 δ = ρ d + O 1 ( 4 d 3 δ ) ,
and using ρ d d :
α ζ = ρ d 2 + O 1 ( 4 d 3 δ ρ d + 3 d 3 δ ρ d + 12 d 6 δ ) = ρ d 2 + O 1 ( 20 d 6 δ ) , α ξ = O 1 ( δ π ρ d + 4 d 3 δ 2 π ) = O 1 ( 2 d 3 δ ) .
Then recall that (A24) is equivalent to ( r 2 α ζ ) cos θ ¯ 0 α ξ = O 1 ( 4 β d ) , that is:
r 2 ρ d 2 + O 1 ( 20 d 6 δ ) 1 + O 1 ( δ 2 / 2 ) + O 1 ( 2 d 3 δ ) = O 1 ( 4 β d )
and in particular:
r 2 ρ d 2 = O 1 20 d 6 δ + 10 d 6 δ 3 + ρ d δ 2 2 + r 2 δ 2 2 + 2 d 3 δ + 4 β d = O 1 35 d 6 δ + 4 β d = O 1 ( 281 π 2 β d 10 )
where we used ρ d d , the definition of δ and δ < 1 .
Controlling the distance. 
We have shown that it is either θ ¯ 0 0 and x 2 x 🟉 2 or θ ¯ 0 π and x 2 ρ d 2 x 🟉 2 . We can therefore conclude that it must be either x x 🟉 or x ρ d x 🟉 .
Observe that if a two dimensional point is known to have magnitude within Δ r of some r and is known to be within an angle Δ θ from 0, then its Euclidean distance to the point of coordinates ( r , 0 ) is no more that Δ r + ( r + Δ r ) Δ θ . Similarly we can write:
x x 🟉 | x x 🟉 | + ( x 🟉 + | x x 🟉 | ) θ ¯ 0 .
In the small angle case, by (A36), (A38), and x 🟉 | x x 🟉 | | x 2 x 🟉 2 | , we have:
x x 🟉 258 π d 6 β + ( 1 + 258 π d 6 β ) 32 d 4 π β 550 π d 10 β .
Next we notice that ρ 2 = 1 / π and ρ d ρ d 1 as follows from the definition and (A31), (A32). Then considering the large angle case and using (A37) we have:
| x ρ d | 281 π 2 β d 10 x + ρ d 281 π 3 β d 10 .
The latter, together with (A38), yields:
x + ρ d x 🟉 | x ρ d | + ( ρ d + | x ρ d | ) ( π θ ¯ 0 ) 281 π 3 β d 10 + ( d + 281 π 3 β d 10 ) 8 β π 2 d 4 284 π 3 β d 10
where in the second inequality we have used ρ d d and in the third 8 β π d 6 1 .
We conclude by noticing that ρ d 1 as d 1 as shown in [16] (Lemma 16). □
We next will show that the values of the loss function in a neighborhood of x 🟉 are strictly smaller that those in a neighborhood of ρ d x 🟉 .
Recall that f ( x ) : = 1 / 4 G ( x ) G ( x ) T G ( x 🟉 ) G ( x 🟉 ) T H F 2 , we next define the following loss functions:
f 0 ( x ) : = 1 4 G ( x ) G ( x ) T G ( x 🟉 ) G ( x 🟉 ) T F 2 , f H ( x ) : = f 0 ( x ) 1 2 G ( x ) G ( x ) T G ( x 🟉 ) G ( x 🟉 ) T , H F , f E ( x ) : = 1 2 2 d + 2 x 4 + x 🟉 4 2 x , h ˜ x 2 .
In particular notice that f ( x ) = f H ( x ) + 1 / 4 H F 2 . Below we show that assuming the WDC is satisfied, f 0 ( x ) concentrates around f E ( x ) .
Lemma A14.
Suppose that d 2 and the WDC holds with ϵ < 1 / ( 16 π d 2 ) 2 , then for all nonzero x , x 🟉 R k
| f 0 ( x ) f E ( x ) | 16 2 2 d ( x 4 + x 🟉 4 ) d 4 ϵ
Proof. 
Observe that:
| f 0 ( x ) f E ( x ) | 1 4 | G ( x ) 4 1 2 2 d x 4 | + 1 4 | G ( x 🟉 ) 4 1 2 2 d x 🟉 4 | + 1 2 | G ( x ) , G ( x 🟉 ) 2 1 2 2 d x , h ˜ x 2 | .
We analyze each term separately. The first term can be bounded as:
1 4 | G ( x ) 4 1 2 2 d x 4 | = 1 4 | G ( x ) 2 + 1 2 d x 2 | | G ( x ) 2 1 2 d x 2 | 1 4 1 2 d ( 13 12 + 1 ) x 2 | G ( x ) 2 1 2 d x 2 | 1 4 1 2 d ( 13 12 + 1 ) x 2 24 d 3 ϵ 2 d x 2 1 2 2 d 13 d 3 ϵ x 4
where in the first inequality we used (A6) and in the second inequality (A5). Similarly we can bound the second term:
1 4 | G ( x 🟉 ) 4 1 2 2 d x 🟉 4 | 1 2 2 d 13 d 3 ϵ x 🟉 4 .
We next note that h ˜ x ( 1 + d / π ) x 🟉 and therefore from (A6) and d 2 we obtain:
| G ( x ) G ( x 🟉 ) + x h ˜ x 2 d | 1 2 d ( 13 12 + 1 + d π ) x x 🟉 1 2 d 3 2 d x x 🟉
We can then conclude that:
1 2 | G ( x ) , G ( x 🟉 ) 2 x , h ˜ x 2 d 2 | 1 2 | x , Λ x T Λ x 🟉 x 🟉 h ˜ x | | G ( x ) G ( x 🟉 ) + x h ˜ x | 1 2 x 24 d 3 ϵ 2 d x 🟉 1 2 d 3 2 d x x 🟉 9 2 2 d d 4 ϵ ( x 🟉 4 + x 4 ) .
We next consider the loss f E and show that in a neighborhood ρ d x 🟉 , this loss function has larger values than in a neighborhood of x 🟉 .
Lemma A15.
Fix 0 < a 1 / ( 2 π 3 d 3 ) and ϕ d [ ρ d , 1 ] then:
f E ( x ) 1 2 2 d + 2 x 🟉 4 + 1 2 2 d + 2 [ ( a + ϕ d ) 4 2 ϕ d 2 + 2 π d a ] x 🟉 4 x B ( ϕ d x 🟉 , a x 🟉 ) and f E ( x ) 1 2 2 d + 2 x 🟉 4 + 1 2 2 d + 2 [ ( a ϕ d ) 4 2 ρ d 2 ϕ d 2 40 π d 3 a ] x 🟉 4 x B ( ϕ d x 🟉 , a x 🟉 ) .
Proof. 
Let x B ( ϕ d x 🟉 , a x 🟉 ) then observe that 0 θ ¯ i θ ¯ 0 π a / 2 ϕ d and ( ϕ d a ) x 🟉 x ( a + ϕ d ) x 🟉 . Then observe that:
x , h ˜ d 2 d = 1 2 d ( i = 0 d 1 π θ ¯ i π ) x 🟉 x cos θ ¯ 0 + 1 2 d i = 0 d 1 sin θ ¯ i π j = i + 1 d 1 π θ ¯ j π x 🟉 x 1 2 d ( i = 0 d 1 π π a 2 ϕ d π ) ( ϕ d a ) x 🟉 2 ( 1 π 2 a 2 8 ϕ d 2 ) 1 2 d ( 1 d a ϕ d ) ( ϕ d a ) ( 1 π 2 a 2 8 ϕ d 2 ) x 🟉 2 .
using cos θ 1 θ 2 / 2 and ( 1 x ) d ( 1 2 d x ) as long as 0 x 1 . We can therefore write:
f E ( x ) x 🟉 4 2 2 d + 2 1 2 2 d + 2 x 4 1 2 2 d + 1 ( 1 d a ϕ d ) 2 ( ϕ d a ) 2 ( 1 π 2 a 2 8 ϕ d 2 ) 2 x 🟉 4 1 2 2 d + 2 [ ( ϕ d + a ) 4 2 ( 1 2 d a ϕ d ) ( ϕ d a ) 2 ( 1 π 2 a 2 4 ϕ d 2 ) ] x 🟉 4
where in the second inequality we used ( 1 x ) 2 1 2 x for all x R . We then observe that:
( 1 2 d a ϕ d ) ( ϕ d a ) 2 ( 1 π 2 a 2 4 ϕ d 2 ) ( 1 π 2 a 2 4 ϕ d 2 2 a d ϕ d ) ϕ d 2 + a ( a 2 ϕ d ) ( 1 2 d a ϕ d ) ( 1 π 2 a 2 4 ϕ d 2 ) ϕ d 2 a ( 1 2 π d 3 + 2 d ϕ d ) + a ( a 2 ϕ d ) ( 1 2 d a ϕ d ) ( 1 π 2 a 2 4 ϕ d 2 ) ϕ d 2 a ( 1 2 π d 3 + 2 d ϕ d + 2 ϕ d ) ϕ d 2 π d a ,
where in the second inequality we have used π 3 d 3 a 2 and in the last one d 2 and ϕ d 1 . We can then conclude that:
f E ( x ) x 🟉 4 2 2 d + 2 1 2 2 d + 2 ( ϕ d + a ) 4 2 ( ϕ d 2 π d a ) x 🟉 4
We next take x B ( ϕ d x 🟉 , a x 🟉 ) which implies 0 π θ ¯ 0 π 2 a / 2 = : δ and x ( a + ϕ d ) x 🟉 . We then note that for ξ and ζ as defined in (A23) we have:
| x T h ˜ x | 2 ( | ξ | + | ζ | ) 2 ( a + ϕ d ) 2 x 🟉 4 ( δ π + 3 d 3 δ + ρ d ) 2 ( a + ϕ d ) 2 x 🟉 4 ( π 3 d 3 2 a + ρ d ) 2 ( a + ϕ ) 2 x 🟉 4 ( 2 π 3 d 3 a + ρ d 2 ) ( a + ϕ d ) 2 x 🟉 4 20 π d 3 a + ρ d 2 ϕ d 2
where the second inequality is due to (A33) and (A34), the rest from d 2 , ρ d ϕ d 1 and 2 π 3 d 3 a 1 . Finally using ( ϕ d a ) x 🟉 x , we can then conclude that:
f E ( x ) x 🟉 4 2 2 d + 2 1 2 2 d + 2 ( ϕ d a ) 4 2 ( 20 π d 3 a + ρ d 2 ϕ d 2 ) x 🟉 4 .
The above two lemmas are now used to prove Lemma A8.
Proof of Proposition A8. 
Let x B ( ± ϕ d x 🟉 , φ x 🟉 ) for a 0 < φ < 1 that will be specified below, and observe that by the assumptions on the noise:
| G ( x ) G ( x ) T G ( x 🟉 ) G ( x 🟉 ) T , H F | | G ( x ) T H G ( x ) | + | G ( x 🟉 ) T H G ( x 🟉 ) | ω 2 d ( x 2 + x 🟉 2 ) ω 2 d ( ( ϕ d + φ ) 2 + 1 ) x 🟉 2 ,
and therefore by Lemma A14:
| f 0 ( x ) f E ( x ) | + 1 2 | G ( x ) G ( x ) T G ( x 🟉 ) G ( x 🟉 ) T , H F | 16 2 2 d ( ( ϕ d + φ ) 4 + 1 ) x 🟉 4 d 4 ϵ + ω 2 d ( ( ϕ d + φ ) 2 + 1 ) x 🟉 2 272 2 2 d x 🟉 4 d 4 ϵ + ω 2 d ( ( ϕ d + φ ) 2 + 1 ) x 🟉 2
We next take φ = ϵ and x B ( ϕ d x 🟉 , φ x 🟉 ) , so that by Lemma A15 and the assumption 2 d d 44 w K 2 x 🟉 2 , we have:
f H ( x ) f E ( x ) + | f 0 ( x ) f E ( x ) | + 1 2 | G ( x ) G ( x ) T G ( x 🟉 ) G ( x 🟉 ) T , H F | 1 2 2 d + 2 1 + ( ϵ + ϕ d ) 4 2 ϕ d 2 + 2 π d ϵ x 🟉 4 + 272 d 4 ϵ x 🟉 4 + ω 2 d + 1 ( 2 + 2 ϵ + ϵ 2 ) x 🟉 2 1 2 2 d + 2 1 2 ϕ d 2 + ( ϵ + ϕ d ) 4 x 🟉 4 + 1 2 2 d 3 2 2 d x 🟉 2 ω + π d 2 + 272 d 4 ϵ x 🟉 4 + ω 2 d x 🟉 2 1 2 2 d + 2 1 2 ϕ d 2 + ( ϵ + ϕ d ) 4 x 🟉 4 + 1 2 2 d 3 2 K 2 d 44 + π d 2 + 272 d 4 ϵ x 🟉 4 + K 2 x 🟉 4 2 2 d d 44 .
Similarly if y B ( ϕ d x 🟉 , φ x 🟉 ) , and φ = ϵ we obtain:
f H ( y ) f E ( y ) | f 0 ( y ) f E ( y ) | 1 2 | G ( y ) G ( y ) T G ( x 🟉 ) G ( x 🟉 ) T , H | 1 2 2 d + 2 1 2 ϕ d 2 ρ d 2 + ( ϵ ϕ d ) 4 x 🟉 4 1 2 2 d 3 2 2 d x 🟉 2 ω + 10 π d 3 + 272 d 4 ϵ x 🟉 4 ω 2 d x 🟉 2 1 2 2 d + 2 1 2 ϕ d 2 ρ d 2 + ( ϵ ϕ d ) 4 x 🟉 4 1 2 2 d 3 2 K 2 d 44 + 10 π d 3 + 272 d 4 ϵ x 🟉 4 K 2 x 🟉 4 2 2 d d 44 .
In order to guarantee that f ( y ) > f ( x ) , it suffices to have:
2 ( 1 ρ d 2 ) ϕ d 2 8 K 2 d 44 > 4 C d ϵ
with C d : = ( 544 d 4 + 10 π d 3 π + 3 K 2 d 44 + π d / 2 + 1 / 100 ) , that is to require:
φ = ϵ < ( 1 ρ d 2 ) ϕ d 2 4 K 2 d 44 2 C d 2 .
Finally notice that by Lemma 17 in [16] it holds that 1 ρ d ( K ( d + 2 ) ) 2 for some numerical constant K, we therefore choose ϕ = ϱ / d 12 for some ϱ > 0 small enough. □

Appendix B.5. Supplementary Proofs for Appendix A.6

In this section we use strong convexity and smoothness to prove convergence to x 🟉 up to the noise variance ω . The idea is to show that every vector in the subgradient points in the direction ( x x 🟉 ) . Recall that the gradient in the noiseless case was:
v ¯ x = Λ x T [ Λ x x x T Λ x T Λ x 🟉 x 🟉 x 🟉 T Λ x 🟉 T ] Λ x x .
We show that by continuity of Λ x , when x is close to x 🟉 , then v ¯ x it is close to:
v ¯ ¯ x : = Λ x T Λ x [ x x T x 🟉 x 🟉 T ] Λ x T Λ x x .
which in turn concentrates around:
v ˇ x : = 1 2 2 d [ x x T x 🟉 x 🟉 T ] x
by the WDC.
We begin by recalling the following result which can be found in the proof of Lemma 22 of [16].
Lemma A16.
Suppose x B ( x 🟉 , d ϵ x 🟉 ) and the WDC holds with ϵ < 1 / ( 200 4 d 6 ) . Then it holds that:
Λ x T Λ x x 🟉 Λ x T Λ x 🟉 x 🟉 1 16 1 2 d x x 🟉 .
We now prove that v ¯ x v ¯ ¯ x for x close to x 🟉 .
Lemma A17.
Suppose x B ( x 🟉 , d ϵ x 🟉 ) and the WDC holds with ϵ < 1 / ( 200 4 d 6 ) . Then it holds that:
v ¯ x v ¯ ¯ x 13 96 1 2 2 d x x 🟉 x x 🟉
Proof. 
Let g x , 🟉 : = Λ x T Λ x x 🟉 and q x , 🟉 : = Λ x T Λ x 🟉 x 🟉 . Then observe that:
v ¯ x v ¯ ¯ x = g x , 🟉 , x g x , 🟉 q x , 🟉 , x q x , 🟉 ( x g x , 🟉 + Λ x x Λ x 🟉 x 🟉 ) g x , 🟉 q x , 🟉 13 6 1 2 d x x 🟉 g x , 🟉 q x , 🟉 13 96 1 2 2 d x x 🟉 x x 🟉
where the second inequality follows from Lemma A1, and the third from Lemma A16. □
We next prove that by the WDC v ¯ ¯ x v ˇ x .
Lemma A18.
Suppose the WDC holds with ϵ < 1 / ( 16 π d 2 ) 2 . Then for all nonzero x , x 🟉 :
v ¯ ¯ x v ˇ x 25 12 4 ϵ d 2 2 d x x x 🟉 x + x 🟉
Proof. 
For notational convenience we define E d : = I k / 2 d the scaled identity in R d , Q x : = Λ x T Λ x and M x , 🟉 = x x T x 🟉 x 🟉 T . Next observe that:
v ¯ ¯ x v ˇ x ( Q x E d ) M x , 🟉 Q x x + E d M x , 🟉 ( Q x E d ) x , 25 12 1 2 d Q x E d M x , 🟉 x | 25 12 4 ϵ d 2 2 d M x , 🟉 x
where the second inequality follows from Lemma A1 and the third from (17) in [15]. We conclude by noticing that M x , 🟉 x x 🟉 x + x 🟉 . □
Next consider the quartic function f ˇ ( x ) : = 1 / 2 2 d + 2 x x T x 🟉 x 🟉 T F 2 and observe that:
f ˇ ( x ) = 1 2 2 d ( x x T x 🟉 x 🟉 T ) x = v ˇ x , 2 f ˇ ( x ) = 1 2 2 d ( x 2 I n + 2 x x T x 🟉 x 🟉 T ) .
Following [63] we show that v ˇ x is β -smooth and strongly convex, in turn deriving the result in Lemma A19.
Lemma A19.
Assume x x 🟉 γ x 🟉 with γ < 1 / 5 . Take τ 3 ( 1 + γ ) 2 + 1 , then:
v ˇ x τ 2 2 d x 🟉 2 ( x x 🟉 ) τ ( τ α ) x x 🟉 x 🟉 2 2 2 d
where α = 2 9 γ .
Proof. 
Let x x 🟉 γ x 🟉 with γ < 1 / 5 , then:
( 1 γ ) x 🟉 x ( 1 + γ ) x 🟉
Δ x γ ( 1 + γ ) x 🟉 2
since Δ = x x 🟉 satisfies Δ x . Using (A40) we then obtain
x x T = x 🟉 x 🟉 T + Δ x T + x Δ T Δ Δ T , x 🟉 x 🟉 T 3 Δ x I k , x 🟉 x 🟉 T 3 γ ( 1 + γ ) x 🟉 2 I k .
We therefore have:
2 2 d 2 f ˇ ( x ) = x 2 I k + 2 x x T x 🟉 x 🟉 T x 2 I k + x 🟉 x 🟉 T 6 γ ( 1 + γ ) x 🟉 2 I k , ( ( 1 γ ) 2 + 1 6 γ ( 1 + γ ) ) x 🟉 2 I k , ( 2 9 γ ) x 🟉 2 I k ,
where in the second line we have used (A39) and in the third 0 < γ < 1 / 5 .
Next notice that using (A39) we have
2 2 d 2 f ˇ ( x ) 3 x 2 + x 🟉 2 ( 3 ( 1 + γ ) 2 + 1 ) x 🟉 2
and therefore, for all x B ( x 🟉 , γ x 🟉 )
( 2 9 γ ) x 🟉 2 I k 2 2 d 2 f ˇ ( x ) ( 3 ( 1 + γ ) 2 + 1 ) x 🟉 2 .
The above bounds imply in particular that for all x B ( x 🟉 , γ x 🟉 2 ) the gradient of f ˇ satisfies the regularity condition (see [63]):
2 f ˇ ( x ) , x x 🟉 μ f ˇ ( x ) 2 + λ x x 🟉 2 ,
where λ = ( 2 9 γ ) x 🟉 2 / 2 2 d and 1 / μ = ( 3 ( 1 + γ ) 2 + 1 ) x 🟉 2 / 2 2 d . By (A41) we can then conclude that for σ 1 / μ :
f ˇ ( x ) σ ( x x 🟉 ) 2 f ˇ ( x ) 2 + σ 2 x x 🟉 2 2 σ f ˇ ( x ) , x x 🟉 ( 1 σ μ ) f ˇ ( x ) 2 + σ ( σ λ ) x x 🟉 2 σ ( σ λ ) x x 🟉 2
Finally letting σ = τ x 🟉 2 / 2 d , α = ( 2 9 γ ) , and by the assumptions on γ and τ the thesis follows. □
We finally can prove Lemma A9.
Proof of Lemma A9. 
Let x B ( x 🟉 , d ϵ x 🟉 ) , then x + x 🟉 ( 2 + d ϵ ) x 🟉 and since ϵ < 1 / ( 200 4 d 6 ) we have from Lemma A17 and Lemma A18
v ¯ x v ˇ x v ¯ x v ¯ ¯ x + v ¯ ¯ x v ˇ x 1 2 2 d 13 48 x x 🟉 x x 🟉 .
If x B ( x 🟉 , d ϵ x 🟉 ) is a differentiable point of f, then by Lemma A19 and the assumption (13) on the noise H, it holds that:
v ˜ x τ 2 2 d x 🟉 2 ( x x 🟉 ) 13 48 ( 1 + γ ) + τ ( τ α ) x 🟉 2 2 2 d x x 🟉 + ω 2 d ( 1 + γ ) x 🟉 .
where γ = d ϵ , α = 2 9 γ and τ 3 ( 1 + γ ) 2 + 1 .
In general, if x B ( x 🟉 , γ x 🟉 ) and v x f ( x ) , by the characterization of Clarke subdifferential (9) and the previous results:
v x τ 2 2 d x 🟉 2 ( x x 🟉 ) t = 1 T c t v t τ 2 2 d x 🟉 2 ( x x 🟉 ) 13 48 ( 1 + γ ) + τ ( τ α ) x 🟉 2 2 2 d x x 🟉 + ω 2 d ( 1 + γ ) x 🟉 .
We finally obtain the thesis by noting that by the assumptions on ϵ it holds that γ = d ϵ < 1 / ( 400 ) 2 and taking τ = τ 1 = 21 / 5 , τ 2 = 17 / 5 and τ 3 = ( 1 + 1 / ( 400 ) 2 ) . □

Appendix C. Proof of Theorem 1

By Theorem 3 it suffices to show that with high probability the weight matrices { W i } i = 1 d satisfy the WDC, and that ω / 2 d upper bounds the spectral norm of the noise term Λ x T H Λ x where
  • in the Spiked Wishart Model H = Σ N Σ where Σ N = Y T Y / N and the Σ = y 🟉 y 🟉 T + σ 2 I n ;
  • in the Spiked Wigner Model H = ν H where H G O E ( n ) .
Regarding the Weight Distribution Condition (WDC), we observe that it was initially proposed in [15], where it was shown to hold with high probability for networks with random Gaussian weights under an expansivity condition on their dimensions. It was later shown in [62] that a less restrictive expansivity rate is sufficient.
Lemma A20
(Theorem 3.2 in [62]). There are constants C , c > 0 with the following property. Let 0 < ϵ < 1 and suppose W R n × k has i . i . d . N ( 0 , 1 / n ) entries. Suppose that n c k ϵ 2 log ( 1 / ϵ ) . Then with probability at least 1 exp ( C k ) , W satisfies the WDC with constant ϵ.
By a union bound over all layers, using the above result we can conclude that the WDC holds simultaneously for all layers of the network with probability at least 1 i = 1 d e C n i 1 . Note in particular that this argument does not requires the independence of the weight matrices { W i } i = 1 d .
By Lemma A20, with high probability the random generative network G satisfies the WDC. Therefore if we can guarantee that the assumptions on the noise term H are satisfied, then the proof of the main Theorem 1 follows from the deterministic Theorem 3 and Lemma A20.
Before turning to the bounds of the noise terms in the spiked models, we recall the following lemma which bounds the number of possible Λ x for x 0 . Note that this is related to the number of possible regions defined by a deep ReLU network.
Lemma A21.
Consider a network G as defined in (3) with d 2 , weight matrices W i R n i × n i 1 with i.i.d. entries N ( 0 , 1 / n i ) . Then, with probability one, for any x 0 the number of different matrices Λ x is
| { Λ x | x 0 } | 10 d 2 ( n 1 d n 2 d 1 n d ) k ( n 1 d n 2 d 1 n d ) 8 k
Proof. 
The first inequality follows from Lemma 16 and the proof of Lemma 17 in [15]. For the second inequality notice that as k 1 , n 1 > k and d 2 it follows that 7 k log ( n 1 d n 2 d 1 n d ) 7 k d ( d + 1 ) log ( n 1 ) / 2 3.5 d ( d + 1 ) log ( 2 ) d 2 log ( 10 ) .
In the next section we use this lemma to control the noise term Λ x T H Λ x on the event where the WDC holds.

Appendix C.1. Spiked Wigner Model

Recall that in the Wigner model Y = G ( x 🟉 ) G ( x 🟉 ) T + ν H and the symmetric noise matrix H follows a Gaussian Orthogonal Ensemble GOE ( n ) , that is H i i N ( 0 , 2 / n ) for all 1 i n and H j i = H i j N ( 0 , 1 / n ) for 1 j < i n . Our goal is to bound Λ x T H Λ x uniformly over x with high probability.
Fix x R k , and let N 1 / 4 be a 1 / 4 -net on the sphere S k 1 such that (see for example [64]) | N 1 / 4 | 9 k and
Λ x T H Λ x 2 max z N 1 / 4 | Λ x T H Λ x z , z | .
Observe next that for any v R n by the definition of GOE(n) it holds that
v T H v = i n H i i v i 2 + 2 i < j n H i j v i v j N ( 0 , 2 ( i v i 4 + 2 i < j v i 2 v j 2 ) / n ) = N ( 0 , 2 v 4 / n ) .
Therefore for any z N 1 / 4 let x , z : = Λ x z R n , then x , z T H x , z N ( 0 , 2 x , z 4 / n ) . In particular by (A6), the quadratic form x , z T H x , z is sub-Gaussian with parameter γ 2 given by
γ 2 : = 2 n 13 12 2 1 2 2 d .
Then for fixed x R k , standard sub-Gaussian tail bounds (e.g., [64]) and a union bound over N 1 / 4 give for any u 0
P Λ x T H Λ x 2 u P [ max z N 1 / 4 x , z T H x , z u ] z N 1 / 4 P x , z T H x , z u 2 · 9 k e u 2 2 γ 2 .
Lemma A21, then ensures that the number of possible Λ x is at most ( n 1 d n 2 d 1 n d ) 8 k , so a union bound over this set allows us to conclude that
P Λ x T H Λ x 2 u , for all x 1 ( n 1 d n 2 d 1 n d ) 8 k P Λ x T H Λ x 2 u 1 2 exp 8 k log ( 2 n 1 d n 2 d 1 n d ) u 2 / ( 2 γ 2 )
Choosing then u = 2 γ 2 · 9 k log ( 2 n 1 d n 2 d 1 n d ) and substituting the definition of γ 2 , we obtain
P [ Λ x T H Λ x 1 2 d 169 k log ( 2 n 1 d n 2 d 1 n d ) n , for all x ] 1 2 e k log ( 2 n 1 d n 2 d 1 n d )
which implies the thesis as n = n d and log ( n ) log ( 2 n 1 d n 2 d 1 n d ) .

Appendix C.2. Spiked Wishart Model

Each row { y i } i = 1 N of the matrix Y in (1) can be seen as i.i.d. samples from N ( 0 , Σ ) where Σ = y 🟉 y 🟉 T + σ 2 I n . In the minimization problem (4) we take M = Σ N σ 2 I n where Σ N is the empirical covariance matrix Y T Y / N . The symmetric noise matrix H is then given by H = Σ N Σ and by the Law of Large Numbers H 0 as N . We bound Λ x T H Λ x with high probability uniformly over x R k .
Fix x R k , let N 1 / 4 be a 1 / 4 -net on the sphere S k 1 such that | N 1 / 4 | 9 k , and notice that:
Λ x T H Λ x 2 max z N 1 / 4 | z T Λ x T H Λ x z | .
By a union bound on N 1 / 4 we obtain for any fixed z N 1 / 4 :
P Λ x T H Λ x 2 u 9 k P | z T Λ x T H Λ x z | u .
Let x , z : = Λ x z and s i : = x , z T y i , so that we can write
z T Λ x T H Λ x z = 1 N i = 1 N ( x T y i ) 2 E [ ( x T y i ) 2 ] = 1 N i = 1 N s i 2 E [ s i 2 ]
and in particular
P Λ x T H Λ x 2 u 9 k P | z T Λ x T H Λ x z | u = 9 k P 1 N | i = 1 N s i 2 E [ s i 2 ] | u
Observe then that s i N ( 0 , γ 2 ) with γ 2 = x , z T Σ x , z . It follows for u [ 0 , γ 2 ] by the small deviation bound for χ 2 random variables (e.g., [33] (Example 2.11))
P Λ x T H Λ x 2 u 2 exp 2 k log 3 N u 2 8 γ 4
Recall now that | { Λ x | x 0 } | ( n 1 d n 2 d n d ) 8 k , then proceeding as for the Wigner case by a union bound over all possible Λ x :
P Λ x T H Λ x 2 u , for all x 1 ( n 1 d n 2 d 1 n d ) 8 k P Λ x T H Λ x 2 u 1 2 exp ( 8 k log ( 2 n 1 d n 2 d 1 n d ) N u 2 8 γ 4 ) .
Substituting u = 8 γ 4 · 9 k log ( 2 n 1 d n 2 d 1 n d ) / N we find that:
P [ Λ x T H Λ x 2 72 k log ( 2 n 1 d n 2 d 1 n d ) N γ 2 , for all x ] 1 2 e k log ( n )
since log ( n ) log ( 2 n 1 d n 2 d 1 n d ) .
Similarly if u > γ 2 by large deviation bounds for sub-exponential variables
P Λ x T H Λ x 2 u , for all x 1 2 exp ( 8 k log ( 2 n 1 d n 2 d 1 n d ) N u 8 γ 2 ) .
Substituting u = 8 γ 2 · 9 k log ( 2 n 1 d n 2 d 1 n d ) / N we find that:
P [ Λ x T H Λ x 2 72 k log ( 3 n 1 d n 2 d 1 n d ) N γ 2 , for all x ] 1 2 e k log ( n )
Finally observe that using (A6) for bounding Λ x 2 (by the WDC) and σ 2 + y 🟉 2 for bounding Σ , we have
γ 2 13 12 1 2 d ( y 🟉 2 + σ 2 ) ,
which combined with (A42) and (A43) implies the thesis.

References

  1. Johnstone, I.M. On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 2001, 29, 295–327. [Google Scholar] [CrossRef]
  2. Amini, A.A.; Wainwright, M.J. High-dimensional analysis of semidefinite relaxations for sparse principal components. In Proceedings of the 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008; pp. 2454–2458. [Google Scholar]
  3. Deshpande, Y.; Montanari, A. Sparse PCA via covariance thresholding. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014; pp. 334–342. [Google Scholar]
  4. Vu, V.; Lei, J. Minimax rates of estimation for sparse PCA in high dimensions. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, La Palma, Canary Islands, Spain, 21–23 April 2012; pp. 1278–1286. [Google Scholar]
  5. Abbe, E.; Bandeira, A.S.; Bracher, A.; Singer, A. Decoding binary node labels from censored edge measurements: Phase transition and efficient recovery. IEEE Trans. Netw. Sci. Eng. 2014, 1, 10–22. [Google Scholar] [CrossRef] [Green Version]
  6. Bandeira, A.S.; Chen, Y.; Lederman, R.R.; Singer, A. Non-unique games over compact groups and orientation estimation in cryo-EM. Inverse Probl. 2020, 36, 064002. [Google Scholar] [CrossRef] [Green Version]
  7. Javanmard, A.; Montanari, A.; Ricci-Tersenghi, F. Phase transitions in semidefinite relaxations. Proc. Natl. Acad. Sci. USA 2016, 113, E2218–E2223. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. McSherry, F. Spectral partitioning of random graphs. In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science, Newport Beach, CA, USA, 8–11 October 2001; pp. 529–537. [Google Scholar]
  9. Deshpande, Y.; Abbe, E.; Montanari, A. Asymptotic mutual information for the binary stochastic block model. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 185–189. [Google Scholar]
  10. Moore, C. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. arXiv 2017, arXiv:1702.00467. [Google Scholar]
  11. D’Aspremont, A.; Ghaoui, L.; Jordan, M.; Lanckriet, G. A direct formulation for sparse PCA using semidefinite programming. Adv. Neural Inf. Process. Syst. 2004, 17, 41–48. [Google Scholar] [CrossRef] [Green Version]
  12. Berthet, Q.; Rigollet, P. Optimal detection of sparse principal components in high dimension. Ann. Stat. 2013, 41, 1780–1815. [Google Scholar] [CrossRef] [Green Version]
  13. Bandeira, A.S.; Perry, A.; Wein, A.S. Notes on computational-to-statistical gaps: Predictions using statistical physics. arXiv 2018, arXiv:1803.11132. [Google Scholar] [CrossRef] [Green Version]
  14. Kunisky, D.; Wein, A.S.; Bandeira, A.S. Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXiv 2019, arXiv:1907.11636. [Google Scholar]
  15. Hand, P.; Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. IEEE Trans. Inf. Theory 2019, 66, 401–418. [Google Scholar] [CrossRef] [Green Version]
  16. Heckel, R.; Huang, W.; Hand, P.; Voroninski, V. Rate-optimal denoising with deep neural networks. arXiv 2018, arXiv:1805.08855. [Google Scholar]
  17. Hand, P.; Leong, O.; Voroninski, V. Phase retrieval under a generative prior. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 9136–9146. [Google Scholar]
  18. Ma, F.; Ayaz, U.; Karaman, S. Invertibility of convolutional generative networks from partial measurements. Adv. Neural Inf. Process. Syst. 2018, 31, 9628–9637. [Google Scholar]
  19. Hand, P.; Joshi, B. Global Guarantees for Blind Demodulation with Generative Priors. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 11535–11543. [Google Scholar]
  20. Song, G.; Fan, Z.; Lafferty, J. Surfing: Iterative optimization over incrementally trained deep networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 15034–15043. [Google Scholar]
  21. Bora, A.; Jalal, A.; Price, E.; Dimakis, A.G. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 537–546. [Google Scholar]
  22. Asim, M.; Shamshad, F.; Ahmed, A. Blind Image Deconvolution using Deep Generative Priors. arXiv 2019, arXiv:cs.CV/1802.04073. [Google Scholar] [CrossRef]
  23. Hand, P.; Leong, O.; Voroninski, V. Compressive Phase Retrieval: Optimal Sample Complexity with Deep Generative Priors. arXiv 2020, arXiv:2008.10579. [Google Scholar]
  24. Hand, P.; Voroninski, V. Compressed sensing from phaseless gaussian measurements via linear programming in the natural parameter space. arXiv 2016, arXiv:1611.05985. [Google Scholar]
  25. Li, X.; Voroninski, V. Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 2013, 45, 3019–3033. [Google Scholar] [CrossRef]
  26. Ohlsson, H.; Yang, A.Y.; Dong, R.; Sastry, S.S. Compressive phase retrieval from squared output measurements via semidefinite programming. arXiv 2011, arXiv:1111.6323. [Google Scholar] [CrossRef] [Green Version]
  27. Cai, T.; Li, X.; Ma, Z. Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. Ann. Stat. 2016, 44, 2221–2251. [Google Scholar] [CrossRef]
  28. Wang, G.; Zhang, L.; Giannakis, G.B.; Akçakaya, M.; Chen, J. Sparse phase retrieval via truncated amplitude flow. IEEE Trans. Signal Process. 2017, 66, 479–491. [Google Scholar] [CrossRef]
  29. Yuan, Z.; Wang, H.; Wang, Q. Phase retrieval via sparse wirtinger flow. J. Comput. Appl. Math. 2019, 355, 162–173. [Google Scholar] [CrossRef] [Green Version]
  30. Aubin, B.; Loureiro, B.; Maillard, A.; Krzakala, F.; Zdeborová, L. The spiked matrix model with generative priors. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8366–8377. [Google Scholar]
  31. Cocola, J.; Hand, P.; Voroninski, V. Nonasymptotic Guarantees for Spiked Matrix Recovery with Generative Priors. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 11 December 2020; Volume 33. [Google Scholar]
  32. Johnstone, I.M.; Lu, A.Y. On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc. 2009, 104, 682–693. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
  34. Montanari, A.; Richard, E. Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Trans. Inf. Theory 2015, 62, 1458–1484. [Google Scholar] [CrossRef] [Green Version]
  35. Deshpande, Y.; Montanari, A.; Richard, E. Cone-constrained principal component analysis. Adv. Neural Inf. Process. Syst. 2014, 27, 2717–2725. [Google Scholar]
  36. Zou, H.; Hastie, T.; Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 2006, 15, 265–286. [Google Scholar] [CrossRef] [Green Version]
  37. Krauthgamer, R.; Nadler, B.; Vilenchik, D.; others. Do semidefinite relaxations solve sparse PCA up to the information limit? Ann. Stat. 2015, 43, 1300–1322. [Google Scholar] [CrossRef]
  38. Berthet, Q.; Rigollet, P. Computational lower bounds for Sparse PCA. arXiv 2013, arXiv:1304.0828. [Google Scholar]
  39. Cai, T.; Ma, Z.; Wu, Y. Sparse PCA: Optimal rates and adaptive estimation. Ann. Stat. 2013, 41, 3074–3110. [Google Scholar] [CrossRef]
  40. Ma, T.; Wigderson, A. Sum-of-squares lower bounds for sparse PCA. Adv. Neural Inf. Process. Syst. 2015, 28, 1612–1620. [Google Scholar]
  41. Lesieur, T.; Krzakala, F.; Zdeborová, L. Phase transitions in sparse PCA. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 1635–1639. [Google Scholar]
  42. Brennan, M.; Bresler, G. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. arXiv 2019, arXiv:1902.07380. [Google Scholar]
  43. Arous, G.B.; Wein, A.S.; Zadik, I. Free energy wells and overlap gap property in sparse PCA. In Proceedings of the Conference on Learning Theory, PMLR, Graz, Austria, 9–12 July 2020; pp. 479–482. [Google Scholar]
  44. Fan, J.; Liu, H.; Wang, Z.; Yang, Z. Curse of heterogeneity: Computational barriers in sparse mixture models and phase retrieval. arXiv 2018, arXiv:1808.06996. [Google Scholar]
  45. Richard, E.; Montanari, A. A statistical model for tensor PCA. Adv. Neural Inf. Process. Syst. 2014, 27, 2897–2905. [Google Scholar]
  46. Decelle, A.; Krzakala, F.; Moore, C.; Zdeborová, L. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 2011, 84, 066106. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. Perry, A.; Wein, A.S.; Bandeira, A.S.; Moitra, A. Message-Passing Algorithms for Synchronization Problems over Compact Groups. Commun. Pure Appl. Math. 2018, 71, 2275–2322. [Google Scholar] [CrossRef] [Green Version]
  48. Oymak, S.; Jalali, A.; Fazel, M.; Eldar, Y.C.; Hassibi, B. Simultaneously structured models with application to sparse and low-rank matrices. IEEE Trans. Inf. Theory 2015, 61, 2886–2908. [Google Scholar] [CrossRef] [Green Version]
  49. Dhar, M.; Grover, A.; Ermon, S. Modeling sparse deviations for compressed sensing using generative models. arXiv 2018, arXiv:1807.01442. [Google Scholar]
  50. Shah, V.; Hegde, C. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4609–4613. [Google Scholar]
  51. Mixon, D.G.; Villar, S. Sunlayer: Stable denoising with generative networks. arXiv 2018, arXiv:1803.09319. [Google Scholar]
  52. Yeh, R.A.; Chen, C.; Lim, T.Y.; Schwing, A.G.; Hasegawa-Johnson, M.; Do, M.N. Semantic image inpainting with deep generative models. arXiv 2016, arXiv:1607.07539. [Google Scholar]
  53. Sønderby, C.K.; Caballero, J.; Theis, L.; Shi, W.; Huszár, F. Amortised map inference for image super-resolution. arXiv 2016, arXiv:1610.04490. [Google Scholar]
  54. Yang, G.; Yu, S.; Dong, H.; Slabaugh, G.; Dragotti, P.L.; Ye, X.; Liu, F.; Arridge, S.; Keegan, J.; Guo, Y.; et al. DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Trans. Med. Imaging 2017, 37, 1310–1321. [Google Scholar] [CrossRef] [Green Version]
  55. Qiu, S.; Wei, X.; Yang, Z. Robust One-Bit Recovery via ReLU Generative Networks: Improved Statistical Rates and Global Landscape Analysis. arXiv 2019, arXiv:1908.05368. [Google Scholar]
  56. Xue, Y.; Xu, T.; Zhang, H.; Long, L.R.; Huang, X. Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics 2018, 16, 383–392. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  57. Heckel, R.; Hand, P. Deep Decoder: Concise Image Representations from Untrained Non-convolutional Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  58. Heckel, R.; Soltanolkotabi, M. Denoising and regularization via exploiting the structural bias of convolutional generators. arXiv 2019, arXiv:1910.14634. [Google Scholar]
  59. Heckel, R.; Soltanolkotabi, M. Compressive sensing with un-trained neural networks: Gradient descent finds the smoothest approximation. arXiv 2020, arXiv:2005.03991. [Google Scholar]
  60. Aubin, B.; Loureiro, B.; Baker, A.; Krzakala, F.; Zdeborová, L. Exact asymptotics for phase retrieval and compressed sensing with random generative priors. In Proceedings of the First Mathematical and Scientific Machine Learning Conference, PMLR, Princeton, NJ, USA, 20–24 July 2020; pp. 55–73. [Google Scholar]
  61. Clason, C. Nonsmooth Analysis and Optimization. arXiv 2017, arXiv:1708.04180. [Google Scholar]
  62. Daskalakis, C.; Rohatgi, D.; Zampetakis, M. Constant-Expansion Suffices for Compressed Sensing with Generative Priors. arXiv 2020, arXiv:2006.04237. [Google Scholar]
  63. Chi, Y.; Lu, Y.M.; Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Trans. Signal Process. 2019, 67, 5239–5269. [Google Scholar] [CrossRef] [Green Version]
  64. Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018; Volume 47. [Google Scholar]
Figure 1. Expected value, with respect to the weights, of the objective function f in (4) in the noiseless case (see (16) for explicit formula), for a network with latent dimension k = 2 and x 🟉 = [ 1 , 1 ] .
Figure 1. Expected value, with respect to the weights, of the objective function f in (4) in the noiseless case (see (16) for explicit formula), for a network with latent dimension k = 2 and x 🟉 = [ 1 , 1 ] .
Entropy 23 00115 g001
Figure 2. Reconstruction error for the recovery of a spike y 🟉 = G ( x 🟉 ) in the Wishart and Wigner models with random generative network priors. Each point corresponds to the average over 50 random drawing of the network weights and samples. These plots demonstrate that the reconstruction errors follow the scalings established by Theorem 1.
Figure 2. Reconstruction error for the recovery of a spike y 🟉 = G ( x 🟉 ) in the Wishart and Wigner models with random generative network priors. Each point corresponds to the average over 50 random drawing of the network weights and samples. These plots demonstrate that the reconstruction errors follow the scalings established by Theorem 1.
Entropy 23 00115 g002
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Cocola, J.; Hand, P.; Voroninski, V. No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors. Entropy 2021, 23, 115. https://doi.org/10.3390/e23010115

AMA Style

Cocola J, Hand P, Voroninski V. No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors. Entropy. 2021; 23(1):115. https://doi.org/10.3390/e23010115

Chicago/Turabian Style

Cocola, Jorio, Paul Hand, and Vladislav Voroninski. 2021. "No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors" Entropy 23, no. 1: 115. https://doi.org/10.3390/e23010115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop