1 Introduction

In a generalized linear model (GLM) [36, 39], we want to recover a d-dimensional signal \({\varvec{x}}\in \mathbb R^d\) given n i.i.d. measurements \({\varvec{y}}=(y_1, \ldots , y_n)\) of the form:

$$\begin{aligned} y_i\sim p(y\mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle ), \qquad i\in \{1, \ldots , n\}, \end{aligned}$$
(1)

where \(\langle \cdot , \cdot \rangle \) denotes the scalar product, \(\{{\varvec{a}}_i\}_{1\le i\le n}\) are known sensing vectors, and the (stochastic) output function \(p(\cdot \mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle )\) is a known probability distribution. GLMs arise in several problems in statistical inference and signal processing. Examples include photon-limited imaging [53, 58], compressed sensing [19], signal recovery from quantized measurements [7, 46], phase retrieval [21, 49], and neural networks with one hidden layer [30].

The problem of estimating \({\varvec{x}}\) from \({\varvec{y}}\) is, in general, non-convex, and semi-definite programming relaxations have been proposed [9, 11, 52, 56]. However, the computational complexity and memory requirement of these approaches quickly grow with the dimension d. For this reason, several non-convex approaches have been developed, e.g., alternating minimization [40], approximate message passing (AMP) [15, 44, 48], Wirtinger Flow [10], Kaczmarz methods [57], and iterative convex-programming relaxations [1, 7, 14, 25]. The Bayes-optimal estimation and generalization error have also been studied in [3]. When the output function \(p(\cdot \mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle )\) is unknown, (1) is called the single-index model in the statistics literature, see e.g., [8, 28, 33]. The problem of recovering a structured signal (e.g., sparse, low-rank) from high-dimensional single-index measurements has been an active research topic over the past few years [22,23,24, 41,42,43, 51, 52, 59].

Throughout this paper, the performance of an estimator \(\hat{\varvec{x}}\) will be measured by its normalized correlation (or “overlap") with \({\varvec{x}}\):

$$\begin{aligned} \frac{\left|{\langle {\varvec{x}}, \hat{\varvec{x}}\rangle }\right|}{\Vert {\varvec{x}}\Vert _2 \Vert \hat{\varvec{x}}\Vert _2 }, \end{aligned}$$
(2)

where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm of a vector.

Most of the existing techniques require an initial estimate of the signal, which can then be refined via a local algorithm. Here, we focus on two popular methods: a linear estimator and a spectral estimator. The linear estimator \(\hat{\varvec{x}}^\mathrm{L}\) has the form:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}_L(y_i){\varvec{a}}_i \,, \end{aligned}$$
(3)

where \({\mathcal {T}}_L\) denotes a given preprocessing function. The performance analysis of this estimator is quite simple, see e.g., Proposition 1 in [43] or Sect. 2.3 of this paper. The spectral estimator consists in the principal eigenvector \(\hat{\varvec{x}}^\mathrm{s}\) of a matrix of the form:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}_s(y_i){\varvec{a}}_i{\varvec{a}}_i^\mathsf{T}\,, \end{aligned}$$
(4)

where \({\mathcal {T}}_s\) is another preprocessing function. The idea of a spectral method first appeared in [32] and, for the special case of phase retrieval, a series of works has provided more and more refined performance bounds [11, 12, 40]. Recently, an exact high-dimensional analysis of the spectral method for generalized linear models with Gaussian sensing vectors has been carried out in [34, 37]. These works consider a regime where both n and d grow large at a fixed proportional rate \(\delta =n/d>0\). The choice of \({\mathcal {T}}_s\) which minimizes the value of \(\delta \) (and, consequently, the amount of data) necessary to achieve a strictly positive scalar product (2) was obtained in [37]. Furthermore, the choice of \({\mathcal {T}}_s\) which maximizes the correlation between \({\varvec{x}}\) and \(\hat{\varvec{x}}^\mathrm{s}\) for any given value of the sampling ratio \(\delta \) was obtained in [35]. The case in which the sensing vectors are obtained by picking columns from a Haar matrix is tackled in [16].

In short, the performance of the linear estimate \(\hat{\varvec{x}}^\mathrm{L}\) and the spectral estimate \(\hat{\varvec{x}}^\mathrm{s}\) is well understood, and there is no clear winner between the two. In fact, the superiority of one method over the other depends on the output function \(p(\cdot \mid \langle {\varvec{x}}, {\varvec{a}}_i \rangle )\) and on the sampling ratio \(\delta \). For example, for phase retrieval (\(y_i=|\langle {\varvec{x}}, {\varvec{a}}_i \rangle |\)), the spectral estimate provides positive correlation with the ground-truth signal as long as \(\delta >1/2\) [37], while linear estimators of the form (3) are not effective for any \(\delta >0\). On the contrary, for 1-bit compressed sensing (\(y_i=\mathrm{sign}(\langle {\varvec{x}}, {\varvec{a}}_i \rangle )\)) the situation is the opposite: the spectral estimator is uncorrelated with the signal for any \(\delta >0\), while the linear estimate works well. For many cases of practical interest, e.g., neural networks with ReLU activation function (\(y_i=\max (\langle {\varvec{x}}, {\varvec{a}}_i \rangle , 0)\)), both the linear and the spectral method give estimator with non-zero correlation. Thus, a natural question is the following:

What is the optimal way to combine the linear estimator \(\hat{\varvec{x}}^\mathrm{L}\) and the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\)?

This paper closes the gap and answers the question above for Gaussian sensing vectors \(\{{\varvec{a}}_i\}_{1\le i\le n}\). Our main technical contribution is to provide an exact high-dimensional characterization of the joint empirical distribution of \(({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\) in the limit \(n, d \rightarrow \infty \) with a fixed sampling ratio \(\delta =n/d\) (see Theorem 1). In particular, we prove that the conditional distribution of \((\hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\) given \({\varvec{x}}\) converges to the law of a bivariate Gaussian whose mean vector and covariance matrix are specified in terms of the preprocessing functions \({\mathcal {T}}_L\) and \({\mathcal {T}}_s\). As a consequence, we are able to compute the Bayes-optimal combination of \(\hat{\varvec{x}}^\mathrm{L}\) and \(\hat{\varvec{x}}^\mathrm{s}\) for any given prior distribution on \({\varvec{x}}\) (see Theorem 2). In the special case in which the signal prior is Gaussian, the Bayes-optimal combination has the form \(\theta \hat{\varvec{x}}^\mathrm{L}+\hat{\varvec{x}}^\mathrm{s}\), with \(\theta \in \mathbb R\), and we compute the optimal combination coefficient \(\theta _*\) that maximizes the normalized correlation in (2) (see Corollary 2).

The characterization of the joint empirical distribution of \(({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\) is achieved by designing and analyzing a suitable approximate message passing (AMP) algorithm. AMP is a family of iterative algorithms that has been applied to several high-dimensional statistical estimation problems including estimation in linear models [4, 5, 15, 29], generalized linear models [44, 48, 50], and low-rank matrix estimation [13, 31, 38, 45]. An appealing feature of AMP algorithms is that under suitable conditions on the model, the empirical joint distribution of the iterates can be exactly characterized in the high-dimensional limit, in terms of a simple scalar recursion called state evolution.

In this paper, we design an AMP algorithm that is equivalent to a power method computing the principal eigenvector of the matrix (4). Then, the state evolution analysis leads to the desired joint empirical distribution of \(({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\). Using the limiting distribution, we reduce the vector problem of estimating \({\varvec{x}}\in \mathbb R^d\) given two (correlated) observations \(\hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\in \mathbb R^d\) to the scalar problem of estimating the random variable \(X\in \mathbb R\) given two (correlated) observations \(X_L, X_s\in \mathbb R\). We emphasize that the focus of this work is not on using the AMP algorithm as an estimator for the generalized linear model. Rather, we use AMP as a proof technique to characterize the joint empirical distribution of \(({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\), and thereby understand how to optimally combine the two simple estimators.

Fig. 1
figure 1

a Performance comparison among the linear estimator, the spectral estimator and the proposed optimal combination for a specific output function and ranging values of the sampling ratio \(\delta \). The performance is measured in terms of the normalized correlation (2). b Optimal combination coefficient \(\theta _*\) as a function of \(\delta \) for the same output function as in a. c Percentage performance gain of the combined estimator for different output functions and sampling ratios

Our proposed combination of the linear and spectral estimators can significantly boost the correlation with the ground-truth signal (2). As an illustration, in Fig. 1a we compare against each other the correlations \(\rho _L, \rho _s\) and \(\rho _*\) of the linear, spectral and optimal combined estimators, respectively, for a range of values of the sampling ratio \(\delta =n/d\) and measurements of the form \(y_i = 0.3\,{\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle } + {\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle }^2 + z_i\). Here, \({\varvec{x}}\) is uniformly distributed on the d-dimensional sphere of radius \(\sqrt{d}\), \({\varvec{a}}_i \ \sim _{i.i.d.} \ \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d/d)\), \(z_i\sim _{i.i.d.}\mathcal {N}(0,0.2)\), and the preprocessing functions are chosen as follows: \({\mathcal {T}}_L(y) = y\) and \({\mathcal {T}}_s(y) = \min \{y,3.5\}\). The solid lines correspond to analytically derived asymptotic formulae, and they are compared against numerical simulations (cf. markers of the corresponding color) computed for \(d=2000\). Specifically, the red line corresponds to the optimal combined estimator \(\theta _*\hat{\varvec{x}}^\mathrm{L}+ \hat{\varvec{x}}^\mathrm{s}\) (in this example, the empirical distribution of \({\varvec{x}}\) is Gaussian). The optimal combination coefficient \(\theta _*\) is plotted in Fig. 1b as a function of \(\delta \). Note that for values of \(\delta \) for which the spectral estimator achieves strictly positive correlation with \({\varvec{x}}\), the combined estimator provides a significant performance improvement. The performance gain depends on: (i) the sampling ratio \(\delta \) (it can be as large as \(\sim 30\%\) for \(\delta \approx 8\)), and (ii) the output function that defines the measurement. To better visualize this dependence, we plot in Fig. 1c the percentage gain \((\rho _*-\rho _{\max })/\rho _{\max }\times 100\) for various values of \(\delta \) and for different output-function parameterizations. Specifically, the x-axis in Fig. 1c represents the value of the coefficient \(H_1\) of an output function of the form \(y_i = 0.5+H_1\,{\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle } + 0.5\,{\langle }{\varvec{x}}, {\varvec{a}}_i{\rangle }^2 + z_i\), with \(z_i\sim _{i.i.d.}\mathcal {N}(0,0.2)\). Above, \(\rho _*\) denotes the correlation achieved by our proposed estimator, and \(\rho _{\max }\) is the maximum correlation among the linear and spectral estimators.

The rest of the paper is organized as follows. In Sect. 2, we describe the setting and review existing results on the linear and the spectral estimator. In Sect. 3, we present our contributions. The main technical result, Theorem 1, gives an exact characterization of the joint empirical distribution of \(({\varvec{x}}, \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\). Using this, we derive the Bayes-optimal combination of the estimators \(\hat{\varvec{x}}^\mathrm{L}\) and \(\hat{\varvec{x}}^\mathrm{s}\). In the special case in which the signal prior is Gaussian, the Bayes-optimal combination is linear in \(\hat{\varvec{x}}^\mathrm{L}\) and \(\hat{\varvec{x}}^\mathrm{s}\), and we derive the optimal coefficient. In Sect. 4, we demonstrate the effectiveness of our method via numerical simulation. In Sect. 5, we describe the generalized AMP algorithm and use it to prove Theorem 1.

2 Preliminaries

2.1 Notation and Definitions

Given \(n\in \mathbb N\), we use the shorthand \([n]=\{1, \ldots , n\}\). Given a vector \({\varvec{x}}\), we denote by \(\Vert {\varvec{x}}\Vert _2\) its Euclidean norm. Given a matrix \({\varvec{A}}\), we denote by \(\Vert {\varvec{A}}\Vert _\mathrm{op}\) its operator norm.

The empirical distribution of a vector \({\varvec{x}}= (x_1, \ldots , x_d)^{\mathsf{T}}\) is given by \( \frac{1}{d}\sum _{i=1}^d \delta _{x_i}\), where \(\delta _{x_i}\) denotes a Dirac delta mass on \(x_i\). Similarly, the empirical joint distribution of vectors \({\varvec{x}}, {\varvec{x}}'\in \mathbb {R}^d\) is \(\frac{1}{d} \sum _{i=1}^d \delta _{(x_i, x'_i)}\).

Given two probability measures \(\mu \) (on a space \(\mathcal {X}\)) and \(\nu \) (on a space \(\mathcal {Y}\)), a coupling \(\rho \) of \(\mu \) and \(\nu \) is a probability distribution on \(\mathcal {X}\times \mathcal {Y}\) whose marginals coincide with \(\mu \) and \(\nu \), respectively. For \(k\ge 1\), the Wasserstein-k (\(W_k\)) distance between two probability measures \(\mu \), \(\nu \) on \(\mathbb {R}^n\) is defined by

$$\begin{aligned} W_k(\mu ,\nu ) \equiv \inf _{\rho }{\mathbb E}_{({\varvec{X}},{\varvec{Y}})\sim \rho }\big \{\Vert {\varvec{X}}-{\varvec{Y}}\Vert _2^k\}^{1/k}\, , \end{aligned}$$
(5)

where the infimum is over all the couplings of \(\mu \) and \(\nu \). A sequence of probability distributions \(\nu _n\) on \(\mathbb {R}^m\) converges in \(W_k\) to \(\nu \), written \(\nu _n{\mathop {\Rightarrow }\limits ^{W_{k}}} \nu \), if \(W_k(\nu _n,\nu ) \rightarrow 0\) as \(n \rightarrow \infty \).

2.2 Generalized Linear Model

Let \({\varvec{x}}\in \mathbb {R}^d\) be the signal of interest. We assume that \(\Vert {\varvec{x}}\Vert ^2_2=d\). The signal is observed via inner products with n sensing vectors \(({\varvec{a}}_i)_{i \in [n]}\), with each \({\varvec{a}}_i \in \mathbb {R}^d\) having independent Gaussian entries with mean zero and variance 1/d. That is,

$$\begin{aligned} ( {\varvec{a}}_i) \ \sim _{i.i.d.} \ \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d/d). \end{aligned}$$
(6)

Given \(g_i = \langle {\varvec{x}}, {\varvec{a}}_i \rangle \), the measurement vector \({\varvec{y}}\in \mathbb {R}^n\) is obtained by drawing each component independently according to a conditional distribution \(p_{Y| {G}}\)

$$\begin{aligned} y_i \ \sim \ p_{Y | G}(y_i \mid {g}_i), \quad i \in [n]. \end{aligned}$$
(7)

We stack the measurement vectors as rows to define the \(n \times d\) sensing matrix \({\varvec{A}}\). That is,

$$\begin{aligned} {\varvec{A}}= [{\varvec{a}}_1, \ldots , {\varvec{a}}_n]^{\mathsf{T}}. \end{aligned}$$
(8)

We write \(\delta _n = \frac{n}{d}\) for the sampling ratio and assume that \(\delta _n \rightarrow \delta \in (0, \infty )\). Since the entries of the sensing matrix are \(\sim _{i.i.d.} \mathsf{N}(0, 1/d)\), each row of \({\varvec{A}}\) has norm close to 1.

2.3 Linear Estimator

Given the measurements \((y_i)_{i \in [n]}\) and a preprocessing function \({\mathcal {T}}_L: \mathbb {R}\rightarrow \mathbb {R}\), define the \(n\times 1\) vector

$$\begin{aligned} {\varvec{z}}^\mathrm{L}= [{\mathcal {T}}_L(y_1), \ldots , {\mathcal {T}}_L(y_n)]^\mathsf{T}. \end{aligned}$$
(9)

Consider the following linear estimator that averages the data as follows:

$$\begin{aligned} \hat{\varvec{x}}^\mathrm{L}:= \frac{\sqrt{d}}{n} {\varvec{A}}^\mathsf{T}{\varvec{z}}^\mathrm{L}= \frac{\sqrt{d}}{n}\sum _{i=1}^n {\mathcal {T}}_L(y_i){\varvec{a}}_i \,. \end{aligned}$$
(10)

The following lemma characterizes the asymptotic performance of this simple estimator. The proof is rather straightforward, and we include it in Appendix A for completeness.

Lemma 1

Let \({\varvec{x}}\) be such that \(\Vert {\varvec{x}}\Vert ^2_2=d\), \(\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)\), and \({\varvec{y}}\) be distributed according to (7). Let \(n/d\rightarrow \delta \), \(G\sim \mathsf{N}(0, 1)\) and define \(Z_L=\mathcal T_L(Y)\) for \(Y\sim p_{Y|G}(\,\cdot \,|\,G)\) such that \({\mathbb E}\{|GZ_L|\}<\infty \). Let \(\hat{\varvec{x}}^\mathrm{L}\) be the linear estimator defined as in (10). Then, as \(n \rightarrow \infty \),

$$\begin{aligned}&\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2^2 {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\left( {\mathbb E}\left\{ G Z_L\right\} \right) ^2 + \frac{{\mathbb E}\{Z_L^2\}}{\delta }, \quad \text { and }\nonumber \\&\quad \frac{\langle \hat{\varvec{x}}^\mathrm{L}, {\varvec{x}}\rangle }{\Vert {\hat{\varvec{x}}^\mathrm{L}}\Vert _2 \, \left\Vert {{\varvec{x}}}\right\Vert _2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\frac{{\mathbb E}\left\{ G Z_L\right\} }{\sqrt{\left( {\mathbb E}\left\{ G Z_L\right\} \right) ^2 + {{\mathbb E}\left\{ Z_L^2\right\} }\big /{\delta }}}\,. \end{aligned}$$
(11)

2.4 Spectral Estimator

Given the measurements \((y_i)_{i \in [n]}\), consider the \(n \times n\) diagonal matrix

$$\begin{aligned} {\varvec{Z}}_s = \mathrm{diag}({\mathcal {T}}_s(y_1), \ldots , {\mathcal {T}}_s(y_n)), \end{aligned}$$
(12)

where \({\mathcal {T}}_s: \mathbb {R}\rightarrow \mathbb {R}\) is a preprocessing function. Consider the \(d \times d\) matrix

$$\begin{aligned} {\varvec{D}}_n = {\varvec{A}}^\mathsf{T}{\varvec{Z}}_s {\varvec{A}}. \end{aligned}$$
(13)

Let \(G\sim \mathsf{N}(0, 1)\), \(Y\sim p(\cdot \mid G)\), and \(Z_s=\mathcal T_s(Y)\). We will make the following assumptions on \(Z_s\).

(A1) \(\mathbb P(Z_s=0)<1\).

(A2) \(Z_s\) has bounded support and \(\tau \) is the supremum of this support, i.e.,

$$\begin{aligned} \tau = \inf \{ z : \mathbb P (Z_s\le z) = 1\}. \end{aligned}$$
(14)

(A3) As \(\lambda \) approaches \(\tau \) from the right, we have

$$\begin{aligned} \lim _{\lambda \rightarrow \tau ^+}{\mathbb E}\left\{ \frac{Z_s}{(\lambda -Z_s)^2}\right\} =\lim _{\lambda \rightarrow \tau ^+}{\mathbb E}\left\{ \frac{Z_s\cdot G^2}{\lambda -Z_s}\right\} =\infty . \end{aligned}$$
(15)

Let us comment on these assumptions. First, the condition (A1) simply avoids the degenerate case in which the measurement vector, after passing through the preprocessing function, is 0 with high probability. Second, the condition (A2) requires that the support of \(Z_s\) is bounded both from above and below. This assumption appears in the papers that have recently analyzed the performance of spectral estimators [34, 35, 37], and it is also required for Lemma 2. Requiring that the support of \(Z_s\) is bounded from above is rather natural, since the argument relies on the matrix \({\varvec{D}}_n\) having a spectral gap. It is not clear whether having the support of \(Z_s\) bounded from both sides (rather than only from above) is necessary, and investigating this aspect is an interesting avenue for future research. Let us also point out that the condition (A2) is purely technical and rather mild. In fact, if the desired preprocessing function is not bounded,Footnote 1 then one can construct a sequence of bounded approximations that approach its performance, as done, e.g., in [35]. Finally, the condition (A3) essentially requires that \(Z_s\) has sufficient probability mass near the supremum of the support \(\tau \). One sufficient condition is that the law of \(Z_s\) has a point mass at \(\tau \). If this is not the case, the argument in (115)–(118) of [37] shows how to modify the preprocessing function \({\mathcal {T}}_s\) so that (i) condition (A3) holds, and (ii) the spectral estimator suffers no performance loss.

For \(\lambda \in (\tau , \infty )\) and \(\delta \in (0, \infty )\), define

$$\begin{aligned} \phi (\lambda ) = \lambda \cdot {\mathbb E}\left\{ \frac{Z_s\cdot G^2}{\lambda -Z_s}\right\} , \end{aligned}$$
(16)

and

$$\begin{aligned} \psi _{\delta }(\lambda ) = \lambda \left( \frac{1}{\delta }+{\mathbb E}\left\{ \frac{Z_s}{\lambda -Z_s}\right\} \right) . \end{aligned}$$
(17)

Note that \(\phi (\lambda )\) is a monotone non-increasing function and that \(\psi _{\delta }(\lambda )\) is a convex function. Let \(\bar{\lambda }_{\delta }\) be the point at which \(\psi _{\delta }\) attains its minimum, i.e.,

$$\begin{aligned} \bar{\lambda }_\delta = \arg \min _{\lambda \ge \tau } \psi _{\delta }(\lambda ). \end{aligned}$$
(18)

For \(\lambda \in (\tau , \infty )\), define also

$$\begin{aligned} \zeta _{\delta }(\lambda ) = \psi _{\delta }(\max (\lambda , \bar{\lambda }_\delta )). \end{aligned}$$
(19)

The spectrum of \({\varvec{D}}_n\) exhibits a phase transition as \(\delta \) increases. The most basic phenomenon of this kind was unveiled for low-rank perturbations of a Wigner matrix: the well-known BBAP phase transition, first discovered in the physics literature [26], and named after the authors of [2]. Here, the model for the random matrix \({\varvec{D}}_n\) is quite different from that considered in [2, 26], and the phase transition is formalized by the following result.

Lemma 2

Let \({\varvec{x}}\) be such that \(\Vert {\varvec{x}}\Vert ^2_2=d\), \(\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)\), and \({\varvec{y}}\) be distributed according to (7). Let \(n/d\rightarrow \delta \), \(G\sim \mathsf{N}(0, 1)\) and define \(Z_s=\mathcal T_s(Y)\) for \(Y\sim p_{Y|G}(\,\cdot \,|\,G)\). Assume that \(Z_s\) satisfies the assumptions (A1)(A2)(A3). Let \(\hat{\varvec{x}}^\mathrm{s}\) be the principal eigenvector of the matrix \({\varvec{D}}_n\), defined as in (13). Then, the following results hold:

  1. 1.

    The equation

    $$\begin{aligned} \zeta _{\delta }(\lambda ) = \phi (\lambda ) \end{aligned}$$
    (20)

    admits a unique solution, call it \(\lambda _{\delta }^*\), for \(\lambda > \tau \).

  2. 2.

    As \(n \rightarrow \infty \),

    $$\begin{aligned} \frac{|\langle \hat{\varvec{x}}^\mathrm{s}, {\varvec{x}}\rangle |^2}{\left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2^2 \, \left\Vert {{\varvec{x}}}\right\Vert _2^2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}} \left\{ \begin{array}{ll} 0, &{} \hbox { if }\psi _{\delta }'(\lambda _{\delta }^*)\le 0,\\ \displaystyle \frac{\psi _{\delta }'(\lambda _{\delta }^*)}{\psi _{\delta }'(\lambda _{\delta }^*)-\phi '(\lambda _{\delta }^*)}, &{} \hbox { if }\psi _{\delta }'(\lambda _{\delta }^*)> 0,\\ \end{array}\right. \end{aligned}$$
    (21)

    where \(\psi _{\delta }'\) and \(\phi '\) denote the derivatives of these two functions.

  3. 3.

    Let \(\lambda _1^{{\varvec{D}}_n}\ge \lambda _2^{{\varvec{D}}_n}\) denote the two largest eigenvalues of \({\varvec{D}}_n\). Then, as \(n\rightarrow \infty \),

    $$\begin{aligned} \begin{aligned} \lambda _1^{{\varvec{D}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _{\delta }(\lambda _{\delta }^*),\\ \lambda _2^{{\varvec{D}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _{\delta }(\bar{\lambda }_{\delta }).\\ \end{aligned} \end{aligned}$$
    (22)

The proof immediately follows from Lemma 2 of [37]. In that statement, it is assumed that \({\varvec{x}}\) is uniformly distributed on the d-dimensional sphere, but this assumption is actually never used. In fact, since the sensing vectors \(\{{\varvec{a}}_i\}_{1\le i\le n}\) are i.i.d. standard Gaussian, to prove the result above, without loss of generality we can assume that \({\varvec{x}}= \sqrt{d}{\varvec{e}}_1\), where \({\varvec{e}}_1\) is the first element of the canonical basis of \(\mathbb R^d\). We also note that the signal \({\varvec{x}}\) and the measurement matrix \({\varvec{A}}\) differ from Lemma 2 in [37] for a scaling factor. This accounts for an extra term \(\delta \) in the expression of the eigenvalues of \({\varvec{D}}_n\).

3 Main Results

Throughout this section, we will make the following additional assumptions on the signal \({\varvec{x}}\), the output function of the GLM, and the preprocessing functions \({\mathcal {T}}_L\) and \({\mathcal {T}}_s\) used for the linear and spectral estimators, respectively.

(B1) Let \(\hat{P}_{X,d}\) denote the empirical distribution of \({\varvec{x}}\in \mathbb {R}^d\), with \(\Vert {\varvec{x}}\Vert ^2_2 =d\). As \(d \rightarrow \infty \), \(\hat{P}_{X,d}\) converges weakly to a distribution \(P_X\) such that, for some \(k \ge 2\), the following hold: (i) \({\mathbb E}_{P_X}\{ \left|{X}\right|^{2k -2} \} < \infty \),   and (ii) \(\lim _{d \rightarrow \infty } {\mathbb E}_{\hat{P}_{X,d}}\{ \left|{X}\right|^{2k -2} \} = {\mathbb E}_{P_X}\{ \left|{X}\right|^{2k -2} \}\). Furthermore, \({\mathbb E}\{ \left|{Y}\right|^{2k -2} \} < \infty \), with \(Y\sim p_{Y|G}(\,\cdot \,|\,G)\) and \(G \sim \mathsf{N}(0,1)\).

(B2) The function \({\mathcal {T}}_L: \mathbb {R}\rightarrow \mathbb {R}\) is Lipschitz and \(|{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}|>0\); the function \({\mathcal {T}}_s: \mathbb {R}\rightarrow \mathbb {R}\) is bounded and Lipschitz.

The assumption (B2) is mainly technical and rather mild, since one can construct a sequence of approximations of the desired \({\mathcal {T}}_L, {\mathcal {T}}_s\) that satisfy (B2).

Lemmas 1 and 2 in the previous sections derive formulae for the asymptotic correlation of \({\varvec{x}}\) with the linear estimator \(\hat{\varvec{x}}^\mathrm{L}\) and the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\). For convenience, let us denote these as follows:

$$\begin{aligned} \rho _L:=\frac{{\mathbb E}\left\{ Z_L G\right\} }{\sqrt{\left( {\mathbb E}\left\{ Z_L G\right\} \right) ^2 + {{\mathbb E}\left\{ Z_L^2\right\} }\big /{\delta }}},\quad \text {and}\quad \rho _s:= \sqrt{\frac{\psi _{\delta }'(\lambda _{\delta }^*)}{\psi _{\delta }'(\lambda _{\delta }^*)-\phi '(\lambda _{\delta }^*)}}. \end{aligned}$$
(23)

We also denote by \(n_L\) the high-dimensional limit of \(\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2\),

$$\begin{aligned} n_L = \sqrt{\left( {\mathbb E}\left\{ G Z_L\right\} \right) ^2 + {{\mathbb E}\left\{ Z_L^2\right\} }\big /{\delta }}, \end{aligned}$$
(24)

and we define

$$\begin{aligned} q := \displaystyle \sqrt{\frac{\psi _{\delta }'(\lambda _{\delta }^*)}{\psi _{\delta }'(\lambda _{\delta }^*)-\phi '(\lambda _{\delta }^*)}}\cdot \frac{{\mathbb E}\left\{ \frac{Z_L \cdot G}{1-\frac{1}{\lambda _{\delta }^*} Z_s} \right\} }{\sqrt{\left( {\mathbb E}\left\{ Z_L\cdot G\right\} \right) ^2 + \frac{{\mathbb E}\left\{ Z_L^2\right\} }{\delta }}}. \end{aligned}$$
(25)

3.1 Joint Distribution of Linear and Spectral Estimators

The key technical challenge is to compute the limiting joint empirical distribution of the signal \({\varvec{x}}\), the linear estimator \(\hat{\varvec{x}}^\mathrm{L}\), and the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\). This result is stated in terms of pseudo-Lipschitz test functions.

Definition 1

(Pseudo-Lipschitz test function) We say that a function \(\psi : \mathbb {R}^m \rightarrow \mathbb {R}\) is pseudo-Lipschitz of order \(k \ge 1\), denoted \(\psi \in \mathrm{PL}(k)\), if there is a constant \(C > 0\) such that

$$\begin{aligned} \left\Vert {\psi ({\varvec{x}})-\psi ({\varvec{y}})}\right\Vert _2 \le C (1 + \Vert {\varvec{x}}\Vert _2^{k-1} + \Vert {\varvec{y}}\Vert _2^{k-1} )\left\Vert {{\varvec{x}}- {\varvec{y}}}\right\Vert _2, \end{aligned}$$
(26)

for all \({\varvec{x}},{\varvec{y}}\in \mathbb {R}^m\).

Examples of test functions in \(\mathrm{PL}(2)\) with \(m=2\) include \(\psi (a,b) = (a-b)^2\), \(\psi (a,b) =ab\), and \(\psi (a,b) = \left|{a-b}\right|\). We note that if \(\psi \in \mathrm{PL}(k)\), then \(\psi ({\varvec{x}}) \le C'(1+ \Vert {\varvec{x}}\Vert _2^k)\) for some constant \(C' >0\). Also note that if \(\psi \in \mathrm{PL}(k)\) for \(k \ge 2\), then \(\psi \in \mathrm{PL}(k')\) for \(1 \le k' \le (k-1)\).

Theorem 1

(Joint distribution) Let \({\varvec{x}}\) be such that \(\Vert {\varvec{x}}\Vert ^2_2=d\), \(\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)\), and \({\varvec{y}}\) be distributed according to (7). Let \(n/d\rightarrow \delta \), \(G\sim \mathsf{N}(0, 1)\), \(Z_L=\mathcal T_L(Y)\), and \(Z_s=\mathcal T_s(Y)\) for \(Y\sim p_{Y|G}(\,\cdot \,|\,G)\). Assume that (A1)(A2)(A3) and (B1)(B2) hold. Assume further that \(\psi _{\delta }'(\lambda _{\delta }^*)> 0\), and let \(\hat{\varvec{x}}^\mathrm{s}\) be the principal eigenvector of \({\varvec{D}}_n\), defined as in (13) with preprocessing function \({\mathcal {T}}_s\), with the sign of \(\hat{\varvec{x}}^\mathrm{s}\) chosen so that \(\langle \hat{\varvec{x}}^\mathrm{s}, {\varvec{x}}\rangle \ge 0\). Let \(\hat{\varvec{x}}^\mathrm{L}\) be the linear estimator defined as in (10) with preprocessing function \({\mathcal {T}}_L\).

Consider the rescalings \({\varvec{x}}^\mathrm{s}=\sqrt{d}\hat{\varvec{x}}^\mathrm{s}\) and \({\varvec{x}}^\mathrm{L} = \sqrt{d} \hat{\varvec{x}}^\mathrm{L}/n_L\). Then, the following holds almost surely for any PL(k) function \(\psi : \mathbb {R}^3 \rightarrow \mathbb {R}\):

$$\begin{aligned}&\lim _{d \rightarrow \infty } \, \frac{1}{d} \sum _{i=1}^d \psi (x_i, x^\mathrm{L}_i, x^\mathrm{s}_i) = {\mathbb E}\{ \psi (X, \, \rho _L X + W_L ,\, \rho _s X + W_s)) \} . \end{aligned}$$
(27)

Here, \(X \sim P_X\), and \((W_L, W_s)\) are independent of X and jointly Gaussian with zero mean and covariance given by \({\mathbb E}\{W_L^2\}=1-\rho _L^2\), \({\mathbb E}\{W_s^2\}=1-\rho _s^2\) and \({\mathbb E}\{W_L W_s\} = q-\rho _L\rho _s\).

Note that the order k of the pseudo-Lipschitz test function appearing in (27) is the same as the integer k appearing in assumption (B1). In particular, the order of pseudo-Lipschitz functions for which (27) holds is only constrained by the fact that the random variables X and Y should have finite moments of order \(2k-2\). The proof of the theorem is given in Sect. 5.

Remark 1

(What happens if either linear or spectral are ineffective?) From Lemma 1, the assumption \(|{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}|>0\) contained in (B2) implies that \(|\rho _L|>0\). Similarly, from Lemma 2, the assumption \(\psi _{\delta }'(\lambda _{\delta }^*)> 0\) implies that \(\rho _s>0\). Thus, Theorem 1 assumes that both the linear and the spectral estimators are effective. We note that a similar result also holds when only one of the two estimators achieves strictly positive correlation. In that case, \(\psi : \mathbb {R}^2 \rightarrow \mathbb {R}\) takes as input the components of the signal \({\varvec{x}}\) and of the estimator that is effective (as well as the corresponding random variables), and a formula analogous to (27) holds. The proof of this claim is easily deduced from the argument of Theorem 1. A simpler proof using a rotational invariance argument along the lines of (144)-(151) also leads to the same result.

Remark 2

(Convergence in \(W_k\)) The result in Eq. (27) is equivalent to the statement that the empirical joint distribution of \(({\varvec{x}}, {\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\) converges almost surely in \(W_k\) distance to the joint law of

$$\begin{aligned} (X, \, \rho _L X + W_L ,\, \rho _s X + W_s). \end{aligned}$$
(28)

This follows from the fact that a sequence of distributions \(P_n\) with finite k-th moment converges in \(W_k\) to P if and only if \(P_n\) converges weakly to P and \(\int \Vert a \Vert ^k \, \mathrm{d}P_n(a) \rightarrow \int \Vert a \Vert ^k \, \mathrm{d}P(a)\), see [55, Definition 6.7, Theorem 6.8].

3.2 Optimal Combination

Equipped with the result of Theorem 1, we now reduce the vector problem of estimating \({\varvec{x}}\) given \((\hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s})\) to an estimation problem over scalar random variables, i.e., how to optimally estimate X from observations \(X_L:=\rho _L X + W_L\) and \(X_s:=\rho _s X + W_s\), where \(W_L\) and \(W_s\) are jointly Gaussian. The Bayes-optimal estimator for this scalar problem is given by

$$\begin{aligned} F_*(x_L, x_s) := \mathbb E\{X \mid X_L=x_L, X_s=x_s\}. \end{aligned}$$
(29)

This is formalized in the following result.

Lemma 3

Let \((X,X_L,X_s)\) be jointly distributed random variables such that

$$\begin{aligned} X_L = \rho _L X + W_L\qquad \text {and}\qquad X_s = \rho _s X + W_s\,, \end{aligned}$$
(30)

where \((W_L,W_s)\) are jointly Gaussian independent of X with zero mean and covariance given by \({\mathbb E}\{W_L^2\}=1-\rho _L^2\), \({\mathbb E}\{W_s^2\}=1-\rho _s^2\) and \({\mathbb E}\{W_L W_s\} = q-\rho _L\rho _s\). Assume that \(\rho _L\ne 0\) or \(\rho _s\ne 0\). Let

$$\begin{aligned} \mathcal {V}:=\left\{ f(X_L,X_s)~:~0<{\mathbb E}\{f^2(X_L,X_s)\}<\infty \right\} , \end{aligned}$$
(31)

and consider the following optimal estimation problem of X given \(X_L\) and \(X_s\) over all measurable estimators \(f:\mathbb {R}^2\rightarrow \mathbb {R}\) in \(\mathcal {V}\):

$$\begin{aligned} \max _{f\in \mathcal {V}} \, \frac{|{\mathbb E}\left\{ X\cdot f(X_L,X_s)\right\} |}{\sqrt{{\mathbb E}\{X^2\} \cdot {\mathbb E}\left\{ f^2(X_L,X_s)\right\} }}. \end{aligned}$$
(32)

Then, for any \(c\ne 0\), \(\hat{X} = cF_*(X_L, X_s)\) attains the maximum in (32), where \(F_*\) is defined in (29).

Proof

By the tower property of conditional expectation, for any \(f \in \mathcal {V}\) we have

$$\begin{aligned} \begin{aligned}&\frac{\left|{{\mathbb E}\left\{ X\cdot f(X_L,X_s)\right\} }\right|}{\sqrt{ {\mathbb E}\{ f(X_L,X_s)^2 \}}} = \frac{\left|{{\mathbb E}\left\{ {\mathbb E}\{ X \mid X_L, X_s \} \cdot f(X_L,X_s)\right\} }\right|}{\sqrt{ {\mathbb E}\{ f(X_L,X_s)^2 \}}}\\&\le \sqrt{{\mathbb E}\left\{ {\mathbb E}\{ X \, | \, X_L, X_s \}^2 \right\} } , \end{aligned} \end{aligned}$$
(33)

where we have used the Cauchy–Schwarz inequality. Moreover, the inequality in (33) becomes an equality if and only if \(f(X_L,X_s) = c \, {\mathbb E}\{ X \, | \, X_L, X_s \}\), for some \(c \ne 0\), which proves the result.

At this point, we are ready to show how to optimally combine the linear estimator \(\hat{\varvec{x}}^\mathrm{L}\) and the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\).

Theorem 2

(Optimal combination) Consider the setting of Theorem 1. Let \(F_*\) be defined in (29) and assume that \(F_*\in PL(\lfloor k/2\rfloor )\). Then, as \(n\rightarrow \infty \),

$$\begin{aligned} \frac{|\langle F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2}{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\rho _*:=\frac{|{\mathbb E}\left\{ X\cdot F_*(X_L,X_s)\right\} |}{\sqrt{{\mathbb E}\left\{ F_*^2(X_L,X_s)\right\} }}, \end{aligned}$$
(34)

where \(F_*\) acts component-wise on \({\varvec{x}}^\mathrm{L}\) and \({\varvec{x}}^\mathrm{s}\), i.e., \(F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})=(F_*(x^\mathrm{L}_1, x^\mathrm{s}_1), \ldots , F_*(x^\mathrm{L}_d, x^\mathrm{s}_d))\). Furthermore, for any \(f\in PL(\lfloor k/2\rfloor )\) acting component-wise on \({\varvec{x}}^\mathrm{L}\) and \({\varvec{x}}^\mathrm{s}\), almost surely,

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{|\langle f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2}\le \rho _*. \end{aligned}$$
(35)

Proof

If \(f(b, c)\in \) PL(\(\lfloor k/2\rfloor \)), then one can immediately verify that (i) \(\psi (a, b, c)=af(b, c)\in \) PL(k), and (ii) \(\psi (a, b, c)=(f(b, c))^2\in \) PL(k). Thus, by applying Theorem 1, we have that, for any \(f(b, c)\in \) PL(\(\lfloor k/2\rfloor \)), as \(n\rightarrow \infty \),

$$\begin{aligned} \begin{aligned} \frac{|\langle f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert f({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\frac{\mathbb E\{Xf(\rho _LX+W_L, \rho _sX+W_s)\}}{\sqrt{\mathbb E\{f^2(\rho _LX+W_L, \rho _sX+W_s)\}}}. \end{aligned} \end{aligned}$$
(36)

By taking \(f=F_*\), the result (34) immediately follows. By applying Lemma 3, (35) also follows and the proof is complete.

The integer k appearing in assumption (B1) is the same one defining the order \(\lfloor k/2\rfloor \) of the pseudo-Lipschitz functions \(F_*\) and f in (34)–(35).

Remark 3

(What happens if either linear or spectral are ineffective?) Theorem 2 considers the same setting of Theorem 1, and therefore it assumes that \(\rho _L\ne 0\) and \(\rho _s\ne 0\). The results in (34)–(35) still hold if either \(\rho _L= 0\) or \(\rho _s= 0\) (and even in the case \(\rho _L= \rho _s= 0\)). For the sake of simplicity, suppose that \(\rho _L= 0\) (the argument for \(\rho _s= 0\) is analogous). Then, \(X_L=W_L\) is independent of X and therefore the conditional expectation in (29) does not depend on \(x_L\). Recall from Remark 1 that if \(\rho _L=0\), then a formula analogous to (27) holds where \(\psi : \mathbb {R}^2 \rightarrow \mathbb {R}\) takes as input the components of \({\varvec{x}}\) and \({\varvec{x}}^\mathrm{s}\) on the LHS, and the corresponding random variables on the RHS. Hence, (34)–(35) are obtained by following the same argument in the proof of Theorem 2.

Fig. 2
figure 2

Monte Carlo simulations and theoretical predictions for an i.i.d. Gaussian signal and measurements model \(y_i = f({\langle }{\varvec{a}}_i, {\varvec{x}}{\rangle }) + \sigma z_i, z_i\sim _{i.i.d.} \mathcal {N}(0,1)\). Here, \(f(x)=\max \{x,-0.4x\}\) (cf. a)

Fig. 3
figure 3

Monte Carlo simulations and theoretical predictions for an i.i.d. Gaussian signal and measurements model \(y_i = f({\langle }{\varvec{a}}_i, {\varvec{x}}{\rangle }) + \sigma z_i, z_i\sim _{i.i.d.} \mathcal {N}(0,1)\). Here, \(f(x)=|x|\cdot \mathbf {1}_{\{|x|\ge 1.5\}} + x\cdot \mathbf {1}_{\{|x|<1.5\}}\) (cf. a)

Fig. 4
figure 4

Monte Carlo simulations and theoretical predictions for an i.i.d. Gaussian signal and measurements model \(y_i = f({\langle }{\varvec{a}}_i, {\varvec{x}}{\rangle }) + \sigma z_i, z_i\sim _{i.i.d.} \mathcal {N}(0,1)\). Here, \(f(x)=0.3x+x^2\) (cf. a)

We highlight that, even if one of the two estimators is ineffective, the proposed optimal combination can still improve on the performance of the other one. This is due to the fact that the function \(F_*\) takes advantage of the knowledge of the signal prior. We showcase an example of this behavior for a binary prior in Fig. 5 discussed in Sect. 4.2. We also note that if the signal prior is Gaussian, then no improvement is possible when one of the two estimators has vanishing correlation with the signal, see Figs. 2, 3 and 4 in Sect. 4.1. In fact, as detailed in Sect. 3.3, in the Gaussian case \(F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\) is a linear combination of \({\varvec{x}}^\mathrm{L}\) and \({\varvec{x}}^\mathrm{s}\). Thus, if \(\rho _L=0\) (resp. \(\rho _s=0\)), then \(F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\) is aligned with \({\varvec{x}}^\mathrm{s}\) (resp. \({\varvec{x}}^\mathrm{L}\)).

Remark 4

(Sign of \(\hat{\varvec{x}}^\mathrm{s}\)) The spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\) is defined up to a change of sign, since it is the principal eigenvector of a suitable matrix. In Theorem 1 and 2, we pick the sign of \(\hat{\varvec{x}}^\mathrm{s}\) such that \(\langle \hat{\varvec{x}}^\mathrm{s}, {\varvec{x}}\rangle \ge 0\). In practice, there is a simple way to resolve the sign ambiguity: one can match the sign of q as defined in (25) with the sign of the scalar product \(\langle \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\rangle \) (see also (41)).

Remark 5

(Sufficient condition for pseudo-Lipschitz \(F_*\)) The assumption that the Bayes-optimal estimator \(F_*\) in (29) is pseudo-Lipschitz is fairly mild. In fact, \(F_*\) is Lipschitz if either of the following two conditions on X hold [20, Lemma 3.8]: (i) X has a log-concave distribution, or (ii) there exist independent random variables UV such that U is Gaussian, V is compactly supported and \(X {\mathop {=}\limits ^\mathrm{d}} U+V\).

Remark 6

(Non-separable combinations) The optimality of \(F_*\) in Theorem 2 can be extended to a class of combined estimators of the form \(f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\), where \(f_d: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}^d\) may not act component-wise on \(({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\). Given \(f_d\), we define the function \(S_{f_d}({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})= \frac{1}{d}\Vert f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}) \Vert ^2\). Let \(f_d: \mathbb {R}^d \rightarrow \mathbb {R}^d\) be any sequence of functions (indexed by d) such that \(S_{f_d}: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\) is uniformly pseudo-Lipschitz of order k. That is, for each d, the property (26) holds for \(S_{f_d}\) with a pseudo-Lipschitz constant C that does not depend on d. Then, the state evolution result in [6, Theorem 1] for non-separable test functions implies that

$$\begin{aligned} \lim _{d \rightarrow \infty } {\mathbb P}\left( \frac{|\langle f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}), {\varvec{x}}\rangle |}{\Vert f_d({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\Vert _2 \Vert {\varvec{x}}\Vert _2} \le \rho _*\right) =1. \end{aligned}$$
(37)

The result above is in terms of convergence in probability, while the limiting statement in (35) holds almost surely. This is because the state evolution result for AMP with non-separable functions [6, Theorem 1] is obtained in terms of convergence in probability.

3.3 A Special Case: Optimal Linear Combination

Theorem 2 shows that the optimal way to combine \(\hat{\varvec{x}}^\mathrm{L}\) and \(\hat{\varvec{x}}^\mathrm{s}\) is via the Bayes-optimal estimator \(F_*\) for the corresponding scalar problem. If the signal prior X is standard Gaussian, then \(F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\) is a linear combination of \({\varvec{x}}^\mathrm{L}\) and \({\varvec{x}}^\mathrm{s}\). In this section, we provide closed-form expressions for the performance of such optimal linear combination.

For convenience, let us denote the normalized linear estimator as \(\bar{\varvec{x}}^\mathrm{L}\), i.e., \(\bar{\varvec{x}}^\mathrm{L}=\hat{\varvec{x}}^\mathrm{L}/\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2\). (Recall that the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\) is already normalized, i.e., \(\Vert \hat{\varvec{x}}^\mathrm{s}\Vert _2=1\).) We consider an estimator \(\hat{\varvec{x}}^\mathrm{c}(\theta )\) of \({\varvec{x}}\), parameterized by \(\theta \in \mathbb {R}\cup \{\pm \infty \}\), defined as follows:

$$\begin{aligned} \hat{\varvec{x}}^\mathrm{c}(\theta ):= \theta \bar{\varvec{x}}^\mathrm{L}+ \hat{\varvec{x}}^\mathrm{s},\quad \theta \in \mathbb {R}\cup \{\pm \infty \}, \end{aligned}$$
(38)

where we use the convention, \(\hat{\varvec{x}}^\mathrm{c}(\theta )=\pm \bar{\varvec{x}}^\mathrm{L}\) for \(\theta =\pm \infty \).

Let us now compute the asymptotic performance of the proposed estimator \(\hat{\varvec{x}}^\mathrm{c}(\theta )\) in (38). Specifically, using Lemmas 1 and 2, it follows immediately that

$$\begin{aligned} \left\langle \hat{\varvec{x}}^\mathrm{c}(\theta ), \, \frac{{\varvec{x}}}{\left\Vert {{\varvec{x}}}\right\Vert _2} \right\rangle {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\theta \cdot \rho _L + \rho _s \,. \end{aligned}$$
(39)

In order to conclude with the limit of the normalized correlation \(\frac{{\langle }\hat{\varvec{x}}^\mathrm{c}(\theta ), \mathbf {x}{\rangle }}{\left\Vert {\hat{\varvec{x}}^\mathrm{c}(\theta )}\right\Vert _2\left\Vert {{\varvec{x}}}\right\Vert _2}\), it still remains to compute the magnitude of the new estimator:

$$\begin{aligned} \left\Vert {\hat{\varvec{x}}^\mathrm{c}(\theta )}\right\Vert _2^2 = \theta ^2 \Vert \bar{\varvec{x}}^\mathrm{L}\Vert _2^2 + \left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2^2 + 2 \theta {\langle }\bar{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}{\rangle } = \theta ^2+1+ 2 \theta {\langle }\bar{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}{\rangle }. \end{aligned}$$
(40)

This is possible thanks to the following result, which gives the correlation between the linear and the spectral estimator as well as the asymptotic performance of the linear combination \(\hat{\varvec{x}}^\mathrm{c}(\theta )\).

Corollary 1

(Performance of linear combination) Consider the setting of Theorem 1. Then, as \(n\rightarrow \infty \),

$$\begin{aligned} \frac{\langle \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\rangle }{\Vert {\hat{\varvec{x}}^\mathrm{L}}\Vert _2 \, \left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}q, \end{aligned}$$
(41)

where q is given by (25). Furthermore, let \(\hat{\varvec{x}}^\mathrm{c}(\theta )\) be the estimator defined in (38) parameterized by \(\theta \in \mathbb {R}\). Then, as \(n\rightarrow \infty \),

$$\begin{aligned} \frac{{\langle }\hat{\varvec{x}}^\mathrm{c}(\theta ), {\varvec{x}}{\rangle }}{\left\Vert {\hat{\varvec{x}}^\mathrm{c}(\theta )}\right\Vert _2\left\Vert {{\varvec{x}}}\right\Vert _2} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\frac{\theta \rho _L + {\rho _s}}{\sqrt{1+\theta ^2+2\theta q}}=:F(\theta ). \end{aligned}$$
(42)

Proof

The limit of the correlation (41) follows by applying Theorem 1 with the PL(2) function \(\psi (a, b, c)=bc\) and using that \(\Vert \hat{\varvec{x}}^\mathrm{L}\Vert _2 {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}n_L\). The result (42) then follows from (40) and (41), recalling that \({\langle }\bar{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}{\rangle } = \frac{\langle \hat{\varvec{x}}^\mathrm{L}, \hat{\varvec{x}}^\mathrm{s}\rangle }{\Vert {\hat{\varvec{x}}^\mathrm{L}}\Vert _2 \, \left\Vert {\hat{\varvec{x}}^\mathrm{s}}\right\Vert _2}\) and \(\Vert \hat{\varvec{x}}^\mathrm{s}\Vert _2=1\).

Using (23), the parameter q can be alternatively expressed in terms of \(\rho _L\) and \(\rho _s\) in the following compact form:

$$\begin{aligned} q = \rho _L \cdot \rho _s \cdot {{\mathbb E}\left\{ \frac{Z_L G}{1-\frac{1}{\lambda _\delta ^*} Z_s} \right\} }\Big /{{\mathbb E}\left\{ Z_L G\right\} }\,. \end{aligned}$$
(43)

Observe also that \(F(0)=\rho _s\) and \(F(\infty ):=\lim _{\theta \rightarrow \pm \infty }F(\theta )=\pm \rho _L\).

Using the characterization of Corollary 1, we can compute the combination coefficient \(\theta _*\) that leads to the asymptotically optimal linear combination of the form (38).

Corollary 2

(Optimal linear combination) Recall the notation in Corollary 1 and define

$$\begin{aligned} \theta _*=\frac{\rho _L-\rho _s q}{\rho _s-\rho _L q}\in \mathbb {R}\cup \{\pm \infty \}. \end{aligned}$$
(44)

Assume \(|q|<1\). Then, for all \(\theta \in \mathbb {R}\cup \{\pm \infty \}\), it holds that

$$\begin{aligned} |F(\theta )| \le F(\theta _*)= \sqrt{\frac{\rho _s^2+\rho _L^2-2q\rho _L\rho _s}{1-q^2}}\,. \end{aligned}$$
(45)

The proof of Corollary 1 is deferred to Appendix B. Let us now comment on the assumption \(|q|<1\). If \(\hat{\varvec{x}}^\mathrm{L}\) and \(\hat{\varvec{x}}^\mathrm{s}\) are perfectly correlated (i.e., \(|q|=1\)), then it is clear that the combined estimator \(\hat{\varvec{x}}^\mathrm{c}(\theta )\) cannot improve the performance for any value of \(\theta \). On the contrary, when \(|q|<1\), Corollary 2 characterizes when the linear combination \(\hat{\varvec{x}}^\mathrm{c}\) strictly improves upon the performance of the individual estimators \(\hat{\varvec{x}}^\mathrm{L}\) and \(\hat{\varvec{x}}^\mathrm{s}\). Specifically, by denoting

$$\begin{aligned} \rho _{\max } := \max \{|\rho _L|,\rho _s\}\qquad \text {and}\qquad p := {\left\{ \begin{array}{ll} \rho _s/\rho _L, &{}~\text {if}~ |\rho _L|\ge \rho _s, \\ \rho _L/\rho _s, &{}~\text {else}, \end{array}\right. } \end{aligned}$$
(46)

such that the right-hand side of (45) becomes

$$\begin{aligned} F(\theta _*) = \rho _{\max }\sqrt{1+\frac{(p-q)^2}{1-q^2}}\,, \end{aligned}$$
(47)

it can be readily checked that \(F(\theta _*)>\rho _{\max }\) provided that \(|q|<1\) and \(q\ne p\).

Remark 7

(Optimization of preprocessing functions) The linear estimator \(\hat{\varvec{x}}^\mathrm{L}\) and the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\) use the preprocessing functions \(\mathcal {T}_L\) (cf. (3)) and \(\mathcal {T}_s\) (cf. (4)), respectively. The choice of these functions naturally affects the performance of the two estimators, as well as, that of the combined estimator \(F_*({\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s})\). Lemmas 12, Theorem 2 and Corollary 2 derive sharp asymptotics on the estimation performance that hold for any choice of the preprocessing functions \(\mathcal {T}_L\) and \(\mathcal {T}_s\) satisfying our technical assumptions (A1), (A2), (A3), (B2). In Appendix C, we briefly discuss how these results can be used to yield optimal such choices. Specifically, the optimal preprocessing function that maximizes the normalized correlation of the spectral estimator is derived in [35], see Appendix C.2. The optimal choice for the linear estimator is much easier to obtain and we outline the process in Appendix C.1. In Appendix C.3, we combine these two results to derive a precise characterization of the sampling regimes in which the linear estimator surpasses the spectral estimator, and vice-versa. Finally, in Appendix C.4, we use Corollary 2 to cast the problem of optimally choosing \(\mathcal {T}_L\) and \(\mathcal {T}_s\) to maximize the correlation of the combined estimator \(\hat{\varvec{x}}^\mathrm{c}(\theta _*)\) as a function optimization problem. Solving the latter is beyond the scope of this paper, and it represents an intriguing future research direction.

4 Numerical Simulations

This section validates our theory via numerical experiments and provides further insights on the benefits of the proposed combined estimator.

First, we consider a setting in which the signal \({\varvec{x}}\) is uniformly distributed on the d-dimensional sphere. In this case, the limiting empirical distribution \(P_X\) is Gaussian. Thus, the Bayes-optimal estimator \(F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s})\) in (29) is linear and is given by \(\hat{\varvec{x}}^\mathrm{c}(\theta _*)\), where \(\theta _*\) is determined in Corollary 2. For this scenario, we study in Figs. 2, 3 and 4 the performance gain of \(\hat{\varvec{x}}^\mathrm{c}(\theta _*)\) for three different measurement models and for various noise levels.

Second, in Fig. 5 we consider a setting in which the the entries of \({\varvec{x}}\) are binary, drawn i.i.d. from the set \(\{1, -1 \}\), such that the empirical distribution is of the form \(P_X(1) = 1 - P_X(-1) = \mathsf{p}\), for some \(\mathsf{p}\in (0,1)\). For this case, we compute the Bayes-optimal estimator \(F_*({\varvec{x}}_L,{\varvec{x}}_s)\) and compare it with the optimal linear combination \(\hat{\varvec{x}}^c(\theta _*)\) for various choices of output functions.

4.1 Optimal Linear Combination

In Fig. 2, we fix the input–output relation as \(y_i = f({\langle }{\varvec{a}}_i, {\varvec{x}}{\rangle }) + \sigma z_i\), with \(z_i\sim _{i.i.d.} \mathcal {N}(0,1)\) and \(f(x)=\max \{x,-0.4x\}\) (cf. Fig. 2a), and we investigate the performance gain of the proposed combined estimator for different values of the noise variance \(\sigma ^2\). Here, \({\varvec{x}}\) is generated via a standard Gaussian vector which is normalized such that \(\Vert {\varvec{x}}\Vert _2=\sqrt{d}\). Also, \({\varvec{a}}_i \ \sim _{i.i.d.} \ \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d/d)\) and the pre-processing functions are chosen: \({\mathcal {T}}_L(y) = y\) and \({\mathcal {T}}_s(y) = \min \{y,3.5\}\). Note that the empirical distribution of \({\varvec{x}}\) tends to a standard Gaussian distribution in the high-dimensional limit. Thus, following Sect. 3.3, the optimal combined estimator is (asymptotically) linear and is given by (38) for \(\theta =\theta _*\) chosen as in (44). In Fig. 2b, we plot the percentage improvement \(\frac{\rho _*-\rho _{\max }}{\rho _{\max }}\times 100\) as a function of the sampling ratio \(\delta \), for three values of the noise variance \(\sigma ^2=0,0.4\) and 0.8. Here, \(\rho _*=F(\theta _*)\) defined in (45) and \(\rho _{\max }=\max \{|\rho _L|,\rho _s\}.\) We observe that, as \(\sigma ^2\) increases, larger values of the sampling ratio \(\delta \) are needed for the combined estimator to improve upon the linear and spectral estimators. However, for sufficiently large \(\delta \), the benefit of the combined estimator is more pronounced for larger values of the noise variance. For instance, for \(\sigma ^2=0.8\) and large values of \(\delta \), the percentage gain is larger than \(10\%\). In Fig. 2c, we fix \(\sigma ^2=0.4\) and plot the correlations \(\rho _L, \rho _s\) and \(\rho _*\). The solid lines correspond to the theoretical predictions obtained by Lemma 1, Lemma 2 and Corollary 2, respectively. The theoretical predictions are compared against the results of Monte Carlo simulations. For the simulations, we used \(d=1000\) and averaged over 15 independent problem realizations. In Fig. 2c (Middle), we also plot the limit q (cf. (25)) of the cross-correlation \(\frac{{\langle }{\varvec{x}}^\mathrm{L}, {\varvec{x}}^\mathrm{s}{\rangle }}{\Vert {\varvec{x}}^\mathrm{L} \Vert \Vert {\varvec{x}}^\mathrm{s} \Vert }\) and the ratio p in (46). The corresponding values of the optimal combination coefficient \(\theta _*\) are plotted in Fig. 2c (Right). For values of \(\delta \) smaller than the threshold for weak-recovery of the spectral method (where \(\rho _s=0\)), we observe that \(\rho _*=\rho _L\) and \(\theta _*=\infty \). However, for larger values of \(\delta \), the value of the optimal coefficient \(\theta _*\) is non-trivial. Finally, Fig. 2d shows the same plots as in Fig. 2c, but for \(\sigma ^2=0.8\).

The setting of Fig. 3 is the same as in Fig. 2, only now the input–output function is chosen as \(f(x)=|x|\cdot \mathbf {1}_{\{|x|\ge 1.5\}} + x\cdot \mathbf {1}_{\{|x|<1.5\}}\). Comparing Fig. 3b to Fig. 2b, note that the benefit of the combined estimator is more significant for the link function studied here. Moreover, in contrast to Fig. 2b, here, the percentage gain of the combined estimator takes its maximum value in the noiseless case: \(\sigma ^2=0\).

In Fig. 4, we repeat the experiments of Figs. 2 and 3, but for \(f(x)= 1+ 0.3x + (x^2-1)\). Compared to the two functions studied in Figs. 2 and 3, in Fig. 4 we observe that the performance gain is significantly larger and reaches values up to \(~30\%\). This can be argued by considering the expansion of the input–output functions on the basis of the Hermite polynomials, i.e., \(f(x) = \sum _{i=0}^\infty H_i h_i(x)\), where \(h_i(x)\) is the ith-order Hermite polynomial with leading coefficient 1 and \(H_i = \frac{1}{i!}{\mathbb E}_{G\sim \mathcal {N}(0,1)}\{f(G)h_i(G)\}\). Specifically, recall that the first three Hermite polynomials are as follows: \(h_0(x)=1\), \(h_1(x)=x\) and \(h_2(x)=x^2-1\). Thus, for \(f(x)= 1+ 0.3x + (x^2-1)\), only the first three coefficients \(\{H_i\}, i=0,1,2\), are non-zero. To see the relevance of these coefficients to the linear and spectral estimators, note that for identity pre-processing functions it holds that \( {\mathbb E}\{GZ_L\} = {\mathbb E}\{GY\} = {\mathbb E}\{Gf(G)\} = H_1 \) and \( {\mathbb E}\{(G^2-1)Z_s\} ={\mathbb E}\{(G^2-1)f(G)\} = 2H_2. \) Thus, it follows directly from Lemma 1 that \(\rho _L=0\) if the first Hermite coefficient \(H_1\) is zero. Similarly, it can be shown using Lemma 2 that \(\rho _s=0\) if the second Hermite coefficient \(H_2\) is zero; in fact, the threshold of weak recovery of the spectral method is infinity in this case (see (167) in Appendix C). Intuitively, the linear and spectral estimators exploit the energy of the output function corresponding to the Hermite polynomials of first- and second-order, respectively; see also [17, 52]. In this example, we have chosen f(x) such that all of its energy is concentrated on the Hermite polynomials of order up to two.

As a final note, from the numerical results in Figs. 2, 3 and 4, we observe that the proposed optimal combination leads to a performance improvement only if both the linear and the spectral estimators are asymptotically correlated with the signal. This is because the signal prior is Gaussian (see Remark 3). In contrast, as we will see in the next section, when the signal prior is binary, the combined estimator provides an improvement even when only the linear estimator is effective.

4.2 Bayes-Optimal Combination

In Figs. 5a–d, we consider the same setting as in Figs. 3 and 4, respectively. However, here each entry of \({\varvec{x}}\) takes value either \(+1\) or \(-1\). Each entry is chosen independently according to the distribution \(P_X(1) = 1 -P_X(-1) =\mathsf{p}\), for \(\mathsf{p}\in (0,1)\). Thus, the Bayes optimal combination \(\hat{\varvec{x}}^\mathrm{mmse}=F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s})\) is not necessarily linear as in the Gaussian case. In Appendix D, we compute the Bayes-optimal estimator \(F_*(x_L,x_s)\) (cf. (29)) for the setting considered here. Then, we use the prediction of Theorem 2 to plot in solid black lines the normalized correlation of \(\hat{\varvec{x}}^\mathrm{mmse}\) with \({\varvec{x}}\) (i.e., \(\rho _*^\mathrm{mmse} = \rho _*\) in (34)). The theoretical predictions (solid lines) are compared against the results of Monte Carlo simulations (markers). Moreover, we compare the optimal performance against those of the linear estimator (cf. \(\rho _L\)), the spectral estimator (cf. \(\rho _s\)) and the optimal linear combination \(\hat{\varvec{x}}^\mathrm{c}(\theta _*)\) (cf. \(\rho _*^\mathrm{linear})\). We have chosen \(\mathsf{p}=0.3\) in Fig. 5a, c and \(\mathsf{p}=0.5\) in Fig. 5b, d. Note that the optimal linear combination provides a performance improvement only for the values of \(\delta \) s.t. \(\rho _L>0\) and \(\rho _s>0\). On the contrary, \(\rho _*^\mathrm{mmse}\) is strictly larger than \(\rho _L\) even when \(\rho _s=0\).

Fig. 5
figure 5

Bayes-optimal combination vs optimal linear combination for a binary prior \(P_X(1)= 1- P_X(-1)=\mathsf{p}\). In a, b (resp. c, d) the setting is otherwise the same as in Fig. 3 (resp. Fig. 4)

Fig. 6
figure 6

Normalized correlation of gradient descent iterates for different initializations: (i) linear \(\hat{{\varvec{x}}}^\mathrm{L}\), (ii) spectral \(\hat{{\varvec{x}}}^\mathrm{s}\), (iii) optimal linear combination \(\hat{{\varvec{x}}}^\mathrm{linear}\) (Bayes-optimal for Gaussian signal prior), and (iv) Bayes-optimal combination \(\hat{{\varvec{x}}}^\mathrm{mmse}\). Depicted are means (solid lines) and standard errors (shaded regions) over 10 Monte Carlo realizations. In all cases, \(y_i=f( \langle {\varvec{a}}_i , {\varvec{x}}\rangle )+z_i, i\in [n]\), \(z_i\sim \mathcal {N}(0,0.2)\) and \(d=250\). a \(f(x)=0.3 x + 0.5(x^2-1),\) \(\delta =3\), Gaussian prior. b \(f(x)=0.3 x + 0.5x^2,\) \(\delta =5\), Gaussian prior. c, d \(f(x)=0.3 x + 0.5x^2,\) \(\delta =5\), prior over \(\{+1, -1 \}\) with \(P_X(1)=\mathsf{p}=0.3\) for c and \(P_X(1)=\mathsf{p}=0.1\) for d

4.3 The Combined Estimator as Initialization for Local Algorithms

As mentioned in the introduction, the initial estimates of the signal obtained by either the linear/spectral methods or our proposed combined estimator can then be refined via local algorithms such as gradient descent (GD). The theory and numerical simulations in the previous sections showed the superiority of our proposed combined estimator \(\hat{\varvec{x}}^\mathrm{mmse}=F_*({\varvec{x}}^\mathrm{L},{\varvec{x}}^\mathrm{s})\) over \({\varvec{x}}^\mathrm{L}\) and \({\varvec{x}}^\mathrm{s}\) in terms of correlation with the true signal. In Fig. 6, by plotting the correlation of GD iterates \({\varvec{x}}_t, t\ge 1\) for different initializations, we show numerically how this improvement translates to improved performance of gradient descent refinements. Specifically, we ran GD on squared error loss with step size 1/2, that is \({\varvec{x}}_{t+1}={\varvec{x}}_{t}-\sum _{i\in [n]}f^\prime (\langle {\varvec{a}}_i , {\varvec{x}}_t \rangle )\left( y_i-f(\langle {\varvec{a}}_i , {\varvec{x}}_t \rangle )\right) \). Here, f is the output function described in the caption of the figure.

In Fig. 6a, b, the true signal has i.i.d. Gaussian entries, thus the linear combined estimator \(\hat{\varvec{x}}^\mathrm{c}\) is the optimal combination in terms of correlation performance. We observe that for two different choices of output function and sampling ratio (see caption), GD with linear or spectral initialization requires hundreds of iterations to reach the performance of the combined estimator. In Fig. 6c, d, the true signal has entries in \(\{+1, -1\}\), chosen independently with \(P_X(1)=\mathsf{p}=0.3\) and \(P_X(-1)=\mathsf{p}=0.1\), respectively. Here, the linear combined estimator is sub-optimal (but still improves upon linear/spectral), thus we also compute and study the Bayes-optimal estimator. Interestingly, while for both priors, GD converges to the same correlation as t increases, for \(\mathsf{p}=0.1\), GD achieves higher correlation if stopped early. It is a fascinating, albeit challenging, question to better understand the evolution of the GD trajectory as a function of the initialization, signal prior and output function.

5 Proof of Theorem 1

5.1 Proof Sketch

The proof of Theorem 1 is based on the design and analysis of a generalized approximate message passing (GAMP) algorithm. GAMP is a class of iterative algorithms proposed by Rangan [44] for estimation in generalized linear models. A GAMP algorithm is defined in terms of a sequence of Lipschitz functions \(f_t:\mathbb R\rightarrow \mathbb R\) and \(g_t:\mathbb R \times \mathbb {R}\rightarrow \mathbb R\), for \(t \ge 0\). For \(t\ge 0\), the GAMP iteration computes:

$$\begin{aligned} \begin{aligned} {\varvec{u}}^t&= \frac{1}{\sqrt{\delta }}{\varvec{A}}f_t({\varvec{v}}^t)-\mathsf{b}_t g_{t-1}({\varvec{u}}^{t-1}; {\varvec{y}}),\\ {\varvec{v}}^{t+1}&= \frac{1}{\sqrt{\delta }}{\varvec{A}}^\mathsf{T}g_t({\varvec{u}}^t; {\varvec{y}}) - \mathsf{c}_t f_t({\varvec{v}}^t). \end{aligned} \end{aligned}$$
(48)

Here, the functions \(f_t\) and \(g_t\) are understood to be applied component-wise, i.e., \(f_t({\varvec{v}}^t)=(f_t(v^t_1)\), \(\ldots , f_t(v^t_d))\) and \(g_t({\varvec{u}}^t; {\varvec{y}})=(g_t(u^t_1; y_1), \ldots , g_t(u^t_n; y_n))\). The scalars \(\mathsf{b}_t, \mathsf{c}_t\) are defined as

$$\begin{aligned} \mathsf{b}_t =\frac{1}{n}\sum _{i=1}^d f_t'(v_i^t), \qquad \mathsf{c}_t = \frac{1}{n}\sum _{i=1}^n g_t'(u_i^t; y_i), \end{aligned}$$
(49)

where \(g_t'(\cdot , \cdot )\) denotes the derivative with respect to the first argument. The iteration (48) is initialized with

$$\begin{aligned} {\varvec{u}}^{0}= c {\varvec{1}}_n, \qquad {\varvec{v}}^1 = \frac{1}{\sqrt{\delta }} {\varvec{A}}^{\mathsf{T}} g_{0}({\varvec{u}}^{0} ; {\varvec{y}}), \end{aligned}$$
(50)

for some constant \(c>0\). Here, \({\varvec{1}}_n \in \mathbb {R}^n\) denotes the all-ones vector.

A key feature of the GAMP algorithm (48) is that the asymptotic empirical distribution of its iterates can be succinctly characterized via a deterministic recursion, called state evolution. Hence, the performance of the high-dimensional problem involving the iterates \({\varvec{u}}^t, {\varvec{v}}^t\) is captured by a scalar recursion. Specifically, this result gives that for \(t \ge 1\), the empirical distributions of \({\varvec{u}}^t\) and \({\varvec{v}}^t\) converge in \(W_k\) distance to the laws of the random variables \(U_t\) and \(V_t\), respectively, with

$$\begin{aligned} U_t&\equiv \mu _{U,t} G + \sigma _{U,t} W_{U,t}, \end{aligned}$$
(51)
$$\begin{aligned} V_t&\equiv \mu _{V,t} X + \sigma _{V,t} W_{V,t}, \end{aligned}$$
(52)

where \((G, W_{U,t}) \sim _{\text {i.i.d.}} \mathsf{N}(0,1)\). Similarly, \(X \sim P_X\) and \(W_{V,t} \sim \mathsf{N}(0,1)\) are independent. The deterministic parameters \((\mu _{U,t}, \sigma _{U,t}\), \(\mu _{V,t}, \sigma _{V,t} )\) are computed via the recursion (56) detailed in Sect. 5.2, and the formal statement of the state evolution result is contained in Proposition 1 (again in Sect. 5.2).

Next, in Sect. 5.3, we show that a suitable choice of the functions \(f_t, g_t\) leads to a GAMP algorithm such that (i) \({\varvec{v}}^1 = \sqrt{d} \, \hat{\varvec{x}}^\mathrm{L}\), and (ii) \({\varvec{v}}^t\) is aligned with \(\sqrt{d} \, \hat{\varvec{x}}^\mathrm{s}\) as t grows large. Specifically, choosing \(g_0(u; y)=\mathcal T_L(y)/\sqrt{\delta }\) in (50) immediately gives \({\varvec{v}}^1 = \sqrt{d} \, \hat{\varvec{x}}^\mathrm{L}\). In order to approach the spectral estimator as \(t\rightarrow \infty \), we pick \(f_t, g_t\) to be linear functions; see (68). The idea is that, with this choice of \(f_t\) and \(g_t\), the GAMP iteration is effectively a power method. Let us now briefly discuss why this is the case. With the choice of \(f_t, g_t\) in (68), the GAMP iteration can be expressed as

$$\begin{aligned} \begin{aligned} {\varvec{u}}^t&= \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}{\varvec{u}}^{t-1} \big ], \\ {\varvec{v}}^{t+1}&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{u}}^t - \frac{\sqrt{\delta }}{\beta _t} \, {\mathbb E}\{ Z \} \, {\varvec{v}}^t, \end{aligned} \end{aligned}$$
(53)

where \({\varvec{Z}}= \mathrm{diag}({\mathcal {T}}(y_1), \ldots , {\mathcal {T}}(y_n))\), \(Z=\mathbb E\{{\mathcal {T}}(Y)\}\), and the function \(\mathcal T:\mathbb R\rightarrow \mathbb R\) is defined later in terms of the spectral preprocessing function \({\mathcal {T}}_s\); see (82). Then, Lemma 4 analyzes the fixed points of the state evolution of the GAMP algorithm (53), and Lemma 6 proves that in the high-dimensional limit, the vector \({\varvec{v}}^t\) tends to align with the principal eigenvector of the matrix \({\varvec{M}}_n= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}\) as \(t \rightarrow \infty \). Here, we provide a heuristic sanity check for this last claim. Assume that the iterates \({\varvec{v}}^t\) and \({\varvec{u}}^t\) converge to the limits \({\varvec{v}}^\infty \) and \({\varvec{u}}^\infty \), respectively, in the sense that \(\lim _{t\rightarrow \infty }\frac{1}{d}\Vert {\varvec{v}}^t-{\varvec{v}}^\infty \Vert ^2=0\) and \(\lim _{t\rightarrow \infty }\frac{1}{n}\Vert {\varvec{u}}^t-{\varvec{u}}^\infty \Vert ^2=0\). Then, from (53), the limits \({\varvec{v}}^\infty \) and \({\varvec{u}}^\infty \) satisfy

$$\begin{aligned} \begin{aligned} {\varvec{u}}^\infty&= \frac{1}{\sqrt{\delta } \, \beta _\infty } \big [ {\varvec{A}}{\varvec{v}}^\infty \, - \, {\varvec{Z}}{\varvec{u}}^{\infty } \big ], \\ {\varvec{v}}^{\infty }&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{u}}^\infty - \frac{\sqrt{\delta }}{\beta _\infty } \, {\mathbb E}\{ Z \} \, {\varvec{v}}^\infty . \end{aligned} \end{aligned}$$
(54)

Furthermore, from the analysis of the fixed points of state evolution of Lemma 4, we obtain that \(\beta _\infty =\sqrt{\delta }\mathbb E\{Z(G^2-1)\}\). Thus, after some manipulations, (54) can be rewritten as

$$\begin{aligned} \begin{aligned} {\varvec{u}}^\infty&= ({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^\infty \\ \left( 1+\frac{ {\mathbb E}\{ Z \}}{\mathbb E\{Z(G^2-1)\}} \,\right) {\varvec{v}}^{\infty }&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^\infty . \end{aligned} \end{aligned}$$
(55)

Hence, \({\varvec{v}}^\infty \) is an eigenvector of the matrix \({\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}\), and the GAMP algorithm is effectively performing a power method. Finally, we choose \(\mathcal T\) (and consequently \({\varvec{Z}}\)) so that \({\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1}={\varvec{Z}}_s\), with \({\varvec{Z}}_s\) given by (12). In this way, the matrix \({\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}\) coincides with the spectral matrix \({\varvec{D}}_n\), as defined in (13), and therefore \({\varvec{v}}^t\) approaches the spectral estimator \(\hat{\varvec{x}}^\mathrm{s}\).

In conclusion, as \({\varvec{v}}^1 = \sqrt{d} \hat{\varvec{x}}^\mathrm{L}\) and \({\varvec{v}}^t\) tends to be aligned with \(\sqrt{d} \hat{\varvec{x}}^\mathrm{s}\) for large t, we can use the state evolution result of Proposition 1 to analyze the joint empirical distribution of \(({\varvec{x}}, \sqrt{d} \hat{\varvec{x}}^\mathrm{L}, \sqrt{d} \hat{\varvec{x}}^\mathrm{s})\). At this point, the proof of Theorem 1 becomes a straightforward application of Lemma 6 and is presented at the end of Sect. 5.3. The proof of Lemma 6 is quite long, so it is provided separately in Sect. 5.4.

5.2 State Evolution for Generalized Approximate Message Passing (GAMP)

For \(t \ge 1\), let us define the following deterministic recursion for the parameters \((\mu _{U,t}, \sigma _{U,t}\), \(\mu _{V,t}, \sigma _{V,t} )\) appearing in (51)–(52):

$$\begin{aligned} \begin{aligned} \mu _{U,t}&= \frac{1}{\sqrt{\delta }} {\mathbb E}\{ X f_t(V_t) \}, \\ \sigma _{U,t}^2&= \frac{1}{\delta } {\mathbb E}\Big \{ \big ( f_t(V_t) - X {\mathbb E}\{ X f_t(V_t)\} \big )^2 \Big \} = \frac{1}{\delta } {\mathbb E}\Big \{ \big ( f_t(V_t) - \sqrt{\delta }\, \mu _{U,t} X \big )^2 \Big \}, \\ \mu _{V,t+1}&=\sqrt{\delta } {\mathbb E}\{ G g_t(U_t; Y) \} \, - \, {\mathbb E}\{ g_t'(U_t;Y) \} {\mathbb E}\{ X f_t(V_t) \}, \\ \sigma _{V,t+1}^2&= {\mathbb E}\{ g_t(U_t; Y)^2 \}. \end{aligned} \end{aligned}$$
(56)

Recalling from (50) that \({\varvec{u}}^0 = c {\varvec{1}}_n\), the state evolution recursion is initialized with

$$\begin{aligned} \begin{aligned}&\mu _{V,1} = \sqrt{\delta } \, {\mathbb E}\{ g_0(c ; Y) G \}, \qquad \sigma _{V,1}^2 = {\mathbb E}\{ g_0(c; Y)^2 \}. \end{aligned} \end{aligned}$$
(57)

Furthermore, for \(t \ge 0\), let the sequences of random variables \((W_{U,t})_{t \ge 0}\) and \((W_{V,t})_{t \ge 0}\) be each jointly Gaussian with zero mean and covariance defined as follows [6, 47]. First, we have:

$$\begin{aligned} {\mathbb E}\{W_{V,1} W_{V,t} \} = \frac{1}{\sigma _{V,1} \, \sigma _{V,t}} {\mathbb E}\left\{ g_{0}(c; Y) \, g_{t-1}(\mu _{U,t-1} G + \sigma _{U,t-1} W_{U,t-1}; Y) \right\} , \qquad t \ge 2. \end{aligned}$$
(58)

Then, for \(r, t \ge 1\), we iteratively compute:

$$\begin{aligned}&{\mathbb E}\{W_{U,r} W_{U,t} \} \nonumber \\&\quad = \frac{1}{\sigma _{U,r} \, \sigma _{U,t}} \cdot \frac{1}{\delta }{\mathbb E}\Big \{ \big ( f_r(\mu _{V,r} X + \sigma _{V,r} W_{V,r}) - X \sqrt{\delta }\, \mu _{U,r}\big ) \big ( f_t(\mu _{V,t} X + \sigma _{V,t} W_{V,t}) \nonumber \\&\qquad - X \sqrt{\delta }\, \mu _{U,t}\}\big ) \Big \}, \end{aligned}$$
(59)
$$\begin{aligned}&{\mathbb E}\{W_{V,r+1} W_{V,t+1} \} = \frac{1}{\sigma _{V,r+1} \, \sigma _{V,t+1}} {\mathbb E}\left\{ g_{r}(\mu _{U,r} G + \sigma _{U,r} W_{U,r}; Y) \, g_{t}(\mu _{U,t} G + \sigma _{U,t} W_{U,t}; Y) \right\} . \end{aligned}$$
(60)

Note that for \(r=t\), by (56) we have \({\mathbb E}\{W_{U,t}^2 \} = {\mathbb E}\{W_{V,t}^2 \}=1\).

At this point, we are ready to present the state evolution result [27, 44] for the GAMP algorithm (48)-(49).

Proposition 1

(State Evolution) Consider the GAMP iteration in Eqs. (48)–(49), with initialization \({\varvec{u}}_0 = c {\varvec{1}}_n \in \mathbb {R}^n\), for any constant \(c>0\). Assume that for \(t \ge 0\), the functions \(g_t: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) and \(f_t: \mathbb {R}\rightarrow \mathbb {R}\) are Lipschitz, and that Assumption (B1) on p.6 holds for some \(k \ge 2\).

Then, the following holds almost surely for any PL(k) function \(\psi : \mathbb {R}^{t+1} \rightarrow \mathbb {R}\), for \(t \ge 1\):

$$\begin{aligned}&\lim _{n \rightarrow \infty }\frac{1}{n} \sum _{i=1}^n \psi (g_i, u^t_i, u^{t-1}_i, \ldots , u^1_{i} ) = {\mathbb E}\{ \psi (G, \, U_t, \, U_{t-1}, \, \ldots , U_1) \}, \end{aligned}$$
(61)
$$\begin{aligned}&\lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i=1}^d \psi (x_i, v^{t}_i, v^{t-1}_i, \ldots , v^1_i) = {\mathbb E}\{ \psi (X, \, V_{t}, \, V_{t-1}, \, \ldots , V_1) \}, \end{aligned}$$
(62)

where the distributions of the random vectors \((G, U_t, \ldots , U_1)\) and \((X, V_t, \ldots , V_1)\) are given by the state evolution recursion in Eqs. (51)–(60).

Remark 8

Suppose that we have a sequence of PL(k) functions \(\psi _t: \mathbb {R}^{t+1} \rightarrow \mathbb {R}\) (indexed by t) such that

$$\begin{aligned} \begin{aligned}&\lim _{t \rightarrow \infty } {\mathbb E}\{ \psi _t(G, \, U_{t}, \, U_{t-1}, \, \ldots , U_1)\} = c_U, \\&\lim _{t \rightarrow \infty } {\mathbb E}\{ \psi _t(X, \, V_{t}, \, V_{t-1}, \, \ldots , V_1)\} = c_V, \end{aligned} \end{aligned}$$
(63)

for some constants \(c_U, c_V \in \mathbb {R}\). Then, since the statements (61) and (62) hold with probability 1 for each fixed \(t \ge 1\), we have that, almost surely,

$$\begin{aligned}&\lim _{t\rightarrow \infty } \lim _{n \rightarrow \infty } \frac{1}{n} \sum _{i=1}^n \psi (g_i, u^{t}_i, \ldots , u^1_i) = c_U, \end{aligned}$$
(64)
$$\begin{aligned}&\lim _{t\rightarrow \infty } \lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i=1}^d \psi (x_i, v^{t}_i, \ldots , v^1_i) = c_V. \end{aligned}$$
(65)

5.3 GAMP as a Power Method to Compute the Spectral Estimator

Consider the GAMP iteration in (48) initialized with \({\varvec{u}}^0 = \frac{1}{\delta } {\varvec{1}}_n\), and the function \(g_0:\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) chosen as

$$\begin{aligned} g_0(u; \, y) = \frac{{\mathcal {T}}_L(y)}{\sqrt{\delta }}, \end{aligned}$$
(66)

so that

$$\begin{aligned} {\varvec{v}}^1 = \frac{1}{\delta }{\varvec{A}}^{\mathsf{T}} {\mathcal {T}}_L({\varvec{y}}). \end{aligned}$$
(67)

(The function \(f_0\) is assumed to be zero.) From (10), we note that \({\varvec{v}}^1 = \sqrt{d} \, \hat{\varvec{x}}^\mathrm{L}\).

For \(t \ge 1\), let the functions \(g_t: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) and \(f_t:\mathbb {R}\rightarrow \mathbb {R}\) be chosen as

$$\begin{aligned} g_t(u; \, y) = \sqrt{\delta } \, u {\mathcal {T}}(y), \qquad f_{t}(v) = \frac{v}{\beta _{t}}, \end{aligned}$$
(68)

where the function \({\mathcal {T}}: \mathbb {R}\rightarrow \mathbb {R}\) is bounded and Lipschitz, and \(\beta _t\) is a constant, defined iteratively for \(t \ge 1\) via the state evolution equations below (Eqs. (72)–(74)). To prove Theorem 1, we will choose \({\mathcal {T}}\) as a suitable function of \({\mathcal {T}}_s\) (see (82)). Note that the functions \(g_t\) and \(f_t\) are required to be Lipschitz for \(t\ge 0\), and this will be ensured choosing \({\mathcal {T}}\) to be bounded and Lipschitz (see Lemma 6).

With this choice of \(f_t, g_t\), for \(t\ge 1\), the scalars in (49) are given by

$$\begin{aligned} \mathsf{b}_t =\frac{1}{\delta \beta _t}, \qquad \mathsf{c}_t = \sqrt{\delta } \cdot \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}(y_i). \end{aligned}$$
(69)

In the GAMP iteration below, we will replace the parameter \(\mathsf{c}_t\) by its almost sure limit \(\bar{\mathsf{c}}_t = \sqrt{\delta } \, {\mathbb E}\{Z\}\), where \(Z \triangleq {\mathcal {T}}(Y)\). The state evolution result in Proposition 1 still holds when \(\mathsf{c}_t\) is replaced with \(\bar{\mathsf{c}}_t\) in the GAMP iteration [27, 44]. This can be shown using the pseudo-Lipschitz property of the test functions \(\psi \) in Eqs. (61)–(62) and the fact that \( \lim _{n \rightarrow \infty } \frac{1}{n}\sum _{i=1}^n {\mathcal {T}}(y_i) = {\mathbb E}\{ Z \}\) almost surely, due to the strong law of large numbers.

With these choices, the GAMP iteration in (48) is as follows. Initializing with

$$\begin{aligned} {\varvec{u}}^0 = \frac{1}{\delta } {\varvec{1}}_n, \qquad {\varvec{v}}^1 = \frac{1}{\delta } {\varvec{A}}^{\mathsf{T}} {\mathcal {T}}_L({\varvec{y}}), \end{aligned}$$
(70)

we have for \(t \ge 1\):

$$\begin{aligned} \begin{aligned} {\varvec{u}}^t&= {\left\{ \begin{array}{ll} \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}_L {\varvec{u}}^{t-1} \big ], &{} \text { for } t=1, \\ \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}{\varvec{u}}^{t-1} \big ], &{} \text { for } t >1, \end{array}\right. }\\ {\varvec{v}}^{t+1}&= {\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{u}}^t - \frac{\sqrt{\delta }}{\beta _t} \, {\mathbb E}\{ Z \} \, {\varvec{v}}^t, \end{aligned} \end{aligned}$$
(71)

where \({\varvec{Z}}_L = \mathrm{diag}({\mathcal {T}}_L(y_1), \ldots , {\mathcal {T}}_L(y_n))\) and \({\varvec{Z}}= \mathrm{diag}({\mathcal {T}}(y_1), \ldots , {\mathcal {T}}(y_n))\). The state evolution equations in (56) become:

$$\begin{aligned}&\mu _{U,t} = \frac{\mu _{V,t}}{\sqrt{\delta } \beta _t}, \qquad \sigma _{U, t}^2 = \frac{\sigma _{V,t}^2}{ \delta \, \beta ^2_t}, \end{aligned}$$
(72)
$$\begin{aligned}&\mu _{V,t+1} = \sqrt{\delta } \,\frac{\mu _{V,t}}{\beta _t} \big [ {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\} \big ], \qquad \sigma _{V,t+1}^2 = \frac{1}{\beta _t^2}\big [ \mu _{V,t}^2 {\mathbb E}\{Z^2G^2\} + \sigma _{V,t}^2 {\mathbb E}\{ Z^2\} \big ], \end{aligned}$$
(73)
$$\begin{aligned}&\beta _{t+1} = \sqrt{\mu _{V,{t+1}}^2 + \sigma _{V,{t+1}}^2}. \end{aligned}$$
(74)

Here, we recall that \(G \sim \mathsf{N}(0,1)\) and \(Z={\mathcal {T}}(Y)\), for \(Y \sim p_{Y|G}(\cdot \mid G)\). From (57), the state evolution iteration is initialized with the following:

$$\begin{aligned} \mu _{V,1} = {\mathbb E}\{ {\mathcal {T}}_L(Y) G \}, \qquad \sigma _{V,1}^2 = \frac{1}{\delta }{\mathbb E}\{ {\mathcal {T}}_L(Y)^2 \}, \qquad \beta _1 = \sqrt{\mu _{V,1}^2 + \sigma _{V,1}^2}. \end{aligned}$$
(75)

We will show in Lemma 6 that in the high-dimensional limit, the vector \({\varvec{v}}^t\) in (71) tends to align with the principal eigenvector of the matrix \({\varvec{M}}_n= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}\) as \(t \rightarrow \infty \). In other words, the GAMP iteration is equivalent to a power iteration for \({\varvec{M}}_n\). This equivalence, together with Proposition 1, allows us to precisely characterize the limiting empirical distribution of the eigenvector of \({\varvec{M}}_n\) in Lemma 6.

We begin with a result characterizing the fixed points of the state evolution recursion in (72)–(73).

Lemma 4

Assume that \({\mathbb E}\{ Z(G^2-1)\} >0\) and that \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\). Then, the state evolution recursion (72)–(73) has three fixed points: one is \(\mathsf{FP}_0 \equiv (\bar{\mu }_V=0, \bar{\sigma }_V^2 ={\mathbb E}\{ Z^2\})\), and the other two are \(\mathsf{FP}_1 \equiv (\tilde{\mu }_V, \tilde{\sigma }_V^2)\) and \(\mathsf{FP}_2 \equiv (-\tilde{\mu }_V, \tilde{\sigma }_V^2)\), where

$$\begin{aligned} \tilde{\mu }_V = \sqrt{\frac{\tilde{\beta }^2(\tilde{\beta }^2 - {\mathbb E}\{Z^2\})}{\tilde{\beta }^2 + {\mathbb E}\{Z^2 G^2\} - {\mathbb E}\{Z^2\}}}, \qquad \tilde{\sigma }_V^2 = \frac{\tilde{\beta }^2 {\mathbb E}\{Z^2 G^2\}}{ \tilde{\beta }^2 + {\mathbb E}\{ Z^2 G^2\} - {\mathbb E}\{Z^2\}}, \end{aligned}$$
(76)

with

$$\begin{aligned} \tilde{\beta }^2 = \delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2. \end{aligned}$$
(77)

Furthermore, if the initialization (75) is such that \(\mu _{V,1} >0\), then the recursion converges to \(\mathsf{FP}_1\). If \(\mu _{V,1} < 0\), the recursion converges to \(\mathsf{FP}_2\).

The lemma is proved in Appendix E. We note that Lemma 4 (and the subsequent Lemmas 56 as well) assumes that \({\mathbb E}\{ Z(G^2-1)\} >0\) and \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\). These conditions concern the auxiliary random variable Z (or, equivalently, the auxiliary function \({\mathcal {T}}\)). In the proof of Theorem 1, which appears at the end of this section, we will provide a choice of Z (depending on \(Z_s\)) that fulfills these requirements; see (82). Let us also point out that the condition \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\) follows from \(\psi _{\delta }'(\lambda _{\delta }^*)> 0\), which in turn ensures that \(\rho _s>0\). For a discussion of the case \(\rho _s=0\), see Remark 1.

The next lemma shows that the mean-squared difference between successive AMP iterates vanishes as \(t \rightarrow \infty \) in the high-dimensional limit.

Lemma 5

Assume that \({\mathbb E}\{ Z(G^2-1)\} >0\), \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\), and \(\left|{{\mathbb E}\{ {\mathcal {T}}_L(Y) G \}}\right| >0\). Consider the GAMP iteration in (71) initialized with \({\varvec{u}}^0 = \frac{1}{\delta }{\varvec{1}}_n\). Then, the following limits hold almost surely:

$$\begin{aligned} \lim _{t \rightarrow \infty } \lim _{n \rightarrow \infty } \, \frac{1}{n} \Vert {\varvec{u}}^t - {\varvec{u}}^{t-1} \Vert ^2 =0, \qquad \lim _{t\rightarrow \infty } \lim _{d \rightarrow \infty } \, \frac{1}{d} \Vert {\varvec{v}}^{t+1} - {\varvec{v}}^{t} \Vert ^2 =0. \end{aligned}$$
(78)

The proof of the lemma is given in Appendix F. The next result is the main technical lemma needed to prove Theorem 1. It shows that, as t grows large, \({\varvec{v}}^t\) tends to be aligned with the principal eigenvector of the matrix \({\varvec{M}}_n\) defined in (79). Theorem 1 is then obtained from Lemma 6 by using a suitable choice for \({\mathcal {T}}(\cdot )\) in the GAMP iteration (71), which ensures that \({\varvec{M}}_n\) is a scaled version of \({\varvec{D}}_n\) in (13).

Lemma 6

Let \({\varvec{x}}\) be such that \(\Vert {\varvec{x}}\Vert ^2=d\), \(\{{\varvec{a}}_i\}_{1\le i\le n}\sim _{i.i.d.}\mathsf{N}({{\varvec{0}}}_d,{\varvec{I}}_d/d)\), and \({\varvec{y}}\) be distributed according to (7). Let \(n/d\rightarrow \delta \), \(G\sim \mathsf{N}(0, 1)\) and \(Z=\mathcal T(Y)\) for \(Y\sim p_{Y|G}(\,\cdot \,|\,G)\). Assume that the conditions (B1)(B2) on p.6 hold (with \({\mathcal {T}}_s(\cdot )\) replaced by \(\mathcal T(\cdot )\) in (B2)). Assume also that Z is bounded, \(\mathbb P(Z>-1)=1\), \({\mathbb E}\{ Z(G^2-1)\} >0\) and \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\). Define \(Z'=\frac{Z}{Z+\delta \mathbb E\{Z(G^2-1)\}}\) and assume that \(Z'\) satisfies the assumptions (A1), (A2), (A3) on p.5. Define the matrix

$$\begin{aligned} {\varvec{M}}_n= {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1} {\varvec{A}}, \end{aligned}$$
(79)

where \({\varvec{Z}}= \mathrm{diag}({\mathcal {T}}(y_1), \ldots , {\mathcal {T}}(y_n))\). Let \(\hat{{\varvec{\varphi }}}_1\) be the principal eigenvector of \({\varvec{M}}_n\), let its sign be chosen so that \(\langle \hat{{\varvec{\varphi }}}_1, {\varvec{x}}\rangle \ge 0\), and consider the rescaling \(\tilde{{\varvec{\varphi }}}^{(1)}=\sqrt{d}\hat{{\varvec{\varphi }}}_1\). Also, let \(\tilde{\varvec{x}}^\mathrm{L} = \sqrt{d} \hat{\varvec{x}}^\mathrm{L}\), where \(\hat{\varvec{x}}^\mathrm{L}\) is the linear estimator defined in (10).

Then, the following holds almost surely for any PL(k) function \(\psi : \mathbb {R}\times \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\):

$$\begin{aligned}&\lim _{d \rightarrow \infty } \, \frac{1}{d} \sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) = {\mathbb E}\{ \psi (X, \, \mu _{V,1} X + \sigma _{V,1} W_{V,1} ,\, \tilde{\beta }^{-1}(\tilde{\mu }_{V} X + \tilde{\sigma }_{V} W_{V, \infty })) \} . \end{aligned}$$
(80)

Here \(X \sim P_X\), \(\tilde{\mu }_{V}, \tilde{\sigma }_{V}, \tilde{\beta }\) are given by (76)–(77), and \(\mu _{V,1},\sigma _{V,1}^2\) are given by (75). The random variables \((W_{V,1}, W_{V, \infty })\) are independent of X, and jointly Gaussian with zero mean and covariance:

$$\begin{aligned} {\mathbb E}\{ W_{V,1}^2 \} = {\mathbb E}\{ W_{V,\infty }^2 \} =1, \qquad {\mathbb E}\{W_{V,1} W_{V,\infty } \} = \frac{ \tilde{\mu }_{V} \, {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma }_{V} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } }. \end{aligned}$$
(81)

We first show how Theorem 1 is obtained from Lemma 6, and then prove the lemma in the following subsection.

Proof of Theorem 1

Recall that \(\lambda _\delta ^*\) is the unique solution of (20) for \(\lambda >\tau \), where \(\tau \) is the supremum of the support of \(Z_s\). Define

$$\begin{aligned} Z \triangleq \frac{Z_s}{\lambda _\delta ^*- Z_s} = \frac{{\mathcal {T}}_s(Y)}{\lambda _\delta ^*- {\mathcal {T}}_s(Y)}. \end{aligned}$$
(82)

We now verify that Z satisfies the assumptions of Lemma 6. As \(\lambda _\delta ^*>\tau \) and \(Z_s\) has bounded support with supremum \(\tau \), we have that Z is bounded. Furthermore, Z is a Lipschitz function of Y, since \(Z_s\) is a Lipschitz function of Y and \(Z_s\) is bounded away from \(\lambda _\delta ^*\). Thus, Z satisfies the condition (B2) (with \({\mathcal {T}}_s(\cdot )\) replaced by \({\mathcal {T}}(\cdot )\)).

Next, note that \(\tau >0\) (since \(Z_s\) satisfies assumption (A3)). As \(\mathbb P(\lambda _\delta ^*- Z_s>0)=1\), we have that \(\mathbb P(Z>-1)=1\).

As \(\psi _{\delta }'(\lambda _{\delta }^*)> 0\), we have that \(\lambda _{\delta }^*>\bar{\lambda }_\delta \), where \(\bar{\lambda }_\delta \) denotes the point at which \(\psi _\delta \) attains its minimum (see Eq. (18)). Consequently, since \(\lambda _{\delta }^*\) solves (20), we have that

$$\begin{aligned} \psi _\delta (\lambda _{\delta }^*)=\phi (\lambda _{\delta }^*). \end{aligned}$$
(83)

As \(Z_s\) satisfies assumption (A3), we have that \(\tau >0\), which implies that \(\lambda _\delta ^*>0\). Thus, by using the definitions (16)–(17), (83) can be rewritten as

$$\begin{aligned} \mathbb E\left\{ \frac{Z_s(G^2-1)}{\lambda _{\delta }^*-Z_s}\right\} =\frac{1}{\delta }. \end{aligned}$$
(84)

By combining (82) and (84), we obtain that

$$\begin{aligned} \mathbb E\{Z(G^2-1)\}=\frac{1}{\delta }>0. \end{aligned}$$
(85)

Finally, we compute the derivative of \(\psi _\delta (\lambda )\) at \(\lambda _{\delta }^*\), noting that the derivative and the expectation in (17) can be interchanged due to bounded convergence. We thus obtain that the condition \(\psi _{\delta }'(\lambda _{\delta }^*)> 0\) is equivalent to

$$\begin{aligned} \frac{1}{\delta } > \mathbb E\left\{ \frac{Z_s^2}{(\lambda _{\delta }^*-Z_s)^2}\right\} = \mathbb E\left\{ Z^2\right\} , \end{aligned}$$
(86)

where the last equality follows from (82). From (85) and (86), we have \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\).

We also have

$$\begin{aligned} Z'\triangleq \frac{Z}{Z+\delta \mathbb E\{Z(G^2-1)\}}=\frac{Z}{Z+1}=\frac{Z_s}{\lambda _\delta ^*}, \end{aligned}$$
(87)

where in the first equality we use (85) and in the second equality we use (82). Thus, \(Z'\) satisfies the assumptions (A1)(A2)(A3) (with \(\tau /\lambda _\delta ^*\) being the supremum of its support), and we can apply Lemma 6.

Using (87) in (79), we see that \({\varvec{M}}_n = \frac{1}{\lambda _\delta ^*}{\varvec{A}}^{\mathsf{T}} {\varvec{Z}}_s {\varvec{A}}\). Therefore, the principal eigenvector of \({\varvec{M}}_n\) is equal to \(\hat{\varvec{x}}^\mathrm{s}\). Furthermore, from (76)–(77), we can compute the coefficients \(\tilde{\mu }_{V}, \tilde{\sigma }_{V}, \tilde{\beta }\) as

$$\begin{aligned} \begin{aligned} \tilde{\beta }^2&= \delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2=\frac{1}{\delta },\\ \tilde{\mu }_V&= \sqrt{\frac{\tilde{\beta }^2(\tilde{\beta }^2 - {\mathbb E}\{Z^2\})}{\tilde{\beta }^2 + {\mathbb E}\{Z^2 G^2\} - {\mathbb E}\{Z^2\}}}=\sqrt{\frac{\frac{1}{\delta }\left( \frac{1}{\delta } - {\mathbb E}\left\{ \frac{Z_s^2}{(\lambda _\delta ^*- Z_s)^2}\right\} \right) }{\frac{1}{\delta } + {\mathbb E}\left\{ \frac{Z_s^2(G^2-1)}{(\lambda _\delta ^*- Z_s)^2}\right\} }},\\ \tilde{\sigma }_V&= \sqrt{\frac{\tilde{\beta }^2 {\mathbb E}\{Z^2 G^2\}}{ \tilde{\beta }^2 + {\mathbb E}\{ Z^2 G^2\} - {\mathbb E}\{Z^2\}}}= \sqrt{\frac{\frac{1}{\delta } {\mathbb E}\left\{ \frac{Z_s^2\cdot G^2}{(\lambda _\delta ^*- Z_s)^2}\right\} }{\frac{1}{\delta } + {\mathbb E}\left\{ \frac{Z_s^2(G^2-1)}{(\lambda _\delta ^*- Z_s)^2}\right\} }}. \end{aligned} \end{aligned}$$

By using also (75), one can easily verify that

$$\begin{aligned}&\mu _{V, 1}=n_L\rho _L,\qquad \sigma _{V, 1}=n_L\sqrt{1-\rho _L^2},\qquad \tilde{\beta }^{-1}\tilde{\mu }_V=\rho _s,\\&\tilde{\beta }^{-1}\tilde{\sigma }_V=\sqrt{1-\rho _s^2}, \qquad \frac{q-\rho _s\rho _L}{\sqrt{(1-\rho _s^2)(1-\rho _L^2)}}=\frac{ \tilde{\mu }_{V} \, {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma }_{V} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } }, \end{aligned}$$

which yields the desired result.

5.4 Proof of Lemma 6

Fix \(c>0\), and let \(\tilde{{\varvec{Z}}}=c {\varvec{Z}}\) and \(\tilde{Z}=c Z\). Then,

$$\begin{aligned} \begin{aligned} {\varvec{Z}}({\varvec{Z}}+\delta \mathbb E\{Z(G^2-1)\}{\varvec{I}}_n)^{-1}&= \tilde{{\varvec{Z}}}(\tilde{{\varvec{Z}}}+\delta \mathbb E\{\tilde{Z}(G^2-1)\}{\varvec{I}}_n)^{-1}. \end{aligned} \end{aligned}$$
(88)

Thus, by inspecting the definition (79), one immediately obtains that \(\tilde{{\varvec{\varphi }}}^{(1)}\) does not change if we rescale \({\varvec{Z}}\) and Z by the multiplicative factor c. Furthermore, by using the definitions (76)–(77), we have that

$$\begin{aligned} \tilde{\beta }^{-1}\tilde{\mu }_{V}&=\sqrt{\frac{\delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 - {\mathbb E}\{Z^2\}}{\delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 + {\mathbb E}\{Z^2 G^2\} - {\mathbb E}\{Z^2\}}} \nonumber \\&= \sqrt{\frac{\delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 - {\mathbb E}\{\tilde{Z}^2\}}{\delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 + {\mathbb E}\{\tilde{Z}^2 G^2\} - {\mathbb E}\{\tilde{Z}^2\}}},\end{aligned}$$
(89)
$$\begin{aligned} \tilde{\beta }^{-1}\tilde{\sigma }_{V}&=\sqrt{\frac{{\mathbb E}\{Z^2 G^2\}}{ \delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 + {\mathbb E}\{ Z^2 G^2\} - {\mathbb E}\{Z^2\}}}\nonumber \\&=\sqrt{\frac{{\mathbb E}\{\tilde{Z}^2 G^2\}}{ \delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 + {\mathbb E}\{ \tilde{Z}^2 G^2\} - {\mathbb E}\{\tilde{Z}^2\}}}, \end{aligned}$$
(90)

and that

$$\begin{aligned} \begin{aligned} \frac{ \tilde{\mu }_{V} \, {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma _{V}} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } }&=\frac{{\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{\sqrt{\delta }( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})}\sqrt{\frac{\delta \, ( {\mathbb E}\{ZG^2\} - {\mathbb E}\{Z\})^2 - {\mathbb E}\{Z^2\}}{{\mathbb E}\{ Z^2 G^2 \}{\mathbb E}\{ {\mathcal {T}}_L(Y)^2\}}}\\&=\frac{{\mathbb E}\{ {\mathcal {T}}_L(Y) \tilde{Z} G \}}{\sqrt{\delta }( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})}\sqrt{\frac{\delta \, ( {\mathbb E}\{\tilde{Z}G^2\} - {\mathbb E}\{\tilde{Z}\})^2 - {\mathbb E}\{\tilde{Z}^2\}}{{\mathbb E}\{ \tilde{Z}^2 G^2 \}{\mathbb E}\{ {\mathcal {T}}_L(Y)^2\}}}. \end{aligned} \end{aligned}$$
(91)

Consequently, both the LHS and the RHS of (80) are unchanged when we rescale \({\varvec{Z}}\) and Z by the multiplicative factor c.

The argument above proves that, without loss of generality, we can rescale \({\varvec{Z}}\) and Z by any multiplicative factor \(c>0\). To simplify the rest of the argument, it is convenient to assume the normalization condition

$$\begin{aligned} \mathbb E\{Z(G^2-1)\}=\frac{1}{\delta }, \end{aligned}$$
(92)

under which \({\varvec{M}}_n = {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}\).

Consider the iteration (71) with the initialization in (70). Since by hypothesis, \(\mu _{V,1}^2 = ({\mathbb E}\{ {\mathcal {T}}_L(Y) G \})^2 >0\), Lemma 4 guarantees that the state evolution recursion (73) converges to either fixed point \(\mathsf{FP}_1\) or \(\mathsf{FP}_2\) as \(t \rightarrow \infty \). That is,

$$\begin{aligned} \lim _{t \rightarrow \infty } \mu _{V,t}^2 = \tilde{\mu }_V^2, \quad \lim _{t \rightarrow \infty } \sigma ^2_{V,t} = \tilde{\sigma }^2_V, \quad \lim _{t \rightarrow \infty } \beta _t^2 = \tilde{\mu }_V^2 + \tilde{\sigma }^2_V = \tilde{\beta }^2 = \frac{1}{\delta }. \end{aligned}$$
(93)

The last equality above follows by combining (77) and (92).

By substituting the expression for \({\varvec{u}}^t\) in (71) in the \({\varvec{v}}^{t+1}\) update, the iteration can be rewritten as follows, for \(t \ge 2\):

$$\begin{aligned} {\varvec{u}}^t&= \frac{1}{\sqrt{\delta } \, \beta _t} \big [ {\varvec{A}}{\varvec{v}}^t \, - \, {\varvec{Z}}{\varvec{u}}^{t-1} \big ], \end{aligned}$$
(94)
$$\begin{aligned} {\varvec{v}}^{t+1}&= \frac{1}{\sqrt{\delta } \beta _t}\Big [ \big ({\varvec{A}}^\mathsf{T}{\varvec{Z}}{\varvec{A}}- \delta {\mathbb E}\{Z\} \, {\varvec{I}}_d \big ) {\varvec{v}}^t \, - \, {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2 {\varvec{u}}^{t-1} \Big ]. \end{aligned}$$
(95)

In the remainder of the proof, we will assume that \(t \ge 2\). Define

$$\begin{aligned} {\varvec{e}}_1^t&= {\varvec{u}}^{t}-{\varvec{u}}^{t-1}, \end{aligned}$$
(96)
$$\begin{aligned} {\varvec{e}}_2^t&= {\varvec{v}}^{t+1}-{\varvec{v}}^{t}. \end{aligned}$$
(97)

By combining (96) with (94), we have that

$$\begin{aligned} {\varvec{u}}^{t-1}=({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{A}}{\varvec{v}}^t-\sqrt{\delta }\beta _t({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t. \end{aligned}$$
(98)

By substituting the expression for \({\varvec{u}}^{t-1}\) obtained in (98) into (95), we have

$$\begin{aligned} \begin{aligned} {\varvec{v}}^{t+1}&= \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}- \frac{\sqrt{\delta } {\mathbb E}\{Z\}}{\beta _t} \, {\varvec{I}}_d \right) {\varvec{v}}^t +{\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t\\&= \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-\delta {\mathbb E}\{Z\} \, {\varvec{I}}_d \right) {\varvec{v}}^t \\&\quad +(1-\sqrt{\delta }\beta _t){\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^t \\&\quad + \delta {\mathbb E}\{Z\}\left( 1-\frac{1}{\sqrt{\delta }\beta _t}\right) {\varvec{v}}^t+{\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t. \end{aligned} \end{aligned}$$
(99)

Let

$$\begin{aligned} {\varvec{e}}_3^t = \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{v}}^t. \end{aligned}$$
(100)

From (97) and (99), we obtain

$$\begin{aligned} \begin{aligned} {\varvec{e}}_3^t&={\varvec{e}}_2^t-(1-\sqrt{\delta }\beta _t){\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^t \\&\quad - \delta {\mathbb E}\{Z\}\left( 1-\frac{1}{\sqrt{\delta }\beta _t}\right) {\varvec{v}}^t-{\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t. \end{aligned} \end{aligned}$$
(101)

Let us decompose \({\varvec{v}}^t\) into a component in the direction of \(\hat{{\varvec{\varphi }}}_1\) plus an orthogonal component \({\varvec{r}}^t\):

$$\begin{aligned} {\varvec{v}}^t = \xi _t \hat{{\varvec{\varphi }}}_1+{\varvec{r}}^t, \end{aligned}$$
(102)

where \(\xi _t=\langle {\varvec{v}}^t, \hat{{\varvec{\varphi }}}_1\rangle \).

At this point, the idea is to show that, when t is large, \({\varvec{r}}^t\) is small, thus \({\varvec{v}}^t\) tends to be aligned with \(\hat{{\varvec{\varphi }}}_1\). To do so, we prove that, as \(n\rightarrow \infty \), the largest eigenvalue of the matrix \({\varvec{M}}_n\) defined in (79) converges almost surely to \(\delta {\mathbb E}\{Z\}+1\). Furthermore, we prove that the matrix \({\varvec{M}}_n\) exhibits a spectral gap, in the sense that the second largest eigenvalue of \({\varvec{M}}_n\) converges almost surely to a value strictly smaller than \(\delta {\mathbb E}\{Z\}+1\). Since \({\varvec{r}}^t\) is orthogonal to \(\hat{{\varvec{\varphi }}}_1\) and \({\varvec{M}}_n\) has a spectral gap, the norm of \({\varvec{e}}_3^t\) in (100) can be lower bounded by a strictly positive constant times the norm of \({\varvec{r}}^t\). Next, using the expression in (101), we show that the norm of \({\varvec{e}}_3^t\) can be made arbitrarily small by taking n and t sufficiently large. From this, we conclude that \({\varvec{r}}^t\) must be small.

We begin by proving that \({\varvec{M}}_n\) has a spectral gap.

Lemma 7

(Spectral gap for \({\varvec{M}}_n\)) The following holds almost surely:

$$\begin{aligned} \lim _{n \rightarrow \infty } \lambda _1^{{\varvec{M}}_n}&= 1+\delta \mathbb E\{Z\}, \end{aligned}$$
(103)
$$\begin{aligned} \limsup _{n \rightarrow \infty } \lambda _2^{{\varvec{M}}_n}&< 1+\delta \mathbb E\{Z\} - c_1, \end{aligned}$$
(104)

for a numerical constant \(c_1 >0\) that does not depend on n.

Proof of Lemma 7

By hypothesis, \(Z'=Z/(Z+1)\) satisfies the assumptions (A1)-(A2)-(A3). Thus, we can use Lemma 2 to compute the almost sure limit of the two largest eigenvalues of \({\varvec{M}}_n\), call them \(\lambda _1^{{\varvec{M}}_n}\ge \lambda _2^{{\varvec{M}}_n}\).

Let \(\tau \) denote the supremum of the support of \(Z'\). As \(\mathbb P(Z>-1)=1\) and Z has bounded support, we have that \(\tau <1\). For \(\lambda \in (\tau , \infty )\), define

$$\begin{aligned} \phi (\lambda ) = \lambda \cdot {\mathbb E}\left\{ \frac{Z'\cdot G^2}{\lambda -Z'}\right\} =\lambda \cdot {\mathbb E}\left\{ \frac{Z\cdot G^2}{(\lambda -1)Z+ \lambda }\right\} , \end{aligned}$$
(105)

and

$$\begin{aligned} \psi _{\delta }(\lambda ) = \lambda \left( \frac{1}{\delta }+{\mathbb E}\left\{ \frac{Z'}{\lambda -Z'}\right\} \right) = \lambda \left( \frac{1}{\delta }+{\mathbb E}\left\{ \frac{Z}{(\lambda -1)Z+ \lambda }\right\} \right) . \end{aligned}$$
(106)

Note that

$$\begin{aligned} \begin{aligned} \phi (1) = \mathbb E\{Z\cdot G^2\}, \qquad \psi _{\delta }(1) =\frac{1}{\delta }+{\mathbb E}\{Z\}. \end{aligned} \end{aligned}$$
(107)

Thus, by using (92), we have that

$$\begin{aligned} \phi (1)=\psi _{\delta }(1). \end{aligned}$$
(108)

Furthermore,

$$\begin{aligned} \psi _{\delta }'(\lambda ) = \frac{1}{\delta }-{\mathbb E}\left\{ \left( \frac{Z'}{\lambda -Z'}\right) ^2\right\} = \frac{1}{\delta }-{\mathbb E}\left\{ \left( \frac{Z}{(\lambda -1)Z+ \lambda }\right) ^2\right\} . \end{aligned}$$
(109)

Thus,

$$\begin{aligned} \psi _{\delta }'(1)=\frac{1}{\delta }-{\mathbb E}\{Z^2\}>0, \end{aligned}$$
(110)

where in the last step we combine the hypothesis \(\delta > \frac{{\mathbb E}\{ Z^2\}}{ ({\mathbb E}\{ Z G^2 \} \, - \,{\mathbb E}\{Z\})^2}\) with the normalization condition (92).

Let \(\bar{\lambda }_\delta \) be the point at which \(\psi _\delta \) attains its minimum, as defined in (18), define \(\zeta _\delta \) as in (19) and let \(\lambda _\delta ^*\) be the unique solution of (20). Since \(\psi _\delta \) is convex, (109) and (110) imply that \(\bar{\lambda }_\delta <1\). Furthermore, (108) implies that \(\lambda _\delta ^*=1\). By Lemma 2, we obtain that, as \(n\rightarrow \infty \),

$$\begin{aligned} \begin{aligned} \lambda _1^{{\varvec{M}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _\delta (1),\\ \lambda _2^{{\varvec{M}}_n}&{\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\zeta _\delta (\bar{\lambda }_\delta ). \end{aligned} \end{aligned}$$
(111)

Note that

$$\begin{aligned} \delta \,\zeta _\delta (1)=\delta \,\phi (1)= \delta \mathbb E\{Z\cdot G^2\} = 1+\delta \mathbb E\{Z\}, \end{aligned}$$
(112)

where the first equality comes from the fact that \(\lambda _\delta ^*=1\) is the unique solution of (20), while the second and third equalities follow from (107) and (108). By combining (112) and (111), we obtain \(\lambda _1^{{\varvec{M}}_n} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}1+\delta \mathbb E\{Z\}\). Furthermore, \(\zeta _\delta (\bar{\lambda }_\delta )=\psi _\delta (\bar{\lambda }_\delta )\). As \(\bar{\lambda }_\delta <1\) and \(\psi _\delta \) is Lipschitz continuous (from (109)), there exists a numerical constant \(c_1 >0\) such that

$$\begin{aligned} \delta \psi _\delta (\bar{\lambda }_\delta )\le 1+\delta \mathbb E\{Z\}-c_1. \end{aligned}$$
(113)

Hence, \(\lambda _2^{{\varvec{M}}_n} {\mathop {\longrightarrow }\limits ^{{\small \mathrm{a.s.}}}}\delta \,\psi _\delta (\bar{\lambda }_\delta )\le 1+\delta \mathbb E\{Z\}-c_1\).

Let us now go back to (100) and combine it with (102). Then,

$$\begin{aligned} \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{r}}^t ={\varvec{e}}_3^t+\xi _t(\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n})\hat{{\varvec{\varphi }}}_1. \end{aligned}$$
(114)

We now prove that, almost surely, for all sufficiently large n, the following lower bound on the norm of the LHS of (114) holds:

$$\begin{aligned} \left\| \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{r}}^t\right\| _2\ge c_2\Vert {\varvec{r}}^t\Vert _2, \end{aligned}$$
(115)

where \(c_2>0\) is a numerical constant independent of nt.

As the matrix \({\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d\) is symmetric, it can be written in the form \({\varvec{Q}}{\varvec{\Lambda }}{\varvec{Q}}^\mathsf{T}\), with \({\varvec{Q}}\) orthogonal and \({\varvec{\Lambda }}\) diagonal. Furthermore, the columns of \({\varvec{Q}}\) are the eigenvectors of \({\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d\) and the diagonal entries of \({\varvec{\Lambda }}\) are the corresponding eigenvalues. As \({\varvec{r}}^t\) is orthogonal to \(\hat{{\varvec{\varphi }}}_1\), we can write

$$\begin{aligned} \left( {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}-(\delta {\mathbb E}\{Z\}+1) \, {\varvec{I}}_d \right) {\varvec{r}}^t={\varvec{Q}}{\varvec{\Lambda }}'{\varvec{Q}}^\mathsf{T}{\varvec{r}}^t, \end{aligned}$$
(116)

where \({\varvec{\Lambda }}'\) is obtained from \({\varvec{\Lambda }}\) by changing the entry corresponding to \(\lambda _1^{{\varvec{M}}_n}-(\delta {\mathbb E}\{Z\}+1)\) to any other value. For our purposes, it suffices to substitute \(\lambda _1^{{\varvec{M}}_n}-(\delta {\mathbb E}\{Z\}+1)\) with \(\lambda _2^{{\varvec{M}}_n}-(\delta {\mathbb E}\{Z\}+1)\). Note that

$$\begin{aligned} \begin{aligned} \Vert {\varvec{Q}}{\varvec{\Lambda }}'{\varvec{Q}}^\mathsf{T}{\varvec{r}}^t\Vert _2^2&\ge \Vert {\varvec{r}}^t\Vert _2^2 \min _{\varvec{s}:\Vert \varvec{s}\Vert =1}\Vert {\varvec{Q}}{\varvec{\Lambda }}'{\varvec{Q}}^\mathsf{T}\varvec{s}\Vert _2^2\\&=\Vert {\varvec{r}}^t\Vert _2^2\min _{\varvec{s}:\Vert \varvec{s}\Vert =1}\langle \varvec{s}, {\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T}\varvec{s}\rangle \\&=\Vert {\varvec{r}}^t\Vert _2^2\,\lambda _\mathrm{min}({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T}), \end{aligned} \end{aligned}$$
(117)

where \(\lambda _\mathrm{min}({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T})\) denotes the smallest eigenvalue of \({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T}\) and the last equality follows from the variational characterization of the smallest eigenvalue of a symmetric matrix. Note that

$$\begin{aligned} \lambda _\mathrm{min}({\varvec{Q}}\left( {\varvec{\Lambda }}'\right) ^2{\varvec{Q}}^\mathsf{T})=\lambda _\mathrm{min}\left( ({\varvec{\Lambda }}')^2\right) =\min _{i\in \{2, \ldots , d\}}\left( ((\delta {\mathbb E}\{Z\}+1)-\lambda _i^{{\varvec{M}}_n})^2\right) . \end{aligned}$$
(118)

By using (104), we obtain that, almost surely, for all sufficiently large n,

$$\begin{aligned} \min _{i\in \{2, \ldots , d\}}\left( ((\delta {\mathbb E}\{Z\}+1)-\lambda _i^{{\varvec{M}}_n})^2\right) \ge \left( \frac{c_1}{2}\right) ^2. \end{aligned}$$
(119)

By combining (116), (117), (118) and (119), we conclude that (115) holds.

Recalling that \({\varvec{r}}^t\) satisfies (114), we will next show that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{e}}_3^t+\xi _t(\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n})\hat{{\varvec{\varphi }}}_1\right\| _2=0. \end{aligned}$$
(120)

Combined with (114) and (115), this implies that \( \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty } \frac{\Vert {\varvec{r}}^t \Vert _2}{\sqrt{d}} = 0\) almost surely.

By using the triangle inequality, we have

$$\begin{aligned} \begin{aligned} \left\| {\varvec{e}}_3^t+\xi _t(\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n})\hat{{\varvec{\varphi }}}_1\right\| _2&\le \left\| {\varvec{e}}_3^t\right\| _2+|\xi _t|\cdot |\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n}|\cdot \left\| \hat{{\varvec{\varphi }}}_1\right\| _2\\&\le \left\| {\varvec{e}}_3^t\right\| _2+\Vert {\varvec{v}}^t\Vert _2\cdot |\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n}|, \end{aligned} \end{aligned}$$
(121)

where the second inequality uses \(\left\| \hat{{\varvec{\varphi }}}_1\right\| _2=1\) and that \(|\xi _t|= \langle {\varvec{v}}^t, \hat{{\varvec{\varphi }}}_1 \rangle \le \Vert {\varvec{v}}^t\Vert _2\).

We can bound the second term on the RHS of (121) using the result in Proposition 1, applied with the PL(2) test function \(\psi (v) = v^2\). Then, almost surely,

$$\begin{aligned} \lim _{d\rightarrow \infty } \, \frac{1}{d}\Vert {\varvec{v}}^t\Vert ^2_2 = {\mathbb E}\{V_t^2\}=\beta _t^2. \end{aligned}$$
(122)

Here we used the definitions of \(V_t\) and \(\beta _t^2\) from (52) and (74). Recalling from (93) that \(\lim _{t\rightarrow \infty } \beta _t^2 =\frac{1}{\delta }\), the limit in (122) combined with Remark 8 and the continuous mapping theorem implies that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty } \frac{1}{\sqrt{d}}\Vert {\varvec{v}}^t\Vert _2=\frac{1}{\sqrt{\delta }}. \end{aligned}$$
(123)

Thus, by using (103), we conclude that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{d}}\Vert {\varvec{v}}^t\Vert _2\cdot |\delta {\mathbb E}\{Z\}+1-\lambda _1^{{\varvec{M}}_n}|=0. \end{aligned}$$
(124)

We now bound the first term on the RHS of (121). Recalling the definition of \({\varvec{e}}_3^t\) in (101), an application of the triangle inequality gives

$$\begin{aligned} \begin{aligned} \left\| {\varvec{e}}_3^t\right\| _2&\le \left\| {\varvec{e}}_2^t\right\| _2+|1-\sqrt{\delta }\beta _t|\cdot \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{v}}^t\right\| _2 \\&\quad + \delta |{\mathbb E}\{Z\}|\cdot \left| 1-\frac{1}{\sqrt{\delta }\beta _t}\right| \cdot \left\| {\varvec{v}}^t\right\| _2+\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}{\varvec{e}}_1^t\right\| _2\\&\le \left\| {\varvec{e}}_2^t\right\| _2+|1-\sqrt{\delta }\beta _t|\cdot \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}\right\| _\mathrm{op}\Vert {\varvec{v}}^t\Vert _2 \\&\quad + \delta |{\mathbb E}\{Z\}|\cdot \left| 1-\frac{1}{\sqrt{\delta }\beta _t}\right| \cdot \left\| {\varvec{v}}^t\right\| _2+\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\Vert {\varvec{e}}_1^t\Vert _2, \end{aligned} \end{aligned}$$
(125)

where the second inequality follows from the fact that, given a matrix \({\varvec{M}}\) and a vector \({\varvec{v}}\), \(\Vert {\varvec{M}}{\varvec{v}}\Vert _2\le \Vert {\varvec{M}}\Vert _\mathrm{op} \Vert {\varvec{v}}\Vert _2\).

Let us bound the operator norm of the two matrices appearing in the RHS of (125). As the operator norm is sub-multiplicative, we have

$$\begin{aligned} \begin{aligned}&\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}\right\| _\mathrm{op}\\&\quad \le \left\| {\varvec{Z}}\right\| _\mathrm{op}\left\| ({\varvec{Z}}+{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\left\| ({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op} \left\| {\varvec{A}}\right\| _\mathrm{op}^2,\\&\left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\\&\quad \le \left\| {\varvec{Z}}\right\| _\mathrm{op}^2\left\| ({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}\left\| {\varvec{A}}\right\| _\mathrm{op}^2. \end{aligned} \end{aligned}$$
(126)

As Z is bounded, the operator norm of \({\varvec{Z}}\) is upper bounded by a numerical constant (independent of nt). The operator norm of \(({\varvec{Z}}+{\varvec{I}}_n)^{-1}\) and \(({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\) is also upper bounded by a numerical constant (independent of nt). Indeed, from (93) \(\beta _t \rightarrow 1/\sqrt{\delta }\) as \(t \rightarrow \infty \), and the support of Z does not contain points arbitrarily close to \(-1\). We also have that, almost surely, for all sufficiently large n, the operator norm of \({\varvec{A}}\) is upper bounded by a constant (independent of nt). As a result, we deduce that, almost surely, for all sufficiently large nt,

$$\begin{aligned} \begin{aligned} \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1}({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1} {\varvec{A}}\right\| _\mathrm{op}&\le C,\\ \left\| {\varvec{A}}^\mathsf{T}{\varvec{Z}}^2({\varvec{Z}}+\sqrt{\delta }\beta _t{\varvec{I}}_n)^{-1}\right\| _\mathrm{op}&\le C, \end{aligned} \end{aligned}$$
(127)

where C is a numerical constant (independent of nt). Furthermore, by Lemma 5, the following limits hold almost surely:

$$\begin{aligned} \begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{n}}\left\| {\varvec{e}}_1^t\right\| _2&=0,\\ \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{e}}_2^t\right\| _2&=0. \end{aligned} \end{aligned}$$
(128)

By combining (93), (123), (127) and (128), we obtain that, almost surely, each of the terms in the RHS of (125) vanishes when scaled by the factor \(1/\sqrt{n}\), as \(t, n\rightarrow \infty \). As a result, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{e}}_3^t\right\| _2=0. \end{aligned}$$
(129)

By combining (121), (124) and (129), we conclude that, almost surely, (120) holds.

Recall that \({\varvec{r}}^t\) satisfies (114). Thus, by combining the lower bound in (115) with the almost sure limit in (120), we obtain that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d \rightarrow \infty }\frac{1}{\sqrt{d}}\left\| {\varvec{r}}^t\right\| _2=0. \end{aligned}$$
(130)

Recalling from (102) that \({\varvec{r}}^t\) is the component of \({\varvec{v}}^t\) orthogonal to \(\hat{{\varvec{\varphi }}}_1\), the result in (130) implies that \({\varvec{v}}^t\) tends to be aligned with \(\hat{{\varvec{\varphi }}}_1\) in the high-dimensional limit. In formulas, by combining (102) with (130), we have that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty }\frac{1}{\sqrt{n}}\left\| {\varvec{v}}^t - \xi _t \hat{{\varvec{\varphi }}}_1\right\| _2=0. \end{aligned}$$
(131)

Note that

$$\begin{aligned} \left\| {\varvec{v}}^t - \xi _t \hat{{\varvec{\varphi }}}_1\right\| _2^2=\left\| {\varvec{v}}^t\right\| _2^2-\xi _t^2. \end{aligned}$$
(132)

Thus, by using (123), we obtain that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{n\rightarrow \infty } \frac{1}{\sqrt{d}}|\xi _t|=\frac{1}{\sqrt{\delta }}. \end{aligned}$$
(133)

To obtain the sign of \(\xi _t\), we first observe that, by Proposition 1, almost surely,

$$\begin{aligned} \lim _{d \rightarrow \infty }\frac{1}{d}\langle {\varvec{v}}^t, {\varvec{x}}\rangle =\mu _{V, t}. \end{aligned}$$
(134)

As \(\mu _{V, 0} = \alpha > 0\) and \({\mathbb E}\{ Z(G^2-1)\} = 1/\delta \), the state evolution iteration (73) implies that \(\mu _{V, t} >0\) for all \(t \ge 0\). Using (102), we can write

$$\begin{aligned} \frac{1}{d} \langle {\varvec{v}}^t, {\varvec{x}}\rangle = \frac{\xi _t}{\sqrt{d}} \frac{\langle \hat{{\varvec{\varphi }}}_1, {\varvec{x}}\rangle }{\sqrt{d}} \, + \, \frac{\langle {\varvec{r}}^t, {\varvec{x}}\rangle }{d}. \end{aligned}$$
(135)

Recall that by hypothesis, \(\langle \hat{{\varvec{\varphi }}}_1, {\varvec{x}}\rangle \ge 0\). Moreover, using (130) and Cauchy–Schwarz, we have that, almost surely,

$$\begin{aligned} \lim _{t \rightarrow \infty } \lim _{d \rightarrow \infty }\frac{\langle {\varvec{r}}^t, {\varvec{x}}\rangle }{\sqrt{d}} = 0. \end{aligned}$$

Thus we deduce that, almost surely,

$$\begin{aligned} \lim _{t \rightarrow \infty } \lim _{d \rightarrow \infty } \frac{\xi _t}{\sqrt{d}} = + \frac{1}{\sqrt{\delta }}.\end{aligned}$$

Therefore,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty }\frac{1}{\sqrt{d}}\left\| \sqrt{\delta }{\varvec{v}}^t - \tilde{{\varvec{\varphi }}}^{(1)}\right\| _2=0 \quad \text {a.s.}, \end{aligned}$$
(136)

with \(\tilde{{\varvec{\varphi }}}^{(1)}=\sqrt{d}\hat{{\varvec{\varphi }}}_1\).

At this point, we are ready to prove (80). For any \(\mathrm{PL}(k)\) function \(\psi : \mathbb {R}^3 \rightarrow \mathbb {R}\), we have that

$$\begin{aligned} \begin{aligned}&\left| \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) - \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) \right| \\&\quad \le \frac{1}{d}\sum _{i=1}^d \left| \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) - \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) \right| \\&\ \le \frac{C}{d}\sum _{i=1}^d |\tilde{\varphi }_i^{(1)}-\sqrt{\delta }v_i^t| \\&\quad \left[ 1+\, \left( \big (\tilde{\varphi }_i^{(1)}\big )^2+ (\tilde{x}^\mathrm{L}_i)^2 +x_i^2 \right) ^{(k-1)/2}+ \, \left( \delta (v_i^t)^2 + (\tilde{x}^\mathrm{L}_i)^2 +x_i^2\right) ^{(k-1)/2} \right] \\&\ \le \frac{C}{d}\sum _{i=1}^d |\tilde{\varphi }_i^{(1)}-\sqrt{\delta }v_i^t|\\&\quad \left[ 1+\, 3^{\frac{k-1}{2}} \Big ( \left|{\tilde{\varphi }_i^{(1)}}\right|^{k-1} + \left|{\tilde{x}^\mathrm{L}_i}\right|^{k-1} + \left|{x_i}\right|^{k-1} + \left|{\sqrt{\delta } v_i^t}\right|^{k-1} + \left|{\tilde{x}^\mathrm{L}_i}\right|^{k-1} + \left|{x}\right|_i^{k-1} \Big ) \right] \\&\ \le C'\frac{\Vert \tilde{{\varvec{\varphi }}}^{(1)} - \sqrt{\delta }{\varvec{v}}^t\Vert _2}{\sqrt{d}} \left[ 1+ \sum _{i=1}^d \left( \frac{ |\tilde{\varphi }_i^{(1)}|^{2(k-1)} + |\tilde{x}^\mathrm{L}_i|^{2(k-1)} + |{x}_i|^{2(k-1)} + |\sqrt{\delta } v_i^t|^{2(k-1)}}{d} \right) \right] ^{1/2}, \end{aligned} \end{aligned}$$
(137)

where \(C,C'\) are universal constants (which may depend on k but not on dn). The inequality in the second line above uses \(\psi \in \mathrm{PL}(k)\), and the third and fourth lines are obtained via the Hölder and Cauchy–Schwarz inequalities. We now claim that, almost surely,

$$\begin{aligned} \lim _{d\rightarrow \infty }\sum _{i=1}^d \left( \frac{ |\tilde{\varphi }_i^{(1)}|^{2(k-1)} + |\tilde{x}^\mathrm{L}_i|^{2(k-1)} + |{x}_i|^{2(k-1)} + |\sqrt{\delta } v_i^t|^{2(k-1)}}{d} \right) \le C, \end{aligned}$$
(138)

where, from now on, we will use C to denote a generic positive constant that does not depend on dn. If (138) holds, then by using (136) and (137), we deduce that, almost surely,

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{d\rightarrow \infty } \left| \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \tilde{\varphi }_i^{(1)}) - \frac{1}{d}\sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) \right| = 0. \end{aligned}$$
(139)

Let us now prove (138). First, by assumption (B1), we have that

$$\begin{aligned} \lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i} |{x}_i|^{2(k-1)} \le C. \end{aligned}$$
(140)

Next, the main technical lemma [27, Lemma 2] leading to the state evolution result in Proposition 1 implies that, almost surely, for \(t\ge 1\),

$$\begin{aligned} \limsup _{d \rightarrow \infty } \frac{1}{d} \sum _i | v_i^t|^{2(k-1)} \le C. \end{aligned}$$
(141)

In particular, this follows from [27, Lemma 2(e)] (see also [4, Lemma 1(e)]). Since \(\tilde{{\varvec{x}}}^\mathrm{L} = {\varvec{v}}^1\), we also have that, almost surely,

$$\begin{aligned} \limsup _{d \rightarrow \infty } \frac{1}{d} \sum _i | \tilde{x}_i^\mathrm{L}|^{2(k-1)} \le C. \end{aligned}$$
(142)

It remains to show that, almost surely,

$$\begin{aligned} \limsup _{d \rightarrow \infty } \frac{1}{d} \sum _i (\tilde{\varphi }_i^{(1)})^{2(k-1)}\le C. \end{aligned}$$
(143)

To do so, we use a rotational invariance argument. Let \({\varvec{R}}\in \mathbb R^{d \times d}\) be an orthogonal matrix such that \({\varvec{R}}{\varvec{x}}= {\varvec{x}}\). Then,

$$\begin{aligned} \langle {\varvec{x}}, {\varvec{a}}_i\rangle =\langle {\varvec{R}}{\varvec{x}}, {\varvec{R}}{\varvec{a}}_i \rangle = \langle {\varvec{x}}, {\varvec{R}}{\varvec{a}}_i\rangle . \end{aligned}$$
(144)

Consequently, we have that

$$\begin{aligned} {\varvec{R}}{\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}{\varvec{R}}^{\mathsf{T}} {\mathop {=}\limits ^{{\text {d}}}}{\varvec{A}}^\mathsf{T}{\varvec{Z}}({\varvec{Z}}+{\varvec{I}}_n)^{-1} {\varvec{A}}, \end{aligned}$$
(145)

which immediately implies that

$$\begin{aligned} {\varvec{R}}\tilde{{\varvec{\varphi }}}^{(1)} {\mathop {=}\limits ^{{\text {d}}}}\tilde{{\varvec{\varphi }}}^{(1)}. \end{aligned}$$
(146)

Then, we can decompose \(\tilde{{\varvec{\varphi }}}^{(1)}\) as

$$\begin{aligned} \tilde{{\varvec{\varphi }}}^{(1)} = a_d \, {\varvec{x}}+ \sqrt{1-a_d^2}\, {\varvec{\varphi }}^{\perp }, \end{aligned}$$
(147)

where \({\varvec{\varphi }}^{\perp }\) is uniformly distributed over the set of vectors orthogonal to \({\varvec{x}}\) with norm \(\sqrt{d}\) and

$$\begin{aligned} \frac{1}{d} \langle {\varvec{x}}, \tilde{{\varvec{\varphi }}}^{(1)} \rangle = a_d. \end{aligned}$$
(148)

Relating the uniform distribution on the sphere to the normal distribution [54, Sec. 3.3.3], we can express \({\varvec{\varphi }}^\perp \) as follows:

$$\begin{aligned} {\varvec{\varphi }}^{\perp } =\sqrt{d} \, \frac{{\varvec{u}}- \displaystyle \frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle {\varvec{x}}}{\Big \Vert {\varvec{u}}- \displaystyle \frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle {\varvec{x}}\Big \Vert _2}, \end{aligned}$$
(149)

where \({\varvec{u}}\sim \mathsf{N}({\varvec{0}}_d, {\varvec{I}}_d)\) and independent of \({\varvec{x}}\). By the law of large numbers, we have the almost sure limits

$$\begin{aligned} \begin{aligned} \lim _{d\rightarrow \infty }\frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle = 0, \qquad \quad \lim _{d\rightarrow \infty }\frac{1}{d}\Big \Vert {\varvec{u}}- \displaystyle \frac{1}{d}\langle {\varvec{u}}, {\varvec{x}}\rangle {\varvec{x}}\Big \Vert _2^2 = 1. \end{aligned} \end{aligned}$$
(150)

Thus, by combining (147) and (149), we conclude that

$$\begin{aligned} \tilde{{\varvec{\varphi }}}^{(1)} = c_1 {\varvec{x}}+c_2{\varvec{u}}, \end{aligned}$$
(151)

where the coefficients \(c_1\) and \(c_2\) can be bounded by universal constants (independent of nd) using (150). As a result,

$$\begin{aligned}&\frac{1}{d} \sum _i (\tilde{\varphi }_i^{(1)})^{2(k-1)}\le 2^{2(k-1)} |c_1|^{2(k-1)} \frac{1}{d}\nonumber \\&\sum _i |x_i|^{2(k-1)}+2^{2(k-1)} |c_2|^{2(k-1)} \frac{1}{d} \sum _i |u_i|^{2(k-1)}. \end{aligned}$$
(152)

Note that, almost surely,

$$\begin{aligned} \lim _{d\rightarrow \infty } \frac{1}{d} \sum _i |u_i|^{2(k-1)}={\mathbb E}\{U^{2(k-1)}\}\le C, \end{aligned}$$
(153)

where \(U\sim \mathsf{N}(0, 1)\). By combining (140), (152) and (153), (143) immediately follows. Finally, by combining (140), (141), (142) and (143), we deduce that (138) holds.

We now use Proposition 1 which guarantees that, almost surely,

$$\begin{aligned} \left|{ \lim _{d \rightarrow \infty } \frac{1}{d} \sum _{i=1}^d \psi (x_i, \tilde{x}^\mathrm{L}_i, \sqrt{\delta }v_i^t) - {\mathbb E}\{ \psi (X, \, \mu _{V, 1} X + \sigma _{V,1} W_{V,1} , \, \sqrt{\delta }(\mu _{V, t} X + \sigma _{V,t} W_{V,t})) \}}\right| = 0, \quad t \ge 1. \end{aligned}$$
(154)

To conclude the proof of (80), we take the limit \(t \rightarrow \infty \) and use Remark 8. For this, we will show that

$$\begin{aligned} \begin{aligned}&\lim _{t \rightarrow \infty } {\mathbb E}\{ \psi (X, \, \mu _{V, 1} X + \sigma _{V,1} W_{V,1} , \, \sqrt{\delta }(\mu _{V, t} X + \sigma _{V,t} W_{V,t})) \} \\&\quad = {\mathbb E}\{ \psi (X, \, \mu _{V, 1} X + \sigma _{V,1} W_{V,1} , \, \sqrt{\delta }(\tilde{\mu }_{V} X + \tilde{\sigma }_{V} W_{V,\infty })) \}, \end{aligned} \end{aligned}$$
(155)

where \((W_{V,1}, W_{V,\infty })\) are zero mean jointly Gaussian random variables with covariance given by (81). Using (58) and the formulas for \(g_0\) and \(g_t\) from (66) and (68), we have

$$\begin{aligned} \begin{aligned} {\mathbb E}\{ W_{V,1} W_{V,t} \}&= \frac{1}{\sigma _{V,1} \sigma _{V,t}}{\mathbb E}\{ {\mathcal {T}}_L(Y) Z (\mu _{U,t-1} G + \sigma _{U,t-1}W_{U,t-1}) \} \\&= \frac{\mu _{V,t-1}}{\sqrt{\delta } \beta _{t-1} \sigma _{V,1} \sigma _{V,t}} {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}. \end{aligned} \end{aligned}$$
(156)

In the second equality above, we used (72). Using the expression for \(\sigma _{V,1}\) from (75) and letting \(t \rightarrow \infty \), we have

$$\begin{aligned} \lim _{t \rightarrow \infty } {\mathbb E}\{ W_{V,1} W_{V,t} \} = \frac{ \tilde{\mu }_{V} {\mathbb E}\{ {\mathcal {T}}_L(Y) Z G \}}{ \tilde{\beta } \tilde{\sigma }_{V} \sqrt{ {\mathbb E}\{ {\mathcal {T}}_L(Y)^2\} } } = {\mathbb E}\{ W_{V,1} W_{V, \infty } \}. \end{aligned}$$
(157)

Therefore, the sequence of zero mean jointly Gaussian pairs \((W_{V,1}, W_{V,t})_{t \ge 1}\) converges in distribution to the jointly Gaussian pair \((W_{V,1}, W_{V,\infty })\), whose covariance is given by (81).

To show (155), we use Lemma 9 in Appendix G. We apply this result taking \(Q_t\) to be the distribution of

$$(X, \, \mu _{V,1}X + \sigma _{V,1} W_{V,1}, \, {\mu }_{V,t}X + {\sigma }_{V,t} W_{V,t} ).$$

Since \(\mu _{V,t} \rightarrow \tilde{\mu }_V\), \(\sigma _{V,t} \rightarrow \tilde{\sigma }_V\), the sequence \((Q_t)_{t \ge 2}\) converges weakly to Q, which is the distribution of

$$(X, \, \mu _{V,1}X + \sigma _{V,1} W_{V,1}, \, \tilde{\mu }_{V}X + \tilde{\sigma }_{V} W_{V,\infty } ).$$

In our case, \(\psi : \mathbb {R}^3 \rightarrow \mathbb {R}\) is PL(k), and therefore \(\psi (a,b,c) \le C'(1 + \left|{a}\right|^k + \left|{b}\right|^k + \left|{c}\right|^k)\), for all \((a,b,c) \in \mathbb {R}^3\) for some constant \(C'\). Choosing \(h(a,b,c) = \left|{a}\right|^k + \left|{b}\right|^k + \left|{c}\right|^k\), we have \(\frac{\left|{\psi }\right|}{1 + h} \le C'\). Furthermore, \(\int h \, \mathrm{d}Q_t\) is a linear combination of \(\{\mu _{V,t}, \mu _{V,t}^2, \ldots , \mu _{V,t}^k, \sigma _{V,t}, \sigma _{V,t}^2, \ldots , \sigma _{V,t}^k \}\), with coefficients that do not depend on t. The integral \(\int h \, \mathrm{d}Q\) has the same form, except that \(\mu _{V,t}, \sigma _{V,t}\) are replaced by \(\tilde{\mu }_{V}, \tilde{\sigma }_{V}\), respectively. Since \(\mu _{V,t} \rightarrow \tilde{\mu }_{V}\) and \(\sigma _{V,t} \rightarrow \tilde{\sigma }_{V}\), we have that

$$\lim _{t \rightarrow \infty } \int h \, \mathrm{d}Q_t = \int h \, \mathrm{d}Q.$$

Therefore, by applying Lemma 9 in Appendix G, we have that

$$\begin{aligned} \lim _{t \rightarrow \infty } \int \psi \, \mathrm{d}Q_t = \int \psi \, \mathrm{d}Q, \end{aligned}$$

which is equivalent to (155). This completes the proof of the lemma.