1 Introduction

As the scale and dimensionality of data continue to grow in many applications (e.g., bioinformatics, finance, computer vision, medical informatics) (Sánchez et al. 2013; Mitchell et al. 2004; Simianer et al. 2012; Bartz et al. 2011), it becomes critical to develop efficient and effective algorithms to solve big data machine learning problems. Randomized reduction methods for large-scale and high-dimensional (LSHD) data analytics have received a great deal of attention in recent years (Mahoney and Drineas 2009; Shi et al. 2012; Paul et al. 2013; Weinberger et al. 2009; Mahoney 2011). By either reducing the dimensionality (referred to as feature reduction) or reducing the number of training instances (referred to as instance reduction), the resulting problem has a smaller size of training data that is not only memory-efficient but also computation-efficient. While randomized instance reduction has been studied a lot for least-squares regression (Drineas et al. 2008, 2006, 2011; Ma et al. 2014; Zhou et al. 2007), randomized feature reduction is more popular for linear classification (Blum 2005; Shi et al. 2009a, 2012; Paul et al. 2013; Weinberger et al. 2009) (e.g., random hashing is a noticeable built-in tool in Vowpal Wabbit,Footnote 1 a fast learning library, for solving high-dimensional problems.). In this paper, we focus on the applications of randomized reduction methods to high-dimensional classification and sparse least-squares regression problems.

Previous results are limited on theoretical properties of randomized reduction methods for high-dimensional classification problems and sparse least-squares regression problems. In particular, for high-dimensional classification problems, previous results on the preservation of margin (Blum 2005; Balcan et al. 2006; Shi et al. 2012; Paul et al. 2013) and the recovery error of the model (Zhang et al. 2014) rely on strong assumptions about the data. For example, both Paul et al. (2013) and Zhang et al. (2014) assume the data matrix is of low-rank, and Blum (2005), Balcan et al. (2006), Shi et al. (2012) make a assumption that all examples in the original space are separated with a positive margin (with a high probability). Another analysis in Zhang et al. (2014) imposes an additional assumption that the weight vector for classification is sparse. These assumptions are too strong to hold in many real applications. Zhou et al. (2007) studied randomized reduction for \(\ell _1\) regularized least-squares problem and analyzed the sparsitency (i.e., the recovery of the support set) and the persistency (i.e., the generalization performance). However, their result is asymptotic, which only holds when the number of instances approaches infinity, and requires strong assumptions about the data matrix and other parameters.

To address these limitations, we propose to explore the intrinsic sparsity of the problem, i.e., the sparsity of the optimal dual solution for classification problems and the sparsity of the optimal model for sparse regression problems. To leverage the sparsity, we propose novel formulations for both classes of problems by introducing or strengthening the sparse regularizer. In particular, for LSHD classification problems we propose a sparse regularized dual formulation on the dimensionality-reduced data by leveraging the (near) sparsity of the optimal dual solution (i.e., the number of (effective) support vectors is small compared to the total number of examples). For large-scale sparse least-squares regression problems, we analyze a modified formulation on the scale-reduced data by increasing the regularization parameter before the sparse regularizer of the model. We develop novel theoretical analysis on the recovery error of learned high-dimensional models.

Compared with previous works (Blum 2005; Balcan et al. 2006; Shi et al. 2012; Paul et al. 2013; Zhang et al. 2014; Zhou et al. 2007), the presented theoretical results enjoy several advantages: (i) our analysis demands a milder assumption about the data; (ii) our analysis covers both smooth and non-smooth loss functions for classification and both the \(\ell _1\) regularizer and the elastic net regularizer for sparse regression; (iii) our analysis is applicable to a broad class of randomized reduction methods unlike that most previous works focus on random Gaussian projection; (iv) our results directly provide guarantee on a small recovery error of the obtained model, which is critical for subsequent analysis, e.g., feature selection (Guyon et al. 2002; Brank et al. 2002) and model interpretation (Rätsch et al. 2005a; Sonnenburg and Franc 2010; Ratsch et al. 2005b; Sonnenburg et al. 2007; Ben-Hur et al. 2008). To elaborate on the last point, when exploiting a linear model to classify people into sick or not sick based on genomic markers, the learned weight vector is important for understanding the effect of different genomic markers on the disease and for designing effective medicine (Jostins and Barrett 2011; Kang et al. 2011). In addition, the recovery could also increase the predictive performance, in particular when there exists noise in the original features (Goldberger et al. 2005).

The remainder of the paper is organized as follows. We review more related work in Sect. 2. We present preliminaries in Sect. 3 and the main results in Sect. 4. We present randomized reduction methods in Sect. 5, and discuss optimization and time complexity in Sect. 6. We present the proofs in Sect. 7. Experimental results are presented in Sect. 9. Finally, we conclude in Sect. 10. Proofs of some technical lemmas are presented in the “Appendix”.

2 Related work

2.1 Random projection for high-dimensional learning

Random projection has been employed for addressing the computational challenge of high-dimensional learning problems, including the classification problems (Balcan et al. 2006) and the regression problems (Maillard and Munos 2009). If let \({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_n\in {\mathbb {R}}^d\) denote a set of instances, by random projection we can reduce the high-dimensional features into a low dimensional feature space by \(\widehat{{\mathbf {x}}}_i=A{\mathbf {x}}_i\in {\mathbb {R}}^{m}\), where \(A\in {\mathbb {R}}^{m\times d}\) is a random projection matrix. Several works have studied some theoretical properties of classification in the low dimensional space. For example, Paul et al. (2013) considered the following problem and its random sketched (RS) counterpart:

$$\begin{aligned}&{\mathbf {w}}_*=\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\frac{1}{n}\sum _{i=1}^n\ell ({\mathbf {w}}^{\top }{\mathbf {x}}_i, y_i) + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2,\\&\text {RS:}\quad \min _{{\mathbf {u}}\in {\mathbb {R}}^m}\frac{1}{n}\sum _{i=1}^n\ell ({\mathbf {u}}^{\top }\widehat{{\mathbf {x}}}_i, y_i) + \frac{\lambda }{2}\Vert {\mathbf {u}}\Vert _2^2, \end{aligned}$$

where \(y_i\in \{1,-1\}\) and \(\ell (z,y)=\max (0, 1-yz)\) is the hinge loss. Paul et al. showed that the margin and minimum enclosing ball in the reduced feature space are preserved to within a small relative error provided that the data matrix \(X\in {\mathbb {R}}^{n\times d}\) is of low-rank. Zhang et al. (2013) studied the problem of recovering the original optimal solution \({\mathbf {w}}_*\) and proposed a dual recovery approach, i.e., using the learned dual variable in the reduced feature space to recover the model in the original feature space. They analyzed a recovery error under the low-rank assumption of the data matrix. In the extended version Zhang et al. (2014) also considered a case when the optimal solution \({\mathbf {w}}_*\) is sparse and the data matrix is approximately low rank. They established a recovery error in the order of \(O(\sqrt{s/m}\Vert {\mathbf {w}}_*\Vert _2)\), where s is either the rank of the data matrix or the sparsity of the optimal primal solution. In Blum (2005), Balcan et al. (2006), Shi et al. (2012), the authors have shown that the classification margin in the reduced feature space is approximately preserved provided that the original data is linearly separable by a large margin (with a high probability). However, in practice these conditions for classification problems are usually too strong to hold, i.e., the optimal model \({\mathbf {w}}_*\) is not necessarily sparse and the data matrix is not necessarily low rank and the data is not necessarily linearly separable.

Recently, Pilanci and Wainwright (2015) studied a random sketched quadratic convex program over an arbitrary convex set, which is closely related to the present work. The original problem and the random sketched problem are given by

$$\begin{aligned}&{\mathbf {w}}_*=\arg \min _{{\mathbf {w}}\in {\mathcal {C}}} f({\mathbf {w}})\triangleq \Vert X{\mathbf {w}}- {\mathbf {y}}\Vert _2^2,\quad \text {RS: }\widehat{{\mathbf {w}}}_*=\arg \min _{{\mathbf {w}}\in {\mathcal {C}}}\Vert A(X{\mathbf {w}}- {\mathbf {y}})\Vert _2^2. \end{aligned}$$

The framework includes least-squares regression, sparse constrained least-squares regression, a dual formulation of a particular form of SVM, and etc. The established a relative approximation error in terms of the objective function, i.e., \(f(\widehat{{\mathbf {w}}}_*)\le (1+\delta )^2f({\mathbf {w}}_*)\) by leveraging the Gaussian width of the intersection between the transformed tangent cone and the Euclidean ball. For sparse constrained least-squares regression and a particular form of SVM, they also exploit the sparsity of the optimal solution for developing the Gaussian width. There are some key differences between our work and Pilanci and Wainwright (2015): (i) we focus on regularized formulations instead of constraint formulations; Regularized formulations are attractive because they can be solved with less efforts than the constrained formulations (e.g., the \(\ell _1\) regularizer can be handled by soft-thresholding with O(d) time complexity with d being the dimensionality of the variable, while projection onto the \(\ell _1\) constraint needs \(O(d\log (d))\) time complexity); (ii) we propose modified formulations on the random sketched data by introducing or strengthening the sparse regularizer, which yields improved recovery results; (iii) we also explore the strong convexity of the objective function to develop stronger theoretical guarantee; (iv) our analysis focuses on the recovery error of high-dimensional models instead of the relative optimization error in terms of the objective function. Notably, a small optimization error in the objective value does not necessarily indicate a small recovery error on the solution.

2.2 Approximate least-squares regression

In numerical linear algebra, one important problem is the over-constrained least-squares problem, i.e., finding a vector \({\mathbf {w}}_{*}\) such that the Euclidean norm of the residual error \(\Vert X{\mathbf {w}}- {\mathbf {y}}\Vert _2\) is minimized, where the data matrix \(X\in {\mathbb {R}}^{n\times d}\) has \(n\gg d\). The exact solver takes \(O(nd^2)\) time complexity. Tremendous studies have been devoted to developing and analyzing randomized algorithms for finding an approximate solution to the above problem in \(o(nd^2)\) (Drineas et al. 2006, 2008, 2011; Boutsidis et al. 2009; Mahoney 2011; Halko et al. 2011). These works share the same paradigm by applying an appropriate random matrix \(A\in {\mathbb {R}}^{m\times n}\) to both X and \({\mathbf {y}}\) and solving the induced subproblem, i.e., \(\widehat{{\mathbf {w}}}_{*}=\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\Vert A(X{\mathbf {w}}- {\mathbf {y}})\Vert _2\). Relative-error bounds for \(\Vert {\mathbf {y}}- X\widehat{{\mathbf {w}}}_{*}\Vert _2\) and \(\Vert {\mathbf {w}}_{*}-\widehat{{\mathbf {w}}}_{*}\Vert _2\) have been developed. Nevertheless, sparse regularized least-squares regression with random sketched data has not been studied much except for Zhou et al. (2007) and Pilanci and Wainwright (2016). Zhou et al. (2007) focuses on asymptotic analysis of sparsitency (i.e., the recovery of the support set) and the persistency (i.e., the generalization performance). Pilanci and Wainwright (2016) studied the unconstrained and sparse-constrained least-squares problems under sketched data. In particular, they established a lower bound on the statistical error of any estimator based on the sketched data \((AX, A{\mathbf {y}})\), which is sub-optimal compared with the estimator that is directly learned from \((X, {\mathbf {y}})\). Hence, they proposed an iterative Hessian sketch method based on the data \((AX, X^{\top }{\mathbf {y}})\) to learn a solution that is statistically optimal. It is worth mentioning that the difference between this work and Pilanci and Wainwright (2016) is that we focus on learning from the sketched data \((AX, A{\mathbf {y}})\), and the statistical error of the model learned from the sketched data \((AX, A{\mathbf {y}})\) by our method matches the lower bound established in Pilanci and Wainwright (2016). It indicates that our method is statistically optimal among all methods that learn a sparse solution based on the sketched data \((AX, A{\mathbf {y}})\).

2.3 Sparse recovery analysis

The LASSO problem has been one of the core problems in statistics and machine learning, which is essentially to learn a high-dimensional sparse vector \({\mathbf {u}}_*\in {\mathbb {R}}^d\) from (potentially noise) linear measurements \({\mathbf {y}}= X{\mathbf {u}}_* + \xi \in {\mathbb {R}}^n\). A rich theoretical literature (Tibshirani 1996; Zhao and Yu 2006; Wainwright 2009) describes the consistency, in particular the sign consistency, of various sparse regression techniques. A stringent “irrepresentable condition” has been established to achieve sign consistency. To circumvent the stringent assumption, several studies (Jia and Rohe 2012; Paul et al. 2008) have proposed to precondition the data matrix X and/or the target vector \({\mathbf {y}}\) by PX and \(P{\mathbf {y}}\) before solving the LASSO problem with a particular preconditioning matrix \(P\in {\mathbb {R}}^{n\times n}\). The oracle inequalities of the solution to LASSO (Bickel et al. 2009) and other sparse estimators (e.g., the Dantzig selector of Candes and Tao 2007) have also been established under restricted eigen-value conditions of the data matrix X and the Gaussian noise assumption of \(\xi \). The focus in these studies is on when the number of measurements n is much less than the number of features, i.e., \(n\ll d\). Different from these work, we consider that both n and d are significantly largeFootnote 2 and design fast algorithms for solving the sparse least-squares problem approximately by using random sketched data \(AX\in {\mathbb {R}}^{m\times d}\) and \(A{\mathbf {y}}\in {\mathbb {R}}^{m\times 1}\) with \(m\ll n\). The analysis is centered on the recovery error of the learned model with respect to the optimal solution to the original problem. For problems that are not strongly convex, we also use the restricted eigen-value conditions of the data matrix to establish the recovery error.

2.4 Randomized reduction by Johnson–Lindenstrauss (JL) transforms

The JL transforms refer to a class of transforms that obey the JL lemma (Johnson and Lindenstrauss 1984), which states that any N points in Euclidean space can be embedded into \(O(\epsilon ^2\log N)\) dimensions so that all pairwise Euclidean distances are preserved upto \(1 \pm \epsilon \). Since the original Johnson–Lindenstrauss result, many transforms have been designed to satisfy the JL lemma, including Gaussian random matrices (Dasgupta and Gupta 2003), sub-Gaussian random matrices (Achlioptas 2003), randomized Hadamard transform (Ailon and Chazelle 2006), sparse JL transforms by random hashing (Dasgupta et al. 2010; Kane and Nelson 2014). The analysis presented in this work builds upon the JL lemma and therefore our method can enjoy the computational benefits of sparse JL transforms, i.e., less memory and fast computation. More details of these transforms will be given later.

3 Preliminaries

Let \(\Vert \cdot \Vert _2\) and \(\Vert \cdot \Vert _1\) denote the Euclidean norm (\(\ell _2\) norm) and \(\ell _1\) norm, respectively. Denote by \(\partial f({\mathbf {w}})\) and \(\nabla f({\mathbf {w}})\) the subgradient and gradient of a non-smooth function and a smooth function, respectively. A function \(f({\mathbf {w}}): {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is a G-Lipschitz continuous function w.r.t \(\Vert \cdot \Vert _2\), if

$$\begin{aligned} |f({\mathbf {w}}_1)- f({\mathbf {w}}_2)|\le G\Vert {\mathbf {w}}_1-{\mathbf {w}}_2\Vert _2,\forall {\mathbf {w}}_1,{\mathbf {w}}_2\in {\mathbb {R}}^d. \end{aligned}$$

A convex function \(f({\mathbf {w}}):{\mathcal {D}}\rightarrow {\mathbb {R}}\) is \(\beta \)-strongly convex w.r.t \(\Vert \cdot \Vert _2\), if for any \(\alpha \in [0, 1]\) and \(\forall {\mathbf {w}}_1,{\mathbf {w}}_2\in {\mathcal {D}}\)

$$\begin{aligned} f(\alpha {\mathbf {w}}_1+ (1-\alpha ){\mathbf {w}}_2)\le \alpha f({\mathbf {w}}_1) + (1-\alpha ) f({\mathbf {w}}_2) - \frac{1}{2}\alpha (1-\alpha )\beta \Vert {\mathbf {w}}_1-{\mathbf {w}}_2\Vert _2^2, \end{aligned}$$

where \(\beta \) is called the strong convexity modulus of f. The strong convexity is also equivalent to

$$\begin{aligned} f({\mathbf {w}}_1) \ge f({\mathbf {w}}_2) + \langle \partial f({\mathbf {w}}_2), {\mathbf {w}}_1-{\mathbf {w}}_2\rangle + \frac{\beta }{2}\Vert {\mathbf {w}}_1-{\mathbf {w}}_2\Vert _2^2, \forall {\mathbf {w}}_1,{\mathbf {w}}_2\in {\mathcal {D}}. \end{aligned}$$

A function \(f({\mathbf {w}}):{\mathcal {D}}\rightarrow {\mathbb {R}}\) is L-smooth w.r.t \(\Vert \cdot \Vert _2\), if it is differentiable and its gradient is L-Lipschitz continuous, i.e.,

$$\begin{aligned} \Vert \nabla f({\mathbf {w}}_1) - \nabla f({\mathbf {w}}_2)\Vert _2\le L\Vert {\mathbf {w}}_1 - {\mathbf {w}}_2\Vert _2, \forall {\mathbf {w}}_1, {\mathbf {w}}_2\in {\mathcal {D}}, \end{aligned}$$

or equivalently

$$\begin{aligned} f({\mathbf {w}}_1) \le f({\mathbf {w}}_2) + \langle \nabla f({\mathbf {w}}_2), {\mathbf {w}}_1-{\mathbf {w}}_2\rangle + \frac{L}{2}\Vert {\mathbf {w}}_1-{\mathbf {w}}_2\Vert _2^2, \forall {\mathbf {w}}_1,{\mathbf {w}}_2\in {\mathcal {D}}. \end{aligned}$$

Given a convex function \(f({\mathbf {w}})\), its convex conjugate function \(f^*({\mathbf {u}})\) is defined as

$$\begin{aligned} f^*({\mathbf {u}})=\max _{{\mathbf {w}}\in {\mathcal {D}}}{\mathbf {w}}^{\top }{\mathbf {u}}- f({\mathbf {u}}). \end{aligned}$$

It is known that the convex conjugate of a L-smooth function is 1/L strongly convex, and vice versa (Kakade et al. 2009).

We write \(f=O(g)\) if there exists a constant \(C>0\) such that \(f\le Cg\), write \(f=\varOmega (g)\) if there exists a constant \(c>0\) such that \(f\ge cg\), and write \(f=\varTheta (g)\) if there exist constants \(0<c\le C\) such that \(cg\le f\le Cg\). A vector \({\mathbf {w}}\in {\mathbb {R}}^d\) is said to be s-sparse if it has at most s non-zero entries.

Let \(({\mathbf {x}}_i, y_i), i=1,\ldots , n\) denote a set of training examples, where \({\mathbf {x}}_i\in {\mathbb {R}}^d\) denotes the feature vector, \(y_i\) denotes the label. Assume both n and d are very large. For classification, we consider \(y_i\in \{1, -1\}\), and for regression we consider \(y_i\in {\mathbb {R}}\). Let \(X=({\mathbf {x}}_1,\ldots , {\mathbf {x}}_n)^{\top }=({\bar{{\mathbf {x}}}}_1,\ldots , {\bar{{\mathbf {x}}}}_d)\in {\mathbb {R}}^{n\times d}\) denote the data matrix and \({\mathbf {y}}=(y_1,\ldots , y_n)^{\top }\). In this paper, we focus on learning a linear model \({\mathbf {w}}\in {\mathbb {R}}^d\) from training examples that makes a prediction by \({\mathbf {w}}^{\top }{\mathbf {x}}\). For classification, the problem of interest can be cast into a regularized minimization:

$$\begin{aligned} {\mathbf {w}}^c_* = \arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d} \frac{1}{n}\sum _{i=1}^n\ell (y_i{\mathbf {w}}^{\top }{\mathbf {x}}_i) + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert ^2_{2}, \end{aligned}$$
(1)

where \(\ell (z)\) is a convex loss function suited for classification (e.g., hinge loss \(\ell (z)=\max (0, 1-zy)\) or the squared hinge loss \(\ell (z)=\max (0, 1- z)^2\)) and \(\lambda \) is a regularization parameter. Using the conjugate function, we can turn the problem into a dual problem:

$$\begin{aligned} \alpha _* = \arg \max _{\alpha \in {\mathbb {R}}^n} -\frac{1}{n}\sum _{i=1}^n\ell _i^*(\alpha _i) - \frac{1}{2\lambda n^2}\alpha ^TXX^{\top }\alpha , \end{aligned}$$
(2)

where \(\ell ^*_i(\alpha )\) is the convex conjugate function of \(\ell (z)\). Given the optimal dual solution \(\alpha _*\), the optimal primal solution can be computed by \({\mathbf {w}}^c_* = -\frac{1}{\lambda n}X^{\top }\alpha _*\).

For regression, we consider the sparse regularized least-squares regression (SLSR) problem:

$$\begin{aligned} {\mathbf {w}}^r_* =\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d} \frac{1}{2n}\Vert X{\mathbf {w}}- {\mathbf {y}}\Vert _2^2 + R({\mathbf {w}}), \end{aligned}$$
(3)

where \(R({\mathbf {w}})\) is a sparsity-inducing norm. In the sequel, we consider two widely used sparsity-inducing norms: (i) \(R({\mathbf {w}})=\gamma \Vert {\mathbf {w}}\Vert _1\), the \(\ell _1\) norm that leads to a formulation also known as LASSO (Tibshirani 1996; ii) \(R({\mathbf {w}})=\frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2+\gamma \Vert {\mathbf {w}}\Vert _1\), the mixture of \(\ell _1\) and \(\ell _2\) norm that leads to a formulation known as the Elastic Net (Zou and Hastie 2003).

Several remarks are in order. Firstly, we will abuse the same notation \({\mathbf {w}}_*\) to denote the optimal solution to the classification problem or the SLSR problem when it is clear from the context. Secondly, the dual formulation in (2) for classification and the SLSR problem (3) share a similar structure that consists of a quadratic term and a (strongly) convex term, which allows us to derive similar theoretical guarantee on the recovery error of \(\alpha _*\) for classification and of \({\mathbf {w}}^r_*\) for SLSR. In particular, when the loss function \(\ell (z,y)\) is a smooth function (e.g., the squared hinge loss), its conjugate function is strongly convex. When \(R({\mathbf {w}}) =\frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2+\tau \Vert {\mathbf {w}}\Vert _1\) with \(\lambda >0\), it is strongly convex. Thirdly, we assume \(\alpha _*\) and \({\mathbf {w}}^r_*\) are sparse. The sparsity of \(\alpha _*\) is implicit in many large-scale high-dimensional classification problems, which is indicated by that the number of the support vectors is usually much less than the total number of examples. The sparsity of \({\mathbf {w}}^r_*\) is explicit induced by the sparsity-inducing regularizer \(R({\mathbf {w}})\). We will also consider the case when \(\alpha _*\) is nearly sparse.

Let \(A\in {\mathbb {R}}^{m\times N}\) denote a random reduction matrix from a distribution that satisfies the JL lemma stated below.

Lemma 1

(JL Lemma) For any integer \(N > 0\), and any \(0< \delta < 1/2\), there exists a probability distribution on \(m\times N\) real matrices A such that for any fixed \({\mathbf {x}}\in {\mathbb {R}}^N\) with a probability at least \(1-\delta \), we have

$$\begin{aligned} \left| \Vert A{\mathbf {x}}\Vert _2^2 - \Vert {\mathbf {x}}\Vert ^2_2\right| \le c\sqrt{\frac{\log (1/\delta )}{m}}\Vert {\mathbf {x}}\Vert _2^2, \end{aligned}$$
(4)

where c is a universal constant.

We will discuss several random matrices A that satisfy the above assumption in more details in Sect. 5.

4 Main results

Although the analysis for the classification problem share many similarities to that for the SLSR problem, we will state the results and analysis separately for the sake of clarity.

4.1 Classification

For LSHD data, directly solving (1) or (2) is very expensive. We address the computational challenge by employing randomized reduction methods to reduce the dimensionality. Let \(\widehat{{\mathbf {x}}}_1,\ldots , \widehat{{\mathbf {x}}}_n\) denote the random sketched data of the training examples, where \(\widehat{{\mathbf {x}}}_i = A{\mathbf {x}}_i\in {\mathbb {R}}^m\). With the random sketched data, a conventional approach is to solve the following reduced primal problem

$$\begin{aligned} \min _{{\mathbf {u}}\in {\mathbb {R}}^m} \frac{1}{n}\sum _{i=1}^n\ell (y_i{\mathbf {u}}^{\top }\widehat{{\mathbf {x}}}_i) + \frac{\lambda }{2}\Vert {\mathbf {u}}\Vert ^2_{2}, \end{aligned}$$
(5)

or its the dual problem

$$\begin{aligned} \widehat{\alpha }_* = \arg \max _{\alpha \in {\mathbb {R}}^n} -\frac{1}{n}\sum _{i=1}^n\ell _i^*(\alpha _i) - \frac{1}{2\lambda n^2}\alpha ^T{\widehat{X}}{\widehat{X}}^{\top }\alpha , \end{aligned}$$
(6)

where \({\widehat{X}}= (\widehat{{\mathbf {x}}}_1,\ldots , \widehat{{\mathbf {x}}}_n)^{\top }\in {\mathbb {R}}^{n\times m}\). Zhang et al. (2014) proposed a dual recovery approach that constructs a recovered solution by \(\widehat{{\mathbf {w}}}_* = -\frac{1}{\lambda n}\sum _{i=1}^n[\widehat{\alpha }_*]_i{\mathbf {x}}_i\) and proved the recovery error for random projection under the assumption of low-rank data matrix or sparse \({\mathbf {w}}_*\). One deficiency with the simple dual recovery approach is that it fails to capture that data in the reduced feature space could become difficult to be separated, especially when the reduced dimensionality m is small. As a result, the magnitude of the dual solution may become larger, causing many non-support vectors (of the original problem) to be support vectors, which could further result in the corruption in the recovery error. That is also the reason that the original analysis of dual recovery method requires a strong assumption of data (i.e., the low rank assumption). In this work, we plan to address this limitation in a different way, which allows us to relax the assumption significantly.

To reduce the number of or the contribution of training instances, which are non-support vectors in the original optimization problem but are transformed into support vectors due to the reduction of dimensionality, we employ a simple trick that adds a sparse regularization on the dual variable to the reduced dual problem. In particular, we solve the following problem:

$$\begin{aligned} \widetilde{\alpha }_* = \arg \max _{\alpha \in {\mathbb {R}}^n} -\frac{1}{n}\sum _{i=1}^n\ell _i^*(\alpha _i) - \frac{1}{2\lambda n^2}\alpha ^T{\widehat{X}}{\widehat{X}}^{\top }\alpha - \frac{1}{n}\tau \Vert \alpha \Vert _1, \end{aligned}$$
(7)

where \(\tau >0\) is a regularization parameter, whose theoretical value will be revealed later. Given \(\widetilde{\alpha }_*\), we can recover a model in the original high-dimensional space by

$$\begin{aligned} \widetilde{{\mathbf {w}}}_{*} = - \frac{1}{\lambda n}X^{\top }\widetilde{\alpha }_*. \end{aligned}$$
(8)

Geometric Explanation by Reduced Margin: To further understand the added sparse regularizer, we consider SVM, where the loss function can be either the hinge loss (a non-smooth function) \(\ell (z)=\max (0, 1-z)\) or the squared hinge loss (a smooth function) \(\ell (z) = \max (0, 1 -z)^2\). We first consider the hinge loss, where \(\ell _i^*(\alpha _i) = \alpha _i y_i\) for \(\alpha _iy_i\in [-1, 0]\). Then the new dual problem is equivalent to

$$\begin{aligned} \max _{\alpha \circ {\mathbf {y}}\in [-1, 0]^n} \frac{1}{n}\sum _{i=1}^n-\alpha _iy_i - \frac{1}{2\lambda n^2}\alpha ^T{\widehat{X}}{\widehat{X}}^{\top }\alpha - \frac{\tau }{n}\Vert \alpha \Vert _1. \end{aligned}$$

Using variable transformation \(-\alpha _i y_i \rightarrow \beta _i\), the above problem is equivalent to

$$\begin{aligned} \max _{\beta \in [0,1]^n} \frac{1}{n}\sum _{i=1}^n\beta _i (1-\tau )- \frac{1}{2\lambda n^2}(\beta \circ {\mathbf {y}})^T{\widehat{X}}{\widehat{X}}^{\top }(\beta \circ {\mathbf {y}}). \end{aligned}$$

Changing into the primal form, we have

$$\begin{aligned} \max _{{\mathbf {u}}\in {\mathbb {R}}^m} \frac{1}{n}\sum _{i=1}^n\ell _{1-\tau }({\mathbf {u}}^{\top }\widehat{{\mathbf {x}}}_iy_i) + \frac{\lambda }{2}\Vert {\mathbf {u}}\Vert _2^2, \end{aligned}$$
(9)

where \(\ell _\gamma (z) = \max (0, \gamma - z)\) is a max-margin loss with margin given by \(\gamma \). It can be understood that adding the \(\ell _1\) regularization in the reduced problem of SVM is equivalent to using a max-margin loss with a smaller margin, which is intuitive because examples become difficult to separate after dimensionality reduction and is consistent with several previous studies that the margin is reduced in the reduced feature space (Blum 2005; Shi et al. 2012). Similarly for squared hinge loss, the equivalent primal problem is

$$\begin{aligned} \max _{{\mathbf {u}}\in {\mathbb {R}}^m} \frac{1}{n}\sum _{i=1}^n\ell ^2_{1-\tau }({\mathbf {u}}^{\top }\widehat{{\mathbf {x}}}_iy_i) + \frac{\lambda }{2}\Vert {\mathbf {u}}\Vert _2^2, \end{aligned}$$
(10)

where \(\ell ^2_\gamma (z) = \max (0, \gamma - z)^2\). Although adding a sparse regularizer on the dual variable is intuitive and can be motivated from previous results, we emphasize that the proposed sparse regularized dual formulation provides a new perspective and bounding the dual recovery error \(\Vert \widetilde{\alpha }_*-\alpha _*\Vert \) is a non-trivial task. We first state our main result in Theorem 1 for smooth loss functions.

Theorem 1

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. Let \(\widetilde{\alpha }_*\) be the optimal dual solution to (7). Assume \(\alpha _*\) is s-sparse, \(\max _i\Vert {\mathbf {x}}_i\Vert _2\le R\) and \(\ell (z)\) is L-smooth. If we set

$$\begin{aligned} \tau =\varTheta \left( R\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (2n/\delta )}{m}}\right) \ge 2cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (2n/\delta )}{m}}, \end{aligned}$$

then we have

$$\begin{aligned}&\Vert \widetilde{\alpha }_* -\alpha _*\Vert _2\le O\left( LR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{s}\sqrt{\frac{\log (2n/\delta )}{m}}\right) , \end{aligned}$$
(11)
$$\begin{aligned}&\Vert \widetilde{\alpha }_* -\alpha _*\Vert _1\le O\left( LR\Vert {\mathbf {w}}_*\Vert _2s \sqrt{\frac{\log (2n/\delta )}{m}}\right) , \end{aligned}$$
(12)

where c is the quantity that appears in the Lemma 1.

Remark 1

The smooth loss function that can yield a sparse \(\alpha _*\) includes the squared hinge loss \(\ell (z) = \max (0, 1-z)^2\) and the \(\epsilon \)-insensitive loss. From the proof, we will see that the value of \(\tau \) depends on the order of \(\Vert (XX^{\top } - {\widehat{X}}{\widehat{X}}^{\top })\alpha _*\Vert _\infty \), which we can bound without using any assumption about the data matrix. In contrast, previous bounds (Zhang et al. 2013, 2014; Paul et al. 2013) depend on \(\Vert XX^{\top } - {\widehat{X}}{\widehat{X}}^{\top }\Vert _2\), which requires the low rank assumption on X. The result in the above theorem indicates the recovery error of the obtained dual variable will be scaled as \(\sqrt{1/m}\) in terms of m - the same order of recovery error as in Zhang et al. (2013, 2014) that assumes low rank of the data matrix.

Remark 2

We would like to make a connection with LASSO for sparse signal recovery. In sparse signal recovery under noise measurements \({\mathbf {f}} = U{\mathbf {w}}_*+{\mathbf {e}}\), where \({\mathbf {e}}\) denotes the noise in measurements, if a LASSO \(\min _{{\mathbf {w}}}\frac{1}{2}\Vert U{\mathbf {w}} - {\mathbf {f}}\Vert _2^2 + \lambda \Vert {\mathbf {w}}\Vert _1\) is solved for the solution, then the regularization parameter \(\lambda \) is required to be larger than the quantity \(\Vert U^{\top }{\mathbf {e}}\Vert _\infty \) that depends on the noise in order to have an accurate recovery (Eldar and Kutyniok 2012). Similarly in our formulation, the added \(\ell _1\) regularization \(\tau \Vert \alpha \Vert _1\) is to counteract the noise in \({\widehat{X}}{\widehat{X}}^{\top }\) as compared with \(XX^{\top }\) and the value of \(\tau \) is dependent on the noise.

Next, we present the results for non-smooth loss functions. Compared with a smooth loss function that renders the objective function (7) strongly convex, the convex conjugate of a non-smooth function is not strongly convex. To address this limitation, we can explore the restricted strong convexity of the quadratic term with respect to sparse solutions. To this end, we introduce the restricted eigen-value conditions similar to those used in the sparse recovery analysis for LASSO (Bickel et al. 2009; Xiao and Zhang 2013). In particular, we introduce the following definition of restricted eigen-values.

Definition 1

Given an integer \(s>0\), we define

$$\begin{aligned} {\mathcal {K}}_{n,s}=\{\alpha \in {\mathbb {R}}^n: \Vert \alpha \Vert _2\le 1, \Vert \alpha \Vert _1\le \sqrt{s}\}. \end{aligned}$$

We say that \(XX^{\top }\in {\mathbb {R}}^{n\times n}\) satisfies the restricted eigenvalue condition at sparsity level s if there exist positive constants \(\rho ^+_{n,s}\) and \(\rho ^{-}_{n,s}\) such that

$$\begin{aligned} \rho ^+_{n,s} = \sup _{\alpha \in {\mathcal {K}}_{n, s}}\frac{\alpha ^{\top }XX^{\top }\alpha }{n}, \quad \rho ^{-}_{n, s} = \inf _{\alpha \in {\mathcal {K}}_{n,s}} \frac{\alpha ^{\top }XX^{\top }\alpha }{n}. \end{aligned}$$
(13)

We also define another quantity that measures the restricted eigen-value of \(XX^{\top } - {\widehat{X}}{\widehat{X}}^{\top }\), namely

$$\begin{aligned} \sigma _{n,s} = \sup _{\alpha \in \mathcal K_{n,s}}\frac{|\alpha ^{\top }(XX^{\top } - {\widehat{X}}{\widehat{X}}^{\top })\alpha |}{n}. \end{aligned}$$
(14)

The lemma below shows that \(\sigma _{n,s}\) can be bounded by \({\widetilde{O}}(\rho _{n,s}^+\sqrt{s/m})\).

Lemma 2

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. With a probability \(1-\delta \), we have

$$\begin{aligned} \sigma _{n,s}\le 16c\rho ^+_{n,s}\sqrt{\frac{(\log (2/\delta )+2s\log (27n/s))}{m}}, \end{aligned}$$

where c is the quantity that appears in Lemma 1.

Theorem 2

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. Let \(\widetilde{\alpha }_*\) be the optimal dual solution to (7). Assume \(\alpha _*\) is s-sparse, \(\max _i\Vert {\mathbf {x}}_i\Vert _2\le R\), the data matrix \(XX^{\top }\) satisfies the restricted eigen-value condition at sparsity level 16s and \(\sigma _{n, 16s}<\rho ^-_{n, 16s}\) . If we set

$$\begin{aligned} \tau =\varTheta \left( 2cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (2n/\delta )}{m}}\right) \ge 2cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (2n/\delta )}{m}}, \end{aligned}$$

then we have

$$\begin{aligned} \Vert \widetilde{\alpha }_* -\alpha _*\Vert _2&\le O\left( \frac{\lambda }{(\rho ^-_{n,16s} - \sigma _{n,16s})}R\Vert {\mathbf {w}}_*\Vert _2\sqrt{s} \sqrt{\frac{\log (2n/\delta )}{m}}\right) ,\\ \Vert \widetilde{\alpha }_* -\alpha _*\Vert _1&\le O\left( \frac{\lambda }{(\rho ^-_{n,16s} - \sigma _{n,16s})} R\Vert {\mathbf {w}}_*\Vert _2s\sqrt{\frac{\log (2n/\delta )}{m}}\right) , \end{aligned}$$

where c is the quantity that appears in Lemma 1.

Remark 3

Compared to smooth loss functions, the smoothness constant L is replaced by \({\widehat{L}}=\frac{\lambda }{(\rho ^-_{n,16s} - \sigma _{n,16s})}\) for the non-smooth loss functions, and there is an extra condition \(\sigma _{n,16s}<\rho ^{-}_{n,16s}\). In light of Lemma 2, the condition implies that \(m\ge \varOmega \left( \left( \frac{\rho ^+_{n,16s}}{\rho ^-_{n,16s}}\right) ^2s\log (n/s)\right) \).

Finally, we present the recovery error of the recovered high-dimensional model \(\widetilde{{\mathbf {w}}}_*\).

Theorem 3

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. Let \(\widetilde{{\mathbf {w}}}_*\) be the recovered primal solution in (8). Define \({\tilde{L}} = L\) if the loss function is L-smooth, otherwise \(\tilde{L}=\frac{\lambda }{(\rho ^-_{n,16s} - \sigma _{n,16s})}\) . Assume \(\alpha _*\) is s-sparse, \(\max _i\Vert {\mathbf {x}}_i\Vert _2\le R\), and \(\sigma _{n,16s}<\rho ^{-}_{n,16s}\) if the loss function is non-smooth. If we set

$$\begin{aligned} \tau =\varTheta \left( 2cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (n/\delta )}{m}}\right) \ge 2cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (n/\delta )}{m}}, \end{aligned}$$

then we have

$$\begin{aligned} \Vert \widetilde{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2\le O\left( \frac{\sqrt{\rho ^+_{n,16s}}}{\lambda \sqrt{n}}\tilde{L}R\Vert {\mathbf {w}}_*\Vert _2\sqrt{s} \sqrt{\frac{\log (n/\delta )}{m}}\right) , \end{aligned}$$

where c is the quantity that appears in Lemma 1.

Remark 4

To understand the order of the recovery error, we consider a smooth loss function with L being a constant. According to the learning theory, the optimal \(\lambda \) can be in the order of \(1/\sqrt{n}\) (Sridharan et al. 2008). In addition, \(\sqrt{\rho _{n,16s}}\le R\). Thus, the recovery error is in the order of \(O(R^2\Vert {\mathbf {w}}_*\Vert _2\sqrt{s/m})\) for a smooth loss function. We can compare this recovery error with that derived in Zhang et al. (2014) for smooth loss functions, which is in the order of \(O(\sqrt{r/m}\Vert {\mathbf {w}}_*\Vert _2)\) with r being the rank of the data matrix. The two errors have the same scaling in terms of m. The error bound in Zhang et al. (2014) depends on the rank of the data matrix, while that in Theorem 3 depends on the sparsity of the optimal dual solution \(\alpha _*\) to the original problem. For real LSHD data, the sparsity of optimal solution \(\alpha _*\) is usually much less than n, however, the rank of the data matrix is not necessarily much less than n in particular when the dimensionality is very high, which renders the proposed dual-sparse recovery more attractive. In experiments, we will show that the proposed dual-sparse recovery gives smaller error than the ordinary dual recovery approach on real datasets.

4.2 Sparse regularized least-squares regression (SLSR)

Recall the problem of SLSR

$$\begin{aligned} {\mathbf {w}}_* =\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d} \frac{1}{2n}\Vert X{\mathbf {w}}- {\mathbf {y}}\Vert _2^2 + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2 + \gamma \Vert {\mathbf {w}}\Vert _{1}, \end{aligned}$$
(15)

where \(\lambda \ge 0\). Although many optimization algorithms have been developed for solving (15), they could still suffer from high computational complexities for large-scale high-dimensional data due to (i) an O(nd) memory complexity and (ii) an O(nd) iteration complexity.

To alleviate the two complexities, we consider using the JL transform to reduce the scale of data. In particular, we let \(A\in {\mathbb {R}}^{m\times n}\) denote a random reduction matrix corresponding to a JL transform, then we compute a sketched data by \({\widehat{X}}= AX\in {\mathbb {R}}^{m\times d}\) and \(\widehat{{\mathbf {y}}}= A{\mathbf {y}}\in {\mathbb {R}}^m\), and then solve the following problem:

$$\begin{aligned} \widehat{{\mathbf {w}}}_* =\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d} \frac{1}{2n}\Vert {\widehat{X}}{\mathbf {w}}- \widehat{{\mathbf {y}}}\Vert _2^2 + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2 + (\gamma +\tau ) \Vert {\mathbf {w}}\Vert _{1}, \end{aligned}$$
(16)

where \(\tau >0\) is added to strengthen the sparse regularizer, whose theoretical value is exhibited later. We emphasize that to obtain a bound on the optimization error of \(\widehat{{\mathbf {w}}}_*\), i.e., \(\Vert \widehat{{\mathbf {w}}}_*-{\mathbf {w}}_*\Vert \), it is important to increase the value of the regularization parameter before the \(\ell _1\) norm. Intuitively, after compressing the data the optimal solution may become less sparse, hence increasing the regularization parameter can pull the solution towards closer to the original optimal solution.

Geometric Interpretation by Optimality Condition. We can explain the added parameter \(\tau \) from a geometric viewpoint, which sheds insights on its theoretical value. Without loss of generality, we consider \(\lambda =0\). Since \({\mathbf {w}}_*\) is the optimal solution to the original problem, then there exists a sub-gradient \(g\in \partial \Vert {\mathbf {w}}_*\Vert _1\) such that

$$\begin{aligned} \frac{1}{n}X^{\top }(X{\mathbf {w}}_* - {\mathbf {y}}) + \gamma g = 0. \end{aligned}$$

Since \(\Vert g\Vert _\infty \le 1\), therefore \({\mathbf {w}}_*\) must satisfy \(\frac{1}{n}\Vert X^{\top }(X{\mathbf {w}}_* - {\mathbf {y}})\Vert _\infty \le \gamma \). Similarly, the compressed problem (16) also defines a domain of the optimal solution \(\widehat{{\mathbf {w}}}_*\), i.e.,

$$\begin{aligned} \widehat{{\mathcal {D}}}_{{\mathbf {w}}}=\left\{ {\mathbf {w}}\in {\mathbb {R}}^d: \frac{1}{n}\Vert {\widehat{X}}^{\top }({\widehat{X}}{\mathbf {w}}- \widehat{{\mathbf {y}}})\Vert _\infty \le \tau +\gamma \right\} . \end{aligned}$$
(17)

It turns out that \(\sigma \) is added to ensure that the original optimal solution \({\mathbf {w}}_*\) lies in \(\widehat{{\mathcal {D}}}_{{\mathbf {w}}}\) provided that \(\sigma \) is set appropriately, which can be verified as follows:

$$\begin{aligned}&\frac{1}{n}\left\| {\widehat{X}}^{\top }({\widehat{X}}{\mathbf {w}}_* - \widehat{{\mathbf {y}}})\right\| _\infty = \frac{1}{n}\left\| X^{\top }(X{\mathbf {w}}_* - {\mathbf {y}}) + {\widehat{X}}^{\top }({\widehat{X}}{\mathbf {w}}_* - \widehat{{\mathbf {y}}}) -X^{\top }(X{\mathbf {w}}_* - {\mathbf {y}})\right\| _\infty \\&\quad \le \frac{1}{n}\Vert X^{\top }(X{\mathbf {w}}_* - {\mathbf {y}})\Vert _\infty + \frac{1}{n}\left\| {\widehat{X}}^{\top }({\widehat{X}}{\mathbf {w}}_* - \widehat{{\mathbf {y}}}) -X^{\top }(X{\mathbf {w}}_* - {\mathbf {y}})\right\| _\infty \\&\quad \le \gamma + \frac{1}{n}\Vert X^{\top }(A^{\top }A-I)(X{\mathbf {w}}_* - {\mathbf {y}}) \Vert _\infty . \end{aligned}$$

Hence, if we set \(\tau \ge \frac{1}{n}\Vert X^{\top }(A^{\top }A-I)(X{\mathbf {w}}_* - {\mathbf {y}}) \Vert _\infty \), it is guaranteed that \({\mathbf {w}}_*\) also lies in \(\widehat{{\mathcal {D}}}_{{\mathbf {w}}}\). Lemma 5 in Sect. 7 provides an upper bound of \(\frac{1}{n}\Vert X^{\top }(A^{\top }A-I)(X{\mathbf {w}}_* - {\mathbf {y}}) \Vert _\infty \), therefore exhibits a theoretical value of \(\tau \).

Next, we present the theoretical guarantee on the recovery error of the obtained solution \(\widehat{{\mathbf {w}}}_*\). We use the notation \({\mathbf {e}}\) to denote \(X{\mathbf {w}}_* - {\mathbf {y}}={\mathbf {e}}\) and assume \(\Vert {\mathbf {e}}\Vert _2\le \eta \). We abuse the notation R to denote the upper bound of column vectors in X, i.e., \(\max _{1\le j\le d}\Vert {\bar{{\mathbf {x}}}}_j\Vert _2\le R\) where \({\bar{{\mathbf {x}}}}_j\) denotes the j-th column of X. We first present the result for the elastic net regularizer, which is a strongly convex function.

Theorem 4

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. Let \({\mathbf {w}}_*\) and \(\widehat{{\mathbf {w}}}_*\) be the optimal solutions to (15) and (16) for \(\lambda >0\), respectively. Assume \({\mathbf {w}}_*\) is s-sparse and \(\max _j\Vert \bar{\mathbf {x}}_j\Vert _2\le R\) . If we set

$$\begin{aligned} \tau = \varTheta \left( \frac{\eta R}{n}\sqrt{\frac{\log (2d/\delta )}{m}}\right) \ge \frac{2c\eta R}{n}\sqrt{\frac{\log (2d/\delta )}{m}}. \end{aligned}$$

Then with a probability at least \(1-\delta \), we have

$$\begin{aligned} \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2\le O\left( \frac{\eta R\sqrt{s}}{n\lambda }\sqrt{\frac{\log (2d/\delta )}{m}}\right) , \quad \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1\le O\left( \frac{\eta R s}{n\lambda }\sqrt{\frac{\log (2d/\delta )}{m}}\right) , \end{aligned}$$

where c is the quantity that appears in Lemma 1.

Remark

The order of \(\tau \) is derived by bounding \(\frac{1}{n}\Vert X^{\top }(A^{\top }A-I){\mathbf {e}}\Vert _\infty \) that will be exhibited later. The upper bound of the recovery error exhibits several interesting properties: (i) the term of \(\sqrt{\frac{s^{2/p}\log (d/\delta )}{m}}\) for p-norm error bound occurs commonly in theoretical results of sparse recovery (Koltchinskii 2011; ii) the term of \(R/\lambda \) is related to the condition number of the optimization problem (15), which reflects the intrinsic difficulty of optimization; and (iii) the term of \(\eta /n\) is related to the empirical error of the optimal solution \({\mathbf {w}}_*\). This term makes sense because if \(\eta =0\) indicating that the optimal solution \({\mathbf {w}}_*\) satisfies \(X{\mathbf {w}}_* -{\mathbf {y}}=0\), then it is straightforward to verify that \({\mathbf {w}}_*\) also satisfies the optimality condition of (16) for \(\tau =0\). Due to the uniqueness of the optimal solution to (15), thus \(\widehat{{\mathbf {w}}}_* = {\mathbf {w}}_*\).

Next, we present the result for random sketched LASSO. Since the objective is not strongly convex, we explore the restricted strong convexity with respect to sparse vectors, i.e, the restricted eigen-value condition. Different from that the classification problem, here we assume the restricted eigen-value condition for the matrix \(X^{\top }X\).

Definition 2

Given an integer \(s>0\), we define

$$\begin{aligned} {\mathcal {K}}_{d,s}=\{{\mathbf {w}}\in {\mathbb {R}}^d: \Vert {\mathbf {w}}\Vert _2\le 1, \Vert {\mathbf {w}}\Vert _1\le \sqrt{s}\}. \end{aligned}$$

We say that \(X^{\top }X\in {\mathbb {R}}^{d\times d}\) satisfies the restricted eigenvalue condition at sparsity level s if there exist positive constants \(\rho ^+_{d,s}\) and \(\rho ^{-}_{d,s}\) such that

$$\begin{aligned} \rho ^+_{d,s} = \sup _{{\mathbf {w}}\in {\mathcal {K}}_{d, s}}\frac{{\mathbf {w}}^{\top }X^{\top }X{\mathbf {w}}}{n}, \quad \rho ^{-}_{d,s} = \inf _{{\mathbf {w}}\in {\mathcal {K}}_{d,s}} \frac{{\mathbf {w}}^{\top }X^{\top }X{\mathbf {w}}}{n}. \end{aligned}$$
(18)

Note that the above definitions of \(\rho ^+_{d, s}\) and \(\rho ^-_{d,s}\) are little different from \(\rho ^+_{n,s}\) and \(\rho ^-_{n,s}\). Similar to before, we define another quantity that measures the restricted eigen-value of \(X^{\top }X - {\widehat{X}}^{\top }{\widehat{X}}\), namely

$$\begin{aligned} \sigma _{d,s} = \sup _{{\mathbf {w}}\in \mathcal {K}_{d,s}}\frac{|{\mathbf {w}}^{\top }(X^{\top }X - {\widehat{X}}^{\top }{\widehat{X}}){\mathbf {w}}|}{n}. \end{aligned}$$
(19)

A similar lemma to Lemma 2 hold for \(\sigma _{d,s}\) and \(\rho _{d,s}^+\) defined here.

Lemma 3

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. With a probability \(1-\delta \), we have

$$\begin{aligned} \sigma _{d,s}\le 16c\rho ^+_{d,s}\sqrt{\frac{(\log (2/\delta )+2s\log (27d/s))}{m}}, \end{aligned}$$

where c is the quantity that appears in Lemma 1.

Theorem 5

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. Assume \(X^{\top }X\) satisfies the restricted eigen-value condition at sparsity level 16s. Let \({\mathbf {w}}_*\) and \(\widehat{{\mathbf {w}}}_*\) be the optimal solutions to (15) and (16) with \(\lambda =0\), respectively, and \(\varLambda =\rho ^-_{d,16s} - 2\sigma _{d,16s}\). If we set

$$\begin{aligned} \tau = \varTheta \left( \frac{\eta R}{n}\sqrt{\frac{\log (2d/\delta )}{m}}\right) \ge \frac{2c\eta R}{n}\sqrt{\frac{\log (2d/\delta )}{m}}, \end{aligned}$$

Assume \(\varLambda >0\), then with a probability at least \(1-\delta \), we have

$$\begin{aligned} \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2\le O\left( \frac{\eta R\sqrt{s}}{n\varLambda }\sqrt{\frac{\log (2d/\delta )}{m}}\right) , \quad \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1\le O\left( \frac{\eta R s}{n\varLambda }\sqrt{\frac{\log (2d/\delta )}{m}}\right) , \end{aligned}$$

where c is the quantity that appears in Lemma 1.

Remark

Note that \(\lambda \) in Theorem 4 is replaced by \(\varLambda \) in Theorem 5. In order to make the result to be valid, we must have \(\varLambda >0\), i.e., \(m\ge \varOmega \left( \left( \frac{\rho ^+_{d,16s}}{\rho ^-_{d,16s}}\right) ^2s\log (d/s)\right) \). In addition, if the conditions in Theorem 5 hold, the result in Theorem 4 can be made stronger by replacing \(\lambda \) with \(\lambda +\varLambda \).

Remark

Under a standard statistical model that the noise in the observation follows a Gaussian distribution the above result indicates the proposed method based on the sketched data \((AX, A{\mathbf {y}})\) is statistically optimal. In particular, if we assume there exists \({\mathbf {u}}_*\) such that \(y = {\mathbf {u}}_*^{\top }{\mathbf {x}}+ \varepsilon \), where \(\varepsilon \sim {\mathcal {N}}(0, \sigma ^2)\). It was proved that under appropriate conditions (restricted isometry property of the data matrix X), any sparse estimator \({\widetilde{{\mathbf {w}}}}\) based on the sketched data \((AX, A{\mathbf {y}})\) cannot be better than \(O(\frac{s\log (d/s)}{m})\) (Pilanci and Wainwright 2016). Under the condition that the entries in X is bounded and the Gaussian noise in the observations \({\mathbf {e}}\) we have \(R \sim O(\sqrt{n})\) and \(\eta \sim O(\sqrt{n})\). Thus, the above result indicates that \(\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2\le O(\frac{s\log (d)}{m})\), which further implies that the statistical error of \(\widehat{{\mathbf {w}}}_*\) will be dominated by \(O(\frac{s\log (d)}{m})\) since the statistical error of \({\mathbf {w}}_*\) is \(O(\frac{s\log (d)}{n})\) (Bickel et al. 2009). Therefore, the proposed method is statistically optimal up to a logarithmic factor among all methods that learn sparse solution from the sketched data \((AX, A{\mathbf {y}})\).

5 Randomized reduction by JL transforms

In this section, we review several classes of random reduction matrices \(A\in {\mathbb {R}}^{m\times N}\) and their JL-type lemmas.

5.1 subGaussian random projection

subGaussian random projection has been employed widely for dimension reduction. The projection operator A is usually sampled from sub-Gaussian distributions with mean 0 and variance 1/m, e.g., (i) Gaussian distribution: \(A_{ij}\sim {\mathcal {N}}(0, 1/m)\), (ii) Rademacher distribution: \(\Pr (A_{ij} =\pm 1/\sqrt{m}) =0.5\), (iii) discrete distribution: \(\Pr (A_{ij}=\pm \sqrt{3/m})=1/6\) and \(\Pr (A_{ij}=0)=2/3\). The last two distributions for dimensionality reduction were proposed and analyzed in Achlioptas (2003). The following lemma is the general JL-type lemma for A with sub-Gaussian entries.

Lemma 4

(Nelson 2015) Let \(A\in {\mathbb {R}}^{m\times N}\) be a random matrix with subGaussian entries of mean 0 and variance 1/m . For any given \({\mathbf {x}}\in {\mathbb {R}}^N\) with a probability \(1-\delta \), we have

$$\begin{aligned} \left| \Vert A{\mathbf {x}}\Vert ^2_2 - \Vert {\mathbf {x}}\Vert ^2_2\right| \le c\sqrt{\frac{\log (1/\delta )}{m}}\Vert {\mathbf {x}}\Vert ^2_2. \end{aligned}$$

where c is some small universal constant.

5.2 Subsampled randomized Hadamard transform (SRHT)

Randomized Hadamard transform was introduced to speed up random projection, reducing the computational timeFootnote 3 of random projection from O(Nm) to \(O(N\log N)\) or even \(O(N\log m)\). The projection matrix A is of the form \(A=PHD\), where

  • \(D\in {\mathbb {R}}^{N\times N}\) is a diagonal matrix with \(D_{ii} = \pm 1\) with equal probabilities.

  • H is the \(N\times N\) Hadamard matrix (assuming N is a power of 2), scaled by \(1/\sqrt{N}\).

  • \(P\in {\mathbb {R}}^{m\times N}\) is typically a sparse matrix that facilities computing \(P{\mathbf {x}}\). Several choices of P are possible (Ailon and Chazelle 2009; Nelson 2015; Tropp 2011). Below we provide a JL-type lemma for subsampled randomized Hadamard transform.

Lemma 5

(Boutsidis and Gittens 2013) Let \(A=\sqrt{\frac{N}{m}}PHD\in {\mathbb {R}}^{m\times N}\) be a subsampled randomized Hadamard transform with P being a random sampling matrix with or without replacement. For any given \({\mathbf {x}}\) with a probability \(1-\delta \), we have

$$\begin{aligned} \left| \Vert A{\mathbf {x}}\Vert ^2_2 - \Vert {\mathbf {x}}\Vert ^2_2\right| \le c\sqrt{\frac{\log (1/\delta )\log (N/\delta )}{m}}\Vert {\mathbf {x}}\Vert ^2_2, \end{aligned}$$

where c is some small universal constant.

Remark 5

Compared to subGaussian random projection, there is an additional \(\sqrt{\log (N/\delta )}\) factor in the upper bound. So directly using the SRHT transform will blow up the recovery error by a logarithmic factor of \(\sqrt{\log (d/\delta })\) for classification and a logarithmic factor of \(\sqrt{\log (n/\delta })\) for SLSR. However, this additional factor can be removed by applying an additional random projection. In particular, if we let \(A = \sqrt{\frac{N}{m}}P'PHD\in {\mathbb {R}}^{m\times N}\), where \(P\in {\mathbb {R}}^{t\times N}\) is a random sampling matrix with \(t = m\log (N/\delta )\) and \({\hat{P}}\in {\mathbb {R}}^{m\times t}\) is a random projection matrix that satisfies Lemma 4, then we have the same order as that in Lemma 4. Note that the projection by \({\hat{P}}\) only involves marginal computation when \(m\ll N\). Please refer to Nelson (2015) for more details.

5.3 Random hashing

Another line of work to speed up random projection is random hashing which makes the reduction matrix A much sparser and takes advantage of the sparsity of original high-dimensional vectors. It was introduced in Shi et al. (2009b) for dimensionality reduction and later was improved to an unbiased version by Weinberger et al. (2009) with some theoretical analysis. Dasgupta et al. (2010) provided a rigorous analysis of the unbiased random hashing. Recently, Kane and Nelson (2014) proposed two new random hashing algorithms with a slightly sparser random matrix A. Here we provide a JL-type lemma for the random hashing algorithm in Kane and Nelson (2014). Let \(h_k(i):[n]\rightarrow [m/s], k=1,\ldots , s\) denote s independent random hashing functions (assuming m is a multiple of s) and let

$$\begin{aligned} A=\left( \begin{array}{c}H_1D_1\\ H_2D_2\\ \cdot \\ \cdot \\ H_sD_s\end{array}\right) \in {\mathbb {R}}^{m\times N} \end{aligned}$$
(20)

be a random matrix with a block of s random hashing matrices, where \(D_k\in {\mathbb {R}}^{N\times N}\) is a diagonal matrix with each entry sampled from \(\{-1, +1\}\) with equal probabilities, and \(H_k\in {\mathbb {R}}^{m/s, N}\) with \([H_k]_{j,i}=\delta _{j, h_k(i)}\). Note that A is a sparse matrix with the sparsity of each column being s. The lemma below states the JL-type result for the random hashing matrix given in (20).

Lemma 6

Let A be given in (20). If \(s=\varTheta (\sqrt{m\log (1/\delta )})\) with a probability \(1-\delta \), we have

$$\begin{aligned} \left| \Vert A{\mathbf {x}}\Vert _2^2 - \Vert {\mathbf {x}}\Vert _2^2\right| \le c\sqrt{\frac{\log (1/\delta )}{m}}\Vert {\mathbf {x}}\Vert _2^2. \end{aligned}$$

If \(s=1\), the with a probability 2/3 we have

$$\begin{aligned} \left| \Vert A{\mathbf {x}}\Vert _2^2 - \Vert {\mathbf {x}}\Vert _2^2\right| \le c\sqrt{\frac{1}{m}}\Vert {\mathbf {x}}\Vert _2^2, \end{aligned}$$

where c is a universal constant.

Remark 6

The first inequality is a result in Kane and Nelson (2014). The second inequality is proved in Nelson and Nguyen (2012). If using one block of random hashing matrix for A, then the high probability error bounds become a constant probability error bounds. In practice, we find that using one block random hashing performs as well as Gaussian random projection.

5.4 Random sampling

Last we discuss random sampling and compare with the aforementioned randomized reduction methods.

Lemma 7

Let \(A=\sqrt{\frac{d}{m}}P\in {\mathbb {R}}^{m\times d}\) be a scaled random sampling matrix where \(P\in {\mathbb {R}}^{m\times d}\) samples m coordinates with or without replacement. Then with a probability \(1-\delta \), we have

$$\begin{aligned} \left| \Vert A{\mathbf {x}}\Vert _2^2 - \Vert {\mathbf {x}}\Vert _2^2\right| \le c\frac{\sqrt{d}\Vert {\mathbf {x}}\Vert _\infty }{\Vert {\mathbf {x}}\Vert _2}\sqrt{\frac{\log (1/\delta )}{m}}\Vert {\mathbf {x}}\Vert _2^2, \end{aligned}$$

where c is a universal constant.

Remark 7

Compared with other three randomized reduction methods, there is an additional \(\frac{\Vert {\mathbf {x}}\Vert _\infty }{\Vert {\mathbf {x}}\Vert _2}\sqrt{d}\) factor in the upper bound, which could result in a much larger recovery error. That is why the randomized Hadamard transform was introduced to make this additional factor close to a constant.

6 Optimization and time complexity

In this section, we discuss the optimization for solving the original problems and the random sketched problems, and the time complexities for the proposed randomized methods. There are many optimization algorithms that can be used to solve the original formulations in (1) or (2) or (3), including stochastic dual coordinate ascent for solving the dual formulation (SDCA) (Shalev-Shwartz and Zhang 2013b), stochastic variance reduced gradient (SVRG) (Johnson and Zhang 2013), stochastic average gradient algorithm (SAGA) (Defazio et al. 2014), accelerated stochastic proximal coordinate gradient method (APCG) (Lin et al. 2014), and many other variants (Shalev-Shwartz and Zhang 2013a, 2014; Xiao and Zhang 2014). Next, we focus on the APCG algorithm since (i) it can be applied to solving the sparse regularized dual formulation in (7) and the SLSR formulation in (16); (ii) it is applicable for both strongly convex objective functions and non-strongly convex objective functions; (iii) it achieves the state-of-the art iteration complexities for both problems. We restrict the discussion to the strongly convex objective functions (i.e., the loss functions are smooth for classification and the regularizer for SLSR includes the \(\ell _2\) norm). In particular, we assume the loss functions for classification \(\ell (z)\) is L-smooth. In the following discussion, we make no assumption about the sparsity of the data.

First, we note that the optimization problems in (2), (7), (15) and (16) can be written as the following general form:

$$\begin{aligned} \min _{{\mathbf {u}}\in {\mathbb {R}}^{N}} F({\mathbf {u}})\triangleq \underbrace{\frac{1}{2n}{\mathbf {u}}^{\top }C^{\top }C{\mathbf {u}}+ \frac{\beta }{2}\Vert {\mathbf {u}}\Vert _2^2}_{f({\mathbf {u}})} + \underbrace{\sum _{i=1}^N\phi _i(u_i)}_{\varPsi ({\mathbf {u}})}, \end{aligned}$$
(21)

where \(C\in {\mathbb {R}}^{M\times N}\) and \(\phi _i(u_i)\) is a convex function. In particular,

  • For (2), we have \(N=n, M=d\), \({\mathbf {u}}=\alpha \in {\mathbb {R}}^n\), \(C=X^{\top }\in {\mathbb {R}}^{d\times n}\), \(\beta = \frac{\lambda }{L}\), and \(\phi _i(\alpha _i) = \lambda (\ell _i^*(\alpha _i) - \frac{1}{L}\alpha _i^2)\)

  • For (7), we have \(N=n, M=m\), \({\mathbf {u}}=\alpha \in {\mathbb {R}}^n\), \(C={\widehat{X}}^{\top }\in {\mathbb {R}}^{m\times n}\), \(\beta = \frac{\lambda }{L}\), and \(\phi _i(\alpha _i) = \lambda (\ell _i^*(\alpha _i) - \frac{1}{L}\alpha _i^2 + \tau |\alpha _i|)\)

  • For (15), we have \(N=d, M=n\), \({\mathbf {u}}={\mathbf {w}}\in {\mathbb {R}}^d\), \(C=X\in {\mathbb {R}}^{n\times d}\), \(\beta =\lambda \), and \(\phi _i(w_i) = \gamma |w_i| - \frac{1}{n}w_i[X^{\top }{\mathbf {y}}]_i\)

  • For (16), we have \(N=d, M=m\), \({\mathbf {u}}={\mathbf {w}}\in {\mathbb {R}}^d\), \(C={\widehat{X}}\in {\mathbb {R}}^{m\times d}\), \(\beta =\lambda \), and \(\phi _i(w_i) = (\gamma + \tau )|w_i| - \frac{1}{n}w_i[{\widehat{X}}^{\top }\widehat{{\mathbf {y}}}]_i\)

Assume that \(\max _j\Vert C_{*j}\Vert _2\le R_c\) where \(C_{*j}\) denotes the j-th column. By the specific form of \(f({\mathbf {u}})\) in (21), we can show that

$$\begin{aligned} |\nabla _i f({\mathbf {u}}+{\mathbf {e}}_i \delta ) - \nabla _i f({\mathbf {u}}) | \le \left( \frac{R_c^2}{n}+\beta \right) |\delta |, \quad \forall \, \delta \in {\mathbb {R}}, \quad i=1,\ldots ,n, \quad {\mathbf {u}}\in {\mathbb {R}}^N, \end{aligned}$$

where \({\mathbf {e}}_i\) is the basis vector whose all coordinates are zero except that the i-th coordinate is 1. By definition, this means the coordinate-wise Lipschitz continuity constant for each coordinate of \(f({\mathbf {u}})\) is \(\frac{R_c^2}{n}+\beta \). Knowing this constant and the strong convexity parameter \(\beta \) of \(f({\mathbf {u}})\), APCG method can be formulated as Algorithm 1.

figure a

The direct implementation of APCG requires updating full-dimensional vectors such as \({\mathbf {v}}^{(k)}\), \({\mathbf {w}}^{(k+1)}\) and \({\mathbf {u}}^{(k+1)}\) at each iteration. Moreover, the \(i_k\)th coordinate of the gradient \(\nabla _{i} f({\mathbf {u}})=\frac{C_{*i}^TC{\mathbf {u}}}{n}+\beta u_i\) so that its direct computation requires taking a full-dimensional inner product of vectors per iteration. Hence, the complexity of each iteration Algorithm 1 is O(N) at least if the matrix \(C^TC\) can be pre-computed and stored in memory. In order to avoid these full-dimensional updates, the authors in Lin et al. (2014) proposed a change of variables scheme so that APCG can be represented equivalently by transformed variables which can be updated in only one dimension at each iteration and allows a low-cost computation of \(\nabla _{i} f({\mathbf {u}})\). We present this efficient implementation of APCG in Algorithm 2.

figure b

According to Lin et al. (2014), the transformed variables \({\mathbf {x}}^{(k)}\) and \({\mathbf {y}}^{(k)}\) in Algorithm 2 are related to the original variables in Algorithm 1 as follows

$$\begin{aligned} {\mathbf {u}}^{(k)} = \rho ^k {\mathbf {x}}^{(k)} + {\mathbf {y}}^{(k)}, \quad {\mathbf {v}}^{(k)} = \rho ^{k+1} {\mathbf {x}}^{(k)} + {\mathbf {y}}^{(k)}, \quad {\mathbf {w}}^{(k)} = -\rho ^k {\mathbf {x}}^{(k)} + {\mathbf {y}}^{(k)} , \end{aligned}$$

where \(\rho \) is defined as in Algorithm 2. The auxiliary vectors \({\mathbf {p}}^{(k)}\) and \({\mathbf {q}}^{(k)}\) are introduced in order to compute \(\nabla _{i_k} f({\mathbf {v}}^{(k)})\) efficiently. In fact, according to Lin et al. (2014), \({\mathbf {p}}^{(k)}\) and \({\mathbf {q}}^{(k)}\) satisfy

$$\begin{aligned} {\mathbf {p}}^{(k)}=C {\mathbf {u}}^{(k)}, \qquad {\mathbf {q}}^{(k)}=C {\mathbf {v}}^{(k)}, \end{aligned}$$

in each iteration. Since \({\mathbf {v}}^{(k)} = \rho ^{k+1} {\mathbf {x}}^{(k)} + {\mathbf {y}}^{(k)}\), we have

$$\begin{aligned} \nabla _{i_k} f({\mathbf {v}}^{(k)})=\frac{1}{n}C_{*i_k}^TC(\rho ^{k+1} {\mathbf {x}}^{(k)} + {\mathbf {y}}^{(k)})+\beta (\rho ^{k+1}x_{i_k}^{(k)} + y_{i_k}^{(k)}) =\nabla ^{(k)}_{i_k}, \end{aligned}$$

where \(\nabla ^{(k)}_{i_k} \) is defined as in Algorithm 2. These observations verify that Algorithm 2 and Algorithm 1 are essentially equivalent.

Note that the complexity for computing \(\nabla ^{(k)}_{i_k}\) and updating \({\mathbf {p}}^{(k)}\) and \({\mathbf {q}}^{(k)}\) are both O(M) while the complexity for computing \( h^{(k)}_{i_k}\) and updating \({\mathbf {x}}^{(k)}\) and \({\mathbf {y}}^{(k)}\) are only O(1). Hence, the per-iteration complexity of Algorithm 2 is only O(M), lower than that of Algorithm 1.

The iteration complexity of APCG for solving (21) is stated in the following proposition.

Proposition 1

Assume that \(\max _j\Vert C_{*j}\Vert _2\le R_c\) where \(C_{*j}\) denotes the j-th column. Then the iteration complexity of APCG for finding \({\mathbf {u}}\) such that \(\mathrm {E}[F({\mathbf {u}})] - F_*]\le \epsilon \) is bounded by

$$\begin{aligned} O\left( N+N\sqrt{\frac{R_c^2}{n\beta }}\right) \log (1/\epsilon ), \end{aligned}$$

and the time complexity of APCG is bounded by

$$\begin{aligned} O\left( NM+NM\sqrt{\frac{R_c^2}{n\beta }}\right) \log (1/\epsilon ). \end{aligned}$$

Suppose the loss function \(\ell \) in (5) is not L-smooth. Its conjugate function \(\ell ^*\) is not \(\frac{1}{L}\)-strongly convex. Then, (2) and (7) can be still formulated as (21) but with \(\beta = 0\) and \(\phi _i(\alpha _i) = \lambda \ell _i^*(\alpha _i)\). Besides, if the parameter \(\lambda =0\) in (15) and (16), we will \(\beta = 0\) in the corresponding formulation of (21). In either case, (21) is no longer a strongly convex optimization problem. However, both Algorithm 1 and Algorithm 2 can be still adapted for solving (21) with similar updating steps except that the parameter \(\theta \) will vary with the iteration number k. As showed in Lin et al. (2014), if \(\beta =0\), the iteration complexity of APCG for finding \({\mathbf {u}}\) such that \(\mathrm {E}[F({\mathbf {u}})] - F_*]\le \epsilon \) is bounded by

$$\begin{aligned} O\left( N\sqrt{\frac{R_c^2}{n\epsilon }}\right) , \end{aligned}$$

and the per-iteration cost with an implementation similar to Algorithm 2 is still O(M).

Next, we analyze the total time complexities for the optimizing original formulations (2), (15) and for the proposed formulations (7) and (16).

6.1 Optimization time complexity for classification

We first consider the time complexity for optimizing the original formulation (2) and the proposed formulation (7) for classification. For the original formulation, \(R_c=R\triangleq \max _{1\le i\le n}\Vert {\mathbf {x}}_i\Vert _2\), the time complexity is

$$\begin{aligned} O\left( nd + nd\sqrt{\frac{R^2L}{n\lambda }}\right) \log (1/\epsilon ). \end{aligned}$$
(22)

In contrast, for the proposed formulation (7) with a high probability \(1-\delta \),

$$\begin{aligned} R_c&=\max _{1\le i\le n}\Vert \widehat{{\mathbf {x}}}_i\Vert _2=\max _{1\le i\le n}\Vert A{\mathbf {x}}_i\Vert _2=\max _{1\le i\le n}\Vert A{\mathbf {x}}_i\Vert _2\le \max _{1\le i\le n}\sqrt{(1+\epsilon _m)}\Vert {\mathbf {x}}_i\Vert _2 \\&= \sqrt{(1+\epsilon _m)}R, \end{aligned}$$

where \(\epsilon _m = c\sqrt{\log (n/\delta )/m}\) according to Assumption 1. By assuming that m is sufficiently large such that \(\epsilon _m\le 1\), then the time complexity is

$$\begin{aligned} O\left( nm + nm\sqrt{\frac{R^2L}{n\lambda }}\right) \log (1/\epsilon ) = O\left( \frac{m}{d}\left[ nd + nd\sqrt{\frac{R^2L}{n\lambda }}\right] \right) \log (1/\epsilon ). \end{aligned}$$
(23)

Compared to time complexity in (22) for solving the original formulation, the time complexity for solving the proposed formulation on the random sketched data is scaled roughly by \(m/d\ll 1\) with a high probability.

6.2 Optimization time complexity for SLSR

Next, we analyze and compare the time complexity of optimization for (15) and (16). For (15), \(R_c=R\triangleq \max _{1\le j\le d}\Vert \bar{\mathbf {x}}_j\Vert _2\), where \(\bar{\mathbf {x}}_j\) denotes the j-th column of the data matrix X. Therefore the time complexity of APCG becomes

$$\begin{aligned} O\left( nd + nd\sqrt{\frac{R^2}{n\lambda }}\right) \log (1/\epsilon ). \end{aligned}$$
(24)

For (16), with high probability \(1-\delta \), we have

$$\begin{aligned} R_c = \max _{1\le j\le d}\Vert A\bar{\mathbf {x}}_j\Vert _2\le \max _{1\le j\le d}\sqrt{1+\epsilon _m}\Vert \bar{\mathbf {x}}_j\Vert _2=\sqrt{1+\epsilon _m}R, \end{aligned}$$

where \(\epsilon _m = O(\sqrt{\log (d/\delta )/m})\). Let m be sufficiently large, we can conclude that \(R_c\) for \({\widehat{X}}\) is O(R). Therefore, the time complexity of APCG for solving (16) is

$$\begin{aligned} O\left( md + md\sqrt{\frac{R^2}{n\lambda }}\right) \log (1/\epsilon ) =O\left( \frac{m}{n}\left[ nd + nd\sqrt{\frac{R^2}{n\lambda }}\right] \right) \log (1/\epsilon ) \end{aligned}$$

Hence, we can see that the optimization time complexity of APCG for solving (16) can be reduced upto a factor of \(1-\frac{m}{n}\), which is substantial when \(m\ll n\).

6.3 Total amortized time complexity and parallel computation

The total time complexity for the proposed randomized methods consists of the optimization time for the reduced formulations, and the extra time for randomized reduction and dual recovery if a high-dimensional model in the original feature space is required in the classification problem, i.e.,

$$\begin{aligned} \text {Total Time}=\text {time}_{proc}+ \text {time}_{opt}, \end{aligned}$$

where \(\text {time}_{opt}\) refers to the optimization time for solving the reduced formulations and \(\text {time}_{proc}\) refers to the extra processing time of reduction and recovery if necessary. The recovery time is O(nd) for the proposed randomized algorithm for classification. Among all randomized reduction methods, the transformation using the Gaussian random matrices is the most expensive that takes O(mnd) time complexity when applied to \(X\in {\mathbb {R}}^{n\times d}\), while subsampled randomized Hadamard transform and random hashing can reduce it to \({\widetilde{O}}(nd)\) and O(nnz(X)), respectively, where \({\widetilde{O}}(\cdot )\) suppresses only a logarithmic factor and nnz(X) denotes the number of non-zeros entries in X. Although the extra computational time still scales as nd in the worst case comparable to that for optimizing the original optimization problems, the computational benefit of the proposed randomized algorithms can become more prominent when we consider the amortizing time complexity and parallel computation to speed up reduction and recovery. In particular, in machine learning, we usually need to tune the regularization parameters (aka cross-validation) to achieve a better generalization performance. Let B denote the number of settings for the regularization parameters, and K denote the number of nodes (cores) for performing the reduction (and recovery if necessary), then the total amortized time complexity of one node (core) becomes

$$\begin{aligned} \text {Total Amortized Time}=\frac{ \text {time}_{proc} }{K}+ B\cdot \text {time}_{opt}. \end{aligned}$$

In comparison, the total amortized time complexity for solving the original formulations is \(B\cdot \text {Time}_{opt}\). According to the analysis in previous subsections, \(\text {time}_{opt}\) is \(m/n\ll 1\) or \(m/d\ll 1\) times of \(\text {Time}_{opt}\), hence the total amortized time complexity of the proposed randomized methods using parallel computation can be substantially reduced. Note that here we do not consider the parallel optimization that includes communication time between different nodes (cores), rendering the analysis of optimization time much more involved.

7 Proofs

In this section, we present proofs for the main results.

7.1 Proof of Theorem 1

Denote by \({\mathcal {S}}\) the support set of \(\alpha _*\) and by \({\mathcal {S}}^c\) its complement. Define

$$\begin{aligned} \varDelta = \frac{1}{\lambda n}(XX^{\top } - {\widehat{X}}{\widehat{X}}^{\top })\alpha _*. \end{aligned}$$
(25)

Let \({\hat{F}}(\alpha )\) be defined as

$$\begin{aligned} {\hat{F}} (\alpha ) = \frac{1}{n}\sum _{i=1}^n\ell _i^*(\alpha _i) + \frac{1}{2\lambda n^2}\alpha ^T{\widehat{X}}{\widehat{X}}^{\top }\alpha + \frac{\tau }{n}\Vert \alpha \Vert _1. \end{aligned}$$

Since \(\widetilde{\alpha }_* = \arg \min {\hat{F}}(\alpha )\) therefore for any \(g_*\in \partial \Vert \alpha _*\Vert _1\)

$$\begin{aligned} 0\ge&{\hat{F}}(\widetilde{\alpha }_*) - {\hat{F}}(\alpha _*)\ge (\widetilde{\alpha }_* - \alpha _*)^{\top }\left( \frac{1}{n}\nabla \ell ^*(\alpha _*) + \frac{1}{\lambda n^2}{\widehat{X}}{\widehat{X}}^{\top }\alpha _*\right) + \frac{\tau }{n}(\widetilde{\alpha }_* - \alpha _*)^{\top }g_* \\&+ \frac{1}{2nL}\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2, \end{aligned}$$

where we used the strong convexity of \(\ell _i^*\) and its strong convexity modulus 1/L. By the optimality condition of \(\alpha _*\), we can have

$$\begin{aligned} 0&\ge (\alpha _* - \widetilde{\alpha }_*)^{\top }\left( \frac{1}{n}\nabla \ell ^*(\alpha _*) + \frac{1}{\lambda n^2}XX^{\top }\alpha _*\right) . \end{aligned}$$
(26)

Combining the above two inequalities we have

$$\begin{aligned} 0\ge&(\widetilde{\alpha }_* - \alpha _*)^{\top }\frac{1}{n}\varDelta + \frac{\tau }{n}(\widetilde{\alpha }_* - \alpha _*)^{\top }g_* + \frac{1}{2nL}\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2. \end{aligned}$$

Since the above inequality holds for any \(g_*\in \partial \Vert \alpha _*\Vert _1\), if we choose \([g_*]_i= sign([\widetilde{\alpha }_*]_i), i\in {\mathcal {S}}^c\), then we have

$$\begin{aligned} (\widetilde{\alpha }_* - \alpha _*)^{\top }g_* \ge - \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1 + \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1. \end{aligned}$$

Combining the above inequalities leads to

$$\begin{aligned} (\tau +\Vert \varDelta \Vert _\infty ) \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1\ge&(\tau - \Vert \varDelta \Vert _\infty )\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1 + \frac{1}{2L}\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2. \end{aligned}$$
(27)

Assuming \(\tau \ge 2\Vert \varDelta \Vert _\infty \), we have

$$\begin{aligned} \frac{3\tau }{2}\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1\ge \frac{\tau }{2}\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1 + \frac{1}{2L}\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2. \end{aligned}$$
(28)

Then

$$\begin{aligned} \begin{aligned}&\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2\le 3\tau L \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1\\&\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1\le 3 \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1. \end{aligned} \end{aligned}$$
(29)

Therefore,

$$\begin{aligned} \Vert [\widetilde{\alpha }_*-\alpha _*]_{\mathcal {S}}\Vert _1^2\le s\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2&\le 3\tau Ls \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1, \end{aligned}$$

leading to the result

$$\begin{aligned} \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1\le 3\tau L s. \end{aligned}$$

Combining this inequality with inequalities in (29) we have

$$\begin{aligned}&\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1\le 9\tau L s\nonumber \\&\Vert \widetilde{\alpha }_*-\alpha _*\Vert _2\le 3\tau L\sqrt{s}. \end{aligned}$$
(30)

and

$$\begin{aligned} \Vert \widetilde{\alpha }_*-\alpha _*\Vert _1\le \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1 + \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1\le 12\tau Ls. \end{aligned}$$
(31)

To complete the proof of Theorem 1, we need to bound \(\Vert \varDelta \Vert _\infty \), i.e., the value of \(\tau \).

Lemma 8

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix such that Assumption 1. Then with a high probability \(1-\delta \) we have

$$\begin{aligned} \Vert \varDelta \Vert _\infty \le cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (2n/\delta )}{m}}, \end{aligned}$$

where \(R=\max _i\Vert {\mathbf {x}}_i\Vert _2\).

Proof

Let \(\epsilon _{m,\delta } = c\sqrt{\log (1/\delta )/m}\).

$$\begin{aligned}&\frac{1}{\lambda n}( {\widehat{X}}{\widehat{X}}^{\top } - XX^{\top })\alpha _* = \frac{1}{\lambda n}( XA^{\top }AX^{\top } - XX^{\top })\alpha _* \\&= \frac{1}{\lambda n}X(A^{\top }A - I)X^{\top }\alpha _* = X(I - A^{\top }A){\mathbf {w}}_*, \end{aligned}$$

where we use the fact \({\mathbf {w}}_* = -\frac{1}{\lambda n}X^{\top }\alpha _*\). Then

$$\begin{aligned} \frac{1}{\lambda n}[( {\widehat{X}}{\widehat{X}}^{\top } - XX^{\top })\alpha _*]_i = {\mathbf {x}}_i^{\top }(I - A^{\top }A){\mathbf {w}}_*. \end{aligned}$$

Therefore in order to bound \(\Vert \varDelta \Vert _\infty \), we need to bound \({\mathbf {x}}_i^{\top }(I - A^{\top }A){\mathbf {w}}_*\) for all \(i\in [n]\). We first bound for individual i and then apply the union bound. Let \(\widetilde{{\mathbf {x}}}_i\) and \(\widetilde{{\mathbf {w}}}_*\) be normalized version of \({\mathbf {x}}_i\) and \({\mathbf {w}}_*\), i.e., \(\widetilde{{\mathbf {x}}}_i = {\mathbf {x}}_i/\Vert {\mathbf {x}}_i\Vert _2\) and \(\widetilde{{\mathbf {w}}}_* = {\mathbf {w}}_*/\Vert {\mathbf {w}}_*\Vert _2\). By Assumption 1, with a probability \(1-\delta \),

$$\begin{aligned} (1-\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*\Vert _2^2\le \Vert A(\widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*)\Vert _2^2\le (1+\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*\Vert _2^2, \end{aligned}$$

and with a probability \(1-\delta \),

$$\begin{aligned} (1-\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i - \widetilde{{\mathbf {w}}}_*\Vert _2^2\le \Vert A(\widetilde{{\mathbf {x}}}_i-\widetilde{{\mathbf {w}}}_*)\Vert _2^2\le (1+\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i- \widetilde{{\mathbf {w}}}_*\Vert _2^2. \end{aligned}$$

Then with a probability \(1-2\delta \), we have

$$\begin{aligned} \widetilde{{\mathbf {x}}}_i^{\top }A^{\top }A\widetilde{{\mathbf {w}}}_* - \widetilde{{\mathbf {x}}}_i^{\top }\widetilde{{\mathbf {w}}}_*&= \frac{\Vert A(\widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*)\Vert _2^2 - \Vert A(\widetilde{{\mathbf {x}}}_i-\widetilde{{\mathbf {w}}}_*)\Vert _2^2}{4} - \widetilde{{\mathbf {x}}}_i^{\top }\widetilde{{\mathbf {w}}}_*\\&\le \frac{(1+\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*\Vert _2^2 - (1-\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i-\widetilde{{\mathbf {w}}}_*\Vert _2^2}{4}-\widetilde{{\mathbf {x}}}_i^{\top }\widetilde{{\mathbf {w}}}_*\\&\le \frac{\epsilon _{m,\delta }}{2}(\Vert \widetilde{{\mathbf {x}}}_i\Vert _2^2 + \Vert \widetilde{{\mathbf {w}}}_*\Vert _2^2)\le \epsilon _{m,\delta }, \end{aligned}$$

and

$$\begin{aligned} \widetilde{{\mathbf {x}}}_i^{\top }A^{\top }A\widetilde{{\mathbf {w}}}- \widetilde{{\mathbf {x}}}_i^{\top }\widetilde{{\mathbf {w}}}_*&= \frac{\Vert A(\widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*)\Vert _2^2 - \Vert A(\widetilde{{\mathbf {x}}}_i-\widetilde{{\mathbf {w}}}_*)\Vert _2^2}{4}- \widetilde{{\mathbf {x}}}_i^{\top }\widetilde{{\mathbf {w}}}_*\\&\ge \frac{(1-\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i+\widetilde{{\mathbf {w}}}_*\Vert _2^2 - (1+\epsilon _{m,\delta })\Vert \widetilde{{\mathbf {x}}}_i-\widetilde{{\mathbf {w}}}_*\Vert _2^2}{4}-\widetilde{{\mathbf {x}}}_i^{\top }\widetilde{{\mathbf {w}}}_*\\&\ge -\frac{\epsilon _{m,\delta }}{2}(\Vert \widetilde{{\mathbf {x}}}_i\Vert _2^2 + \Vert \widetilde{{\mathbf {w}}}_*\Vert _2^2)\ge -\epsilon _{m,\delta }. \end{aligned}$$

Therefore with a probability \(1 - 2\delta \), we have

$$\begin{aligned}&|{\mathbf {x}}_i^{\top }A^{\top }A{\mathbf {w}}_* - {\mathbf {x}}_i^{\top }{\mathbf {w}}_*|\le \Vert {\mathbf {x}}_i\Vert _2\Vert {\mathbf {w}}_*\Vert _2 |\widetilde{{\mathbf {x}}}_i^{\top }A^{\top }A\widetilde{{\mathbf {w}}}_* - \widetilde{{\mathbf {x}}}^{\top }\widetilde{{\mathbf {w}}}_*|\le \Vert {\mathbf {x}}_i\Vert _2\Vert {\mathbf {w}}_*\Vert _2 \epsilon _{m,\delta }. \end{aligned}$$

Then applying union bound, with a probability \(1-2n\delta \),

$$\begin{aligned} \Vert \varDelta \Vert _\infty \le cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (1/\delta )}{m}}, \end{aligned}$$

or with a probability \(1-\delta \)

$$\begin{aligned} \Vert \varDelta \Vert _\infty \le cR\Vert {\mathbf {w}}_*\Vert _2 \sqrt{\frac{\log (2n/\delta )}{m}}. \end{aligned}$$

\(\square \)

To finish the proof of Theorem 1, we can combine the above Lemma with the inequalities in (31) and (30) by noting the value of \(\tau \).

7.2 Proof of Theorem 2

Following the same proof of Theorem 1, we first notice that inequality (27) holds for \(L=\infty \), i.e.,

$$\begin{aligned} (\tau +\Vert \varDelta \Vert _\infty ) \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1\ge&(\tau - \Vert \varDelta \Vert _\infty )\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1. \end{aligned}$$

Therefore if \(\tau \ge 2\Vert \varDelta \Vert _\infty \), we have

$$\begin{aligned} \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1\le 3\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1. \end{aligned}$$

As a result,

$$\begin{aligned} \frac{\Vert \widetilde{\alpha }_* - \alpha _*\Vert _1}{\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2}&\le \frac{\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1 + \Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1}{\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2}\le \frac{4\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}}-[\alpha _*]_{{\mathcal {S}}}\Vert _1}{\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2}\le 4\sqrt{s}. \end{aligned}$$

By the definition of \({\mathcal {K}}_{n,s}\), we have \(\displaystyle \frac{\widetilde{\alpha }_* - \alpha _*}{\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2}\in {\mathcal {K}}_{n, 16s}\). To proceed the proof, there exists \(\widetilde{g}_*\in \partial |\widetilde{\alpha }_*|_1\) such that

$$\begin{aligned} 0\ge&(\widetilde{\alpha }_* - \alpha _*)^{\top }\left( \frac{1}{n}\nabla \ell ^*(\widetilde{\alpha }_*) + \frac{1}{\lambda n^2}{\widehat{X}}^{\top }{\widehat{X}}\widetilde{\alpha }_*\right) + \frac{\tau }{n}(\widetilde{\alpha }_* - \alpha _*)^{\top }{\widetilde{g}}_*. \end{aligned}$$

Adding the above inequality to (26), we have

$$\begin{aligned} 0&\ge (\alpha _* - \widetilde{\alpha }_*)^{\top }\left( \frac{1}{n}\nabla \ell ^*(\alpha _*) -\frac{1}{n}\nabla \ell ^*(\widetilde{\alpha }_*)\right) \\&\quad + (\alpha _* - \widetilde{\alpha }_*)^{\top }\left( \frac{1}{\lambda n^2}X^{\top }X\alpha _* - \frac{1}{\lambda n^2}{\widehat{X}}^{\top }{\widehat{X}}\widetilde{\alpha }_*\right) \\&\quad +\frac{\tau }{n}\left\| [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\right\| _1 - \frac{\tau }{n}\left\| [\widetilde{\alpha }_*]_{{\mathcal {S}}} - [\alpha _*]_{{\mathcal {S}}}\right\| _1. \end{aligned}$$

By convexity of \(\ell ^*\) we have

$$\begin{aligned} (\alpha _* - \widetilde{\alpha }_*)^{\top }\left[ \frac{1}{n}\nabla \ell ^*(\alpha _*) -\frac{1}{n}\nabla \ell ^*(\widetilde{\alpha }_*)\right] \ge 0. \end{aligned}$$

Thus, we have

$$\begin{aligned} \tau \left\| [\widetilde{\alpha }_*]_{{\mathcal {S}}} - [\alpha _*]_{{\mathcal {S}}}\right\| _1&\ge \tau \left\| [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\right\| _1 + (\alpha _* - \widetilde{\alpha }_*)^{\top }\left( \frac{1}{\lambda n}X^{\top }X - \frac{1}{\lambda n}{\widehat{X}}^{\top }{\widehat{X}}\right) \alpha _* \\&\quad - (\alpha _* - \widetilde{\alpha }_*)^{\top }\left( \frac{1}{\lambda n}X^{\top }X - \frac{1}{\lambda n}{\widehat{X}}^{\top }{\widehat{X}}\right) (\alpha _* - \widetilde{\alpha }_*)\\&\quad + \frac{1}{\lambda n}(\alpha _* - \widetilde{\alpha }_*)^{\top }X^{\top }X(\alpha _* - \widetilde{\alpha }_*). \end{aligned}$$

Since

$$\begin{aligned} (\alpha _* - \widetilde{\alpha }_*)^{\top }\varDelta \ge -\Vert \varDelta \Vert _\infty \Vert \alpha _* - \widetilde{\alpha }_*\Vert _1, \end{aligned}$$

and \(\tau \ge 2\Vert \varDelta \Vert _\infty \) and by the definition of \(\rho ^-_{s}, \sigma _{s}\), we have

$$\begin{aligned} \frac{3\tau }{2}&\left\| [\widetilde{\alpha }_* - \alpha _*]_{{\mathcal {S}}}\right\| _1\ge \frac{\tau }{2}\left\| [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\right\| _1 + \frac{\rho ^-_{16s} - \sigma _{16s}}{\lambda }\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2^2. \end{aligned}$$

This is similar to the inequality (28). Then the conclusion follows the same analysis as before.

7.3 Proof of Theorem 4

Define

$$\begin{aligned} \displaystyle {\mathbf {q}}= \frac{1}{n}X^{\top }(A^{\top }A -I ){\mathbf {e}}, \quad \quad {\mathbf {e}}=X{\mathbf {w}}_* - {\mathbf {y}}. \end{aligned}$$
(32)

First, we note that

$$\begin{aligned} \widehat{{\mathbf {w}}}_*&= \arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\frac{1}{2n}\Vert {\widehat{X}}{\mathbf {w}}- \widehat{{\mathbf {y}}}\Vert _2^2 + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2 + (\gamma + \tau ) \Vert {\mathbf {w}}\Vert _1\\&=\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\underbrace{\frac{1}{2n}\left( {\mathbf {w}}^{\top }{\widehat{X}}^{\top }{\widehat{X}}{\mathbf {w}}- 2{\mathbf {w}}^{\top }{\widehat{X}}^{\top }{\mathbf {y}}\right) + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2 + (\gamma + \tau ) \Vert {\mathbf {w}}\Vert _1}_{{\widehat{F}}({\mathbf {w}})}, \end{aligned}$$

and

$$\begin{aligned} {\mathbf {w}}_* = \arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\frac{1}{2n}\Vert X{\mathbf {w}}- {\mathbf {y}}\Vert _2^2 + \frac{\lambda }{2}\Vert {\mathbf {w}}\Vert _2^2 + \gamma \Vert {\mathbf {w}}\Vert _1. \end{aligned}$$

By optimality of \(\widehat{{\mathbf {w}}}_*\) and the strong convexity of \(\widehat{F}({\mathbf {w}})\), for any \(g\in \partial \Vert {\mathbf {w}}_*\Vert _1\) we have

$$\begin{aligned} 0\ge {\widehat{F}}(\widehat{{\mathbf {w}}}_*) - {\widehat{F}}({\mathbf {w}}_*)&\ge (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }\left( \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}{\mathbf {w}}_* - \frac{1}{n}{\widehat{X}}^{\top }\widehat{{\mathbf {y}}}+ \lambda {\mathbf {w}}_*\right) \nonumber \\&\quad + (\gamma + \tau )(\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }g+ \frac{\lambda }{2}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2. \end{aligned}$$
(33)

By the optimality condition of \({\mathbf {w}}_*\), there exists \(h\in \partial \Vert {\mathbf {w}}_*\Vert _1\) such that

$$\begin{aligned} \frac{1}{n}X^{\top }X{\mathbf {w}}_* -\frac{1}{n}X^{\top } {\mathbf {y}}+ \lambda {\mathbf {w}}_* + \gamma h=0. \end{aligned}$$
(34)

By utilizing the above equation in (33), we have

$$\begin{aligned} 0\ge&(\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }{\mathbf {q}}+ (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }\left[ (\gamma + \tau )g - \gamma h\right] + \frac{\lambda }{2}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2. \end{aligned}$$
(35)

Let \({\mathcal {S}}\) denote the support set of \({\mathbf {w}}_*\) and \({\mathcal {S}}_c\) denote its complement set. Since g could be any sub-gradient of \(\Vert {\mathbf {w}}\Vert _1\) at \({\mathbf {w}}_*\), we define g as \(g_i = \left\{ \begin{array}{lc} h_i,&{} i\in {\mathcal {S}}\\ sign({\widehat{w}}_{*i}),&{}i\in {\mathcal {S}}_c\end{array}\right. \). Then we have

$$\begin{aligned}&(\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top } \left[ (\gamma + \tau )g - \gamma h\right] = \sum _{i\in {\mathcal {S}}}({\widehat{w}}_{*i} - w_{*i} )(\tau h_i)\\&\quad \quad + \sum _{i\in {\mathcal {S}}^c}({\widehat{w}}_{*i} - w_{*i} )(\tau sign({\widehat{w}}_{*i}) + \gamma (sign({\widehat{w}}_{*i}) - h_i))\\&\quad \ge -\tau \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{{\mathcal {S}}}\Vert _1 + \sum _{i\in {\mathcal {S}}_c}\tau sign({\widehat{w}}_{*i}){\widehat{w}}_{*i} \\&\quad \quad + \sum _{i\in {\mathcal {S}}_c}\gamma (sign({\widehat{w}}_{*i}) - h_i){\widehat{w}}_{*i}\\&\quad \ge -\tau \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{{\mathcal {S}}}\Vert _1 + \tau \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}_c}\Vert _1, \end{aligned}$$

where the last inequality uses \(|h_i|\le 1\) and \(\sum _{i\in {\mathcal {S}}_c}(sign({\widehat{w}}_{*i}) - h_i){\widehat{w}}_{*i}\ge 0\). Combining the above inequality with (35), we have

$$\begin{aligned} 0\ge&-\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1\Vert {\mathbf {q}}\Vert _\infty - \tau \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1 +\tau \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}_c}\Vert _1 + \frac{\lambda }{2}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2. \end{aligned}$$

By splitting \(\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1 = \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1 + \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{{\mathcal {S}}_c}\Vert _1\) and reorganizing the above inequality we have

$$\begin{aligned} \frac{\lambda }{2}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2 + (\tau - \Vert {\mathbf {q}}\Vert _\infty )\Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _1\le (\tau + \Vert {\mathbf {q}}\Vert _\infty )\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1. \end{aligned}$$
(36)

If \(\tau \ge 2\Vert {\mathbf {q}}\Vert _\infty \), then we have

$$\begin{aligned} \frac{\lambda }{2}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2&\le \frac{3\tau }{2}\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1 \end{aligned}$$
(37)
$$\begin{aligned} \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _1&\le 3\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1. \end{aligned}$$
(38)

Note that the inequality (38) hold regardless the value of \(\lambda \). Since

$$\begin{aligned}&\Vert [\widehat{{\mathbf {w}}}_*-{\mathbf {w}}_*]_{\mathcal {S}}\Vert _1\le \sqrt{s}\Vert [\widehat{{\mathbf {w}}}_*-{\mathbf {w}}_*]_{\mathcal {S}}\Vert _2, \\&\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2\ge \max (\Vert [\widehat{{\mathbf {w}}}_*-{\mathbf {w}}_*]_{\mathcal {S}}\Vert _2, \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _2), \end{aligned}$$

by combining the above inequalities with (37), we can get

$$\begin{aligned} \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2\le \frac{3\tau }{\lambda }\sqrt{s}, \quad \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1\le \frac{3\tau }{\lambda }s, \end{aligned}$$

and

$$\begin{aligned} \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1&\le \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _1 + \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1\\&\le 3\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1 +\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1\\&\le \frac{12\tau }{\lambda }s. \end{aligned}$$

We can then complete the proof of Theorem 2 by noting the upper bound of \(\Vert {\mathbf {q}}\Vert _\infty \) in the following lemma and by setting \(\gamma \) according to the Theorem.

Lemma 9

Let \(A\in {\mathbb {R}}^{m\times n}\) be a random matrix such that Assumption 1. With a probability at least \(1-\delta \), we have

$$\begin{aligned} \Vert {\mathbf {q}}\Vert _\infty \le \frac{c\eta R}{n}\sqrt{\frac{\log (2d/\delta )}{m}}, \end{aligned}$$

where \(R=\max _{j}\Vert \bar{\mathbf {x}}_j\Vert \) and c is the quantity that appears in Assumption 1.

The lemma can be proved similarly as the Lemma 8.

7.4 Proof of Theorem 5

When \(\lambda =0\), the reduced problem becomes

$$\begin{aligned} \widehat{{\mathbf {w}}}_*&=\arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\underbrace{\frac{1}{2n}\Vert {\widehat{X}}{\mathbf {w}}- \widehat{{\mathbf {y}}}\Vert _2^2 +(\gamma +\tau )\Vert {\mathbf {w}}\Vert _1}_{{\widehat{F}}({\mathbf {w}})}. \end{aligned}$$
(39)

Recall the original optimization problem:

$$\begin{aligned} {\mathbf {w}}_* = \arg \min _{{\mathbf {w}}\in {\mathbb {R}}^d}\frac{1}{2n}\Vert X{\mathbf {w}}- {\mathbf {y}}\Vert _2^2 + \gamma \Vert {\mathbf {w}}\Vert _1. \end{aligned}$$

From the proof of Theorem 4, we have

$$\begin{aligned} \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _1&\le 3\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1, \quad \text {and}\quad \frac{\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1}{\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2} = \frac{4\Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{{\mathcal {S}}}\Vert _1}{\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2} \le 4\sqrt{s}. \end{aligned}$$

Thus \(\frac{\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*}{\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2}\in {\mathcal {K}}_{d,16s}\). Then we proceed our proof as follows. Since \({\mathbf {w}}_*\) optimizes \(F({\mathbf {w}}_*)\), we have for any \(g\in \partial \Vert \widehat{{\mathbf {w}}}_*\Vert _1\)

$$\begin{aligned} 0\ge F({\mathbf {w}}_*)-F(\widehat{{\mathbf {w}}}_*)&\ge ({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }\left( \frac{1}{n}X^{\top }X\widehat{{\mathbf {w}}}_* - \frac{1}{n}X^{\top }{\mathbf {y}}+ \gamma g \right) \\&+ \frac{1}{2n}({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }X^{\top }X({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*). \end{aligned}$$

Since \(\widehat{{\mathbf {w}}}_*\) optimizes \({\widehat{F}}({\mathbf {w}})\), by the first order optimality condition there exists \(h\in \partial \Vert \widehat{{\mathbf {w}}}_*\Vert _1\) such that

$$\begin{aligned} 0\ge (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }\left( \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}\widehat{{\mathbf {w}}}_* - \frac{1}{n}{\widehat{X}}^{\top }\widehat{{\mathbf {y}}}\right) + (\gamma + \tau )(\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }h. \end{aligned}$$

Combining the two inequalities above we have

$$\begin{aligned} 0\ge&\, ({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }\left( \frac{1}{n}X^{\top }X\widehat{{\mathbf {w}}}_* - \frac{1}{n}X^{\top }{\mathbf {y}}- \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}\widehat{{\mathbf {w}}}_* + \frac{1}{n}{\widehat{X}}^{\top }\widehat{{\mathbf {y}}}\right) \\&+\, (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }(\gamma h + \tau h - \gamma g)\\&+\, \frac{1}{2n}({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }X^{\top }X({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)\\ =&\,({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }\left( \frac{1}{n}X^{\top }X{\mathbf {w}}_* - \frac{1}{n}X^{\top }{\mathbf {y}}- \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}{\mathbf {w}}_* + \frac{1}{n}{\widehat{X}}^{\top }\widehat{{\mathbf {y}}}\right) \\&+\, (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }(\gamma h + \tau h - \gamma g)\\&+\, \frac{1}{2n}({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }X^{\top }X({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*) \\&+\, ({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }\left( \frac{1}{n}X^{\top }X(\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*) - \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}(\widehat{{\mathbf {w}}}_*-{\mathbf {w}}_*) \right) \\ =&\,({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }\left( \frac{1}{n}X^{\top }X{\mathbf {w}}_* - \frac{1}{n}X^{\top }{\mathbf {y}}- \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}{\mathbf {w}}_* + \frac{1}{n}{\widehat{X}}^{\top }\widehat{{\mathbf {y}}}\right) \\&+\, (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }(\gamma h + \tau h - \gamma g)\\&+\, \frac{1}{2n}({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }X^{\top }X({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*) \\&+\, ({\mathbf {w}}_* - \widehat{{\mathbf {w}}}_*)^{\top }\left( \frac{1}{n}X^{\top }X - \frac{1}{n}{\widehat{X}}^{\top }{\widehat{X}}\right) (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*). \end{aligned}$$

By setting \(g_i= h_i, i\in {\mathcal {S}}\) and following the same analysis as in the proof of Theorem 4, we have

$$\begin{aligned} (\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*)^{\top }(\gamma h + \tau h - \gamma g) \ge - \tau \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1+ \tau \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _1. \end{aligned}$$

As a result,

$$\begin{aligned} 0\ge&- \Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _1\Vert {\mathbf {q}}\Vert _\infty - \tau \Vert [\widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*]_{\mathcal {S}}\Vert _1 + \tau \Vert [\widehat{{\mathbf {w}}}_*]_{{\mathcal {S}}^c}\Vert _1 + \frac{\rho ^-_{d,16s}}{2}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2\\&- \sigma _{d,16s}\Vert \widehat{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2^2. \end{aligned}$$

Then we arrive at the same inequality as in (36) with \(\lambda \) replaced by \(\rho ^-_{d,16s}-2\sigma _{d,16s}\). Then the same analysis will follow to prove Theorem 5.

8 Extensions

8.1 Recovery error with nearly sparse dual solution for classification

In this section, we provide a theoretical result on the recovery error for the nearly sparse optimal dual variable \(\alpha _*\). We state the result for smooth loss functions. To quantify the near sparsity, we let \(\alpha _*^s\in {\mathbb {R}}^n\) denote a vector that zeros all entries in \(\alpha _*\) except for the top-s elements in magnitude and assume \(\alpha _*^s\) satisfies the following condition:

$$\begin{aligned}&\left\| \nabla \ell ^*(\alpha ^s_*) + \frac{1}{\lambda n}XX^{\top }\alpha ^s_*\right\| _\infty \le \xi , \end{aligned}$$
(40)

where \(\nabla \ell ^*(\alpha ) = (\nabla \ell _1^*(\alpha _1),\ldots , \nabla \ell _n^*(\alpha _n))^{\top }\). The above condition can be considered as a sub-optimality condition (Boyd and Vandenberghe 2004) of \(\alpha ^s_*\) measured in the infinite norm. For the optimal solution \(\alpha _*\) that lies in the interior of its domain, we have \(\nabla \ell ^*(\alpha _*) + \frac{1}{\lambda n}XX^{\top }\alpha _*=0\).

Theorem 6

Let \(A\in {\mathbb {R}}^{m\times d}\) be a random matrix sampled from a distribution that satisfies the JL lemma. Let \(\widetilde{\alpha }_*\) be the optimal dual solution to (7). Assume \(\alpha _*\) is nearly s-sparse such that (40) holds, \(\max _i\Vert {\mathbf {x}}_i\Vert _2\le R\) and \(\ell (z)\) is L-smooth. If we set \(\tau \ge \frac{2}{\lambda n}\Vert (XX^{\top } - {\widehat{X}}{\widehat{X}}^{\top })\alpha _*\Vert _\infty + 2\xi \), then we have

$$\begin{aligned}&\Vert \widetilde{\alpha }_* -\alpha ^s_*\Vert _2\le 3\tau L \sqrt{s},\quad \Vert \widetilde{\alpha }_* -\alpha ^s_*\Vert _1\le 12\tau L s. \end{aligned}$$

Remark 8

The proof appears in “Appendix A”. Compared to Theorem 1 for exactly sparse optimal dual solution, the dual recovery error bound for nearly sparse optimal dual solution is increased by \(6L\sqrt{s}\xi \) for \(\ell _2\) norm and by \(24Ls\xi \) for \(\ell _1\) norm.

9 Numerical experiments

In this section, we provide experimental results to complement the theoretical analysis. We use four datasets for our experiments, whose statistics are summarized in Table 1. Among them, RCV1, KDD and Splice are for the classification task and E2006-tfidf is for the regression task.

Table 1 Statistics of datasets
Fig. 1
figure 1

Recovery error for squared hinge loss. From left to right: \(\frac{\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1}{\Vert [\widetilde{\alpha }_*]_{\mathcal {S}}- [\alpha _*]_{\mathcal {S}}\Vert _1}\) versus \(\tau \), \(\frac{\Vert \widetilde{\alpha }_* - \alpha _*\Vert _2}{\Vert \alpha _*\Vert _2}\) versus \(\tau \), and \(\frac{\Vert \widetilde{{\mathbf {w}}}_* - {\mathbf {w}}_*\Vert _2}{\Vert {\mathbf {w}}_*\Vert _2}\) versus \(\tau \)

Fig. 2
figure 2

Same curves as in Fig. 1 but for non-smooth hinge loss

9.1 Classification

We first study the recovery error for classification. We use the RCV1-binary data (Lewis et al. 2004) to conduct the study. The data contains 697, 641 documents and 47, 236 features. We use a splitting 677, 399/20, 242 for training and testing. The feature vectors were normalized such that the \(\ell _2\) norm is equal to 1. We only report the results using random hashing as in (20) since it is the most efficient, while other randomized reduction methods (except for random sampling) have similar performance. For the loss function, we use both the squared hinge loss (smooth) and the hinge loss (non-smooth). We aim to examine two questions related to our analysis and motivation (i) how does the value of \(\tau \) affect the recovery error? (ii) how does the number of samples m affect the recovery error? We vary the value of \(\tau \) among \(0, 0.1, 0.2, \ldots , 0.9\), the value of m among 1024, 2048, 4096, 8192, and the value of \(\lambda \) among 0.001, 0.00001. Note that \(\tau =0\) corresponds to the dual recovery approach proposed in Zhang et al. (2013). The results averaged over 5 random trials are shown in Fig. 1 for the squared hinge loss and in Fig. 2 for the hinge loss. We first analyze the results in Fig. 1. We can observe that when \(\tau \) increases the ratio of \(\frac{\Vert [\widetilde{\alpha }_*]_{{\mathcal {S}}^c}\Vert _1}{\Vert [\widetilde{\alpha }_*]_{\mathcal {S}}- [\alpha _*]_{\mathcal {S}}\Vert _1}\) decreases indicating that the magnitude of dual variables for the original non-support vectors decreases. This is intuitive and consistent with our motivation. The recovery error of the dual solution (middle) first decreases and then increases. This can be partially explained by the theoretical result in Theorem 1. When the value of \(\tau \) becomes larger than a certain threshold making \(\tau >\Vert \varDelta \Vert _\infty \) hold, then Theorem 1 implies that a larger \(\tau \) will lead to a larger error. On the other hand, when \(\tau \) is less than the threshold, the dual recovery error will decrease as \(\tau \) increases, verifying that adding a sparse regularizer is very important. In addition, the figures exhibit that the thresholds for larger m are smaller which is consistent with our analysis of \(\Vert \varDelta \Vert _\infty =O(\sqrt{1/m})\). The difference between \(\lambda =0.001\) and \(\lambda =0.00001\) is because that smaller \(\lambda \) will lead to larger \(\Vert {\mathbf {w}}_*\Vert _2\). In terms of the hinge loss, we observe similar trends, however, the recovery is much more difficult than that for squared hinge loss especially when the value of \(\lambda \) is small.

Next, we compare the classification performance of different randomized methods. We compare the proposed randomized method by solving a sparse regularized dual formulation and recovering a high-dimensional model by dual recovery (referred to as DSRR), and the previous random projection with dual recovery (referred to as RPDR), and the standard random projection approach (referred to as RP) that learns a low-dimensional model from random sketched data. For each method, we use three randomized reduction methods, namely random Gaussian projection (RG), random hashing (RH) and random sampling (RS). Note comparing to RS that does not satisfy the JL lemma in general allows us to verify that the JL property is very important for maintaining good performance. The loss function is fixed to the hinge loss, the regularization parameter is set to \(10^{-5}\), and the sparse regularization parameter in DSRR is set to 0.9. We test on two data sets, namely RCV1 used in the first experiment and Splice site data (Sonnenburg and Franc 2010). For Splice site data, we use a subset of data that contains \(10^6\) training examples, 4, 627, 840 testing examples and 12, 495, 340 features, and we evaluate the classification performance by the measure of area under precision recall curve (auPRC)Footnote 4 as in Sonnenburg and Franc (2010) due to that the data is highly imbalanced. We show the results in Fig. 3 for two values of the reduced dimensionality \(m=500\) and \(m=1000\). From the results we can see that (i) RS usually performs worse than RG and RH, which verifies that satisfying the JL property is very important for the randomized reduction methods; (ii) there is no clear winner between RG and RH; however RH is much more efficient RG; (iii) DSRR performs similarly to RPDR on RCV1 data and better than RPDR on Splice site data, implying that DSRR is favorable than RPDR; (iv) both DSRR and RPDR performs better than RP, verifying that recovering an accurate high-dimensional model benefits the prediction performance.

Fig. 3
figure 3

Classification performance of different randomized methods on RCV1 ad Splice data

Runtime Speedup by Distributed Learning: Although in some cases the recovered solution or the solution learned in the reduced space can provide sufficiently good performance, it usually performs worse than the optimal solution that solves the original problem and sometimes the performance gap between them can not be ignored as seen in following experiments. To address this issue, we combine the benefits of distributed learning and the proposed randomized reduction methods for solving big data problems. When data is too large and sits on multiple machines, distributed learning can be employed to solve the optimization problem. In distributed learning, individual machines iteratively solve sub-problems associated with the subset of data on them and communicate some global variables (e.g., the primal solution \({\mathbf {w}}\in {\mathbb {R}}^d\)) among them. When the dimensionality d is very large, the total communication cost could be very high. To reduce the total communication cost, we propose to first solve the reduced data problem and then use the found solution as the initial solution to the distributed learning for the original data.

Below, we demonstrate the effectiveness of DSRR for the recently proposed distributed stochastic dual coordinate ascent (DisDCA) algorithm (Yang 2013). The procedure is (1) reduce original high-dimensional data to very low dimensional space on individual machines; (2) use DisDCA to solve the reduced problem; (3) use the optimal dual solution to the reduce problem as an initial solution to DisDCA for solving the original problem. We record the running time for randomized reduction in step 1 and optimization of the reduced problem in step 2, and the optimization of the original problem in step 3. We compare the performance of four methods (i) the method that only uses the low-dimensional model learned by solving the reduced problem (7) with DisDCA, which is referred to as DSRR-ld, (ii) the method that uses the recovered model in the original space with the dual solution learned by solving the reduced problem (7) with DisDCA, referred to as DSRR-hd; (iii) the method that uses the dual solution to the reduced problem as an initial solution of DisDCA and runs it for the original problem with \(k=1\) or 2 communications (the number of updates before each communication is set to the number of examples in each machine), referred to as DSRR-DisDCA-k; and (iv) the distributed method that directly solves the original dual problem by DisDCA. For DisDCA to solve the original problem, we stop running when its performance on the testing data does not improve. Two data sets are used, namely RCV1-binary, KDD 2010 Cup data. For KDD 2010 Cup data, we use the one available on LibSVM data website. RCV1 data is evenly distributed over 5 machines and KDD data is evenly distributed over 10 machines. The results averaged over 5 trials are shown in Fig. 4, which exhibit that the performance of DSRR-DisDCA-1/2 is remarkable in the sense that it achieves almost the same performance of directly training on the original data (DisDCA) and uses much less training time with about 5 times speedup. In addition, DSRR-DisDCA performs much better than DSRR-ld and has small computational overhead.

Fig. 4
figure 4

Top: Testing error for different methods. Bottom: Training time for different methods. The value of \(\lambda =10^{-5}\) and the value of \(\tau =0.9\). The high-dimensional features are reduced to \(m=1024\)-dimensional space using random hashing. The loss function is the squared hinge loss

9.2 SLSR

In this subsection, we present some numerical experiments for SLSR. We conduct experiments on two datasets, a synthetic dataset and a real dataset. The synthetic data is generated similar to previous studies on sparse signal recovery (Xiao and Zhang 2013). In particular, we generate a random matrix \(X\in {\mathbb {R}}^{n\times d}\) with \(n = 10^4\) and \(d = 10^5\). The entries of the matrix X are generated independently with the uniform distribution over the interval \([-1,+1]\). A sparse vector \({\mathbf {u}}_* \in {\mathbb {R}}^d\) is generated with the same distribution at 100 randomly chosen coordinates. The noise \(\xi \in {\mathbb {R}}^n\) is a dense vector with independent random entries with the uniform distribution over the interval \([-\sigma , \sigma ]\), where \(\sigma \) is the noise magnitude and is set to 0.1. We scale the data matrix X such that all entries have a variance of 1/n and scale the noise vector \(\xi \) accordingly. Finally the vector \({\mathbf {y}}\) was obtained as \({\mathbf {y}}= X{\mathbf {u}}_* + \xi \). For elastic net on the synthetic data, we try two different values of \(\lambda \), \(10^{-8}\) and \(10^{-5}\). The value of \(\gamma \) is set to \(10^{-5}\) for both elastic net and lasso. Note that these values are not intended to optimize the performance of elastic net and lasso on the synthetic data. The real data used in the experiment is E2006-tfidf dataset. We use the version available on libsvm website.Footnote 5 There are a total of \(n=16,087\) training instances and \(d=150,360\) features and 3308 testing instances. We normalize the training data such that each dimension has mean zero and variance 1/n. The testing data is normalized using the statistics computed on the training data. For JL transform, we use the random matrix with one block of random hashing. The experimental results on the synthetic data under different settings are shown in Fig. 5. In the left plot, we compare the recovery error for elastic net with \(\lambda =10^{-8}\) and two different values of m, i.e., \(m=1000\) and \(m=2000\). The horizontal axis is the value of \(\sigma \), the added regularization parameter. We can observe that adding a slightly larger additional \(\ell _1\) norm to the compressed data problem indeed reduces the recovery error. When the value of \(\tau \) is larger than some threshold, the error will increase, which is consistent with our theoretical results. In particular, we can see that the threshold value for \(m=2000\) is smaller than that for \(m=1000\). In the middle plot, we compare the recovery error for elastic net with \(m=1000\) and two different values of the regularization parameter \(\lambda \). Similar trends of the recovery error versus \(\sigma \) are also observed. In addition, it is interesting to see that the recovery error for \(\lambda =10^{-8}\) is less than that for \(\lambda =10^{-5}\), which seems to contradict to the theoretical results at the first glance due to the explicit inverse dependence on \(\lambda \). However, the optimization error also depends on \(\Vert {\mathbf {e}}\Vert _2\), which measures the empirical error of the corresponding optimal model. We find that with \(\lambda =10^{-8}\) we have a smaller \(\Vert {\mathbf {e}}\Vert _2=0.95\) compared to 1.34 with \(\lambda =10^{-5}\), which explains the result in the middle plot. For the right plot, we repeat the same experiments for lasso as in the left plot for elastic net, and observe similar results.

Fig. 5
figure 5

Recovery error of elastic net and lasso under different settings on the synthetic data

Fig. 6
figure 6

Recovery or Regression error of lasso under different settings on the E2006-tfidf

The experimental results on E2006-tfidf dataset for lasso are shown in Fig. 6. In the left plot, we show the root mean square error (RMSE) on the testing data of different models learned from the original data with different values of \(\gamma \). In the middle and right plots, we fix the value of \(\gamma =10^{-4}\) and increase the value of \(\tau \) and plot the relative recovery error and the RMSE on the testing data. Again, the empirical results are consistent with the theoretical results and verify that with random sketched data a larger \(\ell _1\) regularizer could yield a better performance.

10 Conclusions

In this paper, we have proposed new theory of high-dimensional model recovery from random sketched data. We have studied two important problems in machine learning, i.e., classification and sparse least-squares regression. We propose to explore the intrinsic sparsity of the problem and develop new formulations on the sketched data and novel analysis of the recovery error. The numerical experiments validate our theoretical analysis and also demonstrate that the proposed randomized methods are favorable in comparison with previous randomized algorithms.