Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces
Introduction
Let the input space H be a separable Hilbert space with inner product denoted by and the output space . Let ρ be an unknown probability measure on , the induced marginal measure on H, and the conditional probability measure on with respect to and ρ. Let the hypothesis space . The goal of least-squares regression is to approximately solve the following expected risk minimization, where the measure ρ is known only through a sample of size , independently and identically distributed according to ρ. Let be the Hilbert space of square integral functions from H to with respect to , with its norm given by . The function that minimizes the expected risk over all measurable functions is the regression function [6], [27], defined as Throughout this paper, we assume that the support of is compact and there exists a constant , such that Under this assumption, is a subspace of , and a solution for (1) is the projection of the regression function onto the closure of in . See e.g., [14], [1], or Section 2 for further details.
The above problem was raised for non-parametric regression with kernel methods [6], [27] and it is closely related to functional regression [20]. A common and classic approach for the above problem is based on spectral algorithms. It amounts to solving an empirical linear equation, where to avoid over-fitting and to ensure good performance, a filter function for regularization is involved, see e.g., [1], [10]. Such approaches include ridge regression, principal component regression, gradient methods and iterated ridge regression.
A large amount of research has been carried out for spectral algorithms within the setting of learning with kernel methods, see e.g., [26], [5] for Tikhonov regularization, [33], [31] for gradient methods, and [4], [1] for general spectral algorithms. Statistical results have been developed in these references, but still, they are not satisfactory. For example, most of the previous results either restrict to the case that the space is universal consistency (i.e., is dense in ) [26], [31], [4] or the attainable case (i.e., ) [5], [1]. Also, some of these results require an unnatural assumption that the sample size is large enough and the derived convergence rates tend to be (capacity-dependently) suboptimal in the non-attainable cases. Finally, it is still unclear whether one can derive capacity-dependently optimal convergence rates for spectral algorithms under a general source assumption.
In this paper, we study statistical results for spectral algorithms. Considering a capacity assumption of the space H [32], [5], and a general source condition [1] of the target function , we show high-probability, optimal convergence results in terms of variants of norms for spectral algorithms. As a corollary, we obtain almost sure convergence results with optimal rates. The general source condition is used to characterize the regularity/smoothness of the target function in , rather than in as those in [5], [1]. The derived convergence rates are optimal in a minimax sense. Our results, not only resolve the issues mentioned in the last paragraph but also generalize previous results to convergence results with different norms and consider a more general source condition.
Section snippets
Learning with kernel methods and notations
In this section, we first introduce supervised learning with kernel methods, which is a special instance of the learning setting considered in this paper. We then introduce some useful notations and auxiliary operators.
Learning with kernel methods. Let Ξ be a closed subset of Euclidean space . Let μ be an unknown but fixed Borel probability measure on . Assume that are i.i.d. from the distribution μ. A reproducing kernel K is a symmetric function such that
Spectral/regularized algorithms
In this section, we demonstrate and introduce spectral algorithms.
The search for an approximate solution in for Problem (1) is equivalent to the search of an approximated solution in H for As the expected risk can not be computed exactly and that it can be only approximated through the empirical risk , defined as a first idea to deal with the problem is to replace the objective function in (10) with the
Convergence results
In this section, we first introduce some basic assumptions and then present convergence results for spectral algorithms.
Proofs
In this section, we prove the results stated in Section 4. We first give some basic lemmas, and then give the proof of the main results.
Acknowledgements
JL and VC's work was supported in part by Office of Naval Research (ONR) under grant agreement number N62909-17-1-2111, in part by Hasler Foundation Switzerland under grant agreement number 16066, and in part by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement number 725594).
References (33)
- et al.
On regularization algorithms in learning theory
J. Complexity
(2007) - et al.
Double operator integrals in a Hilbert space
Integral Equations Operator Theory
(2003) - et al.
Optimal rates for regularization of statistical inverse learning problems
Optimal learning rates for regularization operators in learning theory
(2006)- et al.
Optimal rates for the regularized least-squares algorithm
Found. Comput. Math.
(2007) - et al.
Learning Theory: An Approximation Theory Viewpoint
(2007) - et al.
Kernel ridge vs. principal component regression: minimax bounds and the qualification of regularization operators
Electron. J. Stat.
(2017) - et al.
Regularization of Inverse Problems
(1996) - et al.
Norm inequalities equivalent to Heinz inequality
Proc. Amer. Math. Soc.
(1993) - et al.
Spectral algorithms for supervised learning
Neural Comput.
(2008)
Learning theory of distributed spectral algorithms
Inverse Probl.
Random design analysis of ridge regression
Found. Comput. Math.
Optimal rates for multi-pass stochastic gradient methods
J. Mach. Learn. Res.
Distributed learning with regularized least squares
J. Mach. Learn. Res.
Regularization of some linear ill-posed problems with discretized random noisy data
Math. Comp.
Cited by (77)
Estimates on learning rates for multi-penalty distribution regression
2024, Applied and Computational Harmonic AnalysisSpectral algorithms for learning with dependent observations
2024, Journal of Computational and Applied MathematicsSPECTRALLY TRANSFORMED KERNEL REGRESSION
2024, arXiv