Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces

https://doi.org/10.1016/j.acha.2018.09.009Get rights and content

Abstract

In this paper, we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral/regularized algorithms, including ridge regression, principal component regression, and gradient methods. We prove optimal, high-probability convergence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, filling a theoretical gap for the non-attainable cases.

Introduction

Let the input space H be a separable Hilbert space with inner product denoted by ,H and the output space R. Let ρ be an unknown probability measure on H×R, ρX() the induced marginal measure on H, and ρ(|x) the conditional probability measure on R with respect to xH and ρ. Let the hypothesis space Hρ={f:HR|ωH with f(x)=ω,xH,ρX-almost surely}. The goal of least-squares regression is to approximately solve the following expected risk minimization,inffHρE(f),E(f)=H×R(f(x)y)2dρ(x,y), where the measure ρ is known only through a sample z={zi=(xi,yi)}i=1n of size nN, independently and identically distributed according to ρ. Let LρX2 be the Hilbert space of square integral functions from H to R with respect to ρX, with its norm given by fρ=(H|f(x)|2dρX)1/2. The function that minimizes the expected risk over all measurable functions is the regression function [6], [27], defined asfρ(x)=Rydρ(y|x),xH,ρX-almost every. Throughout this paper, we assume that the support of ρX is compact and there exists a constant κ[1,[, such thatx,xHκ2,x,xH,ρX-almost every. Under this assumption, Hρ is a subspace of LρX2, and a solution fH for (1) is the projection of the regression function fρ(x) onto the closure of Hρ in LρX2. See e.g., [14], [1], or Section 2 for further details.

The above problem was raised for non-parametric regression with kernel methods [6], [27] and it is closely related to functional regression [20]. A common and classic approach for the above problem is based on spectral algorithms. It amounts to solving an empirical linear equation, where to avoid over-fitting and to ensure good performance, a filter function for regularization is involved, see e.g., [1], [10]. Such approaches include ridge regression, principal component regression, gradient methods and iterated ridge regression.

A large amount of research has been carried out for spectral algorithms within the setting of learning with kernel methods, see e.g., [26], [5] for Tikhonov regularization, [33], [31] for gradient methods, and [4], [1] for general spectral algorithms. Statistical results have been developed in these references, but still, they are not satisfactory. For example, most of the previous results either restrict to the case that the space Hρ is universal consistency (i.e., Hρ is dense in LρX2) [26], [31], [4] or the attainable case (i.e., fHHρ) [5], [1]. Also, some of these results require an unnatural assumption that the sample size is large enough and the derived convergence rates tend to be (capacity-dependently) suboptimal in the non-attainable cases. Finally, it is still unclear whether one can derive capacity-dependently optimal convergence rates for spectral algorithms under a general source assumption.

In this paper, we study statistical results for spectral algorithms. Considering a capacity assumption of the space H [32], [5], and a general source condition [1] of the target function fH, we show high-probability, optimal convergence results in terms of variants of norms for spectral algorithms. As a corollary, we obtain almost sure convergence results with optimal rates. The general source condition is used to characterize the regularity/smoothness of the target function fH in LρX2, rather than in Hρ as those in [5], [1]. The derived convergence rates are optimal in a minimax sense. Our results, not only resolve the issues mentioned in the last paragraph but also generalize previous results to convergence results with different norms and consider a more general source condition.

Section snippets

Learning with kernel methods and notations

In this section, we first introduce supervised learning with kernel methods, which is a special instance of the learning setting considered in this paper. We then introduce some useful notations and auxiliary operators.

Learning with kernel methods. Let Ξ be a closed subset of Euclidean space Rd. Let μ be an unknown but fixed Borel probability measure on Ξ×Y. Assume that {(ξi,yi)}i=1n are i.i.d. from the distribution μ. A reproducing kernel K is a symmetric function K:Ξ×ΞR such that (K(ui,uj))i,

Spectral/regularized algorithms

In this section, we demonstrate and introduce spectral algorithms.

The search for an approximate solution in Hρ for Problem (1) is equivalent to the search of an approximated solution in H forinfωHE˜(ω),E˜(ω)=H×R(ω,xHy)2dρ(x,y). As the expected risk E˜(ω) can not be computed exactly and that it can be only approximated through the empirical risk E˜z(ω), defined asE˜z(ω)=1ni=1n(ω,xiHyi)2, a first idea to deal with the problem is to replace the objective function in (10) with the

Convergence results

In this section, we first introduce some basic assumptions and then present convergence results for spectral algorithms.

Proofs

In this section, we prove the results stated in Section 4. We first give some basic lemmas, and then give the proof of the main results.

Acknowledgements

JL and VC's work was supported in part by Office of Naval Research (ONR) under grant agreement number N62909-17-1-2111, in part by Hasler Foundation Switzerland under grant agreement number 16066, and in part by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement number 725594).

References (33)

  • F. Bauer et al.

    On regularization algorithms in learning theory

    J. Complexity

    (2007)
  • M.S. Birman et al.

    Double operator integrals in a Hilbert space

    Integral Equations Operator Theory

    (2003)
  • G. Blanchard et al.

    Optimal rates for regularization of statistical inverse learning problems

  • A. Caponnetto

    Optimal learning rates for regularization operators in learning theory

    (2006)
  • A. Caponnetto et al.

    Optimal rates for the regularized least-squares algorithm

    Found. Comput. Math.

    (2007)
  • F. Cucker et al.

    Learning Theory: An Approximation Theory Viewpoint

    (2007)
  • L.H. Dicker et al.

    Kernel ridge vs. principal component regression: minimax bounds and the qualification of regularization operators

    Electron. J. Stat.

    (2017)
  • H.W. Engl et al.

    Regularization of Inverse Problems

    (1996)
  • J. Fujii et al.

    Norm inequalities equivalent to Heinz inequality

    Proc. Amer. Math. Soc.

    (1993)
  • L.L. Gerfo et al.

    Spectral algorithms for supervised learning

    Neural Comput.

    (2008)
  • Z.-C. Guo et al.

    Learning theory of distributed spectral algorithms

    Inverse Probl.

    (2017)
  • D. Hsu et al.

    Random design analysis of ridge regression

    Found. Comput. Math.

    (2014)
  • J. Lin, V. Cevher, Optimal convergence for distributed learning with stochastic gradient methods and spectral...
  • J. Lin et al.

    Optimal rates for multi-pass stochastic gradient methods

    J. Mach. Learn. Res.

    (2017)
  • S.-B. Lin et al.

    Distributed learning with regularized least squares

    J. Mach. Learn. Res.

    (2017)
  • P. Mathé et al.

    Regularization of some linear ill-posed problems with discretized random noisy data

    Math. Comp.

    (2006)
  • Cited by (77)

    • Estimates on learning rates for multi-penalty distribution regression

      2024, Applied and Computational Harmonic Analysis
    • Spectral algorithms for learning with dependent observations

      2024, Journal of Computational and Applied Mathematics
    View all citing articles on Scopus
    View full text