Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration,Applied and Computational Harmonic Analysis

当前位置： X-MOL 学术 › Appl. Comput. Harmon. Anal. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration
Applied and Computational Harmonic Analysis ( IF 2.5 ) Pub Date : 2021-12-17 , DOI: 10.1016/j.acha.2021.12.003
Song Mei ₁ , Theodor Misiakiewicz ₂ , Andrea Montanari _{2,

3}

Affiliation

Consider the classical supervised learning problem: we are given data $(y_{i}, x_{i})$ , $i \leq n$ , with $y_{i}$ a response and $x_{i} \in X$ a covariates vector, and try to learn a model $\hat{f} : X \to R$ to predict future responses. Random feature methods map the covariates vector $x_{i}$ to a point $ϕ (x_{i})$ in a higher dimensional space $R^{N}$ , via a random featurization map ϕ. We study the use of random feature methods in conjunction with ridge regression in the feature space $R^{N}$ . This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime.

We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random feature ridge regression. In particular, we address two fundamental questions: (1) What is the generalization error of KRR? (2) How big N should be for the random feature approximation to achieve the same error as KRR?

In this setting, we prove that KRR is well approximated by a projection onto the top ℓ eigenfunctions of the kernel, where ℓ depends on the sample size n. We show that the test error of random feature ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $N \leq n^{1 - δ}$ for some $δ > 0$ . We characterize this gap. For $N \geq n^{1 + δ}$ , random features achieve the same error as the corresponding KRR, and further increasing N does not lead to a significant change in test error.

中文翻译：

随机特征和核方法的泛化误差：超收缩性和核矩阵集中

考虑经典的监督学习问题：给定数据 $(是_{一世}, X_{一世})$ , $一世 \leq n$ ，和 $是_{一世}$ 一个回应和 $X_{一世} \in X$ 一个协变量向量，并尝试学习一个模型 $\hat{F} ： X \to 电阻$ 预测未来的反应。随机特征方法映射协变量向量 $X_{一世}$ 到一点 $φ (X_{一世})$ 在高维空间 ${电阻}^{N}$ ，通过一个随机特征图ϕ。我们在特征空间中研究了随机特征方法与岭回归的结合使用 ${电阻}^{N}$ . 这可以看作是核岭回归 (KRR) 的有限维近似，或者看作是所谓的惰性训练机制中神经网络的程式化模型。

我们在底层内核上定义了一类满足某些谱条件的问题，以及对相关特征函数的超收缩假设。这些条件由经典的高维例子验证。在这些条件下，我们证明了随机特征岭回归误差的尖锐特征。特别是，我们解决了两个基本问题：（1）KRR 的泛化误差是多少？(2)随机特征逼近应该有多大的N才能达到与 KRR 相同的误差？

在这种情况下，我们证明 KRR 可以通过投影到内核的顶部ℓ特征函数上来很好地近似，其中ℓ取决于样本大小n。我们表明随机特征岭回归的测试误差主要由其逼近误差决定，并且只要是大于 KRR 的误差 $N \leq n^{1 - δ}$ 对于一些 $δ > 0$ . 我们描述了这种差距。为了 $N \geq n^{1 + δ}$ ，随机特征达到与对应的 KRR 相同的误差，进一步增加N不会导致测试误差的显着变化。

更新日期：2021-12-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>