Abstract
Gaussians are a family of Mercer kernels which are widely used in machine learning and statistics. The variance of a Gaussian kernel reflexes the specific structure of the reproducing kernel Hilbert spaces (RKHS) induced by the Gaussian or other important features of learning problems such as the frequency of function components. As the variance of the Gaussian decreases, the learning performance and approximation ability will improve. This paper introduces the unregularized online algorithm with decreasing Gaussians where no regularization term is imposed and the samples are presented in sequence. With the appropriate step sizes, concrete learning rates are derived under the smoothness assumptions on the target function, for which are used to bound the approximation error. Additionally, a new type of the geometric noise condition is proposed to estimate the approximation error instead of any smoothness assumption. It is more general than the work in Steinwart et al. (Ann Stat 35(2):575–607, 2007), for which is only suitable for the hinge loss. An essential estimate is to bound the difference of the approximation functions generated by varying Gaussian RKHS. Fourier transform plays a crucial role in our analysis.
Similar content being viewed by others
References
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Cesa-Bianchi, N., Conconi, A., Gentile, C.: On the generalization ability of on-line learning algorithms. IEEE Trans. Inf. Theory 50(9), 2050–2057 (2004)
Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1–3), 131–159 (2002)
Chen, D.R., Wu, Q., Ying, Y., Zhou, D.-X.: Support vector machine soft margin classifiers: error analysis. J. Mach. Learn. Res. 18(5), 1143–1175 (2004)
Dieuleveut, A., Bach, F., et al.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016)
Eberts, M., Steinwart, I.: Optimal learning rates for least squares SVMs using Gaussian kernels. In: Advances in Neural Information Processing Systems, pp. 1539–1547 (2011)
Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117. ACM (2004)
Hang, H., Steinwart, I.: Optimal learning with anisotropic Gaussian SVMs. arXiv preprint arXiv:1810.02321 (2018)
Lanckriet, G.R., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5(Jan), 27–72 (2004)
Lei, Y., Shi, L., Guo, Z.-C.: Convergence of unregularized online learning algorithms. J. Mach. Learn. Res. 18(1), 6269–6301 (2017)
Lin, J., Zhou, D.-X.: Online learning algorithms can converge comparably fast as batch learning. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2367–2378 (2018)
Matache, M.T., Matache, V.: Hilbert spaces induced by Toeplitz covariance kernels. In: Pasik-Duncan, B. (ed.) Stochastic Theory and Control, pp. 319–333. Springer, Berlin (2002)
Rakhlin, A., Panchenko, D., Mukherjee, S.: Risk bounds for mixture density estimation. ESAIM Probab. Stat. 9, 220–229 (2005)
Smale, S., Yao, Y.: Online learning algorithms. Found. Comput. Math. 6(2), 145–170 (2006)
Smale, S., Zhou, D.-X.: Estimating the approximation error in learning theory. Anal. Appl. 1(01), 17–41 (2003)
Steinwart, I., Hush, D., Scovel, C.: An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans. Inf. Theory 52(10), 4635–4643 (2006)
Steinwart, I., Scovel, C., et al.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)
Tsybakov, A.B., et al.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)
Wu, Q., Ying, Y., Zhou, D.-X.: Multi-kernel regularized classifiers. J. Complex. 23(1), 108–134 (2007)
Xiang, D.-H., Zhou, D.-X.: Classification with Gaussians and convex loss. J. Mach. Learn. Res. 10(Jul), 1447–1468 (2009)
Yang, Y.: Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inf. Theory 45(7), 2271–2284 (1999)
Ye, G.-B., Zhou, D.-X.: Fully online classification by regularization. Appl. Comput. Harmon. Anal. 23(2), 198–214 (2007)
Ying, Y., Pontil, M.: Online gradient descent learning algorithms. Found. Comput. Math. 8(5), 561–596 (2008)
Ying, Y., Zhou, D.-X.: Online regularized classification algorithms. IEEE Trans. Inf. Theory 52(11), 4775–4788 (2006)
Ying, Y., Zhou, D.-X.: Learnability of Gaussians with flexible variances. J. Mach. Learn. Res. 8(Feb), 249–276 (2007)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by G. Kerkyacharian.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work described in this paper is partially supported by the National Natural Science Foundation of China [Nos. 11671307 and 11571078], Natural Science Foundation of Hubei Province in China [No. 2017CFB523] and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY20012].
Appendices
Appendixes
Some Properties of RKHS with Gaussian Kernels
This section presents known results about the Gaussian kernel \(k_\sigma \) and its associated RKHS \(\mathcal H_\sigma \), that are useful for the proofs of our results.
Lemma A.1
[16] For any \(0<\sigma <\tau ,\)
Moreover, for any \(f\in \mathcal{H}_\tau ,\)
Lemma A.2
[12] Let \(\mathcal{H}_{\sigma }({\mathbb {R}}^n)\) be the RKHS with the Gaussian kernel \(K_\sigma (\cdot ,\cdot )\) on \({\mathbb {R}}^n\times {\mathbb {R}}^n.\) Then its associated norm is given for any \(f\in \mathcal{H}_{\sigma }({\mathbb {R}}^n)\) by
where \({{\hat{f}}}(w)\) denotes the Fourier transform of the function f.
Lemma A.3
[1] Let \(\mathcal{X}\) be a non-empty subset of \({\mathbb {R}}^n.\) Then
Elementary Inequalities
Lemma B.1
For \(T\ge 3,\) the elementary inequalities hold as follows.
Lemma B.2
For \(T\ge 3,\)
Proof
When \(1\le k\le \frac{T}{2},\) then \(\sum _{t=T-k}^T (t-1)^{-1}\le (T-k-1)^{-1}(k+1)\le 6T^{-1}(k+1),\) and when \(\frac{T}{2}< k\le T-2,\) using Lemma B.1, then \(\sum _{t=T-k}^T (t-1)^{-1}\le 2\log T.\) Combing the two cases yields that
\(\square \)
Lemma B.3
We have for \(T\ge 3,\)
Proof
When \(1\le k\le \frac{T}{2},\) \((T-k-1)^{-1}\le 6 T^{-1}\) and when \(\frac{T}{2}<k<T,\) \(1\le T-k-1< \frac{T}{2}-1.\) Thus, using Lemma B.1, we have that
\(\square \)
Lemma B.4
For any \(q^*>1,\) we have for \(T\ge 3,\)
where \(c_{q^*}\) is a constant independent of T.
Proof
Using Lemma B.1, its proof is similar to that of Lemma B.2. Here we omit it. \(\square \)
Useful Lemmas and Proofs in Sect. 4
This lemma is key in estimating the approximation ability of \(\mathcal{H}_\sigma .\)
Lemma C.1
Let \(\mathcal{X}\) be a closed unit ball in \({\mathbb {R}}^n\) and \(\mathcal{{{\tilde{X}}}}=3\mathcal{X}.\) Assume that \(f: \mathcal{X}\rightarrow {\mathbb {R}}\) be a measurable function. Define an extension \({{\tilde{f}}} :\mathcal{{{\tilde{X}}}}\rightarrow {\mathbb {R}}\) of the function f by
Then for any \(x\in \mathcal{X}\) and \(\varepsilon >0,\) we have that
Proof
When \(v\in \mathcal{{{\tilde{X}}}}/\mathcal{X},\) by simple calculations, we have that \(\left| x-\frac{v}{|v|}\right| ^2< |x-v|^2\) for \(x\in \mathcal{X}\) and \(v\in \mathcal{{{\tilde{X}}}}/\mathcal{X}.\)
For any \(v\in \mathcal{{{\tilde{X}}}}/\mathcal{X},\) since \({{\tilde{f}}}(v)=f\left( \frac{v}{|v|}\right) ,\) then \(|{\tilde{f}}(x)-{\tilde{f}}(v)|=\left| f(x)-f\left( \frac{v}{|v|}\right) \right| \) and
Notice the fact that for any \(x\in \mathcal{X},\)
or
Based on the estimates as above, the proof is completed. \(\square \)
Next, we will bound the error caused by the varying Gaussians.
Lemma C.2
Define \({{\tilde{f}}}_{\sigma _t}\) by (45) and \(f_{\sigma _t}:={{\tilde{f}}}_{\sigma _t}|_\mathcal{X}\). If the variances \(\{\sigma _t,t\in {\mathbb {N}}\}\) with \(0<\beta <1,\) then
and
where \(c_\beta \) is given in the proof of Lemma 1.
Proof
Notice that the Fourier transform \(\hat{{\tilde{f}}}_{\sigma }(w)=\hat{{\tilde{K}}}_\sigma (w)\hat{{\tilde{f}}}^\phi _\rho (w)=\exp \left\{ -\frac{\sigma ^2|w|^2}{2}\right\} \hat{{\tilde{f}}}^\phi _\rho (w).\) With Lemmas A.2 and A.3, by the similar procedure in the proofs of Lemma 1, the conclusions (50) and (51) can be obtained. \(\square \)
With the help of the above lemmas, we can prove our convergence rate in Sect. 4.
Proof of Theorem 5
We shall prove the conclusions above by Theorem 1. By (51) and (46), we have
with \(\mathcal{A}_1=C_{n,\zeta ,q}\) and \(\mathcal{B}_1=\frac{2c_\beta +c_\beta ^2}{\sqrt{2}(\sqrt{2\pi })^{n/2}}\Vert {\tilde{f}}^\phi _\rho \Vert _{L^2({\mathbb {R}}^n)}.\) Taking \(\tau =\frac{1+\epsilon }{1-\frac{n\beta }{2}},\) by (50) and (31), we have that
where \(C^{**}=4C_{\tau ,q,\eta }\left( 1+\frac{2\eta \mathcal{A}_1+\mathcal{B}_1^\tau +\mathcal{B}_1^{2-\tau }}{\max \{1-\theta -\zeta \beta (1+\zeta )^{-1},n\beta +\epsilon \}} +\frac{q^*}{q^*-1}\frac{\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}} \right) +\frac{2\Vert {\tilde{f}}^\phi _\rho \Vert ^2_{L^2({\mathbb {R}}^n)}}{(\sqrt{2\pi })^{n}}.\) Putting the estimates above into (7), following the same proof procedure of Theorem 2, we can get the desired conclusion with
\(\square \)
Rights and permissions
About this article
Cite this article
Wang, B., Hu, T. Unregularized Online Algorithms with Varying Gaussians. Constr Approx 53, 403–440 (2021). https://doi.org/10.1007/s00365-021-09536-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00365-021-09536-3
Keywords
- Online learning
- Varying Gaussian kernels
- Reproducing kernel Hilbert spaces
- Geometric noise condition
- Learning rate