Abstract
We consider a general regularised interpolation problem for learning a parameter vector from data. The well-known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is at the core of kernel methods in machine learning as it makes the problem computationally tractable. Most literature deals only with sufficient conditions for representer theorems in Hilbert spaces and shows that the regulariser being norm-based is sufficient for the existence of a representer theorem. We prove necessary and sufficient conditions for the existence of representer theorems in reflexive Banach spaces and show that any regulariser has to be essentially norm-based for a representer theorem to exist. Moreover, we illustrate why in a sense reflexivity is the minimal requirement on the function space. We further show that if the learning relies on the linear representer theorem, then the solution is independent of the regulariser and in fact determined by the function space alone. This in particular shows the value of generalising Hilbert space learning theory to Banach spaces.
Article PDF
Similar content being viewed by others
References
Argyriou, A., Micchelli, C.A., Pontil, M.: When is there a representer theorem? vector versus matrix regularizers. J. Mach. Learn. Res. 10, 2507–2529 (2009)
Asplund, E.: Positivity of duality mappings. Bull. Amer. Math. Soc. 73(2), 200–203 (1967)
Blaz̆ek, J.: Some remarks on the duality mapping. Acta Univ. Carolinae Math. Phys. 23(2), 15–19 (1982)
Brezis, H.: Functional Analysis, Sobolev Spaces and Partial Differential Equations. Springer, New York. https://doi.org/10.1007/978-0-387-70914-7 (2011)
Browder, F.E.: Multi-valued monotone nonlinear mappings and duality mappings in banach spaces. Trans. Am. Math. Soc. 118, 338–351 (1965)
Cox, D.D., O’Sullivan, F.: Asymptotic analysis of penalized likelihood and related estimators. Ann. Statist. 18 (4), 1676–1695 (1990). https://doi.org/10.1214/aos/1176347872
Cucker, F., Smale, S.: On the mathematical foundations of learning. Bull. Am. Math. Soc. 39(1), 1–49 (2001)
Dragomir, S.S.: Semi-inner Products and Applications. Nova Science Publishers (2004)
Kimeldorf, G., Wahba, G.: Some results on tchebycheffian spline functions. J. Math. Anal. Appl. 33(1):82–95. https://doi.org/10.1016/0022-247X(71)90184-3 (1971)
Lindenstrauss, J., Preiss, D., Tišer, J.: Frechet Differentiability of Lipschitz Functions and Porous Sets in Banach Spaces. Princeton University Press (2012)
Micchelli, C.A., Pontil, M.: A Function Representation for Learning in Banach Spaces. In: Learning Theory. COLT, vol. 2004, pp. 255–269 (2004)
Micchelli, C.A., Pontil, M.: Learning the kernel function via regularization. J. Mach. Learn. Res. 6, 1099–1125 (2005)
Phelps, R.: Convex Functions Monotone Operators and Differentiability. Lecture Notes in Mathematics. Springer, Berlin (1993)
Schlegel, K.: When is there a representer theorem? nondifferentiable regularisers and banach spaces. Journal of Global Optimization. https://doi.org/10.1007/s10898-019-00767-0 (2019)
Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press (2002)
Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Computational Learning Theory, pp 416–426 (2001)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511809682 (2004)
Smola, J.A., Schölkopf, B.: On a kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica 22(1), 211–231 (1998). https://doi.org/10.1007/PL00013831
Tropp, J.A.: Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans. Inf. Theory 52(3), 1030–1051 (2006). https://doi.org/10.1109/TIT.2005.864420
Xu, Y., Ye, Q.: Generalized Mercer Kernels and Reproducing Kernel Banach Spaces. Memoirs of the American Mathematical Society. American Mathematical Society. https://books.google.co.uk/books?id=rd2RDwAAQBAJ (2019)
Zhang, H., Zhang, J.: Regularized learning in banach spaces as an optimization problem: Representer theorems. J. Glob. Optim. 54(2), 235–250 (2012). https://doi.org/10.1007/s10898-010-9575-z
Zhang, H., Xu, Y., Zhang, J.: Reproducing kernel banach spaces for machine learning. J. Mach. Learn. Res. 10, 2741–2775 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Russell Luke
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Appendix
A.1 Regularised Interpolation
The proof of theorem 1 is largely identical to the one presented in [1] but requires a few minor adjustments to hold for the generality of reflexive Banach spaces. We present the full proof here.
Proof of theorem 1
To prove that Ω is admissible for the regularised interpolation problem (2) we are going to show that Ω is tangentially nondecreasing in the sense of lemma 1 depending on the properties of the space \({\mathscr{B}}\).
Fix \(0 \neq f\in {\mathscr{B}}\) and L ∈ J(f) and let a0 be the unique nonzero minimiser of \(\min \limits \{\mathcal {E}(a\nu ,y) : a\in \mathbb {R}\}\). For every λ > 0 Consider the regularisation problem
By assumption there exist solutions \(f_{\lambda }\in {\mathscr{B}}\) such that
i.e. there exist \(c_{\lambda }\in \mathbb {R}\) such that cλL ∈ J(fλ).
Now fix any \(g\in {\mathscr{B}}\) such that L ∈ J(g) which exists as \({\mathscr{B}}\) is reflexive so J is surjective. We then obtain
where the first inequality follows from a0 minimising \(\mathcal {E}(a\nu ,y)\) and the second inequality from L(g) = ∥L∥2. This shows that Ω(fλ) ≤Ω(g) for all λ and so by hypothesis the set {fλ : λ > 0} is bounded. Hence there exists a weakly convergent subsequence \((f_{\lambda _{l}})_{l\in \mathbb {N}}\) such that \(\lambda _{l}\underset {l\rightarrow \infty }{\longrightarrow }0\) and \(f_{\lambda _{l}}\rightharpoonup \overline {f}\) as \(l\rightarrow \infty \). Taking the limit inferior as \(l\rightarrow \infty \) on the right-hand side of inequality 10 we obtain
Since a0 is by assumption the unique, nonzero minimiser this means that
But then since \(L(\overline {f}) \leq \|{L}\|\cdot \|{\overline {f}}\) we have \(\|{L}\|\leq \|{\overline {f}}\|\).
Moreover since J(fλ) ∩span{L}≠∅ we have \(\|{L}\|\cdot \|{f_{\lambda }}\| = L(f_{\lambda }) \rightarrow \|{L}\|^{2}\) and thus \(\|{f_{\lambda }}\|\rightarrow \|{L}\|\). Since \(\|{\overline {f}}\| \leq \liminf \|{f_{\lambda }}\|=\|{L}\|\) (e.g. [4], Proposition 3.5 (iii)) we have \(\|{\overline {f}}\|= \|{L}\|\) and thus \(L\in J(\overline {f})\).
Since the fλ are minimisers of the regularisation problem we have for all \(g\in {\mathscr{B}}\) such that L(g) = ∥L∥2
Since a0 is the minimiser this implies in particular that
and taking the limit inferior again we obtain that \(\overline {f}\) is in fact a solution of the interpolation problem
Now this means that \({\varOmega }(\overline {f}+f_{T}) \geq {\varOmega }(\overline {f})\) for all \(f_{T}\in \ker (L)\) and if \(\overline {f} = f\) we are clearly done. If \(\overline {f}\neq f\) we know that f and \(\overline {f}\) are in the same face as L ∈ J(f) and \(L\in J(\overline {f})\). They thus have the same error \({\mathcal {E}}\). If \({\varOmega }(f) = {\varOmega }(\overline {f})\) then both are equivalent minimisers and it is clear that both satisfy the tangential bound. If \({\varOmega }(f) > {\varOmega }(\overline {f})\) then f is not admissible and does not need to satisfy the tangential bound.
Finally note that the claim is trivially true for L = 0 as in that case \(\mathcal {E}\) is independent of f and for every λ the minimiser fλ has to be zero to satisfy J(fλ) ∩{0}≠∅. This means Ω is minimised at 0. □
A.2 Duality mappings
The proof of theorem 2 crucially relies on the following connection of the duality mapping with subgradients (c.f. [2, 3]).
Proposition 4
For a normed linear space V with duality mapping J with gauge function μ define \(M:V \rightarrow \mathbb {R}\) by
Then x∗∈ ∂M(x) ⊂ V∗, the subgradient of M, if and only if
For any 0≠x ∈ V we have that ∂M(x) = J(x).
We now give a proof of theorem 2 which follows the ideas of the one presented by [3] but corrects the issue in that paper.
Proof of theorem 2
Using the functional M from proposition 4 define a functional \(F \colon V\rightarrow \mathbb {R}\) by
Since M is continuous, convex with strictly increasing derivative and L0 is linear, F is clearly continuous, convex and coercive. This means that F attains its minimum on the reflexive subspace W in at least one point, \(\overline {z}\) say.
Hence for all y ∈ W
By proposition 4 this means that \(L_{0}\big |_{W} \in \partial M\big |_{W}(\overline {z} - x_{0}) = J_{\mu }\big |_{W}(\overline {z} - x_{0})\). For simplicity we write L0|W = LW.
Note that if x0 ∈ W and LW = 0 we have that F(x) = M(x − x0) on W so \(\overline {z} = x_{0}\) and we trivially have Jμ(x0 − x0) = {0} = {−L0 + L0}⊂ W⊥ + L0. So we can without loss of generality assume that not both x0 ∈ W and LW = 0.
In case x0 ∈ W it is clear that M is minimised at x0. If LW≠ 0 then LW attains its norm on W in a point z say. Thus it is clear that there exists a minimiser for F of the form \(\overline {z} = z + x_{0}\). More precisely F is minimised where an element of ∂M and ∇L0 are equal. Since \(\partial M(x-x_{0})=\mu (\|{x-x_{0}}\|)\frac {L_{x}}{\mu (\|{x-x_{0}}\|)}\) for Lx ∈ Jμ(x − x0) we get that the minimiser \(\overline {z}=z+x_{0}\) is such that \(\|{L_{W}}\|_{W^{\ast }}=\mu (\|{\overline {z}-x_{0}}\|)\).
If on the other hand x0∉W then we note that \(\overline {z}\) being the minimum for F on W implies that Lz(y) ≥ 0 for all \(L_{z} \in \partial F(\overline {z})\) and all y ∈ W. But this means that
for every \(L_{z}\in J(\overline {z}-x_{0})\). But since \(\frac {L_{z}}{\mu (\|{\overline {z}-x_{0}}\|)}\) is of norm 1 this means that
for all y ∈ W. Thus \(\|{L_{W}}\|_{W^{\ast }}=\|{L_{0}\big |_{W}}\|_{W^{\ast }}\leq \mu (\|{\overline {z}-x_{0}}\|)\).
Now denote by \(\overline {W}\) the space generated by W and x0 and note that this space is still reflexive. Extend LW to \(L_{\overline {W}}\) on \(\overline {W}\) by setting
Then
so \(\|{L_{\overline {W}}}\|_{\overline {W}^{\ast }} \geq \mu (\|{\overline {z}-x_{0}}\|)\).
Further \(L_{\overline {W}}(y)=L_{W}(y)\leq \mu (\|{\overline {z}-x_{0}}\|)\cdot \|{y}\|\) for all y ∈ W, so \(\|{L_{\overline {W}}}\| > \mu (\|{\overline {z}-x_{0}}\|)\) can only happen if the norm is attained for some point λy + νx0 for y ∈ W, ν≠ 0. Or equivalently, dividing through by ν, at a point y + x0 for some y ∈ W. But for those points we have
and thus \(\|{L_{\overline {W}}}\| = \mu (\|{\overline {z}-x_{0}}\|)\) and \(L_{\overline {W}}(\overline {z}-x_{0}) = \|{L_{\overline {W}}}\|\cdot \|{\overline {z}-x_{0}}\|\).
Since for x0 ∈ W we have \(\overline {W}=W\) in either case we have obtained a function \(L_{\overline {W}}\) such that \(L_{\overline {W}}=L_{0}\big |_{W}\), \(\|{L_{\overline {W}}}\|=\mu (\|{\overline {z}-x_{0}}\|)\) and \(L_{\overline {W}}(\overline {z}-x_{0}) = \|{L_{\overline {W}}}\|\cdot \|{\overline {z}-x_{0}}\|\).
Now extend \(L_{\overline {W}}\) by Hahn-Banach to LV on V such that
and \(L_{V}\big |_{\overline {W}}=L_{\overline {W}}\). Hence (LV − L0)|W = 0 so LV ∈ W⊥ + L0.
It remains to show that \(L_{V}\in J_{\mu }(\overline {z}-x_{0})\) by showing 12 holds for LV and every y ∈ V. Notice first that
But
and further
so the left-hand side of 15 is always at least as big as the left-hand side of 14. We can thus add the left-hand side of 14 to the right-hand side of 12 and the left-hand side of 15 to the left-hand side of 12 while preserving the inequality. 12 is in particular true for \(\overline {z}\) and in that case also for \(L_{\overline {W}}\) as it agrees with L0 on \(\overline {z}\) and x0, i.e.
Thus by adding the left-hand sides of 14 and 15 as described we obtain
for all y ∈ V. But since LV also agrees with \(L_{\overline {W}}\) on \(\overline {z}\) and x0 this together with 13 implies that
for all y ∈ V which is what we wanted to prove. Thus indeed \(L_{V}\in J_{\mu }(\overline {z}-x_{0})\) as claimed. By homogeneity of Jμ clearly − LV with \(-\overline {z}\in W\) is as in the statement of the theorem. □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schlegel, K. When is there a representer theorem?. Adv Comput Math 47, 54 (2021). https://doi.org/10.1007/s10444-021-09877-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10444-021-09877-4
Keywords
- Representer theorem
- Regularised interpolation
- Regularisation
- Kernel methods
- Reproducing kernel Banach spaces