Posterior impropriety of some sparse Bayesian learning models

https://doi.org/10.1016/j.spl.2021.109039Get rights and content

Abstract

Sparse Bayesian learning models are typically used for prediction in datasets with significantly greater number of covariates than observations. Such models often take a reproducing kernel Hilbert space (RKHS) approach to carry out the task of prediction and can be implemented using either proper or improper priors. In this article we show that a few sparse Bayesian learning models in the literature, when implemented using improper priors, lead to improper posteriors.

Introduction

Modern datasets often have significantly greater number of covariates, p, than observations, n. For such datasets, often the objective is to predict the response variable for previously unobserved values of the covariates. If p<n, then one can fit a suitable linear model using a traditional statistical technique like ordinary least squares (OLS). But if p>n, then OLS is no longer applicable and hence one can rely on penalized methods such as least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani (1996) or ridge regression proposed by Hoerl and Kennard (1970) to find a suitable model. But, both LASSO and ridge regression are penalized regression techniques that perform variable selection among the class of linear models. Hence, in case of p>n, if we wish to explore non linear class of models, we can estimate a function, f, from a functional space (H) using the following Tikhonov regularization, minfH[i=1nL(yi,f(xi))+λfH2],where {yi,xi}i=1n is the training data such that yiR for all i and xiRp for all i, L(,) is the loss function, λ is the penalty parameter, H is the functional space and H is the norm defined on H.

Since a functional space is infinite dimensional, the solution of (1.1) can also be infinite dimensional. Hence there is a possibility that we cannot use it for practical purposes. Wahba (1990) proved that, if the functional space is a reproducing kernel Hilbert space (RKHS), then the solution is finite dimensional and is given by, f(x)=j=1nk(x,xj)βj,where k(,) is a reproducing kernel and {βj}j=1n are some unknown coefficients. The formal definition of RKHS and reproducing kernel can be found in Berlinet and Thomas-Agnan (2011).

Tipping (2001) used the finite dimensional solution (1.2) in a hierarchical Bayesian model to introduce the relevance vector machine (RVM). It has also been discussed in Tipping (2000) and Bishop and Tipping (2000). The prior structure of RVM has been chosen in such a manner that it will produce a sparse solution and hence will lead to better predictions. RVM is a very popular sparse Bayesian learning model that is typically used for prediction and its popularity is demonstrated by large number of citations of the original RVM paper of Tipping (2001).

In Bayesian analysis, prior distributions are assumed on parameters. A prior distribution is said to be a proper prior if the prior density associated with it is a valid probability density function, else, it is said to be an improper prior. The most common objective prior used in the literature is the so-called Jeffreys’ prior proposed by Jeffreys (1961) whose density function is directly proportional to the square root of the determinant of the Fisher information matrix, and, hence can be computed easily in several cases. The Jeffreys’ prior can be proper or improper depending upon the data model used in the analysis. For Bayesian models involving improper priors, the posterior distribution of the parameters given the data is not guaranteed to be proper. Hence, in such cases, it is necessary to show that the normalizing constant associated with the posterior distribution is bounded above by a finite constant otherwise there is a possibility that the posterior distribution is improper and drawing inference from an improper posterior distribution is equivalent to drawing inference from a function that integrates to infinity.

The RVM proposed by Tipping (2001) involves prior density functions proportional to λa1ebλ on different scale parameters. These densities can be assumed to be either proper gamma priors or improper priors based on the choice of the hyperparameters a,b of that prior and Tipping (2001) presents both cases. The case of improper prior assumed by Tipping (2001) leads to an improper posterior distribution and we provide a proof of it. Additionally, we also derive necessary and sufficient conditions for the posterior propriety of RVM. The necessary conditions will help past researchers of RVM to check if the improper prior used by them leads to an improper posterior and the sufficient conditions will provide guidelines for future researchers to choose prior distributions that will guarantee posterior propriety. Figueiredo (2002) proposed to apply RVM using the popular Jeffreys’ prior on parameters. The necessary conditions that we derive show that the choice of Jeffreys’ prior also leads to an improper posterior.

Sparse Bayesian learning models also involve classification models. Mallick et al. (2005) proposed a RKHS based Bayesian classification model which makes use of the finite dimensional solution in (1.2) to build models corresponding to both logistic likelihoods as well as support vector machine related likelihoods. They propose to implement their model by using either proper priors or Jeffreys’ prior. We show that the use of Jeffreys’ prior in their models lead to an improper posterior.

The article is structured as follows. In Section 2, we explain RVM and a related model proposed by Figueiredo (2002) along with their inference method in detail. Further in Section 2, we provide necessary and sufficient conditions for the posterior propriety of RVM and show that the sparse Bayesian learning models proposed by Tipping (2001) under improper prior and Figueiredo (2002) under Jeffreys’ prior lead to improper posteriors. In Section 3, we provide details about the Bayesian classification models proposed by Mallick et al. (2005) and show that the models are improper under the choice of Jeffreys’ prior. Some concluding remarks are given in Section 4.

Section snippets

Relevance vector machine and its impropriety

Let {(yi,xi),i=1,2,,n} be the training data, where yiR is the ith observation for the response variable and xiRp is the p dimensional covariate vector associated with yi. Let y=(y1,y2,,yn)T and β=(β0,β1,,βn)T. Let K be the n×(n+1) matrix whose ith row is given by ki=(1,kθ(xi,x1),kθ(xi,x2),,kθ(xi,xn))T where {kθ(xi,xj):i=1,2,,n;j=1,2,,n} are the values of the reproducing kernel and θ is a kernel parameter. The relevance vector machine proposed by Tipping (2001) is a hierarchical Bayesian

Sparse Bayesian classification model and its impropriety

Let y be an n dimensional vector containing the observed response variables {yi}i=1n such that yi{0,1} for all i and let z be an n dimensional vector of latent variables that connect the response variables to the covariates. The Bayesian classification model based on reproducing kernels proposed by Mallick et al. (2005) is as follows, f(y|z)exp{i=1nl(yi,zi)}z|β,σ2,θN(Kβ,σ2I)β|λ,σ2N(0,σ2D1)withD=diag(λ0,λ1,,λn)π(λi)λia1exp{bλi}foralli=1,2,,nσ2IG(c,d)θU(u1,u2) where y=(y1,y2,,yn)T,

Conclusion and discussion

A probability density function is said to be valid only if the area under the curve is equal to one. This basic requirement is not assured for the posterior density function of a Bayesian model with an improper prior. Therefore for a Bayesian model with an improper prior, one should move ahead with inference only after showing that the posterior density function is valid. In this paper we have shown that some sparse Bayesian learning models based on improper priors do not have valid posterior

Acknowledgements

The authors would like to thank the two referees and an Associate Editor for helpful comments that have improved the paper.

References (15)

  • AthreyaK.B. et al.

    Monte Carlo Methods for improper target distributions

    Electron. J. Stat.

    (2014)
  • BerlinetA. et al.

    Reproducing Kernel Hilbert Spaces in Probability and Statistics

    (2011)
  • Bishop, C.M., Tipping, M.E., 2000. Variational relevance vector machines. In: Proceedings of the Sixteenth Conference...
  • FigueiredoM.

    Adaptive sparseness using jeffreys prior

    Adv. Neural Inf. Process. Syst.

    (2002)
  • GelmanA.

    Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper)

    Bayesian Anal.

    (2006)
  • HobertJ.P. et al.

    The effect of improper priors on gibbs sampling in hierarchical linear mixed models

    J. Amer. Statist. Assoc.

    (1996)
  • HoerlA. et al.

    Ridge regression: biased estimation for nonorthogonal problems

    Technometrics

    (1970)
There are more references available in the full text version of this article.

Cited by (4)

View full text