Skip to main content
Log in

Hamiltonian Monte Carlo acceleration using surrogate functions with random bases

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

For big data analysis, high computational cost for Bayesian methods often limits their applications in practice. In recent years, there have been many attempts to improve computational efficiency of Bayesian inference. Here we propose an efficient and scalable computational technique for a state-of-the-art Markov chain Monte Carlo methods, namely, Hamiltonian Monte Carlo. The key idea is to explore and exploit the structure and regularity in parameter space for the underlying probabilistic model to construct an effective approximation of its geometric properties. To this end, we build a surrogate function to approximate the target distribution using properly chosen random bases and an efficient optimization process. The resulting method provides a flexible, scalable, and efficient sampling algorithm, which converges to the correct target distribution. We show that by choosing the basis functions and optimization process differently, our method can be related to other approaches for the construction of surrogate functions such as generalized additive models or Gaussian process models. Experiments based on simulated and real data show that our approach leads to substantially more efficient sampling algorithms compared to existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Ahn, S., Chen, Y., Welling, M.: Distributed and adaptive darting Monte Carlo through regenerations. In: Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AI Stat, 2013)

  • Amari, S., Nagaoka, H.: Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191. Oxford University Press, Oxford (2000)

  • Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Stat. Comput. 18(4), 343–373 (2008)

    Article  MathSciNet  Google Scholar 

  • Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed April 2015 (2013)

  • Betancourt, M.J.: The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In: Proceedings of 31th International Conference on Machine Learning (ICML’15) (2015)

  • Chen, T., Fox, E.B., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo (Preprint, 2014)

  • Conard, P.R., Marzouk, Y.M., Pillai, N.S., Smith, A.: Asymptotically exact MCMC algorithms via local approximations of computationally intensive models. arXiv:1402.1694v3. Accessed Nov 2014 (unpublished manuscript, preprint, 2014)

  • Dashti, M., Stuart, A.: Uncertainty quantification and weak approximation of an elliptic inverse problem. SIAM J. Sci. Comput. 49, 2524–2542 (2011)

    MathSciNet  MATH  Google Scholar 

  • Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195, 216–222 (1987)

    Article  Google Scholar 

  • Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)

    Article  MATH  Google Scholar 

  • Geyer, C.J.: Practical Markov chain Monte Carlo. Stat. Sci. 7, 473–483 (1992)

    Article  Google Scholar 

  • Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. B 73, 123–214 (with discussion, 2011)

  • Greville, T.: Some applications of the pseudoinverse of a matrix. SIAM Rev. 2(1), 15–22 (1960)

    Article  MathSciNet  MATH  Google Scholar 

  • Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: Neural Information Processing Systems (NIPS), pp. 856–864 (2010)

  • Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501(2006a)

  • Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006b)

  • Hyvärinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709 (2005)

    MathSciNet  MATH  Google Scholar 

  • Kohonen, T.: Self-Organization and Associative Memory. Springer, Berlin (1988)

    Book  MATH  Google Scholar 

  • Kovanic, P.: On the pseudo inverse of a sum of symmetric matrices with applications to estimation. Kybernetika 15(5), 341–348 (1979)

    MathSciNet  MATH  Google Scholar 

  • Lan, S., Streets, J., Shahbaba, B.: Wormhole Hamiltonian Monte Carlo. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014a)

  • Lan, S., Zhou, B., Shahbaba, B.: Spherical Hamiltonian Monte Carlo for constrained target distributions. In: Proceedings of the 31st International Conference on Machine Learning (ICML, 2014b)

  • Lin, C.J., Weng, R.C., Keerthi, S.S.: Trust region Newton method for large-scale logistic regression. J. Mach. Learn. Res. 9, 627–650 (2008)

    MathSciNet  MATH  Google Scholar 

  • Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, New York (2001)

    MATH  Google Scholar 

  • Meeds, E., Welling, M.: GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI) (2014)

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)

    Article  Google Scholar 

  • Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014)

    Article  Google Scholar 

  • Neal, R.M.: MCMC using Hamiltonian dynamics. In: Brooks, S., Gelman, A., Jones, G., Meng, X.L. (eds.) Handbook of Markov Chain Monte Carlo, pp. 113–162. Chapman and Hall/CRC, Boca Raton (2011)

  • Neal, R.: Regression and classification using Gaussian process priors. Bayesian Stat. 6, 475–501 (1999)

    MathSciNet  MATH  Google Scholar 

  • Neal, R.: Bayesian Learning in Neural Networks. Springer, New York (1996)

    Book  MATH  Google Scholar 

  • Quinonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)

    MathSciNet  MATH  Google Scholar 

  • Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Proceedings of the 21st Conference on Advances in Neural Information Processing Systems, NIPS (2007)

  • Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing (2008)

  • Rasmussen, C.E.: Evaluation of Gaussian processes and other methods for non-linear regression. PhD Thesis, University of Toronto (1996)

  • Rasmussen, C.E.: Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. Bayesian Stat. 7, 651–659 (2003)

    MathSciNet  Google Scholar 

  • Rasmussen, C.E., Nickisch, H.: Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, 3011–3015 (2010)

    MathSciNet  MATH  Google Scholar 

  • Roberts, G.O., Rosenthal, J.S.: Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithm. J. Appl. Probab. 44(2), 458–475 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. I, pp. 318–362. Bradford Books, Cambridge (1986)

  • Shahbaba, B., Lan, S., Johnson, W.O., Neal, R.M.: Split Hamiltonian Monte Carlo. Stat. Comput. 24, 339–349 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-input. In: Advances in Neural Information Processing Systems (NIPS), vol. 18. MIT Press, Cambridge (2006)

  • Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., Gretton, A.: Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In: Advances in Neural Information Processing Systems (2015)

  • van Schaik, A., Tapson, J.: Online and adaptive pseudoinverse solutions for ELM weights. Neurocomputing 149, 233–238 (2015)

    Article  Google Scholar 

  • Verlet, L.: Computer ‘experiments’ on classical fluids, I: thermodynamical properties of Lennard–Jones molecules. Phys. Rev. 159, 98–103 (1967)

    Article  Google Scholar 

  • Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681–688 (2011)

  • Zhang, C., Shahbaba, B., Zhao, H.: Precomputing strategy for Hamiltonian Monte Carlo method based on regularity in parameter space. arxiv:1504.01418v1. Accessed April 2015 (unpublished manuscript, 2015)

Download references

Acknowledgments

This work is supported by NSF grants DMS-1418422 and DMS-1622490.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Zhang.

Appendices

Appendix 1: Convergence to the correct target distribution

In order to prove that the equilibrium distribution remains the same, it suffices to show that the detailed balance condition still holds. Denote \(\theta = (p,\,q),\; \theta ^{\prime }={\tilde{\phi }}_t(\theta ).\) In the Metropolis–Hasting step, we use the original Hamiltonian to compute the acceptance probability

$$\begin{aligned} \alpha (\theta ,\,\theta ^{\prime }) = \min (1,\,\exp [-H(\theta ^{\prime }) + H(\theta )]), \end{aligned}$$

therefore,

$$\begin{aligned} \alpha (\theta ,\,\theta ^{\prime }){\mathbb {P}}(\mathrm{d}\theta )&= \alpha (\theta ,\,\theta ^{\prime }) \exp [-H(\theta )]\mathrm{d}\theta \\&\mathop {=}\limits ^{\theta = {\tilde{\phi }}^{-1}_{t}(\theta ^{\prime })} \min (\exp [-H(\theta )],\,\exp [-H(\theta ^{\prime })])\left| \frac{\mathrm{d}\theta }{\mathrm{d}\theta ^{\prime }}\right| \mathrm{d}\theta ^{\prime } \\&= \alpha (\theta ^{\prime },\,\theta )\exp [-H(\theta ^{\prime })]\mathrm{d}\theta ^{\prime }\\&= \alpha (\theta ^{\prime },\,\theta ) {\mathbb {P}}(\mathrm{d}\theta ^{\prime }), \end{aligned}$$

since \( \left| \frac{\mathrm{d}\theta }{\mathrm{d}\theta ^{\prime }}\right| = 1\) due to the volume conservation property of the surrogate-induced Hamiltonian flow \({\tilde{\phi }}_t.\) Now that we showed the detailed balance condition is satisfied, along with the reversibility of the surrogate-induced Hamiltonian flow, the modified Markov chain will converge to the correct target distribution.

Appendix 2: Potential matching

In the paper, training data collected from the history of Markov chain are used without a detailed explanation. Here, we will analyze the asymptotical behavior of surrogate-induced distribution and try to explain why the history of the Markov chain is a reasonable choice for training data. Recall that we find our neural network surrogate function by minimizing the mean square error (11). Similarly to Hyvärinen (2005), it turns out that minimizing (11) is asymptotically equivalent to minimizing a new distance between the surrogate-induced distribution and the underlying target distribution, independent of their corresponding normalizing constants.

Suppose we know the density of the underlying intractable target distribution up to a constant

$$\begin{aligned} P(q|Y) = \frac{1}{Z}\exp (-U(q)), \end{aligned}$$

where Z is the unknown normalizing constant. Our goal is to approximate this distribution using a parametrized density model, also known up to a constant,

$$\begin{aligned} Q(q,\,\theta ) = \frac{1}{Z(\theta )}\exp (-V(q,\,\theta )). \end{aligned}$$

Ignoring the multiplicative constant, the corresponding potential energy functions are U(q) and \(V(q,\,\theta ),\) respectively. The straightforward square distance between the two potentials will not be a well-defined measure between the two distributions because of the unknown normalizing constants. Therefore, we use the following distance instead:

$$\begin{aligned} K(\theta )&= \min _d\int \Vert V(q,\,\theta )-U(q)-d\Vert ^2P(q|Y)\mathrm{d}q\nonumber \\&=\int \Vert V(q,\,\theta )-U(q)\Vert ^2 P(q|Y)\mathrm{d}q \nonumber \\&\quad - \left[ E_q(V(\theta )-U)\right] ^2 = {\mathbb {V}}\mathrm {ar}_{q}(V(\theta )-U). \end{aligned}$$
(14)

Because of its similarity to score matching (Hyvärinen 2005), we refer to the approximation method based on this new distance as potential matching; the corresponding potential matching estimator of \(\theta \) is given by

$$\begin{aligned} {\hat{\theta }} = \arg \min _{\theta } K(\theta ). \end{aligned}$$

It is easy to verify that \(K(\theta ) = 0 \Rightarrow V(\theta ) = U + \mathrm{constant} \Rightarrow Q(q,\,\theta ) = P(q|Y),\) so \(K(\theta )\) is a well-defined squared distance. Exact evaluation of (14) is analytically intractable. In practice, given N samples from the target distribution \(q_1,\,q_2,\ldots ,q_N,\) we minimize the empirical version of (14)

$$\begin{aligned} {\tilde{K}}(\theta )&= \min _d\frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta \right) - U\left( q_n\right) -d\right\| ^2 \nonumber \\&= \frac{1}{N}\sum _{n=1}^N\left\| V\left( q_n,\,\theta \right) -U\left( q_n\right) \right\| ^2 \nonumber \\&\quad - \left( \frac{1}{N}\sum _{n=1}^NV\left( q_n,\,\theta \right) -U\left( q_n\right) \right) ^2, \end{aligned}$$
(15)

which is asymptotically equivalent to K by law of large numbers. (15) could be more concise if we allow a shift term in the parametrized model (\(V(q,\,\theta ) = V(q,\,\theta _{-d}) + \theta _d\)). In that case, the empirical potential matching estimator is

$$\begin{aligned} {\tilde{\theta }}&= \arg \min _\theta {\tilde{K}}(\theta ) = \arg \min _\theta \min _d\frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta _{-d}\right) \right. \\&\left. \quad +\left( \theta _d-d\right) - U\left( q_n\right) \right\| ^2\\&= \arg \min _\theta \frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta _{-d}\right) + \theta _d - U\left( q_n\right) \right\| ^2\\&= \arg \min _\theta \frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta \right) - U\left( q_n\right) \right\| ^2. \end{aligned}$$

In particular, if we adopt an additive model for \(V(q,\,\theta )\) with a shift term

$$\begin{aligned} V(q,\,\theta ) = \sum _{i=1}^sv_i\sigma \left( \varvec{w_i}\cdot q+d_i\right) + b,\quad \theta = (\varvec{v},\,b), \end{aligned}$$

where \(\varvec{w_i},\,d_i\), and activation function \(\sigma \) are chosen according to Algorithm 2 and collect early evaluations from the history of Markov chain

$$\begin{aligned} {\mathcal {T}}_N= & {} \left\{ \left( q^{(1)},\,U\left( q^{(1)}\right) \right) ,\,\left( q^{(2)},\,U\left( q^{(2)}\right) \right) ,\ldots ,\right. \\&\times \,\left. \left( q^{(N)},\,U\left( q^{(N)}\right) \right) \right\} , \end{aligned}$$

as training data; this way, the estimated parameters (i.e., weights for the output neuron) are asymptotically the potential matching estimates

$$\begin{aligned} \lim _{N\rightarrow \infty }{\hat{\theta }}_{\mathrm{ELM},{\mathcal {T}}_N} = \arg \min _{\theta }\lim _{N\rightarrow \infty }C\left( \theta |{\mathcal {T}}_N\right) = {\hat{\theta }}, \end{aligned}$$

since the Markov chain will eventually converge to the target distribution. When truncated at a finite N,  the estimated parameters are almost the empirical potential matching estimates except that the samples from the history of the Markov chain are not exactly (but quite close) from the target distribution.

Appendix 3: Adaptive learning

Theorem 1

(Greville) If a matrix A, with k columns, is denoted by \(A_k\) and partitioned as \(A_k=[A_{k-1}a_k],\) with \(A_{k-1}\) a matrix having \(k-1\) columns, then the Moore–Penrose generalized inverse of \(A_k\) is

$$\begin{aligned} A_k^{\dagger } =\begin{bmatrix}A_{k-1}^{\dagger }(I-a_kb_k^\mathrm{T})\\b_k^\mathrm{T}\end{bmatrix}, \end{aligned}$$

where

$$\begin{aligned} c_k = \left( I-A_{k-1}A_{k-1}^{\dagger }\right) a_k, \quad b_k = \left\{ \begin{array}{ll} \frac{(A_{k-1}^{\dagger })^\mathrm{T}A_{k-1}^{\dagger }a_k}{1+ \Vert A_{k-1}^\dagger a_k\Vert ^2}, &{}c_k=0,\\ \frac{c_k}{\Vert c_k\Vert ^2}, &{} c_k\ne 0. \end{array}\right. \end{aligned}$$

Proof of Proposition 1

To save computational cost, we only update the estimator. Suppose the current output matrix is \(H_{k+1}=\begin{bmatrix}H_k\\h_{k+1}^\mathrm{T}\end{bmatrix}\) and the target vector is \(T_{k+1}=\begin{bmatrix}T_k\\t_{k+1}\end{bmatrix},\) then

$$\begin{aligned} v_{k+1}^\mathrm{T}H_{k+1}^\mathrm{T}&= T_{k+1}^\mathrm{T} \Rightarrow v_{k+1}^\mathrm{T} \\&= T_{k+1}^\mathrm{T}\left( H_{k+1}^\mathrm{T}\right) ^{\dagger } \\&= \left[ T_k^\mathrm{T} t_{k+1}\right] \left( \left[ H_k^\mathrm{T} h_{k+1}\right] \right) ^{\dagger }\\&= \left[ T_k^\mathrm{T} t_{k+1}\right] \begin{bmatrix}(H_k^\mathrm{T})^{\dagger }(I-h_{k+1}b_{k+1}^\mathrm{T})\\b_{k+1}^\mathrm{T}\end{bmatrix}\\&= T_k^\mathrm{T}\left( H_k^\mathrm{T}\right) ^{\dagger }\left( I-h_{k+1}b_{k+1}^\mathrm{T}\right) + t_{k+1}b_{k+1}^\mathrm{T}\\&= v_k^\mathrm{T} + \left( t_{k+1}-v_k^\mathrm{T}h_{k+1}\right) b_{k+1}^\mathrm{T}, \end{aligned}$$

where \(b_{k+1}\) is obtained according to Theorem 1. Note that the computation of \(b_{k+1}\) and \(c_{k+1}\) still involves \(H_k\) and \(H_k^{\dagger }\) whose sizes increase as data size grows. Following Kohonen (1988), Kovanic (1979), and Schaik and Tapson (2015), we introduce two auxiliary matrices here

$$\begin{aligned}&\varPhi _{k} = I-H_k^\mathrm{T}\left( H_k^\mathrm{T}\right) ^{\dagger } \in {\mathbb {R}}^{s\times s},\quad \\&\varTheta _{k}= \left( \left( H_k^\mathrm{T}\right) ^{\dagger }\right) ^\mathrm{T}\left( H_k^\mathrm{T}\right) ^{\dagger }= H_k^{\dagger }\left( H_k^\mathrm{T}\right) ^{\dagger } \in {\mathbb {R}}^{s\times s}, \end{aligned}$$

and rewrite \(b_{k+1}\) and \(c_{k+1}\) as

$$\begin{aligned} c_{k+1} = \varPhi _{k}h_{k+1}, \quad b_{k+1} = \left\{ \begin{array}{ll} \frac{\varTheta _kh_{k+1}}{1+ h_{k+1}^\mathrm{T}\varTheta _kh_{k+1}}, &{}c_{k+1}=0,\\ \frac{c_{k+1}}{\Vert c_{k+1}\Vert ^2}, &{} c_{k+1}\ne 0. \end{array}\right. \end{aligned}$$

Updating of the two auxiliary matrices can also be done adaptively

$$\begin{aligned} \varPhi _{k+1}= & {} I - H_{k+1}^\mathrm{T}\left( H_{k+1}^\mathrm{T}\right) ^{\dagger } \\= & {} I-\left[ H_k^\mathrm{T}h_{k+1}\right] \begin{bmatrix}(H_k^\mathrm{T})^{\dagger }(I-h_{k+1}b_{k+1}^\mathrm{T})\\b_{k+1}^\mathrm{T}\end{bmatrix}\\= & {} \varPhi _k -\varPhi _kh_{k+1}b_{k+1}^\mathrm{T},\\ \varTheta _{k+1}= & {} H_{k+1}^{\dagger }\left( H_{k+1}^\mathrm{T}\right) ^{\dagger } \\= & {} \left[ \left( I-b_{k+1}h_{k+1}^\mathrm{T}\right) H_k^{\dagger }b_{k+1}\right] \begin{bmatrix}(H_k^\mathrm{T})^{\dagger }(I-h_{k+1}b_{k+1}^\mathrm{T})\\b_{k+1}^\mathrm{T}\end{bmatrix}\\= & {} \left( I-b_{k+1}h_{k+1}^\mathrm{T}\right) \varTheta _k\left( I-h_{k+1}b_{k+1}^\mathrm{T}\right) + b_{k+1}b_{k+1}^\mathrm{T}, \end{aligned}$$

and if \(c_{k+1} = 0,\) these formulas can be simplified as below

$$\begin{aligned} \varPhi _{k+1} = \varPhi _k,\quad \varTheta _{k+1} = \varTheta _k - \varTheta _kh_{k+1}b_{k+1}^\mathrm{T}. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Shahbaba, B. & Zhao, H. Hamiltonian Monte Carlo acceleration using surrogate functions with random bases. Stat Comput 27, 1473–1490 (2017). https://doi.org/10.1007/s11222-016-9699-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-016-9699-1

Keywords

Navigation