Surrogate-assisted parallel tempering for Bayesian neural learning

doi:10.1016/j.engappai.2020.103700

Engineering Applications of Artificial Intelligence

Volume 94, September 2020, 103700

https://doi.org/10.1016/j.engappai.2020.103700 Get rights and content

Abstract

Due to the need for robust uncertainty quantification, Bayesian neural learning has gained attention in the era of deep learning and big data. Markov Chain Monte-Carlo (MCMC) methods typically implement Bayesian inference which faces several challenges given a large number of parameters, complex and multimodal posterior distributions, and computational complexity of large neural network models. Parallel tempering MCMC addresses some of these limitations given that they can sample multimodal posterior distributions and utilize high-performance computing. However, certain challenges remain given large neural network models and big data. Surrogate-assisted optimization features the estimation of an objective function for models which are computationally expensive. In this paper, we address the inefficiency of parallel tempering MCMC for large-scale problems by combining parallel computing features with surrogate assisted likelihood estimation that describes the plausibility of a model parameter value, given specific observed data. Hence, we present surrogate-assisted parallel tempering for Bayesian neural learning for simple to computationally expensive models. Our results demonstrate that the methodology significantly lowers the computational cost while maintaining quality in decision making with Bayesian neural networks. The method has applications for a Bayesian inversion and uncertainty quantification for a broad range of numerical models.

Introduction

Although neural networks have gained significant attention due to the deep learning revolution (Schmidhuber, 2015), several limitations exist. The challenge widens for uncertainty quantification in decision making given the development of new neural network architectures and learning algorithms. Bayesian neural learning provides a probabilistic viewpoint with the representation of neural network weights as probability distributions (MacKay, 1995, Robert, 2014). Rather than single-point estimates by gradient-based learning methods, the probability distributions naturally account for uncertainty in parameter estimates.

Through Bayesian neural learning, uncertainty regarding the data and model can be propagated into the decision making process. Markov Chain Monte Carlo (MCMC) sampling methods implement Bayesian inference (Hastings, 1970, Metropolis et al., 1953, Tarantola et al., 1982, Mosegaard and Tarantola, 1995) by constructing a Markov chain, such that the desired distribution becomes the equilibrium distribution after a given number of steps (Raftery and Lewis, 1996, van Ravenzwaaij et al., 2016). MCMC methods provide numerical approximations of multi-dimensional integrals (Banerjee et al., 2014). MCMC methods have not gained as much attention in neural learning when compared to gradient-based counterparts since convergence becomes computationally expensive for a large number of model parameters and multimodal posteriors (Robert et al., 2018). MCMC methods typically require thousands of samples to be drawn depending on the model which becomes a major limitation in applications such as deep learning (Schmidhuber, 2015, Gal and Ghahramani, 2016, Kendall and Gal, 2017). Hence, other methods for implementing Bayesian inference exist, such as variational inference (Blei et al., 2017, Damianou et al., 2016) which has been used for deep learning (Blundell et al., 2015). Variational inference has the advantage of faster convergence when compared to MCMC methods for large models. However, in computationally expensive models, variational inference methods would have a similar problem as MCMC, since both would need to evaluate model samples for the likelihood; hence, both would benefit from surrogate-based models.

Parallel tempering is an MCMC method that (Swendsen and Wang, 1987, Marinari and Parisi, 1992, Geyer and Thompson, 1995) features multiple replicas to provide global and local exploration which makes them suitable for irregular and multi-modal distributions (Patriksson and van der Spoel, 2008, Hukushima and Nemoto, 1996). During sampling, parallel tempering features the exchange of neighbouring replicas to feature exploration and exploitation in the search space. The replicas with higher temperature values ensures that there is enough exploration, while replicas with lower temperature values exploit the promising areas found during the exploration. In contrast to canonical MCMC sampling methods, we can more easily implement parallel tempering in a multi-core or parallel computing architecture (Lamport, 1986). In the case of neural networks, parallel tempering was used for inference of restricted Boltzmann machines (RBMs) (Salakhutdinov et al., 2007, Fischer and Igel, 2015) where it was shown that (Desjardins et al., 2010a) parallel tempering is more effective than Gibbs sampling by itself. Parallel tempering for RBMs has been improved by featuring the efficient exchange of information among the replicas (Brakel et al., 2012). These studies motivated the use of parallel tempering in Bayesian neural learning for pattern classification and time series prediction (Chandra et al., 2019a).

Surrogate assisted optimization (Hicks and Henne, 1978, Jin, 2011) considers the use of machine learning methods such as Gaussian process and neural network models to estimate the objective function during optimization. This is handy when the evaluation of the objective function is too time-consuming. In the past, metaheuristic and evolutionary optimization methods have been used in surrogate assisted optimization (Ong et al., 2003, Zhou et al., 2007). Surrogate assisted optimization has been useful for the fields of engine and aerospace design to replicate computationally expensive models (Ong et al., 2005, Jeong et al., 2005, Samad et al., 2008, Hicks and Henne, 1978). The optimization literature motivates improving parallel tempering using a low-cost replica of the actual model via a surrogate to lower the computational costs.

In the case of conventional Bayesian neural learning, much of the literature concentrated on smaller problems such as datasets and network architecture (Richard and Lippmann, 1991, MacKay, 1996, MacKay, 1995, Robert, 2014) due to computational efficiency of MCMC methods. Therefore, parallel computing has been used in the implementation of parallel tempering for Bayesian neural learning (Chandra et al., 2019a), where the computational time was significantly decreased due to parallelization. Besides, the method achieved better prediction accuracy and convergence due to the exploration features of parallel tempering. We believe that this can be further improved through incorporating notions from surrogate assisted optimization in parallel tempering, where the likelihood function at certain times is estimated rather than evaluated in a high-performance computing environment.

We note that some work has been done using surrogate assisted Bayesian inference. Zeng et al. (Wang et al., 2016) presented a method for material identification using surrogate assisted Bayesian inference for estimating the parameters of advanced high strength steel used in vehicles. Ray and Myer (2019) used Gaussian process-based surrogate models with MCMC for geophysical inversion problems. The benefits of surrogate assisted methods for computationally expensive optimization problems motivate parallel tempering for computationally expensive models. To our knowledge, there is no work on parallel tempering with surrogate-models implemented via parallel computing for machine learning problems. In the case of parallel tempering that uses parallel computing, the challenge would be in developing a paradigm where different replicas can communicate efficiently. Besides, the task of training the surrogate model from data across multiple replicas in parallel poses further challenges.

In this paper, we present surrogate-assisted parallel tempering for Bayesian neural learning where a surrogate is used to estimate the likelihood rather than evaluating the actual model that feature a large number of parameters and datasets. We present a framework that seamlessly incorporates the decision making by a master surrogate for parallel processing cores that execute the respective replicas of parallel tempering MCMC. Although the framework is intended for general computationally expensive models, we demonstrate its effectiveness using a neural network model for classification problems. The major contribution of this paper is to address the limitations of parallel tempering given computationally expensive models.

The rest of the paper is organized as follows. Section 2 provides background and related work, while Section 3 presents the proposed methodology. Section 4 presents experiments and results and Section 5 concludes the paper with discussion for future research.

Section snippets

Bayesian neural learning

In Bayesian inference, we update the probability for a hypothesis as more evidence or information becomes available (Freedman, 1963). We estimate the posterior distribution by sampling using prior distribution and a ‘likelihood function’ that evaluates the model with observed data. A probabilistic perspective treats learning as equivalent to maximum likelihood estimation (MLE) (White, 1982). Given that the neural network is the model, we base the prior distribution on belief or expert opinion

Surrogate-assisted multi-core parallel tempering

Surrogate models primarily learn to mimic actual or true models using their behaviour, i.e. how the true model responds to a set of input parameters. A surrogate model captures the relationship between the input and output given by the true model. The input is the set of proposals in parallel tempering MCMC that features the weights and biases of the neural network model. Hence, we utilize the surrogate model to approximate the likelihood of the true model. We define the approximation of the

Experiments and results

In this section, we present an experimental analysis of surrogate-assisted parallel tempering (SAPT) for Bayesian neural learning. The experiments consider a wide range of issues that test the accuracy of pseudo-likelihood by the surrogate, the quality in decision making given by the classification performance, and the amount of computational time saved.

Discussion

The results, in general, have shown that surrogate-assisted parallel tempering can be beneficial for larger datasets and models, demonstrated by Bayesian neural network architecture for Pen-Digit and Chess classification problems. This implies that the method would be very useful for large scale models where computational time can be lowered while maintaining performance in decision making such as classification accuracy. We observed that in general, the Langevin-gradients improves the accuracy

Conclusions and future work

We presented surrogate-assisted parallel tempering for implementing Bayesian inference for computationally expensive problems that harness the advantage of parallel processing. We used a Bayesian neural network model to demonstrate the effectiveness of the framework for computationally expensive problems. The results from the experiments reveal that the method gives a promising performance where computational time is reduced for larger problems.

The surrogate-based framework is flexible and can

Software and data

We provide an open-source implementation of the proposed algorithm in Python along with data and sample results.⁴

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank Prof. Dietmar Muller and Danial Azam for discussions and support during the course of this research project. We sincerely thank the editors and anonymous reviewers for their valuable comments.

References (91)

ChandraR. et al.
Langevin-gradient parallel tempering for Bayesian neural learning
Neurocomputing
(2019)
ChandraR. et al.
Bayesian neural multi-source transfer learning
Neurocomputing
(2020)
FischerA. et al.
A bound for the convergence rate of parallel tempering for sampling restricted Boltzmann machines
Theoret. Comput. Sci.
(2015)
HansmannU.H.
Parallel tempering algorithm for conformational studies of biological molecules
Chem. Phys. Lett.
(1997)
JinY.
Surrogate-assisted evolutionary computation: Recent advances and future challenges
Swarm Evol. Comput.
(2011)
LiY. et al.
A decentralized parallel implementation for parallel tempering algorithm
Parallel Comput.
(2009)
MingasG. et al.
Particle MCMC algorithms and architectures for accelerating inference in state-space models
Internat. J. Approx. Reason.
(2017)
PallJ. et al.
Bayesreef: A Bayesian inference framework for modelling reef growth in response to environmental change and biological dynamics
Environ. Model. Softw.
(2020)
SchmidhuberJ.
Deep learning in neural networks: An overview
Neural Netw.
(2015)
TandjiriaV. et al.
Reliability analysis of laterally loaded piles using response surface methods
Struct. Saf.
(2000)

AuldT. et al.

Bayesian neural networks for Internet traffic classification

IEEE Trans. Neural Netw.

(2007)

BanerjeeS. et al.

Hierarchical Modeling and Analysis for Spatial Data

(2014)

BishopC. et al.

Neural Networks for Pattern Recognition

(1995)

BittnerE. et al.

Make life simple: Unleash the full power of the parallel tempering algorithm

Phys. Rev. Lett.

(2008)

BleiD.M. et al.

Variational inference: A review for statisticians

J. Amer. Statist. Assoc.

(2017)

Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D., 2015. Weight uncertainty in neural network. In: Proceedings...

BrakelP. et al.

Training restricted Boltzmann machines with multi-tempering: harnessing parallelization

BroomheadD.S. et al.

Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive NetworksTech. Rep.

(1988)

CalvoF.

All-exchanges parallel tempering

J. Chem. Phys.

(2005)

ChandraR. et al.

Bayesian neural learning via Langevin dynamics for chaotic time series prediction

International Conference on Neural Information Processing

(2017)

ChandraR. et al.

Multicore parallel tempering Bayeslands for basin and landscape evolution

Geochemistry, Geophysics, Geosystems

(2019)

ChoK. et al.

Improved learning of Gaussian-Bernoulli restricted Boltzmann machines

ChoK. et al.

Parallel tempering is efficient for learning restricted Boltzmann machines

DamianouA.C. et al.

Variational inference for latent variables and uncertain inputs in Gaussian processes

J. Mach. Learn. Res.

(2016)

DesjardinsG. et al.

Adaptive parallel tempering for stochastic maximum likelihood learning of RBMs

(2010)

Desjardins, G., Courville, A., Bengio, Y., Vincent, P., Delalleau, O., 2010. Tempered Markov chain Monte Carlo for...

DesjardinsG. et al.

Deep tempering

(2014)

Díaz-ManríquezA. et al.

A review of surrogate assisted multiobjective evolutionary algorithms

Comput. Intell. Neurosci.

(2016)

DuaD. et al.

UCI Machine Learning Repository

(2017)

DuchiJ. et al.

Adaptive subgradient methods for online learning and stochastic optimization

J. Mach. Learn. Res.

(2011)

EarlD.J. et al.

Parallel tempering: Theory, applications, and new perspectives

Phys. Chem. Chem. Phys.

(2005)

FreedmanD.A.

On the asymptotic behavior of Bayes’ estimates in the discrete case

Ann. Math. Stat.

(1963)

Gal, Y., Ghahramani, Z., 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning....

GeyerC.J. et al.

Annealing Markov chain Monte Carlo with applications to ancestral inference

J. Amer. Statist. Assoc.

(1995)

Giunta, A., Watson, L., 0000. A comparison of approximation modeling techniques-Polynomial versus interpolating models....

HastingsW.K.

Monte Carlo sampling methods using Markov chains and their applications

Biometrika

(1970)

HicksR.M. et al.

Wing design by numerical optimization

J. Aircr.

(1978)

HintonG.E. et al.

A fast learning algorithm for deep belief nets

Neural Comput.

(2006)

HintonG. et al.

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent

(2012)

HukushimaK. et al.

Exchange Monte Carlo method and application to spin glass simulations

J. Phys. Soc. Japan

(1996)

JeongS. et al.

Efficient optimization design method using kriging model

J. Aircr.

(2005)

JinR. et al.

Comparative studies of metamodelling techniques under multiple modelling criteria

Struct. Multidiscip. Optim.

(2001)

KarimiK. et al.

High-performance physics simulations using multi-core cpus and gpgpus in a volunteer computing context

Int. J. High Perform. Comput. Appl.

(2011)

KatzgraberH.G. et al.

Feedback-optimized parallel tempering Monte Carlo

J. Stat. Mech. Theory Exp.

(2006)

KendallA. et al.

What uncertainties do we need in bayesian deep learning for computer vision?

Cited by (10)

Memory capacity of recurrent neural networks with matrix representation
2023, Neurocomputing
It is well known that canonical recurrent neural networks (RNNs) face limitations in learning long-term dependencies, which have been addressed by memory structures in long short-term memory (LSTM) networks. Neural Turing machines (NTMs) are a variant of RNNs that implement the notion of programmable memory with neural network controllers that can learn simple algorithmic tasks. Matrix neural networks, on the other hand, feature matrix representations which inherently has the potential to preserve the spatial structure of data when compared to canonical neural networks that use only vector-based representation. One may then argue that neural networks with matrix representations may have the potential to provide better memory capacity. In this paper, we define and study a probabilistic notion of memory capacity based on Fisher information for matrix-based RNNs. We find bounds on memory capacity for such networks under various hypotheses and compare them with their conventional (vector) counterparts. In particular, we show that the memory capacity of such networks is bounded by $N^{2}$ , for $N \times N$ state matrix which generalizes and provides similar results for vector networks. We also show and analyze the increase in memory capacity for such networks which is introduced when one exhibits an external state memory, such as NTMs. Consequently, we construct NTMs with RNN controllers with matrix-based representation of external memory, leading us to introduce Matrix NTMs. We demonstrate the performance of this class of memory networks under certain algorithmic learning tasks such as copying and recall and compare it with Matrix RNNs. We find an improvement in the performance of Matrix NTMs by the addition of external memory, in comparison to Matrix RNNs.
Gradient boosting Bayesian neural networks via Langevin MCMC
2023, Neurocomputing
Bayesian neural networks harness the power of Bayesian inference which provides an approach to neural learning that not only focuses on accuracy but also uncertainty quantification. Markov Chain Monte Carlo (MCMC) methods implement Bayesian inference by sampling from the posterior distribution of the model parameters. In the case of Bayesian neural networks, the model parameters refer to weights and biases. MCMC methods suffer from scalability issues in large models, such as deep neural networks with thousands to millions of parameters. In this paper, we present a Bayesian ensemble learning framework that utilizes gradient boosting by combining multiple shallow neural networks (base learners) that are trained by MCMC sampling. We present two Bayesian gradient boosting strategies that employ simple neural networks as base learners with Langevin MCMC sampling. We evaluate the performance of these methods on various classification and time-series prediction problems. We demonstrate that the proposed framework improves the prediction accuracy of canonical gradient boosting while providing uncertainty quantification via Bayesian inference. Furthermore, we demonstrate that the respective methods scale well when the size of the dataset and model increases.
Distributed Bayesian optimisation framework for deep neuroevolution
2022, Neurocomputing
Citation Excerpt :
Recent advances in parallel and distributed computing have enabled efficient implementation of neuroevolution for complex and computationally expensive neural models [15]. Motivated by this, our previous work presented surrogate-assisted Bayesian inference for training neural networks [36] which was also applied for landscape evolution models [31]. To our knowledge, there is no work that incorporates surrogates with distributed neuroevolution in a parallel computing infrastructure.
Neuroevolution is a machine learning method for evolving neural networks parameters and topology with a high degree of flexibility that makes them applicable to a wide range of architectures. Neuroevolution has been popular in reinforcement learning and has also shown to be promising for deep learning. The major feature of Bayesian optimisation is in reducing computational load by approximating the actual model with an acquisition function (surrogate model) that is computationally cheaper. A major limitation of neuroevolution is the high computational time required for convergence since learning (evolution) typically does not utilize gradient information. Bayesian optimisation, which is also known as surrogate-assisted optimisation, has been popular for expensive engineering optimisation problems and hyper-parameter tuning in machine learning. It has potential for training deep learning models via neuroevolution given large datasets and complex models. Recent advances in parallel and distributed computing have enabled efficient implementation of neuroevolution for complex and computationally expensive neural models. In this paper, we present a Bayesian optimisation framework for deep neuroevolution using a distributed architecture to provide computational efficiency in training. Our results demonstrate promising results for simple to deep neural network models such as convolutional neural networks which motivates further applications.
Revisiting Bayesian Autoencoders With MCMC
2022, IEEE Access
Bayesian graph convolutional neural networks via tempered MCMC
2021, arXiv
Revisiting Bayesian autoencoders with MCMC
2021, arXiv

View all citing articles on Scopus

View full text

Surrogate-assisted parallel tempering for Bayesian neural learning

Abstract

Introduction

Section snippets

Bayesian neural learning

Surrogate-assisted multi-core parallel tempering

Experiments and results

Discussion

Conclusions and future work

Software and data

Declaration of Competing Interest

Acknowledgements

Neurocomputing

Neurocomputing

Theoret. Comput. Sci.

Chem. Phys. Lett.

Swarm Evol. Comput.

Parallel Comput.

Internat. J. Approx. Reason.

Environ. Model. Softw.

Neural Netw.

Struct. Saf.

Bayesian neural networks for Internet traffic classification

IEEE Trans. Neural Netw.

Hierarchical Modeling and Analysis for Spatial Data

Neural Networks for Pattern Recognition

Make life simple: Unleash the full power of the parallel tempering algorithm

Phys. Rev. Lett.

Variational inference: A review for statisticians

J. Amer. Statist. Assoc.

Training restricted Boltzmann machines with multi-tempering: harnessing parallelization

Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive NetworksTech. Rep.

All-exchanges parallel tempering

J. Chem. Phys.

Bayesian neural learning via Langevin dynamics for chaotic time series prediction

International Conference on Neural Information Processing

Multicore parallel tempering Bayeslands for basin and landscape evolution

Geochemistry, Geophysics, Geosystems

Improved learning of Gaussian-Bernoulli restricted Boltzmann machines

Parallel tempering is efficient for learning restricted Boltzmann machines

Variational inference for latent variables and uncertain inputs in Gaussian processes

J. Mach. Learn. Res.

Adaptive parallel tempering for stochastic maximum likelihood learning of RBMs

Deep tempering

A review of surrogate assisted multiobjective evolutionary algorithms

Comput. Intell. Neurosci.

UCI Machine Learning Repository

Adaptive subgradient methods for online learning and stochastic optimization

J. Mach. Learn. Res.

Parallel tempering: Theory, applications, and new perspectives

Phys. Chem. Chem. Phys.

On the asymptotic behavior of Bayes’ estimates in the discrete case

Ann. Math. Stat.

Annealing Markov chain Monte Carlo with applications to ancestral inference

J. Amer. Statist. Assoc.

Monte Carlo sampling methods using Markov chains and their applications

Biometrika

Wing design by numerical optimization

J. Aircr.

A fast learning algorithm for deep belief nets

Neural Comput.

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent

Exchange Monte Carlo method and application to spin glass simulations

J. Phys. Soc. Japan

Efficient optimization design method using kriging model

J. Aircr.

Comparative studies of metamodelling techniques under multiple modelling criteria

Struct. Multidiscip. Optim.

High-performance physics simulations using multi-core cpus and gpgpus in a volunteer computing context

Int. J. High Perform. Comput. Appl.

Feedback-optimized parallel tempering Monte Carlo

J. Stat. Mech. Theory Exp.

What uncertainties do we need in bayesian deep learning for computer vision?