Surrogate-assisted parallel tempering for Bayesian neural learning

https://doi.org/10.1016/j.engappai.2020.103700Get rights and content

Abstract

Due to the need for robust uncertainty quantification, Bayesian neural learning has gained attention in the era of deep learning and big data. Markov Chain Monte-Carlo (MCMC) methods typically implement Bayesian inference which faces several challenges given a large number of parameters, complex and multimodal posterior distributions, and computational complexity of large neural network models. Parallel tempering MCMC addresses some of these limitations given that they can sample multimodal posterior distributions and utilize high-performance computing. However, certain challenges remain given large neural network models and big data. Surrogate-assisted optimization features the estimation of an objective function for models which are computationally expensive. In this paper, we address the inefficiency of parallel tempering MCMC for large-scale problems by combining parallel computing features with surrogate assisted likelihood estimation that describes the plausibility of a model parameter value, given specific observed data. Hence, we present surrogate-assisted parallel tempering for Bayesian neural learning for simple to computationally expensive models. Our results demonstrate that the methodology significantly lowers the computational cost while maintaining quality in decision making with Bayesian neural networks. The method has applications for a Bayesian inversion and uncertainty quantification for a broad range of numerical models.

Introduction

Although neural networks have gained significant attention due to the deep learning revolution (Schmidhuber, 2015), several limitations exist. The challenge widens for uncertainty quantification in decision making given the development of new neural network architectures and learning algorithms. Bayesian neural learning provides a probabilistic viewpoint with the representation of neural network weights as probability distributions (MacKay, 1995, Robert, 2014). Rather than single-point estimates by gradient-based learning methods, the probability distributions naturally account for uncertainty in parameter estimates.

Through Bayesian neural learning, uncertainty regarding the data and model can be propagated into the decision making process. Markov Chain Monte Carlo (MCMC) sampling methods implement Bayesian inference (Hastings, 1970, Metropolis et al., 1953, Tarantola et al., 1982, Mosegaard and Tarantola, 1995) by constructing a Markov chain, such that the desired distribution becomes the equilibrium distribution after a given number of steps (Raftery and Lewis, 1996, van Ravenzwaaij et al., 2016). MCMC methods provide numerical approximations of multi-dimensional integrals (Banerjee et al., 2014). MCMC methods have not gained as much attention in neural learning when compared to gradient-based counterparts since convergence becomes computationally expensive for a large number of model parameters and multimodal posteriors (Robert et al., 2018). MCMC methods typically require thousands of samples to be drawn depending on the model which becomes a major limitation in applications such as deep learning (Schmidhuber, 2015, Gal and Ghahramani, 2016, Kendall and Gal, 2017). Hence, other methods for implementing Bayesian inference exist, such as variational inference (Blei et al., 2017, Damianou et al., 2016) which has been used for deep learning (Blundell et al., 2015). Variational inference has the advantage of faster convergence when compared to MCMC methods for large models. However, in computationally expensive models, variational inference methods would have a similar problem as MCMC, since both would need to evaluate model samples for the likelihood; hence, both would benefit from surrogate-based models.

Parallel tempering is an MCMC method that (Swendsen and Wang, 1987, Marinari and Parisi, 1992, Geyer and Thompson, 1995) features multiple replicas to provide global and local exploration which makes them suitable for irregular and multi-modal distributions (Patriksson and van der Spoel, 2008, Hukushima and Nemoto, 1996). During sampling, parallel tempering features the exchange of neighbouring replicas to feature exploration and exploitation in the search space. The replicas with higher temperature values ensures that there is enough exploration, while replicas with lower temperature values exploit the promising areas found during the exploration. In contrast to canonical MCMC sampling methods, we can more easily implement parallel tempering in a multi-core or parallel computing architecture (Lamport, 1986). In the case of neural networks, parallel tempering was used for inference of restricted Boltzmann machines (RBMs) (Salakhutdinov et al., 2007, Fischer and Igel, 2015) where it was shown that (Desjardins et al., 2010a) parallel tempering is more effective than Gibbs sampling by itself. Parallel tempering for RBMs has been improved by featuring the efficient exchange of information among the replicas (Brakel et al., 2012). These studies motivated the use of parallel tempering in Bayesian neural learning for pattern classification and time series prediction  (Chandra et al., 2019a).

Surrogate assisted optimization (Hicks and Henne, 1978, Jin, 2011) considers the use of machine learning methods such as Gaussian process and neural network models to estimate the objective function during optimization. This is handy when the evaluation of the objective function is too time-consuming. In the past, metaheuristic and evolutionary optimization methods have been used in surrogate assisted optimization (Ong et al., 2003, Zhou et al., 2007). Surrogate assisted optimization has been useful for the fields of engine and aerospace design to replicate computationally expensive models (Ong et al., 2005, Jeong et al., 2005, Samad et al., 2008, Hicks and Henne, 1978). The optimization literature motivates improving parallel tempering using a low-cost replica of the actual model via a surrogate to lower the computational costs.

In the case of conventional Bayesian neural learning, much of the literature concentrated on smaller problems such as datasets and network architecture (Richard and Lippmann, 1991, MacKay, 1996, MacKay, 1995, Robert, 2014) due to computational efficiency of MCMC methods. Therefore, parallel computing has been used in the implementation of parallel tempering for Bayesian neural learning (Chandra et al., 2019a), where the computational time was significantly decreased due to parallelization. Besides, the method achieved better prediction accuracy and convergence due to the exploration features of parallel tempering. We believe that this can be further improved through incorporating notions from surrogate assisted optimization in parallel tempering, where the likelihood function at certain times is estimated rather than evaluated in a high-performance computing environment.

We note that some work has been done using surrogate assisted Bayesian inference. Zeng et al. (Wang et al., 2016) presented a method for material identification using surrogate assisted Bayesian inference for estimating the parameters of advanced high strength steel used in vehicles. Ray and Myer (2019) used Gaussian process-based surrogate models with MCMC for geophysical inversion problems. The benefits of surrogate assisted methods for computationally expensive optimization problems motivate parallel tempering for computationally expensive models. To our knowledge, there is no work on parallel tempering with surrogate-models implemented via parallel computing for machine learning problems. In the case of parallel tempering that uses parallel computing, the challenge would be in developing a paradigm where different replicas can communicate efficiently. Besides, the task of training the surrogate model from data across multiple replicas in parallel poses further challenges.

In this paper, we present surrogate-assisted parallel tempering for Bayesian neural learning where a surrogate is used to estimate the likelihood rather than evaluating the actual model that feature a large number of parameters and datasets. We present a framework that seamlessly incorporates the decision making by a master surrogate for parallel processing cores that execute the respective replicas of parallel tempering MCMC. Although the framework is intended for general computationally expensive models, we demonstrate its effectiveness using a neural network model for classification problems. The major contribution of this paper is to address the limitations of parallel tempering given computationally expensive models.

The rest of the paper is organized as follows. Section 2 provides background and related work, while Section 3 presents the proposed methodology. Section 4 presents experiments and results and Section 5 concludes the paper with discussion for future research.

Section snippets

Bayesian neural learning

In Bayesian inference, we update the probability for a hypothesis as more evidence or information becomes available (Freedman, 1963). We estimate the posterior distribution by sampling using prior distribution and a ‘likelihood function’ that evaluates the model with observed data. A probabilistic perspective treats learning as equivalent to maximum likelihood estimation (MLE) (White, 1982). Given that the neural network is the model, we base the prior distribution on belief or expert opinion

Surrogate-assisted multi-core parallel tempering

Surrogate models primarily learn to mimic actual or true models using their behaviour, i.e. how the true model responds to a set of input parameters. A surrogate model captures the relationship between the input and output given by the true model. The input is the set of proposals in parallel tempering MCMC that features the weights and biases of the neural network model. Hence, we utilize the surrogate model to approximate the likelihood of the true model. We define the approximation of the

Experiments and results

In this section, we present an experimental analysis of surrogate-assisted parallel tempering (SAPT) for Bayesian neural learning. The experiments consider a wide range of issues that test the accuracy of pseudo-likelihood by the surrogate, the quality in decision making given by the classification performance, and the amount of computational time saved.

Discussion

The results, in general, have shown that surrogate-assisted parallel tempering can be beneficial for larger datasets and models, demonstrated by Bayesian neural network architecture for Pen-Digit and Chess classification problems. This implies that the method would be very useful for large scale models where computational time can be lowered while maintaining performance in decision making such as classification accuracy. We observed that in general, the Langevin-gradients improves the accuracy

Conclusions and future work

We presented surrogate-assisted parallel tempering for implementing Bayesian inference for computationally expensive problems that harness the advantage of parallel processing. We used a Bayesian neural network model to demonstrate the effectiveness of the framework for computationally expensive problems. The results from the experiments reveal that the method gives a promising performance where computational time is reduced for larger problems.

The surrogate-based framework is flexible and can

Software and data

We provide an open-source implementation of the proposed algorithm in Python along with data and sample results.4

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank Prof. Dietmar Muller and Danial Azam for discussions and support during the course of this research project. We sincerely thank the editors and anonymous reviewers for their valuable comments.

References (91)

  • AuldT. et al.

    Bayesian neural networks for Internet traffic classification

    IEEE Trans. Neural Netw.

    (2007)
  • BanerjeeS. et al.

    Hierarchical Modeling and Analysis for Spatial Data

    (2014)
  • BishopC. et al.

    Neural Networks for Pattern Recognition

    (1995)
  • BittnerE. et al.

    Make life simple: Unleash the full power of the parallel tempering algorithm

    Phys. Rev. Lett.

    (2008)
  • BleiD.M. et al.

    Variational inference: A review for statisticians

    J. Amer. Statist. Assoc.

    (2017)
  • Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D., 2015. Weight uncertainty in neural network. In: Proceedings...
  • BrakelP. et al.

    Training restricted Boltzmann machines with multi-tempering: harnessing parallelization

  • BroomheadD.S. et al.

    Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive NetworksTech. Rep.

    (1988)
  • CalvoF.

    All-exchanges parallel tempering

    J. Chem. Phys.

    (2005)
  • ChandraR. et al.

    Bayesian neural learning via Langevin dynamics for chaotic time series prediction

    International Conference on Neural Information Processing

    (2017)
  • ChandraR. et al.

    Multicore parallel tempering Bayeslands for basin and landscape evolution

    Geochemistry, Geophysics, Geosystems

    (2019)
  • ChoK. et al.

    Improved learning of Gaussian-Bernoulli restricted Boltzmann machines

  • ChoK. et al.

    Parallel tempering is efficient for learning restricted Boltzmann machines

  • DamianouA.C. et al.

    Variational inference for latent variables and uncertain inputs in Gaussian processes

    J. Mach. Learn. Res.

    (2016)
  • DesjardinsG. et al.

    Adaptive parallel tempering for stochastic maximum likelihood learning of RBMs

    (2010)
  • Desjardins, G., Courville, A., Bengio, Y., Vincent, P., Delalleau, O., 2010. Tempered Markov chain Monte Carlo for...
  • DesjardinsG. et al.

    Deep tempering

    (2014)
  • Díaz-ManríquezA. et al.

    A review of surrogate assisted multiobjective evolutionary algorithms

    Comput. Intell. Neurosci.

    (2016)
  • DuaD. et al.

    UCI Machine Learning Repository

    (2017)
  • DuchiJ. et al.

    Adaptive subgradient methods for online learning and stochastic optimization

    J. Mach. Learn. Res.

    (2011)
  • EarlD.J. et al.

    Parallel tempering: Theory, applications, and new perspectives

    Phys. Chem. Chem. Phys.

    (2005)
  • FreedmanD.A.

    On the asymptotic behavior of Bayes’ estimates in the discrete case

    Ann. Math. Stat.

    (1963)
  • Gal, Y., Ghahramani, Z., 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning....
  • GeyerC.J. et al.

    Annealing Markov chain Monte Carlo with applications to ancestral inference

    J. Amer. Statist. Assoc.

    (1995)
  • Giunta, A., Watson, L., 0000. A comparison of approximation modeling techniques-Polynomial versus interpolating models....
  • HastingsW.K.

    Monte Carlo sampling methods using Markov chains and their applications

    Biometrika

    (1970)
  • HicksR.M. et al.

    Wing design by numerical optimization

    J. Aircr.

    (1978)
  • HintonG.E. et al.

    A fast learning algorithm for deep belief nets

    Neural Comput.

    (2006)
  • HintonG. et al.

    Neural networks for machine learning lecture 6a overview of mini-batch gradient descent

    (2012)
  • HukushimaK. et al.

    Exchange Monte Carlo method and application to spin glass simulations

    J. Phys. Soc. Japan

    (1996)
  • JeongS. et al.

    Efficient optimization design method using kriging model

    J. Aircr.

    (2005)
  • JinR. et al.

    Comparative studies of metamodelling techniques under multiple modelling criteria

    Struct. Multidiscip. Optim.

    (2001)
  • KarimiK. et al.

    High-performance physics simulations using multi-core cpus and gpgpus in a volunteer computing context

    Int. J. High Perform. Comput. Appl.

    (2011)
  • KatzgraberH.G. et al.

    Feedback-optimized parallel tempering Monte Carlo

    J. Stat. Mech. Theory Exp.

    (2006)
  • KendallA. et al.

    What uncertainties do we need in bayesian deep learning for computer vision?

  • Cited by (10)

    • Distributed Bayesian optimisation framework for deep neuroevolution

      2022, Neurocomputing
      Citation Excerpt :

      Recent advances in parallel and distributed computing have enabled efficient implementation of neuroevolution for complex and computationally expensive neural models [15]. Motivated by this, our previous work presented surrogate-assisted Bayesian inference for training neural networks [36] which was also applied for landscape evolution models [31]. To our knowledge, there is no work that incorporates surrogates with distributed neuroevolution in a parallel computing infrastructure.

    View all citing articles on Scopus
    View full text