Vecchia–Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

https://doi.org/10.1016/j.csda.2020.107081Get rights and content

Abstract

Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. A Vecchia–Laplace approximation for GGPs is proposed, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in simple, general, scalable, and accurate methodology. Numerical studies and comparisons on simulated and real spatial data are provided. The methods are implemented in a freely available R package.

Introduction

Dependent non-Gaussian data are ubiquitous in time series, geospatial applications, and more generally in nonparametric regression and classification. A popular model for such data is obtained by combining a latent Gaussian process (GP) with conditionally independent, potentially non-Gaussian likelihoods from the exponential family. This is traditionally referred to as a spatial generalized linear mixed model (SGLMM) in the spatial statistics literature (Diggle et al., 1998), but the same model has more recently also been referred to as a generalized GP (GGP) (e.g., Chan and Dong, 2011); we will use the latter, more concise term throughout. GGPs are highly flexible, interpretable, and allow for natural, probabilistic uncertainty quantification. However, inference for GGPs can be analytically intractable, and large datasets pose additional computational challenges due to the inversion of the GP covariance matrix.

Popular techniques to numerically perform the intractable marginalization necessary for inference are, in order of increasing speed: Markov chain Monte Carlo (MCMC), expectation propagation, variational methods, and Laplace approximations. See Shang and Chan (2013) for a recent review of deterministic techniques and Filippone and Girolami (2014) for a comparison of MCMC and expectation propagation. Rue et al. (2009) argue that variational methods and expectation propagation suffer from underestimated and overestimated posterior variances, respectively. Here, we consider the Laplace approximation (e.g., Tierney and Kadane, 1986, Williams and Barber, 1998), a classic technique for integral evaluation based on second-order Taylor expansion. Bonat and Ribeiro (2016) show numerically that the Laplace approximation can be a practical and accurate method for GGP inference.

It has long been recognized that the computational cost for GPs is cubic in the data size. Much work has been doneon GP approximations that address this problem in the context of Gaussian noise (as recently reviewed in Heaton et al., 2019). Low-rank approaches (e.g., Higdon, 1998, Quiñonero-Candela and Rasmussen, 2005, Banerjee et al., 2008, Cressie and Johannesson, 2008, Katzfuss and Cressie, 2011) have limitations in the presence of fine-scale structure (Stein, 2014), but they have proved popular due to their simplicity. Approximations relying on sparsity in covariance matrices (Furrer et al., 2006, Kaufman et al., 2008) by definition can only capture local, short-range dependence and cannot guarantee low computation cost. Approaches that take advantage of Toeplitz or Kronecker structure (e.g., Dietrich and Newsam, 1997, Flaxman et al., 2015, Guinness and Fuentes, 2017) can be extremely efficient but are not as generally applicable. Recently, methods relying on sparsity in precision matrices (Rue and Held, 2005, Lindgren et al., 2011, Nychka et al., 2015) have gained popularity due to their accuracy and flexibility. In particular, a class of highly promising GP approximations (Vecchia, 1988, Stein et al., 2004, Datta et al., 2016, Guinness, 2018, Katzfuss and Guinness, 2019, Katzfuss et al., 2020a) rely on ordered conditional independence that can guarantee linear scaling in the data size while resolving dependence at all scales.

There are also a number of existing methods for large non-Gaussian datasets modeled using GGPs. A popular approach is to combine a low-rank GP with an approximation of the non-Gaussian likelihood, as the dimension reduction inherent in the low-rank approximation carries through even to the non-Gaussian case. Sengupta and Cressie (2013) estimate parameters using an expectation–maximization algorithm with low-rank and Laplace approximations. Sheth et al. (2015) use variational inference to obtain the posterior and select a set of conditioning points for their low-rank approximation. Some methods of dimension reduction, including random sketching (e.g., Yang et al., 2017) and projection, offer theoretical guarantees and can be combined with MCMC methods for the analysis of non-Gaussian data (e.g., Hughes and Haran, 2013, Guan and Haran, 2018), but are still subject to the limitations of low-rank methods. Nickisch et al. (2018) develop state–space models for one-dimensional non-Gaussian time series, which can perform inference in linear time and memory using a set of knots in time, a form of low-rank approximation. Alternate priors (e.g., log gamma priors for count data, Bradley et al., 2018) are an interesting but specialized approach to completely avoid the intractability issues with GGPs.

Similar to what we shall propose, some authors have combined a sparse-precision approach with a non-Gaussian approximation. A prominent example is Lindgren et al. (2011), in which an integrated nested Laplace approximation (INLA) is combined with a sparse-precision approximation of the GP using its representation as the solution to a stochastic partial differential equation. Datta et al. (2016) proposed to apply the GP approximation of Vecchia (1988) to a latent GP, but did not provide an explicit algorithm for large non-Gaussian data. While both Lindgren et al. (2011) and Datta et al. (2016) limit the number of nonzero entries per row or column in the precision matrix to a small constant, the computational complexity for decomposing this sparse n×n matrix is not linear in n, but rather O(n32) in two dimensions (Lipton et al., 1979 Thm. 6), and at least O(n2) in higher dimensions. In the Gaussian setting, this scaling problem can be overcome by applying a Vecchia approximation to the observed data (Vecchia, 1988) or to the joint distribution of the observed data and the latent GP (Katzfuss and Guinness, 2019).

To handle both scaling and intractability, we propose a Vecchia–Laplace (VL) approximation for GGPs. The posterior mode necessary for the Laplace approximation is found using the Newton–Raphson algorithm, which can be viewed as iterative GP inference based on Gaussian pseudo-data. At each iteration of our VL algorithm, the joint Gaussian distribution of the pseudo-data and the latent GP realizations is approximated using the general Vecchia framework (Katzfuss and Guinness, 2019, Katzfuss et al., 2020a). By modeling the joint distribution of pseudo-data and GP realizations at each iteration, our VL approach can ensure sparsity and guarantee linear scaling in n for any dimension, overcoming the scaling issues of the sparse-matrix approaches mentioned above.

To our knowledge, we provide the first explicit algorithm extending and applying the highly promising class of general-Vecchia GP approximations to large non-Gaussian data. We believe it to be a useful addition to the literature due to its speed, simplicity, guaranteed numerical performance, and wide applicability (e.g., binary, count, right-skewed, and point-pattern data). In addition, as shown in Katzfuss and Guinness (2019), the general Vecchia approximation includes many popular GP approximations (e.g., Vecchia, 1988, Snelson and Ghahramani, 2007, Finley et al., 2009, Sang et al., 2011, Datta et al., 2016, Katzfuss, 2017, Katzfuss and Gong, 2019) as special cases, and so our VL methodology also directly provides extensions of these GP approximations to non-Gaussian data.

The remainder of this document is organized as follows. In Section 2, we review the Laplace approximation and general Vecchia. In Section 3, we introduce and examine our VL method, including parameter inference and predictions at unobserved locations. In Sections 4 Simulations and comparisons, 5 Application to satellite data, we study and compare the performance of VL on simulated and real data, respectively. Some details are left to the appendix. A separate Supplementary Material document contains Sections S1–S6 with additional derivations, simulations, and discussion. The methods and algorithms proposed here are implemented in the R package GPvecchia (Katzfuss et al., 2020) with sensible default settings, so that a wide audience of practitioners can immediately use the code with little background knowledge. Our results and figures can be reproduced using the code and data at https://github.com/katzfuss-group/GPvecchia-Laplace.

Section snippets

Generalized Gaussian processes

Let y()GP(μ,K) be a latent Gaussian process with mean function μ and kernel or covariance function K on a domain DRd, dN+. Observations z=(z1,,zn) at locations siD are assumed to be conditionally independent, zi|yind.gi(zi|yi), where y=(y1,,yn) and yi=y(si). We assume that the observation densities or likelihoods gi are from the exponential family. Parameters θ in μ, K, or the gi will be assumed fixed and known for now; for example, θ may contain regression coefficients determining

Vecchia–Laplace methods

We now introduce our Vecchia–Laplace (VL) approximation, which allows fast inference for large datasets modeled using GGPs, by combining the Laplace and general Vecchia approximations reviewed in Section 2.

Simulations and comparisons

We compared our VL approaches to other methods using simulated data. Throughout Section 4, unless specified otherwise, we simulated realizations y on a grid of size n×n on the unit square from a GP with mean zero and a Matérn covariance function with variance 1, smoothness ν, and range parameter λ=0.05. Gridded locations allow us to carry out simulations for large n using Fourier methods. The data were then generated conditional on y using the four likelihoods in Table 1, with a=2 in the Gamma

Application to satellite data

We applied our methodology to a large, spatially correlated, non-Gaussian dataset of column water vapor. These data were collected by NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS), which is mounted on the NASA Aqua satellite (Borbas et al., 2017). We considered a Level-2 near-infrared dataset of total precipitable water at a 1354×2030=2,746,820 grid of 1km pixels. We used up to 500,000 of these data points for our demonstration. Our dataset was observed between 13:45 and 13:50 on

Conclusions and future work

In this work, we presented a novel combination of techniques that allow for efficient analysis of large, spatially correlated, non-Gaussian datasets or point patterns. The key idea is to apply a Vecchia approximation to the Gaussian (and hence tractable) joint distribution of GP realizations and pseudo-data at each iteration of a Newton–Raphson algorithm, leading to a Gaussian Laplace approximation. This allows us to carry out inference for non-Gaussian data by iteratively applying existing

Acknowledgments

Katzfuss’ research was partially supported by National Science Foundation (NSF) grant DMS–1521676 and NSF CAREER grant DMS–1654083. The authors would like to thank Joe Guinness, Jianhua Huang, Leonid Koralov, Boris Vainberg, and several anonymous reviewers for helpful comments and suggestions. Wenlong Gong, Joe Guinness, Marcin Jurek, and Jingjie Zhang contributed to the R package GPvecchia, and Florian Schäfer provided C code for the exact maxmin ordering.

References (59)

  • CressieN. et al.

    Fixed rank kriging for very large spatial data sets

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2008)
  • DattaA. et al.

    Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets

    J. Amer. Statist. Assoc.

    (2016)
  • DietrichC.R. et al.

    Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix

    SIAM J. Sci. Comput.

    (1997)
  • DiggleP.J. et al.

    Spatial and spatio-temporal log-Gaussian Cox processes: Extending the geostatistical paradigm

    Statist. Sci.

    (2013)
  • DiggleP. et al.

    Model-based geostatistics

    J. R. Stat. Soc. Ser. C. Appl. Stat.

    (1998)
  • FilipponeM. et al.

    Pseudo-marginal Bayesian inference for Gaussian processes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • Flaxman, S., Wilson, A., Neill, D., Nickisch, H., Smola, A., 2015. Fast Kronecker inference in Gaussian processes with...
  • FongY. et al.

    Bayesian inference for generalized linear mixed models

    Biostatistics

    (2010)
  • FurrerR. et al.

    Covariance tapering for interpolation of large spatial datasets

    J. Comput. Graph. Statist.

    (2006)
  • GneitingT. et al.

    Probabilistic forecasting

    Annu. Rev. Stat. Appl.

    (2014)
  • GuanY. et al.

    A computationally efficient projection-based approach for spatial generalized linear mixed models

    J. Comput. Graph. Statist.

    (2018)
  • GuinnessJ.

    Permutation methods for sharpening Gaussian process approximations

    Technometrics

    (2018)
  • GuinnessJ. et al.

    Circulant embedding of approximate covariances for inference from Gaussian data on large lattices

    J. Comput. Graph. Statist.

    (2017)
  • HeatonM.J. et al.

    A case study competition among methods for analyzing large spatial data

    J. Agric. Biol. Environ. Stat.

    (2019)
  • HigdonD.

    A process-convolution approach to modelling temperatures in the North Atlantic Ocean

    Environ. Ecol. Stat.

    (1998)
  • HughesJ. et al.

    Dimension reduction and alleviation of confounding for spatial generalized linear mixed models

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2013)
  • JurekM. et al.

    Multi-resolution filters for massive spatio-temporal data

    (2018)
  • JurekM. et al.

    Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering

    (2020)
  • KatzfussM.

    A multi-resolution approximation for massive spatial datasets

    J. Amer. Statist. Assoc.

    (2017)
  • Cited by (29)

    • Compositionally-warped additive mixed modeling for a wide variety of non-Gaussian spatial data

      2021, Spatial Statistics
      Citation Excerpt :

      This model accommodates spatial random effects and has been developed and extended under the Bayesian framework (Gotway and Stroup, 1997; Diggle et al., 1998). While Bayesian models can be slow because of the simulation step, fast Bayesian inference including the integrated nested Laplace approximation (Rue et al., 2009) and Vecchia–Laplace approximation (Zilber and Katzfuss, 2021) has been developed for fast non-Gaussian spatial regression modeling. The generalized additive (mixed) model (e.g., Wood, 2017) is another popular model that is applicable for non-Gaussian spatial data modeling (see, Umlauf et al., 2015).

    View all citing articles on Scopus
    View full text