Vecchia–Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

doi:10.1016/j.csda.2020.107081

Computational Statistics & Data Analysis

Volume 153, January 2021, 107081

https://doi.org/10.1016/j.csda.2020.107081 Get rights and content

Abstract

Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. A Vecchia–Laplace approximation for GGPs is proposed, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in simple, general, scalable, and accurate methodology. Numerical studies and comparisons on simulated and real spatial data are provided. The methods are implemented in a freely available R package.

Introduction

Dependent non-Gaussian data are ubiquitous in time series, geospatial applications, and more generally in nonparametric regression and classification. A popular model for such data is obtained by combining a latent Gaussian process (GP) with conditionally independent, potentially non-Gaussian likelihoods from the exponential family. This is traditionally referred to as a spatial generalized linear mixed model (SGLMM) in the spatial statistics literature (Diggle et al., 1998), but the same model has more recently also been referred to as a generalized GP (GGP) (e.g., Chan and Dong, 2011); we will use the latter, more concise term throughout. GGPs are highly flexible, interpretable, and allow for natural, probabilistic uncertainty quantification. However, inference for GGPs can be analytically intractable, and large datasets pose additional computational challenges due to the inversion of the GP covariance matrix.

Popular techniques to numerically perform the intractable marginalization necessary for inference are, in order of increasing speed: Markov chain Monte Carlo (MCMC), expectation propagation, variational methods, and Laplace approximations. See Shang and Chan (2013) for a recent review of deterministic techniques and Filippone and Girolami (2014) for a comparison of MCMC and expectation propagation. Rue et al. (2009) argue that variational methods and expectation propagation suffer from underestimated and overestimated posterior variances, respectively. Here, we consider the Laplace approximation (e.g., Tierney and Kadane, 1986, Williams and Barber, 1998), a classic technique for integral evaluation based on second-order Taylor expansion. Bonat and Ribeiro (2016) show numerically that the Laplace approximation can be a practical and accurate method for GGP inference.

It has long been recognized that the computational cost for GPs is cubic in the data size. Much work has been doneon GP approximations that address this problem in the context of Gaussian noise (as recently reviewed in Heaton et al., 2019). Low-rank approaches (e.g., Higdon, 1998, Quiñonero-Candela and Rasmussen, 2005, Banerjee et al., 2008, Cressie and Johannesson, 2008, Katzfuss and Cressie, 2011) have limitations in the presence of fine-scale structure (Stein, 2014), but they have proved popular due to their simplicity. Approximations relying on sparsity in covariance matrices (Furrer et al., 2006, Kaufman et al., 2008) by definition can only capture local, short-range dependence and cannot guarantee low computation cost. Approaches that take advantage of Toeplitz or Kronecker structure (e.g., Dietrich and Newsam, 1997, Flaxman et al., 2015, Guinness and Fuentes, 2017) can be extremely efficient but are not as generally applicable. Recently, methods relying on sparsity in precision matrices (Rue and Held, 2005, Lindgren et al., 2011, Nychka et al., 2015) have gained popularity due to their accuracy and flexibility. In particular, a class of highly promising GP approximations (Vecchia, 1988, Stein et al., 2004, Datta et al., 2016, Guinness, 2018, Katzfuss and Guinness, 2019, Katzfuss et al., 2020a) rely on ordered conditional independence that can guarantee linear scaling in the data size while resolving dependence at all scales.

There are also a number of existing methods for large non-Gaussian datasets modeled using GGPs. A popular approach is to combine a low-rank GP with an approximation of the non-Gaussian likelihood, as the dimension reduction inherent in the low-rank approximation carries through even to the non-Gaussian case. Sengupta and Cressie (2013) estimate parameters using an expectation–maximization algorithm with low-rank and Laplace approximations. Sheth et al. (2015) use variational inference to obtain the posterior and select a set of conditioning points for their low-rank approximation. Some methods of dimension reduction, including random sketching (e.g., Yang et al., 2017) and projection, offer theoretical guarantees and can be combined with MCMC methods for the analysis of non-Gaussian data (e.g., Hughes and Haran, 2013, Guan and Haran, 2018), but are still subject to the limitations of low-rank methods. Nickisch et al. (2018) develop state–space models for one-dimensional non-Gaussian time series, which can perform inference in linear time and memory using a set of knots in time, a form of low-rank approximation. Alternate priors (e.g., log gamma priors for count data, Bradley et al., 2018) are an interesting but specialized approach to completely avoid the intractability issues with GGPs.

Similar to what we shall propose, some authors have combined a sparse-precision approach with a non-Gaussian approximation. A prominent example is Lindgren et al. (2011), in which an integrated nested Laplace approximation (INLA) is combined with a sparse-precision approximation of the GP using its representation as the solution to a stochastic partial differential equation. Datta et al. (2016) proposed to apply the GP approximation of Vecchia (1988) to a latent GP, but did not provide an explicit algorithm for large non-Gaussian data. While both Lindgren et al. (2011) and Datta et al. (2016) limit the number of nonzero entries per row or column in the precision matrix to a small constant, the computational complexity for decomposing this sparse $n \times n$ matrix is not linear in $n$ , but rather $O (n^{3 ∕ 2})$ in two dimensions (Lipton et al., 1979 Thm. 6), and at least $O (n^{2})$ in higher dimensions. In the Gaussian setting, this scaling problem can be overcome by applying a Vecchia approximation to the observed data (Vecchia, 1988) or to the joint distribution of the observed data and the latent GP (Katzfuss and Guinness, 2019).

To handle both scaling and intractability, we propose a Vecchia–Laplace (VL) approximation for GGPs. The posterior mode necessary for the Laplace approximation is found using the Newton–Raphson algorithm, which can be viewed as iterative GP inference based on Gaussian pseudo-data. At each iteration of our VL algorithm, the joint Gaussian distribution of the pseudo-data and the latent GP realizations is approximated using the general Vecchia framework (Katzfuss and Guinness, 2019, Katzfuss et al., 2020a). By modeling the joint distribution of pseudo-data and GP realizations at each iteration, our VL approach can ensure sparsity and guarantee linear scaling in $n$ for any dimension, overcoming the scaling issues of the sparse-matrix approaches mentioned above.

To our knowledge, we provide the first explicit algorithm extending and applying the highly promising class of general-Vecchia GP approximations to large non-Gaussian data. We believe it to be a useful addition to the literature due to its speed, simplicity, guaranteed numerical performance, and wide applicability (e.g., binary, count, right-skewed, and point-pattern data). In addition, as shown in Katzfuss and Guinness (2019), the general Vecchia approximation includes many popular GP approximations (e.g., Vecchia, 1988, Snelson and Ghahramani, 2007, Finley et al., 2009, Sang et al., 2011, Datta et al., 2016, Katzfuss, 2017, Katzfuss and Gong, 2019) as special cases, and so our VL methodology also directly provides extensions of these GP approximations to non-Gaussian data.

The remainder of this document is organized as follows. In Section 2, we review the Laplace approximation and general Vecchia. In Section 3, we introduce and examine our VL method, including parameter inference and predictions at unobserved locations. In Sections 4 Simulations and comparisons, 5 Application to satellite data, we study and compare the performance of VL on simulated and real data, respectively. Some details are left to the appendix. A separate Supplementary Material document contains Sections S1–S6 with additional derivations, simulations, and discussion. The methods and algorithms proposed here are implemented in the R package GPvecchia (Katzfuss et al., 2020) with sensible default settings, so that a wide audience of practitioners can immediately use the code with little background knowledge. Our results and figures can be reproduced using the code and data at https://github.com/katzfuss-group/GPvecchia-Laplace.

Section snippets

Generalized Gaussian processes

Let $y (\cdot) \sim G P (μ, K)$ be a latent Gaussian process with mean function $μ$ and kernel or covariance function $K$ on a domain $D \subset R^{d}$ , $d \in N^{+}$ . Observations $z = {(z_{1}, \dots, z_{n})}^{'}$ at locations $s_{i} \in D$ are assumed to be conditionally independent, $z_{i} | y \overset{i n d .}{\sim} g_{i} (z_{i} | y_{i})$ , where $y = {(y_{1}, \dots, y_{n})}^{'}$ and $y_{i} = y (s_{i})$ . We assume that the observation densities or likelihoods $g_{i}$ are from the exponential family. Parameters $θ$ in $μ$ , $K$ , or the $g_{i}$ will be assumed fixed and known for now; for example, $θ$ may contain regression coefficients determining

Vecchia–Laplace methods

We now introduce our Vecchia–Laplace (VL) approximation, which allows fast inference for large datasets modeled using GGPs, by combining the Laplace and general Vecchia approximations reviewed in Section 2.

Simulations and comparisons

We compared our VL approaches to other methods using simulated data. Throughout Section 4, unless specified otherwise, we simulated realizations $y$ on a grid of size $\sqrt{n} \times \sqrt{n}$ on the unit square from a GP with mean zero and a Matérn covariance function with variance 1, smoothness $ν$ , and range parameter $λ = 0.05$ . Gridded locations allow us to carry out simulations for large $n$ using Fourier methods. The data were then generated conditional on $y$ using the four likelihoods in Table 1, with $a = 2$ in the Gamma

Application to satellite data

We applied our methodology to a large, spatially correlated, non-Gaussian dataset of column water vapor. These data were collected by NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS), which is mounted on the NASA Aqua satellite (Borbas et al., 2017). We considered a Level-2 near-infrared dataset of total precipitable water at a $1354 \times 2030 = 2, 746, 820$ grid of 1km pixels. We used up to 500,000 of these data points for our demonstration. Our dataset was observed between 13:45 and 13:50 on

Conclusions and future work

In this work, we presented a novel combination of techniques that allow for efficient analysis of large, spatially correlated, non-Gaussian datasets or point patterns. The key idea is to apply a Vecchia approximation to the Gaussian (and hence tractable) joint distribution of GP realizations and pseudo-data at each iteration of a Newton–Raphson algorithm, leading to a Gaussian Laplace approximation. This allows us to carry out inference for non-Gaussian data by iteratively applying existing

Acknowledgments

Katzfuss’ research was partially supported by National Science Foundation (NSF) grant DMS–1521676 and NSF CAREER grant DMS–1654083. The authors would like to thank Joe Guinness, Jianhua Huang, Leonid Koralov, Boris Vainberg, and several anonymous reviewers for helpful comments and suggestions. Wenlong Gong, Joe Guinness, Marcin Jurek, and Jingjie Zhang contributed to the R package GPvecchia, and Florian Schäfer provided C code for the exact maxmin ordering.

References (59)

FinleyA.O. et al.
Improving the performance of predictive process modeling for large datasets
Comput. Stat. Data Anal.
(2009)
SenguptaA. et al.
Hierarchical statistical modeling of big spatial datasets using the exponential family of distributions
Spat. Stat.
(2013)
SteinM.L.
Limitations on low rank approximations for covariance matrices of spatial data
Spat. Stat.
(2014)
BanerjeeS. et al.
Gaussian predictive process models for large spatial data sets
J. R. Stat. Soc. Ser. B Stat. Methodol.
(2008)
BonatW.H. et al.
Practical likelihood analysis for spatial generalized linear mixed models
Environmetrics
(2016)
BorbasE. et al.
MODIS atmosphere L2 water vapor product
BoydS. et al.
Convex Optimization
(2004)
BradleyJ.R. et al.
Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion)
Bayesian Anal.
(2018)
BreslowN.E. et al.
Approximate inference in generalized linear mixed models
J. Amer. Statist. Assoc.
(1993)
Chan, A.B., Dong, D., 2011. Generalized Gaussian process models. In: CVPR. pp....

CressieN. et al.

Fixed rank kriging for very large spatial data sets

J. R. Stat. Soc. Ser. B Stat. Methodol.

(2008)

DattaA. et al.

Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets

J. Amer. Statist. Assoc.

(2016)

DietrichC.R. et al.

Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix

SIAM J. Sci. Comput.

(1997)

DiggleP.J. et al.

Spatial and spatio-temporal log-Gaussian Cox processes: Extending the geostatistical paradigm

Statist. Sci.

(2013)

DiggleP. et al.

Model-based geostatistics

J. R. Stat. Soc. Ser. C. Appl. Stat.

(1998)

FilipponeM. et al.

Pseudo-marginal Bayesian inference for Gaussian processes

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

Flaxman, S., Wilson, A., Neill, D., Nickisch, H., Smola, A., 2015. Fast Kronecker inference in Gaussian processes with...

FongY. et al.

Bayesian inference for generalized linear mixed models

Biostatistics

(2010)

FurrerR. et al.

Covariance tapering for interpolation of large spatial datasets

J. Comput. Graph. Statist.

(2006)

GneitingT. et al.

Probabilistic forecasting

Annu. Rev. Stat. Appl.

(2014)

GuanY. et al.

A computationally efficient projection-based approach for spatial generalized linear mixed models

J. Comput. Graph. Statist.

(2018)

GuinnessJ.

Permutation methods for sharpening Gaussian process approximations

Technometrics

(2018)

GuinnessJ. et al.

Circulant embedding of approximate covariances for inference from Gaussian data on large lattices

J. Comput. Graph. Statist.

(2017)

HeatonM.J. et al.

A case study competition among methods for analyzing large spatial data

J. Agric. Biol. Environ. Stat.

(2019)

HigdonD.

A process-convolution approach to modelling temperatures in the North Atlantic Ocean

Environ. Ecol. Stat.

(1998)

HughesJ. et al.

Dimension reduction and alleviation of confounding for spatial generalized linear mixed models

J. R. Stat. Soc. Ser. B Stat. Methodol.

(2013)

JurekM. et al.

Multi-resolution filters for massive spatio-temporal data

(2018)

JurekM. et al.

Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering

(2020)

KatzfussM.

A multi-resolution approximation for massive spatial datasets

J. Amer. Statist. Assoc.

(2017)

Cited by (29)

Nearest neighbors weighted composite likelihood based on pairs for (non-)Gaussian massive spatial data with an application to Tukey-hh random fields estimation
2024, Computational Statistics and Data Analysis
A highly scalable method for (non-)Gaussian random fields estimation is proposed. In particular, a novel (a) symmetric weight function based on nearest neighbors for the method of maximum weighted composite likelihood based on pairs (WCLP) is studied.
The new weight function allows estimating massive (up to millions) spatial datasets and improves the statistical efficiency of the WCLP method using symmetric weights based on distances, as shown in the numerical examples.
As an application of the proposed method, the estimation of a novel non-Gaussian random field named Tukey-hh random field that has flexible marginal distributions (possibly skewed and/or heavy-tailed) is considered. In an extensive simulation study the statistical efficiency of the proposed nearest neighbors WCLP method with respect to the WCLP method using weights based on distances is explored when estimating the parameters of the Tukey-hh random field. In the Gaussian case the proposed method is compared with the Vecchia approximation from computational and statistical viewpoints. Finally, the effectiveness of the proposed methodology is illustrated by estimating a large dataset of mean temperatures in South-America. The proposed methodology has been implemented in an open-source package for the R statistical environment.
Compositionally-warped additive mixed modeling for a wide variety of non-Gaussian spatial data
2021, Spatial Statistics
Citation Excerpt :
This model accommodates spatial random effects and has been developed and extended under the Bayesian framework (Gotway and Stroup, 1997; Diggle et al., 1998). While Bayesian models can be slow because of the simulation step, fast Bayesian inference including the integrated nested Laplace approximation (Rue et al., 2009) and Vecchia–Laplace approximation (Zilber and Katzfuss, 2021) has been developed for fast non-Gaussian spatial regression modeling. The generalized additive (mixed) model (e.g., Wood, 2017) is another popular model that is applicable for non-Gaussian spatial data modeling (see, Umlauf et al., 2015).
As with the advancement of geographical information systems, non-Gaussian spatial data sets are getting larger and more diverse. This study develops a general framework for fast and flexible non-Gaussian regression, especially for spatial/spatiotemporal modeling. The developed model, termed the compositionally-warped additive mixed model (CAMM), combines an additive mixed model (AMM) and the compositionally-warped Gaussian process to model a wide variety of non-Gaussian continuous data including spatial and other effects. A specific advantage of the proposed CAMM is that it requires no explicit assumption of data distribution unlike existing AMMs. Monte Carlo experiments show the estimation accuracy and computational efficiency of CAMM for modeling non-Gaussian data including fat-tailed and/or skewed distributions. Finally, the model is applied to crime data to examine the empirical performance of the regression analysis and prediction. The result shows that CAMM provides intuitively reasonable coefficient estimates and outperforms AMM in terms of prediction accuracy. CAMM is verified to be a fast and flexible model that potentially covers a wide variety of non-Gaussian data modeling.
On the Laplace Approximation as Model Selection Criterion for Gaussian Processes
2024, arXiv
A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed Models
2024, arXiv
Multi-resolution filters via linear projection for large spatio-temporal datasets
2024, arXiv
Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models
2023, arXiv

View all citing articles on Scopus

View full text

Vecchia–Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

Abstract

Introduction

Section snippets

Generalized Gaussian processes

Vecchia–Laplace methods

Simulations and comparisons

Application to satellite data

Conclusions and future work

Acknowledgments

Comput. Stat. Data Anal.

Spat. Stat.

Spat. Stat.

Gaussian predictive process models for large spatial data sets

J. R. Stat. Soc. Ser. B Stat. Methodol.

Practical likelihood analysis for spatial generalized linear mixed models

Environmetrics

MODIS atmosphere L2 water vapor product

Convex Optimization

Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion)

Bayesian Anal.

Approximate inference in generalized linear mixed models

J. Amer. Statist. Assoc.

Fixed rank kriging for very large spatial data sets

J. R. Stat. Soc. Ser. B Stat. Methodol.

Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets

J. Amer. Statist. Assoc.

Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix

SIAM J. Sci. Comput.

Spatial and spatio-temporal log-Gaussian Cox processes: Extending the geostatistical paradigm

Statist. Sci.

Model-based geostatistics

J. R. Stat. Soc. Ser. C. Appl. Stat.

Pseudo-marginal Bayesian inference for Gaussian processes

IEEE Trans. Pattern Anal. Mach. Intell.

Bayesian inference for generalized linear mixed models

Biostatistics

Covariance tapering for interpolation of large spatial datasets

J. Comput. Graph. Statist.

Probabilistic forecasting

Annu. Rev. Stat. Appl.

A computationally efficient projection-based approach for spatial generalized linear mixed models

J. Comput. Graph. Statist.

Permutation methods for sharpening Gaussian process approximations

Technometrics

Circulant embedding of approximate covariances for inference from Gaussian data on large lattices

J. Comput. Graph. Statist.

A case study competition among methods for analyzing large spatial data

J. Agric. Biol. Environ. Stat.

A process-convolution approach to modelling temperatures in the North Atlantic Ocean

Environ. Ecol. Stat.

Dimension reduction and alleviation of confounding for spatial generalized linear mixed models

J. R. Stat. Soc. Ser. B Stat. Methodol.

Multi-resolution filters for massive spatio-temporal data

Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering

A multi-resolution approximation for massive spatial datasets

J. Amer. Statist. Assoc.