Vecchia Approximations of Gaussian-Process Predictions

Katzfuss, Matthias; Guinness, Joseph; Gong, Wenlong; Zilber, Daniel

doi:10.1007/s13253-020-00401-7

Matthias Katzfuss¹,
Joseph Guinness²,
Wenlong Gong^1,3 &
…
Daniel Zilber¹

942 Accesses
35 Citations
2 Altmetric
Explore all metrics

Abstract

Gaussian processes (GPs) are highly flexible function estimators used for geospatial analysis, nonparametric regression, and machine learning, but they are computationally infeasible for large datasets. Vecchia approximations of GPs have been used to enable fast evaluation of the likelihood for parameter inference. Here, we study Vecchia approximations of spatial predictions at observed and unobserved locations, including obtaining joint predictive distributions at large sets of locations. We consider a general Vecchia framework for GP predictions, which contains some novel and some existing special cases. We study the accuracy and computational properties of these approaches theoretically and numerically, proving that our new methods exhibit linear computational complexity in the total number of spatial locations. We show that certain choices within the framework can have a strong effect on uncertainty quantification and computational cost, which leads to specific recommendations on which methods are most suitable for various settings. We also apply our methods to a satellite dataset of chlorophyll fluorescence, showing that the new methods are faster or more accurate than existing methods and reduce unrealistic artifacts in prediction maps. Supplementary materials accompanying this paper appear on-line.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Spatial machine learning: new opportunities for regional science

Article Open access 24 December 2021

Katarzyna Kopczewska

spaceNtime: an R package for estimating abundance of unmarked animals using camera-trap photographs

Article Open access 26 October 2021

Anna K. Moeller & Paul M. Lukacs

Comparing implementations of global and local indicators of spatial association

Article 27 July 2018

Roger S. Bivand & David W. S. Wong

References

Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004). Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall, Boca Raton.
MATH Google Scholar
Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70(4):825–848.
MathSciNet MATH Google Scholar
Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70(1):209–226.
MathSciNet MATH Google Scholar
Cressie, N. and Wikle, C. K. (2011). Statistics for Spatio-Temporal Data. Wiley, Hoboken, NJ.
MATH Google Scholar
Curriero, F. C. and Lele, S. (1999). A composite likelihood approach to semivariogram estimation. Journal of Agricultural, Biological, and Environmental Statistics, 4(1):9–28.
MathSciNet Google Scholar
Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812.
MathSciNet Google Scholar
Du, J., Zhang, H., and Mandrekar, V. S. (2009). Fixed-domain asymptotic properties of tapered maximum likelihood estimators. The Annals of Statistics, 37:3330–3361.
MathSciNet MATH Google Scholar
Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2014). Estimation and prediction in spatial models with block composite likelihoods using parallel computing. Journal of Computational and Graphical Statistics, 23(2):295–315.
MathSciNet Google Scholar
Erisman, A. M. and Tinney, W. F. (1975). On computing certain elements of the inverse of a sparse matrix. Communications of the ACM, 18(3):177–179.
MathSciNet MATH Google Scholar
Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28(2):401–414.
MathSciNet Google Scholar
Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis, 53(8):2873–2884.
MathSciNet MATH Google Scholar
Foreman-Mackey, D., Agol, E., Ambikasaran, S., and Angus, R. (2017). Fast and scalable Gaussian process modeling with applications to astronomical time series. The Astronomical Journal, 154:220.
Google Scholar
Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15(3):502–523.
MathSciNet Google Scholar
Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1(1):125–151.
Google Scholar
Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
MathSciNet Google Scholar
Guan, K., Berry, J. A., Zhang, Y., Joiner, J., Guanter, L., Badgley, G., and Lobell, D. B. (2016). Improving the monitoring of crop productivity using spaceborne solar-induced fluorescence. Global Change Biology, 22(2):716–726.
Google Scholar
Guinness, J. (2018). Permutation methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429.
MathSciNet Google Scholar
Guinness, J. (2019). Spectral density estimation for random fields via periodic embeddings. Biometrika, 106(2):267–286.
MathSciNet MATH Google Scholar
Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and Zammit-Mangion, A. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological, and Environmental Statistics, 24(3):398–425.
MathSciNet MATH Google Scholar
Higdon, D. (1998). A process-convolution approach to modelling temperatures in the North Atlantic Ocean. Environmental and Ecological Statistics, 5(2):173–190.
Google Scholar
Jones, D. R., Schonlau, M., and W. J. Welch (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492.
MathSciNet MATH Google Scholar
Jurek, M. and Katzfuss, M. (2018). Multi-resolution filters for massive spatio-temporal data. arXiv:1810.04200.
Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112(517):201–214.
MathSciNet Google Scholar
Katzfuss, M. and Cressie, N. (2011). Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. Journal of Time Series Analysis, 32(4):430–446.
MathSciNet MATH Google Scholar
Katzfuss, M. and Gong, W. (2019). A class of multi-resolution approximations for large spatial datasets. Statistica Sinica. https://doi.org/10.5705/ss.202018.0285.
Katzfuss, M. and Guinness, J. (2019). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, accepted.
Katzfuss, M., Guinness, J., and Lawrence, E. (2020a). Scaled Vecchia approximation for fast computer-model emulation. arXiv:2005.00386.
Katzfuss, M., Jurek, M., Zilber, D., Gong, W., Guinness, J., Zhang, J., and Schäfer, F. (2020b). GPvecchia: Fast Gaussian-process inference using Vecchia approximations. R package version 0.1.3.
Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103(484):1545–1555.
MathSciNet MATH Google Scholar
Kelly, B. C., Becker, A. C., Sobolewska, M., Siemiginowska, A., and Uttley, P. (2014). Flexible and scalable methods for quantifying stochastic variability in the era of massive time-domain astronomical data sets. Astrophysical Journal, 788(1):33.
Google Scholar
Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B, 63(3):425–464.
MathSciNet MATH Google Scholar
Le, N. D. and Zidek, J. V. (2006). Statistical Analysis of Environmental Space-Time Processes. Springer, Berlin.
MATH Google Scholar
Li, S., Ahmed, S., Klimeck, G., and Darve, E. (2008). Computing entries of the inverse of a sparse matrix using the FIND algorithm. Journal of Computational Physics, 227(22):9408–9427.
MathSciNet MATH Google Scholar
Lin, L., Yang, C., Meza, J., Lu, J., Ying, L., and Weinan, E. (2011). SelInv—An algorithm for selected inversion of a sparse symmetric matrix. ACM Transactions on Mathematical Software, 37(4):40.
MathSciNet MATH Google Scholar
Lindgren, F., Rue, H., and Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B, 73(4):423–498.
MathSciNet MATH Google Scholar
Nychka, D. W., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. R. (2015). A multi-resolution Gaussian process model for the analysis of large spatial data sets. Journal of Computational and Graphical Statistics, 24(2):579–599.
MathSciNet Google Scholar
OCO-2 Science Team, Gunson, M., and Eldering, A. (2015). OCO-2 Level 2 bias-corrected solar-induced fluorescence and other select fields from the IMAP-DOAS algorithm aggregated as daily files, retrospective processing V7r. https://disc.gsfc.nasa.gov/datacollection/OCO2_L2_Lite_SIF_7r.html
Quiñonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.
MathSciNet MATH Google Scholar
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge.
MATH Google Scholar
Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications. CRC Press, Boca Raton.
MATH Google Scholar
Sang, H., Jun, M., and Huang, J. Z. (2011). Covariance approximation for large multivariate spatial datasets with an application to multiple climate model errors. Annals of Applied Statistics, 5(4):2519–2548.
MathSciNet MATH Google Scholar
Schäfer, F., Katzfuss, M., and Owhadi, H. (2020). Sparse Cholesky factorization by Kullback–Leibler minimization. arXiv:2004.14455.
Schäfer, F., Sullivan, T. J., and Owhadi, H. (2017). Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. arXiv:1706.02205.
Snelson, E. and Ghahramani, Z. (2007). Local and global sparse Gaussian process approximations. In Artificial Intelligence and Statistics 11 (AISTATS).
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems.
Stein, M. L., Chi, Z., and Welty, L. (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B, 66(2):275–296.
MathSciNet MATH Google Scholar
Stroud, J. R., Stein, M. L., and Lysen, S. (2017). Bayesian and maximum likelihood estimation for Gaussian processes on an incomplete lattice. Journal of Computational and Graphical Statistics, 26(1):108–120.
MathSciNet Google Scholar
Sun, Y., Frankenberg, C., Jung, M., Joiner, J., Guanter, L., Köhler, P., and Magney, T. (2018). Overview of solar-induced chlorophyll fluorescence (sif) from the orbiting carbon observatory-2: Retrieval, cross-mission comparison, and global monitoring for gpp. Remote Sensing of Environment, 209:808–823.
Google Scholar
Sun, Y., Frankenberg, C., Wood, J. D., Schimel, D. S., Jung, M., Guanter, L., Drewry, D., Verma, M., Porcar-Castell, A., Griffis, T. J., et al. (2017). Oco-2 advances photosynthesis observation from space via solar-induced chlorophyll fluorescence. Science, 358(6360):eaam5747.
Google Scholar
Sun, Y. and Stein, M. L. (2016). Statistically and computationally efficient estimating equations for large spatial datasets. Journal of Computational and Graphical Statistics, 25(1):187–208.
MathSciNet Google Scholar
Tzeng, S. and Huang, H.-C. (2018). Resolution adaptive fixed rank kriging. Technometrics, 60(2):198–208.
MathSciNet Google Scholar
Vecchia, A. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50(2):297–312.
MathSciNet Google Scholar
Vecchia, A. (1992). A new method of prediction for spatial regression models with correlated errors. Journal of the Royal Statistical Society, Series B, 54(3):813–830.
MathSciNet MATH Google Scholar
Wang, Y., Khardon, R., and Protopapas, P. (2012). Nonparametric Bayesian estimation of periodic light curves. Astrophysical Journal, 756(1):67.
Google Scholar
Wikle, C. K. and Cressie, N. (1999). A dimension-reduced approach to space-time Kalman filtering. Biometrika, 86(4):815–829.
MathSciNet MATH Google Scholar
Zhang, B., Sang, H., and Huang, J. Z. (2019). Smoothed full-scale approximation of Gaussian process models for computation of large spatial datasets. Statistica Sinica, 29:1711–1737.
MathSciNet MATH Google Scholar
Zilber, D. and Katzfuss, M. (2019). Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv:1906.07828.

Download references

Acknowledgements

Katzfuss’ research was partially supported by National Science Foundation (NSF) Grant DMS–1521676 and NSF CAREER Grant DMS–1654083. Guinness’ research was partially supported by NSF Grant DMS–1613219 and NIH Grant No. R01ES027892. The authors would like to thank Anirban Bhattacharya, David Jones, Jennifer Hoeting, and several anonymous reviewers for helpful comments and suggestions. Jingjie Zhang and Marcin Jurek contributed to the R package GPvecchia, and Florian Schäfer provided C code for the exact maxmin ordering.

Author information

Authors and Affiliations

Department of Statistics, Texas A&M University, College Station, USA
Matthias Katzfuss, Wenlong Gong & Daniel Zilber
Department of Statistics and Data Science, Cornell University, Ithaca, USA
Joseph Guinness
Department of Statistics, NC State University, Raleigh, USA
Wenlong Gong

Authors

Matthias Katzfuss
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Guinness
View author publications
You can also search for this author in PubMed Google Scholar
Wenlong Gong
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Zilber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Katzfuss.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1544 KB)

Supplementary material 2 (zip 11555 KB)

Appendices

Vector and Indexing Notation

As an example, define $\mathbf{y} = (\mathbf{y} _1,\mathbf{y} _2,\mathbf{y} _3,\mathbf{y} _4,\mathbf{y} _5)$ as a vector of vectors. Vectors of indices are used for defining subvectors. For example, if $o = (4,1,2)$ is a vector of indices, then $\mathbf{y} _o = (\mathbf{y} _4,\mathbf{y} _1,\mathbf{y} _2)$. Unions of vectors are vectors and are defined when the two vectors have the same type and when the ordering is specified. For example, if $\mathbf{z} = (\mathbf{z} _1,\mathbf{z} _2)$, then $\mathbf{y} _o \cup \mathbf{z} = (\mathbf{y} _4,\mathbf{y} _1, \mathbf{z} _1, \mathbf{y} _2, \mathbf{z} _2)$ defines the union of $\mathbf{y} _o$ and $\mathbf{z} $. Intersection is defined similarly and uses the $\cap $ notation. The ordering of the elements of the union or intersection can be defined alternatively via an index function $\#$ taking in an element and a vector and returning the index occupied by the element in the vector. Continuing the example above, $\#(\mathbf{y} _4, \mathbf{y} ) = 4$, whereas $\#(\mathbf{y} _4, \mathbf{y} _o \cup \mathbf{z} ) = 1$. A full description of the vector and indexing notation can be found in Katzfuss and Guinness (2019, App. A).

Computing $\mathbf{U }$

We recapitulate here the formulas for computing $\mathbf{U }$ from Katzfuss and Guinness (2019). Let g(i) denote the vector of indices of the elements in $\mathbf{x} $ on which $x_i$ conditions. Also define $C(x_i,x_j)$ as the covariance between $x_i$ and $x_j$ implied by the true model in Sect. 2; that is, $C(y_i,y_j) = C(z_i,y_j) = K(\mathbf{s} _i,\mathbf{s} _j)$ and $C(z_i,z_j) = K(\mathbf{s} _i,\mathbf{s} _j) + \mathbb {1}_{i=j} \tau ^2_i$. Then, the (j, i)th element of $\mathbf{U }$ can be calculated as

$$\begin{aligned} \mathbf{U }_{ji} = {\left\{ \begin{array}{ll} d_i^{-1/2}, &{} i=j,\\ -b_{i}^{(j)} d_i^{-1/2}, &{} j \in g(i), \\ 0, &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$

(9)

where $\mathbf{b} _i'= C(x_i,\mathbf{x} _{g(i)}) C(\mathbf{x} _{g(i)},\mathbf{x} _{g(i)})^{-1}$, $d_i = C(x_i,x_i) - \mathbf{b} _i' C(\mathbf{x} _{g(i)},x_i)$, and $b_i^{(j)}$ denotes the kth element of $\mathbf{b} _i$ if j is the kth element in g(i) (i.e., $b_i^{(j)}$ is the element of $\mathbf{b} _i$ corresponding to $x_j$).

Prediction with Unknown Parameters

In practice, most GP models depend on an unknown parameter vector $\varvec{\theta }$, which we will make explicit here. Likelihood approximation for parameter inference is discussed in detail in Katzfuss and Guinness (2019), but we will review it briefly here. Integration of $\widehat{f}(\mathbf{x} |\varvec{\theta })$ in (3) with respect to $\mathbf{y} $ results in the following Vecchia likelihood (Katzfuss and Guinness 2019, Prop. 2):

$$\begin{aligned} \textstyle -2 \log \widehat{f}(\mathbf{z} _o|\varvec{\theta })= & {} -2 \sum _{i} \log \mathbf{U }_{ii} + 2\sum _{i} \log \mathbf{V} _{ii} + {\tilde{\mathbf{z }}}'{\tilde{\mathbf{z }}} \nonumber \\&- (\mathbf{V} ^{-1}\mathbf{U }_{\ell ,\bullet }{\tilde{\mathbf{z }}})'(\mathbf{V} ^{-1}\mathbf{U }_{\ell ,\bullet }{\tilde{\mathbf{z }}}) + n \log (2\pi ), \end{aligned}$$

(10)

where $\mathbf{U }$ and $\mathbf{V} $ implicitly depend on $\varvec{\theta }$, and ${\tilde{\mathbf{z }}} = \mathbf{U }_{r,\bullet }'\mathbf{z} _o$. The computational cost for evaluating this Vecchia likelihood is often low, and Katzfuss and Guinness (2019) provide conditions on the g(i) under which the cost is guaranteed to be linear in n.

This allows for various forms of likelihood-based parameter inference. In a frequentist setting, we can compute ${\hat{\varvec{\theta }}} = \hbox {arg max}_{\varvec{\theta }} \log \widehat{f}(\mathbf{z} _o|\varvec{\theta })$ and then compute summaries of the posterior predictive distribution $\widehat{f}(\mathbf{y} |\mathbf{z} _o) = \mathcal {N}_n(\varvec{\mu }({\hat{\varvec{\theta }}}),\varvec{\Sigma }({\hat{\varvec{\theta }}}))$ as described in Sect. 3.3. This often has low computational cost but ignores uncertainty in ${\hat{\varvec{\theta }}}$. An example is given in Sect. 6.

In a Bayesian setting, given a prior distribution $f(\varvec{\theta })$, a Metropolis–Hastings sampler can be used for parameters whose posterior or full-conditional distribution are not available in closed form. At the $(l+1)$th step of the algorithm, one would propose a new value $\varvec{\theta }^{(P)} \sim q(\varvec{\theta }|\varvec{\theta }^{(l)})$ and accept it with probability $\min (1,h(\varvec{\theta }^{(P)},\varvec{\theta }^{(l)})/h(\varvec{\theta }^{(l)},\varvec{\theta }^{(P)}))$, where $h(\varvec{\theta },{\tilde{\varvec{\theta }}}) = f(\varvec{\theta })\widehat{f}(\mathbf{z} _o|\varvec{\theta }) q({\tilde{\varvec{\theta }}}|\varvec{\theta })$. After burn-in and thinning, this results in a sample, say, $\varvec{\theta }^{(1)},\ldots ,\varvec{\theta }^{(L)}$, leading to a Gaussian-mixture prediction: $\widehat{f}(\mathbf{y} |\mathbf{z} _o) = (1/L) \sum _{l=1}^L \mathcal {N}_n(\varvec{\mu }(\varvec{\theta }^{(l)}),\varvec{\Sigma }(\varvec{\theta }^{(l)}))$. See Katzfuss and Guinness (2019, App. E) for an example for the use of RF-full in this setting. For more complicated Bayesian hierarchical models, inference can be carried out using a Gibbs sampler in which $\mathbf{y} $ is sampled from its full-conditional distribution as described in item 4. in Sect. 3.3.

Numerical Nonzeros in $\mathbf{V} $

For simplicity, we focus here on RF-full, although numerical nonzeros can similarly occur for RF-stand, LF-full, and LF-ind (see Figure S1). For RF-full, the upper triangle of $\mathbf{W} $ is at least as dense as the upper triangle of $\mathbf{V} $. Specifically, for $j<i$, $\mathbf{V} _{ji}=\mathbf{U }_{\ell _j\ell _i}=0$ unless $j \in q_y(i)$. From Katzfuss and Guinness (Katzfuss and Guinness (2019), Prop. 3.2), we have that $\mathbf{W} _{ji} =0$ unless $j \in q_y(i)$ or $\exists k > i$ such that $i,j \in q_y(k)$. Thus, for any pair $j<i$ such that $j \notin q_y(i)$ but $i,j \in q_y(k)$ for some $k>i$, we generally have $\mathbf{V} _{ji} = 0$ and $\mathbf{W} _{ji} \ne 0$.

From the standard Cholesky algorithm, we can derive that the algorithm for $\mathbf{V} = {{\,\mathrm{rchol}\,}}(\mathbf{W} )$ computes $\mathbf{V} _{ji}$ as

$$\begin{aligned} \textstyle \mathbf{V} _{ji} = (\mathbf{W} _{ji} - \sum _{k=i+1}^n\mathbf{V} _{ik}\mathbf{V} _{jk})/\mathbf{V} _{ii}. \end{aligned}$$

Thus, for any pair $j<i$ such that $\mathbf{V} _{ji}=0$ but $\mathbf{W} _{ji} \ne 0$, we know that $\mathbf{W} _{ji} = \sum _{k=i+1}^n\mathbf{V} _{ik}\mathbf{V} _{jk}$ theoretically, but due to potential numerical error it is not guaranteed that this equation holds exactly. A numerical nonzero is introduced in $\mathbf{V} $ whenever a rounding error occurs in $\sum _{k=i+1}^n\mathbf{V} _{ik}\mathbf{V} _{jk}$, which relies on (potentially many) previous calculations in the Cholesky algorithm. Such numerical nonzeros are avoided by extracting $\mathbf{V} =\mathbf{U }_{\ell \ell }$ as a submatrix of $\mathbf{U }$ (as proposed in Sect. 4.1), instead of explicitly carrying out the Cholesky factorization $\mathbf{V} = {{\,\mathrm{rchol}\,}}(\mathbf{W} )$.

1.1 Implications for Selected Inverse

When $\mathbf{V} $ is computed by copying a submatrix from $\mathbf{U }$ to avoid numerical nonzeros, the selected inverse of this $\mathbf{V} $ is not guaranteed to return the exact posterior variances of $\mathbf{y} $, unless $\mathbf{V} $ is “padded” with zeros, which results in additional costs. This is because the selected inverse operates on the symbolic nonzero elements; that is, it operates on all elements that have to be computed in the Cholesky, even if they cancel to zero numerically (which is the case for many entries in our case). Denoting by $\mathbf{S} $ the selected inverse of $\mathbf{W} $ based on $\mathbf{V} $, a close look at the Takahashi recursions reveals that for all j, k with $j,k \in q_y(i)$, we need $\mathbf{S} _{ji}$ and $\mathbf{S} _{kj}$. The latter element is only calculated if $j \in q_y(k)$. However, if $j \notin q_y(k)$, $\mathbf{S} _{kj} ={{\,\mathrm{cov}\,}}(y_j,y_k|\mathbf{z} )$ will typically be very small (if m is reasonably large), because their corresponding locations will likely be far away from each other and data can be observed in between. In our experiments, the additional approximation error introduced by the selected inverse was negligible relative to the error introduced by the Vecchia approximation itself. When m becomes large enough for the Vecchia approximation to be accurate, the additional approximation error introduced by SelInv goes to zero as well.

If the exact variances implied by the Vecchia approximation $\widehat{f}(\mathbf{y} |\mathbf{z} _o)$ are desired, they can be computed as ${{\,\mathrm{diag}\,}}({{\,\mathrm{var}\,}}(\mathbf{y} _p|\mathbf{z} _o)) = ((\mathbf{V} ^{-1}\mathbf{I} _{\bullet p})\circ (\mathbf{V} ^{-1}\mathbf{I} _{\bullet p}))'\mathbf 1 _{n}$. For RF-full and RF-stand, this requires $\mathcal {O}(nmn_P)$ time, as described in Sect. 4.3.3, and so the overall computational complexity would not be increased if $n_P = \mathcal {O}(m^2)$.

Proofs

In this section, we provide proofs for the propositions stated throughout the article.

Proof of Proposition 1

Part 1 of this proposition is equivalent to Thm. 1 in Guinness (2018). We prove the statement here again in a different way, which can be easily extended to prove Part 2.

First, consider a generic Vecchia approximation of a vector $\mathbf{x} $ of length n: $\widehat{f}(\mathbf{x} ) = \prod _{i=1}^n f(x_i|\mathbf{x} _{g(i)}) = \prod _{i=1}^n \mathcal {N}(x_i|\mu _{i|g(i)},\sigma ^2_{i|g(i)})$. Then,

$$\begin{aligned} \textstyle (-2) {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ) = \sum _{i=1}^n \log {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)}) + \sum _{i=1}^n {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} w_i^2 + n \log (2\pi ), \end{aligned}$$

where $w_i = (x_i - \mu _{i|g(i)})/\sigma _{i|g(i)}$. We have ${{\,\mathrm{\mathbb {E}}\,}}(w_i) = 0$, because ${{\,\mathrm{\mathbb {E}}\,}}(\mu _{i|g(i)}) = {{\,\mathrm{\mathbb {E}}\,}}{{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g(i)}) = {{\,\mathrm{\mathbb {E}}\,}}(x_i)$. We also have ${{\,\mathrm{var}\,}}(w_i)=1$, because

$$\begin{aligned} {{\,\mathrm{var}\,}}(x_i-\mu _{i|g(i)})= & {} {{\,\mathrm{var}\,}}{{\,\mathrm{\mathbb {E}}\,}}\big (x_i-{{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g(i)})|\mathbf{x} _{g(i)}\big ) + {{\,\mathrm{var}\,}}\big (x_i-{{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g(i)})|\mathbf{x} _{g(i)}\big ) \\= & {} 0 + \sigma ^2_{i|g(i)}. \end{aligned}$$

Hence, $w_i \sim \mathcal {N}(0,1)$, and so ${{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} w_i^2 = 1$ and

$$\begin{aligned} \textstyle (-2) {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ) = \sum _{i=1}^n \log {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)}) + n + n \log (2\pi ). \end{aligned}$$

Because the exact density $f(\mathbf{x} )$ is a special case of Vecchia with $g(i) = (1,\ldots ,i-1)$, we have

$$\begin{aligned} \textstyle \text {KL}\big (f(\mathbf{x} )\Vert \widehat{f}(\mathbf{x} )\big ) = {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log f(\mathbf{x} ) - {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ) = \frac{1}{2} \sum _{i=1}^n \log \frac{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)})}{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{1:i-1})}. \end{aligned}$$

(11)

For $g_1(i) \subset g_2(i)$, we can write, say, $g_2(i) = g_1(i) \cup c(i)$. Using the law of total variance,

$$\begin{aligned} {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g_1(i)}) = {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g_1(i)}, \mathbf{x} _{c(i)}) + {{\,\mathrm{var}\,}}({{\,\mathrm{\mathbb {E}}\,}}(x_i|\mathbf{x} _{g_1(i)}, \mathbf{x} _{c(i)})|\mathbf{x} _{c(i)}) \ge {{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g_2(i)}).\nonumber \\ \end{aligned}$$

(12)

Now, Part 1 follows by combining (11) and (12) with the assumption that $g_1(i) \subset g_2(i)$ for all i.

For Part 2, we consider $\mathbf{x} =(\mathbf{x} ^{(1)},\mathbf{x} ^{(2)})$ with $\mathbf{x} ^{(1)} = (x_1,\ldots ,x_{n_1})$ and $\mathbf{x} ^{(2)} = (x_{n_1+1},\ldots ,x_{n_1+n_2})$. Then,

$$\begin{aligned} \textstyle&\text {CKL}\big (f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})\Vert \widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})\big )\\&\quad = \int f(\mathbf{x} ^{(1)}) \int f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) \log \big (f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})/\widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)})\big ) d\mathbf{x} ^{(2)} d\mathbf{x} ^{(1)} \\&\quad = {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log f(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) - {{\,\mathrm{\mathbb {E}}\,}}^\mathbf{x} \log \widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) \\&\quad = \frac{1}{2} \sum _{i=n_1+1}^{n_1+n_2} \log \frac{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{g(i)})}{{{\,\mathrm{var}\,}}(x_i|\mathbf{x} _{1:i-1})}, \end{aligned}$$

where the last equality can be shown almost identically to (11) above, noting that $\widehat{f}(\mathbf{x} ^{(2)}|\mathbf{x} ^{(1)}) = \widehat{f}(\mathbf{x} )/\widehat{f}(\mathbf{x} ^{(1)}) = \prod _{i=n_1+1}^{n_1+n_2} f(x_i|\mathbf{x} _{g(i)})$. Part 2 follows by combining this result and (12) with the assumption that $g_1(i) \subset g_2(i)$ for all $i=n_1+1,\ldots ,n_1+n_2$. $\square $

Proof of Proposition 2

For all parts of this proposition, note that all variables in the model are conditionally independent of $z_j$ given $y_j$, and so conditioning on $y_j$ is equivalent to conditioning on $y_j$ and $z_j$. For Part 1, we can thus verify easily that g(i) is equivalent to $(1,\ldots ,i-1)$, for all i and all methods under consideration, and so Part 1 follows from (11). Using Proposition 1, the proof for all other parts simply consists of showing that certain conditioning vectors contain certain other conditioning vectors (similar to Katzfuss and Guinness 2019, Prop. 4). For example, for Part 3, all response-first methods are based on the ordering $\mathbf{x} = (\mathbf{z} _o,\mathbf{y} _o,\mathbf{y} _p)$ with nearest neighbor conditioning (under some restrictions for RF-stand and RF-ind), and so it is easy to see that $g_m(i) \subset g_{m+1}(i)$ for all $i \in \ell _p=(n+1,\ldots ,n+n_P)$. LF-full and LF-ind can equivalently be defined based on the ordering $(\mathbf{y} _o,\mathbf{z} _o,\mathbf{y} _p)$, in which case we also have $g_m(i) \subset g_{m+1}(i)$ for all $i \in \ell _p=(n+1,\ldots ,n+n_P)$. For Part 5, note that in RF-stand and RF-ind, $\mathbf{y} _p$ does not condition on $\mathbf{y} _o$. Hence, the distribution $\widehat{f}(\mathbf{y} _p|\mathbf{z} _o)$ is equivalent to the distribution obtained under a simplified Vecchia approximation based on $\mathbf{x} =(\mathbf{z} _o,\mathbf{y} _p)$ (i.e., with $\mathbf{y} _o$ removed completely). It is straightforward to show that Part 5 holds for this simplified approximation. For Part 6, note that g(i) is the same for RF-full and RF-stand for all $i=1,\ldots ,n$. For $i \in p$, letting $a(i) = q_y^{\text {RF-full}}(i) \cap o$, we have $\mathbf{x} _{g^{\text {RF-full}}(i)}=(\mathbf{y} _{q_y^{\text {RF-stand}}},\mathbf{y} _{a(i)})=(\mathbf{y} _{q_y^{\text {RF-stand}}},\mathbf{y} _{a(i)},\mathbf{z} _{a(i)})$ and $\mathbf{x} _{g^{\text {RF-stand}}(i)}=(\mathbf{y} _{q_y^{\text {RF-stand}}},\mathbf{z} _{a(i)})$, and so $g^{\text {RF-stand}}(i) \subset g^{\text {RF-full}}(i)$ for all $i \in p$. For Part 7, note that a GP with exponential covariance in 1-D is a Markov process, and so in (6) with $q_y(i)=i-1$ and left-to-right ordering, we have $f(y_i|\mathbf{y} _{q_y(i)})=f(y_i|y_{1},\ldots ,y_{i-1})$ and hence $\widehat{f}(\mathbf{x} ) = f(\mathbf{x} )$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Katzfuss, M., Guinness, J., Gong, W. et al. Vecchia Approximations of Gaussian-Process Predictions. JABES 25, 383–414 (2020). https://doi.org/10.1007/s13253-020-00401-7

Download citation

Received: 22 January 2020
Accepted: 02 June 2020
Published: 23 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s13253-020-00401-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Vecchia Approximations of Gaussian-Process Predictions

Abstract

Access this article

Similar content being viewed by others

Spatial machine learning: new opportunities for regional science

spaceNtime: an R package for estimating abundance of unmarked animals using camera-trap photographs

Comparing implementations of global and local indicators of spatial association

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 1544 KB)

Supplementary material 2 (zip 11555 KB)

Appendices

Vector and Indexing Notation

Computing \(\mathbf{U }\)

Prediction with Unknown Parameters

Numerical Nonzeros in \(\mathbf{V} \)

1.1 Implications for Selected Inverse

Proofs

Proof of Proposition 1

Proof of Proposition 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vecchia Approximations of Gaussian-Process Predictions

Abstract

Access this article

Similar content being viewed by others

Spatial machine learning: new opportunities for regional science

spaceNtime: an R package for estimating abundance of unmarked animals using camera-trap photographs

Comparing implementations of global and local indicators of spatial association

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 1544 KB)

Supplementary material 2 (zip 11555 KB)

Appendices

Vector and Indexing Notation

Computing \(\mathbf{U }\)

Prediction with Unknown Parameters

Numerical Nonzeros in \(\mathbf{V} \)

1.1 Implications for Selected Inverse

Proofs

Proof of Proposition 1

Proof of Proposition 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation