Kernel mean embedding based hypothesis tests for comparing spatial point patterns
Introduction
Comparison of spatial point patterns is of practical importance in a number of scientific fields including ecology, epidemiology, and criminology. For example, such comparisons may reveal differential effects of the environment on plant species spread, uncover spatial variation in disease risk, or detect seasonal differences in crime locations (see e.g. Baddeley et al., 2015). While exploratory analyses are vital for obtaining deep insights about pattern differences, such analyses can be subjective unless supplemented with formal hypothesis tests.
In this paper we are interested in comparing the first-order structures of point patterns. Consider two point processes and over the region with the first-order intensities given by ) and . Given realizations from these processes, we would like to detect whether there are statistically significant differences in the first-order intensities. However, testing for equality, , is not flexible enough. For example, when studying the spatial variation in disease risk, the diseased population is only a small fraction compared to the control population; naturally, the corresponding observed patterns will differ significantly in the overall counts of points—yet this is irrelevant to the substantive question. The more appropriate null hypothesis posits that there exists a constant such that . Equality within a constant factor means that the intensities have the same functional form of spatial variation. To avoid dealing with the nuisance parameter , one can normalize the intensities to integrate to 1, giving rise to probability distributions and over the region ; in Cucala (2006) and Fuentes-Santos et al. (2017) these are called the densities of event locations. Now, our null hypothesis is equivalent to the equality , which is an instance of the two-sample hypothesis testing problem (see, e.g. Anderson et al., 1994).
In practice it is desirable to have nonparametric hypothesis testing approaches to pattern comparison that: (a) capture a particular aspect of difference; (b) can be applied to both single and replicated patterns; and (c) do not depend on resampling methods for (re-)calibration. Early nonparametric tests for pattern comparison (Diggle et al., 1991, Hahn, 2012) probe for differences in the -functions (Ripley, 1976) of point patterns. Being based on a second-order property, the detected differences conflate the spatial variation in intensities with the interaction properties. Concentrating on the first-order properties, Kelsall and Diggle, 1995b, Kelsall and Diggle, 1995a and Davies and Hazelton (2010) estimate the logarithm of the ratio between the intensities using kernel density estimation. Other approaches rely on counts of events (Andresen, 2009, Alba-Fernández et al., 2016) or normalized count of events (Zhang and Zhuang, 2017) within pre-specified areas. The recent work (Fuentes-Santos et al., 2017) detects differences in the first-order structure by looking at the -distance between kernel density estimates of the probability distributions and . All of these truly first-order comparison approaches are limited to single patterns, and with the exception of Zhang and Zhuang (2017) they are calibrated with resampling methods. The latter issue can result in prohibitive computation costs in industrial settings where thousands of pattern comparisons may be needed together with requiring high precision -values to account for multiple testing corrections.
In this paper, we introduce an approach that leverages the kernel mean embedding (KME) (Berlinet and Thomas-Agnan, 2004, Smola et al., 2007, Gretton et al., 2012, Muandet et al., 2017) to test for the equality , which allows us to detect differences in the first-order structure of point patterns. Our approach is based on introducing an approximate version of the kernel mean embedding, aKME. While the original KME is infinite-dimensional and implicit, our approximate kernel mean embedding is finite-dimensional and comes with explicit closed-form formulas. With the help of aKME, we reduce the pattern comparison problem to the comparison of means in the Euclidean space.
The resulting pattern comparison test is surprisingly simple and a complete implementation is provided in the Appendix B. The computation of aKME is illustrated in Fig. 1.1. First, the points in the pattern are projected onto a line, which is followed by the application of functions with a specific frequency; this step can be seen as wrapping the line onto a circle of some radius. The resulting values are separately averaged to give two numbers that provide a “fingerprint” of the point pattern behavior with respect to the direction of the line and the scale that corresponds to the frequency (i.e. circle circumference). The process is repeated with a multitude of lines and frequencies; assuming lines and frequencies per line, we obtain such fingerprints; these are concatenated together to give an overall dimensional aKME. Finally, to compare patterns, we compare their aKMEs by applying -tests on each coordinate of the embedding. We combine the resulting separate -values into a single overall -value using one of the recently introduced -value combination techniques, such as harmonic mean (Good, 1958, Wilson, 2019) or Cauchy combination test (Liu and Xie, 2020), leading to well-calibrated and powerful tests as confirmed by the simulations.
The connection to the original KME guides the choice of the parameters for this construction and provides approximation guarantees that are crucial to the consistency of the hypothesis testing. The main advantages of the proposed approach are that it can be applied to both single and replicated pattern comparisons, and that neither bootstrap nor permutation procedures are needed to obtain or calibrate the -values. In addition, being based on -tests, one can compute Bayes factors for each of the involved tests allowing to quantify evidence supporting the hypothesis of difference for each directionality/scale represented in aKME; one can also report the averaged Bayes factor as an overall summary of this evidence.
The ideas developed in this paper are in line with the recent surge of interest in applying reproducing kernel Hilbert space techniques to the comparison of probability distributions. For example, the Maximum Mean Discrepancy (MMD) is a measure of divergence between distributions (Gretton et al., 2012) which has already found numerous applications in statistics and machine learning. Similarly, the kernel mean embedding (Berlinet and Thomas-Agnan, 2004, Smola et al., 2007) has been receiving increased attention, see for example the recent review (Muandet et al., 2017) and citations therein. Some of these notions can be traced back and seen as closely related to N-distances (Zinger et al., 1992) and energy distances (Baringhaus and Franz, 2004, Székely and Rizzo, 2005). Our approximate embedding has its roots in the Random Fourier Features (Rahimi and Recht, 2007), its improvements (Avron et al., 2016, Yu et al., 2016, Munkhoeva et al., 2018), and its application to the MMD (Zhao and Meng, 2015); the scheme we propose in this paper is tailored to the two-dimensional setting, and has the ability to provide higher-order approximations. There has already been some interest in applying the reproducing kernel methodology to spatial point processes, the roots going back to the 1980s (Bartoszynski et al., 1981, Silverman, 1982) and more recently in Flaxman et al., 2017, Jitkrittum et al., 2017 and Yang et al. (2019). We discuss some of the connections between reproducing kernel machinery and kernel density estimation based methods commonly used with spatial point patterns in Section 2.
The main contributions of this paper are the proposed approximate kernel mean embedding (Section 3) and the hypothesis testing framework for comparison of point patterns (Section 5). After investigating the empirical properties of the resulting tests on simulated data (Section 6.1), we present applications of the methodology to two real world datasets (Section 6.2).
Section snippets
Kernel mean embedding.
Mathematically rigorous development of the kernel mean embedding requires the machinery of the reproducing kernel Hilbert spaces, and the interested reader is referred to Muandet et al. (2017). For our purposes, it will be sufficient to have an intuitive understanding of the kernel mean embedding as expressed in terms of the feature maps.
Given a data instance (in our context will be a point in some region of ), a nonlinear transformation can be used to lift this point into a
Approximate kernel mean embedding
Instead of relying on the kernel trick, in this section we take an orthogonal path to avoiding the infinite-dimensionality of the kernel mean embedding. Namely, specializing to the case , we construct a finite-dimensional approximate feature map such that . As a result, testing can be reduced to testing in the -dimensional Euclidean space.
Since it allows obtaining closed form formulas, for the rest of the paper we will concentrate on the Gaussian kernel
Spatial point pattern aKME
The goal of this section is to introduce the aKME for a point process and show that it can be estimated in an unbiased manner from the realizations of the point process. As a preliminary, we will go over the notions of the first-order intensity and the density of event locations for spatial point processes. While for an inhomogeneous Poisson process these two are equivalent up to a normalization, in general there are differences that should be taken into consideration when conducting replicated
Comparing spatial point patterns with aKME
Consider two point processes and in with the first-order intensity functions given by ) and . We would like to test the null hypothesis of whether there exists a constant such that . Equality up to a constant factor means that the intensities of the two processes have the same functional form. This is different from testing because our null hypothesis can hold true even if the realizations from and have vastly differing numbers of events. We start
Experiments
Our goal in this section is to investigate the size and the power of the proposed tests. We also demonstrate two applications to real world data. The aKME embedding is constructed using four radial projections and four roots for the polar Gauss–Hermite formula (i.e. , in the notation of Section 3) resulting in . To avoid the selection of the kernel width parameter, we concatenate together aKMEs corresponding to , and when the point pattern domain is the unit square; the
Conclusion
We have introduced an approach to detect differences in the first-order structure of spatial point patterns. The proposed approach leverages the kernel mean embedding in a novel way, by introducing its approximate version. Hypothesis testing is based on conducting -tests on each dimension of the approximate embedding and combining them using either the harmonic mean or Cauchy approach. Our experiments confirm that the resulting tests are powerful and the -values are well-calibrated. Two
Acknowledgments
We are grateful to Alfred Stein for his editorial efforts and to the reviewers for their constructive comments which have led to a much improved version of this article. We thank Tonglin Zhang and Isabel Fuentes-Santos for providing the source code of the methods from their respective papers.
References (61)
- et al.
On the similarity analysis of spatial patterns
Spat. Stat.
(2016) - et al.
Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates
J. Multivariate Anal.
(1994) Testing for similarity in area-based spatial patterns: A nonparametric Monte Carlo approach
Appl. Geogr.
(2009)- et al.
On a new multivariate two-sample test
J. Multivariate Anal.
(2004) - et al.
Symmetric adaptive smoothing regimens for estimation of the spatial relative risk function
Comput. Statist. Data Anal.
(2016) - et al.
A nonparametric test for the comparison of first-order structures of spatial point processes
Spat. Stat.
(2017) Resampling a coverage pattern
Stochastic Process. Appl.
(1985)A systematic comparison of methods for combining p-values from independent tests
Comput. Statist. Data Anal.
(2004)- et al.
A new test for multivariate normality
J. Multivariate Anal.
(2005) - et al.
Effective sample size of spatial process models
Spat. Stat.
(2014)
Testing proportionality between the first-order intensity functions of spatial point processes
J. Multivariate Anal.
On the effective geographic sample size
J. Stat. Comput. Simul.
Distribution-free multiple testing
Electron. J. Stat.
Quasi-Monte Carlo feature maps for shift-invariant kernels
J. Mach. Learn. Res.
Spatial Point Patterns: Methodology and Applications with R
Controlling the false discovery rate via knockoffs
Ann. Statist.
Some nonparametric techniques for estimating the intensity function of a cancer related nonstationary Poisson process
Ann. Statist.
Controlling the false discovery rate: A practical and powerful approach to multiple testing
J. R. Stat. Soc. Ser. B Stat. Methodol.
Reproducing kernel hilbert space in probability and statistics
Probability and Measure
A two-sample test for high-dimensional data with applications to gene-set testing
Ann. Statist.
Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition
IEEE Trans. Electron. Comput.
Two-Dimensional Spacings and Noisy Observations in the Analysis of Spatial Point Patterns
Adaptive kernel estimation of spatial relative risk
Stat. Med.
A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point
J. R. Stat. Soc. Ser. A Stat. Soc.
Analysis of variance for replicated spatial point patterns in clinical neuroanatomy
J. Amer. Statist. Assoc.
Closed-form density-based framework for automatic detection of cellular morphology changes
Proc. Natl. Acad. Sci.
Poisson intensity estimation with reproducing kernels
Electron. J. Stat.
Significance tests in parallel and in series
J. Amer. Statist. Assoc.
Cited by (7)
SoilSpatvis: WEB Application for Geographical Data Visualization with R Language for Assessing Soil Pollution
2023, Soil and Sediment ContaminationIntrinsic Sliced Wasserstein Distances for Comparing Collections of Probability Distributions on Manifolds and Graphs
2023, Proceedings of Machine Learning ResearchOptimal configuration strategy for temperature sensors in solar greenhouse based on HSIC
2022, Nongye Gongcheng Xuebao/Transactions of the Chinese Society of Agricultural EngineeringThe lÉvy combination test
2021, arXiv