A new class of -transformations for the spatial analysis of Compositional Data
Introduction
In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data are represented by points in a simplex, i.e. vectors with non-negative coordinates whose sum is constant. In this paper, we are interested in geostatistics or spatial statistics for compositional data, which is the area of statistics developing methods to analyze and predict, through kriging, compositional data associated with spatial or spatio-temporal phenomena.
Georeferenced compositional datasets arise in varied fields of research, from geology to economics to chemistry to sociology, although most studies were historically concerned with topics in the geosciences. In this setting, it took a long time to find a solution to the problem of how to perform a proper statistical analysis of closed data – namely data with a constant-sum constraint – by taking into account the consequence of compositional constraints on correlations. Because some standard statistical techniques lose their applicability and classical interpretation when applied to compositional data, new techniques were needed. The first theoretically sound solution was proposed in the 1980s, when Aitchison (1986) built a consistent theory based on log-ratio transformations of compositional data. Later developments have shown that the mathematical foundation of a proper statistical analysis for this type of data is based on the definition of a specific geometry on the simplex, referred to as the Aitchison geometry. Based on it and on the Principles of Compositional Data Analysis (Egozcue et al., 2003), a relatively large body of literature has established a complete mathematical framework for statistical analysis which is nowadays widely accepted by the statistical community. Spatial statistics has also been widely treated in the compositional community (Pawlowsky-Glahn and Olea, 2004, Tolosana-Delgado et al., 2009, Tolosana-Delgado, 2006, Tolosana-Delgado et al., 2008, Tolosana-Delgado et al., 2010, Tolosana-Delgado et al., 2011, Tolosana-Delgado and van den Boogaart, 2013), by following the methods developed in the Aitchison geometry, i.e., by applying the log-ratio transformations that allow one to respect the Principles of Compositional Data Analysis. We refer to Pawlowsky-Glahn and Egozcue (2016) for a complete review of the historical evolution of spatial analysis of compositional data through the Aitchison geometry.
Even though the Aitchison’s approach is nowadays mainstream in the analysis of compositional data, a number of authors have pushed forward alternative viewpoints, arguing that the choice of the appropriate method for the statistical data analysis should not be determined a priori from a set of mathematical principles, but that this choice should rather depend, at least in part, on the data (Scealy and Welsh, 2014). In this vein, Tsagris et al. (2011) proposed a new family of transformations, called -transformations, parameterized by a constant . This parameter allows one to control the degree of transformation applied to the data, ranging from a linear transformation () to a log-ratio transformation (). In this setting, the parameter is chosen in a data-driven manner, thus allowing one to render the approach application-specific. Note that the use of -transformations also enables one to deal with the presence of 0s in the compositions, unlike the log-ratio approach which is only suitable for strictly positive compositions. Besides these aspects, the approach based on -transformations proved effective in real studies, both in classification (Tsagris et al., 2016) and in regression (Tsagris, 2015). In the same vein, but in the somewhat different domain of correspondence analysis, Greenacre, 2009b, Greenacre, 2009a, Greenacre, 2011 proposed a similar power transformation and proved that it tends to the Aitchison log-ratio transformation as the power parameter approaches 0.
This paper follows the line pioneered by Tsagris et al. (2011) and Greenacre (2009b) and aims to establish a methodological framework for the statistical analysis of spatial compositional data, while finding a balance between the data-driven approach of Tsagris et al. (2011) and that of the compositional community. In this vein, a novel set of transformations is considered, based on the concept of -contrasts – which generalizes that of log-contrasts upon which the Aitchison geometry is based. We show that, similarly to Tsagris et al. (2011), our approach coincides with that of Egozcue et al. (2003) for specific choices of the parameter , while attaining in general a more flexible framework than that of the Aitchison simplex. Besides, we shall also establish an explicit link between the covariance structure induced by the proposed class of transformations and that defined under the Aitchison geometry, opening broad perspectives of potential application on a wide range of covariance-based methods for exploratory and inferential data analyses. In fact, the approach we propose is very general and not limited to spatial datasets. However, for the sake of brevity, this work will mainly focus on the problem of spatial analysis of compositional data with the lens of spatial prediction (kriging), which shall drive the formulation of the transformation, and particularly the choice of the “best” parameterization.
The remaining of this work is organized as follows: Section 2 introduces the theoretical concepts of compositional data, focusing on the log-ratio transformations, the spatial statistics methodologies applied in this field and the -transformations proposed by Tsagris et al. (2011). Section 3 explores the new class of transformations considered in this paper, i.e., the Centered and Isometric -transformations (-CT and -IT), by discussing their properties, especially in a geostatistical setting. A maximum likelihood estimation method is proposed which maximizes the Gaussianity of the transformed data — and, as a consequence, the kriging performances. Section 4 is interested in the application of the -IT to a simulated spatial dataset, in order to evaluate the improvement of this transformation over the classical Aitchison transformations. Section 5 conducts a geostatistical analysis of land cover data, following the approach analyzed throughout the previous sections. Special attention is given to the analysis of data in presence of 0-parts in the compositions. Results show that when all parts are positive, the limit cases (ILR or linear transformations) are optimal for none of the considered metrics. An intermediate geometry, corresponding to the -IT with maximum likelihood estimate better describes the dataset in a geostatistical setting. When the amount of compositions with 0s is not negligible, some side-effects of the transformation get amplified as decreases, entailing poor kriging performances both within the -IT geometry and for metrics in the simplex. Finally, Section 6 reviews the main points of the paper.
Section snippets
Compositional data analysis in the Aitchison simplex
In this section we provide a brief overview of the key concepts underlying the analysis of compositional data through the Aitchison geometry. We refer the reader to, e.g., Pawlowsky-Glahn et al. (2015) for a deeper account on the subject.
A column vector, , is defined as a -part composition when all its components are positive real numbers carrying only relative information. The sample space of compositional data is the simplex, defined as
The Centered and Isometric -transformations
We here introduce a new class of transformations called Centered and Isometric -transformations (-CT and -IT). Let us define the simplex that admits one or more 0-values in its components.
Definition 1 Let . The Centered -transformation (-CT) of a compositional vector is the mapping
where is the centering matrix defined above. The Isometric -transformation (-IT) of a compositional vector is the
A simulation study
The aim of this section is to analyze the Isometric -transformation introduced in Section 3.1, by applying it to simulated spatial compositional data. Here, we will evaluate the performances of this approach in comparison to the classical log-ratio transformations introduced by Aitchison (1986) and Egozcue et al. (2003).
Application to the Copernicus land cover map
A geostatistical analysis of a spatial compositional dataset is now conducted following the lines presented above: cokriging of the transformed data by -IT followed by the back-transformation into the simplex. Using Monte Carlo cross-validation, we shall assess the performance of our approach for several values of , including the value maximizing the likelihood (21), even though the transformed variables are not necessarily multivariate Gaussian.
Discussion and conclusion
A new class of -transformations, named -IT, has been proposed, to allow for the geostatistical analysis of compositional data. The transformation has been proved to converge to the Isometric Log-Ratio transformation as approaches 0, while it reduces to a linear transformation when . In this sense, the -IT represents a compromise between the Aitchison geometry () and the Euclidean one (). Nonetheless, the presence of the parameter controlling the degree of transformation applied
Acknowledgments
The authors are grateful to four anonymous reviewers for their very careful reading and the many valuable comments that helped to improve the manuscript. R scripts reproducing our analyses can be found at http://github.com/luciclar/alphaIT_spatial_compositional.
References (40)
Power transformations in correspondence analysis
Comput. Statist. Data Anal.
(2009)- et al.
Spatial analysis of compositional data: A historical review
J. Geochem. Explor.
(2016) - et al.
Isometric logratio transformations for compositional data analysis
Chemometr. Intell. Lab. Syst.
(2016) The Statistical Analysis of Compositional Data
(1986)- et al.
Possible solution of some essential zero problems in compositional data analysis
- et al.
Means and covariance functions for geostatistical compositional data: an axiomatic approach
Math. Geosci.
(2018) - et al.
Some aspects of transformations of compositional data and the identification of outliers
Math. Geol.
(1996) - et al.
Statistical interpretation of species composition
J. Amer. Statist. Assoc.
(2001) - et al.
Compositional Data Analysis in the Geosciences: From Theory to Practice
(2006) - Buchhorn, M., Smets, B., Bertels, L., Lesiv, M., Tsendbazar, N.-E., Herold, M., Fritz, S., 2019. Copernicus Global Land...
Statistics for Spatial Data
Isometric logratio transformations for compositional data analysis
Math. Geol.
Matérn cross-covariance functions for multivariate random fields
J. Amer. Statist. Assoc.
Log-ratio analysis is a limiting case of correspondence analysis
Math. Geosci.
Measuring subcompositional incoherence
Math. Geosci.
On information and sufficiency
Ann. Math. Stat.
Climate-induced land use change in France: Impacts of agricultural adaptation and climate change mitigation
Ecol. Econom.
Dealing with zeros and missing values in compositional data sets using nonparametric imputation
Math. Geol.
Compositional Data Analysis: Theory and Applications, chapter Dealing with zeros
The principle of working on coordinates
Cited by (2)
Overview of data preprocessing for machine learning applications in human microbiome research
2023, Frontiers in Microbiology