Abstract
The log-ratio approach to compositional data (CoDa) analysis has now entered a mature phase. The principles and statistical tools introduced by J. Aitchison in the eighties have proven successful in solving a number of applied problems. The algebraic–geometric structure of the sample space, tailored to those principles, was developed at the beginning of the millennium. Two main ideas completed the J. Aitchison’s seminal work: the conception of compositions as equivalence classes of proportional vectors, and their representation in the simplex endowed with an interpretable Euclidean structure. These achievements allowed the representation of compositions in meaningful coordinates (preferably Cartesian), as well as orthogonal projections compatible with the Aitchison distance introduced two decades before. These ideas and concepts are reviewed up to the normal distribution on the simplex and the associated central limit theorem. Exploratory tools, specifically designed for CoDa, are also reviewed. To illustrate the adequacy and interpretability of the sample space structure, a new inequality index, based on the Aitchison norm, is proposed. Most concepts are illustrated with an example of mean household gross income per capita in Spain.
Similar content being viewed by others
References
Äijö T, Müller CL, Bonneau R (2018) Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics 34(3):372–380
Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc Ser B Stat Methodol 44(2):139–177
Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70(1):57–65
Aitchison J (1986) The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman & Hall Ltd., London (reprinted in 2003 with additional material by The Blackburn Press)
Aitchison J (1992) On criteria for measures of compositional difference. Math Geol 24(4):365–379
Aitchison J (1994) Multivariate analysis and its applications, volume 24 of lecture notes—monograph series, chapter principles of compositional data analysis. Institute of Mathematical Statistics, Hayward, pp 73–81
Aitchison J (1997) The one-hour course in compositional data analysis or compositional data analysis is simple. In: Pawlowsky-Glahn V (ed) Proceedings of IAMG’97—the III annual conference of the international association for mathematical geology, volume I, II and addendum, Barcelona (E). CIMNE, Barcelona, pp 3–35, ISBN 978-84-87867-76-7
Aitchison J, Bacon-Shone J (1984) Log contrast models for experiments with mixtures. Biometrika 71:323–330
Aitchison J, Egozcue JJ (2005) Compositional data analysis: Where are we and where should we be heading? Math Geol 37(7):829–850
Aitchison J, Greenacre M (2002) Biplots for compositional data. J R Stat Soc Ser C Appl Stat 51(4):375–392
Aitchison J, Shen S (1980) Logistic-normal distributions. Some properties and uses. Biometrika 67(2):261–272
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32(3):271–275
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Reply to letter to the editor by S. Rehder and U. Zier on “Logratio analysis and compositional distance”. Math Geol 33(7):849–860
Aitchison J, Barceló-Vidal C, Egozcue JJ, Pawlowsky-Glahn V (2002) A concise guide for the algebraic-geometric structure of the simplex, the sample space for compositional data analysis. In: Bayer U, Burger H, Skala W (eds) Proceedings of IAMG’02—the VIII annual conference of the international association for mathematical geology, vol I and II. Selbstverlag der Alfred-Wegener-Stiftung, Berlin, pp 387–392
Atkinson AB (1970) On the measurement of inequality. J Econ Theory 2:244–263
Bacon-Shone J (2003) Modelling structural zeros in compositional data. In: Thió-Henestrosa S, Martín-Fernández JA (eds) Proceedings of CoDaWork’03, the 1st compositional data analysis workshop, Girona (E). Universitat de Girona, ISBN 84-8458-111-X, http://ima.udg.es/Activitats/CoDaWork2003/
Barceló-Vidal C, Martín-Fernández JA (2016) The mathematics of compositional analysis. Austrian J Stat 45:57–71
Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Mathematical foundations of compositional data analysis. In: Ross G (ed) Proceedings of IAMG’01—the VII annual conference of the international association for mathematical geology, Cancun (Mex), p 20
Billheimer D, Guttorp P, Fagan W (2001) Statistical interpretation of species composition. J Am Stat Assoc 96(456):1205–1214
Buccianti A, Pawlowsky-Glahn V (2005) New perspectives on water chemistry and compositional data analysis. Math Geol 37(7):703–727
Chayes F (1971) Ratio correlation. University of Chicago Press, Chicago, p 99
Chen J, Zhang X, Li S (2017) Multiple linear regression with compositional response and covariates. J Appl Stat 44(12):2270–2285
Chipman HA, Gu H (2005) Interpretable dimension reduction. J Appl Stat 32:969–987
Comas-Cufí M, Thió-Henestrosa S (2011) Codapack 2.0: a stand-alone, multi-platform compositional software. See Egozcue et al. (2011c)
Connor RJ, Mosimann JE (1969) Concepts of independence for proportions with a generalization of the Dirichlet distribution. J Am Stat Assoc 64(325):194–206
Daunis-i Estadella J, Barceló-Vidal J, Buccianti A (2006) Exploratory compositional data analysis. In: Compositional data analysis in the geosciences: from theory to practice, volume 264 of special publications. Geological Society, London, pp 161–174
de Finetti B (1926) Considerazioni matematiche sull’ereditarietà mendeliana. Metron 6(3):3–41
Egozcue JJ (2009) Reply to “On the Harker variation diagrams;...” by J. A. Cortés. Math Geosci 41(7):829–834
Egozcue JJ, Jarauta-Bragulat E (2014) Differential models for evolutionary compositions. Math Geosci 46(4):381–410
Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Math Geol 37(7):795–828
Egozcue JJ, Pawlowsky-Glahn V (2011a) Basic concepts and procedures. See Pawlowsky-Glahn and Buccianti (2011), pp 12–28
Egozcue JJ, Pawlowsky-Glahn V (2011b) Evidence information in Bayesian updating. See Egozcue et al. (2011c)
Egozcue JJ, Pawlowsky-Glahn V (2018a) Evidence functions: a compositional approach to information (invited paper). Stat Oper Res Trans 42(2):1–24
Egozcue JJ, Pawlowsky-Glahn V (2018b) Modelling compositional data. The sample space approach, Chapter 4, p XXV, 875. Handbook of mathematical geosciences—fifty years of IAMG. Springer, Berlin
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300
Egozcue JJ, Díaz-Barrero JL, Pawlowsky-Glahn V (2006) Hilbert space of probability density functions based on Aitchison geometry. Acta Math Sin 22(4):1175–1182. https://doi.org/10.1007/s10114-005-0678-2
Egozcue JJ, Barceló-Vidal C, Martín-Fernández JA, Jarauta-Bragulat E, Díaz-Barrero JL, Mateu-Figueras G (2011a) Elements of simplicial linear algebra and geometry. See Pawlowsky-Glahn and Buccianti (2011), pp 141–157
Egozcue JJ, Jarauta-Bragulat E, Díaz-Barrero JL (2011b) Calculus of simplex-valued functions. See Pawlowsky-Glahn and Buccianti (2011), pp 158–175
Egozcue JJ, Tolosana-Delgado R, Ortego MI (eds) (2011c) Proceedings of the 4th international workshop on compositional data analysis, Sant Feliu de Guixols, Girona. CIMNE, Barcelona, ISBN 978-84-87867-76-7
Egozcue JJ, Daunis-i-Estadella J, Pawlowsky-Glahn V, Hron K, Filzmoser P (2012) Simplicial regression. The normal model. J Appl Probab Stat 6(1–2):87–108
Egozcue JJ, Pawlowsky-Glahn V, Tolosana-Delgado R, Ortego MI, van den Boogaart KG (2013) Bayes spaces: use of improper distributions and exponential families. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales, Serie A Matemáticas 107:475–486. https://doi.org/10.1007/s13398-012-0082-6
Egozcue JJ, Pawlowsky-Glahn V, Templ M, Hron K (2015) Independence in contingency tables using simplicial geometry. Commun Stat Theory Methods 44(18):3978–3996
Egozcue JJ, Pawlowsky-Glahn V, Gloor GB (2018) Linear association in compositional data analysis. Austrian J Stat 47(1):3–31
Erb I, Notredame C (2016) How should we measure proportionality on relative gene expression data? Theory Biosci 135(1–2):21–36. https://doi.org/10.1007/s12064-015-0220-8
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2:15.1–15.13
Filzmoser P, Hron K, Templ M (2012) Discriminant analysis for compositional data and robust parameter estimation. Comput Stat 27(4):585–604
Filzmoser P, Hron K, Templ M (2018) Applied compositional analysis. With worked examples in R. Springer, Switzerland AG, p 280
Fisher RA (1947) The analysis of covariance method for the relation between a part and the whole. Biometrics 3(2):65–68
Fréchet M (1948) Les éléments Aléatoires de Nature Quelconque dans une Espace Distancié. Annales de l’Institut Henri Poincaré 10(4):215–308
Fry JM, Fry TRL, McLaren KR (2000) Compositional data analysis and zeros in micro data. Appl Econ 32(8):953–959
Gini C (1921) Measurement of inequality of incomes. Econ J 31(121):124–126
Greenacre M (2011) Measuring subcompositional incoherence. Math Geosci 43(6):681–693
Halmos P (1974) Finite dimensional vector spaces. Springer, Berlin
Hijazi RH, Jernigan RW (2009) Modelling compositional data using Dirichlet regression models. J Appl Probab Stat 4(1):77–91
Hron K, Filzmoser P, Thompson K (2012) Linear regression with compositional explanatory variables. J Appl Stat 39(5):1115–1128
Hrůzová K, Todorov V, Hron K, Filzmoser P (2016) Classical and robust orthogonal regression between parts of compositional data. Statistics 50(6):1261–1275
INE (2016) Renta disponible bruta de los hogares (per cápita). Serie 2010–2014. Contabilidad regional de España. Base 2010
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA (2015) Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 11(5):e1004226. https://doi.org/10.1371/journal.pcbi.1004226
Kync̆lová P, Hron K, Filzmoser P (2017) Correlation between compositional parts based on symmetric balances. Math Geosci 49:777–796. https://doi.org/10.1007/s11004-016-9669-3
Lin W, Shi P, Feng R, Li H (2014) Variable selection in regression with compositional covariates. Biometrika 101(4):785–797
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3):e1004075
Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V (2003) Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3):253–278
Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2012) Model-based replacement of rounded zeros in compositional data: classical and robust approaches. Comput Stat Data Anal 56:2688–2704
Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2015) Bayesian-multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2):134–158
Martín-Fernández JA, Pawlowsky-Glahn V, Egozcue JJ, Tolosona-Delgado R (2018) Advances in principal balances for compositional data. Math Geosci 50(3):273–298
Mateu-Figueras G (2003) Models de distribució sobre el símplex. Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona
Mateu-Figueras G, Pawlowsky-Glahn V (2007) The skew-normal distribution on the simplex. Commun Stat Theory Methods 36(9):1787–1802
Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2011) The principle of working on coordinates. See Pawlowsky-Glahn and Buccianti (2011), pp 31–42
Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2013) The normal distribution in some constrained sample spaces. Stat Oper Res Trans 37(1):29–56
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London
Menafoglio A, Secchi P, Dalla Rosa M (2013) A universal kriging predictor for spatially dependent functional data of a Hilbert space. Electron J Stat 7:2209–2240
Menafoglio A, Guadagnini A, Secchi P (2016) Stochastic simulation of soil particle-size curves in heterogeneous aquifer systems through a bayes space approach. Water Resour Res 52(8):5708–5726
Morais J, Thomas-Agnan C, Simioni M (2018) Using compositional and Dirichlet models for market share regression. J Appl Stat 45(9):1670–1689. https://doi.org/10.1080/02664763.2017.1389864
Mosimann JE (1962) On the compound multinomial distribution, the multivariate \(\beta \)-distribution and correlations among proportions. Biometrika 49(1–2):65–82
Ortego MI, Egozcue JJ (2013) Spurious copulas. In: Hron PFK MT (eds) Proceedings of the 5th workshop on compositional data analysis, CoDaWork 2013, pp 123–130
Palarea-Albaladejo J, Martín-Fernández J (2008) A modified EM alr-algorithm for replacing rounded zeros in compositional data sets. Comput Geosci 34(8):2233–2251
Palarea-Albaladejo J, Martín-Fernández JA (2015) zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemom Intell Lab Syst 143:85–96
Pawlowsky-Glahn V, Buccianti A (eds) (2011) Compositional data analysis: theory and applications. Wiley, New York, p 378
Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess 15(5):384–398
Pawlowsky-Glahn V, Egozcue JJ (2002) BLU estimators and compositional data. Math Geol 34(3):259–274
Pawlowsky-Glahn V, Egozcue J (2011) Exploring compositional data with the coda-dendrogram. Austrian J Stat 40(1 & 2):103–113
Pawlowsky-Glahn V, Egozcue JJ, Lovell D (2015a) Tools for compositional data with a total. Stat Model 15(2):175–190
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015b) Modeling and analysis of compositional data. Statistics in practice. Wiley, Chichester, p 272
Pearson K (1897) Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond LX:489–502
Queysanne M (1973) Álgebra Básica. Editorial Vicens Vives, Barcelona (E), p 669
Rivera-Pinto J, Egozcue JJ, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle ML (2018) Balances: a new perspective for microbiome analysis. mSystems 3(4):e00053–18. https://doi.org/10.1128/mSystems.00053-18
Robert CP (1994) The Bayesian choice. A decision-theoretic motivation. Springer, New York
Scealy JL, Welsh AH (2011) Regression for compositional data by using distributions defined on the hypersphere. J R Stat Soc Ser B Stat Methodol 73(3):351–375
Shi P, Zhang A, Li H (2016) Regression analysis for microbiome compositional data. Ann Appl Stat 10(2):1019–1040
Shorrocks AF (1980) The class of additively decomposable inequality measures. Econometrica 48(3):613–625
Theil H (1967) On the measurement of inequality. North Holland, Amsterdam
Tolosana-Delgado R, von Eynatten H (2009) Grain-size control on petrographic composition of sediments: compositional regression and rounded zeros. Math Geosci 41:869–886
Tolosana-Delgado R, von Eynatten H (2010) Simplifying compositional multiple regression: application to grain size controls on sediment geochemistry. Comput Geosci 36(5):577–589
van den Boogaart KG, Tolosana-Delgado R (2013) Analysing compositional data with R. Springer, Berlin, p 258
van den Boogaart KG, Egozcue JJ, Pawlowsky-Glahn V (2010) Bayes linear spaces. Stat Oper Res Trans 34(2):201–222
van den Boogaart KG, Egozcue JJ, Pawlowsky-Glahn V (2014) Bayes Hilbert spaces. Aust NZ J Stat 56(2):171–194
Vistelius AB (1960) The skew frequency distributions and the fundamental law of the geochemical processes. J Geol 68(1):1–22
Wang H, Shangguan L, Wu J, Guan R (2013) Multiple linear regression modeling for compositional data. Neurocomputing 122:490–500
Wikipedia (2018) Homogeneous function—Wikipedia, The Free Encyclopedia. Accessed 5 Aug 2018
Acknowledgements
This work was supported by Grants MTM2015-65016-C2-1-R and MTM2015-65016-C2-2-R (MINECO/FEDER) from the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund. We are grateful for the useful comments and criticisms given by three anonymous reviewers.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00671-5, https://doi.org/10.1007/s11749-019-00672-4, https://doi.org/10.1007/s11749-019-00673-3.
Appendices
Appendices
Data
Mean Household Gross Income (MHGI) in Spain The Instituto Nacional de Estadística (INE) (INE 2016) provides on its webpage an estimation of the MHGI per capita for all Autonomous Communities and Autonomous Cities (ACs) in Spain, for the years 2000 to 2014. According to the webpage, the data are median aggregated data, but the algorithm for doing so is not included. Nevertheless, the result is a non-closed composition, and the compositional tools here presented would lead to exactly the same results whether applied to the given data, or a representation in proportions or percentages or, as can be seen in the webpage of the INE, normalized so that the estimated MHGI for the whole of Spain corresponds to 100%. This data set (19 ACs for 15 years, listed in Table 2) is used here to illustrate the procedures and properties presented. Figure 10 shows the values in Euros. The MHGI in the Balearic Islands (circles), Canary Islands (triangles), Madrid (\(+\)) and Catalonia (\(\times \)) are highlighted for discussion.
In Sect. 3, the MHGI for all ACs is considered as a 19-part composition.
Proofs of the properties of the inequality index \(A_I^2\)
Consider a D-part composition \(\mathbf {s}\) divided into two subcompositions \(\mathbf {s}_1\) and \(\mathbf {s}_2\) of \(d_1\) and \(d_2\) shares, respectively, with \(d_1 + d_2 = D\) and represented in the same units. Proofs of the properties of \(A_I^2\) follow.
Decomposability by subcompositions:
where \((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))\) is a 2-part composition whose components are the geometric means of the parts in \(\mathbf {s}_1\) and \(\mathbf {s}_2\), respectively.
Since for any D-part composition \(\mathbf {s}\) it holds that \(A_I^2(\mathbf {s})=(1/D)\Vert \mathbf {s} \Vert _a^2\), the index of the composed vector of shares \(A_I^2((\mathbf {s}_1,\mathbf {s}_2))\) can be expressed as a sum of squares of orthonormal balances obtained in an SBP. Define a first binary partition, by separating the shares in \(\mathbf {s}_1\), marked with a \(+1\), and the shares in \(\mathbf {s}_2\) marked with a \(-1\). Any further partitions within \(\mathbf {s}_1\) and within \(\mathbf {s}_2\) are then possible to obtain the expression
The 2-part composition \((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))\) has square norm
Dividing Eq. (10) by the appropriate constants to obtain inequality indexes, and substituting the value in Eq. (11) yields
Population replication Let \(\mathbf {p}\) be a composition of shares. The property to prove is \(A_I^2(\mathbf {p})=A_I^2((\mathbf {p},\mathbf {p}, \dots ,\mathbf {p}))\).
The geometric means of any number of replications of \(\mathbf {p}\) are equal, that is \(\mathrm {g}_\mathrm {m}(\mathbf {p}) = \mathrm {g}_\mathrm {m}((\mathbf {p},\mathbf {p},\dots ,\mathbf {p}))\). Applying the previous result of decomposability by subcompositions [Eq. (12)], the value of \(A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)))\) is null. Applying repeatedly the decomposability by subcompositions, the population replication property holds.
Principle of transfers A transfer of \(\delta >0\) to the i-th unit increases its share to \(p_i+\delta \) being \(\delta \) detracted from the share of the j-th unit. If \(p_j-\delta > p_i+\delta \), then \(A_I^2(\mathbf {p})\ge A_I^2(\mathbf {p}')\) where \(\mathbf {p}'\) is the composition after the transfer.
Without loss of generality, take \(i=1\), \(j=2\). There are positive constants, a, b, such that \(p_1'=p_1+\delta = p_1 a\), \(p_2'=p_2-\delta =b p_2\), with \(a > 1\) and \(0< b < 1\), since \(p_2-p_1> 2\delta >0\). The new shares \(\mathbf {p}'\) are the perturbation
The statement is proven if the following inner product satisfies
After some computation, the inner product is
where \(\mathrm {clr}_k(\mathbf {p})= \log p_k - \log \mathrm {g}_\mathrm {m}(\mathbf {p})\), \(k=1,2\). This inner product is negative if the two inequalities \(\log (ab)> 0\) and \((\log b)/(\log a)> -1\) hold. In fact, since \(\log a>0\), the negativeness of Eq. (13) is equivalent to
which derives from \((\log b)/(\log a)> -1\) and \( \mathrm {clr}_1(\mathbf {p}) < \mathrm {clr}_2(\mathbf {p})\).
The first inequality derives from
and the assumption that \((p_2-p_1)-2\delta \ge 0\). The second inequality also holds from the assumption \(p_1+\delta < p_2-\delta \), since this is \(a < b\) and then \(\log a < \log b\); finally, \(\log a>0\) implies that \((\log b) /(\log a)< -1\).
Supplementary material
This section contains some tables and figures that are not included in the body of the article due to its extension (Tables 3, 4, 5; Figs. 11, 12, 13, 14, 15).
Rights and permissions
About this article
Cite this article
Egozcue, J.J., Pawlowsky-Glahn, V. Compositional data: the sample space and its structure. TEST 28, 599–638 (2019). https://doi.org/10.1007/s11749-019-00670-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-019-00670-6
Keywords
- Simplex
- Equivalence class
- Isometric log-ratio coordinates
- Euclidean space
- Aitchison geometry
- Principal balances
- Dendrogram
- Principal components
- Biplot
- Household income
- Normal distribution on the simplex
- Logistic-normal