Compositional data: the sample space and its structure

Egozcue, Juan José; Pawlowsky-Glahn, Vera

doi:10.1007/s11749-019-00670-6

Compositional data: the sample space and its structure

Invited Paper
Published: 16 July 2019

Volume 28, pages 599–638, (2019)
Cite this article

TEST Aims and scope Submit manuscript

1965 Accesses
63 Citations
Explore all metrics

Abstract

The log-ratio approach to compositional data (CoDa) analysis has now entered a mature phase. The principles and statistical tools introduced by J. Aitchison in the eighties have proven successful in solving a number of applied problems. The algebraic–geometric structure of the sample space, tailored to those principles, was developed at the beginning of the millennium. Two main ideas completed the J. Aitchison’s seminal work: the conception of compositions as equivalence classes of proportional vectors, and their representation in the simplex endowed with an interpretable Euclidean structure. These achievements allowed the representation of compositions in meaningful coordinates (preferably Cartesian), as well as orthogonal projections compatible with the Aitchison distance introduced two decades before. These ideas and concepts are reviewed up to the normal distribution on the simplex and the associated central limit theorem. Exploratory tools, specifically designed for CoDa, are also reviewed. To illustrate the adequacy and interpretability of the sample space structure, a new inequality index, based on the Aitchison norm, is proposed. Most concepts are illustrated with an example of mean household gross income per capita in Spain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Modelling Compositional Data. The Sample Space Approach

Factor Analysis of Compositional Data with a Total

A Review of Flexible Transformations for Modeling Compositional Data

References

Äijö T, Müller CL, Bonneau R (2018) Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics 34(3):372–380
Article Google Scholar
Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc Ser B Stat Methodol 44(2):139–177
MATH Google Scholar
Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70(1):57–65
Article MathSciNet MATH Google Scholar
Aitchison J (1986) The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman & Hall Ltd., London (reprinted in 2003 with additional material by The Blackburn Press)
Aitchison J (1992) On criteria for measures of compositional difference. Math Geol 24(4):365–379
Article MathSciNet MATH Google Scholar
Aitchison J (1994) Multivariate analysis and its applications, volume 24 of lecture notes—monograph series, chapter principles of compositional data analysis. Institute of Mathematical Statistics, Hayward, pp 73–81
Aitchison J (1997) The one-hour course in compositional data analysis or compositional data analysis is simple. In: Pawlowsky-Glahn V (ed) Proceedings of IAMG’97—the III annual conference of the international association for mathematical geology, volume I, II and addendum, Barcelona (E). CIMNE, Barcelona, pp 3–35, ISBN 978-84-87867-76-7
Aitchison J, Bacon-Shone J (1984) Log contrast models for experiments with mixtures. Biometrika 71:323–330
Article Google Scholar
Aitchison J, Egozcue JJ (2005) Compositional data analysis: Where are we and where should we be heading? Math Geol 37(7):829–850
Article MathSciNet MATH Google Scholar
Aitchison J, Greenacre M (2002) Biplots for compositional data. J R Stat Soc Ser C Appl Stat 51(4):375–392
Article MathSciNet MATH Google Scholar
Aitchison J, Shen S (1980) Logistic-normal distributions. Some properties and uses. Biometrika 67(2):261–272
Article MathSciNet MATH Google Scholar
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32(3):271–275
Article MATH Google Scholar
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Reply to letter to the editor by S. Rehder and U. Zier on “Logratio analysis and compositional distance”. Math Geol 33(7):849–860
Article MATH Google Scholar
Aitchison J, Barceló-Vidal C, Egozcue JJ, Pawlowsky-Glahn V (2002) A concise guide for the algebraic-geometric structure of the simplex, the sample space for compositional data analysis. In: Bayer U, Burger H, Skala W (eds) Proceedings of IAMG’02—the VIII annual conference of the international association for mathematical geology, vol I and II. Selbstverlag der Alfred-Wegener-Stiftung, Berlin, pp 387–392
Atkinson AB (1970) On the measurement of inequality. J Econ Theory 2:244–263
Article MathSciNet Google Scholar
Bacon-Shone J (2003) Modelling structural zeros in compositional data. In: Thió-Henestrosa S, Martín-Fernández JA (eds) Proceedings of CoDaWork’03, the 1st compositional data analysis workshop, Girona (E). Universitat de Girona, ISBN 84-8458-111-X, http://ima.udg.es/Activitats/CoDaWork2003/
Barceló-Vidal C, Martín-Fernández JA (2016) The mathematics of compositional analysis. Austrian J Stat 45:57–71
Article Google Scholar
Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Mathematical foundations of compositional data analysis. In: Ross G (ed) Proceedings of IAMG’01—the VII annual conference of the international association for mathematical geology, Cancun (Mex), p 20
Billheimer D, Guttorp P, Fagan W (2001) Statistical interpretation of species composition. J Am Stat Assoc 96(456):1205–1214
Article MathSciNet MATH Google Scholar
Buccianti A, Pawlowsky-Glahn V (2005) New perspectives on water chemistry and compositional data analysis. Math Geol 37(7):703–727
Article MATH Google Scholar
Chayes F (1971) Ratio correlation. University of Chicago Press, Chicago, p 99
Google Scholar
Chen J, Zhang X, Li S (2017) Multiple linear regression with compositional response and covariates. J Appl Stat 44(12):2270–2285
Article MathSciNet Google Scholar
Chipman HA, Gu H (2005) Interpretable dimension reduction. J Appl Stat 32:969–987
Article MathSciNet MATH Google Scholar
Comas-Cufí M, Thió-Henestrosa S (2011) Codapack 2.0: a stand-alone, multi-platform compositional software. See Egozcue et al. (2011c)
Connor RJ, Mosimann JE (1969) Concepts of independence for proportions with a generalization of the Dirichlet distribution. J Am Stat Assoc 64(325):194–206
Article MathSciNet MATH Google Scholar
Daunis-i Estadella J, Barceló-Vidal J, Buccianti A (2006) Exploratory compositional data analysis. In: Compositional data analysis in the geosciences: from theory to practice, volume 264 of special publications. Geological Society, London, pp 161–174
de Finetti B (1926) Considerazioni matematiche sull’ereditarietà mendeliana. Metron 6(3):3–41
MATH Google Scholar
Egozcue JJ (2009) Reply to “On the Harker variation diagrams;...” by J. A. Cortés. Math Geosci 41(7):829–834
Article MATH Google Scholar
Egozcue JJ, Jarauta-Bragulat E (2014) Differential models for evolutionary compositions. Math Geosci 46(4):381–410
Article MathSciNet MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Math Geol 37(7):795–828
Article MathSciNet MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V (2011a) Basic concepts and procedures. See Pawlowsky-Glahn and Buccianti (2011), pp 12–28
Egozcue JJ, Pawlowsky-Glahn V (2011b) Evidence information in Bayesian updating. See Egozcue et al. (2011c)
Egozcue JJ, Pawlowsky-Glahn V (2018a) Evidence functions: a compositional approach to information (invited paper). Stat Oper Res Trans 42(2):1–24
MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V (2018b) Modelling compositional data. The sample space approach, Chapter 4, p XXV, 875. Handbook of mathematical geosciences—fifty years of IAMG. Springer, Berlin
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300
Article MathSciNet MATH Google Scholar
Egozcue JJ, Díaz-Barrero JL, Pawlowsky-Glahn V (2006) Hilbert space of probability density functions based on Aitchison geometry. Acta Math Sin 22(4):1175–1182. https://doi.org/10.1007/s10114-005-0678-2
Article MathSciNet MATH Google Scholar
Egozcue JJ, Barceló-Vidal C, Martín-Fernández JA, Jarauta-Bragulat E, Díaz-Barrero JL, Mateu-Figueras G (2011a) Elements of simplicial linear algebra and geometry. See Pawlowsky-Glahn and Buccianti (2011), pp 141–157
Egozcue JJ, Jarauta-Bragulat E, Díaz-Barrero JL (2011b) Calculus of simplex-valued functions. See Pawlowsky-Glahn and Buccianti (2011), pp 158–175
Egozcue JJ, Tolosana-Delgado R, Ortego MI (eds) (2011c) Proceedings of the 4th international workshop on compositional data analysis, Sant Feliu de Guixols, Girona. CIMNE, Barcelona, ISBN 978-84-87867-76-7
Egozcue JJ, Daunis-i-Estadella J, Pawlowsky-Glahn V, Hron K, Filzmoser P (2012) Simplicial regression. The normal model. J Appl Probab Stat 6(1–2):87–108
MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V, Tolosana-Delgado R, Ortego MI, van den Boogaart KG (2013) Bayes spaces: use of improper distributions and exponential families. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales, Serie A Matemáticas 107:475–486. https://doi.org/10.1007/s13398-012-0082-6
Article MathSciNet MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V, Templ M, Hron K (2015) Independence in contingency tables using simplicial geometry. Commun Stat Theory Methods 44(18):3978–3996
Article MathSciNet MATH Google Scholar
Egozcue JJ, Pawlowsky-Glahn V, Gloor GB (2018) Linear association in compositional data analysis. Austrian J Stat 47(1):3–31
Article Google Scholar
Erb I, Notredame C (2016) How should we measure proportionality on relative gene expression data? Theory Biosci 135(1–2):21–36. https://doi.org/10.1007/s12064-015-0220-8
Article Google Scholar
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2:15.1–15.13
Article Google Scholar
Filzmoser P, Hron K, Templ M (2012) Discriminant analysis for compositional data and robust parameter estimation. Comput Stat 27(4):585–604
Article MathSciNet MATH Google Scholar
Filzmoser P, Hron K, Templ M (2018) Applied compositional analysis. With worked examples in R. Springer, Switzerland AG, p 280
Fisher RA (1947) The analysis of covariance method for the relation between a part and the whole. Biometrics 3(2):65–68
Article Google Scholar
Fréchet M (1948) Les éléments Aléatoires de Nature Quelconque dans une Espace Distancié. Annales de l’Institut Henri Poincaré 10(4):215–308
MATH Google Scholar
Fry JM, Fry TRL, McLaren KR (2000) Compositional data analysis and zeros in micro data. Appl Econ 32(8):953–959
Article Google Scholar
Gini C (1921) Measurement of inequality of incomes. Econ J 31(121):124–126
Article Google Scholar
Greenacre M (2011) Measuring subcompositional incoherence. Math Geosci 43(6):681–693
Article Google Scholar
Halmos P (1974) Finite dimensional vector spaces. Springer, Berlin
Book MATH Google Scholar
Hijazi RH, Jernigan RW (2009) Modelling compositional data using Dirichlet regression models. J Appl Probab Stat 4(1):77–91
MathSciNet MATH Google Scholar
Hron K, Filzmoser P, Thompson K (2012) Linear regression with compositional explanatory variables. J Appl Stat 39(5):1115–1128
Article MathSciNet Google Scholar
Hrůzová K, Todorov V, Hron K, Filzmoser P (2016) Classical and robust orthogonal regression between parts of compositional data. Statistics 50(6):1261–1275
Article MathSciNet MATH Google Scholar
INE (2016) Renta disponible bruta de los hogares (per cápita). Serie 2010–2014. Contabilidad regional de España. Base 2010
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA (2015) Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 11(5):e1004226. https://doi.org/10.1371/journal.pcbi.1004226
Article Google Scholar
Kync̆lová P, Hron K, Filzmoser P (2017) Correlation between compositional parts based on symmetric balances. Math Geosci 49:777–796. https://doi.org/10.1007/s11004-016-9669-3
Article MathSciNet MATH Google Scholar
Lin W, Shi P, Feng R, Li H (2014) Variable selection in regression with compositional covariates. Biometrika 101(4):785–797
Article MathSciNet MATH Google Scholar
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3):e1004075
Article Google Scholar
Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V (2003) Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3):253–278
Article MATH Google Scholar
Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2012) Model-based replacement of rounded zeros in compositional data: classical and robust approaches. Comput Stat Data Anal 56:2688–2704
Article MathSciNet MATH Google Scholar
Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2015) Bayesian-multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2):134–158
Article MathSciNet MATH Google Scholar
Martín-Fernández JA, Pawlowsky-Glahn V, Egozcue JJ, Tolosona-Delgado R (2018) Advances in principal balances for compositional data. Math Geosci 50(3):273–298
Article MathSciNet MATH Google Scholar
Mateu-Figueras G (2003) Models de distribució sobre el símplex. Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona
Mateu-Figueras G, Pawlowsky-Glahn V (2007) The skew-normal distribution on the simplex. Commun Stat Theory Methods 36(9):1787–1802
Article MathSciNet MATH Google Scholar
Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2011) The principle of working on coordinates. See Pawlowsky-Glahn and Buccianti (2011), pp 31–42
Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2013) The normal distribution in some constrained sample spaces. Stat Oper Res Trans 37(1):29–56
MathSciNet MATH Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London
Book MATH Google Scholar
Menafoglio A, Secchi P, Dalla Rosa M (2013) A universal kriging predictor for spatially dependent functional data of a Hilbert space. Electron J Stat 7:2209–2240
Article MathSciNet MATH Google Scholar
Menafoglio A, Guadagnini A, Secchi P (2016) Stochastic simulation of soil particle-size curves in heterogeneous aquifer systems through a bayes space approach. Water Resour Res 52(8):5708–5726
Article Google Scholar
Morais J, Thomas-Agnan C, Simioni M (2018) Using compositional and Dirichlet models for market share regression. J Appl Stat 45(9):1670–1689. https://doi.org/10.1080/02664763.2017.1389864
Article MathSciNet Google Scholar
Mosimann JE (1962) On the compound multinomial distribution, the multivariate $\beta $-distribution and correlations among proportions. Biometrika 49(1–2):65–82
MathSciNet MATH Google Scholar
Ortego MI, Egozcue JJ (2013) Spurious copulas. In: Hron PFK MT (eds) Proceedings of the 5th workshop on compositional data analysis, CoDaWork 2013, pp 123–130
Palarea-Albaladejo J, Martín-Fernández J (2008) A modified EM alr-algorithm for replacing rounded zeros in compositional data sets. Comput Geosci 34(8):2233–2251
Article Google Scholar
Palarea-Albaladejo J, Martín-Fernández JA (2015) zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemom Intell Lab Syst 143:85–96
Article Google Scholar
Pawlowsky-Glahn V, Buccianti A (eds) (2011) Compositional data analysis: theory and applications. Wiley, New York, p 378
Google Scholar
Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess 15(5):384–398
Article MATH Google Scholar
Pawlowsky-Glahn V, Egozcue JJ (2002) BLU estimators and compositional data. Math Geol 34(3):259–274
Article MathSciNet MATH Google Scholar
Pawlowsky-Glahn V, Egozcue J (2011) Exploring compositional data with the coda-dendrogram. Austrian J Stat 40(1 & 2):103–113
Google Scholar
Pawlowsky-Glahn V, Egozcue JJ, Lovell D (2015a) Tools for compositional data with a total. Stat Model 15(2):175–190
Article MathSciNet Google Scholar
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015b) Modeling and analysis of compositional data. Statistics in practice. Wiley, Chichester, p 272
Google Scholar
Pearson K (1897) Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond LX:489–502
MATH Google Scholar
Queysanne M (1973) Álgebra Básica. Editorial Vicens Vives, Barcelona (E), p 669
Rivera-Pinto J, Egozcue JJ, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle ML (2018) Balances: a new perspective for microbiome analysis. mSystems 3(4):e00053–18. https://doi.org/10.1128/mSystems.00053-18
Robert CP (1994) The Bayesian choice. A decision-theoretic motivation. Springer, New York
MATH Google Scholar
Scealy JL, Welsh AH (2011) Regression for compositional data by using distributions defined on the hypersphere. J R Stat Soc Ser B Stat Methodol 73(3):351–375
Article MathSciNet MATH Google Scholar
Shi P, Zhang A, Li H (2016) Regression analysis for microbiome compositional data. Ann Appl Stat 10(2):1019–1040
Article MathSciNet MATH Google Scholar
Shorrocks AF (1980) The class of additively decomposable inequality measures. Econometrica 48(3):613–625
Article MathSciNet MATH Google Scholar
Theil H (1967) On the measurement of inequality. North Holland, Amsterdam
Google Scholar
Tolosana-Delgado R, von Eynatten H (2009) Grain-size control on petrographic composition of sediments: compositional regression and rounded zeros. Math Geosci 41:869–886
Article MATH Google Scholar
Tolosana-Delgado R, von Eynatten H (2010) Simplifying compositional multiple regression: application to grain size controls on sediment geochemistry. Comput Geosci 36(5):577–589
Article Google Scholar
van den Boogaart KG, Tolosana-Delgado R (2013) Analysing compositional data with R. Springer, Berlin, p 258
Book MATH Google Scholar
van den Boogaart KG, Egozcue JJ, Pawlowsky-Glahn V (2010) Bayes linear spaces. Stat Oper Res Trans 34(2):201–222
MathSciNet MATH Google Scholar
van den Boogaart KG, Egozcue JJ, Pawlowsky-Glahn V (2014) Bayes Hilbert spaces. Aust NZ J Stat 56(2):171–194
Article MathSciNet MATH Google Scholar
Vistelius AB (1960) The skew frequency distributions and the fundamental law of the geochemical processes. J Geol 68(1):1–22
Article Google Scholar
Wang H, Shangguan L, Wu J, Guan R (2013) Multiple linear regression modeling for compositional data. Neurocomputing 122:490–500
Article Google Scholar
Wikipedia (2018) Homogeneous function—Wikipedia, The Free Encyclopedia. Accessed 5 Aug 2018

Download references

Acknowledgements

This work was supported by Grants MTM2015-65016-C2-1-R and MTM2015-65016-C2-2-R (MINECO/FEDER) from the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund. We are grateful for the useful comments and criticisms given by three anonymous reviewers.

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, Jordi Girona 1–3, Mod. C2, Barcelona, Spain
Juan José Egozcue
Universitat de Girona, Campus Montilivi, P4, Girona, Spain
Vera Pawlowsky-Glahn

Authors

Juan José Egozcue
View author publications
You can also search for this author in PubMed Google Scholar
Vera Pawlowsky-Glahn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan José Egozcue.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00671-5, https://doi.org/10.1007/s11749-019-00672-4, https://doi.org/10.1007/s11749-019-00673-3.

Appendices

Data

Mean Household Gross Income (MHGI) in Spain The Instituto Nacional de Estadística (INE) (INE 2016) provides on its webpage an estimation of the MHGI per capita for all Autonomous Communities and Autonomous Cities (ACs) in Spain, for the years 2000 to 2014. According to the webpage, the data are median aggregated data, but the algorithm for doing so is not included. Nevertheless, the result is a non-closed composition, and the compositional tools here presented would lead to exactly the same results whether applied to the given data, or a representation in proportions or percentages or, as can be seen in the webpage of the INE, normalized so that the estimated MHGI for the whole of Spain corresponds to 100%. This data set (19 ACs for 15 years, listed in Table 2) is used here to illustrate the procedures and properties presented. Figure 10 shows the values in Euros. The MHGI in the Balearic Islands (circles), Canary Islands (triangles), Madrid ($+$) and Catalonia ($\times $) are highlighted for discussion.

In Sect. 3, the MHGI for all ACs is considered as a 19-part composition.

Table 2 MHGI per capita in the period 2000–2014 (INE 2016)

Full size table

Proofs of the properties of the inequality index $A_I^2$

Consider a D-part composition $\mathbf {s}$ divided into two subcompositions $\mathbf {s}_1$ and $\mathbf {s}_2$ of $d_1$ and $d_2$ shares, respectively, with $d_1 + d_2 = D$ and represented in the same units. Proofs of the properties of $A_I^2$ follow.

Decomposability by subcompositions:

$$\begin{aligned} A_I^2( (\mathbf {s}_1,\mathbf {s}_2) ) = \frac{d_1 A_I^2(\mathbf {s}_1)}{d_1+d_2} + \frac{d_2 A_I^2( \mathbf {s}_2)}{d_1+d_2} + \frac{4\ d_1 d_2}{(d_1+d_2)^2}\ A_I^2(\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)), \end{aligned}$$

where $(\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))$ is a 2-part composition whose components are the geometric means of the parts in $\mathbf {s}_1$ and $\mathbf {s}_2$, respectively.

Since for any D-part composition $\mathbf {s}$ it holds that $A_I^2(\mathbf {s})=(1/D)\Vert \mathbf {s} \Vert _a^2$, the index of the composed vector of shares $A_I^2((\mathbf {s}_1,\mathbf {s}_2))$ can be expressed as a sum of squares of orthonormal balances obtained in an SBP. Define a first binary partition, by separating the shares in $\mathbf {s}_1$, marked with a $+1$, and the shares in $\mathbf {s}_2$ marked with a $-1$. Any further partitions within $\mathbf {s}_1$ and within $\mathbf {s}_2$ are then possible to obtain the expression

$$\begin{aligned} \Vert (\mathbf {s}_1,\mathbf {s}_2) \Vert _a^2 = \Vert \mathbf {s}_1 \Vert _a^2 + \Vert \mathbf {s}_2 \Vert _a^2 + \left( \sqrt{\frac{d_1d_2}{d_1+d_2}} \log \frac{\mathrm {g}_\mathrm {m}(\mathbf {s}_1)}{\mathrm {g}_\mathrm {m}(\mathbf {s}_2)} \right) ^2. \end{aligned}$$

(10)

The 2-part composition $(\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))$ has square norm

$$\begin{aligned} \Vert (\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)) \Vert _a^2 = \left( \frac{1}{\sqrt{2}} \log \frac{\mathrm {g}_\mathrm {m}(\mathbf {s}_1)}{\mathrm {g}_\mathrm {m}(\mathbf {s}_2)} \right) ^2 =2\ A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))). \end{aligned}$$

(11)

Dividing Eq. (10) by the appropriate constants to obtain inequality indexes, and substituting the value in Eq. (11) yields

$$\begin{aligned} A_I^2( (\mathbf {s}_1,\mathbf {s}_2) ) = \frac{d_1 A_I^2(\mathbf {s}_1)}{d_1+d_2} + \frac{d_2 A_I^2( \mathbf {s}_2)}{d_1+d_2} + \frac{4\ d_1 d_2}{(d_1+d_2)^2}\ A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))). \end{aligned}$$

(12)

Population replication Let $\mathbf {p}$ be a composition of shares. The property to prove is $A_I^2(\mathbf {p})=A_I^2((\mathbf {p},\mathbf {p}, \dots ,\mathbf {p}))$.

The geometric means of any number of replications of $\mathbf {p}$ are equal, that is $\mathrm {g}_\mathrm {m}(\mathbf {p}) = \mathrm {g}_\mathrm {m}((\mathbf {p},\mathbf {p},\dots ,\mathbf {p}))$. Applying the previous result of decomposability by subcompositions [Eq. (12)], the value of $A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)))$ is null. Applying repeatedly the decomposability by subcompositions, the population replication property holds.

Principle of transfers A transfer of $\delta >0$ to the i-th unit increases its share to $p_i+\delta $ being $\delta $ detracted from the share of the j-th unit. If $p_j-\delta > p_i+\delta $, then $A_I^2(\mathbf {p})\ge A_I^2(\mathbf {p}')$ where $\mathbf {p}'$ is the composition after the transfer.

Without loss of generality, take $i=1$, $j=2$. There are positive constants, a, b, such that $p_1'=p_1+\delta = p_1 a$, $p_2'=p_2-\delta =b p_2$, with $a > 1$ and $0< b < 1$, since $p_2-p_1> 2\delta >0$. The new shares $\mathbf {p}'$ are the perturbation

$$\begin{aligned} \mathbf {p}'= \mathbf {a} \oplus \mathbf {p}, \quad \mathbf {a}=(a,b,1,1,\dots ,1). \end{aligned}$$

The statement is proven if the following inner product satisfies

$$\begin{aligned} \langle \mathbf {a}, \mathbf {p}\rangle _a = \langle \mathrm {clr}(\mathbf {a}), \mathrm {clr}(\mathbf {p})\rangle _e \le 0. \end{aligned}$$

After some computation, the inner product is

$$\begin{aligned} \langle \mathbf {a}, \mathbf {p}\rangle _a = \log a \cdot \mathrm {clr}_1(\mathbf {p}) + \log b\cdot \mathrm {clr}_2(\mathbf {p}), \end{aligned}$$

(13)

where $\mathrm {clr}_k(\mathbf {p})= \log p_k - \log \mathrm {g}_\mathrm {m}(\mathbf {p})$, $k=1,2$. This inner product is negative if the two inequalities $\log (ab)> 0$ and $(\log b)/(\log a)> -1$ hold. In fact, since $\log a>0$, the negativeness of Eq. (13) is equivalent to

$$\begin{aligned} \mathrm {clr}_1(\mathbf {p}) + \frac{\log b}{\log a}\cdot \mathrm {clr}_2(\mathbf {p}) < 0, \end{aligned}$$

which derives from $(\log b)/(\log a)> -1$ and $ \mathrm {clr}_1(\mathbf {p}) < \mathrm {clr}_2(\mathbf {p})$.

The first inequality derives from

$$\begin{aligned} ab = \frac{(p_1+\delta )(p_2-\delta )}{p_1 p_2} = 1 + \frac{\delta ((p_2-p_1)-\delta )}{p_1p_2} \ \end{aligned}$$

and the assumption that $(p_2-p_1)-2\delta \ge 0$. The second inequality also holds from the assumption $p_1+\delta < p_2-\delta $, since this is $a < b$ and then $\log a < \log b$; finally, $\log a>0$ implies that $(\log b) /(\log a)< -1$.

Supplementary material

This section contains some tables and figures that are not included in the body of the article due to its extension (Tables 3, 4, 5; Figs. 11, 12, 13, 14, 15).

Table 3 Center and normalized variation matrix of MHGI (Appendix A)

Full size table

Table 4 Signs code for the SBP approaching principal balances for MHGI data

Full size table

Table 5 Estimated balances of coefficients in model (9)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Egozcue, J.J., Pawlowsky-Glahn, V. Compositional data: the sample space and its structure. TEST 28, 599–638 (2019). https://doi.org/10.1007/s11749-019-00670-6

Download citation

Published: 16 July 2019
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s11749-019-00670-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Compositional data: the sample space and its structure

Abstract

Access this article

Similar content being viewed by others

Modelling Compositional Data. The Sample Space Approach

Factor Analysis of Compositional Data with a Total

A Review of Flexible Transformations for Modeling Compositional Data

References

Acknowledgements