Skip to main content
Log in

Compositional data: the sample space and its structure

  • Invited Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

The log-ratio approach to compositional data (CoDa) analysis has now entered a mature phase. The principles and statistical tools introduced by J. Aitchison in the eighties have proven successful in solving a number of applied problems. The algebraic–geometric structure of the sample space, tailored to those principles, was developed at the beginning of the millennium. Two main ideas completed the J. Aitchison’s seminal work: the conception of compositions as equivalence classes of proportional vectors, and their representation in the simplex endowed with an interpretable Euclidean structure. These achievements allowed the representation of compositions in meaningful coordinates (preferably Cartesian), as well as orthogonal projections compatible with the Aitchison distance introduced two decades before. These ideas and concepts are reviewed up to the normal distribution on the simplex and the associated central limit theorem. Exploratory tools, specifically designed for CoDa, are also reviewed. To illustrate the adequacy and interpretability of the sample space structure, a new inequality index, based on the Aitchison norm, is proposed. Most concepts are illustrated with an example of mean household gross income per capita in Spain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Äijö T, Müller CL, Bonneau R (2018) Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics 34(3):372–380

    Article  Google Scholar 

  • Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc Ser B Stat Methodol 44(2):139–177

    MATH  Google Scholar 

  • Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70(1):57–65

    Article  MathSciNet  MATH  Google Scholar 

  • Aitchison J (1986) The statistical analysis of compositional data. Monographs on statistics and applied probability. Chapman & Hall Ltd., London (reprinted in 2003 with additional material by The Blackburn Press)

  • Aitchison J (1992) On criteria for measures of compositional difference. Math Geol 24(4):365–379

    Article  MathSciNet  MATH  Google Scholar 

  • Aitchison J (1994) Multivariate analysis and its applications, volume 24 of lecture notes—monograph series, chapter principles of compositional data analysis. Institute of Mathematical Statistics, Hayward, pp 73–81

  • Aitchison J (1997) The one-hour course in compositional data analysis or compositional data analysis is simple. In: Pawlowsky-Glahn V (ed) Proceedings of IAMG’97—the III annual conference of the international association for mathematical geology, volume I, II and addendum, Barcelona (E). CIMNE, Barcelona, pp 3–35, ISBN 978-84-87867-76-7

  • Aitchison J, Bacon-Shone J (1984) Log contrast models for experiments with mixtures. Biometrika 71:323–330

    Article  Google Scholar 

  • Aitchison J, Egozcue JJ (2005) Compositional data analysis: Where are we and where should we be heading? Math Geol 37(7):829–850

    Article  MathSciNet  MATH  Google Scholar 

  • Aitchison J, Greenacre M (2002) Biplots for compositional data. J R Stat Soc Ser C Appl Stat 51(4):375–392

    Article  MathSciNet  MATH  Google Scholar 

  • Aitchison J, Shen S (1980) Logistic-normal distributions. Some properties and uses. Biometrika 67(2):261–272

    Article  MathSciNet  MATH  Google Scholar 

  • Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32(3):271–275

    Article  MATH  Google Scholar 

  • Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Reply to letter to the editor by S. Rehder and U. Zier on “Logratio analysis and compositional distance”. Math Geol 33(7):849–860

    Article  MATH  Google Scholar 

  • Aitchison J, Barceló-Vidal C, Egozcue JJ, Pawlowsky-Glahn V (2002) A concise guide for the algebraic-geometric structure of the simplex, the sample space for compositional data analysis. In: Bayer U, Burger H, Skala W (eds) Proceedings of IAMG’02—the VIII annual conference of the international association for mathematical geology, vol I and II. Selbstverlag der Alfred-Wegener-Stiftung, Berlin, pp 387–392

  • Atkinson AB (1970) On the measurement of inequality. J Econ Theory 2:244–263

    Article  MathSciNet  Google Scholar 

  • Bacon-Shone J (2003) Modelling structural zeros in compositional data. In: Thió-Henestrosa S, Martín-Fernández JA (eds) Proceedings of CoDaWork’03, the 1st compositional data analysis workshop, Girona (E). Universitat de Girona, ISBN 84-8458-111-X, http://ima.udg.es/Activitats/CoDaWork2003/

  • Barceló-Vidal C, Martín-Fernández JA (2016) The mathematics of compositional analysis. Austrian J Stat 45:57–71

    Article  Google Scholar 

  • Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2001) Mathematical foundations of compositional data analysis. In: Ross G (ed) Proceedings of IAMG’01—the VII annual conference of the international association for mathematical geology, Cancun (Mex), p 20

  • Billheimer D, Guttorp P, Fagan W (2001) Statistical interpretation of species composition. J Am Stat Assoc 96(456):1205–1214

    Article  MathSciNet  MATH  Google Scholar 

  • Buccianti A, Pawlowsky-Glahn V (2005) New perspectives on water chemistry and compositional data analysis. Math Geol 37(7):703–727

    Article  MATH  Google Scholar 

  • Chayes F (1971) Ratio correlation. University of Chicago Press, Chicago, p 99

    Google Scholar 

  • Chen J, Zhang X, Li S (2017) Multiple linear regression with compositional response and covariates. J Appl Stat 44(12):2270–2285

    Article  MathSciNet  Google Scholar 

  • Chipman HA, Gu H (2005) Interpretable dimension reduction. J Appl Stat 32:969–987

    Article  MathSciNet  MATH  Google Scholar 

  • Comas-Cufí M, Thió-Henestrosa S (2011) Codapack 2.0: a stand-alone, multi-platform compositional software. See Egozcue et al. (2011c)

  • Connor RJ, Mosimann JE (1969) Concepts of independence for proportions with a generalization of the Dirichlet distribution. J Am Stat Assoc 64(325):194–206

    Article  MathSciNet  MATH  Google Scholar 

  • Daunis-i Estadella J, Barceló-Vidal J, Buccianti A (2006) Exploratory compositional data analysis. In: Compositional data analysis in the geosciences: from theory to practice, volume 264 of special publications. Geological Society, London, pp 161–174

  • de Finetti B (1926) Considerazioni matematiche sull’ereditarietà mendeliana. Metron 6(3):3–41

    MATH  Google Scholar 

  • Egozcue JJ (2009) Reply to “On the Harker variation diagrams;...” by J. A. Cortés. Math Geosci 41(7):829–834

    Article  MATH  Google Scholar 

  • Egozcue JJ, Jarauta-Bragulat E (2014) Differential models for evolutionary compositions. Math Geosci 46(4):381–410

    Article  MathSciNet  MATH  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V (2005) Groups of parts and their balances in compositional data analysis. Math Geol 37(7):795–828

    Article  MathSciNet  MATH  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V (2011a) Basic concepts and procedures. See Pawlowsky-Glahn and Buccianti (2011), pp 12–28

  • Egozcue JJ, Pawlowsky-Glahn V (2011b) Evidence information in Bayesian updating. See Egozcue et al. (2011c)

  • Egozcue JJ, Pawlowsky-Glahn V (2018a) Evidence functions: a compositional approach to information (invited paper). Stat Oper Res Trans 42(2):1–24

    MATH  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V (2018b) Modelling compositional data. The sample space approach, Chapter 4, p XXV, 875. Handbook of mathematical geosciences—fifty years of IAMG. Springer, Berlin

  • Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300

    Article  MathSciNet  MATH  Google Scholar 

  • Egozcue JJ, Díaz-Barrero JL, Pawlowsky-Glahn V (2006) Hilbert space of probability density functions based on Aitchison geometry. Acta Math Sin 22(4):1175–1182. https://doi.org/10.1007/s10114-005-0678-2

    Article  MathSciNet  MATH  Google Scholar 

  • Egozcue JJ, Barceló-Vidal C, Martín-Fernández JA, Jarauta-Bragulat E, Díaz-Barrero JL, Mateu-Figueras G (2011a) Elements of simplicial linear algebra and geometry. See Pawlowsky-Glahn and Buccianti (2011), pp 141–157

  • Egozcue JJ, Jarauta-Bragulat E, Díaz-Barrero JL (2011b) Calculus of simplex-valued functions. See Pawlowsky-Glahn and Buccianti (2011), pp 158–175

  • Egozcue JJ, Tolosana-Delgado R, Ortego MI (eds) (2011c) Proceedings of the 4th international workshop on compositional data analysis, Sant Feliu de Guixols, Girona. CIMNE, Barcelona, ISBN 978-84-87867-76-7

  • Egozcue JJ, Daunis-i-Estadella J, Pawlowsky-Glahn V, Hron K, Filzmoser P (2012) Simplicial regression. The normal model. J Appl Probab Stat 6(1–2):87–108

    MATH  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V, Tolosana-Delgado R, Ortego MI, van den Boogaart KG (2013) Bayes spaces: use of improper distributions and exponential families. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales, Serie A Matemáticas 107:475–486. https://doi.org/10.1007/s13398-012-0082-6

    Article  MathSciNet  MATH  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V, Templ M, Hron K (2015) Independence in contingency tables using simplicial geometry. Commun Stat Theory Methods 44(18):3978–3996

    Article  MathSciNet  MATH  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V, Gloor GB (2018) Linear association in compositional data analysis. Austrian J Stat 47(1):3–31

    Article  Google Scholar 

  • Erb I, Notredame C (2016) How should we measure proportionality on relative gene expression data? Theory Biosci 135(1–2):21–36. https://doi.org/10.1007/s12064-015-0220-8

    Article  Google Scholar 

  • Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2:15.1–15.13

    Article  Google Scholar 

  • Filzmoser P, Hron K, Templ M (2012) Discriminant analysis for compositional data and robust parameter estimation. Comput Stat 27(4):585–604

    Article  MathSciNet  MATH  Google Scholar 

  • Filzmoser P, Hron K, Templ M (2018) Applied compositional analysis. With worked examples in R. Springer, Switzerland AG, p 280

  • Fisher RA (1947) The analysis of covariance method for the relation between a part and the whole. Biometrics 3(2):65–68

    Article  Google Scholar 

  • Fréchet M (1948) Les éléments Aléatoires de Nature Quelconque dans une Espace Distancié. Annales de l’Institut Henri Poincaré 10(4):215–308

    MATH  Google Scholar 

  • Fry JM, Fry TRL, McLaren KR (2000) Compositional data analysis and zeros in micro data. Appl Econ 32(8):953–959

    Article  Google Scholar 

  • Gini C (1921) Measurement of inequality of incomes. Econ J 31(121):124–126

    Article  Google Scholar 

  • Greenacre M (2011) Measuring subcompositional incoherence. Math Geosci 43(6):681–693

    Article  Google Scholar 

  • Halmos P (1974) Finite dimensional vector spaces. Springer, Berlin

    Book  MATH  Google Scholar 

  • Hijazi RH, Jernigan RW (2009) Modelling compositional data using Dirichlet regression models. J Appl Probab Stat 4(1):77–91

    MathSciNet  MATH  Google Scholar 

  • Hron K, Filzmoser P, Thompson K (2012) Linear regression with compositional explanatory variables. J Appl Stat 39(5):1115–1128

    Article  MathSciNet  Google Scholar 

  • Hrůzová K, Todorov V, Hron K, Filzmoser P (2016) Classical and robust orthogonal regression between parts of compositional data. Statistics 50(6):1261–1275

    Article  MathSciNet  MATH  Google Scholar 

  • INE (2016) Renta disponible bruta de los hogares (per cápita). Serie 2010–2014. Contabilidad regional de España. Base 2010

  • Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA (2015) Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 11(5):e1004226. https://doi.org/10.1371/journal.pcbi.1004226

    Article  Google Scholar 

  • Kync̆lová P, Hron K, Filzmoser P (2017) Correlation between compositional parts based on symmetric balances. Math Geosci 49:777–796. https://doi.org/10.1007/s11004-016-9669-3

    Article  MathSciNet  MATH  Google Scholar 

  • Lin W, Shi P, Feng R, Li H (2014) Variable selection in regression with compositional covariates. Biometrika 101(4):785–797

    Article  MathSciNet  MATH  Google Scholar 

  • Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3):e1004075

    Article  Google Scholar 

  • Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V (2003) Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3):253–278

    Article  MATH  Google Scholar 

  • Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2012) Model-based replacement of rounded zeros in compositional data: classical and robust approaches. Comput Stat Data Anal 56:2688–2704

    Article  MathSciNet  MATH  Google Scholar 

  • Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J (2015) Bayesian-multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2):134–158

    Article  MathSciNet  MATH  Google Scholar 

  • Martín-Fernández JA, Pawlowsky-Glahn V, Egozcue JJ, Tolosona-Delgado R (2018) Advances in principal balances for compositional data. Math Geosci 50(3):273–298

    Article  MathSciNet  MATH  Google Scholar 

  • Mateu-Figueras G (2003) Models de distribució sobre el símplex. Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona

  • Mateu-Figueras G, Pawlowsky-Glahn V (2007) The skew-normal distribution on the simplex. Commun Stat Theory Methods 36(9):1787–1802

    Article  MathSciNet  MATH  Google Scholar 

  • Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2011) The principle of working on coordinates. See Pawlowsky-Glahn and Buccianti (2011), pp 31–42

  • Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ (2013) The normal distribution in some constrained sample spaces. Stat Oper Res Trans 37(1):29–56

    MathSciNet  MATH  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London

    Book  MATH  Google Scholar 

  • Menafoglio A, Secchi P, Dalla Rosa M (2013) A universal kriging predictor for spatially dependent functional data of a Hilbert space. Electron J Stat 7:2209–2240

    Article  MathSciNet  MATH  Google Scholar 

  • Menafoglio A, Guadagnini A, Secchi P (2016) Stochastic simulation of soil particle-size curves in heterogeneous aquifer systems through a bayes space approach. Water Resour Res 52(8):5708–5726

    Article  Google Scholar 

  • Morais J, Thomas-Agnan C, Simioni M (2018) Using compositional and Dirichlet models for market share regression. J Appl Stat 45(9):1670–1689. https://doi.org/10.1080/02664763.2017.1389864

    Article  MathSciNet  Google Scholar 

  • Mosimann JE (1962) On the compound multinomial distribution, the multivariate \(\beta \)-distribution and correlations among proportions. Biometrika 49(1–2):65–82

    MathSciNet  MATH  Google Scholar 

  • Ortego MI, Egozcue JJ (2013) Spurious copulas. In: Hron PFK MT (eds) Proceedings of the 5th workshop on compositional data analysis, CoDaWork 2013, pp 123–130

  • Palarea-Albaladejo J, Martín-Fernández J (2008) A modified EM alr-algorithm for replacing rounded zeros in compositional data sets. Comput Geosci 34(8):2233–2251

    Article  Google Scholar 

  • Palarea-Albaladejo J, Martín-Fernández JA (2015) zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemom Intell Lab Syst 143:85–96

    Article  Google Scholar 

  • Pawlowsky-Glahn V, Buccianti A (eds) (2011) Compositional data analysis: theory and applications. Wiley, New York, p 378

    Google Scholar 

  • Pawlowsky-Glahn V, Egozcue JJ (2001) Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess 15(5):384–398

    Article  MATH  Google Scholar 

  • Pawlowsky-Glahn V, Egozcue JJ (2002) BLU estimators and compositional data. Math Geol 34(3):259–274

    Article  MathSciNet  MATH  Google Scholar 

  • Pawlowsky-Glahn V, Egozcue J (2011) Exploring compositional data with the coda-dendrogram. Austrian J Stat 40(1 & 2):103–113

    Google Scholar 

  • Pawlowsky-Glahn V, Egozcue JJ, Lovell D (2015a) Tools for compositional data with a total. Stat Model 15(2):175–190

    Article  MathSciNet  Google Scholar 

  • Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015b) Modeling and analysis of compositional data. Statistics in practice. Wiley, Chichester, p 272

    Google Scholar 

  • Pearson K (1897) Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond LX:489–502

    MATH  Google Scholar 

  • Queysanne M (1973) Álgebra Básica. Editorial Vicens Vives, Barcelona (E), p 669

  • Rivera-Pinto J, Egozcue JJ, Pawlowsky-Glahn V, Paredes R, Noguera-Julian M, Calle ML (2018) Balances: a new perspective for microbiome analysis. mSystems 3(4):e00053–18. https://doi.org/10.1128/mSystems.00053-18

  • Robert CP (1994) The Bayesian choice. A decision-theoretic motivation. Springer, New York

    MATH  Google Scholar 

  • Scealy JL, Welsh AH (2011) Regression for compositional data by using distributions defined on the hypersphere. J R Stat Soc Ser B Stat Methodol 73(3):351–375

    Article  MathSciNet  MATH  Google Scholar 

  • Shi P, Zhang A, Li H (2016) Regression analysis for microbiome compositional data. Ann Appl Stat 10(2):1019–1040

    Article  MathSciNet  MATH  Google Scholar 

  • Shorrocks AF (1980) The class of additively decomposable inequality measures. Econometrica 48(3):613–625

    Article  MathSciNet  MATH  Google Scholar 

  • Theil H (1967) On the measurement of inequality. North Holland, Amsterdam

    Google Scholar 

  • Tolosana-Delgado R, von Eynatten H (2009) Grain-size control on petrographic composition of sediments: compositional regression and rounded zeros. Math Geosci 41:869–886

    Article  MATH  Google Scholar 

  • Tolosana-Delgado R, von Eynatten H (2010) Simplifying compositional multiple regression: application to grain size controls on sediment geochemistry. Comput Geosci 36(5):577–589

    Article  Google Scholar 

  • van den Boogaart KG, Tolosana-Delgado R (2013) Analysing compositional data with R. Springer, Berlin, p 258

    Book  MATH  Google Scholar 

  • van den Boogaart KG, Egozcue JJ, Pawlowsky-Glahn V (2010) Bayes linear spaces. Stat Oper Res Trans 34(2):201–222

    MathSciNet  MATH  Google Scholar 

  • van den Boogaart KG, Egozcue JJ, Pawlowsky-Glahn V (2014) Bayes Hilbert spaces. Aust NZ J Stat 56(2):171–194

    Article  MathSciNet  MATH  Google Scholar 

  • Vistelius AB (1960) The skew frequency distributions and the fundamental law of the geochemical processes. J Geol 68(1):1–22

    Article  Google Scholar 

  • Wang H, Shangguan L, Wu J, Guan R (2013) Multiple linear regression modeling for compositional data. Neurocomputing 122:490–500

    Article  Google Scholar 

  • Wikipedia (2018) Homogeneous function—Wikipedia, The Free Encyclopedia. Accessed 5 Aug 2018

Download references

Acknowledgements

This work was supported by Grants MTM2015-65016-C2-1-R and MTM2015-65016-C2-2-R (MINECO/FEDER) from the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund. We are grateful for the useful comments and criticisms given by three anonymous reviewers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan José Egozcue.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00671-5, https://doi.org/10.1007/s11749-019-00672-4, https://doi.org/10.1007/s11749-019-00673-3.

Appendices

Appendices

Data

Mean Household Gross Income (MHGI) in Spain The Instituto Nacional de Estadística (INE) (INE 2016) provides on its webpage an estimation of the MHGI per capita for all Autonomous Communities and Autonomous Cities (ACs) in Spain, for the years 2000 to 2014. According to the webpage, the data are median aggregated data, but the algorithm for doing so is not included. Nevertheless, the result is a non-closed composition, and the compositional tools here presented would lead to exactly the same results whether applied to the given data, or a representation in proportions or percentages or, as can be seen in the webpage of the INE, normalized so that the estimated MHGI for the whole of Spain corresponds to 100%. This data set (19 ACs for 15 years, listed in Table 2) is used here to illustrate the procedures and properties presented. Figure 10 shows the values in Euros. The MHGI in the Balearic Islands (circles), Canary Islands (triangles), Madrid (\(+\)) and Catalonia (\(\times \)) are highlighted for discussion.

Fig. 10
figure 10

MHGI per capita for all ACs in Spain for the years 2000–2014. Four ACs are highlighted: Balearic Islands (4, circles), Canary Islands (5, triangles), Madrid (13, \(+\)) and Catalonia (9, \(\times \)); the numbers are placed near the end point of each MHGI. Names are shown in the right panel

In Sect. 3, the MHGI for all ACs is considered as a 19-part composition.

Table 2 MHGI per capita in the period 2000–2014 (INE 2016)

Proofs of the properties of the inequality index \(A_I^2\)

Consider a D-part composition \(\mathbf {s}\) divided into two subcompositions \(\mathbf {s}_1\) and \(\mathbf {s}_2\) of \(d_1\) and \(d_2\) shares, respectively, with \(d_1 + d_2 = D\) and represented in the same units. Proofs of the properties of \(A_I^2\) follow.

Decomposability by subcompositions:

$$\begin{aligned} A_I^2( (\mathbf {s}_1,\mathbf {s}_2) ) = \frac{d_1 A_I^2(\mathbf {s}_1)}{d_1+d_2} + \frac{d_2 A_I^2( \mathbf {s}_2)}{d_1+d_2} + \frac{4\ d_1 d_2}{(d_1+d_2)^2}\ A_I^2(\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)), \end{aligned}$$

where \((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))\) is a 2-part composition whose components are the geometric means of the parts in \(\mathbf {s}_1\) and \(\mathbf {s}_2\), respectively.

Since for any D-part composition \(\mathbf {s}\) it holds that \(A_I^2(\mathbf {s})=(1/D)\Vert \mathbf {s} \Vert _a^2\), the index of the composed vector of shares \(A_I^2((\mathbf {s}_1,\mathbf {s}_2))\) can be expressed as a sum of squares of orthonormal balances obtained in an SBP. Define a first binary partition, by separating the shares in \(\mathbf {s}_1\), marked with a \(+1\), and the shares in \(\mathbf {s}_2\) marked with a \(-1\). Any further partitions within \(\mathbf {s}_1\) and within \(\mathbf {s}_2\) are then possible to obtain the expression

$$\begin{aligned} \Vert (\mathbf {s}_1,\mathbf {s}_2) \Vert _a^2 = \Vert \mathbf {s}_1 \Vert _a^2 + \Vert \mathbf {s}_2 \Vert _a^2 + \left( \sqrt{\frac{d_1d_2}{d_1+d_2}} \log \frac{\mathrm {g}_\mathrm {m}(\mathbf {s}_1)}{\mathrm {g}_\mathrm {m}(\mathbf {s}_2)} \right) ^2. \end{aligned}$$
(10)

The 2-part composition \((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))\) has square norm

$$\begin{aligned} \Vert (\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)) \Vert _a^2 = \left( \frac{1}{\sqrt{2}} \log \frac{\mathrm {g}_\mathrm {m}(\mathbf {s}_1)}{\mathrm {g}_\mathrm {m}(\mathbf {s}_2)} \right) ^2 =2\ A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))). \end{aligned}$$
(11)

Dividing Eq. (10) by the appropriate constants to obtain inequality indexes, and substituting the value in Eq. (11) yields

$$\begin{aligned} A_I^2( (\mathbf {s}_1,\mathbf {s}_2) ) = \frac{d_1 A_I^2(\mathbf {s}_1)}{d_1+d_2} + \frac{d_2 A_I^2( \mathbf {s}_2)}{d_1+d_2} + \frac{4\ d_1 d_2}{(d_1+d_2)^2}\ A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2))). \end{aligned}$$
(12)

Population replication Let \(\mathbf {p}\) be a composition of shares. The property to prove is \(A_I^2(\mathbf {p})=A_I^2((\mathbf {p},\mathbf {p}, \dots ,\mathbf {p}))\).

The geometric means of any number of replications of \(\mathbf {p}\) are equal, that is \(\mathrm {g}_\mathrm {m}(\mathbf {p}) = \mathrm {g}_\mathrm {m}((\mathbf {p},\mathbf {p},\dots ,\mathbf {p}))\). Applying the previous result of decomposability by subcompositions [Eq. (12)], the value of \(A_I^2((\mathrm {g}_\mathrm {m}(\mathbf {s}_1),\mathrm {g}_\mathrm {m}(\mathbf {s}_2)))\) is null. Applying repeatedly the decomposability by subcompositions, the population replication property holds.

Principle of transfers A transfer of \(\delta >0\) to the i-th unit increases its share to \(p_i+\delta \) being \(\delta \) detracted from the share of the j-th unit. If \(p_j-\delta > p_i+\delta \), then \(A_I^2(\mathbf {p})\ge A_I^2(\mathbf {p}')\) where \(\mathbf {p}'\) is the composition after the transfer.

Without loss of generality, take \(i=1\), \(j=2\). There are positive constants, a, b, such that \(p_1'=p_1+\delta = p_1 a\), \(p_2'=p_2-\delta =b p_2\), with \(a > 1\) and \(0< b < 1\), since \(p_2-p_1> 2\delta >0\). The new shares \(\mathbf {p}'\) are the perturbation

$$\begin{aligned} \mathbf {p}'= \mathbf {a} \oplus \mathbf {p}, \quad \mathbf {a}=(a,b,1,1,\dots ,1). \end{aligned}$$

The statement is proven if the following inner product satisfies

$$\begin{aligned} \langle \mathbf {a}, \mathbf {p}\rangle _a = \langle \mathrm {clr}(\mathbf {a}), \mathrm {clr}(\mathbf {p})\rangle _e \le 0. \end{aligned}$$

After some computation, the inner product is

$$\begin{aligned} \langle \mathbf {a}, \mathbf {p}\rangle _a = \log a \cdot \mathrm {clr}_1(\mathbf {p}) + \log b\cdot \mathrm {clr}_2(\mathbf {p}), \end{aligned}$$
(13)

where \(\mathrm {clr}_k(\mathbf {p})= \log p_k - \log \mathrm {g}_\mathrm {m}(\mathbf {p})\), \(k=1,2\). This inner product is negative if the two inequalities \(\log (ab)> 0\) and \((\log b)/(\log a)> -1\) hold. In fact, since \(\log a>0\), the negativeness of Eq. (13) is equivalent to

$$\begin{aligned} \mathrm {clr}_1(\mathbf {p}) + \frac{\log b}{\log a}\cdot \mathrm {clr}_2(\mathbf {p}) < 0, \end{aligned}$$

which derives from \((\log b)/(\log a)> -1\) and \( \mathrm {clr}_1(\mathbf {p}) < \mathrm {clr}_2(\mathbf {p})\).

The first inequality derives from

$$\begin{aligned} ab = \frac{(p_1+\delta )(p_2-\delta )}{p_1 p_2} = 1 + \frac{\delta ((p_2-p_1)-\delta )}{p_1p_2} \ \end{aligned}$$

and the assumption that \((p_2-p_1)-2\delta \ge 0\). The second inequality also holds from the assumption \(p_1+\delta < p_2-\delta \), since this is \(a < b\) and then \(\log a < \log b\); finally, \(\log a>0\) implies that \((\log b) /(\log a)< -1\).

Supplementary material

This section contains some tables and figures that are not included in the body of the article due to its extension (Tables 3, 4, 5; Figs. 11, 12, 13, 14, 15).

Table 3 Center and normalized variation matrix of MHGI (Appendix A)
Table 4 Signs code for the SBP approaching principal balances for MHGI data
Table 5 Estimated balances of coefficients in model (9)
Fig. 11
figure 11

Scatterplot of two balances with common parts in the denominator. When the slope is equal to 1, the two parts in the numerator are associated within the considered subcomposition. Left panel: testing association between Gal and Vas (p value 0.21, unit slope not rejected). Right panel: testing association between IBal and ICan (p value 0.0000, unit slope rejected)

Fig. 12
figure 12

Scree plot for the MHGI data. Full line with circles: proportion of total variance explained by each principal component. Dotted line with triangles: cumulative proportion of total variance explained by the sequence of principal components. Note that only 14 square singular values are reported as there are only 15 data points despite there being 19 ACs

Fig. 13
figure 13

MHGI per capita data. Form biplot of the subcomposition Cataluña (Cat), Madrid (Mad) and Melilla (Mel), representing 100% of total variance. The numeric markers correspond to years after 2000. The behavior of points is almost linear, as suggested by the biplot in Fig. 6

Fig. 14
figure 14

Evolution of MHGI represented in the two first principal balances. The numbers correspond to years after 2000

Fig. 15
figure 15

Form biplot of compositional residuals of regression model (9). Left panel: first and second principal components. Right panel: first and third principal components. No trend is detected. Total variance: residuals \(9.097\times 10^{-5}\); MHDI data 0.00234; percentage of total variance explained by the model 96.1%

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Egozcue, J.J., Pawlowsky-Glahn, V. Compositional data: the sample space and its structure. TEST 28, 599–638 (2019). https://doi.org/10.1007/s11749-019-00670-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-019-00670-6

Keywords

Mathematics Subject Classification

Navigation