Introduction

Morphological integration and modularity have been studied to examine how complex traits interact in terms of shared developmental pathways and functional demands (Olson and Miller 1958; Hallgrímsson et al. 2009; Armbruster et al. 2014; Goswami et al. 2014; Klingenberg 2014). According to the theoretical framework of integration, strictly independent evolution of traits in living organisms may not be possible due to the correlated responses to selection among traits. Thus, it serves as a reminder of one important principle in evolutionary processes; that not all traits are caused by adaptation or direct selection (Gould and Lewontin 1979).

Morphological integration and modularity can be regarded as “the patterns and processes” of trait interaction and independence, respectively (Armbruster et al. 2014). For instance, a set of traits defined as a ‘module’ may have fewer connections with other anatomical or morphological traits but within-module integration should be higher than others. Thus, integration and modularity are not antonyms but rather complimentary concepts. At the genetic level, integration and modularity can be associated with pleiotropy, epistasis, and linkage disequilibrium, resulting in shared developmental pathways among traits (Cheverud 1984; Hallgrímsson et al. 2009). For instance, a pleiotropic effect can occur when a mutation at a single gene locus causes changes in many phenotypic traits (Cheverud 1984). Functional constraints can also have an impact on morphological integration. For instance, fore- and hindlimb lengths or proportions can be functionally (and developmentally) integrated for the locomotor behaviors, leading to morphological integration and evolutionary constraint in these anatomical structures (Young et al. 2010; Rolian 2014). Young et al. (2010) showed that fore- and hindlimb lengths or proportion is less integrated in hominoids than other anthropoids, which may have resulted in less evolutionary constraint for the evolution of novel limb proportion and bipedalism in humans. Thus, functional influence later in life (as in, locomotor behaviors in hominoids from Young et al. 2010) can lead to reduction of developmental integration of limb proportions earlier in ontogeny of subsequent generations (as in Zelditch and Carmichael 1989; Kelly et al. 2019).

In studies of morphological integration, the structures of trait correlation or variance/covariance (V/CV) matrices are used as proxies for pattern and magnitude of integration (Cheverud 1984; Ackermann and Cheverud 2000, 2002; Marroig and Cheverud 2004; Porto et al. 2009; Marroig et al. 2009). In other words, multiple correlations or covariations among traits are analyzed to examine overall patterns and magnitudes of integration. For instance, the simple example of A and B in Fig. 1 shows the correlation or covariation between two traits with similar patterns but different magnitudes of integration. Here, β and ΔZ represent the selection vectors and response vectors to selection, respectively. The graphs show that total variation is more concentrated in a single axis in B than A. Thus, in Fig. 1b, the response along selection vector β2 is more constrained relative to the selection vector β1 where the selection vector is more closely aligned with the main axis of trait correlation (Of course, the direction of the response and selection vectors may not be in the perfect match in real life). These differences are expressed by different length of responses (ΔZ) between Fig. 1a and b. In this regard, it has been suggested that the pattern and magnitude of integration can constrain or facilitate the evolution of morphology in morphospace depending on the adaptive landscape (Porto et al. 2009, 2013; Marroig et al. 2009; de Oliveira et al. 2009; Hallgrímsson et al. 2009; Shirai and Marroig 2010; Klingenberg 2014; Armbruster et al. 2014; Penna et al. 2017). For instance, the pattern, but not the magnitude of integration has been shown to be fairly consistent in the cranium of mammals (Porto et al. 2009; Marroig et al. 2009). Thus, it appears that morphological diversification may be associated with magnitudes of integration but not necessarily patterns of integration (Porto et al. 2009; Marroig et al. 2009; de Oliveira et al. 2009).

Fig. 1
figure 1

Two trait correlation or covariation graph with similar patterns of integration but with different magnitudes of integration. A. Traits 1 and 2 are positively but not strongly correlated/covarying. B. Traits 1 and 2 are also positively but much more strongly correlated/covarying. Selection vectors are presented as dashed lines (β) and response vectors to selection vectors are presented as solid lines ΔZ

Several indices have been used to calculate the pattern and magnitude of integration or modularity, such as the RV coefficient, covariance ratio (CR) coefficient, partial least square (PLS), Random Skewers (RS) method, coefficient of determination (r2), and integration coefficient of variation (ICV) (Klingenberg 2009; Adams 2016; Rohlf and Corti 2000; Cheverud and Marroig 2007; Shirai and Marroig 2010). However, only few studies have been conducted regarding the issue of necessary sample sizes for various integration indices (e.g., Adams 2016; Grabowski and Porto 2017). Thus, the purpose of this study is to examine the effects of varying sample sizes on the reliable estimation of distributions of ICV scores since the ICV is more suited for calculating the magnitude of integration of the V/CV matrix (Shirai and Marroig 2010).

The RV coefficient (Escoufier 1973; Klingenberg 2009) and CR coefficient (Adams 2016) are used to quantify patterns of integration or modularity between two or more morphological modules. PLS (Rohlf and Corti 2000) can be used to quantify degree (or magnitude) of integration or modularity between two or more morphological modules. When there are more than two modules, the mean of the calculated indices from all pairs of two modules can be obtained. RS method (Cheverud and Marroig 2007) can be used for comparing the pattern of integration between two modules. The indices of r2 and ICV are used to calculate the magnitude of integration in a single correlation or V/CV matrix, respectively (Porto et al. 2009; Shirai and Marroig 2010; Grabowski and Porto 2017).

The RV coefficient is the ratio between the covariance of two blocks and their within-block variances (Klingenberg 2009), where blocks refer to matrices describing V/CV patterns either within- or between modules. In order to calculate the RV coefficient, the covariance matrix needs to be structured as follows (Klingenberg 2009), where, for example, S1 is the within-module V/CV for module 1, while S12 is the between-module V/CV for modules 1 and 2:

$$S = \left[ {\begin{array}{*{20}c} {S_{1} } & {S_{12} } \\ {S_{21} } & {S_{2} } \\ \end{array} } \right]$$

when module 1 and module 2 has p and q number of traits, respectively, matrix S has p + q dimensions. Then, calculation of the RV coefficient is as follows (Klingenberg 2009):

$$RV = \frac{{trace\left( {S_{12} S^{\prime}_{21} } \right)}}{{\sqrt {trace\left( {S_{1} S_{1} ^{\prime}} \right)trace\left( {S_{2} S^{\prime}_{2} } \right)} }}$$

Trace(S1S1′) or trace(S2S2′) is the sum of the squared variance and squared covariance in each block (within-modules). Trace(S12S21′) is the sum of the squared covariance between two modules. Thus, this formula presents covariation between two blocks which is standardized by the amount of variation within blocks. It is analogous to the (multivariate) correlation coefficient (Klingenberg 2009). Calculated RV coefficients range between zero and one. The RV coefficient is used to quantify how much covariation exists between two modules considering variances within each module (Klingenberg 2009).

However, Adams (2016) has argued that the RV coefficient may be too sensitive to sample size and the number of variables employed. Thus, the covariance ratio (CR coefficient) was suggested instead to calculate patterns of integration or modularity between two or more modules (Adams 2016). The CR coefficient is different from the RV coefficient as calculation of the CR coefficient is conducted with only the off-diagonal matrix as the numerator and denominator in the formula for the RV coefficient above. Thus, S1, S2, and S12 will have zeroes for their diagonal elements and only the sum of squared covariance will be included, while the sum of squared variance is excluded from the calculation of the CR coefficient (Adams 2016). Hence, the CR coefficient is literally a covariance ratio of the between and within modules covariance, and quantifies whether covariation between modules is larger or smaller than covariation within modules (Adams 2016).

Partial least squares (PLS) is a method for finding a new axis that explains most of the covariation between two or more modules (Rohlf and Corti 2000). Thus, PLS is similar to principal component analysis but the aim of PLS is to maximize covariance patterns between two or more blocks instead of maximizing variance of a single block (Rohlf and Corti 2000).

The Random Skewers (RS) method uses the multivariate Breeder’s equation, ΔZ = Gβ, where ΔZ is the evolutionary change in a vector of trait means, G is the additive genetic V/CV matrix, and β is the selection gradient vector (Lande 1979; Cheverud and Marroig 2007). Morphological variation can be analyzed using this Breeder’s equation as the additive genetic G-matrix can be substituted with a phenotypic V/CV matrix (P) given that they have been shown to be largely proportional (Cheverud 1996; Roff 1995; de Oliveira et al. 2009). The RS method is applied to two morphological V/CV matrices to compare their evolutionary response to the same set of selection vectors (Cheverud and Marroig 2007). For instance, the same 1000 randomly generated selection vectors are applied to two target matrices and the correlation between their responses to selection vectors is calculated. Thus, using this approach, one can test the similarity of pattern of integration (or structural similarity) between two morphological V/CV matrices (Cheverud and Marroig 2007).

The coefficient of determination (r2) is simply the mean of squared correlation coefficients between all traits (Porto et al. 2009). Thus, r2 quantifies the intensity of mean correlations between all traits within a module. Similarly, the integration coefficient of variation (ICV) is an index for calculating magnitude of integration within modules. The ICV is calculated from the standard deviation of eigenvalues divided by the mean eigenvalue of a V/CV matrix (Shirai and Marroig 2010). Thus, high ICV values indicate that most of the shape variation is concentrated within fewer dimensions as ICV = \(\frac{\sigma \left( \lambda \right)}{{\overline{\lambda }}}\), where \(\sigma \left( \lambda \right)\) is the standard deviation of the eigenvalues and \(\overline{\lambda }\) is the mean of those eigenvalues (Shirai and Marroig 2010; Conaway et al. 2018). Moreover, ICV is scale-independent as the standard deviation of eigenvalues is standardized by its mean.

Although there are various indices, the ICV is more suited for analyzing magnitudes of integration within V/CV matrices than r2, which can be used to quantify integration in correlation matrices (Shirai and Marroig 2010). Moreover, the ICV can practically summarize the capacity of traits to vary in morphospace depending on their overall covariation patterns (Shirai and Marroig 2010; Conaway et al. 2018). Although PLS can be used to quantify degree (or magnitude) of integration (Rohlf and Corti 2000), PLS cannot take account of within-module patterns or magnitudes of integration (Adams 2016). The ICV, on the other hand, can be employed to quantify magnitudes of integration both within- and between-modules, by combining traits across modules.

Calculating the magnitude of integration is important as the magnitude of integration has been found to vary in, for example, the mammalian cranium, while the pattern of integration was found to be consistent among the crania of mammals (Marroig et al. 2009; Porto et al. 2009; de Oliveira et al. 2009). On this basis, it was argued that evolution or diversification in cranial morphology of mammals may be constrained and/or facilitated by magnitudes of integration but is less affected by differing patterns. Moreover, some studies have been conducted in the post-cranium and showed that patterns of integration may be consistent in post-cranial skeletal elements of mammalian taxa, such as mustelids, felids, or canids (Arnold et al. 2016; Randau and Goswami 2017; Botton-Divet et al. 2018; Jones et al. 2018). However, it has not been tested whether the magnitude of integration in the post-cranium is also consistent among mammalian taxa or varies like in the cranium (but see, Young et al. 2010). Therefore, given that the ICV index can be used to quantify magnitudes of integration both within- and between-modules across an organism, the ICV is a useful means of testing how (the magnitude of) morphological integration is associated with evolutionary constraint or facilitation in future studies.

Although the ICV is more suitable for calculating magnitudes of integration in V/CV matrices, the effects of varying sample sizes on obtaining reliable estimates of ICV has not yet been explored. Only one study has been conducted regarding the issue of necessary sample sizes, focused on the magnitude of integration (r2 in Grabowski and Porto 2017). In the study, Grabowski and Porto (2017) argued that a sample size of 108 individuals was required to accurately estimate integration parameters when average trait coefficient of determination (r2) is 0.05. The minimum required sample size is an important issue in biology as there may not be enough specimens available in museum collections in many occasions. For instance, if the purpose of a study is to analyze patterns and/or magnitudes of integration in the entire skeleton of primates, it may be impossible to achieve the recommended sample size of 108 individuals (Grabowski and Porto 2017), although the required sample size does vary depending on the average trait coefficient of determination (r2), and the overall number of traits (Grabowski and Porto 2017). These circumstances are exacerbated if a researcher tries to study morphological integration using the post-cranial skeleton, which is often more poorly represented in museum collections than cranial remains.

Given that, the purpose of this study is to explore the effects of sample size on obtaining reliable estimates of the ICV by simulating populations with various magnitudes of integration (as quantified using r2). This simulation study on the ICV was conducted using a trait resampling method for generating ICV distributions (Conaway et al. 2018). This methodological approach is useful for comparing magnitudes of integration between skeletal elements with different number of traits (Conaway et al. 2018). For instance, two skeletal elements with 50 traits and 100 traits, respectively, would automatically have different ICV scores if all traits were included, even if mean eigenvalues were the same, as the smallest eigenvalue may be more skewed in the eigenvalue distribution of 100 traits, resulting in unintended inflation of ICV scores. Thus, the trait resampling method is an effective way of generating distributions of ICV values for each morphological module, irrespective of the number of traits, that can be compared statistically. To introduce here briefly within and between module ICV calculations, let us assume that there are modules A, B, C, and D with trait numbers of 50, 60, 70, and 80, respectively. In this example, a “random module” created from A‒D would have 260 traits in total as all traits from modules A through D are combined together (Conaway et al. 2018). Thereafter, within-module ICV values can be statistically compared to the random module using the resampling method described above, to address the basic question of whether individual modules (A, B, C or D) are statistically more strongly integrated than random sets of traits taken from across all modules. In a similar way, within and between module ICV scores can be compared between two modules A and B with the combination of modules A and B as a “random module”, by resampling 10 traits out of 110 traits as there are 50 and 60 traits in module A and B, respectively. Therefore, distributions of ICV values for certain numbers of random sets of vectors would be generated resulting in a mean and standard deviation of ICV values that can be statistically compared across taxa and/or across different morphological modules.

Methods and Materials

Simulations were conducted in three steps (Fig. 2). First, a variance/covariance (V/CV) matrix based on a multivariate normal distribution with known parameter values of coefficient of determination (r2) of V/CV matrix was generated using the ‘genPositiveDefMat’ function in the clusterGeneration R package (Joe 2006; Qiu et al. 2006). The generation method used was “c-vine” and the range of variance was between 0.5 and 0.6 with reference to Grabowski and Porto (2017). The generated V/CV matrix had 300 dimensions, representing 300 traits. Grabowski and Porto (2017) showed that generating V/CV matrices using the “c-vine” method may sometimes underestimate the effect of sample size for integration indices due to extremely small values of the smallest eigenvalue. Thus, it was suggested that V/CV matrices with too much skewness in terms of log-eigenvalue distribution be filtered out and to choose a V/CV matrix with ‘proper’ skewness when generating V/CV matrix with parameter values. However, there was no detectable relationship found between skewness of log-eigenvalue distribution and ICV values (Supplementary Fig. 1). Moreover, there was no detectable relationship in the scatter plot of the smallest eigenvalue and ICV values (Supplementary Fig. 2). Thus, the subsequent simulation was conducted without filtering V/CV matrices. In this regard, a V/CV matrix was generated once and used to apply the same pattern of integration to all subsequent simulation procedures as patterns of integration were found to have no effect on sampling effort in a previous study (Grabowski and Porto 2017). The first eigenvalue of the generated V/CV matrix was scaled to adjust parameter r2 values ranging from 0.02 to 0.7. Thus, the parameter r2 of the V/CV matrix was allowed to vary, while the pattern of the V/CV matrix remained the same in this study. Accordingly, calculations of ICV distributions were based on the same pattern of integration but differential parameter r2 values, and therefore, differing magnitudes of integration.

Fig. 2
figure 2

Summary of the three steps used in this study to conduct simulations

Next, populations of size 10,000 were generated based on the simulated V/CV matrix with a mean of zero and various parameter r2 values ranging from 0.02 to 0.7 with about 0.02 intervals (Fig. 2). Distributions of ICV values were generated using a resampling method whereby 10 traits were sampled at random out of 300 traits 1000 times (Kazi-Aoual et al. 1995; Conaway et al. 2018; Fig. 2). The resampling method was conducted based on the generated population size of 10,000. The resampling method randomly generates vectors with ten elements from the original vector of length 300. Next, ICV scores were calculated from these randomly generated vectors with length of 10. This procedure is reiterated 1000 times each for differential sample sizes of between 11 and 150 specimens. Samples of between 11 to 150 specimens were randomly selected from the population of 10,000 based on the simulated V/CV matrix with various parameter r2 values. Sample size started at 11 in order to generate sample V/CV matrices with full rank (i.e., more samples than the number of traits) as 10 traits were resampled in this simulation (Grabowski and Porto 2017). Thus, distributions of ICV values were generated for each sample size to examine the effect of varying sample sizes between 11 and 150. As a result, there were 140 ICV distributions (with 1000 ICV scores in each distribution) generated for each simulated V/CV matrix with specific parameter r2 values. Boxplots illustrating ICV distributions for varying sample size were examined to determine which minimum sample sizes were required to generate ‘stable’ ICV calculations, under different assumptions of average trait r2 values based on previous empirical estimates, such as r2 = 0.05 for the human cranium, r2 = 0.08 for the hominoid cranium, and r2 = 0.12 for the cranium of New and Old World monkey (Marroig et al. 2009; Porto et al. 2009; de Oliveria et al. 2009). Moreover, for future reference, simulations were also conducted with r2 values of 0.2 and 0.35. Mann–Whitney U tests were employed to statistically compare mean ICV scores between sample size distributions intervals of 10 (e.g., 11 vs. 21 and 21 vs. 31) within each r2 value tested. Bonferroni adjustment was applied due to multiple comparisons and the resultant alpha level was 0.0038 (i.e., 13 comparisons within each r2 value).

It was also possible that ICV values may be affected by the number of traits (in this case 300) used in the starting V/CV matrix, as skeletal elements are likely to have different number of traits based on the number of landmarks or measurements available. The effects of changing the total trait numbers on ICV values was examined by altering starting trait numbers to 75, 150, and 300. Moreover, the number of resampled traits was also altered from 10 to 29 traits, sample sizes were set as either 30 or 100, and parameter r2 values were set as 0.1 or 0.5 to simultaneously examine the effect of the number of total traits to choose from, the number of resampled traits, sample size, and the r2 (magnitude of integration) of the V/CV matrix. Based on the results of this simulation, it is possible to test whether different skeletal elements (e.g., cranium or mandible) would show systematically differential ICV values due to differing total number of traits. Alternatively, it may be the case that only r2 and/or the number of resampled traits matters for calculating ICV using the resampling method as predicted above and that it is not necessarily dependent on the total number of traits in terms of magnitude of integration.

For real biological data, sample r2 values were empirically quantified for 22 different skeletal elements using samples of n = 40 Macaca fascicularis. Skeletal elements were scanned using a HDI-120 and a Macro R5X structured-light scanner (LMI technologies INC., Vancouver, Canada). 18 wild specimens of M. fascicularis were from the Museum of Comparative Zoology at Harvard University and 22 captive specimens were from the Department of Anthropology of University at Buffalo, SUNY. For the limb bones (scapula, humerus, radius, os coxa, femur, tibia), only 35 specimens (18 wild and 17 captive) out of total 40 specimens were available. The following 22 skeletal elements were quantified: cranium, mandible, 13 elements of the vertebral column (C1, C2, C3, C5, C7, T1, T4, T7, T10, T12, L1, L4, L7), sacrum, scapula, humerus, radius, os coxa, femur, and tibia (there were 3 specimens with T13 as the last thoracic vertebra, 4 specimens with L6 as the last lumbar vertebra, and 1 specimen with 4 sacral vertebrae). In the case of bilateral elements, the left side was landmarked. When the left side was damaged, the right side was landmarked instead. All landmarks were digitized using the software Landmark (Wiley et al. 2005) on the 3D scanned skeletal elements. Descriptions of the landmarking protocol for each skeletal element can be found in supplementary data (Supplementary Table 1–10). Traits were generated for each skeletal element by calculating all possible Euclidean distances between pairs of landmarks on each bone.

Within-sex mean standardization was conducted to remove the potentially confounding effect of sexual dimorphism and to control for size differences within and between traits of different skeletal elements (Conaway et al. 2018). For instance, larger bones may have higher ICV scores if larger interlandmark distances explain most of the variation and show larger variance. For each trait, the mean was centered to zero but variance was not scaled within each sex. Thus, the variance and covariance structure was similarly maintained while the effect of sexual dimorphism in size difference was controlled. Next, a MANOVA was conducted on the mean standardized traits for each skeletal element to remove the possibility of artificial inflation of variance due to inclusion of both wild and captive specimens, by extracting trait residuals which were used in further analyses. For each of the 22 elements, the average r2 values were calculated based on the MANOVA residuals for all possible traits available for that element.

The means and standard deviations (SD) of ICV scores for M. fascicularis skeletal elements were calculated with varying sample sizes to examine how r2 values and sample size affect ICV distributions based on real empirical data. Distributions of ICV scores were calculated using the resampling method described above. For the vertebral column, only the first and last vertebra for the cervical, thoracic, and lumbar region was examined (i.e., six vertebrae). Sample sizes were set to 10, 20, 30, and 40 (or 35) for comparison. For instance, for a sample size of 10 for the cranium, 10 individuals from 40 individuals, and 10 traits out of 595 interlandmark distances were randomly drawn with resampling 1000 times. Hence, there were 1000 ICV scores in each ICV distribution. To test the correlation between the empirical r2 value and the mean and SD of ICV scores, a Spearman’s rank correlation test was conducted for each sample size category. For statistical comparison, Mann–Whitney U tests were conducted between different sample sizes (e.g., 10 vs. 20) with Bonferroni adjustment. Test results were considered to be statistically significant when p-values were less than 0.0125. All simulations and statistical tests were conducted with r 3.4.4. and ICV scores were calculated using the ‘CalcICV’ function in the evolqg package (Melo et al. 2015) in r 3.4.4.

Results

Simulation Study

The results showed that means of ICV distributions could be reliably estimated for r2 values of 0.05 or greater with sample sizes of about 100, for r2 values of 0.08 or greater with sample sizes of about 55‒60, and for r2 values of 0.12 with sample sizes of about 40‒45 (Figs. 3, 4, and 5). In other words, when the mean trait r2 value is relatively low, larger sample sizes are required to accurately estimate ICV values, and vice versa. On the other hand, a sample size of 40 individuals is large enough to accurately calculate the magnitude of integration when the average trait r2 value is over 0.08. Moreover, with fairly high r2 values (r2 > 0.2 or > 0.35), about 20 or 30 individuals are sufficient for calculating reliable ICV means (Figs. 6 and 7). For the parameter r2 value of 0.05, Mann–Whitney U tests were not statistically significant once sample sizes were over 51 (Table 1). For r2 values of 0.08 and 0.12, Mann–Whitney U tests were not statistically significant once sample sizes were over 31. For r2 values of 0.2 and 0.35, Mann–Whitney U tests were not statistically significant once sample sizes were over 11 (Table 1). Mann–Whitney U test results showed that even sample sizes with means above the ‘asymptote’ red line in Figs. 3, 4, 5, 6, and 7 are sufficient to reliably estimate mean ICV scores for some r2 values.

Fig. 3
figure 3

Distributions of ICV values based on 10 traits resampled at random from 300 traits with varying sample sizes (n = 11–150) from a multivariate normal population of 10,000 individuals when average among trait correlation is r2 = 0.05 (similar to the empirical r2 for human cranial traits in Porto et al. 2009). Red straight line shows where sample size starts to approach an asymptote ICV value

Fig. 4
figure 4

Distributions of ICV values based on 10 traits resampled at random from 300 traits with varying sample sizes (n = 11–150) from a multivariate normal population of 10,000 individuals when average among trait correlation is r2 = 0.08 (similar to the empirical r2 for hominoid cranial traits in Porto et al. 2009). Red straight line shows where sample size starts to approach an asymptote ICV value

Fig. 5
figure 5

Distributions of ICV values based on 10 traits resampled at random from 300 traits with varying sample sizes (n = 11–150) from a multivariate normal population of 10,000 individuals when average among trait correlation is r2 = 0.12 (similar to the empirical r2 for Old and New World monkey cranial traits in Porto et al. 2009). Red straight line shows where sample size starts to approach an asymptote ICV value

Fig. 6
figure 6

Distributions of ICV values based on 10 traits resampled at random from 300 traits with varying sample sizes (n = 11–150) from a multivariate normal population of 10,000 individuals when average among trait correlation is r2 = 0.2. Red straight line shows where sample size starts to approach an asymptote ICV value

Fig. 7
figure 7

Distributions of ICV values based on 10 traits resampled at random from 300 traits with varying sample sizes (n = 11–150) from a multivariate normal population of 10,000 individuals when average among trait correlation is r2 = 0.35. Red straight line shows where sample size starts to approach an asymptote ICV value

Table 1 Mann–Whitney U test results between different sample sizes under different assumptions of r2 values

Boxplots of ICV distributions based on various parameter r2 values show that the mean ICV based on various sample sizes increased logarithmically with increasing parameter r2 values (Fig. 8). Conversely, the standard deviation (SD) of mean ICV exponentially decreased with increasing parameter r2 values (Fig. 9). Thus, there is a clear relationship between parameter r2 values and the means and standard deviations of resultant ICV estimates, which explains why only relatively small sample sizes are required to reliably calculate ICV distribution using the resampling method if overall trait r2 values are reasonably high.

Fig. 8
figure 8

Distributions of mean ICV of varying sample sizes (n = 11–150) based on various r2 values (0.02–0.7)

Fig. 9
figure 9

Standard deviation of mean ICV of varying sample sizes (n = 11–150) based on various r2 values (0.02–0.7)

The results of the simulations of the effects of total trait number, sample size, and number of resampled traits on mean ICV showed that mean ICV increased with more resampled traits irrespective of the starting total number of traits, sample size, or parameter r2 value (Fig. 10). However, the rate at which mean ICV increased as the number of resampled traits increased was faster when the r2 value was higher. Moreover, the mean ICV was mostly influenced by r2 values and number of resampled traits, rather than sample size and the total number of traits (Fig. 10). Thus, it appears that the total number of traits per skeletal element does not affect the calculation of ICV distributions as long as the number of resampled traits is held constant between elements being compared.

Fig. 10
figure 10

Effect of total trait number, sample size, and number of traits resampled on mean ICV. t total trait number, s sample size, r r2 value

Skeletal elements of Macaca fascicularis

The empirical r2 values for the 22 skeletal elements of M. fascicularis tested are presented in Table 2. Among skeletal elements, mean trait r2 values ranged between 0.22 and 0.51 (Table 2). In general, postcranial elements showed higher r2 values than the cranium and mandible. The lowest r2 value was 0.22 in the first cervical vertebra (C1), while the highest r2 value was 0.51 in the tibia. The average r2 value across skeletal elements was 0.35. The correlation between r2 values and mean ICV scores was significant for all sample sizes (sample size of 10: r = 0.758, p = 0.002; sample size of 20: r = 0.725, p < 0.001; sample size of 30: r = 0.766, p < 0.001; sample size of 40 (or 35): r = 0.771, p < 0.001). The correlation between r2 values and standard of deviations (SD) of ICV scores was also significant for all sample sizes (sample size of 10: r = − 0.547, p = 0.035; sample size of 20: r = − 0.58, p = 0.023; sample size of 30: r = − 0.649, p = 0.009; sample size of 40 (or 35): r = − 0.63, p = 0.012). Thus, mean ICV scores increased but SD of ICV scores decreased with increasing r2 values.

Table 2 Coefficient of determination (r2) values for skeletal elements in Macaca fascicularis

The mean ICV scores for different skeletal elements of M. fascicularis generated using various sample sizes showed that for all skeletal elements there was no significant difference in ICV scores between sample sizes of 30 and 40 (or 35) (Table 3). In all skeletal elements, except for the radius and tibia, there was a significant difference in ICV between sample sizes of 10 and 20. Therefore, even very small samples were sufficient to accurately estimate the ICV for the radius and tibia, which reflect the fact that these skeletal elements had the highest r2 values (0.4 and 0.51, respectively) among skeletal elements. For all other skeletal elements, barring the femur, scapula, and humerus, there was also a significant difference in ICV scores between sample sizes of 20 and 30, but for all bones tested, increasing the sample size above 30 did not have any significant impact on the estimation of distributions of ICV. Thus, in general, the results from the empirical analysis of M. fascicularis skeletal elements corroborate the computer simulation results in suggesting that sample sizes of 30‒40 are sufficient to reliably calculate mean ICV scores in skeletal elements where the average r2 values among traits are reasonably high.

Table 3 Means and standard deviations of ICV distributions generated using resampling method (1000 resamples of 10 traits) with various sample sizes for skeletal elements of Macaca fascicularis

Discussion

Interpretation of the simulation results

In general, the results corresponded to the previous study of Grabowski and Porto (2017) as both sample size and r2 were important for calculating and estimating reliable indices of morphological integration. In this regard, it is advisable that a ‘universal’ sample size for different anatomical regions or skeletal elements be based on the lowest r2 obtained for any anatomical region or skeletal element. For instance, if the cranium shows lower r2 than post-cranial skeletal elements, the sampling effort should be based on the r2 of the cranium, although post-cranial skeletal elements may show relatively higher r2 values than the cranium. The simulation results showed that calculation of reliable mean ICV mostly depends on the value of the parameter r2. When r2 is ‘moderate’ (i.e., r2 > 0.08), a sample size of 40 individuals is large enough to calculate the mean ICV using a trait resampling method. Moreover, standard deviations of mean ICV exponentially decreased as r2 values increased (Fig. 9). Thus, the accuracy of the calculation of mean ICV is fairly reliable with high r2 values. In contrast, when r2 is low as is the case with the human cranium (r2 = 0.05) found in previous studies (Porto et al. 2009; de Oliveira et al. 2009), a larger sample size is required to calculate stable mean ICV.

Nevertheless, the Mann–Whitney U test results showed that a sample size of 51 may be sufficient even with a low r2 value of 0.05 for calculating reliable mean ICV scores using the trait resampling method (Table 1). Thus, for the human cranium (r2 = 0.05), 51 individuals may be the bare minimum required to calculate reliable mean ICV based on the results of the present study, while at least 108 individuals are required to calculate reliable integration indices (e.g., r2) based on the results of Grabowski and Porto (2017). The discrepancy between the results of Grabowski and Porto (2017) and the present study may depend on the methods used to quantify errors. In the present study, we used Mann–Whitney U tests to statistically compare significant differences in mean ICV between sample sizes to examine how many individuals are required to calculate a reliable mean ICV for each r2 value. In contrast, Grabowski and Porto (2017) used three measures of error; bias, imprecision, and inaccuracy. For instance, inaccuracy was calculated as “Inaccuracy = Imprecision + Bias2,” where imprecision is “the distance of repeated measurements to each other”, and bias is “the difference between the expected value of a parameter and the true parameter value.” Or, inaccuracy can be calculated as “the distance of a measured value to its parameter value” (Grabowski and Porto 2017). They considered an inaccuracy value of 0.05 as a “cut-off” for calculating reliable integration indices in their simulation (Grabowski and Porto 2017). Hence, Grabowski and Porto’s (2017) findings that much larger sample sizes are required may be related to their fairly strict criteria for calculating reliable integration indices. If we determine required minimum sample sizes based on the ‘asymptote’ lines in Figs. 3, 4, 5, 6, and 7, the present study demonstrates similar conclusions to the results of Grabowski and Porto (2017) (i.e., more than 100 individuals are required when r2 = 0.05). However, the results of the Mann–Whitney U tests show a less strict criterion for sample size determination that may be applicable to empirical studies of morphological integration. Thus, in terms of required sample size, our results correspond better to those of Cheverud and colleagues (e.g., Ackermann 2009), which showed that sample sizes of 40 are necessary for accurately estimating the structure of V/CV matrices or calculating integration indices. At the very least, our simulation results suggest that a sample size of 30‒40 when trait r2 value is over 0.08 (e.g., cranium of hominoids) or a sample size over 51 when r2 value is over 0.05 (e.g., cranium of humans) are the bare minimums required to accurately calculate magnitudes of integration using mean ICV scores (Table 1).

Moreover, as shown in previous studies (Grabowski and Porto 2017) and the present study, the number of (resampled) traits can significantly affect the index of morphological integration as different numbers of traits will change the average trait r2 of a certain morphology. For instance, mean ICV increased when the number of resampled traits increased in this study (Fig. 10). Nevertheless, it was also shown that the total number of traits did not affect mean ICV values when the number of resampled traits remained the same (Fig. 10). This result shows the merit of the resampling method as different skeletal elements have different numbers of traits, which may artificially alter ICV values. For instance, two skeletal elements with smaller and larger number of traits, respectively, would automatically have different ICV scores as the smallest eigenvalue may be more skewed in an eigenvalue distribution from a larger number of traits, resulting in unintended inflation of ICV scores even if mean eigenvalues are the same between skeletal elements with different number of traits. Thus, calculating distributions of ICV scores using a resampling method is an effective way to compare magnitudes of integration between skeletal elements with different number of traits, such as different developmental and/or functional modules.

Empirical Data of Macaca fascicularis

The analysis of skeletal elements in M. fascicularis showed that average trait r2 values ranged between 0.22 and 0.51 (Table 2). The correlation coefficient between r2 values and mean ICV scores of skeletal elements ranged between 0.725 in the sample size of 20 and 0.771 in sample size of 40 (or 35). Although there is a clear relationship between r2 and ICV measures of morphological integration, the ICV is more appropriate for calculating magnitudes of integration in V/CV matrices, while r2 is better for the same purpose when using correlation matrices as suggested by Shirai and Marroig (2010). The average trait r2 value for the cranium (r2 = 0.24) was slightly higher than that found in a previous study (r2 ≈ 0.17; de Oliveira et al. 2009). This difference most likely reflects the different landmarks used to capture cranial traits and differences in sample composition used in the present and previous studies. In particular, the total number of traits for the cranium in this study was 595 interlandmark distances, which was much larger than trait sets used in previous studies. The r2 value of the post-cranium was generally higher than the cranium except for C1, while the os coxa showed the same r2 value as the cranium (Table 2). In this regard, the r2 value for the cranium of mammals in previous studies (e.g., Marroig et al. 2009; Porto et al. 2009; de Oliveira et al. 2009) can be used as a guideline for deciding the minimum sample size required to reliably calculate the magnitude of integration using mean ICV scores. Based on the simulation results above, only about 10–20 specimens are required to analyze magnitudes of integration for the skeletal elements of M. fascicularis presented here (Table 1 and 2). However, it was found that sample sizes of 10–30 are required to reliably calculate mean ICV scores based on the results of the Mann–Whitney U tests comparing the effect of sample size on mean ICV for the skeletal elements of M. fascicularis (Table 3). The difference between the results of the simulation and real biological data is likely due to the structure of the data as a parameter V/CV matrix was generated in the simulation with multivariate normality but this is not always likely to be the case for real biological data. Thus, if one wants to examine mean ICV scores in all skeletal elements of M. fascicularis, a minimum sample size of 30 is recommended (Table 3).

A Cautionary Tale: The Relationship Between r2 Values and Landmarking Protocol

As shown in this study, a posterior calculation of average trait r2 values will be important and necessary for validating necessary sample sizes for analyzing magnitudes of integration using the ICV. Having said this, the results of the present study also suggest that how traits are determined, such as the landmarking protocol used, may also be an important issue for morphometric integration studies, in terms of estimation of the average r2 value (Conaway et al. 2019). In previous studies and in the present study, one of the main ways in which researchers can decide on an appropriate sampling effort for reliably calculating integration indices is to evaluate the average trait r2 value. Thus, it is also important to consider whether different (geometric) morphometric methods applied to skeletal elements have substantial effects on the calculation of r2 values. In this respect, Conaway et al. (2019) reported that the r2 can vary based on the landmarking protocol used to define morphometric traits, such as the use of different numbers of landmarks and/or inclusion of semi-landmarks in the analysis. In other words, the magnitude of integration could be different even within the same skeletal region due to differing landmarking protocols and, resultant varying r2 values. Therefore, sampling effort should be considered not only prior to the study based on r2 values from previous studies but also after the (preliminary) sampling is completed when r2 can be recalculated posteriorly based on the specific traits (i.e., specific landmarking protocol) being employed in each analysis. In this regard, it is recommended to prepare to sample more than the minimum required sample sizes for specific skeletal elements of certain species (e.g., more than 51 individuals for the human cranium) as it is impossible to know a priori what the precise r2 will be based on the specific morphometric protocol being employed and the sample composition. For instance, it may be that the actual r2 value is lower than 0.05, which would require a much larger sample size than 51 individuals as the standard deviation of mean ICV exponentially increase with lower r2 values (Fig. 9). In a nutshell, while the results of this simulation provide guidelines for minimum sample sizes based on average trait r2 values, sampling effort should be tailored to the specific purpose (i.e., hypothesized developmental and/or functional module), trait definition method (i.e., landmarking protocol), and material (i.e., skeletal element) of each study.