Introduction

Between-class or school differences in the achievement composition of their students may reflect the selective allocation of students in different learning groups according to their academic capabilities (e.g., tracking, ability-grouping, streaming; Ireson et al., 2001), or they may simply be the consequence of social segregation and neighborhood effects, and, even, the result of parental choice. A compositional effect is revealed when students’ outcomes are associated with the aggregated characteristics of their peers in the school or the classroom after controlling for pre-existing differences at the student level (Epple & Romano, 2011; Harker & Tymms, 2004; Marsh, Pekrun et al., 20182023; Wagner, 2022). For instance, a positive compositional effect of average achievement would suggest that students of the same academic achievement benefit more if they attend an institution or a classroom with a higher achievement intake. On the contrary, an absent or a negative compositional effect would suggest that attending a higher-achieving institution might not necessarily benefit student learning (Televantou et al., 2015). Research findings often support what is taken to be the conventional wisdom, suggesting a positive, albeit weak effect of class- or school-aggregated achievement on students’ academic outcomes (Teddlie et al., 1999; Wagner, 2022; Willms, 1985) — the so-called peer spillover effect (Cooley Fruehwirth, 2013; see Fig. 1). Thus, they suggest a positive association between the peers’ average achievement and a student’s academic achievement. Empirical evidence that supports this view can be traced back to the 1960s, to the “Coleman Report” on educational opportunity (Coleman et al., 1966). Since then, many other studies that looked at data from different countries and used different analytical strategies have also shown positive compositional effects on students’ academic achievement development (Becker et al., 2021; Burns & Mason, 2002; Hanushek et al., 2003; Nomi & Raudenbush, 2016; Opdenakker et al., 2002). At the same time, another set of studies reports negative or non-existent compositional effects (for an overview, see Sacerdote, 2011). In fact, there is remarkably little agreement on this matter (Ammermueller & Pischke, 2009; Hanushek et al., 2003; Sacerdote, 2001; Stinebrickner & Stinebrickner, 2006). Variation in the reported findings is linked to, among other factors, the inadequacies in different methodologies used and the quality of the data collected (Hutchison, 2004; Manski, 1993; Thrupp et al., 2002). Recent empirical (Harker & Tymms, 2004; Hutchison, 2007; Marsh et al., 2010; Televantou et al., 2015; Woodhouse et al., 1996) and methodological work (Marsh et al., 2009; Pokropek, 2015) has specifically turned its focus on the impact of correcting for measurement error in student-level measures (i.e., student achievement) on compositional effects estimates. Based on a longitudinal sample of US children who participated in the Early Childhood Longitudinal Study Kindergarten Class of 1998-99, Dicke et al. (2018) demonstrated that when the appropriate methodology is used, which allows for adjustments for measurement error in achievement scores, and controls for pre-existing differences, originally positive school-level compositional effects become minimal, close to zero. The authors called for further studies to investigate their findings' replicability (Nosek et al., 2022).

Fig. 1
figure 1

Theoretical model of the peer spillover effect, the Big-Fish-Little-Pond-Effect, and, the reciprocal effects model. T1, pre-test (i.e., beginning of the academic year); T2, post-test (i.e., end of the academic year); ACH1, student’s academic achievement at T1; ACH2, students’ academic achievement at T2; ASC2, academic self-concept at T2; ave ACH1, students’ average achievement at T1

Dicke and colleagues addressed the importance of evaluating compositional effects concerning academic achievement on another educational outcome, namely, academic self-concept (see also Stäbler et al., 2017; Televantou et al., 2021). Considering the positive effect of average achievement on students’ academic achievement and the reciprocal effects model, which suggests a mutually positive relationship between student achievement and student self-concept, one would expect a positive effect of average achievement on students’ academic self-concept. However, consistent evidence in the educational psychology literature suggests a negative relationship between average achievement and students’ self-concept (Marsh & Craven, 2005; Marsh & Martin, 2011), the Big-Fish-Little-Pond-Effect (BFLPE; Marsh, 1987; Marsh, Xu et al., 2021; see Fig. 1). Dicke et al. claimed, and empirically showed, that correcting for measurement error and pre-existing student differences may be the key to achieving convergence between BFLPE research findings of negative compositional effects on self-concept and the educational research findings of positive compositional effects on achievement. They based their suggestion on their finding that negative school-level BFLPEs turned even more negative after such adjustments. In contrast, positive school-level compositional effects became much smaller, slightly below zero. In this respect, the inconsistency in the estimates of the two compositional effects was eliminated, and the aforementioned theoretical paradox was partially resolved. In a subsequent study, however, Becker et al. (2021) suggested that the mixed pattern in studies investigating compositional effects on achievement and self-concept in educational settings might not be an artifact of inadequate methodology alone; it might instead reflect a “substantive effect pattern” (Becker et al., 2021; p. 14). They propose that researchers should look into the mechanisms driving the occurrence of compositional effects—in addition to using the appropriate research methodology—before they conclude the appropriateness of statistical findings. Becker et al. distinguished between different mechanisms leading to the occurrence of achievement composition effects, namely peer processes and instructional processes, as well as the allocation of resources to schools or classrooms. They explain that factors associated with these factors should be controlled for in statistical analysis before inferences about the actual size of achievement composition effects are made.

The present investigation revisits historical data from the 1980s, administered by the International Association for the Evaluation of Educational Achievement (IEA), namely the Second International Mathematics Study (SIMS80). Based on these data, we reproduce a positive and statistically significant class-level compositional effect of mathematics achievement (Zimmer & Toma, 2000). However, this effect largely disappears when controls for measurement error are made, in line with the recommendations of more recent studies (Dicke et al., 2018; Televantou et al., 2015, 2021), and controlling for a range of class-level variables, in line with Becker et al. (2021). Thus, the original study of Zimmer and Toma, despite being highly cited (Hanushek et al., 2003; Hanushek & Woessmann, 2011; Sacerdote, 2011; Van de Werfhorst & Mijs, 2010), fails to be replicated (Nosek et al., 2022) with a more appropriate analytical approach. Further, the present study evaluates the compositional effect of mathematics achievement on mathematics self-concept (Dicke et al., 2018; Stäbler et al., 2017), demonstrating the robustness of the BFLPE for different analytical strategies. The value of doing and understanding replication, reproducibility, and robustness has been increasingly recognized in the past decade as contributing to the quality of research findings and accelerating scientific progress (Nosek et al., 2022); our study involves aspects of all these three elements.

The Second International Mathematics Study (SIMS80) and FIMS, the First International Mathematics Study administered in the 1960s, represent the first two international comparisons of mathematics achievement. While both surveys have substantially influenced education worldwide, SIMS80 provides a more valid basis for relevant empirical studies (Brown, 1996). Significantly, SIMS80 is based on a pre- and post-measurement design, allowing for controls for the effects of prior achievement on subsequent achievement (the peer spillover effect; see above) and the BFLPE. Although this might also be the case for some other studies that involve longitudinal data, ours might be the only study to have this cross-nationally. In particular, all subsequent IEA and PISA data collections have been strictly cross-sectional. Whereas the BFLPE can be tested with cross-sectional data, tests of the peer spillover effect cannot (Caro et al., 2017; Wagner, 2022). In this respect, our study is the only cross-national study to have evaluated both the BFLPE and the peer spillover effect with controls for a true measure of prior achievement — and the only study to test the peer spillover effect cross-nationally using doubly latent models.

The following sections describe how current state-of-the-art compositional analysis models build on Zimmer and Toma’s (2000) analytical approach, adjusting for measurement error bias in compositional effect estimates. We then justify why we considered academic self-concept as an educational outcome, in addition to mathematics self-concept, describing the Big-Fish-Little-Pond-Effect (BFLPE). Finally, we give our study’s scope, research hypotheses, and research questions.

State-of-the-Art Compositional Analysis Models

In estimating the class-level compositional effect of average achievement, Zimmer and Toma (2000) used a fixed effects regression model linear-in-means model (Ammermueller & Pischke, 2009); this is a typical approach being followed in the econometrics literature. Today’s default approach to compositional analysis in education is multilevel modeling (Snijders & Bosker, 2012). With both multiple regression linear-in-means models and multilevel compositional analysis models, the criterion, typically student-level performance in an outcome of interest (academic self-concept, academic achievement), is regressed on an individual-level variable (prior achievement), and the corresponding class- or school-level aggregate (average achievement). The effect of the aggregated variable on the student-level outcome is commonly referred to as the compositional effect (Harker & Tymms, 2004). Multilevel modeling’s strength lies in the fact that it accounts for the nesting in the structure of educational data (e.g., students nested into classrooms or schools), providing unbiased standard errors — typically, with multiple regression models, standard errors are estimated larger. However, they involve single-scale scores (manifest concerning the sampling of items) and manifest aggregation (manifest concerning the sampling of people). Marsh et al. (Marsh et al., 2009; Marsh & Martin, 2011; Marsh et al., 2012) developed and demonstrated the application of doubly latent models (latent variable, latent aggregation) to investigate compositional effects. This approach builds on the multilevel model, referred to as the doubly manifest model (Marsh et al., 2009), by allowing controls for measurement error in individual-level measures and corresponding aggregates and sampling error in the aggregated measures. In this methodological framework, measurement error is conceptualized as the result of using only a finite number of items to measure a student’s academic achievement or trait. In contrast, an infinite number of items would have been needed to obtain a reliable measurement. It is controlled by using multiple indicators (Marsh et al., 2009). Sampling error arises when only a finite number of individuals from each higher-level unit is used to form the aggregated measures. It is adjusted for by using latent rather than a manifest aggregation to form the group-level aggregates (Lüdtke et al., 2008).

Phantom Peer Spillover Effects

Harker and Tymms (2004) coined the term phantom effects to describe misleadingly positive effects of aggregated variables in compositional models — peer spillover effects that are simply an artifact of the inadequacy in the statistical procedures used. Two “facets” (Televantou et al., 2015, p. 79) of under-specification at the student level have been shown to lead to the so-called phantom effects: measurement error bias and omitted variable, or selection bias (Caro et al., 2017; Harker & Tymms, 2004; Televantou et al., 2015). Omitted variable bias refers to insufficient controls for student-level background measures — a problem common to all observational studies (Pearl, 2002; West & Thoemmes, 2010). Measurement error bias may lead to positive compositional effects misleadingly appearing as more positive and non-existent ones being estimated as positive and significant. Conversely, negative compositional effects may be estimated as less negative, and in the presence of large amounts of measurement error, they may even turn positive (Televantou et al., 2015). The direction of omitted variable bias in the compositional effect estimate is not straightforward to predict: it depends on the correlation between the omitted and the aggregated variable, as well as on the relative direction of the effects of the two variables on the outcome of interest (Caro et al., 2017). Previous research (Dicke et al., 2018) has empirically shown that correcting for omitted variables at the student level eliminates the peer spillover effect and leads to a more negative BFLPE. However, the researchers call for further studies to validate their findings with different data and background variables. In their study, Zimmer and Toma (2000) dealt with omitted variable bias through controls for a diverse range of student-level characteristics available with SIMS80. However, they could not deal with measurement error bias reasonably since statistical models that can accommodate this source of bias were not readily available at the time their study took place.

The Big-Fish-Little-Pond-Effect

Academic Self-Concept (ASC) is defined as the specific component of self-concept that denotes how individuals perceive their academic abilities and competencies in a specific subject (Byrne & Shavelson, 1986). ASC is valued as an educational outcome in its own right (Zirkel, 1971) and as a facilitator of other desirable outcomes (Guay et al., 2004; Ivanova & Michaelides, 2022). Significantly, ASC and academic achievement are reciprocally related — the reciprocal effects model (REM; Guo et al., 2018; Marsh, 2023; Marsh & Craven, 2005, 2006; Marsh & Martin, 2011; Marsh, Pekrun et al., 2022 — so that higher ASC facilitates higher academic achievement and vice versa. BFLPE studies center on the consequences of attending a high-achieving classroom or school on academic self-concept (Fig. 1). They show that students with similar academic achievement levels feel less competent in high-achieving classrooms than in average- or low-achieving ones (Marsh, 1987). The theoretical explanation for the BFLPE is based on social comparison theory (Festinger, 1954; Marsh, Xu et al., 2021), which emphasizes the need to consider the relative frames of reference to understand how people perceive their competencies in specific domains (Marsh et al., 2014; Marsh, Pekrun et al., 2018). The negative BFLPE is conceptualized as the net effect of two processes (Marsh et al., 2000): a positive assimilation effect due to being affiliated with a prestigious institution or a highly selective educational program, and a negative contrast due to social comparisons with higher-achieving peers.

The BFLPE is one of psychology’s most cross-culturally universal phenomena, verified with three successive PISA data collections (Marsh & Hau, 2003; Marsh, Xu et al., 2021; Seaton et al., 2009). It generalizes across student groups: subject domains, ASC instruments, and cultures (Basarkod, et al., 2023; Marsh et al., 2008; Marsh et al., 2015; Nagengast & Marsh, 2012), and has been shown using standardized test scores and school grades (Marsh, Pekrun et al. 2022; also see Fleischmann et al., 2021), and in terms of class rank instead of class-average achievement (Loyalka et al., 2018; Marsh et al., 2020). The present study aims to verify the hypothesis that BFLPEs remain negative and statistically significant after measurement error is controlled for (Dicke et al., 2018; Televantou et al., 2021).

Class-level Confounders of the “Pure” Compositional Effect

Investigating compositional effects in educational settings is closely linked to whether “Schools Matter” (Mortimore et al., 1988). If the effects are substantial, then researchers interpret this as an indication that students’ achievements are influenced by the interaction of students with each other in the school’s social context (Thrupp et al., 2002). In assigning such interpretations to compositional effect estimates, however, empirical researchers must be cautious; compositional effects may reflect something more than simply “peer contagion” (Dishion & Tipsord, 2011) — the influence that students may exert on each other. Differences in average achievement across classrooms or educational institutions are typically confounded with inequalities in instructional processes and allocation of resources (Thrupp, 1999). For example, teachers respond to the group of students they teach, and vice versa; high-achieving classes may attract teachers with more teaching experience and a higher level of pedagogical training (Fauth et al., 2021). Concerning these issues, Becker et al. (2021) distinguished between “general” and “pure” compositional effects, the latter being a more accurate approximation of the actual peer effect. “Pure” compositional effects, or peer effects, can be approximated by adjusting for the effects of extra-compositional variables that act as potential confounders of the effect of aggregated characteristics at the class or the school level (Dicke et al., 2018; Marsh, Pekrun et al., 2023; Televantou et al., 2015). In our study, following Zimmer and Toma (2000), we exploit the richness of SIMS80 data, controlling for a range of class-level extra-compositional variables (e.g., the teacher’s experience and pedagogical knowledge) in compositional analysis models. This way, we aim to approximate the “pure” compositional effect better.

The Present Study

The present study initially attempts to reproduce the peer spillover effect (Fig. 1) reported by Zimmer and Toma (2000). using a subset of the data used in their analysis (see the “Data Sample and Measures” section and Supplementary Materials). Our research hypothesis (Research Hypothesis 1/RH1) is that a positive class-level compositional effect of average achievement on subsequent mathematics achievement will be retrieved, in line with the original study. Second, we replicate the study of Zimmer and Toma, using doubly latent models that correct for measurement error. We anticipate that peer spillover effects will become less positive or disappear once measurement error bias is adjusted (Research Hypothesis 2/RH2; Lüdtke et al., 2008; Marsh et al., 2009; Televantou et al., 2015). Additionally, we test the peer spillover effect and the BFLE simultaneously, integrating the two effects in one path analyses model (Fig. 2; Dicke et al., 2018; Televantou et al., 2021; Stäbler et al., 2017). Following our analysis of the integrated model, we observe the amount of bias in the compositional effect estimates, the peer spillover effect, and the BFLPE, due to different forms of model misspecifications — not controlling for measurement error, omission of student-level variables, and omission of class-level variables — one in isolation from the other. We expect the negative and statistically significant BFLPE to be robust to different model specifications (Research Hypothesis 3/RH3), in line with Dicke et al. (Dicke et al., 2018; Televantou et al., 2021). Peer spillover effects are expected to be estimated more positive when measurement error is not adjusted for (Becker et al., 2021; Dicke et al., 2018; Harker & Tymms, 2004; Pokropek, 2015; Televantou et al., 2015). Moreover, they are expected to be sensitive to the set of background variables controlled for in the models (Research Hypotheses 4/RH4). Finally, in an analysis of exploratory nature, we assume mediation of the effect of class-level effect of average mathematics achievement on students’ subsequent mathematics achievement through self-concept (Fig. 2; see dashed line). We focus on the significance and direction of the mediating effect of mathematics self-concept in the relationship between class-average achievement and subsequent mathematics achievement. Our rationale is that if a negative and statistically significant mediation is revealed, this would indicate that the BFLPE and the peer spillover effects might not operate independently.

Fig. 2
figure 2

Conceptual model of the paths tested in the present study. L1-MAchX corresponds to individual mathematics achievement at pre-test, L2-MAchX to mathematics achievement aggregated at level 2, L1-MAchY to mathematics achievement at post-test, and L1-MScY to mathematics self-concept at post-test. Level 2 is the class. Dashed lines represent controls for student background characteristics and additional controls for class-level extra-compositional effects. The range of background variables selected at both levels was based on Zimmer and Toma (2000). *Path tested in an exploratory study assuming mediation of the effect of the compositional effect of class-average achievement on subsequent achievement, through mathematics self-concept

Methodology

Data Sample and Measures

Descriptive measures of our data, a large sample of 13-year-old students from the USA, Canada (Ontario), and New Zealand, are given in Table 1 (see “A.1 Data Samples” in Suplementary Materials).

Table 1 Descriptive analysis for pre- and post-test measures of mathematics achievement and mathematics self-concept for the three countries in our sample, USA, Canada (Ontario), and New Zealand

Mathematics Achievement

Following Zimmer and Toma (2000), we based our mathematics achievement measures on forty items present in all four distinct rotated forms of the SIMS80 mathematics tests (Zimmer & Toma, 2000). The items were common in pre- and post-measurement occasions; each item was given a value of 1 if it was answered correctly and a value of 0 if it was answered wrongly, and was left blank if no answer was given in the specific question. They were a mixture of arithmetic and word problems covering a range of mathematical topics (algebra, geometry). We first used multiple imputations to treat missingness at the item-level (see “Missing Data” section) and then used item parceling (Little et al., 2002, 2013, 2022) to form the multiple indicators for the pre- and post-test. For both measures, we created four 10-item parcels, taking the average of every 4th item available, allowing for more parsimonious statistical models to be used (Marsh et al., 2013). Importantly, concerning the purposes of our analysis, in which the original items were dichotomous, item parcels gave indicators with a distribution closer to normal, thereby facilitating normal theory-based estimation (Marsh et al., 2013). Manifest measures of achievement were based simply on each student’s average score. All variables at level 1 were standardized by subtracting the overall mean and dividing by the overall standard deviation to have a mean of 0 and a standard deviation of 1.

Mathematics Self-concept

Mathematics self-concept measures were only present at the end-of-year exams. However, four items were used as indicators of students’ mathematics self-concept, namely, “I could never be a good mathematician,” “I am not so good at Mathematics,” and “I cannot do well at Mathematics, no matter how hard I try” and “Mathematics is harder for me than for most people.” The items were on a 5-point agreement Likert scale (1 = “Strongly Disagree” ... 5 = “Strongly Agree”), but they were reversely coded, so that scores closest to 5 would reflect higher mathematics self-concept.

Reliability Estimates

Measurement error reliability was relatively high for mathematics achievement and self-concept measures — McDonald’s omega (ω) was estimated higher than .8 (see Table 1). Thus, measurement error in our data was not substantial.

Student Background Measures

Our models included student-level variables representing family and socio-economic characteristics (Table A.1, Supplementary Materials). We considered the same sets of student background variables as the reference study (Zimmer & Toma, 2000): father’s and mother’s occupation (un- or semiskilled, skilled worker, clerical or sales, professional), father’s and mother’s higher education attained (little or none, primary school, secondary school, beyond secondary), students’ gender (female, male), and the frequency with which the language of instruction is spoken at home (never, sometimes, usually, always).

Class-Level Background Measures

Class-level variables considered in our analysis were mainly relevant to the teacher’s characteristics (Table A.1, Supplementary Materials). More specifically, we considered the teacher’s gender (male/female), the number of years of teaching experience, the number of years teaching mathematics to 8-year students, the number of courses in mathematics methods and pedagogy that were included in the teacher’s post-secondary education, and the number of courses in general methods and pedagogy that were included in the teacher’s post-secondary education. Finally, we controlled for the type of school (Private or Public) in which the class was situated, the kind of community served by the school (Rural, Suburban, Urban/Suburban, Inner-city metropolisFootnote 1), and the total number of students enrolled in the school.

Statistical Analysis

Missing Data

The percentage of missing data for the different variables involved in our analysis was not substantial (Table A.1; Supplementary materials). We used multiple imputations to treat missing data (Rubin, 1987; Schafer & Graham, 2002); imputation procedures were run in IBM SPSS Statistics (IBM SPSS Missing Values 21, 2012). The procedure involved replacing missing values with a list of five simulated values. Missing items in the mathematics self-concept scale were imputed in the same imputation model as mathematics achievement items.. The procedure involved replacing missing values with a list of five simulated values. Each plausible version of the complete data was analyzed using a complete data method. The results were combined to obtain overall estimates and standard errors in the statistical package Mplus 8 (Muthén & Muthén, 2017). Where relevant, class-level variables were computed based on the imputed data.

Statistical Models

We first replicated the original analysis by Zimmer and Toma, using a multiple regression linear-in-means model and specifying interaction terms of class-average achievement with the dummy variables for the three countries considered. In models controlling for measurement error, we used a multilevel latent variable analysis framework, and we specified multi-group doubly latent models with the grouping variable being the country. The models were applied in Mplus 8 (Muthén & Muthén, 2017). Before any analyses, we established the invariance of the factor structure of the latent factors (mathematics achievement, mathematics self-concept; Raykov et al., 2013) across countries to facilitate meaningful interpretation of observed differences (see Tables A.3; A.4 in Supplementary Materials).

Effect Size Measures

To facilitate comparisons of the effects estimated across different modeling approaches, and with previous research findings, effect sizes (ESs) were calculated according to the recommendations of Marsh et al. (2009) using the following formula:

$$E{S}_{\beta_{com}}=2\ast {\beta}_{com}\ast S{D}_{com}/{\sigma}_e$$
(1)

We used Eq. (1) to calculate the effect size of the effect of the aggregated variable: what we refer to as the compositional effect (\({ES}_{\beta_{com}}\)). The denominator for both equations is the same, the level 1 residual standard deviation of the score of the students in year 4 mathematics score — or of the score for students’ academic self-concept in the case of the BFLPE. The unstandardized regression coefficient, βcom, is multiplied by the standard deviation of the predictor. This effect size is interpreted as the difference in the dependent variable between two classes that differ by two standard deviations on the predictor variable.

Model Fit

For assessing the fit of our models, we used sample size independent fit indices (Marsh et al., 2015): the Tucker–Lewis index (TLI), and the Comparative Fit Index (CFI) — these vary along a 0–1 continuum, and values greater than .90 and .95 typically reflect acceptable and excellent fit to the data, respectively. We also used the Root-Mean-Square Error of Approximation (RMSEA), with values of less than .05 and .08 reflecting a close fit and a minimally acceptable fit to the data, respectively.

Results

Failing to Replicate Past Findings on the Existence of Peer Spillover Effects

As expected (RH1) and consistent with the original analysis by Zimmer and Toma (2000), applying a multiple regression linear-in-means model led to a positive and statistically significant compositional effect of class-average achievement. In addition, a non-statistically significant interaction term between the country with class-average achievement (see Table 2) was retrieved, suggesting a positive compositional effect for all three countries involved with our data, the USA, Canada (Ontario), and New Zealand. However, when we replicated the study of Zimmer and Toma, using a multilevel latent variable framework, we failed to retrieve the positive peer spillover effect — in line with RH2. More specifically (Table 2), the positive class-level compositional effect of average mathematics achievement detected in study 1 was eliminated. It became non-statistically significant for the two countries involved (USA, New Zealand).

Table 2 Replication of Zimmer and Toma (2000) study: Using a multiple regression model vs a multilevel latent variable model, and, SIMS80 data from the USA, Canada (Ontario), and New Zealand to estimate the class-level compositional effect of average mathematics achievement on students’ mathematics achievement at the fourth year

Modeling the BFLPE and the Peer Spillover Effect in an Integrated Model

The theoretical path models underlying the formation of the BFLPE involve a pre-test in individual achievement and the corresponding school- or class-average achievement as a covariate at level 1 and level 2, respectively (Dicke et al., 2018; Stäbler et al., 2017). Mathematics self-concept at post-test is the outcome. In our analyses, we tested the peer spillover effect and the BFLE simultaneously, integrating the two models in one path analysis model (Fig. 2). The estimates for the two effects for a fully specified doubly latent model, controlling for measurement error and all background variables, are given in Table 3 (Model: “doubly latent with covariates3”). The estimates for the peer spillover effect are essentially the same as those in Table 2, i.e., in the analysis where mathematics achievement is used as the only outcome in the model. The BFLPE was found negative and statistically significant for all three countries — USA (βcomp =  − .281, SD = .091, ES =  − .606), Canada (Ontario) (βcomp =  − .496, SD = .053, ES =  − .677), and New Zealand (βcomp =  − .420, SD = .037, ES =  − .878). Importantly, when constraints of equality for the estimates of the estimated compositional effects were imposed (restricted multi-group doubly latent model; equal compositional effects across countries), we found an overall negative and non-significant compositional effect of class-average achievement on subsequent mathematics achievement (βcomp =  − .041, SD = .023, ES =  − .152), The BFLPE detected with the restricted multi-group doubly latent model (equal compositional effects across countries) was βcomp =  − .336 (SD = .023, ES =  − .693). The Model Fit was not substantially affected by restricting the compositional effects to be equal across countries (Unrestricted Model: χ2=3242.622, d.f.=846, RMSEA=.025, CFI=.970, TLI=.965; Restricted Model: χ2=3879.612, d.f.=866, RMSEA=.028, CFI=.962, TLI=.957).

Table 3 Quantifying bias in the peer spillover effect and the BFLPE due to different forms of model misspecification

Impact of Model Misspecification on Peer Spillover Effect and BFLPE Estimates

Another aim of our study was to quantify the amount of bias in the compositional effect estimates that could be attributable to different forms of model misspecification (failure to control for measurement error, not controlling for appropriate student- and class-level background variables). Relevant findings are displayed in Table 3: The BFLPE remained negative and statistically significant despite the potential bias in statistical estimates, in line with RH3. In contrast, the peer spillover effect estimate was highly vulnerable to the different model specifications, consistent with RH4.

More specifically, failing to control for measurement error led to artefactual peer spillover effects. Controlling for measurement error alone only partially corrected for the positive bias in the peer spillover effect; the effect was revealed with Canada (Ontario) and USA data. Peer spillover effects disappeared altogether when additional controls for student background measures were made.

Both facets of under-specification at the student level (see the “Phantom Peer Spillover Effects” section) led to BFLPEs being estimated smaller in magnitude (i.e., less negative); the effects remained statistically significant throughout.

Failing to include class-level background measures in our models did not substantially affect our conclusions regarding the magnitude and direction of peer spillover effects and BFLPEs. However, the peer spillover effect was revealed for Canada (Ontario) once controls for class-level background variables were made in the analyses.

Modeling Mathematics Self-concept as a Mediator

In a supplementary study of exploratory nature, we modeled the mediation of the class-level compositional effect on subsequent mathematics achievement through mathematics self-concept. We found a small, statistically significant negative mediation effect for all three countries (Table 4). When equality restrictions of compositional and mediation effects were imposed for the three countries, the mediation effect was estimated as negative and statistically significant (βmed =  − .052, SD = .005, p < .001). Class-average achievement had a negative effect on mathematics self-concept (BFLPE; Table 4). The results suggest that at least part of the effect of class-average achievement is mediated via mathematics self-concept and this mediated effect is negative.

Table 4 Estimates of the direct, indirect, and total peer spillover effect, and for the BFLPE

Discussion

The impact of school- or class-average achievement on students’ outcomes has received a growing concern among researchers (Becker et al., 2021; Yang Hansen et al., 2022). Despite years of accumulated research, considerable confusion remains about how past analyses can be interpreted. Many relationships reported between student achievement and their class or school peers characteristics are limited by conceptual and analytical shortcomings (Dicke et al., 2018; Harker & Tymms, 2004; Thrupp et al., 2002).

Our study, a large longitudinal sample of students in eighth grade from the USA, Canada, and New Zealand participating in SIMS80, represents another yet scholarly attempt to estimate the class-level compositional effects of achievement on students’ mathematics achievement (the peer spillover effect) and students’ mathematics self-concept (the Big-Fish-Little-Pond-Effect; BFLPE) using multilevel doubly latent models (Becker et al., 2021;Dicke et al., 2018 ; Televantou et al., 2021). Zimmer and Toma (2000) used multiple regression linear-in-means models, the state-of-the-art approach at the time their study was conducted. They reported a positive effect of class-average achievement that was statistically significant and robust across the countries in their sample; their findings are not replicated when we apply doubly latent models.

Our findings should not be the sole basis for informing current issues in educational policy and practice as the data are dated, and the generalizability over different countries must be considered. However, they question the empirical results of past and current research evaluating compositional effects based on sub-optimal models that fail to control for measurement error. Importantly, they demonstrate the robustness of the BFLPE to different modeling specifications and datasets used.

The present investigation represents the first cross-national study that simultaneously investigates the BFLPE and the peer spillover effect, controlling for true measures of prior achievement. Students’ prior achievement has been shown to explain up to 50% of their differences in subsequent educational outcomes (Colom & Flores-Mendoza, 2007); failing to make such adjustments may lead to overestimating the peer spillover effect (Harker & Tymms, 2004; Wagner, 2022). Further, it evaluates compositional effects at the classroom level and uses data on students in their eighth year of educational studies. This differs from existing studies’ focus (Dicke et al., 2018; Televantou et al., 2021), which used doubly latent models to evaluate compositional effects on students’ mathematics achievement and mathematics self-concept; their interest was at the level of the school, and they used data from younger students. In general, compositional effects in the educational context are more prominent when looking at the composition of a class rather than that of a school, as the class is the immediate learning environment to which students belong (Marsh et al., 2014). Thus, it is vital to show that positive peer spillover effects can also be artefactual — even at the classroom level — where they would have been expected to be larger, i.e., more positive.

We distinguish pure compositional effects associated with the achievement levels of peers from class-level variables associated with class-level variables that are likely to be confounded with class-average achievement (e.g., the teachers’ qualifications). The basic concept (Becker et al., 2021; Hanushek et al., 2003) is that classroom-level fixed effects remove selection effects and allow the researcher to identify peer effects from idiosyncratic variation in peer ability. Becker et al. (2021) made similar arguments and controlled for the effect of tracking in models estimating class-level compositional effects of achievement. In our study, we adjust for a broader range of class-level characteristics. No substantial differences were observed in the compositional effect estimates after such adjustments. Further consideration needs to be done to interpret this finding, considering the educational systems of the countries involved in our sample when the data were gathered. However, this is beyond the scope of our study.

Demonstrating Robustness of the BFLPE

We demonstrate BFLPEs with mathematics self-concept data collected over 40 years ago in the early 1980s. BFLPEs are evident for all three educational systems involved, the USA, Canada (Ontario), and New Zealand. Specifically, with Canada (Ontario), we observe a negative effect of class-average achievement on students’ self-concept, despite the apparent weak positive effect of the same compositional variable on students’ subsequent mathematics achievement. Adjusting for measurement error and omitted variable bias leads to an even more negative BFLPE, consistent with findings reported by Dicke et al. (2018) and Televantou et al. (2021).

It is essential to understand why controlling for measurement error and covariates strengthens the BFLPE but weakens or eliminates the peer spillover effect. The explanation is that failure to control for measurement error, and confounding variables are likely to result in a negative bias in the effect of class-average achievement. The direction of this bias works against the BFLPE so that the controls make the BFLPE even more significant and more negative. In this sense, the BFLPE is robust concerning this bias. In contrast, failure to control measurement error and confounding variables produces a positive bias for the peer spillover effect on achievement. Thus, the direction of the bias is in the same direction as the prediction of a positive peer spillover effect. These contrasting effects and implications are apparent in our analysis. Hence, claims of positive peer spillover effects without these controls must be viewed cautiously. Furthermore, because there will always be unmeasured confounding variables likely to be positively related to class-average achievement, it might only be possible to resolve this problem partially. Still, the robustness of our findings revealed a negative BFLPE across all modeling specifications, supporting the characterization of the effect as a “pan-human universal” phenomenon (Marsh et al., 2020; Marsh, Pekrun, et al., 2018; Marsh, Xu et al., 2021.

Replicating Findings of Previous Studies

Replications, being defined as intentional attempts to repeat previous research to verify or disprove the findings reported in the past (Plucker & Makel, 2021), are important for developing a “cumulative knowledge base” (Peterson & Schreiber, 2012, p. 287). Theoretical conclusions are stronger when they are based on this accumulated knowledge, and they can make a valuable and meaningful contribution to the development of educational policy and practice. Following Dicke et al.’s (Dicke et al., 2018; Televantou et al., 2021) methodological approach, we consider the impact of failing to account for measurement error on compositional effect estimates. Dicke et al. found that an under-specified model leads to less positive peer spillover effect, and more negative BFLPEs. We replicate their findings, enhancing the external validity of their claims. With doubly latent models and adjustments for student background, the initially detected peer spillover effect disappears altogether for USA and New Zealand data. For Canada (Ontario), the effect becomes substantially smaller — but it remains positive and statistically significant. In a supplementary study, we verified this finding using three different sets of student background variables to correct for selection bias at the student level (see Supplementary Materials, Table A.6). Whether and how selection bias can be sufficiently addressed in observational studies have been highly debated (Reardon & Owens, 2014) since no study can effectively control for the infinite number of potential confounders at the student level. By evaluating models based on different sets of student-level covariates and demonstrating the same trend in the findings, we address relevant concerns (Dicke et al., 2018).

Resolving a Theoretical Paradox

The divergence in conclusions regarding the magnitude of the peer spillover effect, even after controls for measurement and omitted variable bias, echoes previous studies that are also contradictory, with some supporting the existence of positive achievement compositional effects and some rejecting it (Becker et al., 2021; Sacerdote, 2011). Supplementary analysis to our main study (“Modeling Mathematics Self-concept as a Mediator”) demonstrates how class-average achievement can have different effects on student math self-concept and achievement, even though math self-concept and achievement are reciprocally related (see claims of Becker et al., 2021). It suggests that part of the effect of class-average achievement is mediated via mathematics self-concept, and this mediated effect is negative (Marsh, 2023; Marsh, Pekrun et al., 2023). Thus, another explanation of the theoretical paradox initially identified and partially explained by Dicke et al. (2018) might be derived: the BFLPE could be one mechanism driving negative compositional effects on achievement; however, other factors at the level of the classroom or the school (e.g., instructional practices) may also operate so that peer spillover effects are, eventually, manifested. Future research could address this hypothesis.

Limitations and Directions for Future Research

Our study investigates the impact of failing to account for measurement error bias in compositional analysis with SIMS80 data by applying the doubly latent approach. Intact classes were used in the sampling of students in our study. Hence, we were concerned about whether overcorrecting for sampling error affected our estimates (see Marsh et al., 2012, for a relevant discussion). However, no substantial differences were observed in juxtaposing estimates of our analyses obtained with models assuming manifest aggregation (the latent manifest approach; Lüdtke et al., 2008; see Supplementary Materials, Table A.5).

We also note that the imputation model we used for missing data, despite considering all the multilevel covariates used in our analysis, does not mimic the analytical model used: implementing multilevel imputation would be ideal for our study. However, we faced serious convergence issues when we tried to do so.

In testing mediation, mathematics self-concept and achievement were measured at the end of the school year (see Fig. 2; dashed lie) since prior measures of mathematics self-concept were not available with SIMS80. Thus, we base our mediational analysis on cross-sectional data, with both the mediator and the predictor measured at the end-of-year exam (for problems with mediation based on cross-sectional data, see Maxwell et al., 2011; O'Laughlin et al., 2018). While existing literature suggests that self-concept and individual achievement are reciprocally related (REM; Marsh & Craven, 2005; Marsh, 2023; Marsh, Pekrun et al., 2022), we only model the skill development process (i.e., prior achievement leads to subsequent academic self-concept. In this respect, our mediation analysis is more like a “what if” exploratory study of an interesting question, and our findings are only tentative, leading to a “hypothesis” that can be further tested when appropriate data are available to test mediation (e.g., a study with three or more waves of data). We note that the apparent lag 0 effect of academic self-concept on academic achievement might merely reflect lag 1 effects (effects of academic self-concept at a previous time point), not included in the model (Marsh, Pekrun et al., 2022). A promising direction for future research would be the application of models with both reciprocal and contemporaneous effects between achievement and self-concept (Muthen & Asparouhov, 2023).

The strength of our analysis from a substantive point of view is that it shows a negative and statistically significant BFLPE, controlling for true measures of prior achievement — that persists and becomes even more prominent after adjustments for measurement error. There is now a vast literature in support of the BFLPE. Although most of this research is based on cross-sectional data, several studies have also evaluated it longitudinally. Interestingly, the results based on cross-sectional (e.g., Nagengast & Marsh, 2012; ES =  − .286), and longitudinal (e.g., Dicke et al., 2018; ES =  − .36) analyses do not differ substantially in the size of the effect. Our study replicates and extends these findings. What is interesting in putting the peer spillover effect and the BFLPE together in the same model is how the application of progressively stronger models results in systematically weaker (less positive, null, or even negative) peer spillover effects and systematically stronger (more negative) BFLPEs. Importantly, the direction of changes in both the spillover and BFLPE is consistent with a priori predictions.

Our estimates of compositional effects do not, however, represent an unambiguous causal effect; our interpretation would be strengthened by juxtaposing our results with potentially stronger designs such as regression discontinuity, propensity score matching (Randolph & Falbe, 2014), instrumental variables (e.g., Aral & Nicolaides, 2017), or a true experiment with random assignment (Paloyo, 2020). Existing studies have also pointed to the potential of social networks literature (e.g., Froehlich et al., 2020) in informing research on peer effects (Paloyo, 2020). Thus, for example, Koivuhovi et al. (2022) found that peer-group-average achievement had no effect on academic self-concept beyond the negative (BFLPE) effect of class-average achievement. All these are interesting avenues for exploration in future studies. Nevertheless, current research — including the present investigation — suggests that peer spillover effects are substantially smaller when appropriate adjustments are made, and may even disappear altogether.