1 Introduction

Most people who have heard of the Italian mathematician Giuseppe Peano credit him with inventing the standard axioms of arithmetic. This is all they associate with the name “Peano”. Yet, the axioms were invented by the German mathematician Richard Dedekind, and Peano published a simplified version only afterwards. If people identify Peano only by the description “the inventor of the standard axioms of arithmetic”, to whom are they referring when they use the name “Peano”? To Peano or to Dedekind? And more generally, what kind of meaning must people associate to a proper name like “Peano” in order to be competent users of that name?

In philosophy of language, there are two main classes of theories about the meaning of proper names: descriptivist theories and causal-historical theories. According to descriptivist theories (Frege 1892; Russell 1905; Searle 1958), proper names have definite descriptions as their meaning. The idea is that a proper name can refer to a person only via the descriptive properties that users of the name associate with it. Thus, people who identify Peano only by the description “the inventor of the standard axioms of arithmetic” would actually refer to Dedekind when they use the name “Peano”. After all, it is Dedekind who uniquely satisfies that description.

According to causal-historical theories (Kripke 1980), proper names do not imply any descriptive property of the individuals to which they refer. Proper names refer directly to their bearers without being essentially associated with any descriptive properties of an individual. People processing a proper name like “Peano” certainly rely on some mental representations of certain descriptive properties, but these representations play no role in determining the meaning of “Peano”. Instead, what is crucial for determining the meaning of a proper name is its causal history. All uses of the name that causally derive from an original act of baptism refer to the individual originally baptized with that name. Even if people falsely associate the description “the inventor of the standard axioms of arithmetic” with Peano, the proper name “Peano” actually refers to Peano. This is because of a relevant causal chain between the original act of baptism for Giuseppe Peano, and people’s usage of the name “Peano”.

One influential argument for why proper names cannot be semantically equivalent to definite descriptions is that the referent of a proper name ‘is stipulated to be a single object whether we are speaking of the actual world or a counterfactual situation’ ((Kripke 1980), p. 21). In contrast, definite descriptions can refer to different individuals in counterfactual situations.

Kripke illustrates this argument with an example analogous to the Peano case above ((Kripke 1980), pp. 83 ff.). Suppose that the only description you associate with the name “Gödel” is “the mathematician who proved the incompleteness of arithmetic”. Suppose that you discover that a certain Mr Schmidt rather than Gödel actually proved the incompleteness of arithmetic. If the name “Gödel” is semantically equivalent to the definite description “the mathematician who proved the incompleteness of arithmetic”, then you are committed to the conclusion that the name “Gödel” actually refers to Mr Schmidt. But, of course, this conclusion is false, which suggests that descriptivist theories of proper names are false.

In a landmark study, Machery et al. (2004) used this Gödel case, and three other similar vignettes, to explore the question of whether Kripke’s and other Anglophone speakers’ judgments about the referents of proper names are cross-culturally robust. (Machery et al. 2004) found that Westerners tend to have intuitions in line with Kripke’s causal-historical theory, while East Asians’ intuitions tend to agree with the descriptivist theory. On the basis of this finding, (Machery et al. 2004) reasoned that if people’s judgments about the meaning of a proper name are systematically influenced by demographic variables like their culture, then Kripke’s and other Anglophone speakers’ semantic judgments cannot constitute reliable evidence for a theory of proper names.

Several follow-up studies have extended and probed the original result, and embedded it into a broader methodological discourse about the use of intuitions about particular cases as evidence for philosophical theories e.g.,(Machery et al. 2009; Lam 2010; Sytsma and Livengood 2011; Machery 2017; Cova et al. 2018). (Machery et al. 2004)’s effect has been found to show significant variation both within and across studies, but no convincing explanation has been offered for when and why we should expect cross-cultural variation in semantic intuitions. It might be that scenarios like the Peano or Gödel cases above are not the right tools for eliciting people’s semantic intuitions. Or, it might be that for making semantic judgments people have access to multiple cognitive strategies, which include both causal-historical and descriptivist factors, and shift between them depending on one’s ease to take the perspective of the speaker, background knowledge, audience, and purpose of communication (Genone and Lombrozo 2012).

Philosophers are not the only ones interested in proper names. Proper names have also been studied by several other disciplines, including linguistics, psychology, neuroscience, and anthropology.

Linguists agree that proper names are mainly used to identify individuals uniquely; but the referential and connotative use of a proper name depends on the communicative intentions the speaker want to convey to the hearer, the knowledge frame shared by the interlocutors, and their discourse-relative perspective (Dancygier 2009; Dancygier and Vandelanotte 2017; Marmaridou 2000).

Cognitive psychologists focus on the mental representations of the meanings of proper names, and in particular on differences between the cognitive processing of proper names and common nouns (Sophia and Marmaridou 1989; Valentine et al. 1996). Much of this research provides evidence that proper names and common nouns are associated with different mental representations, and that they are processed differently. It has been found that, generally, people from different cultures judge more quickly if a word is a proper name as opposed to a common noun (Müller 2010), and that they retrieve the meaning of a proper name from memory with more difficulty compared to common nouns (Proverbio et al. 2009; Wang et al. 2016). Additional evidence for these processing differences between proper names and common nouns comes from neuropsychological studies, which have shown a double dissociation between retrieval of proper names and retrieval of common nouns (Yen 2006; Semenza 2006).

Finally, for anthropologists, the meaning of proper names depends on principles governing naming practices across cultures (Vom Bruck and Bodenhorn 2006; Bright 2003). Proper names have been found to fulfil two main cultural functions: to identify their bearers differentiating them from other individuals, and to classify them in terms of their parental, economic, ethnic or geographical group. These two functions can trade off, since the more a proper name differentiates its bearer from other individuals, the less social information it carries. Focusing on these trade-offs, (Alford 1988) examined the use of proper names in sixty cultures around the world to better understand the social and communicative situations where one function is more prevalent than the other.

Given this rich and varied literature, it is noteworthy that the philosophical literature studying cross-cultural variation of proper names has overlooked the connection to studies in other disciplines about how people use, memorize, recall, and mentally represent the meaning of proper names.Footnote 1 At the very least, studies from other disciplines, which employ various methods to study proper names, would provide experimental philosophers with independent evidence to probe the robustness of putative cross-cultural effects.

Locating experimental philosophers’ work in the wider scientific context highlights the general importance of the questions we set out to address with our meta-analysis in this paper. For example, to what extent are the demographic effects observed by experimental philosophers large and interpretable? Do they depend on superficial features of the experimental material and design experimental philosophers have been using? Or do they indicate variation in the mechanisms and representations, which different communities of speakers would recruit to make sense of proper names? Answering these questions is not only important for philosophers; it will also clarify whether researchers from different disciplines are warranted to rely on the demographic effect Machery et al. (2004) originally found, and on simple vignettes like the Gödel case for studying proper names.

A relatively large and interpretable meta-analytic effect will be convincing reason that the cross-cultural effect is not an artefact of particular vignettes. The finding that a descriptivist theory of proper names predicts East Asians’ intuitions, while a causal-historical theory predicts Westerners’ intuitions, will motivate new hypotheses cognitive psychologists, and neuropsychologists could test. For example, as noted above, psychologists and neuropsychologists have proposed that the retrieval of proper names is particularly difficult compared to common nouns. One reason for this hypothesis is that proper names would be detached from the semantic network representing descriptive or biographical information (e.g., Semenza, 2006). But, if East Asian speakers have descriptivist intuitions, then their retrieval of proper names should be easier compared to the retrieval of proper names exhibited by Western speakers, whose mental representations of proper names would be detached from descriptive information.

A small, noisy, and hard-to-interpret meta-analytic effect will give us reason to either call into question the idea that demographic factors play a substantial role in semantic intuitions, or doubt that the vignettes philosophers have been using to elicit such intuitions are reliable tools. Either way, this finding should motivate experimental philosophers interested in the semantics of proper names to pay closer attention to relevant anthropological evidence about different functions of proper names, as well as to the experimental designs cognitive psychologists and linguists have been using to study proper names.

So, central to our meta-analysis are the questions of whether Westerners’ and East Asians’ intuitions about the reference of person names are in line with either the causal-historical or the descriptivist theory (Hypotheses 1a and 2a), and how the two populations compare in this respect (Hypothesis 3a).Footnote 2 We evaluate these hypotheses on the basis of a set of empirical studies, which rely on probes similar to the Gödel and Peano cases we have described above. To answer these questions, we test three hypotheses:

Hypothesis 1a:

The majority of Westerners have causal-historical intuitions.

Hypothesis 2a:

The majority of East Asians have descriptivist intuitions.

Hypothesis 3a:

Westerners are more likely than East Asians to have causal-historical intuitions.

There are two types of vignettes in the literature we aggregate: first, the Gödel probe, which is structurally identical to our introductory Peano/Dedekind example, and the Jonah probe, where the properties associated with a proper name gradually shift over time. In the Gödel probe both the causal-historical and the descriptivist theory single out a specific referent of the proper name, while in the Jonah probe the proper name does not refer at all according to the descriptivist account. Since both probes follow the same experimental design (exposure to vignette, binary response, two theories with different predictions, etc.), the results can be aggregated to test hypotheses 1a–3a. Moreover, since the Gödel probe has particular significance in the philosophical literature (?[, going back to)]Kripke1980, and since the original study found a notable effect of cross-cultural variation only for Gödel probes, but not for Jonah probes (Machery et al. 2004), we also test the following three hypotheses:

Hypothesis 1b:

The majority of Westerners have causal-historical intuitions in the Gödel probes.

Hypothesis 2b:

The majority of East Asians have descriptivist intuitions in Gödel probes.

Hypothesis 3b:

Westerners are more likely than East Asians to have causal-historical intuitions in Gödel probes.

One final note before we describe our meta-analysis in detail. Different studies in the literature originated from (Machery et al. 2004)’s work differ in the samples of Westerners and East Asians. (Machery et al. 2004) tested participants in the United States and Hong Kong. Follow-up studies use for example Dutch participants (Cova et al. 2018) or French participants (Machery et al. 2009). We note these differences in detail in Section 3.1. Nonetheless, we formulated the hypotheses in a general form as a comparison between Westerners and East Asians for two reasons. First, we intended to study the effect as reported in (Machery et al. 2004) and hence we followed their design. Second, we did not have theoretical reasons to exclude e.g., Dutch or French from the group of Westerners or e.g., Japanese from the group of East Asians.

2 Material and Methods

2.1 Data Sources and Searches

We conducted a comprehensive literature search of studies on cross-cultural semantic intuitions. To find replications of the original experiment, we started with the Google Scholar list of all papers citing (Machery et al. 2004) and checked whether they contained experimental data. Our search was aided by a list of known replications that we obtained via e-mail from Édouard Machery (=the first author of the original study). We used this search strategy because any replication of Machery et al. (2004) would, in virtue of its aims and scope, refer to the original paper.

2.2 Study Selection

Studies were eligible if they were published or publicly available in English, and used a design sufficiently similar to the original study. Specifically, eligible studies had to contain results from experiments featuring East Asian and/or Western participants with one or more binary-choice probe. The binary-choice answer to such probes had to correspond to the descriptivist and the causal-historical theory of reference, respectively.

2.3 Data Extraction

The eligible studies were classified by two teams, each of which included two authors of this paper. For each study, team members independently extracted the name of the first author, the year of publication, and the data of the probe responses. Per study and probe, they independently extracted data on the type of the probe (e.g., Gödel, Jonah, etc.), the number of Western and East Asian participants, total sample size, the number of causal-historical responses (per subgroup and in total), and deviations from the original design, such as language of the probe, phrasing of the question, and phrasing of the answers. Disagreements were resolved by consensus.Footnote 3

2.4 Data Synthesis and Analysis

We carried out a confirmatory (Section 3.2) and an exploratory analysis (Section 3.3), as well as a quality appraisal Section 3.4 and Appendix C). A confirmatory analysis assessed the overall meta-analytic evidence and its generalisability for the main hypotheses tested in Machery et al. (2004) (Hypotheses 1a–3a) and the meta-analytic evidence for the particular Gödel probes (Hypotheses 1b-3b) for which Machery et al. (2004) reported positive results. Specifically, for Hypotheses 1b-3b, only Gödel probes were used, which did not deviate from the design of the original study (i.e., direct replications). The analyses were conducted on the level of the individual probes within a study, because not all studies had the same (number of) probes. We used a multilevel random effects (RE) model to capture the hierarchical structure (i.e., probes within studies) of the data and the inter-study dependencies between probes via the ‘metafor’ package in R (Viechtbauer 2010).

In total, we performed six confirmatory meta-analyses, one per hypothesis. Specifically, we calculated summary proportion estimates of Causal/Historical response for single cultural group responses (Hypotheses 1a/b, 2a/b) with a restricted maximum-likelihood heterogeneity estimatorFootnote 4 to model the random effects. For the probes that compare Westerners to East Asians (Hypotheses 3a and 3b), we calculated summary Relative Risk ratios (RR), that is, the quotient of the proportions of Causal/Historical responses in both groups: If A and B are the two groups and pA and pB are the proportions of event E in the groups (in our case: Causal/Historical responses), then the relative risk ratio is defined as RRE(A,B) = pA/pB.Footnote 5 The extent of heterogeneity between probes was assessed by the I2 measure (indicating the percentage of total variance due to between-probe variance; Higgins et al. (2003) and Ioannidis et al. (2007)). In addition to the 95% confidence intervals (CIs) for the unknown parameters (i.e., the proportion of causal-historical responses and the RR between populations), we calculated the 95% prediction intervals (PI). Unlike CIs, which are based on compatibility of the observed data with an unknown parameter value, PIs predict the distribution of future data points by taking into account inter-study variability. Therefore, they are better suited to express a plausible range of values for the next conducted study, and to predict whether effects are likely to replicate (Higgins et al. 2009; Riley et al. 2011).

The exploratory analysis focused on the observed variance in effect-size between the various probes; that is, it aimed to identify factors that would explain why studies often deliver heterogeneous results. Specifically, we repeated the tests performed for Hypotheses 1a–3a with a meta-regression analysis where deviations from the original design were included as moderator variables. We ignored the hierarchical structure of the data because we were interested in explaining variance by the general design factors.Footnote 6

3 Results

3.1 Studies and Data

Via Google Scholar, we identified 482 records published in 2017 or earlier that cite (Machery et al. 2004). Using the search criteria described in Section 2.1, we ended up with 15 potential studies. Four studies were added in addition to our initial search results, resulting in a total of 19 studies.Footnote 7

Eventually, from this set of 19 studies, 13 were included and 6 were excluded. Reasons for exclusion were the following: [1] missing data, [2] culturally mixed samples, and [3] structure of the data (i.e., data from non-binary questions, compare Section 2.2). See Appendix A for the list of excluded papers and their reason for exclusion. In addition, the Indian sample from Machery 2009 was excluded, because they are considered to be an South-Asian instead of East-Asian. Combined, these studies tested 61 probes on 4691 participants, who produced a total 8959 binary responses. Of these probes, 35 tested both Western and East Asian samples [median sample size: 181]; 15 probes tested Western samples [median sample size: 60]; and 11 probes tested East Asian samples [median sample size: 211]. Table 1 provides an overview of the number of probes per study and their characteristics. Most Western samples were from the United States of America. Exceptions were ‘Machery, 2009’ (France), ‘Cova, 2019’ (The Netherlands), and ‘Colombo, n.p.’ (The Netherlands, USA, England, Germany, and Italy). In general, the East-Asian samples were from Hong Kong. Exceptions were ‘Machery, 2009’ (Mongolia), ‘Sytsman, 2015’ (Japan), ‘Kazaki, 2017’ (Japan), ‘Izumi, 2018’ (Japan), and ‘Colombo, n.p.’ (Hong Kong and China). We identified ten factors on which probes could deviate from the original design (see Appendix B), from the language in which probes were presented to the phrasing of the question that participants were asked.Footnote 8

Table 1 Overview of the Studies and Their Characteristics

In particular, the studies in our dataset did not always use the same phrasing for eliciting a judgment on causal-historical versus descriptivist intuitions. Most studies followed the original design by (Machery et al. 2004) and asked participants the following question: Who is the person that the vignette character John is “talking about” when using the proper name “Peano”? By contrast, (Sytsma and Livengood 2011) suggest rephrasing the question in the following way: Who would you take is the person that John is talking about when using the proper name “Peano”? (this modification is known as the “clarified narrator’s perspective”, see also Section 3.3), and (Machery et al. 2009) asked for the truth value of sentences such as “Peano discovered the incompleteness of mathematics”. Similarly, answers to the binary question were sometimes phrased as descriptions (e.g., in the original study) and sometimes as bare noun phrases without further explanations e.g., in (Lam 2010). For the purpose of the meta-analysis, we treated all these dependent variables as answering the same question, independently of the exact phrasing. However, for testing Hypothesis 1b-3b, we focused on direct replications of the original Gödel probe and excluded variations of the design such as Sytsma and Livengood’s “clarified narrator’s perspective”. These different ways of analyzing the data allow us to answer two different questions: first, the question of the replicability and internal validity of the original effect observed by Machery et al., and second, the generality and external validity of the observed effect to a wider hypothesis about the reference of proper names.

3.2 Meta-Analytic Estimates

The results per hypothesis are displayed as forest plots in Figs. 1 and 2Footnote 9 and summarized below. The analyses provide confirmatory evidence for Hypotheses 1a, 2b, 3a and 3b in the sense that the confidence interval for the unknown parameter consists only of values in the expected direction, and excludes the hypothesis of zero effect—that is, a relative risk ratio of RR = 1 for hypotheses 3a and 3b, and a 50% proportion of Causal/Historical responses for hypotheses 1a through 2b. Forest plots for all six hypotheses are given in Figs. 1 and 2. Below is a precise statement of our results.

Fig. 1
figure 1

Forest plot for Hypotheses 3a, showing distribution of effect sizes and confidence intervals

Fig. 2
figure 2

Forest plot for Hypotheses 1b, 2b and 3b, respectively, showing distribution of effect sizes and confidence intervals

Hypothesis 1a:

The summary proportion of Causal-Historical probe responses in Westerners was 0.58 [95% CI: 0.52–0.62] with large observed heterogeneity [I2 = 82.10%] and a prediction interval that included zero-effect [95% PI: 0.35–0.77].

Hypothesis 1b:

The summary proportion of Causal/Historical responses to Gödel probes in Westerners (in the original experimental design) was 0.55 [95% CI: 0.50–0.59] with moderate observed heterogeneity [I2 = 21.89%, 95% PI: 0.35–0.77].

Hypothesis 2a:

The summary proportion of Causal-Historical probe responses in East Asians was 0.50 [95% CI: 0.41–0.59] with large observed heterogeneity [I2 = 87.50%] and prediction interval [95% PI: 0.19–0.80].

Hypothesis 2b:

The summary proportion of Causal-Historical to Gödel probes in East Asians (in the original experimental design) was 0.36 [95% CI: 0.32–0.41] with large observed heterogeneity [I2 = 52.59%], but a prediction interval that excluded the zero-effect value [95% PI: 0.26–0.48].

Hypothesis 3a:

The summary RR of Causal-Historical probe responses in Westerners versus East Asians was 1.18 [95% CI: 1.09–1.27] with moderate-to-large observed heterogeneity [I2 = 47.42%] and a prediction interval that included zero-effect [95% PI: 0.96–1.44].

Hypothesis 3b:

The summary RR of Causal-Historical responses to Gödel probes in Westerners versus East Asians (in the original experimental design) was 1.36 [95% CI: 1.20–1.54] with moderate observed heterogeneity [I2 = 30.08%] and a prediction interval that excluded the zero-effect value [95% PI: 1.03–1.80].

In addition, we wanted to make sure that these results did not depend on how we modelled the structure of the data. Most studies included in the meta-analysis contain several probes. Some studies used one sample of participants to test several probes (i.e., repeated measure design), while other studies used separate samples of participants to test each probe individually (i.e., between-subject design). Our analyses take into account the hierarchical structure of probes nested in studies, but they neglect the potential dependencies between probes in studies with repeated measures design.

To take account of these differences, we ran two additional sets of analyses. For the first analyses, we used a flat data structure (i.e., a regular RE meta-analysis). For the second analyses, we also used a regular RE meta-analysis, but included the average effect size and standard error of probe clusters with shared repeated measure design. For instance, if a study contained one sample of (Western) participants (and/or one sample of Asian participants) that responded to four probes (e.g., the two Gödel and two Jonah probes of the original study by Machery et al. (2004)), then only the average of their response was included in the analyses. The results of these analyses were similar to those of the analyses with the hierarchical data structure, thereby warranting the same conclusion (code and results of the these analyses can be found in the additional materials file).

3.3 Results of Exploratory Analyses

The width of the confidence and prediction intervals in the hypotheses tests (see Section 3.2) point to high observed inter-study variance of participants’ responses. By means of a meta-analytic regression, we explored whether this variance could be explained by specific deviations in study design, such as language of the participants (see Appendix B for a list of these deviations and their descriptions). Concretely, we fitted three generalized meta-analytic regressions with the Knapp and Hartung methodFootnote 10 (Knapp and Hartung 2003; Viechtbauer 2010) for the data used to test hypotheses 1a, 2a, and 3a with the identified design deviation factors as predictors and log-transformed RR (hypothesis 3a data) or Odds Ratio (hypotheses 1a and 2a data) as outcome. We fixed the sampling variance to zero, because we were only considering variance between probe outcomes. This choice is acceptable, since we are not estimating an overall effect-size weighted by the precision of each datum in the analysis. However, we did include the sample size in the model as a predictor.

For all three analyses, much of the variance was explained by the modelFootnote 11. About 47% of the variance in the probe responses by Westerners could be accounted for by deviations in study design [R2 = 0.47, F(18,30) = 3.34, p = 0.002]. For probe responses by East Asians, about 64% of the variance could be accounted for [R2 = 0.64, F(22,23) = 4.60, p = 0.0003]. In the comparison in probe responses between Westerners and East Asians, about 28% of the variance could be accounted for by deviations in study design [R2 = 0.28, F(18,15) = 1.70, p = 0.15].Footnote 12 Although the relatively large number of predictors in relation to the size of the dataset made the results unsuited for drawing strong conclusions with certainty, a noteworthy result was that none of the design deviations, which include participant nationality and different types of probes, made a statistically significant contribution (p < 0.05) to all of the models.Footnote 13

In addition, we separately analyzed studies that used an alternative formulation of the binary-response question in the Gödel probes: the “clarified narrator’s perspective” e.g.,(Sytsma and Livengood 2011). Based on existing literature, this formulation was expected to be less ambiguous in eliciting participants’ semantic intuitions. We found six such probes in our data set, of which three compared the causal-historical and descriptivist responses between Westerners and East Asians; one tested East Asians only; and two tested Westerners only. To test the summary effect-size of these data, the same kind of analyses were used as for testing H1b-H3b (see Section 2.4). The results are summarized in Table 2. The proportion of causal-historical intuitions is higher for both Westerners and East Asians in the clarified narrator’s perspective than in the set of all Gödel probes. In particular, for Westerners, the effect—as measured by the proportion of causal-historical responses—is more pronounced in the “clarified narrator’s perspective” while it is slightly smaller for East Asians and the comparison of East Asians and Westerners. These comparisons are relative to the outcomes of the tests of H1b–H3b in Section 3.2.

Table 2 Results of the analysis for probes with the clarified narrator’s perspective proposed by (Sytsma and Livengood 2011)

Finally, we analyzed the difference in responses to non-Gödel probes. This analysis is a complement to the test of hypothesis 3b, a comparison in responses between Westerners and East Asians on all the probes except the Gödel probes. Cultural difference is absent (RR = 0.08, 95% CI =[-0.04, 0.20]), but this is not particularly surprising: the analysis consisted overwhelmingly of Jonah probes which did not show a significant difference between the samples in the original study by (Machery et al. 2004).

3.4 Evaluation of the Study’s Results and Methods

Evaluation of the results of the included papers indicate absence of small-study effects and presence of evidential value, but high inter-study variance. The p-curves are right-skewed (see Figs. 3 and 4), which could be considered indicative of the set of studies containing evidential value (Simonsohn et al. 2014).Footnote 14 Figure 5 shows the funnel plots (Sterne et al. 2011) for Hypothesis 1a-3a, and Fig. 6 shows the corresponding funnel plots for Hypothesis 1b-3b. Small to large studies show symmetrical distributions around the summary effect-sizes (see Figs. 5 and 6), but for Hypothesis 1a and 2a, inter-study variance is high even among studies with high precision (i.e., large sample size, low standard error). This indicates that inter-study variance does not diminish with increasing precision as it should. No such effect is observed when cross-cultural variations are directly studied, that is, in Hypothesis 3a/b. For a quality appraisal of the methods employed in the experiments, see Appendix C.

Fig. 3
figure 3

P-Curve plots for Hypotheses 1a, 2a and 3a, respectively

Fig. 4
figure 4

P-Curve plots for Hypotheses 1b, 2b and 3b, respectively

Fig. 5
figure 5

Funnel plots for Hypotheses 1a (upper left figure), 2a (upper right figure) and 3a (bottom figure). x-axis: observed effect size; y-axis: precision of study as measured by standard error

Fig. 6
figure 6

Funnel plots for Hypotheses 1b (upper left figure), 2b (upper right figure) and 3b (bottom figure). x-axis: observed effect size; y-axis: precision of study as measured by standard error

4 Discussion

All in all, our meta-analysis supports the hypothesis that cross-cultural factors affect semantic intuitions about proper names (Machery et al. 2004). For four out of the six tested hypotheses, the meta-analytic confidence interval for the summary effect size does not include the zero effect value. Neither do specific analysis tools aimed at detecting publication bias or QRPs (e.g., funnel plots, p-curves) provide evidence of systematic suppression of negative results. The wide prediction intervals point to high inter-study variability of the data, which cannot be consistently explained by the studies’ differences in experimental design.

Our study cannot test the overall scope of cross-cultural factors in eliciting intuitions about the reference of proper names, since the meta-analysis is restricted to a very specific set of probes. In addition, some aspects of the meta-analysis are limited by the methodological design of the studies and the small number of studies with respect to the differences in design between the studies. The use of a few vignettes with binary responses per study with unknown dependencies forced us to analyze the data on the level of individual probes (vignettes) and may have contributed to the large heterogeneity. However, three findings of general interest stand out.

First, there is a notable difference between the confidence intervals (CIs) for the unknown parameter and the prediction intervals (PIs) for the next observation: only two of the six PIs do not contain zero effect. The clearest difference between the CIs and the PIs is perhaps visible in the meta-analysis of Hypothesis 1a, 1b and 2a: the 95% PIs are too wide to make a theoretically meaningful prediction. The reason is that PIs take into account how the variability in the data transfers to the expected value of future data points. In a case where individual studies scatter over a wide range of points, like in the case of Hypothesis 1a, 1b, and 2a, the PIs will be considerably wider than the CIs. The same remarks apply, though with a less pronounced effect, to the other three hypotheses. For these reasons, it is not surprising that a recent replication attempt, included in (Cova et al. 2018), has not reproduced the original effect: although there is, on average, a cross-cultural effect in the predicted direction, the inter-study variance is too high to reliably predict a result in the vicinity of the meta-analytic mean. The funnel plots of Figs. 5 and 6 confirm this diagnosis: since variation in the observed proportions of causal-historical responses is extremely high, there seems to be a lot of “noise” in the data.

Second, the clarified narrator’s perspective, suggested by (Sytsma and Livengood 2011), seems to push causal-historical intuitions: it increases the tendency of participants to give causal-historical responses across the board. At the same time, it reduces the cross-cultural difference between Westerners and East Asians. However, due to the small sample size for this type of probe, our finding here should be treated with caution.

Third and last, our meta-analytic regression found statistically significant dependencies of the results on variations in experimental design. However, we could not identify predictor variables that uniformly explained inter-study variance for Westerners, East Asians and the comparison of both populations. We could not decide whether this lack of consistency between the analyses is due to overfitting, or because relevant methodological deviations from the original study were not reported in the papers, and thus function as hidden moderators. On the other hand, there are interesting dependencies on the particular probes. For example, for the population of East Asians, the tendency to give descriptivist answers is almost exclusively driven by Gödel-type probes (meta-analytic estimate of 64% as opposed to 50% overall). This is supported by the null result of the exploratory analysis of all non-Gödel probes. These results show a high sensitivity to the particular instrument for eliciting semantic intuitions, without presence of a theoretically convincing explanation (the effect is absent for Westerners). More generally, the large amount of heterogeneity and lack of uniform explanation may (partly) be due to methodological deficiencies (see Appendix C). We therefore suggest that future research on cross-cultural differences in semantic intuitions extends the range of probes in experiments, in order to avoid that substantial philosophical and psychological conclusions depend (too) much on the specifics of a particular probe (Cf., Devitt and Porot 2018).

Given these qualifications, the take-home message of our meta-analysis can be summed up as follows: there is cross-cultural variation in intuitions about the reference of proper names, but there is a high unexplained inter-study variance, compromising predictive validity; and it is not easy to disentangle this cross-cultural effect from the (random) effect of a particular study.

In relation to existing findings from other disciplines, our results are consistent with anthropologists’ and linguists’ work showing that there are many features of naming practices that differ across cultures and contexts of discourse (Vom Bruck and Bodenhorn 2006). However, underlying these differences, the primary pragmatic functions of person names of referring directly to a unique individual and also marking social connections, may be stable (Alford 1988; Sophia and Marmaridou 1989). Also stable may be the kinds of cognitive processes involved in understanding proper names, which probably recruit a number of structured representations (or frames) associated with the name (Valentine et al. 1996; Dancygier 2009). Given the high degree of variation, both within and across experimental groups, that our meta-analysis uncovered, it is plausible that the workings of these kinds of processes are modulated by both causal and descriptive factors in different linguistic and social contexts (Genone and Lombrozo 2012).

Due to the dependency on particular probes, we recommend that researchers interested in understanding the mechanisms underlying the processing of person names move beyond the contrast between descriptivists and causal-historical theories, and use a broader and more diverse set of vignettes in different languages and contexts. To this end, methodologies similar to those employed in cognitive psychology and neuroscience can be useful as they are designed to uncover the mechanisms and cognitive representations underlying people’s semantic intuitions. For readers who are primarily interested in questions about philosophical methodology, our results confirm that philosophical scenarios are less reliable instruments for eliciting semantic intuitions than current philosophical practice seems to presume.