The recent increase in the number of high-quality norms and databases allows the memory researcher to better equate two classes of stimuli such that they differ only on the dimension of interest. For example, age of acquisition refers to when a word is typically learned. Unfortunately for the experimenter, age of acquisition correlates with dimensions such as concreteness, number of phonemes, frequency, and orthographic and phonological Levenshtein distance and frequency, to mention only a few, and these confounds have led to quite different reports of the effect of age of acquisition on different memory tasks (for a review, see MacMillan, Neath, & Surprenant, in press). By using the new norms, which were unavailable to earlier researchers, MacMillan et al. were able to create a set of early and late age of acquisition words that were equated on all of these (and other) dimensions and found that whereas age of acquisition affected recognition, it had no effect on either free or serial recall. However, one issue they encountered was that there were no published assessments of which measure of semantic relatednessFootnote 1 was appropriate for episodic memory tasks. That is, although many such measures exist, and although their predictions have been compared to human behavior, the tests chosen to evaluate these measures are typically lexical decision, similarity judgment, semantic categorization, and analogy tasks. It is not clear whether these measures capture the processing involved in, for example, immediate serial recall. The purpose of the current paper is to report an initial assessment.

Semantic relatedness effects are readily observable in immediate memory tasks. For example, Murdock and Vom Saal (1967), Murdock (1976), Poirier and Saint-Aubin (1995), and Chubala, Neath and Surprenant (2019) all found better serial recall of categorized lists, those containing exemplars from the same category, than of uncategorized lists, those containing exemplars from different categories. Such lists are typically constructed using norms such as Battig and Montague (1969) and Van Overschelde et al. (2004). Semantic relatedness can also affect memory when the words are related by association or meaning rather than by category membership. For example, Tse (2009), Tehan (2010), and Tse et al. (2011) all found better serial recall of associated than unassociated lists. Such lists are typically constructed using word association norms such as De Deyne et al. (2019) and Nelson et al. (2004).

However, one difficulty is how to assess the extent to which two lists differ in semantic relatedness. Consider two lists of nouns. The first is alien, companion, polo, soul, transient and the second is bout, clientele, diary, fuss, sensation. Do these lists differ in semantic relatedness?

Although word association norms are useful for creating new lists, they are less useful for assessing existing lists. For example, according to the De Deyne et al. (2019) norms, no word in either list is a forward or backward associate of any of the other words in the list. Another way of measuring semantic relatedness is based on analyzing co-occurrence within a large corpus and representing each word as a vector. The similarity measure reported is the cosine of the angle between the two vectors representing the two words of interest. This measure ranges from –1 to +1, with +1 indicating maximum similarity, 0 indicating a 90° angle between the vectors (i.e., the vectors are orthogonal), and –1 indicating maximum dissimilarity. LSA (latent semantic analysis; Landauer & Dumais, 1997; see http://lsa.colorado.edu) has been used previously as a measure of semantic relatedness in the immediate memory literature (e.g., Tse, 2009). Using the topic space “general reading up to 1st year college,” LSA indicates that the two lists do differ in semantic relatedness, t(8) = 2.446, p = 0.040. LSA indicates the alien list is more related (M = 0.160, SD = 0.080) than the bout list (M = 0.070, SD = 0.016). GloVe (Global Vectors for Word Representation; Pennington et al., 2014) is a distributional model that also represents words as vectors based on co-occurrence. Using a library of pre-trained words based on 840 billion tokens with a vocabulary of 2.2 million words, this measure indicates that there is no difference between the two lists, t(8) = 0.446, p = 0.659, with a mean cosine of 0.160 (SD = 0.071) for the alien list compared to 0.178 (SD = 0.047) for the bout list. A second distributional model, fastText (2021), using the pre-trained word vectors from Grave et al. (2018), also indicates no difference, 0.149 (SD = 0.034) for the alien list compared to 0.116 (SD = 0.462) for the bout list, t(8) = 1.218, p = 0.235.

A third way of measuring semantic relatedness is based on WordNet (Fellbaum, 2017; Miller et al., 1990), a lexical database in which words are organized into synonym sets (synsets) that represent the specific meaning of a word. Although WordNet also includes verbs, adjectives, and adverbs, we focus on nouns because the structure of WordNet does not afford a simple way of evaluating similarity across parts of speech. Nouns with multiple meanings are organized in different synsets. For example, whereas beagle has only one sense (“a small short-legged smooth-coated breed of hound”), racket has four senses: (1) a “loud and disturbing noise,” (2) a “fraudulent scheme,” (3) “the auditory experience of sound that lacks musical quality,” and (4) “a sports implement”.Footnote 2

WordNet can be seen as implementing an organization that developed from spreading activation models such as that of Collins and Loftus (1975). Nouns are organized in a hierarchy of concepts, one for each sense. These synsets are connected to each other via explicit semantic relations. For example, because beagle is a kind of hound (“any of several breeds of dog used for hunting typically having large drooping ears”), it is a hyponym of hound and hound is a hypernym of beagle. By following these isa links, the path length between synsets can be computed. For example, the path length between beagle and greyhound is 3 (both are types of hounds), and the path length between beagle and sense 1 of dog (“domestic dog, Canis familiaris”) is 4 (a hound is a hyponym of dog, and dog is a hypernym of hound). The shortest path length is 1 (between beagle and beagle) and the longest path lengths are in the 20s.

The most simple WordNet-based measure of semantic relatedness is WordNet path length (WNPL), which is the shortest path between any sense of either word. In addition to this measure, two additional measures scale simple path length. Leacock and Chodorow (1998)’s measure, LCH, scales path length by the maximum path length found in the hierarchy containing the two words. Wu and Palmer (1994)’s measure, WUP, scales path length using the depth of the least common subsumer. Three other measures, JCN (Jiang & Conrath, 1997), LIN (Lin, 1998), and RES (Resnik, 1995), take into account not only path length but also information content, which is inversely related to how frequently a concept is encountered (see Meng et al., 2013, for details on each measure and for formulas). For the example lists, all of these WordNet-based measures indicate the alien list is more related than the bout list.

For the example list, then, all of the WordNet-based measures as well as LSA indicate a difference, with the alien list being more related, whereas neither GloVe nor fastText indicate a difference. It should be noted that these measures were not developed to predict the effect of semantic relatedness on episodic memory tasks. Because we could find no published assessment of which, if any, of these measures best indicates whether two sets of stimuli will be differentially recalled because of differences in semantic relatedness, we began this series of studies.

As a starting point, we used simple WordNet path length (WNPL) to construct the stimuli in the first experiment. One reason is its simplicity: The primary assumption is that the longer the path between two nouns, the less the two items are semantically related. A second reason is that this idea accords well with notions of how spreading activation may help during encoding and retrieval in immediate memory tasks (e.g., Tehan, 2010). For example, relative to a list of words that are less related, processing the first item in a list of related words would cause activation to spread to related items. When the second item in the list is processed, it is already partially activated. This activation is reinforced by the remaining items. Similar activation may also aid retrieval. A third reason is that this measure indicated no difference in semantic relatedness in the age of acquisition stimuli of Macmillan et al. (in press), and there was also no difference in recall. The reason this is potentially informative is that if there had been a difference in semantic relatedness between their two conditions, there should also have been a recall difference.

The first experiment tests whether a difference in immediate serial recall will be observed when all but one of the measures indicates that one set of words is more semantically related than a second set. Experiments 2 and 3 then test the measures for our intended primary use case, assessing whether two sets of words vary in semantic relatedness. Experiment 2 shows that when a primary manipulation is confounded with semantic relatedness, the main effect of interest may not occur, whereas Experiment 3 shows that when the two sets of words are equated for semantic relatedness, the main effect of interest is now observable.

Experiment 1

Performance on immediate serial recall tests is generally better when the words are related than unrelated, even when relatedness is not defined by category membership (e.g., Tehan, 2010; Tse, 2009; Tse et al., 2011). Experiment 1 was designed to examine this effect. A set of stimuli was created such that the related and unrelated lists were equated on a number of other dimensions known to affect immediate serial recall (see Table A1 in the appendix). The related words all had a path length of 3 from each other; therefore, WNPL = 3.00 (SD = 0.00). The unrelated words had a WNPL of 9.33 (SD = 1.60). This difference is approximately the same as that found in the literature for categorized lists. For example, exemplars in 58 of the 64 categorized lists used by Chubala et al. (2019) are in WordNet. The mean WNPL of these categorized lists is 4.82 (SD = 1.24) compared to 11.97 (SD = 2.25) for 58 lists constructed by randomly sampling exemplars from different categories. As shown in Table 1, all but two of the measures of semantic relatedness indicate that the two lists differ. The exceptions are that LSA indicates no difference between the lists and GloVe indicates only a borderline difference.

Table 1 Measures of semantic relatedness for each condition in each experiment

Method

Subjects

Forty-four volunteers from ProlificAC were paid £8.00 per hour (prorated) for their participation. The inclusion criteria were: (1) native speaker of English; (2) age between 19 and 39; and (3) at least a 90% approval rating on prior participation. The mean age was 29.48 (SD = 5.73, range 19–38) and 26 people self-identified as female while 18 self-identified as male. A power analysis determined the sample size: A sample of 44 has power of 0.90 to detect an effect size of d = 0.5 (Faul et al., 2009).

Stimuli

The related words are: air, clarity, difference, ease, good, mobility, nature, romance, stuff, sweetness, tone, ultimate, utility. The unrelated words are: art, clear, coverage, fortune, friendship, idealism, lack, like, loan, monitor, notable, notice, selection. See Table A1 for details.

Procedure

After reading an informed consent form and agreeing to participate, subjects were reminded of the instructions. They were informed that after clicking on the “Start next trial button” they would see a fixation point in the center of the screen for 1 s. Then, one second after the fixation point disappeared, six words would be shown in the same location, one at time, for 1 s each in 24-point Helvetica. One second after the end of the list, the subjects were prompted to type in the words they had just seen in strict serial order. They were informed that they needed to type in the first word first, the second word second, and so on. If they were unsure of a response, they were encouraged to guess or click on a button labeled “skip”. The subjects initiated the next trial when ready, but this could be done only after six responses had been made. A self-timed break could occur by simply waiting to initiate the next trial.

There were 24 trials, 12 with related and 12 with unrelated words. The words for each list were randomly selected for each trial and were randomly ordered, and the order of the lists, related or unrelated, was randomly determined for each subject.

Results and discussion

The data were analyzed using both frequentist and Bayesian techniques using JASP (JASP Team, 2019). For the latter, a Bayes factor (BF) is reported. BF10 between 3 and 20 indicates positive evidence for the alternate hypothesis (and therefore evidence against the null hypothesis), BF10 between 20 and 150 indicates strong evidence, and BF10 greater than 150 indicates very strong evidence (Kass & Raftery, 1995). Similarly, BF01 indicates evidence in favor of the null hypothesis using the same scale. Default priors were used.

There was a semantic relatedness effect: The proportion of words correctly recalled in order was 0.607 (SD = 0.149) for the related lists compared to 0.565 (SD = 0.149) for the unrelated lists, t(43) = 3.056, p = 0.004, d = 0.461, BF10 = 9.01.Footnote 3 Twenty-nine subjects recalled more related than unrelated words, 11 recalled more unrelated than related words, and 4 were tied (p = 0.006 by a two-tailed sign test). The effect size is comparable to d = 0.54 observed by Tse (2009). Unlike that study, and those of Tse et al. (2011) and Tehan (2010), the stimuli in the current study were equated on more dimensions, thus ruling out potential confounds. All but two of the measures indicated that the related list was indeed more related than the unrelated list and therefore can be taken as predicting the observed results. The two exceptions are LSA, which indicated no difference in relatedness, and GloVe, whose measure was equivocal.

Experiment 2

The primary use case considered in this paper is assessing whether two sets of words differ in semantic relatedness such that this factor can be ruled out as a reason for differential recall on an episodic memory test. For example, an experimenter interested in the concreteness effect in serial recall would use a set of abstract words and a set of concrete words; the usual result is a recall advantage for concrete items, even when using a small, closed set of words (e.g., Neath & Surprenant, 2019; Walker & Hulme, 1999). However, if the abstract and concrete words also differ on some other dimension, such as semantic relatedness, the concreteness effect might be either magnified (if the concrete words are more related) or eliminated (if the abstract words are more related). In this experiment, we examine whether the concreteness effect can be eliminated if the abstract words are more semantically related than the concrete words. Table 1 shows that all measures except for JCN indicate the abstract and concrete words differ in semantic relatedness, with the abstract words being more related than the concrete.

Method

Subjects

Forty-four different volunteers from ProlificAC were paid £8.00 per hour (prorated) for their participation. The mean age was 26.97 (SD = 6.24, range 19–39) and 29 people self-identified as female while 15 self-identified as male.

Stimuli

The abstract words were: betrayal, bliss, foresight, greed, hardship, loyalty, malice, mercy, psyche, revenge, riddance, risk, urge, wisdom. The concrete words were: band, blazer, cadet, cauldron, creature, gauntlet, hike, ledge, liqueur, manor, ranch, rubble, throttle, ulcer.Footnote 4 Using the Brysbaert et al. (2014) concreteness norms (where 1 = abstract and 5 = concrete), the mean rating for the concrete words is 4.52 (SD = 0.25; range 4.07–4.79) and the mean rating for the abstract words is 1.64 (SD = 0.17; range 1.34–1.93). This difference in mean concreteness rating is comparable to numerous manipulations that resulted in a concreteness effect (see Table 1 of Neath & Surprenant, 2020). The abstract and concrete words were equated on a number of other dimensions known to affect memory (see Table A2 for details), but according to all measures except JCN, the abstract words are more semantically related than the concrete words (see Table 1). With this set of stimuli, then, it is possible that the usual concreteness effect may be diminished or even abolished because the recall of the abstract words is boosted by their higher degree of relatedness.

Procedure

The procedure was identical to that of Experiment 1.

Results and discussion

There was no concreteness effect: The proportion of words correctly recalled in order was 0.521 (SD = 0.201) for the concrete words compared to 0.533 (SD = 0.211) for the abstract words, t(43) = 0.720, p = 0.476, d = 0.108, BF01 = 4.80.Footnote 5 Eighteen subjects recalled more concrete than abstract words, 24 recalled more abstract than concrete words, and two were tied (p = 0.441 by a two-tailed sign test). The results are consistent with the interpretation that the usual recall advantage of concrete words was not observed because the enhanced semantic relatedness of the abstract words offset the advantage that usually accrues to concrete words. All measures except JCN indicated the abstract words were more related and therefore all except JCN predicted this result.

Experiment 3

Experiment 3 used a new set of abstract and concrete stimuli. Unlike in Experiment 2, the words were equated for semantic relatedness as measured by WNPL. The WNPL for the abstract words was 10.68 (SD = 1.36) and that of the concrete words was 10.79 (SD = 1.30), which do not differ, t(26) = 0.219, p = 0.829. As can be seen in Table 1, the words also did not differ using LCH or JCN. However, the two sets of words do differ in semantic relatedness according to LSA, GloVe, and fastText, all of which indicate that the abstract words are more related than the concrete words. In contrast, WUP, RES, and LIN indicate the reverse, that the concrete words are more related than the abstract. The prediction depends on which measure is better capturing the effect of semantic relatedness in immediate serial recall. According to WNPL, LCH, and JCN, there should be a small concreteness effect because the only known difference between the two sets of words is concreteness. According to LSA, GloVe, and fastText, the situation is unchanged from Experiment 2: there should be no concreteness effect because the advantage that usually accrues to concrete words will be offset by the advantage of enhanced relatedness of the abstract words. Finally, according to WUP, RES, and LIN, there should be an enhanced concreteness effect because the advantage that usually accrues to concrete words will be supplemented by their also being more related than the abstract words.

Method

Subjects

Forty-four different volunteers from ProlificAC were paid £8.00 per hour (prorated) for their participation. The mean age was 27.75 (SD = 5.24, range 19–38) and 32 subjects self-identified as female while 12 self-identified as male.

Stimuli

The abstract words were: adage, almighty, entail, gradual, holiness, hopeful, infamy, leeway, merit, noble, outlook, rapport, trifling, whimsy. The concrete words were: athlete, bachelor, calcium, diagram, diary, entree, mogul, noggin, ointment, pagoda, plaza, saffron, silencer, veneer. According to the Brysbaert et al. (2014) concreteness norms, the mean rating for the concrete words was 4.31 (SD = 0.19; range 4.04–4.67) and the mean concreteness rating for the abstract words was 1.81 (SD = 0.13; range 1.56 –1.97), comparable to the values in Experiment 2. See Table A3 for details.

Procedure

The procedure was identical to that of Experiments 1 and 2.

Results and discussion

There was a concreteness effect: The proportion of concrete words correctly recalled in order was 0.497 (SD = 0.177) compared to 0.463 (SD = 0.191) for the abstract words, t(43) = 3.243, p = 0.002, d = 0.489, BF10 = 14.21.Footnote 6 Twenty-eight subjects recalled more concrete than abstract words, 14 recalled more abstract than concrete words, and two were tied (p = 0.044 by a two-tailed sign test). The concreteness effect was small in terms of the difference in proportion correct, but the effect size was comparable to that observed in Experiment 1 when semantic relatedness was manipulated. The results are not consistent with the predictions based on LSA, GloVe, or fastText; according to these measures, the results should have resembled the null effect observed in Experiment 2 because recall of the abstract words should have been boosted by their higher degree of relatedness. The results are consistent with predictions based on WNPL, LCH, and JCN, which indicated no difference in semantic relatedness and therefore the only difference between the lists should be concreteness. The results are problematic for WUP, RES, and LIN, which indicated that the concrete words were more related than the abstract words. In effect, this predicts an additional advantage for concrete words.

General discussion

There has recently been a huge increase in the number of databases that allow memory researchers to better equate two sets of stimuli. Although there exist many measures of semantic relatedness, none have been tested to assess which best reflects the effects of semantic relatedness on immediate memory tests. Experiment 1 replicated previous findings of a semantic relatedness effect in immediate serial recall. All measures except LSA indicated the words differed, with the caveat that the prediction from GloVe was equivocal. Experiment 2 found that when abstract words are more related than concrete words, the concreteness effect disappears, a result consistent with all measures except JCN. Finally, Experiment 3 found a small concreteness effect, consistent with the predictions based on WNPL, LCH, and JCN, which indicated no difference in semantic relatedness. In contrast LSA, GloVe, and fastText predicted no concreteness effect because the abstract words were deemed more related than the concrete; in effect, these measures predicted that the result found in Experiment 2 should also have been observed in Experiment 3. WUP, RES, and LIN predicted an enhanced concreteness effect because the concrete words had two advantages: Not only were they more concrete, but they were also more semantically related. Only two measures were consistent with the results from all three experiments: WNPL and LCH. Both are based on simple WordNet path length and differ only in that the latter measure is scaled by the maximum path length found in the hierarchy containing the two words.

One simultaneous strength and weakness of WNPL is that it is derived from WordNet. The primary strength is that WordNet is based on psycholinguistic research and theory (Fellbaum, 2017; Miller et al., 1990). As a result, it is more closely aligned with the type of processing that likely occurs in the memory tests assessed in this paper. As Tehan (2010) suggested, an item in a more semantically related list would receive a boost from the activation that spreads from the other closely related words in the list whereas an item in a list where the words are more distantly related would not get such a boost. Both WNPL and JCH would capture this because activation is thought to be directly related to path length. The scaling done by JCH involves the overall path length which, in the current case, would not differ. Although WUP is also based on path length, it uses a quite different scaling process, the depth of the least common subsumer. This will vary more than the overall path length, hence divergence in predictions. The other three WordNet-based measures all include information content, but it is not clear how that would be involved in processes such as those used in immediate serial recall.

In contrast to measures based on simple WordNet path length, measures based on co-occurrence of words in a large corpus did not fare well in predicting the pattern of results. One reason may be that co-occurrence could involve more than just semantic relations and therefore these measures may be responding to additional factors. Some admittedly post hoc evidence consistent with this view is the finding that GloVe and fastText both consistently indicated that abstract words were more semantically related than concrete words. As noted earlier, they similarly indicated that the early acquired words used by MacMillan et al. (in press) are more related than the late acquired words, whereas WNPL indicated no difference. A second reason may be that none of these models were created to predict the effect of semantic relatedness on episodic memory tasks. For example, the goal of GloVe was to capture fine-grained syntactic and semantic regularities in word vectors and then to assess the properties of these vectors (Pennington et al., 2014). Although GloVe excels at some tasks, such as analogical reasoning, word similarity judgements, and named entity recognition tasks, it is not necessarily the case that the semantic information it is capturing will be appropriate for episodic memory processing.

Although simple WordNet path length appears, based on the limited available data, the better measure for episodic memory tasks, it is not without its own limitations. First, the measure applies only to nouns because in WordNet, nouns and other parts of speech are organized differently. This means that WNPL cannot be computed for lists that include verbs or adjectives. Second, the implied interval scale means that a pair of nouns with a given path length is considered to have identical semantic relatedness to any other noun pairs with the same path length. This is likely not the case. A third potential weakness inherited from WordNet is that WNPL, like other measures of relatedness, may miss some forms of relatedness. For example, Li and Kohanyi (2017) note that whereas needle is expected to be associated with haystack, the path length in WordNet is 14. However, WordNet does capture other similar examples; for example, the path length between mountain and molehill is 5. A final limitation is that whereas WNPL and JCH may currently be the most useful measure for predicting human performance on episodic memory tasks, they are clearly far more limited in their generality than measures such as GloVe and fastText.

It should also be noted that in this work, we took a significant difference between two scores as indicative that there was a difference in semantic relatedness. That is, we interpreted the measures dichotomously. In reality, it is possible that a significant difference that is numerically small might not lead to a difference in recall. Evaluating each measure's scale properties requires far more data, and also requires that researchers publish their stimuli. A number of studies on memory for semantically related and unrelated lists do not include their stimuli.

Research methods are continually evolving, and practices that were acceptable not so long ago are now seen to be problematic. For current purposes, it is now clearer than ever that numerous lexical and psycholinguistic factors can affect performance on a wide number of tasks and researchers need to take this into account when creating sets of stimuli. The recent increase in the number of norms available makes it possible to equate two sets of words on over two dozen dimensions. One conspicuous omission was an easy way to compute semantic relatedness with evidence to support the choice of the measure. We have shown that two measures based on simple WordNet path length, WNPL and JCH, aligned with the results from all three experiments. No other measure did so. Moreover, both measures are consistent with theoretical accounts of how semantic relatedness might benefit memory. The other measures all failed to account for at least one experimental result, and some failed to account for two results.

Pending more research on the appropriateness of the various measures of semantic relatedness for episodic memory tasks, the tentative recommendation is to use WNPL or JCH. Although the only data to support this recommendation comes from the studies reported here (and also by the two serial recall studies reported by MacMillan et al., in press), this is more data than was available prior to this study. What is needed is further empirical work to assess the degree to which measures of semantic relatedness predict performance on memory tasks.

Computing WNPL There are a number of ways to compute WNPL. One way is to download and install WordNet (wordnet.princeton.edu), and then download and install any of several packages designed to access WordNet. One such package is the collection of Perl tools called WordNet::Similarity (Pedersen et al., 2004; see metacpan.org/pod/WordNet::Similarity). Another package is available in Python as part of the natural language toolkit (see https://www.nltk.org/howto/wordnet.html). A third way that does not involve programming is to use a database and software we created, which reads in a list of words and outputs a matrix of path lengths (see doi.org/10.17605/OSF.IO/3XTQD).

Author Notes

We thank Ted Pedersen for making the WordNet::Similarity database available. This research was supported, in part, by grants from the Natural Sciences and Engineering Research Council of Canada to IN and AMS. The authors are listed alphabetically. Correspondence may be addressed to ineath@mun.ca