Feature-listing is an invaluable resource for understanding how people understand and interpret words. In this paper we describe an automated way to score features generated by the listing task. The procedure we describe can be used for a wide range of research purposes. Here we will describe how it has been used in the study of metaphor. Later in the paper we describe many other applications of the automated procedure.

In a typical feature-listing task, participants are asked to generate properties for a given concept. These can involve visual aspects (e.g., “is red”), other sensory aspects (e.g., “is soft”), functional aspects (e.g., “used for transportation”), taxonomical knowledge (e.g., “a type of bird”), and, for living beings, behaviours (e.g., “eats plants”). The resulting feature lists provide a rich source of data and have been used to explore topics including conceptual similarity (Maki et al., 2006; McRae et al., 2005), semantic priming (Buchanan et al., 2019b; McRae et al., 1997), exemplar typicality (Rosch & Mervis, 1975), and cross-language differences in concept representation (Kremer & Baroni, 2011). However, feature data are time-consuming and costly to process, typically involving multiple coders (Becker, 1997; McRae et al., 2005; Montefinese et al., 2013; Vinson & Vigliocco, 2008).

The bottleneck in processing feature data stems from two major issues. The first involves combining synonymous features (Kremer & Baroni, 2011; McRae et al., 2005; Montefinese et al., 2013; Vinson & Vigliocco, 2008; Vivas et al., 2017). Synonymous responses can occur due to variations in both phrasing (e.g., “used for transportation”, “used for transport”, and “people use it for transportation”; McRae et al., 2005) and the words used (e.g., “four-legged” and “quadruped”; Vinson & Vigliocco, 2008). These cases can be ambiguous, and different authors may disagree on whether certain responses are synonymous. For instance, when examining McRae et al.’s (2005) dataset, Kremer and Baroni (2011) decided to treat the features “eats meat” and “carnivore” as synonymous whereas McRae et al. treated these as separate features, with the former classified as a “behaviour” and the latter classified as a “category”. Furthermore, sometimes the same word can have a different meaning depending on the context. For instance, the feature “has a trunk” means something quite different for an elephant versus a car (McRae et al., 2005). McRae et al. (2005) chose to treat such features as equivalent. However, it is unclear where to draw the line. As these cases demonstrate, deciding whether two features can be considered equivalent is not an easy task, and different coders across labs may have different opinions on what features are synonymous.

The other major processing block in scoring features is in deciding when to separate participant responses into distinct features when they include multiple chunks of information (Kremer & Baroni, 2011; McRae et al., 2005; Montefinese et al., 2013; Vinson & Vigliocco, 2008; Vivas et al., 2017). This involves determining whether a response such as “red fruit” could be considered as two distinct features (“red” and “fruit”) and whether treating these as separate features would retain the same meaning (Vinson & Vigliocco, 2008). Multiple chunks of information occur in adjective-noun features (e.g., “has four wheels”; the concept both has wheels and has four specifically), disjunctive features (“is green or red”), and conjunctive features (e.g., “is green and red”). Whether such features should be separated is not straightforward however; for instance, when regarding the concept tiger, “is yellow and black” may capture additional meaning when retained as a single feature, because a tiger is simultaneously both of these colours (Montefinese et al., 2013). As such, different studies have different criteria for separating features. For instance, Vivas et al. (2017) separated disjunctive, but not conjunctive features. Kremer and Baroni (2011) instead used data from other participants as a guideline and separated the feature under consideration if at least one other participant listed the same features on separate lines (e.g., “yellow and black” would be split into two separate features if another participant listed “yellow” and “black” separately for any other concept in the database). In other studies, multiple raters are employed to decide on such ambiguous cases (McRae et al., 2005; Montefinese et al., 2013; Vinson & Vigliocco, 2008).

Feature-listing for metaphors

The feature-listing task has also been used to explore metaphor interpretation (Becker, 1997; Gineste et al., 2000; Nueckles & Janetzko, 1997; Roncero & de Almeida, 2015; Tourangeau & Rips, 1991; Utsumi, 2005). Metaphors involve the framing of one concept in terms of another. The concept being framed is labelled the “topic”, and the concept used to frame the topic is labelled the “vehicle”. For example, for the metaphor “time is money”, time is the topic and money is the vehicle. Typically in metaphor studies, lists of features are collected for the complete metaphor, and separately from the topic and the vehicle (Becker, 1997; Gineste et al., 2000; Nueckles & Janetzko, 1997; Tourangeau & Rips, 1991; Utsumi, 2005). One can then determine which features given to the complete metaphor come from its topic or vehicle, and which features are “emergent”, i.e., features that are not characteristic of either the topic or vehicle on their own, but that become salient when the two concepts interact in a metaphor. Features themselves are simply pointers to the characteristics of a concept, or in the case of the metaphor, characteristics of the topic that are being highlighted by the vehicle.

In the metaphor literature, feature-listing has been used to compare metaphors to similes (Glucksberg, 2008; Utsumi, 2007), assess computational models of metaphor (Kintsch & Bowles, 2002; Reid & Katz, 2018; Utsumi, 2011), and create stimuli for later experiments (Gineste et al., 2000; Terai & Goldstone, 2011). Although many studies have employed feature-listing tasks, these studies vary greatly in terms of the instructions given to participants and how the features are combined and counted after data has been collected.

As in word feature studies, a major step in processing metaphor features involves deciding whether two different participant responses can be considered the same feature. Not surprisingly, there are differences in opinion on the best method for making these decisions in this literature as well. One approach is typified by Becker (1997), who combined features if they were synonyms (e.g., scary and frightening), alternative spellings or abbreviations of the same concept (e.g., O2 and oxygen), or differing intensities of the same concept (e.g., lots of hair and hair). Also, if the response was a phrase that included multiple concepts (e.g., bright lights), it was counted as a distinct feature unless both of the concepts appeared separately in other participant responses, in which case the response would be counted twice—once for each concept. Utsumi (2005) also combined responses into a single feature if they were synonyms. He reasoned that if (1) the two words “belonged to the same deepest category of a Japanese thesaurus” (p. 156), or (2) the dictionary definition of one of the words included the other (e.g., lie and not true), that the two responses were synonymous. Also, Utsumi combined responses that had the same root form (e.g., red and redness) or that only varied in degree due to a modifier (e.g., frightened and quite frightened).

Although these authors ostensibly are transparent in their guidelines for combining features, in practice, there are often ambiguous cases that fall upon the experimenter’s judgment. In our own feature data (which we will describe in more detail later), we encountered several issues, especially with the synonym approach. First of all, responses often included words that had multiple meanings, and only one meaning was a synonym to another response. For instance, for the metaphor history is a mirror, the responses included “reflective” and “contemplative”. The lexico.com dictionary lists “reflective” as a synonym for “contemplative”; however, given this metaphor, “reflective” could also refer to the literal meaning related to light, so combining both responses into a single feature would miss this other meaning. Furthermore, we found that sometimes responses X and Y were synonyms, as were Y and Z, but X and Z were not. For example, given the metaphor wisdom is a foreigner, responses included “dark”, “mysterious”, and “strange”. Lexico.com lists both “dark” and “strange” as synonyms for “mysterious”, but “dark” is not listed as a synonym for “strange”, nor is “strange” listed as a synonym for “dark”. Therefore, even with clear instructions, there are still ambiguous cases that require judgment from the experimenter. Tourangeau and Rips (1991) point out this ambiguity in their own study: “These rules—especially the one for combining synonyms—are hardly very precise” (p. 471).

In contrast to the synonym approach, Roncero and de Almeida (2015) employed a different technique and combined features only if they shared the same morphological root (e.g., sleep and sleepiness). They argued that semantically similar words can still have subtle differences in meaning better captured by their approach. However, even using this criterion, there are cases that are not straightforward. For example, in our own data, for the metaphor adventure is a roller-coaster, responses included the words “joyful” and “enjoyable”. Both words could reduce to the morphological root “joy”, but these words capture somewhat different meanings, with joyful more commonly used to describe a person’s state of joy, and enjoyable more commonly used to describe events, things, and places, such as the roller-coaster itself.

Thus, even when the criteria for combining features are clearly articulated, inevitably many judgments fall upon the authors and may be prone to bias. There are ways to mitigate the problems noted above. One way is to have multiple individuals code the responses, as in Becker (1997) where six judges working in pairs processed the data, and go with consensus. However, another lab working with another set of judges may code the responses differently, thus adding a complexity in replicating findings across labs (Buchanan et al., 2019a). Moreover, the hand scoring approach is both time-consuming and expensive. Finally, due to the cost of manually coding responses, typically the data can only be coded in one way, and this does not allow for comparison of different coding methods.

Automated analysis of feature data

Regardless of the subjectivity involved in scoring feature-listing data, manual coding becomes increasingly infeasible in today’s research environment, with the increased accessibility of online survey platforms and the ease in obtaining large datasets with thousands of participants quickly and inexpensively. For instance, Buchanan et al. (2019b) recently constructed a database of features for over 4000 different concepts, 1914 for which they collected new data from participants. In contrast to the manual coding approaches mentioned above, Buchanan et al. (2019b) used a combination of automated and manual coding to process their data. The automated procedures included the removal of stop words (i.e., words that hold little semantic content, such as “has”, “a”, “is”, etc.) and stemming, which removes the affixes from words, reducing them to their root. Furthermore, Buchanan et al., (2019b) applied a “bag of words” approach in which each word in a response was treated as a separate feature (e.g., “give up” would be treated as two distinct features, “give” and “up”). This approach resembles distributional semantic models, such as latent semantic analysis (Landauer & Dumais, 1997). Buchanan et al. found that this approach still yielded data that were convergent with manually coded data; for instance, cosine similarities were highly correlated for concept pairs that overlapped between their dataset and both the McRae et al. (2005) and Vinson and Vigliocco (2008) norms.

We introduce here a program for making the coding process automated. Automated processing eliminates subjective experimenter judgments, can be easily applied to large datasets, reduces inter-lab reliability issues, and affords different methods for categorizing features by simply tweaking the code, allowing for comparison of different combinatory methods.

The main purpose of this paper is to introduce the “RK processor” (short for “Reid and Katz processor”). This is a program we developed in our lab for processing feature-listing data for the study of metaphor. In this paper, we will initially demonstrate the RK processor using a dataset of features generated by participants for 88 “A is B” metaphors taken from the Katz et al. (1988) metaphor norms, and a dataset of features generated for the 146 unique words contained in those metaphors. The program uses Python 3 and is included in the supplementary materials of this paper along with our metaphor and word feature datasets. Although we designed the program for analysing metaphor data and include some operations specifically for metaphor, the program offers researchers a tool for automatically parsing feature data for words as well and more generally for research involving other feature-listing tasks. In the Supplementary Material we provide a more technical description for researchers wishing to apply the RK processor to their own databases.

To our knowledge, this is the first attempt to apply fully automated processing to metaphor feature data, though as mentioned above, Buchanan et al. (2019b) used a combination of automatic procedures and hand coding to process a dataset of word features. These authors used some of the same techniques we employ, such as Snowball stemming and stop word removal (see description below; see also Buchanan et al., 2019a, for a useful tutorial on how to collect and automatically process feature-listing data). However, rather than implementing a “bag of words” technique and treating each word in a response as a separate feature, we instead maintained responses with multiple words as a single feature. That is, phrases such as “hard work” or “up and down” would be counted as singular features, distinct from the responses “hard”, “work”, “up”, and “down”. We reasoned that such phrases can capture a greater amount of semantic information than the individual words alone; for instance, “hard work” is not equivalent to either “hard” or “work” alone. We do, however, include the option to process data using the bag of words approach, which can be done by tweaking the parameters of the program (see Supplementary Material for more detail).

The RK processor builds upon the work of Buchanan and colleagues in several ways. First, although our method can be applied to words, we designed functions specifically for use in metaphor research, such as determining whether the features listed for the metaphor are characteristic of the topic word, the vehicle word, both the topic and vehicle words, or neither the topic nor vehicle word (i.e., emergent features). Second, the RK processor includes additional operations, such as commands for finding synonymous features, calculating interpretive diversity based on the distribution of features (Roncero & de Almeida, 2015; Utsumi, 2005, 2007, 2011), automatically calculating word similarity based on feature overlap, and multiple options for suffix removal. Third, the processing techniques have been packaged into an easy-to-use program that should be accessible to researchers with limited computer programming experience.

As a final note, it should be mentioned that a more accurate description of this program would be a “pre-processor”, as typically the extracted features would need to be further processed or analysed depending on the research question. For instance, McRae et al. (2005) classified features into basic knowledge types depending on whether the feature denoted visual information, primary sensory information, functional/motor information, or encyclopaedic knowledge. Furthermore, other statistical analyses and variables are often useful for understanding aspects of word processing, such as feature distinctiveness or intercorrelational density (McRae et al., 2005; Montefinese et al., 2013; Vivas et al., 2017). Because the specific classifications and statistical measures will likely vary depending on the research question, our program does not automate these processing steps (although we do include a function for automatically computing cosine similarity between words, as this is a widely used measure, Buchanan et al., 2019a, b; McRae et al., 2005; Montefinese et al., 2013; Vivas et al., 2017). Nonetheless, pre-processing features is a costly and time-consuming step in data analysis, so we believe that the RK processor will be a useful and time-saving tool for word and concept researchers. We encourage researchers to use this processor for their own datasets, and for those interested in more details of how to use the RK processor, the available commands and examples of usage with our dataset are presented in Supplementary Material.

The RK processor

The RK processor is a program for Python 3 that uses tools from the Natural Language Toolkit (nltk) package (Loper & Bird, 2002) to tokenize words, remove stop words, and remove suffixes from words using stemming and lemmatization. Although this paper will focus on the features extracted from our own dataset on metaphors, the program can process raw participant responses from other feature-listing datasets, provided the data is arranged in the correct format for the program. The processor can be applied to research on compound words, categories and concepts, similes, and potentially any other dataset in which participants list short responses as features. For the purposes of this paper, we present the processing steps of the RK processor in general terms and focus on the extracted feature data.

The RK processor includes five basic steps of processing. First, the raw response from each participant for each metaphor is split into a list of separate responses whenever a comma, period, slash, or paragraph indent occurs. For instance, if the participant’s raw response in the textbox is “calm, deep, peace”, this would be split into three separate features based on the comma placements. The second step strips punctuation (aside from underscores and hyphens) from each response and converts all letters to lowercase. This is so differences in punctuation or capitalization will not result in the same feature being categorized differently. The third step uses the tokenization function from the nltk package to separate each feature into individual words. Much of the time the feature is already a single word; however, some features may be listed as multiple words (e.g., “hard work”) or a phrase (e.g., “ups and downs”). These first three steps of processing occur automatically, whereas for the final two steps the researcher has to choose which optional parameters in the RK processor best meet their research aims.

Tokenization is required for the final two steps of processing, namely, removal of stop words and stemming/lemmatization. First, the researcher must decide whether or not stop words should be removed. Removing stop words can improve the accuracy of counting features because stop words themselves hold little semantic content (Buchanan et al., 2019a, b). For instance, stop words include function words such as “a”, “the”, “on”, “which”, etc. Participants will often use stop words in their responses; for instance, for a word such as butterfly, different participants may list “wings”, “has wings”, and “they have wings”. However, the words “has” and “they have” are clearly not critical aspects of these features; the critical aspect of all three responses is the feature “wings”. Because each response is technically a different string of letters, without removing stop words a computer program would consider these as distinct features. However, removing stop words reduces all three responses to “wings”, and these three responses would now be counted as the same feature. The nltk package includes a list of stop words, so we employed this list with some minor additions and subtractions.Footnote 1 The commands in the RK processor include a parameter for removing stop words with two options: “yes” (stop words will be removed) and “no” (stop words will not be removed). Because of the noted improvements in accuracy, the default setting in the program and the one we used in the following analyses is “yes”.

In the final step of processing, each word for each feature is either stemmed, lemmatized, or left unaltered depending on the parameters selected. The RK processor includes five options for stemming or lemmatizing: “snowball”, “porter”, “lancaster”, “lemmatize”, and “original”. The first three options are stemming algorithms used for stripping suffixes that are included in the nltk package. The algorithms are based on the “Snowball stemmer” (Porter, 2001), the “Porter stemmer” (Porter, 1980), and the “Lancaster stemmer” (also known as the “Paice/Husk stemmer”; Paice, 1990). The Porter stemmer is the most conservative of the three stemmers, whereas the Lancaster stemmer is the most aggressive and sometimes results in over-stemming. The Snowball stemmer is very close to the Porter stemmer but is slightly more liberal. As stemming algorithms follow a preset pattern based on the letters on the end of a word, stemming can result in non-words (see Table 1 below for an example). In contrast to the stemmers, the lemmatize option reduces words to their lemmas and considers whether the word is a noun, verb, adjective, or adverb. The lemmatize option uses functions within the nltk package that access the WordNet corpus to obtain part of speech tags and lemmas. The advantage of the lemmatizer is that the lemmas are complete words, unlike stems. Lastly, the “original” option will leave the words as is, without stemming or lemmatizing.

Table 1 Examples of how seven different words would be processed using the five different parameters of stemming and lemmatizing

For demonstration, consider the following seven words: “adventure”, “adventured”, “adventurer”, “adventures”, “adventuring”, “adventurous”, and “advent”. Table 1 shows what each word would be converted to for each of the five processes.

Depending on the research project, there may be reasons for selecting one process over another. However, in our own simulations, we find the Snowball stemmer to result in the most intuitive feature groupings. As is evident in Table 1, the Snowball stemmer reduces the first six words to the same stem (meaning these words would all be grouped together as the same feature), capturing the notion that all these words seem to be tapping into an “adventure” feature. The Lancaster stemmer also groups the word “advent”, which has a substantially different meaning from “adventure”, with the other features. In contrast, the lemmatizer seems overly conservative, as both “adventurer” and “adventurous” are considered as distinct features from “adventure”. For these reasons, the Snowball stemmer is neither overly liberal nor overly conservative in grouping features and is set as the default option. We used this option to process the data we present in the following sections.

After the response is tokenized, stop words are removed, and each remaining word is stemmed or lemmatized, the default setting will rejoin the stems or lemmas (if there are more than one) as a single response. For instance, for the response “they are hard working” under the default Snowball stemming and stop word removal parameters, the words “they” and “are” would be removed as they are in the list of stop words, “hard” would remain unaltered as it does not include a suffix, and “working” would be stemmed to “work”. The two remaining stems, “hard” and “work”, would then be rejoined into a single response—“hard work”. If the bag of words parameter is set to “yes”, however, each stem or lemma is treated as a separate feature (as done in Buchanan et al., 2019b). Therefore, the above response would be counted as two distinct features: “hard” and “work”. By default, the bag of words parameter is set to “no”.

In addition to these five basic steps, there is an option to automatically detect and correct spelling mistakes. This option should be used with caution, as there is potential for false positives or corrections to words the participant did not intend (see also Buchanan et al., 2019a). For this reason, we also include commands that detail exactly which responses were detected as misspellings and to what they are corrected (see Supplementary Material). For the purposes of this paper, and to err on the side of caution, we do not use the automatic spell corrector in our simulations. However, this option is available to researchers, and it is fairly simple to customize the spellchecker, such as by adding or subtracting words from the spellcheck dictionary or even loading a custom dictionary for use with the spellchecker.

After the participant responses are processed, the RK processor includes commands that automatically count features, calculate interpretive diversity (Utsumi, 2005, 2007, 2011), and determine which metaphor features were also listed for the topic and vehicle words when presented on their own (see Supplementary Material for a short description and demonstration of the RK processor commands). The counting is fairly simple after the responses have been processed—any two letter strings that are identical after the five steps of processing outlined above will be counted as the same feature. In addition, non-identical responses will be counted together if they vary only in spacing (e.g., “hard work” and “hardwork”), hyphenation (e.g., “hard-work” and “hard work” or “hardwork”), or word order (e.g., “hard work” and “work hard”). Here, the researcher may set the threshold for the number of participants that need to list a given feature for it to be retained in the count. By default, the threshold is set to two participants, as this is commonly used in metaphor studies (Becker, 1997; Roncero & de Almeida, 2015; Utsumi, 2005). That is, if only one participant listed a certain feature for a concept or metaphor, that feature will not be included in the count.

For the purposes of this study, we will focus on the metaphor feature counts obtained by the RK processor and the categories the features fall under: topic (feature was listed for the topic alone, but not the vehicle alone), vehicle (feature was listed for the vehicle alone, but not the topic alone), shared (features was listed for both the topic alone and the vehicle alone), and emergent (feature was not listed for either the topic alone or the vehicle alone). We will also report interpretive diversity values (Utsumi, 2005, 2007, 2011), which relate to the distribution of listed features (see description below).

Demonstrating and applying the RK processor on a database

We obtained a dataset of participant-listed features for 88 metaphors from the Katz et al. (1988) norms and the 146 unique words that constitute the topics and vehicles in these metaphors. These norms have been used extensively in metaphor research, for instance, to examine comparison and categorization processes (Bowdle & Gentner, 2005), aptness and conventionality (Jones & Estes, 2006), processing fluency (Thibodeau & Durgin, 2011), effects of topic abstractness on appreciation and interpretation (Xu, 2010), effects of semantic space on comprehension (Al-Azary & Buchanan, 2017), factors that contribute to aesthetic liking (Jacobs & Kinder, 2017), and differences between literary and non-literary metaphors (Jacobs & Kinder, 2018). Although the Katz et al. norms are now over 30 years old, Campbell and Raney (2016) recently replicated the study and found that the metaphors ratings (e.g., comprehensibility, familiarity, etc.) remained consistent over time and across different populations of native and non-native English speakers. Although there were magnitude differences between the two samples across all metaphors, the differences in ratings between the metaphors in the two samples were highly correlated. We opted to use the original Katz et al. data because Campbell and Raney only replicated a subset of metaphors, but their study demonstrates that despite the age of the original norms, those ratings are still relevant today.

After collecting the data, the RK processor was used to automatically process the raw participant responses. The only manual processes were removing participants who did the task incorrectly, copying and pasting the relevant data into clean Microsoft Excel files, excluding data such as the participants’ age and gender, and saving the files in .csv format.

In our application we obtained data from 155 (90 female) participants from Western University, with reported ages ranging from 17 to 77 (mean = 18.91, SD = 5.09). For the data presented here, 17 participants were removed for completing the task incorrectly.Footnote 2 As noted above, 88 “A is B” metaphors taken from the Katz et al. (1988) non-literary norms were used as stimuli. Metaphors were selected if they contained a one-word topic, a one-word vehicle, and no other words in the metaphor other than filler words (i.e., a, an, are, is, and the). We also presented the topic and vehicle words in isolation, as in other studies exploring emergent features (Becker, 1997; Gineste et al., 2000; Roncero & de Almeida, 2015; Utsumi, 2005). There were 146 unique words contained in the 88 “A is B” metaphors, excluding the filler words mentioned above. Data collection took place online using the Qualtrics survey platform. There were four groups of participants. One group listed features for half of the metaphors (n = 36), one group listed features for the other half of the metaphors (n = 37), one group listed features for half of the words (n = 32), and one group listed features for the other half of the words (n = 33). Metaphor feature-listing studies typically include about 20 participants per condition (Gineste et al., 2000; Nueckles & Janetzko, 1997; Roncero & de Almeida, 2015; Tourangeau & Rips, 1991; Utsumi, 2005). Therefore, we had slightly higher statistical power compared to previous studies employing similar designs. Finally, unlike some previous studies (e.g., Gineste et al., 2000; Utsumi, 2005), participants listed features either only for words, or only for metaphors (as done in Becker, 1997; Roncero & de Almeida, 2015), to minimize the likelihood that participants would be biased towards thinking about the words in a metaphorical way. Despite the slightly larger number of participants in the metaphor groups, the average number of features listed per metaphor (91.65, SD = 4.45) and per word (91.65, SD = 4.59) were almost identical.

For the metaphor groups, the participants were instructed to list three features or characteristics of the topic that were being described by the vehicle (as done by Utsumi, 2005). For the word groups, the participants were asked simply to list three features or characteristics of the word. The metaphors and words were displayed on the screen one at a time with a single textbox for entering the responses below. There was no time limit for any of the responses. The features generated to topic, vehicle, and metaphor were then inputted and analysed with the RK processor using the default settings (stop words removed, Snowball stemming, no spelling correction, and only features listed by at least two participants retained in analyses). We present data obtained on that database first for individual words, then as applied to metaphor studies, and finally suggest additional applications of the RK processor.

Word data

The average number of features per word and the average production frequency per feature (number of participants to list the feature) are displayed in Table 2, along with similar descriptive statistics from other word studies.

Table 2 Average number of features listed per word and average production frequency per feature across our dataset and other English word datasets

These descriptive statistics are not directly comparable for various methodological reasons; for instance, we asked participants to only list three features, as is done typically in metaphor studies, whereas McRae et al. (2005) gave participants sheets with ten blank spaces, encouraging more features to be listed per participant. McRae et al. (2005) and Buchanan et al. (2019b) also had stricter criteria for retaining features in analyses, only including features listed by over 16% of participants. Nonetheless, we provide these descriptive statistics to roughly show how our dataset compares to other word feature datasets.

Calculating word similarity based on feature overlap

As mentioned in the introduction, feature lists are commonly used to compute semantic similarity between word pairs. Similarity is computed by treating the words’ feature production counts (i.e., number of participants that generated each feature) as vectors and calculating the cosine between the vectors, which is equal to the dot product of the two vectors divided by the product of the two vectors’ lengths (see Buchanan et al., 2019b; McRae et al., 2005). Feature-based cosine similarity aligns with subjective ratings of word similarity (Maki et al., 2006; McRae et al., 1997) and similarity values estimated from word co-occurrences (Buchanan et al., 2019b; Maki et al., 2006), and is also predictive of semantic priming effects (Buchanan et al., 2019b; McRae et al., 1997).

Because cosine similarity is an important measure in concept research, we included a function in the RK processor to automatically compute cosine similarity between word pairs. We also conducted a brief analysis to compare cosine similarities from the RK processor to similarity estimates from the Global Vectors (GloVe) model of word representation. GloVe is a popular vector space model that estimates word similarity based on co-occurrences from large text corpora and has been demonstrated to outperform LSA- and word2vec-type models on word similarity and analogy tasks (Pennington et al., 2014). The model was pretrained on a dump of Wikipedia from February 2017 (available for download from vectors.nlpl.eu/repository, model ID = 8).

Of the 146 words in our dataset, 143 were included in the GloVe model (the words “coffeepot”, “fiords”, and “trash-masher” were not in the model). This yielded 10,153 unique word pairs; however, the majority of these (8219) were unrelated and had zero featural cosine similarity. Including a substantial amount of unrelated word pairs would skew the analysis; therefore, we employed a method used by Maki et al. (2006) to sample non-zero cosine values across ranges of similarity. Specifically, we randomly selected five word pairs with non-zero featural similarity values from each of ten cosine ranges: [0-.1], [.1-.2], … [.9-1.0]. The ranges 0.8–0.9 and 0.9–1.0 had fewer than five pairs (3 and 1, respectively), so we included all word pairs in these ranges. Using these 44 word pairs, we computed the correlation between the featural cosine values extracted by the RK processor and GloVe’s similarity measure. To ensure a stable measure, we employed this sampling procedure 1000 different times and took the average correlation coefficient across the samples, which was r(42) = .52, p < .001Footnote 3. Ninety-five percent of the samples had correlation coefficients between .351 (p < .03) and .678 (p < .001). For comparison, Maki et al. (2006) found a correlation coefficient of .588 between cosine similarity using the McRae et al. (2005) feature norms and LSA similarity. Therefore, the strong positive correlation between cosine similarity computed using the RK processor and GloVe’s similarity measure generally aligns with Maki et al. and supports the validity of the extracted features.

We also examined the correlation between the feature-based cosine similarity measure and subjective ratings of “semantic relatedness” between the topic–vehicle pairs in the Katz et al. (1988) norms. For comparison, we also examined the relationship between GloVe’s similarity values and the Katz et al. ratings. Of the 88 topic–vehicle pairs, three included vehicle words (underlined below) that were not in the GloVe model (“a storm is a coffeepot”, “wounds are fiords”, and “memory is a trash-masher”), so these metaphors were removed from analysis, leading to 85 remaining pairs. There was a significant positive correlation between the RK feature-based cosine similarity values and the Katz et al. semantic relatedness ratings, r(83) = .21, p = .029, one-tailed. For comparison, the correlation between GloVe’s similarity values and the Katz et al. ratings was r(83) = .15, p = .088, one-tailed. Dunn and Clark’s (1969) z test revealed that these two correlation coefficients did not differ significantly, z = 0.42, p = .677.

A limitation with the above analyses is that metaphors involve two dissimilar terms, and therefore, comparing topic and vehicle pairs results in a restricted range of low-similarity word pairs. For instance, the correlation between RK cosine similarity and GloVe similarity on these 85 topic–vehicle pairs is only r(83) = .18, p = .046 (one-tailed), which is substantially lower than when word pairs are sampled across the ranges of similarity. Nonetheless, these analyses demonstrate that, even with that limitation, similarity as measured by feature data extracted from the RK processor aligns with participants’ judgments of semantic relatedness and is no worse at predicting these judgments (at least for this dataset) than GloVe, one of the best current co-occurrence models.

Analyses of the features extracted by the RK processor relevant for the study of metaphors

We examined the features produced to the metaphors under four categories: those listed for the topic when it had been presented alone (i.e., not embedded in a metaphor), for the vehicle alone, for both the topic and vehicle when each presented separately, or neither the topic nor vehicle alone (i.e., features that “emerge” when the two concepts are combined in a metaphor). The data also allow for the analysis of the number of “types” and the number of “tokens”, as has been done in some previous studies (Becker, 1997; Nueckles & Janetzko, 1997; Tourangeau & Rips, 1991; Utsumi, 2005). Types refer to the number of unique features, regardless of the frequency they were listed by participants, whereas tokens include the participant counts for each of the features. For example, assume that three participants listed the feature “calm” and two participants listed the feature “beautiful”, and that “calm” is a topic feature whereas “beautiful” is a vehicle feature. The number of topic types and vehicle types would both be 1, but the number of topic and vehicle tokens would be 3 and 2, respectively.

Given its importance in metaphor studies (Utsumi, 2005, 2007, 2011), the RK processor can also compute interpretive diversity automatically, i.e., the distribution of generated features based on the equation for interpretive diversity derived from Shannon’s (1948) measure of entropy (see Utsumi, 2005):

$$ H(X)=-\sum \limits_{x\in X}p(x){\log}_2\ p(x) $$

In the above equation, p(x) is the probability that a given feature is listed, as calculated by dividing the number of times a given feature was listed by the number of times any feature was listed.

For illustrative purposes, the type, token, and interpretive diversity output for the metaphor feature-listing produced to the metaphor “A forest is a harp” is presented in Table 3. Table 4 displays the mean distribution of types and tokens by category across all metaphors in the dataset (for the distributions and interpretive diversity scores of each individual metaphor, see Appendix 1).

Table 3 Feature list for the metaphor “a forest is a harp”. Count refers to the number of participants that listed the feature shown in the column farthest to the left. Int Div = interpretive diversity
Table 4 Mean number and percentage of feature types and tokens by category across the 88 metaphors in the dataset. Standard deviations are included in parentheses. Int Div = interpretive diversity

Because we were interested in examining how the features extracted and processed by the RK processor “perform” relative to those extracted manually, we first looked at the characteristics of our dataset. Statistical analyses of our dataset indicated that the number of types varied significantly by category, F(2.12, 184.83) = 238.17, p < .001, η2p = .732 (degrees of freedom adjusted using the Greenhouse-Geisser correction). Post hoc tests with Bonferroni corrections indicated that all contrasts were significant (p < .001) except between topic and vehicle types (p = .203): emergent types > topic types = vehicle types > shared types. These data replicate previous findings based on manual coding of features that emergent types are listed for metaphor more frequently than other category types (Gineste et al., 2000; Nueckles & Janetzko, 1997; Tourangeau & Rips, 1991; Utsumi, 2005). The number of tokens also varied significantly by category, F(3, 261) = 74.08, p < .001, η2p = .460. Post hoc tests with Bonferroni corrections revealed a similar pattern to the types analysis as all contrasts were significant (p < .001) except between topic and vehicle tokens (p > .999): emergent tokens > topic tokens = vehicle tokens > shared tokens.

Further analyses of these data replicated two major findings in the metaphor literature. First, although fewer shared types were listed than the other categories on average, when they were listed, they were listed by more participantsFootnote 4 (Nueckles & Janetzko, 1997). Second, interpretive diversity was strongly correlated with both the number of emergent types, r(86) = .75, p < .001, and emergent tokens, r(86) = .58, p < .001. In fact, our correlation coefficients were remarkably similar to Utsumi’s (2005).

We then examined the percentages of types and tokens from previous metaphor feature-listing studies and compared them to those derived in our dataset by the RK processor (see Table 5).

Table 5 Percentages of types by category for other metaphor feature-listing studies

Excepting the Roncero and de Almeida (2015) dataset, the other metaphor studies that obtained listed features do not report standard deviations on the percentages of types and tokens by category, so it is difficult to statistically compare our results to these studies. To provide an approximation, we computed the average and standard deviation for these values across studies. For topic, vehicle, and emergent features, the percentage of types and tokens we obtained in our study all fell within one standard deviation of the mean across previous studies. The percentages of shared types and tokens were less consistent with previous studies, but both of these percentages fell within two standard deviations of the mean across previous studies. Thus the data from the current study fall generally in line with past studies, with the greatest discrepancy being with Roncero and de Almeida for our and all the other data sets.

It is unclear why Roncero and de Almeida’s dataset is less comparable than all the other studies. Because they provide their full dataset, we were able to do a statistical comparison between our and their datasets, but because they did not categorize their data by the percentage of topic, vehicle, shared, and emergent types and tokens, we were forced to estimate how their data would be categorized. The following calculations are based on our analysis of their dataset. For the 20 or so cases in which it was unclear whether the same feature was listed for the metaphor and either the topic or vehicle (e.g., “dependable” and “dependent”), the two authors considered whether or not the listed features were equivalent, and unless both raters agreed that they were, the features were treated as distinct. After computing the percentages of types and tokens by category in Roncero and de Almeida’s (2015) dataset (see Table 5), we conducted eight t test comparisons (alpha = .006) between the percentages of topic, vehicle, shared, and emergent types and tokens across the two studies. Here we found that our study differed significantly from Roncero and de Almeida’s in all comparisons except those with shared types and tokens (vehicle types: p = .006; all other significant ps < .006).

We are unsure why our dataset differed from the Roncero and de Almeida dataset and why their dataset diverges from the other metaphor studies. One possibility is that our own analysis of their dataset may have been overly conservative in counting features listed for the metaphor and either the topic or vehicle as equivalent. Even with pre-processed feature lists for the metaphors, topics, and vehicles provided, we still found a handful of cases in which it was difficult to decide whether the feature listed for the metaphor was the same as that listed for the topic/vehicle. This further highlights the utility of an automated processing method, so such analyses can be reproduced easily.

In summary, our results generally align with previous metaphor studies when features are counted and combined manually. Several major findings were replicated. First, emergent types were listed more frequently than the other category types (Gineste et al., 2000; Nueckles & Janetzko, 1997; Tourangeau & Rips, 1991; Utsumi, 2005). Second, emergent types and tokens were strongly correlated with interpretive diversity scores (Utsumi, 2005). Third, although shared types were the least common, when these features were listed, they were listed by more participants compared to the other categories, directly replicating Nueckles and Janetzko (1997) and conceptually aligning with Becker’s (1997) finding that shared features are rated as highly important in metaphor interpretation.

The percentages of shared types and tokens were lower than the other studies on average. This could potentially be a limitation of not considering synonyms as equivalent features. For instance, if “powerful” is listed for the metaphor but “strong” is listed for the topic and vehicle, “powerful” would not be considered a shared feature, despite having obvious similarity to both the topic and vehicle. Therefore, the participants listing features for the metaphor, topic, and vehicle would all need to use the exact same word to describe these concepts for the feature to be considered “shared”, which may be why we obtained a lower percentage of this feature category.

We have shown above that the RK processor can successfully extract metaphor features from participant data that produce metaphor findings analogous to studies based on the more laborious, subjective manual procedure. We next wanted to examine whether the features extracted in our dataset can be used to make meaningful predictions about metaphor judgments. We considered two dimensions, comprehensibility and appreciation, that have received much attention in the literature (e.g., Al-Azary & Buchanan, 2017; Chiappe et al., 2003; Utsumi, 2005; Xu, 2010). The Katz et al. (1988) norms include subjective ratings of “comprehensibility” and “metaphor goodness”, the latter of which measures “how good, apt, and pleasing” a metaphor is (p. 197). We examined whether the features extracted by the RK processor could predict scores on these two dimensions (for completeness, we also include a table displaying the intercorrelations between the type and token categories and the correlations between the type and token categories and the ten metaphor dimensions from Katz et al. in Appendix 2).

The number of types and tokens by category (i.e., topic, vehicle, shared, and emergent) are highly correlated (see Appendix 2, Table 10) and intercorrelated predictors are problematic in regression analyses; therefore, the types and tokens were analysed separately. The pattern of results was similar with both types and tokens, but the regression models with types accounted for more variance than with tokens in each case, so for simplicity of exposition, we will focus only on the number of types. Some of the type categories were also intercorrelated, but they had low variance inflation factors in the regression models (topic types: 1.14, vehicle types: 1.18, shared types: 1.07, emergent types: 1.14), indicating that the amount of multicollinearity was not problematic.

We conducted a stepwise multiple regression analysis with the rating as the predicted variable and the number of types for each of the four categories (topic, vehicle, shared, or emergent) as the predictors.Footnote 5 Stepwise selection begins with no predictors and adds first the most predictive variable. After a new variable is added, variables that are no longer improving the prediction are removed. This process results in a regression model that retains only the variables adding significantly to the prediction.

For both ratings, the stepwise selection procedure resulted in a regression model that significantly predicted the rating. The number of vehicle and shared types were retained in the comprehensibility model and accounted for 12% of the variance, F(2, 85) = 5.95, p = .004. Both the vehicle and shared types were positively associated with comprehensibility and were significant independent predictors (the full models with all category types and the final models after stepwise selection are included in Appendix 3). For metaphor goodness, only the number of vehicle feature types was retained in the regression model and accounted for 7% of the variance, F(1, 86) = 6.24, p = .014. Again, the association of vehicle types with the rating was positive. These data confirm the oft-claimed argument for the primacy of the vehicle in metaphor processing (see Katz, 1989) and further indicate that metaphors are more comprehensible and appreciated when the vehicle contributes more features to the interpretation.

Also, while our data are not directly related to process models, the importance of shared features to comprehensibility, though low in number, is suggestive of the first step in structure-mapping theory (Bowdle & Gentner, 2005; Gentner, 1983), which involves a stage of symmetrical alignment. Topic features were not predictive of either comprehensibility or metaphor goodness, a finding that aligns with Ortony’s (1979) salience imbalance model which posits that metaphors are defined by features that are of low salience to the topic but of high salience to the vehicle. Also, shared features were not predictive of metaphor goodness. This could be because people appreciate metaphors more when they highlight a novel aspect of the topic that is not obvious when the topic is considered on its own. Lastly, emergent features were not predictive of either comprehensibility or metaphor goodness. This was more surprising, as emergent features have been proposed to be important, especially for metaphor appreciation (Utsumi, 2005). However, it is possible that this relationship is more complex; for instance, emergent features tend to be less agreed upon amongst participants, as indicated by their lower token per type average. Therefore, in some cases a high number of emergent features may indicate less agreement over what the metaphor meant, which could mean the metaphor was more difficult to comprehend and appreciate.

Regardless of the theoretical underpinnings, the feature types extracted by the RK processor were useful for predicting both comprehensibility and metaphor goodness ratings from the Katz et al. (1988) norms. This suggests that the algorithm extracts data that are meaningful for studying metaphor and that have predictive power on subjective ratings relevant to metaphor processing.

Suggestions for future applications

As the RK processor has broad applications in any research using the feature-listing task, we briefly consider some potential research questions for which it could be applied.

Compound words

One of the most direct potential applications of the RK processor would be for compound word research. Compounds are similar to metaphors in that they contain two constituents, and one of the main research topics in the compound word literature concerns how the two constituents contribute to the semantic representation of the compound (Gagné et al., 2019). Feature-listing could easily be applied to this question. For instance, one could obtain listed features for both constituent words on their own (e.g., “snow” and “man”) as well as the compound word (“snowman”). Similar to the current study, one could then analyse which features of the compound were also listed for the first constituent only, the second constituent only, both constituents, or neither constituent (i.e., emergent features). This would provide a measure of the degree to which each constituent contributes to the semantic representation of the compound. The percentage of emergent features could also provide a measure of semantic transparency (Gagné et al., 2019). Compounds low in semantic transparency (e.g., honeymoon) should have more emergent features, as the constituents (e.g., honey and moon) are less related to the meaning of the compound.

Vector space modelling of metaphor comprehension

Feature-listing data have been used to assess computational models of metaphor by comparing the features predicted by the model to features generated by actual participants (Kintsch & Bowles, 2002; Utsumi, 2011). Some of the popular metaphor comprehension models use vectors to represent words (see Reid & Katz, 2018, for a review), and there are packages available for Python that make vector space modelling relatively simple, such as gensim (Řehůřek & Sojka, 2010; available for download from https://radimrehurek.com/gensim/). Because the RK processor outputs data in Python, the participant-generated features can be used in conjunction with vector space modelling fairly easily. For instance, a centroid vector of the participant listed features could be computed and compared to metaphor vector output by Kintsch’s (2000) predication algorithm or Utsumi’s (2011) comparison algorithm to see how closely these models align with participants’ interpretations.Footnote 6 Reid et al. (2020) recently used the RK processor along with gensim to examine whether interpretive diversity scores for metaphors could be modelled using distributional semantics.

Categories and exemplars

Feature-listing has also been used to explore category structures (Rosch & Mervis, 1975). The RK processor could also be applied to category research and could process either listed features or listed exemplars. Furthermore, there is no reason that more complex categories, such as ad hoc categories (e.g., “things that have a smell”; Barsalou, 1983), could not be analysed.

Incongruity

Feature-listing may also be used to measure incongruity, which has been proposed as an important mechanism for evoking both poetic appreciation (Utsumi, 2005, 2006) and humour (Boylan, 2018; Nakamura et al., 2018). Boylan (2018) explored incongruity in relation to humour evoked by puns. In one study, participants were given a critical word used in a pun and were asked to list associated concepts for the two alternative meanings of this word. For a task such as this, the RK processor could easily process the concepts listed for both meanings and determine the degree of overlap between the two lists of concepts. When the two meanings evoke fewer shared features, the pun would be considered more incongruous. Therefore, the RK processor could be used to obtain estimates of incongruity and make predictions on poetic appreciation and humour.

Semantic ambiguity

The RK processor could potentially be used to explore how semantically ambiguous words are represented. Although ambiguity is often treated as a categorical variable (e.g., monosemes, polyseme, and homonyms), Beekhuizen et al. (2018) argue that ambiguity more likely varies on a continuum. One could envision a study in which participants list senses of different ambiguous words. The distribution of senses could then be analysed to determine the level of ambiguity for each word. In fact, the RK processor already includes a function for calculating interpretive diversity, which should theoretically parallel ambiguity. Ambiguity should be highest when many senses are listed at about equal frequencies (high interpretive diversity) and lowest when a single sense dominates (low interpretive diversity). Therefore, interpretive diversity could potentially capture semantic ambiguity. Thus, ambiguity could easily be estimated using the RK processor.

Literary versus non-literary metaphors

Lastly, the RK processor could be used to examine differences between the features evoked by literary and non-literary metaphors (Katz et al., 1988). Differences between literary and non-literary metaphors have been found in some subjective ratings (Jacobs & Kinder, 2018) and in neural activation (Chen et al., 2016); however, to our knowledge, the features evoked by literary and non-literary metaphors have not been compared. Literary and non-literary metaphors may differ in terms of the percentage of topic, vehicle, shared, and emergent features evoked, or may differ in interpretive diversity. We are currently collecting feature data on a subset of literary metaphors from the Katz et al. (1988) norms to explore these possibilities.

Limitations

Finally, although the RK processor offers a time-saving method for combining features, and the extracted features can be used to make meaningful predictions about metaphor processing, which in principle has widespread applicability, there are some limitations compared to human judgments that one must consider when deciding to employ the algorithm.

First, the processor does not handle sentence or phrase responses well. For instance, considering the metaphor “money is a lubricant” in our dataset, the responses “makes life easier” and “makes things easier” were counted as distinct features. A human may well categorize these responses as the same feature, as both share the common component “makes easier”. However, there is no way for the processor to know which words in the phrase constitute the central feature, aside from removing stop words. “Life” and “things” are not considered stop words, as these could be important features of other metaphors or words. Alternatively, the processor could be programmed to count each word as a separate feature. However, this would lead to inaccuracies for phrases such as “not true” or “not very big” in which the preceding word “not” alters the meaning of the critical word. For this reason, we included phrases as a single response rather than counting each word as a separate feature.

One way to mitigate this issue with phrases is to encourage participants to list only single words as features in the instructions for the feature-listing task. For instance, in our instructions where we give examples of generated features, all examples were single-word responses. As a result, we found that multiple-word responses, especially those containing conjoint features such as “red fruit”, were rare. In contrast, the task instructions from McRae et al. (2005) and Vinson and Vigliocco (2008) included several examples of features listed as phrases (e.g., “requires paper”, “has a tail”, “done by humans”). This may have led to a larger number of conjoint features in these studies. For automated analyses, it is likely that better results will be obtained if participants are encouraged to answer with single-word features when possible.

A second limitation is that the stemming algorithms sometimes miss features that share the same morphological root. For example, for the metaphor “love is a flower” in our dataset, “grow” and “growth” were counted as separate features using the default Snowball stemmer setting. In this case, removing the -th stem would be appropriate because “grow” and “growth” share a similar meaning; however, for a word such as “breadth”, removing the -th results in the word “bread”, which has a different meaning from “breadth”. The Lancaster stemmer, which is more aggressive than the Snowball stemmer, correctly removes the -th from “growth;” however, it tends to over-stem for other responses. For instance, for the metaphor “a forest is a harp”, the Lancaster stemmer reduced both the words “tall” and “talent” to the stem “tal” and counted them as the same feature. Therefore, although the Lancaster stemmer sometimes more accurately combines certain features, such as “grow” and “growth”, it also tends to incorrectly group dissimilar responses together, such as “talent” and “tall”. For this reason, we prefer the more conservative Snowball stemmer even though occasionally it will not combine features that potentially share meaning.

A third limitation is that there is not an automated way for the RK processor to combine words that are synonyms. The RK processor does include a command that will find any pairs that are synonyms in the participant responses; however, there was not an easy way to automatically combine synonyms because of the issue mentioned in the introduction that words X and Y and Y and Z can be synonyms, but X and Z may not be. In our dataset, this issue was demonstrated for the response “good”, which was found to be a synonym of the responses “commodity”, beneficial”, and “well”. However, these latter three responses were not found to be synonyms with each other. This tends to be especially an issue with homonyms and polysemes, such as “good”, that have multiple meanings. Roncero and de Almeida (2015) argue that synonyms can have subtle differences in meaning and that combining synonyms can reduce the variability in a dataset. We agree with this sentiment and, like Roncero and de Almeida, focus on combining features with the same morphological root. For now, the RK processor includes a function for finding potential synonyms but does not combine and count these synonyms. This function could be a useful tool to aid feature analysis, but the researcher would need to make the final decision on how the synonyms are combined and counted.

Lastly, the RK processor works best with simple “A is B” metaphors. It is possible to extract features for more complex metaphors, as long as they are labelled in the datafile in a way that the RK processor can interpret (see Supplementary Material for more detail). However, for some types of metaphors the RK processor is more limited. For instance, for “A is the B of C” metaphors (e.g., “Robert Redford is the peacock of actors”, Tourangeau & Sternberg, 1981; Trick & Katz, 1986), it may be desirable to obtain the features for all three terms rather than just the topic and vehicle. Also, for many metaphors the topic domain is not explicitly stated, such as in Robert Frost’s poem “The Road Not Taken”, where the road is a metaphor for life, but life is never explicitly mentioned. The RK processor is not set up to deal with these types of metaphors, especially the analyses that involve finding the topic, vehicle, shared, and emergent features of the metaphor. Our program could still be helpful for processing features listed for these overall metaphors and the terms used in these metaphors, but the automatic analysis of topic and vehicle features is more suited to simple “A is B” metaphors.

Conclusion

In this paper, we presented the RK processor, a program designed to automatically process participant data from metaphor and word feature-listing studies. The feature data processed using this program generally align with other metaphor feature-listing studies in which the features are combined and counted manually. Furthermore, the feature types by category (i.e., topic only, vehicle only, shared, and emergent) were used to make meaningful predictions about metaphor comprehension and goodness ratings. Aside from metaphor research, the processor also has potential applications in research on word similarity, compound words, categories, computational modelling, incongruity resolution, and semantic ambiguity. The program, along with our datasets, is included in the supplementary materials of this paper. Although the program can be used as is in Python, we encourage other researchers to build upon our code and adjust it for their own purposes. We believe that this program is a step towards greater consistency and replicability across research labs.