A judgment of leaning (JOL) is one’s judgment about how well a given item is learned and how likely it is that this item will be successfully retrieved on a future memory test. JOLs are metacognitive judgments that reflect monitoring that takes place during the encoding phase of learning. In a typical JOL experiment (e.g., Nelson & Dunlosky, 1991), participants are presented with a list of word pairs (e.g., OCEAN–TREE) and asked to make a prediction on each pair as to how likely it is that they will be able to recall the target word (e.g., TREE) when the cue word (e.g., OCEAN–???) is presented on a memory test. The accuracy of JOLs is assessed by comparing JOL ratings against actual performance on a criterion test such as recall and recognition (see Eakin & Moss, 2019, for a review of methodology). Numerous studies have shown that JOLs are predictive of actual performance, particularly when judgments are made after a period of delay (i.e., delayed JOL effect) compared with when judgments are made immediately after studying an item (e.g., Dunlosky & Nelson, 1992; Nelson & Dunlosky, 1991). Furthermore, studies have shown that JOLs are associated with study behaviors such as which items one would choose for further study and how long one would persist in studying (e.g., Metcalfe & Finn, 2008; Metcalfe & Kornell, 2005; Thiede & Dunlosky, 1999; see Metcalfe, 2009, for a review). These results indicate that JOLs are a critical component of self-regulated study behaviors, consistent with the notion that metacognition plays an important role in monitoring and control of cognition (Nelson & Narens, 1994).

Since the seminal work by Nelson and Dunlosky (1991), the measure of JOLs has attracted considerable attention from many researchers, and since then, many important discoveries have been made (see Metcalfe, 2009). However, there has been an ongoing debate as to whether asking participants to make JOLs would influence memory performance by creating a reactivity effect (see Soderstrom et al., 2015). The reactivity effect occurs when the measure itself alters behaviors (see Ericsson & Simon, 1993; Fox et al., 2011). In the case of JOLs, it may be the case that asking participants to make JOLs may alter their memory and subsequently modify their study behaviors. The issue of reactivity has been discussed since the early days of JOL research (e.g., Dunlosky & Nelson, 1992; Nelson & Dunlosky, 1991; Spellman & Bjork, 1992). However, thus far, only a handful of studies have directly investigated the issue, and the results have been mixed (see Double et al., 2018; Rhodes & Tauber, 2011, for a review).

For example, Benjamin et al. (1998) showed that making JOLs does not influence memory. In this study, participants were asked to learn lists of unrelated words and took an immediate free recall test after studying each list. For half of the lists, participants were asked to make a prediction for each recalled word as to whether they would be able to recall this item on a future recall test. When participants completed all the lists, they received the final recall test in which they were asked to recall the words from all the lists. The final recall test showed that there was no difference in recall between the lists for which participants made recall predictions and the lists for which they did not make recall predictions. These results were consistent with the notion that making JOLs does not influence memory.

In contrast, other studies have shown that making JOLs indeed influences memory. For example, Soderstrom et al. (2015) showed that the act of making JOLs enhances memory performance when study materials consist of strongly related cue–target word pairs. Furthermore, these researchers showed that providing JOLs makes the cue–target relationship salient to participants, similar to when participants are asked to generate a target word using a word fragment (e.g., ORCHID–FL_W_R). Note that the reactivity effect such as this is not a temporary phenomenon. In three experiments, Witherby and Tauber (2017) showed that the reactivity effect was still present on a test that was administered two days later. Other researchers also showed that asking participants to make JOLs modifies how participants study the material by changing the study goal of participants. In support of this notion, Mitchum et al. (2016) reported that presenting a probability scale for making a JOL informs participants that some items are more difficult to remember than others, which then leads participants to abandon the mastery study goal and, as a result, put less effort in learning difficult items.

Researchers investigating the delayed JOL effects also showed that making JOLs may influence memory. According to the self-fulfilling prophecy (SFP) hypothesis (Spellman & Bjork, 1992), delayed JOLs tend to be more accurate than immediate JOLs because making a delayed judgment tends to increase the likelihood of recalling the target on a subsequent memory test. They proposed two mechanisms that may influence memory when participants are asked to make delayed JOLs: covert recall attempts and the spacing effect. First, when making a JOL, participants would covertly attempt to recall the target, and if successful, they assign a high JOL rating. The successful retrieval of the target, in turn, acts as an additional study opportunity, increasing the likelihood of recalling the target on a subsequent memory test. Second, the difference between immediate and delayed judgments can be explained by the spacing effect or the effect showing that memory performance is higher when there is a spacing between repeated study trials. Based on this notion, it can be explained that the benefit of retrieving the target would be greater for the delayed judgment than for the immediate judgment because for the former, there is a spacing between the initial presentation and the second exposure (when the retrieval attempt is successful). The SFP hypothesis has received some support over the years. For instance, Kimball and Metcalfe (2003) showed that for items that received high JOL ratings, recall was similar regardless of whether a delayed JOL was made with a cue only or a delayed JOL was made with a cue only, followed by an exposure to both the cue and target. These results showed that for these high JOL items, making a delayed JOL was similar to receiving an extra exposure to the study material. For low JOL items, receiving an extra exposure to the cue and target enhanced memory because for these items, the likelihood of spontaneously retrieving the targets was low.

Other researchers (Akdoğan et al., 2016; Jönsson et al., 2012; Kelemen & Weaver, 1997) also showed that making delayed JOLs enhanced memory for these items on a delayed test. However, Tauber et al. (2015) showed that retrieval attempts associated with making delayed JOLs are not as effortful as when participants are explicitly tested because these retrieval attempts are often terminated when the cue word is not familiar. Also, Son and Metcalfe (2005) showed that when participants make delayed JOLs, they try to retrieve only some items and terminate retrieval attempts for other items, which is different from retrieval attempts during a test when participants tend to try to retrieve each item.

Because the results of the previous studies have been mixed as to whether asking participants to make JOLs would create a reactivity effect, there have been two papers that published the results of a meta-analysis on this topic (Double et al., 2018; Rhodes & Tauber, 2011). Rhodes and Tauber (2011) examined whether delaying judgments would indeed increase JOL accuracy and whether delaying judgments would also increase memory performance (reactivity effect). Double et al. (2018) focused on immediate JOLs, directly comparing memory performance between conditions with a JOL task and conditions without a JOL task. The analysis by Rhodes and Tauber included 45 studies, and the analysis by Double et al. included 17 studies. Rhodes and Tauber showed that there was a robust beneficial effect of delaying judgments on JOL accuracy; however, the effect of delaying judgments on memory was much smaller than the effect on accuracy. The results showed that delaying judgments increased memory under the following conditions: when both the cue and target were presented for making a judgment (as opposed to a cue only), when the materials were paired associates, when the delayed judgments were made with a delayed interval of 1 min or less, when the cues used for judgments did not match the cues used for the test (e.g., cue–target pairs for judgments and cue only for test), when a within-subjects manipulation was used (i.e., participants making both immediate and delayed judgments), and when children as opposed to adults were tested. These results therefore indicated that delaying judgments can create a reactivity effect under some conditions. The results of Double et al. showed that a reactivity effect was present in 6 out of 17 studies. The reactivity was present when the study materials were word pairs that consisted of related cue–target as well as when the study materials were a word list consisting of single words.

Based on these findings, it is evident that making JOLs can modify memory; however, it is also evident that the reactivity effect is not always present. Therefore, we conducted two experiments to examine how making JOLs would influence memory performance. In particular, we investigated the type of processing a JOL task would induce. Our hypothesis was that making JOLs would induce item-specific processing because when one makes a JOL, one would focus attention on a particular item, thereby enhancing the distinctiveness of each item (Hunt, 2006, 2012). In other words, we propose that asking participants to make a JOL would be similar to asking participants to perform an encoding task (Craik & Tulving, 1975; Hyde & Jenkins, 1969), which is designed to induce item-specific processing (Einstein & Hunt, 1980; Hunt & Einstein, 1981; Hunt & McDaniel, 1993; Hunt & Seta, 1984). This notion has been proposed before (Mitchum et al., 2016; Schmidt & Schmidt, 2017); however, as far as we know, there has not been a direct test of this hypothesis.

In the present experiments, we presented participants with a list consisting of single words and asked them to perform a JOL task or a well-established item-specific processing task (a pleasantness rating task in Experiment 1, and a single imagery task in Experiment 2). Subsequently, memory performance in these conditions was compared with memory performance in an intentional learning control condition. We hypothesized that both JOL and item-specific processing tasks would enhance memory performance relative to the control condition, and that the enhancement would be similar between these two conditions. In Experiment 1, we selected a pleasantness rating task because this task was most notably used by Hunt and colleagues (e.g., Einstein & Hunt, 1980; Hunt & Einstein, 1981; Hunt & Seta, 1984) in their investigation of the effect of relational and item-specific processing on memory (see Hunt & McDaniel, 1993, for an extensive review). According to Hunt and colleagues, optimal memory requires both types of processing such that highlighting the uniqueness of each item (item-specific processing) is most beneficial to memory when one is also paying attention to the similarity among the items (relational processing). In their experiments, Hunt and colleagues used a pleasantness rating task to induce item-specific processing and a category sorting task to induce relational processing. In Experiment 2, we selected a single imagery task, in which participants were asked to create a vivid mental image of each word, because this task has been used to induce item-specific processing in several experiments (e.g., Burns & Schoff, 1998; Burns et al., 2007; Huff & Bodner, 2014; Otani & Hodge, 1991). It is also important to note that Hodge and Otani (1996) showed that memory performance was comparable between the pleasantness rating and single imagery tasks in both free recall and recognition, indicating that these tasks are similar in inducing item-specific processing.

To detect the effect of item-specific processing, we additionally manipulated the list type, such that half of the participants received a categorized list whereas the other half of the participants received an uncategorized list. According to Hunt and colleagues (e.g., Einstein & Hunt, 1980; Hunt, 2006, 2012; Hunt & Einstein, 1981; Hunt & McDaniel, 1993; Hunt & Seta, 1984), the structure of the study list is also important because a categorized list has a tendency to induce relational processing, and an uncategorized list has a tendency to induce item-specific processing. Thus, the processing induced by the list structure can interact with the processing induced by the encoding task such that optimal memory would result when both relational and item-specific processing are simultaneously present. Consistent with this notion, Hunt and colleagues (e.g., Hunt & Einstein, 1981) showed that when the list was categorized, recall was higher when participants engaged in an item-specific processing task (i.e., pleasantness rating) whereas when the list was uncategorized, recall was higher when participants engaged in a relational processing task (i.e., category sorting). Based on these previous findings, in the present experiments, we expected that an item-specific processing task such as JOLs would enhance memory when a list was categorized more so than when a list was uncategorized.

Note that in most JOL studies, study lists consist of word pairs (see Nelson et al., 2004). However, in the present experiments, we decided to use a list consisting of single words because manipulating encoding tasks is more established with a list of single words than a list of word pairs (e.g., levels of processing; Craik & Tulving, 1975). Furthermore, there have been studies in which participants were asked to make JOLs on single items (e.g., Dunlosky et al., 2000; Otani et al., 2014; Schmidt & Schmidt, 2017). Notably, a meta-analysis by Double et al. (2018) included three studies that used a word list that consisted of single words, and of these three studies, two (Yang et al., 2015; Zechmeister & Shaughnessy, 1980) showed a reactivity effect, whereas one (Tauber & Rhodes, 2012) did not.

In sum, in the present experiments, we assumed that making JOLs would induce item-specific processing similar to when one performs other item-specific processing tasks, such as rating pleasantness and creating a mental image of each word. Based on this notion, we predicted that making JOLs would enhance recall when the list was categorized more so than when the list was uncategorized. Furthermore, the enhancement would be similar between the JOL and other item-specific processing conditions (pleasantness rating and single imagery). In addition, we predicted that when the list was uncategorized, the JOL condition would show minimal memory enhancement, again similar to the pleasantness rating and single imagery conditions.

Experiment 1

In Experiment 1, making JOLs was compared with a well-known encoding task of rating pleasantness. Half of the participants studied a categorized list of words, whereas the other half of the participants studied an uncategorized list of words. We expected that relative to the control condition, memory enhancement would be similar between the JOL and pleasantness rating conditions, and that memory enhancement would be more likely to occur when the list was categorized as opposed to uncategorized.

Method

Participants

Participants were 32 male and 172 female undergraduate students attending introductory psychology courses at a public university in the Midwest region of the United States. They participated to earn extra course credit. An equal number (n = 34) of participants were randomly assigned to six between-subjects conditions, which were created by a 3 (encoding task: JOL, pleasantness rating, control) × 2 (list type: categorized, noncategorized) factorial design. We determined that 34 participants per condition would be sufficient based on an analysis using G*Power software (Faul et al., 2007). According to this analysis, assuming a medium effect size f = 0.25 with power (1 − β) of .80, the minimum sample size would be 26 participants per condition. However, we acknowledge that detecting an interaction effect may require a larger sample size (see http://shiny.ieis.tue.nl/anova_power/ by Lakens & Caldwell, 2019). In fact, as indicated below, we did not have sufficient power to detect the interaction effect, even though there was enough power for a priori follow-up analyses. Nevertheless, previous researchers used a smaller sample size than ours when they manipulated a pleasantness rating task. In particular, Einstein and Hunt (1980) used 18 per condition, Hunt and Einstein (1981) used 19 per condition, and Hodge and Otani (1996) used 28 per condition. The experiment was conducted with approval given by the Institutional Review Board (IRB) where data were collected.

Materials

Two lists, categorized and uncategorized, were constructed from English words selected from the Van Overschelde et al. (2004) category norms. The categorized list included 32 words from four different categories, and the noncategorized list included 32 words from 16 different categories (see Appendix 1). These words were chosen from the middle ranking for each category with the proportion of participants producing a particular word given a category cue ranging from .07 to .77. Furthermore, the length of words varied from five to eight letters. A PowerPoint presentation was used to present the words, one at a time, in the middle of the computer screen in lowercase letters at the rate of one word per 5 s. The order of the words was randomized once, and the same order was used for all participants. Each slide presenting a word was followed by an instruction slide, presented for 7 s. The instruction slide presented a scale from 0% to 100%, which was used for performing the encoding tasks. In addition, a sheet with randomly generated two-digit numbers was prepared for a filler task. A blank sheet of paper was used for a free recall test.

Procedure

Participants, who were tested in small groups up to four individuals, were told that they would be presented with a list of words, and their task would be to remember as many of these words as possible. In addition, after each word was presented, participants in the JOL condition were asked to rate their JOL, indicating how likely they would be able to recall this word later using a scale from 0% (definitely will not recall) to 100% (definitely will recall). Participants in the pleasantness condition were asked to rate the pleasantness of the word using a scale from 0% (definitely not pleasant) to 100% (definitely pleasant). Participants in the control condition were asked to choose and write an arbitrary number from 0% to 100%. Following the study phase, the participants were asked to perform a filler task for 2 min, crossing out the numbers divisible by three. Then, participants completed a self-paced free recall test in which they were asked to recall and write as many of the study words as possible.

Results

The dependent measure was the proportion of correctly recalled words. Table 1 shows the means across encoding task and list type. As shown, both the JOL and pleasantness rating conditions resulted in higher recall than the control condition for both the categorized and uncategorized lists. However, the difference was much smaller for the uncategorized list than for the categorized list.

Table 1 Mean proportion of correct recall as a function of encoding task and list type in Experiment 1

To compare the proportion of correctly recalled words across the conditions, we conducted a 3 (encoding task: JOL, pleasantness rating, control) × 2 (list type: categorized, uncategorized) analysis of variance (ANOVA). The results indicated that the main effect of encoding task was significant, F(2, 198) = 7.98, MSE = 0.02, p < .001, ηp2 = .08. Least significant difference (LSD) tests showed that recall was higher in the JOL (M = .44, SD = .15, p < .001) and pleasantness rating conditions (M = .43, SD = .14, p = .001) than in the control condition (M = .36, SD = .12). No difference was found between the former two conditions (p = .87). The main effect of list type was also significant, F(1, 198) = 35.72, MSE = 0.02, p < .001, ηp2 = .15. Recall was higher for the categorized list (M = .46, SD = .14) than for the uncategorized list (M = .36, SD = .12). The interaction was not significant, F(2, 198) = 1.06, MSE = 0.02, p = .35, ηp2 = .01. Note, however, that the observed power for the interaction was only .23.

Although the interaction was not significant, we conducted further analyses based on the a priori hypothesis that the effect of making JOLs and rating pleasantness would be greater for the categorized list than for the uncategorized list because the effect of performing an item-processing task should be greater when the list encourages relational processing by emphasizing the similarity among items than when the list does not. To test this hypothesis, we conducted a separate one-way ANOVA on each list. For the categorized list, the results indicated that the difference among the encoding conditions was significant, F(2, 99) = 6.80, MSE = 0.02, p = .002, ηp2 = .12. LSD tests showed that recall was higher in the JOL (M = .50, SD = .14, p = .002) and pleasantness rating conditions (M = .50, SD = .14, p = .002) than in the control condition (M = .40, SD = .12). No difference was found between the former two conditions (p = .93). For the uncategorized list, the results indicated that the difference among the encoding conditions was not significant, F(2, 99) = 1.78, MSE = 0.01, p = .17, ηp2 = 0.04. As expected, these results showed that the effect of the JOL and pleasantness rating tasks was greater for the categorized list than for the uncategorized list.

In addition to these conventional analyses, we also computed Bayesian factors in order to model the data under the null and alternative hypotheses. A conventional analysis based on p values only considers the null hypothesis, whereas a Bayesian analysis considers both the null and alternative hypotheses at the same time. The latter approach is superior to the former approach because it allows for an estimate of how likely the observed data would occur when the null hypothesis is true or when the alternative hypothesis is true (see Jarosz & Wiley, 2014). We specifically compared the JOL and pleasantness rating conditions for each list to show that the observed data fit the null hypothesis better than the alternative hypothesis. We computed Bayesian factors with the Rouder’s method using IBM SPSS Statistics for Windows (Version 26.0; IBM Corp., Armonk, NY, USA). The results showed that for the categorized list, an estimated Bayesian factor (null/alternative) indicated that the observed data fit the null hypothesis 5.43 times better than the alternative hypothesis. For the uncategorized list, an estimated Bayesian factor (null/alternative) indicated that the observed data fit the null hypothesis 5.39 times better than the alternative hypothesis. These results provided strong evidence that as predicted, recall was similar between the JOL and pleasantness rating conditions for both the categorized and uncategorized lists.Footnote 1

Next, we examined whether there was a difference in JOL ratings and accuracy between the categorized and uncategorized lists. Note that for these analyses, only two groups of participants (JOL conditions) were included. We consider JOL ratings first. Given that the categorized list was easier to learn than the uncategorized list, JOL ratings should be higher for the categorized list than for the uncategorized list. However, an independent-samples t test on JOL ratings showed the difference was not significant, t(66) = 0.61, p = .54, indicating that there was no difference in JOL ratings between the categorized (M = 47.75, SD = 17.48) and uncategorized lists (M = 45.10, SD = 18.13). Accordingly, JOL ratings did not reflect the fact that the categorized list was easier to learn than the uncategorized list.

In terms of JOL accuracy, two types of accuracy need to be considered: relative and absolute. Relative accuracy indicates whether higher JOL ratings are associated with a higher likelihood of recalling items regardless of the actual numbers. In other words, recall should be higher for an item that was rated 80% than 40%, and this relationship should still hold even when the ratings are 60% and 20%. Absolute accuracy indicates whether the mean JOL rating matches actual recall. In other words, if the average JOL rating is 80%, the actual recall should be 80%. It is difficult to predict the effect of list type on relative accuracy. However, given that making JOLs increased actual recall when the list was categorized, the relative accuracy may have been lower for the categorized list than for the uncategorized list. There are several measures of relative accuracy, but we chose Goodman–Kruskal gamma because in the JOL literature, gamma is the most common. Gamma ranges from −1 to +1, with +1 indicating perfect accuracy and 0 indicating no association between JOL ratings and recall performance. The result of an independent-samples t test showed that the difference was not significant, t(66) = 1.12, p = .27, indicating that there was no difference in relative accuracy between the categorized (M = .19, SD = .31, range: −.35 to .66) and uncategorized lists (M = .27, SD = .26, range: −.28 to .63). These results indicated that relative accuracy was similar between the two lists even though making JOLs increased recall for the categorized list and not for the uncategorized list.

Absolute accuracy can be examined in several ways, but the most intuitive way is to compute signed difference scores for each participant (see Dunlosky & Metcalfe, 2009) by averaging JOL ratings across items and comparing the average rating with the actual recall (see Van Overschelde & Nelson, 2006). If the average JOL rating matches actual recall, the difference should be zero. If the rating is higher than actual recall, participants are said to be overconfident, whereas if the rating is lower than actual recall, participants are said to be underconfident. The effect of list type on absolute accuracy is also difficult to predict. However, given that making JOLs increased recall when the list was categorized, the absolute accuracy may have been lower for the categorized list than for the uncategorized list. Contrary to this expectation, the results showed that there was no significant difference between the categorized (M = −2.16, SD = 18.37) and uncategorized lists (M = 7.41, SD = 24.18), t(66) = 1.84, p = .07. These results showed that absolute accuracy was similar between the lists even though making JOLs increased recall for the categorized list and did not increase recall for the uncategorized list. However, note that the results showed a trend indicating that accuracy was higher when the list was categorized than uncategorized. As shown below, this trend was consistent with the results of Experiment 2, which showed that accuracy was significantly higher for the categorized list than for the uncategorized list.

Discussion

Experiment 1 investigated whether making JOLs would influence memory performance by inducing item-specific processing. We tested this hypothesis by comparing a JOL task with a well-established task of rating pleasantness because the pleasantness rating task has been shown to induce item-specific processing (e.g., Einstein & Hunt, 1980; Hunt & Einstein, 1981; Hunt & Seta, 1984). We predicted that both the JOL and pleasantness rating tasks would produce memory enhancement relative to a control condition, and that the enhancement would be similar between the JOL and pleasantness rating conditions. Furthermore, we predicted that the enhancement would be greater when the list was categorized than uncategorized because it has been shown that the effect of item-specific processing is stronger when the list encourages relational-processing by emphasizing similarities among study items (Hunt & Einstein, 1981).

The results indicated that recall was higher in the JOL and pleasantness rating conditions than in the control condition, with the former two conditions showing similar performance. These results were in agreement with the assumption that making JOLs influences memory by encouraging item-specific processing similar to rating pleasantness. Although the Encoding Task × List Type interaction was not significant, further analyses comparing the encoding conditions on each list showed that the difference among the conditions was significant for the categorized list but not for the uncategorized list. These results are similar to the results reported by Soderstrom et al. (2015) that making JOLs enhanced memory performance for strongly related word pairs but not for unrelated word pairs, and further support the assumption that making JOLs promotes item-specific processing (e.g., Hunt & Einstein, 1981).

Another interesting finding was that there was no difference in JOL ratings as well as accuracy (both relative and absolute) between the categorized and uncategorized lists. If list type influences recall, it would be expected that JOL ratings and/or JOL accuracy would be affected by list type. Contrary to this expectation, list type did not make any difference in JOL ratings or JOL accuracy. Note that the ratings and accuracy were compared across groups of participants, and there are several problems associated with intergroup comparisons of JOL accuracy (see Dunlosky & Metcalfe, 2009; Schwartz & Metcalfe, 1994). Thus, further research is needed to examine JOL accuracy and the reactivity issue. Nevertheless, if making JOLs modifies memory, JOLs would no longer truly capture the natural ways individuals regulate their own study behaviors. Therefore, researchers need to exercise caution when using JOLs to investigate such behaviors.

In sum, the results of Experiment 1 supported the hypothesis that making JOLs enhances memory by inducing item-specific processing. However, rating pleasantness is only one of several tasks that have been shown to induce item-specific processing (see Hodge & Otani, 1996). In order to test whether the results were task-specific or process-specific, in Experiment 2, another well-known task was used to induce item-specific processing.

Experiment 2

The purpose of Experiment 2 was to test further the hypothesis that making JOLs would be similar to engaging in an item-specific processing task. The design and the procedure of Experiment 2 were the same as those in Experiment 1, but in Experiment 2, making JOLs was compared with a single imagery task or a task of creating a vivid mental image of each word, which is another well-known item-specific processing task (see Hodge & Otani, 1996). Similar to Experiment 1, we hypothesized that both the JOL and single imagery conditions would produce similar memory enhancement relative to the control condition, and that memory enhancement would be greater when the list was categorized as opposed to when the list was uncategorized.

Method

Participants

Participants were 68 male and 172 female undergraduate students in introductory psychology courses at a public university in the Midwest region of the United States. They participated to earn extra course credit. An equal number (n = 40) of participants were randomly assigned to six between-subjects conditions in a 3 (encoding task: JOL, single imagery, control) × 2 (list type: categorized, uncategorized) factorial design. We attempted to increase statistical power by increasing the sample size to 40 per condition. The experiment was conducted in accordance with the approval given by the IRB where data were collected.

Materials and procedure

The materials were the same as those in Experiment 1. The procedures for the JOL and control conditions were the same as those in Experiment 1. The only difference between Experiments 1 and 2 was that in Experiment 2, participants in the single imagery condition were asked to create a mental image of each word as vividly as possible. In this condition, each slide presented a word and was followed by an instruction slide that showed a scale from 0% (not very vivid) to 100% (very vivid), and participants were asked to rate the vividness of the image they created using this scale and write down their ratings on the response sheet (see Appendix 2 for the instructions for the single imagery condition).

Results and discussion

The dependent measure was the proportion of correctly recalled words. Table 2 shows the means across encoding task and list type. As shown, both the JOL and single imagery conditions showed higher recall than the control condition for both the categorized and uncategorized lists, and as expected, the difference was greater for the categorized list than for the uncategorized list.

Table 2 Mean proportion of correct recall as a function of encoding task and list type in Experiment 2

To compare the proportion of correctly recalled words across the conditions, we conducted a 3 (encoding task: JOL, single imagery, control) × 2 (list type: categorized, uncategorized) ANOVA. The results indicated that the main effect of encoding task was significant, F(2, 234) = 11.39, MSE = 0.01, p < .001, ηp2 = .09. LSD tests showed that recall was higher in the JOL (M = .44, SD = .15, p < .001) and single imagery conditions (M = .45, SD = .14, p < .001) than in the control condition (M = .37, SD = .11). No difference was found between the JOL and single imagery conditions (p = .43). The main effect of list type was also significant, F(1, 234) = 71.52, MSE = 0.01, p < .001, ηp2 = .23. Recall was higher for the categorized list (M = .48, SD = .14) than for the noncategorized list (M = .36, SD = .10). Lastly, the Encoding Task × List Type interaction was significant, F(2, 234) = 3.49, MSE = 0.01, p = .03, ηp2 = .03.

Because the interaction was significant, we conducted a separate one-way ANOVA on each list. For the categorized list, the results indicated that the difference among the encoding conditions was significant, F(2, 117) = 10.67, MSE = 0.02, p < .001, ηp2 = .15. LSD tests showed that recall was higher in the JOL (M = .50, SD = .15, p = .001) and single imagery conditions (M = .54, SD = .13, p < .001) than in the control condition (M = .41, SD = .11). No difference was found between the JOL and single imagery conditions (p = .23). For the uncategorized list, the results indicated that the difference among the encoding conditions was not significant, F(2, 117) = 1.88, MSE = 0.01, p = .16, ηp2 = 0.03.

The results of Bayesian analyses also confirmed that there was no difference in recall between the JOL and single imagery conditions. For the categorized list, an estimated Bayesian factor (null/alternative) showed that the observed data fit the null hypothesis 3.25 times better than the alternative hypothesis. For the uncategorized list, an estimated Bayesian factor (null/alternative) showed that the observed data fit the null hypothesis 5.64 times better than the alternative hypothesis. These results provided strong evidence that as predicted, recall was similar between the JOL and single imagery conditions.

Next, we examined JOL ratings and accuracy by comparing the JOL conditions between the two lists. An independent-samples t test on JOL ratings showed the difference was not significant, t(78) = 0.67, p = .50, indicating that there was no difference in JOL ratings between the categorized (M = 49.68, SD = 16.01) and uncategorized lists (M = 47.30, SD = 15.65), despite the fact that recall was easier for the categorized list than for the uncategorized list.

To examine the relative accuracy of JOLs, we computed Goodman-Kruskal gamma scores. An independent-samples t test showed that there was no difference between the categorized (M = .24, SD = .25, range: −.32 to .68) and uncategorized lists (M = .30, SD = .27, range: −.71 to .69), t(78) = 1.0, p = .32, indicating that relative accuracy was similar between the two lists despite the fact that making JOLs increased recall for the categorized list and did not increase recall for the uncategorized list. To examine absolute accuracy, we computed signed difference scores comparing the average JOL rating with actual recall for each participant. An independent-samples t test showed that there was a significant difference between the categorized (M = −.72, SD = 22.06) and uncategorized lists (M = 9.95, SD = 19.56), t(78) = 2.29, p = .03, indicating that absolute accuracy was higher for the categorized list than for the uncategorized list. These results therefore indicated that absolute accuracy of JOLs can be influenced by the type of processing that the study material would promote.

Experiment 2 investigated whether making JOLs and creating a single image would influence memory performance in a similar way. This prediction was based on the assumption that both JOL and single imagery tasks would promote item-specific processing. If this assumption is correct, both the JOL and single imagery conditions would produce memory enhancement for the categorized list. In contrast, for the uncategorized list, memory enhancement by these conditions would be minimal because the uncategorized list would naturally promote item-specific processing, which is the same processing type that the JOL and single imagery tasks would promote. Based on the notion that optimal memory requires a combined effect of relational and item-specific processing, memory enhancement should occur when the study material promotes relational processing (categorized list), whereas the encoding task promotes item-specific processing (JOL and single imagery). The results were consistent with these expectations. There was a significant interaction between list type and encoding task such that for the categorized list, recall was higher in the JOL and single imagery conditions than in the control condition whereas for the uncategorized list, recall was similar among the three conditions.

With regard to JOL ratings and relative accuracy, there was no difference between the categorized and uncategorized lists, replicating the results of Experiment 1. However, absolute accuracy was different between the two lists, such that absolute accuracy was higher for the categorized list than for the uncategorized list. These results therefore showed that JOL accuracy can be influenced by the type of processing that the study material would promote. Nevertheless, it is difficult to explain this finding. One possibility is that participants were naturally overconfident, and therefore, by increasing recall, their ratings became more accurate by reducing the difference between their ratings and actual recall. Obviously, other possibilities need to be explored in the future studies.

General discussion

The present two experiments examined whether the act of making JOLs would enhance memory performance by inducing item-specific processing. As mentioned in the Introduction, there has been an ongoing debate as to whether asking participants to make JOLs would create a reactivity effect because some studies have shown that making JOLs would influence memory (e.g., Soderstrom et al., 2015; Zechmeister & Shaughnessy, 1980), whereas other studies have shown that making JOLs would not influence memory (e.g., Benjamin et al., 1998; Tauber & Rhodes, 2012). In fact, two papers that published a meta-analysis showed that a JOL task would create a reactivity effect under some conditions, but not in other conditions (Double et al., 2018; Rhodes & Tauber, 2011).

In the present experiments, we hypothesized that making JOLs would enhance memory by inducing item-specific processing because the task of making JOLs would direct one’s attention to a particular item and enhance the distinctiveness of each item in memory (Hunt, 2006, 2012). Our assumption was that a JOL task is similar to other encoding tasks that are designed to induce a particular type of processing by asking participants to make a particular type of judgment on a given item (Craik & Tulving, 1975; Hyde & Jenkins, 1969). These tasks have been extensively used to study the effect of encoding processing on memory at least since the beginning of the levels of processing approach (Craik & Lockhart, 1972). Hunt and colleagues expanded on the levels of processing approach and proposed that there are two types of processing that are particularly important to memory—item-specific processing and relational processing. Their proposal was that when these two types of processing are combined, memory is optimized by taking advantage of two important principles of memory—organization and distinctiveness (see Hunt, 2006, 2012; Hunt & McDaniel, 1993). It is also important to note that encoding tasks are not the only source that would influence the type of processing. Hunt and colleagues showed that the type of study materials is also important because when the study materials promote relational processing, performing a task that would induce item-specific processing becomes beneficial, whereas when the study materials promote item-specific processing, performing a task that would induce relational processing becomes beneficial (see Hunt & Einstein, 1981; Hunt & McDaniel, 1993). Our assumption was that making JOLs would induce item-specific processing and increase the distinctiveness of each item. If this is the mechanism of how JOLs would enhance memory, we should be able to detect it by presenting a categorized list, which would promote relational processing. Accordingly, in the present experiments, we manipulated the encoding task and the study material. The tasks were a JOL task and two other tasks that have been shown to induce item-specific processing: the pleasantness rating and single imagery tasks (see Hodge & Otani, 1996). The study material was a categorized list and an uncategorized list. If making JOLs would induce item-specific processing similar to other item-specific processing tasks, recall should be enhanced when the list is categorized more so than when the list is uncategorized. Furthermore, the enhancement should be similar between the JOL task and the other tasks that induce item-specific processing (i.e., pleasantness rating and single imagery; see Hodge & Otani, 1996). In addition, when the list is uncategorized, the enhancement should be minimal because the uncategorized list would promote item-specific processing, which is the same processing type as these tasks would promote.

The results were consistent with these predictions. The results of Experiment 1 showed that for the categorized list, recall was higher in the JOL and pleasantness rating conditions than in the control condition. Furthermore, the former two conditions produced similar recall performance. These results were replicated in Experiment 2 with the single imagery task. The results of Experiment 2 showed that for the categorized list, recall was higher in the JOL and single imagery conditions than in the control condition, with the former two conditions showing similar recall performance. The results were different for the uncategorized list. In both Experiments 1 and 2, recall was similar among all these conditions when the list was uncategorized.Footnote 2 These results therefore support the hypothesis that making JOLs promotes item-specific processing similar to rating pleasantness or creating a single mental image of each word. These results are also in agreement with the results obtained by Soderstrom et al. (2015), which showed that making JOLs enhanced memory performance for strongly related cue–target pairs, but not for weakly related or unrelated cue–target pairs. Their explanation was that making JOLs would make the cue–target relationship salient. This explanation is similar to the notion that memory is enhanced when both similarities and differences are emphasized.

It must be noted that in the meta-analysis by Double et al. (2018), there were three studies that used a list of single words. Among these, two showed a reactivity effect (Yang et al., 2015; Zechmeister & Shaughnessy, 1980), whereas one did not show a reactivity effect (Tauber & Rhodes, 2012). It is important to note that none of these studies used a categorized list, and thus the reason that memory enhancement did or did not occur in these studies is not clear. However, it is reasonable to assume that a JOL task is an encoding task (Craik & Tulving, 1975; Hyde & Jenkins, 1969) and could induce a deep level of processing, and if so, would enhance memory performance. In fact, in the present experiments, when the list was uncategorized, memory performance was slightly higher for the JOL condition (albeit nonsignificant) than for the control condition in both experiments. One additional note on encoding tasks is that most of the past studies that manipulated encoding tasks used an incidental learning instruction in an attempt to keep processing as pure as possible (e.g., Craik & Tulving, 1975; see Otani et al., 2019, for the history of memory research methodology). Obviously, in a JOL study, it is impossible to use an incidental learning instruction. Thus, the lack of difference among the conditions with the uncategorized list is understandable. Compared with the past studies that used an incidental learning instruction, the effect of the processing manipulation may not have been strong enough in the present experiments, perhaps due to mixing of item-specific processing and intentional learning strategies. Nevertheless, further studies are needed to investigate the reason that the uncategorized list did not show a difference among the conditions.

We also acknowledge that similar performance does not necessarily means that similar processes are responsible for the performance across conditions. However, there is no direct measure of type of processing underlying memory performance. This has been the main difficulty of the processing approach to memory since the beginning of the levels of processing approach (Craik, 2002). Nevertheless, there have been attempts to detect item-specific and relational processing using a repeated-measures paradigm (e.g., Burns, 1993; Burns et al., 2007; Burns & Schoff, 1998; Mulligan, 2000, 2002). In this paradigm, the same recall test is repeated multiple times without providing study trials between tests. The patterns of item gains and loss across tests have been used as an index of relational and item-specific processing (see also Hunt & McDaniel, 1993). Thus, this approach should be used in the future to provide converging evidence that making JOLs induces item-specific processing.

In terms of JOL ratings, both experiments showed that there was no difference between the categorized and uncategorized lists. This finding was surprising because the categorized list was easier to learn than the uncategorized list. A possible explanation is that we used common words from various categories (see Appendix 1), and therefore participants may not have viewed these as particularly easy or difficult to remember. In fact, the JOL ratings were about 50% for both lists in both experiments. Furthermore, the order of the words was randomized. This means that the categorical nature of the list may not have been particularly salient in the categorized list. In the future, the categorized list should be presented with the words from each category in a block. Additionally, each word could be presented with a category cue (e.g., fruit–apple) to emphasize the categorical nature of the list.

Regarding accuracy, in Experiment 1, list type did not influence relative accuracy or absolute accuracy. Because making JOLs increased recall for the categorized list, we expected that JOL ratings would not reflect actual recall, thereby resulting in lower accuracy. This expectation was not confirmed. In Experiment 2, relative accuracy was similar between the two lists, replicating Experiment 1. The result of absolute accuracy was different in Experiment 2 such that absolute accuracy was higher for the categorized list than for the uncategorized list. This finding was also contrary to the expectation because we expected that for the categorized list, JOL ratings would not match actual recall because the act of making JOLs itself would result in increased recall. A possible explanation of this unexpected result is that participants were naturally overconfident, and therefore, when actual recall was increased, accuracy was increased by reducing overconfidence, such that the difference between the ratings and actual recall became smaller. However, Experiment 1 did not show this effect, even though there was a trend showing that participants were slightly more accurate when the list was categorized than uncategorized. These results therefore indicated that absolute accuracy of JOLs can be altered when the study material promotes relational processing and making JOLs modifies memory. Obviously, these speculations need to be examined more directly in future studies, particularly using within-participant comparisons (for the difficulty of comparing accuracy between groups, see Dunlosky & Metcalfe, 2009; Schwartz & Metcalfe, 1994). Furthermore, it has been well established that immediate JOLs are less accurate than delayed JOLs (Rhodes & Tauber, 2011). Accordingly, immediate JOLs may not be sensitive enough to show the effect of processing types on relative and absolute accuracy. Therefore, future studies should use delayed JOLs to investigate this issue.

In sum, the results of the current experiments showed that the relationship between JOLs and recall performance is complex. In fact, it is not always the case that a variable that influences memory also influences JOLs. Such dissociations have been reported in the past (see Schwartz & Efklides, 2012, for an extensive review). For example, studying a list multiple times has been shown to increase cued-recall performance; however, Kornell and Bjork (2009) reported that participants did not increase JOL ratings to match the increased performance.

Although the present experiments did not show a straightforward effect that memory enhancement has on JOL ratings and accuracy, the results of the present experiments make it clear that researchers need to carefully consider how making a JOL influences memory when investigating self-regulated study behaviors. Making JOLs can increase memory and may modify self-regulated study behaviors (see, e.g., Dunlosky & Connor, 1997; Metcalfe & Finn, 2008; Metcalfe & Kornell, 2005; Mitchum et al., 2016; Thiede & Dunlosky, 1999, for how JOLs can influence study behaviors). Note that these results are reminiscent of the difficulty associated with introspection, which Wundt and Titchener used to investigate subjective experience (see Fox et al., 2011; Murray, 1983). Throughout the history of psychology, it has been shown that when one introspects on one’s own internal experience, the act of introspection itself may modify the experience (see Fox et al., 2011, for a meta-analysis showing when introspection does and does not become reactive). However, the results of the present experiments did not show that making JOLs always modifies memory. The results showed that when the list was uncategorized, memory enhancement did not occur. These results are another reminder that attention to detail is called for when designing a study. Any method of measuring behavior (particularly subjective experience) needs to be tested before implementing it in an experiment, as it may influence the behavior itself. Furthermore, whenever possible, an appropriate control condition needs to be included in order to assess whether the measurement itself is modifying the behavior (Mitchum et al., 2016). It is also advisable that other nonreactive measures be explored (Double et al., 2018; Mitchum et al., 2016).

The limitations of the present experiments include the use of lists with single words. As mentioned earlier, most of studies investigating JOLs have used word pairs (see Double et al., 2018). Therefore, the future studies need to examine whether making JOLs would induce item-specific processing using word pairs. Another limitation is that we did not explicitly control the word characteristics such as emotionality. However, the same list was used across conditions, therefore it is unlikely that the word characteristics created a confounding variable.

In conclusion, the present study showed that making JOLs induces item-specific processing and enhances memory when the study material promotes relational processing. Accordingly, it is important to take this into account when using JOLs to investigate self-regulated study behaviors because it is possible that asking participants to make JOLs may change memory and modify study behaviors (e.g., Mitchum et al., 2016). Nevertheless, in terms of practical applications, these results showed that there is a benefit to asking oneself whether a particular study item is learned. By doing so, one would promote item-specific processing of the item, which may in turn lead to enhanced learning and retention of the item.

The data and materials for all experiments are available from the first author. None of the experiments was preregistered.