Our ability to remember information is deeply dependent on our existing knowledge structures, or schemas (Bartlett, 1932; Hintzman, 1986). Even superficially identical information is better remembered if it is integrated into a set of knowledge rather than simply seen as arbitrary. For example, people are better at remembering that someone is a baker than that someone’s name is Baker, because the profession baker activates a rich set of meaningful associations that the name Baker does not (McWeeny et al., 1987); and people remember visual images better if they recognize them as faces than if identical images are not recognized, but seen as meaningless texture (e.g., Brady et al., 2019).

Different people have different knowledge and schemas, in part based on their expertise, and this has consequences for memory: Imagine after playing a round of chess, you are asked to recreate the board from some critical moment in the game. For most people, this task would prove very difficult. However, if you were a world-class chess player, this might be quite easy. Becoming an expert in a domain such as chess changes our memory for items in that domain of expertise (Chase & Simon, 1973; de Groot, 1946), allowing us to store more information as long as this information is consistent with the expectations we have formed as a result of our expertise (Gobet & Simon, 1996).

A large literature is devoted to quantifying memory benefits in experts compared with novices (e.g., Ericsson & Kintsch, 1995; Engle & Bukstel, 1978; Gobet & Simon, 1996; Vincente & Wang, 1998). For example, car experts can remember more car images in visual working memory (Curby et al., 2009); baseball experts can remember more baseball-related information in long-term memory (Voss et al., 1980); and expert radiologists have better long-term memory for mammograms—but not natural scenes or real-world objects—compared with controls (Evans et al., 2011).

Why do experts show this increase in memory performance for their domain of expertise? In the literature on expertise, many authors posit that memory improvement occurs because existing knowledge allows experts to know what variation to expect for information in an expert’s domain (e.g., Vincente & Wang, 1998). That is, existing schemas make the relevant part of the information predictable and thus easier to encode and remember (Graesser & Nakamura, 1982). Thus, in many ways, memory benefits in experts may be considered a manifestation of a broader phenomenon where information that is understood as meaningful—and thus integrated into a schema—is easier to correctly recognize or recall (Bartlett, 1932). For experts, there may simply be a wider variety of meaningful concepts and schemas, resulting in a richer ability to understand and remember stimuli in their domain of expertise (e.g., Ericsson & Kintsch, 1995). This is sometimes known as an organizational processing account of expertise: that experts can have improved memory because they are better able to chunk this information and otherwise create effective knowledge structures (Ericsson & Kintsch, 1995; Rawson & Van Overschelde, 2008).

Is better organization the sole reason for better memory in experts? Beyond schemas and knowledge organization, experts in some domains—particularly those where the expertise is more perceptual, like radiologists looking at mammograms or car experts focusing on the details of cars—may have developed specialized processing mechanisms for their domain of expertise which take advantage of the way stimuli vary in that domain. For example, experts in some domains employ more holistic processing strategies for objects of their expertise (Bilalić et al., 2011; Gauthier et al., 2000; Gauthier et al., 1999; Richler et al., 2011; Watson & Robbins, 2014). Enhanced perceptual expertise may allow experts to process more information about an item even in the same amount of time, and lead to richer memory traces (Ericsson & Kintsch, 1995).

In addition to building richer knowledge structures and better perceptual encoding, there is a third factor that could explain experts’ improved memory performance in domains of expertise, which has often been overlooked in studies of memory: increased distinctiveness of items when they are items of expertise (Rawson & Van Overschelde, 2008). In contrast to views that claim memorability is an intrinsic aspect of a stimulus (e.g., Bainbridge et al., 2013), a significant body of literature argues instead that the critical driver of how memorable an item is in a given context is its distinctiveness from other items currently being stored in memory. Imagine, for example, you are given a list to remember that has 30 animal names and also the word “bread” on it. People tend to remember this distinctive word (“bread”) most accurately—and this is true even if it appears first on the list, so its unique status is not yet known and it is not differentially attended or processed (Calkins, 1894; Hunt, 2006). Memory models naturally predict this effect because most memory models propose that memory is strongly limited by interference at retrieval, and having more unique features allows easier retrieval (e.g., Shiffrin & Steyvers, 1997).

This is broadly consistent with the idea that abnormal or schema-inconsistent items tend to be better remembered than expected, schema-consistent items (Friedman, 1979; Hollingworth & Henderson, 2003; Light et al., 1979; McDaniel & Einstein, 1986; Pedzek et al., 1989). For example, people tend to better remember unexpected aspects of images (Friedman, 1979).

How does such distinctiveness interact with expertise? For experts, many items may be unique from other items in a set in a way that would not be noticed by nonexperts, thus enhancing memory for those items as they would then be more unique in the set for experts than nonexperts (Rawson & Van Overschelde, 2008).

In summary, experts are often better at accurately recognizing or recalling information in their domain of expertise. This can arise from at least three factors, each of which has been independently studied: experts may have changed perceptual processing strategies; may benefit from general usage of schemas to organize memory; or may benefit from increased distinctiveness of items in memory. However, the way these effects interact has rarely been studied, and many have been studied primarily in domains with limited or no perceptual expertise available (e.g., in word lists).

The current experiments: Memory for mammograms in novices and expert radiologists

To understand how expertise effects memory, and how each of these three factors may play a role, the current experiments ask how expertise affects memory for mammograms (comparing novices and expert radiologists), and test whether expert radiologists have better memory for abnormal images (i.e., cancerous mammograms), when compared with normal images (i.e., noncancerous mammograms). While for normal mammograms, perceptual encoding benefits, schemas, and distinctiveness all likely play a role in expert’s memory, abnormal mammograms provide a unique case study. Abnormal mammograms do not violate a radiologists’ schema (as they are trained to look for abnormalities), but abnormal cases do provide distinctive retrieval cues (e.g., this mammogram has calcifications in this location) which would not be available to nonexperts who have no idea that those little white spots are significant. Nor would these cues be available in normal mammograms. Abnormal mammograms therefore present an interesting case; they are schema-consistent, while also potentially providing a unique window into the role of distinctiveness in expert’s memory.

To measure memory performance, we will use receiver operating characteristic (ROC) analysis to take into account the possibility of differential false alarms and differential response criterion, which is critical to understand whether any effects we observe are truly changes in memory strength. We predict that experts will have improved performance compared with nonexperts for both normal and abnormal mammograms because of their perceptual expertise and because they have developed schemas over time to represent these complex images. We also predict that abnormal items might show even more benefit for radiologists compared with nonexperts because for radiologists and radiologists alone, these images have unique and distinctive retrieval cues available.

We focus on radiologists’ memory for mammograms for two reasons: First, search for signs of breast cancer involves a usefully specific perceptual expertise. For instance, only 2-3 kinds of local abnormalities are typically present in abnormal mammograms, and radiologists have significant perceptual expertise whether looking at normal or abnormal medical images.

Second, there are two senses in which a mammogram might be considered “abnormal”: (1) It could contain a focal abnormality. In our study, these are masses or architectural distortions that are subsequently proven to be malignant. (2) Given a mass (for example) in one breast, the other breast could be considered abnormal in the sense that the image comes from a patient with cancer. We assess the impact of each of these two kinds of abnormality on memory. Note that a mammogram might be considered “abnormal” if it showed a benign mass. We did not use such stimuli in this study.

Radiologists are explicitly trained to recognize an image as abnormal if they detect the presence of a visible, localized abnormality, like a mass or calcification. In addition, recent research has shown that, if asked in an experimental setting, radiologists have an ability to detect a “gist” of abnormality in the breast contralateral to the lesion. They perform at above chance levels when asked to categorize images as coming from normal or abnormal patients (Evans et al., 2016). In other words, this study suggests that radiologists do not always need to see a localized physical lesion to know that an image is abnormal. This global signal of abnormality is relatively subtle. More importantly, for present purposes, work on this gist signal is new enough that most radiologists are unfamiliar with the concept. Thus, any impact on memorability could be considered to be the result of an implicit effect of abnormality.

Published studies of the gist of abnormality have involved giving radiologists only a brief (250–500 ms) glance at an image. While this seems sufficient for expert radiologists to gain some evidence of abnormality, it remains unknown whether this ability impacts radiologists’ memory for normal versus abnormal images.

To summarize, the questions guiding this experiment are the following: Do radiologists show improved memory performance for abnormal images compared with normal images? If so, does global gist produce enhanced expert memory for images of the breast contralateral to the breast that contains focal signs of cancer? Alternatively, does any abnormality advantage in memory depend upon having a focal abnormality that can draw spatial attention?

Experiment 1 is a baseline study with novice observers, whose performance can be compared with radiologist performance in Experiment 2. In addition, Experiment 1 allows us to determine whether our stimulus set contains images that are memorable regardless of expertise. Experiment 2 assesses memory performance in expert radiologists. To anticipate our results, Experiment 1 reveals patterns in our image set that we take into account in Experiment 2. In Experiment 2, we find a large memory benefit for radiologists relative to novices as well as an abnormality advantage in radiologists for focal abnormalities. We find no evidence that experts make use of a nonfocal gist of abnormality either in judgment or memory.

Experiment 1: Novices

Experiment 1 was conducted using novice (nonradiologist) observers. The design, number of observers, exclusion, and analysis plan for this experiment were preregistered (URL for this experiment: http://aspredicted.org/blind.php?x=xr3843).

In this experiment, novice observers viewed a series of mammograms and judged whether each case was normal or abnormal and whether they remembered seeing the image earlier in the experiment. We would expect both, judging whether an image is normal or abnormal as well as remembering the images to be difficult, as this task is designed for expert radiologists. However, novice performance provides a useful baseline for comparing radiologist performance and provides a baseline of memory in novice observers. In particular, the results of this experiment can indicate if particular images are particularly distinctive in the absence of any mammographic expertise.

Method

Participants

Sixty participants (23 female participants, mean age 38 years) were recruited for this experiment through Amazon’s Mechanical Turk, which offers monetary compensation for participation in online tasks. Mechanical Turk workers are reasonably representative of the American adult population (Berinsky et al., 2012; Buhrmester et al., 2011; Difallah et al., 2018), and provide data that are comparable to data obtained when participants are tested in experimental psychological laboratories (e.g., see Brady & Alvarez, 2011, for a comparison in a visual memory context). All participants gave informed consent, were compensated at a rate of approximately $10/hour, were located in the United States, and had a hit approval rate greater than 95%. Informed consent procedures were approved by the Institutional Review Board of the University of California, San Diego.

Stimuli and procedure

Participants viewed single breast mammograms in this study. The stimulus set consisted of 80 abnormal (cancerous) cases and 40 normal (noncancerous) cases. All images were deidentified. All images were preclassified by a group of trained radiologists who did not participate in the study. Normal images were noncancerous and did not contain benign lesions. Abnormal images consisted either of histologically verified malignant masses or architectural distortions (see Evans et al., 2016, for a more detailed description of this stimuli set). Half of the abnormal images contained a visible abnormality (i.e., a lesion was present) and half were images of the breast contralateral to the breast with the lesion (i.e., still an abnormal case, but with no focal indication of that abnormality). Thus, the entire set consisted of 40 normal images, 40 focal-abnormality images (herein referred to as abnormal), and 40 non-focal abnormality images (images contralateral to the breast with the focal abnormality), herein and henceforth referred to as contralateral-abnormal. Each image subtended approximately 16 × 20 degrees of visual angle at an estimated viewing distance of approximately 60 cm from the screen.

On each trial, one image was present for 3 seconds, followed by a new screen containing response questions. The mammogram was randomly chosen to be either normal, abnormal, or contralateral-abnormal. Critically, each image was also either a new image (presented for the first time in the experiment) or a repeated image from 3 trials back or 30 trials back (3-back and 30-back, respectively). Of the images that were later repeated, 50% were repeated at 3-back, and 50% were repeated at 30-back. The experiment was balanced such that ~20% of trials in the first and second half of the study were 3-back and 30-backs, respectively. In fact, due to sampling different streams of images for each participant, in our exact pool of radiologists, 18% of trials were 3-backs in the first half of the trials, versus 23% in the second half of the trials, and 22% were 30-backs in the first half of the trials, and 20% in the second half. In total, with repetitions, there were 210 trials: 120 new images (40 per condition), plus 90 repeat images (30 per condition, split evenly between 3-back and 30-back).

After being displayed for 3 seconds, each image was immediately followed by two response questions: (1) Was the image abnormal or normal? (2) Have you seen this image before? Using a standard computer mouse, participants were told to indicate their level of confidence on a 6-point rating scale ranging from confident yes/abnormal to confident no/normal (see Fig. 1). We collected confidence ratings instead of simple yes/no answers to allow for ROC analysis. There was no time constraint imposed on responding. The initiation of the next trial was contingent on answering both questions of the current trial.

Fig. 1
figure 1

Method. N = 60 nonexpert novice participants rated a sequence of 210 images on normal/abnormal and old/new. Images could repeat either after three or 30 subsequent images and be either normal, abnormal, or contralateral-abnormal

Before the experiment began, participants were presented with instructions and several demographic questions (Gender; Age; “Are you a radiologist?”; “Do you have a job where you read medical images; i.e., tech, medical physicist?”). Instructions were written for a novice population with no medical training. For novice participants, abnormal cases were broadly defined as “images that might contain lesions, or cancer, or otherwise might be something worthy of follow-up if you were a radiologist.”

Exclusion criterion and analysis plan

Our exclusion criteria and analyses were decided in advance (see preregistration, above). Individual trials were excluded if participants took less than 1,500 ms or more than 15,000 ms to respond (based on pilot data). Participants were excluded if they took less than 15 minutes (zero excluded) or more than 1 hour to complete the study (3 excluded). Radiologists were excluded (1 excluded) as were those with other prior experience reading medical images (zero excluded). Participants were also excluded if they had more than 80% identical responses (e.g., picked the exact same answer on nearly every trial; one excluded) or had more than 20% of trials excluded on the basis of the reaction time criteria (one excluded). After applying these a priori exclusion criteria, seven participants were excluded from analysis, leaving a final sample of 53 participants.

Following our preregistered analyses (above), we did not conduct an overall analysis of variance (ANOVA) initially but rather followed our specific targeted tests. We first analyzed the confidence ratings of classifying an image as abnormal or normal. We subsequently analyzed the confidence ratings representing memory for images. In order to do this, we conducted ROC analysis for 3-back and 30-back as a function of image type (normal/abnormal/contralateral-abnormal). We also generated ROCs for the normal/abnormal judgments. ROCs were summarized by area under the curve (AUC) and compared using t tests. As noted, we are interested in whether, within the group of novice participants, there is a benefit in memory performance for any type of image (e.g., as judged by normal vs. abnormal AUC). Since the novices lack medical experience, any such effect would give us insight into the nature of the image set (i.e., memorability or distinctiveness). Finally, we conducted image similarity analyses to quantify how image differences might be influencing these results.

Image similarity comparison

Because normal, focally abnormal and contralateral-abnormal images are necessarily different image sets, it is useful to compare how distinctive each set of images is from all the other images in order to look at the effect this has on memory. One way to accomplish this is to have individuals give similarity ratings between images. However, this would require 120 × 120 = 14,440 similarity ratings. Instead, to streamline the process, we relied on previously established computer vision techniques designed to give similarity measurements for natural scenes. Specifically, we conducted a Gabor wavelet pyramid (GWP) analysis, which computes features of the images and compares them (Greene et al., 2016; Kay et al., 2008). To assess the level of similarity in the different image types, the GWP represents each image as the output of a bank of multiscale Gabor filters. Prior work has shown that these features can successfully model object representation in early visual areas (Kay et al., 2008). Following the exact procedure and parameters provided by Greene et al. (2016), each image was converted to grayscale, down sampled to 128 × 128 pixels, and represented with a bank of Gabor filters at three spatial scales (3, 6, and 11 cycles per image with a luminance-only wavelet that covers the entire image), four orientations (0, 45, 90, and 135 degrees) and two phases (0 and 90 degrees). This gave a set of features for each image, which we then compared with all 120 images to compute a distance/dissimilarity score by computing the dot product of each images features to each other images after subtracting the mean across images and normalizing the feature vectors to unit length.

Results (Experiment 1: Novices)

Performance on the classification task

First, we looked at how confident novices were at classifying an image as either normal or abnormal (see Fig. 2). We found a significant difference between normal and abnormal images, t(52) = 4.78, p < .001, but not between normal and contralateral-abnormal images, t(52) = 1.94, p > .05.

Fig. 2
figure 2

Classification task: Overall performance of novices on labeling an image as normal or abnormal. The confidence rating scale is now plotted on the y-axis. Each point in the plot represents the rating for a particular image. We found a significant difference in confidence in classifying normal versus abnormal images, which seems to be driven by a few salient abnormal images. Novices are not confident in distinguishing between any image type (most responses tend to be in the middle of the confidence scale, no matter the image type). Error bars represent standard error of the mean

While participants did not have training to distinguish normal from abnormal medical images, a small number of images in the set are extremely saliently abnormal (i.e., a single bright white spot would look questionable even to novice viewers). Looking at ratings by image (see Fig. 3) reveals that these images are largely responsible for the significant difference between normal and abnormal ratings. In short, for at least for a small subset of images, even novice participants can notice the abnormality, leading to above-chance classification performance broadly. But for most images, novices seem to have little information about normality versus abnormality.

Fig. 3
figure 3

Image ratings in the classification task. Example images and their confidence ratings for each participant in the classification task. As can be seen with the third pictured image, most participants rated this as abnormal with high confidence. Altogether, the two or three brightly striped vertical lines in the abnormal image set indicate that those and only those images were reliably rated as abnormal by a large majority of participants

Note that the y-axis in Fig. 2 represents the confidence ratings for novices. It is clear that the novices are generally not confident in distinguishing any image type, with average responses tightly clustered near the middle of the rating scale for all conditions. Another way of visualizing this data is on an ROC curve (see Fig. 4), where novices fall almost on top of the dotted diagonal line indicative of chance performance, with an AUC of only 0.54 (where 0.50 is chance and 1.0 is perfect). Although, as stated above, this difference from chance is highly reliable across participants, t(52) = 4.21, p < .001, largely because of the few images that participants could all reliably classify.

Fig. 4
figure 4

ROC for normal/abnormal categorization. Novices are very close to the diagonal line representative of chance performance, indicating that they do not perceive a strong difference between normal and abnormal images. The significant effect is driven by a select few salient images (see Fig. 3)

Memory for abnormal images

Figure 5 shows the ROCs for the 3-back and 30-back memory tasks. Since novices were not, for the most part, able to perceive contralateral-abnormal images as different from normal images in the classification task, we focused exclusively on memory differences between normal and abnormal images. Overall, independent of image type, and as expected, novices have better 3-back memory (averaged AUC of 0.70 for detecting 3-backs) than 30-back memory (averaged AUC of 0.64 for detecting 30-backs), t(52) = 6.59, p < .001. Interestingly, breaking down performance across image conditions reveals that novices show a small normality benefit: they remember normal images better than abnormal images in both the 3-back condition and the 30-back condition, with only the 3-back yielding a significant difference. We found an AUC benefit of 0.069 for normal images at 3-back, t(52) = 5.48, p < .001, compared with abnormal, and an AUC difference of 0.026 for normal images at 30-back, t(52) = 1.70, p = .096, compared with abnormal.

Fig. 5
figure 5

Novice performance on the memory task. As noted, the gray dashed line indicates chance, and more bowed out curves represent better memory performance. Novices had stronger memory for images in the 3-back condition than in the 30-back condition. Novices also show a small effect of normality, with memory for normal images being better than for abnormal images in both 3-back and 30-back conditions

Given the weak performance at discriminating normal from abnormal images, it is rather surprising that normality had any effect. Therefore, we examined the data for evidence of more basic effects of visual similarity. We found that the lower memory performance in the abnormal conditions was largely driven by an increased false-alarm rate in the contralateral-abnormal and abnormal image sets. Here, we are classifying as “new” all images with a confidence rating >3. This is consistent with an image similarity account in which novices would be more likely to false alarm to new images in the contralateral-abnormal and abnormal conditions simply because these images are more similar to one another than images in the normal set (as predicted by summed similarity accounts of memory; e.g., Nosofsky, 1991). In other words, if the normal images were somewhat more dissimilar to each other compared with the other images, this could explain why novices have somewhat better memory for the normal condition (i.e., it is easier to determine if an image of a dog is new if that dog is presented in a series of different animals than if it is presented in a set of similar dogs. Obviously, the similarity effects in our stimuli are much smaller.). We test this hypothesis next.

Similarity matrix—Gabor wavelet pyramid analysis

We tested this image similarity hypothesis by measuring similarity between our images as described in the Methods (Greene et al., 2016; Kay et al., 2008). We found increased dissimilarity among normal images relative to contralateral-abnormal and abnormal images (normal = 0.174; abnormal = 0.139; contralateral-abnormal = 0.133). In other words, normal images were more different, on average, from one another (and thus more discriminable in memory) than either abnormal or contralateral-abnormal images. This is consistent with the hypothesis that the small difference in memory favoring normal images is driven by image similarity differences across sets. Thus, the small normality benefit found in the current study is likely a result of image similarity. Critically, this can provide a useful baseline for considering memory for the same images in expert radiologists in Experiment 2.

Experiment 2: Radiologists

Experiment 2 was the same as Experiment 1, except conducted on radiologist observers.

Method

Participants

Thirty-two expert radiologists (14 female participants, average age = 49 years) were recruited during the 2018 Radiological Society of North America (RSNA) conference in Chicago, Illinois. All radiologists gave informed consent and were not compensated beyond being entered into a lottery for a $500 gift card. Informed consent procedures were approved by the Institutional Review Board of the University of California, San Diego.

Data from participants would have been excluded if they took less than 15 minutes or more than 1 hour to complete the study, had more than 80% identical responses, or had more than 20% of trials excluded. Under these guidelines, no radiologists were excluded from analysis, leaving a final sample of 32 participants.

Stimuli and procedure

The stimuli and experimental design were the same as described in Experiment 1. The main procedural difference was that the experiment was conducted at the RSNA conference where the experimenter explained the instructions in person. Unlike in Experiment 1, in Experiment 2, we gave more general instructions, asking for any abnormality rather than specifically asking participants to look for focal lesions or cancer: “For each image, please judge whether the image is abnormal or normal, and whether you have previously seen it during the course of the experiment.”

Results

In this section, we compare the performance of expert radiologists to the performance of novice participants in Experiment 1. In particular, we investigate how nonexperts compare to experts’ judgments of image classification (i.e., normal vs. abnormal), and critically, whether experts show differential memory for abnormal versus normal images. While analyzing expert performance, we take into account idiosyncrasies in our image set that we learned from Experiment 1, such as that our normal images are more dissimilar and therefore inherently slightly more memorable.

Performance on the classification task

Similar to Experiment 1, we first analyzed performance on the classification task by looking at the confidence ratings of classifying each image as either normal or abnormal. How good are radiologists at simply distinguishing abnormal from normal images? Unsurprisingly, radiologists are very good at distinguishing abnormality (see Fig. 6a). Radiologists were significantly more confident that an abnormal image was abnormal instead of normal, t(31) = 13.17, p < .001. Figure 6b shows the ROC curve for distinguishing focal-abnormal images from normal images in radiologists. ROCs were summarized by area under the curve (radiologist AUC = 0.72; recall that novice AUC = 0.54). As noted in Experiment 1, controls are close to the diagonal line indicative of chance, whereas radiologists elicit a typical curvilinear ROC indicative of a perceived (and significant) difference between normal and abnormal images with an AUC well above chance, t(31) = 19.8, p < .001.

Fig. 6
figure 6

a (top left): Classification task: Overall performance of radiologists on labeling an image as normal or abnormal. Once again, each point in the plot represents the average rating for a particular image. Radiologists clearly distinguished abnormal from normal images, but they did not distinguish between contralateral-abnormal and normal images. b (top right): ROC depiction of performance for labeling an abnormal image as abnormal instead of normal (ignoring contralateral-abnormal images). c (bottom): Classification by image. Unlike novices, experts reliably classify most of the abnormal images as abnormal and most of the normal images as normal, with performance not largely driven by any particular subset of images

Next, we looked at if radiologists could detect abnormality in the contralateral-abnormal images. There was not a significant difference between the normal and contralateral-abnormal image conditions, t(31) = 0.43, p = .67. In the original study of Evans et al. (2016), they found an effect of abnormality in the gist information (i.e., in a very short presentation time of ~250 ms). Our instructions and stimulus set may have biased participants against reporting contralateral images as abnormal. In a set of images that include visible lesions (the abnormal cases) and in the absence of an instruction to look for asymptomatic images from symptomatic patients (the contralateral cases), it is, perhaps, not surprising that radiologists reserved their abnormal ratings for the abnormal cases with lesions. Furthermore, it is possible that our instructions could have primed radiologists to look for both benign and malignant lesions, although no benign lesions were present in the current study. Future studies could investigate the effects of instruction on this task. Recall, however, that our interest in the present experiment is in radiologists’ memory for these images. Contralateral-abnormal images, for instance, might still be remembered better if their vaguely suspicious appearance caused radiologists to devote more attention to them.

Memory for abnormal images

Figure 7 shows radiologist performance on the memory task. Radiologists have better memory for abnormal images in both memory conditions, but the advantage for abnormal images is only significant in the 30-back condition, t(31) = 2.86, p = .008, AUC difference = .051. We found an AUC advantage of 0.02 for abnormal images at 3-back. Although this was not significant, t(31) = 1.62, p = .12, it follows the same trend as the 30-back condition.

Fig. 7
figure 7

Radiologist performance on the memory task. Radiologists have better memory for abnormal images in both of the memory conditions. However, only memory at long delays (30-back) was significant

Radiologists showed no memory benefit for the contralateral-abnormal images, even at long delays (p = .24). Since radiologists were not able to distinguish between contralateral-abnormal images and normal images in the classification task, this result might be expected; though, recall that we were looking for evidence that an implicitly recognized abnormal gist might enhance memory. That is not what we found. Overall, independent of image type, radiologists have better memory at 3-back (averaged AUC of .852 for detecting 3-backs) than 30-back (averaged AUC of .752 for detecting 30-backs) for medical images. Why are radiologists better at 3-back than at 30-back? While it seems clear that this difference largely reflects typical effects of forgetting and interference (e.g., Wixted, 2005), it is also possible that observers would be more likely to “catch on” to the presence of 3-back rather than the 30-back repetitions. If so, they might adopt a strategy that prioritized the 3-back task. However, given that the 3-back and 30-back tests were equally likely and equally distributed throughout the task, and that observers consistently said they remembered seeing mammograms from 30 images back (and therefore were distinctly aware that 3-back wasn’t the only n-back test present), it seems unlikely that observers would transition to a strategy that only prioritized 3-back memory task. Taken together, these results suggest that experts have better memory overall at 3-back than at 30-back, but that a memory benefit for abnormal images compared with normal images is significant only at 30-back.

In recognition memory studies, it is almost always found that ROCs are not consistent with an equal variance signal detection model (e.g., Egan, 1958; Wixted, 2007). One way to look at this is to convert the hit and false-alarm rates to z scores and to plot zROC functions. On a zROC graph, equal variance produces data with a zROC slope of 1.0. Instead, as is typical in recognition memory tasks, the slopes of our zROCs were reliably below 1.0 in 3 of the 4 memory conditions. We fit a linear mixed-effect model with slope and intercept as random per-subject factors (mean slope[M] = 0.68 for 3-back for normal images, difference from 1.0: p < .001; M = 1.05 for 30-back for normal images, not different from 1, p = .60; M = 0.52 for 3-back for abnormal images, p < .001; 0.82 for 30-back for abnormal images, p = .005). Collapsing across all conditions, thus allowing the slope to be more reliably estimated, the mean zROC slope was 0.68, significantly different from 1.0 (p < .00001). Taken together, then, the ROCs we observed in memory were inconsistent with an equal variance signal detection model and consistent with an unequal variance model, potentially due to variation in memory strength between different items. This is typical of recognition memory and the reason that collecting confidence judgments and performing ROC analysis is necessary in order to assess memory strength. Simple d', in this context, does not properly account for response criteria differences (e.g., Dougal & Rotello 2007).

Recall from the similarity analysis in Experiment 1 that the normal images in our data set are less similar to each other than the abnormal images, and thus memory for normal images should be better than abnormal (as it was in novices). In fact, it is memory for the abnormal images that is better in radiologist observers. This suggests that the effect of expertise more than compensates for differences between the stimulus categories in image similarity. To see what the effect of abnormality is, independent of baseline image similarity differences, we can compare radiologists’ memory performance to novices’ performance with the same images. To do this, we compare the benefit—in terms of AUC of the ROC—for radiologists relative to controls in each condition. Doing so reveals a significant abnormality benefit at both 3-back, t(31) = 6.67, p < .001, and 30-back in expert radiologists, t(31) = 4.33, p < .001, where, taking their performance after baselining relative to the performance of novice participants, radiologists were specifically better at remembering abnormal images (see Fig. 8).

Fig. 8
figure 8

Using novices as a baseline to account for image similarity, there were robust abnormality memory benefits for radiologists at both 3-back and 30-back

Extracting additional information with a second presentation

Due to the structure of this experiment, designed to probe memory, each item in the memory set has two classification ratings (for normal/abnormal). Thus, while we set out to probe memory, the experiment also makes it possible for us to combine both ratings in order to examine whether there is a “crowd-within” effect in this situation (Vul & Pashler, 2008). The authors proposed the crowd-within as a variant for the “wisdom of the crowd.” They found that averaging a single individual’s responses to repetitions of same question led to better performance than single responses alone. This is what one would expect if a single judgment did not incorporate all of the information people could possibly have about a question. If this is true for assessments of mammograms by expert radiologists, we would expect that averaging a radiologist’s ratings of abnormality from two exposures to the same mammogram should result in better accuracy than looking at either rating alone. Note that in this situation, however, unlike Vul and Pashler (2008), participants actually have additional information the second time—they get to see the image again before the second judgment, they are not just asked again. Thus, in this case, the crowd-within effect here could arise from actual new information being incorporated (e.g., the observer might scrutinize different parts of the image), rather than internal sampling.

We find a modest but significant advantage to incorporating both judgments: Averaging radiologists’ responses from the first and second time that they saw an image resulted in slightly higher performance in the 30-back condition (AUC = 0.745) compared with single item performance (AUC = 0.716), t(31) = 3.46, p = .002 (see Fig. 9, left). The effect was not significant in the 3-back condition (joint AUC = .712, single AUC = .705), t(31) = 1.15, p = .259. Unsurprisingly, this effect was not present in novices, since their performance was very poor on both responses (see Fig. 9, right; all ps > .10).

Fig. 9
figure 9

Crowd-within analysis. Left: Radiologists (Exp. 2). The blue line is the ROC for distinguishing focally abnormal mammograms versus normal mammograms when the radiologists first see the image. The red line is the average of the first time seeing it and their responses seeing it at 30-back. Right: Novices (Exp. 1). Once again, the blue line is the ROC for distinguishing focally abnormal mammograms versus normal mammograms when novices first see the image. The red line is the average of the first time seeing it and their responses seeing it at 30-back

Thus, expert performance can be improved (albeit, rather modestly) by averaging more than one response. It remains to be seen whether this benefit would occur if radiologists were offered unlimited time to process each image, rather than the 3 seconds in the current study. The limited viewing time here may have particularly enhanced radiologists’ ability to extract new information in the second viewing of the mammogram.

General discussion

In the current study, we examined memory performance by nonexpert novices and expert radiologists for normal versus abnormal mammography images as a case study in understanding the role of schemas, distinctiveness, and expertise in memory. To do so, we relied on ROC analysis, designed to properly measure memory independent of differences in response criteria and to take into account both enhanced memory for seen items as well as the possibility of false alarms.

First, we looked at how confident and how competent novice and expert observers were at classifying medical images as either normal or abnormal. Unsurprisingly, radiologists were much better than novices at this task. Novices did show some ability to distinguish abnormality, although this appeared to be largely the result of a few salient images.

Second, we examined our main question of interest: memory for the images. In Experiment 1, we examined memory for mammograms in novices, who have none of the expertise or schemas needed to process these images. We found poor performance overall, as well as a small normality benefit in novice participants’ memory, which could be explained by the greater image dissimilarity of normal images. Thus, Experiment 1 (on novices) gave us not only a baseline for memory performance, but also an understanding of the intricacies of our image set, showing that some abnormal images were quite salient, and that our normal images were more dissimilar from each other.

Even though the normal images in our set were more visually distinctive, in Experiment 2, we found that radiologists had better memory for abnormal images, and had far superior memory performance to novices. This gives insight into how expertise changes memory: not only enhancing the encoding of normal items but also enhancing the distinctiveness of abnormal items. Thus, while experts might have access to perceptual encoding benefits, distinctiveness and/or schemas/chunking to enable them to outperform novices, our finding of an extra benefit of expertise for abnormal images is most consistent with a special role of distinctiveness. For experts, the abnormal images have unique features that make them distinct from other items in memory; whereas for novices, these features are not appreciated and so these images are just like any other image. For example, one possibility is that rather than encoding the entire image, in the case of abnormal images, radiologists specifically encode the abnormality and not the rest of the image into memory. This might reduce the load on memory for that image and might make the memory trace for that image more distinctive.

Broadly speaking, then, we find strong evidence for a role of schemas and distinctiveness in memory, even after taking into account false memory and the possibility of response criterion shifts: We find experts significantly outperform novices, and that memory for abnormal cases with a visible, focal lesion is better than memory other images. There was no evidence for a memory benefit for “abnormal” contralateral cases.

Measuring memory: False alarms and ROC analysis

In the current studies, we used ROC analysis to examine memory. This is because, in previous work, it has often been unclear if benefits for schema-consistent information like those reported in experts are, in fact, improvements in memory, as opposed to changes in response criteria. To determine whether memory has actually improved, it is not adequate to simply find a reliable increase in the rate with which observers correctly report having been exposed to some piece of information (the true positive, or “hit” rate). The observer could simply be saying “yes, I have seen it” more often. This would produce an increase in false-positive (or false-alarm) errors. In the context of memory research, these false-positive errors can be seen as a form of false memory. In theory, signal detection models and measures like d' can distinguish between these two, but in practice, the prerequisites for d' to properly adjust for response bias (equal variance; zROC slopes = 1.0) are almost never present in recognition memory contexts, and were not present here. Thus, ROC analysis is needed to distinguish between the difference in the ability to remember as opposed to criterion shifts, which would reflect an increase tendency of observers to say that they remember (e.g., Wixted & Mickes, 2015).

Is false memory a true concern? In fact, previous work has found that organizing information in memory via schemas can have both positive and negative consequences—and in particular, does often increase false alarms, making it difficult to tell whether memory is genuinely improved. In particular, while greater understanding—as in expertise—may allow encoding of only the relevant details, reducing memory load, it may also cause us to falsely remember information that was not present (e.g., Owens et al., 1979). For example, in recognition tests, people are more likely to false alarm to schema-consistent relative to schema-inconsistent lures. They would be more likely to falsely report seeing books in a graduate student’s office than inconsistent objects like a piece of tree bark or a pair of pliers (Brewer & Treyens, 1981; Lampinen et al., 2001). And while participants are more likely to correctly remember schema-consistent information in a briefly presented scene (Biederman et al., 1982; Brewer & Treyens, 1981), they are also more likely to falsely remember such information (e.g., Hollingworth & Henderson, 2003; Pedzek et al. 1989).

Thus, measuring fully ROCs—rather than attempting to infer how response bias would change performance using measures like A', d', or hits minus false alarms—often reveals surprising answers about memory, particularly in situations like expertise and consistent/inconsistent items where it is known that both hit and false-alarm rates are affected. For example, Dougal and Rotello (2007) used ROC analysis to show that the well-known effect of “improved memory” for emotional words compared with neutral words is a response bias effect, not a true difference in memory between the words. Similarly, Mickes et al. (2012) showed in the domain of eyewitness memory that sequential lineups, which reduce both false alarms and hit rates relative to simultaneous lineups, are actually inferior to simultaneous lineups, contrary to a large body of literature suggesting the opposite (e.g., Wells et al., 2011), as the major “benefit” arises simply from a response criterion shift, not a change in memory strength.

Thus, the current experiments provide unique evidence that expertise and distinctiveness that is apparent only to experts do, in fact, enhance memory—and that this is not just a response criterion shift.

What explains radiologists outperforming novices

Consistent with a wide variety of work on expertise, we find that expert radiologists outperform novices in remembering mammograms. One likely possibility is that this occurs because of experts knowledge about these images: they have relevant knowledge that allows them to understand these images in a way novices do not, and likely have perceptual expertise built into their visual system from years of experience (e.g., in the form of greater holistic processing; e.g., Richler et al., 2011). In particular, for an expert, the abnormal images would have an added attribute (that mass, that calcification), learned over years of experience, that would help to distinguish the item in memory.

However, in the current study, we did not attempt to directly match our experts to our novices. Our novice pool was sampled from the internet, which is much more broadly representative of the demographics of the United States than an undergraduate population (e.g., Difallah et al., 2018), but still likely differs in a number of ways from our radiologists (in demographic and socioeconomic factors, as well as motivation to focus on mammogram images). Thus, Experiment 1 should be taken as only an approximate baseline: it revealed important image features in our stimulus set, and points to the possibility of strong expertise effects, but does not directly confirm these are based solely on knowledge rather than other factors.

Memory and abnormality judgments in radiologists

Previous work has found mixed results when investigating memory improvements in radiologists. For example, Hardesty et al. (2005) investigated radiologists’ long-term memory for medical images presented months later and found that none of the radiologists remembered cases that they had read previously. Evans et al. (2016) found mixed results when investigating whether abnormality improves memory in expert observers, including radiologists. Our results provide context to these ambiguities, as they suggest that expert radiologists do have stronger memory for abnormal images even in a long-term memory setting and even when response bias is properly taken into account using ROC analysis. However, our long delays were only on the order of minutes, not months, and so it remains unclear how such advantages would last over long durations.

It is worth noting that in the classification task, radiologists performed on average much more poorly than would be expected of radiologists in the clinic with unlimited viewing time (d' = 2.5–3.0, as in D’Orsi et al., 2013). One reason for this might be that each image in our study was only presented for 3 seconds each. For instance, Evans et al. (2013) showed radiologists only a brief glimpse of mammograms and varied timing from 250 ms to 2,000 ms. The respective AUC’s for radiologists in their experiment for 500 ms, 1,000 ms, and 2,000 ms viewing times was 0.65, 0.66, and 0.72, respectively. In our experiment with a presentation time of 3,000 ms, we found an AUC of 0.72. Thus, our 3,000-ms presentations resulted in a similar level of performance to the 2,000-ms presentations of Evans et al. (2013), which, while well below what is expected with unlimited viewing time, is consistent with other studies and consistent with viewing time being the main constraint that lead to lower performance.

The “crowd-within” effect in radiologists

Because our study had radiologists answer the same classification question about an image multiple times, we looked at whether averaging radiologists’ responses when they judge the same image twice resulted in better performance (a “crowd-within” effect; Vul & Pashler, 2008). We found that radiologist performance improved when averaged across the same image twice compared with either response alone, but only in the 30-back condition and only modestly even then. This indicates that by the time radiologists were presented with the same image 30 images later, they gave a response that is somewhat independent of their first response. This suggests that, under the current experimental conditions, there might be information the radiologists are not using the first time they see an image—and that the opportunity to see the image again allows the radiologist to glean additional useful information. Future studies might determine whether such benefits persist when experts are given unlimited time to process the images as well as whether this effect can be made larger with an even longer delay between the first and second presentation of an image (as found by Vul & Pashler, 2008).

The “gist” of abnormality

Given the Evans et al. (2016) finding that there is a “gist of abnormality” present in the contralateral breast when no localizable abnormality is present, we were interested to know whether these contralateral-abnormal images had any advantage over normal images in expert memory. We found no such evidence. In our experiment, we also found no difference in the classification of abnormality between contralateral-abnormal images compared with normal. While at first this might seem to contradict earlier work, there are a number of methodological differences that make it difficult to compare our results directly with Evans et al. (2016). It is possible that we did not find this result because we presented images for longer encoding time (3,000 ms). Typical stimulus exposure in mammogram “gist” studies has been less than a second; 500 ms is typical. It is possible that presenting images for longer encoding times might actually obscure the gist information—overwriting an initial “gist” impression with more semantic or meaningful information. Recall, also, that our radiologists were not informed about gist and likely reserved their “abnormal” ratings for cases where they could localize a lesion. It is possible that we would observe a contralateral-abnormal effect even at long encoding times if we explicitly directed participants to look for a more general abnormal texture or gist. Given these methodological differences, the current study cannot be readily compared with Evans et al. (2016). However, this seems to be a promising avenue for future work.

Conclusion

Using radiologists as a case study, we find an advantage for memory in experts as well as an advantage for abnormal images—even when properly measuring memory via ROC analysis. This is broadly consistent with the literature on schemas. Our findings have important implications for both applied fields that utilize expert intelligence in making inferential decisions as well as theoretical fields interested in how memory changes with expertise. In particular, understanding the structure of memory in experts is critical in situations where decisions need to be made by people who have significant expertise.