Relationships between expertise and distinctiveness: Abnormal medical images lead to enhanced memory performance only in experts

Schill, Hayden M.; Wolfe, Jeremy M.; Brady, Timothy F.

doi:10.3758/s13421-021-01160-7

Relationships between expertise and distinctiveness: Abnormal medical images lead to enhanced memory performance only in experts

Published: 14 April 2021

Volume 49, pages 1067–1081, (2021)
Cite this article

Download PDF

Memory & Cognition Aims and scope Submit manuscript

Relationships between expertise and distinctiveness: Abnormal medical images lead to enhanced memory performance only in experts

Download PDF

Hayden M. Schill¹,
Jeremy M. Wolfe^2,3 &
Timothy F. Brady¹

947 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Memories are encoded in a manner that depends on our knowledge and expectations (“schemas”). Consistent with this, expertise tends to improve memory: Experts have elaborated schemas in their domains of expertise, allowing them to efficiently represent information in this domain (e.g., chess experts have enhanced memory for realistic chess layouts). On the other hand, in most situations, people tend to remember abnormal or surprising items best—those that are also rare or out-of-the-ordinary occurrences (e.g., surprising—but not random—chess board configurations). This occurs, in part, because such images are distinctive relative to other images. In the current work, we ask how these factors interact in a particularly interesting case—the domain of radiology, where experts actively search for abnormalities. Abnormality in mammograms is typically focal but can be perceived in the global “gist” of the image. We ask whether, relative to novices, expert radiologists show improved memory for mammograms. We also test for any additional advantage for abnormal mammograms that can be thought of as unexpected or rare stimuli in screening. We find that experts have enhanced memory for focally abnormal images relative to normal images. However, radiologists showed no memory benefit for images of the breast that were not focally abnormal, but were only abnormal in their gist. Our results speak to the role of schemas and abnormality in expertise; the necessity for spatially localized abnormalities versus abnormalities in the gist in enhancing memory; and the nature of memory and decision-making in radiologists.

The invisible breast cancer: Experience does not protect against inattentional blindness to clinically relevant findings in radiology

Article 02 November 2020

Lauren Williams, Ann Carrigan, … Trafton Drew

Multiple expressions of “expert” abnormality gist in novices following perceptual learning

Article Open access 01 February 2023

Gregory J. DiGirolamo, Megan DiDominica, … Max P. Rosen

The impact of speed and bias on the cognitive processes of experts and novices in medical image decision-making

Article Open access 04 July 2018

Jennifer S. Trueblood, William R. Holmes, … Quentin Eichbaum

Our ability to remember information is deeply dependent on our existing knowledge structures, or schemas (Bartlett, 1932; Hintzman, 1986). Even superficially identical information is better remembered if it is integrated into a set of knowledge rather than simply seen as arbitrary. For example, people are better at remembering that someone is a baker than that someone’s name is Baker, because the profession baker activates a rich set of meaningful associations that the name Baker does not (McWeeny et al., 1987); and people remember visual images better if they recognize them as faces than if identical images are not recognized, but seen as meaningless texture (e.g., Brady et al., 2019).

Different people have different knowledge and schemas, in part based on their expertise, and this has consequences for memory: Imagine after playing a round of chess, you are asked to recreate the board from some critical moment in the game. For most people, this task would prove very difficult. However, if you were a world-class chess player, this might be quite easy. Becoming an expert in a domain such as chess changes our memory for items in that domain of expertise (Chase & Simon, 1973; de Groot, 1946), allowing us to store more information as long as this information is consistent with the expectations we have formed as a result of our expertise (Gobet & Simon, 1996).

A large literature is devoted to quantifying memory benefits in experts compared with novices (e.g., Ericsson & Kintsch, 1995; Engle & Bukstel, 1978; Gobet & Simon, 1996; Vincente & Wang, 1998). For example, car experts can remember more car images in visual working memory (Curby et al., 2009); baseball experts can remember more baseball-related information in long-term memory (Voss et al., 1980); and expert radiologists have better long-term memory for mammograms—but not natural scenes or real-world objects—compared with controls (Evans et al., 2011).

Why do experts show this increase in memory performance for their domain of expertise? In the literature on expertise, many authors posit that memory improvement occurs because existing knowledge allows experts to know what variation to expect for information in an expert’s domain (e.g., Vincente & Wang, 1998). That is, existing schemas make the relevant part of the information predictable and thus easier to encode and remember (Graesser & Nakamura, 1982). Thus, in many ways, memory benefits in experts may be considered a manifestation of a broader phenomenon where information that is understood as meaningful—and thus integrated into a schema—is easier to correctly recognize or recall (Bartlett, 1932). For experts, there may simply be a wider variety of meaningful concepts and schemas, resulting in a richer ability to understand and remember stimuli in their domain of expertise (e.g., Ericsson & Kintsch, 1995). This is sometimes known as an organizational processing account of expertise: that experts can have improved memory because they are better able to chunk this information and otherwise create effective knowledge structures (Ericsson & Kintsch, 1995; Rawson & Van Overschelde, 2008).

Is better organization the sole reason for better memory in experts? Beyond schemas and knowledge organization, experts in some domains—particularly those where the expertise is more perceptual, like radiologists looking at mammograms or car experts focusing on the details of cars—may have developed specialized processing mechanisms for their domain of expertise which take advantage of the way stimuli vary in that domain. For example, experts in some domains employ more holistic processing strategies for objects of their expertise (Bilalić et al., 2011; Gauthier et al., 2000; Gauthier et al., 1999; Richler et al., 2011; Watson & Robbins, 2014). Enhanced perceptual expertise may allow experts to process more information about an item even in the same amount of time, and lead to richer memory traces (Ericsson & Kintsch, 1995).

In addition to building richer knowledge structures and better perceptual encoding, there is a third factor that could explain experts’ improved memory performance in domains of expertise, which has often been overlooked in studies of memory: increased distinctiveness of items when they are items of expertise (Rawson & Van Overschelde, 2008). In contrast to views that claim memorability is an intrinsic aspect of a stimulus (e.g., Bainbridge et al., 2013), a significant body of literature argues instead that the critical driver of how memorable an item is in a given context is its distinctiveness from other items currently being stored in memory. Imagine, for example, you are given a list to remember that has 30 animal names and also the word “bread” on it. People tend to remember this distinctive word (“bread”) most accurately—and this is true even if it appears first on the list, so its unique status is not yet known and it is not differentially attended or processed (Calkins, 1894; Hunt, 2006). Memory models naturally predict this effect because most memory models propose that memory is strongly limited by interference at retrieval, and having more unique features allows easier retrieval (e.g., Shiffrin & Steyvers, 1997).

This is broadly consistent with the idea that abnormal or schema-inconsistent items tend to be better remembered than expected, schema-consistent items (Friedman, 1979; Hollingworth & Henderson, 2003; Light et al., 1979; McDaniel & Einstein, 1986; Pedzek et al., 1989). For example, people tend to better remember unexpected aspects of images (Friedman, 1979).

How does such distinctiveness interact with expertise? For experts, many items may be unique from other items in a set in a way that would not be noticed by nonexperts, thus enhancing memory for those items as they would then be more unique in the set for experts than nonexperts (Rawson & Van Overschelde, 2008).

In summary, experts are often better at accurately recognizing or recalling information in their domain of expertise. This can arise from at least three factors, each of which has been independently studied: experts may have changed perceptual processing strategies; may benefit from general usage of schemas to organize memory; or may benefit from increased distinctiveness of items in memory. However, the way these effects interact has rarely been studied, and many have been studied primarily in domains with limited or no perceptual expertise available (e.g., in word lists).

The current experiments: Memory for mammograms in novices and expert radiologists

To understand how expertise effects memory, and how each of these three factors may play a role, the current experiments ask how expertise affects memory for mammograms (comparing novices and expert radiologists), and test whether expert radiologists have better memory for abnormal images (i.e., cancerous mammograms), when compared with normal images (i.e., noncancerous mammograms). While for normal mammograms, perceptual encoding benefits, schemas, and distinctiveness all likely play a role in expert’s memory, abnormal mammograms provide a unique case study. Abnormal mammograms do not violate a radiologists’ schema (as they are trained to look for abnormalities), but abnormal cases do provide distinctive retrieval cues (e.g., this mammogram has calcifications in this location) which would not be available to nonexperts who have no idea that those little white spots are significant. Nor would these cues be available in normal mammograms. Abnormal mammograms therefore present an interesting case; they are schema-consistent, while also potentially providing a unique window into the role of distinctiveness in expert’s memory.

To measure memory performance, we will use receiver operating characteristic (ROC) analysis to take into account the possibility of differential false alarms and differential response criterion, which is critical to understand whether any effects we observe are truly changes in memory strength. We predict that experts will have improved performance compared with nonexperts for both normal and abnormal mammograms because of their perceptual expertise and because they have developed schemas over time to represent these complex images. We also predict that abnormal items might show even more benefit for radiologists compared with nonexperts because for radiologists and radiologists alone, these images have unique and distinctive retrieval cues available.

We focus on radiologists’ memory for mammograms for two reasons: First, search for signs of breast cancer involves a usefully specific perceptual expertise. For instance, only 2-3 kinds of local abnormalities are typically present in abnormal mammograms, and radiologists have significant perceptual expertise whether looking at normal or abnormal medical images.

Second, there are two senses in which a mammogram might be considered “abnormal”: (1) It could contain a focal abnormality. In our study, these are masses or architectural distortions that are subsequently proven to be malignant. (2) Given a mass (for example) in one breast, the other breast could be considered abnormal in the sense that the image comes from a patient with cancer. We assess the impact of each of these two kinds of abnormality on memory. Note that a mammogram might be considered “abnormal” if it showed a benign mass. We did not use such stimuli in this study.

Radiologists are explicitly trained to recognize an image as abnormal if they detect the presence of a visible, localized abnormality, like a mass or calcification. In addition, recent research has shown that, if asked in an experimental setting, radiologists have an ability to detect a “gist” of abnormality in the breast contralateral to the lesion. They perform at above chance levels when asked to categorize images as coming from normal or abnormal patients (Evans et al., 2016). In other words, this study suggests that radiologists do not always need to see a localized physical lesion to know that an image is abnormal. This global signal of abnormality is relatively subtle. More importantly, for present purposes, work on this gist signal is new enough that most radiologists are unfamiliar with the concept. Thus, any impact on memorability could be considered to be the result of an implicit effect of abnormality.

Published studies of the gist of abnormality have involved giving radiologists only a brief (250–500 ms) glance at an image. While this seems sufficient for expert radiologists to gain some evidence of abnormality, it remains unknown whether this ability impacts radiologists’ memory for normal versus abnormal images.

To summarize, the questions guiding this experiment are the following: Do radiologists show improved memory performance for abnormal images compared with normal images? If so, does global gist produce enhanced expert memory for images of the breast contralateral to the breast that contains focal signs of cancer? Alternatively, does any abnormality advantage in memory depend upon having a focal abnormality that can draw spatial attention?

Experiment 1 is a baseline study with novice observers, whose performance can be compared with radiologist performance in Experiment 2. In addition, Experiment 1 allows us to determine whether our stimulus set contains images that are memorable regardless of expertise. Experiment 2 assesses memory performance in expert radiologists. To anticipate our results, Experiment 1 reveals patterns in our image set that we take into account in Experiment 2. In Experiment 2, we find a large memory benefit for radiologists relative to novices as well as an abnormality advantage in radiologists for focal abnormalities. We find no evidence that experts make use of a nonfocal gist of abnormality either in judgment or memory.

Experiment 1: Novices

Experiment 1 was conducted using novice (nonradiologist) observers. The design, number of observers, exclusion, and analysis plan for this experiment were preregistered (URL for this experiment: http://aspredicted.org/blind.php?x=xr3843).

In this experiment, novice observers viewed a series of mammograms and judged whether each case was normal or abnormal and whether they remembered seeing the image earlier in the experiment. We would expect both, judging whether an image is normal or abnormal as well as remembering the images to be difficult, as this task is designed for expert radiologists. However, novice performance provides a useful baseline for comparing radiologist performance and provides a baseline of memory in novice observers. In particular, the results of this experiment can indicate if particular images are particularly distinctive in the absence of any mammographic expertise.

Method

Participants

Sixty participants (23 female participants, mean age 38 years) were recruited for this experiment through Amazon’s Mechanical Turk, which offers monetary compensation for participation in online tasks. Mechanical Turk workers are reasonably representative of the American adult population (Berinsky et al., 2012; Buhrmester et al., 2011; Difallah et al., 2018), and provide data that are comparable to data obtained when participants are tested in experimental psychological laboratories (e.g., see Brady & Alvarez, 2011, for a comparison in a visual memory context). All participants gave informed consent, were compensated at a rate of approximately $10/hour, were located in the United States, and had a hit approval rate greater than 95%. Informed consent procedures were approved by the Institutional Review Board of the University of California, San Diego.

Stimuli and procedure

Participants viewed single breast mammograms in this study. The stimulus set consisted of 80 abnormal (cancerous) cases and 40 normal (noncancerous) cases. All images were deidentified. All images were preclassified by a group of trained radiologists who did not participate in the study. Normal images were noncancerous and did not contain benign lesions. Abnormal images consisted either of histologically verified malignant masses or architectural distortions (see Evans et al., 2016, for a more detailed description of this stimuli set). Half of the abnormal images contained a visible abnormality (i.e., a lesion was present) and half were images of the breast contralateral to the breast with the lesion (i.e., still an abnormal case, but with no focal indication of that abnormality). Thus, the entire set consisted of 40 normal images, 40 focal-abnormality images (herein referred to as abnormal), and 40 non-focal abnormality images (images contralateral to the breast with the focal abnormality), herein and henceforth referred to as contralateral-abnormal. Each image subtended approximately 16 × 20 degrees of visual angle at an estimated viewing distance of approximately 60 cm from the screen.

On each trial, one image was present for 3 seconds, followed by a new screen containing response questions. The mammogram was randomly chosen to be either normal, abnormal, or contralateral-abnormal. Critically, each image was also either a new image (presented for the first time in the experiment) or a repeated image from 3 trials back or 30 trials back (3-back and 30-back, respectively). Of the images that were later repeated, 50% were repeated at 3-back, and 50% were repeated at 30-back. The experiment was balanced such that ~20% of trials in the first and second half of the study were 3-back and 30-backs, respectively. In fact, due to sampling different streams of images for each participant, in our exact pool of radiologists, 18% of trials were 3-backs in the first half of the trials, versus 23% in the second half of the trials, and 22% were 30-backs in the first half of the trials, and 20% in the second half. In total, with repetitions, there were 210 trials: 120 new images (40 per condition), plus 90 repeat images (30 per condition, split evenly between 3-back and 30-back).

After being displayed for 3 seconds, each image was immediately followed by two response questions: (1) Was the image abnormal or normal? (2) Have you seen this image before? Using a standard computer mouse, participants were told to indicate their level of confidence on a 6-point rating scale ranging from confident yes/abnormal to confident no/normal (see Fig. 1). We collected confidence ratings instead of simple yes/no answers to allow for ROC analysis. There was no time constraint imposed on responding. The initiation of the next trial was contingent on answering both questions of the current trial.

Before the experiment began, participants were presented with instructions and several demographic questions (Gender; Age; “Are you a radiologist?”; “Do you have a job where you read medical images; i.e., tech, medical physicist?”). Instructions were written for a novice population with no medical training. For novice participants, abnormal cases were broadly defined as “images that might contain lesions, or cancer, or otherwise might be something worthy of follow-up if you were a radiologist.”

Exclusion criterion and analysis plan

Our exclusion criteria and analyses were decided in advance (see preregistration, above). Individual trials were excluded if participants took less than 1,500 ms or more than 15,000 ms to respond (based on pilot data). Participants were excluded if they took less than 15 minutes (zero excluded) or more than 1 hour to complete the study (3 excluded). Radiologists were excluded (1 excluded) as were those with other prior experience reading medical images (zero excluded). Participants were also excluded if they had more than 80% identical responses (e.g., picked the exact same answer on nearly every trial; one excluded) or had more than 20% of trials excluded on the basis of the reaction time criteria (one excluded). After applying these a priori exclusion criteria, seven participants were excluded from analysis, leaving a final sample of 53 participants.

Following our preregistered analyses (above), we did not conduct an overall analysis of variance (ANOVA) initially but rather followed our specific targeted tests. We first analyzed the confidence ratings of classifying an image as abnormal or normal. We subsequently analyzed the confidence ratings representing memory for images. In order to do this, we conducted ROC analysis for 3-back and 30-back as a function of image type (normal/abnormal/contralateral-abnormal). We also generated ROCs for the normal/abnormal judgments. ROCs were summarized by area under the curve (AUC) and compared using t tests. As noted, we are interested in whether, within the group of novice participants, there is a benefit in memory performance for any type of image (e.g., as judged by normal vs. abnormal AUC). Since the novices lack medical experience, any such effect would give us insight into the nature of the image set (i.e., memorability or distinctiveness). Finally, we conducted image similarity analyses to quantify how image differences might be influencing these results.

Image similarity comparison

Because normal, focally abnormal and contralateral-abnormal images are necessarily different image sets, it is useful to compare how distinctive each set of images is from all the other images in order to look at the effect this has on memory. One way to accomplish this is to have individuals give similarity ratings between images. However, this would require 120 × 120 = 14,440 similarity ratings. Instead, to streamline the process, we relied on previously established computer vision techniques designed to give similarity measurements for natural scenes. Specifically, we conducted a Gabor wavelet pyramid (GWP) analysis, which computes features of the images and compares them (Greene et al., 2016; Kay et al., 2008). To assess the level of similarity in the different image types, the GWP represents each image as the output of a bank of multiscale Gabor filters. Prior work has shown that these features can successfully model object representation in early visual areas (Kay et al., 2008). Following the exact procedure and parameters provided by Greene et al. (2016), each image was converted to grayscale, down sampled to 128 × 128 pixels, and represented with a bank of Gabor filters at three spatial scales (3, 6, and 11 cycles per image with a luminance-only wavelet that covers the entire image), four orientations (0, 45, 90, and 135 degrees) and two phases (0 and 90 degrees). This gave a set of features for each image, which we then compared with all 120 images to compute a distance/dissimilarity score by computing the dot product of each images features to each other images after subtracting the mean across images and normalizing the feature vectors to unit length.

Results (Experiment 1: Novices)

Performance on the classification task

First, we looked at how confident novices were at classifying an image as either normal or abnormal (see Fig. 2). We found a significant difference between normal and abnormal images, t(52) = 4.78, p < .001, but not between normal and contralateral-abnormal images, t(52) = 1.94, p > .05.

While participants did not have training to distinguish normal from abnormal medical images, a small number of images in the set are extremely saliently abnormal (i.e., a single bright white spot would look questionable even to novice viewers). Looking at ratings by image (see Fig. 3) reveals that these images are largely responsible for the significant difference between normal and abnormal ratings. In short, for at least for a small subset of images, even novice participants can notice the abnormality, leading to above-chance classification performance broadly. But for most images, novices seem to have little information about normality versus abnormality.

Note that the y-axis in Fig. 2 represents the confidence ratings for novices. It is clear that the novices are generally not confident in distinguishing any image type, with average responses tightly clustered near the middle of the rating scale for all conditions. Another way of visualizing this data is on an ROC curve (see Fig. 4), where novices fall almost on top of the dotted diagonal line indicative of chance performance, with an AUC of only 0.54 (where 0.50 is chance and 1.0 is perfect). Although, as stated above, this difference from chance is highly reliable across participants, t(52) = 4.21, p < .001, largely because of the few images that participants could all reliably classify.

Memory for abnormal images

Figure 5 shows the ROCs for the 3-back and 30-back memory tasks. Since novices were not, for the most part, able to perceive contralateral-abnormal images as different from normal images in the classification task, we focused exclusively on memory differences between normal and abnormal images. Overall, independent of image type, and as expected, novices have better 3-back memory (averaged AUC of 0.70 for detecting 3-backs) than 30-back memory (averaged AUC of 0.64 for detecting 30-backs), t(52) = 6.59, p < .001. Interestingly, breaking down performance across image conditions reveals that novices show a small normality benefit: they remember normal images better than abnormal images in both the 3-back condition and the 30-back condition, with only the 3-back yielding a significant difference. We found an AUC benefit of 0.069 for normal images at 3-back, t(52) = 5.48, p < .001, compared with abnormal, and an AUC difference of 0.026 for normal images at 30-back, t(52) = 1.70, p = .096, compared with abnormal.

Given the weak performance at discriminating normal from abnormal images, it is rather surprising that normality had any effect. Therefore, we examined the data for evidence of more basic effects of visual similarity. We found that the lower memory performance in the abnormal conditions was largely driven by an increased false-alarm rate in the contralateral-abnormal and abnormal image sets. Here, we are classifying as “new” all images with a confidence rating >3. This is consistent with an image similarity account in which novices would be more likely to false alarm to new images in the contralateral-abnormal and abnormal conditions simply because these images are more similar to one another than images in the normal set (as predicted by summed similarity accounts of memory; e.g., Nosofsky, 1991). In other words, if the normal images were somewhat more dissimilar to each other compared with the other images, this could explain why novices have somewhat better memory for the normal condition (i.e., it is easier to determine if an image of a dog is new if that dog is presented in a series of different animals than if it is presented in a set of similar dogs. Obviously, the similarity effects in our stimuli are much smaller.). We test this hypothesis next.

Similarity matrix—Gabor wavelet pyramid analysis

We tested this image similarity hypothesis by measuring similarity between our images as described in the Methods (Greene et al., 2016; Kay et al., 2008). We found increased dissimilarity among normal images relative to contralateral-abnormal and abnormal images (normal = 0.174; abnormal = 0.139; contralateral-abnormal = 0.133). In other words, normal images were more different, on average, from one another (and thus more discriminable in memory) than either abnormal or contralateral-abnormal images. This is consistent with the hypothesis that the small difference in memory favoring normal images is driven by image similarity differences across sets. Thus, the small normality benefit found in the current study is likely a result of image similarity. Critically, this can provide a useful baseline for considering memory for the same images in expert radiologists in Experiment 2.