Data quality and reliability metrics for event-related potentials (ERPs): The utility of subject-level reliability
Introduction
Across disciplines, scientific research is facing a replication problem and credibility crisis, in part due to poor methodological transparency and lack of clarity in research practices. A promising avenue forward is to adopt research practices aimed at improving measurement (Baldwin, 2017). In psychophysiology, researchers have placed an emphasis on identifying psychometrically reliable measurements of brain activity to determine whether these measures can be used to make valid statistical inferences in within- and between-subjects investigations (e.g., as biomarkers or endophenotypes of psychopathology; Hajcak et al., 2019; Hajcak et al., 2017).
Verification of psychometric reliability1 should be an early step in data analysis, because statistical inferences drawn from unreliable data can lead to mistaken conclusions. This can be accomplished by quantifying the internal consistency of a measure, which is a type of psychometric reliability that characterizes how well measurements can distinguish differences between people. Measurements with high internal consistency are essential for between-subjects investigations examining correlations between neural measurements and individual differences variables (e.g., depression or anxiety symptoms). Measurements with poor internal consistency in correlational analyses increase the likelihood of finding non-replicable results and missing true phenomena (Loken and Gelman, 2017). Problems with drawing valid inferences are exacerbated when combined with other study-related issues, including small sample sizes, which are common in clinical neuroscience (e.g., Szucs and Ioannidis, 2020)–and even in studies using event-related brain potentials (ERPs; Clayson et al., 2019).
Current practices of establishing psychometric reliability in psychophysiological research have been grounded in determining the reliability of measurements from a group of individuals, which results in a single reliability score for an entire group. However, relying on a single group estimate might mask low reliability of scores from some participants. The fields of clinical, social, and cognitive neuroscience would benefit from adopting reliability estimates at the subject-level, which would allow researchers to determine whether subject-level data are of sufficient reliability to make valid statistical inferences. In the current manuscript, we focus on internal consistency estimates commonly used in ERP research and discuss the implications of using various data quality and internal consistency estimates at the group- and subject-level to improve and promote the clarity of ERP measurement practices across studies.
ERPs are direct measures of brain activity that assess a multitude of neuropsychological processes (e.g., sensory, cognitive, motor, and emotion-related). ERPs reflect small voltage fluctuations in the continuous electroencephalogram (EEG) that are time-locked to specific events of interest (e.g., presentation of a visual stimulus or execution of a motor response). In terms of measurement, it is important to note that ERPs reflect tiny signals that are embedded in noise. During signal processing, researchers typically average EEG data across many trials from a given paradigm to reduce the contribution of random noise to averaged activity and consequently reveal the ERP signal of interest. However, after this averaging process, an ERP researcher is left with few options to identify the overall data quality or psychometric reliability of a subject's ERP score. Some metrics have been used, including the root mean square (RMS) of the voltage in the pre-stimulus period (Luck, 2014) or signal-to-noise ratio of a given ERP (e.g., Thigpen et al., 2017), but have not been widely adopted or reported across studies.
Recently, Luck et al., in press proposed a metric referred to as the standardized measurement error (SME) that can capture how noisy a single subject's ERP score is and provide insight into the precision of an ERP measurement. Unlike conventional classical test theory measures, the SME can be applied at both the subject- and group-level. Researchers can use the SME to determine whether data quality is associated with observed effects and statistical power. For example, when the SME is aggregated across participants in a given experiment, researchers can take the RMS of the SME (i.e., RMS[SME]) and directly compare it to the observed between-subject variability (i.e., sample standard deviation). Inferences regarding whether the observed effects can be attributed to data quality can then be made, which highlights the potential utility of the SME in ERP research. Participants with excessively large SME for a given effect size could be removed from further analysis. Despite these uses, in isolation, the SME provides little inherent information about whether measurement precision is “high enough” for a particular purpose (e.g., comparison of ERP scores across conditions, persons, or groups). However, a bootstrapping procedure can be used to determine whether SME is small compared to a difference between two conditions of interest or group-level internal consistency can be estimated (see Luck et al., in press). Nonetheless, there remains a need to establish subject-level estimates of internal consistency, to clarify whether the precision of an averaged ERP score is high enough given between-person differences (i.e., individual differences).
Currently, there are no established metrics to determine whether an individual's specific ERP score reflects adequate psychometric reliability for examining individual differences. Instead, efforts in establishing ERP score reliability have primarily focused on the reliability of ERP measurements at the group level by employing either classical test theory (e.g., Boudewyn et al., 2017; Ethridge and Weinberg, 2018; Hajcak et al., 2017; Klawohn et al., 2020b; Larson et al., 2010; Levinson et al., 2017; Meyer et al., 2013; Olvet and Hajcak, 2009a, Olvet and Hajcak, 2009b; Sandre et al., 2020) or generalizability theory (Carbine et al., in press; Clayson et al., 2021a; Clayson, Carbine, Baldwin, Olsen, & Larson, in press; Clayson et al., 2020; Clayson and Larson, 2019; Clayson and Miller, 2017a, Clayson and Miller, 2017b; Ethridge and Weinberg, 2018; Levinson et al., 2017; Sandre et al., 2020) to determine the consistency of scores across repeated observations (e.g., within-session and/or across testing sessions). Although these approaches offer important insight into ERP score reliability at the group level, they provide no information about subject-level ERP reliability.
There are important implications for determining the psychometric reliability of ERP scores at the subject level. Consistent with traditional signal averaging approaches used in ERP research (e.g., Woodman, 2010), it is often assumed that the meaningful variability in ERP scores primarily occurs between rather than within persons; however, recent research shows that ERP scores may change over the course of experimental paradigms (e.g., Berry et al., 2019; Brush et al., 2018; Volpert-Esmond et al., 2018), suggesting a need for person-specific psychometric reliability estimates. Extending psychometric reliability to an individual person has the advantage of permitting researchers to examine individual differences in reliability, which has largely been ignored.
When researchers are interested in using ERPs in studies of individual differences or dimensional constructs (i.e., examining correlational relationships between ERPs and other measures of individual differences), it is important to know whether the reliability of a person's ERP score is compatible with group-level internal consistency estimates. In this case, subject-level internal consistency estimates could be directly compared to the group-level internal consistency estimate to determine how well the group-level internal consistency estimate characterizes each individual (Williams et al., 2020; Williams et al., 2019; Williams et al., in press). In instances of mischaracterization, researchers could focus on individual cases of unreliable ERP scores to determine the impact of a host of factors on reliability, including recording characteristics (e.g., electrode impedance), the presence of artifacts, or person characteristics. Researchers could also use subject-level internal consistency estimates as predictor or criterion variables in explanatory models. As predictors, subject-level internal consistency estimates could be used to determine their influence on observed effect sizes and statistical power. As a criterion, researchers could examine whether specific between- or within-subjects variables are associated with different levels of reliability.
Evaluating subject-level data quality and psychometric reliability is also consistent with promoting transparency in research practices (e.g., Paul et al., 2021; Saunders and Inzlicht, 2021; Clayson et al., 2021; Garrett-Ruffin et al., 2021; this special issue). In ERP research, the decision to include or exclude a subject's ERP data in statistical analyses is largely left up to the researcher's discretion and is often based on various criteria. For example, this decision could be based on an a priori established threshold (e.g., < 50% of artifact-free trials retained in a subject's ERP score), a minimum number of artifact-free trials retained in their averaged ERP based on group-level internal consistency estimates (e.g., range of 2–15 error trials for ERN; Fischer et al., 2017; Larson et al., 2010; Meyer et al., 2013; Olvet and Hajcak, 2009b; Pontifex et al., 2010; Steele et al., 2016), or visual inspection. This lack of standardization results in increased researcher degrees of freedom that stands in the way of promoting the transparency and rigor of ERP research. Implementation of subject-level internal consistency estimates may help determine whether data quality is high enough to make valid inferences in both within- and between-subjects questions. In particular, subject-level internal consistency estimates provide objective indicators of whether an individual person's data is of sufficient quality to be included in a study. Adopting subject-level internal consistency estimates would also allow the field to move toward standardization and would ultimately increase the transparency and clarity of measurement by shedding light on the factors that impact internal consistency.
In the current manuscript, we provide an overview of the various estimates that have been typically used to examine data quality and score reliability in ERP studies. We then extend this by discussing the importance of quantifying subject-level internal consistency estimates of ERP scores and have structured the manuscript as follows. First, we describe three different types of data measurement metrics (i.e., data quality, group-level internal consistency, and subject-level internal consistency) and outline situations where one estimate may be preferred over another. Then, we illustrate the application of these estimates by applying them to two published datasets on the reward positivity (RewP; Klawohn et al., 2020a) and error-related negativity (ERN; Klawohn et al., 2020c). In this section, we provide commentary on the coupling between data quality and both group- and subject-level internal consistency estimates, and describe the inherent challenges associated with characterizing the quality of ERP measurements with a single score. We then extend our application to between-subjects investigations and provide an overview of the influence of internal consistency on between-subjects effects and how reliability impacts the validity of statistical inferences. Lastly, we summarize the importance of examining and reporting data quality and reliability estimates. We conclude with general comments on the way in which these estimates may be directly integrated into the broader literature to improve the rigor and clarity of ERP measurements.
Section snippets
Measurement metrics
We now describe three different types of estimates: data quality, group-level internal consistency, and subject-level internal consistency. These estimates represent scores from individual trials, i, recorded from within a person, j, within a group, k. Internal consistency is an estimate of psychometric reliability and characterizes the homogeneity of test observations (i.e., ERP trials). Internal consistency estimates are often scaled using the between-person variability (e.g., coefficient
Application of internal consistency and data quality estimates
We now apply data quality and internal consistency estimates covered above to two published datasets on the reward positivity (RewP) and error-related negativity (ERN). We also generally compare the different metrics to demonstrate how to interpret them. The RewP (Klawohn et al., 2020a) and ERN data (Klawohn et al., 2020c) are from the same 83 participants with major depressive disorder (MDD) and 45 healthy controls, and the reader is directed to the published articles for a discussion of the
Impact of internal consistency on between-group effects
Numerical group differences were observed for many of the estimates of internal consistency, but it is unclear how much impact ERP score internal consistency has on the magnitude of between-group ERP differences. The literature suggests that psychometrically unreliable data can dramatically impact not only between-person effects (i.e., relationships with external correlates) but also between-group effect sizes (Hajcak et al., 2017). For example, unreliable scores can lead to magnitude or sign
Discussion
The current primer provides a conceptual overview of various estimates of data quality, group-level internal consistency, and subject-level internal consistency. Most estimates of data quality and group-level internal consistency have been covered in other work, but to our knowledge this primer presents the first application of subject-level internal consistency to ERP scores. The findings from the subject-level internal consistency analyses indicated that group-level internal consistency
References (76)
Improving the rigor of psychophysiology research
Int. J. Psychophysiol.
(2017)- et al.
The impact of recent and concurrent affective context on cognitive control: an ERP study of performance monitoring
Int. J. Psychophysiol.
(2019) - et al.
ERP Reliability Analysis (ERA) toolbox: an open-source toolbox for analyzing the reliability of event-related potentials
Int. J. Psychophysiol.
(2017) - et al.
Psychometric considerations in the measurement of event-related brain potentials: guidelines for measurement and reporting
Int. J. Psychophysiol.
(2017) - et al.
The open access advantage for studies of human electrophysiology: Impact on citations and Altmetrics
International Journal of Psychophysiology
(2021) - et al.
Psychometric properties of neural responses to monetary and social rewards across development
Int. J. Psychophysiol.
(2018) - et al.
Addressing the reliability fallacy: similar group effects may arise from unreliable individual effects
NeuroImage
(2019) - et al.
Robust is not necessarily reliable: from within-subjects fMRI contrasts to between-subjects comparisons
NeuroImage
(2018) - et al.
Open science in psychophysiology: An overview of challenges and emerging solutions
Int. J. Psychophysiol.
(2021) - et al.
Data quality over data quantity in computational cognitive neuroscience
NeuroImage
(2018)