Data quality and reliability metrics for event-related potentials (ERPs): The utility of subject-level reliability

https://doi.org/10.1016/j.ijpsycho.2021.04.004Get rights and content

Highlights

  • Psychometric reliability is critical for studies of individual differences.

  • Group-level reliability estimates risk masking low reliability.

  • Subject-level estimates consider the precision of scores for a person.

  • Data quality and subject-level estimates can improve measurement quality.

  • Subject-level reliability is implemented within the ERA Toolbox.

Abstract

Event-related brain potentials (ERPs) represent direct measures of neural activity that are leveraged to understand cognitive, affective, sensory, and motor processes. Every ERP researcher encounters the obstacle of determining whether measurements are precise or psychometrically reliable enough for an intended purpose. In this primer, we review three types of measurements metrics: data quality, group-level internal consistency, and subject-level internal consistency. Data quality estimates characterize the precision of ERP scores but provide no inherent information about whether scores are precise enough for examining individual differences. Group-level internal consistency characterizes the ratio of between-person differences to the precision of those scores, and provides a single internal consistency estimate for an entire group of participants that risks masking low internal consistency for some individuals. Subject-level internal consistency considers the precision of an ERP score for a person relative to between-person differences for a group, and an estimate is yielded for each individual. We apply each metric to published error-related negativity (ERN) and reward positivity (RewP) data and demonstrate how failing to consider data quality and internal consistency can undermine statistical inferences. We conclude with general comments on how these estimates may be used to improve measurement quality and methodological transparency. Subject-level internal consistency computation is implemented within the ERP Reliability Analysis (ERA) Toolbox.

Introduction

Across disciplines, scientific research is facing a replication problem and credibility crisis, in part due to poor methodological transparency and lack of clarity in research practices. A promising avenue forward is to adopt research practices aimed at improving measurement (Baldwin, 2017). In psychophysiology, researchers have placed an emphasis on identifying psychometrically reliable measurements of brain activity to determine whether these measures can be used to make valid statistical inferences in within- and between-subjects investigations (e.g., as biomarkers or endophenotypes of psychopathology; Hajcak et al., 2019; Hajcak et al., 2017).

Verification of psychometric reliability1 should be an early step in data analysis, because statistical inferences drawn from unreliable data can lead to mistaken conclusions. This can be accomplished by quantifying the internal consistency of a measure, which is a type of psychometric reliability that characterizes how well measurements can distinguish differences between people. Measurements with high internal consistency are essential for between-subjects investigations examining correlations between neural measurements and individual differences variables (e.g., depression or anxiety symptoms). Measurements with poor internal consistency in correlational analyses increase the likelihood of finding non-replicable results and missing true phenomena (Loken and Gelman, 2017). Problems with drawing valid inferences are exacerbated when combined with other study-related issues, including small sample sizes, which are common in clinical neuroscience (e.g., Szucs and Ioannidis, 2020)–and even in studies using event-related brain potentials (ERPs; Clayson et al., 2019).

Current practices of establishing psychometric reliability in psychophysiological research have been grounded in determining the reliability of measurements from a group of individuals, which results in a single reliability score for an entire group. However, relying on a single group estimate might mask low reliability of scores from some participants. The fields of clinical, social, and cognitive neuroscience would benefit from adopting reliability estimates at the subject-level, which would allow researchers to determine whether subject-level data are of sufficient reliability to make valid statistical inferences. In the current manuscript, we focus on internal consistency estimates commonly used in ERP research and discuss the implications of using various data quality and internal consistency estimates at the group- and subject-level to improve and promote the clarity of ERP measurement practices across studies.

ERPs are direct measures of brain activity that assess a multitude of neuropsychological processes (e.g., sensory, cognitive, motor, and emotion-related). ERPs reflect small voltage fluctuations in the continuous electroencephalogram (EEG) that are time-locked to specific events of interest (e.g., presentation of a visual stimulus or execution of a motor response). In terms of measurement, it is important to note that ERPs reflect tiny signals that are embedded in noise. During signal processing, researchers typically average EEG data across many trials from a given paradigm to reduce the contribution of random noise to averaged activity and consequently reveal the ERP signal of interest. However, after this averaging process, an ERP researcher is left with few options to identify the overall data quality or psychometric reliability of a subject's ERP score. Some metrics have been used, including the root mean square (RMS) of the voltage in the pre-stimulus period (Luck, 2014) or signal-to-noise ratio of a given ERP (e.g., Thigpen et al., 2017), but have not been widely adopted or reported across studies.

Recently, Luck et al., in press proposed a metric referred to as the standardized measurement error (SME) that can capture how noisy a single subject's ERP score is and provide insight into the precision of an ERP measurement. Unlike conventional classical test theory measures, the SME can be applied at both the subject- and group-level. Researchers can use the SME to determine whether data quality is associated with observed effects and statistical power. For example, when the SME is aggregated across participants in a given experiment, researchers can take the RMS of the SME (i.e., RMS[SME]) and directly compare it to the observed between-subject variability (i.e., sample standard deviation). Inferences regarding whether the observed effects can be attributed to data quality can then be made, which highlights the potential utility of the SME in ERP research. Participants with excessively large SME for a given effect size could be removed from further analysis. Despite these uses, in isolation, the SME provides little inherent information about whether measurement precision is “high enough” for a particular purpose (e.g., comparison of ERP scores across conditions, persons, or groups). However, a bootstrapping procedure can be used to determine whether SME is small compared to a difference between two conditions of interest or group-level internal consistency can be estimated (see Luck et al., in press). Nonetheless, there remains a need to establish subject-level estimates of internal consistency, to clarify whether the precision of an averaged ERP score is high enough given between-person differences (i.e., individual differences).

Currently, there are no established metrics to determine whether an individual's specific ERP score reflects adequate psychometric reliability for examining individual differences. Instead, efforts in establishing ERP score reliability have primarily focused on the reliability of ERP measurements at the group level by employing either classical test theory (e.g., Boudewyn et al., 2017; Ethridge and Weinberg, 2018; Hajcak et al., 2017; Klawohn et al., 2020b; Larson et al., 2010; Levinson et al., 2017; Meyer et al., 2013; Olvet and Hajcak, 2009a, Olvet and Hajcak, 2009b; Sandre et al., 2020) or generalizability theory (Carbine et al., in press; Clayson et al., 2021a; Clayson, Carbine, Baldwin, Olsen, & Larson, in press; Clayson et al., 2020; Clayson and Larson, 2019; Clayson and Miller, 2017a, Clayson and Miller, 2017b; Ethridge and Weinberg, 2018; Levinson et al., 2017; Sandre et al., 2020) to determine the consistency of scores across repeated observations (e.g., within-session and/or across testing sessions). Although these approaches offer important insight into ERP score reliability at the group level, they provide no information about subject-level ERP reliability.

There are important implications for determining the psychometric reliability of ERP scores at the subject level. Consistent with traditional signal averaging approaches used in ERP research (e.g., Woodman, 2010), it is often assumed that the meaningful variability in ERP scores primarily occurs between rather than within persons; however, recent research shows that ERP scores may change over the course of experimental paradigms (e.g., Berry et al., 2019; Brush et al., 2018; Volpert-Esmond et al., 2018), suggesting a need for person-specific psychometric reliability estimates. Extending psychometric reliability to an individual person has the advantage of permitting researchers to examine individual differences in reliability, which has largely been ignored.

When researchers are interested in using ERPs in studies of individual differences or dimensional constructs (i.e., examining correlational relationships between ERPs and other measures of individual differences), it is important to know whether the reliability of a person's ERP score is compatible with group-level internal consistency estimates. In this case, subject-level internal consistency estimates could be directly compared to the group-level internal consistency estimate to determine how well the group-level internal consistency estimate characterizes each individual (Williams et al., 2020; Williams et al., 2019; Williams et al., in press). In instances of mischaracterization, researchers could focus on individual cases of unreliable ERP scores to determine the impact of a host of factors on reliability, including recording characteristics (e.g., electrode impedance), the presence of artifacts, or person characteristics. Researchers could also use subject-level internal consistency estimates as predictor or criterion variables in explanatory models. As predictors, subject-level internal consistency estimates could be used to determine their influence on observed effect sizes and statistical power. As a criterion, researchers could examine whether specific between- or within-subjects variables are associated with different levels of reliability.

Evaluating subject-level data quality and psychometric reliability is also consistent with promoting transparency in research practices (e.g., Paul et al., 2021; Saunders and Inzlicht, 2021; Clayson et al., 2021; Garrett-Ruffin et al., 2021; this special issue). In ERP research, the decision to include or exclude a subject's ERP data in statistical analyses is largely left up to the researcher's discretion and is often based on various criteria. For example, this decision could be based on an a priori established threshold (e.g., < 50% of artifact-free trials retained in a subject's ERP score), a minimum number of artifact-free trials retained in their averaged ERP based on group-level internal consistency estimates (e.g., range of 2–15 error trials for ERN; Fischer et al., 2017; Larson et al., 2010; Meyer et al., 2013; Olvet and Hajcak, 2009b; Pontifex et al., 2010; Steele et al., 2016), or visual inspection. This lack of standardization results in increased researcher degrees of freedom that stands in the way of promoting the transparency and rigor of ERP research. Implementation of subject-level internal consistency estimates may help determine whether data quality is high enough to make valid inferences in both within- and between-subjects questions. In particular, subject-level internal consistency estimates provide objective indicators of whether an individual person's data is of sufficient quality to be included in a study. Adopting subject-level internal consistency estimates would also allow the field to move toward standardization and would ultimately increase the transparency and clarity of measurement by shedding light on the factors that impact internal consistency.

In the current manuscript, we provide an overview of the various estimates that have been typically used to examine data quality and score reliability in ERP studies. We then extend this by discussing the importance of quantifying subject-level internal consistency estimates of ERP scores and have structured the manuscript as follows. First, we describe three different types of data measurement metrics (i.e., data quality, group-level internal consistency, and subject-level internal consistency) and outline situations where one estimate may be preferred over another. Then, we illustrate the application of these estimates by applying them to two published datasets on the reward positivity (RewP; Klawohn et al., 2020a) and error-related negativity (ERN; Klawohn et al., 2020c). In this section, we provide commentary on the coupling between data quality and both group- and subject-level internal consistency estimates, and describe the inherent challenges associated with characterizing the quality of ERP measurements with a single score. We then extend our application to between-subjects investigations and provide an overview of the influence of internal consistency on between-subjects effects and how reliability impacts the validity of statistical inferences. Lastly, we summarize the importance of examining and reporting data quality and reliability estimates. We conclude with general comments on the way in which these estimates may be directly integrated into the broader literature to improve the rigor and clarity of ERP measurements.

Section snippets

Measurement metrics

We now describe three different types of estimates: data quality, group-level internal consistency, and subject-level internal consistency. These estimates represent scores from individual trials, i, recorded from within a person, j, within a group, k. Internal consistency is an estimate of psychometric reliability and characterizes the homogeneity of test observations (i.e., ERP trials). Internal consistency estimates are often scaled using the between-person variability (e.g., coefficient

Application of internal consistency and data quality estimates

We now apply data quality and internal consistency estimates covered above to two published datasets on the reward positivity (RewP) and error-related negativity (ERN). We also generally compare the different metrics to demonstrate how to interpret them. The RewP (Klawohn et al., 2020a) and ERN data (Klawohn et al., 2020c) are from the same 83 participants with major depressive disorder (MDD) and 45 healthy controls, and the reader is directed to the published articles for a discussion of the

Impact of internal consistency on between-group effects

Numerical group differences were observed for many of the estimates of internal consistency, but it is unclear how much impact ERP score internal consistency has on the magnitude of between-group ERP differences. The literature suggests that psychometrically unreliable data can dramatically impact not only between-person effects (i.e., relationships with external correlates) but also between-group effect sizes (Hajcak et al., 2017). For example, unreliable scores can lead to magnitude or sign

Discussion

The current primer provides a conceptual overview of various estimates of data quality, group-level internal consistency, and subject-level internal consistency. Most estimates of data quality and group-level internal consistency have been covered in other work, but to our knowledge this primer presents the first application of subject-level internal consistency to ERP scores. The findings from the subject-level internal consistency analyses indicated that group-level internal consistency

References (76)

  • S.J. Luck et al.

    A roadmap for the development and validation of event-related potential biomarkers in schizophrenia research

    Biol. Psychiatry

    (2011)
  • D.M. Olvet et al.

    Reliability of error-related brain activity

    Brain Res.

    (2009)
  • M. Paul et al.

    Making ERP research more transparent: Guidelines for preregistration

    Int. J. Psychophysiol.

    (2021)
  • A. Sandre et al.

    Comparing the effects of different methodological decisions on the error-related negativity and its association with behaviour and genders

    Int. J. Psychophysiol.

    (2020)
  • Blair Saunders et al.

    Pooling resources to enhance rigour in psychophysiological research: Insights from open science approaches to meta-analysis

    International Journal of Psychophysiology

    (2021)
  • F.D. Schönbrodt et al.

    At what sample size do correlations stabilize?

    J. Res. Pers.

    (2013)
  • V.R. Steele et al.

    Neuroimaging measures of error-processing: extracting reliable signals from event-related potentials and functional magnetic resonance imaging

    NeuroImage

    (2016)
  • A. Steinke et al.

    RELEX: an excel-based software tool for sampling split-half reliability coefficients

    Methods Psychol.

    (2020)
  • D. Szucs et al.

    Sample size evolution in neuroimaging research: an evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018) in high-impact journals

    NeuroImage

    (2020)
  • H.I. Volpert-Esmond et al.

    Using multilevel models for the analysis of event-related potentials

    Int. J. Psychophysiol.

    (2021)
  • S.A. Baldwin et al.

    The dependability of electrophysiological measurements of performance monitoring in a clinical sample: a generalizability and decision analysis of the ERN and Pe

    Psychophysiology

    (2015)
  • M.P. Berry et al.

    Relation of depression symptoms to sustained reward and loss sensitivity

    Psychophysiology

    (2019)
  • D.G. Bonett

    Confidence intervals for standardized linear contrasts of means

    Psychol. Methods

    (2008)
  • M.A. Boudewyn et al.

    How many trials does it take to get a significant ERP effect? It depends

    Psychophysiology

    (2017)
  • A. Brand et al.

    The precision of effect size estimation from published psychological research: surveying confidence intervals

    Psychol. Rep.

    (2016)
  • A.M. Brandmaier et al.

    Assessing reliability in neuroimaging research through intra-class effect decomposition (ICED)

    eLife

    (2018)
  • W. Brown

    Some experimental results in the correlation of mental abilities

    Br. J. Psychol.

    (1910)
  • C.J. Brush et al.

    Using multilevel modeling to examine blunted neural responses to reward in major depression

    Biol. Psychiatry

    (2018)
  • P.-C. Bürkner

    brms: an R package for Bayesian multilevel models using Stan

    J. Stat. Softw.

    (2017)
  • P.-C. Bürkner

    Advanced Bayesian multilevel modeling with the R Package brms

    R J.

    (2018)
  • K.A. Carbine et al.

    Using generalizability theory and the ERP Reliability Analysis (ERA) toolbox for assessing test-retest reliability of ERP scores part 2: application to food-based tasks and stimuli

    Int. J. Psychophysiol.

    (2021)
  • E. Cho

    Making reliability reliable

    Organ. Res. Methods

    (2016)
  • P.E. Clayson

    Moderators of the internal consistency of error-related negativity scores: a meta-analysis of internal consistency estimates

    Psychophysiology

    (2020)
  • P.E. Clayson et al.

    How does noise affect amplitude and latency measurement of event-related potentials (ERPs)? A methodological critique and simulation study

    Psychophysiology

    (2013)
  • P.E. Clayson et al.

    Methodological reporting behavior, sample sizes, and statistical power in studies of event-related potentials: barriers to reproducibility and replicability

    Psychophysiology

    (2019)
  • P.E. Clayson et al.

    The viability of the frequency following response characteristics for use as biomarkers of cognitive therapeutics in schizophrenia

    PsyArXiv

    (2020)
  • P.E. Clayson et al.

    Evaluating the internal consistency of subtraction-based and residualized difference scores: considerations for psychometric reliability analyses of event-related potentials

    Psychophysiology

    (2021)
  • P.E. Clayson et al.

    Using generalizability theory and the ERP Reliability Analysis (ERA) toolbox for assessing test-retest reliability of ERP scores part 1: algorithms, framework, and implementation

    Int. J. Psychophysiol.

    (2021)
  • Cited by (0)

    View full text