Elsevier

Journal of Genetics and Genomics

Volume 46, Issue 9, 20 September 2019, Pages 433-443
Journal of Genetics and Genomics

Original research
Examining the practical limits of batch effect-correction algorithms: When should you care about batch effects?

https://doi.org/10.1016/j.jgg.2019.08.002Get rights and content

Abstract

Batch effects are technical sources of variation and can confound analysis. While many performance ranking exercises have been conducted to establish the best batch effect-correction algorithm (BECA), we hold the viewpoint that the notion of best is context-dependent. Moreover, alternative questions beyond the simplistic notion of “best” are also interesting: are BECAs robust against various degrees of confounding and if so, what is the limit? Using two different methods for simulating class (phenotype) and batch effects and taking various representative datasets across both genomics (RNA-Seq) and proteomics platforms, we demonstrate that under situations where sample classes and batch factors are moderately confounded, most BECAs are remarkably robust and only weakly affected by upstream normalization procedures. This observation is consistently supported across the multitude of test datasets. BECAs do have limits: When sample classes and batch factors are strongly confounded, BECA performance declines, with variable performance in precision, recall and also batch correction. We also report that while conventional normalization methods have minimal impact on batch effect correction, they do not affect downstream statistical feature selection, and in strongly confounded scenarios, may even outperform BECAs. In other words, removing batch effects is no guarantee of optimal functional analysis. Overall, this study suggests that simplistic performance ranking exercises are quite trivial, and all BECAs are compromises in some context or another.

Introduction

Batch effects (BEs) are technical sources of variation arising from different experiment times, handlers, reagents, and instruments (Leek et al., 2010; Goh et al., 2017a, Goh et al., 2017b). BEs are especially relevant to assays with sample-size limits, thereby requiring multiple runs (each run constituting a “technical batch”). BEs are not a homogeneous group of technical variation: Specific assays and platforms can create idiosyncratic BEs such that within-BEs and between-BEs can be observed in metabonomics and other related platforms (Broadhurst et al., 2018; Sanchez-Illana et al., 2018). The simulations deployed in this study do not cover these types of BEs as it is very specific, instead we simulated BEs using naïve but more generalizable assumptions.

BEs may confound functional analysis. In comparative setups where one phenotype class is compared against another (e.g., normal against disease), BEs may induce false effects or mask real effects (Goh and Wong, 2018) and hamper research outcome. Thus, it is important that BEs are dealt with satisfactorily.

There is a wide variety of batch effect-correction algorithms (BECAs), necessitating various performance ranking evaluations (Luo et al., 2010; Chen et al., 2011; Lazar et al., 2012). Although ComBat (Johnson et al., 2007), one of the earliest BECAs, often dominates, other methods such as Harman (Oytam et al., 2016), surrogate variable analysis (SVA) (Leek and Storey, 2007), the ratio-based algorithms (Luo et al., 2010), Ratio-A, Ratio-G as well as Batch mean-centering (BMC) (Sims et al., 2008), have also been reported to work well in their own specific evaluations. Due to such disparities, we are curious whether the outcomes of such ranking exercises are highly context-dependent (e.g., due to the data used or upstream transformation).

BE correction is the process of estimating and removing BEs algorithmically. There are several potential factors influencing BE correction: (1) the degree of confounding between class and batch factors, (2) upstream normalization procedure, and (3) the nature (magnitude and variability) of the BE itself.

Confounding refers to the extent class and batch effects are inter-mixed and indistinguishable. Suppose we have two classes, A and B, 4 samples per class. Amongst 8 samples, the order of the class labels S is (A1, A2, A3, A4, B5, B6, B7, and B8). This ordered arrangement of class labels, A and B, denotes the assignment of each sample to a particular class. In biology, a class can be a phenotype, e.g., disease or normal. The list S, being an ordered assignment of class labels, is also referred to as the class factor. The corresponding batch factor, X, given two batches, U and V, may be assigned as (U1, U2, V3, V4, U5, U6, V7, and V8). Here, although BEs do exist, the batch factors are equally distributed between the samples in classes A and B. This is a ‘balanced scenario’. In experimental design, the ‘balanced scenario’ is preferred as it is expected that, given sufficient sample size and random distribution, BEs and other sources of technical bias may be “averaged out”. Since BEs are negated, there is no confounding between class and batch factors.

Unfortunately, experimental designs are often inevitably imperfect. Real world factors, e.g., sample availability and failed experiments may also be contributors. These lead towards batch-class factor imbalance, such that BEs cannot be negated via “averaging out”. For example, given (A1, A2, A3, A4, B5, B6, B7, and B8), we may have (U1, U2, U3, V4, U5, V6, V7, and V8) instead. Class A is over-represented in batch U (75%), and vice versa. Here, if a strong difference is detected between samples in class A against B, the differential effect may be contributed by both class and batch effects. It is hard to tell apart which the true source(s) of the differential effect is; we therefore say that class A is confounded with batch U.

In the most extreme scenario, given (A1, A2, A3, A4, B5, B6, B7, and B8), we may have (U1, U2, U3, U4, V5, V6, V7, and V8). This situation may occur due to complete oversight in experimental design. Here, class A is completely correlated with batch U (100%), and vice versa. If a strong difference is detected between samples in class A against B, there is no way to tell if the difference is due to true class effects or batch effects. We therefore say that class A is perfectly confounded with batch U.

BECAs are techniques for estimating and removing BEs from data. Each BECA makes assumptions about the underlying data distribution and is theoretically susceptible to upstream normalization procedure. Appropriate normalizations are meant for unwanted technical variation. However, normalization may change the scale and distribution of the data, and consequently, impact BECA performance.

Determining the nature of BEs is a challenging issue: A simple perspective is to assume that batch-effect correlated variation is uniformly distributed across all variables (variables being measurable quantities that vary between different samples, e.g., gene or protein expression). Leek et al. (2010) suggested that BEs are non-homogeneous, and non-uniformly distributed amongst different variables. If Leek et al. are correct, then most batch-effect simulation approaches assuming uniformity would be insufficient. Moreover, it is probable that uniformity assumptions are rather unrealistic and can be easily dealt with analytically (Goh and Wong, 2017).

This study is not a ranking exercise for elucidating the best BECA given some dataset(s). Instead, we are interested in examining the practical limits of BECAs given (1) different upstream normalization procedures, (2) if generic normalization methods can actually deal with BEs at all, (3) where there are various levels of confounding, which BECAs remain effective, and finally, (4) is there reason to believe a universal best BECA exists given any scenario, or whether data-dependent contexts are crucial in determining performance. To make a case for generalizability, 3 representative RNA-Seq and 1 proteomics dataset are used for evaluation.

Section snippets

BECAs are surprisingly resilient against batch-class imbalances

The 2-dimensional (2-D) scatterplot is a standard approach for detecting class and BEs (Fig. 1). We will discuss findings resulted from one transcriptomics dataset based on the non-uniformity assumption in BE simulation (genomics data 1; see Table 1 for description and meta-data of all datasets used), and then discuss if similar findings are also obtained in the other test datasets.

In genomics data 1, individual samples are resolved across PCs 1 and 2 (shapes represent class; colors represent

Discussion

It is not entirely surprising that conventional normalizations perform poorly. After all, they do not explicitly correct for BEs. We demonstrate this using quantile normalization (QN).

QN is a very widely used normalization technique. It is popular because it creates good standardization amongst samples. However, samples that look well-standardized do not mean that BEs are corrected well. What QN is to calculate a mean for each ranked variable to get a distribution of means across the entire

Simulated datasets

We simulated data modeled from the parameters of four real datasets: three RNA-Seq datasets (genomics datasets), each dataset with three pairs of case and control samples (Ferreira et al., 2014; Isogai et al., 2018; Hatzi et al., 2019), and an unlabelled shotgun proteomic dataset with four cases and controls (Shao et al., 2010). The summary information of these datasets is shown in Table 1. An overview is provided in Fig. 5 linking datasets to the respective analytical output and corresponding

Acknowledgments

WWBG gratefully acknowledges Limsoon Wong, National University of Singapore, for inspiring this work. WWBG gratefully acknowledges support from the National Research Foundation of Singapore, NRF-NSFC (Grant No. NRF2018NRF-NSFC003SB-006).

References (30)

  • A.C. Frazee et al.

    Polyester: simulating RNA-seq datasets with differential transcript expression

    Bioinformatics

    (2015)
  • W.W.B. Goh et al.

    Can peripheral blood-derived gene expressions characterize individuals at ultra-high risk for psychosis?

    Comput. Psychiatr. Psychol.

    (2017)
  • W.W.B. Goh et al.

    Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics

    BMC Genomics

    (2017)
  • K. Hatzi et al.

    Histone demethylase LSD1 is required for germinal center formation and BCL6-driven lymphomagenesis

    Nat. Immunol.

    (2019)
  • R. Hornung et al.

    Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment

    BMC Bioinformatics

    (2016)
  • Cited by (24)

    • Are batch effects still relevant in the age of big data?

      2022, Trends in Biotechnology
      Citation Excerpt :

      In proteomics, BEs remain pertinent, with key recent publications on best practices (e.g., proBatch, which covers useful wisdoms), benchmarking datasets, and recommended workflows [11]. proBatch, alongside our own works [12], reveal that BE correction is ultimately context-dependent. New technologies drive the development of new methodologies.

    • Doppelgänger spotting in biomedical gene expression data

      2022, iScience
      Citation Excerpt :

      As all of our input microarray data sets are imbalanced between batches, this needs to be addressed early in the analysis. Batch imbalances, i.e. different numbers of samples in both batches, have been shown to worsen the performance of two-step batch correction algorithms like ComBat (Li et al., 2021; Zhou et al., 2019). Our doppelgänger identification method relies on ComBat to first reduce the impact of batch effects before PPCC values are calculated (This is a critical step.

    • Perspectives for better batch effect correction in mass-spectrometry-based proteomics

      2022, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      In more severe cases, it can lead towards false positives (proteins that are not differential are selected) and false negatives (proteins that are differential are not selected) [1]. In general, BEs are complex, and effective mitigation is highly context dependent [2,3]. It is also surprising that even as technologies and analytical methods advance, BEs seem to become more pertinent and relevant [4].

    • Mathematical-based microbiome analytics for clinical translation

      2021, Computational and Structural Biotechnology Journal
    View all citing articles on Scopus
    View full text