Original researchExamining the practical limits of batch effect-correction algorithms: When should you care about batch effects?
Introduction
Batch effects (BEs) are technical sources of variation arising from different experiment times, handlers, reagents, and instruments (Leek et al., 2010; Goh et al., 2017a, Goh et al., 2017b). BEs are especially relevant to assays with sample-size limits, thereby requiring multiple runs (each run constituting a “technical batch”). BEs are not a homogeneous group of technical variation: Specific assays and platforms can create idiosyncratic BEs such that within-BEs and between-BEs can be observed in metabonomics and other related platforms (Broadhurst et al., 2018; Sanchez-Illana et al., 2018). The simulations deployed in this study do not cover these types of BEs as it is very specific, instead we simulated BEs using naïve but more generalizable assumptions.
BEs may confound functional analysis. In comparative setups where one phenotype class is compared against another (e.g., normal against disease), BEs may induce false effects or mask real effects (Goh and Wong, 2018) and hamper research outcome. Thus, it is important that BEs are dealt with satisfactorily.
There is a wide variety of batch effect-correction algorithms (BECAs), necessitating various performance ranking evaluations (Luo et al., 2010; Chen et al., 2011; Lazar et al., 2012). Although ComBat (Johnson et al., 2007), one of the earliest BECAs, often dominates, other methods such as Harman (Oytam et al., 2016), surrogate variable analysis (SVA) (Leek and Storey, 2007), the ratio-based algorithms (Luo et al., 2010), Ratio-A, Ratio-G as well as Batch mean-centering (BMC) (Sims et al., 2008), have also been reported to work well in their own specific evaluations. Due to such disparities, we are curious whether the outcomes of such ranking exercises are highly context-dependent (e.g., due to the data used or upstream transformation).
BE correction is the process of estimating and removing BEs algorithmically. There are several potential factors influencing BE correction: (1) the degree of confounding between class and batch factors, (2) upstream normalization procedure, and (3) the nature (magnitude and variability) of the BE itself.
Confounding refers to the extent class and batch effects are inter-mixed and indistinguishable. Suppose we have two classes, A and B, 4 samples per class. Amongst 8 samples, the order of the class labels S is (A1, A2, A3, A4, B5, B6, B7, and B8). This ordered arrangement of class labels, A and B, denotes the assignment of each sample to a particular class. In biology, a class can be a phenotype, e.g., disease or normal. The list S, being an ordered assignment of class labels, is also referred to as the class factor. The corresponding batch factor, X, given two batches, U and V, may be assigned as (U1, U2, V3, V4, U5, U6, V7, and V8). Here, although BEs do exist, the batch factors are equally distributed between the samples in classes A and B. This is a ‘balanced scenario’. In experimental design, the ‘balanced scenario’ is preferred as it is expected that, given sufficient sample size and random distribution, BEs and other sources of technical bias may be “averaged out”. Since BEs are negated, there is no confounding between class and batch factors.
Unfortunately, experimental designs are often inevitably imperfect. Real world factors, e.g., sample availability and failed experiments may also be contributors. These lead towards batch-class factor imbalance, such that BEs cannot be negated via “averaging out”. For example, given (A1, A2, A3, A4, B5, B6, B7, and B8), we may have (U1, U2, U3, V4, U5, V6, V7, and V8) instead. Class A is over-represented in batch U (75%), and vice versa. Here, if a strong difference is detected between samples in class A against B, the differential effect may be contributed by both class and batch effects. It is hard to tell apart which the true source(s) of the differential effect is; we therefore say that class A is confounded with batch U.
In the most extreme scenario, given (A1, A2, A3, A4, B5, B6, B7, and B8), we may have (U1, U2, U3, U4, V5, V6, V7, and V8). This situation may occur due to complete oversight in experimental design. Here, class A is completely correlated with batch U (100%), and vice versa. If a strong difference is detected between samples in class A against B, there is no way to tell if the difference is due to true class effects or batch effects. We therefore say that class A is perfectly confounded with batch U.
BECAs are techniques for estimating and removing BEs from data. Each BECA makes assumptions about the underlying data distribution and is theoretically susceptible to upstream normalization procedure. Appropriate normalizations are meant for unwanted technical variation. However, normalization may change the scale and distribution of the data, and consequently, impact BECA performance.
Determining the nature of BEs is a challenging issue: A simple perspective is to assume that batch-effect correlated variation is uniformly distributed across all variables (variables being measurable quantities that vary between different samples, e.g., gene or protein expression). Leek et al. (2010) suggested that BEs are non-homogeneous, and non-uniformly distributed amongst different variables. If Leek et al. are correct, then most batch-effect simulation approaches assuming uniformity would be insufficient. Moreover, it is probable that uniformity assumptions are rather unrealistic and can be easily dealt with analytically (Goh and Wong, 2017).
This study is not a ranking exercise for elucidating the best BECA given some dataset(s). Instead, we are interested in examining the practical limits of BECAs given (1) different upstream normalization procedures, (2) if generic normalization methods can actually deal with BEs at all, (3) where there are various levels of confounding, which BECAs remain effective, and finally, (4) is there reason to believe a universal best BECA exists given any scenario, or whether data-dependent contexts are crucial in determining performance. To make a case for generalizability, 3 representative RNA-Seq and 1 proteomics dataset are used for evaluation.
Section snippets
BECAs are surprisingly resilient against batch-class imbalances
The 2-dimensional (2-D) scatterplot is a standard approach for detecting class and BEs (Fig. 1). We will discuss findings resulted from one transcriptomics dataset based on the non-uniformity assumption in BE simulation (genomics data 1; see Table 1 for description and meta-data of all datasets used), and then discuss if similar findings are also obtained in the other test datasets.
In genomics data 1, individual samples are resolved across PCs 1 and 2 (shapes represent class; colors represent
Discussion
It is not entirely surprising that conventional normalizations perform poorly. After all, they do not explicitly correct for BEs. We demonstrate this using quantile normalization (QN).
QN is a very widely used normalization technique. It is popular because it creates good standardization amongst samples. However, samples that look well-standardized do not mean that BEs are corrected well. What QN is to calculate a mean for each ranked variable to get a distribution of means across the entire
Simulated datasets
We simulated data modeled from the parameters of four real datasets: three RNA-Seq datasets (genomics datasets), each dataset with three pairs of case and control samples (Ferreira et al., 2014; Isogai et al., 2018; Hatzi et al., 2019), and an unlabelled shotgun proteomic dataset with four cases and controls (Shao et al., 2010). The summary information of these datasets is shown in Table 1. An overview is provided in Fig. 5 linking datasets to the respective analytical output and corresponding
Acknowledgments
WWBG gratefully acknowledges Limsoon Wong, National University of Singapore, for inspiring this work. WWBG gratefully acknowledges support from the National Research Foundation of Singapore, NRF-NSFC (Grant No. NRF2018NRF-NSFC003SB-006).
References (30)
An introduction to ROC analysis
Pattern Recognit. Lett.
(2006)- et al.
Silencing of odorant receptor genes by G protein βγ signaling ensures the expression of one odorant receptor per olfactory sensory neuron
Neuron
(2014) The application of principal component analysis to drug discovery and biomedical data
Drug Discov. Today
(2017)- et al.
Why batch effects matter in omics data, and how to avoid them
Trends Biotechnol.
(2017) - et al.
Dealing with confounders in omics analysis
Trends Biotechnol.
(2018) - et al.
Comparative analysis of statistical methods used for detecting differential expression in label-free mass spectrometry proteomics
J. Proteomics
(2015) - et al.
Evaluation of batch effect elimination using quality control replicates in LC-MS metabolite profiling
Anal. Chim. Acta
(2018) - et al.
Shotgun proteomics analysis of hibernating arctic ground squirrels
Mol. Cell. Proteom.
(2010) - et al.
Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies
Metabolomics
(2018) - et al.
Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods
PLoS One
(2011)
Polyester: simulating RNA-seq datasets with differential transcript expression
Bioinformatics
Can peripheral blood-derived gene expressions characterize individuals at ultra-high risk for psychosis?
Comput. Psychiatr. Psychol.
Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics
BMC Genomics
Histone demethylase LSD1 is required for germinal center formation and BCL6-driven lymphomagenesis
Nat. Immunol.
Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment
BMC Bioinformatics
Cited by (24)
How missing value imputation is confounded with batch effects and what you can do about it
2023, Drug Discovery TodayAre batch effects still relevant in the age of big data?
2022, Trends in BiotechnologyCitation Excerpt :In proteomics, BEs remain pertinent, with key recent publications on best practices (e.g., proBatch, which covers useful wisdoms), benchmarking datasets, and recommended workflows [11]. proBatch, alongside our own works [12], reveal that BE correction is ultimately context-dependent. New technologies drive the development of new methodologies.
Doppelgänger spotting in biomedical gene expression data
2022, iScienceCitation Excerpt :As all of our input microarray data sets are imbalanced between batches, this needs to be addressed early in the analysis. Batch imbalances, i.e. different numbers of samples in both batches, have been shown to worsen the performance of two-step batch correction algorithms like ComBat (Li et al., 2021; Zhou et al., 2019). Our doppelgänger identification method relies on ComBat to first reduce the impact of batch effects before PPCC values are calculated (This is a critical step.
Perspectives for better batch effect correction in mass-spectrometry-based proteomics
2022, Computational and Structural Biotechnology JournalCitation Excerpt :In more severe cases, it can lead towards false positives (proteins that are not differential are selected) and false negatives (proteins that are differential are not selected) [1]. In general, BEs are complex, and effective mitigation is highly context dependent [2,3]. It is also surprising that even as technologies and analytical methods advance, BEs seem to become more pertinent and relevant [4].
Mathematical-based microbiome analytics for clinical translation
2021, Computational and Structural Biotechnology Journal