当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Robust principal component analysis for accurate outlier sample detection in RNA-Seq data.
BMC Bioinformatics ( IF 3 ) Pub Date : 2020-06-29 , DOI: 10.1186/s12859-020-03608-0
Xiaoying Chen 1 , Bo Zhang 2 , Ting Wang 3, 4 , Azad Bonni 1 , Guoyan Zhao 1
Affiliation  

High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis. We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes. rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.

中文翻译:

可靠的主成分分析,可在RNA-Seq数据中进行准确的离群样本检测。

高通量RNA测序是研究基因表达的有效方法。由于数据采集中复杂的多步骤方案,由于技术差异或真实的生物学差异,可能会使样品与同一治疗组的样品发生极端偏差。具有少量生物学重复的数据的高维度使其难以准确检测这些样品,并且目前在文献中对此问题的研究还不够深入。稳健的统计数据是一系列理论和技术,旨在通过首先拟合大多数数据,然后标记偏离数据的数据点来检测异常值。稳健的统计数据已广泛用于多元数据分析中,用于化学计量学和工程学中的异常检测。在这里,我们对RNA序列数据分析应用了可靠的统计数据。我们报告了使用两种鲁棒的主成分分析(rPCA)方法PcaHubert和PcaGrid,以检测带有阳性对照离群样本的多个模拟和真实生物学RNA-seq数据集中的离群样本。使用不同程度差异的阳性对照离群值,PcaGrid在所有测试中均达到了100%的灵敏度和100%的特异性。我们在对照和条件性SnoN基因敲除小鼠小脑的外部颗粒层的RNA-Seq数据集分析基因表达中应用了rPCA方法和经典主成分分析(cPCA)。两种rPCA方法都检测到相同的两个异常样本,但cPCA无法检测到任何一个。我们在离群值去除之前和之后进行了差异表达基因检测,以及有无批次效应模型都进行了检测。我们使用定量逆转录PCR验证了基因表达的变化,并将结果用作比较八种不同数据分析策略性能的参考。就检测生物学相关的差异表达基因而言,在没有批量效应模型的情况下去除异常值表现最佳。在PcaGrid功能中实现的rPCA是检测异常样本的准确客观的方法。它非常适用于小样本大小的高维数据,如RNA-seq数据。离群值去除可以显着提高差异基因检测和下游功能分析的性能。就检测生物学相关的差异表达基因而言,在没有批量效应模型的情况下去除异常值表现最佳。在PcaGrid功能中实现的rPCA是检测异常样本的准确客观的方法。它非常适用于样本量较小的高维数据,如RNA-seq数据。离群值去除可以显着提高差异基因检测和下游功能分析的性能。就检测生物学相关的差异表达基因而言,在没有批量效应模型的情况下去除异常值表现最佳。在PcaGrid功能中实现的rPCA是检测异常样本的准确客观的方法。它非常适用于样本量较小的高维数据,如RNA-seq数据。离群值去除可以显着提高差异基因检测和下游功能分析的性能。
更新日期:2020-06-29
down
wechat
bug