当前位置: X-MOL 学术Mol. Ecol. Resour. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Error, noise and bias in de novo transcriptome assemblies
Molecular Ecology Resources ( IF 7.7 ) Pub Date : 2020-03-17 , DOI: 10.1111/1755-0998.13156
Adam H Freedman 1 , Michele Clamp 1 , Timothy B Sackton 1
Affiliation  

De novo transcriptome assembly is a powerful tool, and has been widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30% to 83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent underestimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map‐to‐reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein‐coding genes, including many that are highly expressed. We demonstrate a computational method, length‐rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.

中文翻译:

从头转录组组装中的错误、噪音和偏差

从头转录组组装是一种强大的工具,在过去十年中被广泛用于进行进化推断。然而,它依赖于两个隐含的假设:组装的转录组是潜在表达转录组的无偏表示,并且如果表达转录本的相对丰度的噪声近似值有噪声,则组装的表达估计是好的。使用模型生物的公开可用数据,我们证明,在组装算法和数据集之间,这些假设始终被违反。偏差存在于核苷酸水平,基因分型错误率从 30% 到 83% 不等。因此,转录组组装的多样性被低估了,除了大多数近交样本外,其他所有样本的杂合性都被低估了。即使在基因水平,表达估计显示与映射到参考估计的广泛偏差,以及较低表达水平的正偏差。转录组组装的标准过滤提高了基因表达估计的稳健性,但导致大量蛋白质编码基因的丢失,包括许多高度表达的基因。我们展示了一种计算方法,即重新调整长度的 CPM,以部分减轻表达式估计中的噪声和偏差。研究人员应考虑将偏倚对转录组组装的影响降至最低的方法。包括许多被高度表达的。我们展示了一种计算方法,即重新调整长度的 CPM,以部分减轻表达式估计中的噪声和偏差。研究人员应考虑将偏倚对转录组组装的影响降至最低的方法。包括许多被高度表达的。我们展示了一种计算方法,即重新调整长度的 CPM,以部分减轻表达式估计中的噪声和偏差。研究人员应考虑将偏倚对转录组组装的影响降至最低的方法。
更新日期:2020-03-17
down
wechat
bug