当前位置: X-MOL 学术J. Comput. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments.
Journal of Computational Biology ( IF 1.4 ) Pub Date : 2020-07-09 , DOI: 10.1089/cmb.2019.0264
Chao Deng 1 , Timothy Daley 2 , Peter Calabrese 1 , Jie Ren 1 , Andrew D Smith 1
Affiliation  

For many types of high-throughput sequencing experiments, success in downstream analysis depends on attaining sufficient coverage for individual positions in the genome. For example, when identifying single-nucleotide variants de novo, the number of reads supporting a particular variant call determines our confidence in that variant call. If sequenced reads are distributed uniformly along the genome, the coverage of a nucleotide position is easily approximated by a Poisson distribution, with rate equal to average sequencing depth. Unfortunately, as has become well known, high-throughput sequencing data are never uniform. The numerous factors contributing to variation in coverage have resisted attempts at direct modeling and change along with minor adjustments in the underlying technology. We propose a new nonparametric method to predict the portion of a genome that will attain some specified minimum coverage, as a function of sequencing effort, using information from a shallow sequencing experiment from the same library. Simulations show our approach performs well under an array of distributional assumptions that deviate from uniformity. We applied this approach to estimate coverage at varying depths in single-cell whole-genome sequencing data from multiple protocols. These resulted in highly accurate predictions, demonstrating the effectiveness of our approach in analyzing complexity of sequencing libraries and optimizing design of sequencing experiments.

中文翻译:

预测在高通量测序实验中获得足够覆盖的碱基数。

对于许多类型的高通量测序实验,下游分析的成功取决于对基因组中各个位置的足够覆盖。例如,在从头识别单核苷酸变异时,支持特定变异检出的读数数量决定了我们对该变异检出的置信度。如果测序读数沿基因组均匀分布,核苷酸位置的覆盖率很容易通过泊松分布近似,速率等于平均测序深度。不幸的是,众所周知,高通量测序数据从来都不是统一的。导致覆盖范围变化的众多因素阻碍了直接建模和更改以及对基础技术进行微小调整的尝试。我们提出了一种新的非参数方法,使用来自同一文库的浅层测序实验的信息,根据测序工作量预测将达到特定最小覆盖率的基因组部分。模拟表明我们的方法在一系列偏离均匀性的分布假设下表现良好。我们应用这种方法来估计来自多个协议的单细胞全基因组测序数据中不同深度的覆盖率。这些导致了高度准确的预测,证明了我们的方法在分析测序文库的复杂性和优化测序实验设计方面的有效性。模拟表明我们的方法在一系列偏离均匀性的分布假设下表现良好。我们应用这种方法来估计来自多个协议的单细胞全基因组测序数据中不同深度的覆盖率。这些导致了高度准确的预测,证明了我们的方法在分析测序文库的复杂性和优化测序实验设计方面的有效性。模拟表明我们的方法在一系列偏离均匀性的分布假设下表现良好。我们应用这种方法来估计来自多个协议的单细胞全基因组测序数据中不同深度的覆盖率。这些导致了高度准确的预测,证明了我们的方法在分析测序文库的复杂性和优化测序实验设计方面的有效性。
更新日期:2020-07-10
down
wechat
bug