当前位置: X-MOL 学术medRxiv. Genet. Genom. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Implications of Genetic Distance to Reference and De Novo Genome Assembly for Clinical Genomics in Africans
medRxiv - Genetic and Genomic Medicine Pub Date : 2020-09-27 , DOI: 10.1101/2020.09.25.20201780
Daniel Shriner , Adebowale Adeyemo , Charles N. Rotimi

In clinical genomics, variant calling from short-read sequencing data typically relies on a pan-genomic, universal human reference sequence. A major limitation of this approach is that the number of reads that incorrectly map or fail to map increase as the reads diverge from the reference sequence. In the context of genome sequencing of genetically diverse Africans, we investigate the advantages and disadvantages of using a de novo assembly of the read data as the reference sequence in single sample calling. Conditional on sufficient read depth, the alignment-based and assembly-based approaches yielded comparable sensitivity and false discovery rates for single nucleotide variants when benchmarked against a gold standard call set. The alignment-based approach yielded coverage of an additional 270.8 Mb over which sensitivity was lower and the false discovery rate was higher. Although both approaches detected and missed clinically relevant variants, the assembly-based approach identified more such variants than the alignment-based approach. Of particular relevance to individuals of African descent, the assembly-based approach identified four heterozygous genotypes containing the sickle allele whereas the alignment-based approach identified no occurrences of the sickle allele. Variant annotation using dbSNP and gnomAD identified systematic biases in these databases due to underrepresentation of Africans. Using the counts of homozygous alternate genotypes from the alignment-based approach as a measure of genetic distance to the reference sequence GRCh38.p12, we found that the numbers of misassemblies, total variant sites, potentially novel single nucleotide variants (SNVs), and certain variant classes (e.g., splice acceptor variants, stop loss variants, missense variants, synonymous variants, and variants absent from gnomAD) were significantly correlated with genetic distance. In contrast, genomic coverage and other variant classes (e.g., ClinVar pathogenic or likely pathogenic variants, start loss variants, stop gain variants, splice donor variants, incomplete terminal codons, variants with CADD score ≥20) were not correlated with genetic distance. With improvement in coverage, the assembly-based approach can offer a viable alternative to the alignment-based approach, with the advantage that it can obviate the need to generate diverse human reference sequences or collections of alternate scaffolds.

中文翻译:

遗传距离对非洲人临床基因组学参考基因和从头基因组大会的影响。

在临床基因组学中,从短读测序数据中调用变体通常依赖于泛基因组通用人类参考序列。该方法的主要局限性在于,错误地映射或无法映射的读段数会随着读段与参考序列的差异而增加。在遗传多样的非洲人的基因组测序的背景下,我们调查了在单个样品调用中使用读取数据的从头组装作为参考序列的优缺点。以足够的读取深度为条件,基于黄金标准调用集进行基准测试时,基于比对和基于组装的方法对单核苷酸变体产生相当的灵敏度和错误发现率。基于对齐的方法产生了另外270个覆盖范围。8 Mb,其灵敏度较低,错误发现率较高。尽管两种方法都检测到并错过了临床相关的变体,但基于装配的方法比基于比对的方法识别出更多的此类变体。与非洲人后裔特别相关的是,基于组装的方法鉴定出了四种含有镰刀等位基因的杂合基因型,而基于比对的方法则未发现镰刀等位基因的出现。使用dbSNP和gnomAD进行变体注释可以识别出由于非洲人代表性不足而导致的这些数据库中的系统偏见。使用基于比对方法的纯合替代基因型计数作为对参考序列GRCh38.p12遗传距离的度量,我们发现错配的数量,总变异位点,潜在的新型单核苷酸变体(SNV)和某些变体类别(例如,剪接受体变体,终止丢失变体,错义变体,同义变体和gnomAD不存在的变体)与遗传距离显着相关。相反,基因组覆盖率和其他变异类别(例如ClinVar致病或可能的致病变异,起始缺失变异,终止增益变异,剪接供体变异,不完全末端密码子,CADD得分≥20的变异)与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。剪接受体变体,终止丢失变体,错义变体,同义变体和gnomAD不存在的变体)与遗传距离显着相关。相反,基因组覆盖率和其他变异类别(例如ClinVar致病或可能的致病变异,起始缺失变异,终止增益变异,剪接供体变异,不完全末端密码子,CADD得分≥20的变异)与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。剪接受体变体,终止丢失变体,错义变体,同义变体和gnomAD不存在的变体)与遗传距离显着相关。相反,基因组覆盖率和其他变异类别(例如ClinVar致病或可能的致病变异,起始缺失变异,终止增益变异,剪接供体变异,不完全末端密码子,CADD得分≥20的变异)与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。和gnomAD中缺少的变体)与遗传距离显着相关。相反,基因组覆盖率和其他变异类别(例如ClinVar致病或可能的致病变异,起始缺失变异,终止增益变异,剪接供体变异,不完全末端密码子,CADD得分≥20的变异)与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。和gnomAD中缺少的变体)与遗传距离显着相关。相反,基因组覆盖率和其他变异类别(例如ClinVar致病或可能的致病变异,起始缺失变异,终止增益变异,剪接供体变异,不完全末端密码子,CADD得分≥20的变异)与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。CADD得分≥20的变异与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。CADD得分≥20的变异与遗传距离无关。随着覆盖范围的改善,基于组装的方法可以提供比基于比对的方法更可行的替代方法,其优点是可以消除生成各种人类参考序列或替代支架集合的需要。
更新日期:2020-09-28
down
wechat
bug