当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genome-wide detection of short tandem repeat expansions by long-read sequencing
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-12-28 , DOI: 10.1186/s12859-020-03876-w
Qian Liu 1 , Yao Tong 1 , Kai Wang 1, 2
Affiliation  

Short tandem repeat (STR), or “microsatellite”, is a tract of DNA in which a specific motif (typically < 10 base pairs) is repeated multiple times. STRs are abundant throughout the human genome, and specific repeat expansions may be associated with human diseases. Long-read sequencing coupled with bioinformatics tools enables the estimation of repeat counts for STRs. However, with the exception of a few well-known disease-relevant STRs, normal ranges of repeat counts for most STRs in human populations are not well known, preventing the prioritization of STRs that may be associated with human diseases. In this study, we extend a computational tool RepeatHMM to infer normal ranges of 432,604 STRs using 21 long-read sequencing datasets on human genomes, and build a genomic-scale database called RepeatHMM-DB with normal repeat ranges for these STRs. Evaluation on 13 well-known repeats show that the inferred repeat ranges provide good estimation to repeat ranges reported in literature from population-scale studies. This database, together with a repeat expansion estimation tool such as RepeatHMM, enables genomic-scale scanning of repeat regions in newly sequenced genomes to identify disease-relevant repeat expansions. As a case study of using RepeatHMM-DB, we evaluate the CAG repeats of ATXN3 for 20 patients with spinocerebellar ataxia type 3 (SCA3) and 5 unaffected individuals, and correctly classify each individual. In summary, RepeatHMM-DB can facilitate prioritization and identification of disease-relevant STRs from whole-genome long-read sequencing data on patients with undiagnosed diseases. RepeatHMM-DB is incorporated into RepeatHMM and is available at https://github.com/WGLab/RepeatHMM .

中文翻译:


通过长读长测序对短串联重复扩增进行全基因组检测



短串联重复序列 (STR) 或“微卫星”是一段 DNA,其中特定基序(通常 < 10 个碱基对)重复多次。 STR 在整个人类基因组中含量丰富,特定的重复扩增可能与人类疾病有关。长读长测序与生物信息学工具相结合,可以估计 STR 的重复计数。然而,除了一些众所周知的与疾病相关的 STR 之外,人类中大多数 STR 的重复计数正常范围尚不清楚,这妨碍了对可能与人类疾病相关的 STR 进行优先排序。在这项研究中,我们扩展了计算工具 RepeatHMM,使用人类基因组的 21 个长读长测序数据集来推断 432,604 个 STR 的正常范围,并构建了一个名为 RepeatHMM-DB 的基因组规模数据库,其中包含这些 STR 的正常重复范围。对 13 个众所周知的重复的评估表明,推断的重复范围为人口规模研究文献中报告的重复范围提供了良好的估计。该数据库与 RepeatHMM 等重复扩展估计工具一起,能够对新测序的基因组中的重复区域进行基因组规模扫描,以识别与疾病相关的重复扩展。作为使用 RepeatHMM-DB 的案例研究,我们评估了 20 名脊髓小脑共济失调 3 型 (SCA3) 患者和 5 名未受影响个体的 ATXN3 的 CAG 重复,并对每个个体进行了正确分类。总之,RepeatHMM-DB 可以促进从未确诊疾病患者的全基因组长读长测序数据中确定与疾病相关的 STR 的优先级和识别。 RepeatHMM-DB 已合并到 RepeatHMM 中,可从 https://github.com/WGLab/RepeatHMM 获取。
更新日期:2020-12-28
down
wechat
bug