当前位置: X-MOL 学术Mobile DNA › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Finding and extending ancient simple sequence repeat-derived regions in the human genome.
Mobile DNA ( IF 4.7 ) Pub Date : 2020-02-17 , DOI: 10.1186/s13100-020-00206-y
Jonathan A Shortt 1 , Robert P Ruggiero 2 , Corey Cox 1 , Aaron C Wacholder 3 , David D Pollock 4
Affiliation  

Background Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. Results The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified 'SSR-clouds', groups of similar kmers (or 'oligos') that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. Conclusions Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A's annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure.

中文翻译:

在人类基因组中寻找和扩展古老的简单序列重复衍生区域。

背景 以前,3% 的人类基因组被注释为简单序列重复 (SSR),与注释为蛋白质编码的比例相似。然而,大部分基因组的起源没有得到很好的注释,一些未识别的区域很可能是当前方法无法识别的古老的 SSR 衍生区域。这些区域的识别是复杂的,因为 SSR 似乎是通过复杂的扩张和收缩循环进化的,经常被改变重复基序和突变率的突变打断。我们应用了一种基于 kmer 的经验方法来识别可能源自 SSR 的基因组区域。结果 注释的 SSR 侧翼的序列富含相似序列和具有相似基序的 SSR,表明 SSR 活动的进化遗迹在明显的 SSR 附近区域比比皆是。使用我们之前描述的 P-clouds 方法,我们确定了“SSR-clouds”,在一组完整的 SSR 基因座训练集附近丰富的类似 kmer(或“oligos”)组,然后使用 SSR-clouds 检测可能的 SSR整个基因组的衍生区域。结论 我们的分析表明,人类基因组中可能源自 SSR 的序列的数量为 6.77%,是先前估计的两倍多,包括数百万个新发现的古老 SSR 衍生基因座。SSR 云在超过 74% 的最古老的 Alu 类(大致为 AluJ)中识别出与转座因子末端相邻的 poly-A 序列,验证了该方法的敏感性。聚-A' 由 SSR 云注释的 s 也具有与其 poly-A 起源更一致的长度分布,即使在较旧的 Alus 中也具有约 35 bp 的平均值。这项工作表明,SSR-Clouds 提供的高灵敏度提高了对 SSR 衍生区域的检测,并将能够更深入地分析衰减重复对基因组结构的影响。
更新日期:2020-02-17
down
wechat
bug