当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gaps and complex structurally variant loci in phased genome assemblies
Genome Research ( IF 6.2 ) Pub Date : 2023-04-01 , DOI: 10.1101/gr.277334.122
David Porubsky 1 , Mitchell R Vollger 1 , William T Harvey 1 , Allison N Rozanski 1 , Peter Ebert 2, 3 , Glenn Hickey 4 , Patrick Hasenfeld 5 , Ashley D Sanders 6, 7, 8 , Catherine Stober 5 , , Jan O Korbel 5, 9 , Benedict Paten 4 , Tobias Marschall 2, 3 , Evan E Eichler 10, 11
Affiliation  

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.

中文翻译:


定相基因组组装中的缺口和复杂结构变异位点



通过将长读数据与亲本信息或链接读数据相结合,在分阶段基因组组装生产方面取得了巨大进展。尽管如此,由 trio-hifiasm 产生的典型定相基因组组装仍然会产生超过 140 个缺口。我们对从 77 个独特人类样本的多样性小组中获得的 182 个单倍体组装体的间隙、组装断裂和错误方向进行了详细分析。尽管使用 HiFi 的基于三重奏的方法是当前的黄金标准,但使用 Strand-seq 代替亲本数据时,染色体范围的定相准确性具有可比性。重要的是,大多数装配间隙聚集在最大和最相同的重复序列附近(包括片段重复[35.4%]、卫星DNA[22.3%]或富含GA/AT DNA的区域[27.4%])。因此,1513 个蛋白质编码基因在至少一种单倍型中与组装间隙重叠,并且 5 个或更多单倍型中的 231 个基因经常被破坏或缺失。此外,我们估计,无论使用无三重奏还是基于三重奏的方法,每个单倍型都有 6-7 Mbp 的 DNA 定向错误。在这些错误定向中,81% 对应于人类物种中真正的大倒位多态性,其中大多数两侧都有大片段重复。我们还发现了大规模的比对不连续性,每个单倍体基因组有 11.9 Mbp 的缺失和 161.4 Mbp 的插入。尽管这种变异的 99% 对应于卫星 DNA,但我们还是识别出了 230 个常染色质 DNA 频繁扩张和收缩的区域,其中近一半与 197 个蛋白质编码基因重叠。这种可变和不完全组装的区域是未来算法开发和泛基因组表示的重要目标。
更新日期:2023-04-01
down
wechat
bug