当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
BMC Bioinformatics ( IF 3 ) Pub Date : 2021-01-06 , DOI: 10.1186/s12859-020-03939-y
Edwin A Solares 1 , Yuan Tao 1 , Anthony D Long 1 , Brandon S Gaut 1
Affiliation  

Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.

中文翻译:

HapSolo:一种在二倍体基因组组装和支架过程中去除次级单倍体的优化方法

尽管最近在长读长测序技术方面取得了显着进步,但二倍体基因组的组装仍然是一项艰巨的任务。一个主要障碍是区分代表高度杂合区域的替代重叠群。如果未正确识别初级和次级重叠群,则初级组装将过度代表基因组的大小和复杂性,这会使下游分析(例如脚手架)复杂化。在这里,我们说明了一种称为 HapSolo 的新方法,该方法识别次要 contigs 并根据多个成对 contig 对齐指标定义初级组装。HapSolo 使用 BUSCO 分数评估候选初级组件,然后使用成本函数区分候选组件。成本函数可以由用户定义,但默认考虑缺失的数量,组装中的重复和单个 BUSCO 基因。HapSolo 执行爬山,以最大限度地降低数千个候选组件的成本。我们说明了 HapSolo 在来自三个物种的基因组数据上的表现:霞多丽葡萄 (Vitis vinifera),基因组为 490 Mb,蚊子 (Anopheles funestus; 200 Mb) 和荆棘鳐 (Amblyraja radiata; 2650 Mb)。HapSolo 快速确定了可以提高组装指标的候选组装,包括减少基因组大小和提高 N50 分数。相对于未减少的初级组装,霞多丽、蚊子和荆棘鳐的 Contig N50 分数分别提高了 35%、9% 和 9%。HapSolo 的好处通过下游分析得到了放大,我们通过 Hi-C 数据的脚手架来说明这一点。例如,我们发现在应用 HapSolo 之前,只有 52% 的霞多丽基因组被捕获在最大的 19 个支架中,对应于染色体的数量。在应用 HapSolo 之后,这个值增加到了 ~ 84%。蚊子最大的三个支架(代表染色体数量)的改进从 61% 到 86%,而带刺鳐的改进更为明显。我们将脚手架结果与基于 PurgeDups 识别次要重叠群的组件进行了比较,HapSolo 的结果通常更好。对于棘手的溜冰鞋,这种改进更为明显。我们将脚手架结果与基于 PurgeDups 识别次要重叠群的组件进行了比较,HapSolo 的结果通常更好。对于棘手的溜冰鞋,这种改进更为明显。我们将脚手架结果与基于 PurgeDups 识别次要重叠群的组件进行了比较,HapSolo 的结果通常更好。
更新日期:2021-01-07
down
wechat
bug