当前位置: X-MOL 学术Gigascience › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The draft nuclear genome assembly of Eucalyptus pauciflora: a pipeline for comparing de novo assemblies.
GigaScience ( IF 11.8 ) Pub Date : 2020-01-01 , DOI: 10.1093/gigascience/giz160
Weiwen Wang 1 , Ashutosh Das 1, 2 , David Kainer 1 , Miriam Schalamun 1, 3 , Alejandro Morales-Suarez 4 , Benjamin Schwessinger 1 , Robert Lanfear 1
Affiliation  

BACKGROUND Eucalyptus pauciflora (the snow gum) is a long-lived tree with high economic and ecological importance. Currently, little genomic information for E. pauciflora is available. Here, we sequentially assemble the genome of Eucalyptus pauciflora with different methods, and combine multiple existing and novel approaches to help to select the best genome assembly. FINDINGS We generated high coverage of long- (Nanopore, 174×) and short- (Illumina, 228×) read data from a single E. pauciflora individual and compared assemblies from 5 assemblers (Canu, SMARTdenovo, Flye, Marvel, and MaSuRCA) with different read lengths (1 and 35 kb minimum read length). A key component of our approach is to keep a randomly selected collection of ∼10% of both long and short reads separated from the assemblies to use as a validation set for assessing assemblies. Using this validation set along with a range of existing tools, we compared the assemblies in 8 ways: contig N50, BUSCO scores, LAI (long terminal repeat assembly index) scores, assembly ploidy, base-level error rate, CGAL (computing genome assembly likelihoods) scores, structural variation, and genome sequence similarity. Our result showed that MaSuRCA generated the best assembly, which is 594.87 Mb in size, with a contig N50 of 3.23 Mb, and an estimated error rate of ∼0.006 errors per base. CONCLUSIONS We report a draft genome of E. pauciflora, which will be a valuable resource for further genomic studies of eucalypts. The approaches for assessing and comparing genomes should help in assessing and choosing among many potential genome assemblies from a single dataset.

中文翻译:


少花桉树核基因组组装草案:用于比较从头组装的管道。



背景技术少花桉(雪胶)是一种具有高度经济和生态重要性的长寿树。目前,关于少花 E. pauciflora 的基因组信息很少。在这里,我们用不同的方法依次组装少花桉的基因组,并结合多种现有的和新颖的方法来帮助选择最佳的基因组组装。研究结果 我们从单个 E. pauciflora 个体中生成了高覆盖率的长(Nanopore,174×)和短(Illumina,228×)读取数据,并比较了来自 5 个组装程序(Canu、SMARTdenovo、Flye、Marvel 和 MaSuRCA)的组件具有不同的读取长度(最小读取长度为 1 和 35 kb)。我们方法的一个关键组成部分是从组件中随机选择约 10% 的长读和短读集合,用作评估组件的验证集。使用该验证集以及一系列现有工具,我们以 8 种方式比较了组装体:重叠群 N50、BUSCO 分数、LAI(长末端重复组装指数)分数、组装倍性、碱基级错误率、CGAL(计算基因组组装)可能性)分数、结构变异和基因组序列相似性。我们的结果表明,MaSuRCA 生成了最佳组装,大小为 594.87 Mb,重叠群 N50 为 3.23 Mb,每个碱基的估计错误率为 ∼0.006 个错误。结论我们报告了少花桉基因组草图,这将是桉树进一步基因组研究的宝贵资源。评估和比较基因组的方法应该有助于从单个数据集中评估和选择许多潜在的基因组组件。
更新日期:2020-01-02
down
wechat
bug