当前位置: X-MOL 学术Mol. Ecol. Resour. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software.
Molecular Ecology Resources ( IF 7.7 ) Pub Date : 2019-11-25 , DOI: 10.1111/1755-0998.13108
Melanie E F LaCava 1, 2 , Ellen O Aikens 1, 3 , Libby C Megna 1, 4 , Gregg Randolph 5 , Charley Hubbard 1, 6 , C Alex Buerkle 1, 6
Affiliation  

Advances in DNA sequencing have made it feasible to gather genomic data for non-model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD-HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD-HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.

中文翻译:

从双消化文库开始的DNA序列从头组装的准确性在软件之间存在很大差异。

DNA测序技术的进步使人们常常可以使用对基因组子集进行测序的方法来收集非模式生物和大量个体的基因组数据。这些方法中的几种对与核酸内切酶限制位点相关的DNA进行测序(各种RAD和GBS方法)。为了在没有参考基因组的分类单元中使用,这些方法依赖于测序文库中片段的从头组装。此应用程序可用的许多软件选项最初都是为其他装配类型开发的,我们不知道它们对于精简表示库的准确性。为了解决这一重要的知识鸿沟,我们模拟了拟南芥和智人基因组的数据,并通过六个常用的软件程序(例如ABySS,CD-HIT,Stacks,Stacks2,天鹅绒和VSEARCH)。我们模拟了不同的突变率和突变类型,然后将六个汇编程序应用于模拟数据集,并更改了汇编参数。我们发现,在仿真和参数设置之间,软件性能存在很大差异。ABySS无法恢复任何真实的基因组片段,而Velvet和VSEARCH在大多数模拟中的表现都很差。Stacks和Stacks2产生了包含SNP的模拟的精确组装,但是插入和缺失突变的添加降低了它们的性能。CD-HIT是唯一能始终回收大量真实基因组片段的组装器。这里,
更新日期:2019-11-25
down
wechat
bug