Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads.,BMC Bioinformatics

当前位置： X-MOL 学术 › BMC Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads.
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-05-29 , DOI: 10.1186/s12859-020-3528-4
William S Pearman ₁ , Nikki E Freed ₁ , Olin K Silander ₁

Affiliation

The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities. Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences (PacBio) with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.

中文翻译：

使用模拟读数测试短读长和长读长真核宏基因组学的优点和缺点。

了解生态群落多样性和动态的第一步是量化群落成员资格。一种越来越常见的方法是通过宏基因组学。由于这种方法的迅速普及，大量的计算工具和流程可用于分析宏基因组数据。然而，这些工具中的大多数都是使用高度准确的短读长数据（即 Illumina）进行设计和基准测试的，很少有研究对易出错的长读长数据的分类准确性进行基准测试（PacBio 或 Oxford Nanopore）。此外，很少有工具针对非微生物群落进行过基准测试。在这里，我们将来自 Oxford Nanopore 和 Pacific Biosciences (PacBio) 的模拟长读段与高精度 Illumina 读段集进行比较，以系统地研究序列长度和分类单元类型对微生物和非微生物群落宏基因组数据分类准确性的影响。我们发现，一般来说，非微生物群落的分类精度要低得多，即使在低分类分辨率（例如科而不是属）下也是如此。然后我们表明，对于两种流行的分类学分类器，长读取可以显着提高分类准确性，这对于非微生物群落最为明显。这项工作提供了对不同分类群宏基因组分析的预期准确性的见解，并确定了在分配正确分类单元时读长比错误率更重要的点。

更新日期：2020-05-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11