当前位置: X-MOL 学术Microbiome › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets.
Microbiome ( IF 15.5 ) Pub Date : 2020-05-15 , DOI: 10.1186/s40168-020-00841-w
Isabel F Escapa 1, 2, 3 , Yanmei Huang 1, 2 , Tsute Chen 1, 2 , Maoxuan Lin 1 , Alexis Kokaras 1 , Floyd E Dewhirst 1, 2 , Katherine P Lemon 1, 3, 4, 5
Affiliation  

BACKGROUND The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies. RESULTS To achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1-V3 region training set for the bacterial microbiota of the human aerodigestive tract using the full-length 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database (eHOMD). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1-V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets. CONCLUSION Here, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies. Video Abstract.

中文翻译:

构建栖息地特定训练集以实现 16S rRNA 基因数据集中的物种级分配。

背景 16S rRNA 基因测序的低成本促进了人群规模的分子流行病学研究。现有的计算算法可以将 16S rRNA 基因序列解析为高分辨率的扩增子序列变体 (ASV),这些变体代表了跨研究可比的一致标签。将这些 ASV 分配到物种级分类法可加强基于 16S rRNA 基因的微生物群研究的生态和/或临床相关性,并进一步促进研究之间的数据比较。结果 为实现这一目标,我们开发了一种广泛适用的方法,用于根据感兴趣栖息地中发现的微生物之间的系统发育关系构建高分辨率训练集。当与朴素贝叶斯核糖体数据库项目 (RDP) 分类器一起使用时,该训练集实现了 16S rRNA 基因衍生的 ASV 的种/超种级分类学分配。生成这样一个训练集的关键步骤是(1)构建一个准确而全面的基于系统发育的、特定于栖息地的数据库;(2) 编译多个16S rRNA 基因序列以代表数据库中每个分类单元的自然序列变异性;(3) 如有必要,修剪训练集以匹配测序区域;(4) 将共享密切相关序列的物种置于特定于训练集的超物种分类水平,以保持亚属水平的分辨率。作为原理证明,我们使用我们扩展的人类口腔微生物组数据库 (eHOMD) 中编译的全长 16S rRNA 基因参考序列,为人类呼吸消化道的细菌微生物群开发了 V1-V3 区域训练集。我们还克服了技术限制,成功地将 Illumina 序列用于 16S rRNA 基因 V1-V3 区域,这是对人类呼吸消化道原生细菌进行分类的信息最丰富的部分。最后,我们生成了一个全长 eHOMD 16S rRNA 基因训练集,我们将其与独立的 PacBio 单分子、实时 (SMRT) 测序的鼻窦数据集结合使用,以验证我们训练集中物种的代表性。这也确立了用于分配长读长 16S rRNA 基因数据集分类的全长训练集的有效性。结论在这里,我们提出了一种系统的方法来构建基于系统发育的、高分辨率的、特定于栖息地的训练集,该训练集允许物种/超物种水平的分类学分配给短读和长读 16S rRNA 基因衍生的 ASV。这一进步增强了基于 16S rRNA 基因的微生物群研究的生态和/或临床相关性。视频摘要。
更新日期:2020-05-15
down
wechat
bug