Outlier detection in BLAST hits.,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Outlier detection in BLAST hits.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2018-03-22 , DOI: 10.1186/s13015-018-0126-3
Nidhi Shah ₁ , Stephen F Altschul ₂ , Mihai Pop ₁

Affiliation

BACKGROUND An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods, which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. Similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. In contrast, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. RESULTS We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log-Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets. CONCLUSION Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone.

中文翻译：

BLAST 命中中的异常值检测。

背景技术宏基因组分析中的一项重要任务是将分类标签分配给样本中的序列。最广泛使用的分类分配方法将样本中的序列与已知序列的数据库进行比较。许多方法使用最佳 BLAST hit(s) 来分配分类标签。然而，众所周知，最好的 BLAST 命中可能并不总是对应于最好的分类匹配。另一种方法涉及系统发育方法，该方法考虑比对和进化模型，以便更准确地定义序列的分类起源。基于相似性搜索的方法通常比系统发育方法运行得更快，并且当样本中的生物体在数据库中得到很好的表示时效果很好。相比之下，系统发育方法能够识别样本中的新生物，但计算成本非常高。结果我们提出了一种用于宏基因组分类单元识别的两步法；即，使用参考数据库准确分类序列的快速方法（这是一个过滤步骤），然后对上一步中未分类的序列使用更复杂的系统发育方法。在这项工作中，我们探索是否以及何时使用最高 BLAST hit(s) 产生正确的分类标签。我们开发了一种方法来检测 BLAST 命中中的异常值，以便将系统发育最密切相关的匹配与来自更远相关生物的序列的匹配区分开来。我们使用了修改后的 BILD（贝叶斯积分对数）分数，一种多重对齐评分函数，定义顶级 BLAST 命中子集中的异常值并分配分类标签。我们将我们的方法的准确性与 RDP 分类器进行了比较，并表明我们的方法在正确分类数据库中不存在的生物的同时产生更少的错误分类。最后，我们在真实 16S rRNA 数据集的背景下评估了我们的方法作为更昂贵的系统发育分析（在我们的案例中为 TIPP）之前的预处理步骤的使用。结论我们的实验为使用两步法进行准确的分类分配提供了一个很好的案例。我们表明，我们的方法可以在使用系统发育方法之前用作过滤步骤，并提供了一种使用比 E 值和位分数单独提供的更多信息来解释 BLAST 结果的方法。我们将我们的方法的准确性与 RDP 分类器进行了比较，并表明我们的方法在正确分类数据库中不存在的生物的同时产生更少的错误分类。最后，我们在真实 16S rRNA 数据集的背景下评估了我们的方法作为更昂贵的系统发育分析（在我们的案例中为 TIPP）之前的预处理步骤的使用。结论我们的实验为使用两步法进行准确的分类分配提供了一个很好的案例。我们表明，我们的方法可以在使用系统发育方法之前用作过滤步骤，并提供了一种使用比 E 值和位分数单独提供的更多信息来解释 BLAST 结果的方法。我们将我们的方法的准确性与 RDP 分类器进行了比较，并表明我们的方法在正确分类数据库中不存在的生物的同时产生更少的错误分类。最后，我们在真实 16S rRNA 数据集的背景下评估了我们的方法作为更昂贵的系统发育分析（在我们的案例中为 TIPP）之前的预处理步骤的使用。结论我们的实验为使用两步法进行准确的分类分配提供了一个很好的案例。我们表明，我们的方法可以在使用系统发育方法之前用作过滤步骤，并提供了一种使用比 E 值和位分数单独提供的更多信息来解释 BLAST 结果的方法。在真正的 16S rRNA 数据集的背景下，我们在更昂贵的系统发育分析（在我们的案例中为 TIPP）之前评估了我们的方法作为预处理步骤的使用。结论我们的实验为使用两步法进行准确的分类分配提供了一个很好的案例。我们表明，我们的方法可以在使用系统发育方法之前用作过滤步骤，并提供了一种使用比 E 值和位分数单独提供的更多信息来解释 BLAST 结果的方法。在真正的 16S rRNA 数据集的背景下，我们在更昂贵的系统发育分析（在我们的案例中为 TIPP）之前评估了我们的方法作为预处理步骤的使用。结论我们的实验为使用两步法进行准确的分类分配提供了一个很好的案例。我们表明，我们的方法可以在使用系统发育方法之前用作过滤步骤，并提供了一种使用比 E 值和位分数单独提供的更多信息来解释 BLAST 结果的方法。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11