当前位置: X-MOL 学术Proc. Natl. Acad. Sci. U.S.A. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters.
Proceedings of the National Academy of Sciences of the United States of America ( IF 11.1 ) Pub Date : 2020-07-21 , DOI: 10.1073/pnas.1903436117
Justin Chu 1 , Hamid Mohamadi 2 , Emre Erhan 2 , Jeffery Tse 2 , Readman Chiu 2 , Sarah Yeo 2 , Inanc Birol 1, 3
Affiliation  

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.



中文翻译:

使用多个间隔的种子和多索引Bloom过滤器的不匹配,无比对的序列分类。

无比对分类工具已实现了许多生物信息学分析管道中测序数据的高通量处理,这主要是由于其计算效率高。最初基于k- mer,当遇到测序错误和多态性时,此类工具通常缺乏敏感性。作为响应,一些工具已经增加了间隔种子,能够容忍不匹配。但是,间隔种子在分类中几乎没有实际应用,因为与使用k-mers。这些限制还导致实际间隔种子的设计和长度受到限制,因为存储间隔种子可能会很昂贵。为了解决这些挑战,我们设计了一种概率数据结构,称为多索引布隆过滤器(miBF),该结构可以存储多个间隔开的种子序列,并且存储成本低,无论种子长度或种子设计如何,该序列都保持不变。我们将对来自多个靶标或参照的序列进行分类时如何最小化miBF的假阳性率。在BioBloom工具中可用,我们在两个用例中说明了miBF的实用程序:针对目标装配的阅读装箱和分类学阅读分配。在我们的基准测试中,与基于序列比对的方法相比,基于miBF的分析流水线对读取分箱显示出更高的灵敏度和特异性,并且执行时间更少。

更新日期:2020-07-22
down
wechat
bug