MetaCache-GPU: Ultra-Fast Metagenomic Classification,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MetaCache-GPU: Ultra-Fast Metagenomic Classification
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-06-14 , DOI: arxiv-2106.08150
Robin KobusJohannes Gutenberg University Mainz, Germany, André MüllerJohannes Gutenberg University Mainz, Germany, Daniel JüngerJohannes Gutenberg University Mainz, Germany, Christian HundtNVIDIA AI Technology Center Luxembourg, Bertil SchmidtJohannes Gutenberg University Mainz, Germany

The cost of DNA sequencing has dropped exponentially over the past decade, making genomic data accessible to a growing number of scientists. In bioinformatics, localization of short DNA sequences (reads) within large genomic sequences is commonly facilitated by constructing index data structures which allow for efficient querying of substrings. Recent metagenomic classification pipelines annotate reads with taxonomic labels by analyzing their $k$-mer histograms with respect to a reference genome database. CPU-based index construction is often performed in a preprocessing phase due to the relatively high cost of building irregular data structures such as hash maps. However, the rapidly growing amount of available reference genomes establishes the need for index construction and querying at interactive speeds. In this paper, we introduce MetaCache-GPU -- an ultra-fast metagenomic short read classifier specifically tailored to fit the characteristics of CUDA-enabled accelerators. Our approach employs a novel hash table variant featuring efficient minhash fingerprinting of reads for locality-sensitive hashing and their rapid insertion using warp-aggregated operations. Our performance evaluation shows that MetaCache-GPU is able to build large reference databases in a matter of seconds, enabling instantaneous operability, while popular CPU-based tools such as Kraken2 require over an hour for index construction on the same data. In the context of an ever-growing number of reference genomes, MetaCache-GPU is the first metagenomic classifier that makes analysis pipelines with on-demand composition of large-scale reference genome sets practical. The source code is publicly available at https://github.com/muellan/metacache .

中文翻译：

MetaCache-GPU：超快速宏基因组分类

在过去十年中，DNA 测序的成本呈指数级下降，使得越来越多的科学家可以获取基因组数据。在生物信息学中，大基因组序列中的短 DNA 序列（读取）的定位通常通过构建索引数据结构来促进，该结构允许有效查询子串。最近的宏基因组分类管道通过分析与参考基因组数据库相关的 $k$-mer 直方图来注释带有分类标签的读取。由于构建哈希映射等不规则数据结构的成本相对较高，因此基于 CPU 的索引构建通常在预处理阶段进行。然而，快速增长的可用参考基因组数量建立了索引构建和以交互速度查询的需求。在本文中，我们介绍了 MetaCache-GPU——一种超快的宏基因组短读分类器，专门为适应支持 CUDA 的加速器的特性而量身定制。我们的方法采用了一种新颖的哈希表变体，其特征是对局部敏感哈希的读取进行有效的 minhash 指纹识别，并使用扭曲聚合操作进行快速插入。我们的性能评估表明，MetaCache-GPU 能够在几秒钟内构建大型参考数据库，实现即时可操作性，而流行的基于 CPU 的工具（如 Kraken2）需要一个多小时才能对相同数据进行索引构建。在参考基因组数量不断增加的背景下，MetaCache-GPU 是第一个宏基因组分类器，它使具有大规模参考基因组集按需组合的分析管道变得可行。

更新日期：2021-06-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>