Fast de Bruijn Graph Compaction in Distributed Memory Environments.,IEEE/ACM Transactions on Computational Biology and Bioinformatics

当前位置： X-MOL 学术 › IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast de Bruijn Graph Compaction in Distributed Memory Environments.
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 3.6 ) Pub Date : 2018-07-31 , DOI: 10.1109/tcbb.2018.2858797
Tony Pan , Rahul Nihalani , Srinivas Aluru

De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is 3.7× and 2.0× faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://github.com/ParBLiSS/bruno.

中文翻译：

分布式内存环境中的快速de Bruijn图压缩。

随着短读测序仪的普及，基于De Bruijn图的基因组组装技术开始流行。核心组装操作是单元的生成，单元是与图中链条相对应的序列。Unitig用作在许多汇编器中生成较长序列的构造块，并且可以促进图形压缩。生成单位的链压缩仍然是一项关键的计算任务。在本文中，我们提出了一种分布式内存并行算法，用于同时压缩双向de Bruijn图中的所有链。我们算法的主要优点包括将链压缩运行时间限制为最长链长度上对数的迭代次数，并能够在最长周期长度内以对数迭代数区分链与周期。我们的算法可扩展到数千个计算核心，并且可以使用7680个分布式内存核心在7.3秒内和使用64个共享内存核心在12.9分钟内压缩人类序列读取集的整个Bruijn图谱。它分别比用于分布式和共享内存环境的最新工具的等效步骤快3.7倍和2.0倍。该算法的实现可从https://github.com/ParBLiSS/bruno获得。

更新日期：2020-03-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文