Compact and evenly distributed k-mer binning for genomic sequences,bioRxiv - Bioinformatics

当前位置： X-MOL 学术 › bioRxiv. Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Compact and evenly distributed k-mer binning for genomic sequences
bioRxiv - Bioinformatics Pub Date : 2020-11-03 , DOI: 10.1101/2020.10.12.335364
Johan Nyström-Persson , Gabriel Keeble-Gagnère , Niamat Zawad

The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers - ordered m-mers where m < k - are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored. Here we present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency counted minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available.

中文翻译：

紧凑且均匀分布的k-mer分箱可实现基因组序列

k-mer（长度为k的子序列）的处理是生物信息学中许多序列处理算法的基础，包括用于基因组大小估计的k-mer计数，基因组组装和宏基因组学的分类学分类。最小化器-有序m-mers，其中m <k-通常用于将k-mers分组为仓，作为此类处理的第一步。但是，众所周知，最小化器会生成大小截然不同的bin，这可能给分布式和并行处理带来挑战，并且通常会增加内存需求。此外，尽管已经提出了各种最小化器的排序，但是它们对于提高工具效率的实用价值尚未得到充分探索。在这里，我们介绍Discount（一种基于Apache Spark的分布式k-mer计数工具），在应用于宏基因组学数据时，我们使用它来研究各种最小化器排序的行为。然后，使用此工具，我们介绍了通用频率排序，这是频率计数最小化器和通用k-mer打击集的新组合，可以产生均匀分布的装箱和较小的装箱尺寸。我们证明，这种排序使Discount可以在可比较方法的1/8的内存中对大型数据集执行分布式k-mer计数，这使其成为可用的最有效的核外分布式k-mer计数方法。这会产生均匀分布的装箱和较小的装箱尺寸。我们证明，这种排序使Discount可以在可比较方法的1/8的内存中对大型数据集执行分布式k-mer计数，这使其成为可用的最有效的核外分布式k-mer计数方法。这会产生均匀分布的装箱和较小的装箱尺寸。我们证明，这种排序使Discount可以在可比较方法的1/8的内存中对大型数据集执行分布式k-mer计数，这使其成为可用的最有效的核外分布式k-mer计数方法。

更新日期：2020-11-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>