Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.,Journal of Computational Biology

当前位置： X-MOL 学术 › J. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.
Journal of Computational Biology ( IF 1.7 ) Pub Date : 2020-03-16 , DOI: 10.1089/cmb.2019.0309
Alan Kuhnle _{1,

2} , Taher Mun ₃ , Christina Boucher ₂ , Travis Gagie _{4,

5} , Ben Langmead ₃ , Giovanni Manzini ₆

Affiliation

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

中文翻译：

有效构建泛基因组读取比对的完整索引。

Short-read aligners 主要使用 FM-index，它很容易索引一个或几个人类基因组。然而，它不能很好地扩展到索引数千个基因组的集合。驱动这个问题的是索引的两个主要组成部分：(1) 字符串的 Burrows-Wheeler 变换 (BWT) 上的秩数据结构，它将允许我们在字符串的后缀数组 (SA) 中找到区间，以及 ( 2) SA 的样本，当与等级数据结构一起使用时，允许我们访问 SA。即使对于大型基因组数据库，通过运行长度压缩 BWT，秩数据结构也可以保持较小，但直到最近，还没有已知的方法可以在不大大减慢对 SA 的访问的情况下保持 SA 样本较小。现在 (SODA 2018) 已经定义了一个 SA 样本，它占用的空间与运行长度压缩的 BWT 大致相同，我们设计了基因组数据库的高效 FM 索引，但面临着构建它们的问题。2018 年，我们展示了如何高效构建大型基因组数据库的 BWT（WABI 2018），但高效构建样本的问题仍然悬而未决。我们将我们的方法与构建 SA 样本的最先进方法进行了比较，并证明它是高度重复基因组数据库中最快、最节省空间的方法。最后，我们应用我们的方法对部分和整个人类基因组进行索引，并表明它在内存和时间方面优于基于 FM 索引的 Bowtie 方法以及在查询时间和内存方面优于基于混合索引的 CHIC 方法索引所需。2018 年，我们展示了如何高效构建大型基因组数据库的 BWT（WABI 2018），但高效构建样本的问题仍然悬而未决。我们将我们的方法与构建 SA 样本的最先进方法进行了比较，并证明它是高度重复基因组数据库中最快、最节省空间的方法。最后，我们应用我们的方法对部分和整个人类基因组进行索引，并表明它在内存和时间方面优于基于 FM 索引的 Bowtie 方法以及在查询时间和内存方面优于基于混合索引的 CHIC 方法索引所需。2018 年，我们展示了如何高效构建大型基因组数据库的 BWT（WABI 2018），但高效构建样本的问题仍然悬而未决。我们将我们的方法与构建 SA 样本的最先进方法进行了比较，并证明它是高度重复基因组数据库中最快、最节省空间的方法。最后，我们应用我们的方法对部分和整个人类基因组进行索引，并表明它在内存和时间方面优于基于 FM 索引的 Bowtie 方法以及在查询时间和内存方面优于基于混合索引的 CHIC 方法索引所需。我们将我们的方法与构建 SA 样本的最先进方法进行了比较，并证明它是高度重复基因组数据库中最快、最节省空间的方法。最后，我们应用我们的方法对部分和整个人类基因组进行索引，并表明它在内存和时间方面优于基于 FM 索引的 Bowtie 方法以及在查询时间和内存方面优于基于混合索引的 CHIC 方法索引所需。我们将我们的方法与构建 SA 样本的最先进方法进行了比较，并证明它是高度重复基因组数据库中最快、最节省空间的方法。最后，我们应用我们的方法对部分和整个人类基因组进行索引，并表明它在内存和时间方面优于基于 FM 索引的 Bowtie 方法以及在查询时间和内存方面优于基于混合索引的 CHIC 方法索引所需。

更新日期：2020-03-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>