Indexing Highly Repetitive String Collections, Part I,ACM Computing Surveys

当前位置： X-MOL 学术 › ACM Comput. Surv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Indexing Highly Repetitive String Collections, Part I
ACM Computing Surveys ( IF 16.6 ) Pub Date : 2021-03-06 , DOI: 10.1145/3434399
Gonzalo Navarro ₁

Affiliation

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey.

中文翻译：

索引高度重复的字符串集合，第一部分

二十年前，索引字符串集合的突破使得在压缩空间内表示它们成为可能，同时提供索引搜索功能。随着这项新技术渗透到生物信息学等应用程序中，字符串集合经历了超过摩尔定律的增长，并挑战了我们即使以压缩形式处理它们的能力。幸运的是，这些快速增长的字符串集合中有许多是高度重复的，因此它们的信息内容比它们的普通大小低几个数量级。然而，用于经典集合的统计压缩方法对这种重复性视而不见，因此开发了一套新的技术来适当地利用它。生成的索引形成了新一代的数据结构，能够处理我们面临的大量重复字符串集合。在这项由两部分组成的调查中，我们涵盖了导致这些数据结构的算法发展。在第一部分中，我们描述了用于利用重复性的不同压缩范例，以及提供对压缩字符串的直接访问的算法技术。在寻求理想的重复性度量时，我们发现了这些度量之间的迷人关系网络，以及可以恢复数据的限制，以及可以提供对压缩数据的直接访问的限制。这是索引性的基本方面，本调查的第二部分对此进行了介绍。在这项由两部分组成的调查中，我们涵盖了导致这些数据结构的算法发展。在第一部分中，我们描述了用于利用重复性的不同压缩范例，以及提供对压缩字符串的直接访问的算法技术。在寻求理想的重复性度量时，我们发现了这些度量之间的迷人关系网络，以及可以恢复数据的限制，以及可以提供对压缩数据的直接访问的限制。这是索引性的基本方面，本调查的第二部分对此进行了介绍。在这项由两部分组成的调查中，我们涵盖了导致这些数据结构的算法发展。在第一部分中，我们描述了用于利用重复性的不同压缩范例，以及提供对压缩字符串的直接访问的算法技术。在寻求理想的重复性度量时，我们发现了这些度量之间的迷人关系网络，以及可以恢复数据的限制，以及可以提供对压缩数据的直接访问的限制。这是索引性的基本方面，本调查的第二部分对此进行了介绍。我们描述了用于利用重复性的不同压缩范例，以及提供对压缩字符串的直接访问的算法技术。在寻求理想的重复性度量时，我们发现了这些度量之间的迷人关系网络，以及可以恢复数据的限制，以及可以提供对压缩数据的直接访问的限制。这是索引性的基本方面，本调查的第二部分对此进行了介绍。我们描述了用于利用重复性的不同压缩范例，以及提供对压缩字符串的直接访问的算法技术。在寻求理想的重复性度量时，我们发现了这些度量之间的迷人关系网络，以及可以恢复数据的限制，以及可以提供对压缩数据的直接访问的限制。这是索引性的基本方面，本调查的第二部分对此进行了介绍。并且可以提供对压缩数据的直接访问。这是索引性的基本方面，本调查的第二部分对此进行了介绍。并且可以提供对压缩数据的直接访问。这是索引性的基本方面，本调查的第二部分对此进行了介绍。

更新日期：2021-03-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>