Computing the Rearrangement Distance of Natural Genomes,Journal of Computational Biology

当前位置： X-MOL 学术 › J. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Computing the Rearrangement Distance of Natural Genomes
Journal of Computational Biology ( IF 1.4 ) Pub Date : 2021-04-20 , DOI: 10.1089/cmb.2020.0434
Leonard Bohnenkämper ₁ , Marília D V Braga ₁ , Daniel Doerr ₁ , Jens Stoye ₁

Affiliation

The computation of genomic distances has been a very active field of computational comparative genomics over the past 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double cut and join distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao et al. relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an integer linear programming (ILP) solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes that have equal numbers of duplicates of any marker. Therefore, it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this article, we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao et al. to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few 10,000 markers, which we demonstrate on simulated and real data.

中文翻译：

计算自然基因组的重排距离

在过去 25 年里，基因组距离的计算一直是计算比较基因组学的一个非常活跃的领域。实质性成果包括 Hannenhalli 和 Pevzner 于 1995 年提出的反演距离的多项式时间可计算性以及 Yancopoulos 等人提出的双切割和连接距离。 2005 年。然而，这两个结果都依赖于这样的假设：所比较的基因组包含相同的一组独特标记（同线基因组区域，有时也称为基因）。 2015 年，Shao 等人。通过在分析中允许重复标记来放松此条件。这种基因组距离问题的广义版本是 NP 困难的，它们提供了一个整数线性规划 (ILP) 解决方案，该解决方案足够高效，可以应用于现实世界的数据集。他们的方法的一个限制是它只能应用于具有相同数量的任何标记重复的平衡基因组。因此，它仍然需要对输入数据进行精细的预处理，其中必须删除过多的不平衡标记副本。在本文中，我们提出了一种解决自然基因组的基因组距离问题的算法，其中任何标记都可能出现任意次数。我们的方法基于一种新的图数据结构，即多关系图，它允许对 Shao 等人的 ILP 进行优雅的扩展。计算一个基因组中相对于另一个基因组代表性不足或过多且需要分别插入或删除的标记的运行次数。通过这一扩展，以前对基因组配置的限制被解除，首次实现了毫不妥协的重排分析。任何标记序列都可以直接用于距离计算。对我们方法的评估表明，它可用于分析具有多达 10,000 个标记的基因组，我们在模拟和真实数据上进行了演示。

更新日期：2021-04-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11