Natural family-free genomic distance,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Natural family-free genomic distance
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2021-05-10 , DOI: 10.1186/s13015-021-00183-8
Diego P Rubert ₁ , Fábio V Martinez ₁ , Marília D V Braga ₂

Affiliation

A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410–431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.

中文翻译：

自然无家族基因组距离

比较基因组学中的一个经典问题是计算重排距离，即将给定基因组转变为另一个给定基因组所需的大规模重排的最小数量。该领域的传统方法是基于家族的，即需要将两个基因组的DNA片段分类为家族。此外，最基本的基于家族的模型能够在多项式时间内计算距离，限制家族在每个基因组中最多出现一次。相反，允许多族（即多次出现的族）的模型中的距离计算是 NP 困难的。最近，Bohnenkämper 等人。 (J Comput Biol 28:410–431, 2021) 提出了一种 ILP 公式，用于计算多家族基因组的基因组距离，允许结构重排，以通用双剪切和连接 (DCJ) 操作为代表，以及内容修改插入和删除DNA 片段。这种 ILP 非常有效，但必须最大化每个多家族中基因的匹配，以防止免费午餐工件，否则会使空或几乎空的匹配产生更小的距离。在本文中，我们采用了另一种无家族设置，即不进行家族分类，而是简单地使用两个基因组 DNA 片段之间的成对相似性来计算它们的重排距离。我们采用了上面提到的 ILP 并开发了一个模型，其中使用成对相似性为匹配和不匹配的基因分配权重，因此最佳解决方案不一定会最大化匹配。然后，我们的模型产生自然的无家族基因组距离，该距离考虑所有给定基因，无需事先分类为家族，并且具有由任意大小的匹配组成的搜索空间。尽管搜索空间更大，但我们的 ILP 似乎因权重而减少了共优解的数量而得到了提升。事实上，它比 Bohnenkämper 等人最初的收敛速度更快。对于具有相同数量的多个连接的实例。我们不仅可以处理细菌基因组，还可以处理真菌和昆虫，或哺乳动物和植物的染色体组。在对六种果蝇基因组的比较研究中，我们获得了准确的结果。

更新日期：2021-05-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11