当前位置: X-MOL 学术Algorithms Mol. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Natural family-free genomic distance
Algorithms for Molecular Biology ( IF 1 ) Pub Date : 2021-05-10 , DOI: 10.1186/s13015-021-00183-8
Diego P. Rubert , Fábio V. Martinez , Marília D. V. Braga

A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410–431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.

中文翻译:

自然的无家族基因组距离

比较基因组学中的一个经典问题是计算重排距离,即将给定基因组转化为另一个给定基因组所需的大规模重排的最小数量。该领域的传统方法是基于家族的,即要求将两个基因组的DNA片段分为家族。此外,能够在多项式时间内计算距离的最基本的基于家族的模型限制了该家族在每个基因组中最多出现一次。相反,在允许多族(即具有多次出现的族)的模型中的距离计算是NP难的。最近,Bohnenkämper等人。(J Comput Biol 28:410–431,2021)提出了一种ILP公式,用于计算具有多族的基因组的基因组距离,从而允许结构重排,以通用的双重剪切和连接(DCJ)操作以及修改DNA片段的内容插入和删除为代表。该ILP非常有效,但必须最大化每个多族中基因的匹配,以防止免费午餐伪像,否则将使空或几乎空的匹配产生更小的距离。在本文中,我们采用替代的无家族设置,而不是家族分类,仅使用两个基因组DNA片段之间的成对相似性来计算它们的重排距离。我们改编了上述ILP并开发了一个模型,其中使用成对相似性为匹配和不匹配的基因分配权重,因此最佳解决方案不一定会使匹配最大化。然后,我们的模型会得出自然的无家族基因组距离,它考虑了所有给定的基因,没有事先分类到家族中,并且具有由任意大小的匹配组成的搜索空间。尽管其搜索空间更大,但由于权重的减少,我们的ILP似乎因减少了最优解的数量而得到了提高。确实,它的融合速度比Bohnenkämper等人的原始版本要快。对于具有相同数量的多个连接的实例。我们不仅可以处理细菌基因组,还可以处理真菌和昆虫,或哺乳动物和植物的染色体组。在对六个果蝇基因组的比较研究中,我们获得了准确的结果。我们的ILP似乎由于权重的减少而减少了最优选择数量。确实,它的融合速度比Bohnenkämper等人的原始版本要快。对于具有相同数量的多个连接的实例。我们不仅可以处理细菌基因组,还可以处理真菌和昆虫,或哺乳动物和植物的染色体组。在对六个果蝇基因组的比较研究中,我们获得了准确的结果。我们的ILP似乎由于权重的减少而减少了最优选择数量。确实,它的融合速度比Bohnenkämper等人的原始版本要快。对于具有相同数量的多个连接的实例。我们不仅可以处理细菌基因组,还可以处理真菌和昆虫,或哺乳动物和植物的染色体组。在对六个果蝇基因组的比较研究中,我们获得了准确的结果。
更新日期:2021-05-11
down
wechat
bug