Algorithms for all-pairs Hamming distance based similarity,Software: Practice and Experience

当前位置： X-MOL 学术 › Softw. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Algorithms for all-pairs Hamming distance based similarity
Software: Practice and Experience ( IF 2.6 ) Pub Date : 2021-04-19 , DOI: 10.1002/spe.2978
Szymon Grabowski ₁ , Tomasz M. Kowalski ₁

Affiliation

All-pairs distance computation for a collection of strings is a computation-intensive task with important applications in bioinformatics, in particular, in distance-based phylogenetic analysis techniques. Even if the computationally efficient Hamming distance is used for this purpose, the quadratic number of sequence pairs may be challenging. We propose a number of practical algorithms for efficient pairwise Hamming distance computation under a given distance threshold. The techniques are based on such concepts as pivot-based similarity search in metric spaces, pigeonhole principle for approximate string matching, cache-friendly data arrangement, bit-parallelism, and others. We experimentally show that our solutions are often about an order of magnitude faster than the average-case linear-time LCP based clusters method proposed recently, both in real and synthetic benchmarks.

中文翻译：

基于所有对汉明距离的相似度算法

字符串集合的所有对距离计算是一项计算密集型任务，在生物信息学中具有重要应用，特别是在基于距离的系统发育分析技术中。即使为此目的使用计算效率高的汉明距离，序列对的二次方数也可能具有挑战性。我们提出了许多实用算法，用于在给定距离阈值下进行高效的成对汉明距离计算。这些技术基于诸如度量空间中基于枢轴的相似性搜索、近似字符串匹配的鸽巢原理、缓存友好的数据排列、位并行等概念。我们通过实验表明，我们的解决方案通常比最近提出的基于平均情况线性时间 LCP 的集群方法快一个数量级，

更新日期：2021-06-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文