当前位置: X-MOL 学术Softw. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Algorithms for all-pairs Hamming distance based similarity
Software: Practice and Experience ( IF 2.6 ) Pub Date : 2021-04-19 , DOI: 10.1002/spe.2978
Szymon Grabowski 1 , Tomasz M. Kowalski 1
Affiliation  

All-pairs distance computation for a collection of strings is a computation-intensive task with important applications in bioinformatics, in particular, in distance-based phylogenetic analysis techniques. Even if the computationally efficient Hamming distance is used for this purpose, the quadratic number of sequence pairs may be challenging. We propose a number of practical algorithms for efficient pairwise Hamming distance computation under a given distance threshold. The techniques are based on such concepts as pivot-based similarity search in metric spaces, pigeonhole principle for approximate string matching, cache-friendly data arrangement, bit-parallelism, and others. We experimentally show that our solutions are often about an order of magnitude faster than the average-case linear-time LCP based clusters method proposed recently, both in real and synthetic benchmarks.

中文翻译:

基于所有对汉明距离的相似度算法

字符串集合的所有对距离计算是一项计算密集型任务,在生物信息学中具有重要应用,特别是在基于距离的系统发育分析技术中。即使为此目的使用计算效率高的汉明距离,序列对的二次方数也可能具有挑战性。我们提出了许多实用算法,用于在给定距离阈值下进行高效的成对汉明距离计算。这些技术基于诸如度量空间中基于枢轴的相似性搜索、近似字符串匹配的鸽巢原理、缓存友好的数据排列、位并行等概念。我们通过实验表明,我们的解决方案通常比最近提出的基于平均情况线性时间 LCP 的集群方法快一个数量级,
更新日期:2021-06-07
down
wechat
bug