LsHASHq: A string matching algorithm exploiting longer q-gram shifting,Information Processing & Management

当前位置： X-MOL 学术 › Inf. Process. Manag. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

LsHASHq: A string matching algorithm exploiting longer q-gram shifting
Information Processing & Management ( IF 7.4 ) Pub Date : 2022-08-24 , DOI: 10.1016/j.ipm.2022.103057
Abdulrakeeb M. Al-Ssulami , Aqil M. Azmi , Hassan Mathkour , Hatim Aboalsamh

String matching is a classical computer science problem where we search for all the occurrences of a text string of size $m$ , typically called pattern, in a string of size $n$ , where both strings are drawn from the same alphabet. It is an essential task for many applications such as data mining, web search engines, bioinformatics, and natural language processing. Fast hash algorithms were developed to speed up the searching process. Here, we compare the hash value of strings (signature) instead of the letters. The hash function allows exploiting bitwise operations while considering the alphabet’s and pattern’s sizes. However, the efficiency of the hash algorithms calls for further improvements. The problem with $q$ -gram hash algorithms is that the shift skips at most $m - q + 1$ positions, where $m$ is the same as before, and $q$ is the length of hashed $q$ -gram. For a fixed $m$ , the number of skipped positions decreases as $q$ increases. This paper presents a new variation of the $q$ -gram hash algorithm, which elongates the shift by skipping at most $m$ positions over text. Theoretically, the proposed hash algorithm, namely, Longer shift HASHq (LsHASHq), has a longer shift than the state-of-the-art hash algorithms. Experimentally, the new algorithm is the fastest among the following algorithms: BNDMq, BXSq, EPSM, FHASHq, FSBNDMq, HASHq, LWFRq, QLQS, SBNDMq, TWFRq, and WFRq on different natural language texts for $m > 10$ . For human genome sequence the new algorithm was second fastest for short patterns of length 10.

中文翻译：

LsHASHq：一种利用更长 q -gram 移位的字符串匹配算法

字符串匹配是一个经典的计算机科学问题，我们搜索所有出现的大小为 $米$ ，通常称为模式，在大小字符串中 $n$ ，其中两个字符串都来自同一个字母表。它是许多应用程序的基本任务，例如数据挖掘、Web 搜索引擎、生物信息学和自然语言处理。开发了快速哈希算法以加快搜索过程。在这里，我们比较的是字符串（签名）而不是字母的哈希值。哈希函数允许在考虑字母和模式大小的同时利用按位运算。然而，哈希算法的效率需要进一步改进。问题与 $q$ -gram 哈希算法是移位最多跳过 $米 - q + 1$ 位置，在哪里 $米$ 和以前一样，并且 $q$ 是散列的长度 $q$ -公克。对于一个固定 $米$ ，跳过位置的数量随着 $q$ 增加。本文介绍了一种新的变体 $q$ -gram 哈希算法，通过最多跳过来延长移位 $米$ 在文本上的位置。从理论上讲，所提出的哈希算法，即更长的移位 HASHq (LsHASHq)，比最先进的哈希算法具有更长的移位。实验上，新算法是以下算法中最快的：BNDMq、BXSq、EPSM、FHASHq、FSBNDMq、HASHq、LWFRq、QLQS、SBNDMq、TWFRq 和 WFRq $米 > 10$ . 对于人类基因组序列，新算法对于长度为 10 的短模式是第二快的。

更新日期：2022-08-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11