当前位置: X-MOL 学术arXiv.cs.LO › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Text Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations
arXiv - CS - Logic in Computer Science Pub Date : 2021-01-03 , DOI: arxiv-2101.00718
Domenico Cantone, Simone Faro, Arianna Pavone

In this paper we investigate the \emph{approximate string matching problem} when the allowed edit operations are \emph{non-overlapping unbalanced translocations of adjacent factors}. Such kind of edit operations take place when two adjacent sub-strings of the text swap, resulting in a modified string. The two involved substrings are allowed to be of different lengths. Such large-scale modifications on strings have various applications. They are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of cancer suppressor genes. In addition, among other applications, they are frequent modifications accounted in musical or in natural language information retrieval. However, despite of their central role in so many fields of text processing, little attention has been devoted to the problem of matching strings allowing for this kind of edit operation. In this paper we present three algorithms for solving the problem, all of them with a $\bigO(nm^3)$ worst-case and a $\bigO(m^2)$-space complexity, where $m$ and $n$ are the length of the pattern and of the text, respectively. % In particular, our first algorithm is based on the dynamic-programming approach. Our second solution improves the previous one by making use of the Directed Acyclic Word Graph of the pattern. Finally our third algorithm is based on an alignment procedure. We also show that under the assumptions of equiprobability and independence of characters, our second algorithm has a $\bigO(n\log^2_{\sigma} m)$ average time complexity, for an alphabet of size $\sigma \geq 4$.

中文翻译:

文本搜索允许不重叠的相邻不平衡易位

在本文中,当允许的编辑操作为\ emph {相邻因子的不重叠不平衡易位}时,我们研究了\ emph {近似字符串匹配问题}。当文本的两个相邻子字符串交换时,会发生这种类型的编辑操作,从而导致修改后的字符串。允许两个相关的子字符串具有不同的长度。这种对字符串的大规模修改具有各种应用。它们是最常见的染色体改变之一,占所有杂合性丧失的30%,杂合性是导致癌症抑制基因失活的主要遗传事件。另外,在其他应用中,它们是音乐或自然语言信息检索中的频繁修改。但是,尽管它们在文本处理的众多领域中发挥着核心作用,很少有人关注允许这种编辑操作的字符串匹配问题。在本文中,我们提出了三种解决问题的算法,所有算法都具有$ \ bigO(nm ^ 3)$最坏情况和$ \ bigO(m ^ 2)$空间复杂度,其中$ m $和$ n $分别是图案的长度和文本的长度。特别是,我们的第一个算法是基于动态编程方法的。我们的第二个解决方案通过使用模式的有向无环字图来改进前一个。最后,我们的第三种算法基于对齐程序。我们还表明,在等概率和字符独立性的假设下,对于大小为$ \ sigma \ geq 4的字母,我们的第二种算法的平均时间复杂度为$ \ bigO(n \ log ^ 2 _ {\ sigma} m)$。 $。
更新日期:2021-01-05
down
wechat
bug