当前位置: X-MOL 学术Theor. Comput. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Counter based suffix tree for DNA pattern repeats
Theoretical Computer Science ( IF 1.1 ) Pub Date : 2020-01-15 , DOI: 10.1016/j.tcs.2019.12.014
Tshepo Kitso Gobonamang , Dimane Mpoeleng

In recent years, the string datasets have increased exponentially, so is the need to process them. Most of these datasets have been deeply rooted in the field of bioinformatics since the entire characteristics of any living organism is encoded in their genes. Genes consist of nucleic bases which will, therefore, makeup the entire genome. A genome is made of a concatenation of different types of nucleic bases. To efficiently extract the information encrypted in these sequences there is a need to use algorithms to decrypt it. Most available methods use the data structure commonly referred to as the suffix tree. They have tremendously evolved over the years, and the on-line construction of the suffix tree is deemed as the best data structure, however, it is not optimal when it comes to finding repeated sequences because of many traversals algorithm will have to do when identifying repeats. To improve the speed and of finding repeats we developed a counter based suffix tree algorithm. Our work presents a novel algorithm of constructing a counter based suffix tree without losing its properties. The counter based suffix tree time complexity is θ(n) where n represents the length of a string. Which is the same as the fastest suffix tree implementation. We have shown that the counter based suffix tree will reduce the search time when identifying repeats. We have proved that a counter based suffix tree can be developed during construction.



中文翻译:

基于计数器的后缀树,用于DNA模式重复

近年来,字符串数据集呈指数增长,因此需要对其进行处理。这些数据集中的大多数已经深深扎根于生物信息学领域,因为任何活生物体的全部特征都在其基因中编码。基因由核酸碱基组成,因此将构成整个基因组。基因组由不同类型的核酸碱基的串联组成。为了有效地提取在这些序列中加密的信息,需要使用算法对其进行解密。大多数可用的方法使用通常称为后缀树的数据结构。多年来,它们已经发生了巨大的发展,而后缀树的在线构造被认为是最好的数据结构,但是,查找重复序列并不是最佳选择,因为在识别重复序列时必须经过许多遍历算法。为了提高速度和查找重复项,我们开发了一种基于计数器的后缀树算法。我们的工作提出了一种构造基于计数器的后缀树而不丢失其属性的新颖算法。基于计数器的后缀树时间复杂度为θñ其中,n表示字符串的长度。这与最快的后缀树实现相同。我们已经表明,基于计数器的后缀树将在标识重复项时减少搜索时间。我们已经证明,在构建过程中可以开发出基于计数器的后缀树。

更新日期:2020-01-15
down
wechat
bug