当前位置: X-MOL 学术arXiv.cs.FL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A new distance based on minimal absent words and applications to biological sequences
arXiv - CS - Formal Languages and Automata Theory Pub Date : 2021-05-31 , DOI: arxiv-2105.14990
Giuseppa Castiglione, Jia Gao, Sabrina Mantaci, Antonio Restivo

A minimal absent word of a sequence x, is a sequence yt hat is not a factorof x, but all of its proper factors are factors of x as well. The set of minimal absent words uniquely defines the sequence itself. In recent times minimal absent words have been used in order to compare sequences. In fact, to do this, one can compare the sets of their minimal absent words. Chairungasee and Crochemorein [2] define a distance between pairs of sequences x and y, where the symmetric difference of the sets of minimal absent words of x and y is involved. Here, weconsider a different distance, introduced in [1], based on a specific subset of such symmetric difference that, in our opinion, better capture the different features ofthe considered sequences. We show the result of some experiments where the distance is tested on a dataset of genetic sequences by 11 living species, in order to compare the new distance with the ones existing in literature.

中文翻译:

基于最小缺失词的新距离及其在生物序列中的应用

序列 x 的最小缺失词是序列 yt 不是 x 的因数,但它的所有适当因数也是 x 的因数。最小缺失词的集合唯一地定义了序列本身。最近,为了比较序列,已经使用了最少的缺席词。事实上,要做到这一点,可以比较他们的最小缺席词的集合。Chairungasee 和 Crochemorein [2] 定义了序列对 x 和 y 之间的距离,其中涉及 x 和 y 的最小缺失词集的对称差异。在这里,我们考虑 [1] 中引入的不同距离,基于这种对称差异的特定子集,在我们看来,它可以更好地捕捉所考虑序列的不同特征。
更新日期:2021-06-01
down
wechat
bug