Elsevier

Theoretical Computer Science

Volume 862, 16 March 2021, Pages 144-154
Theoretical Computer Science

Closest substring problems for regular languages

https://doi.org/10.1016/j.tcs.2020.09.005Get rights and content

Abstract

The Closest Substring problem asks whether there exists a consensus string w of given length such that each string in a set of strings L has a substring whose edit distance is at most r (called the radius) from w. The Closest Substring problem has been studied for finite sets of strings and is known to be NP-hard. We show that the Closest Substring problem for regular languages represented by nondeterministic finite automata (NFA) is PSPACE-complete. The problem remains PSPACE-hard even when the input is a deterministic finite automaton and the length and radius r are given in unary. Also we show that the Closest Substring problem for acyclic NFAs lies in the second level of the polynomial-time hierarchy and is both NP-hard and coNP-hard.

Introduction

In many practical applications, such as, computational biology [6], [18], coding theory [12] and data compression [13] an important task is to find a consensus string for detecting data commonalities from a set of strings. There exist various ways to define a consensus string for a given set of strings. Frances and Litman [12] defined the consensus string based on the concept of radius. The radius of a string w with respect to a set S of equal length strings is the smallest number r such that the Hamming distance between w and any string in S is at most r. The Hamming distance counts the number of positions where two strings of same length differ. It is known that the Consensus String problem based on radius is NP-complete even for strings over a binary alphabet [12]. Sim and Park [25] considered a variant of the problem where they tried to minimize the sum of distances between w and all strings in S, which is called the consensus error. They showed that the Consensus String problem based on consensus error is NP-complete when the penalty matrix is a metric. Amir et al. [1] examined the Consensus String problem by considering both distance sum and radius, where the distance sum is the sum of distances from the strings in the given set to the consensus string and the radius is the largest distance from the set to the consensus string. They presented efficient polynomial time algorithms for a set of three strings. Amir et al. [2] studied the Consensus String problem for other string metrics such as the swap metric and the reversal metric and showed that the problem is NP-hard for the swap metric and APX-hard for the reversal metric. Recently, Schmid [23] investigated a variant where the length differences of the input strings are bounded.

Another important related problem is the Closest Substring problem [12], [18]: given two positive integers r,N and a set of k strings s1,s2,,sk all of length at least , one should decide whether there exists a string s of length such that, for each i{1,,k}, there exists a substring si of length in si such that the Hamming distance between s and si is at most r. The Closest Substring problem is also NP-hard since the Consensus String problem is a special case of the Closest Substring problem. Ma and Sun [19] studied approximation algorithms for closest substring problems and Stojanovic et al. [27] proposed a linear-time algorithm when the radius is one.

While originally the Closest Substring problem is specified for a finite set of strings, studying similar problems for regular languages is often relevant. Due to the expressive power, flexibility, and compactness of regular expressions, many researchers working in bioinformatics have used regular expressions to describe various biological patterns. For instance, the PROSITE database [24] contains a large number of biological signatures that are described as (more than a thousand) patterns or profiles, where the patterns are actually regular expressions. In such a context it is important to find a string with closest alignment to substrings from a set of strings for finding regulatory motifs for DNA sequences [10]. As a consequence, solving the closest substring problem for regular languages would enable us to search for regulatory motifs from a large set of biological (DNA or protein) patterns described by regular expressions.

We extend the Closest Substring problem from finite sets of strings to regular languages. That is, we consider an algorithmic problem for a given regular language L where the goal is to find a string w of given length such that any string in L has a substring within a given radius from w. Since we are comparing strings of arbitrary lengths, instead of the Hamming distance as the metric we use the edit distance, a.k.a. the Levenshtein distance [11].

The Closest Substring problem for regular languages asks for a given finite automaton A and integers r,N, whether there exists a consensus string w of length such that every string in the language of A has a substring whose edit distance to w is at most r. Using the NFA (nondeterministic finite automaton) construction for edit distance neighbourhoods [21], [22] of regular languages, we give a PSPACE algorithm for this problem. We show that the Closest Substring problem for regular languages is PSPACE-hard even when the input is a DFA (deterministic finite automaton) and the length of the consensus string and the radius are given in unary. Also we give an improved complexity upper bound for languages specified by acyclic NFAs. A summary of the complexity results is given in Table 1.

Recently, the same authors investigated a variant of the problem where the goal of the algorithm is to find, for regular languages L1, …, Lk, a consensus string that has edit distance at most r to some string wiLi for each i=1,,k [14]. The Closest Substring problem considered here deals only with one regular language L. On the other hand, while the algorithm in [14] decides only the existence of a string wiLi, i=1,,k, within given radius of the consensus string, in our current setup every string of L has to have a substring within the given the radius.

Section snippets

Preliminaries

The cardinality of a finite set S is denoted |S|. For natural numbers n<m, we denote [n,m]={n,n+1,,m1,m}. The symbol Σ always stands for a finite alphabet. The set of strings over Σ is Σ, the empty string is ε, the set of nonempty strings over Σ is Σ+ and the set of strings of length at most m is Σm, mN. For wΣ and ΩΣ, |w|Ω is the number of occurrences of symbols of Ω in the string w. The complement of a language L over alphabet Σ is Lc=ΣL. The length of the shortest string in a

The closest substring problem for NFAs is PSPACE-complete

Here we establish some lemmas giving relationships between the length of the consensus string, the radius and the length of the shortest string in a language. First we introduce the following notation.

Definition 3

Consider an edit distance de over an alphabet Σ. By cmindelde>0 we denote the smallest deletion cost of a symbol of the alphabet Σ with respect to the distance de.5

Closest substring problem for acyclic finite automata

We show that the Closest Substring problem, where the input languages are specified by an acyclic NFA lies in the second level of the polynomial time hierarchy. It is well known that acyclic NFAs, i.e., NFAs with no cycles (or acyclic DFAs) specify the class of finite languages. For discussion on related language classes see [16].

Recall from Definition 3 that cmindel,cmaxdel>0 stand for the smallest and the largest deletion cost of an individual alphabet symbol.

Theorem 3

The Closest Substring problem for

Conclusion

The classical Consensus String problem looks for a string at a given distance from a finite set of equal length strings. It is well known that the Consensus String problem is NP-complete [12]. We have extended this question to the Closest Substring problem where the input is a regular language represented by an NFA or a DFA, and the problem asks to find a string of length that is within distance r of a substring of every string in the regular language.

Using a reduction from the validity of

Declaration of Competing Interest

There is no conflict of interest.

References (29)

  • L. Bulteau et al.

    Multivariate algorithmics for NP-hard string problems

    Bull. Eur. Assoc. Theor. Comput. Sci.

    (2014)
  • C.S. Calude et al.

    Additive distances and quasi-distances between words

    J. Univers. Comput. Sci.

    (2002)
  • A.K. Chandra et al.

    Alternation

    J. ACM

    (1981)
  • M.K. Das et al.

    A survey of DNA motif finding algorithms

    BMC Bioinform.

    (2007)
  • An earlier version of this paper was presented at the 22nd International Conference Developments in Language Theory, DLT'18, Tokyo, Japan, Sept. 10-14, 2018 and appeared in the proceedings of the conference.

    1

    Han was supported by the Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (2018-0-00276).

    2

    Ko was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A4A307994711).

    3

    Salomaa was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

    View full text