Revisiting the parameterized complexity of Maximum-Duo Preservation String Mapping

doi:10.1016/j.tcs.2020.09.034

Theoretical Computer Science

Volume 847, 22 December 2020, Pages 27-38

https://doi.org/10.1016/j.tcs.2020.09.034 Get rights and content

Abstract

In the Maximum-Duo Preservation String Mapping (Max-Duo PSM) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in $4^{k} \cdot n^{O (1)}$ time, and a deterministic algorithm that solves this problem in ${6.855}^{k} \cdot n^{O (1)}$ time. The previous best known (deterministic) algorithm for this problem has ${(8 e)}^{2 k + o (k)} \cdot n^{O (1)}$ running time [Beretta et al. (2016) [1], [2]]. We also show that Max-Duo PSM admits a problem kernel of size $O (k^{3})$ , improving upon the previous best known problem kernel of size $O (k^{6})$ .

Introduction

Computing distances between strings is a fundamental task in computer science. For many distance measures, the distance between two strings A and B is defined as the minimum number of local operations that are needed to transform A into B, for example the deletion or insertion of a character. For these measures, the distance between two strings A and B can be usually computed in polynomial time [13], [23]. In some applications, however, it is necessary to consider nonlocal operations that transform one string into the other. In comparative genomics, for example, genomes are modeled as strings with one character corresponding to a complete gene and one is interested in determining the evolutionary distance between two genomes. During biological evolution, genomes may be altered by large-scale mutations such as the reversal or the transposition of larger parts of the genome [19].

One approach to approximate the distance between two strings A and B with respect to many of these operations is to compute a smallest common string partition [11], [27]. Informally, a size-ℓ common string partition of two strings A and B is a partition of A and B, each into ℓ nonoverlapping substrings, such that the resulting two multisets of substrings of A and B are the same. The problem to compute a smallest common string partition, known as Minimum Common String Partition, is NP-hard [11], [22].

An alternative way of defining such a partition is to ask for a partition of A into ℓ nonoverlapping substrings such that permuting the order of these substrings and concatenating them subsequently gives the string B. This second view implies a mapping m that (bijectively) maps each position i of A to a position $m (i)$ of B such that $A [i] = B [m (i)]$ . The size of the common string partition is then exactly the number of pairs of consecutive positions i and $i + 1$ (called duos) such that $m (i) + 1 \neq m (i + 1)$ plus one since i is the end of one part and $i + 1$ is the start of the next part. Therefore, computing a mapping m that maps only positions with the same characters to each other and maximizes the number k of consecutive positions for which $m (i) + 1 = m (i + 1)$ directly yields a minimum common string partition of A and B. The problem of computing such a mapping is known as Maximum-Duo Preservation String Mapping (Max-Duo PSM). Since Max-Duo PSM is simply a dual of the Minimum Common String Partition problem, it is NP-hard as well. Motivated by this hardness, we study Max-Duo PSM from the viewpoint of parameterized algorithmics. More precisely, our aim is to obtain efficient algorithms when the parameter is k, the number of preserved duos. Before describing previous and our results, we give a formal problem definition.

Formal problem definition. Let A and B be two strings over a finite set of symbols Σ. Throughout this work, we assume that $| A | = | B | = n$ and that A and B are related, that is, B is a permutation of A. A mapping of A into B is a (bijective) function $m : [n] \to [n]$ where for each $i \in [n]$ ,¹ $A [i] = B [m (i)]$ . A duo in A is a pair of consecutive positions $(i, i + 1)$ of A. We say that a mapping m preserves a duo $(i, i + 1)$ if $m (i) + 1 = m (i + 1)$ . Accordingly, the Max-Duo PSM problem is defined as follows.

Maximum-Duo Preservation String Mapping (Max-Duo PSM)
Input: Two related strings, A and B, and a nonnegative integer k.
Question: Does there exist a (bijective) mapping m of A into B such that the number of preserved duos is at least k?

Previous work. Initially, Max-Duo PSM has been proposed as an alternative possibility of achieving approximation algorithms for Minimum Common String Partition (MCSP) [10], because the best known polynomial-time approximation algorithm has an approximation factor of $O (\log n \log^{⁎} n)$ [12]. Consequently, most work on Max-Duo PSM focuses on approximation algorithms with the first constant-factor approximation algorithm achieving an approximation factor of 4 [6]. This was subsequently improved to a factor of 3.5 [5] and then to a factor of 3.25 [7]. Recently further progress concerning the approximation factor has been reported [18], [28].

Beretta et al. [2], [1] initiated the study of Max-Duo PSM from the viewpoint of parameterized algorithmics. They studied both the fixed-parameter tractability and the kernelization complexity of Max-Duo PSM, showing that this problem can be solved in ${(8 e)}^{2 k + o (k)} \cdot n^{O (1)}$ time, and that it admits a kernel of size $O (k^{6})$ . Thus, Beretta et al. [2], [1] were the first to show that Max-Duo PSM is FPT and that it admits a polynomial kernel. The fixed-parameter algorithm of Beretta et al. [2], [1] is based on a combination of color coding and dynamic programming.

In comparison with Max-Duo PSM, MCSP has been investigated more thoroughly from the viewpoint of parameterized algorithms. Damaschke [15] presented the first fixed-parameter algorithms for MCSP, for combined parameters such as “partition size ℓ plus repetition number of the input strings”.² Subsequently, MCSP was shown to be fixed-parameter tractable with the single parameter partition size ℓ [9]. Jiang et al. [24] considered the combined parameter “partition size ℓ plus maximum occurrence d of any character” and showed that MCSP can be solved in ${(d!)}^{k} \cdot n^{O (1)}$ time. Subsequently, this running time was improved to $O (d^{2 k} \cdot k n)$ [8].

Our contribution. We make two main contributions. First, we develop two algorithms for the Max-Duo PSM problem that are substantially faster than the (deterministic) algorithm by Beretta et al. [2], [1], which runs in ${(8 e)}^{2 k + o (k)} \cdot n^{O (1)}$ time. Specifically, we develop a randomized algorithm that solves Max-Duo PSM in $4^{k} \cdot n^{O (1)}$ time, as well as a deterministic algorithm that solves this problem in ${6.855}^{k} \cdot n^{O (1)}$ time. Here, in the context of our randomized algorithm, we mean that if we determine that the input is a yes-instance, then this answer is necessarily correct, and if we determine that the input is a no-instance, then this answer is correct with probability at least 9/10.³ For the purpose of developing our algorithms, we present a reduction from Max-Duo PSM to a problem of finding paths in an edge-colored graph, which might be of independent interest. This reduction lies at the heart of our algorithms, since by employing advanced tools from the field of parameterized algorithmics, namely, the methods of narrow sieves [4], [3] and representative sets [20], it is possible to quickly solve the resulting graph problem.

Second, we prove that Max-Duo PSM admits a kernel of size $O (k^{3})$ , improving upon the kernel of size $O (k^{6})$ by Beretta et al. [2].

Preliminaries. We use $[i, j]$ to denote the set ${i, i + 1, \dots, j}$ of natural numbers between i and j. Moreover, given a string A, we denote the substring starting at position i and ending at position j by $A [i, j]$ . For a (directed) graph G, let $V (G)$ denote the vertex set of G and $E (G)$ the edge set of G.

The field of parameterized algorithmics studies parameterized problems, where each problem instance is associated with a parameter k, usually a nonnegative integer. Given a parameterized problem, the first question is whether the problem is fixed-parameter tractable (FPT), that is, whether it can be solved in $f (k) \cdot | X |^{O (1)}$ time, where f is an arbitrary function that depends only on k and $| X |$ is the size of the input instance. In other words, the notion of FPT signifies that the combinatorial explosion can be confined to the parameter k. A second question is whether the problem also admits a polynomial kernelization. Here, a problem Π is said to admit a polynomial kernelization if there exists a polynomial-time algorithm that, given an instance $(X, k)$ of Π, outputs an equivalent instance $(\hat{X}, \hat{k})$ of Π, called a kernel, where $| \hat{X} | = {\hat{k}}^{O (1)}$ and $\hat{k} \leq k$ ; kernelization is a mathematical concept that aims to analyze preprocessing procedures in a formal, rigorous manner. For further details, refer to [17], [14], [21].

Section snippets

Reduction to a path finding problem

In this section, we present a reduction from Max-Duo PSM to the following graph problem.

Substantially Blue Path
Input: A directed acyclic graph (DAG) G, an edge-coloring $c : E (G) \to {R, B}$ , a vertex-labeling $ℓ : V (G) \to N$ , and nonnegative integers k and r.
Question: Does G contain a directed path P such that
•
$| V (P) | \leq r$ ,
•
for all $u, v \in V (P)$ , $ℓ (u) \neq ℓ (v)$ , and
•
$| {e \in E (P) : c (e) = B} | \geq k$ .

Construction. Let $(A, B, k)$ be an instance of Max-Duo PSM. We construct an instance $(G, c, ℓ, k, r)$ of Substantially Blue Path as follows (here,

A randomized algorithm based on narrow sieves

In this section, we adapt the method of narrow sieves that was applied to solve the k-Path problem [4] to solve Substantially Blue Path. More precisely, our objective is to provide a constructive proof for the following result.

Lemma 4

There exists a randomized algorithm that solves Substantially Blue Path in $2^{r} \cdot r^{O (1)} \cdot | E (G) |$ time and polynomial space.

In light of Lemma 3, once we have Lemma 4 at hand, we immediately obtain the following theorem.

Theorem 1

There exists a randomized algorithm that solves Max-Duo PSM

Deterministic algorithm: representative sets

In this section, we adapt the approach in which the method of representative sets is applied to solve the k-Path problem [20]. More precisely, our objective is to provide a constructive proof for the following result.

Lemma 10

There exists a deterministic algorithm that solves Substantially Blue Path in $O ({(\frac{1 + \sqrt{5}}{2})}^{r + o (r)} \cdot | E (G) | \log | E (G) |)$ time.

In light of Lemma 3, once we have Lemma 10 at hand, we directly obtain the following theorem.

Theorem 2

There exists a deterministic algorithm that solves Max-Duo PSM in $O ((\frac{1 + \sqrt{5}}{2})$

A cubic problem kernel

In this section we will show that Max-Duo PSM admits a kernel of size $O (k^{3})$ . Let $(A, B, k)$ be an instance of Max-Duo PSM, and let $S \in {A, B}$ . If $S = A$ , then we let $\overline{S} = B$ . Analogously, if $S = B$ , then we let $\overline{S} = A$ .

Let m be a map of S into $\overline{S}$ , and let D be a set of duos. We denote by $m (D) = {(m (i), m (i + 1)) | (i, i + 1) \in D}$ the image of D under m. We say that m preserves D if m preserves each duo in D. Let $C_{A}$ and $C_{B}$ be sets of duos. We say that the pair $(C_{A}, C_{B})$ is complete for $(A, B, k)$ if whenever there is a map m of A

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (29)

S. Beretta et al.
Corrigendum to “Parameterized tractability of the maximum-duo preservation string mapping problem” [Theoret. Comput. Sci. 646 (2016) 16–25]
Theor. Comput. Sci.
(2016)
S. Beretta et al.
Parameterized tractability of the maximum-duo preservation string mapping problem
Theor. Comput. Sci.
(2016)
A. Björklund et al.
Narrow sieves for parameterized paths and packings
J. Comput. Syst. Sci.
(2017)
W. Chen et al.
Solving the maximum duo-preservation string mapping problem with linear programming
Theor. Comput. Sci.
(2014)
R.A. DeMillo et al.
A probabilistic remark on algebraic program testing
Inf. Process. Lett.
(1978)
H. Shachnai et al.
Representative families: a unified tradeoff-based approach
J. Comput. Syst. Sci.
(2016)
A. Björklund
Determinant sums for undirected hamiltonicity
SIAM J. Comput.
(2014)
N. Boria et al.
A 7/2-approximation algorithm for the maximum duo-preservation string mapping problem
N. Boria et al.
Improved approximation for the maximum duo-preservation string mapping problem
B. Brubach
Further improvement in approximating the maximum duo-preservation string mapping problem

L. Bulteau et al.

A fixed-parameter algorithm for minimum common string partition with few duplications

L. Bulteau et al.

Minimum common string partition parameterized by partition size is fixed-parameter tractable

X. Chen et al.

Assignment of orthologous genes via genome rearrangement

IEEE/ACM Trans. Comput. Biol. Bioinform.

(2005)

G. Cormode et al.

The string edit distance matching problem with moves

ACM Trans. Algorithms

(2007)

Cited by (2)

The edge-preservation similarity for comparing rooted, unordered, node-labeled trees
2023, Pattern Recognition Letters
Rooted trees are ubiquitous data structures which are used to model hierarchical objects from a plethora of different application domains. For various downstream analysis tasks, measures are needed that quantify (dis-)similarity between rooted trees. Many such measures exist, e. g., the widely used tree edit distance (TED). However, there are few algorithms to compute (dis-)similarity measures which are specifically designed for rooted, unordered, node-labeled trees and support input trees of different orders. To close this gap in the literature, we introduce the edge-preservation similarity (EPS). We show how to exactly compute EPS via integer quadratic programming on small instances and present a scalable 4-approximation algorithm. An evaluation on tree representations of pseudoknotted RNA secondary structures and acyclic molecular graphs shows that both exact and approximate (normalized) EPS better preserves functional similarities between the compared RNAs and molecules than the often-used TED. Python implementations of our algorithms and scripts to reproduce the results are available on GitHub: https://github.com/bionetslab/edge-preservation-similarity.
The maximum duo-preservation string mapping problem with bounded alphabet
2021, Leibniz International Proceedings in Informatics, LIPIcs

^☆: A preliminary version of this paper appeared in the proceedings of CPM 2017.

View full text

Revisiting the parameterized complexity of Maximum-Duo Preservation String Mapping☆

Abstract

Introduction

Section snippets

Reduction to a path finding problem

A randomized algorithm based on narrow sieves

Deterministic algorithm: representative sets

A cubic problem kernel

Declaration of Competing Interest

Theor. Comput. Sci.

Theor. Comput. Sci.

J. Comput. Syst. Sci.

Theor. Comput. Sci.

Inf. Process. Lett.

J. Comput. Syst. Sci.

Determinant sums for undirected hamiltonicity

SIAM J. Comput.

A 7/2-approximation algorithm for the maximum duo-preservation string mapping problem

Improved approximation for the maximum duo-preservation string mapping problem

Further improvement in approximating the maximum duo-preservation string mapping problem

A fixed-parameter algorithm for minimum common string partition with few duplications

Minimum common string partition parameterized by partition size is fixed-parameter tractable

Assignment of orthologous genes via genome rearrangement

IEEE/ACM Trans. Comput. Biol. Bioinform.

The string edit distance matching problem with moves

ACM Trans. Algorithms