Elsevier

Applied Soft Computing

Volume 95, October 2020, 106499
Applied Soft Computing

Finding Longest Common Subsequences: New anytime A search results

https://doi.org/10.1016/j.asoc.2020.106499Get rights and content

Highlights

  • The prominent Longest Common Subsequence Problem is considered and a new exact A search presented for it.

  • This A search was able to yield proven optimal solutions to 107 smaller instances from the literature.

  • For larger instances the A search is turned into an anytime algorithm with either beam search or anytime column search (ACS).

  • The A+ACS produces new best heuristic solutions on about 70% of all benchmark instance groups from the literature.

  • The A+ACS provides significantly smaller optimality gaps than other anytime algorithms for this problem.

Abstract

The Longest Common Subsequence (LCS) problem aims at finding a longest string that is a subsequence of each string from a given set of input strings. This problem has applications, in particular, in the context of bioinformatics, where strings represent DNA or protein sequences. Existing approaches include numerous heuristics, but only a few exact approaches, limited to rather small problem instances. Adopting various aspects from leading heuristics for the LCS, we first propose an exact A search approach, which performs well in comparison to earlier exact approaches in the context of small instances. On the basis of A search we then develop two hybrid A–based algorithms in which classical A iterations are alternated with beam search and anytime column search, respectively. A key feature to guide the heuristic search in these approaches is the usage of an approximate expected length calculation for the LCS of uniform random strings. Even for large problem instances these anytime A variants yield reasonable solutions early during the search and improve on them over time. Moreover, they terminate with proven optimality if enough time and memory is given. Furthermore, they yield upper bounds and, thus, quality guarantees when terminated early. We comprehensively evaluate the proposed methods using most of the available benchmark sets from the literature and compare to the current state-of-the-art methods. In particular, our algorithms are able to obtain new best results for 82 out of 117 instance groups. Moreover, in most cases they also provide significantly smaller optimality gaps than other anytime algorithms.

Introduction

In computer science strings are widely used for representing sequence information. Words and longer texts are naturally represented by means of strings, and in the field of bioinformatics, DNA, RNA and protein sequences, for example, play particularly important roles. We formally define a string s as a finite sequence of |s| characters from a finite alphabet Σ. A frequently occurring necessity is to detect similarities between several strings in order to derive relationships and possibly predict different aspects of a set of strings. A subsequence of a string s is any sequence obtained by removing arbitrary characters from s. A natural and common way to compare two or more strings is studying their common subsequences. More specifically, given a set of m input strings S={s1,,sm}, the Longest Common Subsequence (LCS) problem [1] aims at finding a subsequence of maximal length which is common for all the strings in S.

As mentioned above, the length of the LCS of two or more input strings is a popular similarity measure in computational biology. More generally, there is a large range of real-world applications in which it is necessary to compute a measure of similarity between two or more sequences, and the requirements are sometimes different. Depending on the application, these sequences may encode biological information (such as, for example, in DNA or RNA strings), sentences, whole texts, or time series (including video signals and speech sequences). Well-known similarity measures in the context of computational biology include, besides the LCS length, the Levenshtein distance which calculates the minimum number of single-character edits (insertions, deletions or substitutions) required to change one sequence into the other. Another example is the Damerau–Levenshtein distance [2] which adds transpositions to the three edit operations that are already considered in the Levenshtein distance. Finally, it is also worth to mention the Canberra distance (used, for example, to analyze the gut microbiome in different disease states), and the Google distance [3]. Well-known similarity measures for sentences and/or texts include metrics such as the Euclidean, the Manhattan and the Minkovski distance [4]. The soft cosine measure [5] considers similarities between pairs of features, and the Jaccard similarity [6] is defined as the size of the intersection divided by size of the union of two sets. Finally, well-known measures of similarity for time series include Dynamic Time Warping (DTW) [7], the matrix-based Euclidean distance (GMED), and matrix-based dynamic time warping (GMDTW) [8], among others. Recently, many approaches from the field of deep learning and machine learning have been developed to derive measures of similarities that take the semantic meaning of the compared sentences into account. These include deep architecture Match-SRNN [9] that utilizes a spatial recurrent neural network to generate the global interaction between two sentences, the Word Order Similarity [10] which is defined as the normalized difference of word order between two sentences, and the Latent Semantic Analysis (LSA) [11]. However, in this work we focus on the efficient calculation of the LCS measure. Apart from applications in computational biology [12], the necessity to calculate this measure arises, for example, in data compression [13], [14], text editing [15], the production of circuits in field programmable gate arrays [16], and file comparison (used in the Unix command diff) [17].

For fixed m polynomial algorithms based on dynamic programming (DP) are known [18] to solve the LCS problem. Standard dynamic programming approaches run in O(nm) time, where n denotes the length of the longest input string. These exact methods become quickly impractical when m grows and n is not small. For a general number of input strings m the LCS problem is known to be NP-hard [1]. In practice, heuristic techniques are typically used for larger m and n. The Expansion algorithm and the Best-Next heuristic [19], [20] are well known simple and fast construction heuristics, respectively. Substantially better solutions can usually be obtained by more advanced search strategies and metaheuristics. Among these are in particular many approaches that are based on Beam Search (BS), see e.g., [21], [22], [23], [24], [25], and they differ in various important details such as the heuristic guidance, the branching mechanism, and the filtering.

In our recent work [26], we proposed a general BS framework for the LCS that unifies all the heuristic state-of-the-art approaches from the literature in the sense that each one can be expressed by respective configuration settings. Moreover, a novel heuristic guidance function was proposed, which approximates the expected length of a LCS for random strings. In a comprehensive experimental comparison previous methods have been compared and a new state-of-the-art BS variant was determined, which dominates the other approaches on most of the available benchmark instances. The mentioned new heuristic guidance function turned hereby out to play a crucial role.

Concerning exact approaches for the LCS problem, an integer linear programming model has been considered in [27]. It is, however, not competitive as it cannot be applied to any of the commonly used benchmark instances due to too many variables and constraints in the model. Dynamic programming approaches are reasonable for small m and small n, but they also quickly run out of memory for larger instances and then typically return only weak solutions, if at all. Chen et al. [28] proposed the parallel FAST_LCS search algorithm, which is based on producing a special successors table to obtain all the identical pairs and their levels. Successor nodes are derived in parallel. Pruning operations are utilized to reduce the computational effort. While the algorithm is effective for a small number of input strings, it also struggles for larger m. Wang et al. [24] proposed another parallel algorithm called QUICK-DP, which is based on the dominant point approach and employs a fast divide-and-conquer technique to compute the dominant points. More recently, Li et al. [29] suggested the Top_MLCS algorithm, which is based on a directed acyclic layered-graph model (called irredundant common subsequence graph) and parallel topological sorting strategies used to filter out paths representing suboptimal solutions. Moreover, the authors showed that the earlier dominant-point-based algorithms do not scale well to larger LCS instances, and Top_MLCS significantly outperforms them. In addition to the sequential Top_MLCS, also a parallel variant was proposed. Another parallel space efficient algorithm based on a graph model, called the Leveled-DAG, was described by Peng and Wang [30]. It eliminates all the nodes in the layered graph that do not contribute to the construction of the LCS, and thus keeps only the nodes from the current level and some previously generated ones. In the experimental comparison, Leveled-DAG and Top_MLCS solved the same number of benchmark instances to proven optimality, but Leveled-DAG consumed less memory.

Despite these recent advances, solving practically relevant instances to proven optimality remains a substantial challenge in terms of memory and computation time, even when utilizing many parallel threads. The existing exact methods are therefore frequently not applicable in practice. As a compromise between classical exact techniques and pure heuristic approaches, anytime algorithms have been proposed [31], [32]. An anytime algorithm is supposed to fulfill the following properties: (1) It is, in principle, complete in the sense that it terminates with a proven optimal solution when enough time and memory is provided; (2) it can be terminated at almost any time and then returns a solution of reasonable quality; and (3) the solution quality improves with the given time.

Anytime algorithms thus offer to choose the trade-off between solution quality and computational requirements. Concerning the LCS problem, two anytime approaches have been proposed in the literature so far: Pro-MLCS [33] and SA-MLCS [34]. Both algorithms are based on the dominant point method [35], which features a special distance measure dist for heuristic guidance and a specific multi-dimensional data structure for checking the dominance relation of already explored nodes during the search. Algorithm Pro-MLCS iteratively extends a fixed number of nodes at each level in a level-by-level manner and is similar to anytime column search [36], which we will consider in more detail in Section 4.2. On the other side, SA-MLCS applies an iterative beam widening strategy in successive iterations to reduce space requirements. It differs from Pro-MLCS in the data structures utilized to maintain open nodes. A specific priority queue is realized for SA-MLCS which stores those nodes whose children have not all been expanded, further exploited in the algorithm to make use of the search information from previous iterations to improve efficiency of the SA-MLCS. Last but not least, [34] describes another memory bounded variant of SA-MLCS, called SLA-MLCS. A weakness of all these approaches is that they are not able to provide an upper bound on the solution quality and therefore no quality guarantee in case of early termination. Moreover, neither in [33] nor in [34] enough details are provided concerning the multi-dimensional data structure for checking dominance. This made it, unfortunately, impossible to re-implement the algorithms with all their details, and source code is not provided by the authors. However, in the experimental section of this work we consider the distance measure dist as an alternative heuristic guidance and we also build upon anytime column search.

Our contributions are as follows. We first propose an exact A search for the LCS problem, which is derived from components and settings that proved already useful in heuristic BS variants as determined in our earlier studies [25], [26]. This A search is shown to be effective for small instances, but as one may expect it has serious scalability issues similar to other exact methods in terms of time and memory requirements when considering larger instances. We therefore extend this A search by applying two alternative hybrid search strategies from [25], turning the original A search into effective anytime algorithms for finding an LCS. Both follow the idea of interleaving traditional A search iterations with heuristic search – either BS or anytime column search [36] – and they are labeled A + BS and A + ACS, respectively. The A framework ensures completeness and provides upper bounds at any time, while the embedded heuristic search iterations rely on the heuristic guidance function from [26], [37] and are responsible for producing a first approximate solution quickly and improving it over time. Most importantly, the heuristic search iterations also operate on the list of open nodes of A search in order to avoid redundant node expansions.

Although we employ, from a conceptual point of view, the same hybrid search strategies as in [25], we want to emphasize the significant differences between the adaptation to the longest common palindromic subsequence (LCPS) problem in [25] and the adaptation to the LCS problem presented in this paper. Note, for example, that the best exact algorithms for the LCS problem when considering two input strings (m=2) require O(n2) of time, while the best exact algorithm for the LCPS problem requires O(n4) time. This already hints that both problems are structurally quite different from each other. These differences lead to the following differences in the adaptation of the algorithmic concepts to both problems:

  • The search spaces of the two problems (in terms of the definition of the A nodes) differ. This is due to the fact that in the LCS problem solutions are generated from left to right, while in the LCPS problem a solution construction starts from the left and from the right at the same time.

  • The upper bounds utilized for the two problems are different.

  • The expected length calculation heuristics (EX) for guiding the tree search techniques differ, even though similar ideas are used for their derivation.

  • Last but not least, the implementations differ in additional details. For example, in this paper we make use of an efficient way of filtering the dominated nodes in BS and ACS iterations (i.e., the restricted filtering). This was not possible in [25].

In the comprehensive experimental study of this paper we evaluate the proposed approaches on various LCS benchmark sets from the literature and compare to the so far best methods’ results. Earlier computational studies always considered a subset of the available benchmark sets. Concerning proven optimality, our A search is able to solve 106 instances from the literature, which exceeds the number of solved instances by Top_MLCS by six. Moreover, in most cases our A search is faster than Top_MLCS. For the remaining instances that cannot be solved to optimality, A+ACS turns out to be the now leading method for most benchmark sets in respect to final solution quality. Moreover, optimality gaps are considered for larger LCS instances for the first time, and those obtained by A + ACS are shown to be significantly better than the ones of the other considered approaches on many occasions. Most remarkably, A + ACS was able to achieve new best known solutions for 82 different LCS instance groups, which corresponds to 70% of all the considered instance groups.

The rest of this article is organized as follows. Section 2 gives an overview on essential previous work and definitions required for the A search for the LCS problem and its anytime variants. The A search framework is then presented in Section 3, while Section 4 provides the details of the A + ACS and A + BS anytime algorithms. Section 5 comprises the whole computational study. Conclusions and ideas for future work are given in Section 6.

Section snippets

Previous work

In this section we summarize those aspects of previous work that are needed for understanding the anytime algorithms proposed in this work. Most of this material was already covered in a more detailed way in [26], where we introduced a general BS framework for the LCS problem. Beam Search is a well known incomplete tree search method that works in a limited Breadth-First Search (BFS) manner. At each step, it maintains a set of nodes – called the beam – from the same level of the search tree.

A search framework

As mentioned before, our focus in this work is on the development of A-based anytime algorithms for the LCS problem. A search [38] is a well-known exact technique widely used in path-finding and planning. It belongs to the class of informed search methods, employing a best-first search strategy. Our A search for the LCS problem operates on the state graph G as defined in Section 2.2. At each iteration the most promising not-yet-processed/expanded (open) node is expanded. To this end, each

Anytime algorithms for the LCS problem

When faced with large-size problem instances of hard optimization problems, pure exact approaches such as DP or A search frequently reach their limits. Moreover, if not given enough time (or space) to terminate, these algorithms are not able to provide sub-optimal solutions of reasonable quality. Therefore, the optimization community has, at some point, started to improve such algorithms by adding mechanisms that allow them to be terminated early and nevertheless provide feasible solutions of

Experimental evaluation

In the following we first provide a summary of the algorithms that are considered for the experimental evaluation. These are our two anytime algorithms (1) A + BS and (2) A + ACS, (3) the APS algorithm from [49], which is one of the state-of-the-art anytime variants from literature that we implemented for comparison purposes, and (4) A + ACS-dist which is the variation of A + ACS in which the heuristic guidance function EX is replaced by the dist () estimation from Pro-MLCS [33] and

Conclusions and future work

We presented an exact A algorithm for the LCS problem based on the general search framework for the problem proposed in our earlier study, which combines features of various other heuristic techniques. This A search makes use of the combination of two previously known upper bound functions for the length of the LCS and is able to solve instances of up to n=100 and |Σ|12 to proven optimality (106 instances from the literature are solved to optimality), most of them in a fraction of a second.

CRediT authorship contribution statement

Marko Djukanovic: Methodology, Software, Writing - original draft. Günther R. Raidl: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We gratefully acknowledge the financial support of this project by the Doctoral Program “Vienna Graduate School on Computational Optimization” funded by the Austrian Science Foundation (FWF) under contract no. W1260-N35.

References (58)

  • LhoussainA.S. et al.

    Adaptating the Levenshtein distance to contextual spelling correction

    Int. J. Adv. Comput. Sci. Appl.

    (2015)
  • ZielezinskiA. et al.

    Alignment-free sequence comparison: benefits, applications, and tools

    Genome Biol.

    (2017)
  • RieckK. et al.

    Efficient algorithms for similarity measures over sequential data: A look beyond kernels

  • SidorovG. et al.

    Soft similarity and soft cosine measure: Similarity of features in vector space model

    Comput. Sist.

    (2014)
  • RabinerL. et al.

    Considerations in dynamic time warping algorithms for discrete word recognition

    IEEE Trans. Acoust. Speech Signal Process.

    (1978)
  • YeY. et al.

    Similarity measures for time series data classification using grid representation and matrix distance

    Knowl. Inf. Syst.

    (2019)
  • WanS. et al.

    Match-SRNN: Modeling the recursive matching structure with spatial RNN

  • IslamA. et al.

    Semantic text similarity using corpus-based word similarity and string similarity

    ACM Trans. Knowl. Discov. Data

    (2008)
  • LandauerT.K. et al.

    An introduction to latent semantic analysis

    Discourse Process.

    (1998)
  • JiangT. et al.

    A general edit distance between RNA structures

    J. Comput. Biol.

    (2002)
  • StorerJ.

    Data Compression: Methods and Theory

    (1988)
  • BealR. et al.

    A new algorithm for “the LCS problem” with application in compressing genome resequencing data

    BMC Genomics

    (2016)
  • KruskalJ.B.

    An overview of sequence comparison: Time warps, string edits, and macromolecules

    SIAM Rev.

    (1983)
  • BriskP. et al.

    Area-efficient instruction set synthesis for reconfigurable system-on-chip design

  • BergrothL. et al.

    A survey of longest common subsequence algorithms

  • GusfieldD.

    Algorithms on Strings, Trees, and Sequences

    Computer Science and Computational Biology

    (1997)
  • FraserC.B.

    Subsequences and Supersequences of Strings

    (1995)
  • HuangK. et al.

    Fast algorithms for finding the common subsequences of multiple sequences

  • WangQ. et al.

    A fast multiple longest common subsequence (MLCS) algorithm

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • Cited by (17)

    View all citing articles on Scopus
    View full text