1 Introduction

1.1 Background

Phylogenetic trees are commonly used in biology to represent evolutionary relationships, with the leaves corresponding to species that exist today and internal vertices to ancestor species that existed in the past [1]. When studying the evolution of a fixed set of species, different available data and tree reconstruction methods can lead to trees that look structurally different. Quantifying this difference is essential to make better evolutionary inferences, which has led to the proposal of several phylogenetic tree distance measures in the literature. For example, to evaluate the accuracy of a tree reconstruction method \(\mathcal {M}\), one can perform the following steps a number of times [2]: First generate a random phylogenetic tree T and let a sequence evolve down the edges of T according to some chosen model of sequence evolution, then apply the method \(\mathcal {M}\) to reconstruct a tree \(T'\), and finally measure the distance between T and \(T'\). Some phylogenetic tree distance measures that are based on counting how many times certain features differ in the two trees are the Robinson-Foulds distance [3], the rooted triplet distance [4] for rooted trees, and the unrooted quartet distance [5] for unrooted trees. Other distance measures are the nearest-neighbor interchange distance (introduced independently in [6] and [7]), the path-length-difference distance [8], the subtree prune-and-regraft distance [9], the maximum agreement subtree [10], and the subtree moving tree edit distance [11].

The rooted phylogenetic network model is an extension of the rooted phylogenetic tree model that allows internal vertices to have more than just one parent [12]. Such networks can describe more complex evolutionary relationships involving reticulation events such as horizontal gene transfer and hybridization. As in the case of phylogenetic trees, it is useful to have distance measures for comparing phylogenetic networks. Therefore, in this article, we study a natural generalization [13] of the rooted triplet distance for phylogenetic trees to rooted phylogenetic networks and present two new algorithms for computing it.

1.2 Problem Definitions

For any vertex u in a directed acyclic graph, let \( in (u)\) and \( out (u)\) be the in-degree and out-degree of u. The vertex u is called a leaf if \( out (u) = 0\), and an internal vertex if \( out (u) \ge 1\). Formally, a rooted phylogenetic network \(N'\) is a directed acyclic graph with one vertex of in-degree 0 (from here on called the root of \(N'\) and denoted by \(r(N')\)), distinctly labeled leaves, and no vertices with both in-degree 1 and out-degree 1. A vertex u in \(N'\) is called a reticulation vertex if \( in (u) \ge 2\) holds. If \(N'\) has no reticulation vertices, i.e., if all vertices in \(N'\) have in-degree at most 1, then \(N'\) is a rooted phylogenetic tree. Below, when referring to a “tree”, we imply a “rooted phylogenetic tree”, and when referring to a “network”, we imply a “rooted phylogenetic network”.

For the rest of this subsection, suppose that \(N'\) is a network. A directed edge from a vertex u to a vertex v in \(N'\) is denoted by \(u \rightarrow v\). A path from u to v in \(N'\) is denoted by \(u \leadsto v\). Let the height of u, written as h(u), be the number of edges in a longest path from u to a leaf in \(N'\). By definition, if v is a parent of u in \(N'\) then \(h(v) > h(u)\). We will use \(\mathcal {L}(N')\) to refer to both the set of leaves in \(N'\) as well as to the set of leaf labels in \(N'\) since they are in one-to-one correspondence.

The level of a network was introduced by Choy et al. [14] as a parameter to measure the treelikeness of a network, with the special case of a level-0 network being a tree and a level-1 network a so-called galled tree [15] in which all underlying cycles are disjoint. The level is defined as follows. Let \(U(N')\) be the undirected graph created by replacing every directed edge in \(N'\) with an undirected edge. An undirected, connected graph H is called biconnected if it has no vertex whose removal makes H disconnected. A maximal subgraph of \(U(N')\) that is biconnected is called a biconnected component of \(U(N')\). (Observe that the biconnected components of \(U(N')\) are edge-disjoint but not necessarily vertex-disjoint.) For any biconnected component of \(U(N'\)), its corresponding subgraph in \(N'\) will be referred to as a block of \(N'\). We say that \(N'\) is a level-k network, or equivalently \(N'\) has level k, if every block of \(N'\) contains at most k reticulation vertices. Figure 1 shows a level-2 and a level-3 network.

If B is a block of \(N'\) consisting of more than two vertices and one edge and B contains at most one vertex that has one or more outgoing edges to vertices not belonging to B then B is called uninformative. See Fig. 2 for an illustration.

Fig. 1
figure 1

\(N_{1}\) is a level-2 network and \(N_{2}\) is a level-3 network with \(\mathcal {L}(N_1) = \mathcal {L}(N_2) = \{a_{1}, a_{2}, a_{3}, a_{4}\}\). In this example, \(D(N_1, N_2) = 6\). Some shared triplets are: \(a_{1}|a_{2}|a_{4}\)\(a_{3}a_{4}|a_{2}\)\(a_{1}a_{3}|a_{2}\). Some triplets consistent with only one network are: \(a_{1}|a_{3}|a_{4}\)\(a_{2}a_{3}|a_{1}\)

Fig. 2
figure 2

The block drawn with solid edges is an uninformative block because it only has one vertex u with outgoing edges to vertices not in the block

Next, a rooted triplet \(\tau \) is a tree with three leaves. If it is binary we say that \(\tau \) is a rooted resolved triplet, and if it is non-binary we say that \(\tau \) is a rooted fan triplet. We say that the rooted fan triplet x|y|z is consistent with \(N'\) if and only if there exists a vertex u in \(N'\) such that there are three directed paths of non-zero length from u to x, from u to y, and from u to z that are vertex-disjoint except for in u. Similarly, we say that the rooted resolved triplet xy|z is consistent with \(N'\) if and only if \(N'\) contains two vertices u and v such that there are four directed paths of non-zero length from u to v, from v to x, from v to y, and from u to z that are vertex-disjoint except for in u and v, and furthermore, the path from u to z does not pass through v. See Fig. 1 for an example. From here on, by “disjoint paths” we imply “vertex-disjoint paths of non-zero length”. Moreover, when referring to a “triplet”, we imply a “rooted triplet”.

Given two networks \(N_{1} = (V_{1}, E_{1})\) and \(N_{2} = (V_{2}, E_{2})\) built on the same leaf label set \(\Lambda \), the rooted triplet distance \(D(N_1, N_2)\), or triplet distance for short, is the number of triplets over \(\Lambda \) that are consistent with exactly one of \(N_{1}\) and \(N_{2}\). Let \(S(N_{1}, N_{2})\) be the total number of shared triplets, i.e., triplets that are consistent with both \(N_{1}\) and \(N_{2}\). Then:

$$\begin{aligned} D(N_{1}, N_{2}) = S(N_{1}, N_{1}) + S(N_{2}, N_{2}) - 2S(N_{1}, N_{2}) \end{aligned}$$
(1.1)

Note that a shared triplet contributes a +1 to \(S(N_{1},N_{1})\), \(S(N_{2},N_{2})\), and \(S(N_{1},N_{2})\), e.g., the triplet \(a_{1}|a_{2}|a_{4}\) in Fig. 1. On the other hand, a triplet from either network that is not shared contributes a +1 to either \(S(N_{1},N_{1})\) or \(S(N_{2},N_{2})\), and a 0 to \(S(N_{1},N_{2})\). As an example, \(a_{1}|a_{3}|a_{4}\) in Fig. 1 contributes a +1 to \(S(N_{1},N_{1})\) and a 0 to \(S(N_{2},N_{2})\) and \(S(N_{1},N_{2})\).

Let \(S_{r}(N_{1},N_{2})\) and \(S_{f}(N_{1},N_{2})\) be the number of shared resolved and shared fan triplets, respectively. Then \(S(N_{1},N_{2}) = S_{r}(N_{1},N_{2}) + S_{f}(N_{1},N_{2})\), which implies that \(D(N_{1}, N_{2})\) can be obtained by considering shared resolved triplets and shared fan triplets separately.

The rest of this article is focused on how to compute \(D(N_{1}, N_{2})\) efficiently. We shall use the following notation to express the time complexities of various algorithms. For \(i \in \{1,2\}\), the network \(N_{i}\) has vertex set \(V_{i}\) and edge set \(E_{i}\). The level of \(N_{i}\) is \(k_{i}\) and the maximum in-degree taken over all vertices in \(N_{i}\) is \(d_{i}\). We assume that the two given networks \(N_{1}\) and \(N_{2}\) have the same leaf label set \(\Lambda \), and write \(n = |\Lambda |\), \(N = \max (|V_{1}|,|V_{2}|)\), \(M = \max (|E_{1}|, |E_{2}|)\), \(k = \max (k_{1},k_{2})\), and \(d = \max (d_{1}, d_{2})\).

To simplify the descriptions of the algorithms, we will also assume that: (i) there is no vertex u satisfying both \( in (u) > 1\) and \( out (u) = 0\), i.e., all leaves have in-degree at most 1; and (ii) there are no uninformative blocks in \(N_{1}\) and \(N_{2}\). Assumption (i) is justified because every leaf u with in-degree larger than 1 can be replaced by an internal vertex to which a leaf with the same leaf label as u is attached, and the resulting network will be consistent with exactly the same triplets as before. Assumption (ii) is justified because first each uninformative block can be replaced by an edge, and then each vertex with in-degree 1 and out-degree 1 can be eliminated by contracting its outgoing edge; the resulting network will be consistent with the same triplets as the original network. If necessary, checking the input networks \(N_{1}\) and \(N_{2}\) and modifying them to ensure that they comply with (i) and (ii) before running the algorithms takes O(M) time, e.g., by using Hopcroft-Tarjan’s algorithm [16] to identify the biconnected components of \(U(N_{1})\) and \(U(N_{2})\).

1.3 Previous Work

The rooted triplet distance was introduced by Dobson [4] in 1975 for trees, and generalized to networks by Gambette and Huber [13] in 2012. See also [17, Section 3.2] for a short discussion about the definition.

Table 1 lists the time complexities of some previously known algorithms and our new ones for computing \(D(N_{1},N_{2})\). When \(k=0\), both \(N_{1}\) and \(N_{2}\) are trees. This case has been extensively studied in the literature [4, 18,19,20,21,22,23,24], with the most efficient algorithms in theory and practice [19, 20, 24] running in \(\mathrm {O}(n \log n)\) time. For \(k=1\), an \(\mathrm {O}(n^{2.687})\)-time algorithm based on counting 3-cycles in an auxiliary graph was given in [17], and a faster, \(O(n \log n)\)-time algorithm that transforms the input to a constant number of instances with \(k = 0\) was given in [25]. All of these algorithms allow the vertices in the input networks to have arbitrary degrees. Moreover, software implementations of the fast algorithms for \(k = 0\) and \(k = 1\) are available [20, 23,24,25].

Table 1 Previous and new results for computing \(D(N_{1},N_{2})\), where \(N_{1}\) and \(N_{2}\) are two phylogenetic networks built on the same leaf label set \(\Lambda \)

For \(k > 1\), much less is known. In a special “binary degree” case where the phylogenetic networks’ roots have out-degree 2 and all other internal vertices have either in-degree 2 and out-degree 1, or in-degree 1 and out-degree 2, one can adapt a technique developed by Byrka et al. [27] for a problem related to finding a network consistent with as many resolved triplets as possible from a given set. They showed how to preprocess any fixed network \(N' = (V,E)\) satisfying the binary degree constraints so that checking if a resolved triplet is consistent with \(N'\) can be done efficiently. Below, we shall refer to this preprocessing as constructing a data structure \(\mathcal {D}\) such that \(\mathcal {D}\) can be used to determine whether any specified resolved triplet is consistent with \(N'\) in \(\mathrm {O}(1)\) time. The proof of Lemma 2 in [27] showed how to build \(\mathcal {D}\) in \(\mathrm {O}(|V|^{3})\) time. According to Remark 1 in [27], this can be further improved to \(\mathrm {O}(|V| + |V| k^{2})\), where k is the level of \(N'\). The rooted triplet distance can thus be computed in \(\mathrm {O}(N + N k^{2} + n^{3})\) time in a straightforward way when \(N_1\) and \(N_2\) obey the special binary degree constraints. A limitation of \(\mathcal {D}\) is that it can only support consistency queries for resolved triplets, while a network with no restrictions on the vertices’ degrees may also contain fan triplets.

In the general case, when \(N_{1}\) and \(N_{2}\) have unbounded degrees and unbounded levels, it is possible to compute \(D(N_{1},N_{2})\) by iterating over all \(4\left( {\begin{array}{c}n\\ 3\end{array}}\right) \) triplets, and for each such triplet applying the classic directed acyclic graph pattern matching algorithm in [26] to determine its consistency with \(N_{1}\) and \(N_{2}\). However, this leads to a time complexity of \({\Omega }(N^{6} n^{3})\). To see this, let P in Theorem 3 in [26] be a resolved triplet and G a phylogenetic network \(N_i\) with \(|V_i|\) vertices. P has two internal nodes and four edges, so the algorithm will consider \(\left( {\begin{array}{c}|V_i|\\ 2\end{array}}\right) \) ways of mapping the two internal nodes of P to vertices in \(N_i\), and for each one, construct a configuration graph \(G'\) with \({\Omega }((|V_i|+1)^{4})\) vertices and look for a path in \(G'\). Hence, the algorithm will use \({\Omega }(|V_i|^{6})\) time for each resolved triplet to check if it occurs in \(N_i\), i.e., \({\Omega }(N^{6} n^{3})\) time in total.

1.4 New Results

Here, we develop two algorithms that significantly improve upon the time complexity of computing the rooted triplet distance in the general, unbounded case. The running time of our first algorithm is \(\mathrm {O}(N^{2} M + n^{3})\). One key insight is that a technique of Perl and Shiloach for identifying two disjoint paths between two pairs of vertices in a directed acyclic graph [28] can be extended to check if a fan triplet or a resolved triplet is embedded in a phylogenetic network, leading to the useful concepts of a fan graph and a resolved graph. Our second algorithm then augments these ideas with so-called block trees and contracted block networks to obtain a running time of \(\mathrm {O}(M + N k^{2} d^{2} + n^{3})\). Neither algorithm has a strictly better time complexity than the other one for all possible inputs. In the special case where \(N_{1}\) and \(N_{2}\) follow the binary degree constraints of Byrka et al. [27], the time complexity reduces to \(\mathrm {O}(N + N k^{2} + n^{3})\), matching the bound in [27].

We also provide implementations of our algorithms, evaluate their performance on simulated and real datasets, and make some observations on the limitations of the current definition of the rooted triplet distance in practice. Our prototype implementations have been packaged into the first publicly available software for computing the triplet distance between two unrestricted networks of arbitrary levels.

1.5 Organization of the Article

Section 2 describes our first new algorithm and Sect. 3 the second one. Section 4 presents an implementation of both our algorithms and experiments illustrating their practical performance. Finally, Sect. 5 gives some concluding remarks.

2 A First Approach

This section presents an algorithm that computes \(D(N_{1}, N_{2})\) in \(\mathrm {O}(N^{2} M + n^{3})\) time.

Overview. The algorithm consists of a preprocessing step and a triplet distance computation step. For the preprocessing step, we extend a technique introduced by Perl and Shiloach [28] to construct suitably defined auxiliary graphs that compactly encode disjoint paths within \(N_{1}\) and \(N_{2}\). Two graphs, the fan graph and resolved graph, are created that enable us to check the consistency of any fan triplet and any resolved triplet, respectively, with \(N_{1}\) and \(N_{2}\) in \(\mathrm {O}(1)\) time. In the triplet distance computation step, we compute \(D(N_{1}, N_{2})\) by iterating over all possible \(4 \left( {\begin{array}{c}n\\ 3\end{array}}\right) \) triplets and using the fan and resolved graphs to check the consistency of each triplet with \(N_{1}\) and \(N_{2}\) efficiently.

2.1 Preprocessing

Let \(G = (V, E)\) be a directed acyclic graph and \(s_{1}\), \(t_{1}\), \(s_{2}\), and \(t_{2}\) four vertices in G. Perl and Shiloach [28] gave an algorithm that can find two vertex-disjoint paths, one from \(s_1\) to \(t_1\) and one from \(s_2\) to \(t_2\), in O(|V||E|) time or determine that no such pair of paths exists. They achieve this by creating a directed graph \(G' = (V', E')\) in \(\mathrm {O}(|V||E|)\) time, with the property that the existence of such a pair of vertex-disjoint paths in G is equivalent to the existence of a directed path from \(\langle s_{1}, s_{2} \rangle \) to \(\langle t_{1}, t_{2} \rangle \) in \(G'\), where \(\langle s_{1}, s_{2} \rangle \) and \(\langle t_{1}, t_{2} \rangle \) are vertices in \(G'\). A fan triplet or resolved triplet involves more than two vertex-disjoint paths, and below we show how to extend the technique by Perl and Shiloach [28] to determine if a given network has the necessary vertex-disjoint paths that would imply the consistency of a given triplet with the network.

2.1.1 The Fan Graph

For any network \(N_{i} = (V_{i}, E_{i})\), let its fan graph \(N_{i}^{f} = (V_{i}^{f}, E_{i}^{f})\) be a graph such that \(V_{i}^{f} = \{s\} \cup \{(u,v,w) \,\mid \, u,v,w \in V_{i},\, u \ne v,\, u \ne w,\, v \ne w\}\) and \(E_{i}^{f}\) includes the following directed edges:

  1. 1.

    \(\{(u_{1}, v_{1}, w_{1}) \rightarrow (u_{2}, v_{1}, w_{1}) \,\mid \, u_{1} \rightarrow u_{2} \in E_{i},\, h(u_{1}) \ge \max (h(v_{1}), h(w_{1}))\}\)

  2. 2.

    \(\{(u_{1}, v_{1}, w_{1}) \rightarrow (u_{1}, v_{2}, w_{1}) \,\mid \, v_{1} \rightarrow v_{2} \in E_{i},\, h(v_{1}) \ge \max (h(u_{1}), h(w_{1}))\}\)

  3. 3.

    \(\{(u_{1}, v_{1}, w_{1}) \rightarrow (u_{1}, v_{1}, w_{2}) \,\mid \, w_{1} \rightarrow w_{2} \in E_{i},\, h(w_{1}) \ge \max (h(u_{1}), h(v_{1}))\}\)

  4. 4.

    \(\{s \rightarrow (u,v,w) \,\mid \, u \rightarrow v \in E_{i},\, u \rightarrow w \in E_{i}\} \)

Every 3-tuple of vertices from \(N_{i}\) with distinct entries is represented by a vertex in \(N_{i}^{f}\). Refer to Fig. 3 for an example. Note that \(N_{i}^{f}\) contains \(\mathrm {O}(|V_{i}|^{3})\) vertices and \(\mathrm {O}(|V_{i}|^{2} |E_{i}|)\) edges, and can be constructed in \(\mathrm {O}(|V_{i}|^{2} |E_{i}|)\) time. It also has the property described in the following lemma, which generalizes Theorem 3.1 in [28].

Fig. 3
figure 3

Illustrating the fan graph. a An example network \(N_{i}\). Every internal vertex is labeled by a letter and its height. b Consider the triplet \(a_{3}|a_{6}|a_{4}\). Lemma 2.1 implies that it is consistent with \(N_{i}\) because there is a path \((s,\, (b,f,h),\, (c,f,h),\, (e,f,h),\, (e,f,i),\, (e,a_{6},i),\, (e,a_{6},a_{4}),\, (a_{3},a_{6},a_{4}))\) in the fan graph \(N_{i}^{f}\). A small part of \(N_i^{f}\) is drawn here, with the two directed edges \((c,f,h) \rightarrow (e,f,h)\) and \((e,f,h) \rightarrow (e,f,i)\) in the path from s to \((a_{3},a_{6},a_{4})\) indicated

Lemma 2.1

Consider a network \(N_{i}\) and its fan graph \(N_{i}^{f} = (V_{i}^{f}, E_{i}^{f})\). For any three different leaves x, y, and z in \(N_{i}\), vertex s can reach vertex (xyz) in \(N_{i}^{f}\) if and only if the fan triplet x|y|z is consistent with \(N_{i}\).

Proof

\((\leftarrow )\) Let x|y|z be any fan triplet consistent with \(N_{i}\). By definition, there exists an internal vertex q in \(N_{i}\) and three disjoint paths (except for in q), one from q to x, one from q to y, and one from q to z. Denote these paths by \((q,x_{0},x_{1},\dots ,x_{a})\), \((q,y_{0},y_{1},\dots ,y_{b})\), and \((q,z_{0},z_{1},\dots ,z_{c})\), where \(x_{a} = x\), \(y_{b} = y\), and \(z_{c} = z\). Then \(N_{i}^{f}\) also contains the following three paths:

  • \((s,\, (q,y_{0},z_{0}))\): This can be seen from \(q\rightarrow y_{0} \in E_{i}\) and \(q\rightarrow z_{0} \in E_{i}\).

  • \(((q,y_{0},z_{0}),\, (x_{0},y_{0},z_{0}))\): This follows from the fact that \(q\rightarrow x_{0} \in E_{i}\) and \(h(q) > h(y_{0}), h(z_{0})\).

  • \(((x_{0},y_{0},z_{0}),\, \dots ,\, (x_{a},y_{b},z_{c}))\): This is because \(h(x_{0})> h(x_{1})> \ldots > h(x_{a})\), \(h(y_{0})> h(y_{1})> \ldots > h(y_{b})\), and \(h(z_{0})> h(z_{1})> \ldots > h(z_{c})\) hold, and \((x_{0},\dots ,x_{a})\), \((y_{0},\dots ,y_{b})\), and \((z_{0},\dots ,z_{c})\) are paths in \(N_{i}\).

By concatenating the three paths above, we get a path in \(N_{i}^{f}\) from s to (xyz).

\((\rightarrow )\) Because s can reach (xyz) in \(N_{i}^{f}\), there exists a path P in \(N_{i}^{f}\) of the form \(P = (s,\, (x_{1},y_{1},z_{1}),\, (x_{2},y_{2},z_{2}),\, \dots ,\, (x_{t},y_{t},z_{t}))\), where \(x_{t} = x\), \(y_{t} = y\), and \(z_{t} = z\). Let \(S_{1} = (x_{1},x_{2},\dots ,x_{t})\), \(S_{2} = (y_{1},y_{2},\dots ,y_{t})\), and \(S_{3} = (z_{1},z_{2},\dots ,z_{t})\), where \(x_{t} = x\), \(y_{t} = y\), and \(z_{t} = z\), be three sequences of vertices from \(N_{i}\) obtained from P.

We prove by induction that the three paths obtained by following the sequences \(S_{1}\)\(S_{2}\), and \(S_{3}\) are disjoint paths in \(N_{i}\). Consider any \(j \in \{1,2,\dots ,t\}\). When \(j=t\), all three vertices \(x_{t}\), \(y_{t}\), and \(z_{t}\) are different according to the definition of \(V_{i}^{f}\). For \(j < t\), by the inductive hypothesis we have that \((x_{j+1},\dots ,x_{t})\), \((y_{j+1},\dots ,y_{t})\) and \((z_{j+1},\dots ,z_{t})\) yield disjoint paths. In addition, by the definition of the fan graph \(N_{i}^{f}\), for every \(j \in \{1,2,\dots ,t-1\}\), one of the following three cases holds: (1) \(x_{j} \ne x_{j+1}\) only, (2) \(y_{j} \ne y_{j+1}\) only, and (3) \(z_{j} \ne z_{j+1}\) only. In case (1), note that \(y_{j} = y_{j+1}\) and \(z_{j} = z_{j+1}\), which means that \((x_{j+1},\dots ,x_{t})\), \((y_{j},\dots ,y_{t})\) and \((z_{j},\dots ,z_{t})\) yield disjoint paths. We now show that \(x_{j}\) cannot appear in any of these three paths. It holds that \(h(x_{j}) \ge \max (h(y_{j}), h(z_{j}))\), so for \(\mu \ge j+1\) and \(y_{\mu } \ne y_{j}\), we have \(h(x_{j}) > h(y_{\mu })\). Similarly, for \(\mu \ge j+1\) and \(z_{\mu } \ne z_{j}\), we have \(h(x_{j}) > h(z_{\mu })\). Together with the fact that \(x_{j}\)\(y_{j}\), and \(z_{j}\) are different according to the definition of \(N_{i}^{f}\), we deduce that the three paths obtained from \((x_{j},\dots ,x_{t})\), \((y_{j},\dots ,y_{t})\), and \((z_{j},\dots ,z_{t})\) are disjoint. Cases (2) and (3) can be argued in the same way. Thus, following \(S_{1}\), \(S_{2}\), and \(S_{3}\) yields three disjoint paths.

Finally, since P contains a directed edge from s to \((x_{1},y_{1},z_{1})\), \(N_{i}\) contains an edge from \(x_{1}\) to \(y_{1}\) and an edge from \(x_{1}\) to \(z_{1}\). Therefore, the three paths in \(N_{i}\) that start at the internal vertex \(x_{1}\) and then follow the sequences \(S_{1}\), \(S_{2}\), and \(S_{3}\), respectively, are disjoint paths (except for in \(x_{1}\)) to x, y, and z. By definition, x|y|z is consistent with \(N_{i}\). \(\square \)

Corollary 2.2

Let \(N_{i}\) be a given network and \(r'\) a dummy leaf attached to \(r(N_{i})\). For any two different leaves x and y in \(N_{i}\) that are not \(r'\), there are two paths from \(r(N_{i})\) to x and y that are disjoint, except for in \(r(N_{i})\), if and only if s can reach \((r',x,y)\) in \(N_{i}^{f}\).

2.1.2 The Resolved Graph

For any network \(N_{i}\), let its resolved graph \(N_{i}^{r} = (V_{i}^{r}, E_{i}^{r})\) be a graph such that \(V_{i}^{r} = \{s\} \cup \{(u,v) \,\mid \, u,v \in V_{i},\, u \ne v\} \cup \{(u,v,w) \,\mid \, u,v,w \in V_{i},\, u \ne v,\, u\ne w,\, v \ne w \}\) and \(E_{i}^{r}\) includes the following directed edges:

  1. 1.

    \(\{s \rightarrow (u,v) \,\mid \, u \rightarrow v \in E_{i}\}\)

  2. 2.

    \(\{(u_{1}, v_{1}) \rightarrow (u_{2}, v_{1}) \,\mid \, u_{1} \rightarrow u_{2} \in E_{i},\, h(u_{1}) \ge h(v_{1})\}\)

  3. 3.

    \(\{(u_{1}, v_{1}) \rightarrow (u_{1}, v_{2}) \,\mid \, v_{1} \rightarrow v_{2} \in E_{i},\, h(v_{1}) \ge h(u_{1})\}\)

  4. 4.

    \(\{(u,v) \rightarrow (u,v,w) \,\mid \, v \rightarrow w \in E_{i},\, h(v) \ge h(u)\}\)

  5. 5.

    \(\{(u_{1},v_{1},w_{1}) \rightarrow (u_{2},v_{1},w_{1}) \,\mid \, u_{1} \rightarrow u_{2} \in E_{i},\, h(u_{1}) \ge \max (h(v_{1}),h(w_{1}))\}\)

  6. 6.

    \(\{(u_{1},v_{1},w_{1}) \rightarrow (u_{1},v_{2},w_{1}) \,\mid \, v_{1} \rightarrow v_{2} \in E_{i},\, h(v_{1}) \ge \max (h(u_{1}),h(w_{1}))\}\)

  7. 7.

    \(\{(u_{1},v_{1},w_{1}) \rightarrow (u_{1},v_{1},w_{2}) \,\mid \, w_{1} \rightarrow w_{2} \in E_{i},\, h(w_{1}) \ge \max (h(u_{1}),h(v_{1}))\}\)

Note that \(N_{i}^{r}\) contains \(\mathrm {O}(|V_{i}|^{3})\) vertices and \(\mathrm {O}(|V_{i}|^{2} |E_{i}|)\) edges, can be constructed in \(\mathrm {O}(|V_{i}|^{2} |E_{i}|)\) time, and has the property described in the following lemma:

Lemma 2.3

Consider a network \(N_{i}\) and its resolved graph \(N_{i}^{r}=(V_{i}^{r},E_{i}^{r})\). For any three different leaves x, y, and z in \(N_{i}\), vertex s can reach vertex (xyz) in \(N_{i}^{r}\) if and only if the resolved triplet yz|x is consistent with \(N_{i}\).

Proof

\((\leftarrow )\) If yz|x is consistent with \(N_{i}\) then \(N_{i}\) contains three paths of the following form: (1) \((x_{0}, x_{1}, \dots , x_{a})\); (2) \((x_{0},y_{1},\dots ,y_{j},y_{j+1},\dots ,y_{b})\); and (3) \((y_{j},z_{1},\dots ,z_{c})\); such that the three paths are vertex-disjoint except for in \(x_{0}\) and \(y_{j}\), the first path does not pass through \(y_{j}\), and it holds that \(x_{a}=x\), \(y_{b}=y\), and \(z_{c} = z\).

Let \(x_{\mu }\) be a vertex on the first path satisfying \(h(x_{\mu -1}) > h(y_{j}) \ge h(x_{\mu })\). Then \((s,\, (x_{0}, y_{1}),\, \dots ,\, (x_{\mu },y_{j}),\, (x_{\mu },y_{j},z_{1}),\, \dots ,\, (x_{a}, y_{b}, z_{c}))\) is a path in \(N_{i}^{r}\).

\((\rightarrow )\) If there is a path from s to (xyz) in \(N_{i}^{r}\), it must be of the form \((s,\, (x_{1},y_{1}),\, (x_{2},y_{2}),\, \dots ,\, (x_{q}, y_{q}),\, (x_{q+1},y_{q+1},z_{q+1}),\, \dots ,\, (x_{t},y_{t},z_{t}))\), with \(x_{t} = x\), \(y_{t} = y\), and \(z_{t} = z\). By the definitions, we have \(x_{1} \rightarrow y_{1} \in E_{i}\), \(x_{q} = x_{q+1}\), \(y_{q} = y_{q+1}\), and \(y_{q} \rightarrow z_{q+1} \in E_{i}\). Define three sequences of vertices from \(N_{i}\) as follows: \(S_{1} = (x_{1}, x_{2}, \dots , x_{t})\), \(S_{2} = (y_{1}, y_{2}, \dots , y_{t})\), and \(S_{3} = (z_{q+1}, z_{q+2}, \dots , z_{t})\).

We claim that following the sequences \(S_{1}\), \(S_{2}\), and \(S_{3}\) yields three disjoint paths in \(N_{i}\). (This claim is shown below.) The claim and the fact that \(N_{i}^{r}\) contains an edge from s to \((x_{1},y_{1})\) and an edge from \((x_{q},y_{q})\) to \((x_{q+1},y_{q+1},z_{q+1})\) then imply that \(N_{i}\) contains a path from \(x_{1}\) to x, a path from \(x_{1}\) to \(y_{q}\), a path from \(y_{q}\) to y, and a path from \(y_{q}\) to z that make yz|x consistent with \(N_{i}\).

To prove the claim, we show that the paths obtained by following the sequences of vertices listed below are disjoint:

  1. (a)

    \((x_{1},x_{2},\dots ,x_{q})\) and \((y_{1},y_{2},\dots ,y_{q})\)

  2. (b)

    \((x_{q+1},x_{q+2},\dots ,x_{t})\), \((y_{q+1},y_{q+2},\dots ,y_{t})\), and \((z_{q+1},z_{q+2},\dots ,z_{t})\)

  3. (c)

    \((x_{1},x_{2},\dots ,x_{q})\) and \((y_{q+1},y_{q+2},\dots ,y_{t})\)

  4. (d)

    \((x_{1},x_{2},\dots ,x_{q})\) and \((z_{q+1},z_{q+2},\dots ,z_{t})\)

  5. (e)

    \((y_{1},y_{2},\dots ,y_{q})\) and \((z_{q+1},z_{q+2},\dots ,z_{t})\)

  6. (f)

    \((y_{1},y_{2},\dots ,y_{q})\) and \((x_{q+1},x_{q+2},\dots ,x_{t})\)

To prove that the paths obtained by following the sequences in (a) are disjoint we use induction. By the definition of \(N_{i}^{r}\), we know that \(x_{q} \ne y_{q}\). For the inductive hypothesis, assume that the paths obtained from \((x_{j+1},\dots ,x_{q})\) and \((y_{j+1},\dots ,y_{q})\) are disjoint. Again by definition, there are two cases: (1) \(x_{j} \ne x_{j+1}\) only; and (2) \(y_{j} \ne y_{j+1}\) only. For (1), we have \(y_{j} = y_{j+1}\) and \(h(x_{j}) \ge h(y_{j})\), thus for \(\mu > j+1\) and \(y_{\mu } \ne y_{j}\), we have \(h(x_{j}) > h(y_{\mu })\). Together with \(x_{j} \ne y_{j}\), we can see that \(x_{j}\) does not appear in \((y_{j},\dots ,y_{q})\). Case (2) can be handled in the same way. Thus, the paths from (a) are disjoint.

For (b), the induction proof from the proof of Lemma 2.1 immediately implies that the three paths are disjoint.

To show that the paths obtained from (c) are disjoint, let \(j \in \{1,\dots ,q\}\) be the largest index such that \(x_{j} \ne x_{q}\). We know from the paths in (b) that \(x_{q} = x_{q+1}\) does not appear in \((y_{q+1},\dots ,y_{t})\), so we only need to prove that \((x_{1},\dots ,x_{j})\) is disjoint from \((y_{q+1},\dots ,y_{t})\). Because \(x_{j} \ne x_{q}\), there exists some \(\mu \in \{1,\dots ,q\}\) such that \((x_{j},y_{\mu }) \rightarrow (x_{q}, y_{\mu })\) is in the path from s to (xyz). By definition \(x_{j} \ne y_{\mu }\) and \(h(x_{j}) \ge h(y_{\mu })\). We consider the following two cases: (1) \(h(x_{j}) > h(y_{\mu })\) and (2) \(h(x_{j}) = h(y_{\mu })\). In case (1), because of \(h(x_{1}),\dots ,h(x_{j}) > h(y_{\mu }),\dots ,h(y_{t})\), the paths from (c) are disjoint. In case (2), let \(g \in \{1,\dots ,j\}\) be the maximum index such that \(x_{g} \ne x_{j}\). Since \(h(x_{g}) > h(x_{j}) = h(y_{\mu })\), using the same argument as in (1), we have that \((x_{1},\dots ,x_{g})\) and \((y_{\mu },\dots ,y_{t})\) are disjoint. It only remains to show that \(x_{j}\) does not appear in \((y_{\mu },\dots ,y_{t})\). If we assume that \(x_{j}\) appears in \((y_{\mu },\dots ,y_{t})\) then because \(y_{\mu } \ne x_{j}\), we would have \(h(y_{\mu }) > h(x_{j})\), which leads to a contradiction.

For the paths from (d), similar arguments as in (c) can be applied since \(y_{q} \rightarrow z_{q+1} \in E_{i}\), \(x_{q} = x_{q+1}\), and \(x_{q+1}\ne z_{q+1}\).

To show that the paths from (e) are disjoint, because \(y_{q} \rightarrow z_{q+1} \in E_{i}\), we have \(h(y_{1}),\dots ,h(y_{q}) > h(z_{q+1}),\dots ,h(z_{t})\), meaning that the paths from (e) are disjoint.

Finally, to show that the paths from (f) are disjoint, by definition we have \(x_{q} = x_{q+1}\) and \(h(y_{q}) \ge h(x_{q})\). So for every \(\mu > q+1\) and \(x_{\mu } \ne x_{q}\), it holds that \(h(y_{q}) > h(x_{\mu })\). Since we also have that \(x_{q} \ne y_{q}\), the paths from (f) are disjoint. \(\square \)

Corollary 2.4

Let \(N_{i}\) be a given network and \(r'\) a dummy leaf attached to \(r(N_{i})\). For any two different leaves x and y in \(N_{i}\) that are not \(r'\), there are two paths from some internal vertex \(z \ne r(N_{i})\) in \(N_{i}\) to x and y that are disjoint, except for in z, if and only if s can reach \((r',x,y)\) in \(N_{i}^{r}\).

2.1.3 The Fan Table and the Resolved Table

Given \(N^{f}_{i}\) and \(N^{r}_{i}\), we define the \(n \times n \times n \) fan table \(A_{i}^{f}\) and the \(n \times n \times n \) resolved table \(A_{i}^{r}\) as follows. For any three different leaves xy, and z\(A_{i}^{f}[x][y][z] = 1\) if the fan triplet x|y|z is consistent with \(N_{i}\) and \(A_{i}^{f}[x][y][z] = 0\) otherwise. Similarly, \(A_{i}^{r}[x][y][z] = 1\) if the resolved triplet x|yz is consistent with \(N_{i}\) and \(A_{i}^{r}[x][y][z] = 0\) otherwise.

With the help of Lemmas 2.1 and 2.3, both \(A_{i}^{f}\) and \(A_{i}^{r}\) can be precomputed by depth-first traversals (starting from s) of \(N^{f}_{i}\) and \(N^{r}_{i}\). More precisely, \(A_{i}^{f}[x][y][z] = 1\) if s can reach (xyz) in \(N_{i}^{f}\) and 0 otherwise, and \(A_{i}^{r}[x][y][z] = 1\) if s can reach (xyz) in \(N_{i}^{r}\) and 0 otherwise.

Since \(N^{f}_{i}\) and \(N^{r}_{i}\) have \(\mathrm {O}(|V_{i}|^{3})\) vertices and \(\mathrm {O}(|V_{i}|^{2} |E_{i}|)\) edges, the time needed to build \(A_{i}^{f}\) and \(A_{i}^{r}\) by depth-first traversals is \(\mathrm {O}(|V_{i}|^{3} + |V_{i}|^{2} |E_{i}|) = \mathrm {O}(|V_{i}|^{2} |E_{i}|)\).

2.2 Triplet Distance Computation

Algorithm 1 summarizes the steps for computing the triplet distance between two networks \(N_{1}\) and \(N_{2}\). The main procedure, D(), uses Equation (1.1) to calculate \(D(N_1, N_2)\). It first builds the fan table \(A_{i}^{f}\) and the resolved table \(A_{i}^{r}\) for each \(N_{i}\), \(i \in \{1,2\}\), in a preprocessing step, and then relies on the procedure S() for counting shared triplets. The shared fan triplets and shared resolved triplets are counted by iterating over all possible triplets and using the fan and resolved tables to determine the consistency of any triplet with each of the two networks. The correctness is ensured by Lemmas 2.1 and 2.3.

figure a

To analyze the running time, building the data structures \(N_{i}^{r}\) and \(N_{i}^{f}\) for \(i \in \{1,2\}\) on line 3 takes \(\mathrm {O}(|V_{1}|^{2} |E_{1}| + |V_{2}|^{2} |E_{2}|)\) time. Building the tables \(A_{i}^{r}\) and \(A_{i}^{f}\) on lines 4’7 requires \(\mathrm {O}(|V_{1}|^{2} |E_{1}| + |V_{2}|^{2} |E_{2}|)\) time as well. After the preprocessing is finished, the procedures \(S_{f}()\) and \(S_{r}()\) take \(\mathrm {O}(n^{3})\) time because each of the \(4 \left( {\begin{array}{c}n\\ 3\end{array}}\right) = \mathrm {O}(n^{3})\) triplets can be checked in \(\mathrm {O}(1)\) time by table lookups. Hence, the total running time of the algorithm becomes \(\mathrm {O}(|V_{1}|^{2} |E_{1}| + |V_{2}|^{2} |E_{2}| + n^{3})\). By the definitions of N and M (see Sect. 1), the time complexity is \(\mathrm {O}(N^{2} M + n^{3})\). We have obtained the following theorem:

Theorem 2.5

The triplet distance between two networks \(N_{1}\) and \(N_{2}\) can be computed in \(\mathrm {O}(N^{2} M + n^{3})\) time.

3 A Second Approach

In this section, we show how to compute \(D(N_{1}, N_{2})\) in \(\mathrm {O}(M + N k^{2} d^{2} + n^{3})\) time.

Overview. Algorithm 1 in the previous section computed \(D(N_{1}, N_{2})\) by iterating over all possible triplets and using the fan and resolved tables for \(N_{1}\) and \(N_{2}\) to identify which triplets were consistent with both networks. To refine this idea, for every block of \(N_{i}\), we will define a network of approximately the same size as the block, which we call a contracted block network. For every such contracted block network, we build a fan and resolved graph and the corresponding fan and resolved table. Furthermore, by replacing the blocks of \(N_{i}\) by single vertices, we obtain a tree structure called the block tree. The new algorithm in this section combines the block tree and all the fan and resolved tables of the contracted block networks of \(N_{i}\) to efficiently determine whether or not any specified triplet is consistent with \(N_{i}\).

3.1 Preprocessing

Let \(N_{i}\) be a network. Note that every block B of \(N_{i}\) contains one vertex whose height is greater than the heights of all other vertices in B. This vertex will be called the root of B and denoted by r(B). If B contains only one edge \(u \rightarrow v\) and \(v \in \mathcal {L}(N_{i})\) then B is called a leaf block; otherwise, B is called a \(non-leaf block \). Recall from Sect. 1.2 that we assume without loss of generality that: (i) all leaves have in-degree at most 1 (so that every leaf has a leaf block); and (ii) the input networks have no uninformative blocks. Lemma 3.1 presents an important property of the blocks in \(N_{i}\).

Lemma 3.1

All blocks of a given network \(N_{i}\) are edge-disjoint.

Proof

For the purpose of obtaining a contradiction, suppose that \(N_{i}\) has two different blocks \(B_{1} = (V_{1}, E_{1})\) and \(B_{2} = (V_{2}, E_{2})\) that share an edge. Define \(B = (V_{1} \cup V_{2}, E_{1} \cup E_{2})\). Let \(U(B_{1})\), \(U(B_{2})\), and U(B) be the subgraphs of \(U(N_{i})\) corresponding to \(B_{1}\), \(B_{2}\), and B. Since \(U(B_{1})\) and \(U(B_{2})\) are connected graphs that share an edge, U(B) is also connected. Furthermore, if any vertex is removed from B, U(B) will still be connected. Therefore, \(U(B_{1})\) and \(U(B_{2})\) are not maximal biconnected subgraphs of \(U(N_{i})\), which means \(B_{1}\) and \(B_{2}\) are not blocks of \(N_{i}\). Hence, we have reached a contradiction and the lemma follows. \(\square \)

3.1.1 The Block Tree

From a high-level perspective, we will remove the cycles in \(U(N_{i})\) by replacing the non-leaf blocks by internal nodes to obtain a rooted tree on the leaf label set \(\mathcal {L}(N_{i})\). A similar idea was previously used by Choy et al. in the proof of Lemma 2 in [14] to bound the number of reticulation vertices in a network, and later by Byrka et al. [27] to efficiently check if a resolved triplet is consistent with a network. Below, we will show that it is also useful for checking if a fan triplet is consistent with a network.

Formally, let \(T_{i} = (V', E')\) be a rooted tree, from now on referred to as the block tree, with vertex set \(V'\) and edge set \(E'\) constructed as follows:

  1. 1.

    For every block \(B_{j}\) in \(N_{i}\), create a vertex \(b_{j}\) in \(T_{i}\).

  2. 2.

    Let \(B_{1}\), \(B_{2}\) be two blocks in \(N_{i}\) with \(r(B_{1}) \ne r(B_{2})\). If \(r(B_{2})\) is also a vertex in \(B_{1}\) then create the edge \(b_{1} \rightarrow b_{2}\) in \(T_{i}\).

  3. 3.

    Create a root vertex r in \(T_{i}\). For every block \(B_{j}\) that has \(r(N_{i})\) as a root, create the edge \(r \rightarrow b_{j}\) in \(T_{i}\).

  4. 4.

    If \(B_{j}\) is a leaf block, rename \(b_{j}\) in \(T_{i}\) by the label of the leaf in \(B_{j}\).

Fig. 4
figure 4

a An example network \(N_{i}\). The blocks containing leaves are highlighted in red. All other blocks are colored gray. b The corresponding block tree \(T_{i}\)

Figure 4 gives an example of a network \(N_{i}\) and its block tree \(T_{i}\). The set of blocks in \(N_{i}\) and the vertex set \(V'-r(T_{i})\), i.e., the set of all vertices of \(T_{i}\) except the root, are in one-to-one correspondence. An edge \(b_{1} \rightarrow b_{2}\) in \(T_{i}\) means that the corresponding blocks \(B_{1}\) and \(B_{2}\) in \(N_{i}\) do not have the same root and the root vertex \(r(B_{2})\) is a shared vertex between \(B_{1}\) and \(B_{2}\). Note that by the definition of a block, an edge connecting two vertices can define a block of its own (for example, block \(B_{9}\) in Fig. 4).

The following lemma states some properties of \(T_{i}\).

Lemma 3.2

Let \(T_{i} = (V', E')\) be the block tree of a given network \(N_{i}\). The block tree \(T_{i}\) is a rooted tree that has n leaves, \(|V'| = O(n)\), and \(|E'| = O(n)\).

Proof

We start by showing that \(T_{i}\) is a rooted tree. Since every edge of \(T_{i}\) is directed, \(T_{i}\) is a directed graph. Let \(U(T_{i})\) be the undirected version of that graph. Since \(U(N_{i})\) is connected, \(U(T_{i})\) is connected as well according to the construction. Next, we prove that \(T_{i}\) is a tree by contradiction. Suppose that \(U(T_{i})\) has a cycle. Then there exists a vertex b in \(T_{i}\) with \( in (b) > 1\). If B is the corresponding block of b in \(N_{i}\), this in turn implies the existence of two different blocks \(B_{1}\) and \(B_{2}\) in \(N_{i}\) such that \(r(B) \ne r(B_{1})\) and \(r(B) \ne r(B_{2})\), and with r(B) being a vertex in both \(B_{1}\) and \(B_{2}\). By the definition of \(N_{i}\), the root \(r(N_{i})\) has a path to every vertex in \(N_{i}\), so \(r(B_{1})\) and \(r(B_{2})\) must have a common ancestor. This means that the two blocks \(B_{1}\) and \(B_{2}\) could be merged to create an even larger block that contains both of them, contradicting that \(B_{1}\) and \(B_{2}\) are blocks of \(N_{i}\). Thus, \(T_{i}\) is a rooted tree.

Next, we count the number of vertices and edges in \(T_{i}\). By assumption (i) mentioned above, there are no leaves with in-degree greater than 1 in \(N_{i}\). Thus, \(N_{i}\) contains n leaf blocks and there will be exactly n leaves in \(T_{i}\). To count the internal vertices in \(T_{i}\), we distinguish between vertices having in-degree 1 and out-degree 1, from now on referred to as extra vertices, and non-extra vertices. First, to count the non-extra vertices in \(T_{i}\), observe that if we were to contract its extra vertices, i.e., add an edge from the parent of every such vertex u to the child of u and then remove u, we would obtain a tree \(T''_{i} = (V'', E'')\) with n leaves in which every internal vertex has in-degree 1 and out-degree at least 2. This means that \(|V''| = \mathrm {O}(n)\) and \(|E''| = \mathrm {O}(n)\). Secondly, to count the extra vertices, observe that any extra vertex corresponds to an uninformative block in \(N_{i}\) or a non-leaf block of \(N_{i}\) containing a single edge. By assumption (ii) above, \(N_{i}\) has no uninformative blocks. By the definition of a network, \(N_{i}\) has no vertex u with \( in (u) = out (u) = 1\), so every extra vertex in \(T_{i}\) must be the parent of at least one non-extra vertex. Because \(T_{i}\) is a tree, no two extra vertices are parents of the same non-extra vertex. If follows that there are \(\mathrm {O}(n)\) extra vertices in \(T_{i}\). In total, the number of vertices and edges in \(T_{i}\) is given by \(|V'| = \mathrm {O}(n)\) and \(|E'| = \mathrm {O}(n)\). \(\square \)

Since the set of blocks of \(N_{i}\) and the set \(V'-r(T_{i})\) are in one-to-one correspondence, we also have:

Corollary 3.3

The network \(N_{i}\) contains \(\mathrm {O}(n)\) blocks.

The following lemma shows that the block tree \(T_{i}\) can be built efficiently:

Lemma 3.4

The block tree \(T_{i} = (V', E')\) of a given network \(N_{i}\) can be constructed in \(\mathrm {O}(|E_{i}|)\) time.

Proof

Constructing \(T_{i}\) when the blocks of \(N_{i}\) are given is performed by scanning the vertices of \(N_{i}\) and the list of components that every vertex belongs to, while adding edges to \(T_{i}\) according to the definition of \(V'\) and \(E'\). This requires \(\mathrm {O}(|V_{i}|)\) time. Finding the blocks takes \(\mathrm {O}(|E_{i}|)\) time by applying the algorithm by Hopcroft and Tarjan in [16]. Lastly, \(|V_{i}| \le |E_{i}|\) because \(N_{i}\) is a connected graph, so we can build \(T_{i}\) in \(\mathrm {O}(|E_{i}|)\) time. \(\square \)

3.1.2 Contracted Block Networks

Each block in \(N_i\) can be viewed as a network, to which we may apply the techniques from Sect. 2 for detecting those triplets that are anchored within. To be able to do so, we first take each block B, make some adjustments to it as described next, and call the resulting network \(C_{B}\) the contracted block network of \(N_{i}\) corresponding to block B. See Figs. 5 and 6 for an example of the construction.

Fig. 5
figure 5

In this example, \(N_{i}\) is a level-3 network that contains a block B whose vertices are \(v_{2}, v_{3}, \dots , v_{15}\) and whose edges are drawn with solid lines. Here, \(r(B) = v_{2}\)

For a given network \(N_{i}\), a block B in \(N_{i}\), and a vertex u in B, initialize \(L^{u}_{B}\) as the set of leaves that can be reached from u without using edges in B. For example, for the block B shown in Fig. 5, \(L^{v_{3}}_{B} = \{a_{5}, a_{6}, a_{7}, a_{19}\}\) and \(L^{v_{10}}_{B} = \{a_{15}\}\). Next, construct the network \(C_{B} = (V', E')\) with vertex set \(V'\) and edge set \(E'\) and update the \(L^{u}_{B}\)-sets by applying the following operations:

  1. 1.

    Let \(C_{B}\) be a copy of \(N_{i}\).

  2. 2.

    Delete every edge and vertex from \(C_{B}\) that is not in B.

  3. 3.

    For every edge \(u_{1} \rightarrow u_{2}\) in \(C_{B}\), if \( in (u_{1}) = out (u_{1}) = in (u_{2}) = out (u_{2}) = 1\) then contract the edge as follows: Let \(u_{2} \rightarrow u_{3}\) be the edge outgoing from \(u_{2}\), create an edge \(u_{1} \rightarrow u_{3}\), delete \(u_{2}\) and its two incident edges, and let \(L^{u_{1}}_{B} = L^{u_{1}}_{B} \cup L^{u_{2}}_{B}\).

  4. 4.

    For every vertex \(u_{j}\) in \(C_{B}\) with \(L^{u_{j}}_{B} \ne \emptyset \), attach a child leaf \(s_{j}\) representing the set of leaves \(L^{u_{j}}_{B}\). Also attach another child leaf \(s'_{j}\) called a copy leaf, to be used later on to count triplets.

  5. 5.

    Insert an artificial leaf \(r'\) as a child of the root \(r(C_{B})\).

Fig. 6
figure 6

The contracted block network \(C_{B}\) for the block B from Fig. 5. The internal vertices \(v_{3}\) and \(v_{4}\) in B have been merged in \(C_{B}\), and similarly for \(v_{8}\)\(v_{9}\), and \(v_{15}\). The set of leaves in \(C_{B}\) is \(\{s_{i}, s'_{i}: i \in \{3,5,6,7,8,10,12, 13,14\}\}\)

Observe that every edge between two internal vertices in \(C_{B}\) corresponds to a path in B. For example, the edge \(v_{8} \rightarrow v_{14}\) in Fig. 6 corresponds to the path \((v_{8},\, v_{9},\, v_{15},\, v_{14})\) in Fig. 5, while the edge \(v_{13} \rightarrow v_{14}\) corresponds to the length-1 path \((v_{13}, v_{14})\).

The following lemma bounds the size of \(C_{B}\):

Lemma 3.5

Let \(N_{i}\) be a network, B a block in \(N_{i}\), and \(C_{B} = (V', E')\) the contracted block network of \(N_{i}\) that corresponds to block B. It holds that \(|V'| = \mathrm {O}(k_{i} d_{i} + 1)\), \(|E'| = \mathrm {O}(k_{i} d_{i} + 1)\), and \(|\mathcal {L}(C_{B})| = \mathrm {O}(k_{i} d_{i} + 1)\).

Proof

If \(k_{i} = 0\) then B consists of a single edge of \(N_{i}\), meaning that \(C_{B}\) is a binary tree on three leaves (a leaf of the form \(s_{j}\), its copy leaf \(s'_{j}\), and the artificial leaf \(r'\)). In this case, \(|V'| = 5\), \(|E'| = 4\), and \(|\mathcal {L}(C_{B})| = 3\).

If \(k_{i} \ge 1\), there are two possibilities. If B contains only one edge then \(C_{B}\) is a binary tree on three leaves as in the case \(k_{i} = 0\) above. Otherwise, proceed as follows to derive the bounds. Call a non-reticulation vertex of \(C_{B}\) that is the parent of at least two internal vertices of \(C_{B}\) a branching vertex (e.g., \(v_{2}\) and \(v_{7}\) in Fig. 6), and a non-reticulation vertex of \(C_{B}\) that is the parent of exactly one internal vertex a path vertex (e.g., \(v_{3}\), \(v_{5}\), \(v_{6}\), \(v_{8}\), \(v_{10}\), and \(v_{13}\) in Fig. 6). We apply a technique from [27] to count the branching vertices and note that every branching vertex is the beginning of at least one new directed path that has to end at a reticulation vertex. Since each reticulation vertex can end at most \(d_{i}\) such paths and there are at most \(k_{i}\) reticulation vertices in \(C_{B}\), the number of branching vertices is at most \(k_{i} d_{i}\). Every path vertex is the parent of either a branching vertex or a reticulation vertex, and every reticulation vertex has at most \(d_{i}\) parents, so the number of path vertices is at most \(2 k_{i} d_{i}\). Therefore, the total number of internal vertices is at most \(k_{i} (3 d_{i} + 1)\). Next, at most two leaves are attached to each internal vertex, so \(|\mathcal {L}(C_{B})| \le 2 k_{i} (3 d_{i} + 1)\) and \(|V'| \le 3 k_{i} (3 d_{i} + 1)\). As for the edges, there are at most \(k_{i} d_{i}\) edges ending at reticulation vertices, at most \(k_{i} d_{i}\) edges ending at branching vertices, at most \(2 k_{i} d_{i}\) edges ending at path vertices, and \(|\mathcal {L}(C_{B})|\) edges ending at leaves. Adding them together gives \(|E'| \le 10 k_{i} d_{i} + 2 k_{i}\).

Hence, the lemma statement holds for every \(k_{i}\ge 0\). \(\square \)

3.1.3 Constructing All Contracted Block Networks Efficiently

We first introduce some additional notation. For a given network \(N_{i}\) and a block B in \(N_{i}\), a leaf x in \(N_{i}\) is said to associate with B if there exists a vertex u in B such that \(u \ne r(B)\) and \(x \in L^{u}_{B}\). As an example, in Fig. 5, the leaf \(a_{16}\) associates with B, but the leaves \(a_{2}\) and \(a_{3}\) do not associate with B. For any leaf x associated with a block B of \(N_{i}\), define:

  • \(q_{B}(x)\): The vertex in B from which there is a path to x that does not use any edges in B. That is, \(x \in L^{q_{B}(x)}_{B}\).

  • \(p_{B}(x)\): The leaf in \(C_{B}\) representing x.

  • \(p'_{B}(x)\): The copy leaf of \(p_{B}(x)\).

For example, in Figs. 5 and 6, \(q_{B}(a_{5}) = v_{3}\), \(p_{B}(a_{5}) = s_{3}\), \(p'_{B}(a_{5}) = s'_{3}\), \(q_{B}(a_{8}) = v_{4}\), and \(p_{B}(a_{8}) = s_{3}\).

Lemma 3.1 yields an algorithm for constructing all block networks of \(N_{i}\) in \(\mathrm {O}(|E_{i}|)\) time. As shown in the next lemma, by properly relabeling the leaves of \(N_{i}\) and using an additional \(\mathrm {O}(n^{2})\) time, it is possible to build the block networks so that we can subsequently compute, for any block B and any leaf \(l\in \mathcal {L}(N_{i})\), the values of \(q_{B}(l)\) and \(p_{B}(l)\) in \(\mathrm {O}(1)\) time.

Lemma 3.6

For any network \(N_{i}\), all the contracted block networks of \(N_{i}\) can be computed in \(\mathrm {O}(|E_{i}| + n^{2})\) time, after which \(q_{B}(l)\) and \(p_{B}(l)\) for any block B and any leaf \(l \in \mathcal {L}(N_{i})\) can be retrieved in \(\mathrm {O}(1)\) time.

Proof

Perform the following steps:

  1. 1.

    Identify all the blocks of \(N_{i}\). Let \(B_{1} \dots , B_{s}\) be the blocks of \(N_{i}\) and let the corresponding vertex sets be \(V(B_{1}),\dots ,V(B_{s})\). Note that for every \(j\in \{1,\dots ,s\}\), it holds that \(V(B_{j}) \subseteq V_{i}\).

  2. 2.

    The leaves of \(N_{i}\) are relabeled as follows. A leaf receives the label i, where \(i\in \{1,2,\dots ,n\}\), if it is the \(i-th\) leaf in order that is discovered by a depth-first traversal of \(N_{i}\). This traversal starts from \(r(N_{i})\). Let u be a vertex in \(N_{i}\) and part of the blocks \(B_{1},\dots , B_{j}\). Let \(B'\) be the block from \(B_{1}, \dots , B_{j}\), such that \(r(B')\) has the largest height among all roots of \(B_{1}, \dots , B_{j}\). During the traversal, every child \(u'\) of u that is not part of \(B'\) is visited first. This is to ensure that the labels in \(L^{u}_{B'}\) are consecutive and defined by a range of numbers \([u_{ left },u_{ right }]\).

  3. 3.

    For every \(j\in \{1,\dots ,s\}\) the process of building \(C_{B_{j}} = (V_{j},E_{j})\) is initialized as follows. Set \(V_{j} = V(B_{j})\). For every edge \(u \rightarrow v\) in \(E_{i}\), if both u and v are in \(V_{j}\) then add that edge to \(E_{j}\). Finally, for any vertex \(u_{1}\) in \(V_{j}\), if \(L^{u_{1}}_{B_{j}} \ne \emptyset \) create the leaf \(s_{1}\) representing \(L^{u_{1}}_{B_{j}}\), the copy leaf \(s'_{1}\), add the edges \(u_{1} \rightarrow s_{1}\) and \(u_{1} \rightarrow s'_{1}\) to \(E_{j}\), and set \(Q_{B_{j}}[l] = u_{1}\) for every \(l \in \{u_{ left },\dots ,u_{ right }\}\).

  4. 4.

    For every \(j\in \{1,\dots ,s\}\) the edges of \(C_{B_{j}}\) are contracted, following the definition of a contracted block network. While performing the contraction, for every \(j\in \{1,\dots ,s\}\), we build the table \(P_{B_{j}}[1,\dots ,n]\), defined so that for every \(l \in \{1,\dots ,n\}\) we have \(P_{B_{j}}[l] = p_{B_{j}}(l)\). The value of \(P_{B_{j}}[l]\) is updated once the final set in which the leaf l will reside has been determined. After contracting all the edges, we also add the artificial leaf \(r'_{j}\).

Step 1 is performed by using the algorithm from [16], which takes \(\mathrm {O}(|E_{i}|)\) time. Step 2 is performed by a depth-first traversal of \(N_{i}\), thus requiring \(\mathrm {O}(|E_{i}|)\) time as well. Since the blocks of \(N_{i}\) are edge-disjoint (see Lemma 3.1), we have \(\sum _{j=1}^{s}|E_{j}| \le |E_{i}|\), thus the time spent on adding and contracting vertices and edges in steps 3 and 4 is \(\mathrm {O}(|E_{i}|)\). For every contracted block network \(C_{B}\), we spend \(\mathrm {O}(n)\) time to update the Q- and P-tables. By Corollary 3.3, there are \(\mathrm {O}(n)\) blocks, so the time needed to update every Q- and P-table is \(\mathrm {O}(n^{2})\). Hence, the total time taken is \(\mathrm {O}(|E_{i}| + n^{2})\). \(\square \)

Finally, for any block B in \(N_{i}\), we denote the fan graph of its contracted block network \(C_{B}\) by \(C^{f}_{B}\) and the resolved graph of \(C_{B}\) by \(C^{r}_{B}\). Moreover, we let \(A^{f}_{B}\) be the fan table of \(C_{B}\) and \(A^{r}_{B}\) the resolved table of \(C_{B}\). The following lemma bounds the time required to build \(C^{f}_{B}\), \(C^{r}_{B}\), \(A^{f}_{B}\), and \(A^{r}_{B}\) for all the blocks of a network \(N_{i}\).

Lemma 3.7

Given a network \(N_{i}\) and all of its contracted block networks, building \(C^{f}_{B}\), \(C^{r}_{B}\), \(A^{f}_{B}\), and \(A^{r}_{B}\) for every block B of \(N_{i}\) takes \(\mathrm {O}(|V_{i}| (k_{i}^{2} d_{i}^{2} + 1))\) time in total.

Proof

We simply apply the method from Sect. 2 to each contracted block network. To analyze the time that this will take, let \(\{B_{1}, B_{2}, \dots , B_{t}\}\) be the blocks in \(N_{i}\). For each block \(B_{x}\) in \(N_{i}\), let b(x) be the number of vertices in \(B_{x}\), c(x) the number of vertices in the contracted block network \(C_{B_{x}}\), and e(x) the number of edges in the contracted block network \(C_{B_{x}}\).

We first express the total size of the contracted block networks in terms of N. When \(C_{B_{x}}\) is constructed from \(B_{x}\), each vertex in \(B_{x}\) will either be deleted or remain and introduce at most two leaves, so \(c(x) \le 3 \cdot b(x)\). Next, since the blocks decompose \(N_{i}\) into edge-disjoint subgraphs by Lemma 3.1, and the total number of times that blocks overlap each other is equal to the number of edges \(E'\) in the block tree \(T_{i}\), we have \(\sum \limits _{x=1}^{t} b(x) \,\le \, |V_{i}| + |E'|\). By Lemma 3.2, \(|E'| = \mathrm {O}(n)\). Then, using \(n \le |V_{i}|\) gives \(\sum \limits _{x=1}^{t} c(x) \,\le \, 3 \cdot \sum \limits _{x=1}^{t} b(x) = \mathrm {O}(|V_{i}|)\).

Now, we analyze the total time for all the blocks. According to Sect. 2, building each \(C^{f}_{B_{x}}\), \(C^{r}_{B_{x}}\), \(A^{f}_{B_{x}}\), and \(A^{r}_{B_{x}}\) takes \(\mathrm {O}(c(x)^{2} e(x))\) time. The total time is thus \(\sum \limits _{x=1}^{t} \mathrm {O}(c(x)^{2} e(x))\). Lemma 3.5 says that \(c(x) = \mathrm {O}(k_{i} d_{i} + 1)\) and \(e(x) = \mathrm {O}(k_{i} d_{i} + 1)\), so we can rewrite the total time needed as \(\mathrm {O}(\sum \limits _{x=1}^{t} (k_{i} d_{i} + 1)^{2} c(x)) = \mathrm {O}((k_{i} d_{i} + 1)^{2} \sum \limits _{x=1}^{t} c(x)) = \mathrm {O}(|V_{i}| (k_{i}^{2} d_{i}^{2} + 1))\). \(\square \)

3.2 Checking If a Triplet is Consistent with a Network

Sections 3.2.1 and 3.2.2 below describe how to determine if any given fan or resolved triplet, respectively, is consistent with \(N_{i}\) in \(\mathrm {O}(1)\) time, assuming that the data structures from Sect. 3.1 have already been built.

A more precise definition of triplet consistency that can associate specific locations in the network to triplets that are consistent with it will be needed in this section. Let B be a block of a network \(N_{i}\). We say that x|y|z is a fan triplet consistent with B if and only if there exists a vertex u in B such that there are three directed paths in \(N_{i}\) from u to x, from u to y, and from u to z that are disjoint except for in u. We also say that x|y|z is rooted at u in B. Since u belongs to \(N_{i}\), this means that x|y|z is rooted at u in \(N_{i}\) as well. Next, we say that xy|z is a resolved triplet consistent with B if and only if there exist two vertices u and v (\(u \ne v\)) in B such that there are four directed paths in \(N_{i}\) from u to v, from v to x, from v to y, and from u to z that are disjoint except for in u and v, and the path from u to z does not pass through v. Moreover, we say that xy|z is rooted at u and v in B and in \(N_{i}\).

Observe that if x|y|z is a fan triplet consistent with a block B, then it is also consistent with \(N_{i}\). In the same way, if xy|z is a resolved triplet consistent with B, it is also consistent with \(N_{i}\).

3.2.1 Checking a Fan Triplet

First, we show how to determine if a given fan triplet x|y|z is consistent with a given block B (Lemma 3.8). The procedure, named IsFanInBlock, requires that the lowest common ancestor (in the block tree \(T_{i}\)) of x and y, the lowest common ancestor of x and z, and the lowest common ancestor of y and z are the same, and that this node corresponds to the block B being examined.

After that, the procedure IsFanInBlock is used as a subroutine in another procedure, named IsFan, to determine if a given fan triplet x|y|z is consistent with a network (Lemma 3.9). Whenever IsFanInBlock’s requirement on the lowest common ancestors cannot be met, IsFan instead considers the different cases for the locations of the lowest common ancestor of every pair (xy), (xz), and (yz) in \(T_{i}\). Since every vertex in \(T_{i}\) except \(r(T_{i})\) corresponds to a block in \(N_{i}\), it can then apply the available data structures to determine if \(N_{i}\) has the necessary disjoint paths.

figure b

Lemma 3.8

Let \(N_i\) be a given network and \(T_i\) its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on \(N_i\). Consider any \(x, y, z \in \Lambda \) such that the lowest common ancestor of every pair (xy), (xz), and (yz) is a node w in \(T_{i}\). If \(w \ne r(T_i)\), Algorithm 2 determines whether or not the fan triplet x|y|z is consistent with the block B in \(N_i\) corresponding to w in \(\mathrm {O}(1)\) time.

Proof

For every \(l\in \{x,y,z\}\), we let \(p_{l} = p_{B}(l)\), \(p'_{l} = p'_{B}(l)\), \(q_{l} = q_{B}(l)\), and \(h_{l}\) be the height of \(q_{l}\) in \(N_{i}\). By construction (see Lemmas 3.4 and 3.6), we know that \(p_{x}\), \(p_{y}\), and \(p_{z}\) are not the root of \(C_{B}\). The algorithm uses the tables Q and P to check all the possible cases for the values of \(p_{x}\), \(p_{y}\), \(p_{z}\), \(q_{x}\), \(q_{y}\), and \(q_{z}\), and return a true or false value, indicating a positive and a negative answer respectively. We have the following cases:

  1. 1.

    \(p_{x} = p_{y} = p_{z}\):

    1. (a)

      \(h_{x} = h_{y} = h_{z}\): We have \(q_{x} = q_{y} = q_{z}\) and x|y|z is rooted at \(q_{x}\). Hence, x|y|z is consistent with B (e.g., \(a_{5}|a_{6}|a_{7}\) in Fig. 5).

    2. (b)

      \(((h_{x} = h_{y}) \wedge (h_{x}> h_{z})) \vee ((h_{x} = h_{z}) \wedge (h_{x}> h_{y})) \vee ((h_{y} = h_{z}) \wedge (h_{y} > h_{x}))\). W.l.o.g., assume true for \(((h_{x} = h_{y}) \wedge (h_{x} > h_{z}))\): Then, we have \(q_{x} = q_{y} \wedge q_{x} \ne q_{z}\) and x|y|z is rooted at \(q_{x}\). Hence, x|y|z is consistent with B (e.g., \(a_{5}|a_{6}|a_{8}\) in Fig. 5).

    3. (c)

      \(h_{x} \ne h_{y} \ne h_{z}\): Then \(q_{x} \ne q_{y} \ne q_{z}\), thus x|y|z is not consistent with B (e.g., \(a_{13}|a_{14}|a_{20}\) in Fig. 5).

  2. 2.

    \(((p_{x} = p_{y}) \wedge (p_{x} \ne p_{z})) \vee ((p_{x} = p_{z}) \wedge (p_{x} \ne p_{y})) \vee ((p_{y} = p_{z}) \wedge (p_{y} \ne p_{x}))\). W.l.o.g., assume true for \((p_{x} = p_{y} \wedge p_{x} \ne p_{z})\):

    1. (a)

      \(h_{x} = h_{y}\): We have \(q_{x} = q_{y}\). If \(p'_{x}|p_{x}|p_{z}\) is a fan triplet in \(C_{B}\), then x|y|z is rooted at \(q_{x}\), thus x|y|z is consistent with B (e.g., \(a_{8}|a_{9}|a_{15}\) in Fig. 5). If \(p'_{x}|p_{x} | p_{z}\) is not a fan triplet in \(C_{B}\), x|y|z is not rooted at any vertex in B, thus x|y|z is not consistent with B (e.g., \(a_{8}|a_{9}|a_{11}\) in Fig. 5).

    2. (b)

      \(h_{x} \ne h_{y}\): Then \(q_{x} \ne q_{y}\) and either \(q_{x}\) or \(q_{y}\) was contracted when creating \(C_{B}\). Moreover, both x and y are now in the set of leaves defined by \(p_{x}\). Since we also have \(p_{z} \ne p_{x}\), the triplet x|y|z is not consistent with B (e.g., \(a_{7}|a_{8}|a_{15}\) in Fig. 5).

  3. 3.

    \(p_{x} \ne p_{y} \ne p_{z}\): If \(p_{x} | p_{y} | p_{z}\) is consistent with \(C_{B}\), then there exists a vertex u in B such that x|y|z is rooted at u. Hence, x|y|z is consistent with B (e.g., \(a_{8}|a_{11}|a_{16}\) in Fig. 5). If \(p_{x} | p_{y} | p_{z}\) is not consistent with \(C_{B}\)x|y|z is not rooted at any vertex in B, thus x|y|z is not consistent with B (e.g., \(a_{14}|a_{16}|a_{17}\) in Fig. 5).

In every case above, testing if a fan triplet is consistent with \(C_{B}\) translates to finding a path that starts from s in \(C^{f}_{B}\) and ends in a vertex of \(C^{f}_{B}\) defined by the leaves of the fan triplet. Hence, every case can be handled in \(\mathrm {O}(1)\) time. In Algorithm 2, the above cases are summarized in a procedure. \(\square \)

figure c

Lemma 3.9

Let \(N_i\) be a given network and \(T_i\) its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on \(N_i\). For any \(x, y, z \in \Lambda \), Algorithm 3 determines whether or not the fan triplet x|y|z is consistent with \(N_{i}\) in \(\mathrm {O}(1)\) time.

Proof

For a block B of \(N_{i}\) and a vertex u in B that can reach a leaf x of \(N_{i}\), define \(h_{B}(x)\) to be the height of \(q_{B}(x)\) in \(N_{i}\). In Algorithm 3 we have the procedure for testing the consistency of the fan triplet x|y|z. It considers the following cases:

  1. 1.

    x|y|z is consistent with \(T_{i}\): Let w be the lowest common ancestor of x, y, and z in \(T_{i}\).

    1. (a)

      \(w = r(T_{i})\): x|y|z is rooted at \(r(N_{i})\), thus x|y|z is consistent with \(N_{i}\) (e.g., \(a_{23}|a_{9}|a_{20}\) in Fig. 4).

    2. (b)

      \(w \ne r(T_{i})\): w corresponds to a block B in \(N_{i}\), thus we use Lemma 3.8 to determine if x|y|z is consistent with B. If x|y|z is consistent with B, then it is also consistent with \(N_{i}\). If x|y|z is not consistent with B, then it is not consistent with \(N_{i}\) (e.g., \(a_{3}|a_{9}|a_{12}\) in Fig. 4).

  2. 2.

    \(xy|z \vee xz|y \vee yz|x\) is consistent with \(T_{i}\). Assume w.l.o.g. that xy|z is consistent with \(T_{i}\). Let \(w = lca (x,y)\) in \(T_{i}\) and \(\mu = lca (x,z)\) in \(T_{i}\), and let B be the block in \(N_{i}\) corresponding to w and F the block in \(N_{i}\) corresponding to \(\mu \):

    1. (a)

      \(\mu \) is not the parent of w in \(T_{i}\): then x|y|z is not rooted at any vertex in \(N_{i}\), thus x|y|z is not consistent with \(N_{i}\) (e.g., \(a_{2}|a_{4}|a_{13}\) in Fig. 4).

    2. (b)

      \(\mu \) is the parent of w in \(T_{i}\). By the definition of \(T_{i}\), B is rooted at a vertex u of F that is not r(F):

      1. i.

        \((p_{B}(x) = p_{B}(y))\): then x|y|z is not rooted at any vertex in \(N_{i}\), thus x|y|z is not consistent with \(N_{i}\) (e.g., \(a_{2}|a_{3}|a_{4}\) in Fig. 4).

      2. ii.

        \((p_{B}(x) \ne p_{B}(y)) \wedge (\mu = r(T_{i}))\): If \(r'|p_{B}(x)|p_{B}(y)\) is consistent with \(C_{B}\), where \(r'\) is the dummy leaf in \(C_{B}\) (see Corollary 2.2), then x|y|z is rooted at \(r(N_{i})\), thus x|y|z is consistent with \(N_{i}\) (e.g., \(a_{1}|a_{11}|a_{15}\) in Fig. 4). Otherwise, x|y|z is not rooted at any vertex in \(N_{i}\), thus x|y|z is not consistent with \(N_{i}\) (e.g., \(a_{12}|a_{13}|a_{15}\) in Fig. 4).

      3. iii.

        \((p_{B}(x) \ne p_{B}(y)) \wedge (\mu \ne r(T_{i}))\):

        1. A.

          \((p_{F}(x) = p_{F}(z)) \wedge (h_{F}(z) \le h_{F}(x))\): Since B is rooted at a vertex of F, we have \(q_{F}(x) = q_{F}(y)\), thus \(h_{F}(x) = h_{F}(y)\). Using Corollary 2.2, if \(r'|p_{B}(x)|p_{B}(y)\) is a fan triplet in \(C_{B}\), where \(r'\) is the dummy leaf in \(C_{B}\), then x|y|z is rooted at \(q_{F}(x)\), thus x|y|z is a fan triplet in \(N_{i}\) (e.g., \(a_{1}|a_{4}|a_{8}\) in Fig. 4). Otherwise, x|y|z is not rooted at any vertex in \(N_{i}\), thus x|y|z is not consistent with \(N_{i}\) (e.g., \(a_{1}|a_{24}|a_{8}\) in Fig. 4).

        2. B.

          \((p_{F}(x) = p_{F}(z)) \wedge (h_{F}(z) > h_{F}(x))\): Since B is rooted at a vertex of F, we have \(q_{F}(x) = q_{F}(y)\) and \(h_{F}(x) = h_{F}(y)\). Hence, x|y|z is not consistent with \(N_{i}\) (e.g., \(a_{1}|a_{4}|a_{21}\) in Fig. 4).

        3. C.

          \(p_{F}(x) \ne p_{F}(z)\): Using Corollary 2.2, if \(r'|p_{B}(x)|p_{B}(y)\) is a fan triplet in \(C_{B}\), where \(r'\) is the dummy leaf in \(C_{B}\), and \(p_{F}(x)|p'_{F}(x)|p_{F}(z)\) is a fan triplet in \(C_{F}\), then x|y|z is rooted at \(q_{F}(x)\). Hence, x|y|z is consistent with \(N_{i}\) (e.g., \(a_{1}|a_{4}|a_{9}\) in Fig. 4). Otherwise, x|y|z is not rooted at any vertex of \(N_{i}\), thus x|y|z is not consistent with \(N_{i}\) (e.g., \(a_{1}|a_{4}|a_{12}\) in Fig. 4).

\(\square \)

3.2.2 Checking a Resolved Triplet

The strategy for determining if a given resolved triplet xy|z is consistent with a network is analogous to the case of fan triplets just described. The procedure IsResolvedInBlock (see Lemma 3.10) first considers consistency with a block B in the case where it holds in the block tree \(T_{i}\) that the lowest common ancestor of x and y, the lowest common ancestor of x and z, and the lowest common ancestor of y and z are the same. Next, the procedure IsResolved (see Lemma 3.11) uses IsResolvedInBlock and the available data structures to take care of the general case.

figure d

Lemma 3.10

Let \(N_i\) be a given network and \(T_i\) its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on \(N_i\). Consider any \(x, y, z \in \Lambda \) such that the lowest common ancestor of every pair (xy), (xz), and (yz) is a node w in \(T_{i}\). If \(w \ne r(T_i)\), Algorithm 4 determines whether or not the resolved triplet xy|z is consistent with the block B in \(N_i\) corresponding to w in \(\mathrm {O}(1)\) time.

Proof

Like in the case of fan triplets in Lemma 3.8, for every \(l\in \{x,y,z\}\), we let \(p_{l} = p_{B}(l)\), \(p'_{l} = p'_{B}(l)\), \(q_{l} = q_{B}(l)\), and \(h_{l}\) be the height of \(q_{l}\) in \(N_{i}\). By construction (see Lemmas 3.4 and 3.6), we know that \(p_{x}\), \(p_{y}\), and \(p_{z}\) are not the root of \(C_{B}\). The algorithm uses the tables Q and P to check all the possible cases for the values of \(p_{x}\), \(p_{y}\), \(p_{z}\), \(q_{x}\), \(q_{y}\), and \(q_{z}\), and return a true or false value, indicating a positive and a negative answer respectively. We have the following cases:

  1. 1.

    \(p_{x} = p_{y} = p_{z}\):

    1. 1.

      \((h_{z}> h_{x}) \wedge (h_{z} > h_{y})\). W.l.o.g., let \(h_{x} \ge h_{y}\): Then, xy|z is rooted at \(q_{z}\) and \(q_{x}\), thus xy|z is a resolved triplet in B (e.g., \(a_{8}a_{9}|a_{6}\) in Fig. 5).

    2. 2.

      \((h_{z} \le h_{x}) \vee (h_{z} \le h_{y})\): Because \(p_{x} = p_{y} = p_{z}\), xy|z is not rooted at any pair of vertices in B, thus xy|z is not consistent with B (e.g., \(a_{8}a_{6}|a_{9}\) in Fig. 5).

  2. 3.

    \((p_{x} = p_{y}) \wedge (p_{x} \ne p_{z})\). W.l.o.g., assume \(h_{x} \ge h_{y}\): If \(p'_{x}p_{x}|p_{z}\) is consistent with \(C_{B}\), there exists \(u \ne q_{x}\) in B such that xy|z is rooted at u and \(q_{x}\) in B. Hence, xy|z is consistent with B (e.g., \(a_{5}a_{8}|a_{17}\) in Fig. 5). If \(p'_{x}p_{x}|p_{z}\) is not consistent with \(C_{B}\), xy|z is not rooted at any pair of vertices in B, thus xy|z is not consistent with B (e.g., \(a_{5}a_{8}|a_{15}\) in Fig. 5).

  3. 4.

    \(((p_{x} = p_{z}) \wedge (p_{x} \ne p_{y}))\vee ((p_{y} = p_{z}) \wedge (p_{y} \ne p_{x}))\). W.l.o.g., assume \((p_{x} = p_{z}) \wedge (p_{x} \ne p_{y})\):

    1. 1.

      \(h_{z} > h_{x}\): If \(p'_{x}|p_{x}|p_{y}\) is a fan triplet in \(C_{B}\), then xy|z is rooted at \(q_{z}\) and \(q_{x}\), thus xy|z is consistent with B (e.g., \(a_{14}a_{17}|a_{13}\) in Fig. 5). If \(p'_{x}|p_{x}|p_{y}\) is not consistent with \(C_{B}\)xy|z is not rooted at any pair of vertices in B, thus xy|z is not consistent with B (e.g., \(a_{14}a_{16}|a_{13}\) in Fig. 5.).

    2. 2.

      \(h_{z} \le h_{x}\): Since \(p_{x} = p_{z}\), the resolved triplet xy|z cannot be consistent with B (e.g., \(a_{14}a_{17}|a_{20}\) in Fig. 5).

  4. 3.

    \(p_{x} \ne p_{y} \ne p_{z}\): If \(p_{x}p_{y}|p_{z}\) is consistent with \(C_{B}\), then there exist two different vertices u, v in B such that xy|z is rooted at u and v, thus xy|z is consistent with B (e.g., \(a_{12}a_{13}|a_{18}\) in Fig. 5). If \(p_{x}p_{y}|p_{z}\) is not consistent with \(C_{B}\), xy|z is not rooted at any pair of vertices in B, thus xy|z is not consistent with B (e.g., \(a_{12}a_{18}|a_{13}\) in Fig. 5).

Similarly to fan triplets, testing if a resolved triplet is consistent with \(C_{B}\) translates to finding a path that starts from s in \(C^{r}_{B}\) and ends in a vertex of \(C^{r}_{B}\) defined by the leaves of the resolved triplet. Hence, every case can be handled in \(\mathrm {O}(1)\) time. Algorithm 4 summarizes the above cases in a procedure. \(\square \)

figure e

Lemma 3.11

Let \(N_i\) be a given network and \(T_i\) its block tree, and suppose that the preprocessing from Lemma 3.7 has been performed on \(N_i\). For any \(x, y, z \in \Lambda \), Algorithm 5 determines whether or not the resolved triplet xy|z is consistent with \(N_{i}\) in \(\mathrm {O}(1)\) time.

Proof

For a block B of \(N_{i}\) and a vertex u in B that can reach a leaf x of \(N_{i}\), define \(h_{B}(x)\) to be the height of \(q_{B}(x)\) in \(N_{i}\). In Algorithm 5 we have the procedure for testing the consistency of the resolved triplet xy|z. We consider the following cases, which are similar to the cases for fan triplets in Lemma 3.9:

  1. 1.

    x|y|z is consistent with \(T_{i}\): Let w be the lowest common ancestor of x, y, and z in \(T_{i}\).

    1. (a)

      \(w = r(T_{i})\): xy|z is not rooted at any pair of vertices in \(N_{i}\), thus xy|z is not consistent with \(N_{i}\) (e.g., \(a_{23}a_{9}|a_{20}\) in Fig. 4).

    2. (b)

      \(w \ne r(T_{i})\): w corresponds to a block B in \(N_{i}\), thus we use Lemma 3.10 to determine if xy|z is consistent with B. If xy|z is consistent with B, then it is also consistent with \(N_{i}\). If xy|z is not consistent with B, then it is not consistent with \(N_{i}\) (e.g., \(a_{1}a_{9}|a_{12}\) in Fig. 4).

  2. 2.

    \(xy|z \vee xz|y \vee yz|x\) is consistent with \(T_{i}\). Assume w.l.o.g. that xy|z is consistent with \(T_{i}\). Let \(w = lca (x,y)\) in \(T_{i}\) and \(\mu = lca (x,z)\) in \(T_{i}\), and let B be the block in \(N_{i}\) corresponding to w and F the block in \(N_{i}\) corresponding to \(\mu \):

    1. (a)

      \(\mu \) is not the parent of w in \(T_{i}\): then there exists a vertex u in B and a vertex v in F such that xy|z is rooted at v and u, thus xy|z is consistent with \(N_{i}\) (e.g., \(a_{2}a_{4}|a_{13}\) in Fig. 4).

    2. (b)

      \(\mu \) is the parent of w in \(T_{i}\). By the definition of \(T_{i}\), B is rooted at a vertex u of F that is not r(F). We consider the following cases:

      1. i.

        \(p_{B}(x) = p_{B}(y)\): W.l.o.g., assume \(h_{B}(x) > h_{B}(y)\). Then, xy|z is rooted at either r(B) and \(q_{B}(x)\), or \(q_{F}(z)\) and \(q_{B}(x)\), or r(F) and \(q_{B}(x)\). Hence, xy|z is consistent with \(N_{i}\) (e.g., \(a_{2}a_{3}|a_{4}\) in Fig. 4).

      2. ii.

        \((p_{B}(x) \ne p_{B}(y)) \wedge (\mu = r(T_{i}))\): Using Corollary 2.4, if we have that \(p_{B}(x)p_{B}(y)|r'\) is consistent with \(C_{B}\), where \(r'\) is the dummy leaf in \(C_{B}\), then there exists a vertex u in B such that xy|z is rooted at \(r(N_{i})\) and u. Hence, xy|z is consistent with \(N_{i}\) (e.g., \(a_{11}a_{13}|a_{15}\) in Fig. 4). Otherwise, xy|z is not rooted at any pair of vertices in \(N_{i}\), thus xy|z is not consistent with \(N_{i}\) (e.g., \(a_{1}a_{13}|a_{15}\) in Fig. 4).

      3. iii.

        \((p_{B}(x) \ne p_{B}(y)) \wedge (\mu \ne r(T_{i}))\):

        1. A.

          \((p_{F}(x) = p_{F}(z)) \wedge (h_{F}(z) \le h_{F}(x))\): Since B is rooted at a vertex of F, we have \(q_{F}(x) = q_{F}(y)\), thus \(h_{F}(x) = h_{F}(y)\). Using Corollary 2.4, if \(p_{B}(x)p_{B}(y)|r'\) is consistent with \(C_{B}\), where \(r'\) is the dummy leaf in \(C_{B}\), then there exists a vertex u in B such that xy|z is rooted at \(q_{F}(x)\) and u. Hence, xy|z is consistent with \(N_{i}\) (e.g., \(a_{1}a_{4}|a_{8}\) in Fig. 4). Otherwise, xy|z is not rooted at any pair of vertices in \(N_{i}\), thus xy|z is not consistent with \(N_{i}\) (e.g., \(a_{1}a_{25}|a_{22}\) in Fig. 4).

        2. B.

          \((p_{F}(x) = p_{F}(z)) \wedge (h_{F}(z) > h_{F}(x))\): Since B is rooted at a vertex of F, we have \(q_{F}(x) = q_{F}(y)\) and \(h_{F}(x) = h_{F}(y)\). Then, there exists a vertex u in B such that xy|z is rooted at \(q_{F}(z)\) and u, thus xy|z is consistent with \(N_{i}\) (e.g., \(a_{1}a_{4}|a_{21}\) in Fig. 4).

        3. C.

          \(p_{F}(x) \ne p_{F}(z)\): Using Corollary 2.4, if \(p_{B}(x)p_{B}(y)|r'\) is consistent with \(C_{B}\), where \(r'\) is the dummy leaf in \(C_{B}\), then there exists a vertex u in B such that xy|z is rooted at either r(B) and u, or \(q_{F}(z)\) and u, or r(F) and u. If \(p_{F}(x)p'_{F}(x)|p_{F}(z)\) is consistent with \(C_{F}\), then w.l.o.g. if \(h_{F}(x) > h_{F}(y)\) we have that xy|z is rooted at some vertex u of F and \(q_{F}(x)\). In both cases, xy|z is consistent with \(N_{i}\) (e.g., \(a_{1}a_{4}|a_{12}\) in Fig. 4). If both cases are false, xy|z is not rooted at any pair of vertices in \(N_{i}\), thus xy|z is not consistent with \(N_{i}\) (e.g., \(a_{1}a_{25}|a_{26}\) in Fig. 4).

\(\square \)

3.3 Triplet Distance Computation

Our second algorithm for computing the triplet distance between two given networks \(N_{1}\) and \(N_{2}\) is listed in Algorithm 6. It has the same basic structure as the algorithm in Sect. 2.2, but it applies the procedures presented in Sect. 3.2.1 and 3.2.2 to check triplet consistency. The main procedure is named D(). In the preprocessing step, for \(i \in \{1,2\}\), the algorithm builds the block tree \(T_{i}\), an \(n \times n\) table for \(T_{i}\) in order to later answer lowest common ancestor queries between pairs of leaves in \(T_{i}\) in \(\mathrm {O}(1)\) time, all the contracted block networks of \(N_{i}\), and finally, for every block B, the fan graph \(C^{f}_{B}\) and the resolved graph \(C^{r}_{B}\) as well as the corresponding \(A^{f}_{B}\)- and \(A^{r}_{B}\)-tables for the contracted block network \(C_{B}\). The algorithm then calls the procedure S() to count shared fan and resolved triplets, which is done by enumerating all possible triplets and calling IsFan and IsResolved to see which of them are consistent with both \(N_{1}\) and \(N_{2}\). The final answer is calculated according to Equation (1.1).

figure f

From Lemma 3.4, computing \(T_{1}\) and \(T_{2}\) requires \(\mathrm {O}(|E_{1}| + |E_{2}|)\) time. Building the two tables for answering lowest common ancestor queries in \(T_{1}\) and \(T_{2}\) takes \(\mathrm {O}(n^{2})\) time by bottom-up traversals. From Lemma 3.6, constructing all the contracted block networks requires \(\mathrm {O}(|E_{1}| + |E_{2}| + n^{2})\) time. From Lemma 3.7, the total time required to build \(C^{f}_{B}\), \(C^{r}_{B}\), \(A^{f}_{B}\), and \(A^{r}_{B}\) for every block B of \(N_{1}\) and \(N_{2}\) is \(\mathrm {O}(|V_{1}| (k_{1}^{2} d_{1}^{2} + 1) + |V_{2}| (k_{2}^{2} d_{2}^{2} + 1))\). Since \(|V_{i}| = \mathrm {O}(|E_{i}|)\), the preprocessing time sums up to \(\mathrm {O}(|E_{1}| + |E_{2}| + |V_{1}| k_{1}^{2} d_{1}^{2} + |V_{2}| k_{2}^{2} d_{2}^{2} + n^{2})\).

Using Lemmas 3.9 and 3.11, after the preprocessing step we can determine the consistency of a triplet with \(N_{1}\) or \(N_{2}\) in \(\mathrm {O}(1)\) time. Since the number of triplets that need to be checked is exactly \(4 \left( {\begin{array}{c}n\\ 3\end{array}}\right) \), the total running time of the algorithm is \(\mathrm {O}(|E_{1}| + |E_{2}| + |V_{1}| k_{1}^{2} d_{1}^{2} + |V_{2}| k_{2}^{2} d_{2}^{2} + n^{3})\). Using the definitions of N, M, k, and d from Sect. 1, the running time can be expressed as \(\mathrm {O}(M + N k^{2} d^{2} + n^{3})\). Hence, we obtain the following theorem:

Theorem 3.12

The triplet distance between two networks \(N_{1}\) and \(N_{2}\) can be computed in \(\mathrm {O}(M + N k^{2} d^{2} + n^{3})\) time.

4 Implementation and Experiments

This section presents the implementations of the two algorithms from Sects. 2 and 3, and experimental results demonstrating their practical performance. Both simulated and real datasets were used in the experiments.

4.1 Algorithm Implementation

From here on, the algorithm from Sect. 2 will be referred to as NTDfirst and the algorithm from Sect. 3 as NTDsecond. Both algorithms were implemented in the C++ programming language and the source code is publicly available at:

https://github.com/kmampent/ntd

Since no other implementations for computing the rooted triplet distance between two networks of arbitrary levels are available, the correctness of our program code was verified by trying a large number of pairs of input networks under varying parameters and making sure that the output of NTDfirst (which is simple to implement) was identical to the output of NTDsecond in all cases.

4.2 The Setup

The experiments were performed on a machine with 16GB RAM and Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz. The operating system was Ubuntu 16.04.2 LTS, and the compiler used was g++ 5.4 with cmake 3.11.0.

4.3 Experiment 1: Performance

The first set of experiments were designed to measure the running times and memory usage of our implementations of NTDfirst and NTDsecond. To do so systematically, we used simulated datasets. The Input. Given three parameters n, p, and e, where \(n \ge 1\) is an integer, \(0 \le p \le 1\), and \(e \ge 0\) is an integer, an input network \(N'\) was built according to the following method:

  • Generate a random rooted binary tree T with n leaves in the uniform model [29].

  • For each internal vertex w in T except r(T), contract the edge between the parent of w and w with probability p.

  • For each vertex w in T, let d(w) be the number of edges on the path from r(T) to w. Let \(N' = T\).

  • Until e edges have been added or it is impossible to add any more edges: Add an edge between two vertices in \(N'\) chosen uniformly at random, under the constraint that an edge \(u \rightarrow v\) is created in \(N'\) only if \(d(u) < d(v)\). (In other words, if the total number of edges that can be added is y and \(y < e\), then only add those y edges.)

Experimental Results. We applied NTDfirst and NTDsecond to pairs of networks generated with the method above for varying values of n, p, and e, and measured their running times and memory usage. In the graphs shown below, every data point corresponds to the average taken over 30 runs with a set of fixed parameters. Reticulation events are typically rare in nature [30], so we used relatively small values for e, i.e., \(e \le 50\) when \(n \le 500\), to make the experiments more realistic.

Fig. 7
figure 7

The running times of NTDfirst and NTDsecond for increasing values of n and with \(p=0\) and \(e \in \{10,20,30,40,50\}\)

Fig. 8
figure 8

The memory usage of the two algorithms for increasing values of n and with \(p=0\) and \(e \in \{10,20,30,40,50\}\), as reported by the Maximum Resident Size parameter when calling the executable of each algorithm with /usr/bin/time -v

Fig. 9
figure 9

The effect of e and n on k (the generated network’s level) and the amount of non-leaf blocks

Fig. 10
figure 10

The running time of NTDfirst minus the running time of NTDsecond for \(e \in \{10,20,30,40,50\}\) and \(p \in \{0,0.8\}\). a Observe that when \(n=90\), \(p=0\), and \(e=50\), the difference is negative, which means NTDfirst is faster than NTDsecond. b When p is large (like the case \(p=0.8\) shown here), the number of edges that can be added to the generated networks is small and the differences in running times for varying values of e less significant

Fig. 11
figure 11

The effect of different values of p on the running time of NTDfirst minus the running time of NTDsecond, for \(e \in \{10,30,50\}\). When \(n=90\), \(p=0\), and \(e=50\), NTDfirst is faster than NTDsecond

The results of Experiment 1 are reported below.

  1. 1.

    The two algorithms’ running times and memory usage increase as n increases according to the plots in Figs. 7 and 8. The first figure shows the CPU time in seconds taken when \(p = 0\) and \(e \in \{10,20,30,40,50\}\). For NTDfirst we used \(10 \le n \le 230\), and for NTDsecond we used \(10 \le n \le 500\). Space is the reason behind the restrictions on n. As can be seen in Fig. 8a, at \(n = 230\) the memory usage of NTDfirst is getting close to the limit of the available 16GB RAM. When \(n \ge 240\), the memory requirements exceed the limit, and the operating system initiates highly time-consuming communication with the disk.

  2. 2.

    Both algorithms take more time as the parameter e increases due to the additional edges in the generated networks, with NTDsecond suffering more than NTDfirst. Again, see Fig. 7. The explanation for this behavior is as follows. The main purpose of extending the algorithm from Sect. 2 in Sect. 3 was to avoid having to build the highly time- and memory-consuming fan and resolved graph on the entire input network, and instead build several such graphs on smaller blocks. Figure 9 shows that a larger value of e implies a higher level k as well as fewer non-leaf blocks in \(N'\), which in turn implies more time spent by NTDsecond building the fan and resolved graphs. An extreme situation is when e is so large that \(N'\) has a really small number of non-leaf blocks, one of which is roughly as large as \(N'\) itself. Then, given that the preprocessing of NTDsecond is more complex than that of NTDfirst, NTDsecond will be slower than NTDfirst. An example of where this happens can be found in Fig. 10a when the parameters are \(n=90\), \(p=0\), and \(e=50\).

    In contrast, when p is large, e.g., \(p=0.8\) in Fig. 10b, the effect of e on the running times is small. This holds especially for NTDsecond. There will be fewer internal vertices in the generated networks, which means that the number of edges that can be added decreases as well.

  3. 3.

    The effect of the parameter p on the relative running times of the two algorithms is shown in Fig. 11. In general, the difference in the two algorithms’ running times becomes smaller as the value of p increases. For certain combinations of the parameters such as \(n = 90\), \(p = 0\), and \(e = 50\) in Fig. 11c, NTDfirst is faster than NTDsecond, as observed earlier.

4.4 Experiment 2: Limitations of the Rooted Triplet Distance

The second set of experiments applied the algorithms to real datasets. The goal was to see how informative the current definition of the rooted triplet distance is in practice when comparing phylogenetic networks, and to investigate any potential shortcomings. The Input. For the real datasets, we borrowed six networks from Table S4 in [31] that describe biologically motivated alternative ‘scenarios’ for the evolutionary history of the Viola genus. They are named \(N_{A}\), \(N_{B}\), \(N_{C}\), \(N_{D}\), \(N_{E}\), and \(N_{F}\) below. The first five networks correspond to the five scenarios A, B, C, D, and E in [31], and \(N_{F}\) is “Scenario E, CHAM and MELVIO resolved”, which is actually the same as scenario E but with two of the subclades (overlapping subtrees) expanded.

Only two of the six networks are shown here; the network \(N_B\) is displayed in Fig. 12a, and \(N_D\) in Fig. 12b. For the other four networks’ branching structures, the reader is referred to Table S4 in [31].

Fig. 12
figure 12

The networks \(N_{B}\) and \(N_{D}\) from [31]

The networks in Table S4 in [31] were inferred from a set of multilabeled trees. (A multilabeled tree is a generalization of a phylogenetic tree in which identical leaf labels are allowed to occur more than once.) The method that was used to construct the networks is explained in detail in Step 3 (“Inference of the Most Parsimonious Network from Multilabeled Gene Trees”) in the Materials and Methods-section of [31]. Table S4 in [31] also provides these multilabeled trees. In order to represent the multilabeled trees as distinctly leaf-labeled trees as well, [31] replaced any repeated leaf label x by unique leaf labels of the form \(x.1, x.2, \dots , x.i\); e.g., one occurrence of the leaf label Tridens was changed to Tridens.1, another one to Tridens.2, another one to Tridens.3, and so on. These (distinctly leaf-labeled) trees were also considered in our experiments and are referred to as \(T_{A}\), \(T_{B}\), \(T_{C}\), \(T_{D}\), \(T_{E}\), and \(T_{F}\).

The size of the leaf label set of \(T_{A}\), \(T_{B}\), \(T_{C}\), \(T_{D}\), \(T_{E}\), and \(T_{F}\) is 16, 20, 21, 21, 22, and 50 leaves, respectively. For every \(s \in \{A,B,C,D,E\}\), \(N_{s}\) contains 8 leaves, and \(N_{F}\) contains 16 leaves. Note that for all \(s \in \{A,B,C,D,E,F\}\), the number of leaf labels in \(T_{s}\) is larger than than the number of leaf labels in \(N_{s}\) due to the leaf relabeling process just described to obtain distinctly leaf-labeled trees.

In our implementations, the input trees are represented in standard Newick format and the input networks in extended Newick format [32]. We employ the graph-theoretic standard adjacency list to store the input networks, making it easy to support different input formats at the same time.

Experimental Results. We used the trees \(T_{s}\) and networks \(N_{s}\), where \(s \in \{A, B, C, D, E, F\}\), from Table S4 in [31], as explained above. In the experiments, we computed the rooted triplet distance between each \(T_{s}\) and \(N_{s}\) and also between pairs of these networks. According to Equation (1.1), \(D(T_{s},N_{s}) = S(T_{s},T_{s}) + S(N_{s},N_{s})- 2S(T_{s},N_{s})\). To make \(\mathcal {L}(T_{s}) = \mathcal {L}(N_{s})\) when computing \(D(T_{s},N_{s})\), if a leaf x in \(N_{s}\) appeared as several leaves \(x.1, \dots , x.i\) in \(T_{s}\) then we replaced x in \(N_{s}\) by leaves labeled \(x.1, \dots , x.i\), attaching each of them as a child of the parent of x. The maximum time spent by any of our algorithms was when computing \(D(T_{F}, N_{F})\), with NTDfirst requiring only 0.18 seconds to run and NTDsecond 0.05 seconds.

Our findings are summarized in Tables 2 and 3. By inspecting the tables, Experiment 2 reveals two ways that the current definition of the rooted triplet distance for networks could be improved:

  1. 1.

    Table 2 shows \(S(T_{s},T_{s})\), \(S(N_{s},N_{s})\), \(S(T_{s},N_{s})\), and \(D(T_{s},N_{s})\) for every \(s \in \{A,B,C,D,E,F\}\). The values of \(D(T_{s},N_{s})\) seem quite large compared to the number of triplets in each \(T_{s}\) (given by \(S(T_{s},T_{s})\)). This is because of the resolved triplets that arise when \(N_{s}\) is created from a multilabeled tree using the method in [31], and the fan triplets that are created whenever a leaf x is replaced by \(x.1, \dots , x.i\) in \(N_{s}\). Consequently, it would be desirable to give less weights to such triplets. A more flexible definition of the rooted triplet distance that can assign different weights to different triplets could therefore be useful.

  2. 2.

    Next, Table 3 lists the triplet distance \(D(N_{s}, N_{s'})\) for all pairs \(s, s' \in \{A,B,C,D,E\}\). The networks \(N_A, \dots , N_E\) have identical leaf label sets, but the leaf label set of \(N_F\) is different, which is why \(N_F\) is excluded from Table 3. Interestingly, although the two networks \(N_{B}\) and \(N_{D}\) are structurally different (see Fig. 12), their triplet distance is 0. This suggests that alternative definitions of the rooted triplet distance for networks may be better in practice, as discussed in Sect. 5 below.

Table 2 Experiments on the real datasets
Table 3 Experiments on the real datasets, continued

5 Final Remarks

Fig. 13
figure 13

Next to every vertex marked with a circle is the number of different pairs of disjoint paths from that vertex to the leaves with labels Tridens and Chilenium, and the number of different disjoint paths from the root to the vertex. With definition A of multiplicity for resolved triplets, the resolved triplet \(\texttt {Tridens}~\texttt {Chilenium}~|~\texttt {Andinium}\) appears \((4 + 6 + 2 + 1 + 2 + 1) \cdot 1 + 1 \cdot 2 = 18\) times in \(N_{B}\) and \((5 + 5 + 8 + 2 + 2 + 2 + 1 + 1) \cdot 1 + 1 \cdot 2 = 28\) times in \(N_{D}\). With definition B, this triplet appears 7 times in \(N_{B}\) and 9 times in \(N_{D}\)

Fig. 14
figure 14

Next to every vertex marked with a circle is the number of different pairs of disjoint paths from that vertex to the leaves with labels Chilenium and CHAM_clade, and the number of different disjoint paths from the root to the vertex. With definition A of multiplicity for resolved triplets, the resolved triplet \(\texttt {Chilenium}~\texttt {CHAM\_clade}~|~\texttt {Andinium}\) appears \((1 + 2 + 1) \cdot 1 = 4\) times in \(N_{B}\) and \((2 + 2 + 1) \cdot 1 = 5\) times in \(N_{D}\). With definition B, this triplet appears three times in both networks

We have developed two new algorithms for computing the rooted triplet distance between two phylogenetic networks over the same leaf label set. We have also presented an implementation of the algorithms and evaluated their performance on simulated and real datasets.

Future work involves creating new algorithms that are even more efficient than the algorithms given here, as well as to research variants of the studied problem that may provide more biologically meaningful ways for comparing networks. An example of such a variant is motivated by the experiments on the real dataset in Sect. 4.4. Recall that the two networks \(N_{B}\) and \(N_{D}\) were structurally different, yet their triplet distance was 0. The reason is that, unlike in the case of trees, the same triplet can appear several times in a network, and for two networks \(N_{1}\) and \(N_{2}\) to be compared, if a triplet appears 1000 times in \(N_{1}\) and only once in \(N_{2}\), it would contribute 0 under the current definition of \(D(N_{1},N_{2})\). However, extending the definition of the triplet distance for networks to capture information about the frequencies of triplets in the networks can be done in different ways, leading to different outcomes. For example, consider the following two alternative definitions of multiplicity for a resolved triplet xy|z, where u and v are the vertices used in the definition of the consistency of a resolved triplet with a network in Sect. 1:

  1. A.

    The total number of quadruples of paths of the form \(((u \leadsto v), (v \leadsto x), (v \leadsto y), (u \leadsto z))\) that are disjoint except for in u and v, and furthermore, the path from u to z does not pass through v.

  2. B.

    The total number of pairs of vertices (uv) such that there exist four paths of the form \((u \leadsto v), (v \leadsto x), (v \leadsto y), (u \leadsto z)\) that are disjoint except for in u and v, and furthermore, the path from u to z does not pass through v.

The definitions for the case of fan triplets are analogous. Now consider the two networks \(N_{B}\) and \(N_{D}\). As shown in Fig. 13, if we follow definition A of multiplicity, the resolved triplet \(\texttt {Tridens}~\texttt {Chilenium}~|~\texttt {Andinium}\) appears 18 times in \(N_{B}\) and 28 times in \(N_{D}\) (and we could thus let it contribute 10 to the extended rooted triplet distance). If we choose definition B instead, this resolved triplet appears 7 times in \(N_{B}\) and 9 times in \(N_{D}\). On the other hand, according to Fig. 14, the resolved triplet \(\texttt {Chilenium}~\texttt {CHAM\_clade}~|~\texttt {Andinium}\) appears 4 times in \(N_{B}\) and 5 times in \(N_{D}\) according to definition A, but 3 times in both networks according to definition B.

In summary, definition B seems somewhat simpler to compute than definition A, but it fails to distinguish between certain cases that definition A can handle. To determine under what circumstances definition B is good enough in practice is an open problem and a future research topic.

Finally, Cardona et al. [33] gave an alternative generalization of the rooted triplet distance from trees to networks. While the extension proposed by Gambette and Huber [13] is closer to the definition of the widely studied rooted triplet distance for trees, efficient algorithms for Cardona et al.’s extension might also be useful. However, as pointed out in [13] and [33], neither one of them yields a metric for all classes of phylogenetic networks (see Corollary 1 in [13] and Figs. 19 and 20 in [33]), so another open problem is to find even more informative generalizations.