1 Introduction

Orthology is one of the key concepts in evolutionary biology: Two genes are orthologs if their last common ancestor was a speciation event Fitch (1970). Distinguishing orthologs from paralogs (originating from gene duplications) or xenologs (i.e., genes that have undergone horizontal gene transfer) is of considerable practical importance for functional genome annotation and thus for a wide array of methods in bioinformatics and computational biology that rely on gene annotation data. In particular, according to the “ortholog conjecture”, orthologous genes in different species are expected to have essentially the same biological and molecular functions, whereas paralogs and xenologs tend to have similar, but distinct functions. Albeit controversial Nehrt et al. (2011), Stamboulian et al. (2020), this assumption is widely made in the computational prediction of gene functions Nehrt et al. (2011), Gabaldón and Koonin (2013), Soria et al. (2014), Zallot et al. (2016). Moreover, the distinction of orthologs and paralogs is crucial in phylogenomics Delsuc et al. (2005). Most of the commonly used tools for large-scale orthology identification compute reciprocal best hits as a first step followed by some filtering and refinement steps to improve the results Tatusov et al. (2000), Roth et al. (2008), Lechner et al. (2011), Linard et al. (2011), Sonnhammer and Östlund (2015), Train et al. (2017), Huerta-Cepas et al. (2018), see also Nichio et al. (2017), Setubal and Stadler (2018), Galperin et al. (2019) for reviews and Altenhoff et al. (2016) for benchmarking results.

Orthology identification has also received increasing attention from a mathematical perspective starting from the concept of an evolutionary scenario comprising a gene tree T and a species tree S together with a reconciliation map \(\mu \) from T to S. The map \(\mu \) identifies the locations in the species tree at which evolutionary events, represented by the vertices of the gene tree, took place. In this contribution, we consider exclusively duplication/loss scenarios, i.e., we explicitly exclude horizontal gene transfer. Characterizations of reconciliation maps are given e.g. in Górecki and Tiuryn (2006), Vernot et al. (2008), Doyon et al. (2011), Rusin et al. (2014). While every gene tree can be reconciled with any species tree Guigó et al. (1996), Page and Charleston (1997), this is no longer true if event-labels are prescribed in the gene tree T Hernandez-Rosales et al. (2012), Lafond and El-Mabrouk (2014), Hellmuth (2017).

The orthology relation itself has been characterized as a cograph (i.e., graphs that do not contain induced paths \(P_4\) on four vertices) by Hellmuth et al. (2013) based on earlier work by Böcker and Dress (1998). This line of research has led to the idea of editing reciprocal best hit data to conform to the required cograph structure Hellmuth et al. (2015). There are, however, two distinct sources of errors in an orthology assignment pipeline based on best matches:

  1. (i)

    inaccuracies in the assignment of best matches from sequence similarity data Stadler et al. (2020), and

  2. (ii)

    limits in the reconstruction of the “true” orthology relation from best match graphs Geiß et al. (2020b).

We consider best matches as an evolutionary concept: A gene y in species s is a best match of a gene x from species \(r\ne s\) if s contains no gene \(y'\) that is more closely related to x. That is, best matches capture the idea of phylogenetically most closely related genes. Maybe surprisingly, the combinatorial structure of best matches has become a focus only very recently Geiß et al. (2019). Best match graphs (BMGs) have several appealing properties: They have several alternative characterizations providing polynomial-time recognition algorithms Geiß et al. (2020a), Schaller et al. (2020) and they are “explained” by a unique least resolved tree Geiß et al. (2019). These properties will be introduced formally in the next section and play an important role in our discussion. The reciprocal best match graphs (RBMGs) are the symmetric parts of BMGs and conceptually correspond to the reciprocal best hits used in orthology detection. In contrast to BMGs, RBMGs are much more difficult to handle and are not associated with unique trees Geiß et al. (2020c). An example for an evolutionary scenario with corresponding BMG and RBMG is given Fig. 1.

Fig. 1
figure 1

An evolutionary scenario (left) consists of a gene tree \((T,\sigma )\) (whose observable part is shown in the second panel) together with an embedding into a species tree S. The coloring \(\sigma \) of the leaves of T represents the species in which the genes reside. Speciation vertices (\(\newmoon \)) of the gene tree coincide with the vertices of the species tree, whereas gene duplications (\(\square \)) are mapped to the edges of S. The reciprocal best match graph (RBMG) \((G,\sigma )\) on the right corresponds to the undirected graph underlying the symmetric part of the best match graph (BMG) \((\vec {G},\sigma )\) (third panel)

In this contribution, we are only concerned with the second source of errors, i.e., with the limits in the reconstruction of the true orthology relation from best matches. We therefore assume throughout that a “correct” BMG (cf. Def. 2) is given. We do not assume, however, that we have any a priori knowledge about the underlying gene or species tree. The problem we aim to solve is to determine the orthology relation that is best supported by the given BMG.

Of course, the true orthology relation is not known. Nevertheless, we start our mathematical analysis with the following definition: A pair of genes x and y that are not true orthologs but reciprocal best matches are false-positive orthologs. If they are orthologs but not reciprocal best matches, they are false-negative orthologs. Geiß et al. (2020b) showed that, for evolutionary scenarios that involve only speciations, gene duplications, and gene losses, there are no false-negative orthology assignments (see also Thm. 2 below). Our task therefore reduces to understanding the false-positive orthology assignments. Being a false positive is a property of the edge xy in an RBMG, and equivalently of the symmetric pair (xy) and (yx) in the BMG. Here, we aim to identify false-positive edges from the structure of the BMG itself.

Fig. 2
figure 2

Two scenarios (1st and 2nd panel to the left) for the evolution of a gene family embedded into a species tree (shown in gray), where \(\newmoon \) represents speciation and \(\square \) duplication events. The second scenario is the simplest example for a complementary gene loss that is not witnessed by any other species. In particular, the two different true histories result in the same topology \({\widetilde{T}}\) of the true (loss-free) gene tree, and thus explain the same BMG \((\vec {G},\sigma )\). However, only for the leftmost scenario the edge xy in \((\vec {G},\sigma )\) describes correct orthologs

We first note that false positives cannot be avoided altogether, i.e., not all false positives can be identified from a BMG alone. The simplest example, Fig. 2 (second scenario), comprises a gene duplication and a subsequent speciation and complementary gene losses in the descendant lineages such that each paralog survives only in one of them. In this situation, xy is a reciprocal best match. If there are no other descendants that harbor genes witnessing the duplication event, then the framework of best matches provides no information to recognize xy as a false-positive assignment.

On the other hand, RBMGs and thus BMGs contain at least some information on false positives. Since the orthology relation forms a cograph but RBMGs are not cographs in general Geiß et al. (2020c), incorrect orthology assignments are associated with induced \(P_4\)s, the forbidden subgraphs that characterize cographs. \(P_4\)s arise for instance as a consequence of the complete loss of different paralogous groups in disjoint lineages. Dessimoz et al. (2006) noted that such false-positive orthology assignments can be identified under certain circumstances, in particular, if there is some species in which both paralogs have survived. The corresponding motif in BMGs, the “good quartets”, was investigated in some detail by Geiß et al. (2020c). The removal of such false-positive orthologs already leads to a substantial improvement of the orthology assignments in simulated data Geiß et al. (2020b). Here, we extend the results of Geiß et al. (2020b) to a complete characterization of false-positive orthology assignments for a given BMG.

Good quartets cannot be defined on RBMGs because information on non-reciprocal best matches is also needed explicitly. This suggests to consider BMGs rather than RBMGs as the first step in graph-based orthology detection methods. In practice, best matches are approximated by sequence similarity and thus are subject to noise and biases Stadler et al. (2020). The empirically determined best match relation thus will usually need to be corrected to conform to the formal definition (cf. Def. 2 below) of BMGs. This naturally leads to a graph editing problem that was recently shown to be NP-complete Schaller et al. (2020), Hellmuth et al. (2020b).

Sec. 2 establishes the notation and summarizes properties of BMGs that are needed throughout this contribution. Sec. 3 formalizes the notion of unambiguous false-positive (u-fp) edges, i.e., reciprocal best matches that cannot be orthologs w.r.t. to any gene tree explaining the BMG. Sec. 4 contains the main mathematical contributions of this work:

  1. 1.

    We provide a full characterization of unambiguous false-positive orthology assignments in BMGs.

  2. 2.

    We provide a polynomial-time algorithm to determine all unambiguous false-positive orthology assignments in BMGs.

In Sec. 5, we complement the mathematical results with a computational analysis of simulated scenarios and observe that at least three quarters of all false positives fall into this class. The remaining cases are not recognizable from best matches alone and correspond to complementary losses without surviving witnesses, i.e., cases that cannot be corrected without additional knowledge on the gene tree and/or the species tree.

Since the material is extensive and very technical, we subdivide our presentation into a main narrative part (Secs. 16) and a technical part (Secs. AD) that contains all proofs and additional material in full detail. Together with the definitions and preliminaries in Sec. 2, the technical part is self-contained. Definitions and results appearing in the narrative part are therefore restated. The order of the material in the two parts is slightly different.

2 Preliminaries

2.1 Graphs and trees

We consider finite, directed graphs \(\vec {G}=(V,E)\), for brevity just called graphs throughout, with arc set \(E\subseteq V\times V{\setminus }\{(v,v)\mid v\in V\}\). We say that xy is an edge in \(\vec {G}\) if and only if both \((x,y)\in E(\vec {G})\) and \((y,x)\in E(\vec {G})\). If all arcs of \(\vec {G}\) in a graph form edges, we call \(\vec {G}\) undirected. A graph \(H=(W,F)\) is a subgraph of \(G=(V,E)\), in symbols \(H\subseteq G\), if \(W\subseteq V\) and \(F\subseteq E\). The underlying symmetric part of a directed graph \(\vec {G}=(V,E)\) is the subgraph \(G=(V,F)\) that contains all edges of \(\vec {G}\). A subgraph \(H=(W,F)\) (of \(\vec {G}\)) is called induced, denoted by \(\vec {G}[W]\), if for all \(u,v\in W\) it holds that \((u,v) \in E\) implies \((u,v) \in F\). In addition, we consider vertex-colored graphs \((\vec {G},\sigma )\) with vertex-coloring \(\sigma :V\rightarrow M\) into some set M of colors. A vertex-coloring is called proper if \(\sigma (x)\ne \sigma (y)\) for every arc (xy) in \(\vec {G}\). We write \(\sigma (W) = \{\sigma (w) \mid w\in W\}\) for subsets \(W\subseteq V\) and \(\sigma _{|W}\) to denote the restriction of the map \(\sigma \) to \(W\subseteq V\). In particular, \((\vec {G}[W],\sigma _{|W})\) is an induced vertex-colored subgraph of \((\vec {G},\sigma )\).

A path (of length \(\ell \)) in a directed graph \(\vec {G}\) or an undirected graph G is a subgraph induced by a nonempty sequence of pairwise distinct vertices \(P(x_0,x_{\ell }) :=(x_0, x_1, \dots , x_{\ell })\) such that \((x_i, x_{i+1}) \in E(\vec {G})\) or \(x_ix_{i+1} \in E(G)\), resp., for \(0 \le i \le \ell -1\). We use the notation \(P(x_0,x_{\ell })\) both for the sequence of vertices and the subgraph they induce.

All trees \(T=(V,E)\) considered here are undirected, planted and phylogenetic, that is, they satisfy (i) the root \(0_T\) has degree 1 and (ii) all inner vertices have degree \(\deg _T(u)\ge 3\). We write L(T) for the leaves (not including \(0_T\)) and \(V^0=V(T){\setminus }(L(T)\cup \{0_T\})\) for the inner vertices (also not including \(0_T\)). To avoid trivial cases, we will always assume \(|L(T)|\ge 2\). An edge uv in T is an inner edge if \(u,v\in V^0(T)\) are inner vertices. The conventional root \(\rho _T\) of T is the unique neighbor of \(0_T\). The main reason for using planted phylogenetic trees instead of modeling phylogenetic trees simply as rooted trees, which is the much more common practice in the field, is that we will often need to refer to the time before the first branching event, i.e., the edge \(0_T\rho _T\).

We define the ancestor order on a given tree T as follows: if y is a vertex of the unique path connecting x with the root \(0_T\), we write \(x\preceq _T y\), in which case y is called an ancestor of x and x is called a descendant of y. We use \(x \prec _T y\) for \(x \preceq _{T} y\) and \(x \ne y\). If \(x \preceq _{T} y\) or \(y \preceq _{T} x\) the vertices x and y are comparable and, otherwise, incomparable. If xy is an edge in T, such that \(y \prec _{T} x\), then x is the parent of y and y the child of x. We denote by \(\mathsf {child}_T(x)\) the set of all children of x. It will be convenient for the discussion below to extend the ancestor relation \(\preceq _T\) to the union of the edge and vertex sets of T. More precisely, for a vertex \(x\in V(T)\) and an edge \(e=uv\in E(T)\) with \(v\prec _T u\) we write \(x \prec _T e\) if and only if \(x\preceq _T v\) and \(e \prec _T x\) if and only if \(u\preceq _T x\). For edges \(e=uv\) with \(v\prec _T u\) and \(f=ab\) with \(b\prec _T a\) in T we put \(e\preceq _T f\) if and only if \(v \preceq _T b\).

For a non-empty subset \(A\subseteq V\cup E\), we define \({{\,\mathrm{lca}\,}}_T(A)\), the last common ancestor of A, to be the unique \(\preceq _T\)-minimal vertex of T that is an ancestor of every vertex or edge in A. For simplicity we drop the brackets and write \({{\,\mathrm{lca}\,}}_T(x_1,\dots ,x_k):={{\,\mathrm{lca}\,}}_T(\{x_1,\dots ,x_k\})\) whenever we specify a set of vertices or edges explicitly.

A vertex \(v\in V(T)\) is binary if \(\deg _T(v)=3\), i.e., if v has exactly two children. A tree is binary, if all vertices \(v\in V^0\) are binary. For \(v\in V(T)\) we denote by T(v) the subtree of T rooted in v. The set of clusters of a tree T is \({\mathscr {C}}(T) = \{L(T(v))\mid v\in V(T)\}\). It is well-known that \({\mathscr {C}}(T)\) uniquely determines TSemple and Steel (2003). We say that a tree T is a refinement of some tree \(T'\) if \({\mathscr {C}}(T')\subseteq {\mathscr {C}}(T)\). A tree \(T'\) is displayed by a tree T, in symbols \(T'\le T\), if \(T'\) can be obtained from a subtree of T by contraction of edges Semple (2003), where the contraction of an edge \(e = uv\) in a tree \(T = (V ,E)\) refers to the removal of e and identification of u and v. It is easy to verify that every refinement T of \(T'\) also displays \(T'\). However, the converse is not always true since \(L(T')\subsetneq L(T)\) and thus, \({\mathscr {C}}(T')\not \subseteq {\mathscr {C}}(T)\) may be possible.

2.2 (Reciprocal) best matches

We consider a pair \(T=(V,E)\) and \(S=(W,F)\) of planted phylogenetic trees together with a map \(\sigma :L(T)\rightarrow L(S)\). We interpret T as a gene tree and S as a species tree; the map \(\sigma \) describes, for each gene \(x\in L(T)\), in the genome of which species \(\sigma (x)\in L(S)\) it resides. W.l.o.g. we assume that the “gene-species-association” \(\sigma \) is a surjective map to avoid trivial cases. Since \(\sigma \) can be viewed as a coloring of the leaves of T, we call \((T,\sigma )\) a leaf-colored tree. For \(s\in L(S)\) we write \(L[s]:=\{x\in L(T)|\sigma (x)=s\}\).

Definition 1

Let \((T,\sigma )\) be a leaf-colored tree. A leaf \(y\in L(T)\) is a best match of the leaf \(x\in L(T)\) if \(\sigma (x)\ne \sigma (y)\) and \({{\,\mathrm{lca}\,}}(x,y)\preceq _T {{\,\mathrm{lca}\,}}(x,y')\) holds for all leaves \(y'\) from species \(\sigma (y')=\sigma (y)\). The leaves \(x,y\in L(T)\) are reciprocal best matches if y is a best match for x and x is a best match for y.

Neither best matches nor reciprocal best matches are unique. That is, a gene x may have two or more (reciprocal) best matches of the same color \(r\ne \sigma (x)\). Some orthology detection tools, such as ProteinOrtho Lechner et al. (2011), explicitly attempt to extract all reciprocal best matches from the sequence data. Moreover, neither of the two relations is transitive. These two properties are at odds e.g. with the clusters of orthologous groups (COGs) concept (cf. Tatusov et al. 1997, 2000; Roth et al. 2008), which at least conceptually presupposes unique reciprocal best matches.

The graph \(\vec {G}(T,\sigma ) = (V,E)\) with vertex set \(V=L(T)\), vertex coloring \(\sigma \), and with arcs \((x,y)\in E\) if and only if y is a best match of x w.r.t. \((T,\sigma )\) is known as the (colored) best match graph of \((T,\sigma )\) Geiß et al. (2019). The symmetric part \(G(T,\sigma )\) of \(\vec {G}(T,\sigma )\) obtained by retaining the edges of \(\vec {G}(T,\sigma )\) is the (colored) reciprocal best match graph Geiß et al. (2020c).

Definition 2

An arbitrary vertex-colored graph \((\vec {G},\sigma )\) is a best match graph (BMG) if there exists a leaf-colored tree \((T,\sigma )\) such that \((\vec {G},\sigma ) = \vec {G}(T,\sigma )\). In this case, we say that \((T,\sigma )\) explains \((\vec {G},\sigma )\). An arbitrary undirected vertex-colored graph \((G,\sigma )\) is a reciprocal best match graph (RBMG) if it is the symmetric part of a BMG \((\vec {G},\sigma )\).

For the symmetric part of the BMG \((\vec {G},\sigma )\), i.e., the RBMG \((G,\sigma )\), we have \(xy\in E(G)\) if and only if x and y are reciprocal best matches in \((T,\sigma )\). In this sense, \((T,\sigma )\) also explains \((G,\sigma )\). We note, furthermore, that RBMGs are not associated with a unique least resolved tree Geiß et al. (2020c).

2.3 Reconciliation maps, event-labeling, and orthology relations

An evolutionary scenario extends the map \(\sigma :L(T)\rightarrow L(S)\) to an embedding of the gene tree into the species tree. It (implicitly) describes different types of evolutionary events: speciations, gene duplications, and gene losses. In this contribution we do not consider other types of events such as horizontal gene transfer. Gene losses do not appear explicitly since L(T) only contains extant genes. Inner vertices in the gene tree T that designate speciations have their correspondence in inner vertices of the species tree. In contrast, gene duplications occur independently of speciations and thus belong to edges of the species tree. The embedding of T into S is formalized by

Definition 3

(Reconciliation Map) Let \(S=(W,F)\) and \(T=(V,E)\) be two planted phylogenetic trees and let \(\sigma :L(T) \rightarrow L(S)\) be a surjective map. A reconciliation from \((T,\sigma )\) to S is a map \(\mu :V \rightarrow W \cup F\) satisfying

(R0):

Root Constraint. \(\mu (x) = 0_S\) if and only if \(x=0_T\).

(R1):

Leaf Constraint. If \(x \in L(T)\), then \(\mu (x)=\sigma (x)\).

(R2):

Ancestor Preservation. If \(x \prec _T y\), then \(\mu (x) \preceq _S \mu (y)\).

(R3):

Speciation Constraints. Suppose \(\mu (x) \in W^0\) for some \(x\in V\). Then

(i):

\(\mu (x)={{\,\mathrm{lca}\,}}_S(\mu (v'),\mu (v''))\) for at least two distinct children \(v',v''\) of x in T.

(ii):

\(\mu (v')\) and \(\mu (v'')\) are incomparable in S for any two distinct children \(v'\) and \(v''\) of x in T.

Several alternative definitions of reconciliation maps for duplication/loss scenarios have been proposed in the literature, many of which have been shown to be equivalent. This type of reconciliation map has been established in Geiß et al. (2020b). Moreover, it has been shown in Geiß et al. (2020b) that the axiom set used here is equivalent to axioms that are commonly used in the literature, see e.g. Górecki and Tiuryn (2006), Vernot et al. (2008), Doyon et al. (2011), Rusin et al. (2014), Hellmuth (2017), Nøjgaard et al. (2018), and the references therein. Without any further constraints, Def. 3 gives rise to a well-known result:

Lemma 1

(Geiß et al. 2020b, Lemma 3) For every tree \((T, \sigma )\) there is a reconciliation map \(\mu \) to any species tree S with leaf set \(L(S) = \sigma (L(T ))\).

The reconciliation map \(\mu \) from \((T,\sigma )\) to S determines the types of evolutionary events in T. This can be formalized by associating an event labeling with the vertices of T. We use the notation introduced in Geiß et al. (2020b):

Definition 4

Given a reconciliation map \(\mu \) from \((T,\sigma )\) to S, the event labeling on T (determined by \(\mu \)) is the map \(t_\mu :V(T)\rightarrow \{\circledcirc ,\odot ,\newmoon ,\square \}\) given by:

$$\begin{aligned} t_\mu (u) = {\left\{ \begin{array}{ll} \circledcirc &{} \, \text {if } u=0_T \text {, i.e., } \mu (u)=0_S \text { (root)}\\ \odot &{} \, \text {if } u\in L(T) \text {, i.e., } \mu (u)\in L(S) \text { (leaf)}\\ \newmoon &{} \, \text {if } \mu (u)\in V^0(S) \text { (speciation)}\\ \square &{} \, \text {else, i.e., } \mu (u)\in E(S) \text { (duplication)}\\ \end{array}\right. } \end{aligned}$$

The following result is a simple but useful consequence of combining the axioms of the reconciliation map with the event labeling of Def. 4.

Lemma 2

(Geiß et al. 2020b, Lemma 3) Let \(\mu \) be a reconciliation map from \((T,\sigma )\) to a tree S and suppose that \(u\in V(T)\) is a vertex with \(\mu (u)\in V^0(S)\) and thus, \(t(\mu (u))=\newmoon \). Then, \(\sigma (L(T(v_1)))\cap \sigma (L(T(v_2))) = \emptyset \) for any two distinct \(v_1,v_2\in \mathsf {child}(u)\).

We will regularly make use of the observation that, by contraposition of Lemma 2, \(\sigma (L(T(v)))\cap \sigma (L(T(v'))) \ne \emptyset \) for two distinct \(v_1,v_2\in \mathsf {child}(u)\) implies that \(\mu (u)\in E(S)\), and thus \(t_{\mu }(u)=\square \).

Lemma 2 suggests to define event-labeled trees as trees (Tt) endowed with a map \(t: V(T)\rightarrow \{\circledcirc ,\odot ,\newmoon ,\square \}\) such that \(t(0_T)=\circledcirc \) and \(t(u)=\odot \) for all \(u\in L(T)\). In Geiß et al. (2020b), Lemma 2 also served as a motivation for

Definition 5

Let \((T,\sigma )\) be a leaf-colored tree. The extremal event labeling of T is the map \({{\widehat{t}}_T}:V(T)\rightarrow \{\circledcirc ,\odot ,\newmoon ,\square \}\) defined for \(u\in V(T)\) by

$$\begin{aligned} {{\widehat{t}}_T}(u) = {\left\{ \begin{array}{ll} \circledcirc &{} \, \text {if } u=0_{T} \\ \odot &{} \, \text {if } u\in L(T) \\ \square &{} \, \text {if there are two children } v_1,v_2\in \mathsf {child}(u) \text { such that}\\ &{} \qquad \sigma (L(T(v_1)))\cap \sigma (L(T(v_2)))\ne \emptyset \\ \newmoon &{} \, \text {otherwise} \\ \end{array}\right. } \end{aligned}$$

An example of an extremal event labeling is shown in Fig. 9 (rightmost tree). The extremal event labeling is closely related to the concept of apparent duplication (AD) vertices often found in the literature (e.g. Swenson et al. 2012; Lafond et al. 2014). For a (binary) gene tree T and a reconciliation of T with a species tree S, a duplication vertex of T is an AD vertex if its two subtrees have at least one color in common. In contrast, it is a non-apparent duplication (NAD) vertex if the color sets of its subtrees are disjoint. This notion is useful for a variety of parsimony problems that usually aim to avoid or minimize the number of NAD vertices Swenson et al. (2012), Lafond et al. (2014). However, the extremal event labeling \({{\widehat{t}}_T}\) is completely defined by \((T,\sigma )\). That is, in contrast to both the event labeling in Def. 4 and the concept of AD and NAD vertices, \({{\widehat{t}}_T}\) does not depend on a specific reconciliation map. On the other hand, there is no guarantee that there always exists a reconciliation map \(\mu \) from \((T,\sigma )\) to some species tree S such that \(t_{\mu } = {{\widehat{t}}_T}\), cf. (Geiß et al. 2020b, Fig. 2) and Fig. 9 in Sec. 4.2 for counterexamples. Nevertheless, we shall see below that the extremal labeling is a key step towards identifying false-positive orthology assignments.

The event labeling on T defines the orthology graph.

Definition 6

The orthology graph \(\Theta (T,t)\) of an event-labeled tree (Tt) has vertex set L(T) and edges \(uv\in E(\Theta )\) if and only if \(t({{\,\mathrm{lca}\,}}(u,v))=\newmoon \).

The orthology graph is often referred to as the orthology relation. Orthology graphs coincide with a well-known graph class:

Theorem 1

(Hellmuth et al. 2013, Cor. 4) A graph G is an orthology graph for some event-labeled tree (Tt), i.e. \(G=\Theta (T,t)\), if and only if G is a cograph.

One of many equivalent characterizations of cographs identifies them with the graphs that do not contain an induced path \(P_4\) on four vertices Corneil et al. (1981).

The orthology graph is a subgraph of the RBMG (and thus also of the BMG) for any given reconciliation map connecting a gene with a species tree.

Theorem 2

(Geiß et al. 2020b, Lemma 4 & 5) Let \((T,\sigma )\) be a leaf-colored tree and \(\mu \) a reconciliation map from \((T,\sigma )\) to some species tree S. Then \(\Theta (T,t_{\mu }) \subseteq \Theta (T,{{\widehat{t}}_T})\subseteq G(T,\sigma ) \subseteq \vec {G}(T,\sigma )\).

In particular, \(t_{\mu }(v) =\newmoon \) implies \({{\widehat{t}}_T}(v) =\newmoon \) for any reconciliation map. By contraposition, therefore, if \({{\widehat{t}}_T}(v) =\square \) then \(t_{\mu }(v) =\square \) for all possible reconciliation maps \(\mu \) from \((T,\sigma )\) to any species tree S. A crucial implication of Thm. 2 is that edges in a BMG \(\vec {G}(T,\sigma )\) always correspond to either correct orthologous pairs of genes or false-positive orthology assignments. Hence, \(\vec {G}(T,\sigma )\) never contains false-negative orthology assignments.

3 False-positive orthology assignments

As discussed in the introduction, we are not concerned here with the errors that arise in the reconstruction of best matches from sequence similarity data. We therefore assume that we are given a BMG \((\vec {G},\sigma )\) as specified in Def. 2. More precisely, we assume that \((\vec {G},\sigma )\) derives from a duplication/loss scenario that is unknown to us. Denote by \(({\widetilde{T}},{\widetilde{t}},\sigma )\) the corresponding true leaf-colored and event-labeled gene tree. An edge xy of \((\vec {G},\sigma )\), or equivalently of the corresponding RBMG \((G,\sigma )\), is a false-positive orthology assignment if \(xy\in E(G)\) but \(xy\notin E(\Theta ({\widetilde{T}},{\widetilde{t}}))\). By Thm. 2, \((G,\sigma )\) cannot contain false-negative orthology assignments, i.e., there is no \(xy\in E(\Theta ({\widetilde{T}},{\widetilde{t}}))\) with \(xy\notin E(G)\). We assume no additional information about the gene tree or the species tree, i.e., the only data about the evolutionary scenario that is available to us is the BMG \((\vec {G},\sigma )\).

In order to study false-positive orthology assignments, we first consider a tree \((T,\sigma )\) that explains the BMG \((\vec {G},\sigma )\). We neither make the assumption that \((T,\sigma )\) is least resolved nor that \((T,\sigma )\) reflects the true history, i.e., that \((T,\sigma )\) is related to the true gene tree \(({\widetilde{T}},\sigma )\).

Definition 10

(\({(T,\sigma )}\)-false-positive) Let \((T,\sigma )\) be a tree explaining the BMG \((\vec {G},\sigma )\). An edge xy in \(\vec {G}\) is called \((T,\sigma )\)-false-positive, or \((T,\sigma )\)-fp for short, if for every reconciliation map \(\mu \) from \((T,\sigma )\) to any species tree S we have \(t_\mu ({{\,\mathrm{lca}\,}}_T(x,y))=\square \), i.e., \(\mu ({{\,\mathrm{lca}\,}}_T(x,y))\in E(S)\),

In other words, xy is called \((T,\sigma )\)-fp whenever x and y cannot be orthologous w.r.t. any possible reconciliation \(\mu \) from \((T,\sigma )\) to any species tree. Interestingly, \((T,\sigma )\)-fp s can be identified without considering reconciliation maps explicitly.

Lemma 10

Let \((\vec {G},\sigma )\) be a BMG, xy be an edge in \(\vec {G}\) and \((T,\sigma )\) be a tree that explains \((\vec {G},\sigma )\). Then, the following statements are equivalent:

  1. 1.

    The edge xy is \((T,\sigma )\)-fp.

  2. 2.

    There are two children \(v_1\) and \(v_2\) of \({{\,\mathrm{lca}\,}}_T(x,y)\) such that \(\sigma (L(T(v_1)))\cap \sigma (L(T(v_2)))\ne \emptyset \).

  3. 3.

    For the extremal labeling \({{\widehat{t}}_T}\) of \((T,\sigma )\) it holds that \({{\widehat{t}}_T}({{\,\mathrm{lca}\,}}_T(x,y)) = \square \).

Lemma 10 implies that \((T,\sigma )\)-fp can be verified in polynomial time for any given gene tree \((T,\sigma )\). By contraposition of Lemma 2, inner vertices with two distinct children \(v_1\) and \(v_2\) satisfying \(\sigma (L(T(v_1)))\cap \sigma (L(T(v_2)))\ne \emptyset \) are duplication vertices for every possible reconciliation map to every possible species tree. Therefore, the property of being an AD vertex only depends on \((T,\sigma )\). In particular, \((T,\sigma )\)-fp edges coincide with the edges xy in \((\vec {G},\sigma )\) for which \({{\,\mathrm{lca}\,}}_{T}(x,y)\) is an AD vertex.

Fig. 3
figure 3

The BMG \((\vec {G},\sigma )\) shown on the right is explained by both \((T_1,\sigma )\), which is the unique least resolved tree for \((\vec {G},\sigma )\), and \((T_2,\sigma )\). The vertices labeled \(\square \) must be duplications due to Lemma 2, whereas the vertices labeled “?” could be both duplications or speciations. The edges xz, \(x'z\) and yz are \((T_1,\sigma )\)-fp but not \((T_2,\sigma )\)-fp (cf. Lemma 10). Thus, neither of the edges xz, \(x'z\) and yz is u-fp

As shown in Fig. 3, there are trees \((T_1,\sigma )\) and \((T_2,\sigma )\) that explain the same BMG for which, however, the edges xz, \(x'z\), and yz are \((T_1,\sigma )\)-fp but not \((T_2,\sigma )\)-fp. Since we assume that no information on \((T,\sigma )\) is available a priori, it is natural to consider the set of edges that are false positives for all trees explaining a given BMG.

Definition 11

(Unambiguous false-positive) Let \((\vec {G},\sigma )\) be a BMG. An edge xy in \(\vec {G}\) is called unambiguous false-positive (u-fp) if for all trees \((T,\sigma )\) that explain \((\vec {G},\sigma )\) the edge xy is \((T,\sigma )\)-fp.

Hence, if an edge xy in \(\vec {G}\) is u-fp, then it is in particular \((T,\sigma )\)-fp in the true history that explains \((\vec {G},\sigma )\). Thus, u-fp edges are always correctly identified as false positives. Not all “correct” false-positive edges are u-fp, however. It is possible that, for an edge xy in \(\vec {G}\), we have \(t_\mu ({{\,\mathrm{lca}\,}}_T(x,y))=\square \) for the true gene tree and the true species tree, but xy is not \((T',\sigma )\)-fp for some gene tree \((T',\sigma )\) possibly different from \((T,\sigma )\). One of the simplest examples is shown in Fig. 2, assuming that \((\vec {G},\sigma )\) is the “true” BMG. Since \(t_\mu ({{\,\mathrm{lca}\,}}_{{\widetilde{T}}}(x,y))=\newmoon \) may be possible (Fig. 2, leftmost scenario, the edge xy is not \(({\widetilde{T}},\sigma )\)-fp and therefore not u-fp.

4 Main results

4.1 Characterization of u-fp edges

In order to adapt the concept of AD vertices for our purposes, we introduce the color-intersection \({\mathcal {S}}^{\cap }\) associated with a gene tree \((T,\sigma )\). For a pair of distinct leaves \(x,y\in L(T)\) we denote by \(v_x, v_y \in \mathsf {child}_T({{\,\mathrm{lca}\,}}_T(x,y))\) the unique children of the last common ancestor of x and y for which \(x\preceq _T v_x\) and \(y\preceq _T v_y\). That is, \(T(v_x)\) and \(T(v_y)\) are the subtrees of T rooted in the children of \({{\,\mathrm{lca}\,}}_T(x,y)\) with \(x\in L(T(v_x))\) and \(y\in L(T(v_y))\). The set

$$\begin{aligned} {\mathcal {S}}_T^{\cap }(x,y):=\sigma (L(T(v_x)))\cap \sigma (L(T(v_y))) \end{aligned}$$

contains the colors, i.e. species, that are common to both subtrees. The existence of common colors, \({\mathcal {S}}_T^{\cap }(x,y)\ne \emptyset \), determines whether or not the inner vertex \({{\,\mathrm{lca}\,}}_T(x,y)\) is AD. Lemma 11 (Sec. B.2) shows that the color-intersection \({\mathcal {S}}_T^{\cap }(x,y)\) of an edge in a BMG \((\vec {G},\sigma )\) is independent of the corresponding tree. Hence, it suffices to consider the color-intersection for the unique least resolved tree \((T^*,\sigma )\) explaining \((\vec {G},\sigma )\). From here on, we drop the explicit reference to the tree and simply write \({\mathcal {S}}^{\cap }(x,y)\); see also Remark 1 in Sec. B.2. The color-intersection provides a sufficient condition for u-fp edges in a BMG.

Prop. 1 and Cor. 3

Every edge xy in a BMG \((\vec {G},\sigma )\) with \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \) is \((T,\sigma )\)-fp for every tree \((T,\sigma )\) that explains \((\vec {G},\sigma )\), and thus u-fp.

As we shall see below, the converse of Prop. 1 and Cor. 3 is not true in general. It does hold for the special case of binary trees, however:

Theorem 4

Let \((\vec {G},\sigma )\) be a BMG that is explained by a binary tree \((T,\sigma )\). Then, for every edge xy in \((\vec {G},\sigma )\), the following three statements are equivalent:

  1. 1.

    The edge xy is \((T,\sigma )\)-fp.

  2. 2.

    \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \).

  3. 3.

    The edge xy is u-fp.

Prop. 8 in Sec. 4.3 provides a characterization of BMGs that can be explained by binary trees; a property that can be tested in polynomial time (cf. Cor. 6). However, not every BMG can be explained by a binary tree as shown by the simple example in Fig. 6(A). This BMG can only be explained by the unique non-binary tree as shown in Fig. 6(B).

Since every orthology graph is a cograph (Thm. 1) and thus free of induced \(P_4\)s, every induced \(P_4\) in the RBMG necessarily contains a false-positive orthology assignments. The subgraphs of the BMG spanned by a \(P_4\) in its symmetric part (i.e., the RBMG) are known as quartets. The quartets on three colors of a BMG \((\vec {G},\sigma )\) fall into three distinct classes depending on the coloring and the additional, non-symmetric edges (cf. (Geiß et al. 2020c, Lemma 32)). We write \(\langle abcd \rangle \) or, equivalently, \(\langle dcba \rangle \) for an induced \(P_4\) with edges ab, bc, and cd.

Definition 12

(Good, bad, and ugly quartets) Let \((\vec {G},\sigma )\) be a BMG with symmetric part \((G,\sigma )\) and vertex set L, and let \(Q:=\{x,y,z,z'\} \subseteq L\) with \(x\in L[r]\), \(y\in L[s]\), and \(z,z'\in L[t]\). The set Q, resp., the induced subgraph \((\vec {G}[Q],\sigma _{|Q})\) is

  • a good quartet if (i) \(\langle zxyz'\rangle \) is an induced \(P_4\) in \((G,\sigma )\) and (ii) \((z,y),(z',x)\in E(\vec {G})\) and \((y,z),(x,z')\notin E(\vec {G})\),

  • a bad quartet if (i) \(\langle zxyz'\rangle \) is an induced \(P_4\) in \((G,\sigma )\) and (ii) \((y,z),(x,z')\in E(\vec {G})\) and \((z,y),(z',x)\notin E(\vec {G})\),

  • an ugly quartet if \(\langle zxz'y\rangle \) is an induced \(P_4\) in \((G,\sigma )\).

The edge xy in a good quartet \(\langle zxyz'\rangle \) is its middle edge. The edge zx of an ugly quartet \(\langle zxz'y\rangle \) or a bad quartet \(\langle zxyz'\rangle \) is called its first edge. First edges in ugly quartets are uniquely determined due to the colors. In bad quartets, this is not the case and therefore, the edge \(yz'\) in \(\langle zxyz'\rangle \) is a first edge as well.

Fig. 4
figure 4

The three types of quartets in BMGs. Ugly quartets may or may not contain either of the two (dashed) arcs between x and y, and y and z, respectively. Bold edges highlight the middle and first edges of the respective quartets as specified in Def. 12

The three different types of quartets are shown in Fig. 4. RBMGs never contain induced \(P_4\)s on two colors (Geiß et al. 2020c, Obs. 5). This, in particular, implies that for the induced \(P_4\)s in Def. 12 the colors r, s, and t must be pairwise distinct. Note that (R)BMGs may also contain induced \(P_4\)s on four colors. These are investigated in some more detail in Secs. 4.3 and D.3.

Good quartets are characteristic of a complementary gene loss (as shown in Fig. 2) that is “witnessed” by a third species in which both child branches of the problematic duplication event survive. That is, good quartets appear if there is a pair of genes z and \(z'\) with \(\sigma (z)=\sigma (z')\) and \({{\,\mathrm{lca}\,}}(z,z')={{\,\mathrm{lca}\,}}(x,y)\) in the true gene tree. We remark that previous work also noted that complementary gene loss can be resolved successfully under certain circumstances Dessimoz et al. (2006) such as this one. An in-depth analysis of quartets shows that they can be used to identify many of the u-fp edges. We collect here the main results of Sec. B.3:

Prop. 2, 3 and 4

Let \({\mathcal {Q}} = \langle xyzw \rangle \) be a quartet in a BMG \((\vec {G},\sigma )\).

  1. (i)

    If \({\mathcal {Q}}\) is good, then its middle edge yz is u-fp.

  2. (ii)

    If \({\mathcal {Q}}\) is ugly, then its first edge xy and its middle edge yz are u-fp.

  3. (iii)

    If \({\mathcal {Q}}\) is bad, then its first edges xy and zw are u-fp.

Not surprisingly, quartets are intimately linked to color-intersections:

Corollary 4

Let \((\vec {G},\sigma )\) be a BMG that contains the edge xy. Then, \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \) implies that xy is either the middle edge of some good quartet or the first edge of some ugly quartet, which in turn implies that xy is u-fp.

All u-fp edges xy with \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \) in \((\vec {G},\sigma )\) are therefore completely determined by the middle edges of good quartets and the first edges of ugly quartets. In particular, not all such edges are the middle edge of a good quartet as the example in Fig. 5 shows. Therein, the edge xy must be u-fp since \({\mathcal {S}}^{\cap }(x,y)=\{\sigma (z)\}\ne \emptyset \) (cf. Prop. 1). The only good quartet is \(\langle zx'yz'\rangle \) identifying \(x'y\) as u-fp. Moreover, \((\vec {G},\sigma )\) does not contain any bad quartet. The edge xy, on the other hand, is the first edge of the ugly quartet \(\langle xyx'z\rangle \).

Fig. 5
figure 5

Example for a \((T,\sigma )\)-fp edge xy in \((\vec {G},\sigma )\) which is not the middle edge of a good quartet, but the first edge in an ugly quartet (right). Note, \((\vec {G},\sigma )\) does not contain bad quartets

Furthermore, if an edge xy is the middle edge of a good quartet, then \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \). Therefore, only ugly quartets may provide additional information about u-fp edges that are not identified with the help of the color-intersection \({\mathcal {S}}^{\cap }\) (see Fig. 14 in Sec. B.3 for an example). Ugly quartets, however, do not convey all the missing information on u-fp edges. The edge xy in the BMG shown in Fig. 6(A) is u-fp, but it is not contained in a good, bad, or ugly quartet.

In order to characterize the u-fp edges that are not identified by quartets, we first introduce an additional motif that may occur in vertex-colored graphs.

Definition 13

(Hourglass) An hourglass in a proper vertex-colored graph \((\vec {G},\sigma )\), denoted by , is a subgraph \((\vec {G}[Q],\sigma _{|Q})\) induced by a set of four pairwise distinct vertices \(Q=\{x, x', y, y'\}\subseteq V(\vec {G})\) such that (i) \(\sigma (x)=\sigma (x')\ne \sigma (y)=\sigma (y')\), (ii) xy and \(x'y'\) are edges in \(\vec {G}\), (iii) \((x,y'),(y,x')\in E(\vec {G})\), and (iv) \((y',x),(x',y)\notin E(\vec {G})\).

Note that Condition (i) rules out arcs between \(x,x'\) and \(y,y'\), respectively, i.e., the only arcs in an hourglass are the ones specified by Conditions (ii) and (iii). An example is shown in Fig. 6(A).

Observation 5

Every hourglass is a BMG since it can be explained by a tree as shown in Fig. 6(B).

Hourglasses are not necessarily part of an induced \(P_4\). In particular, an hourglass does not contain an induced \(P_4\) (see Fig. 6(A)).

Fig. 6
figure 6

A: Hourglass. B: Visualization of Lemma 14. C: Hourglass chain with left tail z and right tail \(z'\) for an odd number of hourglasses in the chain. Edges of the form \(x_i y'_j\in E(G)\) are only shown for \(x_1\), the others are omitted. An hourglass chain \({\mathfrak {H}}\) is a subgraph but not necessarily induced and thus additional arcs may exist. In particular, the elements \(e\in \{x_1y_k, zy_k, x_1z', zz'\}\) are not necessarily edges in an hourglass chain. However, whenever they exist, they are u-fp (cf. Lemma 17). Moreover, each single hourglass in \({\mathfrak {H}}\) is an induced subgraph of the BMG; by definition, therefore, there are no arcs \((z,x'_1)\) or \((z',y'_k)\). Note, \(\sigma (z)\ne \sigma (z')\) is possible. D: Visualization of Lemmas 15 and 16

Hourglasses can be used to identify false-positive edges xy with \({\mathcal {S}}^{\cap }(x,y)=\emptyset \). More precisely, we have

Proposition 6

If a BMG \((\vec {G},\sigma )\) contains an hourglass , then the edge xy is u-fp.

Prop. 6 implies that there are u-fp edges that are not contained in a quartet, see Fig. 6(A). In this example, we have \({\mathcal {S}}^{\cap }(x,y)=\emptyset \) and no induced \(P_4\). However, as shown in Fig. 6(B), the subtree \(T(v_2)\) contains both colors \(\sigma (x)\) and \(\sigma (y)\) and thus, “bridges” the color sets of the subtrees \(T(v_1)\) and \(T(v_3)\). Similarly, in the tree \((T,\sigma )\) in Fig. 6(D), each subtree \(T(v_i)\), \(1\le i \le k\) “bridges” the color sets of the subtrees \(T(v_{i-1})\) and \(T(v_{i+1})\). This observation suggests the concept of hourglass chains, a generalization of hourglasses.

Definition 14

(Hourglass chain) An hourglass chain \({\mathfrak {H}}\) in a graph \((\vec {G},\sigma )\) is a sequence of \(k\ge 1\) hourglasses such that the following two conditions are satisfied for all \(i\in \{1,\dots ,k-1\}\):

  1. (H1)

    \(y_i=x'_{i+1}\) and \(y'_i=x_{i+1}\), and

  2. (H2)

    \(x_i y'_j\) is an edge in \(\vec {G}\) for all \(j\in \{i+1,\dots ,k\}\)

A vertex z is called a left (resp., right) tail of the hourglass chain \({\mathfrak {H}}\) if it holds that \((z,x_1)\in E(\vec {G})\) and \((z,x'_1)\notin E(\vec {G})\) (resp., \((z,y_k)\in E(\vec {G})\) and \((z,y'_k)\notin E(\vec {G})\)). We call \({\mathfrak {H}}\) tailed if it has a left or right tail.

In contrast to the quartets and the hourglass, an hourglass chain in \((\vec {G},\sigma )\) is not necessarily an induced subgraph. Hourglass chains are “overlapping” hourglasses. The additional condition that \(x_i y'_j\in E(G)\) for all \(1\le i<j\le k\) ensures that the two pairs \(x'_k,y'_k\) and \(x'_l,y'_l\) with \(k\ne l\) cannot lie in the same subtree below the last common ancestor u which is common to all hourglasses in the chain (cf. Lemma 15 and 16 in Sec. B.4).

Definition 16

An edge xy in a vertex-colored graph \((\vec {G},\sigma )\) is a hug-edge if it satisfies at least one of the following conditions:

  1. (C1)

    xy is the middle edge of a good quartet in \((\vec {G},\sigma )\);

  2. (C2)

    xy is the first edge of an ugly quartet in \((\vec {G},\sigma )\); or

  3. (C3)

    there is an hourglass chain in \((\vec {G},\sigma )\), and one of the following cases holds:

    1. 1.

      \(x_1=x\) and \(y_k=y\);

    2. 2.

      \(y_k=y\) and \(z:=x\) is a left tail of \({\mathfrak {H}}\);

    3. 3.

      \(x_1=x\) and \(z':=y\) is a right tail of \({\mathfrak {H}}\); or

    4. 4.

      \(z:=x\) is a left tail and \(z':=y\) is a right tail of \({\mathfrak {H}}\).

The term hug-edge refers to the fact that xy is a particular edge of an hourglass-chain, an ugly quartet, or a good quartet. In Sec. C.4, we show that hug-edges coincide with the u-fp edges.

Theorem 11

An edge xy in a BMG \((\vec {G},\sigma )\) is u-fp if and only if xy is a hug-edge of \((\vec {G},\sigma )\).

Interestingly, bad quartets turn out to be redundant for the identification of u-fp edges in the sense that every u-fp edge in a bad quartet appears as a u-fp edge in a good quartet, an ugly quartet, or an hourglass chain. At present, we do not know whether hourglass chains in a colored graph \((\vec {G},\sigma )\) can be found efficiently. We shall see in the following section, however, that the identification of u-fp edges does not require the explicit enumeration of hourglass chains.

The fact that all hug-edges are u-fp by Thm. 11 suggests to consider the subgraph of a BMG that is left after removing all these unambiguously recognizable false-positive orthology assignments.

Definition 17

Let \((\vec {G},\sigma )\) be a BMG with symmetric part G and let F be the set of its hug-edges. The no-hugFootnote 1 graph \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) is the subgraph of G with vertex set \(V(\vec {G})\), coloring \(\sigma \) and edge set \(E(G){\setminus } F\).

By Thm. 11, \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) is therefore the subgraph of the underlying RBMG of \((\vec {G},\sigma )\) that does not contain any u-fp edge. Importantly, it contains the orthology graph for every reconciliation map \(\mu \) as well as the orthology graph induced by the extremal event labeling as subgraphs:

Corollary 5

Let \((T,\sigma )\) be a leaf-colored tree and \(\mu \) a reconciliation map from \((T,\sigma )\) to some species tree S. Then,

$$\begin{aligned} \Theta (T,t_{\mu }) \subseteq \Theta (T, {{\widehat{t}}_T}) \subseteq {\mathbb {N}}{\mathbb {H}}(\vec {G}(T,\sigma )) \subseteq \vec {G}(T,\sigma ). \end{aligned}$$

The no-hug graph still may contain false-positive orthology assignments, i.e., \({\mathbb {N}}{\mathbb {H}}(\vec {G}(T,\sigma ))=\Theta (T,{{\widehat{t}}_T})\) does not hold in general. As an example, consider the BMG \(\vec {G}(T_1,\sigma )\) in Fig. 3. Here, none of the edges xz, \(x'z\) and yz are u-fp and thus, by Thm. 11 also not hug-edges. Hence, they still remain in \({\mathbb {N}}{\mathbb {H}}(\vec {G}(T_1,\sigma ))\). However, these edges are not contained in \(\Theta (T_1,{{\widehat{t}}_T})\), since \({{\widehat{t}}_T}({{\,\mathrm{lca}\,}}_{T_1}(x,x',y,z)) = \square \) and thus, \(\Theta (T_1,{{\widehat{t}}_T}) \subsetneq {\mathbb {N}}{\mathbb {H}}(\vec {G}(T_1,\sigma ))\).

4.2 Algorithms

In this section, we provide a polynomial-time algorithm to identify all u-fp edges in a given BMG. To this end, we take a closer look at hourglass chains and the trees that explain them. In Fig. 6(D), each subtree \(T(v_i)\), \(1\le i \le k\), “bridges” the color sets of the subtrees \(T(v_{i-1})\) and \(T(v_{i+1})\). That is, \(\sigma (L(T(v_{i-1})))\cap \sigma (L(T(v_i)))\) and \(\sigma (L(T(v_i)))\cap \sigma (L(T(v_{i+1})))\) are non-empty. This suggests to consider the children of a vertex u as the vertices of a “color-set intersection graph” with edges connecting children with non-empty color-set intersection:

Definition 7

The color-set intersection graph \({\mathfrak {C}}_T(u)\) of an inner vertex u of a leaf-colored gene tree \((T,\sigma )\) is the undirected graph with vertex set \(V:=\mathsf {child}_T(u)\) and edge set

$$\begin{aligned} E:=\{ v_1v_2 \mid v_1,v_2\in V \text {, }v_1\ne v_2 \text { and } \sigma (L(T(v_1)))\cap \sigma (L(T(v_2)))\ne \emptyset \}. \end{aligned}$$

This construction is similar to the definition of intersection graphs e.g. used in McKee and McMorris (1999). \({\mathfrak {C}}_T(u)\) can be viewed as a natural generalization of \({\mathcal {S}}^{\cap }(x,y)\) in the following sense: if \(u={{\,\mathrm{lca}\,}}_T(x,y)\) is a binary vertex, then \({\mathfrak {C}}_T(u)=K_2\) iff \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \) and therefore, \({\mathfrak {C}}_T(u)=K_1\cup K_1\) iff \({\mathcal {S}}^{\cap }(x,y)=\emptyset \). In the non-binary case, there is an edge \(v_1v_2\) iff \({\mathcal {S}}^{\cap }(x,y)\ne \emptyset \) for some \(x\in L(T(v_1))\) and \(y\in L(T(v_2))\).

Every BMG \((\vec {G},\sigma )\) contains all information necessary to determine the trees \((T,\sigma )\) by which it is explained. Since u-fp edges are defined in terms of the explaining trees, every BMG \((\vec {G},\sigma )\) also contains – at least implicitly – all information needed to identify its u-fp edges. Since \((\vec {G},\sigma )\) is determined by its unique least resolved tree \((T^*,\sigma )\), the u-fp edges must also be determined by \((T^*,\sigma )\). It is not sufficient for this purpose, however, to find an event labeling t of the vertices of \(T^*\).

To see this, consider for example the “true” history \(({\widetilde{T}},{\widetilde{t}},\sigma )\) of the BMG \(\vec {G}({\widetilde{T}},\sigma )\) as shown in Fig. 7. The unique least resolved tree \((T^*,\sigma )\) for \(\vec {G}({\widetilde{T}},\sigma )\) is obtained by merging the two vertices \(v_1\) and \(v_2\) of \({\widetilde{T}}\) resulting in the vertex v of \(T^*\). We have \({\widetilde{t}}(v_1)= \newmoon \ne \square ={\widetilde{t}}(v_2)\). For vertex v and every reconciliation map \(\mu \) from \((T^*,\sigma )\) to any species tree S, it must hold that \(\mu (v)\in E(S)\) and thus \(t^*_{\mu }(v)=\square \), since v has two children with overlapping color sets and by Lemma 2. Thus, the edges cx with \(x\in \{a_1,a_2,b_1,b_2\}\) are \((T^*,\sigma )\)-fp although they are not false positives at all. Since speciation and duplication vertices may be merged into the same vertex v of \(T^*\), the least resolved tree \(T^*\) in general cannot simply inherit the event labeling from the true gene history, and thus there may not be a “correct” labeling \(t^*\) of \(T^*\) that provides evidence for all u-fp edges.

Fig. 7
figure 7

The evolutionary scenario (left) shows the event-labeled gene tree \(({\widetilde{T}},{\widetilde{t}},\sigma )\) embedded into a species tree S. In the least resolved tree \((T^*,\sigma )\) of \(\vec {G}({\widetilde{T}},\sigma )\), the edge \(v_1v_2\) of \({\widetilde{T}}\) has been contracted into vertex v. The BMG \(\vec {G}({\widetilde{T}},\sigma )\) does not contain any u-fp edge. See text for further explanations

The example in Fig. 7 shows that the least resolved tree \(T^*\) simply may not be “resolved enough”. In the following, we therefore describe how the unique least resolved tree can be resolved further to provide more evidence about u-fp edges. Eventually, this will lead us to a characterization of the u-fp edges. To this end, we need to gain more insights into the structure of redundant edges, i.e., those edges e in T for which \((T_e,\sigma )\) still explains \(\vec {G}(T,\sigma )\).

Since the color sets of distinct subtrees below a speciation vertex cannot overlap by Lemma 2, Cor. 1 (Sec. A) implies that all edges below a speciation vertex are redundant and thus can be contracted. More precisely, we have

Observation 8

Let \(\mu \) be a reconciliation map from \((T,\sigma )\) to S and assume that there is a vertex \(u\in V^0(T)\) such that \(\mu (u)\in V^0(S)\) and thus, \(t_{\mu }(u)=\newmoon \). Then every inner edge uv of T with \(v\in \mathsf {child}_{T}(u)\) is redundant w.r.t. \(\vec {G}(T,\sigma )\). Moreover, if an inner edge uv with \(v\in \mathsf {child}_{T}(u)\) is non-redundant, then u must have two children with overlapping color sets, and hence, \(t_{\mu }(u)=\square \).

Our goal is to identify those vertices in \((T^*,\sigma )\) that can be expanded to yield a tree that still explains \(\vec {G}(T^*,\sigma )\). To this end, we need to introduce a particular way of “augmenting” a leaf-colored tree.

Definition 18

Let \((T,\sigma )\) be a leaf-colored tree, u be an inner vertex of T, \({\mathfrak {C}}_T(u)\) the corresponding color-set intersection graph, and \({\mathcal {C}}\) the set of connected components of \({\mathfrak {C}}_T(u)\). Then the tree \(T_u\) augmented at vertex u is obtained by applying the following editing steps to T:

  • If \({\mathfrak {C}}_T(u)\) is connected, do nothing.

  • Otherwise, for each \(C\in {\mathcal {C}}\) with \(|C|>1\)

    • introduce a vertex w and attach it as a child of u, i.e., add the edge uw,

    • for every element \(v_i\in C\), substitute the edge \(uv_i\) by the edge \(wv_i\).

The augmentation step is trivial if \(T_u=T\), in which case we say that no edit step was performed.

An example of an augmentation is shown in Fig. 8. The tree \(T_u\) obtained by an augmentation of a phylogenetic tree T is again a phylogenetic tree.

Fig. 8
figure 8

Left, a (part of a) leaf-colored tree \((T,\sigma )\). The tree \((T_u,\sigma )\) on the right is obtained from \((T,\sigma )\) by augmenting T at vertex u. The color-set intersection graph \({\mathfrak {C}}_T(u)\) (shown in the middle) has more than one connected component and there are connected components consisting of more than two vertices \(v_i\in \mathsf {child}_T(u)\). According to Lemma 21, \(\sigma (L(T_u(v)))\cap \sigma (L(T_u(v')))=\emptyset \) for any two distinct vertices \(v,v'\in \mathsf {child}_{T_u}(u) = \{v_1,w_1,w_2\}\). By Cor. 1 (Sec. A), the edges \(uw_1\) and \(uw_2\) are redundant w.r.t. \(\vec {G}(T_u,\sigma )\) and thus, both trees explain the same BMG

A key property of the procedure in Def. 18 is that repeated augmentation of the same inner vertex leads to at most one expansion and that the order of augmenting multiple vertices does not matter. More precisely, Lemma 23 in Sec. C.3 ensures the existence of a unique augmented tree:

Definition 19

(Augmented tree) Let \((T,\sigma )\) be a leaf-colored tree. The augmented tree of \((T,\sigma )\), denoted by \(({{\,\mathrm{{\mathcal {A}}}\,}}(T),\sigma )\), is obtained by augmenting all inner vertices of \((T,\sigma )\) (in an arbitrary order).

In particular, the augmented tree preserves the best match relation:

Proposition 7

For every leaf-colored tree \((T,\sigma )\), it holds \(\vec {G}(T,\sigma )=\vec {G}({{\,\mathrm{{\mathcal {A}}}\,}}(T),\sigma )\).

We now have everything in place to present the main results of this section.

Theorem 10

Let \((\vec {G},\sigma )\) be a BMG, \((T^*,\sigma )\) its unique least resolved tree, and \({{\widehat{t}}}:={{\widehat{t}}_{{{\,\mathrm{{\mathcal {A}}}\,}}(T^*)}}\) the extremal event labeling of the augmented tree \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\). Then \((\Theta ({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}}),\sigma ) = {\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\).

Since \((\Theta ({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}}),\sigma ) = {\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) is the subgraph of the underlying RBMG of \((\vec {G},\sigma )\) that does not contain any u-fp edges (cf. Def. 17 and Thm. 11), the set of all u-fp edges can readily be obtained by comparing the edges of \((\vec {G},\sigma )\) with the edges in the orthology graph obtained from \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}})\). Since only u-fp edges have been removed to obtain \((\Theta ({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}}),\sigma )\) and since \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\) still explains \((\vec {G},\sigma )\), the graph \((\Theta ({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}}),\sigma )\) is, in the sense of an unambiguous editing, the best estimate of the orthology relation that we can make by solely utilizing the structural information of a given BMG \((\vec {G},\sigma )\). Note, Thm. 1 implies that \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) must, in particular, be a cograph.

Since \((\Theta ({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}}),\sigma ) = {\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\), the computation of \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) can be achieved in polynomial time and avoids the need to find the hourglass chains of \((\vec {G},\sigma )\). In fact, the effort is dominated by computing the least resolved tree \((T^*,\sigma )\) for a given BMG.

Theorem 12

For a given BMG \((\vec {G},\sigma )\), the set of all u-fp edges can be computed in \(O(|L|^3 |{\mathscr {S}}|)\) time, where \(L=V(\vec {G})\) and \({\mathscr {S}} = \sigma (L(T))\) is the set of species under consideration.

As argued in (Geiß et al. 2019, Sec. 5), the number of genes between different species will be comparable in practical applications, i.e., \(O(\ell ) = O(|L|/|{\mathscr {S}}|)\) with \(\ell = \max _{s\in {\mathscr {S}}} |L[s]|\). In this case, the running time to compute \((T^*,\sigma )\) reduces to \(O(|L|^3/|{\mathscr {S}}|)\) and we obtain an overall running time to compute the set of all u-fp edges of \(O(|L|^3/|{\mathscr {S}}| + |L|^2 |{\mathscr {S}}|)\). Thms. 10 and 12 imply that we do not need to find induced quartets and hourglasses explicitly, nor do we need to identify the hourglass chains. Instead, it is more efficient to compute the least resolved tree \((T^*,\sigma )\), its augmented tree \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\), and the corresponding extremal event labeling \({{\widehat{t}}}\).

Deletion of all u-fp edges is necessary to obtain an orthology relation without false positives. It is not sufficient, however, since \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) may contain additional false-positive orthology assignments. In order to construct an example, we consider for a BMG \((\vec {G},\sigma )\) the set \({\mathfrak {T}}\) of all trees \((T,t,\sigma )\) for which \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma ) = (\Theta (T,t),\sigma )\). The example in Fig. 9 shows that it may be the case that none of the trees \((T,t,\sigma )\in {\mathfrak {T}}\) admits a reconciliation map \(\mu \) to any species tree such that \(t_{\mu } = t\). Lemma 29 in Sec. C.5 shows that the augmented tree \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}},\sigma )\) is sufficient to test in polynomial time whether or not \({\mathfrak {T}}\) contains a reconcilable tree. In the negative case, we have clear evidence that \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) still contains a false-positive edge and thus must be edited further. This type of false-positive orthology assignments is the topic of ongoing work.

Fig. 9
figure 9

An evolutionary scenario (left) with a no-hug graph \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) that still contains false-positive edges. Deletion of the highlighted u-fp edge \(a_1b_1\) for \(\vec {G}({\widetilde{T}},\sigma )\) yields \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma ) = (\Theta ({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}}),\sigma )\) and thus, an orthology graph. However, none of its cotrees can be reconciled with any species tree since each of them contains the contradictory species triples \(\sigma (a_1)\sigma (b_1)|\sigma (c_1)\) and \(\sigma (a_1)\sigma (c_1)|\sigma (b_1)\) (see e.g. Hernandez-Rosales et al. (2012), Hellmuth (2017)). Note, the trees \(({\widetilde{T}},{\widetilde{t}})\) and \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}})\) differ in the event label marked by the arrows, resulting in the three additional fp edges \(a_3b_3\), \(c_2b_3\) and \(c_3b_3\) in \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\)

In contrast to the LRT of a BMG, its augmented tree is not necessarily displayed by the true gene tree of the underlying evolutionary scenario. Hence, we advocate the augmented tree endowed with the corresponding extremal event labeling \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}},\sigma )\) primarily as convenient tool to identify false-positive orthology assignments. Whether or not \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}},\sigma )\) is a plausible representation of the gene phylogeny depends on whether it admits a reconciliation of the (phylogenetically correct) species tree. As discussed above, this is not always the case. The following result, however, shows that \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}},\sigma )\) is informative in an important special case.

Lemma 30

Let \((T,t,\sigma )\) be an event-labeled tree explaining the BMG \((\vec {G},\sigma )\), and let \((T^*,\sigma )\) be the least resolved tree of \((\vec {G},\sigma )\). If \((\Theta (T,t),\sigma ) = {\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\), then \({{\,\mathrm{{\mathcal {A}}}\,}}(T^*)\) is displayed by T.

Lemma 30 guarantees that \({{\,\mathrm{{\mathcal {A}}}\,}}(T^*)\) is displayed by the true gene tree \({\widetilde{T}}\) whenever \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) equals the true orthology relation. In a practical workflow, it can be checked efficiently whether there is evidence for additional false-positive edges because \({\mathfrak {T}}\) contains no reconcilable tree. If this is not the case, then it is likely that \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) equals the true orthology relation. In this case, \({\widetilde{T}}\) also displays the unique discriminating cotree of \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\).

One has to keep in mind, however, that it is not possible to find a mathematical guarantee for \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) to be the true orthology relation, because it cannot be ruled out that the true scenario contains unwitnessed duplications that are compensated by additional gene losses. In the extreme case, it is logically possible for every BMG that, in the true scenario, all inner vertices of the gene tree predate the root of the species tree, resulting in a true orthology graph without any edges Guigó et al. (1996), Page and Charleston (1997), Geiß et al. (2020b). Of course, this is extremely unlikely for real data.

4.3 Quartets, hourglasses, and the structure of reciprocal best match graphs

The characterization of u-fp edges is in a way surprising when compared to previous results on the structure of RBMGs Geiß et al. (2020b, 2020c), which were focused on \(P_4\)s and quartets. The expected connection between good and ugly quartets and u-fp edges is captured by Cor. 4. However, Prop. 6 implies that there are also u-fp edges entirely unrelated to quartets and thus induced \(P_4\)s. In this section, we aim to close this gap in our understanding.

Hourglass-free BMGs. We start with an important special case for which quartets are sufficient.

Definition 20

A BMG \((\vec {G},\sigma )\) is hourglass-free if it does not contain an hourglass as an induced subgraph.

In particular, an hourglass-free BMG does not contain an hourglass chain. It turns out that hourglasses are the forbidden induced subgraph characterizing BMGs that can be explained by binary trees.

Prop. 8 and Cor. 6

A BMG \((\vec {G},\sigma )\) can be explained by a binary tree if and only if it is hourglass-free. In particular, it can be decided in polynomial time whether \((\vec {G},\sigma )\) can be explained by a binary tree.

The RBMGs that are already cographs are called co-RBMGs. As shown in Sec. D.1, we obtain

Corollary 7

Let \((\vec {G},\sigma )\) be an hourglass-free BMG. Then its symmetric part \((G,\sigma )\) is either a co-RBMG or it contains an induced \(P_4\) on three colors whose endpoints have the same color, but no induced cycle \(C_n\) on \(n\ge 5\) vertices.

As outlined in Sec. D.1, all u-fp edges in an hourglass-free BMG are identified by the good and ugly quartets, which are 3-colored by construction. In hourglass-free BMGs, it is indeed sufficient to consider only the 3-colored \(P_4\)s to identify all u-fp edges and thus, to obtain an orthology graph, even though the BMG may also contain 4-colored \(P_4\)s. Since hourglasses can only appear in BMGs that require multifurcations for their explanation (cf. Lemma 14), the case of hourglass-free BMGs is the most relevant for practical applications.

Since all u-fp edges in an hourglass-free BMG are contained in quartets, it is also easy to identify the hourglass-free BMGs that are already orthology graphs.

Corollary 8

Let \((\vec {G},\sigma )\) be an hourglass-free BMG. Then, its symmetric part \((G,\sigma )\) is a co-RBMG if and only if there are no u-fp edges in \((\vec {G},\sigma )\).

u-fp Edges in Hourglass Chains. The situation is much more complicated in the presence of hourglasses. We start by providing sufficient conditions for u-fp edges that are identified by hourglass chains.

Proposition 9

Let be an hourglass chain in \((\vec {G},\sigma )\), possibly with a left tail z or a right tail \(z'\). Then, an edge in \(\vec {G}\) is u-fp if it is contained in the set

$$\begin{aligned} F =&\{x_iy_j\mid 1\le i \le j \le k\} \cup \{zz'\} \cup \{zy_{i}, x_iz', zy'_{i}, x'_{i}z' \mid 1 \le i \le k \}\\&\cup \{ x_{i}x_{j+1} \mid 1\le i< j< k \} \cup \{ y_{i}y_{j+1} \mid 1\le i< j < k \} \\&\cup \{x'_1 y'_i, x'_1 y_i \mid 2 \le i \le k \} \cup \{x_i y'_k, x'_i y'_k \mid 1 \le i \le k-1 \} \\&\cup \{x'_1 z, x'_1 z', y'_k z, y'_k z'\} \end{aligned}$$

As outlined in Sec. D.2, hourglass chains identify false-positive edges that are not associated with quartets in the BMG and, in particular, false-positive edges that are not even part of an induced \(P_4\). This observation limits the use of cograph editing in the context of orthology detection, at least in the case of gene trees with polytomies: On one hand, an RBMG can be a cograph and still contain u-fp edges and, on the other hand, there are examples where deletion of the u-fp edge identified by quartets (and thus, by induced \(P_4\)s) is not sufficient to arrive at a cograph (cf. Sec. D.2).

Four-colored \(P_4\)s Geiß et al (2020c, Thm. 8) established that the RBMG \((G,\sigma )\) is a co-RBMG, i.e., a cograph, if and only if every subgraph induced on three colors is a cograph. Therefore, if \((G,\sigma )\) contains an induced 4-colored \(P_4\), it also contains an induced 3-colored \(P_4\). For hourglass-free BMGs \((\vec {G},\sigma )\) it is clear that a 4-colored \(P_4\) always overlaps with a 3-colored \(P_4\): In this case \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) is obtained by deleting middle edges of good quartets and first edges of ugly quartets. Since \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) is a cograph, there is no \(P_4\) left, and thus at least one edge of any 4-colored \(P_4\) was among the deleted edges. It is natural to ask whether this is true for BMGs in general. However, as shown in Sec. D.3, good and ugly quartets are not sufficient on their own and there are examples with 4-colored \(P_4\)s that do not overlap with the middle edge of a good quartet or the first edge of an ugly quartet.

Still, in the context of cograph-editing approaches it is of interest whether the 3-colored \(P_4\)s are sufficient. In the following we provide an affirmative answer.

Lemma 34

Let \((\vec {G},\sigma )\) be a BMG and \({\mathscr {P}}\) a 4-colored induced \(P_4\) in the symmetric part of \((\vec {G},\sigma )\). Then at least one of the edges of \({\mathscr {P}}\) is either the middle edge of some good quartet or the first edge of a bad or ugly quartet in \((\vec {G},\sigma )\).

It is important to recall in this context, however, that the deletion of all u-fp-edges identified by quartets does not necessarily lead to a cograph (see Fig. 17(C) in Sec. D.3 for an example). Hence, the quartets alone therefore cannot provide a complete algorithm for correcting an RBMG to an orthology graph.

5 Simulation results

We illustrate the potential impact of our mathematical results discussed in the previous sections with the help of simulated data. To this end, we focus on the accuracy of the inferred orthology graph assuming that the best matches are accurate. Of course, this is only one of several components in complete orthology detection pipeline, which would also need to consider the genome annotation, pairwise alignments of genes or predicted protein sequences, and the conversion of sequence similarities into best match data. The latter step has been investigated in considerable detail by Stadler et al. (2020). Here, we start from simulated evolutionary scenarios and extract the BMG directly from the ground truth using the simulation library AsymmeTreeStadler et al. (2020).

In brief, AsymmeTree generates realistic evolutionary scenarios in four steps. (1) A planted species tree S is generated using the Innovation Model Keller-Schmidt and Klemm (2012), which models observed phylogenies well. (2) A dating map \(\tau \) assigns time points to all vertices of S and thus branch lengths to the edges of S. (3) On S, we use a variant of the well-known constant-rate birth-death process with a given age (see e.g. Kendall 1948; Hagen and Stadler 2018) to simulate an event-labeled gene tree \((T,t,\sigma )\) containing duplication and loss events. Speciations are included as additional branching events that generate copies of all genes present at a speciation vertex in all descendant lineages. The simulated gene trees are constrained to have at least one surviving gene in each species to avoid trivial cases. (4) The observable part of the gene tree is extracted by recursively removing leaves that correspond to loss events and suppressing inner vertices with a single child. AsymmeTree can also assign rates to edges of \((T,t,\sigma )\) to convert evolutionary time differences into general additive distances; however, this is not relevant here since the rates do not affect evolutionary relatedness and thus the BMG.

Fig. 10
figure 10

Average relative abundance of the different types of hug-edges and undetectable false positives in the BMGs of simulated evolutionary scenarios. We distinguish hug-edges in good and ugly quartets as well as hug-edges appearing only in hourglass chains (orange). In the simulations, the fraction of u-fp edges that are first edges of bad quartets is too small too be visible and therefore not shown here. The undetectable false positives correspond to complementary gene losses without surviving witnesses of the duplication event. Species trees are binary, while gene trees contain multifurcations. The number of offsprings is modeled as \(2+k\), where k is drawn from a Poisson distribution with parameter \(\lambda \). For \(\lambda =0\), the gene trees are binary. In the experiments, we observed that on average 62.4% of the 25000 simulated BMGs do not contain any false-positive edge (cf. Fig. 11). Those instances are included in the computation of the fraction \(|{\mathfrak {F}}|/|E(G)|\) (percentage above the bars). However, for the computation of all other values only scenarios that contain false-positives are considered

Extending the simulations used in Geiß et al. (2020b), Stadler et al. (2020), we also consider non-binary gene trees. This is important here since, by Lemma 14, hourglasses cannot appear in BMGs that are explained by a binary tree. There is an ongoing discussion to what extent polytomies in phylogenetic trees are biological reality as opposed to an artifact of insufficient resolution. At the level of species trees, the assumption that cladogenesis occurs by a series of bifurcations (e.g. Maddison 1989; DeSalle et al. 1994) seems to be prevailing, several authors have argued quite convincingly that there is evidence for a least some bona fide multifurcations of species Kliman et al. (2000), Takahashi et al. (2001), Sayyari and Mirarab (2018). In the simulation, polytomies in species trees are introduced after the first step by edge contraction with a user-defined probability p.

The reality of polytomies is less clear for gene trees. One reason is the abundance of tandem duplications. Although the majority of tandem arrays comprises only a pair of genes, larger clusters are not at all rare Pan and Zhang (2008). Although one may argue that mechanistically they likely arise by stepwise duplications, such arrangements are often subject to gene conversion and non-homologous recombination that keeps the sequences nearly identical for some time before they eventually escape from concerted evolution and diverge functionally Liao (1999), Hanada et al. (2018). As a consequence, duplications in tandem arrays may not be resolvable unless witnesses of different stages of an ongoing duplication process have survived. To model polytomies in the gene tree, we modify step (3) of the simulation procedure by replacing a simple duplication by the generation of \(2+k\) offspring genes. The number k of additional copies is drawn from a Poisson distribution with parameter \(\lambda >0\).

The simulated data set of evolutionary scenarios comprises species trees with 10 to 30 species (drawn uniformly). The time difference between the planted root and the leaves of S is set to unity. The duplication and loss rates in the gene trees are drawn i.i.d. from the uniform distribution on the interval [0.5, 1.5). Multifurcating gene trees were produced for \(\lambda =\{0.0, 0.5, 1.0, 1.5, 2.0\}\). In total, we generated 5000 scenarios for each choice of p and \(\lambda \). Since the true scenarios, and thus the true gene tree T, the true BMG \(\vec {G}\), and the corresponding RBMG G are known, we can also determine the set

$$\begin{aligned} {\mathfrak {F}}:=\left\{ xy \;\mid \; xy\in E(G) \;\;\text {and}\;\; t({{\,\mathrm{lca}\,}}_T(x,y))=\square \right\} . \ \end{aligned}$$
(1)

of false-positive edges. From the BMG, we compute the set \({\mathfrak {U}}\) of u-fp edges as well as the subsets \({\mathfrak {U}}_M\) and \({\mathfrak {U}}_U\) of u-fp edges that are middle edges of a good or first edges of an ugly quartet, respectively. Note that in general we have \({\mathfrak {U}}_M\cap {\mathfrak {U}}_U\ne \emptyset \). We only discuss the results for binary species trees in some detail, since species trees with polytomies yield qualitatively similar results. We observe that the relative abundance of u-fp edges in good and ugly quartets increases moderately for larger p.

Fig. 11
figure 11

False discovery rates computed as proportion of fp among all edges averaged over all scenarios with given number of duplications and losses. Left: RBMGs \((G,\sigma )\), i.e., \(|{\mathfrak {F}}|/|E(G)|\). Middle: edited RBMG \((G_\text {good},\sigma )\) with all middle edges of good quartets removed, i.e., \(|{\mathfrak {F}}{\setminus }{\mathfrak {U}}_M|/|E(G_\text {good})|\). Right: no-hug graphs \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\), i.e., \(|{\mathfrak {F}}{\setminus }{\mathfrak {U}}|/|E({\mathbb {N}}{\mathbb {H}})|\). Scenarios with more than 80 duplication/loss events are not shown

First, we note that, consistent with Geiß et al. (2020b), Stadler et al. (2020), the fraction \(|{\mathfrak {F}}|/|E(G)|\) of false positive orthology assignments is small in our data set, on the order of \(3\%\). This indicates that, in real-life data, the main source of errors is likely the accurate determination of best matches from sequence data rather than false-positive edges contained in the BMG. Considering the fraction \(|{\mathfrak {U}}|/|{\mathfrak {F}}|\) of u-fp edges in Fig. 10, we find that even in the most adverse case of all gene trees being binary, the BMG identifies more than three quarters of \({\mathfrak {F}}\). It may be surprising at first glance that the problem becomes easier with increasing \(\lambda \) and barely \(6\%\) of the false positives escape discovery. A likely explanation is that multifurcations increase the likelihood that an inner vertex has two surviving lineages that serve as witnesses of the event; in addition, multifurcations increase the vertex degree in the BMG, so that in principle more information is available to resolve the tree structure. It is also interesting to note that \({\mathfrak {U}}_U{\setminus } {\mathfrak {U}}_M\) is small, i.e., there are few cases of first edges in an ugly quartet that are not also middle edges in a good quartet. The fraction of u-fp edges that appear only as first edges of bad quartets is even smaller; only 2-3% of the u-fp edges associated with hourglass chains, i.e., less than 0.15% of all u-fp edges are of this type. The overwhelming majority of u-fp edges associated with quartets thus appear (also) as middle edges of good quartets. This observation provides an explanation for the excellent performance of removing the \({\mathfrak {U}}_M\)-edges proposed in Geiß et al. (2020b). In particular in the case of binary trees, which was considered by Geiß et al. (2020b), there is only a small number of other u-fp edges, which are completely covered by \({\mathfrak {U}}_U\). Fig. 11 visualizes the appearance of false-positive edges depending on the number of duplication and loss events. Not surprisingly, \({\mathfrak {F}}\) is enriched in scenarios with a large number of losses compared to the duplications, and depleted when losses are rare. In fact, in the absence of losses, the RBMG equals the orthology graph, i.e., \({\mathfrak {F}}=\emptyset \) (Geiß et al. 2020b, Thm. 4). Removal of \({\mathfrak {U}}_M\), already reduced the false positives considerably.

6 Summary and outlook

We have shown here how all unambiguously false-positive orthology assignments can be identified in polynomial time provided that all best matches are known. In particular, we have provided several characterizations for u-fp edges in terms of underlying subgraphs and refinements of trees. Since the best match graph contains only false positives, we have obtained a characterization of all unambiguously incorrect orthology assignments. Simulations showed that the majority of false positives comprises middle edges of good quartets, while u-fp edges that appear only as first edges of an ugly quartet are rare. Not surprisingly, the hourglass-related u-fp edges become important in gene trees with many multifurcations. They do not appear in scenarios derived from binary gene trees. For the theory developed here, it makes no difference whether polytomies in the gene tree appear as genuine features, or whether limited accuracy of the approximation from underlying sequence data produced the equivalent of a soft polytomy in the BMG.

The augmented tree \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\) is the least resolved tree that admits an event labeling such that all inner vertices with child trees that have overlapping colors are designated as duplications while all inner vertices with color-disjoint child trees are designated as speciations. The tree \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\) therefore does not contain “non-apparent duplications” in the sense of Lafond et al. (2014), i.e., duplication vertices with species-disjoint subtrees. This is an interesting connection linking the literature concerned with polytomy refinement in given gene trees Chang and Eulenstein (2006), Lafond et al. (2014) with best match graphs.

The extremal event labeling \({{\widehat{t}}}\) of \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\) is the one that minimizes the necessary number of duplications on \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),\sigma )\). In a conceptual sense, therefore, \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}})\) is a “most parsimonious” solution, matching the idea of most parsimonious reconciliations Guigó et al. (1996), Page and Charleston (1997). From a technical point of view, however, the problem we solve here is very different. Instead of considering a given pair of gene tree T and species tree S, we ask here about the information contained in the BMG \((\vec {G},\sigma )\), i.e., we only consider the information on the species tree that is already implicitly contained in \((\vec {G},\sigma )\). The construction of the event-labeled gene tree \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}})\) in fact implies a set \({\mathfrak {S}}\) of informative triples, namely those \(\sigma (x)\sigma (y)|\sigma (z)\) with \(\sigma (x)\), \(\sigma (y)\), \(\sigma (z)\) pairwise distinct and \({{\widehat{t}}}({{\,\mathrm{lca}\,}}_{{{\,\mathrm{{\mathcal {A}}}\,}}(T^*)}(x,y,z))=\newmoon \), that are displayed by the species tree S Hernandez-Rosales et al. (2012), Hellmuth (2017). Nothing in our theory, however, ensures that \({\mathfrak {S}}\) is a consistent set of triples, much less that \({\mathfrak {S}}\) is consistent with a given species tree S. A lack of consistency, however, implies that the no-hug graph \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) cannot be the correct orthology relation, and thus, necessarily contains additional false-positive edges. Consistency, on the other hand, cannot provide a mathematical proof for biological correctness. It makes \({\mathbb {N}}{\mathbb {H}}(\vec {G},\sigma )\) a very likely candidate for the true orthology relation, however, because alternative scenarios require additional gene duplications and multiple, strategically placed gene losses to compensate for them.

Since constraints on reconciliation maps deriving from the species phylogeny are fully expressed by informative triples, no such constraint exists in particular for any vertex u of \({{\,\mathrm{{\mathcal {A}}}\,}}(T^*)\) that has only leaves as children. That is, false-positive orthology assignments among the children of u cannot be identified from the BMG alone because there are no further descendants to witness u as duplication event. Additional evidence, such as the assumption of a molecular clock or synteny must be used to resolve situations such as the complementary loss shown in Fig. 2.

Every gene tree T can be reconciled with every species tree SGuigó et al. (1996), Page and Charleston (1997), Geiß et al. (2020b) at the expense of reassigning events as duplications. If \({{\,\mathrm{{\mathcal {A}}}\,}}(T^*)\) is already binary, consistency will require the relabeling of some speciation nodes as duplications. Can one characterize and efficiently compute the minimal relabelings? In the general case, a further refinement of \({{\,\mathrm{{\mathcal {A}}}\,}}(T^*)\) may be sufficient. Is a refinement of speciation nodes sufficient, or are there in general speciation nodes in \(({{\,\mathrm{{\mathcal {A}}}\,}}(T^*),{{\widehat{t}}})\) that need to be refined into separate speciation and duplication events?

Since orthology graphs are cographs contained in the RBMG \((G,\sigma )\), it is of interest to compare the deletion of all u-fp edges in \((G,\sigma )\) with finding a (minimal) edge-deletion set to obtain a cograph. These two problems are clearly distinct: The simplest example is the BMG \((\vec {G},\sigma )\) in Fig. 6(A): its symmetric part G is already a cograph but \((\vec {G},\sigma )\) contains the hug-edge xy, which must be deleted. Despite its practical use Hellmuth et al. (2015), Lafond et al. (2016), this observation relegates cograph editing Liu et al. (2012), Hellmuth et al. (2020a), Tsur (2020) to the status of a heuristic approximation for the purpose of orthology detection.

For practical applications, one has to keep in mind that best matches are inferred from sequence similarity data. Despite efforts to convert best (blast) hits into evolutionary best matches in a systematic manner Stadler et al. (2020), estimated BMGs will contain errors, which in most cases will violate the definition of best match graphs. This begs the question how an empirical estimate of a BMG can be corrected to a closest “correct” BMG that (approximately) fits the data. Not surprisingly, BMG editing Schaller et al. (2020) and the analogous RBMG editing problem Hellmuth et al. (2020b) are NP-hard. Efficient, accurate heuristics are a topic of ongoing research.

Orthology prediction tools intended for large data sets often do not attempt to infer the orthology graph, but instead are content with summarizing the information as clusters of orthologous groups (COGs) in an empirically estimated RBMG Tatusov et al. (1997), Roth et al. (2008). Formally, this amounts to editing the BMG to a set of disjoint cliques. The example in Fig. 7 shows that this approach can destroy correct orthology information: the BMG \((\vec {G},\sigma )\) does not contain u-fp edges and thus, it is the closest orthology graph. However, \((\vec {G},\sigma )\) is not the disjoint union of cliques.