Distance to the stochastic part of phylogenetic varieties
Introduction
Within the new century, algebraic tools have started to be successfully applied to some problems of phylogenetic reconstruction, see for example Allman et al. (2013), Chifman and Kubatko (2015) and Allman et al. (2017). The main goal of phylogenetic reconstruction is to estimate the phylogenetic tree that best explains the evolution of living species using solely information of their genome. To this end, one usually considers evolutionary models of molecular substitution and assume that DNA sequences evolve according to these models by a Markov process on a tree. Some of the most used models are nucleotide substitution models (e.g. Kimura (1981) or Jukes and Cantor (1969) models), which are specified by a transition matrix associated to each edge of the tree and a distribution of nucleotides at the root. Then, the distribution of possible nucleotide sequences at the leaves of the tree (representing the living species) can be computed as an algebraic expression in terms of the parameters of the model (the entries of the substitution matrices and the distribution at the root). This allows the use of algebraic tools for phylogenetic reconstruction purposes.
When reconstructing the tree topology (i.e., the shape of the tree taking into account the names of the species at the leaves), the main tools that have been used come either from rank conditions on matrices arising from a certain rearrangement of the distribution of nucleotides at the leaves (Chifman and Kubatko, 2014, Chifman and Kubatko, 2015; Casanellas and Fernández-Sánchez, 2016), or from phylogenetic invariants (Lake, 1987; Casanellas and Fernández-Sánchez, 2007). These tools use the fact that the set of possible distributions satisfies certain algebraic constraints, but do not specifically use the condition that one is dealing with discrete distributions that arise from stochastic matrices at the edges of the tree (i.e. with positive entries and rows summing to one). These extra conditions lead to semi-algebraic constraints which have been specified for certain models by Allman et al. (2012) (for the general Markov model), Matsen (2009) (for the Kimura 3-parameter model) and by Zwiernik and Smith (2011) and Klaere and Liebscher (2012) for the 2-state case ( transition matrices). Combining algebraic and semi-algebraic conditions to develop a tool for reconstructing the tree topology is not an easy task and, as far as we are aware, both tools have only been used together in Kosta and Kubjas (2019) for the simple case of 2 states.
As a starting point of topology reconstruction problems, it is natural to use trees on four species (called 1, 2, 3, 4 for example). In this case, there are three possible (unrooted and fully resolved) phylogenetic trees, , , and (see Fig. 1). Then a distribution of nucleotides for this set of species is a vector whose entries are non-negative and sum to one. The set of distributions arising from a Markov process on any of these trees T (for a given substitution model) defines an algebraic variety (see Section 2.1). The three phylogenetic varieties , , are different and the topology reconstruction problem for a given distribution is, briefly, deciding to which of these three varieties P is closest (for a certain distance or for another specified optimization problem such as likelihood estimation). The algebraic tools related to rank conditions mentioned above attempt to estimate these Euclidean distances, for example.
If we assume that P should be close to a distribution that has arisen from stochastic parameters on one of these trees, then one should consider only the stochastic part of these varieties, , , (which we call the stochastic phylogenetic regions). The main questions that motivated the study presented here are:
- -
Could semi-algebraic tools add some insight to the already existent algebraic tools?
- -
Do semi-algebraic conditions support the same tree T whose algebraic variety is closest to the data point?
In terms of the Euclidean distance and trees of four species, we make the explicit following question:
- (*)
If is a distribution satisfying , would it be possible that ?
We address this problem for special cases of interest in phylogenetics: short branches at the external edges (see section 4) and long branch attraction (in section 6). The length of a branch in a phylogenetic tree is understood as the expected number of substitutions of nucleotides per site along the corresponding edge; both cases, short and long branches, usually lead to confusing results in phylogenetic reconstruction (particularly in relation to the long branch attraction problem, see section 6). In the first case we are able to deal with the Kimura 3-parameter model and in the second case we have to restrict to the more simple Jukes-Cantor (JC69) model. The reason for this restriction is that the computations get more involved in the second case and we have to use computational algebra techniques (for which is crucial to decrease the number of variables of the problem). To this end, in section 5 we introduce an algorithm that computes the distance of a point to the stochastic phylogenetic regions in the JC69 case; this algorithm makes explicit use of the Euclidean distance degree (Draisma et al., 2015) of the phylogenetic varieties.
We find that in the first framework (short external branches), restricting to the stochastic part does not make any difference, that is, Question 1 has a negative answer in this case (see Theorem 4.3). However, in the long branch attraction framework, considering the stochastic part of phylogenetic varieties might be of interest, specially if the data points are close to the intersection of the three varieties, see Theorem 6.6. In particular, the answer to Question 1 is positive for data close to the long branch attraction problem under the JC69 model. In section 7 we provide results on simulated data that support these findings and also show a positive answer to Question 1 for balanced trees.
Summing up, incorporating the semi-algebraic conditions to the problem of phylogenetic reconstruction seems important when the data are close to the intersection of the three phylogenetic varieties. This is the case where phylogenetic reconstruction methods tend to confuse the trees. On the contrary, on data points which are far from the intersection (in the short branches case of section 4 for example), it does not seem necessary to incorporate these semi-algebraic tools. This is the reason why incorporating these tools into phylogenetic reconstruction methods might be extremely difficult.
In this paper we consider only the Euclidean distance. One reason to do so is that the initial algebraic tools based on rank conditions were dealing with it, but another motivation is that the algebraic expression of the Euclidean distance permits the use of algebraic tools to derive analytical results and the use of numerical algebraic geometry to get global minima. On the other hand, the use of other measures such as Hellinger distance or maximum likelihood, would not allow the use of the Fourier transform for the evolutionary models we use here, which significantly simplifies the computations in our case.
The organization of the paper is as follows. In section 2, we introduce the concepts on nucleotide substitution models and phylogenetic varieties that we will use later on. Then in section 3 we prove some technical results regarding the closest stochastic matrix to a given matrix. In section 4 we consider the case of short external branches for the Kimura 3-parameter model and obtain the results analytically. In section 5 we introduce the computational approach that we use in order to compute the distance to the stochastic phylogenetic regions. The results for the long branch attraction case are expanded in section 6 and in section 7 we provide results on simulated data that illustrate our findings. The Appendix collects all technical proofs needed in section 6.
Section snippets
Phylogenetic varieties
We refer the reader to the work by Allman and Rhodes (2007) for a good general overview of phylogenetic algebraic geometry. Here we briefly introduce the basic concepts that will be needed later. Let T be a quartet tree topology, that is, an (unrooted) trivalent phylogenetic tree with its leaves labelled by (i.e. T is a connected acyclic graph whose interior nodes have degree 3 and whose leaves, of degree 1, are in correspondence with ), see Fig. 1. Using the notation
The closest stochastic matrix
Throughout this section, we will use the following notation. We write for the hyperplane and for the standard simplex in . Given a point , we denote by its orthogonal projection onto .
Definition 3.1 For any matrix we denote by its closest stochastic matrix in the Frobenious norm: Similarly, for any point we write for its closest point in Δ.
The problem of finding the nearest stochastic matrix is
The case of short external branches
In this section we study evolutionary processes where substitutions at the external edges are unusual, so that probabilities of substitution of nucleotides in the corresponding transition matrices are small. This translates to matrices close to the identity at the external edges and short branch lengths, as explained in the Introduction.
We use the results of Section 3 with and we stick to the K81 model. Given , let be a point in that minimizes the distance to P, i.e.
Computing the closest point to a stochastic phylogenetic region
Although in the last section we were able to answer our questions analytically, this approach seems unfeasible when we want to tackle more general problems. In this section, in order to find the distance from a point to a stochastic phylogenetic variety we use numerical algebraic geometry. Our goal is to find all critical points of the distance function to a phylogenetic variety in the interior and at the boundary of the stochastic region. Among the set of critical points we pick the one that
The long branch attraction case
Long branch attraction, LBA for short, is one the most difficult problems to cope with phylogenetic inference (see Kück et al. (2012)). It is a phenomenon that happens when fast evolving lineages are wrongly inferred to be closely related. Quartet trees representing these events are characterized for having two non-sister species that have accumulated many substitutions and two non-sister species that have very similar DNA sequences.
The length of a branch in a phylogenetic tree represents the
Study on simulated data
In this section we simulate points close to a given phylogenetic variety and we compute its distance to the stochastic region of this variety as well as to the other phylogenetic varieties (distinguishing also the stochastic region of the varieties). We do this in the setting of long branch attraction of the previous section and for balanced trees. We cannot do this theoretically because, even if we have found a local minimum for the long branch attraction setting (Theorem 6.3), we cannot
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to thank Piotr Zwiernik for sharing initial discussion in this topic. The authors were partially supported by Spanish government Secretaría de Estado de Investigación, Desarrollo e Innovación [MTM2015-69135-P (MINECO)] and [PID2019-103849GB-I00 (MICINN)]; Generalitat de Catalunya [2014 SGR-634]. M. Garrote-López was also funded by Spanish government, research project Maria de Maeztu [MDM-2014-0445 (MINECO)].
References (36)
- et al.
Phylogenetic invariants of the general Markov model of sequence mutation
Math. Biosci.
(2003) - et al.
The Magma algebra system. I. The user language
Computational Algebra and Number Theory
J. Symb. Comput.
(1997) - et al.
Geometry of the Kimura 3-parameter model
Adv. Appl. Math.
(2008) - et al.
Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites
J. Theor. Biol.
(2015) - et al.
Evolution of protein molecules
- et al.
An algebraic analysis of the two state Markov model on tripod trees
Math. Biosci.
(2012) - et al.
Species tree inference by the star method and its generalizations
J. Comput. Biol.
(2013) - et al.
Split scores: a tool to quantify phylogenetic signal in genome-scale data
Syst. Biol.
(2017) - et al.
Mathematical Models in Biology, an Introduction
(January 2004) - et al.
Quartets and parameter recovery for the general Markov model of sequence mutation
Appl. Math. Res. Express
(2004)
Phylogenetic invariants
The identifiability of covarion models in phylogenetics
IEEE/ACM Trans. Comput. Biol. Bioinform.
A semialgebraic description of the general Markov model on phylogenetic trees
SIAM J. Discrete Math.
Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees
Mol. Biol. Evol.
Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages
Syst. Biol.
The space of phylogenetic mixtures for equivariant models
Algorithms Mol. Biol.
Low degree equations for phylogenetic group-based models
Collect. Math.
Quartet inference from SNP data under the coalescent model
Bioinformatics
Cited by (4)
Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages
2023, Bulletin of Mathematical BiologyThe Kishony Mega-Plate Experiment, a Markov Process
2021, bioRxiv