Distance to the stochastic part of phylogenetic varieties

doi:10.1016/j.jsc.2020.09.003

Journal of Symbolic Computation

Volume 104, May–June 2021, Pages 653-682

https://doi.org/10.1016/j.jsc.2020.09.003 Get rights and content

Abstract

Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial relationships among the probabilities of different characters. The study of these polynomials and the geometry of the algebraic varieties defined by them can be used to reconstruct phylogenetic trees. However, not all points in these algebraic varieties have biological sense. In this paper, we explore the extent to which adding semi-algebraic conditions arising from the restriction to parameters with statistical meaning can improve existing methods of phylogenetic reconstruction. To this end, our aim is to compute the distance of data points to algebraic varieties and to the stochastic part of these varieties. Computing these distances involves optimization by nonlinear programming algorithms. We use analytical methods to find some of these distances for quartet trees evolving under the Kimura 3-parameter or the Jukes-Cantor models. Numerical algebraic geometry and computational algebra play also a fundamental role in this paper.

Introduction

Within the new century, algebraic tools have started to be successfully applied to some problems of phylogenetic reconstruction, see for example Allman et al. (2013), Chifman and Kubatko (2015) and Allman et al. (2017). The main goal of phylogenetic reconstruction is to estimate the phylogenetic tree that best explains the evolution of living species using solely information of their genome. To this end, one usually considers evolutionary models of molecular substitution and assume that DNA sequences evolve according to these models by a Markov process on a tree. Some of the most used models are nucleotide substitution models (e.g. Kimura (1981) or Jukes and Cantor (1969) models), which are specified by a $4 \times 4$ transition matrix associated to each edge of the tree and a distribution of nucleotides at the root. Then, the distribution of possible nucleotide sequences at the leaves of the tree (representing the living species) can be computed as an algebraic expression in terms of the parameters of the model (the entries of the substitution matrices and the distribution at the root). This allows the use of algebraic tools for phylogenetic reconstruction purposes.

When reconstructing the tree topology (i.e., the shape of the tree taking into account the names of the species at the leaves), the main tools that have been used come either from rank conditions on matrices arising from a certain rearrangement of the distribution of nucleotides at the leaves (Chifman and Kubatko, 2014, Chifman and Kubatko, 2015; Casanellas and Fernández-Sánchez, 2016), or from phylogenetic invariants (Lake, 1987; Casanellas and Fernández-Sánchez, 2007). These tools use the fact that the set of possible distributions satisfies certain algebraic constraints, but do not specifically use the condition that one is dealing with discrete distributions that arise from stochastic matrices at the edges of the tree (i.e. with positive entries and rows summing to one). These extra conditions lead to semi-algebraic constraints which have been specified for certain models by Allman et al. (2012) (for the general Markov model), Matsen (2009) (for the Kimura 3-parameter model) and by Zwiernik and Smith (2011) and Klaere and Liebscher (2012) for the 2-state case ( $2 \times 2$ transition matrices). Combining algebraic and semi-algebraic conditions to develop a tool for reconstructing the tree topology is not an easy task and, as far as we are aware, both tools have only been used together in Kosta and Kubjas (2019) for the simple case of 2 states.

As a starting point of topology reconstruction problems, it is natural to use trees on four species (called 1, 2, 3, 4 for example). In this case, there are three possible (unrooted and fully resolved) phylogenetic trees, $13 | 24$ , $13 | 24$ , and $14 | 23$ (see Fig. 1). Then a distribution of nucleotides for this set of species is a vector $P \in R^{4^{4}}$ whose entries are non-negative and sum to one. The set of distributions arising from a Markov process on any of these trees T (for a given substitution model) defines an algebraic variety $V_{T}$ (see Section 2.1). The three phylogenetic varieties $V_{12 | 34}$ , $V_{13 | 24}$ , $V_{14 | 23}$ are different and the topology reconstruction problem for a given distribution $P \in R^{4^{4}}$ is, briefly, deciding to which of these three varieties P is closest (for a certain distance or for another specified optimization problem such as likelihood estimation). The algebraic tools related to rank conditions mentioned above attempt to estimate these Euclidean distances, for example.

If we assume that P should be close to a distribution that has arisen from stochastic parameters on one of these trees, then one should consider only the stochastic part of these varieties, $V_{12 | 34}^{+}$ , $V_{13 | 24}^{+}$ , $V_{14 | 23}^{+}$ (which we call the stochastic phylogenetic regions). The main questions that motivated the study presented here are:

-
Could semi-algebraic tools add some insight to the already existent algebraic tools?
-
Do semi-algebraic conditions support the same tree T whose algebraic variety $V_{T}$ is closest to the data point?

In terms of the Euclidean distance and trees of four species, we make the explicit following question:

(*)
If $P \in R^{4^{4}}$ is a distribution satisfying $d (P, V_{12 | 34}) < m i n {d (P, V_{13 | 24}), d (P, V_{14 | 23})}$ , would it be possible that $d (P, V_{12 | 34}^{+}) > \min {d (P, V_{13 | 24}^{+}), d (P, V_{14 | 23}^{+})}$ ?

We address this problem for special cases of interest in phylogenetics: short branches at the external edges (see section 4) and long branch attraction (in section 6). The length of a branch in a phylogenetic tree is understood as the expected number of substitutions of nucleotides per site along the corresponding edge; both cases, short and long branches, usually lead to confusing results in phylogenetic reconstruction (particularly in relation to the long branch attraction problem, see section 6). In the first case we are able to deal with the Kimura 3-parameter model and in the second case we have to restrict to the more simple Jukes-Cantor (JC69) model. The reason for this restriction is that the computations get more involved in the second case and we have to use computational algebra techniques (for which is crucial to decrease the number of variables of the problem). To this end, in section 5 we introduce an algorithm that computes the distance of a point to the stochastic phylogenetic regions in the JC69 case; this algorithm makes explicit use of the Euclidean distance degree (Draisma et al., 2015) of the phylogenetic varieties.

We find that in the first framework (short external branches), restricting to the stochastic part does not make any difference, that is, Question 1 has a negative answer in this case (see Theorem 4.3). However, in the long branch attraction framework, considering the stochastic part of phylogenetic varieties might be of interest, specially if the data points are close to the intersection of the three varieties, see Theorem 6.6. In particular, the answer to Question 1 is positive for data close to the long branch attraction problem under the JC69 model. In section 7 we provide results on simulated data that support these findings and also show a positive answer to Question 1 for balanced trees.

Summing up, incorporating the semi-algebraic conditions to the problem of phylogenetic reconstruction seems important when the data are close to the intersection of the three phylogenetic varieties. This is the case where phylogenetic reconstruction methods tend to confuse the trees. On the contrary, on data points which are far from the intersection (in the short branches case of section 4 for example), it does not seem necessary to incorporate these semi-algebraic tools. This is the reason why incorporating these tools into phylogenetic reconstruction methods might be extremely difficult.

In this paper we consider only the Euclidean distance. One reason to do so is that the initial algebraic tools based on rank conditions were dealing with it, but another motivation is that the algebraic expression of the Euclidean distance permits the use of algebraic tools to derive analytical results and the use of numerical algebraic geometry to get global minima. On the other hand, the use of other measures such as Hellinger distance or maximum likelihood, would not allow the use of the Fourier transform for the evolutionary models we use here, which significantly simplifies the computations in our case.

The organization of the paper is as follows. In section 2, we introduce the concepts on nucleotide substitution models and phylogenetic varieties that we will use later on. Then in section 3 we prove some technical results regarding the closest stochastic matrix to a given matrix. In section 4 we consider the case of short external branches for the Kimura 3-parameter model and obtain the results analytically. In section 5 we introduce the computational approach that we use in order to compute the distance to the stochastic phylogenetic regions. The results for the long branch attraction case are expanded in section 6 and in section 7 we provide results on simulated data that illustrate our findings. The Appendix collects all technical proofs needed in section 6.

Section snippets

Phylogenetic varieties

We refer the reader to the work by Allman and Rhodes (2007) for a good general overview of phylogenetic algebraic geometry. Here we briefly introduce the basic concepts that will be needed later. Let T be a quartet tree topology, that is, an (unrooted) trivalent phylogenetic tree with its leaves labelled by ${1, 2, 3, 4}$ (i.e. T is a connected acyclic graph whose interior nodes have degree 3 and whose leaves, of degree 1, are in correspondence with ${1, 2, 3, 4}$ ), see Fig. 1. Using the notation

The closest stochastic matrix

Throughout this section, we will use the following notation. We write $H$ for the hyperplane ${x_{1} + \dots + x_{N} = 1} \subset R^{N}$ and $Δ : = {(x_{1}, \dots, x_{N}) | \sum_{i} x_{i} = 1, x_{i} \geq 0}$ for the standard simplex in $R^{N}$ . Given a point $x \in R^{N}$ , we denote by ${proj}_{H} (x)$ its orthogonal projection onto $H$ .

Definition 3.1

For any matrix $M \in M_{N} (R)$ we denote by $\hat{M}$ its closest stochastic matrix in the Frobenious norm: $\hat{M} = \underset{\begin{matrix} \sum_{j} X_{i j} = 1 \forall i, \\ X_{i j} \geq 0 \forall (i, j) \end{matrix}}{arg min} {∥ M - X ∥}_{F} .$ Similarly, for any point $x \in R^{N}$ we write $\hat{x}$ for its closest point in Δ.

The problem of finding the nearest stochastic matrix is

The case of short external branches

In this section we study evolutionary processes where substitutions at the external edges are unusual, so that probabilities of substitution of nucleotides in the corresponding transition matrices are small. This translates to matrices close to the identity at the external edges and short branch lengths, as explained in the Introduction.

We use the results of Section 3 with $N = 4^{4}$ and we stick to the K81 model. Given $P \in R^{4^{4}}$ , let $P_{T}^{+}$ be a point in $V_{T}^{+}$ that minimizes the distance to P, i.e. $d (P, P_{T}^{+}) = d ($

Computing the closest point to a stochastic phylogenetic region

Although in the last section we were able to answer our questions analytically, this approach seems unfeasible when we want to tackle more general problems. In this section, in order to find the distance from a point to a stochastic phylogenetic variety we use numerical algebraic geometry. Our goal is to find all critical points of the distance function to a phylogenetic variety in the interior and at the boundary of the stochastic region. Among the set of critical points we pick the one that

The long branch attraction case

Long branch attraction, LBA for short, is one the most difficult problems to cope with phylogenetic inference (see Kück et al. (2012)). It is a phenomenon that happens when fast evolving lineages are wrongly inferred to be closely related. Quartet trees representing these events are characterized for having two non-sister species that have accumulated many substitutions and two non-sister species that have very similar DNA sequences.

The length of a branch in a phylogenetic tree represents the

Study on simulated data

In this section we simulate points close to a given phylogenetic variety and we compute its distance to the stochastic region of this variety as well as to the other phylogenetic varieties (distinguishing also the stochastic region of the varieties). We do this in the setting of long branch attraction of the previous section and for balanced trees. We cannot do this theoretically because, even if we have found a local minimum for the long branch attraction setting (Theorem 6.3), we cannot

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank Piotr Zwiernik for sharing initial discussion in this topic. The authors were partially supported by Spanish government Secretaría de Estado de Investigación, Desarrollo e Innovación [MTM2015-69135-P (MINECO)] and [PID2019-103849GB-I00 (MICINN)]; Generalitat de Catalunya [2014 SGR-634]. M. Garrote-López was also funded by Spanish government, research project Maria de Maeztu [MDM-2014-0445 (MINECO)].

References (36)

E.S. Allman et al.
Phylogenetic invariants of the general Markov model of sequence mutation
Math. Biosci.
(2003)
W. Bosma et al.
The Magma algebra system. I. The user language
Computational Algebra and Number Theory
J. Symb. Comput.
(1997)
M. Casanellas et al.
Geometry of the Kimura 3-parameter model
Adv. Appl. Math.
(2008)
J. Chifman et al.
Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites
J. Theor. Biol.
(2015)
T. Jukes et al.
Evolution of protein molecules
S. Klaere et al.
An algebraic analysis of the two state Markov model on tripod trees
Math. Biosci.
(2012)
E.S. Allman et al.
Species tree inference by the star method and its generalizations
J. Comput. Biol.
(2013)
E.S. Allman et al.
Split scores: a tool to quantify phylogenetic signal in genome-scale data
Syst. Biol.
(2017)
E.S. Allman et al.
Mathematical Models in Biology, an Introduction
(January 2004)
E.S. Allman et al.
Quartets and parameter recovery for the general Markov model of sequence mutation
Appl. Math. Res. Express
(2004)

E.S. Allman et al.

Phylogenetic invariants

E.S. Allman et al.

The identifiability of covarion models in phylogenetics

IEEE/ACM Trans. Comput. Biol. Bioinform.

(2009)

E.S. Allman et al.

A semialgebraic description of the general Markov model on phylogenetic trees

SIAM J. Discrete Math.

(2012)

M. Casanellas et al.

Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees

Mol. Biol. Evol.

(2007)

M. Casanellas et al.

Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages

Syst. Biol.

(2016)

M. Casanellas et al.

The space of phylogenetic mixtures for equivariant models

Algorithms Mol. Biol.

(2012)

M. Casanellas et al.

Low degree equations for phylogenetic group-based models

Collect. Math.

(2015)

J. Chifman et al.

Quartet inference from SNP data under the coalescent model

Bioinformatics

(2014)

Cited by (4)

DIMENSIONS OF LEVEL-1 GROUP-BASED PHYLOGENETIC NETWORKS
2023, arXiv
Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages
2023, Bulletin of Mathematical Biology
Computing algebraic degrees of phylogenetic varieties
2022, arXiv
The Kishony Mega-Plate Experiment, a Markov Process
2021, bioRxiv

View full text

Distance to the stochastic part of phylogenetic varieties

Abstract

Introduction

Section snippets

Phylogenetic varieties

The closest stochastic matrix

The case of short external branches

Computing the closest point to a stochastic phylogenetic region

The long branch attraction case

Study on simulated data

Declaration of Competing Interest

Acknowledgements

Math. Biosci.

J. Symb. Comput.

Adv. Appl. Math.

J. Theor. Biol.

Math. Biosci.

Species tree inference by the star method and its generalizations

J. Comput. Biol.

Split scores: a tool to quantify phylogenetic signal in genome-scale data

Syst. Biol.

Mathematical Models in Biology, an Introduction

Quartets and parameter recovery for the general Markov model of sequence mutation

Appl. Math. Res. Express

Phylogenetic invariants

The identifiability of covarion models in phylogenetics

IEEE/ACM Trans. Comput. Biol. Bioinform.

A semialgebraic description of the general Markov model on phylogenetic trees

SIAM J. Discrete Math.

Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees

Mol. Biol. Evol.

Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages

Syst. Biol.

The space of phylogenetic mixtures for equivariant models

Algorithms Mol. Biol.

Low degree equations for phylogenetic group-based models

Collect. Math.

Quartet inference from SNP data under the coalescent model

Bioinformatics