Abstract
In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and whose branchings indicate past speciation events. Phylogenetic analyses often rely on molecular sequences, such as DNA sequences, collected from the species of interest, and it is common in this context to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides—also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here, we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the “big bang” condition, a necessary and sufficient condition for statistical consistency in this setting.
Similar content being viewed by others
Notes
The choice of \(t_j\)’s affects the constants in our quantitative bounds, but we have not tried to optimize them in the current work.
References
Andoni A, Daskalakis C, Hassidim A, Roch S (2012) Global alignment of molecular sequences via ancestral state reconstruction. Stoch Process Appl 122(12):3852–3874
Anderson WJ (1991) Continuous-time Markov chains: an applications-oriented approach. Springer series in statistics: probability and its applications. Springer, New York
Daskalakis C, Roch S (2013) Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann Appl Probab 23(2):693–721
Durrett R (1996) Probability: theory and examples, 2nd edn. Duxbury Press, Belmont
Evans WS, Kenyon C, Peres Y, Schulman LJ (2000) Broadcasting on trees and the Ising model. Ann Appl Probab 10(2):410–433
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Fan W-T, Roch S (2018) Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees. Electron J Probab 23(47):24
Gautschi W (1962) On inverses of Vandermonde and confluent Vandermonde matrices. Numer Math 4(1):117–123
Gascuel O, Steel M (2010) Inferring ancestral sequences in taxon-rich phylogenies. Math Biosci 227(2):125–135
Ganesh A, Zhang Q (2019) Optimal sequence length requirements for phylogenetic tree reconstruction with indels. In: Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, STOC 2019. ACM, New York, pp 721–732
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30
Karlin S, Taylor HE (1981) A second course in stochastic processes. Elsevier, Amsterdam
Liberles DA (2007) Ancestral sequence reconstruction. Oxford University Press, Oxford
Mitzenmacher M et al (2009) A survey of results for deletion channels and related synchronization channels. Probab Surv 6:1–33
Mossel E (2001) Reconstruction on trees: beating the second eigenvalue. Ann Appl Probab 11(1):285–300
Sly A (2009) Reconstruction for the Potts model. In: STOC, pp 581–590
Thatte BD (2006) Invertibility of the TKF model of sequence evolution. Math Biosci 200(1):58–75
Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33(2):114–124
Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34(1):3–16
Warnow T (2013) Large-scale multiple sequence alignment and phylogeny estimation. Springer, London, pp 85–146
Acknowledgements
W.-T. Fan’s work was supported by NSF Grant DMS-1149312 to SR and NSF Grant DMS-1804492. S. Roch’s work was supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242 and CCF-1740707 (TRIPODS).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Some properties of the TKF91 length process
Recall the TKF91 edge process \({\mathcal {I}}=({\mathcal {I}}_t)_{t\ge 0}\) in Definition 1, which has parameters \((\nu ,\,\lambda ,\,\mu )\in (0,\infty )^3\) with \(\lambda < \mu \) and \((\pi _A,\,\pi _T,\,\pi _C,\,\pi _G)\in [0,\infty )^4\) with \(\pi _A +\pi _T + \pi _C + \pi _G = 1\). The sequence length of the TKF91 edge process is a continuous-time linear birth–death–immigration process \((|{\mathcal {I}}_{t}|)_{t\ge 0}\) with infinitesimal generator \(Q_{i,i+1}=\lambda +i\lambda \) (for \(i\in {\mathbb {Z}}_+\)), \(Q_{i,i-1}=i\mu \) (for \(i\ge 1\)) and \(Q_{i,j}=0\) otherwise. This is a well-studied process for which explicit forms for the transition density \(p_{ij}(t)\) and probability generating functions \(G_i(z,t)=\sum _{j=0}^{\infty }p_{ij}(t)z^j\) are known. See, for instance, Anderson (1991, Section 3.2) or Karlin and Taylor (1981, Chapter 4) for more details. This process was also analyzed in Thatte (2006) in the related context of phylogeny estimation.
We collect here a few properties that will be useful in our analysis. The probability generating function is given by
for \(i\in {\mathbb {Z}}_+\) and \(t>0\), where
Fix \(t\ge 0\) and let \(\varphi _i(\theta )={\mathbb {E}}_i[\mathrm{e}^{\theta \,|{\mathcal {I}}_{t}|}]\) be the characteristic function of \(|{\mathcal {I}}_{t}|\) starting at i. Then, for \(\lambda \ne \mu \) (i.e., \(\gamma \ne 1\)),
where
Differentiating with respect to \(\theta \) gives
where
Differentiating with respect to \(\theta \) once more gives
The expected value and the second moment are given by
From these, we also have the variance
Consider the function
where we wrote \(C=C_1i+C_2\), with \(C_1=\beta (1-\gamma )^3\) and \(C_2=(1-\gamma )\gamma (1-\beta )^2\), and the last line is a definition. Functions F and G can be simplified to
Since \(\varphi _i(0)=1\), we have \(\psi (0)=0\). We consider the case \(\mu \in (\lambda ,\infty )\), that is, \(\gamma \in (0,1)\). Then, both A and B are strictly positive for all \(t\in [0,\infty )\), provided that \(\mathrm{e}^\theta <\mu /\lambda \). Moreover, F and G are continuous on \([0,\,\mu /\lambda )\), smooth on \((0,\,\mu /\lambda )\) and \(F(0)=G(0)=0\).
Notation
For the reader’s convenience, we list some frequently used notation here (Tables 1, 2 and 3).
Rights and permissions
About this article
Cite this article
Fan, WT., Roch, S. Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling. Bull Math Biol 82, 21 (2020). https://doi.org/10.1007/s11538-020-00693-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11538-020-00693-3