Skip to main content

Advertisement

Log in

Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling

  • Original Article
  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and whose branchings indicate past speciation events. Phylogenetic analyses often rely on molecular sequences, such as DNA sequences, collected from the species of interest, and it is common in this context to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides—also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here, we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the “big bang” condition, a necessary and sufficient condition for statistical consistency in this setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. The choice of \(t_j\)’s affects the constants in our quantitative bounds, but we have not tried to optimize them in the current work.

References

  • Andoni A, Daskalakis C, Hassidim A, Roch S (2012) Global alignment of molecular sequences via ancestral state reconstruction. Stoch Process Appl 122(12):3852–3874

    Article  MathSciNet  Google Scholar 

  • Anderson WJ (1991) Continuous-time Markov chains: an applications-oriented approach. Springer series in statistics: probability and its applications. Springer, New York

    Book  Google Scholar 

  • Daskalakis C, Roch S (2013) Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann Appl Probab 23(2):693–721

    Article  MathSciNet  Google Scholar 

  • Durrett R (1996) Probability: theory and examples, 2nd edn. Duxbury Press, Belmont

    MATH  Google Scholar 

  • Evans WS, Kenyon C, Peres Y, Schulman LJ (2000) Broadcasting on trees and the Ising model. Ann Appl Probab 10(2):410–433

    Article  MathSciNet  Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376

    Article  Google Scholar 

  • Fan W-T, Roch S (2018) Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees. Electron J Probab 23(47):24

    MathSciNet  MATH  Google Scholar 

  • Gautschi W (1962) On inverses of Vandermonde and confluent Vandermonde matrices. Numer Math 4(1):117–123

    Article  MathSciNet  Google Scholar 

  • Gascuel O, Steel M (2010) Inferring ancestral sequences in taxon-rich phylogenies. Math Biosci 227(2):125–135

    Article  MathSciNet  Google Scholar 

  • Ganesh A, Zhang Q (2019) Optimal sequence length requirements for phylogenetic tree reconstruction with indels. In: Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, STOC 2019. ACM, New York, pp 721–732

  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30

    Article  MathSciNet  Google Scholar 

  • Karlin S, Taylor HE (1981) A second course in stochastic processes. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Liberles DA (2007) Ancestral sequence reconstruction. Oxford University Press, Oxford

    Book  Google Scholar 

  • Mitzenmacher M et al (2009) A survey of results for deletion channels and related synchronization channels. Probab Surv 6:1–33

    Article  MathSciNet  Google Scholar 

  • Mossel E (2001) Reconstruction on trees: beating the second eigenvalue. Ann Appl Probab 11(1):285–300

    Article  MathSciNet  Google Scholar 

  • Sly A (2009) Reconstruction for the Potts model. In: STOC, pp 581–590

  • Thatte BD (2006) Invertibility of the TKF model of sequence evolution. Math Biosci 200(1):58–75

    Article  MathSciNet  Google Scholar 

  • Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33(2):114–124

    Article  Google Scholar 

  • Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34(1):3–16

    Article  Google Scholar 

  • Warnow T (2013) Large-scale multiple sequence alignment and phylogeny estimation. Springer, London, pp 85–146

    Google Scholar 

Download references

Acknowledgements

W.-T. Fan’s work was supported by NSF Grant DMS-1149312 to SR and NSF Grant DMS-1804492. S. Roch’s work was supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242 and CCF-1740707 (TRIPODS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastien Roch.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Some properties of the TKF91 length process

Recall the TKF91 edge process \({\mathcal {I}}=({\mathcal {I}}_t)_{t\ge 0}\) in Definition 1, which has parameters \((\nu ,\,\lambda ,\,\mu )\in (0,\infty )^3\) with \(\lambda < \mu \) and \((\pi _A,\,\pi _T,\,\pi _C,\,\pi _G)\in [0,\infty )^4\) with \(\pi _A +\pi _T + \pi _C + \pi _G = 1\). The sequence length of the TKF91 edge process is a continuous-time linear birth–death–immigration process \((|{\mathcal {I}}_{t}|)_{t\ge 0}\) with infinitesimal generator \(Q_{i,i+1}=\lambda +i\lambda \) (for \(i\in {\mathbb {Z}}_+\)), \(Q_{i,i-1}=i\mu \) (for \(i\ge 1\)) and \(Q_{i,j}=0\) otherwise. This is a well-studied process for which explicit forms for the transition density \(p_{ij}(t)\) and probability generating functions \(G_i(z,t)=\sum _{j=0}^{\infty }p_{ij}(t)z^j\) are known. See, for instance, Anderson (1991, Section 3.2) or Karlin and Taylor (1981, Chapter 4) for more details. This process was also analyzed in Thatte (2006) in the related context of phylogeny estimation.

We collect here a few properties that will be useful in our analysis. The probability generating function is given by

$$\begin{aligned} G_i(z,t)=\left[ \frac{1-\beta -z(\gamma -\beta )}{1-\beta \gamma -\gamma z(1-\beta )}\right] ^i\,\left[ \frac{1-\gamma }{1-\beta \gamma -\gamma z(1-\beta )}\right] \end{aligned}$$

for \(i\in {\mathbb {Z}}_+\) and \(t>0\), where

$$\begin{aligned} \beta =\beta _t = \mathrm{e}^{-(\mu -\lambda )t}\quad \text {and}\quad \gamma =\frac{\lambda }{\mu }. \end{aligned}$$

Fix \(t\ge 0\) and let \(\varphi _i(\theta )={\mathbb {E}}_i[\mathrm{e}^{\theta \,|{\mathcal {I}}_{t}|}]\) be the characteristic function of \(|{\mathcal {I}}_{t}|\) starting at i. Then, for \(\lambda \ne \mu \) (i.e., \(\gamma \ne 1\)),

$$\begin{aligned} \varphi _i(\theta )&=G_i(\mathrm{e}^{\theta },t)=\left[ \frac{1-\beta -\mathrm{e}^{\theta }(\gamma -\beta )}{1-\beta \gamma -\gamma \mathrm{e}^{\theta }(1-\beta )}\right] ^i\,\left[ \frac{1-\gamma }{1-\beta \gamma -\gamma \mathrm{e}^{\theta }(1-\beta )}\right] \nonumber \\&=(1-\gamma )\frac{A^i}{B^{i+1}}, \end{aligned}$$
(51)

where

$$\begin{aligned} A=1-\beta -\mathrm{e}^{\theta }(\gamma -\beta ) \quad \text {and}\quad B=1-\beta \gamma -\gamma \mathrm{e}^{\theta }(1-\beta ). \end{aligned}$$

Differentiating with respect to \(\theta \) gives

$$\begin{aligned} \varphi '_i(\theta )&=(1-\gamma )\frac{A^{i-1}\mathrm{e}^{\theta }\big \{i(\beta -\gamma )B -(i+1)\gamma (\beta -1)A\big \}}{B^{i+2}} \nonumber \\&=(1-\gamma )\frac{A^{i-1}\mathrm{e}^{\theta }\big \{\beta (1-\gamma )^2i +\gamma (1-\beta )^2 + \mathrm{e}^{\theta }(\gamma -\beta )\gamma (\beta -1)\big \}}{B^{i+2}} \nonumber \\&= \frac{A^{i-1}}{B^{i+2}}(C\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta }), \end{aligned}$$
(52)

where

$$\begin{aligned} C=(1-\gamma )[\beta (1-\gamma )^2i +\gamma (1-\beta )^2] \quad \text {and}\quad D=(1-\gamma )(\gamma -\beta )\gamma (\beta -1). \end{aligned}$$

Differentiating with respect to \(\theta \) once more gives

$$\begin{aligned} \varphi ''_i(\theta )&=\frac{A^{i-2}\mathrm{e}^{\theta }}{B^{i+3}}\Big \{BA(C+2D\mathrm{e}^{\theta }) \,+\,(C\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta })\\&\quad \times \big [(i-1)B(\beta -\gamma )+(i+2)A\gamma (1-\beta )\big ]\Big \}. \end{aligned}$$

The expected value and the second moment are given by

$$\begin{aligned} {\mathbb {E}}_{i}[|{\mathcal {I}}_{t}|]=\varphi '_i(0)&= i\beta +\frac{\gamma (1-\beta )}{1 -\gamma }=i\beta +\frac{\lambda }{\lambda -\mu }(\beta -1) \quad \text { and}\\ {\mathbb {E}}_{i}[|{\mathcal {I}}_{t}|^2]=\varphi ''_i(0)&= i^2\beta ^2+i\,\frac{\beta (3\gamma +1)(1-\beta )}{1-\gamma } +\frac{(1-\beta )\gamma [1+(1-2\beta )\gamma ]}{(1-\gamma )^2}. \end{aligned}$$

From these, we also have the variance

$$\begin{aligned} \mathrm {Var}_i(|{\mathcal {I}}_{t}|)&={\mathbb {E}}_i[|{\mathcal {I}}_{t}|^2]-({\mathbb {E}}_i[|{\mathcal {I}}_{t}|])^2 \nonumber \\&=i^2\beta ^2+i\,\frac{\beta (3\gamma +1)(1-\beta )}{1-\gamma } +\frac{(1-\beta )\gamma [1+(1-2\beta )\gamma ]}{(1-\gamma )^2}\nonumber \\&\quad - \left( i\beta +\frac{\gamma (1-\beta )}{1 -\gamma }\right) ^2\nonumber \\&= i\,\frac{\beta (1-\beta )(1+3\gamma -2\beta )}{1-\gamma }\,+\,\frac{\gamma (1-\beta )(1-\beta \gamma )}{(1-\gamma )^2}. \end{aligned}$$
(53)

Consider the function

$$\begin{aligned} \psi _i(\theta )&=\varphi '_i(0)-\frac{\varphi '_i(\theta )}{\varphi _i(\theta )}\\&=i\beta +\frac{\gamma (1-\beta )}{1 -\gamma } \,- \frac{C\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta }}{(1-\gamma )AB} \\&=\left( \beta -\frac{C_1\mathrm{e}^{\theta }}{(1-\gamma )AB}\right) i\,+\,\frac{\gamma (1-\beta )}{1 -\gamma } - \frac{C_2\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta }}{(1-\gamma )AB} \\&=\left( \beta -\frac{\beta (1-\gamma )^2\mathrm{e}^{\theta }}{AB}\right) i\,+\,\frac{\gamma (1-\beta )}{1 -\gamma } \,- \frac{\gamma (1-\beta )^2\mathrm{e}^{\theta }+(\gamma -\beta )\gamma (\beta -1)\,\mathrm{e}^{2\theta }}{AB} \\&= F(\theta )i+G(\theta ) \end{aligned}$$

where we wrote \(C=C_1i+C_2\), with \(C_1=\beta (1-\gamma )^3\) and \(C_2=(1-\gamma )\gamma (1-\beta )^2\), and the last line is a definition. Functions F and G can be simplified to

$$\begin{aligned} F(\theta )&= \frac{-\,(\mathrm{e}^{\theta } - 1) \beta (1-\beta )[(1-\beta \gamma )-\mathrm{e}^{\theta }\gamma (\gamma -\beta )]}{AB} \end{aligned}$$
(54)
$$\begin{aligned} G(\theta )&= \frac{-\,(\mathrm{e}^{\theta }-1) \gamma (1-\beta )(1-\beta \gamma )[(1-\beta )-\mathrm{e}^{\theta }(\gamma -\beta )]}{(1-\gamma )AB}. \end{aligned}$$
(55)

Since \(\varphi _i(0)=1\), we have \(\psi (0)=0\). We consider the case \(\mu \in (\lambda ,\infty )\), that is, \(\gamma \in (0,1)\). Then, both A and B are strictly positive for all \(t\in [0,\infty )\), provided that \(\mathrm{e}^\theta <\mu /\lambda \). Moreover, F and G are continuous on \([0,\,\mu /\lambda )\), smooth on \((0,\,\mu /\lambda )\) and \(F(0)=G(0)=0\).

Notation

For the reader’s convenience, we list some frequently used notation here (Tables 12 and 3).

Table 1 TKF91 edge process \({\mathcal {I}}=({\mathcal {I}}_t)_{t\ge 0}\) in Definition 1
Table 2 TKF91 process on trees
Table 3 Probabilities, expectations and other notations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fan, WT., Roch, S. Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling. Bull Math Biol 82, 21 (2020). https://doi.org/10.1007/s11538-020-00693-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11538-020-00693-3

Keywords

Navigation