Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling

Fan, Wai-Tong; Roch, Sebastien

doi:10.1007/s11538-020-00693-3

Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling

Original Article
Published: 22 January 2020

Volume 82, article number 21, (2020)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

Wai-Tong Fan¹ &
Sebastien Roch²

223 Accesses
2 Citations
Explore all metrics

Abstract

In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and whose branchings indicate past speciation events. Phylogenetic analyses often rely on molecular sequences, such as DNA sequences, collected from the species of interest, and it is common in this context to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides—also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here, we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the “big bang” condition, a necessary and sufficient condition for statistical consistency in this setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Accuracy of Ancestral Sequence Reconstruction for Ultrametric Trees with Parsimony

Article 23 February 2018

Ancestral Sequence Reconstruction with Maximum Parsimony

Article 05 October 2017

Phase transition in the sample complexity of likelihood-based phylogeny inference

Article 03 August 2017

Notes

The choice of $t_j$’s affects the constants in our quantitative bounds, but we have not tried to optimize them in the current work.

References

Andoni A, Daskalakis C, Hassidim A, Roch S (2012) Global alignment of molecular sequences via ancestral state reconstruction. Stoch Process Appl 122(12):3852–3874
Article MathSciNet Google Scholar
Anderson WJ (1991) Continuous-time Markov chains: an applications-oriented approach. Springer series in statistics: probability and its applications. Springer, New York
Book Google Scholar
Daskalakis C, Roch S (2013) Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann Appl Probab 23(2):693–721
Article MathSciNet Google Scholar
Durrett R (1996) Probability: theory and examples, 2nd edn. Duxbury Press, Belmont
MATH Google Scholar
Evans WS, Kenyon C, Peres Y, Schulman LJ (2000) Broadcasting on trees and the Ising model. Ann Appl Probab 10(2):410–433
Article MathSciNet Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Article Google Scholar
Fan W-T, Roch S (2018) Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees. Electron J Probab 23(47):24
MathSciNet MATH Google Scholar
Gautschi W (1962) On inverses of Vandermonde and confluent Vandermonde matrices. Numer Math 4(1):117–123
Article MathSciNet Google Scholar
Gascuel O, Steel M (2010) Inferring ancestral sequences in taxon-rich phylogenies. Math Biosci 227(2):125–135
Article MathSciNet Google Scholar
Ganesh A, Zhang Q (2019) Optimal sequence length requirements for phylogenetic tree reconstruction with indels. In: Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, STOC 2019. ACM, New York, pp 721–732
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30
Article MathSciNet Google Scholar
Karlin S, Taylor HE (1981) A second course in stochastic processes. Elsevier, Amsterdam
MATH Google Scholar
Liberles DA (2007) Ancestral sequence reconstruction. Oxford University Press, Oxford
Book Google Scholar
Mitzenmacher M et al (2009) A survey of results for deletion channels and related synchronization channels. Probab Surv 6:1–33
Article MathSciNet Google Scholar
Mossel E (2001) Reconstruction on trees: beating the second eigenvalue. Ann Appl Probab 11(1):285–300
Article MathSciNet Google Scholar
Sly A (2009) Reconstruction for the Potts model. In: STOC, pp 581–590
Thatte BD (2006) Invertibility of the TKF model of sequence evolution. Math Biosci 200(1):58–75
Article MathSciNet Google Scholar
Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33(2):114–124
Article Google Scholar
Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34(1):3–16
Article Google Scholar
Warnow T (2013) Large-scale multiple sequence alignment and phylogeny estimation. Springer, London, pp 85–146
Google Scholar

Download references

Acknowledgements

W.-T. Fan’s work was supported by NSF Grant DMS-1149312 to SR and NSF Grant DMS-1804492. S. Roch’s work was supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242 and CCF-1740707 (TRIPODS).

Author information

Authors and Affiliations

Department of Mathematics, Indiana University, Bloomington, USA
Wai-Tong Fan
Department of Mathematics, UW–Madison, Madison, USA
Sebastien Roch

Authors

Wai-Tong Fan
View author publications
You can also search for this author in PubMed Google Scholar
Sebastien Roch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastien Roch.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Some properties of the TKF91 length process

Recall the TKF91 edge process ${\mathcal {I}}=({\mathcal {I}}_t)_{t\ge 0}$ in Definition 1, which has parameters $(\nu ,\,\lambda ,\,\mu )\in (0,\infty )^3$ with $\lambda < \mu $ and $(\pi _A,\,\pi _T,\,\pi _C,\,\pi _G)\in [0,\infty )^4$ with $\pi _A +\pi _T + \pi _C + \pi _G = 1$. The sequence length of the TKF91 edge process is a continuous-time linear birth–death–immigration process $(|{\mathcal {I}}_{t}|)_{t\ge 0}$ with infinitesimal generator $Q_{i,i+1}=\lambda +i\lambda $ (for $i\in {\mathbb {Z}}_+$), $Q_{i,i-1}=i\mu $ (for $i\ge 1$) and $Q_{i,j}=0$ otherwise. This is a well-studied process for which explicit forms for the transition density $p_{ij}(t)$ and probability generating functions $G_i(z,t)=\sum _{j=0}^{\infty }p_{ij}(t)z^j$ are known. See, for instance, Anderson (1991, Section 3.2) or Karlin and Taylor (1981, Chapter 4) for more details. This process was also analyzed in Thatte (2006) in the related context of phylogeny estimation.

We collect here a few properties that will be useful in our analysis. The probability generating function is given by

$$\begin{aligned} G_i(z,t)=\left[ \frac{1-\beta -z(\gamma -\beta )}{1-\beta \gamma -\gamma z(1-\beta )}\right] ^i\,\left[ \frac{1-\gamma }{1-\beta \gamma -\gamma z(1-\beta )}\right] \end{aligned}$$

for $i\in {\mathbb {Z}}_+$ and $t>0$, where

$$\begin{aligned} \beta =\beta _t = \mathrm{e}^{-(\mu -\lambda )t}\quad \text {and}\quad \gamma =\frac{\lambda }{\mu }. \end{aligned}$$

Fix $t\ge 0$ and let $\varphi _i(\theta )={\mathbb {E}}_i[\mathrm{e}^{\theta \,|{\mathcal {I}}_{t}|}]$ be the characteristic function of $|{\mathcal {I}}_{t}|$ starting at i. Then, for $\lambda \ne \mu $ (i.e., $\gamma \ne 1$),

$$\begin{aligned} \varphi _i(\theta )&=G_i(\mathrm{e}^{\theta },t)=\left[ \frac{1-\beta -\mathrm{e}^{\theta }(\gamma -\beta )}{1-\beta \gamma -\gamma \mathrm{e}^{\theta }(1-\beta )}\right] ^i\,\left[ \frac{1-\gamma }{1-\beta \gamma -\gamma \mathrm{e}^{\theta }(1-\beta )}\right] \nonumber \\&=(1-\gamma )\frac{A^i}{B^{i+1}}, \end{aligned}$$

(51)

where

$$\begin{aligned} A=1-\beta -\mathrm{e}^{\theta }(\gamma -\beta ) \quad \text {and}\quad B=1-\beta \gamma -\gamma \mathrm{e}^{\theta }(1-\beta ). \end{aligned}$$

Differentiating with respect to $\theta $ gives

$$\begin{aligned} \varphi '_i(\theta )&=(1-\gamma )\frac{A^{i-1}\mathrm{e}^{\theta }\big \{i(\beta -\gamma )B -(i+1)\gamma (\beta -1)A\big \}}{B^{i+2}} \nonumber \\&=(1-\gamma )\frac{A^{i-1}\mathrm{e}^{\theta }\big \{\beta (1-\gamma )^2i +\gamma (1-\beta )^2 + \mathrm{e}^{\theta }(\gamma -\beta )\gamma (\beta -1)\big \}}{B^{i+2}} \nonumber \\&= \frac{A^{i-1}}{B^{i+2}}(C\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta }), \end{aligned}$$

(52)

where

$$\begin{aligned} C=(1-\gamma )[\beta (1-\gamma )^2i +\gamma (1-\beta )^2] \quad \text {and}\quad D=(1-\gamma )(\gamma -\beta )\gamma (\beta -1). \end{aligned}$$

Differentiating with respect to $\theta $ once more gives

$$\begin{aligned} \varphi ''_i(\theta )&=\frac{A^{i-2}\mathrm{e}^{\theta }}{B^{i+3}}\Big \{BA(C+2D\mathrm{e}^{\theta }) \,+\,(C\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta })\\&\quad \times \big [(i-1)B(\beta -\gamma )+(i+2)A\gamma (1-\beta )\big ]\Big \}. \end{aligned}$$

The expected value and the second moment are given by

$$\begin{aligned} {\mathbb {E}}_{i}[|{\mathcal {I}}_{t}|]=\varphi '_i(0)&= i\beta +\frac{\gamma (1-\beta )}{1 -\gamma }=i\beta +\frac{\lambda }{\lambda -\mu }(\beta -1) \quad \text { and}\\ {\mathbb {E}}_{i}[|{\mathcal {I}}_{t}|^2]=\varphi ''_i(0)&= i^2\beta ^2+i\,\frac{\beta (3\gamma +1)(1-\beta )}{1-\gamma } +\frac{(1-\beta )\gamma [1+(1-2\beta )\gamma ]}{(1-\gamma )^2}. \end{aligned}$$

From these, we also have the variance

$$\begin{aligned} \mathrm {Var}_i(|{\mathcal {I}}_{t}|)&={\mathbb {E}}_i[|{\mathcal {I}}_{t}|^2]-({\mathbb {E}}_i[|{\mathcal {I}}_{t}|])^2 \nonumber \\&=i^2\beta ^2+i\,\frac{\beta (3\gamma +1)(1-\beta )}{1-\gamma } +\frac{(1-\beta )\gamma [1+(1-2\beta )\gamma ]}{(1-\gamma )^2}\nonumber \\&\quad - \left( i\beta +\frac{\gamma (1-\beta )}{1 -\gamma }\right) ^2\nonumber \\&= i\,\frac{\beta (1-\beta )(1+3\gamma -2\beta )}{1-\gamma }\,+\,\frac{\gamma (1-\beta )(1-\beta \gamma )}{(1-\gamma )^2}. \end{aligned}$$

(53)

Consider the function

$$\begin{aligned} \psi _i(\theta )&=\varphi '_i(0)-\frac{\varphi '_i(\theta )}{\varphi _i(\theta )}\\&=i\beta +\frac{\gamma (1-\beta )}{1 -\gamma } \,- \frac{C\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta }}{(1-\gamma )AB} \\&=\left( \beta -\frac{C_1\mathrm{e}^{\theta }}{(1-\gamma )AB}\right) i\,+\,\frac{\gamma (1-\beta )}{1 -\gamma } - \frac{C_2\mathrm{e}^{\theta }+D\mathrm{e}^{2\theta }}{(1-\gamma )AB} \\&=\left( \beta -\frac{\beta (1-\gamma )^2\mathrm{e}^{\theta }}{AB}\right) i\,+\,\frac{\gamma (1-\beta )}{1 -\gamma } \,- \frac{\gamma (1-\beta )^2\mathrm{e}^{\theta }+(\gamma -\beta )\gamma (\beta -1)\,\mathrm{e}^{2\theta }}{AB} \\&= F(\theta )i+G(\theta ) \end{aligned}$$

where we wrote $C=C_1i+C_2$, with $C_1=\beta (1-\gamma )^3$ and $C_2=(1-\gamma )\gamma (1-\beta )^2$, and the last line is a definition. Functions F and G can be simplified to

$$\begin{aligned} F(\theta )&= \frac{-\,(\mathrm{e}^{\theta } - 1) \beta (1-\beta )[(1-\beta \gamma )-\mathrm{e}^{\theta }\gamma (\gamma -\beta )]}{AB} \end{aligned}$$

(54)

$$\begin{aligned} G(\theta )&= \frac{-\,(\mathrm{e}^{\theta }-1) \gamma (1-\beta )(1-\beta \gamma )[(1-\beta )-\mathrm{e}^{\theta }(\gamma -\beta )]}{(1-\gamma )AB}. \end{aligned}$$

(55)

Since $\varphi _i(0)=1$, we have $\psi (0)=0$. We consider the case $\mu \in (\lambda ,\infty )$, that is, $\gamma \in (0,1)$. Then, both A and B are strictly positive for all $t\in [0,\infty )$, provided that $\mathrm{e}^\theta <\mu /\lambda $. Moreover, F and G are continuous on $[0,\,\mu /\lambda )$, smooth on $(0,\,\mu /\lambda )$ and $F(0)=G(0)=0$.

Notation

For the reader’s convenience, we list some frequently used notation here (Tables 1, 2 and 3).

Table 1 TKF91 edge process ${\mathcal {I}}=({\mathcal {I}}_t)_{t\ge 0}$ in Definition 1

Full size table

Table 2 TKF91 process on trees

Full size table

Table 3 Probabilities, expectations and other notations

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, WT., Roch, S. Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling. Bull Math Biol 82, 21 (2020). https://doi.org/10.1007/s11538-020-00693-3

Download citation

Received: 02 November 2017
Accepted: 02 January 2020
Published: 22 January 2020
DOI: https://doi.org/10.1007/s11538-020-00693-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling

Abstract

Access this article

Similar content being viewed by others

On the Accuracy of Ancestral Sequence Reconstruction for Ultrametric Trees with Parsimony

Ancestral Sequence Reconstruction with Maximum Parsimony

Phase transition in the sample complexity of likelihood-based phylogeny inference

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Some properties of the TKF91 length process

Notation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling

Abstract

Access this article

Similar content being viewed by others

On the Accuracy of Ancestral Sequence Reconstruction for Ultrametric Trees with Parsimony

Ancestral Sequence Reconstruction with Maximum Parsimony

Phase transition in the sample complexity of likelihood-based phylogeny inference

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Some properties of the TKF91 length process

Notation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation