当前位置: X-MOL 学术Syst. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-Space
Systematic Biology ( IF 6.5 ) Pub Date : 2020-04-30 , DOI: 10.1093/sysbio/syaa036
Claudia C Weber 1 , Umberto Perron 1 , Dearbhaile Casey 1 , Ziheng Yang 2 , Nick Goldman 1
Affiliation  

Abstract How can we best learn the history of a protein’s evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modeling based on inferred amino acid sequence and side chain configuration). But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the “unseen” state-space. To illustrate this, we consider two examples with amino acid sequences as input. We show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$$\omega$$\end{document}, a parameter describing the relative strength of selection on nonsynonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible. [Ancestral reconstruction; natural selection; protein structure; state-spaces; substitution models.]

中文翻译:

模糊编码允许从聚合状态空间中的对齐准确推断进化参数

摘要 我们如何才能最好地了解蛋白质的进化历史?理想情况下,序列进化模型应该同时捕获产生遗传变异的过程和确定哪些变化是固定的功能约束。然而,实际上,最合适的方法可能只是将易于获得的输入数据的便利性与返回有用参数估计值的能力相结合的方法。例如,我们可能对选择强度的度量(通常使用密码子模型获得)或祖先结构(使用基于推断的氨基酸序列和侧链构型的结构模型获得)的度量感兴趣。但是,如果相关状态空间中的数据不容易获得怎么办?我们表明,使用已建立的处理缺失数据的方法可以获得对感兴趣的输出的准确估计。将对齐中观​​察到的字符编码为较大状态空间中字符的模糊表示,允许将具有所需特征的模型应用于缺乏通常所需分辨率的数据。这种策略是可行的,因为通过观察空间采取的进化路径包含有关可能在“看不见的”状态空间中访问过的状态的信息。为了说明这一点,我们考虑两个以氨基酸序列作为输入的例子。我们证明 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\ oddsidemargin}{-69pt} \begin{document} }{}$$\omega$$\end{document},一个描述非同义和同义变化的选择相对强度的参数,可以使用适应的方式以无偏的方式估计标准 61 态密码子模型的版本。使用模拟和经验数据,我们发现可以通过将 55 状态经验模型应用于 20 状态氨基酸数据来推断祖先氨基酸侧链构型。在可行的情况下,结合来自歧义编码和完全解析数据的输入可以提高准确性。将结构信息添加到最少 12 个。与考虑完整旋转异构体状态信息的基准相比,氨基酸比对中 5% 的序列导致显着的祖先重建性能。这些例子表明,我们的方法允许从之前无法访问的序列中恢复进化信息。【祖传重建;自然选择; 蛋白质结构;状态空间;替代模型。]
更新日期:2020-04-30
down
wechat
bug