当前位置: X-MOL 学术Syst. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Recoding amino acids to a reduced alphabet may increase or decrease phylogenetic accuracy
Systematic Biology ( IF 6.5 ) Pub Date : 2022-06-17 , DOI: 10.1093/sysbio/syac042
Peter G Foster 1 , Dominik Schrempf 2 , Gergely J Szöllősi 2, 3, 4 , Tom A Williams 5 , Cymon J Cox 6 , T Martin Embley 7
Affiliation  

Common molecular phylogenetic characteristics such as long branches and compositional heterogeneity can be problematic for phylogenetic reconstruction when using amino acid data. Recoding alignments to reduced alphabets before phylogenetic analysis has often been used both to explore and potentially decrease the effect of such problems. We tested the effectiveness of this strategy on topological accuracy using simulated data on four-taxon trees. We simulated alignments in phylogenetically challenging ways to test the phylogenetic accuracy of analyses using various recoding strategies together with commonly-used homogeneous models. We tested three recoding methods based on amino acid exchangeability, and another recoding method based on lowering the compositional heterogeneity among alignment sequences as measured by the Chi-squared statistic. Our simulation results show that on trees with long branches where sequences approach saturation, accuracy was not greatly affected by exchangeability-based recodings, but Chi-squared-based recoding decreased accuracy. We then simulated sequences with different kinds of compositional heterogeneity over the tree. Recoding often increased accuracy on such alignments. Exchangeability-based recoding was rarely worse than not recoding, and often considerably better. Recoding based on lowering the Chi-squared value improved accuracy in some cases but not in others, suggesting that low compositional heterogeneity by itself is not sufficient to increase accuracy in the analysis of these alignments. We also simulated alignments using site-specific amino acid profiles, making sequences that had compositional heterogeneity over alignment sites. Exchangeability-based recoding coupled with site-homogeneous models had poor accuracy for these datasets but Chi-squared-based recoding on these alignments increased accuracy. We then simulated datasets that were compositionally both site- and tree-heterogeneous, like many real datasets. The effect on accuracy of recoding such doubly problematic datasets varied widely, depending on the type of compositional tree-heterogeneity and on the recoding scheme. Interestingly, analysis of unrecoded compositionally heterogeneous alignments with the NDCH or CAT models was generally more accurate than homogeneous analysis, whether recoded or not. Overall, our results suggest that making trees for recoded amino acid datasets can be useful, but they need to be interpreted cautiously as part of a more comprehensive analysis. The use of better fitting models like NDCH and CAT, which directly account for the patterns in the data, may offer a more promising long-term solution for analysing empirical data.

中文翻译:

将氨基酸重新编码为简化的字母表可能会增加或降低系统发育的准确性

当使用氨基酸数据时,常见的分子系统发育特征(例如长分支和组成异质性)可能会给系统发育重建带来问题。在系统发育分析之前将比对重新编码为简化的字母表通常用于探索和潜在地减少此类问题的影响。我们使用四分类树的模拟数据测试了该策略在拓扑精度方面的有效性。我们以系统发育上具有挑战性的方式模拟比对,以使用各种重新编码策略和常用的同质模型来测试分析的系统发育准确性。我们测试了三种基于氨基酸交换性的重新编码方法,以及另一种基于降低通过卡方统计测量的比对序列之间的组成异质性的重新编码方法。我们的模拟结果表明,在序列接近饱和的长枝树上,基于可交换性的重新编码对准确性的影响不大,但基于卡方的重新编码会降低准确性。然后我们模拟了树上具有不同种类组成异质性的序列。重新编码通常会提高此类对齐的准确性。基于可交换性的重新编码很少比不重新编码更糟糕,而且通常要好得多。基于降低卡方值的重新编码在某些情况下提高了准确性,但在其他情况下却没有,这表明低组成异质性本身不足以提高这些比对分析的准确性。我们还使用位点特异性氨基酸谱模拟比对,产生比比对位点具有组成异质性的序列。基于可交换性的重新编码与位点同质模型相结合,这些数据集的准确性较差,但对这些比对的基于卡方的重新编码提高了准确性。然后,我们模拟了在站点和树上均具有异构性的数据集,就像许多真实的数据集一样。重新编码此类双重问题数据集对准确性的影响差异很大,具体取决于组成树异质性的类型和重新编码方案。有趣的是,使用 NDCH 或 CAT 模型对未记录的成分异质比对进行分析通常比同质分析更准确,无论是否重新编码。总的来说,我们的结果表明,为重新编码的氨基酸数据集创建树可能是有用的,但需要谨慎地解释它们作为更全面分析的一部分。
更新日期:2022-06-17
down
wechat
bug