From pairs of most similar sequences to phylogenetic best matches.,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

From pairs of most similar sequences to phylogenetic best matches.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2020-04-09 , DOI: 10.1186/s13015-020-00165-2
Peter F Stadler _{1,

2,

3,

4,

5,

6} , Manuela Geiß _{1,

7} , David Schaller ₁ , Alitzel López Sánchez ₈ , Marcos González Laffitte ₈ , Dulce I Valdivia ₉ , Marc Hellmuth ₁₀ , Maribel Hernández Rosales ₈

Affiliation

1Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany.
2Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv), and Leipzig Research Center for Civilization Diseases, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany.
3Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany.
4Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090 Vienna, Austria.
5Facultad de Ciencias, Universidad National de Colombia, Sede Bogotá, Ciudad Universitaria, 111321 Bogotá, D.C. Colombia.
6Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501 USA.
7Software Competence Center Hagenberg GmbH, Softwarepark 21, 4232 Hagenberg, Austria.
CONACYT-Instituto de Matemáticas, UNAM Juriquilla, Blvd. Juriquilla 3001, 76230 Juriquilla, Querétaro, QRO México.
10Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del IPN (CINVESTAV), Km. 9.6 Libramiento Norte Carretera Irapuato-León, 36821 Irapuato, GTO México.
8School of Computing, University of Leeds, E C Stoner Building, Leeds, LS2 9JT UK.

BACKGROUND Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods. RESULTS If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. A priori knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches. CONCLUSION Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. AVAILABILITY Accompanying software is available at https://github.com/david-schaller/AsymmeTree.

中文翻译：

从最相似的序列对到系统发育的最佳匹配。

背景技术用于直系同源检测的许多常用方法从相互最相似的基因对（相互最佳命中）开始，作为进化上最密切相关的基因对（相互最佳匹配）的近似。这种最佳匹配的近似对于超度量差异来说是精确的，即在分子钟假说下。然而，当旁系同源基因之间存在大的谱系特异性速率变异时，它就会失败。实际上，这会在基于最佳命中的直系同源检测方法的输入数据中引入高水平的噪声。结果如果基因之间的加性距离已知，则可以通过考虑某些基因四重体来识别进化上最密切相关的对，前提是在每个四重体中相对于其余三个基因的外群是已知的。对潜在物种系统发育的先验知识极大地促进了所需外群的识别。尽管工作流程仍然是一种启发式方法，因为在所有情况下都无法可靠地确定正确的外群，但具有谱系特定偏差和速率不对称的模拟表明可以实现近乎完美的结果。在现实环境中，距离数据必须根据序列数据进行估计，因此存在噪声，但仍然可以获得高度准确的最佳匹配集。结论结合这里报告的最佳匹配的准确推断以及理解（相互）最佳匹配图和直系同源关系方面的最新数学进展，可以预期无树直系同源评估方法的改进。可用性随附软件可从 https://github.com/david-schaller/AsymmeTree 获取。

更新日期：2020-04-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11