当前位置: X-MOL 学术Syst. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Linear Time Solution to the Labeled Robinson-Foulds Distance Problem
Systematic Biology ( IF 6.5 ) Pub Date : 2022-04-14 , DOI: 10.1093/sysbio/syac028
Samuel Briand 1 , Christophe Dessimoz 2, 3, 4, 5, 6 , Nadia El-Mabrouk 1 , Yannis Nevers 2, 3, 6
Affiliation  

Motivation A large variety of pairwise measures of similarity or dissimilarity have been developed for comparing phylogenetic trees, e.g. species trees or gene trees. Due to its intuitive definition in terms of tree clades and bipartitions and its computational efficiency, the Robinson-Foulds (RF) distance is the most widely used for trees with unweighted edges and labels restricted to leaves (representing the genetic elements being compared). However, in the case of gene trees, an important information revealing the nature of the homologous relation between gene pairs (orthologs, paralogs, xenologs) is the type of event associated to each internal node of the tree, typically speciations or duplications, but other types of events may also be considered, such as horizontal gene transfers. This labeling of internal nodes is usually inferred from a gene tree/species tree reconciliation method. Here, we address the problem of comparing such event-labeled trees. The problem differs from the classical problem of comparing uniformly labeled trees (all labels belonging to the same alphabet) that may be done using the Tree Edit Distance (TED) mainly due to the fact that, in our case, two different alphabets are considered for the leaves and internal nodes of the tree, and leaves are not affected by edit operations. Results We propose an extension of the RF distance to event-labeled trees, based on edit operations comparable to those considered for TED: node insertion, node deletion and label substitution. We show that this new Labeled Robinson Foulds (LRF) distance can be computed in linear time, in addition of maintaining other desirable properties: being a metric, reducing to RF for trees with no labels on internal nodes and maintaining an intuitive interpretation. The algorithm for computing the LRF distance enables novel analyses on event-label trees such as reconciled gene trees. Here, we use it to study the impact of taxon sampling on labeled gene tree inference, and conclude that denser taxon sampling yields trees with better topology but worse labeling.

中文翻译:

标记 Robinson-Foulds 距离问题的线性时间解

动机 为了比较系统发育树,例如物种树或基因树,已经开发了多种相似性或相异性的成对测量。由于其在树进化枝和二分法方面的直观定义及其计算效率,Robinson-Foulds (RF) 距离最广泛地用于具有未加权边缘和仅限于叶子的标签(代表被比较的遗传元素)的树。然而,在基因树的情况下,揭示基因对(直系同源物、旁系同源物、异种同源物)之间同源关系性质的重要信息是与树的每个内部节点相关的事件类型,通常是物种形成或重复,但其他也可以考虑事件类型,例如水平基因转移。这种内部节点的标记通常是从基因树/物种树协调方法中推断出来的。在这里,我们解决了比较此类事件标记树的问题。该问题不同于比较统一标记的树(属于同一字母表的所有标签)的经典问题,这可以使用树编辑距离(TED)来完成,主要是因为在我们的例子中,考虑了两个不同的字母表树的叶子和内部节点,叶子不受编辑操作的影响。结果 我们建议将 RF 距离扩展到事件标记树,基于与 TED 考虑的编辑操作相当的编辑操作:节点插入、节点删除和标签替换。我们证明了这种新的标记罗宾逊福尔德(LRF)距离可以在线性时间内计算,除了保持其他理想属性之外:作为一个度量,对于内部节点上没有标签的树减少到 RF,并保持直观的解释。用于计算 LRF 距离的算法能够对事件标签树(例如协调的基因树)进行新的分析。在这里,我们使用它来研究分类单元采样对标记基因树推断的影响,并得出结论,更密集的分类单元采样会产生具有更好拓扑但更差标记的树。
更新日期:2022-04-14
down
wechat
bug