当前位置: X-MOL 学术Genetics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss.
GENETICS ( IF 3.3 ) Pub Date : 2020-08-19 , DOI: 10.1534/genetics.120.303597
Mengchi Wang 1 , David Wang 2 , Kai Zhang 1 , Vu Ngo 1 , Shicai Fan 2, 3 , Wei Wang 2, 4, 5
Affiliation  

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon Divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common TFs in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (p-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener's method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.

中文翻译:


座右铭:以最小的信息损失表示共识序列中的图案。



序列分析通常需要直观的理解和方便的主题表示。通常,图案表示为位置权重矩阵 (PWM) 并使用序列徽标进行可视化。然而,在许多情况下,为了解释模体信息或搜索模体匹配,用通配符式共识序列(例如[GC][AT]GATAAG[GAC])来表示模体是紧凑且足够的。基于互信息理论和 Jensen-Shannon Divergence,我们提出了一个数学框架,以最大限度地减少将 PWM 转换为共识序列时的信息损失。我们将这种表示形式命名为序列 Motto,并实现了一种高效的算法,具有灵活的选项,可以将核苷酸、氨基酸和定制字符的基序 PWM 转换为 Motto。我们表明,这种表示提供了一种简单有效的方法来识别人类基因组中 1156 个常见 TF 的结合位点。通过比较 Motto 发现的序列匹配与 FIMO 发现的 PWM 扫描结果,对该方法的有效性进行了基准测试。平均而言,我们的方法在精确率-召回率曲线下的面积达到了 0.81,明显优于所有现有方法( p值 < 0.01),包括最大位置权重、Cavener 方法和最小均方误差。我们相信这种表示提供了主题的精炼总结以及统计依据。
更新日期:2020-08-24
down
wechat
bug