当前位置: X-MOL 学术Interface Focus › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FGGA-lnc: automatic gene ontology annotation of lncRNA sequences based on secondary structures
Interface Focus ( IF 3.6 ) Pub Date : 2021-06-11 , DOI: 10.1098/rsfs.2020.0064
Flavio E Spetale 1, 2 , Javier Murillo 1, 2 , Gabriela V Villanova 3 , Pilar Bulacio 1, 2 , Elizabeth Tapia 1, 2
Affiliation  

The study of long non-coding RNAs (lncRNAs), greater than 200 nucleotides, is central to understanding the development and progression of many complex diseases. Unlike proteins, the functionality of lncRNAs is only subtly encoded in their primary sequence. Current in-silico lncRNA annotation methods mostly rely on annotations inferred from interaction networks. But extensive experimental studies are required to build these networks. In this work, we present a graph-based machine learning method called FGGA-lnc for the automatic gene ontology (GO) annotation of lncRNAs across the three GO subdomains. We build upon FGGA (factor graph GO annotation), a computational method originally developed to annotate protein sequences from non-model organisms. In the FGGA-lnc version, a coding-based approach is introduced to fuse primary sequence and secondary structure information of lncRNA molecules. As a result, lncRNA sequences become sequences of a higher-order alphabet allowing supervised learning methods to assess individual GO-term annotations. Raw GO annotations obtained in this way are unaware of the GO structure and therefore likely to be inconsistent with it. The message-passing algorithm embodied by factor graph models overcomes this problem. Evaluations of the FGGA-lnc method on lncRNA data, from model and non-model organisms, showed promising results suggesting it as a candidate to satisfy the huge demand for functional annotations arising from high-throughput sequencing technologies.



中文翻译:

FGGA-lnc:基于二级结构的lncRNA序列的自动基因本体注释

对超过 200 个核苷酸的长非编码 RNA (lncRNA) 的研究对于了解许多复杂疾病的发生和进展至关重要。与蛋白质不同,lncRNA 的功能仅在其一级序列中巧妙地编码。当前的计算机 lncRNA 注释方法主要依赖于从交互网络推断的注释。但建立这些网络需要进行广泛的实验研究。在这项工作中,我们提出了一种名为 FGGA-lnc 的基于图的机器学习方法,用于跨三个 GO 子域的 lncRNA 的自动基因本体 (GO) 注释。我们以 FGGA(因子图 GO 注释)为基础,这是一种最初开发用于注释非模型生物体蛋白质序列的计算方法。在FGGA-lnc版本中,引入了基于编码的方法来融合lncRNA分子的一级序列和二级结构信息。因此,lncRNA 序列成为高阶字母表的序列,允许监督学习方法评估单个 GO 术语注释。以这种方式获得的原始 GO 注释不知道 GO 结构,因此可能与其不一致。因子图模型体现的消息传递算法克服了这个问题。对来自模型和非模型生物的 lncRNA 数据的 FGGA-lnc 方法的评估显示出有希望的结果,表明它可以作为满足高通量测序技术对功能注释的巨大需求的候选方法。

更新日期:2021-06-11
down
wechat
bug