当前位置: X-MOL 学术IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying LncRNA-Encoded Short Peptides Using Optimized Hybrid Features and Ensemble Learning
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 4.5 ) Pub Date : 2021-08-12 , DOI: 10.1109/tcbb.2021.3104288
Siyuan Zhao 1 , Jun Meng 1 , Qiang Kang 1 , Yushi Luan 2
Affiliation  

Long non-coding RNA (lncRNA) contains short open reading frames (sORFs), and sORFs-encoded short peptides (SEPs) have become the focus of scientific studies due to their crucial role in life activities. The identification of SEPs is vital to further understanding their regulatory function. Bioinformatics methods can quickly identify SEPs to provide credible candidate sequences for verifying SEPs by biological experimenrts. However, there is a lack of methods for identifying SEPs directly. In this study, a machine learning method to identify SEPs of plant lncRNA (ISPL) is proposed. Hybrid features including sequence features and physicochemical features are extracted manually or adaptively to construct different modal features. In order to keep the stability of feature selection, the non-linear correction applied in Max-Relevance-Max-Distance (nocRD) feature selection method is proposed, which integrates multiple feature ranking results and uses the iterative random forest for different modal features dimensionality reduction. Classification models with different modal features are constructed, and their outputs are combined for ensemble classification. The experimental results show that the accuracy of ISPL is 89.86% percent on the independent test set, which will have important implications for further studies of functional genomic.

中文翻译:

使用优化的混合特征和集成学习识别 LncRNA 编码的短肽

长链非编码RNA(lncRNA)含有短开放阅读框(sORFs),sORFs编码的短肽(SEPs)因其在生命活动中的重要作用而成为科学研究的焦点。标准必要专利的识别对于进一步了解其监管功能至关重要。生物信息学方法可以快速识别SEPs,为生物实验验证SEPs提供可信的候选序列。然而,缺乏直接识别标准必要专利的方法。在这项研究中,提出了一种机器学习方法来识别植物 lncRNA (ISPL) 的 SEP。手动或自适应提取包括序列特征和物理化学特征在内的混合特征,构建不同的模态特征。为了保持特征选择的稳定性,提出了应用于最大相关最大距离(nocRD)特征选择方法中的非线性校正,该方法整合了多个特征排序结果,并使用迭代随机森林对不同模态特征进行降维。构建具有不同模态特征的分类模型,并将它们的输出组合起来进行集成分类。实验结果表明,ISPL在独立测试集上的准确率为89.86%,这将对功能基因组的进一步研究具有重要意义。并将它们的输出组合起来进行集成分类。实验结果表明,ISPL在独立测试集上的准确率为89.86%,这将对功能基因组的进一步研究具有重要意义。并将它们的输出组合起来进行集成分类。实验结果表明,ISPL在独立测试集上的准确率为89.86%,这将对功能基因组的进一步研究具有重要意义。
更新日期:2021-08-12
down
wechat
bug