当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PCirc: random forest-based plant circRNA identification software
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2021-01-06 , DOI: 10.1186/s12859-020-03944-1
Shuwei Yin , Xiao Tian , Jingjing Zhang , Peisen Sun , Guanglin Li

Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing amounts of RNA-seq data is very important. However, traditional circRNA recognition methods have limitations. In recent years, emerging machine learning techniques have provided a good approach for the identification of circRNAs in animals. However, using these features to identify plant circRNAs is infeasible because the characteristics of plant circRNA sequences are different from those of animal circRNAs. For example, plants are extremely rich in splicing signals and transposable elements, and their sequence conservation in rice, for example is far less than that in mammals. To solve these problems and better identify circRNAs in plants, it is urgent to develop circRNA recognition software using machine learning based on the characteristics of plant circRNAs. In this study, we built a software program named PCirc using a machine learning method to predict plant circRNAs from RNA-seq data. First, we extracted different features, including open reading frames, numbers of k-mers, and splicing junction sequence coding, from rice circRNA and lncRNA data. Second, we trained a machine learning model by the random forest algorithm with tenfold cross-validation in the training set. Third, we evaluated our classification according to accuracy, precision, and F1 score, and all scores on the model test data were above 0.99. Fourth, we tested our model by other plant tests, and obtained good results, with accuracy scores above 0.8. Finally, we packaged the machine learning model built and the programming script used into a locally run circular RNA prediction software, Pcirc ( https://github.com/Lilab-SNNU/Pcirc ). Based on rice circRNA and lncRNA data, a machine learning model for plant circRNA recognition was constructed in this study using random forest algorithm, and the model can also be applied to plant circRNA recognition such as Arabidopsis thaliana and maize. At the same time, after the completion of model construction, the machine learning model constructed and the programming scripts used in this study are packaged into a localized circRNA prediction software Pcirc, which is convenient for plant circRNA researchers to use.

中文翻译:

PCirc:基于森林的随机植物circRNA识别软件

环状RNA(circRNA)是一种具有闭环结构的新型RNA。在植物和动物中鉴定到越来越多的circRNA,最近的研究表明circRNA在基因调控中起着重要作用。因此,从数量不断增加的RNA-seq数据中鉴定circRNA非常重要。但是,传统的circRNA识别方法有局限性。近年来,新兴的机器学习技术为动物中circRNA的识别提供了一种很好的方法。但是,由于植物circRNA序列的特征与动物circRNA的特征不同,因此无法使用这些特征来鉴定植物circRNA。例如,植物中的剪接信号和转座因子非常丰富,它们在水稻中的序列保守性很高,例如,远低于哺乳动物。为了解决这些问题并更好地识别植物中的circRNA,迫切需要基于植物circRNA的特征使用机器学习来开发circRNA识别软件。在这项研究中,我们使用机器学习方法构建了一个名为PCirc的软件程序,以根据RNA序列数据预测植物circRNA。首先,我们从水稻circRNA和lncRNA数据中提取了不同的功能,包括开放阅读框,k-mers数量和剪接连接序列编码。其次,我们在训练集中通过具有十倍交叉验证的随机森林算法训练了机器学习模型。第三,我们根据准确性,准确性和F1分数评估了分类,并且模型测试数据上的所有分数均高于0.99。第四,我们通过其他工厂测试对模型进行了测试,并获得了良好的结果,准确性得分高于0.8。最后,我们将构建的机器学习模型和使用的编程脚本打包到本地运行的循环RNA预测软件Pcirc(https://github.com/Lilab-SNNU/Pcirc)中。基于水稻circRNA和lncRNA数据,利用随机森林算法构建了植物circRNA识别的机器学习模型,该模型还可以应用于拟南芥和玉米等植物circRNA识别。同时,在模型构建完成后,将构建的机器学习模型和本研究中使用的编程脚本打包到本地circRNA预测软件Pcirc中,以方便植物circRNA研究人员使用。我们将构建的机器学习模型和使用的编程脚本打包到本地运行的循环RNA预测软件Pcirc(https://github.com/Lilab-SNNU/Pcirc)中。基于水稻circRNA和lncRNA数据,利用随机森林算法构建了植物circRNA识别的机器学习模型,该模型还可以应用于拟南芥和玉米等植物circRNA识别。同时,在模型构建完成后,将构建的机器学习模型和本研究中使用的编程脚本打包到本地circRNA预测软件Pcirc中,以方便植物circRNA研究人员使用。我们将构建的机器学习模型和使用的编程脚本打包到本地运行的循环RNA预测软件Pcirc(https://github.com/Lilab-SNNU/Pcirc)中。基于水稻circRNA和lncRNA数据,利用随机森林算法构建了植物circRNA识别的机器学习模型,该模型还可以应用于拟南芥和玉米等植物circRNA识别。同时,在模型构建完成后,将构建的机器学习模型和本研究中使用的编程脚本打包到本地circRNA预测软件Pcirc中,以方便植物circRNA研究人员使用。基于水稻circRNA和lncRNA数据,利用随机森林算法构建了植物circRNA识别的机器学习模型,该模型还可以应用于拟南芥和玉米等植物circRNA识别。同时,在模型构建完成后,将构建的机器学习模型和本研究中使用的编程脚本打包到本地circRNA预测软件Pcirc中,以方便植物circRNA研究人员使用。基于水稻circRNA和lncRNA数据,利用随机森林算法构建了植物circRNA识别的机器学习模型,该模型还可以应用于拟南芥和玉米等植物circRNA识别。同时,在模型构建完成后,将构建的机器学习模型和本研究中使用的编程脚本打包到本地circRNA预测软件Pcirc中,以方便植物circRNA研究人员使用。
更新日期:2021-01-07
down
wechat
bug