当前位置: X-MOL 学术Database J. Biol. Databases Curation › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques
Database: The Journal of Biological Databases and Curation ( IF 3.4 ) Pub Date : 2020-11-18 , DOI: 10.1093/database/baaa067
Fabio R Cerqueira 1, 2 , Ana Tereza Ribeiro Vasconcelos 3
Affiliation  

Small open reading frames (ORFs) have been systematically disregarded by automatic genome annotation. The difficulty in finding patterns in tiny sequences is the main reason that makes small ORFs to be overlooked by computational procedures. However, advances in experimental methods show that small proteins can play vital roles in cellular activities. Hence, it is urgent to make progress in the development of computational approaches to speed up the identification of potential small ORFs. In this work, our focus is on bacterial genomes. We improve a previous approach to identify small ORFs in bacteria. Our method uses machine learning techniques and decoy subject sequences to filter out spurious ORF alignments. We show that an advanced multivariate analysis can be more effective in terms of sensitivity than applying the simplistic and widely used e-value cutoff. This is particularly important in the case of small ORFs for which alignments present higher e-values than usual. Experiments with control datasets show that the machine learning algorithms used in our method to curate significant alignments can achieve average sensitivity and specificity of 97.06% and 99.61%, respectively. Therefore, an important step is provided here toward the construction of more accurate computational tools for the identification of small ORFs in bacteria.

中文翻译:

OCCAM:通过目标诱饵数据库方法和机器学习技术预测细菌基因组中的小 ORF

自动基因组注释系统地忽略了小的开放阅读框 (ORF)。在微小序列中寻找模式的困难是导致计算过程忽略小 ORF 的主要原因。然而,实验方法的进步表明小蛋白质可以在细胞活动中发挥重要作用。因此,迫切需要在计算方法的开发方面取得进展,以加速识别潜在的小 ORF。在这项工作中,我们的重点是细菌基因组。我们改进了以前的方法来识别细菌中的小 ORF。我们的方法使用机器学习技术和诱饵主题序列来过滤掉虚假的 ORF 比对。我们表明,在敏感性方面,高级多变量分析比应用简单且广泛使用的方法更有效。e值截止。这在小 ORF 的情况下尤其重要,因为它们的比对呈现比平常更高的e 值。控制数据集的实验表明,我们的方法中使用的机器学习算法来管理显着比对可以分别达到 97.06% 和 99.61% 的平均灵敏度和特异性。因此,这里提供了一个重要的步骤,以构建更准确的计算工具来识别细菌中的小 ORF。
更新日期:2020-11-19
down
wechat
bug