当前位置: X-MOL 学术J. Comput. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparison of the Strengths and Weaknesses of Machine Learning Algorithms and Feature Selection on KEGG Database Microbial Gene Pathway Annotation and Its Effects on Reconstructed Network Topology.
Journal of Computational Biology ( IF 1.7 ) Pub Date : 2023-07-01 , DOI: 10.1089/cmb.2022.0370
Michael Robben 1 , Mohammad Sadegh Nasr 1 , Avishek Das 1 , Jai Prakash Veerla 1 , Manfred Huber 1 , Justyn Jaworski 2 , Jon Weidanz 3 , Jacob Luber 1
Affiliation  

The development of tools for the annotation of genes from newly sequenced species has not evolved much from homologous alignment to prior annotated species. While the quality of gene annotations continues to decline as we sequence and assemble more evolutionary distant gut microbiome species, machine learning presents a high quality alternative to traditional techniques. In this study, we investigate the relative performance of common classical and nonclassical machine learning algorithms in the problem of gene annotation using human microbiome-associated species genes from the KEGG database. The majority of the ensemble, clustering, and deep learning algorithms that we investigated showed higher prediction accuracy than CD-Hit in predicting partial KEGG function. Motif-based, machine-learning methods of annotation in new species were faster and had higher precision-recall than methods of homologous alignment or orthologous gene clustering. Gradient boosted ensemble methods and neural networks also predicted higher connectivity in reconstructed KEGG pathways, finding twice as many new pathway interactions than blast alignment. The use of motif-based, machine-learning algorithms in annotation software will allow researchers to develop powerful tools to interact with bacterial microbiomes in ways previously unachievable through homologous sequence alignment alone.

中文翻译:

机器学习算法和特征选择在KEGG数据库微生物基因通路注释上的优缺点比较及其对重构网络拓扑的影响。

用于对新测序物种的基因进行注释的工具的开发并没有从与先前注释物种的同源比对发展太多。虽然随着我们对更多进化遥远的肠道微生物物种进行测序和组装,基因注释的质量持续下降,但机器学习提供了传统技术的高质量替代方案。在本研究中,我们利用 KEGG 数据库中与人类微生物组相关的物种基因,研究了常见经典和非经典机器学习算法在基因注释问题中的相对性能。我们研究的大多数集成、聚类和深度学习算法在预测部分 KEGG 函数方面表现出比 CD-Hit 更高的预测精度。基于主题,与同源比对或直系同源基因聚类方法相比,新物种注释的机器学习方法速度更快,并且具有更高的精确召回率。梯度增强集成方法和神经网络还预测了重建的 KEGG 通路中更高的连接性,发现的新通路相互作用是blast对齐的两倍。在注释软件中使用基于基序的机器学习算法将使研究人员能够开发出强大的工具,以以前仅通过同源序列比对无法实现的方式与细菌微生物组相互作用。发现的新通路相互作用是blast比对的两倍。在注释软件中使用基于基序的机器学习算法将使研究人员能够开发出强大的工具,以以前仅通过同源序列比对无法实现的方式与细菌微生物组相互作用。发现的新通路相互作用是blast比对的两倍。在注释软件中使用基于基序的机器学习算法将使研究人员能够开发出强大的工具,以以前仅通过同源序列比对无法实现的方式与细菌微生物组相互作用。
更新日期:2023-07-01
down
wechat
bug