当前位置: X-MOL 学术PLOS ONE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Essential gene prediction using limited gene essentiality information–An integrative semi-supervised machine learning strategy
PLOS ONE ( IF 3.7 ) Pub Date : 2020-11-30 , DOI: 10.1371/journal.pone.0242943
Sutanu Nandi , Piyali Ganguli , Ram Rup Sarkar

Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.



中文翻译:

使用有限的基因必需性信息进行必需基因预测–集成的半监督机器学习策略

基本基因预测有助于找到任何生物生存所必需的最小基因。机器学习(ML)算法已用于预测基因的本质。但是,当前可用的ML管道对于实验数据有限的生物效果不佳。目的是开发一条新的ML管道,以帮助注释那些很少有实验数据的,较少探索的致病生物的必需基因。拟议的策略结合了无监督特征选择技术,使用Kamada-Kawai算法的降维以及采用拉普拉斯支持向量机(LapSVM)的半监督ML算法,通过非常有限的基因组规模代谢网络来预测必需和非必需基因标记的数据集。一种新颖的计分技术 当由于缺乏数据而难以进行监督绩效指标计算时,已提出了等效于ROC曲线下面积(auROC)的半监督模型选择评分,用于选择最佳模型。无监督的特征选择以及随后的降维有助于在必需和非必需基因的聚类中观察到明显的圆形图案。然后,LapSVM创建了一条曲线,该曲线可以对圆进行剖切,从而可以以较高的准确性(auROC> 0.85)对基本基因进行分类和预测,甚至可以使用1%的标记数据进行模型训练。在真核生物和原核生物上成功验证了这种ML管道后,即使标记的数据集非常有限,该管道也显示出很高的准确性,该策略用于预测实验数据不足的生物的基本基因,利什曼原虫属。使用基于图的半监督机器学习方案,已提出了一种用于必需基因预测的新颖整合方法,该方法显示了在标签数据有限的情况下在原核生物和真核生物中的通用性。使用管道预测的必需基因为预测基因必要性和鉴定针对致病性寄生虫的抗生素和疫苗开发的新型治疗靶标提供了重要线索。

更新日期:2020-12-01
down
wechat
bug