当前位置: X-MOL 学术Mol. Omics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers.
Molecular Omics ( IF 2.9 ) Pub Date : 2020-02-25 , DOI: 10.1039/c9mo00198k
Sterling Ramroach 1 , Ajay Joshi , Melford John
Affiliation  

The Cancer Genome Atlas has provided expression values of 18 015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5629 samples and a dataset of 9144 samples, respectively. When trained with sample sets ranging from 16 718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. The ensemble and NN algorithms were the most accurate at distinguishing between cancer types and primary sites, whereas KNN was the fastest. Training sets can be reduced to the 40 highest-ranked differentiator genes without any significant loss of accuracy, amongst which there are potential drug targets and biomarkers.

中文翻译:

通过机器学习优化癌症分类可生成候选药物靶标和生物标志物的丰富列表。

癌症基因组图谱提供了针对不同癌症类型的18 015个基因的表达值。通过机器学习算法对癌症分类的研究使用了不同的数据和方法,这使得很难比较它们的性能。目前尚不清楚哪种算法效果最佳,以及是否已获得最大的准确性。在这项研究中,我们旨在通过比较使用相同数据的五种算法的性能,并确定可能的最小分化基因数目,来优化癌症的诊断。分别使用5629个样本的基因表达数据集和9144个样本的数据集确定了五种癌症类型和原发部位算法的分类准确性。当使用16718至40个基因的样本集进行训练时,随机森林(RF)梯度增强机(GBM)和神经网络(NN)在癌症类型和原发部位的分类中始终达到100%或接近100%的准确性。将训练集减少到40个排名最高的基因,分别使RF和GBM的处理时间缩短了78倍和45倍。嗅觉受体家族,角蛋白相关蛋白和防御素β家族是排名最高的基因。集成和NN算法在区分癌症类型和原发部位方面最准确,而KNN最快。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。和神经网络(NN)在癌症类型和原发部位的分类中始终达到100%或接近100%的准确性。将训练集减少到40个排名最高的基因,分别使RF和GBM的处理时间缩短了78倍和45倍。嗅觉受体家族,角蛋白相关蛋白和防御素β家族是排名最高的基因。集成和NN算法在区分癌症类型和原发部位方面最准确,而KNN最快。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。和神经网络(NN)在癌症类型和原发部位的分类中始终达到100%或接近100%的准确性。将训练集减少到40个排名最高的基因,分别使RF和GBM的处理时间缩短了78倍和45倍。嗅觉受体家族,角蛋白相关蛋白和防御素β家族是排名最高的基因。集成和NN算法在区分癌症类型和原发部位方面最准确,而KNN最快。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。将训练集减少到40个排名最高的基因,分别使RF和GBM的处理时间缩短了78倍和45倍。嗅觉受体家族,角蛋白相关蛋白和防御素β家族是排名最高的基因。集成和NN算法在区分癌症类型和原发部位方面最准确,而KNN最快。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。将训练集减少到40个排名最高的基因,分别使RF和GBM的处理时间缩短了78倍和45倍。嗅觉受体家族,角蛋白相关蛋白和防御素β家族是排名最高的基因。集成和NN算法在区分癌症类型和原发部位方面最准确,而KNN最快。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。而KNN是最快的。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。而KNN是最快的。训练集可以减少到40个排名最高的区分基因,而准确性没有任何重大损失,其中有潜在的药物靶标和生物标志物。
更新日期:2020-02-13
down
wechat
bug