当前位置: X-MOL 学术Concurr. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Breast cancer prediction using an optimal machine learning technique for next generation sequences
Concurrent Engineering Pub Date : 2021-04-20 , DOI: 10.1177/1063293x21991808
Babymol Kurian 1 , VL Jyothi 2
Affiliation  

A wide reach on cancer prediction and detection using Next Generation Sequencing (NGS) by the application of artificial intelligence is highly appreciated in the current scenario of the medical field. Next generation sequences were extracted from NCBI (National Centre for Biotechnology Information) gene repository. Sequences of normal Homo sapiens (Class 1), BRCA1 (Class 2) and BRCA2 (Class 3) were extracted for Machine Learning (ML) purpose. The total volume of datasets extracted for the process were 1580 in number under four categories of 50, 100, 150 and 200 sequences. The breast cancer prediction process was carried out in three major steps such as feature extraction, machine learning classification and performance evaluation. The features were extracted with sequences as input. Ten features of DNA sequences such as ORF (Open Reading Frame) count, individual nucleobase average count of A, T, C, G, AT and GC-content, AT/GC composition, G-quadruplex occurrence, MR (Mutation Rate) were extracted from three types of sequences for the classification process. The sequence type was also included as a target variable to the feature set with values 0, 1 and 2 for classes 1, 2 and 3 respectively. Nine various supervised machine learning techniques like LR (Logistic Regression statistical model), LDA (Linear Discriminant analysis model), k-NN (k nearest neighbours’ algorithm), DT (Decision tree technique), NB (Naive Bayes classifier), SVM (Support-Vector Machine algorithm), RF (Random Forest learning algorithm), AdaBoost (AB) and Gradient Boosting (GB) were employed on four various categories of datasets. Of all supervised models, decision tree machine learning technique performed most with maximum accuracy in classification of 94.03%. Classification model performance was evaluated using precision, recall, F1-score and support values wherein F1-score was most similar to the classification accuracy.



中文翻译:

使用最佳机器学习技术对下一代序列进行乳腺癌预测

在医学领域的当前情况下,高度赞赏通过人工智能的应用使用下一代测序(NGS)在癌症预测和检测方面的广泛应用。下一代序列是从NCBI(国家生物技术信息中心)基因库中提取的。提取正常智人(1类),BRCA1(2类)和BRCA2(3类)的序列用于机器学习(ML)。在50、100、150和200个序列的四个类别下,为该过程提取的数据集的总数为1580个。乳腺癌预测过程包括三个主要步骤,例如特征提取,机器学习分类和性能评估。使用序列作为输入提取特征。DNA序列的10个特征包括ORF(开放阅读框)计数,A,T,C,G,AT和GC含量的单个核碱基平均计数,AT / GC组成,G四联体的出现,MR(突变率)。从三种类型的序列中提取以进行分类。序列类型也作为特征集的目标变量包括在内,对于类别1、2和3分别具有值0、1和2。九种有监督的机器学习技术,例如LR(逻辑回归统计模型),LDA(线性判别分析模型),k-NN(k最近邻算法),DT(决策树技术),NB(朴素贝叶斯分类器),SVM(支持向量机算法),RF(随机森林学习算法),AdaBoost(AB)和梯度提升(GB)被用于四个不同类别的数据集。在所有受监督的模型中,决策树机器学习技术以94.03%的分类准确率最高。使用精度,召回率,F1分数和支持值(其中F1分数与分类精度最相似)评估分类模型的性能。

更新日期:2021-04-20
down
wechat
bug