当前位置: X-MOL 学术Iran. J. Sci. Technol. Trans. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-function Prediction of Unknown Protein Sequences Using Multilabel Classifiers and Augmented Sequence Features
Iranian Journal of Science and Technology, Transactions A: Science ( IF 1.7 ) Pub Date : 2021-05-04 , DOI: 10.1007/s40995-021-01134-z
Saurabh Agrawal , Dilip Singh Sisodia , Naresh Kumar Nagwani

Numerous protein sequences simultaneously exist at multiple subcellular localizations and exhibit multiple functions. Multi-function characterization of the Unknown Protein Sequences (UPS) is useful for analyzing multi-symptom diseases and multi-target drugs. In this work, a multisite subcellular localization model is proposed for the multi-function characterization of UPS using augmented features and algorithm adoption multilabel classifiers. Protein sequence features are augmented with physicochemical and evolutionary properties of amino acid residues as feature vectors while preserving the sequence-order-information and protein-residue-properties. Less discriminative and redundant features are discarded from the feature vector using Multilabel Linear Discriminant Analysis (MLDA). Two different multisite datasets, Gram-Positive (ML_G+) and Gram-Negative (ML_G−) are used for the experimental work, where multiple locative protein sequences with single-label are transformed into a unique multilabel protein sequence. Preprocessed feature vectors of ML_G+ and ML_G− are used separately to train multilabel-classifiers such as Decision Tree (ML_C4.5), K-Nearest Neighbor (ML_kNN), Multi Layer Perceptron (MLP), Extra Tree (ExTr) and Random Forest (RF) using fivefold cross-validation. After that validated multisite model has been utilized for the prediction of single as well as multiple functions of the UPS. The model achieved an accuracy of 94.23% for ML_G+ and 82.77% with ML_G− through known protein sequences using MLP, while for UPS accuracy is 77.50% for ML_G+ using MLP and ExTr, and 54.28% for ML_G− through ML_kNN.



中文翻译:

使用多标签分类器和增强序列特征对未知蛋白质序列进行多功能预测

许多蛋白质序列同时存在于多个亚细胞位置,并表现出多种功能。未知蛋白质序列(UPS)的多功能表征可用于分析多症状疾病和多靶标药物。在这项工作中,提出了一种多站点亚细胞定位模型,用于使用增强功能和算法采用多标签分类器对UPS进行多功能表征。蛋白质序列特征通过氨基酸残基的物理化学和进化特性作为特征载体得到增强,同时保留了序列信息和蛋白质残基特性。使用多标签线性判别分析(MLDA)从特征向量中丢弃较少的判别和冗余特征。两个不同的多站点数据集,革兰氏阳性(ML_G +)和革兰氏阴性(ML_G-)用于实验工作,其中将具有单标记的多个定位蛋白序列转化为唯一的多标记蛋白序列。ML_G +和ML_G-的预处理特征向量分别用于训练多标签分类器,例如决策树(ML_C4.5),K最近邻(ML_kNN),多层感知器(MLP),额外树(ExTr)和随机森林( RF)使用五重交叉验证。之后,将经过验证的多站点模型用于预测UPS的单一功能和多功能。该模型通过使用MLP的已知蛋白序列实现了ML_G +的94.23%的精度和ML_G-的82.77%的精度,而使用MLP和ExTr的UPS的ML_G +的精度为77.50%,通过ML_kNN的ML_G-的精度为54.28%。

更新日期:2021-05-04
down
wechat
bug