Abstract
Background
Prediction of protein solubility is an indispensable prerequisite for pharmaceutical research and production. The general and specific objective of this work is to design a new model for predicting protein solubility by using protein sequence feature fusion and deep dual-channel convolutional neural networks (DDcCNN) to improve the performance of existing prediction models.
Methods
The redundancy of raw protein is reduced by CD-HIT. The four subsequences are built from protein sequence: one global and three locals. The global subsequence is the entire protein sequence, and these local subsequences are obtained by moving a sliding window with some rules. Using G-gap to extract the features of the above four subsequences, a mixed matrix is constructed as the input of one channel which is composed of three-layer convolutional operating. Additional features are extracted by SCRATCH tool as input of another channel, which is consist of a single convolution in order to find hidden relationships and improve the accuracy of predictor. The outputs of two parallel channels are concatenated as the input of the hidden layer. And the prediction of protein solubility is obtained in the output layer. The best protein solubility prediction model is obtained by doing some comparative experiments of different frameworks.
Results
The performance indicators of DDcCNN model (our designed) are as follows: accuracy of 77.82%, Matthew’s correlation coefficient of 0.57, sensitivity of 76.13% and specificity of 79.32%. The results of some comparative experiments show that the overall performance of DDcCNN model is better than existing models (GCNN, LCNN and PCNN). The related models and data are publicly deposited at http://www.ddccnn.wang.
Conclusion
The satisfactory performance of DDcCNN model reveals that these features and flexible computational methodologies can reinforce the existing prediction models for better prediction of protein solubility could be applied in several applications, such as to preselect initial targets that are soluble or to alter solubility of target proteins, thus can help to reduce the production cost.
Similar content being viewed by others
Availability of data and material
The dataset of our work can be obtain from http://www.ddccnn.wang.
Code availability
The coda of our work can be obtain from http://www.ddccnn.wang.
Abbreviations
- E.coli :
-
Escherichia coli
- SVM:
-
Support vector machine
- CNNs:
-
Convolutional neural networks
- G-gap:
-
G value gap dipeptide frequency
- AIs:
-
Aliphatic Indices
- GRAVY:
-
The average of hydrophobicity
- IHH:
-
Isoleucine histidine histidine
- SS:
-
Secondary structure
- FER:
-
The fraction of exposed residues
- RSA:
-
Relative solvent accessibility
- DDcCNN:
-
Deep dual-channel convolutional neural networks
- NN:
-
The feed-forward neural network
- ACC:
-
Accuracy
- P:
-
Rrecision ratio
- R:
-
Recall ratio
- MCC:
-
Matthews correlation coeficient
- TPR:
-
True positive rate
- FPR:
-
False positive rate
- TNR:
-
Specificity
- TNR:
-
True negative rate
- GCNN:
-
Global convolutional neural networks
- LCNN:
-
Local convolutional neural networks
- PCNN:
-
Parallel convolutional neural networks
- SVM:
-
Support vector machine
- DT:
-
Decision trees
- RF:
-
Random forest
- DNN:
-
Deep neural network
References
Yugandhar K, Gupta S, Yu H (2019) Inferring protein-protein interaction networks from mass spectrometry-based proteomic approaches: a mini-review. Comput Struct Biotechnol J 17:805–811. https://doi.org/10.1016/j.csbj.2019.05.007
Siti M, Alireza N, Narges H (2014) A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinform. https://doi.org/10.1186/1471-2105-15-134
Niu X, Shi F, Hu X, Li N (2014) Predicting the protein solubility by integrating chaos games representation and entropy in information theory. Expert Syst Appl 41:1672–1679. https://doi.org/10.1016/j.eswa.2013.08.064
Costa S, Almeida A, Castro A, Domingues L (2014) Fusion tags for protein solubility, purification and immunogenicity in Escherichia coli: the novel Fh8 system. Front Microbiol 2:63–71. https://doi.org/10.3389/fmicb.2014.00063
Castrense S, Bruciaferri N, Tartari G, Martelli PL (2019) DeepMito: accurate prediction of protein submitochondrial localization using convolutional neural networks. Bioinformatics (Oxford, England) 36:56–64. https://doi.org/10.1093/bioinformatics/btz512
Zhang S, Zhang T, Liu C (2019) Prediction of apoptosis protein subcellular localization via heterogeneous features and hierarchical extreme learning machine. SAR QSAR Environ Res 30:209–228. https://doi.org/10.1080/1062936X.2019.1576222
Pellizza L, Smal C, Rodrigo G, Martín A (2018) Codon usage clusters correlation: towards protein solubility prediction in heterologous expression systems in E coli. Sci Rep. https://doi.org/10.1038/s41598-018-29035-z
Davis G, Elisee C, Newham D (1999) New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng 65:382–388. https://doi.org/10.1002/(SICI)1097-0290(19991120)65:4%3c382::AID-BIT2%3e3.0.CO;2-I
Boothroyd S, Kerridge A, Broo A, Buttar D, Anwar J (2018) Solubility prediction from first principles: a density of states approach. Phys Chem Chem Phys 20:20981–20987. https://doi.org/10.1039/c8cp01786g
Thomas S, Balaji P (2021) Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci 14:582–592. https://doi.org/10.1110/ps.041009005
Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. Comput Sci. https://arxiv.org/abs/1505.00853
Wang W, Dai QY, Li F, Xiong Y (2020) MLCDForest: multilabel classification with deep forest in disease prediction for long non-coding RNAs. Brief Bioinform. https://doi.org/10.1093/bib/bbaa104
Lili Q, Yaping W, Guosheng H (2015) Identification of cancerlectins using support vector machines with fusion of G-gap dipeptide. Front Genet. https://doi.org/10.3389/fgene.2020.00275
He CM, Tang H, Cao RZ, Wang W, Wang LM (2017) A two-step discriminated method to identify thermophilic proteins. Int J Biomath. https://doi.org/10.1142/S1793524517500504
Chu Y, Kaushik AC, Wang X, Wang W, Zhang Y, Shan X, Salahub DR, Wei Y-Q (2019) DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief Bioinform. https://doi.org/10.1093/bib/bbz152
Zhang N, Rao RSP, Salvato F, Havelund JF, Mller IM, Thelen JJ, Xu D (2018) MU-LOC: a machine- learning method for predicting mitochondrially localized proteins in plants. Front Plant Sci 9:634–651. https://doi.org/10.3389/fpls.2018.00634
Agostini F, Cirillo D, Livi CM, Delli Ponti R, Tartaglia GG (2014) ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu420
Rawi R, Mall R, Kunji K, Shen CH, Kwong PD, Chuang GY (2017) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34:1092–1098. https://doi.org/10.1093/bioinformatics/btx662
Magnan C, Baldi P (2014) SSpro/ACCpro5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles machine learning and structural similarity. Bioinformatics (Oxford, England) 30:2592–2597. https://doi.org/10.1093/bioinformatics/btu352
Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D (2012) PROSO II: a new method for protein solubility prediction. Febs J 279(12):2192–2200. https://doi.org/10.1111/j.1742-4658.2012.08603.x
Sun H, Zeng X, Tao Xu, Peng G, Ma Y (2019) Computer-aided diagnosis in histopathological images of the endometrium using a convolutional neural network and attention mechanisms. IEEE J Biomed Health Inform 24:1664–1676. https://doi.org/10.1109/JBHI.2019.2944977
Zhang Li, Tian S, Pei M (2015) Crosstalk between histone modification and DNA methylation orchestrates the epigenetic regulation of the costimulatory factors, Tim-3 and galectin-9, in cervical cancer. Oncol Rep 42:2655–2669. https://doi.org/10.3892/or.2019.7388
Alhussein M, Muhammad G (2018) Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access 6:41034–41041. https://doi.org/10.1109/ACCESS.2018.2856238
Khurana S, Rawi R, Kunji K, Chuang GY, Bensmail H, Mall R, Valencia A (2018) Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34:2605–2613. https://doi.org/10.1093/bioinformatics/bty166
Hasan MM, Alam MA, Shoombuatong W, Deng H-W, Manavalan B, Kurata H (2021) NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Brief Bioinform. https://doi.org/10.1093/bib/bbab167
Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H (2020) Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief Bioinform. https://doi.org/10.1093/bib/bbaa202
Hasan MM, Schaduangrat N, Basith S, Lee G, Shoombuatong W, Manavalan B (2020) HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics 36(11):3350–3356. https://doi.org/10.1093/bioinformatics/btaa160
Wei L, Ding Y, Ran Su, Tang J, Zou Q (2018) Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 117:212–217. https://doi.org/10.1016/j.jpdc.2017.08.009
Limin Fu, Beifang N, Zhengwei Z, Sitao Wu, Weizhong Li (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England) 28(23):3150–3152. https://doi.org/10.1093/bioinformatics/bts565
An JY, Zhou Y, Zhao YJ, Yan ZJ (2019) An efficient feature extraction technique based on local coding PSSM and multifeatures fusion for predicting protein–protein interactions. Evol Bioinform. https://doi.org/10.1177/1176934319879920
Wu L-C, Lee J-X, Huang H-D, Liu B-J, Horng J-T (2009) An expert system to predict protein thermostability using decision tree. Expert Syst Appl 36:9007–9014. https://doi.org/10.1016/j.eswa.2008.12.020
Kim JH, Choi JH, Cheon M, Lee JS (2020) MAMNet: multi-path adaptive modulation network for image super-resolution. Neurocomputing 402:38–49. https://doi.org/10.1016/j.neucom.2020.03.069
Wang X, Li H, Gao P, Zeng W (2018) Combining support vector machine with dual G-gap dipeptides to discriminate between acidic and alkaline enzymes. Lett Org Chem 16:325–331. https://doi.org/10.2174/1570178615666180925125912
Raimondi D, Orlando G, Fariselli P, Moreau Y (2020) Insight into the protein solubility driving forces with neural attention. PLoS Comput Biol 2020:16. https://doi.org/10.1371/journal.pcbi.1007722
Chang CCH, Song J, Tey BT, Ramanan RN (2013) Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Brief Bioinform 15:953–962. https://doi.org/10.1093/bib/bbt057
Lin H, Liu WX, He J, Liu XH, Ding H, Chen W (2015) Predicting cancerlectins by the optimal g-gap dipeptides. Sci Rep. https://doi.org/10.1038/srep16964
Abualigah L (2021) Group search optimizer: a nature-inspired meta-heuristic optimization algorithm with its results, variants, and applications. Neural Comput Appl 33:2949–2972. https://doi.org/10.1007/s00521-020-05107-y
Kurbiel T, Khaleghian S (2017) Training of deep neural networks based on distance measures using RMSProp. https://arxiv.org/abs/1708.01911v1
Abualigah L, Diabat A (2021) Advances in sine cosine algorithm: a comprehensive survey. Artif Intell Rev 54:2567–2608. https://doi.org/10.1007/s10462-020-09909-3
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980
Jiao M, Wang D, Qiu J (2020) A GRU-RNN based momentum optimized algorithm for SOC estimation. J Power Sources. https://doi.org/10.1016/j.jpowsour.2020.228051
Jin R, Yang T, Zhu S (2013) A new analysis of compressive sensing by stochastic proximal gradient descent. arXiv:1304.4680
Ruder S (2016) An overview of gradient descent optimization algorithms. https://arxiv.org/abs/1609.04747v2
Zou F, Shen L, Jie Z, Sun J, Liu W (2018) Weighted adagrad with unified momentum. https://arxiv.org/abs/1808.03408
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. https://arxiv.org/abs/1212.5701
McMahan HB (2011) Follow-the-regularized-leader and mirror descent: equivalence theorems and implicit updates. arXiv:1009.3240
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods foronline learning and stochastic optimization. J Mach Learn Res 12:2121–2159
Funding
This work is supported by the Grants from the Key Research Area Grant of the Ministry of Science and Technology of China(Grant no. 2016YFA0501703), the National Natural Science Foundation of China (Grant nos. 62072157, 32070662, 61832019, 32030063, 61802116), the Natural Science Foundation of Henan Province (Grant no. 202300410102), the PhD Start-up Fund of Henan Institute of Technology (Grant no. KQ2002). The computations were partially performed at the Pengcheng Lab and the Center for High-Performance Computing, Shanghai Jiao Tong University.
Author information
Authors and Affiliations
Contributions
XW conceived this article. YL designed and implemented related experiments. ZD and MZ modified and revised the manuscript. XJ collected the related papers. AMK and DW revised some errors in the experiment. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Rights and permissions
About this article
Cite this article
Wang, X., Liu, Y., Du, Z. et al. Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN. Interdiscip Sci Comput Life Sci 13, 703–716 (2021). https://doi.org/10.1007/s12539-021-00456-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-021-00456-1