Skip to main content
Log in

Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Background

Prediction of protein solubility is an indispensable prerequisite for pharmaceutical research and production. The general and specific objective of this work is to design a new model for predicting protein solubility by using protein sequence feature fusion and deep dual-channel convolutional neural networks (DDcCNN) to improve the performance of existing prediction models.

Methods

The redundancy of raw protein is reduced by CD-HIT. The four subsequences are built from protein sequence: one global and three locals. The global subsequence is the entire protein sequence, and these local subsequences are obtained by moving a sliding window with some rules. Using G-gap to extract the features of the above four subsequences, a mixed matrix is constructed as the input of one channel which is composed of three-layer convolutional operating. Additional features are extracted by SCRATCH tool as input of another channel, which is consist of a single convolution in order to find hidden relationships and improve the accuracy of predictor. The outputs of two parallel channels are concatenated as the input of the hidden layer. And the prediction of protein solubility is obtained in the output layer. The best protein solubility prediction model is obtained by doing some comparative experiments of different frameworks.

Results

The performance indicators of DDcCNN model (our designed) are as follows: accuracy of 77.82%, Matthew’s correlation coefficient of 0.57, sensitivity of 76.13% and specificity of 79.32%. The results of some comparative experiments show that the overall performance of DDcCNN model is better than existing models (GCNN, LCNN and PCNN). The related models and data are publicly deposited at http://www.ddccnn.wang.

Conclusion

The satisfactory performance of DDcCNN model reveals that these features and flexible computational methodologies can reinforce the existing prediction models for better prediction of protein solubility could be applied in several applications, such as to preselect initial targets that are soluble or to alter solubility of target proteins, thus can help to reduce the production cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and material

The dataset of our work can be obtain from http://www.ddccnn.wang.

Code availability

The coda of our work can be obtain from http://www.ddccnn.wang.

Abbreviations

E.coli :

Escherichia coli

SVM:

Support vector machine

CNNs:

Convolutional neural networks

G-gap:

G value gap dipeptide frequency

AIs:

Aliphatic Indices

GRAVY:

The average of hydrophobicity

IHH:

Isoleucine histidine histidine

SS:

Secondary structure

FER:

The fraction of exposed residues

RSA:

Relative solvent accessibility

DDcCNN:

Deep dual-channel convolutional neural networks

NN:

The feed-forward neural network

ACC:

Accuracy

P:

Rrecision ratio

R:

Recall ratio

MCC:

Matthews correlation coeficient

TPR:

True positive rate

FPR:

False positive rate

TNR:

Specificity

TNR:

True negative rate

GCNN:

Global convolutional neural networks

LCNN:

Local convolutional neural networks

PCNN:

Parallel convolutional neural networks

SVM:

Support vector machine

DT:

Decision trees

RF:

Random forest

DNN:

Deep neural network

References

  1. Yugandhar K, Gupta S, Yu H (2019) Inferring protein-protein interaction networks from mass spectrometry-based proteomic approaches: a mini-review. Comput Struct Biotechnol J 17:805–811. https://doi.org/10.1016/j.csbj.2019.05.007

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Siti M, Alireza N, Narges H (2014) A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinform. https://doi.org/10.1186/1471-2105-15-134

    Article  Google Scholar 

  3. Niu X, Shi F, Hu X, Li N (2014) Predicting the protein solubility by integrating chaos games representation and entropy in information theory. Expert Syst Appl 41:1672–1679. https://doi.org/10.1016/j.eswa.2013.08.064

    Article  Google Scholar 

  4. Costa S, Almeida A, Castro A, Domingues L (2014) Fusion tags for protein solubility, purification and immunogenicity in Escherichia coli: the novel Fh8 system. Front Microbiol 2:63–71. https://doi.org/10.3389/fmicb.2014.00063

    Article  Google Scholar 

  5. Castrense S, Bruciaferri N, Tartari G, Martelli PL (2019) DeepMito: accurate prediction of protein submitochondrial localization using convolutional neural networks. Bioinformatics (Oxford, England) 36:56–64. https://doi.org/10.1093/bioinformatics/btz512

    Article  CAS  Google Scholar 

  6. Zhang S, Zhang T, Liu C (2019) Prediction of apoptosis protein subcellular localization via heterogeneous features and hierarchical extreme learning machine. SAR QSAR Environ Res 30:209–228. https://doi.org/10.1080/1062936X.2019.1576222

    Article  CAS  PubMed  Google Scholar 

  7. Pellizza L, Smal C, Rodrigo G, Martín A (2018) Codon usage clusters correlation: towards protein solubility prediction in heterologous expression systems in E coli. Sci Rep. https://doi.org/10.1038/s41598-018-29035-z

    Article  PubMed  PubMed Central  Google Scholar 

  8. Davis G, Elisee C, Newham D (1999) New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng 65:382–388. https://doi.org/10.1002/(SICI)1097-0290(19991120)65:4%3c382::AID-BIT2%3e3.0.CO;2-I

    Article  CAS  PubMed  Google Scholar 

  9. Boothroyd S, Kerridge A, Broo A, Buttar D, Anwar J (2018) Solubility prediction from first principles: a density of states approach. Phys Chem Chem Phys 20:20981–20987. https://doi.org/10.1039/c8cp01786g

    Article  CAS  PubMed  Google Scholar 

  10. Thomas S, Balaji P (2021) Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci 14:582–592. https://doi.org/10.1110/ps.041009005

    Article  CAS  Google Scholar 

  11. Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. Comput Sci. https://arxiv.org/abs/1505.00853

  12. Wang W, Dai QY, Li F, Xiong Y (2020) MLCDForest: multilabel classification with deep forest in disease prediction for long non-coding RNAs. Brief Bioinform. https://doi.org/10.1093/bib/bbaa104

    Article  PubMed  PubMed Central  Google Scholar 

  13. Lili Q, Yaping W, Guosheng H (2015) Identification of cancerlectins using support vector machines with fusion of G-gap dipeptide. Front Genet. https://doi.org/10.3389/fgene.2020.00275

    Article  Google Scholar 

  14. He CM, Tang H, Cao RZ, Wang W, Wang LM (2017) A two-step discriminated method to identify thermophilic proteins. Int J Biomath. https://doi.org/10.1142/S1793524517500504

    Article  Google Scholar 

  15. Chu Y, Kaushik AC, Wang X, Wang W, Zhang Y, Shan X, Salahub DR, Wei Y-Q (2019) DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief Bioinform. https://doi.org/10.1093/bib/bbz152

    Article  Google Scholar 

  16. Zhang N, Rao RSP, Salvato F, Havelund JF, Mller IM, Thelen JJ, Xu D (2018) MU-LOC: a machine- learning method for predicting mitochondrially localized proteins in plants. Front Plant Sci 9:634–651. https://doi.org/10.3389/fpls.2018.00634

    Article  PubMed  PubMed Central  Google Scholar 

  17. Agostini F, Cirillo D, Livi CM, Delli Ponti R, Tartaglia GG (2014) ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu420

    Article  PubMed  PubMed Central  Google Scholar 

  18. Rawi R, Mall R, Kunji K, Shen CH, Kwong PD, Chuang GY (2017) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34:1092–1098. https://doi.org/10.1093/bioinformatics/btx662

    Article  CAS  PubMed Central  Google Scholar 

  19. Magnan C, Baldi P (2014) SSpro/ACCpro5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles machine learning and structural similarity. Bioinformatics (Oxford, England) 30:2592–2597. https://doi.org/10.1093/bioinformatics/btu352

    Article  CAS  Google Scholar 

  20. Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D (2012) PROSO II: a new method for protein solubility prediction. Febs J 279(12):2192–2200. https://doi.org/10.1111/j.1742-4658.2012.08603.x

    Article  CAS  PubMed  Google Scholar 

  21. Sun H, Zeng X, Tao Xu, Peng G, Ma Y (2019) Computer-aided diagnosis in histopathological images of the endometrium using a convolutional neural network and attention mechanisms. IEEE J Biomed Health Inform 24:1664–1676. https://doi.org/10.1109/JBHI.2019.2944977

    Article  PubMed  Google Scholar 

  22. Zhang Li, Tian S, Pei M (2015) Crosstalk between histone modification and DNA methylation orchestrates the epigenetic regulation of the costimulatory factors, Tim-3 and galectin-9, in cervical cancer. Oncol Rep 42:2655–2669. https://doi.org/10.3892/or.2019.7388

    Article  CAS  Google Scholar 

  23. Alhussein M, Muhammad G (2018) Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access 6:41034–41041. https://doi.org/10.1109/ACCESS.2018.2856238

    Article  Google Scholar 

  24. Khurana S, Rawi R, Kunji K, Chuang GY, Bensmail H, Mall R, Valencia A (2018) Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34:2605–2613. https://doi.org/10.1093/bioinformatics/bty166

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Hasan MM, Alam MA, Shoombuatong W, Deng H-W, Manavalan B, Kurata H (2021) NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Brief Bioinform. https://doi.org/10.1093/bib/bbab167

    Article  PubMed  Google Scholar 

  26. Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H (2020) Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief Bioinform. https://doi.org/10.1093/bib/bbaa202

    Article  Google Scholar 

  27. Hasan MM, Schaduangrat N, Basith S, Lee G, Shoombuatong W, Manavalan B (2020) HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics 36(11):3350–3356. https://doi.org/10.1093/bioinformatics/btaa160

    Article  CAS  PubMed  Google Scholar 

  28. Wei L, Ding Y, Ran Su, Tang J, Zou Q (2018) Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 117:212–217. https://doi.org/10.1016/j.jpdc.2017.08.009

    Article  Google Scholar 

  29. Limin Fu, Beifang N, Zhengwei Z, Sitao Wu, Weizhong Li (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England) 28(23):3150–3152. https://doi.org/10.1093/bioinformatics/bts565

    Article  CAS  Google Scholar 

  30. An JY, Zhou Y, Zhao YJ, Yan ZJ (2019) An efficient feature extraction technique based on local coding PSSM and multifeatures fusion for predicting protein–protein interactions. Evol Bioinform. https://doi.org/10.1177/1176934319879920

    Article  Google Scholar 

  31. Wu L-C, Lee J-X, Huang H-D, Liu B-J, Horng J-T (2009) An expert system to predict protein thermostability using decision tree. Expert Syst Appl 36:9007–9014. https://doi.org/10.1016/j.eswa.2008.12.020

    Article  Google Scholar 

  32. Kim JH, Choi JH, Cheon M, Lee JS (2020) MAMNet: multi-path adaptive modulation network for image super-resolution. Neurocomputing 402:38–49. https://doi.org/10.1016/j.neucom.2020.03.069

    Article  Google Scholar 

  33. Wang X, Li H, Gao P, Zeng W (2018) Combining support vector machine with dual G-gap dipeptides to discriminate between acidic and alkaline enzymes. Lett Org Chem 16:325–331. https://doi.org/10.2174/1570178615666180925125912

    Article  CAS  Google Scholar 

  34. Raimondi D, Orlando G, Fariselli P, Moreau Y (2020) Insight into the protein solubility driving forces with neural attention. PLoS Comput Biol 2020:16. https://doi.org/10.1371/journal.pcbi.1007722

    Article  CAS  Google Scholar 

  35. Chang CCH, Song J, Tey BT, Ramanan RN (2013) Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Brief Bioinform 15:953–962. https://doi.org/10.1093/bib/bbt057

    Article  CAS  PubMed  Google Scholar 

  36. Lin H, Liu WX, He J, Liu XH, Ding H, Chen W (2015) Predicting cancerlectins by the optimal g-gap dipeptides. Sci Rep. https://doi.org/10.1038/srep16964

    Article  PubMed  PubMed Central  Google Scholar 

  37. Abualigah L (2021) Group search optimizer: a nature-inspired meta-heuristic optimization algorithm with its results, variants, and applications. Neural Comput Appl 33:2949–2972. https://doi.org/10.1007/s00521-020-05107-y

    Article  Google Scholar 

  38. Kurbiel T, Khaleghian S (2017) Training of deep neural networks based on distance measures using RMSProp. https://arxiv.org/abs/1708.01911v1

  39. Abualigah L, Diabat A (2021) Advances in sine cosine algorithm: a comprehensive survey. Artif Intell Rev 54:2567–2608. https://doi.org/10.1007/s10462-020-09909-3

    Article  Google Scholar 

  40. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

  41. Jiao M, Wang D, Qiu J (2020) A GRU-RNN based momentum optimized algorithm for SOC estimation. J Power Sources. https://doi.org/10.1016/j.jpowsour.2020.228051

    Article  Google Scholar 

  42. Jin R, Yang T, Zhu S (2013) A new analysis of compressive sensing by stochastic proximal gradient descent. arXiv:1304.4680

  43. Ruder S (2016) An overview of gradient descent optimization algorithms. https://arxiv.org/abs/1609.04747v2

  44. Zou F, Shen L, Jie Z, Sun J, Liu W (2018) Weighted adagrad with unified momentum. https://arxiv.org/abs/1808.03408

  45. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. https://arxiv.org/abs/1212.5701

  46. McMahan HB (2011) Follow-the-regularized-leader and mirror descent: equivalence theorems and implicit updates. arXiv:1009.3240

  47. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods foronline learning and stochastic optimization. J Mach Learn Res 12:2121–2159

    Google Scholar 

Download references

Funding

This work is supported by the Grants from the Key Research Area Grant of the Ministry of Science and Technology of China(Grant no. 2016YFA0501703), the National Natural Science Foundation of China (Grant nos. 62072157, 32070662, 61832019, 32030063, 61802116), the Natural Science Foundation of Henan Province (Grant no. 202300410102), the PhD Start-up Fund of Henan Institute of Technology (Grant no. KQ2002). The computations were partially performed at the Pengcheng Lab and the Center for High-Performance Computing, Shanghai Jiao Tong University.

Author information

Authors and Affiliations

Authors

Contributions

XW conceived this article. YL designed and implemented related experiments. ZD and MZ modified and revised the manuscript. XJ collected the related papers. AMK and DW revised some errors in the experiment. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xianfang Wang or Dongqing Wei.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Liu, Y., Du, Z. et al. Prediction of Protein Solubility Based on Sequence Feature Fusion and DDcCNN. Interdiscip Sci Comput Life Sci 13, 703–716 (2021). https://doi.org/10.1007/s12539-021-00456-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-021-00456-1

Keywords

Navigation