Skip to main content
Log in

Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Transmembrane proteins play a vital role in cell life activities. There are several techniques to determine transmembrane protein structures and X-ray crystallography is the primary methodology. However, due to the special properties of transmembrane proteins, it is still hard to determine their structures by X-ray crystallography technique. To reduce experimental consumption and improve experimental efficiency, it is of great significance to develop computational methods for predicting the crystallization propensity of transmembrane proteins. In this work, we proposed a sequence-based machine learning method, namely Prediction of TransMembrane protein Crystallization propensity (PTMC), to predict the propensity of transmembrane protein crystallization. First, we obtained several general sequence features and the specific encoded features of relative solvent accessibility and hydrophobicity. Second, feature selection was employed to filter out redundant and irrelevant features, and the optimal feature subset is composed of hydrophobicity, amino acid composition and relative solvent accessibility. Finally, we chose extreme gradient boosting by comparing with other several machine learning methods. Comparative results on the independent test set indicate that PTMC outperforms state-of-the-art sequence-based methods in terms of sensitivity, specificity, accuracy, Matthew's Correlation Coefficient (MCC) and Area Under the receiver operating characteristic Curve (AUC). In comparison with two competitors, Bcrystal and TMCrys, PTMC achieves an improvement by 0.132 and 0.179 for sensitivity, 0.014 and 0.127 for specificity, 0.037 and 0.192 for accuracy, 0.128 and 0.362 for MCC, and 0.027 and 0.125 for AUC, respectively.

Graphic abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of data and material

The codes and data of PTMC can be obtained from https://github.com/xialab-ahu/PTMC.

Code availability

The codes and data of PTMC can be obtained from https://github.com/xialab-ahu/PTMC.

References

  1. Sweeney MD, Sagare AP, Zlokovic BV (2018) Blood-brain barrier breakdown in Alzheimer disease and other neurodegenerative disorders. Nat Rev Neurol 14(3):133. https://doi.org/10.1038/nrneurol.2017.188

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Xu T-H, Yan Y, Kang Y, Jiang Y, Melcher K, Xu HE (2016) Alzheimer’s disease-associated mutations increase amyloid precursor protein resistance to γ-secretase cleavage and the Aβ42/Aβ40 ratio. Cell Discov 2(1):1–14. https://doi.org/10.1038/celldisc.2016.26

    Article  CAS  Google Scholar 

  3. Schmit K, Michiels C (2018) TMEM proteins in cancer: a review. Front Pharmacol 9:1345. https://doi.org/10.3389/fphar.2018.01345

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kuhlman B, Bradley P (2019) Advances in protein structure prediction and design. Nat Rev Mol Cell Biol 20(11):681–697. https://doi.org/10.1038/s41580-019-0163-x

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Palmer AG, Patel DJ (2002) Kurt Wüthrich and NMR of biological macromolecules. Structure 10(12):1603–1604. https://doi.org/10.1016/s0969-2126(02)00915-2

    Article  CAS  PubMed  Google Scholar 

  6. Nogales E (2015) The development of cryo-EM into a mainstream structural biology technique. Nat Methods 13(1):24. https://doi.org/10.1038/nmeth.3694

    Article  CAS  Google Scholar 

  7. Perman B, Anderson S, Schmidt M, Moffat K (2000) New techniques in fast time-resolved structure determination. Cell Mol Biol (Noisy-le-Grand, France) 46(5):895–913

    CAS  Google Scholar 

  8. Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook J (2000) The Protein Data Bank and the challenge of structural genomics. Nat Struct Mol Biol 7(11s):957. https://doi.org/10.1038/80734

    Article  CAS  Google Scholar 

  9. Overton IM, Barton GJ (2006) A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett 580(16):4005–4009. https://doi.org/10.1016/j.febslet.2006.06.015

    Article  CAS  PubMed  Google Scholar 

  10. Overton IM, Padovani G, Girolami MA, Barton GJ (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 24(7):901–907. https://doi.org/10.1093/bioinformatics/btn055

    Article  CAS  PubMed  Google Scholar 

  11. Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355(3):764–769. https://doi.org/10.1016/j.bbrc.2007.02.040

    Article  CAS  PubMed  Google Scholar 

  12. Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Struct Biol 9(1):50. https://doi.org/10.1186/1472-6807-9-50

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Wang H, Feng L, Zhang Z, Webb GI, Lin D, Song J (2016) Crysalis: an integrated server for computational analysis and design of protein crystallization. Sci Rep 6:21383. https://doi.org/10.1038/srep21383

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Elbasir A, Moovarkumudalvan B, Kunji K, Kolatkar PR, Mall R, Bensmail H (2019) DeepCrystal: a deep learning framework for sequence-based protein crystallization prediction. Bioinformatics 35(13):2216–2225. https://doi.org/10.1093/bioinformatics/bty953

    Article  CAS  PubMed  Google Scholar 

  15. Mizianty MJ, Kurgan L (2011) Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 27(13):i24–i33. https://doi.org/10.1093/bioinformatics/btr229

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Jahandideh S, Mahdavi A (2012) RFCRYS: Sequence-based protein crystallization propensity prediction by means of random forest. J Theor Biol 306:115–119. https://doi.org/10.1016/j.jtbi.2012.04.028

    Article  CAS  PubMed  Google Scholar 

  17. Wang H, Wang M, Tan H, Li Y, Zhang Z, Song J (2014) PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection. PLoS ONE 9(8):e105902. https://doi.org/10.1371/journal.pone.0105902

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 23(24):3403–3405. https://doi.org/10.1093/bioinformatics/btm477

    Article  CAS  PubMed  Google Scholar 

  19. Jahandideh S, Jaroszewski L, Godzik A (2014) Improving the chances of successful protein structure determination with a random forest classifier. Acta Crystallogr D Biol Crystallogr 70(3):627–635. https://doi.org/10.1107/S1399004713032070

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Elbasir A, Mall R, Kunji K, Rawi R, Islam Z, Chuang G-Y, Kolatkar PR, Bensmail H (2019) BCrystal: an interpretable sequence-based protein crystallization predictor. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz762

    Article  PubMed  PubMed Central  Google Scholar 

  21. Varga JK, Tusnády GE (2018) TMCrys: predict propensity of success for transmembrane protein crystallization. Bioinformatics 34(18):3126–3130. https://doi.org/10.1093/bioinformatics/bty342

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. ACM, pp 785–794. https://doi.org/10.1145/2939672.2939785

  23. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964

    Article  Google Scholar 

  24. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28. https://doi.org/10.1109/5254.708428

    Article  Google Scholar 

  25. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324

    Article  Google Scholar 

  26. Le Cessie S, Van Houwelingen JC (1992) Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat) 41(1):191–201. https://doi.org/10.2307/2347628

    Article  Google Scholar 

  27. Xia J-F, Zhao X-M, Huang D-S (2010) Predicting protein–protein interactions from protein sequences using meta predictor. Amino Acids 39(5):1595–1599. https://doi.org/10.1007/s00726-010-0588-1

    Article  CAS  PubMed  Google Scholar 

  28. Wang H, Feng L, Webb GI, Kurgan L, Song J, Lin D (2018) Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity. Brief Bioinform 19(5):838–852. https://doi.org/10.1093/bib/bbx018

    Article  CAS  PubMed  Google Scholar 

  29. Kozma D, Simon I, Tusnady GE (2012) PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Res 41(D1):D524–D529. https://doi.org/10.1093/nar/gks1169

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Gabanyi MJ, Adams PD, Arnold K, Bordoli L, Carter LG, Flippen-Andersen J, Gifford L, Haas J, Kouranov A, McLaughlin WA (2011) The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genom 12(2):45–54. https://doi.org/10.1007/s10969-011-9106-2

    Article  CAS  Google Scholar 

  31. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152. https://doi.org/10.1093/bioinformatics/bts565

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Cai L, Wang L, Fu X, Xia C, Zeng X, Zou Q (2020) ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Brief Bioinform. https://doi.org/10.1093/bib/bbaa367

    Article  Google Scholar 

  33. Meher PK, Sahu TK, Banchariya A, Rao AR (2017) DIRProt: a computational approach for discriminating insecticide resistant proteins from non-resistant proteins. BMC Bioinform 18(1):1–14. https://doi.org/10.1186/s12859-017-1587-y

    Article  CAS  Google Scholar 

  34. Li Q, Zhou W, Wang D, Wang S, Li Q (2020) Prediction of anticancer peptides using a low-dimensional feature model. Front Bioeng Biotechnol 8:892. https://doi.org/10.3389/fbioe.2020.00892

    Article  PubMed  PubMed Central  Google Scholar 

  35. Fu X, Ke L, Cai L, Chen X, Ren X, Gao M (2019) Improved prediction of cell-penetrating peptides via effective orchestrating amino acid composition feature representation. IEEE Access 7:163547–163555. https://doi.org/10.1109/ACCESS.2019.2952738

    Article  Google Scholar 

  36. Chou K-C (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19. https://doi.org/10.1093/bioinformatics/bth466

    Article  CAS  PubMed  Google Scholar 

  37. Chou K-C (2009) Pseudo amino acid composition and its applications in bioinformatics. Proteom Syst Biol Curr Proteom 6:262–274. https://doi.org/10.2174/157016409789973707

    Article  CAS  Google Scholar 

  38. Cheng J, Randall AZ, Sweredoski MJ, Baldi P (2005) SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 33(suppl_2):W72–W76. https://doi.org/10.1093/nar/gki396

  39. Hou J, Adhikari B, Cheng J (2018) DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34(8):1295–1303. https://doi.org/10.1093/bioinformatics/btx780

    Article  CAS  PubMed  Google Scholar 

  40. Rawi R, Mall R, Kunji K, Shen CH, Kwong PD, Chuang GY (2018) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34(7):1092–1098. https://doi.org/10.1093/bioinformatics/btx662

    Article  CAS  PubMed  Google Scholar 

  41. Xia C-Q, Pan X, Shen H-B (2020) Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 36(10):3018–3027. https://doi.org/10.1093/bioinformatics/btaa110

    Article  CAS  PubMed  Google Scholar 

  42. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132. https://doi.org/10.1016/0022-2836(82)90515-0

    Article  CAS  PubMed  Google Scholar 

  43. Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369. https://doi.org/10.1093/nar/28.1.374

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830. https://dl.acm.org/doi/10.5555/1953048.2078195

  45. Cheng N, Li M, Zhao L, Zhang B, Yang Y, Zheng C-H, Xia J (2020) Comparison and integration of computational methods for deleterious synonymous mutation prediction. Brief Bioinform 21(3):970–981. https://doi.org/10.1093/bib/bbz047

    Article  CAS  PubMed  Google Scholar 

  46. Shen Z, Zhang Q, Han K, Huang D-s (2020) A deep learning model for RNA-protein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2020.3007544

    Article  Google Scholar 

  47. Li M, Wang Y, Li F, Zhao Y, Liu M, Zhang S, Bin Y, Smith AI, Webb G, Li J (2020) A deep learning-based method for identification of bacteriophage–host interaction. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2020.3017386

    Article  Google Scholar 

  48. Chu Y, Kaushik AC, Wang X, Wang W, Zhang Y, Shan X, Salahub DR, Xiong Y, Wei D-Q (2019) DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief Bioinform. https://doi.org/10.1093/bib/bbz152

    Article  Google Scholar 

  49. Choy BC, Cater RJ, Mancia F, Pryor EE (2021) A 10-year meta-analysis of membrane protein structural biology: detergents, membrane mimetics, and structure determination techniques. Biochim Biophys Acta Biomembr 1863(3):183533. https://doi.org/10.1016/j.bbamem.2020.183533

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank the members of our laboratory for their valuable discussions.

Funding

This work was supported by the Anhui Provincial Outstanding Young Talent Support Plan (gxyq2018083), National Natural Science Foundation of China (62072003, 11835014, and U19A2064).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zeliang Wang or Junfeng Xia.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Q., Wang, L., Dai, R. et al. Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity. Interdiscip Sci Comput Life Sci 13, 693–702 (2021). https://doi.org/10.1007/s12539-021-00448-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-021-00448-1

Keywords

Navigation