Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

A guide to machine learning for biologists

Abstract

The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Choosing and training a machine learning method.
Fig. 2: Training machine learning methods.
Fig. 3: Traditional machine learning methods.
Fig. 4: Neural network methods.

Similar content being viewed by others

References

  1. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018). This is a thorough review of applications of deep learning to biology and medicine including many references to the literature.

    PubMed  PubMed Central  Google Scholar 

  2. Mitchell, T. M. Machine Learning (McGraw Hill, 1997).

  3. Goodfellow, I., Bengio Y. & Courville, A. Deep Learning (MIT Press, 2016).

  4. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).

    CAS  PubMed  Google Scholar 

  6. Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16, 440–456 (2020).

    PubMed  Google Scholar 

  7. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    CAS  PubMed  Google Scholar 

  8. Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R. & Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol. 3, e116 (2007). This is an introduction to machine learning concepts and applications in biology with a focus on traditional machine learning methods.

    PubMed  PubMed Central  Google Scholar 

  9. Silva, J. C. F., Teixeira, R. M., Silva, F. F., Brommonschenkel, S. H. & Fontes, E. P. B. Machine learning approaches and their current application in plant molecular biology: a systematic review. Plant. Sci. 284, 37–47 (2019).

    CAS  PubMed  Google Scholar 

  10. Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol. 6, 366 (2015).

    PubMed  PubMed Central  Google Scholar 

  11. Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10, 94 (2016).

    PubMed  PubMed Central  Google Scholar 

  12. Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).

    Google Scholar 

  13. Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res. 47, W402–W407 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Altman, N. & Krzywinski, M. Clustering. Nat. Methods 14, 545–546 (2017).

    CAS  Google Scholar 

  16. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Zhang, Z. et al. Predicting folding free energy changes upon single point mutations. Bioinformatics 28, 664–671 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  19. Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw. 28, 1–26 (2008).

    Google Scholar 

  20. Blaom, A. D. et al. MLJ: a Julia package for composable machine learning. J. Open Source Softw. 5, 2704 (2020).

    Google Scholar 

  21. Jones, D. T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol. 20, 659–660 (2019).

    CAS  PubMed  Google Scholar 

  22. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    CAS  PubMed  Google Scholar 

  23. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020). Technology company DeepMind entered the CASP13 assessment in protein structure prediction and its method using deep learning was the most accurate of the methods entered.

    CAS  PubMed  Google Scholar 

  24. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Tegunov, D. & Cramer, P. Real-time cryo-electron microscopy data preprocessing with Warp. Nat. Methods 16, 1146–1152 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). This is a review of deep learning by some of the major figures in the deep learning revolution.

    CAS  PubMed  Google Scholar 

  27. Hastie T., Tibshirani R., Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd Edn. (Springer Science & Business Media; 2009).

  28. Adebayo, J. et al. Sanity checks for saliency maps. NeurIPS https://arxiv.org/abs/1810.03292 (2018).

  29. Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. ICML 48, 1050–1059 (2016).

    Google Scholar 

  30. Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics 21, 119 (2020).

    PubMed  PubMed Central  Google Scholar 

  31. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 58, 267–288 (1996).

    Google Scholar 

  32. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 67, 301–320 (2005).

    Google Scholar 

  33. Noble, W. S. What is a support vector machine? Nat. Biotechnol. 24, 1565–1567 (2006).

    CAS  PubMed  Google Scholar 

  34. Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol. 609, 223–239 (2010).

    CAS  PubMed  Google Scholar 

  35. Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B. & Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4, e1000173 (2008). This is an introduction to SVMs with a focus on biological data and prediction tasks.

    PubMed  PubMed Central  Google Scholar 

  36. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Driscoll, M. K. et al. Robust and automated detection of subcellular morphological motifs in 3D microscopy images. Nat. Methods 16, 1037–1044 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat. Methods 15, 5–6 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Wang, C. & Zhang, Y. Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem. 38, 169–177 (2017).

    PubMed  Google Scholar 

  40. Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics 19, 84 (2018).

    PubMed  PubMed Central  Google Scholar 

  41. Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput. 23, 192–203 (2018).

    PubMed  PubMed Central  Google Scholar 

  42. Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 47, 1044 (2019).

    PubMed  Google Scholar 

  43. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    CAS  PubMed  Google Scholar 

  44. Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31, 651–666 (2010).

    Google Scholar 

  45. Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD‘96 Proc. Second Int. Conf. Knowl. Discov. Data Mining. 96, 226–231 (1996).

    Google Scholar 

  46. Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15, e1006907 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  49. Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019). This article provides a discussion and tips for using t-SNE as a dimensionality reduction technique on single-cell transcriptomics data.

    PubMed  PubMed Central  Google Scholar 

  50. Crick, F. The recent excitement about neural networks. Nature 337, 129–132 (1989).

    CAS  PubMed  Google Scholar 

  51. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020). This article discusses a common problem in deep learning called ‘shortcut learning’, where the model uses decision rules that do not transfer to real-world data.

    Google Scholar 

  52. Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202, 865–884 (1988).

    CAS  PubMed  Google Scholar 

  53. deFigueiredo, R. J. et al. Neural-network-based classification of cognitively normal, demented, Alzheimer disease and vascular dementia from single photon emission with computed tomography image data from brain. Proc. Natl Acad. Sci. USA 92, 5530–5534 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci. 3, 80 (2016).

    Google Scholar 

  55. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).

    PubMed  PubMed Central  Google Scholar 

  57. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS  PubMed  Google Scholar 

  58. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17, 1111–1117 (2020).

    PubMed  PubMed Central  Google Scholar 

  59. Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32, i121–i127 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. Yao, R., Qian, J. & Huang, Q. Deep-learning with synthetic data enables automated picking of cryo-EM particle images of biological macromolecules. Bioinformatics 36, 1252–1259 (2020).

    CAS  PubMed  Google Scholar 

  61. Si, D. et al. Deep learning to predict protein backbone structure from high-resolution cryo-EM density maps. Sci. Rep. 10, 4282 (2020).

    PubMed  PubMed Central  Google Scholar 

  62. Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2, 158–164 (2018).

    PubMed  Google Scholar 

  63. AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301.e3 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33, 2842–2849 (2017).

    CAS  PubMed  Google Scholar 

  65. Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model. 58, 472–479 (2018).

    PubMed  Google Scholar 

  66. Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc. 56, 301–318 (2016).

    PubMed  PubMed Central  Google Scholar 

  67. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).

    PubMed  PubMed Central  Google Scholar 

  68. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. Vaswani, A. et al. Attention is all you need. arXiv https://arxiv.org/abs/1706.03762 (2017).

  70. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv https://arxiv.org/abs/2007.06225 (2020).

  71. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. arXiv https://arxiv.org/abs/1806.01261 (2018).

  73. Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 181, 475–483 (2020). In this work, a deep learning model predicts antibiotic activity, with one candidate showing broad-spectrum antibiotic activities in mice.

    CAS  PubMed  Google Scholar 

  74. Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).

    CAS  PubMed  Google Scholar 

  75. Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411.e4 (2020).

    CAS  PubMed  Google Scholar 

  76. Gligorijevic, V. et al. Structure-based function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, i457–i466 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Veselkov, K. et al. HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Sci. Rep. 9, 9237 (2019).

    PubMed  PubMed Central  Google Scholar 

  79. Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch geometric. arXiv https://arxiv.org/abs/1903.02428 (2019).

  80. Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).

    CAS  PubMed  Google Scholar 

  81. Wang, Y. et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci. Rep. 6, 19598 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  82. Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst. 11, 49–62.e16 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  83. Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).

    PubMed  PubMed Central  Google Scholar 

  84. Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun. 12, 1882 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).

    Google Scholar 

  86. Abadi M. et al. Tensorflow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation. 265–283 (USENIX, 2016).

  87. Wei, Q. & Dunbrack, R. L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8, e67863 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  88. Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief. Bioinform 17, 831–840 (2016). This article discusses how peer reviewers can assess machine learning methods in biology, and by extension how scientists can design and conduct such studies properly.

    CAS  PubMed  Google Scholar 

  89. Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol. 21, 282 (2020).

    PubMed  PubMed Central  Google Scholar 

  90. Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).

    CAS  PubMed  PubMed Central  Google Scholar 

  91. Söding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr. Opin. Struct. Biol. 21, 404–411 (2011).

    PubMed  Google Scholar 

  92. Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).

    PubMed  PubMed Central  Google Scholar 

  93. Sillitoe, I. et al. CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res. 47, D280–D284 (2019).

    CAS  PubMed  Google Scholar 

  94. Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10, e1003926 (2014).

    PubMed  PubMed Central  Google Scholar 

  95. Li, Y. & Yang, J. Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein-ligand interactions. J. Chem. Inf. Model. 57, 1007–1012 (2017).

    CAS  PubMed  Google Scholar 

  96. Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 15, e1002683 (2018).

    PubMed  PubMed Central  Google Scholar 

  97. Szegedy, C. et al. Intriguing properties of neural networks. arXiv https://arxiv.org/abs/1312.6199 (2014).

  98. Hie, B., Cho, H. & Berger, B. Realizing private and practical pharmacological collaboration. Science 362, 347–350 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  99. Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes 12, e005122 (2019).

    PubMed  PubMed Central  Google Scholar 

  100. Konečný, J., Brendan McMahan, H., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. arXiv https://arxiv.org/abs/1610.02527 (2016).

  101. Pérez, A., Martínez-Rosell, G. & De Fabritiis, G. Simulations meet machine learning in structural biology. Curr. Opin. Struct. Biol. 49, 139–144 (2018).

    PubMed  Google Scholar 

  102. Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science 365, 6457 (2019).

    Google Scholar 

  103. Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv https://www.biorxiv.org/content/10.1101/103663v1 (2017).

  104. Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol. Syst. Biol. 16, e9198 (2020).

    PubMed  PubMed Central  Google Scholar 

  105. Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxiv https://doi.org/10.1101/2020.07.22.211482 (2020).

    Article  Google Scholar 

  106. Innes, M. et al. A differentiable programming system to bridge machine learning and scientific computing. arXiv https://arxiv.org/abs/1907.07587 (2019).

  107. Ingraham J., Riesselman A. J., Sander C., Marks D. S. Learning protein structure with a differentiable simulator. ICLR https://openreview.net/forum?id=Byg3y3C9Km (2019).

  108. Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R. Trajectory-based training enables protein simulations with accurate folding and Boltzmann ensembles in cpu-hours. PLoS Comput. Biol. 14, e1006578 (2018).

    PubMed  PubMed Central  Google Scholar 

  109. Wang, Y., Fass, J. & Chodera, J. D. End-to-end differentiable molecular mechanics force field construction. arXiv http://arxiv.org/abs/2010.01196 (2020).

  110. Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHub http://github.com/google/jax (2018).

  111. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019). This work provides a software library based on PyTorch providing functionality for biological sequences.

    CAS  PubMed  PubMed Central  Google Scholar 

  112. Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  113. Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. arXiv https://arxiv.org/abs/1912.04232 (2019).

  114. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  115. Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2020).

    PubMed  Google Scholar 

  116. Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 16, e9380 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  117. AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics 20, 311 (2019).

    PubMed  PubMed Central  Google Scholar 

  118. Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. arXiv https://arxiv.org/abs/2012.04035 (2020).

  119. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural. Inf. Process. Syst. 32, 9689–9701 (2019).

    PubMed  PubMed Central  Google Scholar 

  120. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) — round XIII. Proteins 87, 1011–1020 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  121. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  122. Munro, D. & Singh, M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics 36, 5322–5329 (2020).

    CAS  PubMed Central  Google Scholar 

  123. Haario, H. & Taavitsainen, V.-M. Combining soft and hard modelling in chemical kinetic models. Chemom. Intell. Lab. Syst. 44, 77–98 (1998).

    CAS  Google Scholar 

  124. Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. FFPred 3: feature-based function prediction for all gene ontology domains. Sci. Rep. 6, 31865 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  125. Nugent, T. & Jones, D. T. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics 10, 159 (2009).

    PubMed  PubMed Central  Google Scholar 

  126. Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 33, W480–W482 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  127. Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet. 10, 1077 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  128. Zhang, Y. & Skolnick, J. SPICKER: a clustering approach to identify near-native protein folds. J. Comput. Chem. 30, 865–871 (2004).

    Google Scholar 

  129. Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E. Understanding protein flexibility through dimensionality reduction. J. Comput. Biol. 10, 617–634 (2003).

    CAS  PubMed  Google Scholar 

  130. Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXiv https://arxiv.org/abs/1703.06103 (2019).

  131. Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods 15, 805–815 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  132. Antczak, M., Michaelis, M. & Wass, M. N. Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun. 10, 3100 (2019).

    PubMed  PubMed Central  Google Scholar 

  133. Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics 18, 277 (2017).

    PubMed  PubMed Central  Google Scholar 

  134. Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun. 12, 1340 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  135. Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics 35, 3313–3319 (2019).

    PubMed  Google Scholar 

  136. Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res. 42, W314–W319 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  137. Yuan, Y. & Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl Acad. Sci. USA 116, 27151–27158 (2019).

    CAS  PubMed Central  Google Scholar 

  138. Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics 17, S9 (2016).

    Google Scholar 

  139. Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S. & Jain, M. Deep neural networks for classification of LC-MS spectral peaks. Anal. Chem. 91, 12407–12413 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  140. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    PubMed  Google Scholar 

  141. Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites 10, 243 (2020).

    CAS  PubMed Central  Google Scholar 

  142. Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods 18, 176–185 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  143. Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  144. Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613–623 (2021).

    CAS  PubMed  Google Scholar 

  145. Gligorijevic, V., Barot, M. & Bonneau, R. deepNF: deep network fusion for protein function prediction. Bioinformatics 34, 3873–3881 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  146. Karpathy A. A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe (2019).

  147. Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Lecture Notes Comput. Sci. 7700, 437–478 (2012).

    Google Scholar 

  148. Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021). This study assesses 62 machine learning studies that analyse medical images for COVID-19 and none is found to be of clinical use, indicating the difficulties of training a useful model.

    Google Scholar 

  149. List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol. 13, e1005265 (2017).

    PubMed  PubMed Central  Google Scholar 

  150. Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S. The need for open source software in machine learning. J. Mach. Learn. Res. 8, 2443–2466 (2007).

    Google Scholar 

Download references

Acknowledgements

The authors thank members of the UCL Bioinformatics Group for valuable discussions and comments. This work was supported by the European Research Council Advanced Grant ProCovar (project ID 695558).

Author information

Authors and Affiliations

Authors

Contributions

All authors researched data for the article, contributed substantially to discussion of the content, wrote the article and reviewed the manuscript before submission.

Corresponding author

Correspondence to David T. Jones.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information

Nature Reviews Molecular Cell Biology thanks S. Draghici who co-reviewed with T. Nguyen; B. Chain; S. Haider; F. Mahmood; and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Caret: https://topepo.github.io/caret

Colaboratory: https://research.google.com/colaboratory

Graph Nets: https://github.com/deepmind/graph_nets

MLJ: https://alan-turing-institute.github.io/MLJ.jl/stable

PyTorch: https://pytorch.org

PyTorch Geometric: https://pytorch-geometric.readthedocs.io/en/latest

scikit-learn: https://scikit-learn.org/stable

Tensorflow: https://www.tensorflow.org

Glossary

Deep learning

Machine learning methods based on neural networks. The adjective ‘deep’ refers to the use of many hidden layers in the network, two hidden layers as a minimum but usually many more than that. Deep learning is a subset of machine learning, and hence of artificial intelligence more broadly.

Artificial neural networks

A collection of connected nodes loosely representing neuron connectivity in a biological brain. Each node is part of a layer and represents a number calculated from the previous layer. The connections, or edges, allow a signal to flow from the input layer to the output layer via hidden layers.

Ground truth

The true value that the output of a machine learning model is compared with to train the model and test performance. These data usually come from experimental data (for example, accessibility of a region of DNA to transcription factors) or expert human annotation (for example healthy or pathological medical image).

Encoding

Any scheme for numerically representing (often categorical) data in a form suitable for use in a machine learning model. An encoding can be a fixed numerical representation (for example, one-hot or continuous encoding) or can be defined using parameters that are trained along with the rest of a model.

One-hot encoding

An encoding scheme that represents a fixed set of n categorical inputs using n unique n-dimensional vectors, each with one element set to 1 and the rest set to 0. For example, the set of three letters (A,B,C) could be represented by the three vectors [1,0,0], [0,1,0] and [0,0,1], respectively.

Mean squared error

A loss function that calculates the average squared difference between the predicted values and the ground truth. This function heavily penalizes outliers because it increases rapidly as the difference between a predicted value and the ground truth grows.

Binary cross entropy

The most common loss function for training a binary classifier; that is, for tasks aimed at answering a question with only two choices (such as cancer versus non-cancer); sometimes called ‘log loss’.

Linear regression

A model that assumes that the output can be calculated from a linear combination of inputs; that is, each input feature is multiplied by a single parameter and these values are added. It is easy to interpret how these models make their predictions.

Kernel functions

Transformations applied to each data point to map the original points into a space in which they become separable with respect to their class.

Non-linear regression

A model where the output is calculated from a non-linear combination of inputs; that is, the input features can be combined during prediction using operations such as multiplication. These models can describe more complex phenomena than linear regression.

k nearest neighbours

A classification approach where a data point is classified on the basis of the known (ground truth) classes of the k most similar points in the training set using a majority voting rule. k is a parameter that can be tuned. Can also be used for regression by averaging the property value over the k nearest neighbours.

Regularization

Restricting the values of parameters to prevent the model from overfitting to the training data. For example, penalizing high parameter values in regression models reduces the flexibility of the model and can stop it fitting to noise in the training data.

Cloud computing

On-demand computing services, including processing power and data storage, typically available via the Internet. A pay-as-you-go model is usually used. Use of cloud computing minimizes up-front IT infrastructure costs.

Hidden Markov model

A statistical model that can be used to describe the evolution of observable events that depend on factors that are not directly observable. It has various uses in biology, including representing protein sequence families.

Saliency map

In the context of machine learning, an image generated to show which pixels in an input image contribute to the prediction made by a model. It is useful in interpreting models.

Automatic differentiation

A set of techniques to automatically calculate the gradient of a function in a computer program. Used to train neural networks, where it is called ‘backpropagation’.

Gradients

The rate of change of one property as another property changes. In neural networks, the set of gradients of the loss function with respect to the neural network parameters, computed via a process known as backpropagation, is used to adjust the parameters and thus train the model.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Greener, J.G., Kandathil, S.M., Moffat, L. et al. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23, 40–55 (2022). https://doi.org/10.1038/s41580-021-00407-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41580-021-00407-0

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing