Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Protocol
  • Published:

Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data

Abstract

Recovering genomes from shotgun metagenomic sequence data allows detailed taxonomic and functional characterization of individual species or strains in a microbial community. Retrieving these metagenome-assembled genomes (MAGs) involves seven stages. First, low-quality bases, along with adapter and host sequences, are removed. Second, overlapping sequences are assembled to create longer contiguous fragments. Third, these fragments are clustered based on sequence composition and abundance. Fourth, these sequence clusters, or bins, undergo rounds of quality assessment and refinement to yield MAGs. The optional fifth stage is dereplication of MAGs to select representatives. Next, each MAG is taxonomically classified. The optional seventh stage is assessing the fraction of diversity that has been recovered. The output of this protocol is draft genomes, which can provide invaluable clues about uncultured organisms. This protocol takes ~1 week to run, depending on computational resources available, and requires prior experience with high-performance computing, shell script programming and Python.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of workflow for recovery of prokaryotic MAGs from shotgun metagenomic sequence data.
Fig. 2: MULTIQC plots of metagenomic sequence datasets after preprocessing.
Fig. 3: Quality assessment and taxonomic classification of prokaryotic MAGs.
Fig. 4: Using read mapping to visualize bottlenecks in assembly and binning.

Similar content being viewed by others

Data availability

The data presented in Figs. 2 and 3 are available in the supporting primary research papers. The raw data files for the gut runs in Figs. 24 are available as part of the project accessions PRJNA268964 and PRJNA278393. The raw data files for the skin runs in Fig. 4 are available via the project accession PRJNA46333. The datasets generated for this protocol in Figs. 24 are available in https://github.com/Finn-Lab/MAG_Snakemake_wf under the subfolder Anticipated_Results.

Code availability

Codes used in this protocol are publicly available at https://github.com/Finn-Lab/MAG_Snakemake_wf. The code in this protocol has been peer reviewed.

References

  1. McKain, N., Genc, B., Snelling, T. J. & Wallace, R. J. Differential recovery of bacterial and archaeal 16S rRNA genes from ruminal digesta in response to glycerol as cryoprotectant. J. Microbiol. Methods 95, 381–383 (2013).

    Article  CAS  PubMed  Google Scholar 

  2. Watson, E.-J., Giles, J., Scherer, B. L. & Blatchford, P. Human faecal collection methods demonstrate a bias in microbiome composition by cell wall structure. Sci. Rep. 9, 16831 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. Sui, H.-Y. et al. Impact of DNA extraction method on variation in human and built environment microbial community and functional profiles assessed by shotgun metagenomics sequencing. Front. Microbiol. 11, 953 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Jones, M. B. et al. Library preparation methodology can influence genomic and functional predictions in human microbiome research. Proc. Natl Acad. Sci. USA 112, 14024–14029 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Baker, B. J. & Dick, G. J. Omic approaches in microbial ecology: charting the unknown. Microbe. Wash DC 8, 353–359 (2013).

    Google Scholar 

  6. Lukjancenko, O., Wassenaar, T. M. & Ussery, D. W. Comparison of 61 sequenced Escherichia coli genomes. Microb. Ecol. 60, 708–720 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Fitzsimons, M. S. et al. Nearly finished genomes produced using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the human microbiome. Genome Res. 23, 878–888 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).

    Article  CAS  PubMed  Google Scholar 

  9. Chen, L.-X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Nayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S. & Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499–504 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662.e20 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Daims, H. et al. Complete nitrification by Nitrospira bacteria. Nature 528, 504–509 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. van Kessel, M. A. H. J. et al. Complete nitrification by a single microorganism. Nature 528, 555–559 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

    Article  CAS  Google Scholar 

  16. Kong, H. H. et al. Performing skin microbiome research: a method to the madness. J. Invest. Dermatol. 137, 561–568 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

    Article  CAS  PubMed  Google Scholar 

  19. Yuan, C., Lei, J., Cole, J. & Sun, Y. Reconstructing 16S rRNA genes in metagenomic data. Bioinformatics. 31, i35–i43 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. McCarthy, A. Third generation DNA sequencing: Pacific Biosciences’ single molecule real time technology. Chem. Biol. 17, 675–676 (2010).

    Article  CAS  PubMed  Google Scholar 

  21. Mikheyev, A. S. & Tin, M. M. Y. A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Resour. 14, 1097–1102 (2014).

    Article  CAS  PubMed  Google Scholar 

  22. Overholt, W. A. et al. Inclusion of Oxford Nanopore long reads improves all microbial and phage metagenome-assembled genomes from a complex aquifer system. bioRxiv. 2019; 2019.12.18.880807.

  23. Stewart, R. D. et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat. Commun. 9, 870 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  24. Forster, S. C. et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nat. Biotechnol. 37, 186–192 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Royalty, T. M., Steen, A. D. Theoretical and simulation-based investigation of the relationship between sequencing effort, microbial community richness, and diversity in binning metagenome-assembled genomes. mSystems https://doi.org/10.1128/mSystems.00384-19 (2019).

  26. Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Maguire, F. et al. Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic islands. Microb. Genomics 6, 1–12 (2020).

    Article  CAS  Google Scholar 

  28. Oh, J. Temporal stability of the human skin microbiome. Cell 165, 854–866 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Andrews S. FastQC: a quality control tool for high throughput sequence data. Babraham Bioinformatics. http://www.bioinformatics.babraham.ac.uk/projects/fastqc

  30. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 541 (2012).

    Article  CAS  Google Scholar 

  32. van der Walt, A. J. et al. Assembling metagenomes, one community at a time. BMC Genomics 18, 521 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. metaSPAdes: a new versatile de novo metagenomics assembler. Genome Res. 27, 824–834 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).

    Article  CAS  PubMed  Google Scholar 

  35. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Coates, R. C. et al. Characterization of cyanobacterial hydrocarbon composition and distribution of biosynthetic pathways. PLoS ONE 9, e85140 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. Vollmers, J., Wiegand, S. & Kaster, A.-K. Comparing and evaluating metagenome assembly tools from a microbiologist’s perspective—not only size matters! PLoS ONE 12, e0169662 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  38. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

    Article  CAS  PubMed  Google Scholar 

  39. Sedlar, K., Kupkova, K. & Provaznik, I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput. Struct. Biotechnol. J. 15, 48–55 (2017).

    Article  CAS  PubMed  Google Scholar 

  40. Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. & Glöckner, F. O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J. S. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10, 316 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Wu, Y.-W. & Ye, Y. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. in Research in Computational Molecular Biology 535–549 (Springer, 2010).

  43. Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

    Article  CAS  PubMed  Google Scholar 

  44. Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).

    Article  CAS  PubMed  Google Scholar 

  45. Kang, D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. Preprint at PeerJ https://doi.org/10.7287/peerj.preprints.27522 (2019).

  46. Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).

    Article  CAS  PubMed  Google Scholar 

  47. Lin, H.-H., Liao, Y.-C. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. https://doi.org/10.1038/srep24175 (2016).

  48. Sieber, C. M. K., et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  53. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for ’omics data. PeerJ 3, e1319 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  55. von Meijenfeldt, F. A. B., Arkhipova, K., Cambuy, D. D., Coutinho, F. H. & Dutilh, B. E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. 20, 217 (2019).

    Article  CAS  Google Scholar 

  56. Evans, J. T. & Denef, V. J. To dereplicate or not to dereplicate? mSphere https://doi.org/10.1128/mSphere.00971-19 (2020).

  57. Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Ondov, B. D. et al. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biol. 20, 232 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  60. Varghese, N. J. et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43, 6761–6771 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–D72 (2016).

    Article  CAS  PubMed  Google Scholar 

  62. Parks. D. H. et al. A complete domain-to-species taxonomy for acteria and Archaea. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0501-8 (2020).

  63. Chaumeil. P.-A., Mussig. A. J., Hugenholtz, P., Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics https://doi.org/10.1093/bioinformatics/btz848 (2019).

  64. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    Article  CAS  PubMed  Google Scholar 

  65. Fritz, A,. et al. CAMISIM: simulating metagenomes and microbial communities. Microbiome https://doi.org/10.1186/s40168-019-0633-6 (2019).

  66. Perkel, J. M. Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019).

    Article  CAS  PubMed  Google Scholar 

  67. Kitts, P. A. et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 44, D73–D80 (2016).

    Article  CAS  PubMed  Google Scholar 

  68. Stinson, L. F., Keelan, J. A. & Payne, M. S. Identification and removal of contaminating microbial DNA from PCR reagents: impact on low‐biomass microbiome analyses. Lett. Appl. Microbiol. 68, 2–8 (2019).

    Article  CAS  PubMed  Google Scholar 

  69. Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  70. Anantharaman K, et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. https://doi.org/10.1038/ncomms13219 (2016).

  71. Solden, L. M. et al. Interspecies cross-feeding orchestrates carbon degradation in the rumen ecosystem. Nat. Microbiol. 3, 1274–1284 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Woodcroft, B. J. et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018).

    Article  CAS  PubMed  Google Scholar 

  73. Saary, P., Mitchell, A. L. & Finn, R. D. Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC. Genome Biol. 21, 244 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  74. Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-00797-0 (2021).

Download references

Acknowledgements

S.S.K. is a graduate student supported by the NIH-Oxford-Cambridge Scholars Program. A.A. and R.D.F. are funded by EMBL core funds.

Author information

Authors and Affiliations

Authors

Contributions

S.S.K, A.A. and R.D.F. conceived the study. S.S.K. and A.A. wrote the pipeline and performed the analyses. A.A., J.A.S. and R.D.F. supervised the work. J.A.S. and R.D.F. provided funding. S.S.K., A.A. and R.D.F wrote the manuscript. All authors read and approved the manuscript.

Corresponding author

Correspondence to Robert D. Finn.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Protocols thanks Matthew Olm, Maria Pachiadaki and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Almeida, A. et al. Nature 568, 499–504 (2019): https://doi.org/10.1038/s41586-019-0965-1

Almeida, A. et al. Nat. Biotechnol. 39, 105–114 (2021): https://doi.org/10.1038/s41587-020-0603-3

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saheb Kashaf, S., Almeida, A., Segre, J.A. et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data. Nat Protoc 16, 2520–2541 (2021). https://doi.org/10.1038/s41596-021-00508-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41596-021-00508-2

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research