Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Assembly of whole-chromosome pseudomolecules for polyploid plant genomes using outbred mapping populations

Abstract

Despite advances in sequencing technologies, assembly of complex plant genomes remains elusive due to polyploidy and high repeat content. Here we report PolyGembler for grouping and ordering contigs into pseudomolecules by genetic linkage analysis. Our approach also provides an accurate method with which to detect and fix assembly errors. Using simulated data, we demonstrate that our approach is of high accuracy and outperforms three existing state-of-the-art genetic mapping tools. Particularly, our approach is more robust to the presence of missing genotype data and genotyping errors. We used our method to construct pseudomolecules for allotetraploid lawn grass utilizing PacBio long reads in combination with restriction site-associated DNA sequencing, and for diploid Ipomoea trifida and autotetraploid potato utilizing contigs assembled from Illumina reads in combination with genotype data generated by single-nucleotide polymorphism arrays and genotyping by sequencing, respectively. We resolved 13 assembly errors for a published I. trifida genome assembly and anchored eight unplaced scaffolds in the published potato genome.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: PolyGembler framework.
Fig. 2: Pseudomolecule construction for simulated datasets.
Fig. 3: The ITR_r1.0 scaffold Itr_sc000015 is a misassembly.
Fig. 4: Pseudomolecule construction for M9 × M19 and B2721 mapping populations.
Fig. 5: Collinear plots between the Z. japonica pseudomolecules constructed from PolyGembler and the O. sativa chromosomes.

Similar content being viewed by others

Data availability

Data for the simulation studies, including comparisons with other methods and studies of M9 × M19 I. trifida and the B2721 potato, are available from http://data.genomicsresearch.org/Projects/polyGembler. Data for the 12601ab1 × Stirling potato mapping population were provided by C. Hackett. Data for the Z. japonica mapping population Carrizo × El Toro are available from the NCBI repository under the accession number SRP055007. The whole-genome PacBio sequence data for the Z. japonica cultivar Yaji are available from the NCBI repository under the accession number SRP110561. Data related to the PGSC version 4.03 pseudomolecules are available from http://solanaceae.plantbiology.msu.edu. The I. trifida de novo genome assembly ITR_r1.0 is available from http://sweetpotato-garden.kazusa.or.jp. The I. trifida de novo genome assembly NCNSP0306 is available from http://sweetpotato.plantbiology.msu.edu. Release 7 of the O. sativa reference genome is available from http://phytozome.jgi.doe.gov. The genome assembly of the Z. japonica accession Nagirizaki is available from http://zoysia.kazusa.or.jp. Source data are provided with this paper.

Code availability

The software PolyGembler, presented in this article, and its documentation are publicly available at GitHub (https://github.com/c-zhou/polyGembler).

References

  1. Kyriakidou, M., Tai, H. H., Anglin, N. L., Ellis, D. & Strömvik, M. V. Current strategies of polyploid plant genome sequence assembly. Front. Plant Sci. 9, 1660 (2018).

    PubMed  PubMed Central  Google Scholar 

  2. Bancroft, I. et al. Dissecting the genome of the polyploid crop oilseed rape by transcriptome sequencing. Nat. Biotechnol. 29, 762–766 (2011).

    CAS  PubMed  Google Scholar 

  3. Wu, S. et al. Genome sequences of two diploid wild relatives of cultivated sweetpotato reveal targets for genetic improvement. Nat. Commun. 9, 4580 (2018).

    PubMed  PubMed Central  Google Scholar 

  4. Fierst, J. L. Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. Front. Genet. 6, 220 (2015).

    PubMed  PubMed Central  Google Scholar 

  5. Altshuler, D. et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516 (2000).

    CAS  PubMed  Google Scholar 

  6. Baird, N. A. et al. Rapid SNP discovery and genetic mapping using sequenced rad markers. PLoS ONE 3, e3376 (2008).

    PubMed  PubMed Central  Google Scholar 

  7. Elshire, R. J. et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE 6, e19379 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Lander, E. S. & Green, P. Construction of multilocus genetic linkage maps in humans. Proc. Natl Acad. Sci. USA 84, 2363–2367 (1987).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Broman, K. W., Wu, H., Sen, S. & Churchill, G. A. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19, 889–890 (2003).

    CAS  PubMed  Google Scholar 

  10. Margarido, G., Souza, A. & Garcia, A. OneMap: software for genetic mapping in outcrossing species. Hereditas 144, 78–79 (2007).

    CAS  PubMed  Google Scholar 

  11. Van Ooijen, J. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet. Res. 93, 343–349 (2011).

    CAS  Google Scholar 

  12. Rastas, P., Calboli, F. C., Guo, B., Shikano, T. & Merila¨, J. Construction of ultradense linkage maps with Lep-MAP2: stickleback F2 recombinant crosses as an example. Genome Biol. Evol. 8, 78–93 (2016).

    CAS  Google Scholar 

  13. Hackett, C. & Luo, Z. TetraploidMap: construction of a linkage map in autotetraploid species. J. Hered. 94, 358–359 (2003).

    CAS  PubMed  Google Scholar 

  14. Hackett, C. A., Boskamp, B., Vogogias, A., Preedy, K. F. & Milne, I. TetraploidSNPMap: software for linkage analysis and QTL mapping in autotetraploid populations using SNP dosage data. J. Hered. 108, 438–442 (2017).

    CAS  Google Scholar 

  15. Bourke, P. M. et al. polymapR—linkage analysis and genetic map construction from F1 populations of outcrossing polyploids. Bioinformatics 34, 3496–3502 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Hirakawa, H. et al. Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don. DNA Res. 22, 171–179 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Consortium, P. G. S. et al. Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011).

    Google Scholar 

  18. Hoshino, A. et al. Genome sequence and analysis of the Japanese morning glory Ipomoea nil. Nat. Commun. 7, 13295 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Wang, F. et al. Sequence-tagged high-density genetic maps of Zoysia japonica provide insights into genome evolution in Chloridoideae. Plant J. 82, 744–757 (2015).

    CAS  PubMed  Google Scholar 

  20. Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, D883–D887 (2006).

    PubMed  PubMed Central  Google Scholar 

  21. Tanaka, H. et al. Sequencing and comparative analyses of the genomes of zoysiagrasses. DNA Res. 23, 171–180 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Strehl, A. & Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2003).

    Google Scholar 

  23. Wu, Y., Bhat, P. R., Close, T. J. & Lonardi, S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet. 4, e1000212 (2008).

    PubMed  PubMed Central  Google Scholar 

  24. Mascher, M. et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ). Plant J. 76, 718–727 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Hahn, M. W., Zhang, S. V. & Moyle, L. C. Sequencing, assembling, and correcting draft genomes using recombinant populations. G3 (Bethesda) 4, 669–679 (2014).

    Google Scholar 

  26. Su, S.-Y., White, J., Balding, D. J. & Coin, L. J. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 9, 513 (2008).

    PubMed  PubMed Central  Google Scholar 

  27. Zheng, C. et al. Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics 203, 119–131 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Jiao, W.-B. & Schneeberger, K. The impact of third generation genomic technologies on plant genome assembly. Curr. Opin. Plant Biol. 36, 64–70 (2017).

    CAS  PubMed  Google Scholar 

  29. Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5, 833–845 (2019).

    CAS  PubMed  Google Scholar 

  30. Kyriakidou, M., Anglin, N. L., Ellis, D., Tai, H. H. & Strömvik, M. V. Genome assembly of six polyploid potato genomes. Sci. Data 7, 88 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Voorrips, R. E. & Maliepaard, C. A. The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC Bioinformatics 13, 248 (2012).

    PubMed  PubMed Central  Google Scholar 

  32. Huang, W., Li, L., Myers, J. R. & Marth, G. T. Art: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).

    PubMed  Google Scholar 

  33. Love, R. R., Weisenfeld, N. I., Jaffe, D. B., Besansky, N. J. & Neafsey, D. E. Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly. BMC Genomics 17, 187 (2016).

    PubMed  PubMed Central  Google Scholar 

  34. Li, Y. et al. DeepSimulator: a deep simulator for nanopore sequencing. Bioinformatics 34, 2899–2908 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A.Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    CAS  PubMed  Google Scholar 

  36. Glaubitz, J. C. et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS ONE 9, e90346 (2014).

    PubMed  PubMed Central  Google Scholar 

  37. Rochette, N. C., Rivera-Colón, A. G. & Catchen, J. M. Stacks 2: analytical methods for paired-end sequencing improve RADseq-based population genomics. Mol. Ecol. 28, 4737–4754 (2019).

    CAS  PubMed  Google Scholar 

  38. Gerard, D., Ferrão, L. F. V., Garcia, A. A. F. & Stephens, M. Genotyping polyploids from messy sequencing data. Genetics 210, 789–807 (2018).

    PubMed  PubMed Central  Google Scholar 

  39. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Csardi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. Complex Syst. 1695, 1–9 (2006).

    Google Scholar 

  41. Rosvall, M. & Bergstrom, C.Maps of information flow reveal community structure in complex networks. Proc. Natl Acad. Sci. USA 105, 1118–1123 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Preedy, K. & Hackett, C. A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theor. Appl. Genet. 129, 2117–2132 (2016).

    CAS  PubMed  Google Scholar 

  43. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    PubMed  PubMed Central  Google Scholar 

  45. Xie, M., Wu, Q., Wang, J. & Jiang, T. H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics 32, 3735–3744 (2016).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank F. Diaz for developing the M9 × M19 I. trifida mapping population and M. David for extracting and quantifying DNA from the M9 × M19 cross. The 12601ab1 × Stirling Infinium 8303 potato array data were provided by C. A. Hackett. This research was supported by grants from the Bill & Melinda Gates Foundation (OPP1052983) and Australian Research Council (DP170102626 awarded to L.J.M.C.). The work at the International Potato Center (CIP) was carried out as part of the Consultative Group for International Agricultural Research (CGIAR) Research Program on Roots, Tubers and Bananas, which is supported by CGIAR Fund Donors (http://www.cgiar.org/about-us/our-funders/). This research was also supported by use of the NeCTAR Research Cloud, by QCIF and by the University of Queensland’s Research Computing Centre. The NeCTAR Research Cloud is a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy.

Author information

Authors and Affiliations

Authors

Contributions

C.Z. and L.J.M.C. designed the study and wrote the software. A.K., D.C.G. and W.G. developed and provided the I. trifida mapping population materials. B.O., D.C.G., S.W. and W.G. generated data for the M9 × M19 I. trifida mapping population. C.Z. performed the analysis. C.Z. and L.J.M.C. wrote the manuscript. L.J.M.C., G.C.Y., A.K., M.D.C., A.W.G., Z.-B.Z. and Z.F. supervised the project. All authors contributed to editing the final manuscript.

Corresponding author

Correspondence to Lachlan J. M. Coin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Pseudomolecule construction for 20× tetraploid simulated GBS data.

A total of 42,715 SNPs located on 678 scaffolds were used for linkage analysis. These scaffolds of ~482Mb covered approximately 99.6% of the genome. a, Dot plot for the RF estimations for scaffold pairs mapped to the same reference chromosome. The x- and y-axis represents the physical distances and the estimated RFs, respectively. b, Histogram of the RF estimations for scaffold pairs mapped to different reference chromosomes. c, Collinear plots of pseudomolecules mapped to reference chromosomes. The x- and y-axis represents physical positions (Mb) on the reference chromosomes and pseudomolecules, respectively. Each line represents a collinear block between the reference chromosome and the pseudomolecule. The diagonal line in each plot indicates a high correlation between the reference chromosome and the pseudomolecule constructed from scaffolds.

Extended Data Fig. 2 Collinear plots between the Ipomoea nil reference chromosomes and pseudomolecules constructed from the Ipomoea trifida genotype data.

The x- and y-axis represents the physical positions (Mb) on the reference chromosomes and pseudomolecules, respectively. Each line represents a collinear block between the Ipomoea nil reference chromosome and the pseudomolecules.

Extended Data Fig. 3 Genetic linkage map construction from the Infinium 8303 SNP array data of the Stirling×12601ab1 mapping population.

a, Dot plot for RF estimations between scaffold pairs mapped to the same PGSC v4.03 chromosomes. The x- and y-axis represents the physical distances and the estimated RFs, respectively. b, Histogram of the RF estimations for scaffold pairs mapped to different PGSC v4.03 pseudomolecules. c, Comparison between the genetic linkage map constructed by the proposed method and the PGSC v4.03 pseudomolecules. Twelve genetic linkage groups corresponding to 12 pseudomolecules were constructed. In each plot, the x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map.

Extended Data Fig. 4 Genetic linkage map constructed from the Infinium 8303 SNP array data of the B2721 mapping population with TetraploidSNPMap.

Each dot represents a SNP. The x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map. The genetic linkage map comprises a total of 4,745 SNPs including 56 SNPs located on the unplaced PGSC v4.03 scaffolds (red) and 76 SNPs placed in incorrect PGSC v4.03 pseudomolecules (blue). Since the physical positions of the red and blue dots cannot be determined, they were set to zero in the plots.

Extended Data Fig. 5 Genetic linkage map constructed from the Infinium 8303 SNP array data of the Stirling×12601ab1 mapping population with TetraploidSNPMap.

Each dot represents a SNP. The x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map. The genetic linkage map comprises a total of 3,593 SNPs including 54 SNPs located on the unplaced PGSC v4.03 scaffolds (red) and 35 SNPs placed in incorrect PGSC v4.03 pseudomolecules (blue). Since the physical positions of the red and blue dots cannot be determined, they were set to zero in the plots.

Extended Data Fig. 6 Collinear plots between the pseudomolecules of Zoysia japonica accession Yaji and Nagirizaki.

The x- and y-axis represent the positions (Mb) on the pseudomolecules. Each line represents a collinear block between the pseudomolecules.

Extended Data Fig. 7 Relationship between the number of genetic markers and computational resources required for the haplotype phasing algorithm.

The x- and y-axis (in logarithm scale) represents the number of genetic markers and the consumption of resources, respectively. a, CPU time and b, Memory. Each point in the plot was averaged over 30 independent experiments (Intel® Xeon® Processor E5-2667 v3 CPU, 3.20GHz). The error bar for one standard deviation was included at each point.

Source data

Supplementary information

Supplementary Information

Supplementary Notes 1–3, Figs. 1 and 2 and Tables 1–6.

Reporting Summary

Source data

Source Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, C., Olukolu, B., Gemenet, D.C. et al. Assembly of whole-chromosome pseudomolecules for polyploid plant genomes using outbred mapping populations. Nat Genet 52, 1256–1264 (2020). https://doi.org/10.1038/s41588-020-00717-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-020-00717-7

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics