Abstract
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All HiFi data were obtained from the NCBI Sequence Read Archive: SRR11606869 for Z. mays, SRR11606870 for M. musculus, SRR11606867 for F. × ananassa, SRR11606868 and SRR12048570 for R. muscosa, SRP251156 for S. sempervirens, SRR11292120–SRR11292123 for CHM13, ERX3831682 for HG00733, and four runs (SRR10382244, SRR10382245, SRR10382248 and SRR10382249) for HG002. For trio binning and computing QV, short reads were also downloaded: SRR7782677 for HG00733, ERR3241754 for HG00731 (father), ERR3241755 for HG00732 (mother) and SRX1082031 for CHM13. GIAB’s ‘homogeneity Run01’ short-read runs were used for the HG002 trio. These HG002 reads were downsampled to 30-fold coverage. The BAC libraries of CHM13 and HG00733 can be found at https://www.ncbi.nlm.nih.gov/nuccore/?term=VMRC59+and+complete/and https://www.ncbi.nlm.nih.gov/nuccore/?term=VMRC62+and+complete/, respectively. The HG002 major histocompatibility complex reference sequences can be found at https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC/assembly/MHCv1.1/ (ref. 26). For BUSCO, the Embryophyta, Tetrapoda and Mammalia datasets are available at https://busco-data.ezlab.org/v4/data/lineages/embryophyta_odb10.2020-09-10.tar.gz, https://busco.ezlab.org/v2/datasets/tetrapoda_odb9.tar.gz and https://busco.ezlab.org/v2/datasets/mammalia_odb9.tar.gz, respectively. The CHM13 reference (v0.9) generated by the T2T consortium can be found at https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.9.fasta.gz. The hifiasm assemblies produced in this work are available at https://zenodo.org/record/4393631 and https://zenodo.org/record/4393750.
Code availability
Hifiasm code is available at https://github.com/chhylp123/hifiasm/.
References
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat. Commun. 12, 60 (2021).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0711-0 (2020).
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0719-5 (2020).
Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7, 399 (2020).
Edger, P. P. et al. Origin and evolution of the octoploid strawberry genome. Nat. Genet. 51, 541–547 (2019).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Hizume, M., Kondo, T., Shibata, F. & Ishizuka, R. Flow cytometric determination of genome size in the Taxodiaceae, Cupressaceae sensu stricto and Sciadopityaceae. Cytologia 66, 307–311 (2001).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi.org/10.1101/023754 (2015).
Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 395–415 (1999).
Cheng, H., Jiang, H., Yang, J., Xu, Y. & Shang, Y. BitMapper: an efficient all-mapper based on bit-vector computing. BMC Bioinformatics 16, 192 (2015).
Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).
Acknowledgements
This study was supported by grants from the US National Institutes of Health (R01HG010040, U01HG010961 and U41HG010972 to H.L.).
Author information
Authors and Affiliations
Contributions
H.C. and H.L. designed the algorithm, implemented hifiasm and drafted the manuscript. H.C. benchmarked hifiasm and other assemblers. G.T.C. ran hifiasm for S. sempervirens, HiCanu for R. muscosa, Peregrine for S. sempervirens and R. muscosa, and Falcon-Unzip for all datasets. X.F. helped with evaluation of the manuscript. H.Z. provided valuable suggestions for error correction and ran BUSCO.
Corresponding author
Ethics declarations
Competing interests
G.T.C. is an employee of PacBio. H.L. is a consultant of Integrated DNA Technologies and on the Scientific Advisory Boards of Sentieon, BGI and OrigiMed.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Peer review information Nature Methods thanks Benedict Paten and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Supplementary information
Supplementary Information
Supplementary software commands, Supplementary Tables 1–10 and Supplementary Fig. 1.
Rights and permissions
About this article
Cite this article
Cheng, H., Concepcion, G.T., Feng, X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-020-01056-5
This article is cited by
-
A de novo long-read genome assembly of the sacred datura plant (Datura wrightii) reveals a role of tandem gene duplications in the evolution of herbivore-defense response
BMC Genomics (2024)
-
Haplotype-resolved genome assembly provides insights into evolutionary history of the Actinidia arguta tetraploid
Molecular Horticulture (2024)
-
Host-specific co-evolution likely driven by diet in Buchnera aphidicola
BMC Genomics (2024)
-
Chromosome-level genome assembly provides insights into the genome evolution and functional importance of the phenylpropanoid–flavonoid pathway in Thymus mongolicus
BMC Genomics (2024)
-
Annotated genome and transcriptome of the endangered Caribbean mountainous star coral (Orbicella faveolata) using PacBio long-read sequencing
BMC Genomics (2024)