Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED

Kovaka, Sam; Fan, Yunfan; Ni, Bohan; Timp, Winston; Schatz, Michael C.

doi:10.1038/s41587-020-0731-9

Article
Published: 30 November 2020

Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED

Nature Biotechnology volume 39, pages 431–441 (2021)Cite this article

18k Accesses
113 Citations
250 Altmetric
Metrics details

Subjects

Abstract

Conventional targeted sequencing methods eliminate many of the benefits of nanopore sequencing, such as the ability to accurately detect structural variants or epigenetic modifications. The ReadUntil method allows nanopore devices to selectively eject reads from pores in real time, which could enable purely computational targeted sequencing. However, this requires rapid identification of on-target reads while most mapping methods require computationally intensive basecalling. We present UNCALLED (https://github.com/skovaka/UNCALLED), an open source mapper that rapidly matches streaming of nanopore current signals to a reference sequence. UNCALLED probabilistically considers k-mers that could be represented by the signal and then prunes the candidates based on the reference encoded within a Ferragina–Manzini index. We used UNCALLED to deplete sequencing of known bacterial genomes within a metagenomics community, enriching the remaining species 4.46-fold. UNCALLED also enriched 148 human genes associated with hereditary cancers to 29.6× coverage using one MinION flowcell, enabling accurate detection of single-nucleotide polymorphisms, insertions and deletions, structural variants and methylation in these genes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: UNCALLED algorithm and performance on *E. coli* data.**

**Fig. 2: UNCALLED results for the Zymo mock microbial community.**

**Fig. 3: Human cancer gene enrichment using UNCALLED.**

**Fig. 4: Integrated genome browser (IGV) visualization of a heterozygous Alu insertion in an exon of the *MUTYH* gene detected by UNCALLED, ONT WGS and PacBio HiFi reads.**

**Fig. 5: GM12878 promoter methylation.**

Real-Time Selective Sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria

Article Open access 07 August 2019

Readfish enables targeted nanopore sequencing of gigabase-sized genomes

Article 30 November 2020

Nanopore sequencing technology, bioinformatics and applications

Article 08 November 2021

Data availability

All sequencing runs are available as an NCBI BioProject under accession no. PRJNA604456.

Code availability

The source code for UNCALLED is available on GitHub at https://github.com/skovaka/UNCALLED.

References

Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
Article CAS PubMed Google Scholar
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
Article CAS PubMed Google Scholar
Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 19, 90 (2018).
Article PubMed PubMed Central Google Scholar
Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
Article CAS PubMed PubMed Central Google Scholar
Grädel, C. et al. Rapid and cost-efficient enterovirus genotyping from clinical samples using flongle flow cells. Genes 10, 659 (2019).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Article CAS PubMed PubMed Central Google Scholar
Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun. 10, 998 (2019).
Article PubMed PubMed Central Google Scholar
Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 14, 411–413 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gilpatrick, T. et al. Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat. Biotechnol. 38, 433–438 (2020).
Article CAS PubMed PubMed Central Google Scholar
Loose, M., Malla, S. & Stout, M. Real-time selective sequencing using nanopore technology. Nat. Methods 13, 751–754 (2016).
Article CAS PubMed PubMed Central Google Scholar
Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).
Article CAS PubMed Google Scholar
Gu, W. et al. Depletion of abundant sequences by hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications. Genome Biol. 17, 41 (2016).
Article CAS PubMed PubMed Central Google Scholar
Edwards, H. S. et al. Real-time selective sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria. Sci. Rep. 9, 11475 (2019).
Article PubMed PubMed Central Google Scholar
Payne, A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol. (in the press).
Ferragina, P. & Manzini, G. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science 390–398 (IEEE, 2000).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
Luo, R. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat. Mach. Intell. 2, 220–227 (2020).
Article Google Scholar
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).
Article CAS PubMed Google Scholar
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Article CAS PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Article PubMed PubMed Central Google Scholar
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 4.10.1–4.10.14 (2009).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Karolchik, D. et al. The UCSC genome browser database. Nucleic Acids Res. 31, 51–54 (2003).
Article CAS PubMed PubMed Central Google Scholar
Genetics Home Reference. MUTYH gene. MedlinePlus https://ghr.nlm.nih.gov/gene/MUTYH (2020).
Deininger, P. Alu elements: know the SINEs. Genome Biol. 12, 236 (2011).
Article CAS PubMed PubMed Central Google Scholar
Carrel, L. & Willard, H. F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434, 400–404 (2005).
Article CAS PubMed Google Scholar
Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
Article CAS PubMed Google Scholar
Gardner, E. J. et al. The mobile element locator tool (MELT): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wu, J. et al. Tangram: a comprehensive toolbox for mobile element insertion detection. BMC Genomics 15, 795 (2014).
Article CAS PubMed PubMed Central Google Scholar
Cheadle, J. P. & Sampson, J. R. Exposing the MYtH about base excision repair and human inherited disease. Hum. Mol. Genet. 12 (Suppl. 2), R159–R165 (2003).
Win, A. K. et al. Risk of colorectal cancer for carriers of mutations in MUTYH, with and without a family history of cancer. Gastroenterology 146, 1208–1211.e5 (2014).
Article CAS PubMed Google Scholar
Nanopore Community Meeting 2019 Technology Update (Oxford Nanopore Technologies, 2019); https://nanoporetech.com/resource-centre/nanopore-community-meeting-2019-technology-update
De Roeck, A. et al. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol. 20, 239 (2019).
Article PubMed PubMed Central Google Scholar
David, M., Dursi, L. J., Yao, D., Boutros, P. C. & Simpson, J. T. Nanocall: an open source basecaller for Oxford Nanopore sequencing data. Bioinformatics 33, 49–55 (2017).
Article CAS PubMed Google Scholar
Welford, B. P. Note on a method for calculating corrected sums of squares and products. Technometrics 4, 419–420 (1962).
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gog, S. & Petri, M. Optimized succinct data structures for massive data. Softw. Pract. Exp. 44, 1287–1314 (2014).
Article Google Scholar
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Article CAS PubMed PubMed Central Google Scholar
Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).
CAS PubMed Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Aganezov, S. et al. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Res. 30, 1258–1273 (2020).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank T. Mun for his contributions on an early prototype of UNCALLED and T. Gilpatrick for providing extracted GM12878 DNA used in the cancer gene enrichment experiments. This work was funded, in part, by the US National Science Foundation (grant no. DBI-1350041 to M.C.S.) and US National Institutes of Health (grant no. R01HG009190 to W.T.).

Author information

Authors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Sam Kovaka, Bohan Ni & Michael C. Schatz
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Yunfan Fan & Winston Timp
Department of Biology, Johns Hopkins University, Baltimore, MD, USA
Michael C. Schatz
Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Michael C. Schatz

Authors

Sam Kovaka
View author publications
You can also search for this author in PubMed Google Scholar
Yunfan Fan
View author publications
You can also search for this author in PubMed Google Scholar
Bohan Ni
View author publications
You can also search for this author in PubMed Google Scholar
Winston Timp
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Schatz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.K. and M.C.S. designed UNCALLED. S.K. implemented UNCALLED. B.N. and S.K. benchmarked UNCALLED. Y.F. performed all sequencing library preparation. S.K. computed enrichment levels for all experiments and performed small variant and structural variant detection and analysis. Y.F. performed methylation detection and analysis. W.T. supervised sequencing runs and advised on the experimental design. M.C.S. supervised the entire project. All authors contributed to writing the manuscript. All authors read and approve the final manuscript.

Corresponding author

Correspondence to Sam Kovaka.

Ethics declarations

Competing interests

W.T. holds two patents currently licensed by Oxford Nanopore Technologies Limited. M.C.S. and W.T. have received travel funding from Oxford Nanopore Technologies Limited.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 FM Index Mapping.

(top) FM index alignment of a standard DNA sequence, where the size of each box represents the number of possible locations. (middle) FM alignment of a sequence where every position could be one of two bases. Base ambiguity is analogous to the k-mers we consider for every event. (bottom) Same as middle but alignments starting from all positions are found by filling in the gaps between ranges from previous alignments.

Extended Data Fig. 2 Event/K-mer Match Probability Thresholds.

a, Relationship between natural log probability thresholds (x-axis), the mean number of k-mers that match above each threshold per event (blue), the fraction of events that match their correct k-mer above each threshold (red). The values for r9.4 chemistry are shown here. b, The FM index range lengths assigned to different probability thresholds for the E. coli reference. This function varies depending on the reference used.

Extended Data Fig. 3 Zymo Full Flowcell 1 Pore Activity.

Pore activity during Zymo ‘full flowcell 1’ sequencing runs. a, Percent of channels that are labeled active throughout zymo bacterial depletion UNCALLED and control runs, based on the percent of signal labeled ‘pore’ or ‘strand’ in the MinKNOW duty times. Curves are smoothed by taking the mean of 92 minute windows, which smooths over mux scans. b, Number of channels which are ‘alive’ throughout the run, meaning they have the capacity to sequence reads, based on when the last read was produced. This is distinct from the duty time plots in that a channel may not produce a read for several hours but still be considered ‘alive’.

Extended Data Fig. 4 GM12878 Duty Times.

GM12878 gene enrichment run duty times in the a, unsheared run and b, sheared run. Nuclease flushes were carried out at 24 and 48 hours in both runs. Curves plotted as in Extended Data Fig. 3. Note: we observed that a large patch of channels were marked as inactive after the second flush in the unsheared UNCALLED run, which can occur because of bubbles introduced when loading.

Extended Data Fig. 5 SVs confirmed by applying sensitive parameters in Sniffles and SURVIVOR or which required manual inspection to correct.

SVs confirmed by applying sensitive parameters in Sniffles and SURVIVOR or which required manual inspection to correct. a, Insertion detected by UNCALLED but not by ONT WGS because most reads represented it as < 50 bp. b, Insertion detected by ONT WGS but not by UNCALLED because of low-complexity sequence. The overlapping deletion on the other haplotype also likely made the insertion difficult to resolve. c, Insertions detected by UNCALLED but not by PacBio because of low-complexity sequence. d, Deletion detected by PacBio but not by UNCALLED. e, Deletion detected by UNCALLED (and all other long-read datasets) but not by Illumina reads, likely because of surrounding repetitive elements. Note that white read alignments indicate low mapping quality. f, Sniffles called two SVs in this locus in both UNCALLED and ONT WGS, while it appears to represent a single duplication. SURVIVOR merged the ONT WGS SVs but not the UNCALLED SVs, causing a falsely unmatched SV. This is a known issue with SURVIVOR and this case was manually corrected.

Extended Data Fig. 6 Zymo Full Flowcell 1 Gap Durations.

Durations of gaps between reads on channel 109 of the Zymo Full Flowcell 1 UNCALLED run. X-axis indicates when a read ended, Y-axis indicates how long until the next read begins (log scale). Dashed vertical lines indicate mux scans, which often correspond to when gap characteristics change due to pore transitions. The horizontal red line is at one standard deviation over the median gap length for the entire run (including other channels), which is the threshold the simulator uses to define active and inactive periods as represented by the top blue and red bars respectively.

Extended Data Fig. 7 Outline of ReadUntil Simulator.

a, Outline of the ReadUntil simulator. Inputs are sequencing summaries of an UNCALLED run and a control run, in addition to the corresponding UNCALLED PAF file and the raw reads from the control run. The overall ‘pattern’ of the simulation is generated from the UNCALLED run: for each channel, gaps between the end of a read and the start of the next are separated into ‘short’ and ‘long’, where the long gaps are used to define broadly active and inactive periods of the channel (see Extended Data Fig. 6) and the short gaps are stored in a series queues. The read chunks and durations are loaded from the control run. Each channel’s reads are stored in a queue and are output in the same order in which they were sequenced, but the exact time that they are output may vary between channels because of ejections. When all reads are output from a channel, the queue ‘repeats’ and outputs the same reads again. Short gaps are stored in similarly operating queues, each associated with a channel and scan interval. Scan intervals are periods between two mux scans which are synchronized across all channels. b, Illustration of how simulations can be shortened by scaling down the active/inactive periods and scan intervals, but leaving the read and short gap duration unchanged.

Extended Data Fig. 8 Simulation Results.

Simulated results of targeting sets of human genes: a, absolute enrichment with respect to gene count, b, absolute enrichment with respect to reference size, c, true positive rate with respect to gene count, d, true positive rate with respect to reference size. True positive rates were computed based on reads where the first 1,350 bp of each read fully aligns to the target reference according to minimap2. Note that reference size includes the 5Kbp surrounding each gene/exon, while the level of enrichment is calculated based on coverage of the target sequence only (see Supplementary Table 8).

Extended Data Fig. 9 Path Buffer Illustration.

Representation of alignments in path buffers. The ‘Virtual Alignment Forest’ is a more detailed version of the one in Fig. 1a. Pink edges mark paths that were pruned out due to lower probability in order to maintain the tree structure. Shaded backgrounds mark paths that have not been pruned out and are therefore represented in path buffers, and darker shading indicates that part of the path is represented in multiple buffers. ‘Path Buffers’ store cumulative log probabilities that can be used to compute a rolling mean log probability as mapping progresses, as well as ‘stay’ versus ‘move’ events represented by dotted versus solid lines. Seed mappings are inferred from the FM index coordinate which are also stored in the buffers.

Supplementary information

Reporting Summary

Supplementary Tables

Supplementary Tables 1–8.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kovaka, S., Fan, Y., Ni, B. et al. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol 39, 431–441 (2021). https://doi.org/10.1038/s41587-020-0731-9

Download citation

Received: 07 February 2020
Accepted: 07 October 2020
Published: 30 November 2020
Issue Date: April 2021
DOI: https://doi.org/10.1038/s41587-020-0731-9

This article is cited by

Direct RNA sequencing coupled with adaptive sampling enriches RNAs of interest in the transcriptome
- Jiaxu Wang
- Lin Yang
- Yue Wan
Nature Communications (2024)
mEnrich-seq: methylation-guided enrichment sequencing of bacterial taxa of interest from microbiome
- Lei Cao
- Yimeng Kong
- Gang Fang
Nature Methods (2024)
Assessing the efficacy of target adaptive sampling long-read sequencing through hereditary cancer patient genomes
- Wataru Nakamura
- Makoto Hirata
- Yuichi Shiraishi
npj Genomic Medicine (2024)
HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing
- Renmin Han
- Junhai Qi
- Guojun Li
Genome Biology (2023)
SPUMONI 2: improved classification using a pangenome index of minimizer digests
- Omar Y. Ahmed
- Massimiliano Rossi
- Ben Langmead
Genome Biology (2023)