Long-read sequencing to understand genome biology and cell function

doi:10.1016/j.biocel.2020.105799

The International Journal of Biochemistry & Cell Biology

Volume 126, September 2020, 105799

https://doi.org/10.1016/j.biocel.2020.105799 Get rights and content

Abstract

Determining the sequence of DNA and RNA molecules has a huge impact on the understanding of cell biology and function. Recent advancements in next-generation short-read sequencing (NGS) technologies, drops in cost and a resolution down to the single-cell level shaped our current view on genome structure and function. Third-generation sequencing (TGS) methods further complete the knowledge about these processes based on long reads and the ability to analyze DNA or RNA at single molecule level. Long-read sequencing provides additional possibilities to study genome architecture and the composition of highly complex regions and to determine epigenetic modifications of nucleotide bases at a genome-wide level. We discuss the principles and advancements of long-read sequencing and its applications in genome biology.

Introduction

Massively parallel sequencing, also known as next-generation sequencing (NGS) came as disruptive innovation into the field of life science. Within a couple of years, NGS led to a dramatic increase in knowledge on genomes of different organisms, their architecture, function, and genetic variation down to single-cell level (Shendure et al., 2017). Various methods based on semiconductors (Ion Torrent), pyrosequencing (454 Life Science, Roche), sequencing by ligation (Applied Biosystems), and sequencing by synthesis with reversible terminators (Solexa, Illumina) allowed fast and precise DNA and RNA sequencing (Metzker, 2010). However, short-read sequencing methods have shortcomings in their capability to investigate complex genomes, repetitive elements, full-length transcripts, or native base modifications. Several of the current limitations can be overcome by long-read technologies (third-generation sequencing technologies, TGS). In the following we will discuss the applications of long-read sequencing to understand genome function. The review focuses on the technical applications of long-read methods, which can be applied to the most diverse questions in cell biology.

Section snippets

Nanopore sequencing

The original idea of analyzing nucleotide sequences with nanopores was born in the 1980s, but it took more than 30 years for the technology to reach market maturity (Company: Oxford Nanopore Technologies, ONT) (Deamer et al., 2016; Kasianowicz and Bezrukov, 2016). In Nanopore sequencing a current is applied over a tiny pore to driving an ion flow. Each molecule entering the pore interferes with the ion flow and therefore induces a characteristic and measurable change in the current. ONT

Single molecule real-time (SMRT) sequencing

SMRT (single molecule real-time) sequencing from Pacific Bioscience (PacBio) also provides long reads of native DNA. The method relies on fluorescence-labeled nucleotides incorporated by a polymerase which is immobilized at the bottom of so called ZMWs (zero-mode waveguides). These picoliter-sized wells are assembled on a flow cell and allow the detection of fluorescence signals from millions of molecules in parallel. In contrast to NGS methods the incorporation of nucleotides is detected in

Other long-read/ cytogenetic technologies

Synthetic long-read technologies provide alternative methods to obtain information on long DNA fragements. Methods such as linked-read sequencing (10x Genomics) and stLFR (MGI) allow the in silico assembly of long sequences from short-read NGS data. Moreover next-generation cytogenetics enables to analyze single DNA strands at megabase scale. Optical mapping approaches (Bionano) and molecular combing techniques (Genomic Vision) are amongst these novel cytogenetic approaches. Bionano utilizes

Structural variations, complex haplotypes and chromosomal rearrangements

Structural variations (SV) are a rich source for genome evolution and inter-individual variation, but acquired SVs can also drive pathological processes such as cancer development. SVs including copy number variants (deletions, amplifications) can be detected by comparative genomic hybridization approaches (SNP-arrays, CGH-arrays) and to a certain extend by short-read sequencing methods. However, complex structural rearrangements, inversions, balanced chromosomal translocations and other copy

Repeat architecture

The size and structure of many repetitive regions of genomes is hardly accessible with short-read sequencing technologies (Tørresen et al., 2019). However, an increasing number of repetitive elements has been linked to human diseases, which has led to a growing interest in the study of these regions (Hagerman et al., 2017; McColgan and Tabrizi, 2018; Paulson, 2018). Long-read sequencing enables their analysis in a single read and thus the exact determination of length, composition, and repeat

Epigenetic regulation

Over 150 types of base modifications have been described so far (Xu and Seki, 2020). These modifications are crucial in many aspects of biology, including development, cellular maintenance, ageing, or cancer. However, available sequencing technologies allowed only limited insight into nucleic acid modifications. Because base modifications lead to characteristic changes in the current profiles when the respective bases are pulled through nanopores, the method detects various chemical

RNA sequencing, alternative splicing, and single cell sequencing

Alternative splicing of mRNAs is a mechanism to increase protein diversity and function. Nanopore and SMRT sequencing allow to determine entire transcripts within single reads, which provides a comprehensive view on isoforms and splicing events (Soneson et al., 2019). The power of long read sequencing in RNA analysis is underlined by the fact that over 50 % of the identified isoforms from Nanopore sequencing transcriptome analyses are not covered by short read sequencing datasets (Workman et

De novo genome assembly

An important application of long-read sequencing is the de novo assembly of prokaryotic and eukaryotic genomes (van Dijk et al., 2018v). Especially in polyploid organisms such as wheat or Xenopus species and in regions of low complexity the long reads facilitate correct genome assembly to large continuous contigs (Genova et al., 2019; Kapustová et al., 2019; Schmid et al., 2018; Schmidt et al., 2017; Shin et al., 2019; Wang et al., 2019). De novo assemblies are possible without laborious BAC or

Challenges of long-read sequencing

Preparing DNA for long-read sequencing has several pitfalls in terms of obtaining optimal sequencing libraries. Size-selection can be an issue since very large DNA molecules tent to block nanopores and very short molecules reduces the overall sequencing output. Moreover, libraries from freshly isolated DNA/RNA produce a higher output due to less degradation and oxidation compared to long-term stored samples. Furthermore, sample purity is an issue due to the high input of DNA for long-read

Acknowledgements

The authors have no competing interests.

References (77)

A. Ameur et al.
Single-molecule sequencing: towards clinical applications
Trends Biotechnol.
(2019)
W.R. Jeck et al.
A nanopore sequencing-based assay for rapid detection of gene fusions
J. Mol. Diagn.
(2019)
H. Paulson
Repeat expansion diseases
Handb. Clin. Neurol.
(2018)
E.L. van Dijk et al.
The third revolution in sequencing technology
Trends Genet.
(2018)
S.L. Amarasinghe et al.
Opportunities and challenges in long-read sequencing data analysis
Genome Biol.
(2020)
S. Ardui et al.
Detecting AGG interruptions in male and female FMR1 premutation carriers by single-molecule sequencing
Hum. Mutat.
(2017)
D. Beyter et al.
Long read sequencing of 1,817 Icelanders provides insight into the role of structural variants in human disease
bioRxiv
(2019)
M.J.P. Chaisson et al.
Multi-platform discovery of haplotype-resolved structural variation in human genomes
Nat. Commun.
(2019)
M.A. Corbett et al.
Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2
Nat. Commun.
(2019)
M. Cretu Stancu et al.
Mapping and phasing of structural variation in patient genomes using nanopore sequencing
Nat. Commun.
(2017)

I. Legnini et al.

FLAM-seq: full-length mRNA sequencing reveals principles of poly(A) tail length control

Nat. Methods

(2019)

Cited by (25)

Advancement in research on genes associated with fetal congenital heart disease (CHD) and diagnostic testing methods
2023, Gynecology and Obstetrics Clinical Medicine
Congenital heart disease (CHD) is one of the most common congenital malformations, and is a polygenic disease related to some major genes and involved in environmental factors. With the progress of science and technology, the progress was both in the studies of genetic patterns and testing methods. Understanding how each gene participates in normal and pathological anatomy is an important goal of CHD research. We reviewed the development of testing methods and CHD-related genes, to provide some enlightenment for the CHD prenatal diagnosis and hope to realize the intervention and treatment on the gene level of CHD in the future.
Comparative genomic analysis provides insight into the phylogeny and potential mechanisms of adaptive evolution of Sphingobacterium sp. CZ-2
2023, Gene
Citation Excerpt :
Blue represented the estimated species divergence time, green represented the number of expanded gene families, and red represented the number of contracted gene families. NGS sequencing technology has brought a revolution in sequencing, enriching our study of gene structure and function with advantages such as low cost and high accuracy, but it still has serious limitations (Kraft and Kurth, 2020). The short reads generated by NGS sequencing platform require the use of specialized bioinformatics tools and complex post-processing pipelines, which make the manipulation of high-throughput data more difficult and increase the average time of analysis (Athanasopoulou et al., 2021).
Sphingobacterium is a class of Gram-negative, non-fermentative bacilli that have received widespread attention due to their broad ecological distribution and oil degradation ability, but are rarely involved in infections. In this manuscript, a novel Sphingobacterium strain isolated from wildfire-infected tobacco leaves was named Sphingobacterium sp. CZ-2. NGS and TGS sequencing results showed a whole genome of 3.92 Mb with 40.68 mol% GC content and containing 3,462 protein-coding genes, 9 rRNA-coding genes and 50 tRNA-coding genes. Phylogenetic analysis, ANI and dDDH calculations all supported that Sphingobacterium sp. CZ-2 represented a novel species of the genus Sphingobacterium. Analysis of the specific genes of Sphingobacterium sp. CZ-2 by comparative genomics revealed that metal transport proteins encoded by the troD and cusA genes could maintain the balance of heavy metal ion concentrations in the internal environment of bacteria and avoid heavy metal toxicity while meeting the needs of growth and reproduction, and transport proteins encoded by the malG gene could keep nutrients required for the survival of bacteria. Synteny and genome evolutionary analyses of Sphingobacterium strains implicated that the gene family contraction as a major process in genome evolution, with insertional sequences leading to mutations, deletions and reversals of genes that help bacteria to withstand complex environmental changes. Complete genome sequencing and systematic comparative genomic analysis will contribute new insights into the adaptive evolution of this novel species and the genus Sphingobacterium.
Long-read sequencing reveals oncogenic mechanism of HPV-human fusion transcripts in cervical cancer
2023, Translational Research
Citation Excerpt :
Pacific Biosciences Isoform sequencing (PacBio Iso-seq) allows us to obtain full-length cDNA sequences without contig assembly and increases the accuracy from ∼90% up to 99.8% through a high-precision protocol-circular consensus sequencing. Therefore, it is suitable for reliable characterization of complete transcript isoforms across the entire transcriptome or within certain targeted genes.17,18 Though third-generation sequencing technologies have been adopted to profile the full-length transcriptome for some cancers,19-21 it has not been reported in cervical cancer yet.
Integration of high-risk human papillomavirus (HPV) into the host genome is a crucial event for the development of cervical cancer, however, the underlying mechanism of HPV integration-driven carcinogenesis remains unknown. Here, we performed long-read RNA sequencing on 12 high-grade squamous intraepithelial lesions (HSIL) and cervical cancer patients, including 3 pairs of cervical cancer and corresponding para-cancerous tissue samples to investigate the full-length landscape of cross-species genome integrations. In addition to massive unannotated isoforms, transcriptional regulatory events, and gene chimerism, more importantly, we found that HPV-human fusion events were prevalent in HPV-associated cervical cancers. Combined with the genome data, we revealed the existence of a universal transcription pattern in these fusion events, whereby structurally similar fusion transcripts were generated by specific splicing in E6 and a canonical splicing donor site in E1 linking to various human splicing acceptors. Highly expressed HPV-human fusion transcripts, eg, HPV16 E6*I-E7-E1_SD880-human gene, were the key driver of cervical carcinogenesis, which could trigger overexpression of E6*I and E7, and destroy the transcription of tumor suppressor genes CMAHP, TP63 and P3H2. Finally, evidence from in vitro and in vivo experiments demonstrates that the novel read-through fusion gene mRNA, E1-CMAHP (E1C, formed by the integration of HPV58 E1 with CMAHP), existed in the fusion transcript can promote malignant transformation of cervical epithelial cells via regulating downstream oncogenes to participate in various biological processes. Taken together, we reveal a previously unknown mechanism of HPV integration-driven carcinogenesis and provide a novel target for the diagnosis and treatment of cervical cancer.
Chimera: The spoiler in multiple displacement amplification
2023, Computational and Structural Biotechnology Journal
Multiple displacement amplification (MDA) based on isothermal random priming and high fidelity phi29 DNA polymerase-mediated processive extension has revolutionized the field of whole genome amplification by enabling the amplification of minute amounts of DNA, such as from a single cell, generating vast amounts of DNA with high genome coverage. Despite its advantages, MDA has its own challenges, one of the grandest being the formation of chimeric sequences (chimeras), which presents in all MDA products and seriously disturbs the downstream analysis. In this review, we provide a comprehensive overview of current research on MDA chimeras. We first reviewed the mechanisms of chimera formation and chimera detection methods. We then systematically summarized the characteristics of chimeras, including overlap, chimeric distance, chimeric density, and chimeric rate, as found in independently published sequencing data. Finally, we reviewed the methods used to process chimeric sequences and their impacts on the improvement of data utilization efficiency. The information presented in this review will be useful for those interested in understanding the challenges with MDA and in improving its performance.
Third-generation sequencing: A novel tool detects complex variants in the α-thalassemia gene
2022, Gene
Citation Excerpt :
In high-fidelity (HiFi) read detection mode, its single-molecule read-length can be longer than 10 k bp, and its accuracy can be > 99.9% if the detection depth exceeds 30× (Nurk et al., 2020). Due to its long-read length and high detection accuracy, this technology is suitable for detecting rearrangements and copy number variants of nucleic acid sequences (Kraft and Kurth, 2020). Because there are long fragments of homologous regions in the α-globin gene cluster, there are individual carriers of gene structure variants in the population in this region (Galanello and Cao, 2011).
Thalassemia is a monogenic disorder with a high carrier rate in the southern region of China. Most laboratories currently follow the protocol of testing hematologic indicators in individuals with positive hematologic indicators and then using the hot-spot mutation test kit. A novel thalassemia gene test is performed if there is a mismatch between the hematology and hot-spot mutation test results. However, due to the large population in southern China, some individuals carry complex α-globin gene cluster (CAGC) variants in NG_000006.1, which are difficult to detect using conventional thalassemia genetic analysis protocols, leading to missed or false genetic test results for individuals carrying these complex α-globin gene cluster variants. When an individual carries a complex α-thalassemia gene variant, and an individual carries a β- thalassemia gene variant, there may be clinical symptoms that might complicate clinical consultation and prenatal diagnosis if not accurately detected. Third-generation sequencing (TGS) enables long-read single-molecule sequencing with high detection accuracy, and long-length DNA chain reads in high-fidelity reads mode. TGS can be used to analyze high homology and rich GC DNA sequences.
Four samples that showed abnormalities in the thalassemia genetic test were studied using TGS, revealing that they carried genotypes with complex α-globin gene cluster variants, one of which was a complex variant αα anti3.7 α anti3.7 α 17.2.
TGS detects complex α-globin gene cluster variants. This study may provide a reference protocol for the use of TGS for the detection of complex α-globin gene cluster variants. TGS can reveal individuals with complex α-thalassemia genotypes in the population and improve the accuracy of genetic counseling and prenatal diagnosis.
Approaches towards understanding the mechanism-of-action of metallodrugs
2022, Coordination Chemistry Reviews
Citation Excerpt :
Similarly, advancements in sequencing technologies are still helpful for compounds favoring binding with nucleic acids. One could imagine a more direct detection of the DNA-metal adducts in genome-scale, resembling what has been achieved by the third-generation/long-read sequencing methods like SMRT (Single Molecule Real-Time) [280] or Nanopore [281] for DNA modifications [282]. However, merely analyzing cell lines in petrol dishes cannot explain the miscellaneous modes in humans (or animals).
Known as highly efficient metallodrugs that have been clinically used to treat various types of cancers, cisplatin and its analogues have attracted and inspired extensive interest in the development of both platinum and non-platinum drugs. However, the past years witnessed slow progress of newly approved metal-based drugs. One of the key obstacles that hamper the progress is the poor understanding of the mechanism-of-action of metal compounds which usually do not follow the drug-like principles of organic compounds and display complicated thermodynamic and kinetic reactivities. Nevertheless, unique techniques such as atomic spectroscopy, nuclear magnetic resonance, X-ray spectroscopy, emission spectroscopy, and secondary-ion mass spectrometry are available that can provide detailed information on the chemical and biological environment of metal ions/complexes. Moreover, emerging methods including chemical biology probes and multi-omics strategies have been developed which are of great merit in understanding the mechanism-of-action including the molecular target(s) of metal complexes. In this review, we summarized approaches for deciphering the mechanism-of-action of metal complexes and highlighted selected examples with in-depth mechanistic insights.

View all citing articles on Scopus

View full text

Review articleLong-read sequencing to understand genome biology and cell function

Abstract

Introduction

Section snippets

Nanopore sequencing

Single molecule real-time (SMRT) sequencing

Other long-read/ cytogenetic technologies

Structural variations, complex haplotypes and chromosomal rearrangements

Repeat architecture

Epigenetic regulation

RNA sequencing, alternative splicing, and single cell sequencing

De novo genome assembly

Challenges of long-read sequencing

Acknowledgements

Trends Biotechnol.

J. Mol. Diagn.

Handb. Clin. Neurol.

Trends Genet.

Opportunities and challenges in long-read sequencing data analysis

Genome Biol.

Detecting AGG interruptions in male and female FMR1 premutation carriers by single-molecule sequencing

Hum. Mutat.

Long read sequencing of 1,817 Icelanders provides insight into the role of structural variants in human disease

bioRxiv

Multi-platform discovery of haplotype-resolved structural variation in human genomes

Nat. Commun.

Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2

Nat. Commun.

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

Nat. Commun.

Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome

bioRxiv

Three decades of nanopore sequencing

Nat. Biotechnol.

A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping

Nat. Commun.

Sequencing smart: de novo sequencing and assembly approaches for a non-model mammal

Gigascience

Same-day genomic and epigenomic diagnosis of brain tumors using real-time nanopore sequencing

Acta Neuropathol.

Unstable TTTTA/TTTCA expansions in MARCH6 are associated with familial adult myoclonic epilepsy type 3

Nat. Commun.

Direct detection of DNA methylation during single-molecule, real-time sequencing

Nat. Methods

Highly parallel direct RNA sequencing on an array of nanopores

Nat. Methods

WENGAN: efficient and high quality hybrid assembly of human genomes

bioRxiv

Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing

Nat. Biotechnol.

Using long-read sequencing to detect imprinted DNA methylation

Nucleic Acids Res.

Picky comprehensively detects high-resolution structural variants in nanopore long reads

Nat. Methods

Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells

Nat. Biotechnol.

Fragile X syndrome

Nat. Rev. Dis. Primers

Mapping DNA replication with nanopore sequencing

bioRxiv

Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy

Nat. Genet.

Linear assembly of a human centromere on the Y chromosome

Nat. Biotechnol.

The dark matter of large cereal genomes: long tandem repeats

Int. J. Mol. Sci.

Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers and nanopore sequencing

bioRxiv

Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers with nanopore or PacBio sequencing

bioRxiv

On’ three decades of nanopore sequencing’

Nat. Biotechnol.

Identification of DNA base modifications by means of pacific biosciences RS sequencing technology

Methods Mol. Biol.

Novel familial distal imprinting centre 1 (11p15.5) deletion provides further insights in imprinting regulation

Clin. Epigenetics

Alignment-free poly(A) length measurement for oxford nanopore RNA and DNA sequencing

RNA

De novo Nanopore read quality improvement using deep learning

BMC Bioinform.

Review article
Long-read sequencing to understand genome biology and cell function