Background & Summary

Pyricularia oryzae (syn., Magnaporthe oryzae), an ascomycete fungal pathogen, causes rice blast, one of the most destructive diseases of rice throughout the world1,2. The pathogen is an important and long-established model species for understanding fungal-plant interactions3,4. Previously, we sequenced and assembled the first genomes of field strains (P131 and Y34) and performed a comparative analysis between the laboratory and field strains, which demonstrated that translocation of transposable elements (TEs), gain or loss of isolate-specific genes and gene family expansion are essential factors, delimiting genomic plasticity and adaptability of P. oryzae5. Although these assemblies had facilitated the understanding of the genome characteristics of P. oryzae, the genome of the two strains were highly fragmented to more than one thousand scaffolds, for Sanger (2-fold) and 454 (18-fold) sequencing technologies were used in the previous study. Recently, over 50 genomes of different strains of P. oryzae have been available in public genome databases. These genomes were sequenced on the next-generation sequencing platforms, such as second-generation sequencing platforms (e.g., Illumina sequencers) and/or third-generation sequencing platforms [e.g., Pacific Biosciences (PacBio)], which facilitated the genetic studies of genomic changes and pathogenicity variation within P. oryzae6,7,8. However, currently most of these assemblies are fragmented and contain a large number of unplaced contigs and/or gaps owing to the presence of repetitive DNA elements in the P. oryzae genomes, which prevented the dissection of molecular mechanisms of adaptive evolution. Since the importance of genome assembly completeness in genomic analysis, we re-assemble the genome of P. oryzae stain P131 by combining Illumina, PacBio sequencing and high throughput chromatin conformation capture (Hi-C) mapping, which was the first telomere-to-telomere gapless assembly of the P. oryzae genome.

A total of 10.03 Gb PacBio long-read sequencing data (~250x genome coverage) and 4.44 Gb Illumina short-read sequencing data were generated (Table 1). Hi-C library was prepared, sequenced and generated 5.57 Gb sequencing data (~140x genome coverage). The long reads were de novo assembled and corrected. The short reads were used to polish the assembly. Redundant genomic contigs or mitochondrial contigs were then removed. The Hi-C sequencing data were used to anchor and refined remained contigs. The mitochondrial genome was assembled independently by Mitochondrial Long-read Iterative Assembly (MLIA) pipeline9. The final polishing of the complete genome was performed. Finally, seven gapless chromosomes (43,237,743 bp with a contig N50 of 7.05 Mb; Fig. 1a) and a circular mitochondrial genome (34,866 bp; Fig. 1b) were constructed in the final assembly (Fig. 2). The new assembly represented a significant improvement over the previous version GCA_000292605.15,10 (1,823 assembled contigs and contig N50 = 12.3 kb; see Table 2 and Fig. 3).

Table 1 Summary of sequencing raw data of P. oryzae strain P131.
Fig. 1
figure 1

The P. oryzae strain P131 genome assembly. (a) Nuclear genome: Track 1 illustrates the seven assembled nuclear chromosomes with the indicated sizes. The red arrows at the end of chromosomes indicate telomeric repeat sequences (TTAGGG)n. Tracks 2 through 4 show transposon distribution, gene density, and GC density on the seven chromosomes, respectively. (b) Mitochondrial genome: Track 5 depicts a circular illustration of the mitochondrial genome carrying genes (represented by different color blocks), including genes encoding 14 standard fungal core PCGs (nad1-nad6, nad4L, cob, cox1-cox3, atp6, atp8, and rps3), ribosomal subunits (rns and rnl) and 27 tRNA genes. Tracks 6 and 7 display the introns present in the above genes and GC density of the mitochondrial genome, respectively.

Fig. 2
figure 2

An illustration of Hi-C genome-wide interaction map result.

Table 2 Summary of the genome assembly.
Fig. 3
figure 3

The alignment of scaffolds from the previous assembly to the new assembled chromosomes. The X and Y axis represented the new chromosomes and previous scaffolds, respectively.

The nuclear genome was annotated by Braker2 pipeline11. The mitochondrial genome was annotated by MFannot12 using genetic code 4. In conclusion, the nuclear genome is predicted to contain 14,968 genes (including 20,797 transcripts), and the mitochondrial genome is likely to carry 14 conserved protein-coding genes (Table 3). A total of 99.9% of the BUSCOs were mapped onto the P131 genome assembly. Approximately 14.31% of the genome carried repeat sequences, most of which were TEs, which was significantly greater than the previous version (Table 4).

Table 3 Detailed summary of assembled chromosomes.
Table 4 Classification of repeat sequences.

The telomere repeat sequence (TRS) (TTAGGG)n was presented on both ends of chromosomes 2, 4, 5, 6, and 7 and one end of chromosomes 1 and 3 in our assembly. We then compared the TRS in the published near-complete assembled genome of P. oryzae strains with the genome assembly generated in this study. Interestingly, minority deficiency and telomere variability of TRSs in P. oryzae were extensively observed, which may play subtle roles in pathogenic adaptation13,14,15. In summary, we assembled the first telomere-to-telomere gapless genome of P. oryzae, which can be instrumental in understanding the genome evolution and host adaptation in the rice blast fungus.

Methods

Sampling and DNA extraction

The P. oryzae strain P131 was grown and maintained on oatmeal tomato agar (OTA) plates16. Conidia were produced on OTA plates and harvested from 7-day culture plates grown at 25 °C under constant fluorescent light. Hyphae were collected from 2-day-old cultures in complete medium shaken at 150 rpm at 25 °C. Genomic DNA extracted from vegetative mycelia using cetyltrimethylammonium bromide (CTAB) protocol was used for genome sequencing17.

Illumina, PacBio and Hi-C sequencing

Genome sequencing was conducted on Pacific Biosciences Sequel (PacBio, Menlo Park, CA) at CapitalBio Technology Co., Ltd (Beijing, China). Qualified genomic DNA was fragmented with G-tubes (Covaris, Woburn, MA, USA) and end-repaired to prepare SMRTbell DNA template libraries (with fragment size of >10 kb selected). Library quality was detected by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA, Q33230). The average fragment size was estimated on Bioanalyzer 2100 (Agilent, Santa Clara, CA). SMRT sequencing was performed on the Pacific Biosciences RSII sequencer (PacBio, Menlo Park, CA) according to standard protocols using the P4-C2 chemistry. A total of 10.03 Gb PacBio sequencing data with a subread N50 of 14.5 kb. In addition, Illumina HiSeq X Ten sequencer using paired-end technology was also used to perform genome sequencing and 4.44 Gb sequencing data (150 bp paired-end reads) were yielded at CapitalBio Technology Co., Ltd (Beijing, China).

Hi-C library was prepared from cross-linked chromatins of fungal mycelia by Novogene Co., Ltd (Beijing, China). In brief, the tissue was ground and then cross-linked with 4% formaldehyde solution. After the sample of crosslinking reaction and cell lysis, nuclei were digested with 4-cutter restriction enzyme DpnII. Subsequently, ligated DNA was purified and fragmented into 300 bp size on average. The constructed Hi-C library was sequenced by Illumina NovaSeq 6000. 5.57 Gb paired-end sequencing data (150-bp length) were generated. The Hi-C maps from raw data were performed by Juicer (v1.6)18, followed by using a manually correction with Juicebox (v2.13.07)19.

RNA sequencing and analysis

Total RNA was extracted from conidia and hyphae with the Trizol reagent (Invitrogen, Carlsbad, CA, USA, 15596026) and then enriched by RNeasy Pure mRNA Bead Kit (Qiagen, Germany), respectively. High-throughput cDNA libraries were prepared according to the Illumina whole transcriptome library preparation protocol and sequenced on the Illumina GA platform by the BGI Genomics (Shenzhen, China)20. Quality control was performed by FastQC (v0.11; https://github.com/s-andrews/FastQC). RNA-Seq data were mapped to P. oryzae by HISAT2 (v2.2.1)21, and SAMTools (v1.12)22 were used to evaluate read alignments.

Genome assembly

The de novo long-read assembler Canu v2.1.123 (parameters: genomeSize = 44 m corOutCoverage = 200 corMinCoverage = 2 minReadLength = 4000 minOverlapLength = 800 correctedErrorRate = 0.050) was used to assemble PacBio reads to generated draft contigs, which were then corrected by GCpp v1.9 (https://github.com/PacificBiosciences/gcpp; parameters:–algorithm = arrow -x 5 -X 200 -q 40) using PacBio long-reads. The polishing step was performed by Pilon v1.2324 (parameters:–changes–vcf) using the Illumina short reads. Contigs were considered redundant if they aligned concordantly (identity >99%) with another contig, and the redundant contigs, along with mitochondrial contigs, were removed, resulting in a total of 13 contigs. The Hi-C sequencing data were used to anchor all 13 contigs using Juicer v1.618, resulting in 7 scaffolds, which were further refined using Juicebox v2.13.0719. Gaps within the scaffolds were filled using LR_Gapcloser25. We then manually checked whether long reads aligned the bridging cross the gaps, or whether overlapping contig ends (>20 kb length and 99.9% sequence identity) existed.The mitochondrial genome was assembled independently by Mitochondrial Long-read Iterative Assembly (MLIA) pipeline9. The final polishing of the complete genome was performed again using Pilon v1.2324.

Gene model and function annotations

Repetitive sequences of P. oryzae strain P131 was firstly de novo identified via RepeatModeler (v2.0.1)26 and masked by RepeatMasker (v4.1.1)27 (parameters: -e rmblast -pa 30 -xsmall -nolow -norna -gff -a). The nuclear genome was annotated by Braker2 pipeline11 (parameters: –softmasking –gff3 –fungus –gth2traingenes –prg = gth), combining three aspects evidences: ab initio prediction, homologous proteins, and RNA-Seq evidences. The AUGUSTUS v3.4.028 and Genemark-EP+29 was used as ab initio prediction tools in the pipeline. All proteins of the genus Pyricularia in the Uniref100 database30 were collected and the 100% non-redundant protein dataset was built by cd-hit31 (parameters: -c 1.00 -aS 1.00 -aL 1.00 -n 5 -M 20000), which was used as the protein-based training evidence. The GenomeThreader v1.7.332 was used as the alignment tool. RNA-Seq data previous used20 (i.e. SRR1517063833, SRR1517063734 and SRR1517063635) were aligned by HISAT2 (v2.2.1)21 (parameters: -t -dta). The mitochondrial genome was annotated by MFannot12 with genetic code 4.

Data Recodes

The raw genomic sequencing data used and/or analyzed during the current study are available at NCBI Sequence Read Archive database (Accession number SRR2489091036, SRR2489091137 and SRR2489091238). The assembled genome was deposited under the same BioProject with P. oryzae strain P131 at NCBI (Accession number: GCA_000292605.239; BioProject ID: PRJNA82693; BioSample ID: SAMN31867770). The accession numbers from Chr1 to Chr7 chromosome sequences were CP114135 to CP114141, respectively. And the accession number corresponding to the mitochondrial genome sequence was CP114142.

Technical Validation

DNA sample quality

The DNA quality was detected using Qubit (Thermo Fisher Scientific, Waltham, MA) and Nanodrop (Thermo Fisher Scientific, Waltham, MA).

Sequencing data assessment

The short read data were assessed by fastp v0.2340. The genomic short sequencing reads had 49.75% GC content. The Q20 and Q30 percentages were 97.1% and 92.06%, respectively. The Hi-C sequencing data had 50.5% GC content, and had quality scores of 97.67% (Q20) and 93.64% (Q30), respectively.

Evaluation of the genome assembly

The genome assembly quality was evaluated through the Benchmarking Universal Single-Copy Orthologs (BUSCO) tool with the “fungi_odb10” lineage as a reference dataset. The results showed that 99.9% of all 758 BUSCO markers were assembled, implying a high level of completeness of the assembly. In addition, the results generated from “ascomycota_odb10” lineage showed 99.4% of all 1706 BUSCO markers were include (Table 2).