当前位置: X-MOL 学术Plant Biotech. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
BnPIR: Brassica napus pan‐genome information resource for 1689 accessions
Plant Biotechnology Journal ( IF 10.1 ) Pub Date : 2020-10-17 , DOI: 10.1111/pbi.13491
Jia‐Ming Song 1, 2 , Dong‐Xu Liu 1, 3 , Wen‐Zhao Xie 1, 3 , Zhiquan Yang 1, 3 , Liang Guo 1 , Kede Liu 1 , Qing‐Yong Yang 1, 3 , Ling‐Ling Chen 1, 2, 3
Affiliation  

Brassica napus (B. napus) was originally formed ~7500 years ago by interspecific hybridization between B. rape and B. oleracea (Chalhoub et al., 2014), which supplies approximately 13%–16% of the vegetable oil globally. B. napus serves as an excellent model for polyploid genomics and evolutionary research in plants. Brassica database (BRAD) has long been used for rapeseed genomic research, which provides genome browser and syntenic relationship for multiple Brassicaceae genomes (Wang et al., 2015). In addition, some widely used plant genomic databases such as Genoscope (http://www.genoscope.cns.fr/brassicanapus/) and EnsemblPlants (http://plants.ensembl.org/) also include B. napus genomes. However, these databases are based on the genome of primarily assembled Brassica cultivar Darmor‐bzh, which lack multi‐omics data and rapeseed population information. In recent years, more and more B. napus genomes have been sequenced, and a single reference genome is not sufficient to perform the genetic difference analysis for high‐profile species (Gan et al., 2011); therefore, pan‐genome has been proposed to solve this problem. Pan‐genome is a collection of different individual genomes of a species, which provides a new vision in understanding the genome complexity and a map of the presence/absence variations (PAVs) of genes among these genomes. Recently, eight representative rapeseed cultivars were sequenced by PacBio technology and assembled into pseudo‐chromosomes, which provides new resource for rapeseed genomic research (Song et al., 2020). Based on the above eight B. napus reference genomes, and a collection of 1688 rapeseed re‐sequencing data, we constructed a comprehensive database, B. napus pan‐genome information resource (BnPIR, http://cbi.hzau.edu.cn/bnapus), which is based on gene information module, with Pan‐genome Browser and Gbrowse Synteny as the core, and containing multi‐omics data and common bioinformatics tools.

Similar to the method proposed in rice pan‐genome (Wang et al., 2018a, 2018b), we constructed the pan‐genome of B. napus by ‘PVs + map‐to‐pan’ strategy based on well‐assembled ZS11 reference genome (Figure 1a, Song et al., 2020). Firstly, we collected re‐sequencing data of 1688 rapeseed accessions with an average depth of 8× (Lu et al., 2019; Wang et al., 2018a, 2018b; Wu et al., 2019). Among them, seven representative accessions had deep re‐sequencing (104×–132×) and PacBio sequencing data (Song et al., 2020). The phylogenetic relationship of 1689 accessions including ZS11 is shown in Figure 1b, which was divided into spring‐type oilseed rape (SOR), semi‐winter oilseed rape (SWOR)‐I and SWOR‐II, and winter‐type oilseed rape (WOR) sub‐populations. BnPIR was built on Apache Tomcat HTTP web server (http://tomcat.apache.org/). All the genomic data, collinear data, homologs, gene expression, gene PAVs, metabolic pathways, accession information and related literature were organized and stored in MySQL database (http://www.mysql.com/). Most information can be dynamically accessed through user interactive queries with Highcharts (https://www.highcharts.com/) and Javascripts (https://www.javascript.com/). The web page was constructed and displayed through a popular front‐end component library, Bootstrap (https://getbootstrap.com/). Figure 1c showed the representative resource and tools for constructing BnPIR. The rapeseed pan‐genome was displayed in JBrowse (Buels et al., 2016), providing 1770 tracks for user‐selective display, including genes, transposable elements (TEs), expression profile data, presence frequency and coverage of different accessions (Figure 1d). Considering the assess speed and best performance, we recommend to select less than 30 tracks each time. Furthermore, users can filter the tracks in batches according to the region, country, subgroup, sequencing quality and so on.

image
Figure 1
Open in figure viewerPowerPoint
The architecture and representative resources of BnPIR. (a) The pipeline of ‘PVs + map‐to‐pan’ strategy to construct the pan‐genome of B. napus. (b) The phylogenetic tree of 1689 rapeseed accessions. (c) The three‐layer architecture of BnPIR. (d) Pan‐genome browser. (e) Population variations. (f) Gene information page. (g) Gbrowser. (h) Gbrowse synteny of multiple genomes. (i) 1689 rapeseed accessions. (j) Gene expression. T0: 24 days postsowing; T1: 54 days postsowing; T2: 82 days postsowing; T3: 115 days postsowing; T4: 147 days postsowing. (k) KEGG pathway. (l) Literature of rapeseed.

Compared with ZS11 reference genome, the B. napus pan‐genome adds 781.9 Mb sequences and 21 020 protein‐coding genes, which are classified into ‘core genes’ (exist in ≥95% of all rapeseed accessions) and ‘distributed genes’ (exist in <95% of all accessions) by their presence in each variety. Distributed genes are further divided into ‘subspecies imbalance genes’ (frequency in one subspecies is significantly higher than in other subspecies, P value < 0.05), ‘subspecies specific genes’ (>95% in one subspecies) and ‘random genes’ (other distributed genes) according to the frequency of gene existence in different subspecies. Users can quickly query the gene classification and display PAVs in different rapeseed population on phylogenetic tree. Breeders are supposed to focus on the accessions with gene presence in selected donors for their breeding purpose. Except large PAVs in the pan‐genome, we also identified 43 633 669 SNPs and 7 809 506 InDels. In addition, we provided interactive interface to effectively display sequence variation information in 159 high‐coverage accessions with TASUKE (https://tasuke.dna.affrc.go.jp/). The frequency of variations, depth of coverage and annotation of multiples genomes were shown in web‐based genome browser (Figure 1e).

We developed flexible query pages to efficiently retrieve and visualize various types of resources. For example, a keyword‐based search engine by inputting a gene locus (e.g. BnaA10G0244800ZS) or gene name (e.g. FLOWERING LOCUS C) can link to a gene detail information page. In total, 773 065 protein‐coding genes were provided in the gene information page. Basic genetic information includes chromosomal location, coding sequence length, exon number, gene structure, alternative splicing, nucleic acid sequence, the encoded protein sequence, expression data, gene ontology, functional domain, gene classification (core/distributed), frequency in subspecies. (Figure 1f). Moreover, users can access the Gbrowse (https://www.gbrowse.org/) to visualize detailed gene context and upstream/downstream features (Figure 1g). Gbrowse synteny shows collinearity and structure variations comparing to other genomes (Figure 1h). Basic local alignment search tool (BLAST) (Altschul et al., 1990) is provided as a sequence‐based search engine, and homologs can be obtained in multiple B. napus, B. rape and B. oleracea genomes by presenting alignment results in graphical and textual formats. Users can view and download 1689 accessions including subgroup, region and sequencing depth in rapeseed accession table page (Figure 1i). Gene expression module can be used to visualize the gene expression in different accessions throughout flowering period (Figure 1j). The metabolic pathways based on KEGG orthologs (Kanehisa et al., 2009) of the eight rapeseed accessions with reference genomes are provided in BnPIR (Figure 1k). To facilitate gene comparison and retrieval of target genes in different reference genomes, BnPIR provided a unique gene index based on collinear orthologs in nine rapeseed genomes including two SORs (Westar and No2127), four SWORs (ZS11, Gangan, Zheyou7 and Shengli) and three WORs (Darmor‐bzh, Tapidor and Quinta), covering a total of 88 345 protein‐coding genes. Users can compare the gene structural difference in the nine accessions by combining Gbrowse synteny model.

BnPIR also contains practical calculation tools for comparison, evolution and functional analysis of rapeseed and closely related species. OrthoMCL (https://orthomcl.org/) was used to identify homologs in plant genomes, including eight newly sequenced rapeseed accessions (Song et al., 2020), B. napus Darmor‐bzh, Arabidopsis thaliana, B. rape and B. oleracea. We performed OrthoMCL (e‐value: 1e‐5) to identify putative orthologs and paralogs, and closely related gene clusters were obtained in the above species. A total of 109 001 putative homologous groups were identified and stored in BnPIR for query and download. Sub‐genomes A and C inherit the B. rape and B. oleracea genomes, respectively. We use Mummer (http://mummer.sourceforge.net/) to determine the collinear regions between sub‐genomes and visualize the statistical results. A text mining tool is available in BnPIR, which allows to search references by gene names or keywords in 9971 rapeseed‐related articles obtained from PubMed (Figure 1l). For example, The NOD‐like receptor (NLR) gene families play important roles in plant growth and crop breeding. We have comprehensively identified and annotated related genes in different accessions and stored them in BnPIR.

In summary, we have established a comprehensive functional genomic platform, BnPIR, as a new tool for querying and visualizing rapeseed genomes and the pan‐genome based on 1689 accessions. BnPIR contains genomic sequences, gene annotations, phylogenetic relationship, expression data, PAV information, gene classification and common multi‐omics tools for 1689 rapeseed accessions and provides an integration of quick search and visualization. BnPIR will be a rich resource for rapeseed molecular biology and breeding, which will help rapeseed researchers to search and visualize their results in a pan‐genome context, and provide a valuable template for pan‐genome analyses in other species.



中文翻译:

BnPIR:用于1689个种质的甘蓝型油菜全基因组信息资源

甘蓝型油菜B. napus)最初约7500年前通过油菜双歧杆菌油菜双歧杆菌之间的种间杂交而形成(Chalhoub等人2014年),在全球范围内提供约13%–16%的植物油。甘蓝型油菜是多倍体基因组学和植物进化研究的极佳模型。芸苔属数据库(BRAD)长期用于油菜籽基因组研究,可为多个芸苔科基因组提供基因组浏览器和同系关系(Wang等人2015)。另外,一些广泛使用的植物基因组数据库,例如Genoscope(http://www.genoscope.cns.fr/brassicanapus/)和EnsemblPlants(http://plants.ensembl.org/)也包括油菜芽孢杆菌基因组。然而,这些数据库是基于主要组装的芸苔属品种Darmor-基因组BZH,缺乏多组学数据和油菜籽人口信息。近年来,已对越来越多的甘蓝型油菜基因组进行了测序,而单个参考基因组不足以对高知名度物种进行遗传差异分析(Gan等人2011年)。); 因此,提出了全基因组来解决这个问题。泛基因组是一个物种的不同个体基因组的集合,它为了解基因组的复杂性以及这些基因组之间基因的存在/不存在变异(PAV)图提供了新的视野。最近,通过PacBio技术对八个有代表性的油菜品种进行了测序,并组装成假染色体,这为油菜基因组研究提供了新的资源(Song等人2020年)。基于上述八个油菜参考基因组,并收集了1688个油菜籽重测序数据,我们构建了一个综合数据库,即油菜。泛基因组信息资源(BnPIR,http://cbi.hzau.edu.cn/bnapus),基于基因信息模块,以泛基因组浏览器和Gbrowse Synteny为核心,并包含多组学数据和常见的生物信息学工具。

与水稻全基因组中提出的方法类似(Wang等人2018a2018b),我们基于组装良好的ZS11参考基因组通过'PVs + map-to-pan'策略构建了甘蓝型油菜的全基因组。 (图1a,Song2020)。首先,我们收集了1688个油菜籽的重测序数据,平均深度为8倍(Lu2019 ; Wang2018a2018b ; Wu2019)。其中,七个具有代表性的种质进行了深度重测序(104×–132×)和PacBio测序数据(Song2020)。图1b中显示了包括ZS11在内的1689个种系的系统发育关系,分为春季型油菜(SOR),半冬季油菜(SWOR)-I和SWOR-II,以及冬季型油菜(WOR)。 )子群体。BnPIR建立在Apache Tomcat HTTP Web服务器(http://tomcat.apache.org/)上。所有的基因组数据,共线数据,同源物,基因表达,基因PAV,代谢途径,登录信息和相关文献均已整理并存储在MySQL数据库(http://www.mysql.com/)中。可以通过使用Highcharts(https://www.highcharts.com/)和Javascripts(https://www.javascript.com/)的用户交互式查询来动态访问大多数信息。该网页是通过流行的前端组件库Bootstrap(https://getbootstrap.com/)构建和显示的。图1c显示了用于构建BnPIR的代表性资源和工具。油菜的全基因组显示在JBrowse(Buels等人2016年),提供了1770条用于用户选择显示的轨道,包括基因,转座因子(TEs),表达谱数据,存在频率和不同种质的覆盖率(图1d)。考虑到评估速度和最佳性能,我们建议每次选择少于30条曲目。此外,用户可以根据地区,国家,子组,排序质量等来批量过滤轨道。

图像
图1
在图形查看器中打开微软幻灯片软件
BnPIR的体系结构和代表性资源。(a)用于构建油菜双全基因组的“ PVs + map-to-pan”策略的管道。(b)1689个油菜籽种的系统发育树。(c)BnPIR的三层体系结构。(d)泛基因组浏览器。(e)人口差异。(f)基因信息页面。(g)Gbrowser。(h)多个基因组的Gbrowse同义。(i)1689个油菜籽。(j)基因表达。T0:赋予后24天;T1:授予后54天;T2:赋予后82天;T3:115天的赋权;T4:147天后赋。(k)KEGG途径。(l)油菜籽文献。

与ZS11参考基因组相比,甘蓝型油菜泛基因组增加了781.9 Mb序列和21 020个蛋白质编码基因,这些基因被分为``核心基因''(存在于所有油菜籽中≥95%)和``分布式基因''(存在于所有品种中的<95%)。分布的基因进一步分为“亚种失衡基因”(一种亚种的频率显着高于其他亚种,P值<0.05),“亚种特异性基因”(在一个亚种中> 95%)和“随机基因”(其他分布的基因),取决于不同亚种中基因的存在频率。用户可以快速查询基因分类,并在系统树上显示不同油菜种群中的PAV。为了繁殖的目的,育种者应该专注于所选供体中具有基因存在的种质。除了全基因组中的大PAV外,我们还鉴定出43 633 669个SNP和7 809 506个InDels。此外,我们提供了交互式界面,可使用TASUKE(https://tasuke.dna.affrc.go.jp/)有效显示159个高覆盖率种质中的序列变异信息。基于网络的基因组浏览器显示了变异的频率,覆盖深度和多个基因组的注释(图1e)。

我们开发了灵活的查询页面,以有效地检索和可视化各种类型的资源。例如,通过输入基因座(例如BnaA10G0244800ZS)或基因名称(例如FLOWERING LOCUS C)的基于关键字的搜索引擎)可以链接到基因详细信息页面。基因信息页面上总共提供了773 065个蛋白质编码基因。基本遗传信息包括染色体位置,编码序列长度,外显子数量,基因结构,可变剪接,核酸序列,编码的蛋白质序列,表达数据,基因本体论,功能域,基因分类(核心/分布),亚种频率。(图1f)。此外,用户可以访问Gbrowse(https://www.gbrowse.org/)以可视化详细的基因背景和上游/下游特征(图1g)。与其他基因组相比,Gbrowse同义性显示共线性和结构变异(图1h)。基本的局部比对搜索工具(BLAST)(Altschul1990)作为基于序列的搜索引擎提供,通过以图形和文本格式显示比对结果,可以在多个甘蓝型油菜甘蓝型油菜甘蓝型油菜的基因组中获得同源物。用户可以在油菜籽登录表页面(图1i)中查看和下载1689个登录件,包括亚组,区域和测序深度。基因表达模块可用于可视化整个开花期不同种质中的基因表达(图1j)。基于KEGG直系同源基因的代谢途径(Kanehisa2009BnPIR提供了8个带有参考基因组的油菜籽(图1k)。为了促进基因比较和检索不同参考基因组中的靶基因,BnPIR提供了基于共线直系同源基因的独特基因索引,该共轭直系同源物包括9个油菜基因组,包括两个SOR(Westar和No2127),四个SWOR(ZS11,Gangan,Zheyou7和胜利)和三个WORs(Darmor- BZH,Tapidor和昆塔),共覆盖88 345蛋白质编码基因。用户可以通过结合Gbrowse同源模型来比较这9个种质的基因结构差异。

BnPIR还包含实用的计算工具,用于油菜籽和密切相关物种的比较,进化和功能分析。OrthoMCL(https://orthomcl.org/)被用来确定在植物基因组的同系物,包括8个新测序油菜籽种质(宋等人。,2020),欧洲油菜Darmor- BZH拟南芥B.油菜乙。甘蓝。我们进行了OrthoMCL(e值:1e-5)来鉴定推定的直系同源物和旁系同源物,并且在上述物种中获得了密切相关的基因簇。总共鉴定了109 001个推定的同源基团,并将其存储在BnPIR中以进行查询和下载。亚基因组A和C继承了油菜双歧杆菌油菜双歧杆菌的基因组分别。我们使用Mummer(http://mummer.sourceforge.net/)来确定子基因组之间的共线区域并可视化统计结果。BnPIR中提供了一种文本挖掘工具,该工具可以在从PubMed获得的9971个油菜籽相关文章中按基因名称或关键字搜索参考文献(图1l)。例如,NOD样受体(NLR)基因家族在植物生长和农作物育种中起着重要作用。我们已经全面鉴定并注释了不同种质中的相关基因,并将其存储在BnPIR中。

总之,我们已经建立了一个全面的功能基因组学平台BnPIR,作为查询和可视化油菜籽基因组和基于1689个入选基因组的全基因组的新工具。BnPIR包含基因组序列,基因注释,系统发育关系,表达数据,PAV信息,基因分类以及用于1689个油菜籽的通用多组学工具,并提供了快速搜索和可视化的集成。BnPIR将成为油菜分子生物学和育种的丰富资源,这将有助于油菜籽研究人员在全基因组背景下搜索和可视化其结果,并为其他物种的全基因组分析提供有价值的模板。

更新日期:2020-10-17
down
wechat
bug