当前位置: X-MOL 学术Plant Biotech. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards comprehensive integration and curation of chloroplast genomes
Plant Biotechnology Journal ( IF 10.1 ) Pub Date : 2022-09-07 , DOI: 10.1111/pbi.13923
Zhongyi Hua 1 , Dongmei Tian 2, 3, 4 , Chao Jiang 1 , Shuhui Song 2, 3, 4, 5 , Ziyuan Chen 1 , Yuyang Zhao 1 , Yan Jin 1 , Luqi Huang 1 , Zhang Zhang 2, 3, 4, 5 , Yuan Yuan 1
Affiliation  

Dear Editor,

Chloroplasts are semi-autonomous genetic organelles that contain their own DNA. Since the first chloroplast genome was sequenced in 1986, chloroplast genomes have been extensively utilized as fundamental tools in plant phylogenetics and genetically modified to produce protein drugs, especially in the fight against COVID-19 (Daniell et al., 2021, 2022). More chloroplast genome sequences could not only help us learn more about plant diversity and evolution (Daniell et al., 2016), but they could also help chloroplast biotechnological applications by codon optimization and identifying non-conserved intergenic spacer regions and regulatory sequences that are needed for genetic engineering (Daniell et al., 2021). Consequently, the chloroplast genomes of numerous plant species, particularly economically significant crops, have been continuously sequenced. Powered by high-throughput sequencing, over 7000 plant chloroplast genomes have been deposited in the National Center for Biotechnology Information (NCBI) organelle genome database, of which over 50% have been sequenced in the last 3 years (1082, 1175, and 1539 were sequenced in 2019, 2020, and 2021, respectively). With the accumulation of data, inaccurate taxonomic information (Locatelli et al., 2020), disunity of genomic terms (Abeysooriya et al., 2021), and other attendant problems have emerged, provoking significant challenges in employing chloroplast genomes. Many efforts have been made, but current databases still suffer from a lack of comprehensiveness and data curation, as well as incomplete data collection. Most existing curated databases are taxon-specific (e.g., cpGDB for spermatophytes (Singh et al., 2020) and OGDA for algae (Liu et al., 2020)) or limited to certain data types (e.g., ChloroMitoSSRDB for simple sequence repeats [SSRs; Sablok et al., 2015]), which could be further improved by incorporating more comprehensive data. Additionally, each published organelle genome database covers only a fraction of chloroplasts, and thousands of chloroplast genomes are still dispersed in different nucleotide databases. Therefore, there is an urgent need to establish an integrated portal with a comprehensive collection and curation of chloroplast genomes.

Here, we developed the Chloroplast Genome Information Resource (CGIR), an integrated platform (https://ngdc.cncb.ac.cn/cgir) comprising 19 388 chloroplast genome assemblies and their corresponding meta-information (Figure 1a). The CGIR comprises five modules: (1) genomes, (2) genes, (3) SSRs, (4) barcodes, and (5) DNA signature sequences (DSSs; Figure 1b). The ‘Genomes’ module displayed 19 388 chloroplast assemblies from 11 946 different species (Figure 1c). Noticeably, among all assemblies, we sequenced 1170 assemblies from 718 species, of which the chloroplast genomes from 307 species were reported for the first time, including one family (Juncaginaceae) and 53 genera. In addition to boosting the number of sequenced species, newly added assemblies allow a group of species to have chloroplasts from many individuals. Compared to the NCBI Organelle Genome Database and CpGDB, with only one component for each species, multiple assemblies with explicit taxonomic information can provide more information for plant phylogeny. The taxonomic information of assemblies was curated in accordance with The Catalogue of Life Checklist 2021 to eliminate disunity across different databases and contributors (Figure 1d). Functional information on plant species was integrated into the CGIR according to the World Checklist of Useful Plant Species. The ‘Genes’ module contains information on genes as well as their associated coding DNA sequence (CDS) and protein sequence (Figure 1e). To ensure a high-quality dataset, we first unified gene names by curating incorrect capitalization, spelling mistakes, extra characters in gene names, and synonymous gene names (Figure 1f). More importantly, not only was uniformity achieved, but corrections were also made. For example, the gene NADH–ubiquinone oxidoreductase chain 6 (nad6) should be encoded in the mitochondrial genome. However, this gene was observed in some chloroplast genome annotations, such as Bulbophyllum reptans (GenBank accession: NC_058531.1). By manual curation, we confirmed that nad6 in NC_058531.1 was ndhG (Figure 1f).

Details are in the caption following the image
Figure 1
Open in figure viewerPowerPoint
Architecture of the CGIR. (a) Design and construction of the CGIR, (b) the CGIR homepage, (c) the Genome module, (d) the curation model of taxonomic information, (e) the Gene module, (f) the curation model of gene annotation, (g) the Barcode module, (h) the DSS module, (i) the SSR module, (j) the Taxonomy tree view, (k) the Download module, (l) BarcodeBlast, (m) BarcodeFinder, and (n) the statistics of CGIR.

To better utilize these chloroplast genomes, the remaining three modules contained three commonly used DNA markers developed based on chloroplast genomes. The ‘Barcodes’ module contains DNA barcodes extracted from 29 different loci using the electronic PCR approach (Figure 1g), making the CGIR an excellent complement to traditional DNA barcode databases (e.g., Barcode of Life Data System [BOLD; Ratnasingham and Hebert, 2007]), which are mainly from rbcL and matK loci. The ‘DSS’ module contains the candidate DSSs from all species with more than one chloroplast assembly deposited in the CGIR (Figure 1h). DSS is a species-level marker that can be used as a complement to conventional DNA markers (Hua et al., 2022). The ‘SSR’ module comprises 7 284 705 SSRs and their associated primers (Figure 1i), far exceeding that of any other plastid SSR database.

In addition, the CGIR provides various methods for viewing, searching, and downloading data. To help users find the genome of a certain taxon, the ‘Genomes’ module allows users to search by species name, as well as class, order, genus, and family names. The synonyms are also listed in the search results, enabling researchers to determine whether to use these assemblies (Figure 1c). Because chloroplast data are always used in interspecies comparisons, the CGIR also provides a taxonomy tree view for users who are concerned with specific aspects of chloroplast data (e.g., rbcL gene, CDS sequences) in higher taxa (Figure 1j). Using this view, users can browse, search, and retrieve gene, barcode, and DSS data at any taxonomic level. A separate download module is also provided for easy data downloads (Figure 1k). Additionally, the ‘BarcodeBLAST’ tool allows users to search their barcode sequences against those deposited in the CGIR using BLAST (Figure 1l), and the ‘BarcodeFinder’ tool can help users to identify barcode regions in their uploaded chloroplast sequences (Figure 1m).

In general, the integration of high-throughput sequencing, public genomic resources, and careful manual curation guaranteed both the quantity and quality of chloroplast data in the CGIR, making it the largest comprehensive chloroplast repository available (Figure 1n). The CGIR will be a valuable resource for researchers working on phylogenetics and chloroplast genetic engineering. The curated taxonomy information and molecular markers are of tremendous value to plant phylogenetics; the labelled featured plants and corrected gene information will assist researchers in identifying suitable research objects and locating intergenic spacer regions, both of which are necessary for designing chloroplast engineering vectors (Daniell et al., 2021). In future, the CGIR will be continuously updated to incorporate more types of data.



中文翻译:

全面整合和管理叶绿体基因组

亲爱的编辑,

叶绿体是半自主遗传细胞器,含有自己的 DNA。自 1986 年对第一个叶绿体基因组进行测序以来,叶绿体基因组已被广泛用作植物系统发育学的基本工具,并通过基因改造来生产蛋白质药物,尤其是在对抗 COVID-19 方面(Daniell等人,  2021 年2022 年)。更多的叶绿体基因组序列不仅可以帮助我们更多地了解植物多样性和进化 (Daniell et al .,  2016 ),还可以通过密码子优化和识别非保守的基因间间隔区和所需的调控序列来帮助叶绿体生物技术应用用于基因工程(丹尼尔等人,  2021 年)。因此,许多植物物种的叶绿体基因组,特别是经济上重要的作物,已经连续测序。在高通量测序的支持下,7000 多个植物叶绿体基因组已存放在国家生物技术信息中心 (NCBI) 细胞器基因组数据库中,其中超过 50% 在过去 3 年中被测序(1082、1175 和 1539 个被测序)分别于 2019 年、2020 年和 2021 年测序)。随着数据的积累,分类学信息不准确(Locatelli et al .,  2020),基因组术语不统一(Abeysooriya et al .,  2021 )), 以及其他随之而来的问题已经出现, 在使用叶绿体基因组方面引发了重大挑战。已经做出了许多努力,但目前的数据库仍然缺乏全面性和数据管理,以及不完整的数据收集。大多数现有的精选数据库都是特定于分类单元的(例如,用于种子植物的 cpGDB(Singh等人,  2020 年)和用于藻类的 OGDA(Liu等人,  2020 年))或仅限于某些数据类型(例如,用于简单序列重复的 ChloroMitoSSRDB [ SSR;Sablok等人,  2015 年]),可以通过合并更全面的数据进一步改进。此外,每个已发布的细胞器基因组数据库仅涵盖叶绿体的一小部分,并且数千个叶绿体基因组仍分散在不同的核苷酸数据库中。因此,迫切需要建立一个综合门户网站,全面收集和管理叶绿体基因组。

在这里,我们开发了叶绿体基因组信息资源 (CGIR),这是一个集成平台 (https://ngdc.cncb.ac.cn/cgir),包含 19 388 个叶绿体基因组组装及其相应的元信息(图 1a)。CGIR 包括五个模块:(1) 基因组,(2) 基因,(3) SSR,(4) 条形码,和 (5) DNA 签名序列(DSS;图 1b)。“基因组”模块显示了来自 11 946 个不同物种的 19 388 个叶绿体组件(图 1c)。值得注意的是,在所有组装中,我们对来自 718 个物种的 1170 个组装进行了测序,其中首次报道了 307 个物种的叶绿体基因组,包括一个科(灯芯草科)和 53 个属。除了增加测序物种的数量外,新添加的组件还允许一组物种拥有来自许多个体的叶绿体。与 NCBI 细胞器基因组数据库和 CpGDB 相比,每个物种只有一个组件,具有明确分类信息的多个组件可以为植物系统发育提供更多信息。程序集的分类信息是根据2021 年生命目录清单,以消除不同数据库和贡献者之间的不统一(图 1d)。根据世界有用植物物种名录,植物物种的功能信息被纳入 CGIR 。“基因”模块包含有关基因及其相关编码 DNA 序列 (CDS) 和蛋白质序列的信息(图 1e)。为了确保高质量的数据集,我们首先通过整理不正确的大写、拼写错误、基因名称中的额外字符和同义基因名称来统一基因名称(图 1f)。更重要的是,不仅统一了,还进行了修正。例如,基因 NADH–泛醌氧化还原酶链 6 ( nad6) 应该在线粒体基因组中编码。然而,该基因在一些叶绿体基因组注释中被观察到,例如Bulbophyllum reptans(GenBank 登录号:NC_058531.1)。通过手动整理,我们确认NC_058531.1中的 nad6 是ndhG(图 1f)。

详细信息在图片后面的标题中
图1
在图窗查看器中打开微软幻灯片软件
CGIR 的架构。(a) CGIR 的设计和构建,(b) CGIR 主页,(c) 基因组模块,(d) 分类信息管理模型,(e) 基因模块,(f) 基因注释管理模型,(g)条码模块,(h)DSS模块,(i)SSR模块,(j)分类树视图,(k)下载模块,(l)BarcodeBlast,(m)BarcodeFinder,和(n ) CGIR 的统计数据。

为了更好地利用这些叶绿体基因组,其余三个模块包含三个基于叶绿体基因组开发的常用 DNA 标记。“条形码”模块包含使用电子 PCR 方法从 29 个不同位点提取的 DNA 条形码(图 1g),使 CGIR 成为对传统 DNA 条形码数据库(例如生命数据系统条形码 [BOLD;Ratnasingham 和 Hebert,  2007 年)的极好补充]), 主要来自rbcLmatK位点。“DSS”模块包含来自所有物种的候选 DSS,在 CGIR 中沉积了一个以上的叶绿体组件(图 1h)。DSS 是一种物种水平的标记,可以作为传统 DNA 标记的补充(Hua et al .,  2022). “SSR”模块包含 7 284 705 个 SSR 及其相关引物(图 1i),远远超过任何其他质体 SSR 数据库。

此外,CGIR 还提供了多种查看、搜索和下载数据的方法。为了帮助用户找到特定分类单元的基因组,“基因组”模块允许用户按物种名称以及类、目、属和家族名称进行搜索。同义词也列在搜索结果中,使研究人员能够确定是否使用这些程序集(图 1c)。因为叶绿体数据总是用于种间比较,CGIR 还为关注叶绿体数据特定方面的用户提供分类树视图(例如,rbcL基因,CDS 序列)在更高的类群中(图 1j)。使用此视图,用户可以浏览、搜索和检索任何分类级别的基因、条形码和 DSS 数据。还提供了一个单独的下载模块,以便于下载数据(图 1k)。此外,“BarcodeBLAST”工具允许用户使用 BLAST(图 1l)根据存放在 CGIR 中的条码序列搜索他们的条码序列,而“BarcodeFinder”工具可以帮助用户识别他们上传的叶绿体序列中的条码区域(图 1m)。

总的来说,高通量测序、公共基因组资源和精心的人工管理的整合保证了 CGIR 中叶绿体数据的数量和质量,使其成为可用的最大的综合叶绿体存储库(图 1n)。CGIR 将成为系统发育学和叶绿体基因工程研究人员的宝贵资源。精选的分类学信息和分子标记对植物系统发育学具有巨大价值;标记的特色植物和校正后的基因信息将帮助研究人员确定合适的研究对象和定位基因间隔区,这两者都是设计叶绿体工程载体所必需的(Daniell et al .,  2021 )). 未来,CGIR 将不断更新以纳入更多类型的数据。

更新日期:2022-09-07
down
wechat
bug