当前位置: X-MOL 学术Plant Biotech. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PsORF: a database of small ORFs in plants.
Plant Biotechnology Journal ( IF 13.8 ) Pub Date : 2020-04-25 , DOI: 10.1111/pbi.13389
Yanjun Chen 1 , Danyang Li 1 , Weiliang Fan 1, 2 , Xiaoming Zheng 1 , Yifan Zhou 1 , Hanzhe Ye 1 , Xiaodong Liang 1 , Wei Du 1 , Yu Zhou 1, 2 , Kun Wang 1
Affiliation  

Small open reading frames (sORFs) which are translated to small peptides (100 amino acids or fewer in length) have been always excluded from genome annotations. In recent years, more and more biologically significant sORFs have been discovered to encode functional peptides or play regulatory roles on mRNA translation. In plants, an evolutionarily ancient micro‐peptide, AtLURE1, promotes and maintains reproductive isolation through accelerating conspecific pollen tube penetration (Zhong et al., 2019). The sORFs in the 5’ UTR of mRNA, usually named as upstream ORFs (uORFs), were reported to mediate translational regulation of their downstream main ORFs (mORFs) (Xu et al., 2017).

Recent advances in translatomics (especially the ribosome profiling, Ribo‐seq) and MS‐based proteomics have indicated that sORFs were pervasively present in non‐coding RNAs, UTR regions of mRNAs, and circleRNAs etc (Wang et al., 2019). In animals, there have been two public databases for sORF collection: SORFS.ORG (Olexiouk et al., 2016) and smProt (http://bioinfo.ibp.ac.cn/SmProt/) (Hao et al., 2017). The two databases integrated Ribo‐seq and MS‐based proteomics data in animals to annotate the sORFs. In plants, a database ARA‐PEPs (http://www.biw.kuleuven.be/CSB/ARA‐PEPs) has been constructed (Hazarika et al., 2017). The ARA‐PEPs identified sORFs based on criteria that the peptide sequences of at least 10 amino acids beginning with a canonical start codon and not truncated by a stop codon. It is a repository only for sORFs in Arabidopsis thaliana, in which the 13 748 candidate sORFs lack translational evidence, but have only RNA expression evidence (microarray and RNA‐seq). Therefore, a database of systematic sORF annotations in plants is still missing, which will not only hinder cross‐species studies in plants, but also restrict the possibility of cross‐kingdom comparison analysis between animals and plants.

In this study, we collected multi‐omic data including genome, transcriptome, Ribo‐seq and mass spectrum (MS) from public database, and built a pipeline to identify sORFs in 35 different plant species. Based on the results, we designed a web‐accessible database, PsORF (http://psorf.whu.edu.cn/).

The PsORF integrates released data from multiple databases to acquire a set of sORFs generated from non‐coding region annotated in reference genomes. We collected 35 reference genomes from PLAZA database (https://bioinformatics.psb.ugent.be/plaza/) with well‐annotated UTRs and lncRNAs. The five plant species including two eudicots Arabidopsis thaliana and Gossypium arboreum, two monocots Oryza sativa and Zea mays, and a algae Chlamydomonas reinhardtii which have available data of Ribo‐seq and MS in public database were selected to analyse and get the translational evidence for sORFs. Totally, we collected 103 Ribo‐seq for the five major species from NCBI (https://www.ncbi.nlm.nih.gov/) and EBI (https://www.ebi.ac.uk/), together with 93 mass spectral (MS) projects generated by high sensitivity mass spectrometry instrument (Q Exactive or LTQ Orbitrap Elite) in PRIDE database (https://www.ebi.ac.uk/pride/archive/).

To integrate above data, we built a pipeline which is shown in Figure 1a. When defining the candidate sORFs, all three possible reading frames of RNA transcript were examined, and ATG and near‐cognate codons (ATG, TTG, GTG, CTG, AAG, AGG, ACG, ATA, ATT, ATC), and TAG, TAA, TGA were considered as start and stop codons, respectively. To determine whether a candidate sORF is translated, the Ribo‐seq and MS data were analysed separately using different softwares. The PRICE (v 1.0.2) (Erhard et al., 2018) was used to analyse the 3 nt periodic feature of ribosome footprints from Ribo‐seq data. The SearchGUI (v 3.3.13) (Barsnes and Vaudel, 2018) was used to find the peptides matching with the translational reading frame in MS data. Then, the two sets of sORF from Ribo‐seq and MS were filtered to retain sORFs with length of 18‐300 nt and combined by taking the union set to get the core sORF registry for the five plant species.

image
Figure 1
Open in figure viewerPowerPoint
Schematic of PsORF database. (a) Data sources and data processing pipeline of PsORF. (b) The five kinds of sORFs classified by the genome location. uORF, small ORF in the upstream of mORF; uoORF, small ORF across 5’UTR and mORF; dORF, small ORF in the downstream of mORF; doORF, small ORF across mORF and 3’UTR; sORF, other small ORF in the genome. (c) The JBrowser showing a uORF, the associated tracks (Ribo‐seq and RNA‐seq) of which are showed. (d) The MS spectra of a dORF. The b and y ion are showed in blue and red colour, respectively. (e) The phylogenetic tree for a conserved sORF and its homologs across five plant species.

For other 30 plant species, we used the BLAST to find the homologous sORFs to the core sORF registry. Finally, these sORFs from 30 other plant species and the core sORF registry were combined to get the comprehensive sORF registry of 35 plant species, which was consisted of 112,350 sORF from 51 341 transcripts. Based on their genome location, the sORFs could be divided into five categories: uORF (44,467), uoORF (4788), dORF (53 229), doORF (4403) and sORF (5463) (Figure 1b). Based on their sequence conservation, current version of psORF contains 11 665 homologous sORF family.

In addition, to link the identified sORFs with known knowledge, we collected sROFs in the published literatures by using python‐scripted web crawler to discover the key words in the abstract and main text, such as small (coding) ORF/sORF, small protein/peptide, micro‐protein/peptide, unannotated translation events, downstream ORF/dORF and upstream ORF/uORF. The known sORFs were made a database which was BLAST against sORFs in the comprehensive sORFs registry by using BLASTp (v 2.6.0+) with parameter setting: cut‐offs: e‐value ≤ 0.01, coverage ≥ 30% and identity = 100. The BLAST hits were shown in the gene wiki page of each sORF.

PsORF was deployed on Linux operation system with nginx web server, and all data were stored in MySQL database for query. PsORF offers convenient browse and query services for users (Figure 1a) to get the basic sORF information. In PsORF, users can: (i) browse or search sORFs with ID and sequence; (ii) BLAST the sequence similarity of sORFs across plant species; (iii) browse the Ribo‐seq and RNA‐seq data and genome location information of sORFs in genome browser JBrowser (Figure 1c) (Buels et al., 2016); (iv) view the MS/MS fragmentation spectra of small peptides (sORFs encoding) in the visual platform (Figure 1d); (v) find the phylogenetic tree of conserved sORFs across different plant species; (Figure 1e); and (vi) check whether the sORFs or their homologs have associated researches in published literature.

To our best knowledge, PsORF (http://psorf.whu.edu.cn/) is the unique comprehensive database for plant sORFs. As the accumulation of translatomic data from Ribo‐seq and proteomic data from MS, more and more important sORFs and their regulatory roles will be identified. Thus, we will keep on updating PsORF as new data available. We believe that the database will facilitate plant scientists to quickly get the sORF information for further biological discovery.



中文翻译:

PsORF:植物中小型ORF的数据库。

翻译成小肽(长度为100个氨基酸或更少)的小开放阅读框(sORF)始终被排除在基因组注释之外。近年来,越来越多的生物学意义上的sORF被发现编码功能性肽或在mRNA翻译中发挥调节作用。在植物中,一种进化上古老的微肽AtLURE1通过加速同种花粉管穿透来促进并维持生殖分离(Zhong等人2019)。据报道,mRNA 5'UTR中的sORFs通常被称为上游ORFs(uORFs)介导其下游主要ORFs(mORFs)的翻译调控(Xu et al。2017)。

Translatomics(尤其是核糖体谱,Ribo-seq)和基于MS的蛋白质组学的最新进展表明,sORFs普遍存在于非编码RNA,mRNA的UTR区和circleRNA等中(Wang2019)。在动物中,有两个用于sORF收集的公共数据库:SORFS.ORG(Olexiouk等人2016)和smProt(http://bioinfo.ibp.ac.cn/SmProt/)(Hao等人2017) 。这两个数据库整合了动物中的Ribo-seq和基于MS的蛋白质组学数据,以注释sORF。在植物中,已经建立了ARA‐PEPs数据库(http://www.biw.kuleuven.be/CSB/ARA‐PEPs)(Hazarika et al。2017)。ARA-PEPs基于至少10个氨基酸的肽序列以标准起始密码子开始且不被终止密码子截断的标准来鉴定sORF。它仅是拟南芥中sORF的存储库,其中13 748个候选sORF缺乏翻译证据,但仅具有RNA表达证据(微阵列和RNA-seq)。因此,仍然缺少植物中系统的sORF注释的数据库,这不仅阻碍了植物的跨物种研究,而且限制了动植物之间进行跨王国比较分析的可能性。

在这项研究中,我们从公共数据库中收集了包括基因组,转录组,核糖核酸序列和质谱(MS)在内的多组数据,并建立了一个识别35种不同植物物种中sORF的管道。根据结果​​,我们设计了一个可通过Web访问的数据库PsORF(http://psorf.whu.edu.cn/)。

PsORF集成了来自多个数据库的已发布数据,以获取从参考基因组中注释的非编码区生成的一组sORF。我们从PLAZA数据库(https://bioinformatics.psb.ugent.be/plaza/)收集了35个参考基因组,其中包含注解明确的UTR和lncRNA。这5种植物物种包括两个拟南芥拟南芥植物,两个单子叶稻Oryza sativaZea mays以及藻类Chlamydomonas reinhardtii选择在公共数据库中具有Ribo-seq和MS可用数据的数据库,以分析并获得sORF的翻译证据。我们总共从NCBI(https://www.ncbi.nlm.nih.gov/)和EBI(https://www.ebi.ac.uk/)收集了五个主要物种的103 Ribo-seq PRIDE数据库(https://www.ebi.ac.uk/pride/archive/)中的高灵敏度质谱仪(Q Exactive或LTQ Orbitrap Elite)生成了93个质谱(MS)项目。

为了集成以上数据,我们构建了一个如图1a所示的管道。在定义候选sORF时,检查了所有三个可能的RNA转录阅读框,并检测了ATG和近同源密码子(ATG,TTG,GTG,CTG,AAG,AGG,ACG,ATA,ATT,ATC)以及TAG,TAA ,TGA分别被视为起始和终止密码子。为了确定候选sORF是否被翻译,使用不同的软件分别分析了Ribo-seq和MS数据。PRICE(v 1.0.2)(Erhard et al。2018)用于从Ribo-seq数据分析核糖体足迹的3 nt周期性特征。SearchGUI(v 3.3.13)(Barsnes和Vaudel,2018年)用于在MS数据中找到与翻译阅读框匹配的肽段。然后,对来自Ribo-seq和MS的两组sORF进行过滤,以保留长度为18-300 nt的sORF,并通过并集进行组合以获得五个植物物种的核心sORF登记册。

图片
图1
在图形查看器中打开微软幻灯片软件
PsORF数据库的示意图。(a)PsORF的数据源和数据处理管道。(b)按基因组位置分类的五种sORF。uORF,mORF上游的小型ORF;uoORF,跨5'UTR和mORF的小ORF;dORF,mORF下游的小ORF;doORF,跨mORF和3'UTR的小ORF;sORF,基因组中的其他小ORF。(c)显示了uORF的JBrowser,并显示了相关的磁道(Ribo-seq和RNA-seq)。(d)dORF的MS光谱。的bÿ离子分别显示在蓝色和红色。(e)sORF保守的系统树及其在五个植物物种中的同源物。

对于其他30种植物,我们使用BLAST来找到与核心sORF注册中心同源的sORF。最后,将来自其他30种植物的这些sORF与核心sORF注册表进行组合,以得到35种植物的综合sORF注册表,其中包括来自51 341个转录本的112,350 sORF。根据其基因组位置,sORF可分为五类:uORF(44,467),uoORF(4788),dORF(53 229),doORF(4403)和sORF(5463)(图1b)。根据其序列保守性,当前版本的psORF包含11665个同源sORF家族。

此外,为了将识别出的sORF与已知知识联系起来,我们使用python脚本化的网络爬虫在已出版的文献中收集了sROF,以发现摘要和正文中的关键词,例如小(编码)ORF / sORF,小蛋白/肽,微蛋白/肽,未注释的翻译事件,下游ORF / dORF和上游ORF / uORF。已知sORFs作了其通过使用BLASTP(ⅴ2.6.0+)与参数设定反对在综合sORFs注册表sORFs BLAST数据库:截止值:ê -值≤0.01,覆盖≥30%和同一性= 100。 BLAST命中显示在每个sORF的基因Wiki页面中。

PsORF部署在具有nginx Web服务器的Linux操作系统上,所有数据都存储在MySQL数据库中以进行查询。PsORF为用户(图1a)提供方便的浏览和查询服务,以获取基本的sORF信息。在PsORF中,用户可以:(i)浏览或搜索具有ID和序列的sORF;(ii)爆破sORF在植物中的序列相似性;(iii)在基因组浏览器JBrowser中浏览sORF的Ribo-seq和RNA-seq数据以及基因组位置信息(图1c)(Buels et al。2016); (iv)在可视平台上查看小肽段(编码的sORF)的MS / MS碎片质谱图(图1d);(v)查找跨不同植物物种的保守sORF的系统树;(图1e); (vi)检查sORF或其同系物在已发表的文献中是否有相关研究。

据我们所知,PsORF(http://psorf.whu.edu.cn/)是植物sORF的唯一综合数据库。随着来自Ribo-seq的跨学科数据和来自MS的蛋白质组学数据的积累,将会发现越来越重要的sORF及其调控作用。因此,我们将继续更新PsORF作为新数据。我们相信该数据库将有助于植物科学家迅速获得sORF信息,以进行进一步的生物学发现。

更新日期:2020-04-25
down
wechat
bug