Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Accurate annotation of human protein-coding small open reading frames

Abstract

Functional protein-coding small open reading frames (smORFs) are emerging as an important class of genes. However, the number of translated smORFs in the human genome is unclear because proteogenomic methods are not sensitive enough, and, as we show, Ribo-seq strategies require additional measures to ensure comprehensive and accurate smORF annotation. Here, we integrate de novo transcriptome assembly and Ribo-seq into an improved workflow that overcomes obstacles with previous methods, to more confidently annotate thousands of smORFs. Evolutionary conservation analyses suggest that hundreds of smORF-encoded microproteins are likely functional. Additionally, many smORFs are regulated during fundamental biological processes, such as cell stress. Peptides derived from smORFs are also detectable on human leukocyte antigen complexes, revealing smORFs as a source of antigens. Thus, by including additional validation into our smORF annotation workflow, we accurately identify thousands of unannotated translated smORFs that will provide a rich pool of unexplored, functional human genes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Outline of top-down smORF annotation workflow.
Fig. 2: Comparison of translation prediction for smORFs versus annotated ORFs.
Fig. 3: smORF regulation during ER stress.
Fig. 4: Characteristics of protein-coding smORFs.
Fig. 5: Protein-coding smORFs identified on unannotated transcripts.
Fig. 6: Microproteins detected in HLA-I complexes.

Similar content being viewed by others

Data availability

All sequencing datasets generated in this study are available through GEO (GSE125218).

Code availability

A custom java script used for three-frame in silico translation of assembled transcripts is included as Supplementary Data 4.

References

  1. Basrai, M. A., Hieter, P. & Boeke, J. D. Small open reading frames: beautiful needles in the haystack. Genome Res. 7, 768–771 (1997).

    CAS  PubMed  Google Scholar 

  2. Ochman, H. Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet. 18, 335–337 (2002).

    CAS  PubMed  Google Scholar 

  3. Lawrence, J. When ELFs are ORFs, but don’t act like them. Trends Genet. 19, 131–132 (2003).

    CAS  PubMed  Google Scholar 

  4. Dujon, B. et al. Complete DNA sequence of yeast chromosome XI. Nature 369, 371–378 (1994).

    CAS  PubMed  Google Scholar 

  5. Goffeau, A. et al. Life with 6000 genes. Science 274, 563–567 (1996).

    Google Scholar 

  6. Saghatelian, A. & Couso, J. P. Discovery and characterization of smORF-encoded bioactive polypeptides. Nat. Chem. Biol. 11, 909–916 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Couso, J. P. & Patraquim, P. Classification and function of small open reading frames. Nat. Rev. Mol. Cell Biol. 18, 575–589 (2017).

    CAS  PubMed  Google Scholar 

  8. Galindo, M. I., Pueyo, J. I., Fouix, S., Bishop, S. A. & Couso, J. P. Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol. 5, e106 (2007).

    PubMed  PubMed Central  Google Scholar 

  9. Kondo, T. et al. Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nat. Cell Biol. 9, 660–665 (2007).

    CAS  PubMed  Google Scholar 

  10. Arnoult, N. et al. Regulation of DNA repair pathway choice in S and G2 phases by the NHEJ inhibitor CYREN. Nature 549, 548–552 (2017).

    PubMed  PubMed Central  Google Scholar 

  11. Rathore, A. et al. MIEF1 microprotein regulates mitochondrial translation. Biochemistry 57, 5564–5575 (2018).

    CAS  PubMed  Google Scholar 

  12. Stein, C. S. et al. Mitoregulin: a lncRNA-encoded microprotein that supports mitochondrial supercomplexes and respiratory efficiency. Cell Rep. 23, 3710–3720.e8 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. D’Lima, N. G. et al. A human microprotein that interacts with the mRNA decapping complex. Nat. Chem. Biol. 13, 174–180 (2017).

    PubMed  Google Scholar 

  14. Zhang, Q. et al. The microprotein Minion controls cell fusion and muscle formation. Nat. Commun. 8, 15664 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Ma, J. et al. Improved identification and analysis of small open reading frame encoded polypeptides. Anal. Chem. 88, 3967–3975 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Slavoff, S. A. et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59–64 (2013).

    CAS  PubMed  Google Scholar 

  17. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Aspden, J. L. et al. Extensive translation of small open reading frames revealed by Poly-Ribo-Seq. eLife 3, e03528 (2014).

    PubMed  PubMed Central  Google Scholar 

  19. Bazzini, A. A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Hao, Y. et al. SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Brief. Bioinformatics 19, 636–643 (2018).

    CAS  PubMed  Google Scholar 

  21. Olexiouk, V., Van Criekinge, W. & Menschaert, G. An update on sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res. 46, D497–D502 (2018).

    CAS  PubMed  Google Scholar 

  22. Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, 5′UTRs, and pseudogenes are translated and some are likely to express functional proteins. eLife 4, e08890 (2015).

    PubMed  PubMed Central  Google Scholar 

  23. Hsu, P. Y. et al. Super-resolution ribosome profiling reveals unannotated translation events in Arabidopsis. Proc. Natl Acad. Sci. USA 113, E7126–E7135 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. Calviello, L. et al. Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 165–170 (2016).

    CAS  PubMed  Google Scholar 

  25. Raj, A. et al. Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling. eLife 5, e13328 (2016).

    PubMed  PubMed Central  Google Scholar 

  26. Diament, A. & Tuller, T. Estimation of ribosome profiling performance and reproducibility at various levels of resolution. Biol. Direct 11, 24 (2016).

    PubMed  PubMed Central  Google Scholar 

  27. Robasky, K., Lewis, N. E. & Church, G. M. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56–62 (2014).

    CAS  PubMed  Google Scholar 

  28. Ma, J., Saghatelian, A. & Shokhirev, M. N. The influence of transcript assembly on the proteogenomics discovery of microproteins. PLoS ONE 13, e0194518 (2018).

    PubMed  PubMed Central  Google Scholar 

  29. Oslowski, C. M. & Urano, F. Measuring ER stress and the unfolded protein response using mammalian tissue culture system. Methods Enzymol. 490, 71–92 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Liu, C.-L. et al. Genome-wide analysis of tunicamycin-induced endoplasmic reticulum stress response and the protective effect of endoplasmic reticulum inhibitors in neonatal rat cardiomyocytes. Mol. Cell. Biochem. 413, 57–67 (2016).

    CAS  PubMed  Google Scholar 

  31. Xu, J. & Zhang, J. Are human translated pseudogenes functional? Mol. Biol. Evol. 33, 755–760 (2016).

    CAS  PubMed  Google Scholar 

  32. Gjymishka, A., Su, N. & Kilberg, M. S. Transcriptional induction of the human asparagine synthetase gene during the unfolded protein response does not require the ATF6 and IRE1/XBP1 arms of the pathway. Biochem. J. 417, 695–703 (2009).

    CAS  PubMed  Google Scholar 

  33. Andreev, D. E. et al. Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. eLife 4, e03971 (2015).

    PubMed  PubMed Central  Google Scholar 

  34. Sidrauski, C., McGeachy, A. M., Ingolia, N. T. & Walter, P. The small molecule ISRIB reverses the effects of eIF2α phosphorylation on translation and stress granule assembly. eLife 4, e05033 (2015).

    PubMed Central  Google Scholar 

  35. Xiao, Z., Zou, Q., Liu, Y. & Yang, X. Genome-wide assessment of differential translations with ribosome profiling data. Nat. Commun. 7, 11194 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Guan, B. J. et al. Translational control during endoplasmic reticulum stress beyond phosphorylation of the translation initiation factor eIF2α. J. Biol. Chem. 289, 12593–12611 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Zhao, C., Datta, S., Mandal, P., Xu, S. & Hamilton, T. Stress-sensitive regulation of IFRD1 mRNA decay is mediated by an upstream open reading frame. J. Biol. Chem. 285, 8552–8562 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Sundaram, A., Plumb, R., Appathurai, S. & Mariappan, M. The Sec61 translocon limits IRE1α signaling during the unfolded protein response. eLife 6, e27187 (2017).

    PubMed  PubMed Central  Google Scholar 

  39. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Google Scholar 

  40. Chew, G. L., Pauli, A. & Schier, A. F. Conservation of uORF repressiveness and sequence features in mouse, human and zebrafish. Nat. Commun. 7, 11663 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Delcourt, V. et al. The protein coded by a short open reading frame, not by the annotated coding sequence, is the main gene product of the dual-coding gene MIEF1. Mol. Cell. Proteomics 17, 2402–2411 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Brocchieri, L. & Karlin, S. Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 33, 3390–3400 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Lin, M. F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–i282 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M. & Weissman, J. S. Genome-wide annotation and quantitation of translation by ribosome profiling. Curr. Protoc. Mol. Biol. 103, 4.18.1–4.18.19 (2013).

    Google Scholar 

  45. MacLean, J. A. 2nd & Wilkinson, M. F. The Rhox genes. Reproduction 140, 195–213 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Bassani-Sternberg, M., Pletscher-Frankild, S., Jensen, L. J. & Mann, M. Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abundance and turnover on antigen presentation. Mol. Cell. Proteomics 14, 658–673 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Erhard, F. et al. Improved Ribo-seq enables identification of cryptic translation events. Nat. Methods 15, 363–366 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Calviello, L. & Ohler, U. Beyond read-counts: ribo-seq data analysis to understand the functions of the transcriptome. Trends Genet. 33, 728–744 (2017).

    CAS  PubMed  Google Scholar 

  49. Cenik, C. et al. Integrative analysis of RNA, translation, and protein levels reveals distinct regulatory variation across humans. Genome Res. 25, 1610–1621 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Gerashchenko, M. V. & Gladyshev, V. N. Ribonuclease selection for ribosome profiling. Nucleic Acids Res. 45, e6 (2017).

    PubMed  Google Scholar 

  51. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  52. Wang, H., McManus, J. & Kingsford, C. Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap. Bioinformatics 32, 1880–1882 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

    CAS  PubMed  Google Scholar 

  54. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  55. Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580 (2001).

    CAS  PubMed  Google Scholar 

  56. Marchler-Bauer, A. et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 45, D200–D203 (2017).

    CAS  PubMed  Google Scholar 

  57. Xu, T. et al. ProLuCID: an improved SEQUEST-like algorithm with enhanced sensitivity and specificity. J. Proteom. 129, 16–24 (2015).

    CAS  Google Scholar 

  58. Cociorva, D., Tabb, D. L. & Yates, J. R. Validation of tandem mass spectrometry database search results using DTASelect. Curr. Protoc. Bioinformatics 16, 13.4.1–13.4.14 (2006).

    Google Scholar 

  59. Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1061 (2018).

    CAS  Google Scholar 

  60. Kessler, J. H. et al. Competition-based cellular peptide binding assay for HLA class I. Curr. Protoc. Immunol. 61, 18.12.1–18.12.15 (2004).

    Google Scholar 

Download references

Acknowledgements

We thank the Saghatelian laboratory for helpful comments and suggestions throughout the study, and N. Ingolia for advice on RNase I digestion conditions. We also thank M. Ku, N. Hah and the Salk Institute Next Generation Sequencing Core for preparation of RNA-seq libraries and high-throughput sequencing of Ribo-seq and RNA-seq libraries. This research was supported by NIH/NIGMS (R01 GM102491, A.S.), Leona M. and Harry B. Helmsley Charitable Trust grant (A.S.), Dr Frederick Paulsen Chair/Ferring Pharmaceuticals (A.S.), NIH/NIGMS postdoctoral fellowship (F32 GM123685, T.F.M.), George E. Hewitt Foundation for medical research (Q.C.) and a Pioneer Fellowship (D.T.). This work was also supported by the Razavi Newman Integrative Genomics and Bioinformatics Core and the Next Generation Sequencing Core Facilities of the Salk Institute with funding from the NIH-NCICCSG (P30 014195) and the Chapman Foundation.

Author information

Authors and Affiliations

Authors

Contributions

T.F.M. and A.S. conceived the project, designed the experiments and wrote the manuscript. T.F.M. performed cell culture and prepared RPFs and total RNA. T.F.M. and C.D. prepared Ribo-seq libraries. T.F.M. analyzed Ribo-seq and RNA-seq data, developed the smORF annotation workflow and wrote custom scripts to generate Ribo-seq plots. M.N.S. performed de novo transcriptome assembly and generated ORF databases. Q.C. performed HLA-I experiments. T.F.M. and D.T. analyzed HLA-I proteomics data. All authors discussed the results and edited the manuscript. A.S. supervised the study.

Corresponding authors

Correspondence to Thomas F. Martinez or Alan Saghatelian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–14.

Reporting Summary

Supplementary Data 1

List of protein-coding smORFs identified in this study and their properties.

Supplementary Data 2

List of significantly regulated ER stress smORFs and annotated genes.

Supplementary Data 3

List of smORFs containing conserved protein domains and predicted transmembrane helices, as well as smORFs encoding peptides identified in HLA-I proteomics datasets.

Supplementary Data 4

Custom java script used to generate a three-frame ORF database from transcriptome assembly in gtf format.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martinez, T.F., Chu, Q., Donaldson, C. et al. Accurate annotation of human protein-coding small open reading frames. Nat Chem Biol 16, 458–468 (2020). https://doi.org/10.1038/s41589-019-0425-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41589-019-0425-0

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research