Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deep learning decodes the principles of differential gene expression

A preprint version of the article is available at bioRxiv.

Abstract

Identifying the molecular mechanisms that control differential gene expression (DE) is a major goal of basic and disease biology. Here, we develop a systems biology model to predict DE and mine the biological basis of the factors that influence predicted gene expression to understand how it may be generated. This model, called DEcode, utilizes deep learning to predict DE based on genome-wide binding sites on RNAs and promoters. Ranking predictive factors from DEcode indicates that clinically relevant expression changes between thousands of individuals can be predicted mainly through the joint action of post-transcriptional RNA-binding factors. We also show the broad potential applications of DEcode to generate biological insights, by predicting DE between tissues, differential transcript usage, and drivers of ageing throughout the human lifespan, of gene co-expression relationships on a genome-wide scale, and of frequently differentially expressed genes across diverse conditions. DEcode is freely available to researchers to identify influential molecular mechanisms for any human expression data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of building and evaluating the DEcode transcriptome prediction model.
Fig. 2: Performance of DEcode for predicting DE across 53 tissues.
Fig. 3: Identification and characterization of key predictors.
Fig. 4: Performance of the person-specific models.
Fig. 5: Application of the person-specific models to analyse phenotype-related gene signatures.
Fig. 6: Regulatory basis of gene co-expression relationships.

Similar content being viewed by others

Data availability

The data sources for GTEx data, TF and RBP binding peaks, miRNA binding locations, disease-related genes, protein–protein interaction data, pathways, gene ontology and the DE prior rank that were used for model training and interpretation are available in Supplementary Table 8. Processed data are available through our Code Ocean capsule (https://doi.org/10.24433/CO.0084803.v1).

Code availability

DEcode software and pre-trained models for tissue- and person-specific transcriptomes are available at www.differentialexpression.org, https://github.com/stasaki/DEcode and from our Code Ocean capsule (https://doi.org/10.24433/CO.0084803.v1). DEcode is licensed under a BSD 3-Clause license.

References

  1. Lee, T. & Young, R. Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 (2013).

    Google Scholar 

  2. Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).

    Google Scholar 

  3. Glisovic, T., Bachorik, J. L., Yong, J. & Dreyfuss, G. RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett. 582, 1977–1986 (2008).

    Google Scholar 

  4. Bartel, D. P. MicroRNAs: target recognition and regulatory functions. Cell 136, 215–233 (2009).

    Google Scholar 

  5. Schoenfelder, S. & Fraser, P. Long-range enhancer–promoter contacts in gene expression control. Nat. Rev. Genet. 20, 437–455 (2019).

    Google Scholar 

  6. Smith, Z. D. & Meissner, A. DNA methylation: roles in mammalian development. Nat. Rev. Genet. 14, 204–220 (2013).

    Google Scholar 

  7. Roundtree, I. A., Evans, M. E., Pan, T. & He, C. Dynamic RNA modifications in gene expression regulation. Cell 169, 1187–1200 (2017).

    Google Scholar 

  8. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Google Scholar 

  9. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).

    Google Scholar 

  10. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).

    Google Scholar 

  11. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Google Scholar 

  12. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Google Scholar 

  13. Yevshin, I., Sharipov, R., Valeev, T., Kel, A. & Kolpakov, F. GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Res. 45, D61–D67 (2017).

    Google Scholar 

  14. Zhu, Y. et al. POSTAR2: deciphering the post-transcriptional regulatory logics. Nucleic Acids Res. 47, D203–D211 (2019).

    Google Scholar 

  15. Agarwal, V., Bell, G. W., Nam, J. & Bartel, D. P. Predicting effective microRNA target sites in mammalian mRNAs. eLife 4, e05005 (2015).

    Google Scholar 

  16. Melé, M. et al. Human genomics. The human transcriptome across tissues and individuals. Science 348, 660–665 (2015).

    Google Scholar 

  17. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (ICML, 2017).

  18. Lundberg, S. M. & Lee, S. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017).

    Google Scholar 

  19. Chong, J. A. et al. REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell 80, 949–957 (1995).

    Google Scholar 

  20. Imperato, M. R., Cauchy, P., Obier, N. & Bonifer, C. The RUNX1-PU.1 axis in the control of hematopoiesis. Int. J. Hematol. 101, 319–329 (2015).

    Google Scholar 

  21. Soares, E. & Zhou, H. Master regulatory role of p63 in epidermal development and disease. Cell. Mol. Life Sci. 75, 1179–1190 (2018).

    Google Scholar 

  22. Watt, A. J., Garrison, W. D. & Duncan, S. A. HNF4: a central regulator of hepatocyte differentiation and function. Hepatology 37, 1249–1253 (2003).

    Google Scholar 

  23. Lefterova, M. I., Haakonsson, A. K., Lazar, M. A. & Mandrup, S. PPARγ and the global map of adipogenesis and beyond. Trends Endocrinol. Metab. 25, 293–302 (2014).

    Google Scholar 

  24. Ge, Z., Quek, B. L., Beemon, K. L. & Hogg, J. R. Polypyrimidine tract binding protein 1 protects mRNAs from recognition by the nonsense-mediated mRNA decay pathway. eLife 5, e11155 (2016).

    Google Scholar 

  25. Wang, Y. et al. N 6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells. Nat. Cell Biol. 16, 191–198 (2014).

    Google Scholar 

  26. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Google Scholar 

  27. Goh, K. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).

    Google Scholar 

  28. Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

    Google Scholar 

  29. Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

    Google Scholar 

  30. Gerstberger, S., Hafner, M. & Tuschl, T. A census of human RNA-binding proteins. Nat. Rev. Genet. 15, 829–845 (2014).

    Google Scholar 

  31. Gaiteri, C., Ding, Y., French, B., Tseng, G. C. & Sibille, E. Beyond modules and hubs: the potential of gene coexpression networks for investigating molecular mechanisms of complex brain disorders. Genes Brain Behav. 13, 13–24 (2014).

    Google Scholar 

  32. Crow, M., Lim, N., Ballouz, S., Pavlidis, P. & Gillis, J. Predictability of human differential gene expression. Proc. Natl Acad. Sci. USA 116, 6491–6500 (2019).

    Google Scholar 

  33. Bergstra, J., Komer, B., Eliasmith, C., Yamins, D. & Cox, D. D. Hyperopt: a Python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 8, 014008 (2015).

    Google Scholar 

  34. Korotkevich, G., Sukhov, V. & Sergushichev, A. Fast gene set enrichment analysis. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/060012v2 (2019).

  35. Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    Google Scholar 

  36. Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G. D. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS ONE 5, e13984 (2010).

    Google Scholar 

  37. Dawes, R., Lek, M. & Cooper, S. T. Gene discovery informatics toolkit defines candidate genes for unexplained infertility and prenatal or infantile mortality. NPJ Genom. Med. 4, 8–11 (2019).

    Google Scholar 

  38. Smith, C. L., Blake, J. A., Kadin, J. A., Richardson, J. E. & Bult, C. J. Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Res. 46, D836–D842 (2018).

    Google Scholar 

  39. Koscielny, G. et al. The International Mouse Phenotyping Consortium web portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Res. 42, D802–D809 (2014).

    Google Scholar 

  40. Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).

    Google Scholar 

  41. Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).

    Google Scholar 

  42. Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).

    Google Scholar 

  43. Clarke, D. J. B. et al. eXpression2Kinases (X2K) Web: linking expression signatures to upstream cell signaling networks. Nucleic Acids Res. 46, W171–W179 (2018).

    Google Scholar 

  44. Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Google Scholar 

  45. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    Google Scholar 

  46. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).

    Google Scholar 

Download references

Acknowledgements

We thank L. Yu for managing access to Genotype-Tissue Expression (GTEx) data. The study was supported by NIH grants P30AG010161, R01AG061798 and R01AG057911. The GTEx Project is supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS. The data used for the analyses described in this Article were obtained from the GTEx Portal on 1 October 2018.

Author information

Authors and Affiliations

Authors

Contributions

S.T. contributed to the conception and design of the study. S.T. performed the computational analysis. S.T., C.G., S.M. and Y.W. interpreted the results. S.T. wrote the first draft of the manuscript. All authors contributed to manuscript revision, then read and approved the submitted version.

Corresponding author

Correspondence to Shinya Tasaki.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Characterizations of the person-specific models.

a, The consistency between the predictive contributions of regulators and their expression. Spearman’s correlation was used to evaluate the relation between DeepLIFT scores vs log2-TPMs in each tissue. A relation with the absolute correlation > 0.3 and FDR < 5% was defined as significant. b, The consistency of predictive contributions across tissues. We selected the regulators that showed the significant relationships between their DeepLIFT scores with log2-TPMs in multiple tissues. Then, the consistency of directions of the correlations was assessed and visualized as a histogram. c, The actual correlation profile between DeepLIFT scores and log2-TPMs. We selected 99 regulators that showed consistent relationships in more than four tissues. d, The pairwise similarity of per-gene prediction accuracy between tissues. Spearman’s correlation was used for this comparison. e, The association of per-gene prediction accuracy with the gene status. The gene status indicates whether genes are registered in multiple databases (Known) or only in the GENCODE database (Novel or Putative). f, Decomposition of variance in the per-gene prediction accuracy. To compare the variances in the per-gene prediction accuracy explained by the gene status, the number of promoter features, and RNA features, we used a variance decomposition method (lmg) implemented in relaimpo R package.

Extended Data Fig. 2 The major feature types contributed to the gene co-expression.

We defined the gene pairs with the absolute Spearman’s correlation greater than 0.3 and the sign of the correlation matched with one with the ground truth as the successfully predicted gene pairs. The successfully predicted gene pairs of the model trained with the full set of features were split into three groups based on the performance of the models trained with only RNA features or promoter features.

Extended Data Fig. 3 DEcode predicts the regulatory principles behind frequently DE genes across diverse conditions.

a, The scatter plots showing the relations between predicted and actual DE prior rank. The predicted logit of DE prior rank was converted to probability and compared with actual DE prior rank with Spearman’s correlation. b, The performances of the models trained with a distinct feature set. ROCs represent the performances of models in predicting genes with whose DE prior rank greater than 0.9.

Supplementary information

Supplementary Information

Supplementary Figs. 1–12.

Supplementary Table

Supplementary Tables 1–8.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tasaki, S., Gaiteri, C., Mostafavi, S. et al. Deep learning decodes the principles of differential gene expression. Nat Mach Intell 2, 376–386 (2020). https://doi.org/10.1038/s42256-020-0201-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-020-0201-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing