Abstract
Identifying the molecular mechanisms that control differential gene expression (DE) is a major goal of basic and disease biology. Here, we develop a systems biology model to predict DE and mine the biological basis of the factors that influence predicted gene expression to understand how it may be generated. This model, called DEcode, utilizes deep learning to predict DE based on genome-wide binding sites on RNAs and promoters. Ranking predictive factors from DEcode indicates that clinically relevant expression changes between thousands of individuals can be predicted mainly through the joint action of post-transcriptional RNA-binding factors. We also show the broad potential applications of DEcode to generate biological insights, by predicting DE between tissues, differential transcript usage, and drivers of ageing throughout the human lifespan, of gene co-expression relationships on a genome-wide scale, and of frequently differentially expressed genes across diverse conditions. DEcode is freely available to researchers to identify influential molecular mechanisms for any human expression data.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The data sources for GTEx data, TF and RBP binding peaks, miRNA binding locations, disease-related genes, protein–protein interaction data, pathways, gene ontology and the DE prior rank that were used for model training and interpretation are available in Supplementary Table 8. Processed data are available through our Code Ocean capsule (https://doi.org/10.24433/CO.0084803.v1).
Code availability
DEcode software and pre-trained models for tissue- and person-specific transcriptomes are available at www.differentialexpression.org, https://github.com/stasaki/DEcode and from our Code Ocean capsule (https://doi.org/10.24433/CO.0084803.v1). DEcode is licensed under a BSD 3-Clause license.
References
Lee, T. & Young, R. Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 (2013).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Glisovic, T., Bachorik, J. L., Yong, J. & Dreyfuss, G. RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett. 582, 1977–1986 (2008).
Bartel, D. P. MicroRNAs: target recognition and regulatory functions. Cell 136, 215–233 (2009).
Schoenfelder, S. & Fraser, P. Long-range enhancer–promoter contacts in gene expression control. Nat. Rev. Genet. 20, 437–455 (2019).
Smith, Z. D. & Meissner, A. DNA methylation: roles in mammalian development. Nat. Rev. Genet. 14, 204–220 (2013).
Roundtree, I. A., Evans, M. E., Pan, T. & He, C. Dynamic RNA modifications in gene expression regulation. Cell 169, 1187–1200 (2017).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Yevshin, I., Sharipov, R., Valeev, T., Kel, A. & Kolpakov, F. GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Res. 45, D61–D67 (2017).
Zhu, Y. et al. POSTAR2: deciphering the post-transcriptional regulatory logics. Nucleic Acids Res. 47, D203–D211 (2019).
Agarwal, V., Bell, G. W., Nam, J. & Bartel, D. P. Predicting effective microRNA target sites in mammalian mRNAs. eLife 4, e05005 (2015).
Melé, M. et al. Human genomics. The human transcriptome across tissues and individuals. Science 348, 660–665 (2015).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (ICML, 2017).
Lundberg, S. M. & Lee, S. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017).
Chong, J. A. et al. REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell 80, 949–957 (1995).
Imperato, M. R., Cauchy, P., Obier, N. & Bonifer, C. The RUNX1-PU.1 axis in the control of hematopoiesis. Int. J. Hematol. 101, 319–329 (2015).
Soares, E. & Zhou, H. Master regulatory role of p63 in epidermal development and disease. Cell. Mol. Life Sci. 75, 1179–1190 (2018).
Watt, A. J., Garrison, W. D. & Duncan, S. A. HNF4: a central regulator of hepatocyte differentiation and function. Hepatology 37, 1249–1253 (2003).
Lefterova, M. I., Haakonsson, A. K., Lazar, M. A. & Mandrup, S. PPARγ and the global map of adipogenesis and beyond. Trends Endocrinol. Metab. 25, 293–302 (2014).
Ge, Z., Quek, B. L., Beemon, K. L. & Hogg, J. R. Polypyrimidine tract binding protein 1 protects mRNAs from recognition by the nonsense-mediated mRNA decay pathway. eLife 5, e11155 (2016).
Wang, Y. et al. N 6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells. Nat. Cell Biol. 16, 191–198 (2014).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Goh, K. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Gerstberger, S., Hafner, M. & Tuschl, T. A census of human RNA-binding proteins. Nat. Rev. Genet. 15, 829–845 (2014).
Gaiteri, C., Ding, Y., French, B., Tseng, G. C. & Sibille, E. Beyond modules and hubs: the potential of gene coexpression networks for investigating molecular mechanisms of complex brain disorders. Genes Brain Behav. 13, 13–24 (2014).
Crow, M., Lim, N., Ballouz, S., Pavlidis, P. & Gillis, J. Predictability of human differential gene expression. Proc. Natl Acad. Sci. USA 116, 6491–6500 (2019).
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D. & Cox, D. D. Hyperopt: a Python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 8, 014008 (2015).
Korotkevich, G., Sukhov, V. & Sergushichev, A. Fast gene set enrichment analysis. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/060012v2 (2019).
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G. D. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS ONE 5, e13984 (2010).
Dawes, R., Lek, M. & Cooper, S. T. Gene discovery informatics toolkit defines candidate genes for unexplained infertility and prenatal or infantile mortality. NPJ Genom. Med. 4, 8–11 (2019).
Smith, C. L., Blake, J. A., Kadin, J. A., Richardson, J. E. & Bult, C. J. Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Res. 46, D836–D842 (2018).
Koscielny, G. et al. The International Mouse Phenotyping Consortium web portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Res. 42, D802–D809 (2014).
Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).
Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).
Clarke, D. J. B. et al. eXpression2Kinases (X2K) Web: linking expression signatures to upstream cell signaling networks. Nucleic Acids Res. 46, W171–W179 (2018).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Acknowledgements
We thank L. Yu for managing access to Genotype-Tissue Expression (GTEx) data. The study was supported by NIH grants P30AG010161, R01AG061798 and R01AG057911. The GTEx Project is supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS. The data used for the analyses described in this Article were obtained from the GTEx Portal on 1 October 2018.
Author information
Authors and Affiliations
Contributions
S.T. contributed to the conception and design of the study. S.T. performed the computational analysis. S.T., C.G., S.M. and Y.W. interpreted the results. S.T. wrote the first draft of the manuscript. All authors contributed to manuscript revision, then read and approved the submitted version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Characterizations of the person-specific models.
a, The consistency between the predictive contributions of regulators and their expression. Spearman’s correlation was used to evaluate the relation between DeepLIFT scores vs log2-TPMs in each tissue. A relation with the absolute correlation > 0.3 and FDR < 5% was defined as significant. b, The consistency of predictive contributions across tissues. We selected the regulators that showed the significant relationships between their DeepLIFT scores with log2-TPMs in multiple tissues. Then, the consistency of directions of the correlations was assessed and visualized as a histogram. c, The actual correlation profile between DeepLIFT scores and log2-TPMs. We selected 99 regulators that showed consistent relationships in more than four tissues. d, The pairwise similarity of per-gene prediction accuracy between tissues. Spearman’s correlation was used for this comparison. e, The association of per-gene prediction accuracy with the gene status. The gene status indicates whether genes are registered in multiple databases (Known) or only in the GENCODE database (Novel or Putative). f, Decomposition of variance in the per-gene prediction accuracy. To compare the variances in the per-gene prediction accuracy explained by the gene status, the number of promoter features, and RNA features, we used a variance decomposition method (lmg) implemented in relaimpo R package.
Extended Data Fig. 2 The major feature types contributed to the gene co-expression.
We defined the gene pairs with the absolute Spearman’s correlation greater than 0.3 and the sign of the correlation matched with one with the ground truth as the successfully predicted gene pairs. The successfully predicted gene pairs of the model trained with the full set of features were split into three groups based on the performance of the models trained with only RNA features or promoter features.
Extended Data Fig. 3 DEcode predicts the regulatory principles behind frequently DE genes across diverse conditions.
a, The scatter plots showing the relations between predicted and actual DE prior rank. The predicted logit of DE prior rank was converted to probability and compared with actual DE prior rank with Spearman’s correlation. b, The performances of the models trained with a distinct feature set. ROCs represent the performances of models in predicting genes with whose DE prior rank greater than 0.9.
Supplementary information
Supplementary Information
Supplementary Figs. 1–12.
Supplementary Table
Supplementary Tables 1–8.
Rights and permissions
About this article
Cite this article
Tasaki, S., Gaiteri, C., Mostafavi, S. et al. Deep learning decodes the principles of differential gene expression. Nat Mach Intell 2, 376–386 (2020). https://doi.org/10.1038/s42256-020-0201-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-020-0201-6
This article is cited by
-
Obtaining genetics insights from deep learning via explainable artificial intelligence
Nature Reviews Genetics (2023)
-
Interpretable machine learning for genomics
Human Genetics (2022)
-
Verifying explainability of a deep learning tissue classifier trained on RNA-seq data
Scientific Reports (2021)