Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Gaussian embedding for large-scale gene set analysis

Abstract

Gene sets, including protein complexes and signalling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein–protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumours, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a clinical prognostic and predictive subnetwork around neurofilament medium in sarcoma, which we validate in independent cohorts.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of Set2Gaussian.
Fig. 2: Application of Set2Gaussian to gene set member identification.
Fig. 3: Application of Set2Gaussian to cancer subtyping.
Fig. 4: Application of Set2Gaussian to finding concise gene sets.

Similar content being viewed by others

Data availability

We provide pretrained gene set representations of all gene sets in NCI, Reactome and MSigDB at https://doi.org/10.6084/m9.figshare.11341181.v1. All results in this paper are based on these representations.

Code availability

A software implementation of Set2Gaussian is is available at https://doi.org/10.5281/zenodo.3827929.

References

  1. Schaefer, C. F. et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 37, D674–D679 (2009).

    Google Scholar 

  2. Hewett, M. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 30, 163–165 (2002).

    Google Scholar 

  3. Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).

    Google Scholar 

  4. Croft, D. et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477 (2014).

    Google Scholar 

  5. Holden, M., Deng, S., Wojnowski, L. & Kulle, B. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics 24, 2784–2785 (2008).

    Google Scholar 

  6. Wang, S. et al. Deep functional synthesis: a machine learning approach to gene functional enrichment. Preprint at https://doi.org/10.1101/824086 (2019).

  7. Wang, S. et al. Identification of pathways associated with chemosensitivity through network embedding. PLoS Comput. Biol. 15, e1006864 (2019).

    Google Scholar 

  8. Wang, S. et al. Typing tumors using pathways selected by somatic evolution. Nat. Commun. 9, 4159 (2018).

    Google Scholar 

  9. Bateman, A. R., El-Hachem, N., Beck, A. H., Aerts, H. J. W. L. & Haibe-Kains, B. Importance of collection in gene set enrichment analysis of drug response in cancer cell lines. Sci. Rep. 4, 4092 (2014).

    Google Scholar 

  10. Menche, J. et al. Disease networks. Uncovering disease–disease relationships through the incomplete interactome. Science 347, 1257601 (2015).

    Google Scholar 

  11. Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).

    Google Scholar 

  12. Cao, M. et al. New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence. Bioinformatics 30, i219–i227 (2014).

    Google Scholar 

  13. Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010).

    Google Scholar 

  14. Cao, M. et al. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8, e76339 (2013).

    Google Scholar 

  15. Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).

    Google Scholar 

  16. Patkar, S., Magen, A., Sharan, R. & Hannenhalli, S. A network diffusion approach to inferring sample-specific function reveals functional changes associated with breast cancer. PLoS Comput. Biol. 13, e1005793 (2017).

    Google Scholar 

  17. Leiserson, M. D. M. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).

    Google Scholar 

  18. Kim, Y.-A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).

    Google Scholar 

  19. Liu, Y., Gu, Q., Hou, J. P., Han, J. & Ma, J. A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 15, 37 (2014).

    Google Scholar 

  20. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).

    Google Scholar 

  21. Cho, H., Berger, B. & Peng, J. Compact integration of multi-network topology for functional analysis of genes. Cell Syst. 3, 540–548.e5 (2016).

    Google Scholar 

  22. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).

    Google Scholar 

  23. Wang, S., Cho, H., Zhai, C., Berger, B. & Peng, J. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31, i357–i364 (2015).

    Google Scholar 

  24. Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, i457–i466 (2018).

    Google Scholar 

  25. Wieting, J., Bansal, M., Gimpel, K. & Livescu, K. Towards universal paraphrastic sentence embeddings. Preprint at https://arxiv.org/pdf/1511.08198.pdf (2015).

  26. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).

    Google Scholar 

  27. Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at https://arxiv.org/pdf/1810.00826.pdf (2018).

  28. Cavallari, S., Zheng, V. W., Cai, H., Chang, K. C.-C. & Cambria, E. Learning community embedding with community detection and node embedding on graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management—CIKM ’17 377–386 (2017).

  29. Zhang, J., Kwong, S., Liu, G., Lin, Q. & WongK.-C. PathEmb: random walk based document embedding for global pathway similarity search. IEEE J. Biomed. Health Inform 23, 1329–1335 (2018).

    Google Scholar 

  30. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

    Google Scholar 

  31. Bojchevski, A. & GĂĽnnemann, S. Deep Gaussian embedding of graphs: unsupervised inductive learning via ranking. Preprint at https://arxiv.org/pdf/1707.03815.pdf (2017).

  32. He, S., Liu, K., Ji, G. & Zhao, J. Learning to represent knowledge graphs with Gaussian embedding. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management—CIKM ’15 623–632 (2015).

  33. Dos Santos, L., Piwowarski, B. & Gallinari, P. Multilabel classification on heterogeneous graphs with Gaussian embeddings. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Springer, 2016).

  34. Fröhlich, H., Fellmann, M., Sültmann, H., Poustka, A. & Beissbarth, T. Predicting pathway membership via domain signatures. Bioinformatics 24, 2137–2142 (2008).

    Google Scholar 

  35. Kim, K., Jiang, K., Teng, S. L., Feldman, L. J. & Huang, H. Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28, 815–822 (2012).

    Google Scholar 

  36. García-Jiménez, B., Pons, T., Sanchis, A. & Valencia, A. Predicting protein relationships to human pathways through a relational learning approach based on simple sequence features. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 753–765 (2014).

    Google Scholar 

  37. Chavarría-Smith, J. & Vance, R. E. The NLRP1 inflammasomes. Immunol. Rev. 265, 22–34 (2015).

    Google Scholar 

  38. Faustin, B. et al. Mechanism of Bcl-2 and Bcl-X(L) inhibition of NLRP1 inflammasome: loop domain-dependent suppression of ATP binding and oligomerization. Proc. Natl Acad. Sci. USA 106, 3935–3940 (2009).

    Google Scholar 

  39. Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).

    Google Scholar 

  40. Saville, M. W. et al. Treatment of HIV-associated Kaposi’s sarcoma with paclitaxel. Lancet 346, 26–28 (1995).

    Google Scholar 

  41. Millecamps, S. & Julien, J.-P. Axonal transport deficits and neurodegenerative diseases. Nat. Rev. Neurosci. 14, 161–176 (2013).

    Google Scholar 

  42. Yadav, P. et al. Neurofilament depletion improves microtubule dynamics via modulation of Stat3/stathmin signaling. Acta Neuropathol. 132, 93–110 (2016).

    Google Scholar 

  43. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Google Scholar 

  44. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).

    Google Scholar 

  45. Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Syst. 8, 483–493.e7 (2019).

    Google Scholar 

  46. Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191.e4 (2018).

    Google Scholar 

  47. Poon, H., Quirk, C., DeZiel, C. & Heckerman, D. Literome: PubMed-scale genomic knowledge base in the cloud. Bioinformatics 30, 2840–2842 (2014).

    Google Scholar 

  48. Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951).

    MathSciNet  MATH  Google Scholar 

  49. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).

  50. Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

    Google Scholar 

  51. Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452.e17 (2017).

    Google Scholar 

  52. Arora, S., Liang, Y. & Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations (ICLR, 2016).

  53. Davis, J. & Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning 233–240 (ACM, 2006).

  54. Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    Google Scholar 

  55. Cho, A. et al. MUFFINN: cancer gene discovery via network analysis of somatic mutation data. Genome Biol. 17, 129 (2016).

    Google Scholar 

  56. Kim, S., Sael, L. & Yu, H. A mutation profile for top-k patient search exploiting gene-ontology and orthogonal non-negative matrix factorization. Bioinformatics 32, 2081 (2016).

    Google Scholar 

  57. Samstein, R. M. et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet. 51, 202–206 (2019).

    Google Scholar 

  58. Arthur, D. & Vassilvitskii, S. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035 (Society for Industrial and Applied Mathematics, 2007).

  59. Stoney, R. A., Schwartz, J.-M., Robertson, D. L. & Nenadic, G. Using set theory to reduce redundancy in pathway sets. BMC Bioinformatics 19, 386 (2018).

    Google Scholar 

  60. Simillion, C., Liechti, R., Lischer, H. E. L., Ioannidis, V. & Bruggmann, R. Avoiding the pitfalls of gene set enrichment analysis with SetRank. BMC Bioinformatics 18, 151 (2017).

    Google Scholar 

  61. Lu, Y., Rosenfeld, R., Simon, I., Nau, G. J. & Bar-Joseph, Z. A probabilistic generative model for GO enrichment analysis. Nucleic Acids Res. 36, e109 (2008).

    Google Scholar 

Download references

Acknowledgements

This work is supported by NIH TR002515, GM102365, LM005652 and the Chan-Zuckerberg Biohub.

Author information

Authors and Affiliations

Authors

Contributions

All authors conceived the problem. S.W. conceived the algorithm and performed the computational experiments. R.B.A. led the research. All authors wrote the manuscript.

Corresponding author

Correspondence to Russ B. Altman.

Ethics declarations

Competing interests

R.B.A. declares the following competing interests: stock or other ownership (Personalis, 23andme, Youscript) and consulting or advisory roles (United Health, Second Genome, Karius, UK Biobank, Swiss Personalized Health Network).

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Flynn, E.R. & Altman, R.B. Gaussian embedding for large-scale gene set analysis. Nat Mach Intell 2, 387–395 (2020). https://doi.org/10.1038/s42256-020-0193-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-020-0193-2

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics