Gaussian embedding for large-scale gene set analysis

Wang, Sheng; Flynn, Emily R.; Altman, Russ B.

doi:10.1038/s42256-020-0193-2

Article
Published: 15 June 2020

Gaussian embedding for large-scale gene set analysis

Nature Machine Intelligence volume 2, pages 387–395 (2020)Cite this article

2008 Accesses
6 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Gene sets, including protein complexes and signalling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein–protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumours, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a clinical prognostic and predictive subnetwork around neurofilament medium in sarcoma, which we validate in independent cohorts.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Application of Set2Gaussian to gene set member identification.**

**Fig. 3: Application of Set2Gaussian to cancer subtyping.**

**Fig. 4: Application of Set2Gaussian to finding concise gene sets.**

An integrated network representation of multiple cancer-specific data for graph-based machine learning

Article Open access 29 April 2022

Computational analysis of fused co-expression networks for the identification of candidate cancer gene biomarkers

Article Open access 12 March 2021

Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector

Article Open access 20 July 2023

Data availability

We provide pretrained gene set representations of all gene sets in NCI, Reactome and MSigDB at https://doi.org/10.6084/m9.figshare.11341181.v1. All results in this paper are based on these representations.

Code availability

A software implementation of Set2Gaussian is is available at https://doi.org/10.5281/zenodo.3827929.

References

Schaefer, C. F. et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 37, D674–D679 (2009).
Google Scholar
Hewett, M. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 30, 163–165 (2002).
Google Scholar
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Google Scholar
Croft, D. et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477 (2014).
Google Scholar
Holden, M., Deng, S., Wojnowski, L. & Kulle, B. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics 24, 2784–2785 (2008).
Google Scholar
Wang, S. et al. Deep functional synthesis: a machine learning approach to gene functional enrichment. Preprint at https://doi.org/10.1101/824086 (2019).
Wang, S. et al. Identification of pathways associated with chemosensitivity through network embedding. PLoS Comput. Biol. 15, e1006864 (2019).
Google Scholar
Wang, S. et al. Typing tumors using pathways selected by somatic evolution. Nat. Commun. 9, 4159 (2018).
Google Scholar
Bateman, A. R., El-Hachem, N., Beck, A. H., Aerts, H. J. W. L. & Haibe-Kains, B. Importance of collection in gene set enrichment analysis of drug response in cancer cell lines. Sci. Rep. 4, 4092 (2014).
Google Scholar
Menche, J. et al. Disease networks. Uncovering disease–disease relationships through the incomplete interactome. Science 347, 1257601 (2015).
Google Scholar
Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015).
Google Scholar
Cao, M. et al. New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence. Bioinformatics 30, i219–i227 (2014).
Google Scholar
Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010).
Google Scholar
Cao, M. et al. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8, e76339 (2013).
Google Scholar
Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).
Google Scholar
Patkar, S., Magen, A., Sharan, R. & Hannenhalli, S. A network diffusion approach to inferring sample-specific function reveals functional changes associated with breast cancer. PLoS Comput. Biol. 13, e1005793 (2017).
Google Scholar
Leiserson, M. D. M. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Google Scholar
Kim, Y.-A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).
Google Scholar
Liu, Y., Gu, Q., Hou, J. P., Han, J. & Ma, J. A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 15, 37 (2014).
Google Scholar
Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).
Google Scholar
Cho, H., Berger, B. & Peng, J. Compact integration of multi-network topology for functional analysis of genes. Cell Syst. 3, 540–548.e5 (2016).
Google Scholar
Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).
Google Scholar
Wang, S., Cho, H., Zhai, C., Berger, B. & Peng, J. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31, i357–i364 (2015).
Google Scholar
Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, i457–i466 (2018).
Google Scholar
Wieting, J., Bansal, M., Gimpel, K. & Livescu, K. Towards universal paraphrastic sentence embeddings. Preprint at https://arxiv.org/pdf/1511.08198.pdf (2015).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Google Scholar
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at https://arxiv.org/pdf/1810.00826.pdf (2018).
Cavallari, S., Zheng, V. W., Cai, H., Chang, K. C.-C. & Cambria, E. Learning community embedding with community detection and node embedding on graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management—CIKM ’17 377–386 (2017).
Zhang, J., Kwong, S., Liu, G., Lin, Q. & WongK.-C. PathEmb: random walk based document embedding for global pathway similarity search. IEEE J. Biomed. Health Inform 23, 1329–1335 (2018).
Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Google Scholar
Bojchevski, A. & Günnemann, S. Deep Gaussian embedding of graphs: unsupervised inductive learning via ranking. Preprint at https://arxiv.org/pdf/1707.03815.pdf (2017).
He, S., Liu, K., Ji, G. & Zhao, J. Learning to represent knowledge graphs with Gaussian embedding. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management—CIKM ’15 623–632 (2015).
Dos Santos, L., Piwowarski, B. & Gallinari, P. Multilabel classification on heterogeneous graphs with Gaussian embeddings. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Springer, 2016).
Fröhlich, H., Fellmann, M., Sültmann, H., Poustka, A. & Beissbarth, T. Predicting pathway membership via domain signatures. Bioinformatics 24, 2137–2142 (2008).
Google Scholar
Kim, K., Jiang, K., Teng, S. L., Feldman, L. J. & Huang, H. Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28, 815–822 (2012).
Google Scholar
García-Jiménez, B., Pons, T., Sanchis, A. & Valencia, A. Predicting protein relationships to human pathways through a relational learning approach based on simple sequence features. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 753–765 (2014).
Google Scholar
Chavarría-Smith, J. & Vance, R. E. The NLRP1 inflammasomes. Immunol. Rev. 265, 22–34 (2015).
Google Scholar
Faustin, B. et al. Mechanism of Bcl-2 and Bcl-X(L) inhibition of NLRP1 inflammasome: loop domain-dependent suppression of ATP binding and oligomerization. Proc. Natl Acad. Sci. USA 106, 3935–3940 (2009).
Google Scholar
Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).
Google Scholar
Saville, M. W. et al. Treatment of HIV-associated Kaposi’s sarcoma with paclitaxel. Lancet 346, 26–28 (1995).
Google Scholar
Millecamps, S. & Julien, J.-P. Axonal transport deficits and neurodegenerative diseases. Nat. Rev. Neurosci. 14, 161–176 (2013).
Google Scholar
Yadav, P. et al. Neurofilament depletion improves microtubule dynamics via modulation of Stat3/stathmin signaling. Acta Neuropathol. 132, 93–110 (2016).
Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).
Google Scholar
Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Syst. 8, 483–493.e7 (2019).
Google Scholar
Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191.e4 (2018).
Google Scholar
Poon, H., Quirk, C., DeZiel, C. & Heckerman, D. Literome: PubMed-scale genomic knowledge base in the cloud. Bioinformatics 30, 2840–2842 (2014).
Google Scholar
Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951).
MathSciNet MATH Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).
Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Google Scholar
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452.e17 (2017).
Google Scholar
Arora, S., Liang, Y. & Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations (ICLR, 2016).
Davis, J. & Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning 233–240 (ACM, 2006).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Google Scholar
Cho, A. et al. MUFFINN: cancer gene discovery via network analysis of somatic mutation data. Genome Biol. 17, 129 (2016).
Google Scholar
Kim, S., Sael, L. & Yu, H. A mutation profile for top-k patient search exploiting gene-ontology and orthogonal non-negative matrix factorization. Bioinformatics 32, 2081 (2016).
Google Scholar
Samstein, R. M. et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet. 51, 202–206 (2019).
Google Scholar
Arthur, D. & Vassilvitskii, S. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–1035 (Society for Industrial and Applied Mathematics, 2007).
Stoney, R. A., Schwartz, J.-M., Robertson, D. L. & Nenadic, G. Using set theory to reduce redundancy in pathway sets. BMC Bioinformatics 19, 386 (2018).
Google Scholar
Simillion, C., Liechti, R., Lischer, H. E. L., Ioannidis, V. & Bruggmann, R. Avoiding the pitfalls of gene set enrichment analysis with SetRank. BMC Bioinformatics 18, 151 (2017).
Google Scholar
Lu, Y., Rosenfeld, R., Simon, I., Nau, G. J. & Bar-Joseph, Z. A probabilistic generative model for GO enrichment analysis. Nucleic Acids Res. 36, e109 (2008).
Google Scholar

Download references

Acknowledgements

This work is supported by NIH TR002515, GM102365, LM005652 and the Chan-Zuckerberg Biohub.

Author information

Authors and Affiliations

Department of Bioengineering, Stanford University, Stanford, CA, USA
Sheng Wang & Russ B. Altman
Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA
Emily R. Flynn & Russ B. Altman
Department of Genetics, Stanford University, Stanford, CA, USA
Russ B. Altman

Authors

Sheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Emily R. Flynn
View author publications
You can also search for this author in PubMed Google Scholar
Russ B. Altman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived the problem. S.W. conceived the algorithm and performed the computational experiments. R.B.A. led the research. All authors wrote the manuscript.

Corresponding author

Correspondence to Russ B. Altman.

Ethics declarations

Competing interests

R.B.A. declares the following competing interests: stock or other ownership (Personalis, 23andme, Youscript) and consulting or advisory roles (United Health, Second Genome, Karius, UK Biobank, Swiss Personalized Health Network).

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Flynn, E.R. & Altman, R.B. Gaussian embedding for large-scale gene set analysis. Nat Mach Intell 2, 387–395 (2020). https://doi.org/10.1038/s42256-020-0193-2

Download citation

Received: 21 November 2019
Accepted: 15 May 2020
Published: 15 June 2020
Issue Date: July 2020
DOI: https://doi.org/10.1038/s42256-020-0193-2

This article is cited by

Embedding gene sets in low-dimensional space
- Jan Hoinka
- Teresa M. Przytycka
Nature Machine Intelligence (2020)