Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A data-driven dimensionality-reduction algorithm for the exploration of patterns in biomedical data

Abstract

Dimensionality reduction is widely used in the visualization, compression, exploration and classification of data. Yet a generally applicable solution remains unavailable. Here, we report an accurate and broadly applicable data-driven algorithm for dimensionality reduction. The algorithm, which we named ‘feature-augmented embedding machine’ (FEM), first learns the structure of the data and the inherent characteristics of the data components (such as central tendency and dispersion), denoises the data, increases the separation of the components, and then projects the data onto a lower number of dimensions. We show that the technique is effective at revealing the underlying dominant trends in datasets of protein expression and single-cell RNA sequencing, computed tomography, electroencephalography and wearable physiological sensors.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Workflow of FEM, and discovery of subgroups in protein expression data.
Fig. 2: Discovery of subgroups in protein-expression data from mice.
Fig. 3: Visualization of a high-dimensional scRNA-seq dataset in three dimensions using PCA and FEM.
Fig. 4: Visualization of a high-dimensional scRNA-seq dataset using t-SNE, Dhaka and ivis.
Fig. 5: Visualization of high-dimensional CT-scan localization data in two dimensions.
Fig. 6: Classification of biomedical data.

Similar content being viewed by others

Data availability

The main data supporting the results in this study are available within the paper and its Supplementary Information. The protein-expression dataset from mice is available at https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression. The scRNA-seq data of patients with leukaemia are available at https://support.10xgenomics.com/single-cell-gene-expression/datasets. The CT dataset is available at https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis. The datasets for emotion classification and human-activity classification, and the data from wearable sensors and from smartphones, were downloaded from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets).

Code availability

The implementation code of the proposed FEM algorithm is available for research uses at https://github.com/tauhidstanford/Feature-augmented-embedding-machine.

References

  1. Xing, L., Giger, M. & Min, J. K. Artificial Intelligence in Medicine: Technical Basis and Clinical Applications (Elsevier Science, 2020).

  2. Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).

  3. Jolliffe, I. T. Principal Component Analysis 2nd edn (Springer, 2002).

  4. Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000).

    Article  Google Scholar 

  5. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  6. Kruskal, J. B. & Wish, M. Multidimensional Scaling (SAGE, 1978).

  7. Watkins, J. C., Kishore, R. & Priya, S. An Introduction to the Science of Statistics: From Theory to Implementation 12–19 (Watkins, J. C., 2016).

  8. Hinton, G. E. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).

    Article  CAS  Google Scholar 

  9. Pinheiro, P. O. Unsupervised domain adaptation with similarity learning. In Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 8004–8013 (IEEE, 2018).

  10. Sohn, K., Shang, W., Yu, X. & Chandraker, M. Unsupervised domain adaptation for distance metric learning. In Proc. International Conference on Learning Representations (ICLR, 2019).

  11. Xing, E. P., Jordan, M. I., Russell, S. J. & Ng, A. Y. Distance metric learning with application to clustering with side-information. In Proc. 15th International Conference on Neural Information Processing Systems (Eds Becker, S. et al.) 521–528 (MIT Press, 2002).

  12. Suárez, J. L., García, S. & Herrera, F. A tutorial on distance metric learning: mathematical foundations, algorithms and software. Preprint at https://arxiv.org/abs/1812.05944 (2018).

  13. Higuera, C., Gardiner, K. J. & Cios, K. J. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLoS ONE 10, e0129126 (2015).

  14. Ahmed, M. M. et al. Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of down syndrome. PLoS ONE 10, e0119491 (2015).

  15. Dua, D. & Graff, C. UCI Machine Learning Repository (University of California, Irvine, accessed 15 September 2019); http://archive.ics.uci.edu/ml

  16. Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics https://doi.org/10.1093/bioinformatics/btz095 (2019).

  17. Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 8914 (2019).

  18. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  19. Abid, A., Zhang, M. J., Bagaria, V. K. & Zou, J. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nat. Commun. 9, 2134 (2018).

  20. Schölkopf, B., Smola, A. & Müller, K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 1299–1319 (1998).

    Article  Google Scholar 

  21. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

  22. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2015).

  23. Lee, D. D. & Seung, H. S. Algorithms for non-negative matrix factorization. In Proc. 13th International Conference on Neural Information Processing Systems (Eds Leen, T. K. et al.) 556–562 (MIT Press, 2001).

  24. Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000).

  25. Graf, F., Kriegel, H.-P., Schubert, M., Pölsterl, S. & Cavallaro, A. 2D image registration in CT images using radial image descriptors. In Proc. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2011 (Eds Fichtinger, G. et al.) 607–614 (Springer, 2011).

  26. Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).

    Article  CAS  Google Scholar 

  27. Bird, J. J., Manso, L. J., Ribeiro, E. P., Ekárt, A. & Faria, D. R. A study on mental state classification using eeg-based brain-machine interface. In Proc. 2018 International Conference on Intelligent Systems (IS) 795–800 (IEEE, 2018).

  28. Banos, O. et al. mHealthDroid: a novel framework for agile development of mobile health applications. In Proc. Ambient Assisted Living and Daily Activities (Eds Pecchia, L. et al.) 91–98 (Springer, 2014).

  29. Anguita, D., Ghio, A., Oneto, L., Parra, X. & Reyes-Ortiz, J. L. A public domain dataset for human activity recognition using smartphones. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning 437–442 (ESANN, 2013).

  30. Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. Preprint at http://arxiv.org/abs/2002.05709 (2020).

  31. Vidal, R. Subspace clustering. IEEE Signal Process. Mag. 28, 52–68 (2011).

    Article  Google Scholar 

  32. Arthur, D. & Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proc. 18th Annual ACM–SIAM Symposium on Discrete Algorithms 1027–1035 (ACM–SIAM, 2007).

  33. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010).

    Google Scholar 

  34. Pourkamali-Anaraki, F., Folberth, J. & Becker, S. Efficient solvers for sparse subspace clustering. Preprint at http://arxiv.org/abs/1804.06291 (2018).

  35. Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information Retrieval (Cambridge University Press, 2008).

  36. Stone, J. V. Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning (Sebtel Press, 2019).

  37. Lipschutz, M. L. S. Schaum’s Outline of Linear Algebra 4th edn (McGraw-Hill, 2009).

  38. Wang, D., Ding, C. & Li, T. K-Subspace clustering. In Proc. Machine Learning and Knowledge Discovery in Databases (Eds Buntine, W.) 506–521 (Springer, 2009).

  39. Carrell, J. B. Fundamentals of Linear Algebra 412 (2015); https://www.math.ubc.ca/~carrell/NB.pdf

  40. Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis (John Wiley & Sons, 1990).

  41. Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw. 10, 626–634 (1999).

    Article  CAS  Google Scholar 

  42. de Silva, V. & Tenenbaum, J. B. Global versus local methods in nonlinear dimensionality reduction. In Proc. 15th International Conference on Neural Information Processing Systems 721–728 (MIT Press, 2002).

  43. Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).

  44. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).

    Article  CAS  Google Scholar 

  45. Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-seq normalization methods from the perspective of their assumptions. Brief. Bioinform. 19, 776–792 (2017).

  46. Burns, A. et al. SHIMMER™—a wireless sensor platform for noninvasive biomedical research. IEEE Sens. J. 10, 1527–1534 (2010).

    Article  Google Scholar 

Download references

Acknowledgements

We thank M. B. Khuzani and H. Ren for their advice in improving the manuscript. This work was partially supported by the National Institutes of Health (nos. 1R01 CA223667 and R01CA227713) and by a Faculty Research Award from Google.

Author information

Authors and Affiliations

Authors

Contributions

L.X. conceived the experiments; M.T.I conducted the experiments; and M.T.I. analysed the results. Both of the authors reviewed the manuscript.

Corresponding author

Correspondence to Lei Xing.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary methods, figures and references.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Islam, M.T., Xing, L. A data-driven dimensionality-reduction algorithm for the exploration of patterns in biomedical data. Nat Biomed Eng 5, 624–635 (2021). https://doi.org/10.1038/s41551-020-00635-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41551-020-00635-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing