Abstract
Dimensionality reduction is widely used in the visualization, compression, exploration and classification of data. Yet a generally applicable solution remains unavailable. Here, we report an accurate and broadly applicable data-driven algorithm for dimensionality reduction. The algorithm, which we named ‘feature-augmented embedding machine’ (FEM), first learns the structure of the data and the inherent characteristics of the data components (such as central tendency and dispersion), denoises the data, increases the separation of the components, and then projects the data onto a lower number of dimensions. We show that the technique is effective at revealing the underlying dominant trends in datasets of protein expression and single-cell RNA sequencing, computed tomography, electroencephalography and wearable physiological sensors.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The main data supporting the results in this study are available within the paper and its Supplementary Information. The protein-expression dataset from mice is available at https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression. The scRNA-seq data of patients with leukaemia are available at https://support.10xgenomics.com/single-cell-gene-expression/datasets. The CT dataset is available at https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis. The datasets for emotion classification and human-activity classification, and the data from wearable sensors and from smartphones, were downloaded from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets).
Code availability
The implementation code of the proposed FEM algorithm is available for research uses at https://github.com/tauhidstanford/Feature-augmented-embedding-machine.
References
Xing, L., Giger, M. & Min, J. K. Artificial Intelligence in Medicine: Technical Basis and Clinical Applications (Elsevier Science, 2020).
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
Jolliffe, I. T. Principal Component Analysis 2nd edn (Springer, 2002).
Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Kruskal, J. B. & Wish, M. Multidimensional Scaling (SAGE, 1978).
Watkins, J. C., Kishore, R. & Priya, S. An Introduction to the Science of Statistics: From Theory to Implementation 12–19 (Watkins, J. C., 2016).
Hinton, G. E. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Pinheiro, P. O. Unsupervised domain adaptation with similarity learning. In Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 8004–8013 (IEEE, 2018).
Sohn, K., Shang, W., Yu, X. & Chandraker, M. Unsupervised domain adaptation for distance metric learning. In Proc. International Conference on Learning Representations (ICLR, 2019).
Xing, E. P., Jordan, M. I., Russell, S. J. & Ng, A. Y. Distance metric learning with application to clustering with side-information. In Proc. 15th International Conference on Neural Information Processing Systems (Eds Becker, S. et al.) 521–528 (MIT Press, 2002).
Suárez, J. L., García, S. & Herrera, F. A tutorial on distance metric learning: mathematical foundations, algorithms and software. Preprint at https://arxiv.org/abs/1812.05944 (2018).
Higuera, C., Gardiner, K. J. & Cios, K. J. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLoS ONE 10, e0129126 (2015).
Ahmed, M. M. et al. Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of down syndrome. PLoS ONE 10, e0119491 (2015).
Dua, D. & Graff, C. UCI Machine Learning Repository (University of California, Irvine, accessed 15 September 2019); http://archive.ics.uci.edu/ml
Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics https://doi.org/10.1093/bioinformatics/btz095 (2019).
Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 8914 (2019).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Abid, A., Zhang, M. J., Bagaria, V. K. & Zou, J. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nat. Commun. 9, 2134 (2018).
Schölkopf, B., Smola, A. & Müller, K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 1299–1319 (1998).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2015).
Lee, D. D. & Seung, H. S. Algorithms for non-negative matrix factorization. In Proc. 13th International Conference on Neural Information Processing Systems (Eds Leen, T. K. et al.) 556–562 (MIT Press, 2001).
Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000).
Graf, F., Kriegel, H.-P., Schubert, M., Pölsterl, S. & Cavallaro, A. 2D image registration in CT images using radial image descriptors. In Proc. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2011 (Eds Fichtinger, G. et al.) 607–614 (Springer, 2011).
Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Bird, J. J., Manso, L. J., Ribeiro, E. P., Ekárt, A. & Faria, D. R. A study on mental state classification using eeg-based brain-machine interface. In Proc. 2018 International Conference on Intelligent Systems (IS) 795–800 (IEEE, 2018).
Banos, O. et al. mHealthDroid: a novel framework for agile development of mobile health applications. In Proc. Ambient Assisted Living and Daily Activities (Eds Pecchia, L. et al.) 91–98 (Springer, 2014).
Anguita, D., Ghio, A., Oneto, L., Parra, X. & Reyes-Ortiz, J. L. A public domain dataset for human activity recognition using smartphones. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning 437–442 (ESANN, 2013).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. Preprint at http://arxiv.org/abs/2002.05709 (2020).
Vidal, R. Subspace clustering. IEEE Signal Process. Mag. 28, 52–68 (2011).
Arthur, D. & Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proc. 18th Annual ACM–SIAM Symposium on Discrete Algorithms 1027–1035 (ACM–SIAM, 2007).
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010).
Pourkamali-Anaraki, F., Folberth, J. & Becker, S. Efficient solvers for sparse subspace clustering. Preprint at http://arxiv.org/abs/1804.06291 (2018).
Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information Retrieval (Cambridge University Press, 2008).
Stone, J. V. Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning (Sebtel Press, 2019).
Lipschutz, M. L. S. Schaum’s Outline of Linear Algebra 4th edn (McGraw-Hill, 2009).
Wang, D., Ding, C. & Li, T. K-Subspace clustering. In Proc. Machine Learning and Knowledge Discovery in Databases (Eds Buntine, W.) 506–521 (Springer, 2009).
Carrell, J. B. Fundamentals of Linear Algebra 412 (2015); https://www.math.ubc.ca/~carrell/NB.pdf
Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis (John Wiley & Sons, 1990).
Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw. 10, 626–634 (1999).
de Silva, V. & Tenenbaum, J. B. Global versus local methods in nonlinear dimensionality reduction. In Proc. 15th International Conference on Neural Information Processing Systems 721–728 (MIT Press, 2002).
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-seq normalization methods from the perspective of their assumptions. Brief. Bioinform. 19, 776–792 (2017).
Burns, A. et al. SHIMMER™—a wireless sensor platform for noninvasive biomedical research. IEEE Sens. J. 10, 1527–1534 (2010).
Acknowledgements
We thank M. B. Khuzani and H. Ren for their advice in improving the manuscript. This work was partially supported by the National Institutes of Health (nos. 1R01 CA223667 and R01CA227713) and by a Faculty Research Award from Google.
Author information
Authors and Affiliations
Contributions
L.X. conceived the experiments; M.T.I conducted the experiments; and M.T.I. analysed the results. Both of the authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary methods, figures and references.
Rights and permissions
About this article
Cite this article
Islam, M.T., Xing, L. A data-driven dimensionality-reduction algorithm for the exploration of patterns in biomedical data. Nat Biomed Eng 5, 624–635 (2021). https://doi.org/10.1038/s41551-020-00635-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41551-020-00635-3
This article is cited by
-
Cartography of Genomic Interactions Enables Deep Analysis of Single-Cell Expression Data
Nature Communications (2023)
-
Revealing hidden patterns in deep neural network feature space continuum via manifold learning
Nature Communications (2023)
-
Wearable chemical sensors for biomarker discovery in the omics era
Nature Reviews Chemistry (2022)