Dimensionality reduction to maximize prediction generalization capability

Isomura, Takuya; Toyoizumi, Taro

doi:10.1038/s42256-021-00306-1

Article
Published: 12 April 2021

Dimensionality reduction to maximize prediction generalization capability

Nature Machine Intelligence volume 3, pages 434–446 (2021)Cite this article

2906 Accesses
6 Citations
12 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 05 May 2021

This article has been updated

Abstract

Generalization of time series prediction remains an important open issue in machine learning; earlier methods have either large generalization errors or local minima. Here, we develop an analytically solvable, unsupervised learning scheme that extracts the most informative components for predicting future inputs, which we call predictive principal component analysis (PredPCA). Our scheme can effectively remove unpredictable noise and minimize test prediction error through convex optimization. Mathematical analyses demonstrate that, provided with sufficient training samples and sufficiently high-dimensional observations, PredPCA can asymptotically identify hidden states, system parameters and dimensionalities of canonical nonlinear generative processes, with a global convergence guarantee. We demonstrate the performance of PredPCA using sequential visual inputs comprising handwritten digits, rotating three-dimensional objects and natural scenes. It reliably estimates distinct hidden states and predicts future outcomes of previously unseen test input data, based exclusively on noisy observations. The simple architecture and low computational cost of PredPCA are highly desirable for neuromorphic hardware.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Five different prediction model structures.**

**Fig. 2: PredPCA of handwritten digit sequences.**

**Fig. 3: PredPCA-based de-noising, hidden state extraction and subsequent input prediction of videos of rotating 3D objects.**

**Fig. 4: PredPCA of natural scene videos.**

Data-driven discovery of intrinsic dynamics

Article 08 December 2022

Intrinsic dimension estimation for locally undersampled data

Article Open access 20 November 2019

Piecewise linear neural networks and deep learning

Article 09 June 2022

Data availability

Image data used in this work are available in the MNIST dataset³³ (http://yann.lecun.com/exdb/mnist/index.html, for Fig. 2), the ALOI dataset³⁶ (http://aloi.science.uva.nl, for Fig. 3), and the BDD100K dataset³⁷ (https://bdd-data.berkeley.edu, for Fig. 4). Figures 2–4 are generated by applying our scripts (see below) to these image data.

Code availability

MATLAB scripts used in this work are available at https://github.com/takuyaisomura/predpca or https://doi.org/10.5281/zenodo.4362249. The scripts are covered under the GNU General Public License v3.0.

Change history

05 May 2021
A Correction to this paper has been published: https://doi.org/10.1038/s42256-021-00352-9

References

Rao, R. P. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87 (1999).
Article Google Scholar
Rao, R. P. & Sejnowski, T. J. Predictive sequence learning in recurrent neocortical circuits. Adv. Neural Info. Proc. Syst. 12, 164–170 (2000).
Google Scholar
Friston, K. A theory of cortical responses. Phil. Trans. R. Soc. Lond. B 360, 815–836 (2005).
Article Google Scholar
Srivastava, N., Mansimov, E. & Salakhudinov, R. Unsupervised learning of video representations using LSTMs. In Int. Conf. Machine Learning 843−852 (ML Research Press, 2015).
Mathieu, M., Couprie, C. & LeCun, Y. Deep multi-scale video prediction beyond mean square error. Preprint at https://arxiv.org/abs/1511.05440 (2015).
Lotter, W., Kreiman, G. & Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. Preprint at https://arxiv.org/abs/1605.08104 (2016).
Hurvich, C. M. & Tsai, C. L. Regression and time series model selection in small samples. Biometrika 76, 297–307 (1989).
Article MathSciNet MATH Google Scholar
Hurvich, C. M. & Tsai, C. L. A corrected Akaike information criterion for vector autoregressive model selection. J. Time Series Anal. 14, 271–279 (1993).
Article MathSciNet MATH Google Scholar
Cunningham, J. P. & Ghahramani, Z. Linear dimensionality reduction: survey, insights, and generalizations. J. Mach. Learn. Res. 16, 2859–2900 (2015).
MathSciNet MATH Google Scholar
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Article MathSciNet MATH Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Wehmeyer, C. & Noé, F. Time-lagged autoencoders: deep learning of slow collective variables for molecular kinetics. J. Chem. Phys. 148, 241703 (2018).
Article Google Scholar
Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G. & Noé, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013).
Article Google Scholar
Klus, S. et al. Data-driven model reduction and transfer operator approximation. J. Nonlinear Sci. 28, 985–1010 (2018).
Article MathSciNet MATH Google Scholar
Kalman, R. E. A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45 (1960).
Article MathSciNet Google Scholar
Julier, S. J. & Uhlmann, J. K. New extension of the Kalman filter to nonlinear systems. In Signal Processing, Sensor Fusion, And Target Recognition VI Vol. 3068, 182−193 (International Society for Optics and Photonics, 1997).
Friston, K. J., Trujillo-Barreto, N. & Daunizeau, J. DEM: A variational treatment of dynamic systems. NeuroImage 41, 849–885 (2008).
Article Google Scholar
Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19, 716–723 (1974).
Article MathSciNet MATH Google Scholar
Murata, N., Yoshizawa, S. & Amari, S. I. Network information criterion—determining the number of hidden units for an artificial neural network model. IEEE Trans. Neural Netw. 5, 865–872 (1994).
Article Google Scholar
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
Article MathSciNet MATH Google Scholar
Vapnik, V. Principles of risk minimization for learning theory. Adv. Neural Info. Proc. Syst. 4, 831–838 (1992).
Google Scholar
Arlot, S. & Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010).
Article MathSciNet MATH Google Scholar
Comon, P. & Jutten, C. (eds) Handbook of Blind Source Separation: Independent Component Analysis And Applications (Academic Press, 2010).
Ljung, L. System Identification: Theory for the User 2nd edn (Prentice-Hall, 1999).
Schoukens, J. & Ljung, L. Nonlinear system identification: a user-oriented roadmap. Preprint at https://arxiv.org/abs/1902.00683 (2019).
Akaike, H. Prediction and entropy. In Selected Papers of Hirotugu Akaike 387−410 (Springer, 1985).
Oja, E. Neural networks, principal components, and subspaces. Int. J. Neural Syst. 1, 61–68 (1989).
Article Google Scholar
Xu, L. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Netw. 6, 627–648 (1993).
Article Google Scholar
Chen, T., Hua, Y. & Yan, W. Y. Global convergence of Oja’s subspace algorithm for principal component extraction. IEEE Trans. Neural Netw. 9, 58–67 (1998).
Article Google Scholar
Bell, A. J. & Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).
Article Google Scholar
Amari, S. I., Cichocki, A. & Yang, H. H. A new learning algorithm for blind signal separation. Adv. Neural Info. Proc. Syst. 8, 757–763 (1996).
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article Google Scholar
Isomura, T. & Toyoizumi, T. On the achievability of blind source separation for high-dimensional nonlinear source mixtures. Preprint at https://arxiv.org/abs/1808.00668 (2018).
Dimigen, O. Optimizing the ICA-based removal of ocular EEG artifacts from free viewing experiments. Neuroimage 207, 116117 (2020).
Article Google Scholar
Geusebroek, J. M., Burghouts, G. J. & Smeulders, A. W. The Amsterdam library of object images. Int. J. Comput. Vis. 61, 103–112 (2005).
Article Google Scholar
Yu, F. et al. BDD100K: a diverse driving video database with scalable annotation tooling. Preprint at https://arxiv.org/abs/1805.04687 (2018).
Schrödinger, E. What Is Life? The Physical Aspect of the Living Cell and Mind (Cambridge Univ. Press, 1944).
Palmer, S. E., Marre, O., Berry, M. J. & Bialek, W. Predictive information in a sensory population. Proc. Natl Acad. Sci. USA 112, 6908–6913 (2015).
Article Google Scholar
Friston, K., Kilner, J. & Harrison, L. A free energy principle for the brain. J. Physiol. Paris 100, 70–87 (2006).
Article Google Scholar
Oymak, S., Fabian, Z., Li, M. & Soltanolkotabi, M. Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobian. Preprint at https://arxiv.org/abs/1906.05392 (2019).
Suzuki, T. et al. Spectral-pruning: compressing deep neural network via spectral analysis. Preprint at https://arxiv.org/abs/1808.08558 (2018).
Neftci, E. Data and power efficient intelligence with neuromorphic learning machines. iScience 5, 52–68 (2018).
Article Google Scholar
Fouda, M., Neftci, E., Eltawil, A. M. & Kurdahi, F. Independent component analysis using RRAMs. IEEE Trans. Nanotech. 18, 611–615 (2018).
Article Google Scholar
Lee, T. W., Girolami, M., Bell, A. J. & Sejnowski, T. J. A unifying information-theoretic framework for independent component analysis. Comput. Math. Appl. 39, 1–21 (2000).
Article MathSciNet MATH Google Scholar
Isomura, T. & Toyoizumi, T. A local learning rule for independent component analysis. Sci. Rep. 6, 28073 (2016).
Article Google Scholar
Isomura, T. & Toyoizumi, T. Error-gated Hebbian rule: a local learning rule for principal and independent component analysis. Sci. Rep. 8, 1835 (2018).
Article Google Scholar
Dayan, P., Hinton, G. E., Neal, R. M. & Zemel, R. S. The Helmholtz machine. Neural Comput. 7, 889–904 (1995).
Article Google Scholar
Frémaux, N. & Gerstner, W. Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Front. Neural Circuits 9, 85 (2016).
Article Google Scholar
Kuśmierz, Ł., Isomura, T. & Toyoizumi, T. Learning with three factors: modulating Hebbian plasticity with errors. Curr. Opin. Neurobiol. 46, 170–177 (2017).
Article Google Scholar
Zhu, B., Jiao, J. & Tse, D. Deconstructing generative adversarial networks. IEEE Trans. Inf. Theory 66, 7155–7179 (2020).
Article MathSciNet MATH Google Scholar
Lusch, B., Kutz, J. N. & Brunton, S. L. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9, 4950 (2018).
Article Google Scholar
Isomura, T. & Toyoizumi, T. Multi-context blind source separation by error-gated Hebbian rule. Sci. Rep. 9, 7127 (2019).
Article Google Scholar
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Article MATH Google Scholar
Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Info. Theory 39, 930–945 (1993).
Article MathSciNet MATH Google Scholar
Rahimi, A. & Recht, B. Uniform approximation of functions with random bases. In Proc. 46th Ann. Allerton Conf. on Communication, Control, and Computing 555−561 (2008).
Rahimi, A. & Recht, B. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. Adv. Neural Info. Process. Syst. 21, 1313–1320 (2008).
Google Scholar
Hyvärinen, A. & Pajunen, P. Nonlinear independent component analysis: existence and uniqueness results. Neural Netw. 12, 429–439 (1999).
Article Google Scholar
Jutten, C. & Karhunen, J. Advances in blind source separation (BSS) and independent component analysis (ICA) for nonlinear mixtures. Int. J. Neural Syst. 14, 267–292 (2004).
Article Google Scholar
Koopman, B. O. Hamiltonian systems and transformation in Hilbert space. Proc. Natl Acad. Sci. USA 17, 315–318 (1931).
Article MATH Google Scholar
Ljung, L. Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems. IEEE Trans. Automat. Contr. 24, 36–50 (1979).
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We are grateful to S.-I. Amari for discussions. This work was supported by RIKEN Center for Brain Science (T.I. and T.T.), Brain/MINDS from AMED under grant number JP20dm020700 (T.T.), and JSPS KAKENHI under grant number JP18H05432 (T.T.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Laboratory for Neural Computation and Adaptation, RIKEN Center for Brain Science, Wako, Japan
Takuya Isomura & Taro Toyoizumi
Brain Intelligence Theory Unit, RIKEN Center for Brain Science, Wako, Japan
Takuya Isomura
Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Taro Toyoizumi

Authors

Takuya Isomura
View author publications
You can also search for this author in PubMed Google Scholar
Taro Toyoizumi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.I. conceived and designed PredPCA, performed the mathematical analyses and simulations, and wrote the manuscript. T.T. supervised T.I. from the early state of this work, confirmed the rigour of the mathematical analyses and wrote the manuscript.

Corresponding authors

Correspondence to Takuya Isomura or Taro Toyoizumi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Supplementary results of PredPCA with handwritten digit images.

a, Transition mapping estimated using PredPCA \({\mathbf{B}} \in {\Bbb R}^{10 \times 10}\) accurately matches the true transition mapping \(B \in {\Bbb R}^{10 \times 10}\) that generates the ascending order sequence. Elements of x_t+1|t are permuted and sign-flipped for visualization purpose. b, This is also the case for the nonlinear dynamics. The estimated mapping from x_t|t-1 ⊗ x_t-1|t-2 to x_t+1|t, \({\tilde{\mathbf B}} \in {\Bbb R}^{10 \times 100}\), was obtained using the outcomes of PredPCA, which accurately matches the true mapping of the Fibonacci sequence \(\tilde B \in {\Bbb R}^{10 \times 100}\). Here, ⊗ indicates the Kronecker product. These results indicate that PredPCA offers the identification of the transition rules underlying the linear and nonlinear dynamics, without observing the true hidden states x_t. c, Prediction error in the absence of random replacement and/or monochrome inversion of digit images, as a counterpart of Fig. 2d. PredPCA’s outcomes are retained with or without those distortions, and relevant encoders comprise up to 10 dimensions owing to the construction of the input data, highlighting the robustness of PredPCA to various types of large noise. In particular, in the presence of monochrome inversion, irrespective of random replacement of digits, N_u = 10 provides the global minimum of both equations (6) and (7). Conversely, in the absence of monochrome inversion, N_u = 9 provides their global minimum as in this case, the 10-dimensional hidden state representation becomes redundant. This is because without monochrome inversion, true hidden states take only 10 different positions in the 10-dimensional coordinate, which can be fully expressed by the 9-dimensional coordinate. Remarkably, PredPCA could detect their difference. Note that monochrome inversion corresponds to the first principal component (PC1) of PredPCA. This is because whether the next image is a ‘black digit on white background’ or ‘white digit on black background’ is the most predictable feature as the monochrome inversion rarely occurs. Thus, a relatively large prediction error in the absence of monochrome inversion is due to the lack of the PC1. d, PredPCA increases its performance as the number of past observations used for prediction (K_p) increases until reaching its finite optimum. Left panel: error in categorizing digits, which converges to near zero as K_p increases (refer to Fig. 2b). Middle panel: parameter estimation error (refer to Fig. 2c). Right panel: test prediction error (refer to Fig. 2d). The blue line is the optimal test prediction error computed via supervised learning. The red line indicates the theoretical value computed using equation (7), wherein K_p = 10 (green line) gives its minimum, which matches empirical observations (black circles). These observations imply that predicting single-time-step future outcomes (s_t+1) using multi-time-step past observations (ϕ_t) is key to reducing those errors. Note that an extension of PredPCA for multi-time-step prediction while retaining its accuracy is provided in Methods section ‘Derivation of PredPCA'. c and d are obtained with 20 different realizations of digit sequences.

Extended Data Fig. 2 Comparison with related methods.

The errors in estimating system parameters (left and middle panels, as a counterpart of Fig. 2c) and in predicting one-step future inputs in test ascending sequence (right panels, refer to Fig. 2d) are shown. a, Performance of linear TAE. Although it estimates matrix A with high accuracy, it fails to estimate other parameters, because linear TAE (same as PredPCA with ϕ_t = s_t) does not effectively filter out observation noise. Moreover, linear TAE yields a larger test prediction error even relative to PredPCA with ϕ_t = s_t owing to the difference in their cost functions. This is because PredPCA (even with ϕ_t = s_t) extracts components most important to predicting high variant signals preferentially, and thereby provides the global minimum of the squared error in predicting the non-normalized target signal (under the constraint of ϕ_t = s_t), while linear TAE minimizes a normalized target signal (see Methods section ‘Filtering out observation noise' for more details). For reference, the blue and red lines in the right panel represent the optimal test prediction error computed via supervised learning and that of PredPCA with ϕ_t = s_t, respectively. The results are obtained with 20 different realizations of digit sequences. b, Performance of SSM based on Kalman filter. SSM also tends to fail system identification depending on initial conditions and training history, which leads to a relatively larger prediction error. In the left panel, lines and shaded areas indicate the median and the 25th to 75th percentile area, respectively. The results are obtained with 100 different realizations of digit sequences.

Extended Data Fig. 3 Accuracy of long-term predictions.

PredPCA and SSM can both yield generative models to predict an arbitrary future. However, SSM can fail to identify system parameters depending on initial conditions and training history, leading to the failure of long-term predictions even if provided with a winner-takes-all operation. a, Outcomes of PredPCA offer long-term prediction via greedy prediction based on iterative winner-takes-all operations, regardless of training dataset. Each row indicates a prediction based on a different realization of training sequence. A transition mapping from x_t|t-1 to x_t+1|t is assumed. b, The long-term prediction is successful even if a transition mapping from x_t|t-1 ⊗ x_t-1|t-2 to x_t+1|t is assumed, indicating the minimal influence of the assumed model structure (that is, prior knowledge). c, PredPCA can also predict Fibonacci sequences in the long term, regardless of the training dataset. d, Model selection to determine the optimal number of step backs. Here, the standard AIC was used for model selection. We considered the following four models based on four types of polynomial basis functions, x_t|t-1, x_t|t-1 ⊗ x_t-1|t-2, x_t|t-1 ⊗ x_t-1|t-2 ⊗ x_t-2|t-3, and x_t|t-1 ⊗ x_t-1|t-2 ⊗ x_t-2|t-3 ⊗ x_t-3|t-4. The state in the next time period x_t+1|t was predicted based on these four types of bases, followed by a winner-takes-all operation to conduct the greedy prediction, and their AICs were compared. Left panel: To explain the ascending order sequences, a mapping from x_t|t-1 to x_t+1|t was the best among these four models. Right panel: To explain the Fibonacci sequences, a mapping from x_t|t-1 ⊗ x_t-1|t-2 to x_t+1|t was significantly better than other three models. Here, the pairwise t test was applied based on 10 different realizations. Error bars indicate the standard deviation. e, In contrast, SSM based on Kalman filter tends to fail iterative prediction depending on the initial conditions of state and parameter values, and training history—even though it uses the winner-takes-all operation—owing to its relatively large state and parameter estimation errors. System identification using SSM is severely harmed by nonlinear interaction between state and parameter estimations, which yield local minima or spurious solutions (Extended Data Fig. 2b); consequently, SSM exhibits an approximately 6% categorization error (Fig. 2b). These inaccuracies undermine iterative predictions using SSM even when states are de-noised in each step using a winner-takes-all operation.

Extended Data Fig. 4 Instability of features extracted by TAE and SSM.

This figure is a counterpart of Fig. 3b. TAE and SSM do not guarantee the global convergence of their outcomes, and as a result their extracted features are sensitive to the initial conditions, order of supplying mini batches, and level of observation noise. The extracted features in six trials are shown; the last three are outcomes trained with a large noise. The same training dataset was used for all trials. However, as initial parameter values for TAE and SSM and order of supplying mini batches were varied, different features were extracted. The difference in the observation noise level also altered their outcomes. These results imply the unreliability of features extracted by TAE and SSM, and further highlight the benefit of the global convergence guarantee of PredPCA.

Extended Data Fig. 5 Feature extraction of diving car movies.

a, PC1–PC3 of the categorical features (that is, \({\bar{\mathbf x}}_t\)) representing the brightness and vertical and lateral symmetries of scenes. b, PC1 of the dynamical features (that is, Δx_t+3|t) representing the lateral motion. Although (a)(b) were obtained using PredPCA with grouping of the data, these extracted features accurately matched those obtained using PredPCA without the six sub-groups (Fig. 4b,c). This implies that PredPCA offers reliable identification of relevant features, even when using the data grouping. c, 100 major categorical features (\({\bar{\mathbf x}}_t\)) representing different categories of scenes. d, 100 major dynamical features (Δx_t+3|t) responding to motions at different positions of the screen. The white areas indicate the receptive field of each encoder. c and d were obtained using PredPCA and ICA without the six sub-groups. Similar to Fig. 3b, these images visualize linear mappings from each independent component to the observation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Isomura, T., Toyoizumi, T. Dimensionality reduction to maximize prediction generalization capability. Nat Mach Intell 3, 434–446 (2021). https://doi.org/10.1038/s42256-021-00306-1

Download citation

Received: 05 March 2020
Accepted: 27 January 2021
Published: 12 April 2021
Issue Date: May 2021
DOI: https://doi.org/10.1038/s42256-021-00306-1