SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders

Moradi Vartouni, Ali; Teshnehlab, Mohammad; Sedighian Kashi, Saeed

doi:10.1007/s11063-020-10277-w

SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders

Published: 22 June 2020

Volume 52, pages 823–848, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Ali Moradi Vartouni¹,
Mohammad Teshnehlab² &
Saeed Sedighian Kashi¹

224 Accesses
1 Citation
Explore all metrics

Abstract

The stacked auto-encoders are considered deep learning algorithms automatically extracting meaningful unsupervised features from the input data using a hierarcfhical learning process. The parameters are learnt layer-by-layer in each auto-encoder (AE). As optimization is one of the main components of the neural networks and auto-encoders, the learning rate is one of the crucial hyper-parameters of neural networks and AE. This issue on a large scale and especially sparse data sets is more important. In this paper, we adapt the learning rate for special AE corresponding to various components of AE networks in each stochastic gradient calculation and analyze the theoretical convergence of back-propagation learning for the proposed method. We also promote our methodology for online adaptive optimizations suitable for deep learning. We obtain promising results compared to constant learning rates on the (1) MNIST digit, (2) blogs-Gender-100 text, (3) smartphone based recognition of human activities and postural transitions time series, and (4) EEG brainwave feeling emotions time series classification tasks using a single machine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning for time series classification: a review

Article 02 March 2019

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: a review

Article 25 August 2021

Abbreviations

Abbreviation:: Description
MLP:: Multi-layer perceptron
SGD:: Stochastic gradient descent
StGD:: Stable gradient descent
NGD:: Natural gradient descent
kSGD:: Kalman-based stochastic gradient descent
RMS:: Root mean square
RBP:: Resilient back-propagation rule
KL-divergence:: Kullback-leibler divergence
NAG:: Nesterov accelerated gradient
Adagrad:: Adaptive sub-gradient method for online Learning and Stochastic Optimization
Adadelta:: Adaptive learning rate method
RMSprop:: Root mean square propagation method
Adam:: Adaptive moment estimation method
Nadam:: Nesterov-accelerated adaptive moment estimation method
AMSGrad:: Exponential moving average method
ANFIS:: Sable adaptive network based fuzzy inference system
PSO:: Particle swarm optimization
AE:: Auto-encoder
SAE:: Stack auto-encoder
LSTM:: Long-short term memory
CNN:: Convolutional neural network

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) Tensorflow: A system for large-scale machine learning. In: 12th $\{$USENIX$\}$ symposium on operating systems design and implementation ($\{$OSDI$\}$ 16), pp 265–283
Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276
Google Scholar
Banakar A (2011) Lyapunov stability analysis of gradient descent-learning algorithm in network training. In: ISRN applied mathematics 2011
Baydin AG, Cornish R, Rubio DM, Schmidt M, Wood F (2018) Online learning rate adaptation with hypergradient descent. In: Sixth international conference on learning representations (ICLR), Vancouver, Canada, April 30–May 3, 2018
Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 17–36
Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Montavon G, Orr G, KR Müller (eds) Neural networks: tricks of the trade. Springer-Verlag, Berlin, Heidelberg, pp 437–478
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Proceedings of the 19th international conference on neural information processing systems, pp 153–160
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305
MathSciNet MATH Google Scholar
Bertsekas DP (2011) Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim Mach Learn 2010(1–38):3
Google Scholar
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2(Mar):499–526
MathSciNet MATH Google Scholar
Chen R, Qu Y, Li C, Zeng K, Xie Y, Li C (2020) Single-image super-resolution via joint statistic models-guided deep auto-encoder network. Neural Comput Applic 32:4885–4896
Google Scholar
Dozat T (2016) Incorporating nesterov momentum into adam. In: International conference on learning representations
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159
MathSciNet MATH Google Scholar
Dumas T, Roumy A, Guillemot C (2018) Autoencoder based image compression: can the learning be quantization independent? In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1188–1192
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
MATH Google Scholar
Haykin SS, Haykin SS, Haykin SS, Elektroingenieur K, Haykin SS (2009) Neural networks and learning machines, vol 3. Pearson, Upper Saddle River
MATH Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
MathSciNet MATH Google Scholar
Izzo D, Tailor D, Vasileiou T (2018) On the stability analysis of optimal state feedbacks as represented by deep neural models. arXiv preprint arXiv:1812.02532
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kuzborskij I, Lampert CH (2017) Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Montavon G, Orr R, KR Müller (eds) Neural networks: tricks of the trade. Springer-Verlag, Berlin, Heidelberg, pp 9–48
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of the 7th international conference on learning representations, New Orleans, Louisiana
Ma H, Ma S, Xu Y, Zhu M (2018) Deep marginalized sparse denoising auto-encoder for image denoising. J Phys Conf Ser. https://doi.org/10.1088/1742-6596/960/1/012033
Mac H, Truong D, Nguyen L, Nguyen H, Tran HA, Tran D (2018) Detecting attacks on web applications using autoencoder. In: Proceedings of the ninth international symposium on information and communication technology. ACM, pp 416–421
Martens J (2010) Deep learning via hessian-free optimization. ICML 27:735–742
Google Scholar
Masters D, Luschi C (2018) Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612
Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th international conference on neural information processing systems, pp 451–459
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1
Google Scholar
Nesterov YE (1983) A method for solving the convex programming problem with convergence rate o ($1/k^2$). Dokl akad nauk Sssr 269:543–547
MathSciNet Google Scholar
Ollivier Y et al (2018) Online natural gradient as a kalman filter. Electron J Stat 12(2):2930–2961
MathSciNet MATH Google Scholar
Pascanu R, Bengio Y (2013) Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584
Patel V (2016) Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J Optim 26(4):2620–2648
MathSciNet MATH Google Scholar
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
MathSciNet Google Scholar
Ramezani-Kebrya A, Khisti A, Liang B (2018) On the stability and convergence of stochastic gradient descent with momentum. arXiv preprint arXiv:1809.04564
Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. In: International conference on learning. https://openreview.net/forum?id=ryQu7f-RZ
Roux NL, Manzagol PA, Bengio Y (2007) Topmoumoute online natural gradient algorithm. In: Proceedings of the 20th international conference on neural information processing systems, pp 849–856
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747
Ruff L, Vandermeulen R, Goernitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M (2018) Deep one-class classification. In: International conference on machine learning, pp 4393–4402
Rumelhart DE, McClelland JL, Group PR et al (1988) Parallel distributed processing, vol 1. MIT Press, Cambridge
Google Scholar
Shoorehdeli MA, Teshnehlab M, Sedigh A (2008) Stable learning algorithm approaches for anfis as an identifier. IFAC Proc Vol 41(2):7046–7051
Google Scholar
Shoorehdeli MA, Teshnehlab M, Sedigh AK (2009) Identification using anfis with intelligent hybrid stable learning algorithm approaches. Neural Comput Appl 18(2):157–174
MATH Google Scholar
Shoorehdeli MA, Teshnehlab M, Sedigh AK, Khanesar MA (2009) Identification using anfis with intelligent hybrid stable learning algorithm approaches and stability analysis of training methods. Appl Soft Comput 9(2):833–850
Google Scholar
Sutton R (1986) Two problems with back propagation and other steepest descent learning procedures for networks. In: Proceedings of the eighth annual conference of the cognitive science society, pp 823–832
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Google Scholar
Vartouni AM, Kashi SS, Teshnehlab M (2018) An anomaly detection method to detect web attacks using stacked auto-encoder. In: 2018 6th Iranian joint congress on fuzzy and intelligent systems (CFIS). IEEE, pp 131–134
Vartouni AM, Teshnehlab M, Kashi SS (2019) Leveraging deep neural networks for anomaly-based web application firewall. IET Inf Secur 13(4):352–361
Google Scholar
Vinyals O, Povey D (2012) Krylov subspace descent for deep learning. In: Proceedings of the 15th international conference on artificial intelligence and statistics, pp 1261–1268
Widrow B, Hoff ME (1962) Associative storage and retrieval of digital information in networks of adaptive “neurons”. In: Bernard EE (ed) Biological prototypes and synthetic systems. Springer, US, pp 160–160
Google Scholar
Yerramalla S, Fuller E, Mladenovski M, Cukic B (2003) Lyapunov analysis of neural network stability in an adaptive flight control system. In: Symposium on self-stabilizing systems. Springer, Spinger, pp 77–92
Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybernet 47(12):4014–4024
Google Scholar
Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068
Google Scholar
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Zhang Q, Yang LT, Chen Z, Li P (2018) A survey on deep learning for big data. Inf Fus 42:146–157
Google Scholar
Zhao R, Yan R, Chen Z, Mao K, Wang P, Gao RX (2019) Deep learning and its applications to machine health monitoring. Mech Syst Signal Process 115:213–237
Google Scholar
Zhou C, Paffenroth RC (2017) Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 665–674
Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y (2019) Adashift: decorrelation and convergence of adaptive learning rate methods. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, OpenReview.net. https://openreview.net/forum?id=HkgTkhRcKQ

Download references

Author information

Authors and Affiliations

Faculty of Computer Engineering, K.N Toosi University of Technology, Tehran, Iran
Ali Moradi Vartouni & Saeed Sedighian Kashi
Faculty of Electrical and Computer Engineering, K.N Toosi University of Technology, Tehran, Iran
Mohammad Teshnehlab

Authors

Ali Moradi Vartouni
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Teshnehlab
View author publications
You can also search for this author in PubMed Google Scholar
Saeed Sedighian Kashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Teshnehlab.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Appendix: a solution for $c^o$

According to (21) and (25):

$$\begin{aligned} \mathbf{w} _j^o(k+1)= \mathbf{w} _j^o(k) +\frac{\mathbf{e _j(k)\mathbf{g} _j'(k)\mathbf{h} (k)}{(g_j'(k))^2\Vert \mathbf{h} (k)\Vert ^2+c^o(k)}}{\quad }for \ every \ j \end{aligned}$$

(48)

it is need $c^o (k)$ for each iteration k. Therefore, the adaptive algorithm based on gradient descent is:

$$\begin{aligned} \mathbf{c} ^o (k)= \mathbf{c} ^o (k-1)+\rho ^o \frac{\partial J(k)}{\partial \mathbf{c} ^o (k-1)} \end{aligned}$$

(49)

It is show, we should calculate gradient of cost function J(k) relative to gradient of $c^o (k-1)$. According to chain rule, we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial J(k)}{\partial \mathbf{c} ^o(k-1)} \\&\quad =\frac{\partial J(k)}{\partial \mathbf{e} (k)} \frac{\partial \mathbf{e} (k)}{\partial \mathbf{r} (k)} \frac{\partial \mathbf{r} (k)}{\partial \mathbf{net}2 (k)} \frac{\partial \mathbf{net}2 (k)}{\partial W^o(k)} \frac{\partial W^o(k)}{\partial \varvec{\eta }^o(k-1)} \frac{\partial \varvec{\eta }^o(k-1)}{\partial \mathbf{c} ^o(k-1)} \\&\quad =-\mathbf{e} (k)\mathbf{g} '(k)\mathbf{h} (k)\frac{\partial W^o(k)}{\partial \varvec{\eta }^o(k-1)} \frac{\partial \varvec{\eta }^o(k-1)}{\partial \mathbf{c} ^o(k-1)} \end{aligned} \end{aligned}$$

(50)

If we change variable k to k-1 in (25) then,

$$\begin{aligned} \frac{\partial \eta _{j}^o(k-1)}{\partial c_{j}^o(k-1)} =\frac{-1}{((g_j'(k-1))^2\Vert \mathbf{h} (k-1)\Vert ^2+c_j^o(k-1))^2} \end{aligned}$$

(51)

and also, in (20) then,

$$\begin{aligned} \frac{\partial W^o(k)}{\partial \eta _{j}^o(k-1)} =e_j(k-1)g_j'(k-1)\mathbf{h} (k-1) \end{aligned}$$

(52)

Finally, according to (51), (52) and (50) we have:

$$\begin{aligned} \begin{aligned} \frac{\partial J(k)}{\partial c_j^o(k-1)}= \frac{e_j(k)g_j'(k)\mathbf{h} (k) \times e_j(k-1)g_j'(k-1)\mathbf{h} (k-1)}{((g_j'(k-1))^2\Vert \mathbf{h} (k-1)\Vert ^2+c_j^o(k-1))^2} \end{aligned} \end{aligned}$$

(53)

B Appendix: a solution for $c^h$

According to (30) and (32):

$$\begin{aligned} \mathbf{w} _i^h(k+1)= \mathbf{w} _i^h(k) +\frac{e_{j}(k)g_{j}'(k)\mathbf{w} _j^o(k)\mathbf{f} '(k)x_i(k)}{(g_j'(k))^2\Vert \mathbf{f} '(k)\Vert ^2 \Vert \mathbf{w} _j^o(k)\Vert ^2\Vert \mathbf{x} (k)\Vert ^2+c^h(k)} \end{aligned}$$

(54)

it is need $c^h (k)$ for each iteration k. Therefore, the adaptive algorithm based on gradient descent is:

$$\begin{aligned} \mathbf{c} ^h (k)= \mathbf{c} ^h (k-1)+\rho ^h \frac{\partial J(k)}{\partial \mathbf{c} ^h (k-1)} \end{aligned}$$

(55)

It is show, we should calculate gradient of cost function J(k) relative to gradient of $c^h (k-1)$. According to chain rule, we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial J(k)}{\partial \mathbf{c} ^h(k-1)} \frac{\partial J(k)}{\partial \mathbf{e} (k)} \frac{\partial \mathbf{e} (k)}{\partial \mathbf{r} (k)} \frac{\partial \mathbf{r} (k)}{\partial \mathbf{net}2 (k)} \\&\quad =\frac{\partial \mathbf{net}2 (k)}{\partial \mathbf{h} (k)} \frac{\partial \mathbf{h} (k)}{\partial \mathbf{net}1 (k)} \frac{\partial \mathbf{net}1 (k)}{\partial W^h(k)} \frac{\partial W^h(k)}{\partial \varvec{\eta }^h(k-1)} \frac{\partial \varvec{\eta }^h(k-1)}{\partial \mathbf{c} ^h(k-1)} \\&\quad =-\mathbf{e} (k)\mathbf{g} '(k)W^o(k)\mathbf{f} '(k)\mathbf{x} (k)\frac{\partial W^h(k)}{\partial \varvec{\eta }^h(k-1)} \frac{\partial \varvec{\eta }^h(k-1)}{\partial \mathbf{c} ^h(k-1)} \end{aligned} \end{aligned}$$

(56)

If we change variable k to k-1 in (32) then,

$$\begin{aligned} \begin{aligned}&\frac{\partial \eta _{j}^h(k-1)}{\partial c_{j}^h(k-1)} \\&\quad =\frac{-1}{((g_j'(k-1))^2\Vert \mathbf{f} '(k-1)\Vert ^2 \Vert \mathbf{w} _j^o(k-1)\Vert ^2\Vert \mathbf{x} (k-1)\Vert ^2+c_j^h(k-1))^2} \end{aligned} \end{aligned}$$

(57)

and also, in (27) then,

$$\begin{aligned} \frac{\partial W^h(k)}{\partial \eta _{j}^h(k-1)} = e_j(k-1)g_j'(k-1)\mathbf{w} _j^o(k-1)\mathbf{f} '(k-1)x_i(k-1) \end{aligned}$$

(58)

Finally, according to (57), (58) and (56) we have:

$$\begin{aligned} \begin{aligned}&\frac{\partial J(k)}{\partial c_j^o(k-1)}= e_j(k)g_j'(k)\mathbf{w} _j^o(k)\mathbf{f} '(k)x_i(k) \\&\quad \times \frac{ e_j(k-1)g_j'(k-1)\mathbf{w} _j^o(k-1)\mathbf{f} '(k-1)x_i(k-1)}{((g_j'(k-1))^2\Vert \mathbf{f} '(k-1)\Vert ^2 \Vert \mathbf{w} _j^o(k-1)\Vert ^2\Vert \mathbf{x} (k-1)\Vert ^2+c_j^h(k-1))^2} \\ \end{aligned} \end{aligned}$$

(59)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moradi Vartouni, A., Teshnehlab, M. & Sedighian Kashi, S. SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders. Neural Process Lett 52, 823–848 (2020). https://doi.org/10.1007/s11063-020-10277-w

Download citation

Published: 22 June 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11063-020-10277-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: a review

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Appendix: a solution for \(c^o\)

B Appendix: a solution for \(c^h\)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: a review

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Appendix: a solution for \(c^o\)

B Appendix: a solution for \(c^h\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation