Abstract
The stacked auto-encoders are considered deep learning algorithms automatically extracting meaningful unsupervised features from the input data using a hierarcfhical learning process. The parameters are learnt layer-by-layer in each auto-encoder (AE). As optimization is one of the main components of the neural networks and auto-encoders, the learning rate is one of the crucial hyper-parameters of neural networks and AE. This issue on a large scale and especially sparse data sets is more important. In this paper, we adapt the learning rate for special AE corresponding to various components of AE networks in each stochastic gradient calculation and analyze the theoretical convergence of back-propagation learning for the proposed method. We also promote our methodology for online adaptive optimizations suitable for deep learning. We obtain promising results compared to constant learning rates on the (1) MNIST digit, (2) blogs-Gender-100 text, (3) smartphone based recognition of human activities and postural transitions time series, and (4) EEG brainwave feeling emotions time series classification tasks using a single machine.
Similar content being viewed by others
Abbreviations
- Abbreviation:
-
Description
- MLP:
-
Multi-layer perceptron
- SGD:
-
Stochastic gradient descent
- StGD:
-
Stable gradient descent
- NGD:
-
Natural gradient descent
- kSGD:
-
Kalman-based stochastic gradient descent
- RMS:
-
Root mean square
- RBP:
-
Resilient back-propagation rule
- KL-divergence:
-
Kullback-leibler divergence
- NAG:
-
Nesterov accelerated gradient
- Adagrad:
-
Adaptive sub-gradient method for online Learning and Stochastic Optimization
- Adadelta:
-
Adaptive learning rate method
- RMSprop:
-
Root mean square propagation method
- Adam:
-
Adaptive moment estimation method
- Nadam:
-
Nesterov-accelerated adaptive moment estimation method
- AMSGrad:
-
Exponential moving average method
- ANFIS:
-
Sable adaptive network based fuzzy inference system
- PSO:
-
Particle swarm optimization
- AE:
-
Auto-encoder
- SAE:
-
Stack auto-encoder
- LSTM:
-
Long-short term memory
- CNN:
-
Convolutional neural network
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp 265–283
Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276
Banakar A (2011) Lyapunov stability analysis of gradient descent-learning algorithm in network training. In: ISRN applied mathematics 2011
Baydin AG, Cornish R, Rubio DM, Schmidt M, Wood F (2018) Online learning rate adaptation with hypergradient descent. In: Sixth international conference on learning representations (ICLR), Vancouver, Canada, April 30–May 3, 2018
Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 17–36
Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Montavon G, Orr G, KR Müller (eds) Neural networks: tricks of the trade. Springer-Verlag, Berlin, Heidelberg, pp 437–478
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. In: Proceedings of the 19th international conference on neural information processing systems, pp 153–160
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305
Bertsekas DP (2011) Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim Mach Learn 2010(1–38):3
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2(Mar):499–526
Chen R, Qu Y, Li C, Zeng K, Xie Y, Li C (2020) Single-image super-resolution via joint statistic models-guided deep auto-encoder network. Neural Comput Applic 32:4885–4896
Dozat T (2016) Incorporating nesterov momentum into adam. In: International conference on learning representations
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159
Dumas T, Roumy A, Guillemot C (2018) Autoencoder based image compression: can the learning be quantization independent? In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1188–1192
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Haykin SS, Haykin SS, Haykin SS, Elektroingenieur K, Haykin SS (2009) Neural networks and learning machines, vol 3. Pearson, Upper Saddle River
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Izzo D, Tailor D, Vasileiou T (2018) On the stability analysis of optimal state feedbacks as represented by deep neural models. arXiv preprint arXiv:1812.02532
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kuzborskij I, Lampert CH (2017) Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Montavon G, Orr R, KR Müller (eds) Neural networks: tricks of the trade. Springer-Verlag, Berlin, Heidelberg, pp 9–48
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of the 7th international conference on learning representations, New Orleans, Louisiana
Ma H, Ma S, Xu Y, Zhu M (2018) Deep marginalized sparse denoising auto-encoder for image denoising. J Phys Conf Ser. https://doi.org/10.1088/1742-6596/960/1/012033
Mac H, Truong D, Nguyen L, Nguyen H, Tran HA, Tran D (2018) Detecting attacks on web applications using autoencoder. In: Proceedings of the ninth international symposium on information and communication technology. ACM, pp 416–421
Martens J (2010) Deep learning via hessian-free optimization. ICML 27:735–742
Masters D, Luschi C (2018) Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612
Moulines E, Bach FR (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th international conference on neural information processing systems, pp 451–459
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1
Nesterov YE (1983) A method for solving the convex programming problem with convergence rate o (\(1/k^2\)). Dokl akad nauk Sssr 269:543–547
Ollivier Y et al (2018) Online natural gradient as a kalman filter. Electron J Stat 12(2):2930–2961
Pascanu R, Bengio Y (2013) Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584
Patel V (2016) Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J Optim 26(4):2620–2648
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
Ramezani-Kebrya A, Khisti A, Liang B (2018) On the stability and convergence of stochastic gradient descent with momentum. arXiv preprint arXiv:1809.04564
Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. In: International conference on learning. https://openreview.net/forum?id=ryQu7f-RZ
Roux NL, Manzagol PA, Bengio Y (2007) Topmoumoute online natural gradient algorithm. In: Proceedings of the 20th international conference on neural information processing systems, pp 849–856
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747
Ruff L, Vandermeulen R, Goernitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M (2018) Deep one-class classification. In: International conference on machine learning, pp 4393–4402
Rumelhart DE, McClelland JL, Group PR et al (1988) Parallel distributed processing, vol 1. MIT Press, Cambridge
Shoorehdeli MA, Teshnehlab M, Sedigh A (2008) Stable learning algorithm approaches for anfis as an identifier. IFAC Proc Vol 41(2):7046–7051
Shoorehdeli MA, Teshnehlab M, Sedigh AK (2009) Identification using anfis with intelligent hybrid stable learning algorithm approaches. Neural Comput Appl 18(2):157–174
Shoorehdeli MA, Teshnehlab M, Sedigh AK, Khanesar MA (2009) Identification using anfis with intelligent hybrid stable learning algorithm approaches and stability analysis of training methods. Appl Soft Comput 9(2):833–850
Sutton R (1986) Two problems with back propagation and other steepest descent learning procedures for networks. In: Proceedings of the eighth annual conference of the cognitive science society, pp 823–832
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Vartouni AM, Kashi SS, Teshnehlab M (2018) An anomaly detection method to detect web attacks using stacked auto-encoder. In: 2018 6th Iranian joint congress on fuzzy and intelligent systems (CFIS). IEEE, pp 131–134
Vartouni AM, Teshnehlab M, Kashi SS (2019) Leveraging deep neural networks for anomaly-based web application firewall. IET Inf Secur 13(4):352–361
Vinyals O, Povey D (2012) Krylov subspace descent for deep learning. In: Proceedings of the 15th international conference on artificial intelligence and statistics, pp 1261–1268
Widrow B, Hoff ME (1962) Associative storage and retrieval of digital information in networks of adaptive “neurons”. In: Bernard EE (ed) Biological prototypes and synthetic systems. Springer, US, pp 160–160
Yerramalla S, Fuller E, Mladenovski M, Cukic B (2003) Lyapunov analysis of neural network stability in an adaptive flight control system. In: Symposium on self-stabilizing systems. Springer, Spinger, pp 77–92
Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybernet 47(12):4014–4024
Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Zhang Q, Yang LT, Chen Z, Li P (2018) A survey on deep learning for big data. Inf Fus 42:146–157
Zhao R, Yan R, Chen Z, Mao K, Wang P, Gao RX (2019) Deep learning and its applications to machine health monitoring. Mech Syst Signal Process 115:213–237
Zhou C, Paffenroth RC (2017) Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 665–674
Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y (2019) Adashift: decorrelation and convergence of adaptive learning rate methods. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, OpenReview.net. https://openreview.net/forum?id=HkgTkhRcKQ
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Appendix: a solution for \(c^o\)
it is need \(c^o (k)\) for each iteration k. Therefore, the adaptive algorithm based on gradient descent is:
It is show, we should calculate gradient of cost function J(k) relative to gradient of \(c^o (k-1)\). According to chain rule, we have:
If we change variable k to k-1 in (25) then,
and also, in (20) then,
Finally, according to (51), (52) and (50) we have:
B Appendix: a solution for \(c^h\)
it is need \(c^h (k)\) for each iteration k. Therefore, the adaptive algorithm based on gradient descent is:
It is show, we should calculate gradient of cost function J(k) relative to gradient of \(c^h (k-1)\). According to chain rule, we have:
If we change variable k to k-1 in (32) then,
and also, in (27) then,
Finally, according to (57), (58) and (56) we have:
Rights and permissions
About this article
Cite this article
Moradi Vartouni, A., Teshnehlab, M. & Sedighian Kashi, S. SAOSA: Stable Adaptive Optimization for Stacked Auto-encoders. Neural Process Lett 52, 823–848 (2020). https://doi.org/10.1007/s11063-020-10277-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10277-w