A weight initialization based on the linear product structure for neural networks
Introduction
With the rapid growth of applications of neural networks to large datasets, the initialization of the weights of neural networks affects the training process and accuracy significantly. It is well known that zero initialization or arbitrary random initialization can slow down or even completely stall the convergence process. This is the so-called problem of exploding or vanishing gradients which in turn slows down the backpropagation and retards the overall training process [1]. Exploding gradients occur when the gradients get larger and larger, result in oscillating around the minima or even blow up in the training process; vanishing gradients are the exact opposite of exploding gradients when the gradient gets smaller and smaller due to the backpropagation, cause the slower convergence, and may even completely stop the training process. Therefore, proper initialization of the weights in training neural networks is necessary [2], [3]. The most popular initialization method is to use samples from a normal distribution, where is chosen to ensure that the variance of the outputs from the different layers is approximately the same. The first systematic analysis of this initialization was conducted in Glorot and Bengio [4] which showed that, for a linear activation function, the optimal value of , where is the number of nodes feeding into that layer. Although this study makes several assumptions about the inputs to the model, it works extremely well in many cases (especially for ) and is widely used in the initialization of neural networks commonly referred to Xavier initialization. Another important follow-up work is called He initialization [5] which argues that Xavier initialization does not work well with the ReLU activation function and changes to achieve tremendous success in ReLU neural networks such as ResNet. Recently the weight initialization has become an active research area, and numerous methods [2], [6], [7], [8], [9], [10], [11], [12] have been developed to initialize the weights of different neural networks. All the aforementioned initialization works are based on minimizing the variance of parameters between the deeper layers to avoid vanishing/popping at the beginning of training but do not consider the nonlinearity of neural networks which could let the initialization help the final training performance further.
The main contribution of this paper is to study neural networks from the nonlinear computation point of view [13], [14]. We approximate the activation functions by polynomials to provide a new weight initialization approach. The proposed weight initialization algorithm is based on the linear product structure of neural networks and has a theoretical guarantee to find all the local minima based on theories of numerical algebraic geometry [15], [16], [17]. Further theoretical analysis reveals that our new initialization method has a low probability of dying ReLU for deep neural networks. Numerical experiments on both fully connected neural networks and convolutional neural networks show the feasibility and efficiency of the proposed initialization algorithm.
Section snippets
Problem setup and polynomial approximation of activation functions
By considering a -layer neural network , we represent the output, , in terms of the input, , aswhere is the weight matrix, is the bias vector, is the width of the th layer, , , and is the activation function. For simplicity, we denote the set of all parameters as and the number of all parameters as . The activation function, , is a nonlinear function but not a
Linear product structure and weight initialization
After approximated by a polynomial, namely, , the neural network representation in (1) becomesFor each component , , we decompose the polynomial expression (5) into a linear product structure [16], [22], namely,where is the th row of and represents the linear space generated by variables , , and 1 (Similar for ). More
Theoretical analysis of the LPS initialization on the dying ReLU
The dying ReLU occurs when the weights are negative such that the ReLU neurons become inactive and remain to be zero for any input [23]. Therefore, the gradient is zero so that large parts of the neural network do nothing. If the neural networks are deep, the dying ReLU may even occur at the initialization step and the whole training process based on existing initialization algorithms fails at the very beginning. The LPS initialization strategy resolves this issue with a theoretical guarantee
Numerical experiments
In this section, we apply the LPS initialization algorithm to both fully connected neural networks and convolutional neural networks with the ReLU activation function and compare it with the He initialization developed in He et al. [5]. All the experimental details and hyperparameters are reported in Appendix A.5.
Conclusion
Weight initialization is crucial for efficiently training neural networks and therefore has become an active research area in machine learning. Current existing initialization algorithms are based on minimizing the variance of parameters between layers and lack consideration of the nonlinearity of neural networks which is the most essential part of the training. In this paper, we analyze the nonlinearity of neural networks from a nonlinear computation point of view and develop a novel
References (28)
A gradient descent method for solving a system of nonlinear equations
Appl. Math. Lett.
(2021)- et al.
On the difficulty of training recurrent neural networks
International Conference on Machine Learning
(2013) - D. Mishkin, J. Matas, All you need is a good init, arXiv preprint...
- et al.
Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights
1990 IJCNN International Joint Conference on Neural Networks
(1990) - et al.
Understanding the difficulty of training deep feedforward neural networks
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(2010) - et al.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification
Proceedings of the IEEE International Conference on Computer Vision
(2015) - et al.
How to initialize your network? Robust initialization for WeightNorm & ResNets
Advances in Neural Information Processing Systems
(2019) - S. Kumar, On weight initialization in deep neural networks, arXiv preprint...
- et al.
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Advances in Neural Information Processing Systems
(2017) - J. Pennington, S. Schoenholz, S. Ganguli, The emergence of spectral universality in deep networks, arXiv preprint...
Exponential expressivity in deep neural networks through transient chaos
Advances in Neural Information Processing Systems
A homotopy training algorithm for fully connected neural networks
Proc. R. Soc. A
Cited by (10)
Global exponential stability for delayed Clifford-valued coupled neural networks with impulsive effects
2023, Journal of the Franklin InstituteOptimizing biogas production from palm oil mill effluent utilizing integrated machine learning and response surface methodology framework
2023, Journal of Cleaner ProductionPower series expansion neural network
2022, Journal of Computational ScienceExponential synchronisation for delayed Clifford-valued coupled switched neural networks via static event-triggering rule
2024, International Journal of Systems ScienceHybrid learning impact with augmented reality to improve higher order thinking skills of students
2023, International Journal of Advanced and Applied Sciences