A weight initialization based on the linear product structure for neural networks

doi:10.1016/j.amc.2021.126722

Applied Mathematics and Computation

Volume 415, 15 February 2022, 126722

https://doi.org/10.1016/j.amc.2021.126722 Get rights and content

Highlights

•
We consider the neural network training from a nonlinear computation point of view.
•
A new linear product structure initialization strategy has been developed for training neural networks.
•
Theoretical analysis shows that the LPS initialization yields a low probability of dying ReLU.

Abstract

Weight initialization plays an important role in training neural networks and also affects tremendous deep learning applications. Various weight initialization strategies have already been developed for different activation functions with different neural networks. These initialization algorithms are based on minimizing the variance of the parameters between layers and might still fail when neural networks are deep, e.g., dying ReLU. To address this challenge, we study neural networks from a nonlinear computation point of view and propose a novel weight initialization strategy that is based on the linear product structure (LPS) of neural networks. The proposed strategy is derived from the polynomial approximation of activation functions by using theories of numerical algebraic geometry to guarantee to find all the local minima. We also provide a theoretical analysis that the LPS initialization has a lower probability of dying ReLU comparing to other existing initialization strategies. Finally, we test the LPS initialization algorithm on both fully connected neural networks and convolutional neural networks to show its feasibility, efficiency, and robustness on public datasets.

Introduction

With the rapid growth of applications of neural networks to large datasets, the initialization of the weights of neural networks affects the training process and accuracy significantly. It is well known that zero initialization or arbitrary random initialization can slow down or even completely stall the convergence process. This is the so-called problem of exploding or vanishing gradients which in turn slows down the backpropagation and retards the overall training process [1]. Exploding gradients occur when the gradients get larger and larger, result in oscillating around the minima or even blow up in the training process; vanishing gradients are the exact opposite of exploding gradients when the gradient gets smaller and smaller due to the backpropagation, cause the slower convergence, and may even completely stop the training process. Therefore, proper initialization of the weights in training neural networks is necessary [2], [3]. The most popular initialization method is to use samples from a normal distribution, $N (0, σ^{2})$ where $σ$ is chosen to ensure that the variance of the outputs from the different layers is approximately the same. The first systematic analysis of this initialization was conducted in Glorot and Bengio [4] which showed that, for a linear activation function, the optimal value of $σ^{2} = 1 / d_{i}$ , where $d_{i}$ is the number of nodes feeding into that layer. Although this study makes several assumptions about the inputs to the model, it works extremely well in many cases (especially for $t a n h (z)$ ) and is widely used in the initialization of neural networks commonly referred to Xavier initialization. Another important follow-up work is called He initialization [5] which argues that Xavier initialization does not work well with the ReLU activation function and changes $σ^{2} = 2 / d_{i}$ to achieve tremendous success in ReLU neural networks such as ResNet. Recently the weight initialization has become an active research area, and numerous methods [2], [6], [7], [8], [9], [10], [11], [12] have been developed to initialize the weights of different neural networks. All the aforementioned initialization works are based on minimizing the variance of parameters between the deeper layers to avoid vanishing/popping at the beginning of training but do not consider the nonlinearity of neural networks which could let the initialization help the final training performance further.

The main contribution of this paper is to study neural networks from the nonlinear computation point of view [13], [14]. We approximate the activation functions by polynomials to provide a new weight initialization approach. The proposed weight initialization algorithm is based on the linear product structure of neural networks and has a theoretical guarantee to find all the local minima based on theories of numerical algebraic geometry [15], [16], [17]. Further theoretical analysis reveals that our new initialization method has a low probability of dying ReLU for deep neural networks. Numerical experiments on both fully connected neural networks and convolutional neural networks show the feasibility and efficiency of the proposed initialization algorithm.

Section snippets

Problem setup and polynomial approximation of activation functions

By considering a $(n + 1)$ -layer neural network $y (x; θ)$ , we represent the output, $y$ , in terms of the input, $x$ , as $y (x; θ) = W^{n} f^{n - 1} + b^{n}, f^{ℓ} = σ (W^{ℓ} f^{ℓ - 1} + b^{ℓ}), ℓ \in {1, \dots, n - 1}, and f^{0} = x,$ where $W^{ℓ} \in R^{m_{ℓ} \times m_{ℓ - 1}}$ is the weight matrix, $b^{ℓ} \in R^{m_{ℓ}}$ is the bias vector, $m_{ℓ}$ is the width of the $ℓ$ th layer, $m_{0} = \dim (x)$ , $m_{n} = \dim (y)$ , and $σ$ is the activation function. For simplicity, we denote the set of all parameters as $θ = {W^{ℓ}, b^{ℓ}}_{ℓ = 1}^{n}$ and the number of all parameters as $| θ |$ . The activation function, $σ$ , is a nonlinear function but not a

Linear product structure and weight initialization

After approximated by a polynomial, namely, $σ (x) \approx P_{2} (x)$ , the neural network representation in (1) becomes $y (x; θ) \approx W^{n} P_{2} (W^{n - 1} {\tilde{f}}^{n - 1} + b^{n - 1}) + b^{n} .$ For each component $y_{j} (x; θ)$ , $j = 1, \dots, \dim (y)$ , we decompose the polynomial expression (5) into a linear product structure [16], [22], namely, $y_{j} (x; θ) \approx W_{j}^{n} P_{2} (W^{n - 1} {\tilde{f}}^{n - 1} + b^{n - 1}) + b_{j}^{n} \in {W_{j}^{n}, b_{j}^{n}, 1} \times {P_{2} (W^{n - 1} {\tilde{f}}^{n - 1} + b^{n - 1}), 1},$ where $W_{j}^{n}$ is the $j$ th row of $W^{n}$ and ${W_{j}^{n}, b_{j}^{n}, 1}$ represents the linear space generated by variables $W_{j}^{n}$ , $b_{j}^{n}$ , and 1 (Similar for ${P_{2} (W^{n - 1} {\tilde{f}}^{n - 1} + b^{n - 1}), 1}$ ). More

Theoretical analysis of the LPS initialization on the dying ReLU

The dying ReLU occurs when the weights are negative such that the ReLU neurons become inactive and remain to be zero for any input [23]. Therefore, the gradient is zero so that large parts of the neural network do nothing. If the neural networks are deep, the dying ReLU may even occur at the initialization step and the whole training process based on existing initialization algorithms fails at the very beginning. The LPS initialization strategy resolves this issue with a theoretical guarantee

Numerical experiments

In this section, we apply the LPS initialization algorithm to both fully connected neural networks and convolutional neural networks with the ReLU activation function and compare it with the He initialization developed in He et al. [5]. All the experimental details and hyperparameters are reported in Appendix A.5.

Conclusion

Weight initialization is crucial for efficiently training neural networks and therefore has become an active research area in machine learning. Current existing initialization algorithms are based on minimizing the variance of parameters between layers and lack consideration of the nonlinearity of neural networks which is the most essential part of the training. In this paper, we analyze the nonlinearity of neural networks from a nonlinear computation point of view and develop a novel

References (28)

W. Hao
A gradient descent method for solving a system of nonlinear equations
Appl. Math. Lett.
(2021)
R. Pascanu et al.
On the difficulty of training recurrent neural networks
International Conference on Machine Learning
(2013)
D. Mishkin, J. Matas, All you need is a good init, arXiv preprint...
D. Nguyen et al.
Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights
1990 IJCNN International Joint Conference on Neural Networks
(1990)
X. Glorot et al.
Understanding the difficulty of training deep feedforward neural networks
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(2010)
K. He et al.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification
Proceedings of the IEEE International Conference on Computer Vision
(2015)
D. Arpit et al.
How to initialize your network? Robust initialization for WeightNorm & ResNets
Advances in Neural Information Processing Systems
(2019)
S. Kumar, On weight initialization in deep neural networks, arXiv preprint...
J. Pennington et al.
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Advances in Neural Information Processing Systems
(2017)
J. Pennington, S. Schoenholz, S. Ganguli, The emergence of spectral universality in deep networks, arXiv preprint...

B. Poole et al.

Exponential expressivity in deep neural networks through transient chaos

Advances in Neural Information Processing Systems

(2016)

A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural...

D. Sussillo, L. Abbott, Random walk initialization for training very deep feedforward networks, arXiv preprint...

Q. Chen et al.

A homotopy training algorithm for fully connected neural networks

Proc. R. Soc. A

(2019)

Cited by (10)

Global exponential stability for delayed Clifford-valued coupled neural networks with impulsive effects
2023, Journal of the Franklin Institute
In the paper, the global exponential stability for impulsive delayed Clifford-valued coupled neural networks is studied. Firstly, a delayed Clifford-valued coupled neural network model is established. So as to avoid the non-commutativity issue of Clifford numbers multiplication, the original $n$ -dimensional Clifford-valued models are decomposed into $2^{m} n$ -dimensional real-valued models. Then, by using Lyapunov-Krasovskii (L-K) functional approach and the technique about linear matrix inequality, some new sufficient conditions are presented for the global exponential stability of the considered neural network models. Finally, the effectiveness of the results are verified by two numerical simulations.
Recent advancements in machine learning enabled portable and wearable biosensors
2023, Talanta Open
Recent advances in noninvasive portable and wearable biosensors have attracted significant attention due to their capability to offer continual physiological information for continuous healthcare monitoring through the collection of biological signals. To make the collected biological data understandable and improve the efficacy of these biosensors, scientists have integrated machine learning (ML) with biosensors to analyze large sensing data through various ML algorithms. In this article, we have highlighted the recent developments in ML-enabled noninvasive biosensors. Initially, we introduced and discussed the basic features of ML algorithms used in data processing to build an intelligent biosensor system and the capability to make clinical decisions. Next, the principles of portable and wearable biosensors, the application of different ML models in diverse biosensors for healthcare applications, and their impact on the performance of biosensors are discussed. The last section highlights the challenges (such as data privacy, consistency, stability, accuracy, scalable production, and adaptive learning capacity), future prospects, and necessary steps required to address these issues, spotlighting their revolutionizing impact on the healthcare industry for the development of next-generation ML-enabled efficient biosensors.
Optimizing biogas production from palm oil mill effluent utilizing integrated machine learning and response surface methodology framework
2023, Journal of Cleaner Production
This study presents a novel approach to optimize the anaerobic digestion of palm oil mill effluent (POME) for maximum biogas production on an industrial scale. Unlike most optimization studies which are limited to laboratory scale, this study utilized data from a real-world industrial setting to simulate the anaerobic digestion (AD) process. The processes include feedstock pre-treatment, biogas scrubbing, wastewater treatment, and sludge handling. The impact of process conditions such as hydraulic retention time (HRT), organic loading time (OLR), anaerobic sludge recycling ratio (RR_AS), treatment effluent recycling ratio (RR_TE), and reaction temperature on chemical oxygen demand (COD) removal and methane yield was analysed. An integration of response surface methodology (RSM) and artificial neural network (ANN) is applied to predict the optimal values for the process conditions. The results showed that the optimal values for HRT, temperature, OLR, RR_AS, and RR_TE are 38.58 days, 45 °C, 1.033 g COD/(L∙day), 0.052, and 0.95, respectively. The optimization improves the COD removal efficiency, SS removal efficiency, and biogas production by 7.38%, 8.37% and 16.18%, respectively when processing 24,134 L/h of POME feed. The optimized AD process is economically viable with a payback period of 5.34 years and a net present value (NPV) of $5,079,000.
Power series expansion neural network
2022, Journal of Computational Science
In this paper, we develop a new neural network family based on power series expansion, which is proved to achieve a better approximation accuracy in comparison with existing neural networks. This new set of neural networks embeds the power series expansion (PSE) into the neural network structure. Then it can improve the representation ability while preserving comparable computational cost by increasing the degree of PSE instead of increasing the depth or width. Both theoretical approximation and numerical results show the advantages of this new neural network.
Exponential synchronisation for delayed Clifford-valued coupled switched neural networks via static event-triggering rule
2024, International Journal of Systems Science
Hybrid learning impact with augmented reality to improve higher order thinking skills of students
2023, International Journal of Advanced and Applied Sciences

View all citing articles on Scopus

View full text

A weight initialization based on the linear product structure for neural networks

Highlights

Abstract

Introduction

Section snippets

Problem setup and polynomial approximation of activation functions

Linear product structure and weight initialization

Theoretical analysis of the LPS initialization on the dying ReLU

Numerical experiments

Conclusion

Appl. Math. Lett.

On the difficulty of training recurrent neural networks

International Conference on Machine Learning

Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights

1990 IJCNN International Joint Conference on Neural Networks

Understanding the difficulty of training deep feedforward neural networks

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

Delving deep into rectifiers: surpassing human-level performance on imagenet classification

Proceedings of the IEEE International Conference on Computer Vision

How to initialize your network? Robust initialization for WeightNorm & ResNets

Advances in Neural Information Processing Systems

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Advances in Neural Information Processing Systems

Exponential expressivity in deep neural networks through transient chaos

Advances in Neural Information Processing Systems

A homotopy training algorithm for fully connected neural networks

Proc. R. Soc. A