A weight initialization based on the linear product structure for neural networks

https://doi.org/10.1016/j.amc.2021.126722Get rights and content

Highlights

  • We consider the neural network training from a nonlinear computation point of view.

  • A new linear product structure initialization strategy has been developed for training neural networks.

  • Theoretical analysis shows that the LPS initialization yields a low probability of dying ReLU.

Abstract

Weight initialization plays an important role in training neural networks and also affects tremendous deep learning applications. Various weight initialization strategies have already been developed for different activation functions with different neural networks. These initialization algorithms are based on minimizing the variance of the parameters between layers and might still fail when neural networks are deep, e.g., dying ReLU. To address this challenge, we study neural networks from a nonlinear computation point of view and propose a novel weight initialization strategy that is based on the linear product structure (LPS) of neural networks. The proposed strategy is derived from the polynomial approximation of activation functions by using theories of numerical algebraic geometry to guarantee to find all the local minima. We also provide a theoretical analysis that the LPS initialization has a lower probability of dying ReLU comparing to other existing initialization strategies. Finally, we test the LPS initialization algorithm on both fully connected neural networks and convolutional neural networks to show its feasibility, efficiency, and robustness on public datasets.

Introduction

With the rapid growth of applications of neural networks to large datasets, the initialization of the weights of neural networks affects the training process and accuracy significantly. It is well known that zero initialization or arbitrary random initialization can slow down or even completely stall the convergence process. This is the so-called problem of exploding or vanishing gradients which in turn slows down the backpropagation and retards the overall training process [1]. Exploding gradients occur when the gradients get larger and larger, result in oscillating around the minima or even blow up in the training process; vanishing gradients are the exact opposite of exploding gradients when the gradient gets smaller and smaller due to the backpropagation, cause the slower convergence, and may even completely stop the training process. Therefore, proper initialization of the weights in training neural networks is necessary [2], [3]. The most popular initialization method is to use samples from a normal distribution, N(0,σ2) where σ is chosen to ensure that the variance of the outputs from the different layers is approximately the same. The first systematic analysis of this initialization was conducted in Glorot and Bengio [4] which showed that, for a linear activation function, the optimal value of σ2=1/di, where di is the number of nodes feeding into that layer. Although this study makes several assumptions about the inputs to the model, it works extremely well in many cases (especially for tanh(z)) and is widely used in the initialization of neural networks commonly referred to Xavier initialization. Another important follow-up work is called He initialization [5] which argues that Xavier initialization does not work well with the ReLU activation function and changes σ2=2/di to achieve tremendous success in ReLU neural networks such as ResNet. Recently the weight initialization has become an active research area, and numerous methods [2], [6], [7], [8], [9], [10], [11], [12] have been developed to initialize the weights of different neural networks. All the aforementioned initialization works are based on minimizing the variance of parameters between the deeper layers to avoid vanishing/popping at the beginning of training but do not consider the nonlinearity of neural networks which could let the initialization help the final training performance further.

The main contribution of this paper is to study neural networks from the nonlinear computation point of view [13], [14]. We approximate the activation functions by polynomials to provide a new weight initialization approach. The proposed weight initialization algorithm is based on the linear product structure of neural networks and has a theoretical guarantee to find all the local minima based on theories of numerical algebraic geometry [15], [16], [17]. Further theoretical analysis reveals that our new initialization method has a low probability of dying ReLU for deep neural networks. Numerical experiments on both fully connected neural networks and convolutional neural networks show the feasibility and efficiency of the proposed initialization algorithm.

Section snippets

Problem setup and polynomial approximation of activation functions

By considering a (n+1)-layer neural network y(x;θ), we represent the output, y, in terms of the input, x, asy(x;θ)=Wnfn1+bn,f=σ(Wf1+b),{1,,n1},andf0=x,where WRm×m1 is the weight matrix, bRm is the bias vector, m is the width of the th layer, m0=dim(x), mn=dim(y), and σ is the activation function. For simplicity, we denote the set of all parameters as θ={W,b}=1n and the number of all parameters as |θ|. The activation function, σ, is a nonlinear function but not a

Linear product structure and weight initialization

After approximated by a polynomial, namely, σ(x)P2(x), the neural network representation in (1) becomesy(x;θ)WnP2(Wn1f˜n1+bn1)+bn.For each component yj(x;θ), j=1,,dim(y), we decompose the polynomial expression (5) into a linear product structure [16], [22], namely,yj(x;θ)WjnP2(Wn1f˜n1+bn1)+bjn{Wjn,bjn,1}×{P2(Wn1f˜n1+bn1),1},where Wjn is the jth row of Wn and {Wjn,bjn,1} represents the linear space generated by variables Wjn, bjn, and 1 (Similar for {P2(Wn1f˜n1+bn1),1}). More

Theoretical analysis of the LPS initialization on the dying ReLU

The dying ReLU occurs when the weights are negative such that the ReLU neurons become inactive and remain to be zero for any input [23]. Therefore, the gradient is zero so that large parts of the neural network do nothing. If the neural networks are deep, the dying ReLU may even occur at the initialization step and the whole training process based on existing initialization algorithms fails at the very beginning. The LPS initialization strategy resolves this issue with a theoretical guarantee

Numerical experiments

In this section, we apply the LPS initialization algorithm to both fully connected neural networks and convolutional neural networks with the ReLU activation function and compare it with the He initialization developed in He et al. [5]. All the experimental details and hyperparameters are reported in Appendix A.5.

Conclusion

Weight initialization is crucial for efficiently training neural networks and therefore has become an active research area in machine learning. Current existing initialization algorithms are based on minimizing the variance of parameters between layers and lack consideration of the nonlinearity of neural networks which is the most essential part of the training. In this paper, we analyze the nonlinearity of neural networks from a nonlinear computation point of view and develop a novel

References (28)

  • W. Hao

    A gradient descent method for solving a system of nonlinear equations

    Appl. Math. Lett.

    (2021)
  • R. Pascanu et al.

    On the difficulty of training recurrent neural networks

    International Conference on Machine Learning

    (2013)
  • D. Mishkin, J. Matas, All you need is a good init, arXiv preprint...
  • D. Nguyen et al.

    Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights

    1990 IJCNN International Joint Conference on Neural Networks

    (1990)
  • X. Glorot et al.

    Understanding the difficulty of training deep feedforward neural networks

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    (2010)
  • K. He et al.

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • D. Arpit et al.

    How to initialize your network? Robust initialization for WeightNorm & ResNets

    Advances in Neural Information Processing Systems

    (2019)
  • S. Kumar, On weight initialization in deep neural networks, arXiv preprint...
  • J. Pennington et al.

    Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

    Advances in Neural Information Processing Systems

    (2017)
  • J. Pennington, S. Schoenholz, S. Ganguli, The emergence of spectral universality in deep networks, arXiv preprint...
  • B. Poole et al.

    Exponential expressivity in deep neural networks through transient chaos

    Advances in Neural Information Processing Systems

    (2016)
  • A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural...
  • D. Sussillo, L. Abbott, Random walk initialization for training very deep feedforward networks, arXiv preprint...
  • Q. Chen et al.

    A homotopy training algorithm for fully connected neural networks

    Proc. R. Soc. A

    (2019)
  • Cited by (10)

    • Power series expansion neural network

      2022, Journal of Computational Science
    View all citing articles on Scopus
    View full text