Brought to you by:

Universal statistics of Fisher information in deep neural networks: mean field approach*

, and

Published 21 December 2020 © 2020 IOP Publishing Ltd and SISSA Medialab srl
, , Citation Ryo Karakida et al J. Stat. Mech. (2020) 124005 DOI 10.1088/1742-5468/abc62e

1742-5468/2020/12/124005

Abstract

The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value. Because the landscape of the parameter space is defined by the FIM, it is locally flat in most dimensions, but strongly distorted in others. Moreover, we demonstrate the potential usage of the derived statistics in learning strategies. First, small eigenvalues that induce flatness can be connected to a norm-based capacity measure of generalization ability. Second, the maximum eigenvalue that induces the distortion enables us to quantitatively estimate an appropriately sized learning rate for gradient methods to converge.

Export citation and abstract BibTeX RIS

1. Introduction

Deep learning has succeeded in making hierarchical neural networks perform excellently in various practical applications [1]. To proceed further, it would be beneficial to give more theoretical elucidation as to why and how deep neural networks (DNNs) work well in practice. In particular, it would be useful to not only clarify the individual models and phenomena but also explore various unified theoretical frameworks that could be applied to a wide class of deep networks. One widely used approach for this purpose is to consider deep networks with random connectivity and a large width limit [214]. For instance, Poole et al [3] proposed a useful indicator to explain the expressivity of DNNs. Regarding the trainability of DNNs, Schoenholz et al [4] extended this theory to backpropagation and found that the vanishing and explosive gradients obey a universal law. These studies are powerful in the sense that they do not depend on particular model architectures, such as the number of layers or activation functions.

Unfortunately, such universal frameworks have not yet been established in many other topics. One is the geometric structure of the parameter space. For instance, the loss landscape without spurious local minima is important for easier optimization and theoretically guaranteed in single-layer models [15], shallow piecewise linear ones [16], and extremely wide deep networks with the number of training samples smaller than the width [17]. Flat global minima have been reported to be related to generalization ability through empirical experiments showing that networks with such minima give better generalization performance [18, 19]. However, theoretical analysis of the flat landscape has been limited in shallow rectified linear unit (ReLU) networks [20, 21]. Thus, a residual subject of interest is to theoretically reveal the geometric structure of the parameter space truly common among various deep networks.

To establish the foundation of the universal perspective of the parameter space, this study analytically investigates the FIM. As is overviewed in section 2.1, the FIM plays an essential role in the geometry of the parameter space and is a fundamental quantity in both statistics and machine learning.

1.1. Main results

This study analyzes the FIM of deep networks with random weights and biases, which are widely used settings to analyze the phenomena of DNNs [214]. First, we analytically obtain novel statistics of the FIM, namely, the mean (theorem 1), variance (theorem 3), and maximum of eigenvalues (theorem 4). These are universal among a wide class of shallow and deep networks with various activation functions. These quantities can be obtained from simple iterative computations of macroscopic variables. To our surprise, the mean of the eigenvalues asymptotically decreases with an order of O(1/M) in the limit of a large network width M, while the variance takes a value of O(1), and the maximum eigenvalue takes a huge value of O(M) by using the O(⋅) order notation. Since the eigenvalues are non-negative, these results mean that most of the eigenvalues are close to zero, but the edge of the eigenvalue distribution takes a huge value. Because the FIM defines the Riemannian metric of the parameter space, the derived statistics imply that the space is locally flat in most dimensions, but strongly distorted in others. In addition, because the FIM also determines the local shape of a loss landscape, the landscape is also expected to be locally flat while strongly distorted.

Furthermore, to confirm the potential usage of the derived statistics, we show some exercises. One is on the Fisher–Rao norm [22] (theorem 5). This norm was originally proposed to connect the flatness of a parameter space to the capacity measure of generalization ability. We evaluate the Fisher–Rao norm by using an indicator of the small eigenvalues, κ1 in theorem 1. Another exercise is related to the more practical issue of determining the size of the learning rate necessary for the steepest descent gradient to converge. We demonstrate that an indicator of the huge eigenvalue, κ2 in theorem 4, enables us to roughly estimate learning rates that make the gradient method converge to global minima (theorem 7). We expect that it will help to alleviate the dependence of learning rates on heuristic settings.

1.2. Related works

Despite its importance in statistics and machine learning, study on the FIM for neural networks has been limited so far. This is because layer-by-layer nonlinear maps and huge parameter dimensions make it difficult to take analysis any further. Degeneracy of the eigenvalues of the FIM has been found in certain parameter regions [23]. To understand the loss landscape, Pennington and Bahri [5] has utilized random matrix theory and obtained the spectrum of FIM and Hessian under several assumptions, although the analysis is limited to special types of shallow networks. In contrast, this paper is the first attempt to apply the mean field approach, which overcomes the difficulties above and enables us to identify universal properties of the FIM in various types of DNNs.

LeCun et al [24] investigated the Hessian of the loss, which coincides with the FIM at zero training error, and empirically reported that very large eigenvalues exist, i.e. 'big killers', which affects the optimization (discussed in section 4.2). The eigenvalue distribution peaks around zero while its tail is very long; this behavior has been empirically known for decades [25], but its theoretical evidence and evaluation have remained unsolved as far as we know. Therefore, our theory provides novel theoretical evidence that this skewed eigenvalue distribution and its huge maximum appear universally in DNNs.

The theoretical tool we use here is known as the mean field theory of deep networks [3, 4, 1014] as briefly overviewed in section 2.4. This method has been successful in analyzing neural networks with random weights under a large width limit and in explaining the performance of the models. In particular, it quantitatively coincides with experimental results very well and can predict appropriate initial values of parameters for avoiding the vanishing or explosive gradient problems [4]. This analysis has been extended from fully connected deep networks to residual [11] and convolutional networks [14]. The evaluation of the FIM in this study is also expected to be extended to such cases.

Recently, some advances have been made in the understanding of FIM's eigenvalue statistics. While the FIM's maximum eigenvalue takes a huge value and acts as an outlier in naive settings, we can alleviate it under batch normalization [26]. Karakida et al [27] has analyzed the FIM corresponding to the cross-entropy loss with softmax output, and also made some remarks on the connection between the FIM and neural tangent kernel.

2. Preliminaries

2.1. Fisher information matrix (FIM)

We focus on the FIM of neural network models, which previous works have developed and is commonly used [2833]. It is defined by

Equation (1)

where the statistical model is given by p(x, y; θ) = p(y|x; θ)p(x). The output model is given by $p\left(y\vert x;\theta \right)=\mathrm{exp}\left(-{\Vert}y-{f}_{\theta }\left(x\right){{\Vert}}^{2}/2\right)/\sqrt{2\pi }$, where fθ (x) is the network output parameterized by θ and ||⋅|| is the Euclidean norm. The q(x) is an input distribution. The expectation E[⋅] is taken over the input-output pairs (x, y) of the joint distribution p(x, y; θ). This FIM is transformed into $F={\sum }_{k=1}^{C}\mathrm{E}\left[{\nabla }_{\theta }{f}_{\theta ,k}\left(x\right){\nabla }_{\theta }{f}_{\theta ,k}{\left(x\right)}^{\mathrm{T}}\right]$, where fθ,k is the kth entry of the output (k = 1, ..., C). When T training samples x(t) (t = 1, ..., T) are available, the expectation can be replaced by the empirical mean. This is known as the empirical FIM and often appears in practice [2933]:

Equation (2)

This study investigates the above empirical FIM for arbitrary T. It converges to the expected FIM as T. Although the form of the FIM changes a bit in other statistical models (i.e. softmax outputs), these differences are basically limited to the multiplication of activations in the output layer [32]. Our framework can be straightforwardly applied to such cases.

The FIM determines the asymptotic accuracy of the estimated parameters, as is known from a fundamental theorem of statistics, namely, the Cramér–Rao bound [35]. Below, we summarize a more intuitive understanding of the FIM from geometric views.

Information geometric view. Let us define an infinitesimal squared distance dr2, which represents the Kullback–Leibler divergence between the statistical model p(x, y; θ) and p(x, y; θ + dθ) against a perturbation dθ. It is given by

Equation (3)

It means that the parameter space of a statistical model forms a Riemannian manifold and the FIM works as its Riemannian metric, as is known in information geometry [35]. This quadratic form is equivalent to the robustness of a deep network: E[||fθ+dθ (t) − fθ (t)||2] = dθT Fdθ. Insights from information geometry have led to the development of natural gradient algorithms [3133] and, recently, a capacity measure based on the Fisher–Rao norm [22].

Loss landscape view. The empirical FIM (2) determines the local landscape of the loss function around the global minimum. Suppose we have a squared loss function $E\left(\theta \right)=\left(1/2T\right){\sum }_{t}^{T}{\Vert}y\left(t\right)-{f}_{\theta }\left(t\right){{\Vert}}^{2}$. The FIM is related to the Hessian of the loss function, H := ∇θ θ E(θ), in the following way:

Equation (4)

The Hessian coincides with the FIM when the parameter converges to the global minimum by learning, that is, the true parameter θ* from which the teacher signal y(t) is generated by $y\left(t\right)={f}_{{\theta }^{{\ast}}}\left(t\right)$ or, more generally, with noise (i.e. $y\left(t\right)={f}_{{\theta }^{{\ast}}}\left(t\right)+{\varepsilon }_{t}$, where ɛt denotes zero-mean Gaussian noise) [29]. In the literature on deep learning, its eigenvectors whose eigenvalues are close to zero locally compose flat minima, which leads to better generalization empirically [19, 22]. Modifying the loss function with the FIM has also succeeded in overcoming the catastrophic forgetting [36].

Note that the information geometric view tells us more than the loss landscape. While the Hessian (4) assumes the special teacher signal, the FIM works as the Riemannian metric to arbitrary teacher signals.

2.2. Network architecture

This study investigates a fully connected feedforward neural network. The network consists of one input layer with M0 units, L − 1 hidden layers (L ⩾ 2) with Ml units per hidden layer (l = 1, 2, ..., L − 1), and one output layer with ML units:

Equation (5)

This study focuses on the case of linear outputs, that is, ${f}_{\theta ,k}\left(x\right)={h}_{k}^{L}={u}_{k}^{L}$. We assume that the activation function ϕ(x) and its derivative ϕ'(x) := dϕ(x)/dx are square-integrable functions on a Gaussian measure. A wide class of activation functions, including the sigmoid-like and (leaky-) ReLU functions, satisfy these conditions. Different layers may have different activation functions. Regarding the network width, we set Ml = αl M(lL − 1) and consider the limiting case of large M with constant coefficients αl . This study mainly focuses on the case where the number of output units is given by a constant ML = C. The higher-dimensional case of C = O(M) is argued in section 4.3.

The FIM (2) of a deep network is computed by the chain rule in a manner similar to that of the backpropagation algorithm:

Equation (6)

Equation (7)

where ${\delta }_{k,i}^{l}{:=}\partial {f}_{\theta ,k}/\partial {u}_{i}^{l}$ for k = 1, ..., C. To avoid the complicated notation, we omit the index of the output unit, i.e. ${\delta }_{i}^{l}={\delta }_{k,i}^{l}$, in the following.

2.3. Random connectivity

The parameter set $\theta =\left\{{W}_{ij}^{l},{b}_{i}^{l}\right\}$ is an ensemble generated by

Equation (8)

and then fixed, where $\mathcal{N}\left(0,{\sigma }^{2}\right)$ denotes a Gaussian distribution with zero mean and variance σ2, and we set ${\sigma }_{{w}^{l}}{ >}0$ and ${\sigma }_{{b}^{l}}{ >}0$. To avoid complicated notation, we set them uniformly as ${\sigma }_{{w}^{l}}^{2}={\sigma }_{w}^{2}$ and ${\sigma }_{{b}^{l}}^{2}={\sigma }_{b}^{2}$, but they can easily be generalized. It is essential to normalize the variance of the weights by M in order to normalize the output ${u}_{i}^{l}$ to O(1). This setting is similar to how parameters are initialized in practice [37]. We also assume that the input samples ${h}_{i}^{0}\left(t\right)={x}_{i}\left(t\right)\enspace \enspace \left(t=1,\dots ,T\right)$ are generated in an i.i.d. manner from a standard Gaussian distribution: ${x}_{i}\left(t\right)\sim \mathcal{N}\left(0,1\right)$. We focus here on the Gaussian case for simplicity, although we can easily generalize it to other distributions with finite variances.

Let us remark that the above random connectivity is a common setting widely supposed in theories. Analyzing such a network can be regarded as the typical evaluation [2, 3, 5]. It is also equal to analyzing the network randomly initialized [4, 20]. The random connectivity is often assumed in the analysis of optimization as a true parameter of the networks, that is, the global minimum of the parameters [21, 38].

2.4. Mean-field approach

On neural networks with random connectivity, taking a large width limit, we can analyze the asymptotic behaviors of the networks. Recently, this asymptotic analysis is referred to as the mean field theory of deep networks, and we follow the previously reported notations and terminology [3, 4, 11, 12].

First, let us introduce the following variables for feedforward signal propagations: ${\hat{q}}^{l}{:=}{\sum }_{\;i}{h}_{i}^{l}{\left(t\right)}^{2}/{M}_{l}$ and ${\hat{q}}_{st}^{l}{:=}{\sum }_{\;i}{h}_{i}^{l}\left(s\right){h}_{i}^{l}\left(t\right)/{M}_{l}$. In the context of deep learning, these variables have been utilized to explain the depth to which signals can sufficiently propagate. The variable ${\hat{q}}_{st}^{l}$ is the correlation between the activations for different input samples x(s) and x(t) in the lth layer. Under the large M limit, these variables are given by integration over Gaussian distributions because the pre-activation ${u}_{l}^{i}$ is a weighted sum of independent random parameters and the central limit theorem is applicable [24]:

Equation (9)

Equation (10)

with ${\hat{q}}^{0}=1$ and ${\hat{q}}_{st}^{0}=0$ (l = 0, ..., L − 1). We can generalize the theory to unnormalized data with ${\hat{q}}^{0}\ne 0$ and ${\hat{q}}_{st}^{0}\ne 0$, just by substituting them into the recurrence relations. The notation $Du=\mathrm{d}u\enspace \mathrm{exp}\left(-{u}^{2}/2\right)/\sqrt{2\pi }$ means integration over the standard Gaussian density. Here, the notation I[⋅, ⋅] represents the following integral: ${I}_{\phi }\left[a,b\right]=\int D{z}_{1}D{z}_{2}\phi \left(\sqrt{a}{z}_{1}\right)\phi \left(\sqrt{a}\left(c{z}_{1}+\sqrt{1-{c}^{2}}{z}_{2}\right)\right)$ with c = b/a. The ${q}_{st}^{l}$ is linked to the compositional kernel and utilized as the kernel of the Gaussian process [39].

Next, let us introduce variables for backpropagated signals: ${\tilde {q}}^{l}{:=}{\sum }_{i}{\;\delta }_{i}^{l}{\left(t\right)}^{2}$ and ${\tilde {q}}_{st}^{l}{:=}{\sum }_{i}{\;\delta }_{i}^{l}\left(s\right){\delta }_{i}^{l}\left(t\right)$. Note that they are defined not by averages but by the sums. They remain O(1) because of C = O(1). ${\tilde {q}}_{st}^{l}$ is the correlation of backpropagated signals. To compute these quantities, the previous studies assumed the following:

Assumption 1 (Schoenholz et al [4]). On the evaluation of the variables ${\tilde {q}}^{l}$ and ${\tilde {q}}_{st}^{l}$, one can use a different set of parameters, θ for the forward chain (5) and θ' for the backpropagated chain (7), instead of using the same parameter set θ in both chains.

This assumption makes the dependence between $\phi \left({u}_{i}^{l}\right)$ (or ${\phi }^{\prime }\left({u}_{i}^{l}\right)$) and ${\delta }_{j}^{l+1}$, which share the same parameter set, very weak, and one can regard it as independent. It enables us to apply the central limit theorem to the backpropagated chain (7). Thus, the previous studies [4, 7, 11, 12] derived the following recurrence relations (l = 0, ..., L − 1):

Equation (11)

Equation (12)

with ${\tilde {q}}^{L}={\tilde {q}}_{st}^{L}=1$ because of the linear outputs. The previous works confirmed excellent agreements between the above equations and experiments. In this study, we also adopt the above assumption and use the recurrence relations.

The variables (${\hat{q}}^{l},{\tilde {q}}^{l},{\hat{q}}_{st}^{l},{\tilde {q}}_{st}^{l}$) depend only on the variance parameters ${\sigma }_{w}^{2}$ and ${\sigma }_{b}^{2}$, not on the unit indices. In that sense, they are referred to as macroscopic variables (a.k.a. order parameters in statistical physics). The recurrence relations for the macroscopic variables simply require L iterations of one- and two-dimensional numerical integrals. Moreover, we can obtain their explicit forms for some activation functions (such as the error function, linear, and ReLU; see appendix B).

3. Fundamental FIM statistics

Here, we report mathematical findings that the mean, variance, and maximum of eigenvalues of the FIM (2) are explicitly expressed by using macroscopic variables. Our theorems are universal for networks ranging in size from shallow (L = 2) to arbitrarily deep (L ⩾ 3) with various activation functions.

3.1. Mean of eigenvalues

The FIM is a P × P matrix, where P represents the total number of parameters. First, we compute the arithmetic mean of the FIM's eigenvalues as ${m}_{\lambda }{:=}{\sum }_{i=1}^{P}{\lambda }_{i}/P$. We find a hidden relation between the macroscopic variables and the statistics of FIM:

Theorem 1. Suppose that assumption 1 holds. In the limit of M ≫ 1, the mean of the FIM's eigenvalues is given by

Equation (13)

where $\alpha {:=}{\sum }_{l=1}^{L-1}{\alpha }_{l}{\alpha }_{l-1}$. The macroscopic variables ${\hat{q}}^{l}$ and ${\tilde {q}}^{l}$ can be computed recursively, and notably mλ is O(1/M).

This is obtained from a relation mλ = Trace(F)/P (detailed in appendix A.1). The coefficient κ1 is a constant not depending on M, so it is O(1). It is easily computed by L iterations of the layer-wise recurrence relations (9) and (11).

Because the FIM is a positive semi-definite matrix and its eigenvalues are non-negative, this theorem means that most of the eigenvalues asymptotically approach zero when M is large. Recall that the FIM determines the local geometry of the parameter space. The theorem suggests that the network output remains almost unchanged against a perturbation of the parameters in many dimensions. It also suggests that the shape of the loss landscape is locally flat in most dimensions.

Furthermore, by using Markov's inequality, we can prove that the number of larger eigenvalues is limited, as follows:

Corollary 2. Let us denote the number of eigenvalues satisfying λk by N(λk) and suppose that assumption 1 holds. For a constant k > 0, N(λk) ⩽  min {ακ1 CM/k, CT} holds in the limit of M ≫ 1.

The proof is shown in appendix A.2. When T is sufficiently small, we have a trivial upper bound N(λk) ⩽ CT and the number of non-zero eigenvalue is limited. The corollary clarifies that even when T becomes large, the number of eigenvalues whose values are O(1) is O(M) at most, and still much smaller than the total number of parameters P.

3.2. Variance of eigenvalues

Next, let us consider the second moment ${s}_{\lambda }{:=}{\sum }_{i=1}^{P}{\lambda }_{i}^{2}/P$. We now demonstrate that sλ can be computed from the macroscopic variables:

Theorem 3. Suppose that assumption 1 holds. In the limit of M ≫ 1, the second moment of the FIM's eigenvalues is

Equation (14)

Equation (15)

The macroscopic variables ${\hat{q}}_{st}^{l}$ and ${\tilde {q}}_{st}^{l}$ can be computed recursively, and sλ is O(1) 3 .

The proof is shown in appendix A.3.

From theorems 1 and 3, we can conclude that the variance of the eigenvalue distribution, ${s}_{\lambda }-{m}_{\lambda }^{2}$, is O(1). Because the mean mλ is O(1/M) and most eigenvalues are close to zero, this result means that the edge of the eigenvalue distribution takes a huge value.

3.3. Maximum eigenvalue

As we have seen so far, the mean of the eigenvalues is O(1/M), and the variance is O(1). Therefore, we can expect that at least one of the eigenvalues must be huge. Actually, we can show that the maximum eigenvalue (that is, the spectral norm of the FIM) increases in the order of O(M) as follows.

Theorem 4. Suppose that assumption 1 holds. In the limit of M ≫ 1, the maximum eigenvalue of the FIM is

Equation (16)

The λmax is derived from the dual matrix F* (detailed in appendix A.4). If we take the limit T, we can characterize the quantity κ2 by the maximum eigenvalue as λmax = ακ2 M. Note that λmax is independent of C. When C = O(M), it may depend on C, as shown in section 3.4.

This theorem suggests that the network output changes dramatically with a perturbation of the parameters in certain dimensions and that the local shape of the loss landscape is strongly distorted in that direction. Here, note that λmax is proportional to α, which is the summation over L terms. This means that, when the network becomes deeper, the parameter space is more strongly distorted.

We confirmed the agreement between our theory and numerical experiments, as shown in figure 1. Three types of deep networks with parameters generated by random connectivity (8) were investigated: tanh, ReLU, and linear activations (L = 3, αl = C = 1). The input samples were generated using i.i.d. Gaussian samples, and T = 102. When P > T, we calculated the eigenvalues by using the dual matrix F* (defined in appendix A.3) because F* is much smaller and its eigenvalues are easy to compute. The theoretical values of mλ , sλ and λmax agreed very well with the experimental values in the large M limit. We could predict mλ even for small M. In addition, in appendix C.1, we also show the results of experiments with fixed M and changing T. The theoretical values coincided with the experimental values very well for any T as the theorems predict.

Figure 1.

Figure 1. Statistics of FIM eigenvalues: means (left), second moments (center), and maximum (right). Our theory predicts the results of numerical experiments, indicated by the black points and error bars. The experiments used 100 random ensembles with different seeds. The variances of the parameters were given by (${\sigma }_{w}^{2},{\sigma }_{b}^{2}$) = (3, 0.64) in the tanh case, (2, 0.1) in the ReLU case, and (1, 0.1) in the linear case. Each colored line represents theoretical results obtained in the limit of M ≫ 1.

Standard image High-resolution image

4. Connections to learning strategies

Here, we show some applications that demonstrate how our universal theory on the FIM can potentially enrich deep learning theories. It enables us to quantitatively measure the behaviors of learning strategies as follows.

4.1. The Fisher–Rao norm

Recently, Liang et al [22] proposed the Fisher–Rao norm for a capacity measure of generalization ability:

Equation (17)

where θ represents weight parameters. They reported that this norm has several desirable properties to explain the high generalization capability of DNNs. In deep linear networks, its generalization capacity (Rademacher complexity) is upper bounded by the norm. In deep ReLU networks, the Fisher–Rao norm serves as a lower bound of the capacities induced by other norms, such as the path norm [40] and the spectral norm [41]. The Fisher–Rao norm is also motivated by information geometry, and invariant under node-wise linear rescaling in ReLU networks. This is a desirable property to connect capacity measures with flatness induced by the rescaling [42].

Here, to obtain a typical evaluation of the norm, we define the average over possible parameters with fixed variances (${\sigma }_{w}^{2},{\sigma }_{b}^{2}$) by ⟨⋅⟩θ = ∫∏i i (⋅), which leads to the following theorem:

Theorem 5. Suppose that assumption 1 holds. In the limit of M ≫ 1, the Fisher–Rao norm of DNNs satisfies

Equation (18)

where αmin = mini αi . Equality holds in a network with a uniform width Ml = M, and then we have ${\langle {\Vert}\theta {{\Vert}}_{FR}\rangle }_{\theta }={\sigma }_{w}^{2}\left(L-1\right)C{\kappa }_{1}$.

The proof is shown in appendix A.6. Although what we can evaluate is only the average of the norm, it can be quantified by κ1. This guarantees that the norm is independent of the network width in the limit of M ≫ 1, which was empirically conjectured by [22].

Recently, Smith and Le [43] argued that the Bayesian factor composed of the Hessian of the loss function, whose special case is the FIM, is related to the generalization. Similar analysis to the above theorem may enable us to quantitatively understand the relation between the statistics of the FIM and the indicators to measure the generalization ability.

4.2. Learning rate for convergence

Consider the steepest gradient descent method in a batch regime. Its update rule is given by

Equation (19)

where η is a constant learning rate. We have added a momentum term with a coefficient μ because it is widely used in training deep networks. Assume that the squared loss function E(θ) of equation (4) has a global minimum θ* achieving the zero training error E(θ*) = 0. Then, the FIM's maximum eigenvalue is dominant over the convergence of learning as follows:

Lemma 6. A learning rate satisfying η < 2(1 + μ)/λmax is necessary for the steepest gradient method to converge to the global minimum θ*.

The proof is given by the expansion around the minimum, i.e. E(θ* + dθ) = dθT Fdθ (detailed in appendix A.7). This lemma is a generalization of LeCun et al [24], which proved the case of μ = 0. Let us refer to ηc := 2(1 + μ)/λmax as the critical learning rate. When η > ηc, the gradient method never converges to the global minimum. The previous work [24] also claimed that η = ηc/2 is the best choice for fastest convergence around the minimum. Although we focus on the batch regime, the eigenvalues also determine the bound of the gradient norms and the convergence of learning in the online regime [44].

Then, combining lemma 6 with theorem 4 leads to the following:

Theorem 7. Suppose that assumption 1 holds. Let a global minimum θ* be generated by equation (8) and satisfying E(θ*) = 0. In the limit of M ≫ 1, the gradient method never converges to θ* when

Equation (20)

Theorem 7 quantitatively reveals that, the wider the network becomes, the smaller the learning rate we need to set. In addition, α is the sum over L constant positive terms, so a deeper network requires a finer setting of the learning rate and it will make the optimization more difficult. In contrast, the expressive power of the network grows exponentially as the number of layers increases [3, 45]. We thus expect there to be a trade-off between trainability and expressive power.

To confirm the effectiveness of theorem 7, we performed several experiments. As shown in figure 2, we exhaustively searched training losses while changing M and η, and found that the theoretical estimation coincides well with the experimental results. We trained deep networks (L = 4, αl = 1, C = 10) and the loss function was given by the squared error.

Figure 2.

Figure 2. Color map of training losses: batch training on artificial data (left column) and SGD training on MNIST (right column). The losses are averages over five trials. The color bar shows the value of the training loss after the training. The region where the loss diverges (i.e. is larger than 1000) is in gray. The red line shows the theoretical value of ηc. The initial conditions of the parameters were taken from a Gaussian distribution (8) with $\left({\sigma }_{w}^{2},{\sigma }_{b}^{2}\right)=\left(3,0.64\right)$ in tanh networks, (2, 0.1) in ReLU networks, and (1, 0.1) in linear networks.

Standard image High-resolution image

The left column of figure 2 shows the results of training on artificial data. We generated training samples x(t) in the Gaussian manner (T = 100) and teacher signals y(t) by the teacher network with a true parameter set θ* satisfying equation (8). We used the gradient method (19) with μ = 0.9 and trained the DNNs for 100 steps. The variances $\left({\sigma }_{w}^{2},{\sigma }_{b}^{2}\right)$ of the initialization of the parameters were set to the same as the global minimum. We found that the losses of the experiments were clearly divided into two areas: one where the gradient exploded (gray area) and the other where it was converging (colored area). The red line is ηc theoretically calculated using κ1 and κ2 on $\left({\sigma }_{w}^{2},{\sigma }_{b}^{2}\right)$ of the initial parameters. Training on the regions above ηc exploded, just as theorem 7 predicts. The explosive region with η < ηc got smaller in the limit of large M.

We performed similar experiments on benchmark datasets and found that the theory can estimate the appropriate learning rates. The results on MNIST are shown in the right column of figure 2. As shown in appendix C.2, the results of training on CIFAR-10 were almost the same as those of MINIST. We used stochastic gradient descent (SGD) with a mini-batch size of 500 and μ = 0.9, and trained the DNNs for 1 epoch. Each training sample was x(t) normalized to zero mean and variance 1 (T = 50 000). The initial values of $\left({\sigma }_{w}^{2},{\sigma }_{b}^{2}\right)$ were set to the vicinity of the special parameter region, i.e. the critical line of the order-to-chaos transition, which the previous works [3, 4] recommended to use for achieving high expressive power and trainability. Note that the variances $\left({\sigma }_{w}^{2},{\sigma }_{b}^{2}\right)$ may change from the initialization to the global minimum, and the conditions of the global minimum in theorem 7 do not hold in general. Nevertheless, the learning rates estimated by theorem 7 explained the experiments well. Therefore, the ideal conditions supposed in theorem 7 seem to hold effectively. This may be explained by the conjecture that the change from the initialization to the global minima is small in the large limit [46].

Theoretical estimations of learning rates in deep networks have so far been limited; such gradients as AdaGrad and Adam also require heuristically determined hyper-parameters for learning rates. Extending our framework would be beneficial in guessing learning rates to prevent the gradient update from exploding.

4.3. Multi-label classification with high dimensionality

This study mainly focuses on the multi-dimensional output of C = O(1). This is because the number of labels is much smaller than the number of hidden units in most practice cases. However, since classification problems with far more labels are sometimes examined in the context of machine learning [47], it would be helpful to remark on the case of C = O(M) here. Denote the mean of the FIM's eigenvalues in the case of C = O(M) as mλ ' and so on. Straightforwardly, we can derive

Equation (21)

Equation (22)

The derivation is shown in appendix A.5. The mean of eigenvalues has the same form as equation (13) obtained in the case of C = O(1). The second moment and maximum eigenvalues can be evaluated by the form of inequalities. We found that the mean is of O(1) while the maximum eigenvalue is of O(M) at least and of O(M2) at most. Therefore, the eigenvalue distribution is more widely distributed than the case of C = O(1).

5. Conclusion and discussion

The present work elucidated the asymptotic statistics of the FIM common among deep networks with any number of layers and various activation functions. The statistics of FIM are characterized by the small mean of eigenvalues and the huge maximum eigenvalue, which are computed by the recurrence relations. This suggests that the parameter space determined by the FIM is locally flat in many directions while highly distorted in certain others. As examples of how one can connect the derived statistics to learning strategies, we suggest the Fisher–Rao norm and learning rates of steepest gradient descents.

We demonstrated that the experiments with the Gaussian prior on the parameters coincided well with the theory. Basically, the mean field theory is based on the central limit theorem with the parameters generated in an i.i.d. manner with finite variances. Therefore, one can expect that the good agreement with the theory is not limited to the experiments with the Gaussian prior. Further experiments will be helpful to clarify the applicable scope of the mean field approach.

The derived statistics are also of potential importance to other learning strategies, for instance, natural gradient methods. When the loss landscape is non-uniformly distorted, naive gradient methods are likely to diverge or become trapped in plateau regions, but the natural gradient, F−1θ E(θ), converges more efficiently [2932]. Because it normalizes the distortion of the loss landscape, the naive extension of section 4.2 to the natural gradient leads to ηc = 2(1 + μ) and it seems to be much easier to choose the appropriately sized learning rate. However, we found that the FIM has many eigenvalues close to zero, and the inversion of it would make the gradient very unstable. In practice, several experiments showed that the choice of damping term ɛ, introduced in (F + ɛI)−1θ E(θ), is crucial to its performance in DNNs [33]. The development of practical natural gradient methods will require modification such as damping.

It would also be interesting for our framework to quantitatively reveal the effects of normalization methods on the FIM. In particular, batch normalization may alleviate the larger eigenvalues because it empirically allows larger learning rates for convergence [48]. It would also be fruitful to investigate the eigenvalues of the Hessian with a large error (4) and to theoretically quantify the negative eigenvalues that lead to the existence of saddle points and the loss landscapes without spurious local minima [49]. The global structure of the parameter space should be also explored. We can hypothesize that the parameters are globally connected through the locally flat dimensions and compose manifolds of flat minima.

Our framework on FIMs is readily applicable to other architectures such as convolutional networks and residual networks by using the corresponding mean field theories [11, 12]. To this end, it may be helpful to remark that macroscopic variables in residual networks essentially diverge at the extreme depths [11]. If one considers extremely deep residual networks, the statistics will require a careful examination of the order of the network width and the explosion of the macroscopic variables. We expect that further studies will establish a mathematical foundation of deep learning from the perspective of the large limit.

Acknowledgments

This work was partially supported by a Grant-in-Aid for Research Activity Start-up (17H07390) from the Japan Society for the Promotion of Science (JSPS).

Appendix A.: Proofs

A.1. Theorem 1

(i) Case of C = 1

To avoid complicating the notation, we first consider the case of the single output (C = 1). The general case is shown after. The network output is denoted by f(t) here. We denote the Fisher information matrix with full components as

Equation (A.1)

where we notice that

Equation (A.2)

In general, the sum over the eigenvalues is given by the matrix trace, mλ = Trace(F)/P. We denote the average of the eigenvalues of the diagonal block as ${m}_{\lambda }^{\left(W\right)}$ for ∇W fW fT, and ${m}_{\lambda }^{\left(b\right)}$ for ∇b fb fT. Accordingly, we find

Equation (A.3)

The contribution of ${m}_{\lambda }^{\left(b\right)}$ is negligible in the large M limit as follows. The first term is

Equation (A.4)

Equation (A.5)

We can apply the central limit theorem to summations over the units ${\sum }_{i}{\delta }_{i}^{l}{\left(t\right)}^{2}$ and ${\sum }_{j}{h}_{j}^{l-1}{\left(t\right)}^{2}$ independently because they do not share the index of the summation. By taking the limit of M ≫ 1, we obtain ${\sum }_{i}{\delta }_{i}^{l}{\left(t\right)}^{2}{\sum }_{j}{h}_{j}^{l-1}{\left(t\right)}^{2}/{M}_{l-1}={\tilde {q}}^{l}{\hat{q}}^{l-1}$. The variable ${\hat{q}}^{l-1}$ is computed by the recursive relation (9). Under the assumption 1, ${\tilde {q}}^{l}$ is given by the recursive relation (11). Note that this transformation to the macroscopic variables holds regardless of the sample index t. Therefore, we obtain

Equation (A.6)

where αl comes from Ml = αl M, and α comes from P = αM2.

In contrast, the contributions of the bias entries are smaller than those of the weight entries in the limit of M ≫ 1, as is easily confirmed:

Equation (A.7)

Equation (A.8)

Equation (A.9)

${m}_{\lambda }^{\left(W\right)}$ is O(1/M) while ${m}_{\lambda }^{\left(b\right)}$ is O(1/M2). Hence, the mean ${m}_{\lambda }^{\left(b\right)}$ is negligible and we obtain mλ = κ1/M.

(ii) C > 1 of O(1)

We can apply the above computation of C = 1 to each network output ∇fk (k = 1, ..., C):

Equation (A.10)

Therefore, the mean of the eigenvalues becomes

Equation (A.11)

Equation (A.12)

A.2. Corollary 2

Because the FIM is a positive semi-definite matrix, its eigenvalues are non-negative. For a constant k > 0, we obtain

Equation (A.13)

Equation (A.14)

Equation (A.15)

This is known as Markov's inequality. When M ≫ 1, combining this with theorem 1 immediately yields

Equation (A.16)

Because CT is also a trivial upper bound of N(λk), we obtain corollary 2. □

A.3. Theorem 3

Let us describe the outline of the proof. One can express the FIM as F = (BBT)/T by definition. Here, let us consider a dual matrix of F, that is, F* := (BT B)/T. F and F* have the same nonzero eigenvalues. Because the sum of squared eigenvalues is equal to $\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{c}\mathrm{e}\left({F}^{{\ast}}{\left({F}^{{\ast}}\right)}^{\mathrm{T}}\right)$, we have ${s}_{\lambda }={\sum }_{s,t}^{T}{\left({F}_{st}^{{\ast}}\right)}^{2}/P$. The non-diagonal entry ${F}_{st}^{{\ast}}\enspace \left(s\ne t\right)$ corresponds to an inner product of the network activities for different inputs x(s) and x(t), that is, κ2. The diagonal entry ${F}_{ss}^{{\ast}}$ is given by κ1. Taking the summation of ${\left({F}_{st}^{{\ast}}\right)}^{2}$ over all of s and t, we obtain the theorem. In particular, when T = 1 and C = 1, F* is equal to the squared norm of the derivative ∇θ f, that is, F* = ||∇θ f||2, and one can easily check ${s}_{\lambda }=\alpha {\kappa }_{1}^{2}$.

The detailed proof is given as follows.

(i) Case of C = 1

Here, let us express the FIM as F = ∇θ fθ fT/T, where ∇θ f is a P × T matrix whose columns are the gradients on each input sample, i.e. ∇θ f(t) (t = 1, ..., T). We also introduce a dual matrix of F, that is, F*:

Equation (A.17)

Note that F is a P × P matrix while F* is a T × T matrix. We can easily confirm that these F and F* have the same non-zero eigenvalues.

The squared sum of the eigenvalues is given by ${\sum }_{i}{\lambda }_{i}^{2}=\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{c}\mathrm{e}\left({F}^{{\ast}}{\left({F}^{{\ast}}\right)}^{\mathrm{T}}\right)={\sum }_{st}{\left({F}_{st}^{{\ast}}\right)}^{2}$. By using the Frobenius norm ${\Vert}A{{\Vert}}_{F}{:=}\sqrt{{\sum }_{ij}{A}_{ij}^{2}}$, this is ${\sum }_{i}{\lambda }_{i}^{2}={\Vert}{F}^{{\ast}}{{\Vert}}_{F}^{2}$. Similar to mλ , the bias entries in F* are negligible because the number of the entries is much less than that of weight entries. Therefore, we only need to consider the weight entries. The stth entry of F* is given by

Equation (A.18)

Equation (A.19)

where we defined

Equation (A.20)

We can apply the central limit theorem to ${\hat{Z}}^{l-1}\left(s,t\right)$ and ${\tilde {Z}}^{l}\left(s,t\right)$ independently because they do not share the index of the summation. For st, we have ${\hat{Z}}^{l}={\hat{q}}_{st}^{l}+\mathcal{N}\left(0,\hat{\gamma }/M\right)$ and ${\tilde {Z}}^{l}={\tilde {q}}_{st}^{l}+\mathcal{N}\left(0,\tilde {\gamma }/M\right)$ in the limit of M ≫ 1, where the macroscopic variables ${\hat{q}}_{st}^{l}$ and ${\tilde {q}}_{st}^{l}$ satisfy the recurrence relations (10) and (12). Note that the recurrence relation (12) requires the assumption 1. $\hat{\gamma }$ and $\tilde {\gamma }$ are constants of O(1). Then, for all s and t(≠s),

Equation (A.21)

Equation (A.22)

Similarly, for s = t, we have ${\hat{Z}}^{l}={\hat{q}}^{l}+O\left(1/\sqrt{M}\right)$, ${\tilde {Z}}^{l}={\tilde {q}}^{l}+O\left(1/\sqrt{M}\right)$ and then ${F}_{ss}^{{\ast}}=\alpha {\kappa }_{1}M/T+O\left(\sqrt{M}\right)/T$.

Thus, under the limit of M ≫ 1, the dual matrix is asymptotically given by

Equation (A.23)

Neglecting the lower order term, we obtain

Equation (A.24)

Equation (A.25)

Note that, when ${\hat{q}}_{st}^{l}=0$, κ2 becomes zero and the lower order term may be non-negligible. In this exceptional case, we have ${s}_{\lambda }=\alpha {\kappa }_{1}^{2}/T+O\left(1/M\right)$, where the second term comes from the $O\left(\sqrt{M}\right)/T$ term of equation (A.23). Therefore, the lower order evaluation depends on the T/M ratio, although it is outside the scope of this study. Intuitively, the origin of ${\hat{q}}_{st}^{l}\ne 0$ is related to the offset of firing activities ${h}_{i}^{l}$. The condition of ${\hat{q}}_{st}^{l}\ne 0$ is satisfied when the bias terms exist or when the activation ϕ(⋅) is not an odd function. In such cases, the firing activities have the offset $\mathrm{E}\left[{h}_{i}^{l}\left(t\right)\right]\ne 0$. Therefore, for any input samples s and t (st), we have ${\sum }_{i}{h}_{i}^{l}\left(s\right){h}_{i}^{l}\left(t\right)/{M}_{l}={\hat{q}}_{st}^{l}\ne 0$ and then κ2 ≠ 0 makes sλ of O(1).

(ii) C > 1 of O(1)

Here, we introduce the following dual matrix F*:

Equation (A.26)

Equation (A.27)

where ∇θ fk is a P × T matrix whose columns are the gradients on each input sample, i.e. ∇θ fk (t) (t = 1, ..., T), and B is a P × CT matrix. The FIM is represented by F = BBT/T. F* is a CT × CT matrix and consists of T × T block matrices,

Equation (A.28)

for k, k' = 1, ..., C.

The diagonal block F*(k, k) is evaluated in the same way as the case of C = 1. It becomes αMK/T as shown in equation (A.23). The non-diagonal block F*(k, k') has the following stth entries:

Equation (A.29)

Equation (A.30)

Under the limit of M ≫ 1, while ${\tilde {Z}}^{l}\left(s,t\right)$ becomes ${\tilde {q}}_{st}^{l}$ of O(1), $\left({\sum }_{i}{\delta }_{k,i}^{l}\left(s\right){\delta }_{{k}^{\prime },i}^{l}\left(t\right)\right)$ becomes zero and its lower order term of $O\left(1/\sqrt{M}\right)$ appears. This is because the different outputs (kk') do not share the weights ${W}_{ij}^{L}$. We have ${\sum }_{i}{\delta }_{k,i}^{L}\left(s\right){\delta }_{{k}^{\prime },i}^{L}\left(t\right)=0$ and then obtain ${\sum }_{i}{\delta }_{k,i}^{l}\left(s\right){\delta }_{{k}^{\prime },i}^{l}\left(t\right)=0$ (l = 1, ..., L − 1) through the backpropagated chain (7). Thus, the entries of the non-diagonal blocks (A.28) become of $O\left(\sqrt{M}\right)/T$, and we have

Equation (A.31)

where δk,k' is the Kronecker delta.

After all, we have

Equation (A.32)

Equation (A.33)

where the first term comes from the diagonal blocks of O(M) and the second one is their lower order term. The third term comes from the non-diagonal blocks of $O\left(\sqrt{M}\right)$. As one can see from here, when C = O(M), the third term becomes non-negligible. This case is examined in section 4.3. □

A.4. Theorem 4

(i) Case of C = 1

Because F and F* have the same non-zero eigenvalues, what we should derive here is the maximum eigenvalue of F*. As shown in equation (A.23), the leading term of F* asymptotically becomes αMK/T in the limit of M ≫ 1. The eigenvalues of αMK/T are explicitly obtained as follows: ${\lambda }_{\mathrm{max}}=\alpha \left(\frac{T-1}{T}{\kappa }_{2}+\frac{1}{T}{\kappa }_{1}\right)M$ for an eigenvector e = (1, ..., 1), and λi = α(κ1κ2)M/T for eigenvectors e1ei (i = 2, ..., T) where ei denotes a unit vector whose entries are 1 for the ith entry and 0 otherwise. Thus, we obtain ${\lambda }_{\mathrm{max}}=\alpha \left(\frac{T-1}{T}{\kappa }_{2}+\frac{1}{T}{\kappa }_{1}\right)M$.

(ii) C > 1 of O(1)

Let us denote F* shown in equation (A.31) by ${F}^{{\ast}}{:=}{\bar{F}}^{{\ast}}+R$. ${\bar{F}}^{{\ast}}$ is the leading term of F* and given by a CT × CT block diagonal matrix whose diagonal blocks are given by αMK/T. R denotes the residual term of $O\left(\sqrt{M}\right)/T$. In general, the maximum eigenvalue is denoted by the spectral norm ||⋅||2, that is, λmax = ||F*||2. Using the triangle inequality, we have

Equation (A.34)

We can obtain ${\Vert}{\bar{F}}^{{\ast}}{{\Vert}}_{2}=\alpha \left(\frac{T-1}{T}{\kappa }_{2}+\frac{1}{T}{\kappa }_{1}\right)M$ because the maximum eigenvalues of the diagonal blocks are the same as the case of C = 1. Regarding ||R||2, this is bounded by ${\Vert}R{{\Vert}}_{2}{\leqslant}{\Vert}R{{\Vert}}_{F}=\sqrt{{C}^{2}{\sum }_{st}{\left(O\left(\sqrt{M}\right)/T\right)}^{2}}=O\left(C\sqrt{M}\right)$. Therefore, when C = O(1), we can neglect ||R||2 of $O\left(\sqrt{M}\right)$ compared to ${\Vert}{\bar{F}}^{{\ast}}{{\Vert}}_{2}$ of O(M).

On the other hand, we can also derive the lower bound of λmax as follows. In general, we have

Equation (A.35)

Then, we find

Equation (A.36)

where v1 is a CT-dimensional vector whose first T entries are $1/\sqrt{T}$ and the others are 0, that is, ${v}_{1}=\left(1,\dots ,1,0,\dots ,0\right)/\sqrt{T}$. We can compute this lower bound by taking the sum over the entries of F*(1, 1), which is equal to equation (A.23):

Equation (A.37)

Finally, we find that the upper bound (A.34) and lower bound (A.37) asymptotically take the same value of O(M), that is, ${\lambda }_{\mathrm{max}}=\left(\frac{T-1}{T}{\kappa }_{2}+\frac{1}{T}{\kappa }_{1}\right)M$.

A.5. Case of C = O(M)

The mean of eigenvalues mλ ' is derived in the same way as shown in section A.1(ii), that is, m'λ = 1/M.

Regarding the second moment sλ ', the lower order term becomes non-negligible as remarked in equation (A.33). We evaluate this sλ ' by using inequalities as follows:

Equation (A.38)

Equation (A.39)

Equation (A.40)

As shown in section A.3, for any k, we obtain ${\Vert}{\nabla }_{\theta }{f}_{k}^{\mathrm{T}}{\nabla }_{\theta }{f}_{k}{{\Vert}}_{F}^{2}/P=\alpha \left(\frac{T-1}{T}{\kappa }_{2}^{2}+\frac{1}{T}{\kappa }_{1}^{2}\right)$ in the limit of M ≫ 1. Thus, the lower bound becomes the same form as sλ , that is, ${s}_{\lambda }=C\alpha \left(\frac{T-1}{T}{\kappa }_{2}^{2}+\frac{1}{T}{\kappa }_{1}^{2}\right)$. In contrast, the upper bound is given by

Equation (A.41)

Equation (A.42)

Equation (A.43)

where Fk denotes the FIM of the kth output, i.e. Fk := ∑t θ fk (t)∇θ fk (t)T/T. Therefore, the upper bound is reduced to the summation over sλ of C = 1. In the limit of M ≫ 1, we obtain ${s}_{\lambda }^{\prime }{\leqslant}{C}^{2}{\Vert}{F}_{k}{{\Vert}}_{F}^{2}/P={C}^{2}\alpha \left(\frac{T-1}{T}{\kappa }_{2}^{2}+\frac{1}{T}{\kappa }_{1}^{2}\right)=C{s}_{\lambda }$.

Next, we show inequalities for λmax. We have already derived the lower bound (A.37) and this bound holds in the case of C = O(M) as well. In contrast, the upper bound (A.34) may become loose when C is larger than O(1) because of the residual term ||R||2. Although it is hard to explicitly obtain the value of ||R||2, the following upper bound holds and is easy to compute by using sλ of equation (14). Because the FIM is a positive semi-definite matrix, λi ⩾ 0 holds by definition. Then, we have ${\lambda }_{\mathrm{max}}\enspace {\leqslant}\sqrt{{\sum }_{i}{\lambda }_{i}^{2}}$. Combining this with ${s}_{\lambda }^{\prime }={\sum }_{i}{\lambda }_{i}^{2}/P$, we have ${\lambda }_{\mathrm{max}}\enspace {\leqslant}\sqrt{\alpha {s}_{\lambda }^{\prime }}M{\leqslant}\sqrt{\alpha C{s}_{\lambda }}M$.

A.6. Theorem 5

The Fisher–Rao norm is written as

Equation (A.44)

where F(l,ij),(l',ab) represents an entry of the FIM, that is, ${\sum }_{k}^{C}{\sum }_{t}{\nabla }_{{W}_{ij}^{l}}{f}_{k}\left(t\right){\nabla }_{{W}_{ab}^{{l}^{\prime }}}{f}_{k}\left(t\right)/T$. Because F(l,ij),(l',ab) includes the random variables ${W}_{ij}^{l}$ and ${W}_{ab}^{{l}^{\prime }}$, we consider the following expansion. Note that ${W}_{ij}^{l}$ and ${W}_{ab}^{{l}^{\prime }}$ are infinitesimals generated by equation (8). Performing a Taylor expansion around ${W}_{ij}^{l}={W}_{ab}^{{l}^{\prime }}=0$, we obtain

Equation (A.45)

where θ* is the parameter set $\left\{{W}_{ij}^{l},{b}_{i}^{l}\right\}$ with ${W}_{ij}^{l}={W}_{ab}^{{l}^{\prime }}=0$. By substituting the above expansion into the Fisher-Rao norm and taking the average ⟨⋅⟩θ , we obtain the following leading term:

Equation (A.46)

Equation (A.47)

For, (l, ij) ≠ (l', ab), the last line becomes zero because of ${\langle {W}_{ij}^{l}{W}_{ab}^{{l}^{\prime }}\rangle }_{\left\{{W}_{ij}^{l},{W}_{ab}^{{l}^{\prime }}\right\}}={\langle {W}_{ij}^{l}\rangle }_{{W}_{ij}^{l}}{\langle {W}_{ab}^{{l}^{\prime }}\rangle }_{{W}_{ab}^{{l}^{\prime }}}=0$. For (l, ij) = (l', ab), we have ${\langle {\left({W}_{ij}^{l}\right)}^{2}\rangle }_{\left\{{W}_{ij}^{l}\right\}}={\sigma }_{w}^{2}/{M}_{l-1}$. After all, in the limit of M ≫ 1, we obtain

Equation (A.48)

Equation (A.49)

Equation (A.50)

where the derivation of the macroscopic variables is similar to that of mλ , as shown in section A.1. Since we have ${\kappa }_{1}={\sum }_{l}\frac{{\alpha }_{l-1}}{\alpha }{\tilde {q}}^{l}{\hat{q}}^{l-1}$, it is easy to confirm ${\langle {\Vert}\theta {{\Vert}}_{FR}\rangle }_{\theta }{\leqslant}{\sigma }_{w}^{2}\alpha /{\alpha }_{\mathrm{min}}\enspace C{\kappa }_{1}$. When all αl take the same value, we have α/αmin = L − 1 and the equality holds. □

A.7. Lemma 6

Suppose a perturbation around the global minimum: θt = θ* + Δt . Then, the gradient update becomes

Equation (A.51)

where we have used E(θ*) = 0 and ∂E(θ*)/∂θ = 0.

Consider a coordinate transformation from Δt to ${\bar{{\Delta}}}_{t}$ that diagonalizes F. It does not change the stability of the gradients. Accordingly, we can update the ith component as follows:

Equation (A.52)

Solving its characteristic equation, we obtain the general solution,

Equation (A.53)

where A and B are constants. This recurrence relation converges if and only if ηλi < 2(1 + μ) for all i. Therefore, η < 2(1 + μ)/λmax is necessary for the steepest gradient to converge to θ*. □

Appendix B.: Analytical recurrence relations

B.1. Erf networks

Consider the following error function as an activation function ϕ(x):

Equation (B.1)

The error function well approximates the tanh function and has a sigmoid-like shape. For a network with ϕ(x) = erf(x), the recurrence relations for macroscopic variables do not require numerical integrations.

(i) ${\hat{q}}^{l}$ and ${\tilde {q}}^{l}$: note that we can analytically integrate the error functions over a Gaussian distribution:

Equation (B.2)

Hence, the recurrence relations for the feedforward signals (9) have the following analytical forms:

Equation (B.3)

Because the derivative of the error function is Gaussian, we can also easily integrate ϕ'(x) over the Gaussian distribution and obtain the following analytical representations of the recurrence relations (11):

Equation (B.4)

(ii) ${\hat{q}}_{st}^{l}$ and ${\tilde {q}}_{st}^{l}$:

To compute the recurrence relations for the feedforward correlations (10), note that we can generally transform Iϕ [a, b] into

Equation (B.5)

For the error function,

Equation (B.6)

and we obtain

Equation (B.7)

This is the analytical form of the recurrence relation for ${\hat{q}}_{st}^{l}$.

Finally, because the derivative of the error function is Gaussian, we can also easily obtain

Equation (B.8)

This is the analytical forms of the recurrence relations for ${\tilde {q}}_{st}^{l}$.

B.2. ReLU networks

We define a ReLU activation as ϕ(x) = 0(x < 0), x(0 ⩽ x). For a network with this ReLU activation function, the recurrence relations for the macroscopic variables require no numerical integrations.

(i) ${\hat{q}}^{l}$ and ${\tilde {q}}^{l}$: we can explicitly perform the integrations in the recurrence relations (9) and (11):

Equation (B.9)

Equation (B.10)

(ii) ${\hat{q}}_{st}^{l}$ and ${\tilde {q}}_{st}^{l}$: we can explicitly perform the integrations in the recurrence relations (10) and (12):

Equation (B.11)

Equation (B.12)

where c = b/a.

B.3. Linear networks

We define a linear activation as ϕ(x) = x. For a network with this linear activation function, the recurrence relations for the macroscopic variables do not require numerical integrations.

(i) ${\hat{q}}^{l}$ and ${\tilde {q}}^{l}$: we can explicitly perform the integrations in the recurrence relations (9) and (11):

Equation (B.13)

Equation (B.14)

(ii) ${\hat{q}}_{st}^{l}$ and ${\tilde {q}}_{st}^{l}$: we can explicitly perform the integrations in the recurrence relations (10) and (12):

Equation (B.15)

Equation (B.16)

Appendix C.: Additional experiments

C.1. Dependence on T

It is shown in figure C.1.

Figure C.1.

Figure C.1. Statistics of FIM eigenvalues with fixed M and changing T (L = 3, αl = C = 1). The red line represents theoretical results obtained in the limit of M ≫ 1. The first row shows results of tanh networks with M = 1000. The second row shows those with a relatively small width (M = 300) and higher T. We set M = 1000 in ReLU and linear networks. The other settings are the same as in figure 1.

Standard image High-resolution image

C.2. Training on CIFAR-10

It is shown in figure C.2.

Figure C.2.

Figure C.2. Color map of training losses after one epoch of SGD training: tanh, ReLU, and linear networks trained on CIFAR-10.

Standard image High-resolution image

Footnotes

  • This article is an updated version of: Ryo K, Shotaro A and Shun-ichi A 2019 Universal statistics of Fisher information in deep neural networks: mean field approach Proc. Machine Learning Res. 89 3181--92

  • Let us remark that we have assumed σb > 0 in the setting (8). If one considers a case of no bias term (σb = 0), odd activations ϕ(x) lead to ${\hat{q}}_{st}^{l}=0$ and κ2 = 0. In such exceptional cases, we need to evaluate the lower order terms of sλ and λmax (outside the scope of this study).

Please wait… references are loading.