Deep regularization and direct training of the inner layers of Neural Networks with Kernel Flows,Physica D: Nonlinear Phenomena

当前位置： X-MOL 学术 › Phys. D Nonlinear Phenom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep regularization and direct training of the inner layers of Neural Networks with Kernel Flows
Physica D: Nonlinear Phenomena ( IF 4 ) Pub Date : 2021-07-18 , DOI: 10.1016/j.physd.2021.132952
Gene Ryan Yoo ₁ , Houman Owhadi ₂

Affiliation

We introduce a new regularization method for Artificial Neural Networks (ANNs) based on the Kernel Flow (KF) algorithm. The algorithm was introduced in Owhadi and Yoo (2019) as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of interpolation points in random batches of the dataset. Writing $f_{θ} (x) = (f_{θ_{n}}^{(n)} \circ f_{θ_{n - 1}}^{(n - 1)} \circ \dots \circ f_{θ_{1}}^{(1)}) (x)$ for the functional representation of compositional structure of the ANN (where $θ_{i}$ are the weights and biases of the layer $i$ ), the inner layers outputs $h^{(i)} (x) = (f_{θ_{i}}^{(i)} \circ f_{θ_{i - 1}}^{(i - 1)} \circ \dots \circ f_{θ_{1}}^{(1)}) (x)$ define a hierarchy of feature maps and a hierarchy of kernels $k^{(i)} (x, x^{'}) = exp (- γ_{i} {‖ h^{(i)} (x) - h^{(i)} (x^{'}) ‖}_{2}^{2})$ . When combined with a batch of the dataset, these kernels produce KF losses $e_{2}^{(i)}$ (defined as the $L^{2}$ regression error incurred by using a random half of the batch to predict the other half) depending on the parameters of the inner layers $θ_{1}, \dots, θ_{i}$ (and $γ_{i}$ ). The proposed method simply consists of aggregating (as a weighted sum) a subset of these KF losses with a classical output loss (e.g., cross-entropy). We test the proposed method on Convolutional Neural Networks (CNNs) and Wide Residual Networks (WRNs) without alteration of their structure nor their output classifier and report reduced test errors, decreased generalization gaps, and increased robustness to distribution shift without a significant increase in computational complexity relative to standard CNN and WRN training (with Drop Out and Batch Normalization). We suspect that these results might be explained by the fact that while conventional training only employs a linear functional (a generalized moment) of the empirical distribution defined by the dataset and can be prone to trapping in the Neural Tangent Kernel regime (under over-parameterizations), the proposed loss function (defined as a nonlinear functional of the empirical distribution) effectively trains the underlying kernel defined by the CNN beyond regressing the data with that kernel.

中文翻译：

使用内核流对神经网络的内层进行深度正则化和直接训练

我们为基于核流 (KF) 算法的人工神经网络 (ANN) 引入了一种新的正则化方法。该算法在 Owhadi 和 Yoo (2019) 中被引入，作为回归/克里金法中的内核选择方法，该方法基于将数据集随机批次中的插值点数量减半而导致的精度损失最小化。写作 $F_{θ} (X) = (F_{θ_{n}}^{(n)} \circ F_{θ_{n - 1}}^{(n - 1)} \circ \dots \circ F_{θ_{1}}^{(1)}) (X)$ 用于 ANN 的组成结构的功能表示（其中 $θ_{一世}$ 是层的权重和偏差 $一世$ )，内层输出 $H^{(一世)} (X) = (F_{θ_{一世}}^{(一世)} \circ F_{θ_{一世 - 1}}^{(一世 - 1)} \circ \dots \circ F_{θ_{1}}^{(1)}) (X)$ 定义特征图的层次结构和内核的层次结构 $克^{(一世)} (X, X^{'}) = 经验值 (- γ_{一世} {‖ H^{(一世)} (X) - H^{(一世)} (X^{'}) ‖}_{2}^{2})$ . 当与一批数据集结合时，这些内核会产生 KF 损失 ${电子}_{2}^{(一世)}$ （定义为 $升^{2}$ 使用批次的随机一半来预测另一半所产生的回归误差）取决于内层的参数 $θ_{1}, \dots, θ_{一世}$ （和 $γ_{一世}$ ）。所提出的方法仅包括将这些 KF 损失的子集与经典输出损失（例如，交叉熵）进行聚合（作为加权和）。我们在不改变其结构和输出分类器的情况下在卷积神经网络 (CNN) 和宽残差网络 (WRN) 上测试所提出的方法，并报告减少了测试错误，减少了泛化差距，并增加了对分布偏移的鲁棒性，而没有显着增加计算量相对于标准 CNN 和 WRN 训练的复杂性（使用 Drop Out 和 Batch Normalization）。

更新日期：2021-08-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>