当前位置: X-MOL 学术Mach. Learn. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Toward a theory of machine learning
Machine Learning: Science and Technology ( IF 6.3 ) Pub Date : 2021-05-18 , DOI: 10.1088/2632-2153/abe6d7
Vitaly Vanchurin

We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, (6) an activation map and (7) a loss function. We argue that the loss function can be imposed either on the boundary (i.e. input and/or output neurons) or in the bulk (i.e. hidden neurons) for both supervised and unsupervised systems. We apply the principle of maximum entropy to derive a canonical ensemble of the state vectors subject to a constraint imposed on the bulk loss function by a Lagrange multiplier (or an inverse temperature parameter). We show that in an equilibrium the canonical partition function must be a product of two factors: a function of the temperature, and a function of the bias vector and weight matrix. Consequently, the total Shannon entropy consists of two terms which represent, respectively, a thermodynamic entropy and a complexity of the neural network. We derive the first and second laws of learning: during learning the total entropy must decrease until the system reaches an equilibrium (i.e. the second law), and the increment in the loss function must be proportional to the increment in the thermodynamic entropy plus the increment in the complexity (i.e. the first law). We calculate the entropy destruction to show that the efficiency of learning is given by the Laplacian of the total free energy, which is to be maximized in an optimal neural architecture, and explain why the optimization condition is better satisfied in a deep network with a large number of hidden layers. The key properties of the model are verified numerically by training a supervised feedforward neural network using the stochastic gradient descent method. We also discuss a possibility that the entire Universe at its most fundamental level is a neural network.



中文翻译:

迈向机器学习理论

我们将神经网络定义为由 (1) 状态向量、(2) 输入投影、(3) 输出投影、(4) 权重矩阵、(5) 偏置向量、(6) 激活组成的七元组映射和(7)损失函数。我们认为,对于有监督和无监督的系统,损失函数可以施加在边界(即输入和/或输出神经元)或主体(即隐藏神经元)上。我们应用最大熵原理来推导出受拉格朗日乘子(或逆温度参数)施加在体积损失函数上的约束的状态向量的规范集合。我们表明,在平衡中,规范分配函数必须是两个因素的乘积:温度的函数以及偏置向量和权重矩阵的函数。最后,总香农熵由两个项组成,分别表示热力学熵和神经网络的复杂性。我们推导出学习的第一和第二定律:在学习过程中总熵必须减少直到系统达到平衡(即第二定律),并且损失函数的增量必须与热力学熵的增量加上增量成正比在复杂性(即第一定律)。我们计算熵破坏以表明学习效率由总自由能的拉普拉斯算子给出,在最佳神经架构中将最大化,并解释为什么在具有大的深度网络中优化条件更好地满足隐藏层数。通过使用随机梯度下降法训练监督前馈神经网络,对模型的关键属性进行了数值验证。我们还讨论了整个宇宙在最基本层面上是一个神经网络的可能性。

更新日期:2021-05-18
down
wechat
bug