当前位置:
X-MOL 学术
›
arXiv.cs.AI
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Second-order Neural Network Training Using Complex-step Directional Derivative
arXiv - CS - Artificial Intelligence Pub Date : 2020-09-15 , DOI: arxiv-2009.07098 Siyuan Shen, Tianjia Shao, Kun Zhou, Chenfanfu Jiang, Feng Luo, Yin Yang
arXiv - CS - Artificial Intelligence Pub Date : 2020-09-15 , DOI: arxiv-2009.07098 Siyuan Shen, Tianjia Shao, Kun Zhou, Chenfanfu Jiang, Feng Luo, Yin Yang
While the superior performance of second-order optimization methods such as
Newton's method is well known, they are hardly used in practice for deep
learning because neither assembling the Hessian matrix nor calculating its
inverse is feasible for large-scale problems. Existing second-order methods
resort to various diagonal or low-rank approximations of the Hessian, which
often fail to capture necessary curvature information to generate a substantial
improvement. On the other hand, when training becomes batch-based (i.e.,
stochastic), noisy second-order information easily contaminates the training
procedure unless expensive safeguard is employed. In this paper, we adopt a
numerical algorithm for second-order neural network training. We tackle the
practical obstacle of Hessian calculation by using the complex-step finite
difference (CSFD) -- a numerical procedure adding an imaginary perturbation to
the function for derivative computation. CSFD is highly robust, efficient, and
accurate (as accurate as the analytic result). This method allows us to
literally apply any known second-order optimization methods for deep learning
training. Based on it, we design an effective Newton Krylov procedure. The key
mechanism is to terminate the stochastic Krylov iteration as soon as a
disturbing direction is found so that unnecessary computation can be avoided.
During the optimization, we monitor the approximation error in the Taylor
expansion to adjust the step size. This strategy combines advantages of line
search and trust region methods making our method preserves good local and
global convergency at the same time. We have tested our methods in various deep
learning tasks. The experiments show that our method outperforms exiting
methods, and it often converges one-order faster. We believe our method will
inspire a wide-range of new algorithms for deep learning and numerical
optimization.
中文翻译:
复杂步方向导数的二阶神经网络训练
虽然众所周知,牛顿法等二阶优化方法的优越性能,但实际上,它们很难用于深度学习,因为无论是组装黑森矩阵还是计算逆矩阵都不适合大规模问题。现有的二阶方法求助于Hessian的各种对角线或低秩近似,这通常无法捕获必要的曲率信息以产生实质性的改进。另一方面,当训练成为基于批次的(即随机的)时,嘈杂的二阶信息很容易污染训练过程,除非采用昂贵的保护措施。在本文中,我们采用数值算法进行二阶神经网络训练。我们通过使用复步有限差分(CSFD)来解决Hessian计算的实际障碍-复杂的数值过程,在函数上增加了虚数扰动以进行微分计算。CSFD具有高度的鲁棒性,效率和准确性(与分析结果一样准确)。这种方法使我们可以将任何已知的二阶优化方法真正应用于深度学习训练。在此基础上,我们设计了有效的牛顿克里夫程序。关键机制是一旦发现干扰方向就终止随机Krylov迭代,从而可以避免不必要的计算。在优化过程中,我们监视泰勒展开中的近似误差以调整步长。这种策略结合了线搜索和信任区域方法的优点,使我们的方法可以同时保留良好的局部和全局收敛性。我们已经在各种深度学习任务中测试了我们的方法。实验表明,我们的方法优于现有方法,并且收敛速度通常快一阶。我们相信,我们的方法将激发用于深度学习和数值优化的各种新算法。
更新日期:2020-09-16
中文翻译:
复杂步方向导数的二阶神经网络训练
虽然众所周知,牛顿法等二阶优化方法的优越性能,但实际上,它们很难用于深度学习,因为无论是组装黑森矩阵还是计算逆矩阵都不适合大规模问题。现有的二阶方法求助于Hessian的各种对角线或低秩近似,这通常无法捕获必要的曲率信息以产生实质性的改进。另一方面,当训练成为基于批次的(即随机的)时,嘈杂的二阶信息很容易污染训练过程,除非采用昂贵的保护措施。在本文中,我们采用数值算法进行二阶神经网络训练。我们通过使用复步有限差分(CSFD)来解决Hessian计算的实际障碍-复杂的数值过程,在函数上增加了虚数扰动以进行微分计算。CSFD具有高度的鲁棒性,效率和准确性(与分析结果一样准确)。这种方法使我们可以将任何已知的二阶优化方法真正应用于深度学习训练。在此基础上,我们设计了有效的牛顿克里夫程序。关键机制是一旦发现干扰方向就终止随机Krylov迭代,从而可以避免不必要的计算。在优化过程中,我们监视泰勒展开中的近似误差以调整步长。这种策略结合了线搜索和信任区域方法的优点,使我们的方法可以同时保留良好的局部和全局收敛性。我们已经在各种深度学习任务中测试了我们的方法。实验表明,我们的方法优于现有方法,并且收敛速度通常快一阶。我们相信,我们的方法将激发用于深度学习和数值优化的各种新算法。