当前位置: X-MOL 学术IEEE Trans. Cybern. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence
IEEE Transactions on Cybernetics ( IF 9.4 ) Pub Date : 10-19-2018 , DOI: 10.1109/tcyb.2018.2874332
Wei Tao , Zhisong Pan , Gaowei Wu , Qing Tao

Many well-known first-order gradient methods have been extended to cope with large-scale composite problems, which often arise as a regularized empirical risk minimization in machine learning. However, their optimal convergence is attained only in terms of the weighted average of past iterative solutions. How to make the individual convergence of stochastic gradient descent (SGD) optimal, especially for strongly convex problems has now become a challenging problem in the machine learning community. On the other hand, Nesterov's recent weighted averaging strategy succeeds in achieving the optimal individual convergence of dual averaging (DA) but it fails in the basic mirror descent (MD). In this paper, a new primal averaging (PA) gradient operation step is presented, in which the gradient evaluation is imposed on the weighted average of all past iterative solutions. We prove that simply modifying the gradient operation step in MD by PA strategy suffices to recover the optimal individual rate for general convex problems. Along this line, the optimal individual rate of convergence for strongly convex problems can also be achieved by imposing the strong convexity on the gradient operation step. Furthermore, we extend PA-MD to solve regularized nonsmooth learning problems in the stochastic setting, which reveals that PA strategy is a simple yet effective extra step toward the optimal individual convergence of SGD. Several real experiments on sparse learning and SVM problems verify the correctness of our theoretical analysis.

中文翻译:


原始平均:实现最佳个体收敛的新梯度评估步骤



许多著名的一阶梯度方法已被扩展到处理大规模复合问题,这些问题通常作为机器学习中的正则化经验风险最小化而出现。然而,它们的最佳收敛只能根据过去迭代解的加权平均值来实现。如何使随机梯度下降(SGD)的个体收敛最优,特别是对于强凸问题,现在已经成为机器学习界的一个具有挑战性的问题。另一方面,Nesterov最近的加权平均策略成功地实现了对偶平均(DA)的最优个体收敛,但在基本镜像下降(MD)中失败了。本文提出了一种新的原始平均(PA)梯度运算步骤,其中梯度评估是对所有过去迭代解的加权平均值进行的。我们证明,通过 PA 策略简单地修改 MD 中的梯度操作步骤足以恢复一般凸问题的最优个体速率。沿着这条线,强凸问题的最优个体收敛率也可以通过在梯度运算步骤上施加强凸性来实现。此外,我们扩展 PA-MD 来解决随机环境中的正则化非平滑学习问题,这表明 PA 策略是实现 SGD 最优个体收敛的一个简单而有效的额外步骤。几个关于稀疏学习和SVM问题的真实实验验证了我们理论分析的正确性。
更新日期:2024-08-22
down
wechat
bug