当前位置: X-MOL 学术SIAM J. Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Making the Last Iterate of SGD Information Theoretically Optimal
SIAM Journal on Optimization ( IF 3.1 ) Pub Date : 2021-04-13 , DOI: 10.1137/19m128908x
Prateek Jain , Dheeraj M. Nagaraj , Praneeth Netrapalli

SIAM Journal on Optimization, Volume 31, Issue 2, Page 1108-1130, January 2021.
Stochastic gradient descent (SGD) is one of the most widely used algorithms for large-scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by far, the most preferred choice in practice. The best known results for the last point of SGD [O. Shamir and T. Zhang, Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 71--79] however, are suboptimal compared to information theoretic lower bounds by a $\log T$ factor, where $T$ is the number of iterations. Harvey, Liaw, Plan, and Randhawa [Conference on Learning Theory, PMLR, 2019, pp. 1579--1613] shows that in fact, this additional $\log T$ factor is tight for standard step size sequences of $\Theta({\frac{1}{\sqrt{t}}})$ and $\Theta({\frac{1}{t}})$ for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to nonsmooth, convex functions, the best known step size sequences still lead to $O(\log T)$-suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of the last point of SGD as well as GD. We achieve this by designing a modification scheme that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.


中文翻译:

理论上使SGD信息的最后一次迭代最佳

SIAM优化杂志,第31卷,第2期,第1108-1130页,2021年1月。
随机梯度下降(SGD)是用于大规模优化问题的最广泛使用的算法之一。尽管对凸问题研究(后缀)的SGD进行经典理论分析,然后迭代求平均值,并获得有关次优性的理论最优边界信息,但到目前为止,SGD的最后一点在实践中是最优选的选择。对于SGD的最后一点,最著名的结果是[O. Shamir和T. Zhang,《第30届国际机器学习会议论文集,2013年,第71--79页],相比于信息理论下界而言,受$ \ log T $因子的影响不理想,其中$ T $是迭代次数。Harvey,Liaw,Plan和Randhawa [学习理论会议,PMLR,2019年,第1579--1613页]显示,实际上,对于$ \ Theta({\ frac {1} {\ sqrt {t}}})$$和$ \ Theta({\ frac {1} {t} })$分别用于非强凸设置和强凸设置。类似地,即使对于次梯度下降(GD)应用于不光滑的凸函数时,最著名的步长序列仍会导致$ O(\ log T)$次最优收敛速度(在最终迭代中)。这项工作的主要贡献是设计了新的步长序列,这些序列在理论上享有关于SGD以及GD的最后一个点的次最优性的最佳边界信息。我们通过设计一种修改方案来实现这一目标,该方案将步长序列从一个序列转换为另一个序列,以使具有经修改序列的SGD / GD的最后一点具有与具有原始序列的SGD / GD平均值相同的次优保证。我们还表明,我们的结果具有很高的概率成立。我们通过仿真验证了我们的结果,这些仿真表明,与标准步长序列相比,新的步长序列确实可以显着改善最终迭代。
更新日期:2021-05-20
down
wechat
bug