Minimizing finite sums with the stochastic average gradient

被引:9
作者
Mark Schmidt
Nicolas Le Roux
Francis Bach
机构
[1] University of British Columbia,Department of Computer Science
[2] Criteo,Laboratoire d’Informatique de l’Ecole Normale Superieure
[3] INRIA - SIERRA Project-Team,undefined
来源
Mathematical Programming | 2017年 / 162卷
关键词
Convex optimization; Stochastic gradient methods; First-order methods; Convergence Rates; 90C06; 90C15; 90C25; 90C30; 65K05; 68Q25; 62L20;
D O I
暂无
中图分类号
学科分类号
摘要
We analyze the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(1/\sqrt{k})$$\end{document} to O(1 / k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1 / k) to a linear convergence rate of the form O(ρk)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\rho ^k)$$\end{document} for ρ<1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho < 1$$\end{document}. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. This extends our earlier work Le Roux et al. (Adv Neural Inf Process Syst, 2012), which only lead to a faster rate for well-conditioned strongly-convex problems. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.
引用
收藏
页码:83 / 112
页数:29
相关论文
共 54 条
  • [1] Bertsekas DP(1997)A new class of incremental gradient methods for least squares problems SIAM J. Optim. 7 913-926
  • [2] Blatt D(2007)A convergent incremental gradient method with a constant step size SIAM J. Optim. 18 29-51
  • [3] Hero AO(2009)Sgd-qn: careful quasi-newton stochastic gradient descent J. Mach. Learn. Res. 10 1737-1754
  • [4] Gauchman H(2004)KDD-cup 2004: results and analysis ACM SIGKDD Newsl. 6 95-108
  • [5] Bordes A(1847)Méthode générale pour la résolution des systèmes d’équations simultanées Comptes rendus des séances de l’Académie des sciences de Paris 25 536-538
  • [6] Bottou L(2008)Exponentiated gradient algorithms for conditional random fields and max-margin markov networks J. Mach. Learn. Res. 9 1775-1822
  • [7] Gallinari P(1993)Accelerated stochastic approximation SIAM J. Optim. 3 868-881
  • [8] Caruana R(2011)Adaptive subgradient methods for online learning and stochastic optimization J. Mach. Learn. Res. 12 2121-2159
  • [9] Joachims T(2012)Hybrid deterministic-stochastic methods for data fitting SIAM J. Sci. Comput. 34 A1351-A1379
  • [10] Backstrom L(2005)A modified finite Newton method for fast solution of large scale linear SVMs J. Mach. Learn. Res. 6 341-361