Large-Scale Machine Learning with Stochastic Gradient Descent

被引:3752
作者
Bottou, Leon [1 ]
机构
[1] NEC Labs Amer, Princeton, NJ 08542 USA
来源
COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS | 2010年
关键词
stochastic gradient descent; online learning; efficiency;
D O I
10.1007/978-3-7908-2604-3_16
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
引用
收藏
页码:177 / 186
页数:10
相关论文
共 24 条
  • [1] [Anonymous], 2008, P 25 INT C MACHINE L, DOI [10.1145/1390156.1390273, DOI 10.1145/1390156.1390273]
  • [2] [Anonymous], 1957, Project PARA, Report 85-460-1.
  • [3] [Anonymous], 2008, Advances in Neural Information Processing Systems, DOI DOI 10.7751/mitpress/8996.003.0015
  • [4] Bordes A, 2009, J MACH LEARN RES, V10, P1737
  • [5] Bousquet Olivier, 2002, THESIS
  • [6] SUPPORT-VECTOR NETWORKS
    CORTES, C
    VAPNIK, V
    [J]. MACHINE LEARNING, 1995, 20 (03) : 273 - 297
  • [7] DENNIS J. E., 1996, Numerical Methods for Unconstrained Optimization and Nonlinear Equations
  • [8] JOACHIMS T., 2006, P 12 ACM SIGKDD
  • [9] Lafferty J.D., 2001, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, P282, DOI DOI 10.5555/645530.655813
  • [10] Lee WS, 1998, IEEE T INFORM THEORY, V44, P1974, DOI 10.1109/18.705577