Gradient-based optimization of hyperparameters

被引:326
作者
Bengio, Y [1 ]
机构
[1] Univ Montreal, Dept Informat & Rech Operat, Montreal, PQ H3C 3J7, Canada
关键词
D O I
10.1162/089976600300015187
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efficiently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyperparameter gradient involving second derivatives of the training criterion.
引用
收藏
页码:1889 / 1900
页数:12
相关论文
共 21 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
[Anonymous], 1992, ADV NEUR IN
[3]  
[Anonymous], 1982, ESTIMATION DEPENDENC
[4]  
Becker S, 1989, P 1988 CONN MOD SUMM, P29
[5]  
BENGIO Y, 1999, UNPUB LEARNING SIMPL
[6]   EXACT CALCULATION OF THE HESSIAN MATRIX FOR THE MULTILAYER PERCEPTRON [J].
BISHOP, C .
NEURAL COMPUTATION, 1992, 4 (04) :494-501
[7]  
Bottou L., 1999, Online learning in neural networks, P9, DOI DOI 10.1017/CBO9780511569920.003
[8]  
Breiman L, 1996, ANN STAT, V24, P2350
[9]   SMOOTHING NOISY DATA WITH SPLINE FUNCTIONS [J].
WAHBA, G .
NUMERISCHE MATHEMATIK, 1975, 24 (05) :383-393
[10]  
Grandvalet Y., 1998, ICANN 98. Proceedings of the 8th International Conference on Artificial Neural Networks, P201