Gradient-based optimization of hyperparameters

被引：326

作者：

Bengio, Y ^{[1
]}

机构：

[1] Univ Montreal, Dept Informat & Rech Operat, Montreal, PQ H3C 3J7, Canada

来源：

NEURAL COMPUTATION | 2000年 / 12卷 / 08期

关键词：

D O I：

10.1162/089976600300015187

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efficiently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyperparameter gradient involving second derivatives of the training criterion.

引用

页码：1889 / 1900

页数：12

共 21 条

[1] NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].

AKAIKE, H .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723

[2]

[Anonymous], 1992, ADV NEUR IN

[3]

[Anonymous], 1982, ESTIMATION DEPENDENC

[4]

Becker S, 1989, P 1988 CONN MOD SUMM, P29

[5]

BENGIO Y, 1999, UNPUB LEARNING SIMPL

[6] EXACT CALCULATION OF THE HESSIAN MATRIX FOR THE MULTILAYER PERCEPTRON [J].

BISHOP, C .

NEURAL COMPUTATION, 1992, 4 (04) :494-501

[7]

Bottou L., 1999, Online learning in neural networks, P9, DOI DOI 10.1017/CBO9780511569920.003

[8]

Breiman L, 1996, ANN STAT, V24, P2350

[9] SMOOTHING NOISY DATA WITH SPLINE FUNCTIONS [J].

WAHBA, G .

NUMERISCHE MATHEMATIK, 1975, 24 (05) :383-393

[10]

Grandvalet Y., 1998, ICANN 98. Proceedings of the 8th International Conference on Artificial Neural Networks, P201

← 1 2 3 →