Justifying and Generalizing Contrastive Divergence

被引:172
作者
Bengio, Yoshua [1 ]
Delalleau, Olivier [1 ]
机构
[1] Univ Montreal, Dept Comp Sci & Operat Res, Montreal, PQ, Canada
关键词
D O I
10.1162/neco.2008.11-07-647
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation-running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.
引用
收藏
页码:1601 / 1621
页数:21
相关论文
共 26 条
[1]
[Anonymous], 2005, AISTATS BRIDGETOWN B
[2]
[Anonymous], ADV NEURAL INFORM PR
[3]
[Anonymous], 1986, PARALLEL DISTRIBUTED
[4]
[Anonymous], 2006, NeurIPS
[5]
[Anonymous], P 2007 WORKSH INF RE
[6]
[Anonymous], 2007, LARGE SCALE KERNEL M
[7]
Bengio Yoshua, 2006, Advances in neural information processing systems, V19
[8]
AUTO-ASSOCIATION BY MULTILAYER PERCEPTRONS AND SINGULAR VALUE DECOMPOSITION [J].
BOURLARD, H ;
KAMP, Y .
BIOLOGICAL CYBERNETICS, 1988, 59 (4-5) :291-294
[9]
Freund Yoav, 1994, Unsupervised learning of distributions of binary vectors using two layer networks
[10]
Hernandez-Lerma O., 2003, Markov chains and invariant probabilities