A neural probabilistic language model

被引:2178
作者
Bengio, Y [1 ]
Ducharme, R [1 ]
Vincent, P [1 ]
Jauvin, C [1 ]
机构
[1] Univ Montreal, Ctr Rech Math, Dept Informat & Rech Operat, Montreal, PQ H3C 3J7, Canada
关键词
statistical language modeling; artificial neural networks; distributed representation; curse of dimensionality;
D O I
10.1162/153244303322533223
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
引用
收藏
页码:1137 / 1155
页数:19
相关论文
共 34 条
  • [1] [Anonymous], 2000004 GCNU TR U CO
  • [2] [Anonymous], 1215 U MONTR DEP IRO
  • [3] [Anonymous], 2003, AISTATS
  • [4] [Anonymous], PATTERN RECOGNITION
  • [5] BAKER D, 1998, SIGIR 98
  • [6] Bellegarda J.-R., 1997, P 5 EUR C SPEECH COM, P1451
  • [7] Taking on the curse of dimensionality in joint distributions using neural networks
    Bengio, S
    Bengio, Y
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2000, 11 (03): : 550 - 557
  • [8] Bengio Y, 2000, ADV NEUR IN, V12, P400
  • [9] Berger AL, 1996, COMPUT LINGUIST, V22, P39
  • [10] BROWN A, 2000, 2000004 GCNU TR U CO