A neural probabilistic language model

被引：2178

作者：

Bengio, Y ^{[1
]}

Ducharme, R ^{[1
]}

Vincent, P ^{[1
]}

Jauvin, C ^{[1
]}

机构：

[1] Univ Montreal, Ctr Rech Math, Dept Informat & Rech Operat, Montreal, PQ H3C 3J7, Canada

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2003年 / 3卷 / 06期

关键词：

statistical language modeling; artificial neural networks; distributed representation; curse of dimensionality;

D O I：

10.1162/153244303322533223

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

引用

页码：1137 / 1155

页数：19

共 34 条

[1] [Anonymous], 2000004 GCNU TR U CO
[2] [Anonymous], 1215 U MONTR DEP IRO
[3] [Anonymous], 2003, AISTATS
[4] [Anonymous], PATTERN RECOGNITION
[5] BAKER D, 1998, SIGIR 98
[6] Bellegarda J.-R., 1997, P 5 EUR C SPEECH COM, P1451
[7] Taking on the curse of dimensionality in joint distributions using neural networks
Bengio, S
Bengio, Y
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2000, 11 (03): : 550 - 557
[8] Bengio Y, 2000, ADV NEUR IN, V12, P400
[9] Berger AL, 1996, COMPUT LINGUIST, V22, P39
[10] BROWN A, 2000, 2000004 GCNU TR U CO

← 1 2 3 4 →