Recurrent neural networks with small weights implement definite memory machines

被引:41
作者
Hammer, B [1 ]
Tino, P
机构
[1] Univ Osnabruck, Dept Math Comp Sci, D-49069 Osnabruck, Germany
[2] Univ Birmingham, Sch Comp Sci, Birmingham B15 2TT, W Midlands, England
关键词
D O I
10.1162/08997660360675080
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent experimental studies indicate that recurrent neural networks initialized with "small" weights are inherently biased toward definite memory machines (Tino, Cernansky, & Benuskova, 2002a, 2002b). This article establishes a theoretical counterpart: transition function of recurrent network with small weights and squashing activation function is a contraction. We prove that recurrent networks with contractive transition function can be approximated arbitrarily well on input sequences of unbounded length by a definite memory machine. Conversely, every definite memory machine can be simulated by a recurrent network with contractive transition function. Hence, initialization with small weights induces an architectural bias into learning with recurrent neural networks. This bias might have benefits from the point of view of statistical learning theory: it emphasizes one possible region of the weight space where generalization ability can be formally proved. It is well known that standard recurrent neural networks are not distribution independent learnable in the probably approximately correct (PAC) sense if arbitrary precision and inputs are considered. We prove that recurrent networks with contractive transition function with a fixed contraction parameter fulfill the so-called distribution independent uniform convergence of empirical distances property and hence, unlike general recurrent networks, are distribution independent PAC learnable.
引用
收藏
页码:1897 / 1929
页数:33
相关论文
共 54 条
  • [1] Anthony M., 1999, Neural network learning: theoretical foundations, Vfirst
  • [2] Baldi P., 2001, Sequence learning. Paradigms, algorithms, and applications (Lecture Notes in Artificial Intelligence Vol.1828), P80
  • [3] Bartlett P. L., 1994, Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, COLT 94, P299, DOI 10.1145/180139.181158
  • [4] Bartlett PL, 1997, ADV NEUR IN, V9, P134
  • [5] What Size Net Gives Valid Generalization?
    Baum, Eric B.
    Haussler, David
    [J]. NEURAL COMPUTATION, 1989, 1 (01) : 151 - 160
  • [6] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
    BENGIO, Y
    SIMARD, P
    FRASCONI, P
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166
  • [7] Input-output HMM's for sequence processing
    Bengio, Y
    Frasconi, P
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1996, 7 (05): : 1231 - 1249
  • [8] Bühlmann P, 1999, ANN STAT, V27, P480
  • [9] Simple strategies to encode tree automata in sigmoid recursive neural networks
    Carrasco, RC
    Forcada, ML
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2001, 13 (02) : 148 - 156
  • [10] Toward a connectionist model of recursion in human linguistic performance
    Christiansen, MH
    Chater, N
    [J]. COGNITIVE SCIENCE, 1999, 23 (02) : 157 - 205