LSTM: A Search Space Odyssey

被引:5708
作者
Greff, Klaus [1 ,2 ]
Srivastava, Rupesh K. [1 ,2 ]
Koutnik, Jan [1 ,2 ]
Steunebrink, Bas R. [1 ,2 ]
Schmidhuber, Juergen [1 ,2 ]
机构
[1] Scuola Univ Profess Svizzera Italiana, Ist Dalle Molle Studi Intelligenza Artificiale, CH-6928 Manno, Switzerland
[2] Univ Svizzera Italiana, CH-6904 Lugano, Switzerland
基金
瑞士国家科学基金会;
关键词
Functional ANalysis Of VAriance (fANOVA); long short-term memory (LSTM); random search; recurrent neural networks; sequence learning;
D O I
10.1109/TNNLS.2016.2582924
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
Several variants of the long short-term memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful functional ANalysis Of VAriance framework. In total, we summarize the results of 5400 experimental runs (approximate to 15 years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
引用
收藏
页码:2222 / 2232
页数:11
相关论文
共 46 条
[1]
Allan Moray, 2005, Advances in Neural Information Processing Systems, V17
[2]
ANDERSON RL, 1953, J AM STAT ASSOC, V48, P789
[3]
[Anonymous], 1987, The utility driven dynamic error propagation network
[4]
[Anonymous], 2008, THESIS
[5]
[Anonymous], 2007, INT C NEUR INF PROC
[6]
[Anonymous], P C EMPIRICAL METHOD
[7]
[Anonymous], 1989, NUCCS8927
[8]
[Anonymous], 2014, 1410 ARXIV
[9]
[Anonymous], 2013, INT C MACHINE LEARNI
[10]
[Anonymous], 2013, DROPOUT IMPROVES REC