Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition

被引:281
作者
Qian, Yanmin [1 ]
Bi, Mengxiao [1 ]
Tan, Tian [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Comp Sci & Engn Dept, Shanghai 200240, Peoples R China
关键词
Convolutional neural networks; very deep CNNs; robust speech recognition; acoustic modeling; SPEAKER; MODEL;
D O I
10.1109/TASLP.2016.2602884
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Recently, very deep convolutional neural networks (CNNs) have been successfully applied to computer vision and speech recognition tasks. Based on our previous work on very deep CNNs, in this paper this architecture is further developed to improve recognition accuracy for noise robust speech recognition. In the proposed very deep CNN architecture, we study the best configuration for the sizes of filters, pooling, and input feature maps: the sizes of filters and poolings are reduced and dimensions of input features are extended to allow for adding more convolutional layers. Then the appropriate pooling, padding, and input feature map selection strategies are investigated and applied to the very deep CNN to make it more robust for speech recognition. In addition, an in-depth analysis of the architecture reveals key characteristics, such as compact model scale, fast convergence speed, and noise robustness. The proposed new model is evaluated on two tasks: Aurora4 task with multiple additive noise types and channel mismatch, and the AMI meeting transcription task with significant reverberation. Experiments on both tasks show that the proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition. The best architecture obtains a 10.0% relative reduction over the traditional CNN on AMI, competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model. On Aurora4, even without feature enhancement, model adaptation, and sequence training, it achieves aWER of 8.81%, a 17.0% relative improvement over the LSTM-RNN. To our knowledge, this is the best published result on Aurora4.
引用
收藏
页码:2263 / 2276
页数:14
相关论文
共 41 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]  
Abdel-Hamid O, 2012, INT CONF ACOUST SPEE, P4277, DOI 10.1109/ICASSP.2012.6288864
[3]   Acoustic beamforming for speaker diarization of meetings [J].
Anguera, Xavier ;
Wooters, Chuck ;
Hernando, Javier .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022
[4]  
[Anonymous], 2011, P INT C FLOR IT 27 3
[5]  
[Anonymous], 2015, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2015.123
[6]  
[Anonymous], 2015, 3 INT C LEARN REPR I
[7]  
[Anonymous], 2011, WORKSH AUT SPEECH RE
[8]  
[Anonymous], MSRT2014112
[9]  
[Anonymous], 2013, P INT C LEARN REPR I
[10]  
Bi MX, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3259