Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

被引：53

作者：

Narayanan, Arun ^{[1
]}

Wang, DeLiang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2015年 / 23卷 / 01期

关键词：

CHiME-2; joint training; ratio masking; robust ASR; time-frequency masking; BINARY; RECOGNITION; ADAPTATION;

D O I：

10.1109/TASLP.2014.2372314

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Although deep neural network (DNN) acoustic models are known to be inherently noise robust, especially with matched training and testing data, the use of speech separation as a frontend and for deriving alternative feature representations has been shown to improve performance in challenging environments. We first present a supervised speech separation system that significantly improves automatic speech recognition (ASR) performance in realistic noise conditions. The system performs separation via ratio time-frequency masking; the ideal ratio mask (IRM) is estimated using DNNs. We then propose a framework that unifies separation and acoustic modeling via joint adaptive training. Since the modules for acoustic modeling and speech separation are implemented using DNNs, unification is done by introducing additional hidden layers with fixed weights and appropriate network architecture. On the CHiME-2 medium-large vocabulary ASR task, and with log mel spectral features as input to the acoustic model, an independently trained ratio masking frontend improves word error rates by 10.9% (relative) compared to the noisy baseline. In comparison, the jointly trained system improves performance by 14.4%. We also experiment with alternative feature representations to augment the standard log mel features, like the noise and speech estimates obtained from the separation module, and the standard feature set used for IRM estimation. Our best system obtains a word error rate of 15.4% (absolute), an improvement of 4.6 percentage points over the next best result on this corpus.

引用

页码：92 / 101

页数：10

共 47 条

[1]

Abdel-Hamid O, 2013, INT CONF ACOUST SPEE, P7942, DOI 10.1109/ICASSP.2013.6639211

[2]

Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807

[3]

[Anonymous], P ICASSP

[4]

Chao Weng, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5532, DOI 10.1109/ICASSP.2014.6854661

[5] MVA processing of speech features [J].

Chen, Chia-Ping ;

Bilmes, Jeff A. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01) :257-270

[6]

Chen J., 2014, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, P7089

[7] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].

Dahl, George E. ;

Yu, Dong ;

Deng, Li ;

Acero, Alex .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42

[8]

Delcroix M., 2013, Proceedings of Interspeech, P2992

[9]

Duchi J, 2011, J MACH LEARN RES, V12, P2121

[10] Memory-Enhanced Neural Networks and NMF for Robust ASR [J].

Geiger, Juergen T. ;

Weninger, Felix ;

Gemmeke, Jort F. ;

Wollmer, Martin ;

Schuller, Bjoern ;

Rigoll, Gerhard .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (06) :1037-1046

← 1 2 3 4 5 →