Enhancement of log Mel power spectra of speech using a phase-sensitive model of the-acoustic environment and sequential estimation of the corrupting noise

被引:79
作者
Deng, L [1 ]
Droppo, J [1 ]
Acero, A [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 2004年 / 12卷 / 02期
关键词
noise estimate; noise-robust ASR; phase-sensitive acoustic environment model; sequential algorithm; speech feature enhancement;
D O I
10.1109/TSA.2003.820201
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a novel speech feature enhancement technique based on a probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence phase sensitive) between the clean speech and. the corrupting noise in the acoustic distortion process. The core of the enhancement algorithm is the MMSE (minimum mean square error) estimator for the log Mel power spectra of clean speech based on the phase-sensitive environment model, using highly efficient single-point, second-order Taylor series expansion to approximate the joint probability of clean and noisy speech modeled as a multivariate Gaussian. Since a noise estimate is required by the MMSE estimator, a high-quality, sequential noise estimation algorithm is also developed and presented. Both the noise estimation and speech feature enhancement algorithms are evaluated on the Aurora2 task of connected digit recognition. Noise-robust speech recognition results demonstrate that the new acoustic environment model which takes into account, the relative phase in speech and noise mixing is superior to the earlier environment model which discards the phase under otherwise identical experimental conditions. The results also show that the sequential MAP (maximum a posteriori) learning for noise estimation is better. than the sequential ML (maximum likelihood) learning, both evaluated under the identical phase-sensitive MMSE enhancement condition.
引用
收藏
页码:133 / 143
页数:11
相关论文
共 25 条
[1]  
Acero A., 2000, INTERSPEECH, P869, DOI DOI 10.1016/S0167-6393(03)00016-5
[2]  
ACERO A, 1993, ACOUSTIC ENV ROBUSTN
[3]  
AFIFY M, 2001, P ICASSP, V1, P229
[4]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[5]   Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics [J].
Deng, L ;
Ma, J .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2000, 108 (06) :3036-3048
[6]   Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition [J].
Deng, L ;
Droppo, J ;
Acero, A .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (06) :568-580
[7]  
Deng L, 2002, INT CONF ACOUST SPEE, P829
[8]  
Deng L, 2001, INT CONF ACOUST SPEE, P301, DOI 10.1109/ICASSP.2001.940827
[9]  
Deng L., 2000, P ANN C INT SPEECH C, P806
[10]  
DENG L, 2001, P AUT SPEECH REC UND