Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features

被引:38
作者
Deng, L [1 ]
Droppo, J [1 ]
Acero, A [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 2004年 / 12卷 / 03期
关键词
acoustic distortion model; Bayesian estimation; conditional MMSE; dynamic prior; noise reduction; weighted summation;
D O I
10.1109/TSA.2003.822627
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a new algorithm for statistical speech feature enhancement in the cepstral domain. The algorithm exploits joint prior distributions (in the form of Gaussian mixture) in the clean speech model, which incorporate both the static and frame-differential dynamic cepstral parameters. Full posterior probabilities for clean speech given the noisy observation are computed using a linearized version of a nonlinear acoustic distortion model, and, based on this linear approximation, the conditional minimum mean square error (MMSE) estimator for the clean speech feature is derived rigorously using the full posterior. The final form of the derived conditional MMSE estimator is shown to be a weighted sum of three separate terms, and the sum is weighted again by the posterior for each of the mixture component in the speech model. The first of the three terms is shown to arrive naturally fro the predictive mechanism embedded in the acoustic distortion model in absence of any prior information. The remaining two terms result from the speech model using only the static prior and only the dynamic prior, respectively. Comprehensive experiments are carried out using the Aurora2 database to evaluate the new algorithm. The results demonstrate significant improvement in noise-robust recognition accuracy by incorporating the joint prior for both static and dynamic parameter distributions in the speech model, compared with using only the static or dynamic prior and with using no prior.
引用
收藏
页码:218 / 233
页数:16
相关论文
共 32 条
[1]  
Acero A., 2000, INTERSPEECH, P869, DOI DOI 10.1016/S0167-6393(03)00016-5
[2]  
ACERO A, 1993, ACOUSTIC ENV ROBUTSN
[3]  
Attias H, 2001, ADV NEUR IN, V13, P758
[4]  
ATTIAS H, 2001, P EUR, P1903
[5]  
Berouti M., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing, P208
[6]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[7]   Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics [J].
Deng, L ;
Ma, J .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2000, 108 (06) :3036-3048
[8]   A GENERALIZED HIDDEN MARKOV MODEL WITH STATE-CONDITIONED TREND FUNCTIONS OF TIME FOR THE SPEECH SIGNAL [J].
DENG, L .
SIGNAL PROCESSING, 1992, 27 (01) :65-78
[9]   Distributed speech processing in MiPad's multimodal user interface [J].
Deng, L ;
Wang, KS ;
Acero, A ;
Hon, HW ;
Droppo, J ;
Boulis, C ;
Wang, YY ;
Jacoby, D ;
Mahajan, M ;
Chelba, C ;
Huang, XD .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (08) :605-619
[10]  
Deng L, 2001, INT CONF ACOUST SPEE, P301, DOI 10.1109/ICASSP.2001.940827