Robust automatic speech recognition with missing and unreliable acoustic data

被引:397
作者
Cooke, M [1 ]
Green, P [1 ]
Josifovski, L [1 ]
Vizinho, A [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Speech & Hearing Res Grp, Sheffield S1 4DP, S Yorkshire, England
基金
英国工程与自然科学研究理事会;
关键词
robust ASR; missing data; data imputation; HMM; spectral subtraction;
D O I
10.1016/S0167-6393(00)00034-0
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Human speech perception is robust in the face of a wide variety of distortions, both experimentally applied and naturally occurring. In these conditions, state-of-the-art automatic speech recognition (ASR) technology fails. This paper describes an approach to robust ASR which acknowledges the fact that some spectro-temporal regions will be dominated by noise. For the purposes of recognition, these regions are treated as missing or unreliable. The primary advantage of this viewpoint is that it makes minimal assumptions about any noise background. Instead, reliable regions are identified, and subsequent decoding is based on this evidence. We introduce two approaches for dealing with unreliable evidence. The first - marginalisation - computes output probabilities on the basis of the reliable evidence only. The second - state-based data imputation - estimates Values for the unreliable regions by conditioning on the reliable parts and the recognition hypothesis. A further source of information is the bounds on the energy of any constituent acoustic source in an additive mixture. This additional knowledge can be incorporated into the missing data framework. These approaches are applied to continuous-density hidden Markov model (HMM)-based speech recognisers and evaluated on the TIDigits corpus for several noise conditions. Two criteria which use simple noise estimates are employed as a means of identifying reliable regions. The first treats regions which are negative after spectral subtraction as unreliable. The second uses the estimated noise spectrum to derive local signal-to-noise ratios, which are then thresholded to identify reliable data points. Both marginalisation and state-based data imputation produce a substantial performance advantage over spectral subtraction alone. The use of energy bounds leads to a further increase in performance for both approaches. While marginalisation outperforms data imputation, the latter technique allows the technique to act as a preprocessor for conventional recognisers, or in speech-enhancement applications. (C) 2001 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:267 / 285
页数:19
相关论文
共 62 条
[1]  
AHMED S, 1993, ADV NEURAL INFORMATI, V5, P393
[2]  
Albert S. Bregman, 1990, AUDITORY SCENE ANAL, P411, DOI [DOI 10.1121/1.408434, DOI 10.7551/MITPRESS/1486.001.0001]
[3]   How Do Humans Process and Recognize Speech? [J].
Allen, Jont B. .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (04) :567-577
[4]  
[Anonymous], CS9903 U SHEFF DEP C
[5]   Is the sine-wave speech cocktail party worth attending? [J].
Barker, J ;
Cooke, M .
SPEECH COMMUNICATION, 1999, 27 (3-4) :159-174
[6]  
BARKER J, 1997, P EUR 97, P2127
[7]   AN INFORMATION MAXIMIZATION APPROACH TO BLIND SEPARATION AND BLIND DECONVOLUTION [J].
BELL, AJ ;
SEJNOWSKI, TJ .
NEURAL COMPUTATION, 1995, 7 (06) :1129-1159
[8]  
BOURLARD H, 1996, P ICSLP 96
[9]  
BRENDBORG MK, 1997, P EUR 97, P295
[10]  
BRIDLE JS, 1994, P I AC, P307