Audio-visual speech recognition using MPEGA compliant visual features

被引:33
作者
Aleksic, PS [1 ]
Williams, JJ [1 ]
Wu, ZL [1 ]
Katsaggelos, AK [1 ]
机构
[1] Northwestern Univ, Dept Elect & Comp Engn, Evanston, IL 60208 USA
关键词
audio-visual speech recognition; facial animation parameters; snake;
D O I
10.1155/S1110865702206162
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAps) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0-30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.
引用
收藏
页码:1213 / 1227
页数:15
相关论文
共 36 条
[1]  
Abrantes G. A., 1997, FACE FACIAL ANIMATIO
[2]  
Bernstein L. E., 1986, J HOPKINS LIPREADING
[3]  
BREGLER C, 1994, INT CONF ACOUST SPEE, P669, DOI 10.1109/ICASSP.1994.389567
[4]   Real-time lip tracking and bimodal continuous speech recognition [J].
Chan, MT ;
Zhang, Y ;
Huang, TS .
1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, :65-70
[5]   Lipreading from color video [J].
Chiou, GI ;
Hwang, JN .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 1997, 6 (08) :1192-1195
[6]   FINITE-ELEMENT METHODS FOR ACTIVE CONTOUR MODELS AND BALLOONS FOR 2-D AND 3-D IMAGES [J].
COHEN, LD ;
COHEN, I .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1993, 15 (11) :1131-1147
[7]   Audio-Visual Speech Modeling for Continuous Speech Recognition [J].
Dupont, Stephane ;
Luettin, Juergen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151
[8]  
GLOTIN H, 2001, P IEEE INT C AC SPEE, V1, P165
[9]  
GOLDSCHEN AJ, 1994, P IEEE 28 AS C SIGN
[10]   SPEECH RECOGNITION IN NOISY ENVIRONMENTS - A SURVEY [J].
GONG, YF .
SPEECH COMMUNICATION, 1995, 16 (03) :261-291