Recent advances in the automatic recognition of audiovisual speech

被引:440
作者
Potamianos, G [1 ]
Neti, C
Gravier, G
Garg, A
Senior, AW
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Human Language Technol Dept, Yorktown Hts, NY 10598 USA
[2] INRIA, IRISA, Ctr Natl Rech Sci, F-35042 Rennes, France
[3] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
关键词
adaptation; audiovisual fusion; audiovisual speech recognition (ASR); face tracking; hidden Markov models (HMM); multimedia databases; multistream hidden Markov models; product hidden Markov models; speechreading; stream reliability; visual feature extraction;
D O I
10.1109/JPROC.2003.817150
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper we review the main, components of audiovisual automatic speech recognition (ASR) and present novel contributions in two main areas: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovisual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process: We also briefly touch upon the issue of audiovisual adaptation. We apply our algorithms to three multisubject bimodal databases, ranging from small- to, large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves ASR over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
引用
收藏
页码:1306 / 1326
页数:21
相关论文
共 134 条
[1]  
ADJOUDANI A, 1997, P EUR C SPEECH COMM, P1671
[2]  
ADJOUDANI A, 1996, SPEECHREADING HUMANS, P461
[3]   Audio-visual speech recognition using MPEGA compliant visual features [J].
Aleksic, PS ;
Williams, JJ ;
Wu, ZL ;
Katsaggelos, AK .
EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) :1213-1227
[4]  
Anastasakos T, 1997, INT CONF ACOUST SPEE, P1043, DOI 10.1109/ICASSP.1997.596119
[5]  
[Anonymous], P INT C SPOK LANG PR
[6]  
[Anonymous], HEARING EYE
[7]  
[Anonymous], SPEECHREADING HUMANS
[8]  
[Anonymous], 1989, SELECTED PAPERS C R
[9]  
[Anonymous], P ICSLP2000
[10]  
[Anonymous], HEARING EYE