Real-time lip tracking and bimodal continuous speech recognition

被引：26

作者：

Chan, MT ^{[1
]}

Zhang, Y ^{[1
]}

Huang, TS ^{[1
]}

机构：

[1] Rockwell Int Sci Ctr, Thousand Oaks, CA 91360 USA

来源：

1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING | 1998年

关键词：

D O I：

10.1109/MMSP.1998.738914

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We investigate using a bimodal approach to speech recognition by incorporating additional visual features derived from lip movement of the speaker. A reference contour model is used to track the lip outline of the speaker. By using color, constraining the deformation in an affine subspace, and by incorporating an outlier rejection mechanism, our system is robust and runs in real time. To address the model initialization issue, a Past lip localization algorithm is also incorporated. A sample of continuous bimodal speech data based on a confined vocabulary (useful for our application area) was synchronously captured for training and testing. Using the hidden Markov modeling framework, we trained our bimodal context-dependent sub-word-based recognizer in a few different ways. Our experiments show that the bimodal recognizer compares favorably to the acoustic-only counterpart. Our results also indicate that it is advantageous to include first derivatives of the visual features. Furthermore, the 2-stream modeling: scheme appears to be preferable to the 1-stream case for bimodal speech.

引用

页码：65 / 70

页数：6

共 13 条

[1] CO-ARTICULATION EFFECTS IN LIPREADING [J].

BENGUEREL, AP ;

PICHORAFULLER, MK .

JOURNAL OF SPEECH AND HEARING RESEARCH, 1982, 25 (04) :600-607

[2]

BREGLER C, 1994, INT CONF ACOUST SPEE, P669, DOI 10.1109/ICASSP.1994.389567

[3]

GOLDSCHEN AJ, 1993, THESIS G WASHINGTON

[4]

HENNECKE ME, 1995, SPEECHREADING HUMANS

[5] SNAKES - ACTIVE CONTOUR MODELS [J].

KASS, M ;

WITKIN, A ;

TERZOPOULOS, D .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 1987, 1 (04) :321-331

[6] Accurate, real-time, unadorned lip tracking [J].

Kaucic, R ;

Blake, A .

SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1998, :370-375

[7]

PETAJAN E, 1988, ACM SIGGHI, P19

[8]

Potamianos G, 1998, INT CONF ACOUST SPEE, P3733, DOI 10.1109/ICASSP.1998.679695

[9]

RABINER R, 1993, FUNDAMENTALS SPEECH

[10] A model of dynamic auditory perception and its application to robust word recognition [J].

Strope, B ;

Alwan, A .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1997, 5 (05) :451-464

← 1 2 →