Speech-to-video synthesis using MPEG-4 compliant visual features

被引：13

作者：

Aleksic, PS ^{[1
]}

Katsaggelos, AK ^{[1
]}

机构：

[1] Northwestern Univ, Dept Elect & Comp Engn, Evanston, IL 60208 USA

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2004年 / 14卷 / 05期

关键词：

audio-visual speech recognition; correlation hidden Markov models (CHMMs); facial animation parameters (FAPs); speech-to-video synthesis;

D O I：

10.1109/TCSVT.2004.826760

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

There is a strong correlation between the building blocks of speech (phonemes) and the building blocks of visual speech (visimes). In this paper, this correlation is exploited and an approach is proposed for synthesizing the visual representation of speech from a narrow-band acoustic speech signal. The visual speech is represented in terms of the facial animation parameters (FAPs), supported by the MPEG-4 standard. The main contribution of this paper is the development of a correlation hidden Markov model (CHMM) system, which integrates independently trained acoustic HMM (AHMM) and visual HMM (VHMM) systems, in order to realize speech-to-video synthesis. The proposed CHMM system allows for, different model topologies for acoustic and visual HMMs. It performs late integration and reduces the amount of required training data compared to early integration modeling techniques. Temporal accuracy experiments, comparison of the synthesized FAPs to the original FAPS, and audio-visual automatic speech recognition (AV-ASR) experiments utilizing the synthesized visual speech were performed in order to objectively measure the performance of the system. The objective experiments demonstrated that the proposed approach reduces time alignment errors by 40.5% compared to the conventional temporal scaling method, that the synthesized FAP sequences are very similar to the original FAP sequences, and that synthesized YAP sequences contain visual speechreading information that can improve AV-ASR performance.

引用

页码：682 / 692

页数：11

共 31 条

[1]

Aleksic PS, 2003, 2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 3, PROCEEDINGS, P1

[2]

Aleksic PS, 2002, IEEE IMAGE PROC, P960

[3] Audio-visual speech recognition using MPEGA compliant visual features [J].

Aleksic, PS ;

Williams, JJ ;

Wu, ZL ;

Katsaggelos, AK .

EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) :1213-1227

[4]

[Anonymous], 2002, CAMBRIDGE U ENG DEP

[5]

Bernstein L. E., 1991, LIPREADING CORPUS 5

[6]

BREGLER C, 1994, INT CONF ACOUST SPEE, P669, DOI 10.1109/ICASSP.1994.389567

[7]

Bregler C, 1997, P 24 ANN C COMP GRAP, V97, P353, DOI DOI 10.1145/258734.258880

[8] Audio-visual integration in multimodal communication [J].

Chen, T ;

Rao, RR .

PROCEEDINGS OF THE IEEE, 1998, 86 (05) :837-852

[9] Audio-Visual Speech Modeling for Continuous Speech Recognition [J].

Dupont, Stephane ;

Luettin, Juergen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151

[10]

GLOTIN H, 2001, P IEEE INT C AC SPEE, V1, P165

← 1 2 3 4 →