Audio/visual mapping with cross-modal hidden Markov models

被引：42

作者：

Fu, SL ^{[1
]}

Gutierrez-Osuna, R

Esposito, A

Kakumanu, PK

Garcia, ON

机构：

[1] Univ Delaware, Dept Elect & Comp Engn, Newark, DE 19716 USA

[2] Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA

[3] Univ Naples 2, Dept Psychol, Naples, Italy

[4] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA

[5] Univ N Texas, Coll Engn, Denton, TX 76203 USA

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2005年 / 7卷 / 02期

基金：

美国国家科学基金会;

关键词：

3-D audio/video processing; joint media and multimodal processing; speech reading and lip synchroization;

D O I：

10.1109/TMM.2005.843341

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data.

引用

页码：243 / 252

页数：10

共 32 条

[1]

[Anonymous], 1999, AUDITORYVISUAL SPEEC

[2]

Aversano G, 2001, PROCEEDINGS OF THE 44TH IEEE 2001 MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1 AND 2, P516, DOI 10.1109/MWSCAS.2001.986241

[3] GROWTH TRANSFORMATIONS FOR FUNCTIONS ON MANIFOLDS [J].

BAUM, LE ;

SELL, GR .

PACIFIC JOURNAL OF MATHEMATICS, 1968, 27 (02) :211-&

[4] Structure learning in conditional probability models via an entropic prior and parameter extinction [J].

Brand, M .

NEURAL COMPUTATION, 1999, 11 (05) :1155-1182

[5]

Brand M., 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision, P1237, DOI 10.1109/ICCV.1999.790422

[6]

BRAND M, 1909, P SIGGRAPH 99 LOS AN, P21

[7]

Bregler C, 1997, P 24 ANN C COMP GRAP, V97, P353, DOI DOI 10.1145/258734.258880

[8]

Chen TH, 2001, IEEE SIGNAL PROC MAG, V18, P9

[9] Hidden markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system [J].

Choi, KH ;

Luo, Y ;

Hwang, JN .

JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2001, 29 (1-2) :51-61

[10]

COHEN MM, 1998, P AUD VIS SPEECH PER, P201

← 1 2 3 4 →