Subjective analysis of an HMM-based visual speech synthesizer

被引：2

作者：

Williams, JJ ^{[1
]}

Katsaggelos, AK ^{[1
]}

Garstecki, DC ^{[1
]}

机构：

[1] Northwestern Univ, Evanston, IL 60208 USA

来源：

HUMAN VISION AND ELECTRONIC IMAGING VI | 2001年 / 4299卷

关键词：

D O I：

10.1117/12.429527

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emerging broadband communication systems promise a future of multimedia telephony. The addition of visual information, for example, during telephone conversations would be most beneficial to people with impaired hearing and the ability to speechread. For the present, it is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. A Hidden Markov Model (HMM)-based visual speech synthesizer is designed to improve speech understanding. The key elements in the application of HMMs to this problem are: a) the decomposition of the overall modeling task into key stages; and, b) the judicious determination of the components of the observation vector for each stage. The main contribution of this paper is the development of a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. It also reduces the amount of required training data compared to early integration modeling techniques. Results from objective and subjective analysis show that an HMM correlating model can significantly decrease audio-visual synchronization errors and increase speech understanding.

引用

页码：544 / 555

页数：12

共 18 条

[1]

BENOIT C, 1996, IEE C INT AUD VIS PR

[2]

Bernstein L. E., 1986, J HOPKINS LIPREADING

[3]

CHOI K, 1999, 1999 IEEE 3 WORKSH M, P175

[4]

GARSTECKI DC, 1997, AGING COMMUNICATION, P97

[5] EVALUATING THE ARTICULATION INDEX FOR AUDITORY VISUAL INPUT [J].

GRANT, KW ;

BRAIDA, LD .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1991, 89 (06) :2952-2960

[6] Measures of auditory-visual integration in nonsense syllables and sentences [J].

Grant, KW ;

Seitz, PF .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1998, 104 (04) :2438-2450

[7]

KOCHKIN S, 1996, HEARING J, V49

[8]

Massaro D. W., 1998, PERCEIVING TALKING F

[9] Audio-to-visual conversion for multimedia communication [J].

Rao, RR ;

Chen, TH ;

Mersereau, RM .

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 1998, 45 (01) :15-22

[10]

STORK DG, 1996, P 2 INT C AUT FAC GE

← 1 2 →