Multi-modal dialog scene detection using hidden Markov models for content-based multimedia indexing

被引:34
作者
Alatan, AA [1 ]
Akansu, AN
Wolf, W
机构
[1] Middle E Tech Univ, Elect Elect Engn Dept, TR-06531 Ankara, Turkey
[2] New Jersey Inst Technol, New Jersey Ctr Multimedia Res, Newark, NJ 07102 USA
[3] Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA
关键词
content-based indexing; multi-modal analysis; hidden Markov models; dialog scene analysis;
D O I
10.1023/A:1011395131992
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A class of audio-visual data (fiction entertainment: movies, TV series) is segmented into scenes, which contain dialogs, using a novel hidden Markov model-based (HMM) method. Each shot is classified using both audio track (via classification of speech, silence and music) and visual content (face and location information). The result of this shot-based classification is an audio-visual token to be used by the HMM state diagram to achieve scene analysis. After simulations with circular and left-to-right HMM topologies, it is observed that both are performing very good with multi-modal inputs. Moreover, for circular topology, the comparisons between different training and observation sets show that audio and face information together gives the most consistent results among different observation sets.
引用
收藏
页码:137 / 151
页数:15
相关论文
共 16 条
[1]   Video query: Research directions [J].
Bolle, RM ;
Yeo, BL ;
Yeung, MM .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1998, 42 (02) :233-252
[2]  
Boreczky JS, 1998, INT CONF ACOUST SPEE, P3741, DOI 10.1109/ICASSP.1998.679697
[3]  
Eickeler S, 1998, INT C PATT RECOG, P1206, DOI 10.1109/ICPR.1998.711914
[4]  
FERMAN F, 1999, P ICIP 99
[5]  
HUANG J, 1998, P ICIP 98
[6]   Automated generation of news content hierarchy by integrating audio, video, and text information [J].
Huang, Q ;
Liu, Z ;
Rosenberg, A ;
Gibbon, D ;
Shahraray, B .
ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, :3025-3028
[7]   Video abstracting [J].
Lienhart, R ;
Pfeiffer, S ;
Effelsberg, W .
COMMUNICATIONS OF THE ACM, 1997, 40 (12) :54-62
[8]  
Nam J, 1998, 1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, P353, DOI 10.1109/ICIP.1998.723496
[9]   Speaker identification and video analysis for hierarchical video shot classification [J].
Nam, JH ;
Cetin, AE ;
Tewfik, AH .
INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL II, 1997, :550-553
[10]  
Nefian AV, 1999, INT CONF ACOUST SPEE, P3553, DOI 10.1109/ICASSP.1999.757610