A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

被引:28
作者
Acar, Esra [1 ]
Hopfgartner, Frank [2 ]
Albayrak, Sahin [1 ]
机构
[1] Tech Univ Berlin, DAI Lab, Ernst Reuter Pl 7,TEL 14, D-10587 Berlin, Germany
[2] Univ Glasgow, Humanities Adv Technol & Informat Inst, Glasgow, Lanark, Scotland
关键词
Video affective content analysis; Ensemble learning; Deep learning; MFCC; Color; Dense trajectories; SentiBank;
D O I
10.1007/s11042-016-3618-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's society where audio-visual content such as professionally edited and user-generated videos is ubiquitous, automatic analysis of this content is a decisive functionality. Within this context, there is an extensive ongoing research about understanding the semantics (i.e., facts) such as objects or events in videos. However, little research has been devoted to understanding the emotional content of the videos. In this paper, we address this issue and introduce a system that performs emotional content analysis of professionally edited and user-generated videos. We concentrate both on the representation and modeling aspects. Videos are represented using mid-level audio-visual features. More specifically, audio and static visual representations are automatically learned from raw data using convolutional neural networks (CNNs). In addition, dense trajectory based motion and SentiBank domain-specific features are incorporated. By means of ensemble learning and fusion mechanisms, videos are classified into one of predefined emotion categories. Results obtained on the VideoEmotion dataset and a subset of the DEAP dataset show that (1) higher level representations perform better than low-level features, (2) among audio features, mid-level learned representations perform better than mid-level handcrafted ones, (3) incorporating motion and domain-specific information leads to a notable performance gain, and (4) ensemble learning is superior to multi-class support vector machines (SVMs) for video affective content analysis.
引用
收藏
页码:11809 / 11837
页数:29
相关论文
共 46 条
[31]   MPEG-7 visual motion descriptors [J].
Jeannin, S ;
Divakaran, A .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (06) :720-724
[32]   Caffe: Convolutional Architecture for Fast Feature Embedding [J].
Jia, Yangqing ;
Shelhamer, Evan ;
Donahue, Jeff ;
Karayev, Sergey ;
Long, Jonathan ;
Girshick, Ross ;
Guadarrama, Sergio ;
Darrell, Trevor .
PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :675-678
[33]   DEAP: A Database for Emotion Analysis Using Physiological Signals [J].
Koelstra, Sander ;
Muhl, Christian ;
Soleymani, Mohammad ;
Lee, Jong-Seok ;
Yazdani, Ashkan ;
Ebrahimi, Touradj ;
Pun, Thierry ;
Nijholt, Anton ;
Patras, Ioannis .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2012, 3 (01) :18-31
[34]  
Krizhevsky A., 2017, COMMUN ACM, V60, P84, DOI DOI 10.1145/3065386
[35]  
Mairal J, 2010, J MACH LEARN RES, V11, P19
[36]   Multimodal Learning with Deep Boltzmann Machine for Emotion Prediction in User Generated Videos [J].
Pang, Lei ;
Ngo, Chong-Wah .
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, :619-622
[37]  
Plutchik R., 1986, EMOTION THEORY RES E, V3
[38]   EFFECTS OF COLOR ON EMOTIONS [J].
VALDEZ, P ;
MEHRABIAN, A .
JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 1994, 123 (04) :394-409
[39]   Affective understanding in film [J].
Wang, Hee Lin ;
Cheong, Loong-Fah .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2006, 16 (06) :689-704
[40]   Action Recognition with Improved Trajectories [J].
Wang, Heng ;
Schmid, Cordelia .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :3551-3558