Fear-type emotion recognition for future audio-based surveillance systems

被引:117
作者
Clavel, C. [1 ]
Vasilescu, I. [2 ]
Devillers, L. [2 ]
Richard, G. [3 ]
Ehrette, T. [1 ]
机构
[1] Thales Res & Technol France, F-91767 Palaiseau, France
[2] LIMSI CNRS, F-91403 Orsay, France
[3] Telecom ParisTech, F-75014 Paris, France
关键词
fear-type emotions recognition; fiction corpus; annotation scheme; acoustic features of emotions; machine learning; threatening situations; civil safety;
D O I
10.1016/j.specom.2008.03.012
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper addresses the issue of automatic emotion recognition in speech. We focus on a type of emotional manifestation which has been rarely studied in speech processing: fear-type emotions occurring during abnormal situations (here, unplanned events where human life is threatened). This study is dedicated to a new application in emotion recognition - public safety. The starting point of this work is the definition and the collection of data illustrating extreme emotional manifestations in threatening situations. For this purpose we develop the SAFE corpus (situation analysis in a fictional and emotional corpus) based on fiction movies. It consists of 7 h of recordings organized into 400 audiovisual sequences. The corpus contains recordings of both normal and abnormal situations and provides a large scope of contexts and therefore a large scope of emotional manifestations. In this way, not only it addresses the issue of the lack of corpora illustrating strong emotions, but also it forms an interesting support to study a high variety of emotional manifestations. We define a task-dependent annotation strategy which has the particularity to describe simultaneously the emotion and the situation evolution in context. The emotion recognition system is based on these data and must handle a large scope of unknown speakers and situations in noisy sound environments. It consists of a fear vs. neutral classification. The novelty of our approach relies on dissociated acoustic models of the voiced and unvoiced contents of speech. The two are then merged at the decision step of the classification system. The results are quite promising given the complexity and the diversity of the data: the error rate is about 30%. (C) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:487 / 503
页数:17
相关论文
共 72 条
[41]  
ENOS F, 2006, P LREC WORKSH CORP R, P6
[42]   Modeling drivers' speech under stress [J].
Fernandez, R ;
Picard, RW .
SPEECH COMMUNICATION, 2003, 40 (1-2) :145-159
[43]  
FRANCE D, 2003, IEEE T BIOMEDICAL EN, V47, P829
[44]   Communication of emotions in vocal expression and music performance: Different channels, same code? [J].
Juslin, PN ;
Laukka, P .
PSYCHOLOGICAL BULLETIN, 2003, 129 (05) :770-814
[45]  
Kipp M., 2001, Proceedings of the European Conference on Speech Communication and Technology, P1367
[46]  
Kleiber Georges, 1990, La semantique du prototype
[47]  
LANDIS JR, 1977, BIOMETRICS, V33, P174
[48]  
Lee CM, 2002, IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, P737, DOI 10.1109/ICME.2002.1035887
[49]  
LEE CM, 1997, INFORM COMM SIGNAL P, V1, P347
[50]  
McGilloway S., 1997, THESIS QUEENS U BELF