Emotional speech recognition: Resources, features, and methods

被引:542
作者
Ververidis, Dimitrios [1 ]
Kotropoulos, Constantine [1 ]
机构
[1] Aristotle Univ Thessaloniki, Artificial Intelligence & Informat Anal Lab, Dept Informat, Thessaloniki 54124, Greece
关键词
emotions; emotional speech data collections; emotional speech classification; stress; interfaces; acoustic features;
D O I
10.1016/j.specom.2006.04.003
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, k-nearest neighbors, support vector machines are reviewed. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:1162 / 1181
页数:20
相关论文
共 120 条
[21]  
Chuang Z.-J., 2002, P INT C SPOK LANG PR, V3, P2033
[22]  
CLAVEL C, 2004, P ICSLP JEJ, P2277
[23]  
Cole R, 2005, CU KIDS SPEECH CORPU
[24]   Describing the emotional states that are expressed in speech [J].
Cowie, R ;
Cornelius, RR .
SPEECH COMMUNICATION, 2003, 40 (1-2) :5-32
[25]   Emotion recognition in human-computer interaction [J].
Cowie, R ;
Douglas-Cowie, E ;
Tsapatsoulis, N ;
Votsis, G ;
Kollias, S ;
Fellenz, W ;
Taylor, JG .
IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80
[26]  
Cowie R, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1989, DOI 10.1109/ICSLP.1996.608027
[27]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[28]  
Dellaert F, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1970, DOI 10.1109/ICSLP.1996.608022
[29]  
Deller J., 2000, Discrete-Time Processing of Speech Signals, DOI DOI 10.1109/9780470544402.CH11
[30]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38