Multiview Supervised Dictionary Learning in Speech Emotion Recognition

被引:65
作者
Gangeh, Mehrdad J. [1 ,2 ,3 ]
Fewzee, Pouria [1 ]
Ghodsi, Ali [4 ]
Kamel, Mohamed S. [1 ]
Karray, Fakhri [1 ]
机构
[1] Univ Waterloo, Dept Elect & Comp Engn, Ctr Pattern Anal & Machine Intelligence, Waterloo, ON N2L 3G1, Canada
[2] Univ Toronto, Dept Med Biophys, Toronto, ON M5G 2M9, Canada
[3] Sunnybrook Hlth Sci Ctr, Dept Radiat Oncol, Toronto, ON M4N 3M5, Canada
[4] Univ Waterloo, Dept Stat & Actuarial Sci, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Dictionary learning; emotion recognition; multiview representation; sparse representation; supervised learning; SPARSE; CLASSIFICATION;
D O I
10.1109/TASLP.2014.2319157
中图分类号
O42 [声学];
学科分类号
070206 [声学];
摘要
Recently, a supervised dictionary learning (SDL) approach based on the Hilbert-Schmidt independence criterion (HSIC) has been proposed that learns the dictionary and the corresponding sparse coefficients in a space where the dependency between the data and the corresponding labels is maximized. In this paper, two multiview dictionary learning techniques are proposed based on this HSIC-based SDL. While one of these two techniques learns one dictionary and the corresponding coefficients in the space of fused features in all views, the other learns one dictionary in each view and subsequently fuses the sparse coefficients in the spaces of learned dictionaries. The effectiveness of the proposed multiview learning techniques in using the complementary information of single views is demonstrated in the application of speech emotion recognition (SER). The fully-continuous sub-challenge (FCSC) of the AVEC 2012 dataset is used in two different views: baseline and spectral energy distribution (SED) feature sets. Four dimensional affects, i.e., arousal, expectation, power, and valence are predicted using the proposed multiview methods as the continuous response variables. The results are compared with the single views, AVEC 2012 baseline system, and also other supervised and unsupervised multiview learning approaches in the literature. Using correlation coefficient as the performance measure in predicting the continuous dimensional affects, it is shown that the proposed approach achieves the highest performance among the rivals. The relative performance of the two proposed multiview techniques and their relationship are also discussed. Particularly, it is shown that by providing an additional constraint on the dictionary of one of these approaches, it becomes the same as the other.
引用
收藏
页码:1056 / 1068
页数:13
相关论文
共 66 条
[1]
K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation [J].
Aharon, Michal ;
Elad, Michael ;
Bruckstein, Alfred .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2006, 54 (11) :4311-4322
[2]
Amini M.-R., 2009, P 22 INT C NEURAL IN, V22, P28
[3]
[Anonymous], 2003, P ACM INT C MULT ACM
[4]
[Anonymous], P INTERSPEECH
[5]
[Anonymous], 2009, WAVELET TOUR SIGNAL
[6]
[Anonymous], 1999, Basic Emotions
[7]
[Anonymous], P INT S BIOM IM NAN
[8]
[Anonymous], 2007, 2007 IEEE C COMP VIS
[9]
[Anonymous], P IEEE C COMP VIS PA
[10]
[Anonymous], 2008, P ADV NEURAL INFORM