Speech Intention Classification with Multimodal Deep Learning

被引：28

作者：

Gu, Yue ^{[1
]}

Li, Xinyu ^{[1
]}

Chen, Shuhong ^{[1
]}

Zhang, Jianyu ^{[1
]}

Marsic, Ivan ^{[1
]}

机构：

[1] Rutgers State Univ, Dept Elect & Comp Engn, New Brunswick, NJ 08901 USA

来源：

ADVANCES IN ARTIFICIAL INTELLIGENCE, CANADIAN AI 2017 | 2017年 / 10233卷

关键词：

Multimodal intention classification; Textual-acoustic feature representation; Convolutional neural network; Trauma resuscitation; EMOTION RECOGNITION;

D O I：

10.1007/978-3-319-57351-9_30

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.

引用

页码：260 / 271

页数：12

共 25 条

[1] Abadi M., 2016, ARXIV160304467
[2] Convolutional Neural Networks for Speech Recognition
Abdel-Hamid, Ossama
Mohamed, Abdel-Rahman
Jiang, Hui
Deng, Li
Penn, Gerald
Yu, Dong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
[3] [Anonymous], 2015, P 53 ANN M ASS COMP
[4] [Anonymous], 2015, P 2015 C EMP METH NA
[5] [Anonymous], 2004, International Journal of Computational Linguistics and Chinese Language Processing
[6] [Anonymous], 2014, THESIS U WATERLOO
[7] [Anonymous], 2013, P 2013 C EMP METH NA
[8] Bo Pang, 2008, Foundations and Trends in Information Retrieval, V2, P1, DOI 10.1561/1500000001
[9] Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection
Busso, Carlos
Lee, Sungbok
Narayanan, Shrikanth
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (04): : 582 - 596
[10] Emotion recognition in human-computer interaction
Cowie, R
Douglas-Cowie, E
Tsapatsoulis, N
Votsis, G
Kollias, S
Fellenz, W
Taylor, JG
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80

← 1 2 3 →