Speech Intention Classification with Multimodal Deep Learning

被引:28
作者
Gu, Yue [1 ]
Li, Xinyu [1 ]
Chen, Shuhong [1 ]
Zhang, Jianyu [1 ]
Marsic, Ivan [1 ]
机构
[1] Rutgers State Univ, Dept Elect & Comp Engn, New Brunswick, NJ 08901 USA
来源
ADVANCES IN ARTIFICIAL INTELLIGENCE, CANADIAN AI 2017 | 2017年 / 10233卷
关键词
Multimodal intention classification; Textual-acoustic feature representation; Convolutional neural network; Trauma resuscitation; EMOTION RECOGNITION;
D O I
10.1007/978-3-319-57351-9_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.
引用
收藏
页码:260 / 271
页数:12
相关论文
共 25 条
  • [1] Abadi M., 2016, ARXIV160304467
  • [2] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [3] [Anonymous], 2015, P 53 ANN M ASS COMP
  • [4] [Anonymous], 2015, P 2015 C EMP METH NA
  • [5] [Anonymous], 2004, International Journal of Computational Linguistics and Chinese Language Processing
  • [6] [Anonymous], 2014, THESIS U WATERLOO
  • [7] [Anonymous], 2013, P 2013 C EMP METH NA
  • [8] Bo Pang, 2008, Foundations and Trends in Information Retrieval, V2, P1, DOI 10.1561/1500000001
  • [9] Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection
    Busso, Carlos
    Lee, Sungbok
    Narayanan, Shrikanth
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (04): : 582 - 596
  • [10] Emotion recognition in human-computer interaction
    Cowie, R
    Douglas-Cowie, E
    Tsapatsoulis, N
    Votsis, G
    Kollias, S
    Fellenz, W
    Taylor, JG
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) : 32 - 80