Towards reconstructing intelligible speech from the human auditory cortex

被引:122
作者
Akbari, Hassan [1 ,2 ]
Khalighinejad, Bahar [1 ,2 ]
Herrero, Jose L. [3 ,4 ]
Mehta, Ashesh D. [3 ,4 ]
Mesgarani, Nima [1 ,2 ]
机构
[1] Columbia Univ, Mortimer B Zuckerman Mind Brain Behav Inst, New York, NY USA
[2] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA
[3] Hofstra Northwell Sch Med, Manhasset, NY USA
[4] Feinstein Inst Med Res, Manhasset, NY USA
基金
美国国家卫生研究院;
关键词
NEURAL-NETWORKS; RESPONSES; ALGORITHM; REPRESENTATIONS; CLASSIFICATION; STIMULI; IMAGERY; ECOG; EEG;
D O I
10.1038/s41598-018-37359-z
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Auditory stimulus reconstruction is a technique that finds the best approximation of the acoustic stimulus from the population of evoked neural activity. Reconstructing speech from the human auditory cortex creates the possibility of a speech neuroprosthetic to establish a direct communication with the brain and has been shown to be possible in both overt and covert conditions. However, the low quality of the reconstructed speech has severely limited the utility of this method for brain-computer interface (BCI) applications. To advance the state-of-the-art in speech neuroprosthesis, we combined the recent advances in deep learning with the latest innovations in speech synthesis technologies to reconstruct closed-set intelligible speech from the human auditory cortex. We investigated the dependence of reconstruction accuracy on linear and nonlinear (deep neural network) regression methods and the acoustic representation that is used as the target of reconstruction, including auditory spectrogram and speech synthesis parameters. In addition, we compared the reconstruction accuracy from low and high neural frequency ranges. Our results show that a deep neural network model that directly estimates the parameters of a speech synthesizer from all neural frequencies achieves the highest subjective and objective scores on a digit recognition task, improving the intelligibility by 65% over the baseline method which used linear regression to reconstruct the auditory spectrogram. These results demonstrate the efficacy of deep learning and speech synthesis algorithms for designing the next generation of speech BCI systems, which not only can restore communications for paralyzed patients but also have the potential to transform human-computer interaction technologies.
引用
收藏
页数:12
相关论文
共 86 条
  • [31] Brain-to-text: decoding spoken phrases from phone representations in the brain
    Herff, Christian
    Heger, Dominic
    de Pesters, Adriana
    Telaar, Dominic
    Brunner, Peter
    Schalk, Gerwin
    Schultz, Tanja
    [J]. FRONTIERS IN NEUROSCIENCE, 2015, 9
  • [32] Reducing the dimensionality of data with neural networks
    Hinton, G. E.
    Salakhutdinov, R. R.
    [J]. SCIENCE, 2006, 313 (5786) : 504 - 507
  • [33] Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Hinton, Geoffrey
    Deng, Li
    Yu, Dong
    Dahl, George E.
    Mohamed, Abdel-rahman
    Jaitly, Navdeep
    Senior, Andrew
    Vanhoucke, Vincent
    Patrick Nguyen
    Sainath, Tara N.
    Kingsbury, Brian
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
  • [34] A fast learning algorithm for deep belief nets
    Hinton, Geoffrey E.
    Osindero, Simon
    Teh, Yee-Whye
    [J]. NEURAL COMPUTATION, 2006, 18 (07) : 1527 - 1554
  • [35] Iljina O, 2017, BRAIN-COMPUT INTERFA, V4, P186, DOI 10.1080/2326263X.2017.1330611
  • [36] Ioffe S., 2015, P INT C MACH LEARN, P448
  • [37] An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers
    Jensen, Jesper
    Taal, Cees H.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 2009 - 2022
  • [38] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds
    Kawahara, H
    Masuda-Katsuse, I
    de Cheveigné, A
    [J]. SPEECH COMMUNICATION, 1999, 27 (3-4) : 187 - 207
  • [39] Decoding spoken words using local field potentials recorded from the cortical surface
    Kellis, Spencer
    Miller, Kai
    Thomson, Kyle
    Brown, Richard
    House, Paul
    Greger, Bradley
    [J]. JOURNAL OF NEURAL ENGINEERING, 2010, 7 (05)
  • [40] Khalighinejad B, 2017, INT CONF ACOUST SPEE, P846, DOI 10.1109/ICASSP.2017.7952275