Noise-Robust Speech Recognition Through Auditory Feature Detection and Spike Sequence Decoding

被引：8

作者：

Schafer, Phillip B. ^{[1
,2
]}

Jin, Dezhe Z. ^{[1
,2
]}

机构：

[1] Penn State Univ, Dept Phys, University Pk, PA 16802 USA

[2] Penn State Univ, Ctr Neural Engn, University Pk, PA 16802 USA

来源：

NEURAL COMPUTATION | 2014年 / 26卷 / 03期

基金：

美国国家科学基金会;

关键词：

WORD RECOGNITION; FEATURE-EXTRACTION; NEURAL-NETWORKS; NEURONS; REPRESENTATION; MODEL; VOCALIZATIONS; RESPONSES; MACHINE; SOUNDS;

D O I：

10.1162/NECO_a_00557

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequencesone using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

引用

页码：523 / 556

页数：34

共 103 条

[91] Toward a model for lexical access based on acoustic landmarks and distinctive features [J].

Stevens, KN .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) :1872-1891

[92]

Strik H., 2006, SPEECH RECOGNITION I

[93]

Theunissen FE, 2000, J NEUROSCI, V20, P2315

[94] A survey of hybrid ANN/HMM models for automatic speech recognition [J].

Trentin, E ;

Gori, M .

NEUROCOMPUTING, 2001, 37 :91-126

[95]

Vapnik V., 1998, Statistical Learning Theory, P5

[96] Spectrotemporal Response Properties of Inferior Colliculus Neurons in Alert Monkey [J].

Versnel, Huib ;

Zwiers, Marcel P. ;

van Opstal, A. John .

JOURNAL OF NEUROSCIENCE, 2009, 29 (31) :9725-9739

[97] Isolated word recognition with the Liquid State Machine:: a case study [J].

Verstraeten, D ;

Schrauwen, B ;

Stroobandt, D ;

Van Campenhout, J .

INFORMATION PROCESSING LETTERS, 2005, 95 (06) :521-528

[98]

Vinyals O, 2011, INT CONF ACOUST SPEE, P4596

[99] Differential representation of species-specific primate vocalizations in the auditory cortices of marmoset and cat [J].

Wang, XQ ;

Kadia, SC .

JOURNAL OF NEUROPHYSIOLOGY, 2001, 86 (05) :2616-2620

[100]

Young S. J., 2005, HTK BOOK VERSION 3 4

← 2 3 4 5 6 7 8 9 10 11 →