Noise-Robust Speech Recognition Through Auditory Feature Detection and Spike Sequence Decoding

被引：7

作者：

Schafer, Phillip B. ^{[1
,2
]}

Jin, Dezhe Z. ^{[1
,2
]}

机构：

[1] Penn State Univ, Dept Phys, University Pk, PA 16802 USA

[2] Penn State Univ, Ctr Neural Engn, University Pk, PA 16802 USA

来源：

NEURAL COMPUTATION | 2014年 / 26卷 / 03期

基金：

美国国家科学基金会;

关键词：

WORD RECOGNITION; FEATURE-EXTRACTION; NEURAL-NETWORKS; NEURONS; REPRESENTATION; MODEL; VOCALIZATIONS; RESPONSES; MACHINE; SOUNDS;

D O I：

10.1162/NECO_a_00557

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequencesone using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

引用

页码：523 / 556

页数：34

共 103 条

[1] THE SPECTRO-TEMPORAL RECEPTIVE-FIELD - A FUNCTIONAL CHARACTERISTIC OF AUDITORY NEURONS [J].

AERTSEN, AMHJ ;

JOHANNESMA, PIM .

BIOLOGICAL CYBERNETICS, 1981, 42 (02) :133-143

[2]

[Anonymous], 1997, Statistical methods for speech recognition

[3]

[Anonymous], TECH REP

[4]

[Anonymous], 1994, Connectionist Speech Recognition: A Hybrid Approach

[5]

Aradilla G., 2005, Proc. Eurospeech, P3333

[6]

Axelrod S, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P173

[7]

Bar-Yosef O, 2007, FRONT COMPUT NEUROSC, V1, DOI [10.3389/neuro.10/003.2007, 10.3389/neuro.10.003.2007]

[8]

Barker J.P., 2001, P EUR, V1, P213

[9] The PASCAL CHiME speech separation and recognition challenge [J].

Barker, Jon ;

Vincent, Emmanuel ;

Ma, Ning ;

Christensen, Heidi ;

Green, Phil .

COMPUTER SPEECH AND LANGUAGE, 2013, 27 (03) :621-633

[10] A survey of longest common subsequence algorithms [J].

Bergroth, L ;

Hakonen, H ;

Raita, T .

SPIRE 2000: SEVENTH INTERNATIONAL SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL - PROCEEDINGS, 2000, :39-48

← 1 2 3 4 5 6 7 8 9 10 →