Auditory Models and Human Performance in Tasks Related to Speech Coding and Speech Recognition

被引:121
作者
Ghitza, Oded [1 ]
机构
[1] AT&T Bell Labs, Acoust Res Dept, Murray Hill, NJ 07974 USA
来源
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 1994年 / 2卷 / 01期
关键词
D O I
10.1109/89.260357
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Auditory models that are capable of achieving human performance in tasks related to speech perception would provide a basis for realizing effective speech processing systems. Saving bits in speech coders, for example, relies on a perceptual tolerance to acoustic deviations from the original speech. Perceptual invariance to adverse signal conditions (noise, microphone and channel distortions, m m reverberations) and to phonemic variability (due to nonuniqueness of articulatory gestures) may provide a basis for robust speech recognition. A state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level is described. Speech information is extracted from the simulated auditory nerve firings, and used in place of the conventional input to several speech coding and recognition systems. The performance of these systems improves as a result of this replacement, but is still short of achieving human performance. The shortcomings occur, in particular, in tasks related to low bit-rate coding and to speech recognition. Since schemes for low bit-rate coding rely on signal manipulations that spread over durations of several tens of ms, and since schemes for speech recognition rely on phonemic/articulatory information that extend over similar time intervals, it is concluded that the shortcomings are due mainly to a perceptually related rules over durations of 50-100 ms. These observations suggest a need for a study aimed at understanding how auditory nerve activity is integrated over time intervals of that duration. We discuss preliminary experimental results that confirm human usage of such integration, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task.
引用
收藏
页码:115 / 132
页数:18
相关论文
共 37 条
[1]  
ALLEN JB, 1985, IEEE ASSP MAG JAN, P3
[2]   A LOW-DELAY CELP CODER FOR THE CCITT 16 KB S SPEECH CODING STANDARD [J].
CHEN, JH ;
COX, RV ;
LIN, YC ;
JAYANT, N ;
MELCHNER, MJ .
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 1992, 10 (05) :830-849
[3]   SPEECH CODING IN THE AUDITORY-NERVE .5. VOWELS IN BACKGROUND-NOISE [J].
DELGUTTE, B ;
KIANG, NYS .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1984, 75 (03) :908-918
[4]   SPEECH CODING IN THE AUDITORY-NERVE .3. VOICELESS FRICATIVE CONSONANTS [J].
DELGUTTE, B ;
KIANG, NYS .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1984, 75 (03) :887-896
[5]   SPEECH CODING IN THE AUDITORY-NERVE .1. VOWEL-LIKE SOUNDS [J].
DELGUTTE, B ;
KIANG, NYS .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1984, 75 (03) :866-878
[6]   SPEECH CODING IN THE AUDITORY-NERVE .4. SOUNDS WITH CONSONANT-LIKE DYNAMIC CHARACTERISTICS [J].
DELGUTTE, B ;
KIANG, NYS .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1984, 75 (03) :897-907
[7]  
DELGUTTE B, 1986, INVARIANCE VARIABILI, P163
[8]   A COMPOSITE MODEL OF THE AUDITORY PERIPHERY FOR THE PROCESSING OF SPEECH [J].
DENG, L ;
GEISLER, CD ;
GREENBERG, S .
JOURNAL OF PHONETICS, 1988, 16 (01) :93-108
[9]   Hidden Markov models with templates as non-stationary states: An application to speech recognition [J].
Ghitza, Oded ;
Sondhi, M.Mohan .
Computer Speech and Language, 1993, 7 (02) :101-119
[10]   ADEQUACY OF AUDITORY MODELS TO PREDICT HUMAN INTERNAL REPRESENTATION OF SPEECH SOUNDS [J].
GHITZA, O .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1993, 93 (04) :2160-2171