On the perceptual distance between speech segments

被引:2
作者
Ghitza, O
Sondhi, MM
机构
[1] Acoust. and Aud. Commun. Research, Bell Laboratories, Murray Hill
关键词
D O I
10.1121/1.418115
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
For many tasks in speech signal processing it is of interest to develop an objective measure that correlates well with the perceptual distance between speech segments. (Speech segments are defined as pieces of a speech signal of duration 50-150 ms. For concreteness, a segment is considered to mean a diphone, i.e., a segment from the midpoint of one phoneme to the midpoint of the adjacent phoneme.) Such a distance metric would be useful for speech coding at low bit rates. Saving bits in those systems relies on a perceptual tolerance to acoustic perturbations from the original speech - perturbations whose effects typically last for several tens of milliseconds. Such a distance metric would also be useful for automatic speech recognition on the assumption that perceptual invariance to adverse signal conditions (e.g., noise, microphone, and channel distortions, room reverberation, etc.) and to phonemic variability (due to nonuniqueness of articulatory gestures) may provide a basis for robust performance. In this paper, attempts at defining such a metric will be described. The approach in addressing this question is twofold. First psychoacoustical experiments relevant to the perception of speech are conducted to measure the relative importance of various time-frequency ''tiles'' (one at a time) when all other time-frequency information is present. The psychophysical data are then used to derive rules for integrating the output of a model of auditory-nerve activity over time and frequency. (C) 1997 Acoustical Society of America.
引用
收藏
页码:522 / 529
页数:8
相关论文
共 12 条
[1]  
[Anonymous], 1952, 13 MIT AC LAB
[2]   EFFECT OF TEMPORAL ENVELOPE SMEARING ON SPEECH RECEPTION [J].
DRULLMAN, R ;
FESTEN, JM ;
PLOMP, R .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (02) :1053-1064
[3]  
Fletcher Harvey., 1953, SPEECH HEARING COMMU
[4]   SUBROUTINES FOR UNCONSTRAINED MINIMIZATION USING A MODEL TRUST-REGION APPROACH [J].
GAY, DM .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1983, 9 (04) :503-524
[5]   Hidden Markov models with templates as non-stationary states: An application to speech recognition [J].
Ghitza, Oded ;
Sondhi, M.Mohan .
Computer Speech and Language, 1993, 7 (02) :101-119
[6]   ADEQUACY OF AUDITORY MODELS TO PREDICT HUMAN INTERNAL REPRESENTATION OF SPEECH SOUNDS [J].
GHITZA, O .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1993, 93 (04) :2160-2171
[7]   PROCESSING OF SPOKEN CVCS IN THE AUDITORY PERIPHERY .1. PSYCHOPHYSICS [J].
GHITZA, O .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1993, 94 (05) :2507-2516
[8]   Auditory Models and Human Performance in Tasks Related to Speech Coding and Speech Recognition [J].
Ghitza, Oded .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (01) :115-132
[9]   A REVIEW OF THE MTF CONCEPT IN ROOM ACOUSTICS AND ITS USE FOR ESTIMATING SPEECH-INTELLIGIBILITY IN AUDITORIA [J].
HOUTGAST, T ;
STEENEKEN, HJM .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1985, 77 (03) :1069-1077
[10]   AN ANALYSIS OF PERCEPTUAL CONFUSIONS AMONG SOME ENGLISH CONSONANTS [J].
MILLER, GA ;
NICELY, PE .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1955, 27 (02) :338-352