Speech segregation based on sound localization

被引:261
作者
Roman, N [1 ]
Wang, DL
Brown, GJ
机构
[1] Ohio State Univ, Dept Comp & Informat Sci, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA
[3] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England
关键词
D O I
10.1121/1.1610463
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
At a cocktail party, one can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel, supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial localization cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, the notion of an "ideal" time-frequency binary mask is suggested, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. It is observed that within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, pattern classification is performed in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that the model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. (C) 2003 Acoustical Society of America.
引用
收藏
页码:2236 / 2252
页数:17
相关论文
共 60 条
[1]  
Arbib M., 2003, HDB BRAIN THEORY NEU
[2]  
Bench J., 1979, SPEECH HEARING TESTS
[3]  
BERNSTEIN LR, 2001, J ACOUST SCO AM, V109, P2485
[4]  
BLAUERT J, 1997, SPATIAL HEARTING PSY
[5]  
Bodden M, 1996, ACUSTICA, V82, P356
[6]  
Bodden M., 1993, Acta Acustica, V1, P43
[7]   Binaural processing model based on contralateral inhibition. I. Model structure [J].
Breebaart, J ;
van de Par, S ;
Kohlrausch, A .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 110 (02) :1074-1088
[8]  
Bregman A. S., 1990, AUDITORY SCENE ANAL
[9]  
Bronkhorst AW, 2000, ACUSTICA, V86, P117
[10]   COMPUTATIONAL AUDITORY SCENE ANALYSIS [J].
BROWN, GJ ;
COOKE, M .
COMPUTER SPEECH AND LANGUAGE, 1994, 8 (04) :297-336