Monaural speech separation and recognition challenge

被引:139
作者
Cooke, Martin [1 ,2 ]
Hershey, John R. [3 ]
Rennie, Steven J. [3 ]
机构
[1] Univ Basque Country, Dept Elect & Elect, Fac Ciencias & Tecnol, Leioa 48940, Spain
[2] Ikerbasque Basque Sci Fdn, Bilbao 48011, Bizkaia, Spain
[3] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
Speech recognition; Speech separation; Speaker identification; Simultaneous speech; Auditory scene analysis; Noise robustness; ROBUST; MASKING;
D O I
10.1016/j.csl.2009.02.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robust speech recognition in everyday conditions requires the solution to a number of challenging problems, not least the ability to handle multiple sound sources. The specific case of speech recognition in the presence of a competing talker has been studied for several decades, resulting in a number of quite distinct algorithmic solutions whose focus ranges from modeling both target and competing speech to speech separation using auditory grouping principles. The purpose of the monaural speech separation and recognition challenge was to permit a large-scale comparison of techniques for the competing talker problem. The task was to identify keywords in sentences spoken by a target talker when mixed into a single channel with a background talker speaking similar sentences. Ten independent sets of results were contributed, alongside a baseline recognition system. Performance was evaluated using common training and test data and common metrics. Listeners' performance in the same task was also measured. This paper describes the challenge problem, compares the performance of the contributed algorithms, and discusses the factors which distinguish the systems. One highlight of the comparison was the finding that several systems achieved near-human performance in some conditions, and one out-performed listeners overall. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1 / 15
页数:15
相关论文
共 53 条
[1]  
[Anonymous], SPRINGER HDB AUDITOR
[2]  
[Anonymous], 2005, Speech Enhancement
[3]  
[Anonymous], HTK BOOK HTK V2 0
[4]  
[Anonymous], 2007, Speech Enhancement: Theory and Practice
[5]  
Bach FR, 2006, J MACH LEARN RES, V7, P1963
[6]   Speech fragment decoding techniques for simultaneous speaker identification and speech recognition [J].
Barker, Jon ;
Ma, Ning ;
Coy, Andre ;
Cooke, Martin .
COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :94-111
[7]  
Barker J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P85
[8]   Decoding speech in the presence of other sources [J].
Barker, JP ;
Cooke, MP ;
Ellis, DPW .
SPEECH COMMUNICATION, 2005, 45 (01) :5-25
[9]   AN INFORMATION MAXIMIZATION APPROACH TO BLIND SEPARATION AND BLIND DECONVOLUTION [J].
BELL, AJ ;
SEJNOWSKI, TJ .
NEURAL COMPUTATION, 1995, 7 (06) :1129-1159
[10]  
Bregman A., 1990, Auditory Scene Analysis: The Perceptual Organization of Sound, DOI DOI 10.7551/MITPRESS/1486.001.0001