Whisper Intelligibility Enhancement Using a Supervised Learning Approach

被引:7
作者
Zhou, Jian [1 ,2 ]
Liang, Ruiyu [1 ,3 ]
Zhao, Li [1 ]
Zou, Cairong [1 ]
机构
[1] Southeast Univ, Key Lab Underwater Acoust Signal Proc, Minist Educ, Nanjing 210096, Jiangsu, Peoples R China
[2] Anhui Univ, Sch Comp Sci & Technol, Hefei 230601, Peoples R China
[3] Hohai Univ, Coll Comp & Informat, Nanjing 210098, Jiangsu, Peoples R China
基金
国家教育部博士点专项基金资助;
关键词
Whisper; Intelligibility enhancement; Machine learning; SPEECH-INTELLIGIBILITY; FREQUENCY-MODULATION; NOISE; RECOGNITION; DERIVATION;
D O I
10.1007/s00034-012-9415-0
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Whispered speech can be effectively used for quiet and private communications over mobile phones. It is also the communication means of laryngectomized patients under a regime of voice rest. However, little progress has been made on the enhancement of whispered speech because of its special acoustic characteristics. Recent studies with normal-hearing listeners have reported large gains in speech intelligibility with the binary mask approach. This method retains the time-frequency (T-F) units of the mixture signal that are stronger than the interfering noise (masker) and removes the T-F units where the interfering noise dominates. In this paper, a supervised learning method to enhance whispered speech is introduced. A binary mask estimated by a two-class SVM classifier is used to synthesize the enhanced whisper. Amplitude modulation spectrum (AMS) and frequency modulation spectrum (FMS) are extracted as input to SVM. Speech corrupted at low signal to noise (SNR) levels with different types of maskers is enhanced by this method and presented to normal-hearing listeners for word identification. Experimental evidence from the listening tests indicated substantial improvements in intelligibility over that attained by human listeners with unprocessed stimuli.
引用
收藏
页码:2061 / 2074
页数:14
相关论文
共 24 条
[1]  
[Anonymous], 2000, NATURE STAT LEARNING, DOI DOI 10.1007/978-1-4757-3264-1
[2]  
Bregman AS., 1994, AUDITORY SCENE ANAL
[3]   Isolating the energetic com ponent of speech-on-speech masking with ideal time-frequency segregation [J].
Brungart, Douglas S. ;
Chang, Peter S. ;
Simpson, Brian D. ;
Wang, DeLiang .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (06) :4007-4018
[4]   Frequency modulation detection in cochlear implant subjects [J].
Chen, HB ;
Zeng, FG .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2004, 116 (04) :2269-2277
[5]   Speech enhancement for non-stationary noise environments [J].
Cohen, I ;
Berdugo, B .
SIGNAL PROCESSING, 2001, 81 (11) :2403-2418
[6]   The auditory organization of speech and other sources in listeners and computational models [J].
Cooke, M ;
Ellis, DPW .
SPEECH COMMUNICATION, 2001, 35 (3-4) :141-177
[7]   DERIVATION OF AUDITORY FILTER SHAPES FROM NOTCHED-NOISE DATA [J].
GLASBERG, BR ;
MOORE, BCJ .
HEARING RESEARCH, 1990, 47 (1-2) :103-138
[8]   A comparative intelligibility study of single-microphone noise reduction algorithms [J].
Hu, Yi ;
Loizou, Philipos C. .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 122 (03) :1777-1786
[9]   Analysis and recognition of whispered speech [J].
Ito, T ;
Takeda, K ;
Itakura, F .
SPEECH COMMUNICATION, 2005, 45 (02) :139-152
[10]   Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms [J].
Kim, Gibak ;
Loizou, Philipos C. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (08) :2080-2090