Complex Ratio Masking for Monaural Speech Separation

被引：642

作者：

Williamson, Donald S. ^{[1
]}

Wang, Yuxuan ^{[1
,2
]}

Wang, DeLiang ^{[1
,3
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Google Inc, Mountain View, CA 94043 USA

[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2016年 / 24卷 / 03期

关键词：

Complex ideal ratio mask; deep neural networks; speech quality; speech separation; NORMAL-HEARING; QUALITY; PHASE; NOISE; INTELLIGIBILITY; ALGORITHM;

D O I：

10.1109/TASLP.2015.2512042

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

引用

页码：483 / 492

页数：10

共 35 条

[1]

[Anonymous], 1969, IEEE T ACOUST SPEECH, VAU17, P225

[2]

[Anonymous], 2001, 862 ITUR

[3]

[Anonymous], 2007, Speech Enhancement: Theory and Practice

[4] Effects of noise and distortion on speech quality judgments in normal-hearing and hearing-impaired listeners [J].

Arehart, Kathryn H. ;

Kates, James M. ;

Anderson, Melinda C. ;

Harvey, Lewis O., Jr. .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 122 (02) :1150-1164

[5] Multitask learning [J].

Caruana, R .

MACHINE LEARNING, 1997, 28 (01) :41-75

[6] MVA processing of speech features [J].

Chen, Chia-Ping ;

Bilmes, Jeff A. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01) :257-270

[7]

Chen J., 2014, IEEE-ACM T AUDIO SPE, V22, P2112

[8]

Duchi J, 2011, J MACH LEARN RES, V12, P2121

[9] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].

EPHRAIM, Y ;

MALAH, D .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121

[10]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

← 1 2 3 4 →