On Training Targets for Supervised Speech Separation

被引：913

作者：

Wang, Yuxuan ^{[1
]}

Narayanan, Arun ^{[1
]}

Wang, DeLiang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2014年 / 22卷 / 12期

关键词：

Deep neural networks; speech separation; supervised learning; training targets; BINARY; NOISE; INTELLIGIBILITY; RECOGNITION; ALGORITHM;

D O I：

10.1109/TASLP.2014.2352935

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

引用

页码：1849 / 1858

页数：10

共 39 条

[1]

[Anonymous], 2007, Speech Enhancement: Theory and Practice

[2] Determination of the potential benefit of time-frequency gain manipulation [J].

Anzalone, Michael C. ;

Calandruccio, Lauren ;

Doherty, Karen A. ;

Carney, Laurel H. .

EAR AND HEARING, 2006, 27 (05) :480-492

[3] Isolating the energetic com ponent of speech-on-speech masking with ideal time-frequency segregation [J].

Brungart, Douglas S. ;

Chang, Peter S. ;

Simpson, Brian D. ;

Wang, DeLiang .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (06) :4007-4018

[4] MVA processing of speech features [J].

Chen, Chia-Ping ;

Bilmes, Jeff A. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (01) :257-270

[5]

Chen J., 2014, P ICASSP, P7059, DOI [10.1109/icassp.2014.68549652-s2.0-84905233552, DOI 10.1109/ICASSP.2014.68549652-S2.0-84905233552]

[6]

Duchi J, 2011, J MACH LEARN RES, V12, P2121

[7] Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors [J].

Erkelens, Jan S. ;

Hendriks, Richard C. ;

Heusdens, Richard ;

Jensen, Jesper .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06) :1741-1752

[8]

Garofolo J., 1993, NASA STI RECON TECH, V93, P1

[9]

Glorot X, 2011, P 14 INT C ART INT S, P315, DOI DOI 10.1002/ECS2.1832

[10]

Gulcehre C., 2013, P INT C LEARN REPR I

← 1 2 3 4 →