Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

被引：399

作者：

Ephrat, Ariel ^{[1
,2
]}

Mosseri, Inbar ^{[1
]}

Lang, Oran ^{[1
]}

Dekel, Tali ^{[1
]}

Wilson, Kevin ^{[1
]}

Hassidim, Avinatan ^{[1
]}

Freeman, William T. ^{[1
]}

Rubinstein, Michael ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Hebrew Univ Jerusalem, Jerusalem, Israel

来源：

ACM TRANSACTIONS ON GRAPHICS | 2018年 / 37卷 / 04期

关键词：

Audio-Visual; Source Separation; Speech Enhancement; Deep Learning; CNN; BLSTM;

D O I：

10.1145/3197517.3201357

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

引用

页数：11

共 52 条

[1]

Afouras T, 2018, INTERSPEECH, P3244

[2]

[Anonymous], 2017, ICCV 2017 WORKSH COM

[3]

[Anonymous], 2016, NIPS

[4]

[Anonymous], 2017, ICLR

[5]

[Anonymous], 2014, Object Detectors Emerge in Deep Scene CNNs

[6] Blind Audiovisual Source Separation Based on Sparse Redundant Representations [J].

Casanovas, Anna Llagostera ;

Monaci, Gianluca ;

Vandergheynst, Pierre ;

Gribonval, Remi .

IEEE TRANSACTIONS ON MULTIMEDIA, 2010, 12 (05) :358-371

[7] SOME EXPERIMENTS ON THE RECOGNITION OF SPEECH, WITH ONE AND WITH 2 EARS [J].

CHERRY, EC .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1953, 25 (05) :975-979

[8]

Cole Forrester, 2016, CVPR 17

[9]

Comon P, 2010, HANDBOOK OF BLIND SOURCE SEPARATION: INDEPENDENT COMPONENT ANALYSIS AND APPLICATIONS, P1

[10] Features for Masking-Based Monaural Speech Separation in Reverberant Conditions [J].

Delfarah, Masood ;

Wang, DeLiang .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) :1085-1094

← 1 2 3 4 5 6 →