Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-based Baseline System

被引:19
作者
Gogate, Mandar [1 ]
Dashtipour, Kia [1 ]
Hussain, Amir [1 ]
机构
[1] Edinburgh Napier Univ, Sch Comp, Edinburgh, Midlothian, Scotland
来源
INTERSPEECH 2020 | 2020年
基金
英国工程与自然科学研究理事会;
关键词
Speech Enhancement; Audio-Visual Fusion; VISION Corpus; Deep Learning; Multi-modal Speech Processing; Listening Tests;
D O I
10.21437/Interspeech.2020-2935
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we present VIsual Speech In real nOisy eNvironments (VISION), a first of its kind audio-visual (AV) corpus comprising 2500 utterances from 209 speakers, recorded in real noisy environments including social gatherings, streets, cafeterias and restaurants. While a number of speech enhancement frameworks have been proposed in the literature that exploit AV cues, there are no visual speech corpora recorded in real environments with a sufficient variety of speakers, to enable evaluation of AV frameworks' generalisation capability in a wide range of background visual and acoustic noises. The main purpose of our AV corpus is to foster research in the area of AV signal processing and to provide a benchmark corpus that can be used for reliable evaluation of AV speech enhancement systems in everyday noisy settings. In addition, we present a baseline deep neural network (DNN) based spectral mask estimation model for speech enhancement. Comparative simulation results with subjective listening tests demonstrate significant performance improvement of the baseline DNN compared to state-of-the-art speech enhancement approaches.
引用
收藏
页码:4521 / 4525
页数:5
相关论文
共 22 条
[1]  
Adeel A., 2019, IEEE T EMERGING TOPI
[2]  
Adeel Ahsan, 2017, INT WORKSH CHALL HEA
[3]  
[Anonymous], 1969, IEEE T ACOUST SPEECH, VAU17, P225
[4]  
Bailly-Bailliére E, 2003, LECT NOTES COMPUT SC, V2688, P625
[5]  
Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837
[6]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[7]   Hearing loss: rising prevalence and impact [J].
Davis, Adrian C. ;
Hoffman, Howard J. .
BULLETIN OF THE WORLD HEALTH ORGANIZATION, 2019, 97 (10) :646-U6
[8]   Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].
Ephrat, Ariel ;
Mosseri, Inbar ;
Lang, Oran ;
Dekel, Tali ;
Wilson, Kevin ;
Hassidim, Avinatan ;
Freeman, William T. ;
Rubinstein, Michael .
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)
[9]  
Gabbay A, 2018, INTERSPEECH, P1170
[10]  
Gogate M., 2020, INFORM FUSION