A deep architecture for audio-visual voice activity detection in the presence of transients

被引:23
作者
Ariav, Ido [1 ]
Dov, David [1 ]
Cohen, Israel [1 ]
机构
[1] Technion Israel Inst Technol, Andrew & Erna Viterbi Fac Elect Engn, IL-32000 Haifa, Israel
基金
以色列科学基金会;
关键词
Audio-visual speech processing; Voice activity detection; Auto-encoder; Recurrent neural networks; STATISTICAL-MODELS; SPEECH INFORMATION; NEURAL-NETWORKS; OPTICAL-FLOW; REPRESENTATIONS; NOISE;
D O I
10.1016/j.sigpro.2017.07.006
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
080906 [电磁信息功能材料与结构]; 082806 [农业信息与电气工程];
摘要
We address the problem of voice activity detection in difficult acoustic environments including high levels of noise and transients, which are common in real life scenarios. We consider a multimodal setting, in which the speech signal is captured by a microphone, and a video camera is pointed at the face of the desired speaker. Accordingly, speech detection translates to the question of how to properly fuse the audio and video signals, which we address within the framework of deep learning. Specifically, we present a neural network architecture based on a variant of auto-encoders, which combines the two modalities, and provides a new representation of the signal, in which the effect of interferences is reduced. To further encode differences between the dynamics of speech and interfering transients, the signal, in this new representation, is fed into a recurrent neural network, which is trained in a supervised manner for speech detection. Experimental results demonstrate improved performance of the proposed deep architecture compared to competing multimodal detectors. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:69 / 74
页数:6
相关论文
共 45 条
[1]
Almajai I., 2008, P 16 EUR SIGN PROC C
[2]
[Anonymous], 2012, P INT C NEUR INF PRO
[3]
[Anonymous], 2011, P 28 INT C MACH LEAR
[4]
[Anonymous], 2007, LARGE SCALE KERNEL M
[5]
[Anonymous], P 8 IEEE INT C AUT F
[6]
[Anonymous], 2011, 2011 8 INT C INF COM, DOI DOI 10.1109/ICICS.2011.6174265
[7]
[Anonymous], 2010, P INTERSPEECH 2010
[8]
[Anonymous], P IEEE 28 CONV EL EL
[9]
Multimodal fusion for multimedia analysis: a survey [J].
Atrey, Pradeep K. ;
Hossain, M. Anwar ;
El Saddik, Abdulmotaleb ;
Kankanhalli, Mohan S. .
MULTIMEDIA SYSTEMS, 2010, 16 (06) :345-379
[10]
Aubrey Andrew, 2007, 2007 15th European Signal Processing Conference (EUSIPCO), P2409