Realistic Speech-Driven Facial Animation with GANs

被引:7
作者
Konstantinos Vougioukas
Stavros Petridis
Maja Pantic
机构
[1] Imperial College London,Department of Computing
[2] Samsung AI Research Centre Cambridge,undefined
来源
International Journal of Computer Vision | 2020年 / 128卷
关键词
Generative modelling; Face generation; Speech-driven animation;
D O I
暂无
中图分类号
学科分类号
摘要
Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.
引用
收藏
页码:1398 / 1413
页数:15
相关论文
共 59 条
[1]  
Bentivoglio AR(1997)Analysis of blink rate patterns in normal subjects Movement Disorders 12 1028-1034
[2]  
Bressman SB(2014)CREMA-D: Crowd-sourced emotional multimodal actors dataset IEEE Transactions on Affective Computing 5 377-390
[3]  
Cassetta E(2005)Expressive speech-driven facial animation ACM TOG 24 1283-1302
[4]  
Carretta D(2006)An audio-visual corpus for speech perception and automatic speech recognition The Journal of the Acoustical Society of America 120 2421-2424
[5]  
Tonali P(2015)TCD-TIMIT: An audio-visual corpus of continuous speech IEEE Transactions on Multimedia 17 603-615
[6]  
Albanese A(2017)Audio-driven facial animation by joint end-to-end learning of pose and emotion ACM TOG 36 1-12
[7]  
Cao H(2009)A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection International Workshop on Quality of Multimedia Experience (QoMEx) 20 87-91
[8]  
Cooper DG(1990)Generation of mouthshapes for a synthetic talking head Proceedings of the Institute of Acoustics, Autumn Meeting 12 475-482
[9]  
Keutmann MK(2017)Synthesizing Obama: Learning lip sync from audio output Obama video ACM TOG 36 1-13
[10]  
Gur RC(2017)A deep learning approach for generalized speech animation ACM TOG 36 1-13