Deep Video Portraits

被引:353
作者
Kim, Hyeongwoo [1 ]
Garrido, Pablo [2 ]
Tewari, Ayush [1 ]
Xu, Weipeng [1 ]
Thies, Justus [3 ]
Niessner, Matthias [3 ]
Perez, Patrick [2 ]
Richardt, Christian [4 ]
Zollhofer, Michael [5 ]
Theobalt, Christian [1 ]
机构
[1] Max Planck Inst Informat, Campus E1-4, D-66123 Saarbrucken, Germany
[2] Technicolor, 975 Ave Champs Blancs, F-35576 Cesson Sevigne, France
[3] Tech Univ Munich, Boltzmannstr 3, D-85748 Garching, Germany
[4] Univ Bath, Bath BA2 7AY, Avon, England
[5] Stanford Univ, 353 Serra Mall, Stanford, CA 94305 USA
来源
ACM TRANSACTIONS ON GRAPHICS | 2018年 / 37卷 / 04期
基金
英国工程与自然科学研究理事会;
关键词
Facial Reenactment; Video Portraits; Dubbing; Deep Learning; Conditional GAN; Rendering-to-Video Translation; 3D FACE RECONSTRUCTION;
D O I
10.1145/3197517.3201283
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network - thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.
引用
收藏
页数:14
相关论文
共 69 条
[41]  
Kingma D.P., 2015, INT C LEARN REPR ICL, DOI DOI 10.1002/9781118900772.ETRDS0277
[42]   A Generative Model of People in Clothing [J].
Lassner, Christoph ;
Pons-Moll, Gerard ;
Gehler, Peter V. .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :853-862
[43]   Foldabilizing Furniture [J].
Li, Honghua ;
Hu, Ruizhen ;
Alhashim, Ibraheem ;
Zhang, Hao .
ACM TRANSACTIONS ON GRAPHICS, 2015, 34 (04)
[44]   A Data-Driven Approach for Facial Expression Retargeting in Video [J].
Li, Kai ;
Dai, Qionghai ;
Wang, Ruiping ;
Liu, Yebin ;
Xu, Feng ;
Wang, Jue .
IEEE TRANSACTIONS ON MULTIMEDIA, 2014, 16 (02) :299-310
[45]  
Liu K, 2011, IEEE INT CON MULTI
[46]  
Liu ZC, 2001, COMP GRAPH, P271
[47]  
Ma Liqian, 2017, ADV NEURAL INFORM PR
[48]  
Mirza M., 2014, ARXIV PREPRINT ARXIV
[49]   Realistic Dynamic Facial Textures from a Single Image using GANs [J].
Olszewski, Kyle ;
Li, Zimo ;
Yang, Chao ;
Zhou, Yi ;
Yu, Ronald ;
Huang, Zeng ;
Xiang, Sitao ;
Saito, Shunsuke ;
Kohli, Pushmeet ;
Li, Hao .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5439-5448
[50]   High-Fidelity Facial and Speech Animation for VR HMDs [J].
Olszewski, Kyle ;
Lim, Joseph J. ;
Saito, Shunsuke ;
Li, Hao .
ACM TRANSACTIONS ON GRAPHICS, 2016, 35 (06)