Deep Appearance Models for Face Rendering

Size: px

Start display at page:

Download "Deep Appearance Models for Face Rendering"

Roderick Nichols
5 years ago
Views:

Deep Appearance Models for Face Rendering STEPHEN LOMBARDI, Facebook Realiy Labs JASON SARAGIH, Facebook Realiy Labs TOMAS SIMON, Facebook Realiy Labs YASER SHEIKH, Facebook Realiy Labs Deep

(novel view) Rendered Avaar (novel view) Inpu Animaed Avaar Decoder Images from Camera Mouned on HMD Fig. 1.

We use his rich daa o drive our avaars from cameras mouned on a head-mouned display (HMD).

We hen regress from y o he laen rendering code z and decode ino mesh and exure o render. Our mehod enables high-fideliy social ineracion in virual realiy.

Inspired by Acive Appearance Models, we develop a daa-driven rendering pipeline ha learns a join represenaion of facial geomery and appearance from a muliview capure seup.

View-specific exure enables he modeling of view-dependen effecs such as speculariy. In addiion, i can also correc for imperfec geomery semming from biased or low resoluion esimaes.

1 Deep Appearance Models for Face Rendering STEPHEN LOMBARDI, Facebook Realiy Labs JASON SARAGIH, Facebook Realiy Labs TOMAS SIMON, Facebook Realiy Labs YASER SHEIKH, Facebook Realiy Labs Deep Appearance Variaional Auoencoder Encoder Decoder Average exure Muli-camera Capure Images View-specific exure Mesh Mesh Domain Adapaion Variaional Auoencoder Encoder Synheic HMD Images Ground Truh (novel view) Rendered Avaar (novel view) Inpu Animaed Avaar Decoder Images from Camera Mouned on HMD Fig. 1. Our model joinly encodes and decodes geomery and view-dependen appearance ino a laen code z, from daa capured from a muli-camera rig, enabling highly realisic daa-driven facial rendering. We use his rich daa o drive our avaars from cameras mouned on a head-mouned display (HMD). We do his by creaing synheic HMD images hrough image-based rendering, and using anoher variaional auoencoder o learn a common represenaion y of real and synheic HMD images. We hen regress from y o he laen rendering code z and decode ino mesh and exure o render. Our mehod enables high-fideliy social ineracion in virual realiy. We inroduce a deep appearance model for rendering he human face. Inspired by Acive Appearance Models, we develop a daa-driven rendering pipeline ha learns a join represenaion of facial geomery and appearance from a muliview capure seup. Verex posiions and view-specific exures are modeled using a deep variaional auoencoder ha capures complex nonlinear effecs while producing a smooh and compac laen represenaion. View-specific exure enables he modeling of view-dependen effecs such as speculariy. In addiion, i can also correc for imperfec geomery semming from biased or low resoluion esimaes. This is a significan deparure from he radiional graphics pipeline, which requires highly accurae geomery as well as all elemens of he shading model o achieve realism hrough physically-inspired ligh ranspor. Acquiring such a high level of accuracy is difficul in pracice, especially for complex and inricae pars of he face, Auhors addresses: Sephen Lombardi, Facebook Realiy Labs, Pisburgh, PA, sephen. lombardi@fb.com; Jason Saragih, Facebook Realiy Labs, Pisburgh, PA, jason.saragih@ fb.com; Tomas Simon, Facebook Realiy Labs, Pisburgh, PA, omas.simon@fb.com; Yaser Sheikh, Facebook Realiy Labs, Pisburgh, PA, yaser.sheikh@fb.com. such as eyelashes and he oral caviy. These are handled naurally by our approach, which does no rely on precise esimaes of geomery. Insead, he shading model accommodaes deficiencies in geomery hough he flexibiliy afforded by he neural nework employed. A inference ime, we condiion he decoding nework on he viewpoin of he camera in order o generae he appropriae exure for rendering. The resuling sysem can be implemened simply using exising rendering engines hrough dynamic exures wih fla lighing. This represenaion, ogeher wih a novel unsupervised echnique for mapping images o facial saes, resuls in a sysem ha is naurally suied o real-ime ineracive seings such as Virual Realiy (VR). CCS Conceps: Compuing mehodologies Image-based rendering; Neural neworks; Virual realiy; Mesh models; Addiional Key Words and Phrases: Appearance Models, Image-based Rendering, Deep Appearance Models, Face Rendering ACM Reference Forma: Sephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh Deep Appearance Models for Face Rendering. ACM Trans. Graph. 37, 4, Aricle 68 (Augus 2018), 13 pages. hps://doi.org/ / This work is licensed under a Creaive Commons Aribuion Inernaional 4.0 License. 1 INTRODUCTION 2018 Copyrigh held by he owner/auhor(s) /2018/8-ART68 hps://doi.org/ / Wih he adven of modern Virual Realiy (VR) headses, here is a need for improved real-ime compuer graphics models for enhanced immersion. Tradiional compuer graphics echniques are

2 68:2 Lombardi e al. capable of creaing highly realisic renders of saic scenes bu a high compuaional cos. These models also depend on physically accurae esimaes of geomery and shading model componens. When high precision esimaes are difficul o acquire, degradaion in percepual qualiy can be pronounced and difficul o conrol and anicipae. Finally, phoo-realisic rendering of dynamic scenes in real ime remains a challenge. Rendering he human face is paricularly challenging. Humans are social animals ha have evolved o express and read emoions in each oher s faces [Ekman 1980]. As a resul, humans are exremely sensiive o rendering arifacs, which gives rise o he well-known uncanny valley in phoo-realisic facial rendering and animaion. The source of hese arifacs can be aribued o deficiencies, boh in he rendering model as well as is parameers. This problem is exacerbaed in real-ime applicaions, such as in VR, where limied compue budge necessiaes approximaions in he ligh-ranspor models used o render he face. Common examples here include: maerial properies of he face ha are complex and ime-consuming o simulae; fine geomeric srucures, such as eyelashes, pores, and vellus hair ha are difficul o model; and suble dynamics from moion, paricularly during speech. Alhough compelling synheic renderings of faces have been demonsraed in he movie indusry, hese models require significan manual cleanup [Lewis e al. 2014], and ofen a frame-by-frame adjusmen is employed o alleviae deficiencies of he underlying model. This manual process makes hem unsuiable for real-ime applicaions. The primary challenge posed by he radiional graphics pipeline sems form is physics-inspired model of ligh ranspor. For a specific known objec class, like a human face, i is possible o circumven his generic represenaion, and insead direcly address he problem of generaing images of he objec using image saisics from a se of examples. This is he approach aken in Acive Appearance Models (AAMs) [Cooes e al. 2001], which synhesize images of he face by employing a join linear model of boh exure and geomery (ypically a small se of 2D poins), learned from a collecion of face images. Through a process coined analysis-bysynhesis, AAMs were originally designed o infer he underlying geomery of he face in unseen images by opimizing he parameers of he linear model so ha he synhesized resul bes maches he arge image. Thus, he mos imporan propery of AAMs is heir abiliy o accuraely synhesize face images while being sufficienly consrained o only generae plausible faces. Alhough AAMs have been largely supplaned by direc regression mehods for geomery regisraion problems [Kazemi and Sullivan 2014; Xiong and la Torre Frade 2013], heir synhesis capabiliies are of ineres due o wo facors. Firs, hey can generae highly realisic face images from sparse 2D geomery. This is in conras o physics-inspired face rendering ha requires accurae, high-resoluion geomery and maerial properies. Second, he percepual qualiy of AAM synhesis degrades gracefully. This facor is insrumenal in a number of percepual experimens ha rely on he uncanny-valley being overcome [Boker e al. 2011; Cassidy e al. 2016]. Inspired by he AAM, in his work we develop a daa-driven represenaion of he face ha joinly models variaions in sparse 3D geomery and view-dependen exure. Deparing from he linear models employed in AAMs, we use a deep condiional variaional auoencoder [Kingma and Welling 2013] (CVAE) o learn he laen embedding of facial saes and decode hem ino rendering elemens (geomery and exure). A key componen of our approach is he view-dependen exure synhesis, which can accoun for limiaions posed by sparse geomery as well as complex nonlinear shading effecs such as speculariy. We explicily facor ou viewpoin from he laen represenaion, as i is exrinsic o he facial performance, and insead condiion he decoder on he direcion from which he model is viewed. To learn a deep appearance model of he face, we consruced a muliview capure apparaus wih 40 cameras poined a he fron hemisphere of he face. A coarse geomeric emplae is regisered o all frames in a performance and i, along wih he unwarped exures of each camera, consiue he daa used o rain he CVAE. We invesigaed he role of geomeric precision and viewpoin granulariy and found ha, alhough direc deep image generaion echniques such as hose of Radford e al. [2015], Hou e al. [2017], or Kulkarni e al. [2015] can faihfully reconsruc daa, exrapolaion o novel viewpoins is grealy enhanced by separaing he coarse graphics warp (via a riangle mesh) from appearance, confirming he basic premise of he parameerizaion. Our proposed approach is well suied for visualizing faces in VR since; a) sparse geomery and dynamic exure are basic building blocks of modern real-ime rendering engines, and b) modern VR hardware necessarily esimaes viewing direcions in real-ime. However, in order o enable ineracion using deep appearance models in VR, he user s facial sae needs o be esimaed from a collecion of sensors ha ypically have exreme and incomplee views of he face. An example of his is shown in he boom lef of Figure 1. This problem is furher complicaed in cases where he sensors measure a modaliy ha differs from ha used o build he model. A common example is he use of IR cameras, which defeas naive maching wih images capured in he visible specrum due o pronounced sub-surface scaering in IR. To address his problem, we leverage he propery of CVAE, where weigh sharing across similar modaliies ends o preserve semanics. Specifically, we found ha learning a common CVAE over headse images and re-rendered versions of he muliview capure images allows us o correspond he wo modaliies hrough heir common laen represenaion. This, in urn, implies correspondence from headse images o laen codes of he deep appearance model, over which we can learn a regression. Coupled wih he deep appearance model, his approach allows he creaion of a compleely personalized end-o-end sysem where a person s facial sae is encoded wih he racking VAE encoder and decoded wih he rendering VAE decoder wihou he need for manually defined correspondences beween he wo domains. This paper makes wo major conribuions o he problem of real-ime facial rendering and animaion: A formulaion of deep appearance models ha can generae highly realisic renderings of he human face. The approach does no rely on high-precision rendering elemens, can be buil compleely auomaically wihou requiring any manual cleanup, and is well suied o real-ime ineracive seings such as VR. A semi-supervised approach for relaing he deep appearance model laen codes o images capured using sensors wih

3 Deep Appearance Models for Face Rendering 68:3 drasically differing viewpoin and modaliy. The approach does no rely on any manually defined correspondence beween he wo modaliies bu can learn a direc mapping from images o laen codes ha can be evaluaed in real-ime. In 2 we cover relaed work, and describe our capure apparaus and daa preparaion in 3. The mehod for building he deep appearance models is described in 4, and he echnique for driving i from headse mouned sensors in 5. Qualiaive and quaniaive resuls from invesigaing various design choices in our approach are presened in 6. We conclude in 7 wih a discussion and direcions of fuure work. 2 RELATED WORK Acive Appearance Models (AAMs) [Cooes e al. 2001; Edwards e al. 1998; Mahews and Baker 2004; Tzimiropoulos e al. 2013] and 3D morphable models (3DMM) [Blanz and Veer 1999; Knohe e al. 2011] have been used as an effecive way o regiser faces in images hrough analysis-by-synhesis using a low dimensional linear subspace of boh shape and appearance. Our deep appearance model can be a seen as a re-purposing of he generaor componen of hese saisical models for rendering highly realisic faces. To perform his ask effecively, here are wo major modificaions ha were necessary over he classical AAM and 3DMM approaches; he use of deep neural neworks in-place of linear generaive funcions, and a sysem for view-condiioning o specialize he generaed image o he viewpoin of he camera. Recenly, here has been a grea deal of work using deep neworks o model human faces in images. Radford e al. [2015] use Generaive Adversarial Neworks (GANs) o learn represenaions of he face and learn how movemens in his laen space can change aribues of he oupu image (e.g., wearing eyeglasses or no). Hou e al. [2017] use a variaional auoencoder o perform a similar ask. Kulkarni e al. [2015] aemp o separae semanically meaningful characerisics (e.g., pose, lighing, shape) from a daabase of rendered faces semi-auomaically. All hese mehods operae only in he image domain and are resriced o small image sizes ( or less). By conras, we learn o generae view-specific exure maps of how he face changes due o facial expression and viewing direcion. This grealy simplifies he learning problem, achieving high-resoluion realisic oupu as a resul. There has also been work on auomaically creaing image-based avaars using a geomeric facial model [Hu e al. 2017; Ichim e al. 2015; Thies e al. 2018]. Cao e al. [2016] propose an image-based avaar from 32 images capured from a webcam. To render he face, hey consruc a specially crafed funcion ha blends he se of capured images based on he curren facial expression and surface normals, which capures he dependence of boh expression and viewpoin on facial appearance. Casas e al. [2016] propose a mehod for rapidly creaing blendshapes from RGB-D daa, where he face is laer rendered by aking a linear combinaion of he inpu exure maps. Alhough boh mehods ake an image-based approach, our mehod learns how appearance changes due o expression and view raher han prescribing i. Mos imporanly, hese approaches are geared owards daa-limied scenarios and, as a resul, do no scale well in cases where daa is more abundan. Our approach is designed specifically o leverage large quaniies of high qualiy daa o achieve he compelling depicion of human faces. Because our mehod is concerned wih view-dependen facial appearance, our work has a connecion o ohers ha have sudied he view- and ligh-dependen appearance of maerials and objecs. One way o view his work is as a compression of he 8D reflecance funcion ha encapsulaes he complex ligh ineracions inside an imaginary manifold via he boundary condiions on he surface of he manifold [Klehm e al. 2015]. Tha is, we are capuring he lighfield a a manifold on which he cameras lie and compressing he appearance via a deconsrucion of exure and geomery. Image-based rendering mehods [Gorler e al. 1996; Kang e al. 2000] represen he ligh field non-paramerically and aemp o use a large number of samples o reconsruc novel views. The Bidirecional Texure Funcion [Dana e al. 1999] is a mehod o model boh spaially-varying and view-varying appearance. These approaches have served as inspiraion for his work. To drive our model we develop a echnique for learning a direc end-o-end mapping from images o deep appearance model codes. Alhough sae of he ar echniques for face regisraion also employ regression mehods [Kazemi and Sullivan 2014; Xiong and la Torre Frade 2013], mos approaches assume a complee view of he face where correspondences are easily defined hrough facial landmarks. An approach closes o our work is ha of Olszewski e al. [2016], which employ direc regression from images o blendshape coefficiens. The main difference o our work is in how corresponding expressions are found beween he wo domains. Whereas in Olszewski e al. s work, correspondences are defined semanically hrough peak expressions and audiory aligned feaures of speech sequences, our approach does no require any human defined correspondence, or even ha he daa conains he same sequence of facial moions. Since we leverage unsupervised domain adapaion echniques, he only requiremen of our mehod is ha he disribuion of expressions in boh domains are comparable. There have recenly been a number of works ha perform unsupervised domain adapaion using deep neworks [Bousmalis e al. 2017; Kim e al. 2017; Liu e al. 2017; Zhu e al. 2017]. Many of hese mehods rely on Generaive Adversarial Neworks (GANs) o ranslae beween wo or more domains, someimes wih addiional losses o ensure semanic consisency. Some works also use VAE in conjuncion wih oher componens o learn a common laen space beween domains [Hsu e al. 2017; Liu e al. 2017]. Our approach differs from hese works in ha we leverage he pre-rained rendering VAE o help consrain he srucure of he common laen space, resuling in semanics ha are well mached beween he modaliies. 3 CAPTURING FACIAL DATA To learn o render faces, we need o collec a large amoun of facial daa. For his purpose, we have consruced a large muli-camera capure apparaus wih synchronized high-resoluion video focused on he face. The device conains 40 machine vision cameras capable of synchronously capuring images a 30 frames per second. All cameras lie on he fronal hemisphere of he face and are placed a a disance of abou one meer from i. We use zoomed in

68:4 Lombardi e al. Inference AVERAGE View-specific Texure Average Texure E z D View-specific Texure UNWRAP Inpu Rendered Image Mesh Verices Inpu Image RENDER Mesh Verices TRACKING Fig. 2.

We average hese exure maps over all cameras for each frame and inpu i and he mesh o a variaional auoencoder.

A inference ime, we learn o encode a separae signal ino he laen parameers z which can be decoded ino mesh and exure and rendered o he screen.

In order o achieve he bes resuls, we evenly place 200 direcional LED poin lighs direced a he face o promoe uniform illuminaion.

Each subjec also recies a se of 50 phoneicallybalanced senences.

As inpu o our deep appearance model learning framework, we ake he original capured images as well as a racked facial mesh.

[2017], and use i o rack he face hrough he capured speech performances by maching he images and he dense 3D reconsrucions.

(90Hz). In achieving his goal, we avoid using hand-crafed models or priors, and insead rely on he rich daa we acquired from our muliview capure apparaus.

The idea is o consruc a variaional auoencoder ha joinly encodes geomery and appearance.

I is view-specific and herefore conains view-dependen effecs such as speculariies, disorions due o imperfec geomery esimaes, and missing daa in areas of he face ha are occluded.

a es ime, he viewpoin we wan o render from; in boh cases he direcion is composed wih he inverse of he global head-pose in he scene), allowing us o oupu he correc view-specific exure.

4 68:4 Lombardi e al. Inference AVERAGE View-specific Texure Average Texure E z D View-specific Texure UNWRAP Inpu Rendered Image Mesh Verices Inpu Image RENDER Mesh Verices TRACKING Fig. 2. Pipeline of our mehod. Our mehod begins wih inpu images from he muli-camera capure seup. Given a racked mesh, we can unwrap hese images ino view-specific exure maps. We average hese exure maps over all cameras for each frame and inpu i and he mesh o a variaional auoencoder. The auoencoder learns o reconsruc a mesh and view-specific exure because he nework is condiioned on he oupu viewpoin. A inference ime, we learn o encode a separae signal ino he laen parameers z which can be decoded ino mesh and exure and rendered o he screen. 50mm lenses, capuring pore-level deail, where each pixel observes abou 50µm on he face. We preprocess he raw video daa by performing muliview sereo reconsrucion. In order o achieve he bes resuls, we evenly place 200 direcional LED poin lighs direced a he face o promoe uniform illuminaion. To keep he disribuion of facial expressions consisen across ideniies, we have each subjec make a predefined se of 122 facial expressions. Each subjec also recies a se of 50 phoneicallybalanced senences. The mea-daa regarding he semanic conen of he recordings is no uilized in his work, bu is inclusion as addiional consrains in our sysem is sraighforward and is a poenial direcion for fuure work. As inpu o our deep appearance model learning framework, we ake he original capured images as well as a racked facial mesh. To generae his daa, we build personalized blendshape models of he face from he capured expression performances, similar o Laine e al. [2017], and use i o rack he face hrough he capured speech performances by maching he images and he dense 3D reconsrucions. 4 BUILDING A DATA-DRIVEN AVATAR Our primary goal is o creae exremely high-fideliy facial models ha can be buil auomaically from a muli-camera capure seup and rendered and driven in real ime in VR (90Hz). In achieving his goal, we avoid using hand-crafed models or priors, and insead rely on he rich daa we acquired from our muliview capure apparaus. We unify he conceps of 3D morphable models, image-based rendering, and variaional auoencoders o creae a real-ime facial rendering model ha can be learned from a muli-camera capure rig. The idea is o consruc a variaional auoencoder ha joinly encodes geomery and appearance. In our model, he decoder oupus view-dependen exures ha is, a exure map ha is unwrapped from a single camera image. I is view-specific and herefore conains view-dependen effecs such as speculariies, disorions due o imperfec geomery esimaes, and missing daa in areas of he face ha are occluded. The criical piece of he proposed mehod is ha we use a condiional variaional auoencoder o condiion on he viewpoin of he viewer (a raining ime, he viewer is he camera from which he exure was unwrapped; a es ime, he viewpoin we wan o render from; in boh cases he direcion is composed wih he inverse of he global head-pose in he scene), allowing us o oupu he correc view-specific exure. A es ime, we can execue he decoder nework in real-ime o regress from laen encoding o geomery and exure and finally render using raserizaion. Figure 2 visualizes he raining and inference pipeline of our algorihm. For his work, we assume ha coarse facial geomery has been racked and we use i as inpu o our algorihm wih he original camera images. Afer geomery is racked, we unwrap exure maps for each camera and for every frame of capure. Unwrapping is performed by racing a ray from he camera o each exel of he exure map and copying he image pixel value ino he exel if he ray is no occluded. These view-specific exure maps are wha we wan o generae a es ime: reproducing hem will cause he rendered image o mach he ground ruh image afer rendering. To learn how o generae hem efficienly, we use a condiional variaional auoencoder (CVAE) [Kingma and Welling 2013]. Because we joinly model geomery and appearance, our auoencoder has wo branches: an RGB exure map and a vecor of mesh verex posiions. Le Iv be an image from he muli-camera capure rig a ime insan from camera v (for a oal of V = 40 cameras). In his work, we assume ha he viewpoin vecor is relaive o he rigid head orienaion ha is esimaed from he racking algorihm. A each ime insan we also have a 3D mesh M (7306 verices 3 = 21918dimensional vecor) wih a consisen opology across ime. Using

5 Deep Appearance Models for Face Rendering Encoder Convoluions (sride=2) Decoder Fully conneced Fully conneced 68:5 Transposed Convoluions (sride=2) µ Fig. 3. Archiecure of he Encoder and Decoder. The exures T and Tv are 3-channel images. Each convoluional layer has sride 2 and he number of channels doubles afer he firs convoluion every wo layers. We combine he exure and mesh subencodings via concaenaion. The decoder runs hese seps in reverse: we spli he nework ino wo branches and use ransposed convoluions of sride 2 o double he image resoluion a every sep. This decoder nework execues in less han 5 milliseconds on an NVIDIA GeForce GTX 1080 GPU. he image and mesh we can unwrap a view-specific exure map Tv by casing rays hrough each pixel of he geomery and assigning he inerseced exure coordinae o he color of he image pixel. µ During raining, he condiional VAE akes he uple (T, M ) as inpu and (Tv, M ) as he arge. Here, ÍV w v Tv µ T = v=1, (1) ÍV v v=1 w is he average exure, where w v is a facor indicaing wheher each exel is occluded (0) or unoccluded (1) from camera v and represens an elemen-wise produc. The primary reason for his asymmery is o preven he laen space from conaining view informaion and o enforce a canonical laen sae over all viewpoins for each ime insan. The effecs of his is discussed below. The cornersone of our mehod is he condiional variaional auoencoder ha learns o joinly compress and reconsruc he exure Tv and mesh verices M. This auoencoder can be viewed as a generalizaion of Principal Componen Analysis (PCA) in convenional AAMs. The auoencoder consiss of wo halves: he encoder Eϕ and he decoder Dϕ. The encoder akes as inpu he mesh verices and exure inensiies and oupus parameers of a Gaussian disribuion of he laen space, µ µ z, log σz Eϕ (T, M ), (2) where he funcion Eϕ is parameerized using a deep neural nework wih parameers ϕ. A raining ime, we sample from he disribuion, z N(µ z, σz ), (3) and pass i ino he decoder Dϕ and compue he reconsrucion loss. This process approximaes he expecaion over he disribuion defined by he encoder. The vecor z is a daa-driven low-dimensional represenaion of he subjec s facial sae. I encodes all aspecs of he face, from eye gaze direcion, o mouh pose, o ongue expression. The decoder akes wo inpus: he laen facial code z and he view vecor, represened as he vecor poining from he cener of he head o he camera v (relaive o he head orienaion ha is esimaed from he racking algorihm). The decoder ransforms he laen code and view vecor ino reconsruced mesh verices and exure, ˆ Dϕ (z, vv ), Tˆ v, M (4) ˆ is he reconsruced where Tˆ v is he reconsruced exure and M mesh. Afer decoding, he exure, mesh, and camera pose informaion can be used o render he final reconsruced image ˆIv. Figure 3 shows he archiecure of he encoder and decoder. Condiioning is performed by concaenaing he condiioning variable o he laen code z afer each are ransformed by a single fully conneced layer. Since he mesh should be independen of viewpoin, i is only a funcion of he laen code. The exure decoder subnework consiss of a series of srided ransposed convoluions o increase he oupu resoluion. In each layer, reparameerize he weigh ensor using Weigh Normalizaion [Salimans and Kingma 2016]. We use leaky ReLU acivaions beween layers wih 0.2 leakiness. The decoder nework mus execue in less han 11.1 milliseconds o achieve 90Hz rendering for real-ime VR sysems. We are able o achieve his using ransposed srided convoluions even wih a final exure size of This is a major deparure from mos previous work for generaing facial imagery ha has been limied o significanly smaller oupu sizes. Texure maps have non-saionary saisics ha we can exploi in our nework design. For example, DeepFace [Taigman e al. 2014] uilizes locally-conneced layers, a generalizaion of convoluion where each pixel locaion has a unique convoluion kernel. We uilize a similar bu simpler approach: each convoluional layer has a bias ha varies wih boh channel and he spaial dimensions. We find ha his grealy improves reconsrucion error and visual fideliy. To rain our sysem, we minimize he L2 -disance beween he inpu exure and geomery and he reconsruced exure and geomery plus he KL-divergence beween he prior disribuion (an

6 68:6 Lombardi e al. isoropic Gaussian) and he disribuion of he laen space: ( w v l(ϕ) = λ T T v ˆT ) v 2 M 2 + λm ˆM + v, ( λ Z KL N ( µ z, σ z ) ( ) ) N 0, I, where w v is a weighing erm o ensure he loss does no penalize missing daa (i.e., areas of he face ha are no seen by he camera) and λ are weighs for each erm (in all experimens we se hese values o λ T =1.0, λ M =1.0, λ Z =0.01). We use he Adam algorihm [Kingma and Welling 2013] o opimize his loss. Before raining, we sandardize he exure and geomery so ha hey have zero mean and uni variance. For each subjec, our daase consiss of around 5,000 frames of capure under 40 cameras per frame. Training ypically akes around one day for 200,000 raining ieraions wih a mini-bach size of 16 on an NVIDIA Tesla M40. During es ime, we execue he decoder half o ransform facial encodings z and viewpoin v ino geomery and exure. Using he archiecure shown in Figure 3, his akes approximaely 5 milliseconds on an NVIDIA GeForce GTX 1080 graphics card; well wihin our arge of 11.1 milliseconds. In pracice, we find ha in VR i is desirable o decode wice, creaing one exure for each eye o induce parallax also in he generaed exure. Our nework is able o generalize in viewpoin enough ha he small difference in viewpoin beween he wo eyes improves he experience by giving an impression of deph in regions ha are poorly approximaed by he sparse geomery (e.g., he oral caviy). For our viewpoin condiioning o work, he decoder nework D ϕ mus rely on he viewpoin condiioning vecor o supply all he informaion abou he viewpoin. In oher words, he laen code z should conain no informaion abou he viewpoin of he inpu exure. This should be rue for all ypes of variables ha we may wan o condiion on a es ime. The use of variaional auoencoders [Kingma and Welling 2013] does promoe hese properies in he learned represenaion (as we will discuss in secion 4.1). However, for he special case of viewpoin, we can explicily enforce his facorizaion by supplying inpu T µ as he inpu exure, ha is averaged across all cameras for a paricular ime insan. This allows us o easily enforce a viewpoin-independen encoding of facial expression and gives us a canonical per-frame laen encoding. 4.1 Condiioned Auoencoding A criical piece of our model is he abiliy o condiion he oupu of he nework on some propery we wan o conrol a es ime. For our base model o work properly, we mus condiion on he viewpoin of he virual camera. The idea of condiioning on some informaion we wan o conrol a es ime can be exended o include illuminaion, gaze direcion, and even ideniy. We can broadly divide condiioning variables ino wo caegories: exrinsic variables (e.g., viewpoin, illuminaion, ec.) and inrinsic variables (e.g., speech, ideniy, gaze, ec.) Viewpoin Condiioning. The main variable ha mus be condiioned on is he virual camera viewpoin. We condiion on viewpoin so ha a es ime we will generae he appropriae exure from ha viewer s poin of view (relaive o he avaars posiion (5) and orienaion). For his o work properly, he laen encoding z mus no encode any view-specific informaion. A viewpoin-independen expression represenaion can be achieved by using a variaional auoencoder [Kingma and Welling 2013], which encourages he laen encoding z o come from a zero-mean uni-variance Gaussian disribuion. I has been observed ha variaional auoencoders end o learn a minimal encoding of he daa because of he regularizaion of he laen space [Chen e al. 2016]. Because of his, he nework is encouraged o use he condiioning variable (in his case, viewpoin) as much as possible in order o minimize he number of dimensions used in he encoding. Raher han rely only on his mechanism, we inpu a exure map averaged over all viewpoins o he encoder nework o ensure a viewpoin-independen laen encoding z. We will, however, exploi his mechanism for oher condiioning variables Ideniy Condiioning. Avaars buil wih his sysem are person-specific. We would like o have a single encoder/decoder ha models muliple people hrough an ideniy condiioning variable. Wih enough daa, we may be able o learn o map single images o ideniy condiioning vecors, allowing us o dynamically creae new avaars wihou needing a muli-camera capure rig. Ineresingly, wih his ideniy condiioning scheme, we noice ha he nework learns an ideniy-independen represenaion of facial expression in z despie no imposing any correspondence of facial expression beween differen people. This is because he variaional auoencoder (VAE) is encouraged o learn a common represenaion of all ideniies hrough is prior. Semanics end o be preserved probably because he nework can reuse convoluional feaures hroughou he nework. 5 DRIVING A DATA-DRIVEN AVATAR Now ha we are able o rain he decoder o map laen encodings o exure and geomery, we can render animaed faces in real-ime. We would like he abiliy o drive our avaars wih live or recorded performance raher han only playing back daa from he original capure. In his secion, we deail how o animae our avaar by conrolling he laen facial code z wih an unsupervised imagebased racking mehod. Because our focus is on creaing he highes possible qualiy avaars and facial animaion, we limi our scope o person-specific models (i.e., a person can drive only his or her own avaar). The primary challenge for driving performance wih hese avaars is obaining correspondence beween frames in our muli-camera capure sysem and frames in our headse. Noe ha his is a unique problem for virual realiy avaars because one canno wear he headse during he muli-camera capure o allow us o capure boh ses of daa simulaneously, as he headse would occlude he face. We address his problem by uilizing unsupervised domain adapaion. 5.1 Video-driven Animaion Our approach ceners around image-based rendering on he rich muli-camera daa o simulae headse images. The firs sep is o compue approximae inrinsic and exrinsic headse camera parameers in he muli-camera coordinae sysem (we do his by

Deep Appearance Models for Face Rendering 68:7 VR Headse Cameras Real Headse Frame Concaenae along bach Spli along bach Real Headse Frame Rendered Views Synheic Headse Frame Synheic Headse Frame Fig.

These images look geomerically like real headse images bu no phoomerically because of he difference in lighing.

We can ranslae beween he wo modaliies by flipping a condiioning variable. hand for one frame and propagae he racked head pose).

This allows o produce synheic images from he perspecive of a headse wih our muli-camera sysem.

7 Deep Appearance Models for Face Rendering 68:7 VR Headse Cameras Real Headse Frame Concaenae along bach Spli along bach Real Headse Frame Rendered Views Synheic Headse Frame Synheic Headse Frame Fig. 4. Facial Tracking Pipeline. Firs, we generae synheic headse images using image-based rendering on our muli-camera capure daa. These images look geomerically like real headse images bu no phoomerically because of he difference in lighing. To accoun for his difference, we encode boh synheic headse images and real headse images using a VAE, which encourages learning a common represenaion of boh ses. We can ranslae beween he wo modaliies by flipping a condiioning variable. hand for one frame and propagae he racked head pose). Then, for each pixel of he synheic headse image, we raycas ino he racked geomery and projec ha poin ino one of he muli-camera images o ge a color value. This allows o produce synheic images from he perspecive of a headse wih our muli-camera sysem. Unforunaely, he lighing in he headse images and mulicamera images is quie differen (and, in fac, of a differen wavelengh) so naively regressing from hese synheic headse images o facial encoding z will likely no generalize o real headse images. There has been much work recenly on performing unsupervised domain adapaion using adversarial neworks ha learn o ranslae images from one domain o anoher wihou any explici correspondence (e.g., [Zhu e al. 2017]). One possibiliy for us is o use an image-o-image ranslaion approach o ransform synheic headse images ino real headse images and hen learn a regression from ranslaed synheic headse images o our laen code z. This scheme has wo main drawbacks: firs, he nework learning he regression o laen code z never rains on real headse images; second, he adversarial componen of hese mehods end o be difficul o rain. We ake he following alernaive approach o solve his problem. Firs, we rain a single variaional auoencoder o encode/decode boh real headse images and synheic headse images. The Gaussian prior of he laen space will encourage he code y o form a common represenaion of boh ses of images. We condiion he decoder on a binary value indicaing wheher he image was from he se of real headse images or he se of synheic images so ha his informaion is no conained in he laen code. Nex, we learn a linear ransformaion A y z ha maps he laen code y o he rendering code z for he synheic headse images because we have correspondence beween images of our muli-camera sysem I v and laen codes z = E ϕ (T µ, M ) hrough our rendering encoder. If he VAE is successful in learning a common, semanically-consisen represenaion of real and synheic headse images, hen his linear regression will generalize o real headse images. Noe ha while here is no guaranee ha he semanics of he expression are he same when decoding in each of he wo modaliies, we observe ha semanics end o be preserved. We believe he primary reason for his is because he wo image disribuions are closely aligned and herefore he encoder nework can make use of shared feaures. Figure 4 shows a pipeline of our echnique. In his pipeline, he encoder E akes one headse frame H consising of hree images, mouh H m, lef eye Hl, and righ eye Hr. Each headse frame is eiher real H R or synheic H S. The encoder E produces a laen Gaussian disribuion, µ y, log σ y E (H ). (6) A raining ime, we sample from his disribuion o ge a laen code, ( y N µ y, σ y ), (7) as before. Our decoder D produces a headse frame given he laen code and an indicaor variable : Ĥ D (y, R), (8) where R {0, 1} indicaes wheher he decoder should decode a real headse frame or a synheic headse frame. This indicaor variable allows he laen code o conain no modaliy-specific informaion because he decoder nework can ge his informaion from he indicaor variable insead. The archiecure of our headse encoder E is as follows: for each of he hree ypes of headse images (lower mouh, lef eye, righ eye), we creae a hree-branch nework wih each branch conaining eigh sride-2 convoluions wih each convoluion followed by a leaky ReLU wih 0.2 leakiness. The number of oupu channels for he firs layer is 64 and doubles every oher convoluional layer. The hree branches are hen concaenaed ogeher and wo fully-conneced layers are used o oupu µ y and log σ y. For our experimens, he laen vecor y has 256 dimensions. The decoder nework D is similar: hree branches are creaed, wih each branch consising of a fully-conneced layer followed by eigh sride-2 ransposed convoluions wih each layer followed by a leaky ReLU wih 0.2 leakiness. The firs ransposed convoluion has 512 channels as inpu and his value halves every oher ransposed convoluion. We condiion he decoder by concaening he condiioning variable R o every layer, replicaing across all spaial dimensions for convoluional layers.

8 68:8 Lombardi e al. We don use any normalizaion in hese neworks. Our encoder nework runs in 10ms on an NVIDIA GeForce GTX 1080, which enables real-ime racking. To rain he nework, we opimize he reconsrucion loss, reargeing loss, and KL-divergence loss, l(θ) = λ H H Ĥ 2 + λa z A y z y 2 + ( ( λ Y KL N µ y, σ y ) ( ) ) N 0, I, where z is known only for synheic headse frames H S, A y z linearly maps from racking laen code y o rendering code z, and λ are weighs for each erm of he loss (in all experimens we se λ H =1.0, λ A =0.1, and λ Y =0.1). We opimize his loss using he Adam opimizer [Kingma and Ba 2014], as before. To make our mehod robus o he posiion of he HMD on he face, we perform a random ranslaion, scaling, and roaion crop on he images. We also randomly scale he inensiy values of he images o provide some robusness o lighing variaion. This complees he enire pipeline: a headse image is inpu o he headse encoding nework E o produce he headse encoding y, hen i is ranslaed o he muli-camera encoding z, and finally i is decoded ino avaar geomery ˆM and exure ˆT and rendered for he user. This archiecure also works well for social ineracions over a local- or wide-area nework as only he laen code z needs o be sen across he nework, reducing bandwidh requiremens. 5.2 Aris Conrols In addiion o driving he deep appearance model wih video, we wan o provide a sysem for rigging hese models by characer ariss. To do his, we use a geomery-based formulaion where we opimize o find a laen code ha saisfies some geomeric consrains, similar o Lewis and Anjyo [2010]. The aris can click verices and drag hem o a new locaion o creae a consrain. To find he laen code given he geomeric consrains, we solve, arg min z (9) N ˆM i M i 2 + λ P z 2, (10) i=1 where N is number of consrains, ˆM = D ϕ (z) is he mesh resuling from decoding z, ˆM i is he verex of consrain i, and M i is he desired locaion for he verex of consrain i. The regularizaion on z can be seen as a Gaussian prior, which is appropriae because he variaional auoencoder encourages z o ake on a uni Gaussian disribuion (we se λ P =0.01). Even hough he consrain is only placed on geomery, he model will produce a plausible appearance for he consrains because geomery and appearance are joinly modeled. 6 RESULTS In his secion, we give quaniaive and qualiaive resuls of our mehod for rendering human faces and driving hem from HMD cameras. Firs, we explore he effecs of geomeric coarseness, viewpoin sparsiy, and archiecure choice on avaar qualiy. Nex, we demonsrae how our racking variaional auoencoder can preserve facial semanics beween wo modaliies (synheic and real) by swiching he condiioning label. Finally, we show resuls of our complee pipeline: encoding HMD images and decoding and rendering realisic avaars. 6.1 Qualiaive Resuls Figure 5 shows resuls of our mehod s renderings compared o ground ruh images for hree differen subjecs. Our model is able o capure suble deails of moion and complex reflecance behavior by direcly modeling he appearance of he face. Alhough some arifacs are presen (e.g., blurring inside he mouh), hey are less disracing han oher ypes of facial rendering arifacs. Figure 6 shows resuls of our muli-ideniy model. For his experimen, we used daa from 8 differen people and rained one nework condiioned on a laen ideniy vecor. Currenly his model canno generae plausible inerpolaions of people, bu we believe his will be possible wih significanly more daa. 6.2 Effec of Geomeric and Viewpoin Coarseness A fundamenal par of our model is ha we joinly model geomery and exure. As a consequence, our model is able o correc for any bias in mesh racking by alering he oupu of he exure maps o bes mach he rue view-specific exure map. This powerful abiliy is par of wha allows us o produce highly realisic avaars. There is a limi, however, o he abiliy of he exure nework o resolve hose errors. To explore his, we have consruced a series of experimens o analyze quaniaively and qualiaively how he model behaves when he geomery becomes coarser and viewpoins become sparser. We invesigae four differen ypes of geomeric coarseness: our original racked geomery, significan smoohing of he original racked geomery (we use 1000 ieraions of Taubin smoohing for his [Taubin 1995]), a saic ellipsoid subsuming he original mesh, and a billboard quad. We es billboard quad because image-based face generaion in compuer vision is ypically performed on fla images. For each of hese cases, we build our deep appearance model wih differen ses of raining viewpoins (32 viewpoins, 16 viewpoins, 8 viewpoins, and 4 viewpoins) and compare performance in erms of mean-squared error on a se of 8 validaion viewpoins held ou from he oal se of raining viewpoins. Figure 7 shows quaniaive resuls for each of he four levels of geomeric coarseness wih four differen ses of raining viewpoins. The image-space mean-squared error (MSE), compued on he 8 validaion viewpoins, shows a clear decrease in qualiy as he geomery is made more coarse. When he geomery inpu o our algorihm does no mach he rue geomery of he scene, he exure nework mus do more work o fill in he gaps, bu here are limis o how well i generalizes. When he geomery fis well, he model performs well even wih a small number of raining viewpoins. When he geomery is oo coarse, however, many viewpoins are needed for good generalizaion o new viewpoins. Figure 8 shows qualiaive resuls for he same experimen. This figure shows ha accurae geomery (column racked ) generalizes even when here are few views for raining. Coarse geomery, (columns ellipsoid and billboard ), however, generalizes poorly

Deep Appearance Models for Face Rendering GT Predicion GT Predicion GT 68:9 Predicion Fig. 5. Qualiaive Resuls.

We can see ha our model achieves a high level of realism alhough i has some arifacs, in paricular blurring wihin he mouh (we do no have racked geomery inside he mouh, so our nework mus ry o emulae ha

no problem learning he appearance of he raining views bu he generalizaion o new viewpoins is very poor. 6.3 Fig. 6. Ideniy Condiioning.

Mean Squared Error 400 300 Sensiiviy o Geomeric and View Coarseness 32 views 16 views 8 views 4 views 200 100 Tracked Geomery Smoohed Ellipsoid Billboard Fig. 7.

9 Deep Appearance Models for Face Rendering GT Predicion GT Predicion GT 68:9 Predicion Fig. 5. Qualiaive Resuls. This figure shows comparisons beween he ground ruh recordings and rendered mesh and exure prediced by our nework (noe ha hese frames were no in he raining se) for four differen subjecs. We can see ha our model achieves a high level of realism alhough i has some arifacs, in paricular blurring wihin he mouh (we do no have racked geomery inside he mouh, so our nework mus ry o emulae ha geomery wih he exure). We find ha hese arifacs are less bohersome han radiional realime human face renderings. no problem learning he appearance of he raining views bu he generalizaion o new viewpoins is very poor. 6.3 Fig. 6. Ideniy Condiioning. Here we demonsrae he auomaic expressions correspondence obained hrough ideniy condiioning. Here we decode he same laen code z ino differen people by condiioning on separae ideniy one-ho vecors. Mean Squared Error Sensiiviy o Geomeric and View Coarseness 32 views 16 views 8 views 4 views Tracked Geomery Smoohed Ellipsoid Billboard Fig. 7. Quaniaive Effecs of Geomeric and Viewpoin Coarseness. We show he image-based mean-square error (MSE) for each geomery model and each se of raining cameras (32, 16, 8, and 4 viewpoins). Here we see ha accuracy of he geomeric model allows he model o generalize well o he 8 validaion viewpoins even when i has only seen 4 viewpoins a rain ime. by inroducing arifacs or by simply replicaing he closes raining viewpoin. Noe ha wih coarse geomery he nework has Archiecure Search We perform an evaluaion of differen archiecures for our render encoder and decoder neworks. We primarily esed across wo axes: condiioning echniques and normalizaion mehods. For condiioning echniques we esed wo caegories of condiioning: condiioning on an early decoder layer by concaenaion and condiioning on all decoder layers by addiion. For normalizaion echniques, we esed no normalizaion, bach normalizaion [Ioffe and Szegedy 2015], and weigh normalizaion [Salimans and Kingma 2016]. Finally, we invesigae he advanage of a deep archiecure over a linear or bilinear model. In each case, we repor image-space mean-squared error for all valid head pixels. Here, head pixels are deermined by raycasing ino a dense sereo reconsrucion of each frame Condiioning Mehods. Figure 9 shows he resuls of our condiioning evaluaion. The figure shows he reconsrucion loss and he MSE for a series of experimens changing he condiioning mehod. We evaluae across wo main axes: expanding he view vecor o a larger dimensionaliy wih a fully conneced layer before condiioning, and condiioning on one early layer versus condiioning on all layers. Here, we found ha condiioning on a single early layer in he decoder overall seems o ouperform condiioning all layers in erms of he image-based mean-squared error on he se of 8 validaion views. We also found ha ransforming he view vecor before condiioning seems o be beneficial Normalizaion Mehods. Figure 10 shows he resuls of our normalizaion evaluaion. We esed hree normalizaion mehods: no normalizaion, bach normalizaion, and weigh normalizaion.

10 68:10 Lombardi e al. 32 Training Views GT Tracked Smoohed Ellipsoid Billboard Tracked Smoohed Ellipsoid Billboard 8 Training Views GT Fig. 8. Qualiaive Resuls for Evaluaing he Effecs of Geomeric and Viewpoin Coarseness. We show renderings of he avaar rained wih four differen geomery models and wih wo differen ses of viewpoins. Each render has an inse showing he squared difference image from he ground ruh. Noe ha he viewpoin shown was no seen during raining. We see ha when geomery maches he rue geomery well, he rendering model generalizes o new viewpoins (see row 8 Training Views, column Tracked ). When he geomery is oo far from he rue geomery, we sar o see srange arifacs in he resuling image (see row 32 Training Views, column Billboard ). In addiion, we esed weigh normalizaion wihou using perchannel-and-pixel biases (described in secion 4). The figure shows ha weigh normalizaion far ouperforms he oher normalizaion echniques and ha per-channel-and-pixel biases are very imporan for accuracy. Table 1. Linear vs. Nonlinear represenaion. We evaluae our proposed mehod compared o a linear model (i.e., exure and geomery is a linear funcion of he laen code z). Our model is no only beer in erms of he image-based mean-squared error bu also smaller in number of parameers (110MB vs. 806MB vs. 1.6GB). Model Type Linear AAM vs. Deep AAM. Aside from view condiioning, he main way we have augmened radiional AAMs is by changing he laen subspace from linear o nonlinear. The primary effec of his is o increase he expressiveness of he model alhough i also reduces he number of parameers. To evaluae he rade-off, we rain our deep model compared o a linear model and a bilinear model (bilinear in view and facial sae) wih he same laen space dimensionaliy. Unforunaely a linear model requires ( ) = 3.24GB of weighs so we use only a oupu resoluion for he linear and bilinear models. Table 1 shows quaniaive resuls for linear and nonlinear represenaions. For each subjec and mehod, we give he image-space mean-squared error. Our model has clear advanages over linear and bilinear models, including compacness in erms of number of parameers and reconsrucion performance. Our mehod Linear Bilinear Subjec Subjec Subjec Image-based vs. Texure/Geomery-based Loss Tradiional AAMs have been formulaed as a PCA decomposiion in erms of he exure and geomery joinly. By reformulaing he AAM as an auoencoder we can acually make he loss a funcion of he reconsruced image ˆIv iself. This is useful as i allows us o rain on he meric on which we wan o evaluae. I also opens he possibiliy of allowing he model o refine he oupu geomery over he inpu racked geomery ha may be erroneous.

11 Deep Appearance Models for Face Rendering 68:11 Table 2. Image MSE for Differen Training Loss Combinaions. We rain wih wo differen raining objecives (TG exure + geomery loss, TGI exure + geomery + image loss) o deermine he opimal raining objecive. We found ha using he exure + geomery + image loss produces he bes resuls. Condiioning Mehod Evaluaion Loss Type condiion once (3 channels) condiion once (8 channels) condiion once (32 channels) condiion all (3 channels) condiion all (8 channels) Fig. 9. Evaluaion of Condiioning Mehod. We evaluae several ways o condiion he nework on viewpoin (red condiions he raw view vecor on a single layer, blue condiions he view vecor afer a 3 8 fully-conneced layer, purple condiions afer a 3 32 fully-conneced layer, grey condiions he raw view vecor on all decoder layers, and yellow condiions he view vecor afer a 3 8 fully-conneced layer on all decoder layers). Solid lines show raining error (reconsrucion loss), dashed lines show validaion error (reconsrucion loss), and fla dashed lines show image-space mean-squared error on a held-ou se of cameras. Overall, condiioning on one early layer in he nework ends o perform bes. Archiecure Evaluaion weigh normalizaion bach norm no normalizaion no unied biases MSE Combined Reconsrucion Loss Fig. 10. Evaluaion of Archiecures. Solid lines show raining error (reconsrucion loss), dashed lines show validaion error (reconsrucion loss), and fla dashed lines show image-space mean-squared error on a held-ou se of cameras. We found ha using boh weigh normalizaion [Salimans and Kingma 2016] and per-channel-and-pixel biases yields he highes qualiy avaar. To opimize wih respec o he image-based reconsrucion error, we rewrie he loss funcion as, Õ 2 2 ℓ(ϕ) = λ I mv Iv ˆIv + λt w v Tv Tˆ v + v, (11) ˆ 2 + λ Z KL N(µ z, σz ) N(0, I), λ M M M TG TGI Subjec Subjec MSE Combined Reconsrucion Loss 0.19 where ˆIv is he reconsruced image from view v a ime-insan, and mv is a mask ha only penalizes pixels ha conain he head (derived from he dense sereo reconsrucions). We creae a semidiffereniable rendering layer using wo seps: he firs sep (nondiffereniably) raserizes riangle indices o an image; he second sep (differeniably) compues exel coordinaes for each pixel given he riangle indices, camera parameers, and mesh verices and samples he exure map a he coordinae compued for each pixel. This layer allows us o backpropagae hrough he parameers of our model and rain he sysem end-o-end. To evaluae how an image-based loss during raining may improve he resuls we compare wo differen models: a model opimized wih exure+geomery loss and a model opimized wih exure+geomery+image loss. We found ha wihou a geomery erm, he loss becomes unsable and he geomery ends o lose is srucure. Table 2 shows he quaniaive resuls from hese experimens. For each combinaion of raining loss (TG exure + geomery, TGI exure + geomery + image) we give he image-based loss on se of 8 validaion viewpoins. Incorporaing he image-based loss cerainly helps lower he MSE alhough qualiaively he resuls are similar. We also found ha raining was unsable wihou he geomery loss. This is likely because here is no enough long-range signal in he image gradiens o effecively esimae geomery. 6.5 Video-driven Animaion In his secion, we show how our racking VAE (E, D) builds a common represenaion of facial sae across wo differen modaliies and hen we give qualiaive resuls showing our full live animaion pipeline. Noe ha in his work our pipeline is enirely personspecific. Figure 11 shows examples of encoding wih one modaliy and decoding wih he oher. We can see in hese examples ha semanics end o be preserved when he modaliy is changed a inference ime despie only encoding and decoding he same modaliy a rain ime. This is primarily due o he variaional auoencoder prior, which encourages he laen space o ake on a uni Gaussian disribuion. Figure 12 shows qualiaive resuls of a user driving an avaar in real ime. Our sysem works well no only for expressions bu also for suble moion during speech. The benefi of our model is ha i encodes facial sae wih high accuracy, encoding i ino a join geomery and appearance model. This allows us o model complex

Adaped Real Inpu Adaped Synheic Inpu 68:12 Lombardi e al. (a) Synheic o real (b) Real o synheic Fig. 11. Translaing beween real headse images o synheic headse images.

Sub-figure (b) shows real headse images (op row) ranslaed o synheic headse images (boom row) wih he same mehod).

Please see our supplemenal video for addiional examples. 7 DISCUSSION In his paper, we presened a mehod for capuring and encoding human facial appearance and rendering i in real-ime.

12 Adaped Real Inpu Adaped Synheic Inpu 68:12 Lombardi e al. (a) Synheic o real (b) Real o synheic Fig. 11. Translaing beween real headse images o synheic headse images. Sub-figure (a) shows synheic headse images (op row) ranslaed o real headse images (boom row) by swich he condiioning variable in he VAE. Sub-figure (b) shows real headse images (op row) ranslaed o synheic headse images (boom row) wih he same mehod). This shows ha he y conains a common represenaion of facial sae regardless of being synheic or real. changes in appearance due o changes in blood flow, complex maerials, and incorrec geomery esimaes. Please see our supplemenal video for addiional examples. 7 DISCUSSION In his paper, we presened a mehod for capuring and encoding human facial appearance and rendering i in real-ime. The mehod unifies he conceps of Acive Appearance Models, view-dependen rendering, and deep neworks. We showed ha he model enables phoo-realisic rendering by leveraging muliview image daa and ha we can drive he model wih performance. We believe here are exciing opporuniies for exending our approach in differen ways. Our approach is unique because we direcly predic a shaded appearance exure wih a learned funcion. This is in conras o radiional approaches ha predic physically-inspired lighing model parameers (e.g., albedo maps, specular maps) which enable relighing. A limiaion of our approach in is curren form is a limied abiliy o religh. Wih improvemens o our capure apparaus o allow for dynamic lighing during capure, we believe we can alleviae hese limiaions by inroducing lighing as a condiioning variable. As our experimens showed, we are able o accuraely predic he appearance of he face when our racked geomery closely maches he rue surface of he face. This approach is even effecive for regions of he face ha are ypically very difficul o accuraely model because of heir complex reflecance properies (e.g., eyes and eeh). Some arifacs sill remain, however (e.g., blurring on he eeh) especially where here is no rue smooh surface (e.g., hair). We may be able o urn o alernaive represenaions of geomery (e.g., poin clouds) o alleviae some of hese problems. Our racking model enables high-fideliy real-ime racking from cameras mouned on a virual realiy headse by auomaic unsupervised correspondence beween headse images and muli-camera capure images. In his work, we limied he scope o building personspecific rackers and rendering models. To increase he scalabiliy of our work, we need o modify our approach o suppor racking and rendering for arbirary people. This is a difficul problem because i requires learning semanic facial correspondence beween differen people. Finally, we believe ha building realisic avaars of he enire body wih our approach will help enable self- and social- presence in virual realiy. This ask comes wih a new se of problems: handling he dynamics of clohing, he difficulies of ariculaing limbs, and modeling he appearance of ineracions beween individuals. We are confiden ha hese are racable problems along his line of research. REFERENCES Volker Blanz and Thomas Veer A Morphable Model for he Synhesis of 3D Faces. In Proceedings of he 26h Annual Conference on Compuer Graphics and Ineracive Techniques (SIGGRAPH 99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, hps://doi.org/ / S. M. Boker, J. F. Cohn, B. J. Theobald, I. Mahews, M. Mangini, J. R. Spies, Z Ambadar, and T. R. Brick Moion Dynamics, No Perceived Sex, Influence Head Movemens in Conversaion. J. Exp. Psychol. Hum. Percep. Perform. 37 (2011), Konsaninos Bousmalis, Nahan Silberman, David Dohan, Dumiru Erhan, and Dilip Krishnan Unsupervised Pixel-Level Domain Adapaion wih Generaive Adversarial Neworks. ( ), Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou Real-ime Facial Animaion wih Image-based Dynamic Avaars. ACM Trans. Graph. 35, 4, Aricle 126 (July 2016), 12 pages. hps://doi.org/ / Dan Casas, Andrew Feng, Oleg Alexander, Graham Fyffe, Paul Debevec, Ryosuke Ichikari, Hao Li, Kyle Olszewski, Evan Suma, and Ari Shapiro Rapid Phoorealisic Blendshape Modeling from RGB-D Sensors. In Proceedings of he 29h Inernaional Conference on Compuer Animaion and Social Agens (CASA 16). ACM, New York, NY, USA, hps://doi.org/ / S. A. Cassidy, B. Senger, K. Yanagisawa, R. Cipolla, R. Anderson, V. Wan, S. Baron- Cohen, and L Van Dongen Expressive Visual Tex-o-Speech as an Assisive Technology for Individuals wih Auism Specrum Condiions. Compuer Vision and Image Undersanding 148 (2016), Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Suskever, and Pieer Abbeel Variaional Lossy Auoencoder. CoRR abs/ (2016). arxiv: hp://arxiv.org/abs/ Timohy F. Cooes, Gareh J. Edwards, and Chrisopher J. Taylor Acive Appearance Models. IEEE Transacions on Paern Analysis and Machine Inelligence 23, 6 (June 2001), Krisin J. Dana, Bram van Ginneken, Shree K. Nayar, and Jan J. Koenderink Reflecance and Texure of Real-world Surfaces. ACM Transacions on Graphics 18, 1 (Jan. 1999), G. J. Edwards, C. J. Taylor, and T. F. Cooes Inerpreing Face Images Using Acive Appearance Models. In Proceedings of he 3rd. Inernaional Conference on Face & Gesure Recogniion (FG 98). IEEE Compuer Sociey, Washingon, DC, USA, 300. P. Ekman The Face of Man: Expressions of Universal Emoions in a New Guinea Village. Garland Publishing, Incorporaed. Seven J. Gorler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen The Lumigraph. In Proceedings of he 23rd Annual Conference on Compuer Graphics and Ineracive Techniques (SIGGRAPH 1996). ACM, New York, NY, USA, X. Hou, L. Shen, K. Sun, and G. Qiu Deep Feaure Consisen Variaional Auoencoder. In 2017 IEEE Winer Conference on Applicaions of Compuer Vision (WACV) Wei-Ning Hsu, Yu Zhang, and James R. Glass Unsupervised Learning of Disenangled and Inerpreable Represenaions from Sequenial Daa. In NIPS Liwen Hu, Shunsuke Saio, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li Avaar Digiizaion from a Single Image for Real-ime Rendering. ACM Trans. Graph. 36, 6, Aricle 195 (Nov. 2017), 14 pages. hps://doi.org/ /

Deep Appearance Models for Face Rendering 68:13 Inpu Images Oupu Rendering Inpu Images Oupu Rendering Inpu Images Oupu Rendering Fig. 12. Video-driven Animaion.

Despie only seeing pars of he face, our animaion mehod capures and reproduces facial sae well. Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015.

. In ICML (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 448 456. hp://dblp.uni-rier.de/db/conf/icml/icml2015.hml#ioffes15 Sing Bing Kang, R.

13 Deep Appearance Models for Face Rendering 68:13 Inpu Images Oupu Rendering Inpu Images Oupu Rendering Inpu Images Oupu Rendering Fig. 12. Video-driven Animaion. This figure shows hree examples per subjec of HMD images (lef) encoded ino he laen code z and decoded ino geomery and appearance and rendered (righ). Despie only seeing pars of he face, our animaion mehod capures and reproduces facial sae well. Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly Dynamic 3D Avaar Creaion from Hand-held Video Inpu. ACM Transacions on Graphics 34, 4, Aricle 45 (July 2015), 14 pages. Sergey Ioffe and Chrisian Szegedy Bach Normalizaion: Acceleraing Deep Nework Training by Reducing Inernal Covariae Shif.. In ICML (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, hp://dblp.uni-rier.de/db/conf/icml/icml2015.hml#ioffes15 Sing Bing Kang, R. Szeliski, and P. Anandan The geomery-image represenaion radeoff for rendering. In Proceedings 2000 Inernaional Conference on Image Processing, Vol vol.2. Vahid Kazemi and Josephine Sullivan One Millisecond Face Alignmen wih an Ensemble of Regression Trees. In IEEE Inernaional Conference on Compuer Vision and Paern Recogniion. Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim Learning o Discover Cross-Domain Relaions wih Generaive Adversarial Neworks. In ICML. Diederik P. Kingma and Jimmy Ba Adam: A Mehod for Sochasic Opimizaion. Proceedings of he 3rd Inernaional Conference on Learning Represenaions abs/ (2014). Diederik P. Kingma and Max Welling Auo-Encoding Variaional Bayes. In Proceedings of he 2nd Inernaional Conference on Learning Represenaions. Oliver Klehm, Fabrice Rousselle, Marios Papas, Derek Bradley, Chrisophe Hery, Bernd Bickel, Wojciech Jarosz, and Thabo Beeler Recen Advances in Facial Appearance Capure. Compuer Graphics Forum (Proceedings of Eurographics) 34, 2 (May 2015), Reinhard Knohe, Brian Amberg, Sami Romdhani, Volker Blanz, and Thomas Veer Morphable Models of Faces. Springer London, London, Tejas D. Kulkarni, William F. Whiney, Pushmee Kohli, and Joshua B. Tenenbaum Deep Convoluional Inverse Graphics Nework. In Proceedings of he 28h Inernaional Conference on Neural Informaion Processing Sysems (NIPS 15). MIT Press, Cambridge, MA, USA, Samuli Laine, Tero Karras, Timo Aila, Ani Herva, Shunsuke Saio, Ronald Yu, Hao Li, and Jaakko Lehinen Producion-level Facial Performance Capure Using Deep Convoluional Neural Neworks. In Proceedings of he ACM SIGGRAPH / Eurographics Symposium on Compuer Animaion (SCA 17). ACM, New York, NY, USA, Aricle 10, 10 pages. J.P. Lewis and Ken Anjyo Direc Manipulaion Blendshapes. IEEE Compuer Graphics and Applicaions 30, 4 (2010), John P. Lewis, Ken ichi Anjyo, Taehyun Rhee, Mengjie Zhang, Frédéric H. Pighin, and Zhigang Deng Pracice and Theory of Blendshape Facial Models. In Proc. Eurographics Sae of The Ar Repor. Ming-Yu Liu, Thomas Breuel, and Jan Kauz Unsupervised Image-o-Image Translaion Neworks. In NIPS. Iain Mahews and Simon Baker Acive Appearance Models Revisied. Inernaional Journal of Compuer Vision 60, 2 (Nov. 2004), Kyle Olszewski, Joseph J. Lim, Shunsuke Saio, and Hao Li High-Fideliy Facial and Speech Animaion for VR HMDs. Proceedings of ACM SIGGRAPH Asia , 6 (December 2016). Alec Radford, Luke Mez, and Soumih Chinala Unsupervised Represenaion Learning wih Deep Convoluional Generaive Adversarial Neworks. CoRR abs/ (2015). hp://arxiv.org/abs/ Tim Salimans and Diederik P Kingma Weigh Normalizaion: A Simple Reparameerizaion o Accelerae Training of Deep Neural Neworks. In Advances in Neural Informaion Processing Sysems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garne (Eds.). Curran Associaes, Inc., Yaniv Taigman, Ming Yang, Marc Aurelio Ranzao, and Lior Wolf DeepFace: Closing he Gap o Human-Level Performance in Face Verificaion. In Proceedings of he 2014 IEEE Conference on Compuer Vision and Paern Recogniion (CVPR 14). IEEE Compuer Sociey, Washingon, DC, USA, G. Taubin Curve and Surface Smoohing Wihou Shrinkage. In Proceedings of he Fifh Inernaional Conference on Compuer Vision (ICCV 95). IEEE Compuer Sociey, Washingon, DC, USA, 852. J. Thies, M. Zollhöfer, M. Samminger, C. Theobal, and M. Nießner FaceVR: Real-Time Gaze-Aware Facial Reenacmen in Virual Realiy. ACM Transacions on Graphics 2018 (TOG) (2018). Georgios Tzimiropoulos, Joan Alabor-i Medina, Sefanos Zafeiriou, and Maja Panic Generic Acive Appearance Models Revisied. Springer Berlin Heidelberg, Berlin, Heidelberg, Xuehan Xiong and Fernando De la Torre Frade Supervised Descen Mehod and is Applicaions o Face Alignmen. In IEEE Inernaional Conference on Compuer Vision and Paern Recogniion. Pisburgh, PA. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros Unpaired Imageo-Image Translaion using Cycle-Consisen Adversarial Neworks. (December 2017).

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report) Implemening Ray Casing in Terahedral Meshes wih Programmable Graphics Hardware (Technical Repor) Marin Kraus, Thomas Erl March 28, 2002 1 Inroducion Alhough cell-projecion, e.g., [3, 2], and resampling,