Lip Movement Synthesis from Text

Size: px

Start display at page:

Download "Lip Movement Synthesis from Text"

Charla Caldwell
5 years ago
Views:

1 Lip Movement Synthesis from Text 1 1 Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

2 Outline 1 Objective and Motivation 2 Prerequisite Knowledge Generative Adversarial Networks Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Generating Videos with Scene Dynamics Generative Adversarial Text to Image Synthesis 3 Approach Video Prepossessing Basic Video Generation Network Basic Video Generation with Text Embedding Network Modified Video Generation with Embedding 4 Dataset Experiments 5 Result Visualization (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

3 Objective and Motivation Lip Reading Figure: Lip Reading Procedure (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur)July 20, / 33

4 Objective and Motivation Lip Writing Figure: Lip Writing Procedure (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

5 Objective and Motivation Lip Writing Figure: Lip Writing Procedure Hallucinating lip movement for new words Feature Vector for Lip Reading Tasks (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

6 Prerequisite Knowledge Generative Adversarial Network An Unsupervised Machine Learning algorithm implemented by two neural networks Generator and Discriminator who compete against each other in a zero-sum game framework min G max V (D, G) = E x p D data (x)[log(d(x)] + E z pz (z)[log(1 D(G(z)))] (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

Prerequisite Knowledge Deep Convolution Generative Adversarial Network DCGAN was the first attempt at implementing GAN in a Deep Convolutional

7 Prerequisite Knowledge Deep Convolution Generative Adversarial Network DCGAN was the first attempt at implementing GAN in a Deep Convolutional framework. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

8 1 Discriminator Training Get image data from the dataset. Find the Cross Entropy loss from the data through the discriminator with a true label. Generate a sample from the generator. Find the Cross Entropy loss from the generated data through the discriminator with a false label. Backpropogate the loss through the discriminator update the discriminator parameters. 2 Generator Training Find the Cross Entropy loss from the generated data through the discriminator with a true label. Backpropogate the loss in the discriminator and find the loss at the image level representation. Backpropogate the above calculated image level loss through the generator network and update its parameters. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

9 Prerequisite Knowledge Generating Videos with Scene Dynamics G(z) = m(z) f (z) + (1 m(z)) b(z) (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

10 Prerequisite Knowledge Generative Adversarial Text to Image Synthesis (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

11 Approach Video Prepossessing Figure: Dataset Preprocessing steps (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

12 Approach Basic Video Generation Network The lip movement videos did not have any background to them and the only dynamic aspect to them was the lip movement. We simplified the network by just having the Foreground generation Stream of the VideoGAN framework. The training procedure was the standard GAN training procedure. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

13 Figure: Basic Video Generation Network Generator and Discriminator (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

14 Approach Basic Video Generation with Text Embedding Network For video generation from text embedding we first set up a model which was just an amalgamation of our basic video generator model and Scott Reed s method of appending the embeddings. The embedding is up sampled to 128 sized vector using a fully-connected layer which is then passed through a LeakyReLU layer. This embedding is then appended to the initial noise vector. The discriminator is also updated from the base model for the new task. At the layer when the spatio-temporal dimension of the discriminator is ,the text embedding is again upsampled to 128 dimensions passed through a LeakyReLU layer and then replicated and appended to the discriminator so as to make the new dimension ( ) 4 4. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

15 Figure: Basic Video Generation with Text Embedding Generator Discriminator (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

16 Basic Video Generation with Text Embedding Network Training Procedure For the Discriminator 1 From the database get video frames, their corresponding text embeddings and a set of fake database videos having different text embeddings. 2 Calculate the error for the batch in the following way. Get error from database video with the corresponding text embedding with label true. Get error from the generated video and the text embedding with label false. Get error from mismatched data video and text embedding with label false. 3 Use this error to backpropogate it through the discriminator network and update the Discriminator parameters. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

17 Basic Video Generation with Text Embedding Network Training Procedure For the Generator 1 Use the generated video in discriminator training with the text embedding and find the error with the true label. 2 This error is then backpropogated through the discriminator network to find the error at the video level representation. 3 This video level error used for the generator network. Using this error we backpropogate the error through the Generator network and update its parameters. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

18 Approach Modified Video Generation with Embedding The results generated from the basic model, though were decipherable as lip-movement, they were blurry. We expanded upon the basic model made some changes in the generator and discriminator models as well as made some changes in the training procedure. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

19 Modified Video Generation with Embedding Generator Figure: Modified Generator (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

20 Modified Video Generation with Embedding Discriminator Figure: Modified Discriminator (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

21 Modified Video Generation with Embedding Network Changes in Training 1 We sampled from a Spherical Gaussian rather than a Uniform distribution for sampling for the generator. 2 Replace ReLU layers with LeakyReLU in both generator and discriminator. 3 Rather than using two target labels (0,1) for false and true we use soft labels (0-0.3) for false and ( ) for true. This leads to better training of the generator and discriminator. 4 The Discriminator was training and moving towards 0 error soon which was causing the Generator to go haywire during training. To avoid this we added Dropout layers in both generator and discriminator for better training. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

22 Dataset Experiments Grid Dataset The dataset has 34 users saying sentences in the format <command><color ><preposition ><letter ><digit ><adverb >like place blue at F 9 now. Type Number of Words Words command 4 bin, lay, place, set color 4 blue, green, red, white preposition 4 at, by, in, with letter 25 A-Z excluding W digit adverb 4 again, now, please, soon (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

23 Various Datasets for Generation Sub Sampling Dataset: Took the 75 frames of the video, sub sampled 32 frames from it at regular intervals and used the full text embedding associated with them. Multi Word Dataset: Broke down the 2 second videos into 2 parts of almost equal size according to the frames in which the words are spoken. The 2 videos were sub sampled for 32 frames with their corresponding word embedding. One Word Dataset: Comprised of the frames of people saying a single word which were super sampled from the corpus videos with one word embedding. (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

24 Results Basic Video Generation with Sub Sampling Dataset Figure: Basic Video Generation with Sub Sampling Dataset (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

25 Results Basic Embedding model with Sub Sampling Dataset Figure: Basic Embedding model with Sub Sampling Dataset (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

26 Results Modified Embedding Model with Sub Sampling Dataset Figure: Modified Embedding Model with Sub Sampling Dataset (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

27 Results Modified Embedding Model with Multi Word Dataset Figure: Modified Embedding Model with Multi Word Dataset (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

28 Results Modified Embedding Model with One Word Dataset Figure: Modified Embedding Model with One Word Dataset (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

29 Quantitative Results Structural Similarity Index SSIM is Structural Similarity Index introduced in 2004 Z. Wang et.al It measures the similarity in structure of images. The SSIM index is defined as SSIM(x, y) = (2µ xµ y + c 1 )(2σ xy + c 2 ) (µ 2 x + µ 2 y + c 1 )(σ 2 x + σ 2 y + c 2 ) (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

30 Word SSIM Score Word SSIM Score Word SSIM Score a in set again j seven at k sil b l six bin lay soon blue m sp by n t c nine three d now two e o u eight one v f p white five place with four please x g q y green r z h red zero (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

31 Similar Lip Movement Words Word1 Word2 Real Videos Generated Videos u blue a e b bin blue two blue bin in nine Different Lip Movement Words Word1 Word2 Real Videos Generated Videos four d seven t one e four k set place seven place at five Table: SSIM score between Similar and Different Lip Movement Words (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

32 Qualitative Results Figure: Four Eight M (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

33 Qualitative Results Figure: Five Blue B (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

34 Thank You Any Questions? (1Department of Computer Science Lipand Movement Engineering Synthesis Indianfrom Institute Textof Technology, Kanpur) July 20, / 33

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17 Improving Generative Adversarial Networks with Denoising Feature Matching David Warde-Farley 1 Yoshua Bengio 1 1 University of Montreal, ICLR,2017 Presenter: Bargav Jayaraman Outline 1 Introduction 2 Background