Deep Learning for Visual Manipulation and Synthesis

Deep Learning for Visual Manipulation and Synthesis Jun-Yan Zhu 朱俊彦 UC Berkeley 2017/01/11 @ VALSE

What is visual manipulation? Image Editing Program input photo User Input result Desired output: stay close to the input. satisfy user s constraint. [Schaefer et al. 2006]

Sketch2Photo [Tao et al. 2009] What is Visual Synthesis? Image Generation Program user input result Desired output: satisfy user s constraint.

So far so good

Things can get really bad The lack of safety wheels

Adding the safety wheels Image Editing Program Input Photo User Input A desired output: stay close to the input. satisfy user s constraint. Lie on the natural image manifold Output Result Natural Image Manifold

Prior work: Heuristic-based Gradient [Perez et al. 2003] Bleeding artifacts [Tao et al. 2010] Color [Reinhard et al. 2004] Color and Texture [Johnson et al. 2011]

Prior work: Discriminative Learning Natural Human Motion (34 subjects) [Ren et al. 2005] Image Compositing (20 images) [Xue et al. 2012] Image Deblurring (40 images) [Liu et al. 2013]

Our Goal: - Learn the manifold of natural images without direct human annotations. - Improve visual manipulation and synthesis by constraining the result to lie on that learned manifold.

Why Deep Learning Methods? Impressive results on visual recognition. Classification, detection, segmentation,3d vision, videos, etc. No feature engineering. Recent development of generative models. (e.g. Generative Adversarial Networks)

Deep Learning trends: performance

Deep Learning trends: research AlexNet [Krizhevsky et al.] ImageNet [Jia et al.]

Discriminative Model M: {x P real x = 1} [ICCV 15 ] Realism CNN Predict Realism Improve Editing Image Editing Model Generative Model M: {x x = G z } [SIGGRAPH 14 ] Project Editing UI Edit Transfer [ECCV 16 ]

Discriminative Model M: {x P real x = 1} [ICCV 15 ] Realism CNN Predict Realism Improve Editing Image Editing Model Foreground Object F Background B Image Composite I

Learning Visual Realism CNN Training Composite images Classifying Natural Photos 25K natural photos vs. 25k composite images

How do we get composite images? Target Object Composite Images Object Mask Object Masks with Similar Shapes Object Mask: (1) Human Annotation (2) Object Proposal [Lalonde and Efros 2007]

Ranking of Training Composites Most realistic composites Least realistic composites

Evaluation Dataset [Lalonde and Efros 2007] Task: binary classification 360 realistic photos (natural images + realistic composites) 360 unrealistic photos Area under ROC Curve Methods without object mask Lalonde and Efros (no mask) 0.61 AlexNet + SVM 0.73 RealismCNN 0.84 RealismCNN + SVM 0. 88 Human 0.91 Methods using object mask Reinhard et al. 0.66 Lalonde and Efros (with mask) 0.81

Visual Realism Ranking Least Realistic Most Realistic Snowy Mountain Highway Ocean Red: unrealistic composite, Green: realistic composite, Blue: natural image

Our Pipeline Realism CNN Predict Realism Improve Composites Image Editing Model

Improving Visual Realism Editing model: Color adjustment g Foreground object F Realism CNN Original Composite (Realism score: 0.0) Improved Composite (Realism score: 0.8) E(g, F) = E CNN + E reg Quasi-Newton (L-BFGS)

Selecting Suitable Objects Best-fitting object selected by RealismCNN Object with most similar shape

Optimizing Color Compatibility Object mask Cut-n-paste Lalonde et al. Xue et al. Ours

Sanity Check: Real Photos Object mask Cut-n-paste Lalonde et al. Xue et al. Ours

E = 50.73 9.38 5.05 3.44 3.00 Visualizing and Localizing Errors ( E I p ) Number of L-BFGS iterations Result Gradient Map

Discriminative Model {x P real x = 1} Pros: CNN is easy to train. Graphics programs often produce better images than generative models. General framework for many tasks (e.g. deblurring, retargeting, etc.) Cons: Task-specific: cannot apply pre-trained model to other tasks. Graphics programs are often non-parametric and non-differentiable. Graphics programs often require user in the loop: thus automatically generating results for CNN training is challenging. Code: github.com/junyanz/realismcnn Data: people.eecs.berkeley.edu/~junyanz/projects/realism/

Learning Natural Image Manifold Deep generative models: Generative Adversarial Network (GAN) [Goodfellow et al. 2014] [Radford et al. 2015] [Denton et al 2015] Variational Auto-Encoder (VAE) [Kingma and Welling 2013] DRAW (Recurrent Neural Network) [Gregor et al 2015] Pixel RNN and Pixel CNN ([Oord et al 2016])

Image Classification via Neural Network Cat Input image I Slides credit: Andrew Owens

Can We Generate Images with Neural Networks? Gaussian noise Image or Random Distribution

Generative Adversarial Networks (GAN) Generative Model Synthesized image [Goodfellow et al. 2014]

Generative Adversarial Networks (GAN) Generative Model Discriminative Model real [Goodfellow et al. 2014]

Generative Adversarial Networks (GAN) Generative Model Discriminative Model fake [Goodfellow et al. 2014]

Cat Generation (w.r.t. training iterations

GAN as Manifold Approximation Sample training images from Amazon Shirts Random image samples from Generator G(z) [Radford et al. 2015]

Traverse on the GAN Manifold G(z 0 ) Linear Interpolation in z space: G(z 0 + t (z 1 z 0 )) G(z 1 ) Limitations: not photo-realistic enough, low resolution produce images randomly, no user control [Radford et al. 2015]

Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection

Projecting an Image onto the Manifold Input: real image x R Output: latent vector z Optimization Reconstruction loss L Generative model G(z) 0.196 0.238 0.332

Projecting an Image onto the Manifold Input: real image x R Output: latent vector z Optimization Inverting Network z = P x 0.196 0.238 0.332 Auto-encoder with a fixed decoder G 0.218 0.242 0.336

Projecting an Image onto the Manifold Input: real image x R Output: latent vector z Optimization Inverting Network z = P x 0.196 0.238 0.332 Hybrid Method Use the network as initialization 0.218 0.242 0.336 for the optimization problem 0.268 0.153 0.167

Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection

Manipulating the Latent Vector constraint violation loss L g user guidance image Objective: Guidance v g G(z) z 0

Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection

Edit Transfer Motion (u, v)+ Color (A 3 4 ): estimate per-pixel geometric and color variation G(z 0 ) Linear Interpolation in z space G(z 1 ) Input

Edit Transfer Motion (u, v)+ Color (A 3 4 ): estimate per-pixel geometric and color variation G(z 0 ) Linear Interpolation in z space G(z 1 ) Input Result

Image Manipulation Demo

Designing Products

Interactive Image Generation

The Simplest Generative Model: Averaging AverageExplorer: {x x = n w n I n warp } Generative model: weighted average of warped images. Limitations: cannot synthesize novel content. [Zhu et al. 2014]

Generative Image Transformation

igan (aka. interactive GAN) Get the code: github.com/junyanz/igan Intelligent drawing tools via GAN. Debugging tools for understanding and visualizing deep generative networks. Work in progress: supporting more models (GAN, VAE, theano/tensorflow).

Generative Model {x x = G z, z Z} Pros: Task-independent: offline generative model training is independent of the graphics applications. Optimizing z is easier than optimizing x. Generative models are better and better. Cons: Low quality, low res => post-processing (still engineering work). Limitations of current generative models: cannot produce good texture.

Related work on GAN Goodfellow s NIPS 2016 Tutorial: [arxiv], [slides] Early work: [Tu 07 ], [Gutmann and Hyvarinen 10 ], etc. New models: InfoGAN, SSGAN, VAE-GAN, LAPGAN, BiGAN, CoGAN, PPGAN, etc. Training techniques: DCGAN, Improved-GAN, EBGAN, Unrolling. Image: Inpainting, Inverting features, Style Transfer, Text-To- Image, super-resolution, etc. Video: Frame Prediction, Tiny Videos, etc.

Image-to-Image Translation arxiv 2016 code: github.com/phillipi/pix2pix Image-to-Image Translation with Conditional Adversarial Nets Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Arxiv 2016

Image-to-Image Problems

Conditional Adversarial Networks (cgan) Loss: L1+GAN G:U-Net D: PatchGAN (70 70) U-Net [Ronneberger et al. 15 ]

Conditional GAN

Network and Loss Function Loss function: L1 + GAN Generator G: U-Net Discriminator D: PatchGAN 70 70 (FCN) U-Net [Ronneberger et al. 15 ]

Different Losses

Architectures for Generator G

Patch Size of PatchGAN

Applications

Label Facade

Label Street View

Map Generation

Day night

Edge Handbag HED [Xie and Tu. 15 ]

Edge Shoe HED [Xie and Tu. 15 ]

User Sketch Photo

Automatic Colorization

Failure Cases - Sparse input image. - Unusual input image.

Summary: Image-to-Image Problems

Cat Paper Collection GitHub: github.com/junyanz/catpapers 90% data is visual; most of visual data are about Cats. 60+ vision, learning and graphics papers.

Thank You! Eli Yong Jae Philipp Alyosha Philipp Tinghui