Deep Learning for Visual Manipulation and Synthesis Jun-Yan Zhu 朱俊彦 UC Berkeley 2017/01/11 @ VALSE
What is visual manipulation? Image Editing Program input photo User Input result Desired output: stay close to the input. satisfy user s constraint. [Schaefer et al. 2006]
Sketch2Photo [Tao et al. 2009] What is Visual Synthesis? Image Generation Program user input result Desired output: satisfy user s constraint.
So far so good
Things can get really bad The lack of safety wheels
Adding the safety wheels Image Editing Program Input Photo User Input A desired output: stay close to the input. satisfy user s constraint. Lie on the natural image manifold Output Result Natural Image Manifold
Prior work: Heuristic-based Gradient [Perez et al. 2003] Bleeding artifacts [Tao et al. 2010] Color [Reinhard et al. 2004] Color and Texture [Johnson et al. 2011]
Prior work: Discriminative Learning Natural Human Motion (34 subjects) [Ren et al. 2005] Image Compositing (20 images) [Xue et al. 2012] Image Deblurring (40 images) [Liu et al. 2013]
Our Goal: - Learn the manifold of natural images without direct human annotations. - Improve visual manipulation and synthesis by constraining the result to lie on that learned manifold.
Why Deep Learning Methods? Impressive results on visual recognition. Classification, detection, segmentation,3d vision, videos, etc. No feature engineering. Recent development of generative models. (e.g. Generative Adversarial Networks)
Deep Learning trends: performance
Deep Learning trends: research AlexNet [Krizhevsky et al.] ImageNet [Jia et al.]
Discriminative Model M: {x P real x = 1} [ICCV 15 ] Realism CNN Predict Realism Improve Editing Image Editing Model Generative Model M: {x x = G z } [SIGGRAPH 14 ] Project Editing UI Edit Transfer [ECCV 16 ]
Discriminative Model M: {x P real x = 1} [ICCV 15 ] Realism CNN Predict Realism Improve Editing Image Editing Model Foreground Object F Background B Image Composite I
Learning Visual Realism CNN Training Composite images Classifying Natural Photos 25K natural photos vs. 25k composite images
How do we get composite images? Target Object Composite Images Object Mask Object Masks with Similar Shapes Object Mask: (1) Human Annotation (2) Object Proposal [Lalonde and Efros 2007]
Ranking of Training Composites Most realistic composites Least realistic composites
Evaluation Dataset [Lalonde and Efros 2007] Task: binary classification 360 realistic photos (natural images + realistic composites) 360 unrealistic photos Area under ROC Curve Methods without object mask Lalonde and Efros (no mask) 0.61 AlexNet + SVM 0.73 RealismCNN 0.84 RealismCNN + SVM 0. 88 Human 0.91 Methods using object mask Reinhard et al. 0.66 Lalonde and Efros (with mask) 0.81
Visual Realism Ranking Least Realistic Most Realistic Snowy Mountain Highway Ocean Red: unrealistic composite, Green: realistic composite, Blue: natural image
Our Pipeline Realism CNN Predict Realism Improve Composites Image Editing Model
Improving Visual Realism Editing model: Color adjustment g Foreground object F Realism CNN Original Composite (Realism score: 0.0) Improved Composite (Realism score: 0.8) E(g, F) = E CNN + E reg Quasi-Newton (L-BFGS)
Selecting Suitable Objects Best-fitting object selected by RealismCNN Object with most similar shape
Optimizing Color Compatibility Object mask Cut-n-paste Lalonde et al. Xue et al. Ours
Sanity Check: Real Photos Object mask Cut-n-paste Lalonde et al. Xue et al. Ours
E = 50.73 9.38 5.05 3.44 3.00 Visualizing and Localizing Errors ( E I p ) Number of L-BFGS iterations Result Gradient Map
Discriminative Model {x P real x = 1} Pros: CNN is easy to train. Graphics programs often produce better images than generative models. General framework for many tasks (e.g. deblurring, retargeting, etc.) Cons: Task-specific: cannot apply pre-trained model to other tasks. Graphics programs are often non-parametric and non-differentiable. Graphics programs often require user in the loop: thus automatically generating results for CNN training is challenging. Code: github.com/junyanz/realismcnn Data: people.eecs.berkeley.edu/~junyanz/projects/realism/
Discriminative Model M: {x P real x = 1} [ICCV 15 ] Realism CNN Predict Realism Improve Editing Image Editing Model Generative Model M: {x x = G z } [SIGGRAPH 14 ] Project Editing UI Edit Transfer [ECCV 16 ]
Learning Natural Image Manifold Deep generative models: Generative Adversarial Network (GAN) [Goodfellow et al. 2014] [Radford et al. 2015] [Denton et al 2015] Variational Auto-Encoder (VAE) [Kingma and Welling 2013] DRAW (Recurrent Neural Network) [Gregor et al 2015] Pixel RNN and Pixel CNN ([Oord et al 2016])
Image Classification via Neural Network Cat Input image I Slides credit: Andrew Owens
Can We Generate Images with Neural Networks? Gaussian noise Image or Random Distribution
Generative Adversarial Networks (GAN) Generative Model Synthesized image [Goodfellow et al. 2014]
Generative Adversarial Networks (GAN) Generative Model Discriminative Model real [Goodfellow et al. 2014]
Generative Adversarial Networks (GAN) Generative Model Discriminative Model fake [Goodfellow et al. 2014]
Cat Generation (w.r.t. training iterations
GAN as Manifold Approximation Sample training images from Amazon Shirts Random image samples from Generator G(z) [Radford et al. 2015]
Traverse on the GAN Manifold G(z 0 ) Linear Interpolation in z space: G(z 0 + t (z 1 z 0 )) G(z 1 ) Limitations: not photo-realistic enough, low resolution produce images randomly, no user control [Radford et al. 2015]
Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection
Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection
Projecting an Image onto the Manifold Input: real image x R Output: latent vector z Optimization Reconstruction loss L Generative model G(z) 0.196 0.238 0.332
Projecting an Image onto the Manifold Input: real image x R Output: latent vector z Optimization Inverting Network z = P x 0.196 0.238 0.332 Auto-encoder with a fixed decoder G 0.218 0.242 0.336
Projecting an Image onto the Manifold Input: real image x R Output: latent vector z Optimization Inverting Network z = P x 0.196 0.238 0.332 Hybrid Method Use the network as initialization 0.218 0.242 0.336 for the optimization problem 0.268 0.153 0.167
Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection
Manipulating the Latent Vector constraint violation loss L g user guidance image Objective: Guidance v g G(z) z 0
Overview original photo Project Editing UI different degree of image manipulation Edit Transfer projection on manifold transition between the original and edited projection
Edit Transfer Motion (u, v)+ Color (A 3 4 ): estimate per-pixel geometric and color variation G(z 0 ) Linear Interpolation in z space G(z 1 ) Input
Edit Transfer Motion (u, v)+ Color (A 3 4 ): estimate per-pixel geometric and color variation G(z 0 ) Linear Interpolation in z space G(z 1 ) Input
Edit Transfer Motion (u, v)+ Color (A 3 4 ): estimate per-pixel geometric and color variation G(z 0 ) Linear Interpolation in z space G(z 1 ) Input Result
Image Manipulation Demo
Image Manipulation Demo
Designing Products
Interactive Image Generation
The Simplest Generative Model: Averaging AverageExplorer: {x x = n w n I n warp } Generative model: weighted average of warped images. Limitations: cannot synthesize novel content. [Zhu et al. 2014]
Generative Image Transformation
igan (aka. interactive GAN) Get the code: github.com/junyanz/igan Intelligent drawing tools via GAN. Debugging tools for understanding and visualizing deep generative networks. Work in progress: supporting more models (GAN, VAE, theano/tensorflow).
Generative Model {x x = G z, z Z} Pros: Task-independent: offline generative model training is independent of the graphics applications. Optimizing z is easier than optimizing x. Generative models are better and better. Cons: Low quality, low res => post-processing (still engineering work). Limitations of current generative models: cannot produce good texture.
Related work on GAN Goodfellow s NIPS 2016 Tutorial: [arxiv], [slides] Early work: [Tu 07 ], [Gutmann and Hyvarinen 10 ], etc. New models: InfoGAN, SSGAN, VAE-GAN, LAPGAN, BiGAN, CoGAN, PPGAN, etc. Training techniques: DCGAN, Improved-GAN, EBGAN, Unrolling. Image: Inpainting, Inverting features, Style Transfer, Text-To- Image, super-resolution, etc. Video: Frame Prediction, Tiny Videos, etc.
Image-to-Image Translation arxiv 2016 code: github.com/phillipi/pix2pix Image-to-Image Translation with Conditional Adversarial Nets Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Arxiv 2016
Image-to-Image Problems
Conditional Adversarial Networks (cgan) Loss: L1+GAN G:U-Net D: PatchGAN (70 70) U-Net [Ronneberger et al. 15 ]
Conditional GAN
Network and Loss Function Loss function: L1 + GAN Generator G: U-Net Discriminator D: PatchGAN 70 70 (FCN) U-Net [Ronneberger et al. 15 ]
Different Losses
Architectures for Generator G
Patch Size of PatchGAN
Applications
Label Facade
Label Street View
Map Generation
Day night
Edge Handbag HED [Xie and Tu. 15 ]
Edge Shoe HED [Xie and Tu. 15 ]
User Sketch Photo
Automatic Colorization
Failure Cases - Sparse input image. - Unusual input image.
Summary: Image-to-Image Problems
Cat Paper Collection GitHub: github.com/junyanz/catpapers 90% data is visual; most of visual data are about Cats. 60+ vision, learning and graphics papers.
Thank You! Eli Yong Jae Philipp Alyosha Philipp Tinghui