Lecture 10: Other Applications of CNNs

Size: px

Start display at page:

Download "Lecture 10: Other Applications of CNNs"

Marion Vivian Nash
5 years ago
Views:

reidentification Text region detection Style

Method Main goal Synthesizing two images

get more appealing results Input: content

1 Applications of Convolutional Neural Networks CSED703: Deep Learning for Visual ecognition (207F) Lecture 0: Other Applications of CNNs Bohyung Han Computer Vision Lab. Face recognition and verification Person reidentification Text region detection Style transfer Object generation Visual attention and saliency Visual analogy Autonomous driving Many others 2 Neural Artistic Style Method Main goal Synthesizing two images representing both content and style Exploiting a pretrained CNN for image classification CNN VGG 9 layer net without fully connected layers No fine-tuning Average pooling: improves gradient flow and get more appealing results Input: content Input: style + Loss in feature map CNN Final Output output Loss in feature map correlation CVP CVP 206 4

Optimization Optimization Loss Error back-propagation ABCBDE!, ', % = FAGCHBIHB!, %, $ + KALBMEI ', %? AGCHBIHB!, %, $ = P &#;< "#;< 2 ;,< U?

and "# : original content image and its feature map in the $-th layer % and &# : generated image and its feature map in the $-th layer ' and (# : original style image and its feature map

> (#;> (#<> CVP 206 5 YAGCHBIHB =Z Y&#;< ;,< "#;< 0 if &#;< 0 if &#;< < 0 Style: use conv*_ with equal weights (S# = 0.

2 Optimization Optimization Loss Error back-propagation ABCBDE!, ', % = FAGCHBIHB!, %, $ + KALBMEI ', %? AGCHBIHB!, %, $ = P &#;< "#;< 2 ;,< U? # ;,< S# ALBMEI ', % = P P 2 40# # Content: select a particular layer such? as conv4_2 AGCHBIHB!, %, $ = P &#;< "#;< 2 Update rule:! and "# : original content image and its feature map in the $-th layer % and &# : generated image and its feature map in the $-th layer ' and (# : original style image and its feature map in the $-th layer &#, "#, (# ℝ,- /-, where 0# is the number of feature maps and # is the size of feature map # = width# height # = = #?> &#;> &<>?> (#;> (#<> CVP YAGCHBIHB =Z Y&#;< ;,< "#;< 0 if &#;< 0 if &#;< < 0 Style: use conv*_ with equal weights (S# = 0.2) ALBMEI ℝ,-,- : correlation of feature maps in the $-th layer ℝ,-,- : correlation of feature maps in the $-th layer &#;< Update rule: U ', % = P S# X# 2 YALBMEI Y&#;< # X# =? P 40# # ;,< YALBMEI YX# S &# = = _2 # <; ;< # YX# Y&;< 0 if &#;< 0 if &#;< < 0 CVP Generated Images More Examples Style: The Starry Night Source Style2: The Scream CVP CVP 206 8

Balance between Content and Style Multiple

Style2: The Scream CVP 206 9 CVP 206 0 Face

determine whether they are same person or

classification problem Standard pipeline

Learning (DDML) Learning a distance metric

Binary Classification c > [Hu204] J. Hu, J.

3 Balance between Content and Style Multiple Styles Style: The Starry Night Source Style2: The Scream CVP CVP Face Verification Definition Given two faces, determine whether they are same person or not. Binary decision by one-to-one matching elated problem Face detection: finding faces Face recognition: multi-class classification problem Standard pipeline Siamese Network Deep Discriminative Metric Learning (DDML) Learning a distance metric `a % ;, % < = b % ; b % < epresentation learning Two branches share weights. Objective $ ;< c `a % ;, % < > $ ;< = : same ID $ ;< = : different ID Face Detection Face Alignment Feature Extraction Binary Classification c > [Hu204] J. Hu, J. Lu, Y.-P. Tan: Discriminative Deep Metric Learning for Face Verification in the Wild. CVP 204 2

4 ? DeepID I CNN Architecture Deep hidden IDentity features (DeepID) 97.45% verification accuracy: as good as human performance (97.53%) CNN architecture Multiple scales e < = max 0, P i j ; S j ;,< + i ; S ;,< + k < ; [Sun204a] Y. Sun, X. Wang, X. Tang: Deep Learning Face epresentation from Predicting 0,000 Classes. CVP Joint Bayesian Verification Algorithm % = l + m, where l is face identity and n is intra-class variation Comparison between Two Verifiers Joint Bayesian is better than neural network. l ~ 0 0, p l and m ~ 0 0, p m Compute q % r, % s = log v % r, % s w x v % r, % s w y, which has a closed-form solution. Neural networks Test accuracy (%) Number of classes for training 5 highly-correlated subfeature (640D) 60 groups 6

7 esults Comparison of state-of-the-art face verification methods on LFW No. of outside images Method Accuracy (%) No. of points No. of images Feature dimension Joint Bayesian [8] 92.

30 (o+r) 95 20,639 5000 High-dim LBP [9] 95.7 (o) 27 99,773 2000 TL Joint Bayesian [6] 96.33 (o+u) 27 99,773 2000 DeepFace [32] 97.25 (o+u) 6+67 4,400,000 + 3,000,000 4096 4 DeepID on CelebFaces 96.

45 (o+u) 5 202,599 50 r: restricted training protocol, where 6000 face pairs given by LFW are used for 0-fold cross-validation u: unrestricted training protocol, where more training pairs can be

using both outside data and LFW data in the unrestricted protocol for training TL: Joint Bayesian transfer learning from CelebFaces+ to LFW Human-level performance: 97.

5 7 esults Comparison of state-of-the-art face verification methods on LFW No. of outside images Method Accuracy (%) No. of points No. of images Feature dimension Joint Bayesian [8] (o) 5 99, ConvNet-BM [3] (o) 3 87,628 N/A CMD+SLBP [7] (u) 3 N/A 2302 Fisher vector faces [29] (u) 9 N/A 28 2 Tom-vs-Pete classifiers [2] (o+r) 95 20, High-dim LBP [9] 95.7 (o) 27 99, TL Joint Bayesian [6] (o+u) 27 99, DeepFace [32] (o+u) ,400, ,000, DeepID on CelebFaces (o) 5 87, DeepID on CelebFaces (o) 5 202, DeepID on CelebFaces+ & TL (o+u) 5 202, r: restricted training protocol, where 6000 face pairs given by LFW are used for 0-fold cross-validation u: unrestricted training protocol, where more training pairs can be generated from LFW using identity o: using outside training data, however, without using training data from LFW o+r: using both outside data and LFW data in the restricted protocol for training o+u: using both outside data and LFW data in the unrestricted protocol for training TL: Joint Bayesian transfer learning from CelebFaces+ to LFW Human-level performance: DeepID II Joint identification-verification Face identification: increases the inter-personal variations by drawing DeepID2 features extracted from different identities apart Face verification: reduces the intra-personal variations by pulling DeepID2 features extracted from the same identity together Feature extraction b = Conv i; ~ [Sun204b] Y. Sun, Y. Chen, X. Wang, X. Tang: Deep Learning Face epresentation by Joint Identification- Verification. NIPS Two loss functions Identification loss: cross-entropy Ident b, Å; ~ ;Ç Verification loss Verif b ;, b <, e ;< ; ~ äã = ~ äã = å Training CNN Ö = P É ; log É ; = log É á ;Üj 2 b ; b < 2 max 0, å b ; b < b = Conv i; ~ ~ ;Ç : parameters of softmax layer if e ;< = if e ;< = 20 Verification Algorithm Feature extraction Detect 2 facial landmarks by SDM algorithm and align faces globally Crop 400 face patches with variations in positions, scales, color channels, and horizontal flipping ConvNet 200 CNNs: generate 400 DeepID2 feature vectors with horizontal flipping Feature vector: 60D Feature dimensionality reduction Select 25 patches in a greedy manner PCA from 25x60D to 80D Selected 25 face patches Joint Bayesian for verification

method accuracy (%) High-dim LBP [4] 95.7 ±.3 TL Joint Bayesian [2] 96.33 ±.08 DeepFace [2] 97.35 ± 0.25 DeepID [20] 97.45 ± 0.26 GaussianFace [3] 98.52 ± 0.66 DeepID2 99.5 ± 0.3 Human-level performance: 97.

Kalenichenko, J. Philbin: FaceNet: A Unified Embedding for Face ecognition and Clustering. CVP 205 22 Discriminative vs.

Orientation with respect to camera Additional parameters otation, translation, zoom Stretching horizontally or vertically Hue, saturation, brightness Knowledge transfer Generative

6 method accuracy (%) High-dim LBP [4] 95.7 ±.3 TL Joint Bayesian [2] ±.08 DeepFace [2] ± 0.25 DeepID [20] ± 0.26 GaussianFace [3] ± 0.66 DeepID ± 0.3 Human-level performance: esults on LFW FaceNet Architecture Direct mapping between face images and embedded points Triplet loss Using large margin nearest neighbor (LMNN) 2 [Schroff5] F. Schroff, D. Kalenichenko, J. Philbin: FaceNet: A Unified Embedding for Face ecognition and Clustering. CVP Discriminative vs. Generative CNN Goal Discriminative CNN Generative CNN Object class Viewpoint Style CNN Object class Viewpoint Style Generate an object based on high-level inputs such as Class Orientation with respect to camera Additional parameters otation, translation, zoom Stretching horizontally or vertically Hue, saturation, brightness Knowledge transfer Generative CNN learns the manifold of chairs. Interpolation between viewpoints and different objects [Dosovitskiy5] A. Dosovitskiy, J. T. Springenberg, T. Brox: Learning to Generate Chairs with Convolutional Neural Networks. CVP 205

Data Using 3D chair model dataset [Aubry4] Original dataset: 393 chair models, 62 viewpoints, 3 azimuth angles, 2 elevation angles Sanitized version: 809 models, tight cropping,

segmentation mask h Network Architecture î 32M parameters altogether [Aubry4] M. Aubry, D. Maturana, A. Efros, and J.

CVP 204 25 26 ï = î h Operations Training Unpooling: 2x2 Deconvolution: 5x5 Fixed location unpooling Objective function Minimizing the Euclidean error in 2D of, econstruction of the

7 Data Using 3D chair model dataset [Aubry4] Original dataset: 393 chair models, 62 viewpoints, 3 azimuth angles, 2 elevation angles Sanitized version: 809 models, tight cropping, resizing to 28x28 Notations ç = é j, è j, ~ j, é, è, ~,, é,, è,, ~, é: class label è: viewpoint ~: additional parameters ë = % j, í j, %, í,, %,, í, %: target GB output image í: segmentation mask h Network Architecture î 32M parameters altogether [Aubry4] M. Aubry, D. Maturana, A. Efros, and J. Sivic, Seeing 3D Chairs: Exemplar Part-based 2D-3D Alignment using a Large Dataset of CAD Models. CVP ï = î h Operations Training Unpooling: 2x2 Deconvolution: 5x5 Fixed location unpooling Objective function Minimizing the Euclidean error in 2D of, econstruction of the segmented-out chair image Segmentation mask, í min P ò î ôöõ h é ;, è ;, ~ ; ú ó ù û % ; í ; + îli h é ;, è ;, ~ ; ú ù û í ; ;Üj Visualization of uconv-3 layer filters in 28x28 network GB stream elu Segmentation stream 27 [Saxe4] A. M. Saxe, J. L. McClelland, and S. Ganguli, Learning a Nonlinear Embedding by Preserving Class Neighbourhood. ICL

Network Capacity Morphing Different Chairs Translation otation Zoom Stretch Saturation Brightness Color Viewpoints in training set 29 30 Autonomous Driving Two previous approaches Mediated

) Mediated Perception Input Image Behavior eflex Direct Perception (ours) Driving Control Deep Driving Direct perception Estimating the affordance for driving Simple input to model using a few key

8 Network Capacity Morphing Different Chairs Translation otation Zoom Stretch Saturation Brightness Color Viewpoints in training set Autonomous Driving Two previous approaches Mediated perception: parsing the entire scene to make a driving decision (e.g., Mobileye, Google) Behavior reflex: directly mapping an input image to a driving decision by an regressor (ALVINN, LeCun et al.) Mediated Perception Input Image Behavior eflex Direct Perception (ours) Driving Control Deep Driving Direct perception Estimating the affordance for driving Simple input to model using a few key perception indicators Compact yet complete descriptions of the scene for vehicle control Approach Built upon deep convolutional neural network Trained and tested on TOCS (The Open acing Car Simulator) Learning for estimating affordance related to autonomous driving Simpler than the mediated perception approach More interpretable than the typical behavior reflex approach [Chen5] C. Chen, A. Seff, A. Kornhauser, J. Xiao: DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. ICCV

Platform Convolutional Neural Network System architecture TOCS Image & Speed Write

Environment Focusing on highway driving with multiple lanes Three configurations: a

.. Controller Output Prediction of affordance indicator angle CNN angle

marking system activate range overlapping area tomarking_ml tomarking_m dist_mm

tomarking_ll tomarking_ tomarking_l 33 (a) one-lane (b) two-lane, left (c)

34 (a) angle (b) in lane: tomarking (c) in lane: dist (d) on mark.

Models Deep Driving Demo esponse map of KITTI-based ConvNet model 35 esponse map of

9 Platform Convolutional Neural Network System architecture TOCS Image & Speed Write ead Shared Memory ead Driving Controls Image Speed CNN Driving Controller Environment Focusing on highway driving with multiple lanes Three configurations: a road of one lane, two lanes, or three lanes ead Write angle tomarking... dist... Controller Output Prediction of affordance indicator angle CNN angle tomarking_ll dist_ll tomarking_l dist_l always in lane system on marking system on marking system activate range overlapping area tomarking_ml tomarking_m dist_mm dist_ll dist_ tomarking_m tomarking_ dist_l dist_ in lane system activate range tomarking_ll tomarking_ tomarking_l 33 (a) one-lane (b) two-lane, left (c) two-lane, right (d) three-lane (e) inner lane mark. (f) boundary lane mark. 34 (a) angle (b) in lane: tomarking (c) in lane: dist (d) on mark.: tomarking (e) on marking: dist (f) overlapping area Visualization of Learned Models Deep Driving Demo esponse map of KITTI-based ConvNet model 35 esponse map of TOCS-based ConvNet model 36 C. Chen, A. Seff, A. Kornhauser,J. Xiao: DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. ICCV 205

Analogy Visual Analogy Making PAIS: FANCE :: BEIJING: CHINA : : : : Changing color China : : : : Changing shape

37 Slide credit: Scott eed 38 Slide credit: Scott eed Visual Analogy Making Architecture Concept Learns an

back to the image space Infer elationship Transform query L D = argmax,æ,,ç Ø ` ï b k b + b 39 ` = argmax cos b

10 Analogy Visual Analogy Making PAIS: FANCE :: BEIJING: CHINA : : : : Changing color China : : : : Changing shape France Beijing : : : : Changing size Paris : : : :? 37 Slide credit: Scott eed 38 Slide credit: Scott eed Visual Analogy Making Architecture Concept Learns an encoder function b: mapping images into a space, where analogies can be performed Learns a decoder ï: mapping back to the image space Infer elationship Transform query L D = argmax,æ,,ç Ø ` ï b k b + b 39 ` = argmax cos b S, b k b + b [eed5] S. eed, Y. Zhang, Y. Zhang, H. Lee: Deep Visual Analogy Making. NIPS L ±E = argmax,æ,,ç Ø L II = argmax,æ,,ç Ø ` ï b + j b k b ` ï b + h b k b ; b b

11 ?? 4 Optimization egularization For accurate analogy completion by image manifold traversing Making transformation match the difference of encoder embeddings = P b ` b ú b, b k, b,æ,,ç Ø e i for L D ú i, e, µ = _ j e i µ for L ±E MLP e i; µ for L II Training With backpropagation using SGD Combined loss: L + F, F = 0.0 Algorithm : Manifold traversal by analogy, Given ih images f a, ib, c, fand N i (# steps) (E 5) z f(c) for i =to N do z z + T (f(a),f(b),z) x i g(z) return generated images x i (i =,..., N) 42 Training With backpropagation using SGD Combined loss: L + F, F = 0.0 Optimization egularization For accurate analogy completion by image manifold traversing Making transformation match the difference of encoder embeddings = P b ` b ú b, b k, b,æ,,ç Ø Algorithm : Manifold traversal by analogy, Given ih images f a, ib, c, fand N i (# steps) (E 5) z f(c) for i =to N do z z + T (f(a),f(b),z) x i g(z) return generated images x i (i =,..., N) e i for L D ú i, e, µ = _ j e i µ for L ±E MLP e i; µ for L II Shape Predictions: Additive Model Shape Predictions: Multiplicative Model rotate rotate scale scale shift ref out query t= predictions t=2 t=3 t=4 shift ref out query t= predictions t=2 t=3 t=

Shape Predictions: Deep Model esults for Analogy Models Transforming shapes rotate Model otation

44 5.66 6.25 7.45 L mul 8.04.2 3.5 4.2 4.36 4.70 5.78 4.8 4.24 4.45 5.24 6.90 L deep.98 2.9 2.45 2.

6 scale shift ref +rot (gt) query +rot +rot +rot +rot ref out query t= predictions t=2 t=3 t=4 ref

Learning Disentangled Features Disentangling and Analogy Making Objective function L πl = argmax,æ,

12 Shape Predictions: Deep Model esults for Analogy Models Transforming shapes rotate Model otation steps Scaling steps Translation steps L add L mul L deep scale shift ref +rot (gt) query +rot +rot +rot +rot ref out query t= predictions t=2 t=3 t=4 ref +scl (gt) query +scl +scl +scl +scl ref +trans (gt) query +trans +trans +trans +trans Learning Disentangled Features Disentangling and Analogy Making Objective function L πl = argmax,æ, ï ª b + ª b k a Pose b Identity Pose Increment function T Identity 47 Algorithm 2: Disentangling training update. The switches s determine which units from f(a) and f(b) are used to reconstruct image c. Given input images a, b and target c Given switches s 2 {0, } K z s f(a)+( s) f(b) ( ) g(z) c c Pose Identity Disentangling identity d Slide credit: Scott eed

a Classification and Analogy Making Pose esults for Disentangled Features Transferring animation Disentangling pose from identity Pose

classifier Separate classification for identity d Model spellcast thrust walk slash shoot average L add 4.0 53.8 55.7 52. 77.6 56.

0 49 Slide credit: Scott eed 50 esults for Extrapolation ref output -4-3 -2 - query + +2 +3 +4 Summary Proposing novel deep

Modeling transformations by vector addition in embedding space works for simple problems, but multi-layer neural networks are better.

13 a Classification and Analogy Making Pose esults for Disentangled Features Transferring animation Disentangling pose from identity Pose transformations are modeled by deep additive interactions b Identity Pose Increment function T Identity c Pose Identity Attribute classifier Separate classification for identity d Model spellcast thrust walk slash shoot average L add L dis L dis+cls Slide credit: Scott eed 50 esults for Extrapolation ref output query Summary Proposing novel deep architectures that can perform visual analogy making by simple operations in an embedding space Convolutional encoder-decoder networks Modeling transformations by vector addition in embedding space works for simple problems, but multi-layer neural networks are better. walk ref. output query predictions Combining analogy and disentangling training methods Analogy representations can overcome limitations of disentangled representations by learning transformation manifold. thrust rotate 5 52

14 53

Structured Prediction using Convolutional Neural Networks

Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer