Regression in Deep Learning: Siamese and Triplet Networks

Size: px

Start display at page:

Download "Regression in Deep Learning: Siamese and Triplet Networks"

Mae Lindsey
5 years ago
Views:

Regression in Deep Learning: Siamese and Triplet Networks Tu Bui, John Collomosse Centre for Vision, Speech and Signal Processing (CVSSP) University of

1 Regression in Deep Learning: Siamese and Triplet Networks Tu Bui, John Collomosse Centre for Vision, Speech and Signal Processing (CVSSP) University of Surrey, United Kingdom Leonardo Ribeiro, Tiago Nazare, Moacir Ponti Institute of Mathematics and Computer Sciences (ICMC) University of Sao Paulo, Brazil

2 Content The regression problem Siamese network and contrastive loss Triplet network and triplet loss Training tricks Regression application: sketch-based image retrieval Limitations and future work 2

3 top-5 error (%) Lower is better Revolution of deep learning in classification ImageNet ILSVRC winner shallow Human 6% AlexNet 15.3 ZFNet GoogleNet ResNet 2.99 Ensemble year 2.25 SENet 3

4 Classification vs. Regression Classification - Discrete set of outputs - Output: label/class/category Regression - Continuous valued output - Output: embedding feature x n 0 x 4 x 3 x2 x 1 4

5 Regression example: intra-domain learning Face identification Tracking Schroff et al. CVPR 2015 Wang & Gupta ICCV

6 Regression example: cross-domain learning Multi-modality visual search duck Language model 3D model photo model sketch model Skip-gram voxnet AlexNet SketchANet Embedding space 6

7 Conventional methods for cross-domain regression Step 1 Step 2 Source data SIFT, HoG, SURF Local features BoW, GMM Global features Learnable transform matrix *M transformed features Target data Local features Global features Embedding space Problem: assume linear transformation between two domains. 7

8 End-to-end regression with deep learning End-to-end learning Source data Layer 1 Layer 2 Layer n target data Embedding space Layer 1 Layer 2 Layer m Multi-stream network 8

9 End-to-end regression with multi-stream networks Open questions: Network designs? Loss function to be used? 9

10 Using output of classification model as feature? - Not intuitive: different objective function - Cross-domain learning: training a classification network for each domain separately does not guarantee a common embedding. softmax loss fc6 fc7 softmax loss fc6 fc7 10

11 Content The regression problem Siamese network and contrastive loss Triplet network and triplet loss Training tricks Regression application: sketch-based image retrieval Limitations and future work 11

12 Siamese network and contrastive loss - Siamese (2-branch) network x 1 x 2 - Given an input training pair (x 1,x 2 ): o Label: o y = ቊ 0 if x 1, x 2 similar pair 1 if x 1, x 2 dissimilar pair Network output: f W1 g W2 a = f W 1, x 1 p = g W 2, x 2 o Euclidean distance between outputs: D W 1, W 2, x 1, x 2 = a p 2 = f W 1, x 1 g W 2, x 2 2 a=f(w 1,x 1 ) p=g(w 2,x 2 ) L(a,p) 12

13 Siamese network and contrastive loss - Contrastive loss equation: x 1 x 2 L W 1, W 2, x 1, x 2 = y D y max 0, m D 2 D = a p 2 = f W 1, x 1 g W 2, x 2 2 y = ቊ 0 if x 1, x 2 similar pair 1 if x 1, x 2 dissimilar pair f W1 g W2 margin m: desirable distance for dissimilar pair (x 1,x 2 ) - Training: argmin W 1,W 2 L a=f(w 1,x 1 ) p=g(w 2,x 2 ) L(a,p) 13

14 Siamese network and contrastive loss Contrastive loss functions: - Standard form* L y=0 L(a, p) = y D y {max 0, m D) 2 L y=1 - Alternative form** L y=0 L a, p = y D y {max(0, m D2 )} L y=1 *Hadsell et al. CVPR 2006 **Chopra et al. CVPR

15 Content The regression problem Siamese network and contrastive loss Triplet network and triplet loss Training tricks Regression application: sketch-based image retrieval Limitations and future work 15

16 Triplet network and triplet loss - Triplet (3-branch) network x a x p x n o Given a training triplet (x a,x p,x n ): x a anchor; x p positive (similar to x a ); x n negative (dissimilar to x a ) o Pos/neg branches always share weights. f W1 g W2 g W2 o Anchor branch can share weights (intra-domain learning) or not (cross-domain learning). o Network outputs: a = f W 1, x a p = g W 2, x p n = g(w 2, x n ) a=f(w 1,x a ) p=g(w 2,x p ) L(a, p, n) n=g(w 2,x n ) 16

17 Triplet network and triplet loss Triplet loss equation: x a x p x n L a, p, n = 1 2 {max(0, m + D2 (a, p) D 2 (a, n)} o Standard form*: f W1 g W2 g W2 D u, v = u v 2 o Alternative form**: D u, v = 1 *Schroff et al. CVPR 2015 **Wang et al. ICCV 2015 u. v u 2 v 2 a=f(w 1,x a ) p=g(w 2,x p ) L(a, p, n) n=g(w 2,x n ) 17

18 Siamese vs. Triplet n m n n a p a p a p Before training Contrastive loss m Triplet loss L a, p = 1 2 (1 y) a p y {max(0, m a p 2 2 } L a, p, n = 1 2 {max(0, m + + a p 2 2 a n 2 2 } 18

19 Siamese or triplet? Depending on data, training strategies, network design and more: - Siamese superior Radenovie et al. ECCV Triplet superior o Hoffer & Ailon. SBPR o Bui et al. arxiv

20 Content The regression problem Siamese network and contrastive loss Triplet network and triplet loss Training tricks Regression application: sketch-based image retrieval Limitations and future work 20

21 Training trick #1: solving gradient collapsing problem - The gradient collapsing problem N L = 1 2N {max 0, m + a i p 2 i 2 a i n 2 i 2 } i=1 Margin m = 1.0 a m n p expected n a p reality 21

22 Training tricks #1 - Solution for gradient collapsing: Combine regression and classification loss for better regularisation. Change loss function. N L = 1 2N {max 0, m + ka i p 2 i 2 ka i n 2 i 2 } i=1 L(a,p,n) Saddle point L(a,p,n ) p a p, n n p a p, n n 22

23 Training tricks #2: dimensional reduction - Conventional methods: o o Redundant analysis on a fixed set of features. E.g. Principal Component Analysis (PCA), Product quantisation, etc - Dimensional reduction in CNN: part of the training process 4096x1x1 FC7 128x4096x1x x1x1 = 128x1x1 Conv filter (fc) bias out FC out

24 Training tricks #3: hard negative mining Random paring Positive and negative samples are selected randomly. Hard negative mining Negative example is the nearest irrelevant neighbor to the anchor. Hard positive mining Positive example is the farthest relevant neighbor to the anchor. + + duck photo + + duck photo + + duck photo duck 3D swan photo cat photo duck 3D duck 3D cat photo 24

25 Training tricks #4: layer sharing - Consider sharing the anchor with the pos/neg branches a p a p a p Full-share No-share Partial-share 25

26 Other training tricks - Data augmentation: o Random crop, rotation, scaling, flip, whitening - Dropout: o Randomly disable neurons - Regularisation: o Add parameter magnitude to loss o L total (W, X) = L contrastive,triplet (W, X) + W 2 26

27 Content The regression problem Siamese network and contrastive loss Triplet network and triplet loss Training tricks Regression application: sketch-based image retrieval Limitations and future work 27

28 Regression application: sketch-based image retrieval (SBIR) Search for a particular image in your mind? 28

29 Text search? 29

30 Sketch-based Image Retrieval (SBIR) sketch retrieval 30

31 Existing applications Google Emoji Search Detexify: latex symbol search 31

32 Challenges Free-hand sketch is usually messy. Horse category Flickr-330 dataset, Hu et al

33 Challenges Various levels of abstraction. House Crocodile TU-Berlin dataset, Eitz et al

34 Challenges Domain gap: sketch does not always describe real-life object accurately. Caricature Anthropomorphism Cat s whisker Hedgehog s spine Smiling spider? Simplification Viewpoint Category person walking TU-Berlin 34

35 Challenges Limited #sketch datasets. Flickr15K: 330 sketches + 15k classes TU-Berlin: 20k sketches@250 classes o New Google Quickdraw: 50M classes Sketchy: ~75k sketches k classes Flickr15K [Hu et al. 2013] TU-Berlin [Eitz et al. 2012] Sketchy [Sangkloy et al. 2016] 35

36 SBIR evaluation metric - Evaluation metric o Mean Average Precision (map) o Precision-recal (PR) curve P k = # relevant in top k results k AP = σ k=1 N P k rel(k) # relevant images o Kendal rank correlation coefficient map = 1 Q q Q AP q 36

37 Background Conventional shallow SBIR framework Edge extraction Feature extraction # 1 #2 # 3 Photo database Edge map # N Index file matching Query

38 Background: hand-crafted features Structure tensor [Eitz,2010] Flickr15K benchmark Method Structure Tensor [Eitz, 2010] map(%) 7.98 W 1 S W I W 2 I x 2 2 I x y 2 I x y 2 I y 2 dictionary 38

39 Background: hand-crafted features Flickr15K benchmark Shape context [Mori, 2005] Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005]

40 Background: hand-crafted features Flickr15K benchmark Self similarity [Shechtman, 2007] Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 SSIM [Shechtman, 2007] 9.57 corr, 40

[Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.

41 Background: hand-crafted features Flickr15K benchmark SIFT [Lowe, 2004] HoG [Dalas, 2005] Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.11 HoG [Dalas, 2005] SIFT HoG 41

Background: hand-crafted features Flickr15K benchmark GF-HoG [Hu et al. CVIU 2013] Color GF-HoG [Bui et al. ICCV 2015] Method map(%) Structure Tensor [Eitz, 2010] 7.

42 Background: hand-crafted features Flickr15K benchmark GF-HoG [Hu et al. CVIU 2013] Color GF-HoG [Bui et al. ICCV 2015] Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.11 HoG [Dalas, 2005] GF-HoG [Hu, 2013] Color GF-HoG [Bui, 2015]

14 gpb Perceptual edge SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.

43 Background: hand-crafted features Flickr15K benchmark PerceptualEdge [Qi, 2015] Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 gpb Perceptual edge SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.11 HoG [Dalas, 2005] GF-HoG [Hu, 2013] Color GF-HoG [Bui, 2015] PerceptualEdge [Qi, 2015]

44 Back ground: deep features Flickr15K benchmark - Siamese network with contrastive loss - Qi et al. ICIP 2016 o Sketch-edgemap Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.11 HoG [Dalas, 2005] o Fully shared GF-HoG [Hu, 2013] Color GF-HoG [Bui, 2015] PerceptualEdge [Qi, 2015] Siamese network [Qi, 2016]

Triplet network for SBIR Sketch-edgemap CNN architecture: Sketch-A-Net [Yu, 2015] C1 C2 Output dimension: 100 C3 3 3 3 3 Share

45 Triplet network for SBIR Sketch-edgemap CNN architecture: Sketch-A-Net [Yu, 2015] C1 C2 Output dimension: 100 C Share layers: Conv 4-5, FC 6-8 C4 C Loss: fc6 fc7 N L = 1 2N {max 0, m + ka i p 2 i 2 ka i n 2 i 2 } i=1 k = 2.0 fc8 a p n 45

46 Training procedure Images: 25k photos: 100 photos/class. Edge extraction: gpb [Arbelaez, 2011]. Mean subtraction, random crop/rotation/scaling/flip. Sketches: 20k sketches: 20s training, 60s validation per class. Skeletonisation. Mean subtraction, random crop/rotation/scaling/flip. Random stroke removal. Triplet formation: Random selection pos/neg samples. Training: 10k epochs. Multistep decreasing learning rate k = crop rotation scaling flip Stroke removal 46

47 Results Flickr15K benchmark Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.11 HoG [Dalas, 2005] GF-HoG [Hu, 2013] Colour GF-HoG [Bui, 2015] PerceptualEdge [Qi, 2015] Single CNN Siamese network [Qi, 2016] Triplet full-share [Bui, 2016] Triplet no-share [Bui, 2016] Triplet half-share [Bui, 2016]

48 Sketch-photo direct matching loss Training failure epochs a p n 48

49 Sketch-photo direct matching SketchANet hybrid AlexNet AlexNet loss weight x1.0 x2.0 softmax loss softmax loss triplet loss softmax loss special layers dimensional reduction normalisation 49

50 Multi-stage training procedure Stage 1: train unshared layers Train Sketch branch from scratch. Finetune image branch from AlexNet Stage 2: train shared layers Form a 2-branch network with pretrained weights. Freeze unshared layers. Train the shared layers with contrastive loss + softmax loss. Stage 3: regression with triplet loss Form a triplet network. Unfreeze the all layers. Train the whole network with triplet loss + softmax loss. Softmax loss Softmax loss Softmax loss contrastive Triplet loss loss 50

51 Training results Phase 1 Sketch branch Image branch Phase 2 Phase 3 Siamese network Triplet network 51

52 Results Flickr15K benchmark Method map(%) Structure Tensor [Eitz, 2010] 7.98 Shape Context [Mori, 2005] 8.14 SSIM [Shechtman, 2007] 9.57 SIFT [Lowe, 2004] 9.11 HoG [Dalas, 2005] GF-HoG [Hu, 2013] Colour GF-HoG [Bui, 2015] PerceptualEdge [Qi, 2015] Single CNN Siamese network [Qi, 2016] Sketch-edgemap triplet [Bui, 2016] Sketch-photo triplet

53 Layer visualisation 64 15x15 filters in conv1 layer SketchANet 96 11x11 filters in conv1 layer AlexNet 53

54 SBIR example 54

55 Demo: SketchSearch Sketch-based Image Retrieval Sketch Retrieval 55

56 Content The regression problem Siamese network and contrastive loss Triplet network and triplet loss Training tricks Regression application: sketch-based image retrieval Limitations and future work 56

57 Limitations o o o Hard to train a regression model. Need labelled datasets. Real-life sketch can be very complicated Guernica by Pablo Picasso,

58 Future work o Multi-domain regression e.g. 3D, text, photo, sketch, depth-map, cartoon duck Castrejon, 2016 Language model 3D model Photo model Sketch model Siddiquie, 2014 Embedding space o Toward unsupervised deep learning: Labelled image set, unlabelled or no sketch set Radenovic, 2017 Completely unsupervised: Auto-encoder, Generative Adversaries Network (GAN) 58

59 Thank you for listening 59

SKETCH-BASED IMAGE RETRIEVAL VIA SIAMESE CONVOLUTIONAL NEURAL NETWORK

SKETCH-BASED IMAGE RETRIEVAL VIA SIAMESE CONVOLUTIONAL NEURAL NETWORK Yonggang Qi Yi-Zhe Song Honggang Zhang Jun Liu School of Information and Communication Engineering, BUPT, Beijing, China School of