Weakly Supervised Object Recognition with Convolutional Neural Networks

Size: px

Start display at page:

Download "Weakly Supervised Object Recognition with Convolutional Neural Networks"

Sydney Harrington
6 years ago
Views:

1 GDR-ISIS, Paris March 20, 2015 Weakly Supervised Object Recognition with Convolutional Neural Networks Ivan Laptev WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Maxime Oquab Leon Bottou Josef Sivic

2 Summary Training input + image-level labels: Person Chair Airplane Reading Riding bike Running Test output More details in

Recent Progress: Convolutional Neural Networks Success in character recognition [LeCun 88]. Limited performance on natural images until 2012. [Krizhevsky et al.

3 Recent Progress: Convolutional Neural Networks Success in character recognition [LeCun 88]. Limited performance on natural images until [Krizhevsky et al. 2012]: break-through in ImageNet object classification. ILSVRC 12: 1.2M images, 1K classes 2012 Method: Top 5 error: GoogLeNet: 6.6% VGG: 6.8% BAIDU 5.3% Human 5.1%

4 Let s look at the data A typical image with chairs and tables on Flickr.com Images of chairs and tables in ImageNet

5 How to use CNNs for cluttered scenes? C1-C2-C3-C4-C5 FC6 FC7 pre-train CNN on ImageNet Convolutional layers Fully Connected layers FCa FCa backgr. person chair chair table Use ImageNet pre-trained CNN Post-train on the new task [Girshick et al. 14], [Oquab et al. 14], [Sermanet et al. 13 ], [Donahue et al. 13], [Zeiler & Fergus 13]...

6 Approach sliding window training / testing

7 Approach 1. Design training/test procedure using sliding windows 2. Train adaptation layers to map labels

8 Results Pascal VOC Oquab, Bottou, Laptev and Sivic CVPR 2014

9 Results [Oquab, Bottou, Laptev and Sivic, CVPR 2014]

10 Results [Oquab, Bottou, Laptev and Sivic, CVPR 2014]

11 How to use CNNs for cluttered scenes? C1-C2-C3-C4-C5 FC6 FC7 Convolutional layers Fully Connected layers FCa FCa backgr. person chair chair table Problem: Annotation of bounding boxes is (a): subjective (b): expensive

12 Motivation: labeling bounding boxes is tedious

13 Are bounding boxes needed for training CNNs? Image-level labels: Bicycle, Person

14 Motivation: image-level labels are plentiful Beautiful red leaves in a back street of Freiburg [Kuznetsova et al., ACL 2013]

15 Motivation: image-level labels are plentiful Public bikes in Warsaw during night

16 Approach: search over object s location at the training time Oquab, Bottou, Laptev and Sivic CVPR 2015 Per-image score C1-C2-C3-C4-C5 FC6 FC dim vector dim vector dim vector FCa FC b Max-pool over image Max motorbike person diningtable pottedplant chair car bus train 1. Efficient window sliding to find object location hypothesis 2. Image-level aggregation (max-pool) 3. Multi-label loss function (allow multiple objects in image) See also [Kokkinos et al. 15, Sermanet et al. 14, Chaftield et al. 14]

1. Efficient window sliding to find object location Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC.

17 1. Efficient window sliding to find object location Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. C1 C2 C3 C4 C5 FC6 FC7 FCa FCb norm pool 1:8 256 norm pool 1: : : pool 1: dropout 1: dropout 1: dropout 1: :32 20 final-pool Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-

18 2. Image-level aggregation using global max-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. C1 C2 C3 C4 C5 FC6 FC7 FCa FCb norm pool 1:8 256 norm pool 1: : : pool 1: dropout 1: dropout 1: dropout 1: :32 20 final-pool Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-

19 Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clutter. More recent efforts attempt to learn from images containing multiple objects embedded in complex scenes [2, 9, 28] or from video [30]. These methods typically localize objects with visually consistent appearance in the training data that often contains multiple objects in different spatial 3. Multi-label loss function Convolutional feature extraction layers (to allow trained for on multiple 1512 ImageNet classes objects (Oquab et in al., 2014) image) Adaptation layers trained on Pascal VOC. C1 C2 C3 C4 C5 FC6 FC7 FCa FCb norm pool 1:8 256 norm pool 1: : : pool 1: dropout 1: dropout 1: dropout 1: :32 20 final-pool Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Cost function: Sum of log-loss functions over K classes:

Correct label: increase score Learn discriminative object

20 Training with global max-pooling Training input: + image-level labels: Airplane Car Chair Airplane score map Correct label: increase score Learn discriminative object parts Car score map Incorrect label: decrease score Suppress Hard Negatives

21 Training Motorbikes Evolution of localization score maps over training epochs

22 Results for weakly-supervised object recognition in Microsoft COCO dataset

23 Test results in Microsoft COCO: 80 object classes

24 Test results in Microsoft COCO: 80 object classes

25 Test results in Microsoft COCO: 80 object classes

26 Test results in Microsoft COCO: 80 object classes

27 Test results in Microsoft COCO: 80 object classes

28 Test results in Microsoft COCO: 80 object classes

29 Results for weakly-supervised action recognition in Pascal VOC 12 dataset

30 Test results in Pascal VOC 12: 10 action classes

31 Test results in Pascal VOC 12: 10 action classes

32 Test results in Pascal VOC 12: 10 action classes

33 Test results in Pascal VOC 12: 10 action classes

34 Test results in Pascal VOC 12: 10 action classes

35 Test results in Pascal VOC 12: 10 action classes Failure cases

36 [Oquab, Bottou, Laptev and Sivic, CVPR 2015] Results PASCAL VOC 2012 Object classification VGG 89.3

37 Summary Training input + image-level labels: Person Chair Airplane Reading Riding bike Running Test output More details in

38 What s next? a dog sitting beside a red fire hydrant in a dog park. a dog holding a skateboard trotting down a street.

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Kyunghee Kim Stanford University 353 Serra Mall Stanford, CA 94305 kyunghee.kim@stanford.edu Abstract We use a