Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Size: px

Start display at page:

Download "Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation"

Ann Mills
5 years ago
Views:

1 Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation 1

2 2 Problem to solve Object detection Input: Image Output: Bounding box of the object

3 3 Object detection using CNN Faster R-CNN 78.8 VOC 2012

4 4 Transforming the problem to Classification Krizhevsky et al. [A25] shown substantially higher image classification accuracy on ImageNet Large Scale Visual Recognition Challenge (ConvNet) [A9, A10] Trained using 1.2 million labeled images, together with a few twists on LeCun s CNN.

5 5 Transforming the problem to Classification Classification: Image => Class Label Detection: Image => Bounding box Image source: [B]

6 6 Transforming the problem to Classification Krizhevsky et al. s work, ConvNet Image 1000 class labels Image source: [A25]

7 7 Transforming the problem to Classification Localizing object with a deep network using Region proposals Image source: [A]

8 8 Outline Region Proposals RCNN training RCNN fine tuning and Results RCNN Variants LCrowdV: Generating large amount of data and to train Faster R-CNN

9 Region Proposals 9

10 RCNN 10

11 11 Region Proposal Varity of work using different approach to generate region proposals: objectness [A1], selective search [C], category-independent object proposals [A14], constrained parametric min-cuts (CPMC) [A5], multi-scale combinatorial grouping [A3], and CNN [A6]

12 12 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]

13 13 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]

14 14 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification a) Objects are of different scale b) Texture are same c) Color are same d) Wheels are different in color and texture Fast to Compute Image source: [C]

15 15 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

16 16 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

17 17 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

18 18 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together

19 19 Similarity Heuristic Defined by a combination of Color similarity histogram intersection 25 bins for each color Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

20 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions

20 20 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions 10-bin histogram for each direction Histogram intersection Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

21 21 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

Gaussian derivative in 8 directions Size - Encourage small

22 22 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

23 23 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

24 24 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

25 Selective Search for producing Region Proposals 25

26 RCNN training 26

27 RCNN 27

28 28 Feature Extraction [A25] takes a 227 x 227 pixel image [A] uses the simplest approach to convert the region proposals to CNN input: warping regardless of size or aspect ratio

29 29 Training The R-CNN is based on Krizhevsky et al. [A25] [A25] produces a 4096 feature vector

30 RCNN 30

31 31 Classify regions With around 2000 region proposals obtained in step 2, 2000 CNN features are computed in step 3 In step 4, one linear SVM per class is used to test the features Non-maximal suppression: Scored regions are rejected if IoU overlap with a higher score region > a learned threshold Image source:

32 RCNN Finetuning & Results 32

33 33 Supervised pre-training The CNN is trained on ILSVRC2012 classification using image-level only annotations Authors claimed that performance nearly matches the one of the original model in [A25]

34 34 Fine-tuning Continue stochastic gradient decent (SGD) of CNN parameters using only warped region proposals Replace last layer 1000-way classification layer with a randomly initialized (N+1)-way classification layer N is the number of object classes

35 35 Training data Positive samples Negative samples How about this?

36 36 Training data If Intersection-over-union (IoU) < threshold, then it is a negative sample. Authors performed a grid search over {0, 0.1, 0.5} and find out that if IoU = 0.3 is best in map.

37 37 Bounding box regression To improve localization performance, authors propose a bounding box regression to learn the relationship between the pool 5 features Set up N class-specific bounding-box regressors

38 38 Bounding box regression Given a set of Detected region proposal bounding box P Ground truth bounding box G and The authors establish Where w are the learnable parameters

39 39 Bounding box regression The problem is then formulated as a regularized least square problem, where the objective is: where

40 40 Bounding box regression Two subtle issues observed Regularization is important λ = 1000 Selection of (P,G) is important IoU overlap > 0.6 only Discard proposal IoU overlap <= 0.6 for regression

41 41 Results of RCNN

42 RCNN Variants 42

43 43 Why RCNN is slow? RCNN is slow because every Region Proposal is passed into the CNN and compute the features No sharing computation is done among Region Proposals of the same image

44 44 SPPNet [D] Observation: feature maps has also information of spatial position

45 45 Spatial Pyramid representation Image source:

46 SPPNet 46

47 47 SPPNet Much faster than RCNN because each image is passed into CNN once only Can have multiscale variant to improve (maintain) accuracy

48 48 Problem of SPPNet Layers below the spatial pyramid layer cannot be updated, thus affect accuracy Weights CANNOT be updated

49 49 Fast RCNN [E] Fast RCNN solves this problem by proposing a single network trained in one stage

50 50 Faster R-CNN [F] Adding Region Proposal Network (RPN) Full connected layer Take image/feature map Output object proposals Use Fast R-CNN after obtained proposals Features shared between Fast R-CNN and RPN

51 LCrowdV : Generating Labeled Videos for Simulation-based Crowd Behavior Learning 51

52 52 Traditional training with human annotator

53 53 Traditional training with human annotator Obtaining Crowd Videos Annotations 1 hour video * 30 FPS = frames average 100 person per frame => 2M annotations 500 annotations / man-hour => 4000 man-hours

54 54 Training with LCrowdV LCrowdV Annotations 108 5min-videos released = 1 M images frame ~ 10M annotations

55 Strength of LCrowdV 55

56 56 Traditional Vs LCrowdV Ref: Image of left is from UCF-CC50

57 LCrowdV Framework Density Pedestrian Count Personality characteristic Background Noise Agent model Lighting Camera Angle Procedural Simulation Goal Selection

57 57 LCrowdV Framework Density Pedestrian Count Personality characteristic Background Noise Agent model Lighting Camera Angle Procedural Simulation Goal Selection Plan Computation Preferred Goal Velocity Plan Adaption Velocity Motion Synthesis Trajectory Procedural Rendering Results Videos Head location Bounding Boxes Attributes

58 Parameters of LCrowdV 58

59 59

60 60

61 61

62 62

63 63

64 64

65 65

66 66

67 67 Impact of fixing one parameter on the results Precision-recall graph

68 68 Results on Pedestrian Detection Precision-recall graph Trained with data from same scene + LCrowdV Trained with data from same scene Original Model

69 69 Results on Pedestrian Detection Precision-recall graph Varying the number of samples from the same scene as the test data, we observe consistent improvement of AP by complementing the training data with LCrowdV data.

70 Results on Pedestrian Detection 70

71 Results on Pedestrian Detection 71

72 72 Further improvement on LCrowdV More 3D Models of characters Walking cycle animations Background Scenes Perform comprehensive analysis on how to improve the accuracy using LCrowdV Develop novel ways to combine real work data with synthetic data

73 73 Benefits of LCrowdV Precise annotations generated automatically Avoiding Annotators error and intensive Labor effort Large variety in data Can be used for different application of Crowd Understanding Huge number of data Provide samples that Real-life data cannot cover Complementary to Machine Learning in Crowd Shown in Pedestrian Detection experiment

74 74 Reference A. Rich feature hierarchies for accurate object detection and semantic segmentation, Ross Girshick et al. B. You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon et al., CVPR entations/yolo.pdf

75 75 Reference C. Selective Search for Object Recognition, J.R.R. Uijlings et al., IJCV ft.pdf D. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, Kaiming He et al., ILSVRC2014

76 76 Reference E. Fast R-CNN, Ross Girshick F. Faster R-CNN, Shaoqing Ren et al. G. LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning, Ernest Cheung et al.,

Spatial Localization and Detection. Lecture 8-1

Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday