Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1

2 Problem to solve Object detection Input: Image Output: Bounding box of the object

3 Object detection using CNN Faster R-CNN 78.8 VOC 2012

4 Transforming the problem to Classification Krizhevsky et al. [A25] shown substantially higher image classification accuracy on ImageNet Large Scale Visual Recognition Challenge (ConvNet) [A9, A10] Trained using 1.2 million labeled images, together with a few twists on LeCun s CNN.

5 Transforming the problem to Classification Classification: Image => Class Label Detection: Image => Bounding box Image source: [B]

6 Transforming the problem to Classification Krizhevsky et al. s work, ConvNet Image 1000 class labels Image source: [A25]

7 Transforming the problem to Classification Localizing object with a deep network using Region proposals Image source: [A]

8 Outline Region Proposals RCNN training RCNN fine tuning and Results RCNN Variants LCrowdV: Generating large amount of data and to train Faster R-CNN

Region Proposals 9

RCNN 10

11 Region Proposal Varity of work using different approach to generate region proposals: objectness [A1], selective search [C], category-independent object proposals [A14], constrained parametric min-cuts (CPMC) [A5], multi-scale combinatorial grouping [A3], and CNN [A6]

12 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]

13 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]

14 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification a) Objects are of different scale b) Texture are same c) Color are same d) Wheels are different in color and texture Fast to Compute Image source: [C]

15 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

16 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

17 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

18 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together

19 Similarity Heuristic Defined by a combination of Color similarity histogram intersection 25 bins for each color Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

20 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions 10-bin histogram for each direction Histogram intersection Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

21 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

22 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

23 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

24 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

Selective Search for producing Region Proposals 25

RCNN training 26

RCNN 27

28 Feature Extraction [A25] takes a 227 x 227 pixel image [A] uses the simplest approach to convert the region proposals to CNN input: warping regardless of size or aspect ratio

29 Training The R-CNN is based on Krizhevsky et al. [A25] [A25] produces a 4096 feature vector

RCNN 30

31 Classify regions With around 2000 region proposals obtained in step 2, 2000 CNN features are computed in step 3 In step 4, one linear SVM per class is used to test the features Non-maximal suppression: Scored regions are rejected if IoU overlap with a higher score region > a learned threshold Image source: http://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

RCNN Finetuning & Results 32

33 Supervised pre-training The CNN is trained on ILSVRC2012 classification using image-level only annotations Authors claimed that performance nearly matches the one of the original model in [A25]

34 Fine-tuning Continue stochastic gradient decent (SGD) of CNN parameters using only warped region proposals Replace last layer 1000-way classification layer with a randomly initialized (N+1)-way classification layer N is the number of object classes

35 Training data Positive samples Negative samples How about this?

36 Training data If Intersection-over-union (IoU) < threshold, then it is a negative sample. Authors performed a grid search over {0, 0.1, 0.5} and find out that if IoU = 0.3 is best in map.

37 Bounding box regression To improve localization performance, authors propose a bounding box regression to learn the relationship between the pool 5 features Set up N class-specific bounding-box regressors

38 Bounding box regression Given a set of Detected region proposal bounding box P Ground truth bounding box G and The authors establish Where w are the learnable parameters

39 Bounding box regression The problem is then formulated as a regularized least square problem, where the objective is: where

40 Bounding box regression Two subtle issues observed Regularization is important λ = 1000 Selection of (P,G) is important IoU overlap > 0.6 only Discard proposal IoU overlap <= 0.6 for regression

41 Results of RCNN

RCNN Variants 42

43 Why RCNN is slow? RCNN is slow because every Region Proposal is passed into the CNN and compute the features No sharing computation is done among Region Proposals of the same image

44 SPPNet [D] Observation: feature maps has also information of spatial position

45 Spatial Pyramid representation Image source: http://slazebni.cs.illinois.edu/slides/ima_poster.pdf

SPPNet 46

47 SPPNet Much faster than RCNN because each image is passed into CNN once only Can have multiscale variant to improve (maintain) accuracy

48 Problem of SPPNet Layers below the spatial pyramid layer cannot be updated, thus affect accuracy Weights CANNOT be updated

49 Fast RCNN [E] Fast RCNN solves this problem by proposing a single network trained in one stage

50 Faster R-CNN [F] Adding Region Proposal Network (RPN) Full connected layer Take image/feature map Output object proposals Use Fast R-CNN after obtained proposals Features shared between Fast R-CNN and RPN

LCrowdV : Generating Labeled Videos for Simulation-based Crowd Behavior Learning 51

52 Traditional training with human annotator

53 Traditional training with human annotator Obtaining Crowd Videos Annotations 1 hour video * 30 FPS = 108000 frames average 100 person per frame => 2M annotations 500 annotations / man-hour => 4000 man-hours

54 Training with LCrowdV LCrowdV Annotations 108 5min-videos released = 1 M images frame ~ 10M annotations

Strength of LCrowdV 55

56 Traditional Vs LCrowdV Ref: Image of left is from UCF-CC50

57 LCrowdV Framework Density Pedestrian Count Personality characteristic Background Noise Agent model Lighting Camera Angle Procedural Simulation Goal Selection Plan Computation Preferred Goal Velocity Plan Adaption Velocity Motion Synthesis Trajectory Procedural Rendering Results Videos Head location Bounding Boxes Attributes

Parameters of LCrowdV 58

67 Impact of fixing one parameter on the results Precision-recall graph

68 Results on Pedestrian Detection Precision-recall graph Trained with data from same scene + LCrowdV Trained with data from same scene Original Model

69 Results on Pedestrian Detection Precision-recall graph Varying the number of samples from the same scene as the test data, we observe consistent improvement of AP by complementing the training data with LCrowdV data.

Results on Pedestrian Detection 70

Results on Pedestrian Detection 71

72 Further improvement on LCrowdV More 3D Models of characters Walking cycle animations Background Scenes Perform comprehensive analysis on how to improve the accuracy using LCrowdV Develop novel ways to combine real work data with synthetic data

73 Benefits of LCrowdV Precise annotations generated automatically Avoiding Annotators error and intensive Labor effort Large variety in data Can be used for different application of Crowd Understanding Huge number of data Provide samples that Real-life data cannot cover Complementary to Machine Learning in Crowd Shown in Pedestrian Detection experiment

74 Reference A. Rich feature hierarchies for accurate object detection and semantic segmentation, Ross Girshick et al. https://arxiv.org/abs/1311.2524v5 B. You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon et al., CVPR2016 http://www.cs.virginia.edu/~vicente/recognition/pres entations/yolo.pdf

75 Reference C. Selective Search for Object Recognition, J.R.R. Uijlings et al., IJCV 2013 http://www.huppelen.nl/publications/selectivesearchdra ft.pdf D. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, Kaiming He et al., ILSVRC2014 https://arxiv.org/pdf/1406.4729v4.pdf

76 Reference E. Fast R-CNN, Ross Girshick https://arxiv.org/pdf/1504.08083v2.pdf F. Faster R-CNN, Shaoqing Ren et al. https://arxiv.org/abs/1506.01497 G. LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning, Ernest Cheung et al., http://gamma.cs.unc.edu/lcrowdv/