Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1
2 Problem to solve Object detection Input: Image Output: Bounding box of the object
3 Object detection using CNN Faster R-CNN 78.8 VOC 2012
4 Transforming the problem to Classification Krizhevsky et al. [A25] shown substantially higher image classification accuracy on ImageNet Large Scale Visual Recognition Challenge (ConvNet) [A9, A10] Trained using 1.2 million labeled images, together with a few twists on LeCun s CNN.
5 Transforming the problem to Classification Classification: Image => Class Label Detection: Image => Bounding box Image source: [B]
6 Transforming the problem to Classification Krizhevsky et al. s work, ConvNet Image 1000 class labels Image source: [A25]
7 Transforming the problem to Classification Localizing object with a deep network using Region proposals Image source: [A]
8 Outline Region Proposals RCNN training RCNN fine tuning and Results RCNN Variants LCrowdV: Generating large amount of data and to train Faster R-CNN
Region Proposals 9
RCNN 10
11 Region Proposal Varity of work using different approach to generate region proposals: objectness [A1], selective search [C], category-independent object proposals [A14], constrained parametric min-cuts (CPMC) [A5], multi-scale combinatorial grouping [A3], and CNN [A6]
12 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]
13 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]
14 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification a) Objects are of different scale b) Texture are same c) Color are same d) Wheels are different in color and texture Fast to Compute Image source: [C]
15 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes
16 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes
17 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes
18 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together
19 Similarity Heuristic Defined by a combination of Color similarity histogram intersection 25 bins for each color Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]
20 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions 10-bin histogram for each direction Histogram intersection Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]
21 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]
22 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]
23 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes
24 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes
Selective Search for producing Region Proposals 25
RCNN training 26
RCNN 27
28 Feature Extraction [A25] takes a 227 x 227 pixel image [A] uses the simplest approach to convert the region proposals to CNN input: warping regardless of size or aspect ratio
29 Training The R-CNN is based on Krizhevsky et al. [A25] [A25] produces a 4096 feature vector
RCNN 30
31 Classify regions With around 2000 region proposals obtained in step 2, 2000 CNN features are computed in step 3 In step 4, one linear SVM per class is used to test the features Non-maximal suppression: Scored regions are rejected if IoU overlap with a higher score region > a learned threshold Image source: http://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
RCNN Finetuning & Results 32
33 Supervised pre-training The CNN is trained on ILSVRC2012 classification using image-level only annotations Authors claimed that performance nearly matches the one of the original model in [A25]
34 Fine-tuning Continue stochastic gradient decent (SGD) of CNN parameters using only warped region proposals Replace last layer 1000-way classification layer with a randomly initialized (N+1)-way classification layer N is the number of object classes
35 Training data Positive samples Negative samples How about this?
36 Training data If Intersection-over-union (IoU) < threshold, then it is a negative sample. Authors performed a grid search over {0, 0.1, 0.5} and find out that if IoU = 0.3 is best in map.
37 Bounding box regression To improve localization performance, authors propose a bounding box regression to learn the relationship between the pool 5 features Set up N class-specific bounding-box regressors
38 Bounding box regression Given a set of Detected region proposal bounding box P Ground truth bounding box G and The authors establish Where w are the learnable parameters
39 Bounding box regression The problem is then formulated as a regularized least square problem, where the objective is: where
40 Bounding box regression Two subtle issues observed Regularization is important λ = 1000 Selection of (P,G) is important IoU overlap > 0.6 only Discard proposal IoU overlap <= 0.6 for regression
41 Results of RCNN
RCNN Variants 42
43 Why RCNN is slow? RCNN is slow because every Region Proposal is passed into the CNN and compute the features No sharing computation is done among Region Proposals of the same image
44 SPPNet [D] Observation: feature maps has also information of spatial position
45 Spatial Pyramid representation Image source: http://slazebni.cs.illinois.edu/slides/ima_poster.pdf
SPPNet 46
47 SPPNet Much faster than RCNN because each image is passed into CNN once only Can have multiscale variant to improve (maintain) accuracy
48 Problem of SPPNet Layers below the spatial pyramid layer cannot be updated, thus affect accuracy Weights CANNOT be updated
49 Fast RCNN [E] Fast RCNN solves this problem by proposing a single network trained in one stage
50 Faster R-CNN [F] Adding Region Proposal Network (RPN) Full connected layer Take image/feature map Output object proposals Use Fast R-CNN after obtained proposals Features shared between Fast R-CNN and RPN
LCrowdV : Generating Labeled Videos for Simulation-based Crowd Behavior Learning 51
52 Traditional training with human annotator
53 Traditional training with human annotator Obtaining Crowd Videos Annotations 1 hour video * 30 FPS = 108000 frames average 100 person per frame => 2M annotations 500 annotations / man-hour => 4000 man-hours
54 Training with LCrowdV LCrowdV Annotations 108 5min-videos released = 1 M images frame ~ 10M annotations
Strength of LCrowdV 55
56 Traditional Vs LCrowdV Ref: Image of left is from UCF-CC50
57 LCrowdV Framework Density Pedestrian Count Personality characteristic Background Noise Agent model Lighting Camera Angle Procedural Simulation Goal Selection Plan Computation Preferred Goal Velocity Plan Adaption Velocity Motion Synthesis Trajectory Procedural Rendering Results Videos Head location Bounding Boxes Attributes
Parameters of LCrowdV 58
59
60
61
62
63
64
65
66
67 Impact of fixing one parameter on the results Precision-recall graph
68 Results on Pedestrian Detection Precision-recall graph Trained with data from same scene + LCrowdV Trained with data from same scene Original Model
69 Results on Pedestrian Detection Precision-recall graph Varying the number of samples from the same scene as the test data, we observe consistent improvement of AP by complementing the training data with LCrowdV data.
Results on Pedestrian Detection 70
Results on Pedestrian Detection 71
72 Further improvement on LCrowdV More 3D Models of characters Walking cycle animations Background Scenes Perform comprehensive analysis on how to improve the accuracy using LCrowdV Develop novel ways to combine real work data with synthetic data
73 Benefits of LCrowdV Precise annotations generated automatically Avoiding Annotators error and intensive Labor effort Large variety in data Can be used for different application of Crowd Understanding Huge number of data Provide samples that Real-life data cannot cover Complementary to Machine Learning in Crowd Shown in Pedestrian Detection experiment
74 Reference A. Rich feature hierarchies for accurate object detection and semantic segmentation, Ross Girshick et al. https://arxiv.org/abs/1311.2524v5 B. You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon et al., CVPR2016 http://www.cs.virginia.edu/~vicente/recognition/pres entations/yolo.pdf
75 Reference C. Selective Search for Object Recognition, J.R.R. Uijlings et al., IJCV 2013 http://www.huppelen.nl/publications/selectivesearchdra ft.pdf D. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, Kaiming He et al., ILSVRC2014 https://arxiv.org/pdf/1406.4729v4.pdf
76 Reference E. Fast R-CNN, Ross Girshick https://arxiv.org/pdf/1504.08083v2.pdf F. Faster R-CNN, Shaoqing Ren et al. https://arxiv.org/abs/1506.01497 G. LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning, Ernest Cheung et al., http://gamma.cs.unc.edu/lcrowdv/