Rich feature hierarchies for accurate object detection and semant

Size: px

Start display at page:

Download "Rich feature hierarchies for accurate object detection and semant"

Darleen Cameron
5 years ago
Views:

1 Rich feature hierarchies for accurate object detection and semantic segmentation Speaker: Yucong Shen 4/5/2018

2 Develop of Object Detection 1 DPM (Deformable parts models) 2 R-CNN 3 Fast R-CNN 4 Faster R-CNN 5 Mask R-CNN

3 Framework of R-CNN 1 Region Proposal 2 Feature extraction 3 Classification

into 1K to 2K regions 2 Compute the similarity of neighboring regions 3

4 Region Proposal 1 Take an input image, extract 2000 bottom-up region proposals with selective search 2 Selective search: 1 Segment input image into 1K to 2K regions 2 Compute the similarity of neighboring regions 3 Merge the similarity neighboring regions 3 Resize all the proposal regions into

5 Selective Search: Color Similarity and Texture Similarity 1 Color Similarity For every region, we will get a color historgram, there would be 25 (75 if RGB) intervals: C i = {c 1 i, c2 i, cn i } S color (r i, r j ) = Σ n k=1 min(ck i, ck j ) 2 Texture Similarity For every region, calculate Guassian derivative for every channel s 8 different direction with variance 1 Obtain the histogram of them: T i = {t 1 i, t2 i,, tn i } S texture (t i, t j ) = Σ n k=1 min(tk i, tk j )

6 Feature Extractor 1 Network Stucture Train the 1 AlexNet 2 VGG-16 to extract a 4096-dimension feature vector 2 Supervised Pre-training pre-train CNN network on large auxiiary dataset (ImageNet) 3 Domain-specific fine-tuning Continue SGD training of CNN using only warped regions proposals The CNN architecture is unchanged except replacing the 1000-way classification layer with the (N+1)-way classification layer In each iteration, the batch size is 128, 32 positive samples, 96 negative samples 4 Object category classification Treat all region proposals with 05 with a ground-truth box as positives for that box s class and the rest as negatives Once features are extracted and training labels are applied, we optimize one linear SVM per class

7 Classification 1 Training SVM can only deal with 2-way classification problem Labeling the data as negative when IoU less than 03 Once features are extracted, train N linear SVM classifier After that, training a linear regression predict a new detection window given the 4096-dimension vectors features for a selective search region proposal 2 Testing Extract 2000 proposal regions with selective search, normalization all the regions to , forward propogate in CNN, extract the 4096-dimension feature vector, testing with the SVM classifier, removing the redundant bounding box with NMS, obtain the bounding box after bounding box regression

8 Result on PASCAL VOC and ILSIRC

9 Result on PASCAL VOC and ILSIRC

10 Visulization of Pool5

11 Result on PASCAL ILSIRC

12 ILSVRC2013 detection dataset 1 The ILSVRC2013 detection dataset is split into three sets: train (395,918), val (20,121), and test (40,152) in 200 classes 2 The training images are not exhaustively annotated, so train images cannot be used for hard negative mining, so use val for both training and validation 3 Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training (3) bounding-box regressor training

13 Object Proposal (A) the original object proposal at its actual scale relative to the transformed CNN inputs; (B) tightest square with context; (C) tightest square without context; (D) warp Warping with context padding (p = 16 pixels) outperformed the warped by a large margin (3-5 map points)

14 Bounding-box Regression 1 Purpose improve localization performance of the bounding-box 2 Approach Predict a new bounding box for the detection using a class-specific bounding-box regressor after scoring each selec- tive search proposal with a class-specific detection SVM 3 Algorithm The input is a set of N training pairs {(P i, G i )} i=1,2,,n, where P i = (P i x, P i y, P i ω, P i h ) specifies the pixel coordinates of the center of proposal P i s bounding box together with P i s width and height in pixels Each ground-truth bounding box G is specified as: G = (G x, G y, G ω, G h ) The goal is to learn a transformation that maps a proposed box P to a ground-truth box G

15 Bounding-box Regression Set the transformation as: d x (P), d y (P), d ω (P), d h (P) First specify translation of the center of P: Ĝ x = P ω d x (P) + P x Ĝ y = P h d y (P) + P x Then specify log-space translations of the width and height of P s bounding box: ˆ G ω = P ω exp(d ω ) Ĝ h = P h exp(d h ) Each function d (P) (where is one of x,y,h,w) can be obtained by a linear function from the pool5 featurens of proposal ϕ 5 (P): d (P) = w T ϕ 5 (P), where w is a vector of learnable model parameters

16 Bounding-box Regression We learn w by optimizing the regularized least squares objective (ridge regression): w = argmin w ˆ Σ N i (t i ŵ ϕ 5 (P i )) 2 + λ ŵ 2 The regression targets t are defined as: t x = (G x P x )/P ω t y = (G y P y )/P h t ω = log(g ω /P ω ) t h = log(g h /P h )

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects