Rich feature hierarchies for accurate object detection and semant

Rich feature hierarchies for accurate object detection and semantic segmentation Speaker: Yucong Shen 4/5/2018

Develop of Object Detection 1 DPM (Deformable parts models) 2 R-CNN 3 Fast R-CNN 4 Faster R-CNN 5 Mask R-CNN

Framework of R-CNN 1 Region Proposal 2 Feature extraction 3 Classification

Region Proposal 1 Take an input image, extract 2000 bottom-up region proposals with selective search 2 Selective search: 1 Segment input image into 1K to 2K regions 2 Compute the similarity of neighboring regions 3 Merge the similarity neighboring regions 3 Resize all the proposal regions into 227 227

Selective Search: Color Similarity and Texture Similarity 1 Color Similarity For every region, we will get a color historgram, there would be 25 (75 if RGB) intervals: C i = {c 1 i, c2 i, cn i } S color (r i, r j ) = Σ n k=1 min(ck i, ck j ) 2 Texture Similarity For every region, calculate Guassian derivative for every channel s 8 different direction with variance 1 Obtain the histogram of them: T i = {t 1 i, t2 i,, tn i } S texture (t i, t j ) = Σ n k=1 min(tk i, tk j )

Feature Extractor 1 Network Stucture Train the 1 AlexNet 2 VGG-16 to extract a 4096-dimension feature vector 2 Supervised Pre-training pre-train CNN network on large auxiiary dataset (ImageNet) 3 Domain-specific fine-tuning Continue SGD training of CNN using only warped regions proposals The CNN architecture is unchanged except replacing the 1000-way classification layer with the (N+1)-way classification layer In each iteration, the batch size is 128, 32 positive samples, 96 negative samples 4 Object category classification Treat all region proposals with 05 with a ground-truth box as positives for that box s class and the rest as negatives Once features are extracted and training labels are applied, we optimize one linear SVM per class

Classification 1 Training SVM can only deal with 2-way classification problem Labeling the data as negative when IoU less than 03 Once features are extracted, train N linear SVM classifier After that, training a linear regression predict a new detection window given the 4096-dimension vectors features for a selective search region proposal 2 Testing Extract 2000 proposal regions with selective search, normalization all the regions to 227 227, forward propogate in CNN, extract the 4096-dimension feature vector, testing with the SVM classifier, removing the redundant bounding box with NMS, obtain the bounding box after bounding box regression

Result on PASCAL VOC and ILSIRC

Visulization of Pool5

Result on PASCAL ILSIRC

ILSVRC2013 detection dataset 1 The ILSVRC2013 detection dataset is split into three sets: train (395,918), val (20,121), and test (40,152) in 200 classes 2 The training images are not exhaustively annotated, so train images cannot be used for hard negative mining, so use val for both training and validation 3 Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training (3) bounding-box regressor training

Object Proposal (A) the original object proposal at its actual scale relative to the transformed CNN inputs; (B) tightest square with context; (C) tightest square without context; (D) warp Warping with context padding (p = 16 pixels) outperformed the warped by a large margin (3-5 map points)

Bounding-box Regression 1 Purpose improve localization performance of the bounding-box 2 Approach Predict a new bounding box for the detection using a class-specific bounding-box regressor after scoring each selec- tive search proposal with a class-specific detection SVM 3 Algorithm The input is a set of N training pairs {(P i, G i )} i=1,2,,n, where P i = (P i x, P i y, P i ω, P i h ) specifies the pixel coordinates of the center of proposal P i s bounding box together with P i s width and height in pixels Each ground-truth bounding box G is specified as: G = (G x, G y, G ω, G h ) The goal is to learn a transformation that maps a proposed box P to a ground-truth box G

Bounding-box Regression Set the transformation as: d x (P), d y (P), d ω (P), d h (P) First specify translation of the center of P: Ĝ x = P ω d x (P) + P x Ĝ y = P h d y (P) + P x Then specify log-space translations of the width and height of P s bounding box: ˆ G ω = P ω exp(d ω ) Ĝ h = P h exp(d h ) Each function d (P) (where is one of x,y,h,w) can be obtained by a linear function from the pool5 featurens of proposal ϕ 5 (P): d (P) = w T ϕ 5 (P), where w is a vector of learnable model parameters

Bounding-box Regression We learn w by optimizing the regularized least squares objective (ridge regression): w = argmin w ˆ Σ N i (t i ŵ ϕ 5 (P i )) 2 + λ ŵ 2 The regression targets t are defined as: t x = (G x P x )/P ω t y = (G y P y )/P h t ω = log(g ω /P ω ) t h = log(g h /P h )