Object Recognition and Detection

Size: px

Start display at page:

Download "Object Recognition and Detection"

Lewis Martin
5 years ago
Views:

1 CS 2770: Computer Vision Object Recognition and Detection Prof. Adriana Kovashka University of Pittsburgh March 16, 21, 23, 2017

2 Plan for the next few lectures Recognizing the category in the image as a whole Detecting the region in the image that corresponds to a category Using window templates Face detection Pedestrian detection Using parts Implicit Shape Models Deformable Part Models Using Convolutional Neural Networks R-CNN, Fast R-CNN YOLO (You Only Look Once)

Beyond Bags of Features: Spatial Pyramid

edu) Beckman Institute, University of Illinois

3 Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories CVPR 2006 Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign Cordelia Schmid INRIA Rhône-Alpes, France Jean Ponce Ecole Normale Supérieure, France

4 Scene category dataset Fei-Fei & Perona (2005), Oliva & Torralba (2001) Slide credit: L. Lazebnik

5 Bag-of-words representation 1. Extract local features 2. Learn visual vocabulary using clustering 3. Quantize local features using visual vocabulary 4. Represent images by frequencies of visual words Slide credit: L. Lazebnik

6 Image categorization with bag of words Training 1. Compute bag-of-words representation for training images 2. Train classifier on labeled examples using histogram values as features 3. Labels are the scene types (e.g. mountain vs field) Testing 1. Extract keypoints/descriptors for test images 2. Quantize into visual words using the clusters computed at training time 3. Compute visual word histogram for test images 4. Compute labels on test images using classifier obtained at training time 5. Measure accuracy of test predictions by comparing them to groundtruth test labels (obtained from humans) Adapted from D. Hoiem

16) SIFT descriptors of 16x16 patches sampled on a regular grid,

7 Feature extraction (on which BOW is based) Weak features Strong features Edge points at 2 scales and 8 orientations (vocabulary size 16) SIFT descriptors of 16x16 patches sampled on a regular grid, quantized to form visual vocabulary (size 200, 400) Slide credit: L. Lazebnik

8 What about spatial layout? All of these images have the same color histogram Slide credit: D. Hoiem

9 Spatial pyramid Compute histogram in each spatial bin Slide credit: D. Hoiem

10 Spatial pyramid [Lazebnik et al. CVPR 2006] Slide credit: D. Hoiem

Matching using pyramid and histogram intersection for some particular visual

11 Adapted from L. Lazebnik Pyramid matching Indyk & Thaper (2003), Grauman & Darrell (2005) Matching using pyramid and histogram intersection for some particular visual word: x i x j Original images Feature histograms: Level 3 Level 2 Level 1 Level 0 K( x i, x j ) Total weight (value of pyramid match kernel):

Scene category dataset Fei-Fei & Perona (2005), Oliva & Torralba (2001) http://www-cvr.ai.uiuc.

12 Scene category dataset Fei-Fei & Perona (2005), Oliva & Torralba (2001) Multi-class classification results (100 training images per class) Fei-Fei & Perona: 65.2% Slide credit: L. Lazebnik

13 Scene category confusions Difficult indoor images kitchen living room bedroom Slide credit: L. Lazebnik

14 Caltech101 dataset Fei-Fei et al. (2004) Multi-class classification results (30 training images per class) Slide credit: L. Lazebnik

15 Plan for the next few lectures Recognizing the category in the image as a whole Detecting the region in the image that corresponds to a category Using window templates Face detection Pedestrian detection Using parts Implicit Shape Models Deformable Part Models Using Convolutional Neural Networks R-CNN, Fast R-CNN YOLO (You Only Look Once)

16 Category detection: basic framework Build/train object model Choose a representation Learn or fit parameters of model / classifier Generate candidates in new image Score the candidates Kristen Grauman

17 Category detection: representation choice Window-based Part-based Kristen Grauman

18 Window-template-based models Building an object model Consider edges, contours, oriented intensity gradients Summarize local distribution of gradients with histogram Locally orderless: offers invariance to small shifts and rotations Adapted from Kristen Grauman

19 Window-template-based models Building an object model Given the representation, train a binary classifier Car/non-car Classifier No, Yes, not car. a car. Kristen Grauman

20 Window-template-based models Generating and scoring candidates Car/non-car Classifier Kristen Grauman

21 Window-template-based object detection: recap Training: 1. Obtain training data 2. Define features 3. Define classifier Given new image: 1. Slide window 2. Score by classifier Training examples Car/non-car Classifier Feature extraction Kristen Grauman

22 Special case: Faces Detection Recognition Sally Lana Lazebnik

23 Challenges of face detection Sliding window detector must evaluate tens of thousands of location/scale combinations Faces are rare: 0 10 per image A megapixel image has ~10 6 pixels and a comparable number of candidate face locations For computational efficiency, we should try to spend as little time as possible on the non-face windows To avoid having a false positive in every image, our false positive rate has to be less than 10-6 Lana Lazebnik

24 Viola-Jones face detector

25 Boosting intuition Weak Classifier 1 Paul Viola

26 Boosting illustration Weights Increased Paul Viola

27 Boosting illustration Weak Classifier 2 Paul Viola

28 Boosting illustration Weights Increased Paul Viola

29 Boosting illustration Weak Classifier 3 Paul Viola

30 Boosting illustration Final classifier is a combination of weak classifiers Paul Viola

31 Boosting: training Initially, weight each training example equally In each boosting round: Find the weak learner that achieves the lowest weighted training error Raise weights of training examples misclassified by current weak learner Compute final classifier as linear combination of all weak learners (weight of each learner is directly proportional to its accuracy) Exact formulas for re-weighting and combining weak learners depend on the particular boosting scheme (e.g., AdaBoost) Lana Lazebnik, Kristen Grauman

32 Main idea: Viola-Jones face detector Represent local texture with efficiently computable rectangular features within window of interest Select discriminative features to be weak classifiers Use boosted combination of them as final classifier Form a cascade of such classifiers, rejecting clear negatives quickly Kristen Grauman

Viola-Jones detector: features Rectangular filters Feature output is difference between adjacent regions Value = (pixels in white area) (pixels in black area) Efficiently computable

33 Viola-Jones detector: features Rectangular filters Feature output is difference between adjacent regions Value = (pixels in white area) (pixels in black area) Efficiently computable with integral image: any sum can be computed in constant time Value at (x,y) is sum of pixels above and to the left of (x,y) Integral image Adapted from Kristen Grauman and Lana Lazebnik

34 Fast computation with integral images The integral image computes a value at each pixel (x,y) that is the sum of the pixel values above and to the left of (x,y), inclusive This can quickly be computed in one pass through the image (x,y) Lana Lazebnik

35 Lana Lazebnik Computing sum within a rectangle Let A,B,C,D be the values of the integral image at the corners of a rectangle Then the sum of original image values within the rectangle can be computed as: sum = A B C + D Only 3 additions are required for any size of rectangle! D C B A

36 Lana Lazebnik Example Source Result

37 Viola-Jones detector: features Which subset of these features should we use to determine if a window has a face? Considering all possible filter parameters: position, scale, and type: 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to form the classifier Kristen Grauman

38 Viola-Jones detector: AdaBoost Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (nonfaces) training examples, in terms of weighted error. Resulting weak classifier: Outputs of a possible rectangle feature on faces and non-faces. For next round, reweight the examples according to errors, choose another filter/threshold combo. Kristen Grauman

39 Start with uniform weights on training examples. For M rounds Evaluate weighted error for each weak learner, pick best learner. y m (x n ) is the prediction, t n is ground truth for x n Figure from C. Bishop, notes from K. Grauman (d) Normalize the weights so they sum to 1 Re-weight the examples: Incorrectly classified get more weight, correctly classified get less weight. Final classifier is combination of weak ones, weighted according to error they had.

40 Boosting for face detection First two features selected by boosting: This feature combination can yield 100% detection rate and 50% false positive rate Lana Lazebnik

41 Boosting: pros and cons Advantages of boosting Integrates classification with feature selection Complexity of training is linear in the number of training examples Flexibility in the choice of weak learners, boosting scheme Testing is fast Easy to implement Disadvantages Needs many training examples Often found not to work as well as an alternative discriminative classifier, support vector machine (SVM) Lana Lazebnik

42 Are we done? Even if the filters are fast to compute, each new image has a lot of possible windows to search. How to make the detection more efficient? Kristen Grauman

43 Cascading classifiers for detection Form a cascade with low false negative rates early on Apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative Kristen Grauman

Viola-Jones detector: summary Train cascade of

negatives Real-time detector using 38 layer

44 Viola-Jones detector: summary Train cascade of classifiers with AdaBoost Faces New image Selected features, thresholds, and weights Non-faces Train with 5K positives, 350M negatives Real-time detector using 38 layer cascade (0.067s) 6061 features in all layers Adapted from Kristen Grauman

45 Viola-Jones detector: summary A seminal approach to real-time object detection Training is slow, but detection is very fast Key ideas Integral images for fast feature evaluation Boosting for feature selection Attentional cascade of classifiers for fast rejection of non-face windows P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), Matlab demo: Adapted from Kristen Grauman

46 Visual Perceptual Object and Recognition Sensory Augmented Tutorial Computing Kristen Grauman Viola-Jones Face Detector: Results

47 Visual Perceptual Object and Recognition Sensory Augmented Tutorial Computing Kristen Grauman Viola-Jones Face Detector: Results

48 Visual Perceptual Object and Recognition Sensory Augmented Tutorial Computing Kristen Grauman Viola-Jones Face Detector: Results

49 Visual Perceptual Object and Recognition Sensory Augmented Tutorial Computing Kristen Grauman Detecting profile faces? Can we use the same detector?

50 Visual Perceptual Object and Recognition Sensory Augmented Tutorial Computing Viola-Jones Face Detector: Results Paul Viola, ICCV tutorial Kristen Grauman

Dalal-Triggs pedestrian detector 1. Extract fixed-sized (64x128 pixel) window at each position and scale 2. Compute HOG (histogram of gradient) features within each window 3.

51 Dalal-Triggs pedestrian detector 1. Extract fixed-sized (64x128 pixel) window at each position and scale 2. Compute HOG (histogram of gradient) features within each window 3. Score the window with a linear SVM classifier 4. Perform non-maxima suppression to remove overlapping detections with lower scores Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

52 Histograms of oriented gradients (HOG) Divide image into 8x8 regions Orientation: 9 bins (for unsigned angles) Histograms in 8x8 pixel cells Votes weighted by magnitude Adapted from Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

53 Histograms of oriented gradients (HOG) 10x10 cells 20x20 cells N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 2005 Image credit: N. Snavely

54 Histograms of oriented gradients (HOG) N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 2005 Image credit: N. Snavely

55 Histograms of oriented gradients (HOG) N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 2005

56 Histograms of oriented gradients (HOG) N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 2005

57 Train SVM for pedestrian detection using HoG pos w neg w + pedestrian Adapted from Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

58 Remove overlapping detections Non-max suppression Score = 0.8 Score = 0.8 Score = 0.1 Adapted from Derek Hoiem

59 Plan for the next few lectures Recognizing the category in the image as a whole Detecting the region in the image that corresponds to a category Using window templates Face detection Pedestrian detection Using parts Implicit Shape Models Deformable Part Models Using Convolutional Neural Networks R-CNN, Fast R-CNN YOLO (You Only Look Once)

60 Sliding window detector

61 Are window templates enough? Single rigid window template usually not enough to represent a category Many objects (e.g. humans) are articulated, or have parts that can vary in configuration Many object categories look very different from different viewpoints, or from instance to instance Slide by N. Snavely

62 Deformable objects Images from Caltech-256 Slide Credit: Duan Tran

63 Deformable objects Images from D. Ramanan s dataset Slide Credit: Duan Tran

64 Parts-based Models Define object by collection of parts modeled by 1. Appearance 2. Spatial configuration Slide credit: Rob Fergus

65 How to model spatial relations? One extreme: fixed template Derek Hoiem

66 Fixed part-based template Object model = sum of scores of features at fixed positions = -0.5? > 7.5 Non-object = 10.5 Object? > 7.5 Derek Hoiem

67 How to model spatial relations? Another extreme: bag of words = Derek Hoiem

68 How to model spatial relations? Star-shaped model X = X X Derek Hoiem

69 How to model spatial relations? Star-shaped model Part Part Part Root Part Part Derek Hoiem

70 Parts-based Models Articulated parts model Object is configuration of parts Each part is detectable and can move around Adapted from Derek Hoiem, images from Felzenszwalb

Implicit shape models Visual vocabulary is used to index votes for object position [a visual word = part ] training image annotated with object localization info visual codeword with displacement

71 Implicit shape models Visual vocabulary is used to index votes for object position [a visual word = part ] training image annotated with object localization info visual codeword with displacement vectors B. Leibe, A. Leonardis, and B. Schiele, Combined Object Categorization and Segmentation with an Implicit Shape Model, ECCV Workshop on Statistical Learning in Computer Vision 2004 Lana Lazebnik

72 Implicit shape models: Training 1. Build vocabulary of patches around extracted interest points using clustering Lana Lazebnik

73 Implicit shape models: Training 1. Build vocabulary of patches around extracted interest points using clustering 2. Map the patch around each interest point to closest word Lana Lazebnik

74 Implicit shape models: Training 1. Build vocabulary of patches around extracted interest points using clustering 2. Map the patch around each interest point to closest word 3. For each word, store all positions it was found, relative to object center Lana Lazebnik

75 Recall: Generalized Hough transform Template representation: for each type of landmark point, store all possible displacement vectors towards the center Template Model Svetlana Lazebnik

76 Implicit shape models: Testing 1. Given new test image, extract patches, match to vocabulary words 2. Cast votes for possible positions of object center 3. Search for maxima in voting space Lana Lazebnik

77 Visual Perceptual Object and Recognition Sensory Augmented Tutorial Computing Detection Results Qualitative Performance Recognizes different kinds of objects Robust to clutter, occlusion, noise, low contrast K. Grauman, B. Leibe

78 Discriminative part-based models Root filter Part filters Deformation weights P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with Discriminatively Trained Part Based Models, PAMI 32(9), 2010 Lana Lazebnik

79 Discriminative part-based models Multiple components P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with Discriminatively Trained Part Based Models, PAMI 32(9), 2010 Lana Lazebnik

Scoring an object hypothesis The score of a hypothesis is the sum

location in the x, y directions Appearance weights Part features

80 Scoring an object hypothesis The score of a hypothesis is the sum of appearance scores minus the sum of deformation costs part loc anchor loc Displacements i.e. how much the part p i moved from its expected anchor location in the x, y directions Appearance weights Part features Deformation weights i.e. how much we ll penalize the part p i Felzenszwalb et al. for moving from its expected location

81 Felzenszwalb et al. Detection

82 Training Training data: images with labeled bounding boxes Parts are not annotated Need to learn the weights and deformation parameters Adapted from Lana Lazebnik

83 Training Our classifier has the form f ( x) max w HΦ ( x, z) z w are model parameters, z are latent hypotheses Latent SVM training: Initialize w and iterate: Fix w and find the best z for each training example Fix z and solve for w (standard SVM training) Lana Lazebnik

84 Car model Component 1 Component 2 Lana Lazebnik

85 Car detections Lana Lazebnik

86 Person model Lana Lazebnik

87 Person detections Lana Lazebnik

88 Cat model Lana Lazebnik

89 Cat detections Lana Lazebnik

90 Speeding up detection: Restrict set of windows we pass through SVM to those w/ high objectness Alexe et al., CVPR 2010

91 Alexe et al., CVPR 2010 Objectness cue #1: Where people look

92 Objectness cue #2: color contrast at boundary Alexe et al., CVPR 2010

93 Objectness cue #3: no segments straddling the object box Alexe et al., CVPR 2010

predictions for objectness Only run the sheep / horse /

94 Boxes found to have high objectness Cyan = ground truth bounding boxes, yellow = correct and red = incorrect predictions for objectness Only run the sheep / horse / chair etc. classifier on the yellow/red boxes. Alexe et al., CVPR 2010

95 How do detectors fail? Most errors that detectors make are reasonable Localization error and confusion with similar objects Misdetection of occluded or small objects Detectors have different sensitivity to different factors E.g. less sensitive to truncation than to size differences Failure analysis code and annotations available online Adapted from Hoiem et al., ECCV 2012

96 Analysis of object characteristics Additional annotations for seven categories: occlusion level, parts visible, sides visible Hoiem et al., ECCV 2012

97 Top false positives: Airplane (DPM) AP = Background 27% Localization 29% Other Objects 11% Similar Objects 33% Bird, Boat, Car Hoiem et al., ECCV 2012

98 Object characteristics: Aeroplane Occlusion: poor robustness to occlusion, but little impact on overall performance Easier (None) Hoiem et al., ECCV 2012 Harder (Heavy)

99 Object characteristics: Aeroplane Size: strong preference for average to above average sized airplanes Large Medium X-Large Small X-Small Easier Hoiem et al., ECCV 2012 Harder

100 Object characteristics: Aeroplane Aspect Ratio: 2-3x better at detecting wide (side) views than tall views X-Wide Wide Medium X-Tall Tall Easier (Wide) Hoiem et al., ECCV 2012 Harder (Tall)

101 Object characteristics: Aeroplane Sides/Parts: best performance = direct side view with all parts visible Easier (Side) Hoiem et al., ECCV 2012 Harder (Non-Side)

102 Summary Window-template-based approaches Assume object appears in roughly the same configuration in different images Look for alignment with a global template Part-based methods Allow parts to move somewhat from their usual locations Look for good fits in appearance, for both the global template and the individual part templates Speed up by only scoring boxes that look like any object Models prefer that objects appear in certain views

103 Plan for the next few lectures Recognizing the category in the image as a whole Detecting the region in the image that corresponds to a category Using window templates Face detection Pedestrian detection Using parts Implicit Shape Models Deformable Part Models Using Convolutional Neural Networks R-CNN, Fast R-CNN YOLO (You Only Look Once)

104 map (%) Complexity and theplateau [Source: esults/index.html] % DPM 23% DPM, HOG+BOW 28% DPM, MKL plateau & increasing complexity 37% DPM++ 41% 41% DPM++, MKL, Selective Search Selective Search, DPM++, MKL Top competition results ( ) 0 VOC 07 VOC 08 VOC 09 VOC 10 VOC 11 VOC 12 PASCAL VOC challenge dataset Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

105 map (%) R-CNN: Regions with CNNfeatures R-CNN 58.5% R-CNN 53.7% R-CNN 53.3% Postcompetition results ( present) Top competition results ( ) 0 VOC 07 VOC 08 VOC 09 VOC 10 VOC 11 VOC 12 PASCAL VOC challenge dataset Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

106 R-CNN: Regions with CNNfeatures CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features Classify regions (linear SVM) Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

107 R-CNN at test time: Step 1 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Proposal-method agnostic, many choices - Selective Search [van de Sande, Uijlings et al.] (Used in this work) - Objectness [Alexe etal.] - Category independent object proposals [Endres & Hoiem] - CPMC [Carreira & Sminchisescu] Active area, at this CVPR - BING [Ming et al.] fast - MCG [Arbelaez et al.] high-quality segmentation Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

108 R-CNN at test time: Step 2 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

109 R-CNN at test time: Step 2 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features Dilate proposal Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

110 R-CNN at test time: Step 2 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features a. Crop Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

111 R-CNN at test time: Step 2 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features 227 x 227 a. Crop b. Scale (anisotropic) Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

112 R-CNN at test time: Step 2 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features Crop b. Scale (anisotropic) c. Forward propagate Output: fc7 features Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

R-CNN at test time: Step 3 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features Classify regions person? 1.6... horse? -0.3... proposal 4096-dimensional fc7 feature vector linear classifiers (SVM or softmax) Girshick et al.

113 R-CNN at test time: Step 3 CNN aeroplane? no.. person? yes... tvmonitor? no. Input image Extract region proposals (~2k / image) Compute CNN features Classify regions person? horse? proposal 4096-dimensional fc7 feature vector linear classifiers (SVM or softmax) Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

114 Step 4: Object proposal refinement Linear regression on CNNfeatures Original proposal Predicted object bounding box Bounding-box regression Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

115 R-CNN results onpascal VOC 2007 VOC 2010 DPM v5 (Girshick et al. 2011) 33.7% 29.6% UVA sel. search (Uijlings et al. 35.1% 2013) Regionlets (Wang et al. 2013) 41.7% 39.7% SegDPM (Fidler et al. 2013) 40.4% Reference systems R-CNN 54.2% 50.2% R-CNN + bbox regression 58.5% 53.7% metric: mean average precision (higher is better) Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

116 R-CNN results onpascal VOC 2007 VOC 2010 DPM v5 (Girshick et al. 2011) 33.7% 29.6% UVA sel. search (Uijlings et al. 35.1% 2013) Regionlets (Wang et al. 2013) 41.7% 39.7% SegDPM (Fidler et al. 2013) 40.4% R-CNN 54.2% 50.2% R-CNN + bbox regression 58.5% 53.7% metric: mean average precision (higher is better) Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

117 R-CNN on ImageNet detection ILSVRC2013 detection test set map *R CNN BB *OverFeat (2) UvA Euvision *NEC MU *OverFeat (1) Toronto A SYSU_Vision GPU_UCLA 31.4% 24.3% 22.6% 20.9% 19.4% 11.5% 10.5% 9.8% Delta UIUC IFP 1.0% 6.1% post competition result competition result mean average precision (map) in % 0 Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

118 Training R-CNN Bounding-box labeled detection data is scarce Key insight: Use supervised pre-training on a data-rich auxiliary task and transfer to detection Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

119 R-CNN training: Step 1 Supervised pre-training Train a SuperVision CNN* for the 1000-way ILSVRC image classification task train CNN Auxiliary task: ILSVRC 2012 classification (1.2 million images) *Network from Krizhevsky, Sutskever & Hinton. NIPS 2012 Also called AlexNet Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

120 R-CNN training: Step 2 Fine-tune the CNN for detection Transfer the representation learned for ILSVRC classification to PASCAL (or ImageNet detection) fine-tune CNN Target task: PASCAL VOCdetection (~25k object labels) Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

121 R-CNN training: Step 3 Train detection SVMs (With the softmax classifier from fine-tuning map decreases from 54% to 51%) PASCAL VOC object proposals ~ 2k windows / image CNN features training labels per-class SVM Girshick et al., R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014

122 Slow R-CNN Apply bounding-box regressors Bbox reg SVMs Classify regions with SVMs Bbox reg SVMs Bbox reg SVMs ConvNet Forward each region through ConvNet ConvNet ConvNet Warped image regions Regions of Interest (RoI) from a proposal method (~2k) Girshick et al. CVPR14 Input image Post hoc component

123 What s wrong with slow R-CNN? Ad hoc training objectives Fine-tune network with softmax classifier (log loss) Train post-hoc linear SVMs (hingeloss) Train post-hoc bounding-box regressions (least squares) Training is slow (84h), takes a lot of disk space Inference (detection) is slow 47s / image with VGG16 [Simonyan & Zisserman, ICLR15] Girshick, Fast R-CNN, ICCV 2015 ~2000 ConvNet forward passes per image

124 Fast R-CNN Fast test time One network, trained in one stage Higher mean average precision Girshick, Fast R-CNN, ICCV 2015

125 Fast R-CNN (test time) Regions of Interest (RoIs) from a proposal method conv5 feature map of image Forward whole image through ConvNet ConvNet Input image Girshick, Fast R-CNN, ICCV 2015

126 Fast R-CNN (test time) RoI Pooling layer Regions of Interest (RoIs) from a proposal method conv5 feature map of image Forward whole image through ConvNet ConvNet Input image Girshick, Fast R-CNN, ICCV 2015

127 Fast R-CNN (test time) Softmax classifier Linear + softmax FCs Fully-connected layers RoI Pooling layer Regions of Interest (RoIs) from a proposal method conv5 feature map of image Forward whole image through ConvNet ConvNet Input image Girshick, Fast R-CNN, ICCV 2015

128 Fast R-CNN (test time) Softmax classifier Linear + softmax Linear Bounding-box regressors FCs Fully-connected layers RoI Pooling layer Regions of Interest (RoIs) from a proposal method conv5 feature map of image Forward whole image through ConvNet ConvNet Input image Girshick, Fast R-CNN, ICCV 2015

129 Fast R-CNN (training) Linear + softmax Linear FCs ConvNet Girshick, Fast R-CNN, ICCV 2015

130 Fast R-CNN (training) Log loss + smooth L1 loss Multi-task loss Linear + softmax Linear FCs ConvNet Girshick, Fast R-CNN, ICCV 2015

131 Fast R-CNN (training) Linear + softmax Log loss + smooth L1 loss Linear Multi-task loss FCs Trainable ConvNet Girshick, Fast R-CNN, ICCV 2015

132 Main results Fast R-CNN R-CNN [1] SPP-net[2] Train time (h) Speedup 8.8x 1x 3.4x Test time / image 0.32s 47.0s 2.3s Test speedup 146x 1x 20x map 66.9% 66.0% 63.1% Timings exclude object proposal time, which is equal for all methods. All methods use VGG16 from Simonyan and Zisserman. [1] Girshick et al. CVPR14 [2] He et al. ECCV14 Girshick, Fast R-CNN, ICCV 2015

133 Accurate object detection is slow! Pascal 2007 map Speed DPM v FPS 14 s/img R-CNN FPS 20 s/img ⅓ Mile, 1760 feet Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

Accurate object detection is slow! Pascal 2007 map Speed DPM v5 33.7.07 FPS 14 s/img R-CNN 66.

2 7 FPS 140 ms/img YOLO 69.0 45 FPS 22 ms/img 2 feet Redmon et al.

134 Accurate object detection is slow! Pascal 2007 map Speed DPM v FPS 14 s/img R-CNN FPS 20 s/img Fast R-CNN FPS 2 s/img Faster R-CNN FPS 140 ms/img YOLO FPS 22 ms/img 2 feet Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

135 Split the image into a grid Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

136 Each cell predicts boxes and confidences: P(Object) Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

137 Each cell also predicts a probability P(Class Object) Bicycle Car Dog Dining Table Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

138 Combine the box and class predictions Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

139 Finally do NMS and threshold detections Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

140 YOLO works across many natural images Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

141 It also generalizes well to new domains Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

Deformable Part Models

CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones