Object Recognition II

Similar documents
Modern Object Detection. Most slides from Ali Farhadi

Deformable Part Models

Object Category Detection. Slides mostly from Derek Hoiem


Category vs. instance recognition

Object Detection Design challenges

Object Category Detection: Sliding Windows

Category-level localization

Find that! Visual Object Detection Primer

Human detection using histogram of oriented gradients. Srikumar Ramalingam School of Computing University of Utah

Object Detection with Discriminatively Trained Part Based Models

Detection III: Analyzing and Debugging Detection Methods

Object Detection. Computer Vision Yuliang Zou, Virginia Tech. Many slides from D. Hoiem, J. Hays, J. Johnson, R. Girshick

CSE373: Data Structures & Algorithms Lecture 23: Applications. Linda Shapiro Spring 2016

Object Detection Based on Deep Learning

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Development in Object Detection. Junyuan Lin May 4th

Spatial Localization and Detection. Lecture 8-1

Computer Vision Lecture 16

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

6. Convolutional Neural Networks

Rich feature hierarchies for accurate object detection and semantic segmentation

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

A Discriminatively Trained, Multiscale, Deformable Part Model

2D Image Processing Feature Descriptors

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Classification of objects from Video Data (Group 30)

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Deep Neural Networks:

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Recap Image Classification with Bags of Local Features

Linear combinations of simple classifiers for the PASCAL challenge

Unified, real-time object detection

Histograms of Oriented Gradients for Human Detection p. 1/1

Semantic Pooling for Image Categorization using Multiple Kernel Learning

Histogram of Oriented Gradients for Human Detection

An Implementation on Histogram of Oriented Gradients for Human Detection

Pedestrian Detection and Tracking in Images and Videos

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Person Detection in Images using HoG + Gentleboost. Rahul Rajan June 1st July 15th CMU Q Robotics Lab

Modeling 3D viewpoint for part-based object recognition of rigid objects

Multiple-Person Tracking by Detection

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

Deformable Part Models

Detection and Localization with Multi-scale Models

Templates and Background Subtraction. Prof. D. Stricker Doz. G. Bleser

Weakly Supervised Object Recognition with Convolutional Neural Networks

Object Recognition and Detection

c 2011 by Pedro Moises Crisostomo Romero. All rights reserved.

Learning and Recognizing Visual Object Categories Without First Detecting Features

Object recognition. Methods for classification and image representation

Beyond Bags of features Spatial information & Shape models

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Object Recognition with Deformable Models

Part-Based Models for Object Class Recognition Part 3

Rich feature hierarchies for accurate object detection and semantic segmentation

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

Human detection based on Sliding Window Approach

Object Category Detection: Sliding Windows

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

LEARNING OBJECT SEGMENTATION USING A MULTI NETWORK SEGMENT CLASSIFICATION APPROACH

Project 3 Q&A. Jonathan Krause

Part based models for recognition. Kristen Grauman

Lecture 15: Detecting Objects by Parts

Learning Representations for Visual Object Class Recognition

A Discriminatively Trained, Multiscale, Deformable Part Model

Selective Search for Object Recognition

Structured Models in. Dan Huttenlocher. June 2010

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Deep Learning for Object detection & localization

Object Detection by 3D Aspectlets and Occlusion Reasoning

Tri-modal Human Body Segmentation

Ranking Figure-Ground Hypotheses for Object Segmentation

arxiv: v2 [cs.cv] 1 Oct 2014

An Exploration of Computer Vision Techniques for Bird Species Classification

Bag-of-features. Cordelia Schmid

Computer Science Faculty, Bandar Lampung University, Bandar Lampung, Indonesia

The Pennsylvania State University. The Graduate School. College of Engineering ONLINE LIVESTREAM CAMERA CALIBRATION FROM CROWD SCENE VIDEOS

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

Object Detection on Self-Driving Cars in China. Lingyun Li

Sketchable Histograms of Oriented Gradients for Object Detection

CS5670: Computer Vision

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Regionlet Object Detector with Hand-crafted and CNN Feature

Real-time Object Detection CS 229 Course Project

CAP 6412 Advanced Computer Vision

Yiqi Yan. May 10, 2017

Machine Learning 13. week

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Lecture 5: Object Detection

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Attributes. Computer Vision. James Hays. Many slides from Derek Hoiem

Deep Learning for Computer Vision II

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Fast Learning and Prediction for Object Detection using Whitened CNN Features

Transcription:

Object Recognition II Linda Shapiro EE/CSE 576 with CNN slides from Ross Girshick 1

Outline Object detection the task, evaluation, datasets Convolutional Neural Networks (CNNs) overview and history Region-based Convolutional Networks (R-CNNs) 2

Image classification K classes Task: assign correct class label to the whole image Digit classification (MNIST) Object recognition (Caltech-101) 3

Classification vs. Detection Dog Dog Dog 4

Problem formulation { airplane, bird, motorbike, person, sofa } person motorbike Input Desired output 5

Evaluating a detector Test image (previously unseen)

First detection... 0.9 person detector predictions

Second detection... 0.9 0.6 person detector predictions

Third detection... 0.2 0.9 0.6 person detector predictions

Compare to ground truth 0.2 0.9 0.6 person detector predictions ground truth person boxes

Sort by confidence 0.9 0.8 0.6 0.5 0.2 0.1............... true positive (high overlap) X X X false positive (no overlap, low overlap, or duplicate)

Evaluation metric 0.9 0.8 0.6 0.5 0.2 0.1............... X X X t ppppppppppp = #tttt ppppppppppp #tttt ppppppppppp + #fffff ppppppppppp + X rrrrrrrr = #tttt ppppppppppp #gggggg ttttt ooooooo

Evaluation metric 0.9 0.8 0.6 0.5 0.2 0.1............... X X X Average Precision (AP) 0% is worst 100% is best mean AP over classes (map)

Pedestrians Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005 AP ~77% More sophisticated methods: AP ~90% (a) average gradient image over training examples (b) each pixel shows max positive SVM weight in the block centered on that pixel (c) same as (b) for negative SVM weights (d) test image (e) its R-HOG descriptor (f) R-HOG descriptor weighted by positive SVM weights (g) R-HOG descriptor weighted by negative SVM weights 14

Overview of HOG Method 1. Compute gradients in the region to be described 2. Put them in bins according to orientation 3. Group the cells into large blocks 4. Normalize each block 5. Train classifiers to decide if these are parts of a human 15

Details Gradients [-1 0 1] and [-1 0 1] T were good enough filters. Cell Histograms Each pixel within the cell casts a weighted vote for an orientation-based histogram channel based on the values found in the gradient computation. (9 channels worked) Blocks Group the cells together into larger blocks, either R-HOG blocks (rectangular) or C-HOG blocks (circular). 16

More Details Block Normalization They tried 4 different kinds of normalization. L1-norm sqrt of L1-norm L2 norm L2-norm followed by clipping If you think of the block as a vector v, then the normalized block is v/norm(v) 17

Example: Dalal-Triggs pedestrian detector 1. Extract fixed-sized (64x128 pixel) window at each position and scale 2. Compute HOG (histogram of gradient) features within each window 3. Score the window with a linear SVM classifier 4. Perform non-maxima suppression to remove overlapping detections with lower scores 18 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Slides by Pete Barnum 19 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Outperforms centered diagonal uncentered cubic-corrected Sobel Slides by Pete Barnum 20 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Histogram of gradient orientations Orientation: 9 bins (for unsigned angles) Histograms in 8x8 pixel cells Votes weighted by magnitude Bilinear interpolation between cells Slides by Pete Barnum 21 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Normalize with respect to surrounding cells Slides by Pete Barnum 22 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

# orientations X= # features = 15 x 7 x 9 x 4 = 3780 # cells # normalizations by neighboring cells Slides by Pete Barnum 23 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Training set 24

pos w neg w Slides by Pete Barnum 25 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

pedestrian Slides by Pete Barnum 26 Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Detection examples 27

30

Deformable Parts Model Takes the idea a little further Instead of one rigid HOG model, we have multiple HOG models in a spatial arrangement One root part to find first and multiple other parts in a tree structure. 31

The Idea Articulated parts model Object is configuration of parts Each part is detectable 32 Images from Felzenszwalb

Deformable objects Images from Caltech-256 33 Slide Credit: Duan Tran

Deformable objects Images from D. Ramanan s dataset 34 Slide Credit: Duan Tran

How to model spatial relations? Tree-shaped model 35

36

Hybrid template/parts model Detections Template Visualization 37 Felzenszwalb et al. 2008

Pictorial Structures Model Appearance likelihood Geometry likelihood 38

Results for person matching 39

Results for person matching 40

41 BMVC 2009

2012 State-of-the-art Detector: Deformable Parts Model (DPM) Lifetime Achievement 1. Strong low-level features based on HOG 2. Efficient matching algorithms for deformable part-based models (pictorial structures) 3. Discriminative learning with latent variables (latent SVM) Felzenszwalb 42 et al., 2008, 2010, 2011, 2012

Why did gradient-based models work? Average gradient image 43

Generic categories Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep? PASCAL Visual Object Categories (VOC) dataset 44

Generic categories Why doesn t this work (as well)? Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep? PASCAL Visual Object Categories (VOC) dataset 45

Quiz time (Back to Girshick) 46

Warm up This is an average image of which object class? 47

Warm up pedestrian 48

A little harder? 49

A little harder? Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table 50

A little harder bicycle (PASCAL) 51

A little harder, yet? 52

A little harder, yet? Hint: white blob on a green background 53

A little harder, yet sheep (PASCAL) 54

Impossible?? 55

Impossible? dog (PASCAL) 56

Impossible? dog (PASCAL) Why does the mean look like this? There s no alignment between the examples! How do we combat this? 57

PASCAL VOC detection history 70% mean Average Precision (map) 60% 50% 40% 30% 20% 10% 17% DPM 23% DPM, HOG+ BOW 28% DPM, MKL 37% 41% DPM++ DPM++, MKL, Selective Search 41% Selective Search, DPM++, MKL 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year

Part-based models & multiple features (MKL) 70% mean Average Precision (map) 60% 50% 40% 30% 20% 10% 17% DPM 23% DPM, HOG+ BOW 28% DPM, MKL 37% 41% DPM++ DPM++, MKL, Selective Search 41% Selective Search, DPM++, MKL 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year

Kitchen-sink approaches 70% mean Average Precision (map) 60% 50% 40% 30% 20% 10% 17% DPM 23% DPM, HOG+ BOW 28% DPM, MKL increasing complexity & plateau 37% 41% DPM++ DPM++, MKL, Selective Search 41% Selective Search, DPM++, MKL 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year

Region-based Convolutional Networks (R-CNNs) mean Average Precision (map) 70% 60% 50% 40% 30% 20% 10% 17% DPM 23% DPM, HOG+ BOW 28% DPM, MKL 37% 41% DPM++ DPM++, MKL, Selective Search 41% Selective Search, DPM++, MKL 53% R-CNN v1 62% R-CNN v2 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year [R-CNN. Girshick et al. CVPR 2014]

Region-based Convolutional Networks (R-CNNs) 70% mean Average Precision (map) 60% 50% 40% 30% 20% 10% ~1 year ~5 years 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year [R-CNN. Girshick et al. CVPR 2014]

Convolutional Neural Networks Overview 63

Standard Neural Networks Fully connected T x = x 1,, x 784 z j = g(w T j x) g t = 1 1 + e t 64

From NNs to Convolutional NNs Local connectivity Shared ( tied ) weights Multiple feature maps Pooling 65

Convolutional NNs Local connectivity compare Each green unit is only connected to (3) neighboring blue units 66

Convolutional NNs Shared ( tied ) weights w 1 w 2 w 3 w 1 w 2 All green units share the same parameters w Each green unit computes the same function, but with a different input window w 3 67

Convolutional NNs Convolution with 1-D filter: [w 3, w 2, w 1 ] w 1 w 2 w 3 All green units share the same parameters w Each green unit computes the same function, but with a different input window 68

Convolutional NNs Convolution with 1-D filter: [w 3, w 2, w 1 ] w 1 w 2 w 3 All green units share the same parameters w Each green unit computes the same function, but with a different input window 69

Convolutional NNs Convolution with 1-D filter: [w 3, w 2, w 1 ] w 1 w 2 w 3 All green units share the same parameters w Each green unit computes the same function, but with a different input window 70

Convolutional NNs Convolution with 1-D filter: [w 3, w 2, w 1 ] All green units share the same parameters w w 1 w 2 w 3 Each green unit computes the same function, but with a different input window 71

Convolutional NNs Convolution with 1-D filter: [w 3, w 2, w 1 ] All green units share the same parameters w w 1 Each green unit computes the same function, but with a different input window w 2 w 3 72

Convolutional NNs Multiple feature maps ww 1 ww 2 ww 3 All orange units compute the same function but with a different input windows w 1 w 2 w 3 Feature map 1 (array of green units) Orange and green units compute different functions Feature map 2 (array of orange units) 73

Convolutional NNs Pooling (max, average) 1 4 0 3 4 3 Pooling area: 2 units Pooling stride: 2 units Subsamples feature maps 74

2D input Pooling Convolution Image 75

1989 Backpropagation applied to handwritten zip code recognition, Lecun et al., 1989 76

Historical perspective 1980 77

Historical perspective 1980 Hubel and Wiesel 1962 Included basic ingredients of ConvNets, but no supervised learning algorithm 78

Supervised learning 1986 Gradient descent training with error backpropagation Early demonstration that error backpropagation can be used for supervised training of neural nets (including ConvNets) 79

Supervised learning 1986 T vs. C problem Simple ConvNet 80

Practical ConvNets Gradient-Based Learning Applied to Document Recognition, Lecun et al., 1998 81

Demo http://cs.stanford.edu/people/karpathy/convnetjs/ demo/mnist.html ConvNetJS by Andrej Karpathy (Ph.D. student at Stanford) Software libraries Caffe (C++, python, matlab) Torch7 (C++, lua) Theano (python) 82

The fall of ConvNets The rise of Support Vector Machines (SVMs) Mathematical advantages (theory, convex optimization) Competitive performance on tasks such as digit classification Neural nets became unpopular in the mid 1990s 83

The key to SVMs It s all about the features HOG features SVM weights (+) (-) Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005 84

Core idea of deep learning Input: the raw signal (image, waveform, ) Features: hierarchy of features is learned from the raw input 85

If SVMs killed neural nets, how did they come back (in computer vision)? 86

What s new since the 1980s? More layers LeNet-3 and LeNet-5 had 3 and 5 learnable layers Current models have 8 20+ ReLU non-linearities (Rectified Linear Unit) g x = max 0, x Gradient doesn t vanish Dropout regularization Fast GPU implementations More data g(x) x 87

What else? Object Proposals Sliding window based object detection Image Feature Extraction Classificaiton Iterate over window size, aspect ratio, and location Object proposals Fast execution High recall with low # of candidate boxes Image Object Proposal Feature Extraction Classificaiton 88

The number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object. 89

Ross s Own System: Region CNNs

Competitive Results

Top Regions for Six Object Classes

Finale Object recognition has moved rapidly in the last 12 years to becoming very appearance based. The HOG descriptor lead to fast recognition of specific views of generic objects, starting with pedestrians and using SVMs. Deformable parts models extended that to allow more objects with articulated limbs, but still specific views. CNNs have become the method of choice; they learn from huge amounts of data and can learn multiple views of each object class. 93