Object Classification for Video Surveillance Rogerio Feris IBM TJ Watson Research Center rsferis@us.ibm.com http://rogerioferis.com 1
Outline Part I: Object Classification in Far-field Video Part II: Large Scale Object Classification (Near-field / Still Images) 2
Part I Object Classification in Far-field Video 3
Motivation Object Classification in far-field video 4
Search for Pedestrians 5
6
7
Motivation Search for Trucks with Yellow Color Results locates DHL delivery trucks delivering mail at the IBM Hawthorne Facility 8
Far-field Object Classification: Why this is a Hard Problem? Illumination changes and shadows Arbitrary camera views Low-resolution imagery (objects are often less than 100 pixels in height; difficult to use e.g., SIFT features or parts-based approaches) Projective image distortion (cameras with large field-of-view) Groups of people may look like cars 9
Two Main Streams of Work for Far-Field Object Classification: 1) Methods that rely on moving object segmentation Use background subtraction to detect and track moving objects 2) Methods that do NOT use background modeling for classification Scan the entire video frame applying specialized detectors (e.g., car and pedestrian detectors) 10
Classification after Object Segmentation Assume static surveillance cameras Three main steps: Background Subtraction to detect moving object Track moving object Classify Object Track into car, person, etc. 11
Classification after Object Segmentation Constrained two-class object classification problem: discriminating vehicles from pedestrians. Key papers: Bose & Grimson, Improving Object Classification in Far-Field Video, CVPR 04 Lisa Brown, View-Independent Vehicle/Person Classification, WVSSN 04 12
Classification after Object Segmentation Simple Shape and Motion Descriptors Scene-dependent Features Foreground blob area (sensitive to perspective distortions) Aspect ratio (cars may have completely different aspect ratios depending on the pose frontal, side-view, etc.) Speed (cars may move slow just like people) Position and Direction of Motion (people and cars tend to occupy specific regions and follow different patterns of motion for different scenes) 13
Classification after Object Segmentation Simple Shape and Motion Descriptors Scene-independent Features Percentage Occupancy (number of silhouette pixels divided by the bounding box area) Direction of motion with respect to major axis direction (cars tend to move along the major axis direction) Shape Deformation (people tend to have large shape deformations than cars when moving) see recurrent motion image (next slide) 14
Classification after Object Segmentation Recurrent Motion Image [Omar Javed, ECCV 2002] Binary Silhouette image sequence for an object a Exclusive-or operator 15
Classification after Object Segmentation Recurrent Motion Image [Omar Javed, ECCV 2002] 16
Classification after Object Segmentation Recurrent Motion Image [Omar Javed, ECCV 2002] Pros: Scene-independent feature (Works for multiple camera views) Cons : Objects need to be aligned over the frames (translation and scale compensated) Morphology operations in foreground blobs obtained by background subtraction may complicate analysis of shape deformations 17
Classification after Object Segmentation Common Practice: Adaptation / Scene Transfer 1) Project a classifier based on scene-independent features (works for multiple camera views) 2) Deploy the classifier in a specific camera and let it run for hours 3) Select examples classified with high confidence 4) Use these examples to normalize scene-dependent features and retrain the classifier 18
Classification after Object Segmentation Incorporating appearance features So far we just considered shape and motion descriptors from the segmented object (foreground blob), which are limited to handle multiple object classes [Li et al, Real-time object classification in video surveillance based on appearance learning, CVPR 07] Local Binary Patterns + Adaboost Learning for appearance classification Moving objects are classified into 6 classes: car, van, truck, person, bike, and group of people Large training set is required 19
IBM System Interactive Interface User specifies Regions of Interest (ROI) for each class User specifies the size of objects in different locations of the image to compensate for projective distortions. 20
IBM System Bayesian Classifier Class Size Velocity Position Shape_Deformation Size P(s x,c) and Position P(x,C) distributions initially obtained from Interactive interface All distributions are adapted as new samples are classified 21
IBM System Shape Deformation feature Histograms of oriented gradients (HOG) eight bins corresponding to eight directions Differences of HOGs (or histogram intersection) tell how much the shape was deformed, without requiring precise alignment of bounding boxes (as in Recurrent Motion Image feature) Histogram Intersection: i = Frame Number j = Bin number 22
IBM System Shape Deformation feature 23
Two Main Streams of Work for Far-Field Object Classification: 1) Methods that rely on moving object segmentation Use background subtraction to detect and track moving objects 2) Methods that do NOT use background modeling for classification Scan the entire video frame applying specialized detectors (e.g., car and pedestrian detectors) 24
Pedestrian Detection Learning Patterns of Motion and Appearance [Viola et al, ICCV 03] Training Pairs (frame t and t+1) Appearance Filters: rectangle features applied in one of the images (exactly like in Viola/Jones face detector) Motion Filters: rectangle features applied in the following images: Adaboost learning is used to select discriminative appearance+motion filters 25
Pedestrian Detection Learning Patterns of Motion and Appearance [Viola et al, ICCV 03] Robust to shadows and low-resolution imagery 26
Pedestrian Detection Learning Patterns of Motion and Appearance [Viola et al, ICCV 03] Static Detector (only appearance) Dynamic Detector (apperance+motion) 27
Pedestrian Detection Histograms of Oriented Gradients (HOG) for Human Detection [Dalal and Triggs, CVPR 05] State-of-the-art approach for pedestrian detection Dense grid of cells in the detection window. For each cell, compute HOG and train an SVM with the concatenated vector of HOGs SOURCE CODE: http://pascal.inrialpes.fr/soft/olt/ 28
Pedestrian Detection Histograms of Oriented Gradients (HOG) for Human Detection [Dalal and Triggs, CVPR 05] 29
Pedestrian Detection Configuration estimates Improve Pedestrian Finding [Tran & Forsyth, NIPS 07] Structured Learning for detecting parts 30
Car Detection Local Statistics of Parts [Schneiderman, 2000] See Face Detection class for details about this method! 31
Summary 1) Methods that rely on moving object segmentation Fast and reliable approach for static cameras with few objects. These methods do not work for moving cameras or crowded scenes, where background subtraction results are not meaningful 2) Methods that do NOT use background modeling for classification Useful for crowded scenes and moving cameras. Work better under shadows and lighting changes. Training is very expensive (collecting samples and training time). More false positives and sometimes problems with generalization depending on the training/test set. Difficult to handle multiple object poses. 32
Part II Object Classification in Near-field Video 33
Motivation Object Classification in near-field video License Plate Recognition (LPR) 34
Motivation Object Classification in near-field video Recognizing Products in Retail Stores for Loss Prevention Veggie Vision - http://www.research.ibm.com/ecvg/jhc_proj/veggie.html Loss Prevention in Self-Checkout 35
Motivation Object Classification in near-field video LaneHawk (check Evolution Robotics Retail Company) http://www.evolution.com/products/lanehawk/ 36
Near-Field Object Classification We will briefly cover: 1) Shape-Based Methods 2) Methods Based on Bag of Words 37
Shape-based Approaches Shape Context Matching [Belongie et al, 2000] Source Code: http://www.eecs.berkeley.edu/research/projects/cs/vision/shape/sc_digits.htm l 38
Shape-based Approaches Shape Context Matching [Belongie et al, 2000] Linear Assignment (Bipartite Graph Matching) for establishing correspondences 39
Shape-based Approaches Deformable Shape Matching [Berg et al, CVPR 2005] Quadratic Assignment (approximation, since NP-Hard) for establishing correspondences 40
Shape-based Approaches Deformable Shape Matching [Berg et al, CVPR 2005] 41
Shape-based Approaches Learning Graph Matching [Caetano et al, ICCV 07] Pairs of Labeled Matches as training 42
Shape-based Approaches Learning Graph Matching [Caetano et al, ICCV 07] Structured learning problem: given pair of graphs (shapes), predict a matching matrix that provide the best alignment They show that linear assignment (node-node consistency) with learning can match (or exceed) quadratic assignment (edge-edge consistency) without learning Source Code: Structured SVMs http://svmlight.joachims.org/svm_struct.html 43
Near-Field Object Classification We will briefly cover: 1) Shape-Based Methods 2) Methods Based on Bag of Words 44
Bag of Words Excellent Course: Recognizing and Learning Object Categories http://people.csail.mit.edu/torralba/shortcourserloc/ They have a much more detailed presentation about this topic, including Source Code! 45
Object Bag of words Slide from Fei-Fei Li 46
Analogy to documents Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted sensory, point brain, by point to visual centers in the visual, brain; the perception, cerebral cortex was a movie screen, so to speak, upon which the image in retinal, the eye was cerebral projected. Through cortex, the discoveries of eye, Hubel cell, and Wiesel optical we now know that behind the origin of the visual perception in the nerve, brain there image is a considerably more complicated Hubel, course of Wiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a stepwise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. China is forecasting a trade surplus of $90bn ( 51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures China, are likely trade, to further annoy the surplus, US, which has commerce, long argued that China's exports are unfairly helped by a deliberately exports, undervalued imports, yuan. Beijing US, agrees the yuan, surplus bank, is too high, domestic, but says the yuan is only one factor. Bank of China governor Zhou foreign, Xiaochuan increase, said the country also needed to do trade, more to value boost domestic demand so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value. Slide from Fei-Fei Li 47
Slide from Fei-Fei Li 48
learning recognition feature detection & representation codewords dictionary image representation category models (and/or) classifiers 49 Slide from Fei-Fei Li category decision
Bag of Words Feature Extraction (Interest Points / SIFT) Learning / Classification Generative Models Discriminative Classifiers 50
51
Generative Models based on Bag of Words See [Sivic et al, Discovering Object Categories in Image Collections, 2005] - Probabilistic Latent Semantic Analysis (plsa) - Latent Dirichlet Alocation (LDA) d D z w N face 52
Slide from Fei-Fei Li 53
Discriminative methods based on bag of words Grauman & Darrell, 2005, 2006: SVM w/ Pyramid Match kernels Others Csurka, Bray, Dance & Fan, 2004 Serre & Poggio, 2005 54
Pyramid match kernel optimal partial matching between sets of features Grauman & Darrell, 2005, Slide credit: Kristen Grauman 55
Pyramid Match (Grauman & Darrell 2005) Histogram intersection Slide credit: Kristen Grauman 56
Pyramid Match (Grauman & Darrell 2005) Histogram intersection matches at this level matches at previous level Difference in histogram intersections across levels counts number of new pairs matched Slide credit: Kristen Grauman 57
Pyramid match kernel histogram pyramids measure of difficulty of a match at level i Weights inversely proportional to bin size Normalize kernel values to avoid favoring large sets Slide credit: Kristen Grauman number of newly matched pairs at level i 58
Example pyramid match Slide credit: Kristen Grauman Level 0 59
Example pyramid match Slide credit: Kristen Grauman Level 1 60
Example pyramid match Slide credit: Kristen Grauman Level 2 61
Example pyramid match pyramid match Slide credit: Kristen Grauman 62
Summary: Pyramid match kernel optimal partial matching between sets of features difficulty of a match at level i number of new matches at level i 63 Slide credit: Kristen Grauman