Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de rainer.stiefelhagen@kit.edu seemann@pedestrian-detection.com Edgar Seemann, 04.12.09 1 Interactive Systems Laboratories, Universität Karlsruhe (TH)
Computer Vision: People Detection II WS 2009/10 Dr. Edgar Seemann seemann@pedestrian-detection.com Edgar Seemann, 04.12.09 2 Interactive Systems Laboratories, Universität Karlsruhe (TH)
Termine 18.12.2009 21.12.2009 Thema 19.10.2009 Introduction, Applications Stereo and Optical Flow TBA Termine (1) 23.10.2009 Basics: Cameras, Transformations, Color 26.10.2009 Basics: Image Processing 30.10.2009 Basics: Pattern recognition 02.11.2009 Computer Vision: Tasks, Challenges, Learning, Performance measures 06.11.2009 Face Detection I: Color, Edges (Birchfield) 09.11.2009 Project 1: Intro + Programming tips 13.11.2009 Face Detection II: ANNs, SVM, Viola & Jones 16.11.2009 Project 1: Questions 20.11.2009 Face Recognition I: Traditional Approaches, Eigenfaces, Fisherfaces, EBGM 23.12.2009 Face Recognition II 27.11.2009 Head Pose Estimation: Model-based, NN, Texture Mapping, Focus of Attention 30.11.2009 People Detection I 04.12.2009 People Detection II 07.12.2009 Project 1: Student Presentations, Project 2: Intro 11.12.2009 People Detection III (Part-Based Models) 14.12.2009 Scene Context and Geometry Edgar Seemann, 04.12.09 3 Interactive Systems Laboratories, Universität Karlsruhe (TH)
Today Last time: Global approaches Compute a singe feature vector / representation for the complete body Today: Silhouette matching (global approach) Part-based approaches Edgar Seemann, 04.12.09 4
Silhouette Matching Edgar Seemann, 04.12.09 5
Chamfer Matching [Gavrila & Philomin ICCV 99] Goal Align known object shapes with image Object shapes Requirements for an alignment algorithm High detection rate Few false positives Robustness Computationally inexpensive Real-world image of object Edgar Seemann, 04.12.09 6
Distance Transform Used to compare/align two (typically binary) shapes 1. Compute for each pixel the distance to the next edge pixel Shape 1 Shape 2 Here the eculidean distances are approximated by the 2-3 distance Distance =? Distance transform 8 6 5 4 4 5 6 8 6 5 3 2 2 3 5 6 5 3 2 0 0 2 3 5 3 2 0 2 2 0 2 3 2 0 2 2 2 2 0 2 2 0 0 0 0 0 0 2 3 2 2 2 2 2 3 5 2 4 4 4 4 4 4 5 Edgar Seemann, 04.12.09 7
Distance Transform 2. Overlay second shape over distance transform Distance transform 8 6 5 4 4 5 6 8 6 5 3 2 2 3 5 6 5 3 2 0 0 2 3 5 3 2 0 2 2 0 2 3 2 0 2 2 2 2 0 2 2 0 0 0 0 0 0 2 3 2 2 2 2 2 3 5 2 4 4 4 4 4 4 5 Distance = 14 3. Accumulate distances along shape 2 4. Find best matching position by an exhaustive search Distance is not symmetric Distance has to be normalized w.r.t. the length of the shapes Edgar Seemann, 04.12.09 8
Chamfer Matching Binary image Distance transform 8 6 5 4 4 5 6 8 6 5 3 2 2 3 5 6 5 3 2 0 0 2 3 5 3 2 0 2 2 0 2 3 2 0 2 2 2 2 0 2 2 0 0 0 0 0 0 2 3 2 2 2 2 2 3 5 2 4 4 4 4 4 4 5 Distance transform of a real-world image Chamfer Matching Compute distance transform (DT) For each possible object location Position known object shape over DT Accumulate distances along the contour Distance measure Edgar Seemann, 04.12.09 9
Efficient implementation The distance transform can be efficiently computed by two scans over the complete image Forward-Scan Starts in the upper-left corner and moves from left to right, top to bottom Uses the following mask Backward-Scan Starts in the lower-right corner and moves from right to left, bottom to top Uses the following mask 3 2 3 2 0 0 2 3 2 3 Edgar Seemann, 04.12.09 10
Forward scan We can choose different values for the filter mask The local distances, d, s and c, in the mask are added to the pixel values of the distance map and the new value of the zero pixel is the minimum of the five sums Example: 3 2 3 5 3 2+2 2+3 3+5 d s d s c 2 0?? 2 2+0 2????????? Edgar Seemann, 04.12.09 11
Advantages and Disadvantages Fast Distance transform has to be computed only once Comparison for each shape location is cheap Good performance on uncluttered images (with few background structures) Bad performance for cluttered images Needs a huge number of people silhouettes But computation effort increases with the number of silhouettes Edgar Seemann, 04.12.09 12
Template Hierachy To reduce the number of silhouettes to consider, silhouettes can be organized in a template hierarchy For this, the shapes are clustered by similarity Edgar Seemann, 04.12.09 13
Search in the hierarchy Matching the shapes, then corresponds to a traversal of the template hierarchy How can we prune search branches to speed up matching? Thresholds depend on: Edge detector (likelihood of gaps) Silhouette sizes Hierarchy level Allowed shape variation Thresholds are set statistically during training Edgar Seemann, 04.12.09 14
Example Detections Edgar Seemann, 04.12.09 15
Video Edgar Seemann, 04.12.09 16
Coarse-To-Fine Search Goal: Reduce search effort by discarding unlikely regions with minimal computation Idea: Subsample image and search first at a coarse scale Only consider regions with a low distance when searching for a match on finer scales Again, we have to find reasonable thresholds Level 1 Level 2 Level 3 Edgar Seemann, 04.12.09 17
Protector System (Daimler) Edgar Seemann, 04.12.09 18
Adding edge orientation So far edge orientation has been completely ignored Distance = small Idea: Consider edge orientation for each pixel Edgar Seemann, 04.12.09 19
Edge orientation - The math Given two shapes S, C, we can express the chamfer distance in the following manner The orientation correspondence between two points is then measured by The combined distance measure: Edgar Seemann, 04.12.09 20
Statistical Relevance Adding statistical relevance of silhouette regions further improves the results [Dimitrijevic06] Edgar Seemann, 04.12.09 21
Spatio-Temporal templates Use multiple successive frames to build a spatiotemporal template (T={T 1,,T N }) Allow spatial variations of dx, dy (due to motion or camera movements) Edgar Seemann, 04.12.09 22
Example: single-frame vs. 3 frames Edgar Seemann, 04.12.09 23
Quantitative Results Red: spatio-temporal templates + statistical relevance Edgar Seemann, 04.12.09 24
Video Restrict detection to a single articulation (when legs are in a v-shaped position) Spatio-Temporal templates: Allows more reliable detection of motion direction Avoids confusions and some false positive detections Edgar Seemann, 04.12.09 25
Alternatives: Earth Mover s Distance Originally developed to compare histograms Idea: Find the minimal flow to transform one histogram to another Example: Edgar Seemann, 04.12.09 26
Earth mover s distance (EMD) Chamfer: EMD: Example detection EMD Matching Detect edges in image For each possible object location Optimize correspondences between known shape and edge image Distance measure basis Edgar Seemann, 04.12.09 27
EMD The math Variant of the transportation problem (possible solutions: Stepping Stone Algorithm, Transportation-simplex method) Constraints EMD-Distance Edgar Seemann, 04.12.09 28
Advantages and Disadvantages Optimizes matching between silhouette and edge structure in image Enforces one-to-one matchings (unlike chamfer) Allows partial matches Can deal with arbitrary features High computational complexity Approximation is possible [Graumann, Darrel CVPR 94] Edgar Seemann, 04.12.09 29
Part-Based Models Edgar Seemann, 04.12.09 30
Part-Based Models Fischler & Elschlager 1973 Model has two components parts (2D image fragments) structure (configuration of parts) Edgar Seemann, 04.12.09 31
Configuration of parts Fixed Spatial Layout Local parts have a mostly fixed position and orientation with respect to the center object center or detection window Flexible Spatial Layout Local parts are allowed to shift in location and scale Able to compensate for deformations or articulation changes Well suited for non-rigid objects Spatial relations often modeled probabilistically Edgar Seemann, 04.12.09 32
Different Connectivity Structures O(N 6 ) O(N 2 ) O(N 3 ) O(N 2 ) Fergus et al. 03 Fei-Fei et al. 03 Leibe et al. 04, 08 Crandall et al. 05 Fergus et al. 05 Crandall et al. 05 Felzenszwalb & Huttenlocher 05 Csurka 04 Vasconcelos 00 Bouchard & Triggs 05 Carneiro & Lowe 06 Edgar Seemann, 04.12.09 K. Grauman, B. Leibe 33 from [Carneiro & Lowe, ECCV 06]
Fixed Spatial Layout Edgar Seemann, 04.12.09 34
Motivation Breakdown overall variability into more manageable pieces Pieces can be classified by a less complex classifier Apply prior information by (manually) grouping the global feature vector into meaningful parts Edgar Seemann, 04.12.09 35
Example: Sashua et al. IVS 04 Divide person into 9 overlapping parts 4 pair combinations Edgar Seemann, 04.12.09 36
Feature Representation Edge orientation histograms (similar to HOG) Each part is divided into 2x2 regions i.e. 4 histograms per part 8 gradient orientations Features are insensitive to small local shifts Edgar Seemann, 04.12.09 37
Part Classifier and Combination Parts are classified by a linear regression i.e. distance to a separating hyperplane One score with respect to each of the 9 parts Separate head vs. leg feature, head vs. upper body etc. The values of the individual part detectors are combined by AdaBoost Edgar Seemann, 04.12.09 38
Attention mechanism Attention mechanism Considers scene geometry, camera perspective Filters out regions with lack of texture properties Only 75 image windows are considered for pedestrian classification per frame Edgar Seemann, 04.12.09 39
Results Lower curve: Global SVM Middle curve: Mohan et al. Upper curve: Sashua et al. Edgar Seemann, 04.12.09 40
Results Edgar Seemann, 04.12.09 41
Example: Mohan et al. Body divided into 4 parts Face/Shoulder Legs Right/Left arm Detection Sliding window 64x128 pixels Edgar Seemann, 04.12.09 42
Geometric constraints Body parts are not always at the exact same position Allow local shifts Position Scale Best location has to be found for each detection window Edgar Seemann, 04.12.09 43
Training Data MIT pedestrian database Edgar Seemann, 04.12.09 44
Features Based on wavelets Multi-Resolution representation Haar wavelets Quadruple density Edgar Seemann, 04.12.09 45
Discrete Wavelet Transform (DWT) Motivation Representation with basis functions Given: Signal [9 7 3 5] Disadvantages: Coefficients provide only local information No information about global signal change http://nt.eit.uni-kl.de/wavelet/transformation.html Edgar Seemann, 04.12.09 46
Haar Wavelet Basis Scaling function provides information about the mean value or signal level Representation in wavelet basis: [12 4 sqrt(2) -sqrt(2)] Edgar Seemann, 04.12.09 47
1D Wavelet Transformation Step wise transformation with orthogonal spaces Basis function for V j Scaling function Basis function for W j Wavelet function Edgar Seemann, 04.12.09 48
Our Example Project signal onto subspaces V 1 W 1 V 2 W 2 Edgar Seemann, 04.12.09 49
2D Haar-Wavelets Same princple as in 1D Quadruple density i.e. overcomplete representation Edgar Seemann, 04.12.09 50
Feature Discussion Haar wavelets represent pixel/region differences Strong responses of wavelet coefficients on image boundaries, i.e. coefficients encode shape Similar to gradients at different resolutions Overcomplete basis allows good spatial resolution Fine scales are discarded as they represent noise Coarse scales are discarded as they represent global intensity levels and not the shape Edgar Seemann, 04.12.09 51
Performance: Part Detectors Edgar Seemann, 04.12.09 52
Classifier Combination Voting E.g. majority of part detectors, classify detection window as person SVM combination Feed SVM scores of part detectors in a second stage SVM Referred to as Adaptive Classifier Combination Edgar Seemann, 04.12.09 53
Combination Performance Edgar Seemann, 04.12.09 54
Results Edgar Seemann, 04.12.09 55
Occlusion Edgar Seemann, 04.12.09 56