Detecting Pedestrians Using Patterns of Motion and Appearance (Viola & Jones) - Aditya Pabbaraju

Background We are adept at classifying actions. Easily categorize even with noisy and small images Want computers to do just as well How do we do it? Slide Credit: Jasper Snoek

Motivation Possible applications for action recognition Obvious : Tracking people's activities in public places - Surveillance Less obvious Use classification to solve a harder problem - Put a skeletal model over the novel sequence - Synthesize actions Slide Credit: Sunny Chow

Closely Related Work P. Viola & M. Jones - Robust Real-time Object Detection, Workshop on Statistical and Computational Theories of Vision, July 2001 P. Viola & M. Jones Rapid Object Detection Using a Boosted Cascade of Simple Features, ICCVPR, 2001. P. Viola & M. Jones Robust Real-Time Face Detection, IJCV, 2003 P. Viola, M. Jones & D. Snow Detecting Pedestrians Using Patterns of Motion and Appearance, ICCV 2003 Slide Credit: Jasper Snoek

Continued.. Action classification has been attempted in the past, with different assumptions Cutler and Davis : a) Poor quality moving footage b) Tracked images are used to find out the periodicity robustly c) Does long-term motion analysis Papageorgiou et al. : a) SVM is used for detecting pedestrians on an overcomplete wavelet basis. b) The false positive rate of this system was significantly higher than related face detection systems.

The Goals i. Development of a representation of image motion which is extremely efficient. ii. Implementation of a state of the art pedestrian detection system which operates on low-res images under difficult conditions. iii. Integrate image intensity information with motion information for detecting pedestrians. The Approach i. Find extremely basic features of the images that can be computed very quickly. (Real-time) ii. Get a huge set of features, and then use machine learning techniques (AdaBoost) to find the best distinguishing features. Slide Credit: Jasper Snoek

Framework scheme Framework consists of: 1. Trainer 2. Detector The trainer is supplied with positive and negative samples: Positive samples images containing the object. Negative samples images not containing the object. The trainer then creates a final classifier. A lengthy process, to be calculated offline. The detector utilizes the final classifier across a given input image. Slide Credit: Chen Goldberg

Features We describe an object using simple functions also called: Harr-like features. Given a sub-window, the feature function calculates a brightness differential. For example: The value of a two-rectangle feature is the difference between the sum of the pixels within the two rectangular regions. Slide Credit: Chen Goldberg

Features example Faces share many similar properties which can be represented with Haar-like features For example, it is easy to notice that: The eye region is darker than the upper-cheeks. The nose bridge region is brighter than the eyes. Slide Credit: Chen Goldberg

Large library of filters Considering all possible filter parameters: position, scale, and type: 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to form the classifier Viola & Jones, CVPR 2001 K. Grauman, B. Leibe

Three challenges ahead 1. How can we evaluate features quickly? Feature calculation is critically frequent. Image scale pyramid is too expensive to calculate. 2. How do we obtain the best representing features possible? 3. How can we refrain from wasting time on image background? (i.e. non-object) Slide Credit: Chen Goldberg

Introducing Integral Image Definition: The integral image at location (x,y), is the sum of the pixel values above and to the left of (x,y), inclusive. we can calculate the integral image formal definition: representation of the ii image in a single pass. ( x, y) = i( x', y' ) x' xy, ' y Recursive definition: (, ) = (, 1 ) + (, ) (, ) = ( 1, ) + (, ) sxy sxy ixy ii x y ii x y s x y Slide Credit: Chen Goldberg

Rapid evaluation of rectangular features Using the integral image representation one can compute the value of any rectangular sum in constant time. For example the integral sum inside rectangle D we can compute as: ii(4) + ii(1) ii(2) ii(3) As a result: two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array references respectively. Now that s fast! Slide Credit: Chen Goldberg

Scaling Integral image enables us to evaluate all rectangle sizes in constant time. Therefore, no image scaling is necessary. Scale the rectangular features instead! 1 2 3 4 5 6 Slide Credit: Chen Goldberg

The Features First 5 images are created from the original 2 (I t & I t+1 ) to represent motion, U, D, L, R by shifting I t 1 pixel in the corresponding direction (e.g. U means Up, means no shift, it s the temporal gradient) and taking the absolute difference with I t+1. These images represent crude gradients in motion. The sum of the pixels of the images going in the direction of motion will be greater than those that don t. Slide Credit: Jasper Snoek

The Features A feature is a thresholded filter, f i. α if f i (I t,, U, D, L, R) > t i β otherwise For some constants α, β, t i There are essentially 3 types of filters. 1. f i = r i (S) 2. f i = abs(r i ( ) r i (S)) 3. f j = Ф j (S) Ф m represents a sum of pixels over a rectangular filter m. S is one of I t,, U, D, L or R. r i (S) is a sum of pixel values over a box region of image S. Slide Credit: Jasper Snoek

Examples Take two images I t I t+1 Slide Credit: Jasper Snoek

Representing Motion (Examples) Compute U, D, L, R by shifting image I t over pixel and taking the absolute difference with I t+1. is computed as just abs(i t - I t+1 ). D has a sum of 121,020 U has a sum of 62,126. D U So motion is in the upward direction Slide Credit: Jasper Snoek

Filter Type 1 f i = r i (S) S is any of I t,, U, D, L, R. r i (S) is the sum of pixel values over a box region. L Slide Credit: Jasper Snoek

Filter Type 2 1. f i = abs(r i ( ) r i (S)) U S is any of U, D, L, R. r i (S) is the sum of pixel values over a box region. Slide Credit: Jasper Snoek

If we set the threshold to 300 this filter can recognize the symmetry between eyes. Slide Credit: Jasper Snoek Rectangular Features (Filter Type 3) f i = Ф i (S), Ф represents a rectangular filter The total difference in pixel values between the dark and light parts of the rectangles are the filters. Difference = 6,683 Difference = 224 Difference = 5476

Classifier A classifier is a thresholded sum of features. C(I t, I t+1 ) = 1 iff Σ i F i (I t,, U, D, L, R) > Θ, A feature is a thresholded filter. α if f i (I t,, U, D, L, R) > t i β otherwise This gives us 4 parameters to select (α, β, t i, Θ) in addition to choosing what subset of filters to use. Slide Credit: Jasper Snoek

Feature selection Given a feature set and labeled training set of images, we create a strong object classifier. Hypothesis: A combination of only a small number of discriminant features can yield an effective classifier. Variety is the key here if we want a small number of features we must make sure they compensate each other s flaws. Slide Credit: Chen Goldberg

Boosting Boosting is a machine learning meta-algorithm for performing supervised learning. Creates a strong classifier from a set of weak classifiers. Definitions: weak classifier - has an error rate <0.5 (i.e. a better than average advice). strong classifier - has an error rate of ε (i.e. our final classifier). Slide Credit: Chen Goldberg

AdaBoost 1990 - The Strength of Weak Learnability (Schapire) 1997 Generalized version of AdaBoost (Schapire & Singer) AdaBoost is an algorithm for constructing a strong classifier as linear combination of simple weak classifiers h t (x). Slide Credit: Jasper Snoek

AdaBoost Stands for Adaptive boost. AdaBoost is a boosting algorithm for searching out a small number of good classifiers which have significant variety. AdaBoost accomplishes this, by endowing misclassified training examples with more weight (thus enhancing their chances to be classified correctly next). The weights tell the learning algorithm the importance of the example. Slide Credit: Chen Goldberg

AdaBoost example Adaboost starts with a uniform distribution of weights over training examples. Select the classifier with the lowest weighted error (i.e. a weak classifier) Increase the weights on the training examples that were misclassified. (Repeat) At the end, carefully make a linear combination of the weak classifiers obtained at all iterations. h strong 1 1 α h( x) + +α h ( x) α + +α ( x) = 2 0 otherwise ( ) 1 1 n n 1 n Slide taken from a presentation by Qing Chen, Discover Lab, University of Ottawa

AdaBoost Algorithm Start with uniform weights on training examples For T rounds {x 1, x n } Evaluate weighted error for each feature, pick best. Re-weight the examples: incorrectly classified more weight Correctly classified less weight Final classifier is combination of the weak ones, weighted according to the error they had. [Freund & Schapire 1995]

Back to Feature selection We use a variation of AdaBoost for aggressive feature selection. Basically similar to the previous example. Our training set consists of positive and negative images. Our simple classifier consists of a single feature. Slide Credit: Chen Goldberg

The attentional cascade Overwhelming majority of windows are in fact negative. Simpler, boosted classifiers can reject many of negative sub-windows while detecting all positive instances. A cascade of gradually more complex classifiers achieves good detection rates. Consequently, on average, much fewer features are calculated per window. Slide Credit: Chen Goldberg

Cascading classifiers for detection Given a nested set of classifier hypothesis classes % Detection 50 100 % False Pos 0 50 vs false neg determined by IMAGE SUB-WINDOW Classifier 1 T Classifier 2 T Classifier 3 T Pedestrian F F F Non-Pedestrian Non-Pedestrian Non-Pedestrian Slide credit: Paul Viola Slide Credit: Chen Goldberg

Training a cascaded classifier Subsequent classifiers are trained only on examples which pass through all the previous classifiers The task faced by classifiers further down the cascade is more difficult. Slide Credit: Chen Goldberg

Training a cascaded classifier (cont.) Given false positive rate F and detection rate D, we would like to minimize the expected number of features evaluated per window. Since this optimization is extremely difficult, the usual framework is to choose a minimal acceptable false positive and detection rate per layer. N K = n + 0 ni p j i= 1 j< i N ni K p i : Expected number of features evaluated per window. : The number of features in the i-th classifier. : Number of classifiers/layers. : The positive rate of the i-th classifier. Slide Credit: Chen Goldberg

Cascaded Classifier Using all the features in the classifier would take too long. Instead a cascade of classifiers was used where each subsequent level of the cascade contains more features. This way image patches that are very different from actual pedestrians can be thrown out using only a few features. Slide Credit: Jasper Snoek

Experiments Train each classifier in the cascade using 2250 positive examples and 2250 false positives from the previous stages of cascade. (This lowers the false positive rate at each stage) Each stage is trained so that 99.5% of true positives from previous stage are kept while 10% of false positives are eliminated (if this can t be done, more features are added). Slide Credit: Jasper Snoek

Experiments Two detectors (dynamic and static). Dynamic trained using 54,624 filters on the original image I t and the motion images, U, D, L, R. Static trained using 24,328 filters on only the original image I t. Slide Credit: Jasper Snoek

Small Sample of Positive Training Samples

Classifier Intuition

Detector evaluation How to evaluate a detector? Summarize results with an ROC curve: show how the number of correctly classified positive examples varies relative to the number of incorrectly classified negative examples.

Results ROC curves for the classification (by adjusting the number of features) Slide Credit: Jasper Snoek

Results Correct detections - 80% False positives (the total number of false positives / the total number of patches tested) 1/400,000 for the dynamic detector which corresponds to 1 false positive every 2 frames. 1/15,000 for the static detector which corresponds to 13 false positives per frame. Slide Credit: Jasper Snoek

Results Dynamic detector Static detector Slide Credit: Jasper Snoek

Dynamic Detector Static Detector Slide Credit: Jasper Snoek

Comments Using more complex features such as optical flow would likely be more successful (but might make things slower). Why not use basic background subtraction? It would greatly reduce the amount of pixels the detector would have to search over. Slide Credit: Jasper Snoek

Comments Using information about where pedestrians were in previous frames would improve the detector and help against occlusions, etc. (i.e. tracking). Is overfitting a problem? AdaBoost can succumb to overfitting the training data (thus generalizing badly) by picking too many features. Here we have 2250 training examples and 54,624 features. Is 24.3 features per training example not too much? Slide Credit: Jasper Snoek

Questions?

M. Rahman, N. Kehtarnavaz, and J. Ren, A Hybrid Face Detection Approach for Real-time Deployment on Mobile Devices, ICIP, Nov. 2009 Hadid et al, Face and Eye Detection for Person Authentication in Mobile Phones, 2007 Srujankumar Puchakayala

A Hybrid Face Detection Approach for Real-time Deployment on Mobile Devices Most face detection algorithms in the literature use either facial features or skin color for classification process. Facial features are very effective for detecting full frontal faces but fail to show robustness for rotated or partially covered faces. They also require high processing time. Skin color is a very fast and robust feature, but for different lighting conditions and different camera sensors the skin color model needs to be retrained. So a hybrid face detection approach that effectively combines the facial features and skin color to achieve a fast and robust detection is highly desired.

Facial feature based approach Most of the existing facial feature based face detection methods adopt Viola and Jones algorithm which uses a boosted cascade of simple features. The detection rate of this algorithm is high but it drops noticeably for profile, rotated, tilted and partially covered images. Although modifications have been introduced to address various poses, these increase its processing time to a greater extent on a mobile device.

Facial feature based approach

Facial feature based approach First stage is to train a cascaded classifier, which is used for detecting faces during the detection stage. For detection the integral image for the entire image frame is computed. Then each subimage with different positions and sizes is tested against all trees/stages in the classifier. Viola and Jones proposed four different rectangular features within a subimage. After training, each tree does comparison for one rectangular feature. So, during each stage, each tree is applied to the subimage under testing.

Facial feature based approach The authors have introduced some modifications for the Viola Jones face detector for reducing its computation time. They are listed as Data Reduction Spatial subsampling, Step size, Scale size, Minimum face size. Search reduction Utilization of key frame and narrowed detection area. Numerical reduction Fixed point processing. By appropriately reducing data and the amount of search and by performing the computation in fixed-point, a realtime throughput can be achieved.

Skin color based approach The skin color cluster in chrominance domain (Cb-Cr) is represented by a Gaussian mixture model (GMM). The face detection process starts by first converting the captured image into a binary image by performing a skin color extraction using the GMM model with 1 s representing pixels of the image corresponding to skin color and 0 s representing non-skin color pixels. The binary image is then passed through a fast sub-block shape processing scheme which uses face size, aspect ratio and probability scoring to determine any face location in the image. To reduce the computational time, lookup tables are used to define the skin cluster area.

Skin color based approach

Hybrid Approach Skin color based face detection shows a robust detection performance with a small processing time. But this approach highly depends on the proper training of the skin color model. So changing lighting conditions always become a major challenge. The authors overcame this by using different models for different lighting conditions in [6] and [7].

Hybrid Approach But when using a different camera sensor, this model needs to be retrained. So, rather than using some pre-calculated skin color models, an online training method that calculates the skin color model in real-time can be used. First, the facial feature based approach is used to detect the face area. Then skin data is collected from that area for model training.

Hybrid Approach To train a skin color model, first remove 10% area from top, left and right side of the boundary and consider the remaining part. Though it reduces the chance of collecting background and hair portions, some non-skin pixels from eye, mouth or facial hair may still be collected.

Hybrid Approach To separate the skin pixels from non-skin parts a simple k-means clustering is used by minimizing an objective function D defined as, Where is a chosen distance measure between a data point x i (j) and the cluster center c j. It is an indicator of the distance of the n data points from their respective cluster centers.

Hybrid Approach

Hybrid Approach Leave initial few frames as warm up frames. Start feature based face detection algorithm for 5 frames and take median of it. Use detected face area for online calibration of skin color model. Skin color based algorithm is started. Full image area is used again in next key frame or if the algorithm failed to find any face in previous frame. If algorithm fails to find any face for 10 consecutive frames then facial feature based algorithm in executed again.

Results The skin color based approach takes 50 ms on an average to process one frame. The worst possible transient time was recorder to be around 1 sec. But this just done once and after that the color based algorithm is in operation.

Face and eye detection for person authentication in mobile devices Detecting and recognizing rotated faces in mobile phones is not as crucial as in computers since the user will be cooperating with the identification system in mobile devices. But the challenge is that the environmental conditions the device will experience will be highly variable. Also, the images captured by such devices will likely have variable lighting and background conditions.

Skin color based face detection To determine the skin-like regions, the skin locus method is used, which has performed well with images under widely varying conditions. Skin locus is the range of skin chromaticities under varying illumination/camera calibration conditions in Chrominance domain. The main properties of the skin locus model are its robustness against changing intensity and reasonable tolerance towards varying illumination chromaticity.

Skin color based face detection The face detection in the mobile phone starts by extracting skin-like regions using the skin locus. By using morphological operations (erosions and dilations), the number of these regions can be reduced. For every candidate, we verify whether it corresponds to a facial region or not by performing a simple set of heuristics. For instance, we only kept candidates in which the ratio of the height to the width was within a certain range [α 1, α 2 ] and the size of the area was within another range [β 1, β 2 ]

Skin color based face detection Among the advantages of using color is the computational efficiency and robustness against some geometric changes, when the scene is observed under a uniform illumination field. However, the main limitation with the use of color lies in its sensitivity to illumination changes. Unfortunately, the images captured by the mobile phone generally have variable lighting and background conditions. Hence static skin color model based approach to face detection in mobile phones is interesting in terms of speed but may not be very satisfactory in terms of detection rates.

Haar-like/Ada Boost based Face Detection: Discussed Previously Comparison:

Face Authentication Eye detection: In all face authentication systems, eye detection is needed in order to align the target face to the model stored in the database before matching. LBP features are more discriminative and thus more suitable for face authentication and recognition. LBP features also have the advantage of low computational cost.

Local Binary Patterns

Face Authentication For face description, 80x 80 pixels are considered as basic facial resolution and divided the image into 9 blocks of 27 x 27 pixels each. From each block, LBP histograms (59 bins) are extracted which are then concatenated to obtain the facial representation (59*9=531 bins). Then, the face identity is verified by computing the histogram intersection distance D(S,M) between the target LBP histogram (S) to the model LBP histogram (M) stored in the mobile phone. If D(S,M) is below certain empirically determined threshold θ k, the face is rejected. Otherwise, good match is reported.

Face Authentication

Results Authentication system runs at about 2 frames per second on a Nokia N90 mobile phone with an ARM9 processor with 220 MHz. Average authentication rates of 82% for small-sized faces (40x40 pixels) and 96% for faces of 80x 80 pixels are obtained.

Questions?

Pseudo-code for cascade trainer User selects values for f, the maximum acceptable false positive rate per layer and d, the minimum acceptable detection rate per layer. User selects target overall false positive rate F target. P = set of positive examples N = set of negative examples F 0 = 1.0; D 0 = 1.0; i = 0 While F i > F target i++ n i = 0; F i = F i-1 while F i > f x F i-1 o n i ++ o Use P and N to train a classifier with n i features using AdaBoost o Evaluate current cascaded classifier on validation set to determine F i and D i o Decrease threshold for the ith classifier until the current cascaded classifier has a detection rate of at least d x D i-1 (this also affects F i ) N = If F i > F target then evaluate the current cascaded detector on the set of non-face images and put any false detections into the set N. Slide taken from a presentation by Gyozo Gidofalvi, University of California, San Diego