Image Analysis 2 Face detection and recognition Window-based face detection: The Viola-Jones algorithm Christophoros Nikou cnikou@cs.uoi.gr Images taken from: D. Forsyth and J. Ponce. Computer Vision: A Modern Approach, Prentice Hall, 2003. Computer Vision course by Kristen Grauman, University of Texas at Austin. Computer Vision course by Svetlana Lazebnik, University of North Carolina at Chapel Hill. University of Ioannina - Department of Computer Science 3 Face detection and recognition 4 Consumer application: iphoto 2009 Detection Recognition Sally http://www.apple.com/ilife/iphoto/ 5 Consumer application: iphoto 2009 It can be trained to recognize pets! 6 Consumer application: iphoto 2009 iphoto decides that this is a face http://www.maclife.com/article/news/iphotos_faces_recognizes_cats 1
7 Outline Face recognition Eigenfaces Face detection The Viola and Jones algorithm. M. Turk and A. Pentland. "Eigenfaces for recognition". Journal of Cognitive Neuroscience, 3(1):71 86, 1991. Also in CVPR 1991. P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2), 2004. Also in CVPR 2001. 8 Recognition by finding patterns We have seen very simple template matching methods (using filters). Some objects behave like quite simple templates Frontal face images. Strategy: Build/train model (supervised classifier). Find image windows. Correct lighting. Pass them to a statistical test (a classifier) that accepts faces and rejects non-faces with respect to the model. 9 Basic ideas in classifiers Loss: some errors may be more expensive than others e.g. a fatal disease that is easily cured by a cheap medicine with no side-effects false positives in diagnosis are better than false negatives We discuss two class classification: L(1 2) is the loss caused by calling 1 a 2. Total risk of strategy s is the total expected loss: Rs () = Pr{1 2 }(1 sl 2) + Pr{2 1 }(2 sl 1) The optimal classifier minimizes the total risk. 10 Basic ideas in classifiers (cont.) Generally, we should classify as 1 if the expected loss of classifying as 2 is larger than calling a 2 a 1: p(1/ x) L(1 2) > p(2 / x) L(2 1) Equivalently, we should classify as 2 if p(1/ x) L(1 2) < p(2 / x) L(2 1) Using Bayes rule: px ( /1) p(1) L(1 2) > px ( /2) p(2) L(2 1) px ( /1) p(1) L(1 2) < px ( / 2) p(2) L(2 1) 11 Basic ideas in classifiers (cont.) Crucial notion: Decision boundary points where the average loss is the same for either case. 12 Basic ideas in classifiers (cont.) Some loss may be inevitable: the total risk (shaded area) is called the Bayes risk. L(1 2) = L(2 1) = 1, L(1 1) = L(2 2) = 0 L(1 2) = L(2 1) = 1, L(1 1) = L(2 2) = 0 2
13 Basic ideas in classifiers (cont.) Example: Gaussian class densities, p- dimensional measurements with common covariance Σ and different means μ 1, μ 2 and class priors π k : p 2 1 1 1 2 T 1 pxk ( / ) = Σ exp ( x μk) Σ ( x μk) 2π 2 14 Basic ideas in classifiers (cont.) Ignoring a common factor in the posteriors: p 2 1 1 1 k k k 2 T 1 pk ( / x) π Σ exp ( x μ ) Σ ( x μ ) 2π 2 The classifier boils down to choose the class minimizing T 1 ( x ) ( x ) μ Σ μ 2log π k k k The common covariance simplifies the resulting expression. 15 Basic ideas in classifiers (cont.) Cross validation to estimate the total risk. Split the data set and average over all possible splits. Leave-one-out, 10-fold, Bootstrapping to estimate better the decision boundary. Retrain the classifier with the false positives (FP) and false negatives (FN). They are the most significant in the configuration of the decision boundary. 16 Histogram based classifiers Use a histogram to represent the classconditional densities p(x 1), p(x 2), Advantage: estimates become quite good with enough data. Disadvantage: Histogram becomes big with high dimension. we may assume feature independence. 17 Example: finding skin pixels Skin has a very small range of (intensity independent) colours and little texture. Get class conditional densities p(x skin) and p(x not skin) from examples. Get priors p(skin) (ki) and p(not ( skin) from examples count the number of face/non face pixels. The classifier becomes: 18 Example: finding skin pixels (cont.) We can represent a class-conditional density using a histogram (a non-parametric distribution) P(x skin) P(x not skin) Feature x = Hue Feature x = Hue Kristen Grauman 3
19 Example: finding skin pixels (cont.) 20 Example: finding skin pixels (cont.) We can represent a class-conditional density using a histogram (a non-parametric distribution) P(x skin) posterior P ( skin x) = likelihood prior P( x skin) P( skin) P( x) Now we get a new image, and want to label each pixel as skin or non-skin. Feature x = Hue P(x not skin) P( skin x) α P( x skin) P( skin) Where does the prior come from? Feature x = Hue Kristen Grauman Why use a prior? 21 Example: finding skin pixels (cont.) For every pixel in a new image, we can estimate the probability that it is generated by skin. 22 Example: finding skin pixels (cont.) Brighter pixels higher probability of being skin Classify pixels based on these probabilities Using skin color-based face detection and pose estimation as a video-based interface. Open CV, Gary Bradski, 1998. Kristen Grauman Kristen Grauman 23 Example: finding skin pixels (cont.) Parameter θ encapsulates the relative loss. There is a family of classifiers for each value of θ. Each one has different FP and FN rates described by the Receiver Operating Characteristic (ROC) curve. 24 Window-based generic object detection framework Build/train object model Choose a representation. Learn or fit parameters of model / classifier. Generate candidates in new image. Score the candidates. 4
25 building a model 26 building a model (cont.) Pixel-based representations sensitive to small shifts. Simple holistic descriptions of image content grayscale / color histogram vector of pixel intensities Color or grayscale-based appearance description can be sensitive to illumination and intra-class appearance variation. 27 building a model (cont.) Consider edges, contours, and (oriented) intensity gradients. 28 building a model (cont.) Consider edges, contours, and (oriented) intensity gradients. Summarize local distribution of gradients with histogram Locally orderless: offers invariance to small shifts and rotations Contrast-normalization: try to correct for variable illumination 29 building a model (cont.) Given the representation, train a binary classifier. 30 Nearest neighbor Discriminative classifiers Neural networks Car/non-car Classifier 10 6 examples Shakhnarovich, Viola, Darrell 2003 Berg, Berg, Malik 2005... Support Vector Machines Boosting LeCun, Bottou, Bengio, Haffner 1998 Rowley, Baluja, Kanade 1998 Conditional Random Fields No, Yes, not car. a car. Guyon, Vapnik Heisele, Serre, Poggio, 2001, Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006, McCallum, Freitag, Pereira 2000; Kumar, Hebert 2003 5
31 Window-based generic object detection framework 32 generating and scoring candidates Build/train object model Choose a representation Learn or fit parameters of model / classifier Generate candidates in new image Car/non-car Classifier Score the candidates 33 Training: 1. Obtain training data 2. Define features 3. Define classifier Given new image: 1. Slide window 2. Score by classifier Summary Training examples 34 Nearest neighbor 10 6 examples Shakhnarovich, Viola, Darrell 2003 Berg, Berg, Malik 2005... Support Vector Machines Discriminative classifiers Boosting Neural networks LeCun, Bottou, Bengio, Haffner 1998 Rowley, Baluja, Kanade 1998 Conditional Random Fields Feature extraction Car/non-car Classifier Guyon, Vapnik Heisele, Serre, Poggio, 2001, Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006, McCallum, Freitag, Pereira 2000; Kumar, Hebert 2003 35 Boosting Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier A weak learner need only do better than chance. Training consists of multiple boosting rounds During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners. Hardness is captured by weights attached to training examples. 36 Boosting intuition Weak Classifier 1 Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999. Slide credit: Paul Viola 6
37 Boosting illustration 38 Boosting illustration Weights Increased Weak Classifier 2 39 Boosting illustration 40 Boosting illustration Weights Increased Weak Classifier 3 41 Boosting illustration 42 Boosting: training Final classifier is a combination of weak classifiers Initially, weight each training example equally. In each boosting round: Find the weak learner that achieves the lowest weighted training error. Raise weights of training examples misclassified by current weak learner. Compute final classifier as linear combination of all weak learners (weight of each learner is directly proportional to its accuracy). Exact formulas for re-weighting and combining weak learners depend on the particular boosting scheme (e.g. AdaBoost). Slide credit: Lana Lazebnik 7
43 Boosting pros and cons Advantages of boosting Integrates classification with feature selection. Complexity of training is linear instead of quadratic in the number of training examples. Flexibility in the choice of weak learners, boosting scheme. Testing is fast. Easy to implement. Disadvantages Needs many training examples. Often doesn t work as well as SVM (especially for many-class problems). 44 Challenges of face detection Sliding window detector must evaluate tens of thousands of location/scale combinations. PCA is slow for real time face detection. Faces in images are rare: 0 10 per image For computational ti efficiency, i we should try to spend as little time as possible on the non-face windows. A megapixel image has ~10 6 pixels and a comparable number of candidate face locations. To avoid having a false positive in every image, the false positive rate has to be less than 10-6. 45 The Viola/Jones Face Detector A seminal approach to real-time object detection. Training is slow but detection is very fast. Key ideas Represent local texture with efficiently computed rectangular features (integral images) within the window of interest. Select discriminative features as weak classifiers. Boosting for classification. Attentional cascade for fast rejection of non-face windows. P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2), 2004. Also in CVPR 2001. 46 Image features Rectangular filters Output = (pixels in white area) (pixels in black area) 47 Image features (cont.) 48 Example For real problems, the result depends strongly on the features used. Do not employ pixel values. Use a very large set of simple functions sensitive to edges and other critical features multiple scales. 8
49 Fast computation with integral images The integral image computes a value at each pixel (x,y) that is the sum of the pixel values above and to the left of (x,y) inclusive. This can be quickly computed in one pass. (x,y) 50 Computing the integral image 51 Computing the integral image 52 Computing sum within a rectangle ii(x, y-1) s(x-1, 1 y) i(x, y) Cumulative row sum: s(x, y) = s(x 1, y) + i(x, y) Integral image: ii(x, y) = ii(x, y 1) + s(x, y) MATLAB: ii = cumsum(cumsum(double(i)), 2); Let A,B,C,D be the values of the integral image at the corners of a rectangle. The sum of original image values within the rectangle can be computed as: sum = A B C + D Only 3 additions are required for any size of rectangle! D C B A 53 Example 54 Feature selection Integral Image Feature output is different between adjacent regions. The integral image significantly reduces the number of computations. However, for a base resolution of 24x24 detection region, the number of possible rectangle features is ~160,000! 000! (scale, size, type, position, ) It is impractical to evaluate the entire feature set for training. Can we create a good classifier using just a small subset of all possible features? How to select such a subset? Use AdaBoost both to select the informative features and to form the classifier. 9
55 Boosting for face detection We seek to select the single rectangle feature and threshold that best separates positive (faces) and negative (non-faces) training examples, in terms of weighted error. Resulting weak classifier: 56 Start with uniform weights on training examples For T rounds AdaBoost Algorithm Evaluate weighted error for each feature, pick best. {x 1, x n } Outputs of a possible rectangle feature on faces and non-faces. For next round, reweight the examples according to errors, choose another filter/threshold combo. Kristen Grauman Re-weight the examples: Incorrectly classified -> more weight Correctly classified -> less weight Final classifier is combination of the weak ones, weighted according to error they had. Freund & Schapire 1995 57 Boosting for face detection (cont.) The first two features selected by boosting. 58 Boosting for face detection A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 over 14084. Not good enough! This feature combination can yield 100% detection rate and 50% false positive rate. 59 Attentional cascade Even if the filters are fast to compute, each new image has a lot of possible windows to search. How to make the detection more efficient? There is no need to compute the linear combination of all of the features in a window that is clear to be negative from the first features. Define a computational risk hierarchy. The goal of the training process is different instead of minimizing classification errors minimize the false positive rate. 60 Attentional cascade (cont.) Form a cascade with low false negative rates early on. Apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative 10
61 Viola-Jones detector: summary 62 Output of face detector on test images Train cascade of classifiers with AdaBoost Faces New image Non-faces Selected features, thresholds, and weights Train with 5K positives, 350M negatives (10k non face images). Real-time detector using 38 layer cascade. Implemented in OpenCV. 63 Other detection tasks 64 Profile Detection Facial Feature Localization Profile Detection Male vs. female 65 Profile Features 66 Application: Naming characters in TV video M. Everingham, J. Sivic, and A. Zisserman. "Hello! My name is... Buffy" - Automatic naming of characters in TV video. Proceedings of the 17th British Machine Vision Conference (BMVC 2006). 11
67 Application of sliding windows: pedestrian detection 68 Application of sliding windows: pedestrian detection (cont.) SVM on Haar wavelets. Space-time rectangle features. C. Papageorgiou and T. Poggio. A Trainable System for Object Detection. International Journal of Cmputer Vision, Vol. 38, No 1, pp. 15-22, 2000. P. Viola, M. Jones and D. Snow. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, Vol. 63, No 2, pp. 153-161, 2005. 69 Application of sliding windows: pedestrian detection (cont.) 70 Summary: Viola-Jones detector Rectangle features. Integral images for fast computation. Boosting for feature selection. Attentional cascade for fast rejection of negative windows. SVM on histogram of oriented gradients. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005). 12