Viola Jones Simplified. By Eric Gregori

Viola Jones Simplified By Eric Gregori Introduction Viola Jones refers to a paper written by Paul Viola and Michael Jones describing a method of machine vision based fast object detection. This method revolutionized the field of face detection. Using this method, face detection could be implemented in embedded devices and detect faces within a practical amount of time. In this paper they describe an algorithm that uses a modified version of the AdaBoost machine learning algorithm to train a cascade of weak classifiers ( Haar features ). Haar features ( along with a unique concept, the integral image ) are used as the weak classifiers. The weak classifiers are combined using the AdaBoost algorithm to create a strong classifier. The strong classifiers are combined to create a cascade. The cascade provides the mechanism to achieve high classification with a low cpu cycle count cost.

This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the Integral Image which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers[6]. The third contribution is a method for combining increasingly more complex classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. Excerpt from Rapid Object Detection using a Boosted Cascade of Simple Features Haar Features Haar features are one of the mechanisms used by the Viola - Jones algorithm. Images are made up of many pixels. A 250x250 image contains 62500 pixels. Processing images on a pixel by pixel basis is a very CPU cycle intensive process. In addition, individual pixel data contains no information about the pixels around it. Pixel data is absolute, as aposed to relative. A side effect of the absolute nature of pixel data is the effect of lighting. Since the pixel data is absolutely effected by the lighting of the image, large variances can occur in the pixel data due only to changes in lighting. Haar features solve both problem related to pixel data ( cpu cycles required and relativity of data). Haar features do not encode individual pixel information, they encode relative pixel information. Haar features provide relative information on multiple pixels. Haar features are based on Haar wavelets as proposed by: S. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transacttons on Pattern Analyszs and Machzne Intellzgence, 11(7):674-93, July 1989.

Haar features were originally used in the paper: A General Framework for Object Detection Constantine P. Papageorgiou Michael Oren Tomaso Poggio Center for Biological and Computational Learning Artificial Intelligence Laboratory MIT Cambridge, MA 02139 { cpapa, oren, tp}@ai. mit, edu A Haar feature is used to encode both the relative data between pixels, and the position of that data. A Haar feature consists of multiple adjacent areas that are subtracted from each other. Viola - Jones suggested Haar features containing; 2, 3, and 4 areas. Haar Features ( +/- Polarities ) Value = Pixels under Green pixels under red The value of a Haar feature is calculated by taking the sum of the pixels under the green square and subtracting the sum of the pixels under the red square. ) value = By encoding the difference between two adjoining areas in a image, the Haar feature is effectively detecting edges. The further the value is from zero, the harder or more distinct the edge. A value of zero indicates the two areas are equal, thus the pixels under the area have equal average intensities ( the lack of, an edge ). It should be noted that although this process can be done on color images, for the Viola Jones algorithm this process is done on grayscale images. In most

cases individual pixel values are from 0 to 255, with 0 being black, and 255 being white. Average of pixels under Green area under Red area Haar Value 125 250-125 125 225-100 125 200-75 125 175-50 125 150-25 125 125 0 125 100 25 125 75 50 125 50 75 125 25 100 125 0 125 Harder Edge Harder Edge No Edge Hard edge Average of pixels under Green area under Red area Haar Value 250 250 0 250 225 25 250 200 50 250 175 75 250 150 100 250 125 125 250 100 150 250 75 175 250 50 200 250 25 225 250 0 250 Harder Edge The values calculated using Haar features require one additional step before being used for object detection. The values must be converted to true or false results. This is done using thresholding.

Thresholding Thresholding is the process of converting an analog value into a true or false. In this case, the analog value is the output of the Haar feature: value = ). To convert the analog value into a true / false statement the analog value is compared to a threshold. If the value is >= to the threshold, the statement is true, if not it is false. Where: hj(x) Weak classifier(basically 1 Haar feature) Pj - Parity fi(x) - Haar feature Thetaj - Threshold As the equation above illustrates, the output of the weak classifier is either true or false. The threshold determines when the function transitions the output state. The parity determines the direction of the in-equality sign. This will be demonstrated more later. The threshold and parity must be set correctly to get the full benefit of the feature. Setting the threshold and parity is not clearly defined in the Viola Jones paper For each feature, the weak learner determines the optimal threshold classification function, such that the minimum number of examples are misclassified Viola Jones. This statement is attributed to anonymous in the paper. Many theories have been proposed to calculate the threshold; minimum, average, standard variation, and average variation. The following diagram illustrates the results of those theories. The data is based off of one feature type, in a single known position. The feature position and type were based off of data from the Viola Jones paper.

Example As the above text and illustrations show, Viola Jones found that a 3 area Haar feature across the bridge of the nose provided a better then average probability of detecting a face. The eyes are darker then the bridge of the nose. ) value = 2

The value increases if the eye area gets darker, or the bridge of the nose gets brighter. The feature shown above was placed over the eyes of 313 face images. The values were calculated for each image and graphed. Value for picture above 50000 40000 30000 20000 10000 0-10000 -20000-30000 -40000-50000 -60000 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 Face non-face Face Avg Non-Face Avg Figure 1 - Haar Feature sums over 313 face and 313 non-face images Each blue point in the blue curve above represents a Haar feature value ( like the one above ), when placed over the eyes and nose of 313 different faces. The green line represents the average of those values. The average value is positive, and well above zero. This indicates that the Haar feature is over a portion of the image that matches the features characteristics ( in this case, light in the middle and dark on the sides ). The red line indicates the exact same feature being placed in the upper left corner of the image, as shown above. This represents a non-face image, or random noise. A feature is weighted on how well it distinguishes between random noise ( nofaces ) and a portion of the face ( eyes/nose in this case ). The center of the 3 area Haar feature is multiplied by 2, to balance the 2 negative sides. So the formula for the 3 area Haar is slightly different then the other Haar types: value = 2 ). A 3 area Haar feature over a random noise image results in a value of 0. As you can see from the graph above, the average over 313 images is about 0.

Parity The parity variable is used to adjust the value so that it is above 0. An inverted feature 2 ) would present a value that is of the same magnitude with a different sign. The parity is used to convert the value of an inverted feature into a positive integer, bringing it above the zero line. The implementation described in this paper used a slightly different approach. Instead of using a parity variable, 2 sets of Haar features were used; inverted and non-inverted. During the threshold learning process, Haar features were discarded if their average value over the 313 images was negative. In summary, both an inverted and non-inverted Haar feature were placed in the same location on the image. The Haar feature with a negative average value over all images was discarded. This was simpler from an implementation point of view.

Haar feature thresholding 50000 40000 30000 20000 10000 0-10000 -20000-30000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 24 24 Fdev Favg 0 0 Nfavg Nfdev Fstd Nstd -40000-50000 -60000 Figure 2 - Various statistical values from data in figure 1 The graph above shows various methods of calculating a threshold. All methods calculate a threshold based on measuring the Haar feature values at the same image position, in all 313 images. The goal is to calculate a single threshold for all 313 images, that maximizes the number of faces detected, and minimizes the number of non face detection ( false positives ).

Mean Threshold The mean threshold is the average of the value for all images containing a face. This is done by summing the Haar feature values obtained in the same position over the face on all 313 images, and dividing by the number of images. This is represented by the green line ( Favg ) in the graph above. Threshold method #1 = Where: N = Number of images i = single image = Single Haar in same position on all images. As stated above, the parity determines the direction of the inequality sign. In this example the parity is set such that the result is a 1 if the Haar feature value is >= threshold. With parity set accordingly, everything above or on the green line ( Favg ) in the graph above will register a true output from the weak classifier equation. Images that result in a Haar value >= the green line ( Favg ) is a face. If we apply the same threshold to the non-face images ( noise ) we can get an idea of how well the weak classifier works.

The Haar feature is located in the exact same position within the image. Pixels under the red boxes are subtracted from pixels in the green box. The graph below shows the results from 313 images with the same feature in the same position. When the Haar feature is thresholded it produces a value between 1 and 0. At this point it becomes a weak classifier. 35000 25000 15000 5000 Faces 24 24 Faces Fdev Faces Favg Faces Fstd -5000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311-15000 Figure 3 - Haar feature over 313 face images The green shaded box, represents values over Favg. This Haar feature in this position will categorize values greater then 17516 as faces. 17516 is the average value for this Haar feature in this position across all 313 images. Using Favg for threshold only detects 160 out of the 313 faces ( about 51% ). This makes sense since the threshold is the average value over all 313 images. If the same Haar feature is moved to a position in the image where it is know there is no face, the following data is generated.

The Haar feature is located in the exact same position within the image. Pixels under the red boxes are subtracted from pixels in the green box. The graph below shows the results from 313 images with the same feature in the same position. Notice there are many peaks over the threshold ( green line ). These peaks represent false positives. Background that the weak classifier mistakenly classifies as a face. 35000 25000 15000 5000 Non-Faces 0 0 Fdev Favg Fstd -5000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311-15000 Figure 4 - - Haar feature over 313 non-face images The above graph shows the results of placing the Haar feature over the same portion of background in each image. Since the lfw images use random backgrounds, this results in generally random data being measured by the feature in this position. This is backed-up by the average being close to 0. Notice some background image is incorrectly classified as a face. This is determined by the number of points greater then or equal to the green line ( Favg for all images, Haar feature over face ). There were false positives in 11 of the 313 images. Mean results Face Nonface 160/313 11/313 51% 3.5% Faces were correctly identified in 160 images, and incorrectly in 11 images. This was only testing 1 background position/image.

Problem with using mean threshold As expected, using the mean value for threshold resulted in about half of the face images being classified correctly as faces. This should not be a surprise. The number of false positives was low ( 11/313 ), but the so was the number of faces classified correctly ( 160/313 ). The data derived from using the mean value for threshold suggests that the mean value may be more appropriate as a ceiling for the threshold. On the other side of the spectrum, the floor for the threshold would be the minimum of the values across the images. This would guarantee that all the training images would be correctly classified as faces. The tradeoff is a high number of false positives. As mentioned earlier in this paper, this implementation uses 2 separate Haar features of different polarities. This allows the test to always be the same ( value >= threshold ). To achieve this goal, a Haar feature that is primarily creating negative values is disposed of ( it s Haar feature of opposite polarity, would create primarily positive results and will be kept ). As a result minimum values that are negative, are ignored. 0 is the lowest value threshold can be. 35000 25000 15000 5000 Faces 24 24 Fdev Favg Fstd -5000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311-15000 Figure 5 - Threshold = Min ( 0 ) Haar on faces

Notice, many more faces are detected using the min or 0 as the threshold. The specific Haar feature, at the specific position over the faces in the images, described by the above graph yields 302/313 faces detected correctly. Classifiers are ranked not only by the number of faces correctly classified, but how well it filters noise by NOT classifying noise as faces. 35000 25000 15000 5000 Non-Faces 0 0 Fdev Favg Fstd -5000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311-15000 Figure 6 - Threshold = Min ( 0 ) Haar on background ( noise ) Using the minimum ( from the faces blue graph ) as a threshold, the above graph illustrates values from the same Haar feature over a portion of the background ( representing noise ). There are MANY false positives in the above graph when the threshold is 0. The exact number is: 171 false positives. Min Results: Face Nonface 301/313 171/313 96% 55% Although the minimum method detects more faces then the mean threshold, it also creates significantly more false positives. If the mean threshold represents the ceiling, the minimum threshold represents the floor.

The ideal threshold is between the Min and the Average 600 min Positives/False Positives - Average = 17102 Number of Images 500 400 300 200 100 average False Positives Positives Difference 0 Figure 7 - Positives - False Positives The ideal threshold is windowed by the minimum at the bottom, and the average at the top. In the graph above, the blue line represents the number of faces correctly detected. The red line indicates the number of face incorrectly identified in random non-face images ( noise ). The red line indicates false positives. The ideal threshold would detect 100% of the faces, with no false positives. The green line represents the difference between the blue and red lines ( positives false positives ). Where the green line peaks is the provides the best face detection with the least number of false positives. This is the closest to ideal we can get. The blue arrow is pointing to the peak of the green line. This is our desired threshold. To find this peak, a simple max operation was applied to the difference between the blue and red curves.

Threshold = threshold[max( _ )] The peak difference threshold yields: Face Nonface 460/500 64/500 92% 13% As the results indicate, the threshold does not provide a 100% detection rate, but it does provide a low false positive rate. This represents the best this one feature can do, at this size, polarity, and location. Using Haar Features After the threshold has been determined, the Haar feature becomes a weak classifier, and provides a Boolean result when placed over any portion of the image. The pixels within the area of the Haar feature either sum to be greater then or equal to the threshold, or less then the threshold. >= Where: (x)is a single Weak Classifier of a set size, location, polarity, and type. is equal to 1 (x)is a Haar Feature of a set size, location, polarity, and type. is the threshold calculated using the method described above.

With 5 weak classifier types and 2 possible polarities, there are a total of 10 possible weak classifier types. Each feature can have a size that varies from 12 to 28 pixels in steps of 4 pixels ( for the implementation used in this paper ). 5 sizes * 10 types yields 50 variation per possible position within the detector. The detector size for this implementation was chosen based on the size of the faces in the lfw database. The detector size is 96 x 96 pixels. The number of possible positions for a weak classifier within the detector varies depending on the size and type of classifer. Using a step size of 4 pixels, this implementation creates over twelve thousand weak classifier combinations; type, size, and location. //--------------------------------------------------------------------------- //--------------------------------------------------------------------------- // // 1) Create all possible weak classifiers within detector // //--------------------------------------------------------------------------- //--------------------------------------------------------------------------- haar_features^ haar; feature_array->createdhaars = 0; for( int haar_feature = 0; haar_feature<10; haar_feature++ ) // 10 Haar types { for( int haar_size = 12; haar_size < DETECTOR_SIZE; haar_size+=4 ) { for( int x=0; x<detector_size*2; x+=4 ) // DETECTOR_SIZE = 48 { for( int y=0; y<detector_size*2; y+=4 ) { haar = CreateHaar( haar_feature, haar_size ); if( (x+haar->full_width)>=(detector_size*2) ) continue; if( (y+haar->full_height)>=(detector_size*2) ) continue; feature_array->haar_array->add( haar ); feature_array->createdhaars++; } // for y } // for x } // for haar_size } // for haar_features The Detector The location parameter for a weak classifier is relative to the location of the detector. The detector is a box that is scanned across the entire image. The detector consists of many weak classifiers. The combination of weak classifiers creates a strong classifier ( the detector ). For this implementation, the detector starts will all possible weak classifiers in all possible locations ( 14,454 weak classifiers in a 96 x 96 pixel

box). The weak classifiers are pruned as the implementation learns which weak classifiers combine to create the best strong classifier. The following graphs show a single weak classifier of size 16x16. The graphs show the number of face images the classifier correctly classified, and the number of background images the classifier incorrectly classified as a face ( false positive ). There are 500 face images, and 500 non face images. The peak value is the number of positive matches, minus the number of false positives. So a classifier that returns true for all face and non face images will yield a peak of 0 ( 500 500 ) and a positive of 500. The positions are all relative to the upper left corner of the detector box ( Blue box in picture ). The classifier is a 3 area type, in a horizontal orientation. Each area is 16 pixels by 16 pixels. The whole classifier is 48 pixels wide, by 16 pixels high. if(( 2 ) >= threshold 1, -2, 1 if(( 2 ) >= threshold -1, 2, -1 Each weak classifier is put in every location within the detector. This implementation uses a +4 pixel increment in both the X and Y directions when creating the Haars. The result is over 14 thousand Haars of various types, sizes, and locations within the 96x96 pixel wide detector. Obviously this is considerably more then needed, so the next step in the algorithm is how to minimize the number of required weak classifiers to create a detector with a specific Positive to false positive ratio.

Location x y 44 60 44 40 44 20 44 0 40 60 40 40 40 20 40 0 36 60 36 40 36 20 36 0 32 60 32 40 32 20 32 0 28 60 28 40 28 20 28 0 24 60 24 40 24 20 24 0 20 60 20 40 20 20 20 0 16 60 16 40 16 20 16 0 12 60 12 40 12 20 12 0 8 60 8 40 8 20 8 0 4 60 4 40 4 20 4 0 0 60 0 40 0 20 0 0 Weak Classifiers -> size 16 all possible positions 0 100 200 300 400 500 600 Each bar represents a unique weak classifier. Each with it's own threshold. The data illustrated here is an indicator of consistancly consistancy across object images. peak -1,2,-1 pos -1,2,-1 peak 1,-2,1 pos 1,-2,1

Each bar represents a unique weak classifier. Each with it's own threshold. The data illustrated here is an indicator of consistancy across object images. Data is based on 500 face, and 500 non-face images. A high peak value indicates a high quality classifier for the particular position. Location x y 24 76 24 72 24 68 24 64 Weak Classifiers -> size 16 24 60 24 56 24 52 24 48 24 44 24 40 peak -1,2,-1 24 36 pos -1,2,-1 peak 1,-2,1 24 32 pos 1,-2,1 1 of 500 face images used to create this graph. An additional 500 non-face images were also used. The peak values on the graph represent the number of detections on face images ( positives ) minus the number of detections on non-face images ( false positives ). A peak of zero indicates the classifier incorrectly detected faces in all 500 nonface images. 24 28 24 24 24 20 24 16 24 12 24 8 24 4 24 0 0 100 200 300 400 500 600 images

The above bar graphs are best viewed in color. Each bar represents a size 16 horizontal 3 area weak classifier in different positions within the detector. The blue and red bars are -1,2,-1 type classifier: if(( 2 ) >= threshold The green and purple bars are 1,-2,1 type classifiers: -1, 2, -1 if(( 2 ) >= threshold 1, -2, 1 Each bar represents the number of positive classifications the classifier detected on faces ( faces detected ) out of a total of 500 face images ( red and purple ). The bottom part of the bar represents the number of positives ( faces correctly detected ) minus the number of false positives ( faces detected in images containing no faces ). The peak lines ( so called because this data was used to select a threshold as stated above ) provide an indicator of the quality of the specific weak classifier. Note, each bar represents a unique weak classifier with it s own threshold. After a threshold has been chosen, a Haar feature becomes a weak classifier. The threshold does not change. The purpose of these graphs is to indicate that certain aspects of the image lend themselves to specific areas of the object ( in this case a face ). For example, the bar graph above represents a zoomed in portion of the full bar graph. The graph has been zoomed in to detector locations x=24, y = 0->76. This represents a stripe down roughly the center of the face ( as shown by the picture not it is upside down ). The graph indicates, that weak classifiers of the type 1,-2,1 provide quality data in the regions of the eyes. This is shown by the high signal to noise ratio ( true positives / false positives ) of this type and size

of classifier in this region ( 24 24, 24 20 ). The weak classifiers at location x=24, and y=24 or 20, have high peak values, indicating that atleast in the training images, the pixels in these areas are similar, and this particular classifier does a good job of detecting them out of the noise. Stated another way, the particular pattern that the -1,2,-1 classifier is designed to detect, is consistently present in all 500 images at the location 24,24 and 24,20. Also, the threshold selected for the particular classifier does a good job of differentiating the pattern at these locations versus general noise in the image. False positives, or noise The purpose of any filter is to differentiate a signal from noise. Classifiers are simply binary filters. Classifiers provide a true/false indication of a specific pattern in within data. The pattern is equivalent to signal, while all the data not part of the pattern ( spurious data ) represents noise. With a image, this can be demonstrated as the background being noise or spurious data, and the object of interest being the signal or desired pattern within the data. This can be illustrated in the following images. The Green indicates the portion of the face we are trying to detect using the -1,2,-1 size 16 weak classifier described above. The red indicates the false positives that occur when that same classifier is tested in other regions of the image. The light box in the image represents the total portion of the image the weak classifier was scanned over. Although the weak classifier does a good job of detecting the area between the eyes, it does produce a lot of false positives ( approximately 30% on average across all 500 images ). This is why single Haar based classifiers are referred to as weak classifiers. A single Haar feature cannot define a complex enough pattern to provide the signal to noise ratio for practical object detection.

The problem with weak classifiers ( green is positive detection, red is a false positive detection) Strong Classifiers Constantine P. Papageorgiou, Michael Oren, and Tomaso Poggio suggested in their paper A General Framework for Object Detection, that weak classifiers based on Haar features can be combined to create strong classifiers. The Viola, Jones paper furthered the theory by grouping strong classifiers into a cascade, and using the AdaBoost machine learning program to choose the best weak classifiers. As shown above, a single Haar feature is a weak classifier. To decrease the number of false positives created by the single Haar feature, groups of Haar features are combined to create a strong classifier. The Haar features are grouped in a detector. A detector is a group of Haar features of certain type, size, and location. A detector can be any size, but 20 pixels by 20 pixels appears to be standard.

A 20 pixel by 20 pixel detector loaded with weak classifiers The pictures below show the results of adding more weak classifiers. As the number of weak classifiers used increases, the number false positives ( red pixels ) decreases. Unfortunately, the more Haar features used, the slower the process takes. All 1231 Haar features within the detector need to be calculated for every possible detector position in the image, as the detector is scanned across the image.

1231 Haar features ( weak classifiers ) 5357 Haar features ( weak classifiers ) There are 45,396 possible weak classifiers in a 24x24 detector. How do we determine the best weak classifiers to use in a detector, to maximize the number of positive face detection, while minimizing the number of false positives, and minimizing the number of Haar features to provide practical performance? The answer is boosting. Boosting is a machine learning metaalgorithm for performing supervised learning. Boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms

actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority and BrownBoost). Thus, future weak learners focus more on the examples that previous weak learners misclassified. (http://en.wikipedia.org/wiki/boosting). AdaBoost AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert Schapire [1]. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. However in some problems it can be less susceptible to the overfitting problem than most learning algorithms. AdaBoost calls a weak classifier repeatedly in a series of rounds. For each call a distribution of weights D t is updated that indicates the importance of examples in the data set for the classification. On each round, the weights of each incorrectly classified example are increased (or alternatively, the weights of each correctly classified example are decreased), so that the new classifier focuses more on those examples. (http://en.wikipedia.org/wiki/adaboost).

AdaBoost derivative used in the Viola-Jones method AdaBoost creates a strong classifier by combining weak classifiers. It s a machine learning algorithm that creates a recipe for a strong classifier. The ingredients are the weak classifiers. AdaBoost calculates how much of the result of each weak classifier should be added to the final mix.

Final Strong Classifier Weight calculated by AdaBoost Result of weak classifier Equation for the Strong Classifier created by AdaBoost A graphical representation of the above equation The result of the AdaBoost algorithm is the weights shown above. How the AdaBoost learning algorithm calculates those weights is described by the learning algorithm. The algorithm description is confusing because the term weight is used to describe multiple aspect of the algorithm. For this summary, the term weight is reserved for the final values calculated by the algorithm. AdaBoost learns by testing data (faces) and noise (nonfaces) against the weak classifiers. The data and noise is referred to as a distribution. The result of the test is used to score the distribution. If the weak classifier incorrectly classified a object in the distribution, the score for that

object is increased. If the weak classifier correctly classifies a object, the score for that object is decreased. This process is repeated for all weak classifiers. This process is illustrated very well in the video here: http://www.authorstream.com/presentation/asguest79199-727683-animated-adaboost-example/ Integral Image A key technique used by the Viola-Jones method is the integral image. This technique minimizes the number of summations required to calculate the value of a single Haar feature. A summed area table (also known as an integral image) is an algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid (http://en.wikipedia.org/wiki/summed_area_table). The sum area table is an accumulation of pixel values, starting from the upper left and moving towards the lower right of an image. The value at any point (x, y) in the table is the sum of all the pixels above and to the left of that point. The summed area table can be computed efficiently in a single pass over the image, using the fact that the value in the summed area table at (x, y) is just (http://en.wikipedia.org/wiki/summed_area_table): The above equation is used to create the integral image from the original monochrome picture. The resulting integral image is then used to calculate Haar feature values using only 4 sums.

D = 4 2 3 + 1 Calculating the sum of a region using a integral image Cascade A classifier is just another word for filter. It sorts random information, letting the desired information pass through, while throwing away the undesired information. Like both electrical and mechanical filters, it can be more efficient to perform the filtering in stages. Viola-Jones uses a classifier cascade to increase the computational efficiency of their method. Each stage in the cascade has progressively more Haar features, requiring progressively more computations. The first stage has the least number of weak classifiers, requiring the least number of computations. If the first stage identifies a face, it passes the data (integral image) to the next stage. On the other hand, if the first stage does not see a face it rejects that portion of the integral image moving on to the next portion. The cascade decreases the number of computation required when scanning across an image, because the majority of the image does not contain faces. It minimizes the number of calculation based on the premise that the majority of the integral image

does not contain a face. When a face is detected, the number of calculations that occur is the same as if a cascade was not used. integral image As the integral image passes from one stage to the next, progressively more weak classifiers are applied against it. Each stage requires more computations then the previous stage. First stage Least number of weak classifiers Possible face 2nd stage More weak classifiers then the first stage Possible face No face found ( noise ) No face found ( noise ) nth stage Max number of weak classifiers No face found ( noise ) Face Detected A classifier cascade

Conclusion Viola-Jones described a method of fast object detection in computer vision by combining three techniques; Haar features, the integral image, Adaboost, and the cascade. The methods described in the Viola-Jones paper made it possible to perform practical face detection with minimal computing power. Haar features are a powerful tool for classifying images like faces. Haar features classify a region of a image based on differences between the pixels of the two symmetrical sides of the feature. This results in a form of edge detection. Combining Haar features together, creates a strong classifier or detector. Boosting is a machine learning method used to determine the best combination of weak classifiers to optimize the resulting strong classifier. Adaboost is a derivative of boosting that scores the result of a weak classifier when used against a distribution. The score is used to calculate a weight that is used to multiply the result of the weak classifier in the final strong classifier. The final strong classifier that results from the Adaboost method is a sum of all the weights multiplied by the results of the weak classifiers. Boosting is only done during the learning phase of the Viola-Jones method. After the weak classifiers have been selected, and the weights calculated, the resulting strong classifier can be used to start filtering desired data out of noise. After the strong classifiers ( or filters ) have been learned, they are combined into a cascade. Cascade is just another term for multiple stages. The goal of using multiple strong classifiers organized into stages is to decrease the total number of computations required by the filter. The first stage is a prefilter, designed to catch the majority of noise ( non-faces ). It has the least number of weak classifiers ( haar features ) requiring the least number of computations. The following stages progressively increase the number of weak classifiers, but also filter out more noise than the previous stage. After the resulting strong classifiers have been learned using Adaboost, and organized into a cascade, it can be used to start filtering out noise to find the desired data. For face detection, the data is face, and the noise is anything that is not a face. After the cascade has been learned, using it is easy. Convert the image into a integral image. Slide the detector full of Haar features over the integral image. For each position,

calculate all the Haar features required by the first stage of the cascade, and multiply each of them by their corresponding weights that resulted from the Adaboost algorithm. Threshold the result into a true or false ( face under detector, or face not under detector ). If the first stage returns a true, then use the integral image to calculate the Haar features in the second stage of the cascade. This process is repeated until a stage either determines that no face is under the detector, or the final stage determines that a face is under the detector.