Face Detection in images : Neural networks & Support Vector Machines

Size: px

Start display at page:

Download "Face Detection in images : Neural networks & Support Vector Machines"

Caren Walker
5 years ago
Views:

1 Face Detection in images : Neural networks & Support Vector Machines Asim Shankar asim@cse.iitk.ac.in Priyendra Singh Deshwal priyesd@iitk.ac.in April 2002 Under the supervision of Dr. Amitabha Mukherjee, amit@cse.iitk.ac.in Report submitted in partial fulfillment of requirements of the course CS397 Special topics in Computer Science to the Computer Science and Engineering Department, Indian Institute of Technology, Kanpur

2 ABSTRACT Over the years, one of the many problems being dealt with by the computer-vision community is that of face detection and recognition in images. The applications of such a system are numerous, from automated security systems, census, intelligence information etc. In this report, we present our experience with two of the most successful techniques present today ([rowley98],[cvpr97face]) and extensions of this work into other interesting applications. 1 TABLE OF CONTENTS 1 TABLE OF CONTENTS Introduction Generic Approach The sliding window Image pre-processing Bootstrapping Training Set description The Neural Network Technique Network Structure Results Other species Different Network Architectures Fully connected network Two outputs Support Vector Machines Introduction to SVMs SVM learning parameters Results of training Implementation Details Neural Nets and SVMs A comparison Further Directions References Resources...18 Table of Figures Figure 1 - Image pre-processing...5 Figure 2 - Constructing 20x20 training image from original...6 Figure 3- Basic structure of neural network (Taken from [rowley98])...7 Figure 4 - Results of Neural Network on pictures taken by us...8 Figure 5 - Results of neural network on "standard" pictures...9 Figure 6 - Results on a "fully-connected" network

3 2 Introduction Classification algorithms of any kind have traditionally worked on reducing the object in question to a small set of meaningful features, however, in many cases this is not quite feasible. Face detection, for example involves concepts (such as face) that cannot be reduced to manageable, quantifiable set of features, whose basis or eigen-features can be found. Since it is not known apriori, what the relevant features for the given concept are, the feature vectors are typically large (such as the grey values of each pixel in the image). Under such circumstances, the approach taken is to learn the solution from a large set of examples. We look into a Neural Network based technique (Henry Rowley et al.) and a support-vector-machine based techniques (Osuna et al.) which take in the large feature vector and attempt to classify the same

4 3 Generic Approach The problem in question: Given an arbitrary image, be able to mark the faces detected in the image. 3.1 The sliding window The two classification techniques studied (neural networks and SVMs) classify a 20x20 window of pixels as a face/non-face. Thus, the system slides this 20x20 window across the image. For the classifier to correctly detect a face, the face must fit into the window and occupy all of it, ie, it must not to larger or smaller than the window. To expect that this will always happen is ofcourse absurd, and to compensate for this fact we repeatedly scale down the image by a constant factor and then slide a 20x20 image on this smaller window. With this we are able to detect faces that may be larger than the window in the original image. 3.2 Image pre-processing Face images have a great deal of variation the diversity in race, color, gender etc. bring about a great deal of variation in face pictures. Add to that the difference in images taken under different lighting conditions, with different equipment etc. and the classifier can get completely confused, its decision being influence by such factors. To avoid this, each image is pre-processed before being given to the classifier. The pre-processing consists of the following steps: Illumination correction: A best-fit brightness plane is subtracted from the window pixel values, allowing reduction of light and heavy shadows. Histogram equalization: This compensates for differences in illumination brightness, camera responses, skin color etc. These steps are applied to each 20x20 window and not the image as a whole

3.3 Bootstrapping Figure 1 - Image pre-processing Generating a training set for the SVM/neural network is a challenging task because of the difficulty in placing characteristic non-face images in a

5 3.3 Bootstrapping Figure 1 - Image pre-processing Generating a training set for the SVM/neural network is a challenging task because of the difficulty in placing characteristic non-face images in a the training set. To get a representative sample of face images is not much of a problem; however, to choose the right combination of non-face images from the immensely large set of such images, is a complicated task. For this purpose, after each training session, non-faces incorrectly detected as faces are placed in the training set for the next session. This bootstrap method overcomes the problem of using a huge set of nonface images in the training set, many of which may not influence the training. 3.4 Training Set description Researches in the field of face-detection have used two common training sets (CMU, MIT (Poggio)), however, those are not available easily. For our purposes, we used some images from the CMU test set (see Resources) and the Biometric Security s BioID face database (see Resources) and a database of Indian faces generated here at IIT Kanpur. In each image to be placed in the training set the eyes, nose and left, right and center of the mouth were marked. With these markings, the face was transformed into a 20x20 window with the marked features at predetermined positions [ELABORATE]

Initially, for negative samples, random images were created and added to the training set. The training set was subsequently enhanced with bootstrapping of scenery and false-detected images.

6 Initially, for negative samples, random images were created and added to the training set. The training set was subsequently enhanced with bootstrapping of scenery and false-detected images. To make the system somewhat invariant to changes such as rotation of the face random transformations (rotation by ±15 degress, mirroring) were applied to images in the training set. The last used training set (including bootstrapping) had 8982 input vectors. Figure 2 - Constructing 20x20 training image from original - 6 -

7 4 The Neural Network Technique We implemented a retinally-connected neural network. The network takes as input a 400-length vector (each corresponding to the gray value of a pixel in the 20x20 window) and returns a result between 0.0 and 1.0. The network is trained using the standard back-propagation algorithm. 4.1 Network Structure Our implementation was a crude version of the system described in [rowley98]. We did not implement arbitration amongst multiple networks and the size of the training set used was significantly smaller. Figure 3- Basic structure of neural network (Taken from [rowley98]) The neural network is a two-layer (one hidden, one output) feed-forward network. There are 400 input neurons, 26 hidden neurons and 1 output neurons. Each hidden neuron is not connected to ALL the input neurons. The hidden neuron connections are as follows: The input image is divided into a 2x2 grid. 4 of the hidden neurons take input from only one of these grids each The input image is divided into a 4x4 grid. 16 of these neurons take input from only one of these grids each. This division into grids should help in detection local features (eyes, nose) important for face detection. The input image is divided into 6 horizontal stripes (each of height 5 pixels, this there is some overlap between strips).this should aid in the detection of features such as a pair of eyes or the mouth. The idea is that the hidden neurons taking square (grid) inputs would detect individual features while the horizontal stripes would detect pairs of eyes and the mouth

8 4.2 Results Here were present some results obtained (green rectangles around detected faces). You will notice that there are some false detections, which should be reduced by adding these to the training set (more bootstrapping). Also, many times the same face is detected multiple times. The remedy for this is to draw a bounding rectangle around the multiply detected regions. We implemented a primitive collapsing technique and have to refine it further. Figure 4 - Results of Neural Network on pictures taken by us - 8 -

Do note that none of these animal faces were in the training set.

9 Figure 5 - Results of neural network on "standard" pictures 4.3 Other species Here we tried some images of animal faces etc. to see if the network learnt to recognize faces in general (two eyes, a nose and a mouth) or was able to detect something unique about human faces. Do note that none of these animal faces were in the training set. We obtained some interesting results: The application (screenshots above) didn t draw a rectangle around the chimp, so it didn t think it was a face. However, when inspected more closely, we say that this chimp and some others too had a network output quite close to 0.5 (the demarcating limit we used between a face and a non-face). This dog s face was detected by the network. The region after all does have two eyes, the fur of the dog is dark in the middle which makes it appear somewhat like a nose. However, many other dog faces were categorically rejected by the system

10 4.4 Different Network Architectures Other than the network structure proposed [rowley98] we also experimented with alternative structures and compared their performance with the one mentioned above Fully connected network After reading about the aforementioned network an obvious question that arose was the effect on the network of such restricted connections between hidden neurons and others. Rowley proposed 1426 different edges, while if we fully connect all 400 inputs to all 26 hidden neurons and all 26 hidden neurons to the output neuron we end up with edges. To see this, we trained a fully connected network on the same training set. We observed that results were quite similar, however, the time taken to process the image with the fully connected network was much larger (420% extra edges). Since this slower performance didn t translate to more accurate detection, we concluded that Rowley s construction was quite appropriate

Figure 6 - Results on a "fully-connected" network 4.4.2 Two outputs The networks above with only one output gave a few false detections and on rare occasions missed a face.

11 Figure 6 - Results on a "fully-connected" network Two outputs The networks above with only one output gave a few false detections and on rare occasions missed a face. A common strategy used in many neuralnetwork based classifiers is a two-output system. Some believe that neural networks work better with sparse input/output schemes. We thus tried a two output system, where the first output gives us a measure of how likely is the given image to be a face while the second output gives a measure of how likely is the given image to not be a face. Again, such a structure seemed to be no better than the original, more compact network with one input

12 5 Support Vector Machines Preliminary experiments with the SVM technique as mentioned in [cvpr97face] seem to show that the technique is as promising as the neural network technique. A small difference in our implementation and that proposed is that Osuna et al propose a 19x19 feature vector while we use a 20x20 so that the training set can be shared. The training set used was exactly the same as that used in the neural network, i.e., of 8982 input vectors. 5.1 Introduction to SVMs Support vector machine is a patter classification algorithm developed by V. Vapnik and his team at AT&T Bell Labs [vapnik95svnets]. While most machine learning based classification techniques are based on the idea of minimizing the error in training data (empirical risk) SVMs operate on another induction principle, called structural risk minimization, which minimizes an upper bound on the generalization error. Consider data points of the form {(x i,y i )} i=1..n, and we wish to determine among the infinite such points in an N-dimensional space which of two classes of such points does a given point belong to. If the two classes are linearly separable, we need to determine a hyper-plane that separates these two classes in space. However, if the classes are not clearly separable, then our objective would be to minimize the smallest generalization error. Intuitively, a good choice is the hyper-plane that leaves the maximum margin between the two classes (margin being defined as the sum of the distances of the hyper-plane from the closest points of the two classes), and minimizes the misclassification errors. It can be shown that the solution to this problem is a linear classifier: f(x)=sign(σ i N λ i y i x T x i + b), whose coefficients ({λ}) are the solution of the following QP problem: Figure 7 - QP eqn. whose solutions are the support vectors (from [cvpr97face])

13 It turns out that only a small number of coefficients are different from zero, and since every coefficient is a particular data point, this means that the solution is determined by the data points associated with the non-zero coefficients. There are the support vectors, the only ones which are relevant to the solution of the problem, and thus all other data points can be deleted from the data set without affecting the solution. Intuitively, support vectors are data points lying between the border between the two classes. Figure 8 - Separating hyperplanes (a) small margin (b) larger margin, better classifier [taken from [cvpr97face]] In the real world, we re unlikely to find problems that actually be solved by a linear classifier. To extend the technique to non-linear decision surfaces, we project the original vector into a higher dimensional feature space. The problem now is the choice of the features that will project the original vector into a higher dimensional space. For this we use Kernel functions K(x,y). See [vapnik95svnets] for more details. 5.2 SVM learning parameters The parameters used by the learning engine were: C (tradeoff between training error (minimized) and margin (maximized)) = 1.0 Kernel function 2nd degree polynomial The SVM training algorithm reported zero misclassifications with these parameters over the training set. Increasing C to 2.0 however resulted in 89 misclassification errors (1% error) over the training set. As there were no misclassifications with a 2 nd degree polynomial, increasing the degree did not seem required, and indeed performance was similar with a 3 rd degree polynomial kernel

14 5.3 Results of training 644 support vectors were obtained with C=1.0 and a 2 nd degree polynomial kernel, with no misclassifications on the training set. With a 3 rd degree polynomial kernel, the support vectors increased to 711. Interestingly, the learning time for the SVM algorithm was significantly smaller than that for the neural network. Over the ~9000 image training set, the SVM algorithm produced the model in approximately 15 minutes, while backpropagation of the neural network took 1 hour

15 6 Implementation Details In the course of studying the face detection techniques described above, a lot of implementation was done by us. We tried to write a significant amount of reusable and pluggable code so that future work can easily build upon our engines. Intel s Image Processing Library (IPL) was used for image processing and manipulation (histogram equalization, window extraction, scaling etc.). Input vectors were then created from the scaled, processed windows. The application also assists in the creation of the training set by allowing features (eyes, nose, mouth) to be labeled, transforming the face based on the selected features to a 20x20 window, rotating the image randomly, pre-processing the image and then writing to a training set file. A neural network library (see Resources) was created for the corresponding technique. Training of the network was done on a compute server (as the training set was large) and the trained network was then plugged into the GUI for testing. The SVM engine used was SVM-light (see Resources). All training engines are both Linux and Windows compatible. The GUI is currently written for Windows systems. The code written is free for use, with the hope that this will save a significant amount of time for anyone trying to build up from here. Please feel to contact the authors for these applications

16 7 Neural Nets and SVMs A comparison Here we present a few images (20x20 windows blown up) and the output of the neural network and the SVM classifier on that input. Image Neural Network SVM classifier (0=NO, 1=YES) (-1=NO, +1=YES) Yes (0.97) Yes (1.02) Yes (0.82) Yes (0.89) Yes (0.59) No!! (-0.12) Yes (0.86) Yes (0.39) Yes (0.99) Yes (2.01) Yes (0.87) Yes (1.11) No (0.020) No (-2.9) No (0.001) No (-3.9) No ( ) No (-5.6) No ( ) No (-4.5) No (0.040) No (-2.3)

17 8 Further Directions The face detection problem has many applications in the field of security systems, automated census, intelligence systems etc. However, of particular interest to us is in the field of video summarization. The idea is that given a video sequence, we first identify the faces in the frames and then use the identified faces for motion-tracking and face-recognition. With this, we may be able to textually comment on the movement of persons across a scene. While the face detection technique described above can be applied to video applications, a major hindrance is the speed, or rather lack of it, on large images. To do this over a large set of frames in a video would make the system prohibitively slow. However, we can use properties of video to ease this problem. For example, using background subtraction techniques we can reduce the number of regions in the frame where a face detection is likely, and thus instead of looking at all windows in each frame we look only at the regions of interest in each frame. Testing out the feasibility and performance of such a system would be the next logical step to take after the Furthermore, the detection scheme described in this report deals with fullfrontal facial images, meaning thereby that profile views and occluded faces are not handled. Profile views can be detected using the same technique, possibly using the eye, nose and ear to positions to standardize the training set and then use the training schemes described above. We surveyed such techniques as the first step in video summarization. The next step would be to be able to: Label every scene with the characters present in it, and then Label every scene with the actions of each actor (bend, walk, move hand etc.)

18 9 References [rowley98] Neural network based Face Detection. Henry Rowley, Shumeet Baluja, Takeo Kanade. CMU. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 23-38, January ( 2.cs.cmu.edu/afs/cs.cmu.edu/user/har/Web/faces.html) [cvpr97face] Training support vector machines: An application to Face Detection. Edgar Osuna, Robert Freund, Federico Girosi. MIT ( [sung94examplebased] Example-based learning for Human Face Detection. Kah-Kay Sung, Tomaso Poggio. MIT. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 39-51, January ( [rowley97] - Rotation Invariant Neural Network-Based Face Detection. H. Rowley, S. Baluja, and T. Kanade. Technical report CMU-CS , Computer Science Department, Carnegie Mellon University, December, [vapnik95svnets] - Support vector networks. C. Cortes and V. Vapnik. Machine Learning, 20:1-25, 1995 T. Joachims, Making large-scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, Resources CMU image test set for face detection BIOID Face database Annie Artificial Neural Network library for C++ SVM light Support Vector Machine training and classification software Intel Performance Libraries Image Processing Library

Categorization by Learning and Combining Object Parts

Categorization by Learning and Combining Object Parts Bernd Heisele yz Thomas Serre y Massimiliano Pontil x Thomas Vetter Λ Tomaso Poggio y y Center for Biological and Computational Learning, M.I.T., Cambridge,