Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Classifying Facial Expression with Raial Basis Function Networks, using Graient Descent an K-means Neil Allrin Department of Computer Science University of California, San Diego La Jolla, CA 9237 nallrin@cs.ucs.eu Anrew Smith Department of Computer Science University of California, San Diego La Jolla, CA 9237 atsmith@cs.ucs.eu Doug Turnbull Department of Computer Science University of California, San Diego La Jolla, CA 9237 turnbul@cs.ucs.eu Abstract This paper compares methos of training raial basis function networks. We foun RBF networks initialize by supervise clustering perform better than networks initialize by unsupervise clustering an improve with graient escent. Graient escent i not significantly improve the networks initialize by supervise clustering. Introuction Raial basis function (RBF) networks are commonly use for pattern classification. In this paper, we explore three techniques for training RBF networks. First, we present the structure of stanar RBF networks an escribe three techniques for learning their parameters. We then evaluate these training techniques on a set of human facial images. 2 RBF Networks Raial basis function networks typically have two istinct layers as shown in figure. The bottom layer consists of a set of basis functions each of which is a function of the istance between an input pattern an a prototype pattern. A stanar choice of basis function is the Gaussian: ()

$ $ outputs y_ y_c basis functions bias phi_ phi_m x_ x_ inputs Figure : A generic RBF network. The top layer is a simple linear iscriminant that outputs a weighte sum of the basis functions. The equation for a single output is:! #" $,+ &%('*) ).- / 243 (2) 5 7698 For our networks, we set the weights of the top layer using the least squares error. Rewriting the formula for in matrix notation, equation 2 becomes. The minimum of '! @? A8<BC8D6 B E8<BF the sum square error : <;>; 6 B 8JH 8 8JHI K L8<BC8I NM ' G8IH&F is 8<B where is the pseuo-inverse of,. 2. Initializing Parameters for Raial Basis Functions Shperical Gaussian basis functions, as in equation, each have two parameters, an, for the O P th basis function. The three techniques we explore for fining these parameters are maximum likelihoo (ML), K-Means, an graient escent. Maximum Likelihoo: Let Q 'SRQTUTT WV,X be the set of input patterns in class O an let the number of basis functions, assuming one basis function per class. Using maximum likelihoo to set the parameters of basis function P to: Y Z[ _ ]^ ^ V<\ Y Z[ %C' V \! %(' (3) B K-means: K-means is an unsupervise clustering algorithm that fins a set of ` Figure aopte from [2] (4) centers that iteratively

: : k ) ' k minimizes the mean square istance from each ata point to its closest center[4]. Each center represents a cluster, or subset of nearby ata points. The average value for all the points in a cluster are use to fin a for the basis function corresponing to that cluster. is set to the stanar eviation of the points in the cluster, as in equation 4. Using K- Means to fin basis functions is less restrictive than using ML because the number of basis functions is not epenent on the number of classes. 2.2 Improving parameters using Graient Descent A common technique for reucing the error of machine learning moels is graient escent. The basic iea of graient escent is to reuce the error by iteratively ajusting the parameters of the learning moel. The graient of the error function points in the irection of maximum increase, so the error of the learning moel can be reuce by moving the parameters in the opposite irection of the graient. Typically this is accomplishe through the iterative process of repeately calculating the graient for a particular input, or set of inputs, an ajusting the parameters by a small amount. The parameters of our moel are the weights of the linear iscriminant, ), an the means an stanar eviations of the raial basis functions, an. Since the optimal set of weights for a given set of raial basis functions can be calculate in one step, there is no nee to use an iterative process to upate the weights. However, no such close form exists for the optimal set of Gaussians, so we must resort to iterative processes. Our training proceure uses stochastic graient escent which trains the RBF network for a number of epochs. Each epoch trains the RBF network on every ata point in our training set, one at a time, in ranom orer.!! Wc? Our error function is the common sum-of-squares error, :. a;gb %(' The erivative of this error with respect to the stanar eviation of basis function j,, is e $! fg?.q ) U h U U U ji (5) The erivative of the error with respect to the i th component of the j th mean, k e $ Cl? Q U ) %('! For each basis function, the term, ;b backpropagation algorithm. >m? g U e Jn, is (6) can be compute with the qrs rt X!u q where Once we have calculate the graient of the error, we upate the means an stanar eviations by moving them by a small amount against the graient: k Copk is a small, ecreasing value calle the learning rate. After the basis functions are upate, a new set of optimal weights is calculate. 3 Human Expression Classification 3. Experimental Methoology The goal of our experiments is to train RBF networks that can classify images of human faces as one of the following six expressions: happy, sa, surprize, afrai, isguste, angry. We use three methos for learning the parameters of the network. The first metho uses supervise learning (ML) of the gaussian parameters, k an, combine with the least square error solution for the top layer weights. The secon metho uses unsupervise

clustering (K-Means) to learn k an, again using least squares to learn the weights. The thir metho uses graient escent on k an, upating the weights with least squares after every training example. For the techniques that use K-Means, the number of Gaussians are varie from to 53. For the supervise learning techniques, six Gaussians are use, one for each expression. For all experiments, we use ifference images instea of the original images of facial expressions. For each image of a face, we efine the corresponing ifference image as the ifference between that image an the average of all images of that iniviual. Also, we use principal component analysis to reuce the imensionality of the ata to 2. The original images are ownsample to fit our computational resources. To evaluate the success our our networks, we use cross-valiation which reserves a small amount of the training set to test the network s ability to generalize to novel ata. 3.2 Results Supervise an Unsupervise Clustering: Training Data 8 6 4 2 5 6 5 2 25 3 35 4 45 5 Novel Data 8 6 4 2 5 6 5 2 25 3 35 4 45 5 Figure 2: Performance of supervise (ML) an unsupervise (K-Means) clustering. The circle represents the average supervise result an the squares represent the average unsupervise results with varying numbers of clusters. Figure 2 shows that supervise clustering generalizes better than unsupervise clustering with any number of clusters. Unsupervise clustering i best in the range of 2 to 3 clusters, achieving roughly 5% correct classification, but with a high stanar eviation. The

sharp falloff in generalization after 45 clusters combine with the near perfect classifcation of the training ata inicates memorization. Graient Descent:.5 5 5 2 25 3 35 4 45 5.5 5 5 2 25 3 35 4 45 5.5 5 5 2 25 3 35 4 45 5 Figure 3: Graient escent initialize with K-Means. In orer from top to bottom are the training, hol out, an test sets. Figure 3 shows that a network with RBFs initialize with K-Means an traine with graient escent generalizes best with aroun six clusters as oppose to 25 clusters without graient escent. Using six raial basis functions instea of twenty-five is intuitively better because a learning moel with fewer parameters implies generalization rather than memorization. The fact that we are classifying six facial expressions also seems to inicate goo generalization. Using graient escent to improve a network initialize with supervise clustering ha little effect. Figure 4a illustrates the use of a hol out set to reuce memorization. The network fins an optimal set of weights for the hol out set quickly, but continues to memorize the training set. 4b shows that graient escent significantly moves the centers of RBFs initialize with K-Means. Comparison: Figure 5 shows the success rate for both K-Means initialize RBF networks, an K-Means initialize an graient escent improve RBF networks, printe sie-by-sie for comparison. Clearly, graient escent improves the RBFs place with K-Means.

.5.8 RMS Error.6.5.4.5.2.5 (a) Error of Training Set Error of Holout Set 5 5 2 25 Epoch.5 (b).5.5 Figure 4: (a) Mean square error of training (soli) an holout sets (otte) uring graient escent training. (b) The centers of six RBFs being move by graient escent, projecte onto three ranom imensions..9.9.8.8.7.7.6.5.4.6.5.4.3.3.2.2.. 2 3 4 5 2 3 4 5 Figure 5: K-Means initialize RBFs versus K-Means initialize an graient escent improve RBFs. 4 Conclusion The network that generalize best was foun using supervise clustering with all principal components, as summarize in table. For all of our tests in the results section we use the top 2 principal components, representing 8 percent of the total variance in the ata set. Using all components increase the success of RBF networks using supervise clustering,

Table : 2 Principal Components All (52) Principal Components mean st mean st training ata.926.2.. novel ata.659.67.742.42 but i not greatly improve the success of RBF networks initialize with K-means. Graient escent greatly improves generalization of RBF networks initialize with unsupervise clustering. However, as with all iterative error reuction techniques it is more computationaly intensive. Using graient escent to improve the RBFs foun with supervise clustering i not significantly reuce error. However, it is not always possible to use supervise clustering because it might be the case that only a small portion of the entire ata set has known classes. In this case, one woul use unsupervise clustering on the entire ata set to set the parameters of the raial basis functions, an use least-squares to fin the weights that minimize the error on the classifie ata. Iniviual Contributions Neil Allrin: I helpe write the ML classifier, helpe a little with the back-prop coe, helpe with testing, helpe write the writeup, an stuff. I ha fun an stuff. Anrew Smith: I wrote the graient escent proceure. I wrote one version of a finite-ifference error graient approximator, use for coe checking purposes. I ran the tests to generate the figures in this paper. I wrote k-means. I wrote the least-squares solver (it was easy in matlab) This project was a bit too rushe for me to have more than a marginal amount of fun. Doug Turnbull: I was wrote much of the file loaing, preprocessing, an testing coe in aition to implementing the maximum likelihoo moule. I was also involve in running tests an writing this report. I transcene fun. References [] Allrin, N., Smith, A., Turnbull, D. (23) A Three-Unit Network is All You Nee to Discover Females, CSE253, UCSD. [2] Bishop, C.M. (995) Neural Networks for Pattern Recognition, Oxfor University Press, New York. [3] Hastie, T., Tibshirani, R., Frieman, J. (2) Elements of Statistical Learning, Springer-Verlag, New York. [4] Kanungo, T, et. al. (2) The Analysis of a Simple k-means Clustering Algorithm, Computational Geometry 2, Hong Kong.