Margin-Constrained Multiple Kernel Learning Based Multi-Modal Fusion for Affect Recognition

Size: px

Start display at page:

Download "Margin-Constrained Multiple Kernel Learning Based Multi-Modal Fusion for Affect Recognition"

MargaretMargaret Watts
5 years ago
Views:

1 Margn-Constraned Multple Kernel Learnng Based Mult-Modal Fuson for Affect Recognton Shzh Chen and Yngl Tan Electrcal Engneerng epartment The Cty College of New Yor New Yor, NY USA {schen, Abstract Recent advances n multple-ernel learnng (MKL) show the effectveness to fuse multple base features n obect detecton and recognton. However, MKL tends to select only the most dscrmnatve base features but gnore other less dscrmnatve base features whch may provde complementary nformaton. Moreover, MKL usually employ Gaussan RBF ernels to transform each base feature to ts hgh dmensonal space. Generally, base features from dfferent modaltes requre dfferent ernel parameters for obtanng the optmal performance. Therefore, MKL may fal to utlze the maxmum dscrmnatve power of all base features from multple modaltes at the same tme. In order to address these ssues, we propose a margn-constraned multple-ernel learnng (MCMKL) method by extendng MKL wth margn constrants and applyng dmensonally normalzed RBF (NRBF) ernels for applcaton of mult-modal feature fuson. The proposed MCMKL method learns weghts of dfferent base features accordng to ther dscrmnatve power. Unle the conventonal MKL, MCMKL ncorporates less dscrmnatve base features by assgnng smaller weghts when constructng the optmal combned ernel, so that we can fully tae the advantages of the complementary features from dfferent modaltes. We valdate the proposed MCMKL method for affect recognton from face and body gesture modaltes on the FABO dataset. Our extensve experments demonstrate favorable results as compared to the exstng wor, and MKL-based approach. Keywords-multmodal fuson; affect recognton; multple ernel learnng; I. INTROUCTION Recent research demonstrate that affect recognton from multple modaltes acheves better performance [6, 0,, 5, ]. However, most lteratures n affect recognton are ether only based on features extracted from sngle modalty [, 6, ] or ust usng smply concatenated features from dfferent modaltes [6, 0, 5, ]. How to effectvely fuse features from dfferent modaltes s stll an open queston. The most popular methodology at feature level fuson s the smple concatenaton of feature vectors from dfferent modaltes to form a large feature vector [6, 0, 5, ]. However, ths fuson method requres a careful desgn for selectons of features and parameters, such as feature dmenson etc. Ths s essentally the manual feature selecton. Shan et al. [5] apply the Canoncal Correlaton Analyss (CCA) to proect facal features and body gesture features nto a low dmensonal space whch maxmzes ther correlaton. Then the authors smply concatenate the proected feature vectors together to tran a Support Vector Machne (SVM) classfer for affect recognton. However, t s dffcult to extend ths method to more than two types of features, or base features hereafter, snce t needs to fnd the correlated space between a par of base features. In ths paper, we defne base features as a set of base descrptors, whch can be combned to form an optmal ernel. In practce, t s very lely to have more than two base features for affect recognton due to the problem complexty. Gunes and Pccard [0] select frames, whch are the common apex frames from both face and body gesture modaltes for affect recognton, and then perform a drect concatenaton to combne base features from both modaltes. However, the apex frame selecton s based on the nowledge of temporal dynamcs, whch s usually very dffcult to predct n advance. The drect concatenaton fuson method s vulnerable to the contamnaton of less dscrmnatve base features, especally those wth large feature dmensons. The multple-ernel learnng (MKL) [, 3, 8] s able to partally elmnate some drawbacs of the drect concatenaton fuson method. MKL provdes sheldng from the contamnaton of the less dscrmnatve base features by assgnng very large weghts to the most dscrmnatve base features. It has recently shown the effectveness to fuse multple base features n obect detecton and recognton [8, 9]. However, MKL tends to select only the most dscrmnatve base features and gnore other less dscrmnatve base features. Therefore, MKL method cannot fully tae the advantages of all types of base features from multple modaltes, whch provde complementary nformaton. Moreover, MKL usually employs Gaussan RBF ernels for mappng each base feature to ts hgh dmensonal space H. Generally, base features from dfferent modaltes requre dfferent ernel parameters to acheve ther optmal performance. One of the reasons s due to the sgnfcant dfferent feature dmensons from multple modaltes. Therefore, MKL may not utlze the maxmum dscrmnatve power of all types of features from multple modaltes at the same tme.

2 In order to address these ssues, we propose a margnconstraned multple-ernel learnng (MCMKL) by applyng () addtonal margn constrants, and () dmensonally normalzed RBF ernels (NRBF). The margn of a separatng hyper-plane n SVM lterature [3] s defned as the perpendcular dstance between the support vectors of two classes. The large margn, obtaned n the tranng set, usually ndcates that the underlyng feature s dscrmnatve. If the underlyng feature s not dscrmnatve, the margn s usually small. So we use the margn to measure the dscrmnatve power of each base feature. These margns then serve as rough gude to MKL when learnng each base feature s weght. By dmensonally normalzng RBF ernels, MCMKL elmnates the nfluence of the feature dmenson dfference n multple modaltes. Then, the optmal hgh dmensonal space H of each base feature from dfferent modaltes corresponds to smlar ernel parameter. Therefore, MCMKL s able to utlze the maxmum dscrmnatve power of all feature types from multple modaltes. Unle conventonal MKL method, our proposed MCMKL s able to learn the most dscrmnatve base features whle stll consderng other base features, whch are less dscrmnatve, but can potentally provde complementary nformaton. We apply MCMKL method on affect recognton from multple modaltes (e.g. both face and body gesture). The extensve expermental results on the FABO (Face and Body Gesture) facal expresson database [9] demonstrate the effectveness of the proposed method for mult-modal feature fuson. II. MARGIN-CONSTRAINE MULTIPLE KERNEL LEARNING A. Multple Kernel Learnng Gven a set of base features and ther assocated base ernels K, we want to fnd the optmal ernel combnaton K opt d * K, where d s the weght for the th base feature. The ernel combnaton K opt approxmates the best trade-off between the dscrmnatve power and the nvarance for a specfc applcaton. Equatons () to (3) show the obectve cost functon f and ts constrants for the multple-ernel learnng (MKL) proposed n [8]. Mn f w w, ξ, d subect to + C ξ + σ y ( w Φ( x ) + b) + ξ 0 () ξ 0 ; d 0 ; d () Ad p (3) where Φ( x ) s the combned features correspondng to K opt n a hgh dmensonal space for sample x, whch s shown n Eq. (4). Equvalently, K opt can be expressed n Eq. (5), where Φ x ) Φ ( x ) forms the th base ernel K. ( K opt ( x, x ) Φ( x ) Φ( x ) (4) K ( x, x ) d * Φ ( x ) Φ ( x ) (5) opt The optmzaton can be carred out n a SVM framewor subect to addtonal regularzaton term of weght d n the obectve functon. In order to handle large scale problems nvolvng many base ernels, the mnmax optmzaton strategy [5, 3, 8] s used n two teraton steps. In the frst step, feature weght d s fxed,.e., K opt d * K s fxed. Then, the optmzaton problem of Eq. () can be solved by any standard SVM solver usng ts dual form as n Eq. (6) snce the term σ d s smply a constant. Max f subect to y y ( x, x ) + σ, K d (6) 0 C ; y 0 (7) In the second teraton step wth the fxed α, proected gradent descent s employed to fnd updated feature weghts d as shown n Eqs. (8) and (9). f d d new f d opt σ y y K ( x, x ) (8), old f d (9) d These two teraton steps are repeated untl converge or the maxmum number of teratons s reached. The fnal weghts of base features can be determned. Then we tran SVM classfers usng the optmal combned ernel accordng the fnal weghts of base features. The class label of a new sample x may be determned by the sgn of Eq. (0). g( x) y d K ( s, x) + b (0) where s s the support vector. The mult-class problem can be solved by one vs. one or one vs. all strategy smlar to SVM. MKL method provdes an elegant framewor to fuse many base features by assgnng larger feature weght to the most dscrmnatve base feature. Compared to the drect concatenaton method, MKL can avod the contamnaton from less dscrmnatve base features, especally when those features have large dmensons. However, MKL method tends to select very few base features from the feature pool. It often only selects one or two base features, whch are dscrmnatve at a partcular hgh dmensonal space H. Moreover, the ernel parameter

3 assocated wth the optmal hgh dmensonal space H may be sgnfcantly dfferent for the base features from dfferent modaltes. Therefore, tradtonal MKL cannot utlze the maxmum dscrmnatve power of each base feature at the same tme. B. Margn Constrants To address these ssues, we propose a Margn-Constraned Multple Kernel Learnng (MCMKL) method. Ths s motvated by the observatons that base feature whch s more dscrmnatve usually fnds a hyper-plane wth larger margn to separate support vectors of opposte classes durng tranng of SVM machnes. A hyper-plane of base feature a n Fgure (a) has a larger margn than that of base feature b n Fgure (b) to separate the class of sold dot from the class of trangle. Ths suggests that the base feature a s more dscrmnatve than the base feature b for the classfcaton of the sold dot and the trangle class. (a) (b) Fgure : (a) The hyper-plane of base feature a has a large separaton margn to separate sold dot class and trangle class; (b) The hyper-plane of base feature b has a small margn to separate sold dot class and trangle class. Therefore, the separaton margn for each base feature n ts hgh dmensonal space provdes a rough measurement on the base feature s dscrmnatve power. Nevertheless, these rough measurements can effectvely gude MKL when searchng for the optmal feature combnaton. The separaton margn for each base feature can be calculated usng Eq. () as the nversed square root of ts own obectve cost functon. m w f w + C ξ +σ d () After obtanng the separaton margn m for each base feature, we select one of the base feature as the reference base feature, whch has the feature weght of d s and the margn m s. The weght d of th base feature s constraned n the range, whch has the lower bound of LB and the upper bound of UB accordng to the margn rato between m s and m durng tranng. LB and UB can be calculated as n Eq. (3). The addtonal weght constrants n Eq. () are enforced durng the multple-ernel learnng. LB d UB () n m m LB * d s m ; UB * ds *( + δ ) (3) s ms n where n s a parameter that controls the margn senstvty on the feature weght rato between d and d s. As n ncreases, the values of LB and UB become more senstve to the rato of m and m s. δ s a constant to control the range wdth of the feature weght d. In our experments, we set n to.5 and δ to. C. mensonally Normalzed Kernel Gaussan RBF ernel s one of the most popular non-lnear ernels due to ts excellent performance n numerous applcatons. It s defned n Eq. (4). K ( x, x ) exp( γ ( x x ) (4) q, q, q ) where x and x are the th sample and the th sample along wth x,q and x,q as the q th element n a feature vector. s the sample s feature dmenson. γ s the RBF ernel parameter, whch determnes the mappng from a low dmensonal feature space L to a hgh dmensonal space H. Assumng that feature vectors x and x have been properly normalzed between 0 and along each feature dmenson [4], the ernel value decreases when the feature dmenson ncreases at a fxed γ as shown n Eq. (4). Hence, Eq. (4) suggests the nverse relatonshp between the optmal γ and the feature dmenson. Ths ntuton s confrmed n our experments, whch wll be analyzed n Secton 4. In MKL fuson, base features from dfferent modaltes may have sgnfcantly dfferent feature dmensons, whch wll result very dfferent optmal γ values for each base feature. Therefore, MKL cannot utlze the maxmum dscrmnatve power of all base features from dfferent modaltes at the same tme. We can treat γ as a feature selecton parameter n MKL, whch select only few base features at a tme. Ths ntuton also explans the observatons reported n [8] that MKL tends to select only very few most dscrmnatve base features. Therefore, MKL cannot tae the full advantages of all types of features from multple modaltes. Based on these observatons, we propose a dmensonally normalzed RBF ernel (NRBF) whch s defned n Eq. (5). γ K ( x, x ) exp( ( x, q x, q ) ) (5) q Ths normalzaton step s essental to elmnate the effect of feature dmenson on γ selecton, so that all base features have a smlar optmal γ. Therefore, MCMKL can utlze the maxmum dscrmnatve power of all base features from multple modaltes. III. MULTI-MOAL FUSION FOR AFFECT RECOGNITION Affect recognton from multple modaltes s a challengng problem. Our study focuses on fuson of features from vsual modaltes,.e., face and body gesture modalty.

fferent from conventonal approaches to fuse features from multple modaltes, whch smply concatenate all feature vectors from dfferent sources together and feed the concatenated feature vector nto a

MCMKL can effectvely combne all types of features for affect recognton by assgnng an approprate feature weght to each type of features and calculate the optmal ernel for affect recognton. A.

Two types of facal features,.e., Image-HOG and MHI- HOG [6] are extracted n our experments. Here, HOG stands for Hstogram of Gradents [8], and MHI stands for Moton Hstory Image [, 7].

Four types of gesture features are extracted, whch nclude locaton features, moton area features, Image-HOG features, and MHI-HOG features around both hands.

4 fferent from conventonal approaches to fuse features from multple modaltes, whch smply concatenate all feature vectors from dfferent sources together and feed the concatenated feature vector nto a classfer, such as SVM, we apply a margn-constraned multple-ernel learnng (MCMKL) method to fuse features from both face and body gesture modaltes. MCMKL can effectvely combne all types of features for affect recognton by assgnng an approprate feature weght to each type of features and calculate the optmal ernel for affect recognton. A. Overvew of MCMKL-based Affect Recognton Fgure shows an overvew of our affect recognton system, whch conssts of fve maor parts,.e., facal feature extracton, body gesture feature extracton, expresson temporal segmentaton, temporal normalzaton, and MCMKL-based classfcaton. Two types of facal features,.e., Image-HOG and MHI- HOG [6] are extracted n our experments. Here, HOG stands for Hstogram of Gradents [8], and MHI stands for Moton Hstory Image [, 7]. Image-HOG features capture facal appearance changes, whle MHI-HOG features represent facal moton nformaton. Four types of gesture features are extracted, whch nclude locaton features, moton area features, Image-HOG features, and MHI-HOG features around both hands. Each expresson n vdeo sequences can be frst temporally segmented nto onset, apex, offset and neutral phases [6]. Then, we perform a temporal normalzaton procedure to handle dfferent temporal resolutons of expressons. Fnally MCMKL method s employed to fnd the optmal feature combnaton and recognze affects. B. Facal Features Actve Shape Model [7, 0] s frst appled to trac 53 facal landmar ponts ncludng brows, eyes, nose, mouth, and face contour, as shown n Fgure 3(a). Then we locate the correspondng postons of the facal ponts n the Moton Hstory Image (MHI), as shown n Fgure 3(b). The next step s to extract Image-HOG and MHI-HOG features on orgnal vdeo frames and the correspondng MHI mages respectvely. We use 48 by 48 pxels patches wth the number of orentaton bn equals to 6 and 8 for the Image-HOG and the MHI-HOG features respectvely. The MHI mage captures moton nformaton of each selected facal pont, whle the orgnal vdeo frame conveys the appearance nformaton. Fnally, we concatenate the Image-HOG descrptor of all the 53 facal ponts and apply Prncpal Component Analyss (PCA) to reduce the feature dmenson of the concatenated Image-HOG feature from 86 to 40. Smlarly, we can obtan the MHI- HOG descrptor for the correspondng frame and reduce the feature dmenson of the concatenated MHI-HOG from 386 down to 40 for each frame. (a) (b) Fgure 3: (a) Facal landmar ponts tracng; (b) Moton Hstory Image. C. Body Gesture Features To extract body gesture features, we frst trac both hands and head n an expresson vdeo. The head poston s smply the center pont of the facal ponts from the ASM model (see Fgure 4(b)). To trac hands, we apply a sn color detecton [] followed by the removal of the face regons, whch has already been traced by the ASM model as shown n Fgure 4(a) and 4(b). (a) (b) (c) Fgure 4: (a) sn color detecton; (b) head and hand poston n the orgnal vdeo frame; (c) head and hand postons n the MHI mage. In addton to the postons of head and hands, we also calculate the moton areas (e.g. the numbers of moton pxels n MHI mage) wthn the detected regons of head and hands. Fgure 4(c) shows the head and hand regons n a MHI mage. We further extract Image-HOG and MHI-HOG features n hand regons by unformly samplng nterest ponts. Then a bag of words representaton wth the codeboo sze of 80 s used to descrbe the dstrbuton of Image-HOG and MHI-HOG features of hand regons. Fnally, we perform PCA to reduce ther feature dmensons.. Temporal Segmentaton An expresson s a sequence of facal movements whch can be roughly descrbed by neutral, onset, apex and offset temporal segments. Fgure : The overvew of our proposed MCMKL-based multmodal fuson for affect recognton through both face and body gestures. Fgure 5: temporal segmentaton of an expresson vdeo.

In our experments, we smply use the ground truth temporal segmentaton and the affect recognton s performed on the complete expresson cycle,.e., onset, apex and offset. E.

5 Fgure 5 shows a sample of the ground truth temporal segmentaton of an expresson vdeo. The temporal segmentaton procedure s necessary to accurately model the expresson dynamcs, whch has been proven crucal for facal behavor nterpretaton [4]. In our experments, we smply use the ground truth temporal segmentaton and the affect recognton s performed on the complete expresson cycle,.e., onset, apex and offset. E. Temporal Normalzaton In general, the temporal resoluton of an expresson s generally dfferent when performed by dfferent people. Even same expresson performed by same person at a dfferent tme, the temporal resoluton may not be the same. In order to resolve ths tme resoluton ssue n expresson vdeos, we adopt the temporal normalzaton approach by normalzng all types of features over a complete expresson cycle. The temporal normalzaton over an expresson cycle can be easly mplemented by lnear nterpolaton over frame s feature vector along the temporal drecton. F. MCMKL Based Mult-Modal Feature Fuson Features from multple modaltes may have dfferent forms. Therefore dmensons of dfferent types of features may vary sgnfcantly. Our proposed margn-constraned multple ernel learnng (MCMKL) method can effectvely fuse all base features from dfferent modaltes,.e., face and body gesture modalty, by assgnng a feature weght to each base feature. We concatenate the Image-HOG and the MHI-HOG of facal ponts as one base feature,.e., the face feature. The other four base features are from the gesture channel,.e., locaton, moton area, and both hands Image-HOG and MHI-HOG features. Usng the margn of each ndvdual base feature as a gude, along wth the NRBF to synchronze the optmal ernel parameter, the MCMKL learns the optmal combned ernel by selectng a proper weght for each base feature durng the fuson. For our mult-classes applcaton, we choose one vs. one classfcaton, and then usng the maxmum votng scheme to label testng samples. number of vdeos used n our experment s 55 and each vdeo has to 4 complete expresson cycles. We randomly dvde the vdeos nto three subsets. Two subsets are used n tranng and the remanng subset s used n testng. No same vdeo appears n both tranng and testng. But same subect may appear n both tranng and testng due to the random selecton process. Fgure 6: sample vdeo n FABO database recorded by body (top) and face (bottom) camera; B. Comparson to Exstng Wor and MKL In order to evaluate the effectveness of the proposed MCMKL fuson method, we compare t to the most recent wor on the FABO database [6] by usng same features, and same tranng and testng dataset. We further compare MCMKL wth MKL method. The performance of the comparson s dsplayed n Fgure 7. The fve base features are used n our experments, whch nclude face feature, locaton feature, moton area feature, Image-HOG and MHI-HOG feature of both hands. The face feature s the concatenaton of the Image-HOG and the MHI- HOG from the face modalty. The Table shows the correspondng feature dmenson for each base feature. These base features are fused through the concatenaton, MKL, and MCMKL methods. Table : The feature dmenson for each base feature,.e., Face, Loc (locaton), MA (moton area), Img-HOG (Image-HOG from gesture), and MHI-HOG (from gesture). Base Face Loc MA Img-HOG MHI-HOG Feature menson IV. EXPERIMENTS A. Expermental Setups We use a b-modal face and body gesture database,.e., FABO database n our experments [9]. The database s collected usng two cameras,.e., one for face and one for body gesture n a laboratory envronment. A sample vdeo s shown n Fgure 6. However, we only employ the vdeos captured by the body camera to extract features for both modaltes, snce the vdeos from the body camera already contan both face and body gesture nformaton. After removng the categores n the database wth very small number of samples, there are 8 expresson categores,.e., Anger, Anxety, Boredom, sgust, Fear, Happness, Puzzlement, and Uncertanty. The total Fgure 7: The average performance of the top rans by sweepng ernel parameter log(γ) from -5 to 8 for each of the three methods,.e., concatenaton (cvpr4hb ), MKL, and MCMKL. To mae a far comparson, we sweep ernel parameter log(γ) from -5 to 8, and select the top performances for each fuson method. Then we ran these performances by a descendng order of ther accuracy. We repeat same experment

for three dfferent subsets and the average performances are reported n Fgure 7. Fgure 8 shows more detals of the ran comparson,.e., the comparson of the best performances for the three fuson methods: MCMKL, MKL, and drect feature concatenaton (cvpr4hb ).

If we loo at the ran comparson n Fgure 8, we can see that the proposed MCMKL acheves better performance than the concatenaton method.

6 for three dfferent subsets and the average performances are reported n Fgure 7. Fgure 8 shows more detals of the ran comparson,.e., the comparson of the best performances for the three fuson methods: MCMKL, MKL, and drect feature concatenaton (cvpr4hb ). MCMKL outperforms the other two methods over all the top rans, as shown n Fgure 7. If we loo at the ran comparson n Fgure 8, we can see that the proposed MCMKL acheves better performance than the concatenaton method. Note that the fve base features have been carefully selected, and the parameters, e.g., PCA proecton dmensons etc. are also carefully chosen for the concatenaton fuson method n [6]. On the other hand, MCMKL effectvely select those feature vectors, and t can stll outperform the concatenaton fuson method by the average of.3%. MKL method gnores all the gesture features,.e., locaton, and moton area etc., even though these gesture features have been proven to provde complementary nformaton to the face feature [6]. As shown n Fgure 9, the proposed MCMKL obtans a more reasonable feature weght dstrbuton. Smlar to MKL methods, t recognzes the face feature as the most dscrmnatve base feature by assgnng the largest feature weght of 48%. At the same tme, t also ncorporates other less dscrmnatve gesture features accordng to ther dscrmnatve power. (a) Fgure 8: (a) The best average performance by sweepng ernel parameter log(γ) from -5 to 8 for each fuson method,.e., cvpr4hb, MKL, and MCMKL; (b) The best performance of the three fuson method n three dfferent testng subsets. Our proposed MCMKL outperforms tradtonal MKL method by an average recognton rate of 5.7% on these base features, as shown n Fgure 8(a). Fgure 8(b) shows the performance comparson over three dfferent testng subsets. C. Evaluate Feature Weght strbuton In ths secton we verfy that our proposed MCMKL s more effectve than MKL to ncorporate less dscrmnatve features, whch provde complementary nformaton to the base features wth the maxmum dscrmnatve power. We select the ernel parameter γ of -5 for MKL and - for MCMKL method, whch yeld the best performance for MKL and MCMKL method respectvely. Snce we choose one vs. one strategy for our mult-class expresson classfcaton, the total number of models we need C to tran s n, where n s the total number of expresson classes n the dataset. Therefore, we have traned 8 models for 8 categores of expressons, n whch each model contans one set of feature weghts for the base features,.e., face, locaton (loc), moton area (MA), and both hands Image-HOG (mghog) and MHI-HOG (mhhog) features. Then we tae the mean feature weght of the 8 models over each base feature, followed by the proper normalzaton. Fgure 9 shows the dstrbuton of average feature weghts over the 5 base feature types for MKL and MCMKL method. As expected, MKL selects only the most dscrmnatve base feature,.e., face feature. More specfcally, t assgns more than 98% of the total feature weghts to the face feature. The (b) Fgure 9: Comparng the average feature weght dstrbuton of the 5 base features,.e., face, locaton (loc), moton area (MA), and both hands Image-HOG (mghog) and MHI-HOG (mhhog) features, for MKL and MCMKL methods. Fgure 9 has verfed the effectveness of the proposed margn constrants and the dmensonally normalzed RBF ernel (NRBF). It s obvous that the addtonal constrants on the feature weghts accordng to the separaton margn of each base feature can enforce the model to assgn small weghts to the less dscrmnatve base features. However, t may not be ntutve how the NRBF contrbute to a more reasonable feature weght assgnment. Fgure 0: The effect of feature dmenson of three base features over the selecton of the optmal RBF ernel parameter γ. Before we provde such ntuton, we examne the relatonshp between the optmal γ value and feature dmenson expermentally. We select three base features,.e., facal pont s MHI-HOG, the facal pont s Image-HOG and the locaton feature. Then we manually vary the PCA dmenson of the Image-HOG and the MHI-HOG, or the number of normalzaton frames of the locaton feature, so that ther feature dmensons can be gradually ncreased. Then we use SVM s 5-fold cross valdaton to fnd out the optmal ernel parameter γ for each of the three base features at the selected feature dmenson. Fgure 0 has suggested the nverse relatonshp between the optmal γ value and the feature dmenson, whch has verfed our analyss n secton. In the experments of the last secton, the most dscrmnatve feature,.e., the face feature, has the optmal γ of -5. Snce other less dscrmnatve gesture feature has much

7 smaller feature dmenson as we can see from Table. Ther optmal γ value s much larger. Therefore, at the γ of -5, the other gesture features has almost no dscrmnatve power snce ther optmal γ values are very far away from -5. Therefore, MKL method assgns almost zero feature weghts to other gesture features. After we perform the dmensonally normalzaton as n Eq. (5). The optmal γ values become very close for dfferent base features regardless the dfferences of ther feature dmensons. Therefore, MCMKL can utlze the maxmum dscrmnatve power of all base features at the same tme. That s another reason why MCMKL method can ncorporate other less dscrmnatve base features, whch provde complementary nformaton.. Contamnaton from Less scrmnatve Features In ths secton, we examne the contamnaton from the less dscrmnatve base features, partcularly those wth large feature dmensons. From the feature weght dstrbuton n Fgure 9, we now that the Image-HOG and MHI-HOG of hands are the least dscrmnatve features. So we ntentonally ncrease ther feature dmenson to 00 by ncludng more PCA dmensons. At the same tme, we also decrease the dmenson of the most dscrmnatve feature,.e., the face feature, down to 90. Now, we also sweep ernel parameter log(γ) and select the top 0 performances for each fuson method,.e., concatenaton, MKL, and MCMKL. Then we ran these 0 performances by the descendng order of ther accuracy. The expermental results are shown n Fgure. We observe that the ran result of the MCMKL method outperforms the concatenaton fuson method by almost 0%, whch ndcate that MCMKL method s more effectve to sheld the contamnaton from the less dscrmnatve base features, as compared wth the concatenaton fuson method. V. CONCLUSION In ths paper, we have proposed a margn-constraned multple-ernel learnng (MCMKL) method whch extends the multple-ernel learnng (MKL) method by constranng feature weght range accordng to the separaton margn of each base feature. The dmensonally normalzed RBF ernel (NRBF) s also proposed and employed n MCMKL n order to fuse the features from multple modaltes, whch s possble to have very dfferent feature dmensons. Our expermental results demonstrate favorable results as compared to the stateof-the-art results on the FABO database. We also demonstrate the sgnfcant mprovement as compared to the conventonal MKL method. Fgure : The average performance of the top 0 rans by sweepng ernel parameter γ for each of the three methods,.e., concatenaton (cvpr4hb ), MKL, and MCMKL. ACKNOLEGEMENT Ths wor was supported n part by NSF IIS REFERENCES [] F. Bach, G. Lancret, M. Jordan, Multple ernel learnng, conc dualty and the SMO algorthm, NIPS, 004. [] A. Bobc and J. avs, The recognton of human movement usng temporal templates. PAMI, 00. [3] C. Burges, A tutoral on support vector machnes for pattern recognton, ata Mnng and Knowledge scovery, (), 67, 998. [4] C. Chang and C. Ln, LIBSVM : a lbrary for support vector machnes, 00. Software avalable at [5] O. Chapelle, V. Vapn, O. Bousquet, S. Muheree, Choosng Multple Parameters for Support Vector Machnes, Machne Learnng, 00. [6] S. Chen, Y. Tan, Q. Lu,. Metaxas, Recognzng Expressons from Face and Body Gesture by Temporal Normalzed Moton and Appearance Features, IEEE Int'l Conf. Computer Vson and Pattern Recognton worshop for Human Communcatve Behavor Analyss (CVPR4HB). 0. [7] T. Cootes, C. Taylor,. Cooper and J. Graham, Actve Shape Models Ther Tranng and Applcaton, Computer Vson and Image Understandng, 995. [8] N. alal, B. Trggs, Hstogram of Orented Gradents for Human etecton, CVPR 005 [9] H. Gunes and M. Pccard, A bmodal face and body gesture database for automatc analyss of human nonverbal affectve behavor, Internatonal Conferenece Pattern Recognton, 006. [0] H. Gunes and M. Pccard, Automatc Temporal Segment etecton and Affect Recognton From Face and Body splay, IEEE Transacton on Systems, Man and Cybernetcs Part B: Cybernetcs, Vol. 39, NO [] J. Kovac, P. Peer, F. Solna, Human Sn Colour Clusterng for Face etecton, EUROCON Computer as a Tool, 003. [] M. Pantc, L. Rothrantz, Automatc Analyss of Facal Expressons: The State of the Art, PAMI, 000. [3] A. Raotomamony, F. Bach, S. Canu, Y. Grandvalet, More Effcency n Multple Kernel Learnng, ICML, 007. [4] K. Schmdt and J. Cohn, Human Facal Expressons as Adaptatons: Evolutonary Questons n Facal Expresson Research, Yearboo of Physcal Anthropology, 00. [5] C. Shan, S. Gong and P. McOwan, Beyond facal expressons: learnng human emoton from body gestures, Brtsh Machne Vson Conference, 007. [6] Y. Tan, T. Kanade, J. Cohn, Facal Expresson Analyss, Handboo of Face Recognton, pp , Sprnger, 005. [7] Y. Tan, L. Cao, Z. Lu, and Z. Zhang, Herarchcal Fltered Moton for Acton Recognton n Crowded Vdeos, IEEE Transactons on Systems, Man, and Cybernetcs--Part C: Applcatons and Revews, 0. [8] M. Varma,. Ray, Learnng The scrmnatve Power-Invarance Trade-Off, ICCV, 007. [9] A. Vedald, V. Gulshan, M. Varma, A. Zsserman, Multple Kernels for Obect etecton, ICCV, 009. [0] Y. We, Research on Facal Expresson Recognton and Synthess, Master Thess, 009, software avalable at: [] Z. Zeng, M. Pantc, G. Rosman, T. Huang, A Survey of Affect Recognton Methods: Audo, Vsual, and Spontaneous Expressons, PAMI, 009. [] X. Zhou, B. Bhanu, Feature fuson of sde face and gat for vdeo-based human dentfcaton, Pattern Recognton 4, 008.

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: