Large-Scale Multimodal Semantic Concept Detection for Consumer Video

Size: px

Start display at page:

Download "Large-Scale Multimodal Semantic Concept Detection for Consumer Video"

Diane Harmon
5 years ago
Views:

1 Large-Scale Multmodal Semantc Concept Detecton for Consumer Vdeo Shh-Fu Chang, Dan Ells, We Jang, Keansub Lee, Akra Yanagawa, Alexander C. Lou, Jebo Luo ABSTRACT Columba Unversty, New York, NY {sfchang, dpwe, wjang, kslee, In ths paper we present a systematc study of automatc classfcaton of consumer vdeos nto a large set of dverse semantc concept classes, whch have been carefully selected based on user studes and extensvely annotated over 3+ vdeos from real users. Our goals are to assess the state of the art of multmeda analytcs (ncludng both audo and vsual analyss) n consumer vdeo classfcaton and to dscover new research opportuntes. We nvestgated several statstcal approaches bult upon global/local vsual features, audo features, and audo-vsual combnatons. Three mult-modal fuson frameworks (ensemble, context fuson, and jont boostng) are also evaluated. Experment results show that vsual and audo models perform best for dfferent sets of concepts. Both provde sgnfcant contrbutons to multmodal fuson, va expanson of the classfer pool for context fuson and the feature bases for feature sharng. The fused multmodal models are shown to sgnfcantly reduce the detecton errors (compared to sngle modalty models), resultng n a promsng accuracy of 83% over dverse concepts. To the best of our knowledge, ths s the frst work on systematc nvestgaton of multmodal classfcaton usng a large-scale ontology and realstc vdeo corpus. Categores and Subject Descrptors Informaton Search and Retreval; Multmeda Databases; Vdeo Analyss General Terms Algorthms, Management, Performance Keywords Vdeo classfcaton, semantc classfcaton, consumer vdeo ndexng, multmeda ontology. INTRODUCTION Wth the explosve growth of user generated content, there has been tremendous nterest n developng next-generaton technologes for organzng and ndexng multmeda content ncludng photos, vdeos, and musc. One of the major efforts n recent years nvolves automatc semantc classfcaton of meda content nto a large number of predefned concepts that are both relevant to practcal needs and amenable to automatc detecton. The outcomes of such classfcaton processes are hgh-level semantc descrptors, analogous to textual terms descrbng document content, and can be very useful for developng powerful retreval or flterng systems for consumer meda. Eastman Kodak Company Rochester, NY {Alexander.lou, Jebo.luo}@kodak.com Large-scale semantc classfcaton systems requre several crtcal components. Frst, a large ontology s needed to defne the lst of mportant concepts and the relatons among the concepts. Such ontologes may be constructed from the results of formal user studes or data mnng of user nteracton wth onlne systems. Second, a large corpus consstng of realstc data are needed for tranng and testng automatc classfers. An annotaton process s also needed to obtan the concept labels of the defned concepts over the corpus. Thrd, sgnal processng and machne learnng tools are needed to develop robust classfers (also called models or concept detectors) that can be used to detect presence of each concept n any test data. Recently, developments of such large-scale semantc classfcaton systems have been reported for generc classes (e.g., car, arplane, flower) [7] and multmeda concepts n news vdeos [5]. In the consumer meda doman, only lmted efforts have been conducted to categorze consumer photos or vdeos nto a small number of classes. In a companon paper [], we have descrbed a systematc effort to establsh the frst large-scale ontology and benchmark data set for consumer vdeo classfcaton. It conssts of over relevant and potentally detectable concepts, and annotaton of 5 selected concepts over a set of 338 consumer vdeos. The avalablty of such large ontology and rgorously annotated benchmark data set brngs about a unque opportunty for evaluatng state-of-the-art machne learnng tools and multmeda analytcs n automatc semantc classfcaton. In ths paper, we present several novel statstcal models and multmodal fuson frameworks for automatc audo-vsual content classfcaton. On the vsual sde, we nvestgate dfferent approaches usng both global and local features and ensemble fuson wth multple parameter sets. On the audo sde, we develop technques based on smple Gaussan models as well as advanced statstcal methods such as probablstc latent semantc analyss. One of our man goals s to understand the ndvdual contrbutons of audo and vsual models and fnd the optmal fuson strateges. To ths end, we have developed and evaluated several fuson frameworks, rangng from smple weghted averagng, multmodal context fuson by boosted condtonal random feld, to mult-class jont boostng. Through extensve experments, we have demonstrated promsng detecton accuracy of the proposed classfcaton methods, and more valuably, mportant nsghts about the contrbutons of ndvdual algorthms and modaltes n detectng a dverse set of semantc concepts. The multmodal mult-concept classfcaton system s shown to reduce the detecton errors by as much as 5%

2 (n terms of equal error rate) compared to alternatves usng sngle modaltes only. Audo models, though not as effectve as the vsual counterpart n terms of average performance, play an ndspensable role several concepts exclusvely rely on the audo models and audo models provde sgnfcant contrbutons to the performance gans n model fuson. We brefly revew the ontology and semantc concepts for consumer vdeos n Sec.. Vsual and audo models are descrbed n Sec. 3 and 4 respectvely. We present three multmodal fuson frameworks n Sec. 5. Extensve experments for performance evaluaton and dscusson of results are ncluded n Sec. 6.. SELECTION OF THE SEMANTIC CONCEPTS Our research focuses on semantc concept detecton over a collecton of consumer vdeos, and an ontology of concepts derved from user studes, both orgnated at the Eastman Kodak company []. The vdeos were shot by about + partcpants n a year-long user study, usng the vdeo mode of currentgeneraton consumer dgtal cameras, whch can capture vdeos of arbtrary duraton at TV-qualty resoluton and frame rate. The full ontology of over concepts was developed to cover real consumer needs as revealed by the studes. For our experments, we further pared these down to 5 concepts that were smultaneously useful to users, practcal both n terms of the antcpated vablty of automatc detecton and of annotator labelng, and suffcently represented n the vdeo collecton. The concepts fall nto several broad categores ncludng actvtes (e.g. skng, dancng), occasons (e.g. brthday, graduaton), locatons (e.g. beach, park), or partcular objects n the scene (e.g. baby, boat, groups of three or more people). Most concepts were ntrnscally vsual, although some concepts, such as musc and cheerng, were prmarly acoustc. The Kodak vdeo collecton comprsed over 3 vdeos wth an average length of 3 s. We had annotators label each vdeo wth each of the concepts; for most concepts, ths was done on the bass of keyframes taken every s, although some concepts (partcularly the acoustc ones) reled on watchng and hearng the full vdeo. Ths resulted n labels for 566 keyframes. We also expermented wth gatherng addtonal data from the vdeo sharng ste YouTube. Usng each of our concept terms as a query, we downloaded several hundred vdeos for each concept. We then manually fltered these results to dscard vdeos that were not consstent wth the consumer vdeo genre (e.g. edted or broadcast content), resultng n 874 vdeos wth an average duraton of 45 s. The YouTube vdeos were then manually relabeled wth the 5 concepts, but only at the level of entre vdeos nstead of keyframes. More detals on the vdeo collectons and labels are provded n a companon paper []. 3. VISUAL-BASED DETECTORS We frst defne some termnology. Let C,, C denote M M semantc concepts we want to detect, and let D denote the set of tranng data {( I, y I)}. Each I s an mage and the M correspondng yi = { yi,, yi } s the vector of concept labels, where y I = + or - denotes, respectvely, the presence or absence of concept C n mage I. 3. Global Vsual Features & Baselne Models The vsual baselne model uses three attrbutes of color mages: texture, color and edge. Specfcally, three types of global vsual features are extracted: Gabor texture (GBR), Grd Color Moment (GCM), and Edge Drecton Hstogram (EDH). These features have been shown effectve and effcent n detectng generc concepts n several prevous works [], [3], [5]. The GBR feature s used to estmate the mage propertes related to structures and smoothness; GCM approxmates the color dstrbuton over dfferent spatal areas; and EDH s used to capture the salent geometrc cues lke lnes. A detaled descrpton of these features can be found n [6]. Fgure : The workflow of the vsual baselne detector. Based on these global vsual features, two types of support vector machne (SVM) classfers are learned for detectng each concept: () one SVM classfer s traned over each of the three features ndvdually; and () these features are concatenated nto one feature vector over whch a SVM classfer s traned. Then the detecton scores from all dfferent SVM classfers are averaged to generate the baselne vsual-based concept detector. The SVMs are mplemented usng LIBSVM (Verson.8) [] wth the RBF kernel. For learnng each SVM classfer, we need to determne the parameter settng for both the RBF kernel (γ ) and the SVM model (C) []. Here we employ a mult-parameter set model nstead of cross-valdaton so that we can reduce the degradaton of performance n the case that the dstrbuton of the valdaton set s dfferent from the dstrbuton of the test set. Instead of choosng the best parameter set from cross-valdaton, we average the scores from the SVM models wth 5 dfferent sets of parameters C and γ : { k C =,,,, }, { 4 k k k +,,,, k + = 4 } where k = ROUND ( log ( / D )) and D f f γ, s the dmensonalty of the feature vector based on whch the SVM classfer s bult ( γ = k s the recommend parameter n []). The multparameter set approach s appled to each of the three features mentoned above, as well as the aggregate feature, as shown n

3 Fg.. Note the scores (.e., dstances to the SVM decson boundary) generated by each SVM are normalzed before averagng. Varous normalzaton strateges are descrbed n Sec Vsual Models Usng Local Features Complementary to the global vsual features, local descrptors such as SIFT features [] have been shown very useful for detectng specfc objects. Recently, an effectve bag-of-features (BOF) representaton [4] has been proposed for mage classfcaton. In BOF mages are represented by a vsual vocabulary constructed by clusterng the orgnal SIFT descrptors nto a set of vsual tokens. BOF provdes a unform mddle-level representaton through whch the orgnal orderless SIFT descrptors of an mage can be mapped to a feature vector, and based on ths feature vector the learnng-based algorthms, such as the SVM classfer, can be appled for concept detecton. Lately, usng the BOF representaton, the Spatal Pyramd Matchng (SPM) approach [9] and the Vocabulary-Spatal Pyramd Matchng (VSPM) approach [7] have been developed to fuse nformaton from multple resolutons n the spatal doman and multple vsual vocabulares of dfferent granulartes. Promsng performance has been obtaned for detectng generc concepts lke bke and person. In ths work, we expermented wth the VSPM approach [7] to nvestgate the power of the local SIFT features n detectng dverse concepts n the consumer doman. 3.. Local SIFT Descrptor The 8-dmensonal SIFT feature proposed n [] has been proven effectve n detectng objects, because t s desgned to be nvarant to relatvely small spatal shft of regon postons, whch often occurs n real mages. Computng the SIFT descrptor over the affne covarant regons results n local descrpton vectors whch are nvarant to affne transformatons of the mage. In ths work, nstead of computng SIFT features over the detected nterest ponts as n the tradtonal feature extracton algorthms [], we extract SIFT features for every mage patch wth 6x6 pxels over a grd wth spacng of 8 pxels as n [9]. Ths dense samplng method has been shown more effectve n detectng generc concepts [9] than the tradtonal method usng selected nterest ponts only. 3.. Vocabulary-Spatal Pyramd Match Kernel For each concept C, the SIFT features from all the postve tranng mages for ths concept are frst aggregated together, and through herarchcal clusterng these SIFT features are clustered nto L+ sets of clusters L V,, V wth level beng the coarsest and level L the fnest. V l represents a vsual vocabulary comprsed of n l vsual tokens l l l V = { v,,, v, n l }. The vsual vocabulares are expected to nclude the most nformatve vsual descrptors that are characterstc of mages sharng the same concept. l Gven the vsual vocabulary at each level V, the local features of an mage are mapped to tokens n the vocabulary and counts of tokens are computed to form a token hstogram l l l H() I = h, (), I h, n () I. In the Spatal Pyramd Match Kernel l (SPMK) method, each mage s further decomposed nto 4 s blocks n a herarchcal way (s =,, S), wth a separate token hstogram ls, H k, () I assocated wth each spatal block. To compute matches between two mages I p and I, hstogram q ntersecton s used. { h,, h,, } s ls, I I 4 n l ls, I ls, I p q k= j= k j p k j q M (, ) mn ( ), ( ). = The fnal vocabulary-spatal pyramd match kernel defned by l vocabulary V s gven by weghted sum of matches at dfferent spatal levels: l, l, s M ( Ip, Iq) S M ( Ip, I ) l q K ( Ip, Iq) = S +. s S s+ = The above measure s used to construct a kernel matrx, whose elements represent smlartes (or dstances) between all pars of tranng mages (ncludng both postve and negatve samples) for concept C. Images comng from C are lkely to share common l vsual tokens n V and thus have hgh matchng scores n the kernel matrx. The process of constructng VSPM kernels for mult-level vocabulares s llustrated n Fg.. The VSPM kernels provde mportant complementary vsual cues to the global vsual features and are utlzed n two ways for concept detecton: () For each ndvdual concept C, the VSPM kernels L K,, K are combned wth weghts nto an ensemble kernel: ensemble L l = l = w l K K, where weghts w can be heurstcally determned n a way smlar l to [6] or optmzed through expermental valdaton. Then the ensemble kernel s drectly used for learnng a one-vs.-all SVM classfer for detecton of concept C ; () VSPM kernels from dfferent concepts are shared among dfferent concept detectors through a jont boostng framework whch wll be descrbed n detal n Secton 5.3. local feature extracton from tranng mages v, v, v, v,... v, Fgure : Illustraton of the kernel constructon process used n the Vocabulary-Spatal Pyramd Match (VSPM) model. 4. AUDIO-BASED DETECTOR v n, v n, V V v n V, n, The soundtracks of each vdeo are descrbed and classfed by two technques, sngle Gaussan modelng, and probablstc latent v K K K

4 Fgure 3: Illustraton of the calculaton of audo features as the plsa weghts descrbng the hstogram of GMM component utlzatons. Top left shows the formaton of the global GMM; bottom left shows the formaton of the topc profles, p(g z); top rght shows the analyss of each clp nto topc weghts by matchng each hstogram to a combnaton of topc profles, and bottom left shows the fnal classfcaton by SVM. semantc analyss (plsa) [8] of Gaussan mxture model (GMM) component occupancy hstograms, both descrbed below. All systems start wth the same basc representaton of the audo, as 5 Mel-frequency Cepstral Coeffcents (MFCCs) extracted from frequences up to 7 khz over 5 ms frames every ms. Snce each vdeo has a dfferent duraton, t wll result n a dfferent number of feature vectors; these are collapsed nto a sngle clplevel feature vector by the two technques descrbed below. Fnally, these fxed-sze summary features are compared to one another, and ths matrx of dstances (comparng postve examples wth a smlar number of randomly-chosen negatve examples) s used to tran a SVM classfer for each concept. The dstance-to-boundary values from the SVM are taken to ndcate the strength of relevance of the vdeo to the concept, ether for drect rankng or to feed nto the fuson model. 4. Sngle Gaussan Modelng After the ntal MFCC analyss, each soundtrack s represented as a set of d = 5 dmensonal feature vectors, where the total number depends on the length of the orgnal vdeo. (In some experments we augmented ths wth 5 dmensons of delta MFCCs gvng the local tme-dervatve of each component, whch slghtly mproved results.) To descrbe the entre dataset n a sngle feature vector, we gnore the tme dmenson and treat the set as samples from a dstrbuton n the MFCC feature space, whch we ft wth a sngle 5-dmensonal Gaussan by measurng the mean and (full) covarance matrx of the data. Ths approach s based on common practce n speaker recognton and musc genre dentfcaton, where the dstrbuton of cepstral features, gnorng tme, s found to be a good bass for classfcaton. To calculate the dstance between two dstrbutons, as requred for the gram-matrx nput (kernel matrx as defned n Sec. 3.) to the SVM, we have tred two approaches. One s to use the Kullback-Lebler (KL) dvergence between the two Gaussans, Namely, f vdeo clp has a set of MFCC features denoted X, descrbed by mean vector µ and covarance matrx Σ, then the KL dstance between vdeos and j s: The second approach smply treats the d-dmensonal mean vector µ concatenated wth the d(d+)/ unque values of the covarance matrces Σ as a pont n a new (5+35 dmensonal) feature space, normalzes each dmenson by ts standard devaton across the entre tranng set, then bulds a gram matrx from the Eucldean dstance between these normalzed feature statstc vectors. 4. Probablstc Latent Semantc Analyss The Gaussan modelng assumes that dfferent actvtes are assocated wth dfferent sounds whose average spectral shape, as calculated by the cepstral feature statstcs, wll be suffcent to dscrmnate categores. However, a more realstc assumpton s that each soundtrack wll consst of many dfferent sounds that may occur n dfferent proportons even for the same category, leadng to varaton n the global statstcs. If, however, we could decompose the soundtrack nto separate descrptons of those specfc sounds, we mght fnd that the partcular palette of sounds, but not necessarly ther exact proportons, would be a more useful ndcator of the content. Some knds of sounds (e.g. background nose) may be common to all classes, whereas some sound classes (e.g. a baby s cry) mght be very specfc to partcular classes of vdeo. To buld a model better able to capture ths dea, we frst traned a large Gaussan mxture model, comprsng M = 56 Gaussan components, on a subset of MFCC frames chosen randomly from the entre tranng set. (The number of mxtures was optmzed n plot experments.) These 56 mxtures are consdered as anonymous sound classes from whch each ndvdual soundtrack s assembled the analogues of words n document modelng. Then, we classfy every MFCC frame n a gven soundtrack to one of the mxture components, and descrbe the overall soundtrack wth a hstogram of how often each of the 56 Gaussans was chosen when quantzng the orgnal representaton. Note that ths representaton also gnores temporal structure, but t

5 s able to dstngush between nearby ponts n cepstral space, dependng on how densely that part of feature space s represented n the entre database, and thus how many Gaussan components t receved n the orgnal model. The dea of usng hstograms of acoustc tokens to represent the entre soundtrack s also smlar to that n usng vsual token hstograms for mage representaton (Sec. 3.). We could use ths hstogram drectly, but to remove redundant structure and to gve a more compact descrpton, we go on to explan the hstogram wth probablstc Latent Semantc Analyss (plsa) [8]. Ths approach, orgnally developed to generalze the dstrbutons of ndvdual words n documents on dfferent topcs, models the hstogram as a mxture of a smaller number of topc hstograms, gvng each document a compact representaton n terms of a small number of topc weghts. The ndvdual topcs are defned automatcally to maxmze the ablty of the reduced-dmenson model to match the orgnal set of hstograms. Durng tranng, the topc defntons are drven to a local optmum by usng the EM algorthm. Specfcally, the hstogram representaton gves the probablty p(g c) that a partcular component, g, wll be used n clp c as the sum of the dstrbuton of components for topc z, p(g z), weghted by the specfc contrbutons of each topc to clp c, p(z c),.e. For normalzaton, we utlze z-score Eqn.(), sgmod Eqn.(), and sgmod after normalzaton wth z-score (sgmod) Eqn.(3). f ( x) = ( x µ )/ σ () ( ) = / + exp( ) f x x f ( x) = / + exp ( v), v= ( x µ )/ σ (3) where x s the raw score, µ and σ are mean and standard devaton respectvely. Such ensemble fuson method has been appled to combnng the SVM models usng dfferent parameters and features (as llustrated n Fg. ). Here, we extend the fuson process to nclude audo models, usng optmal weghts that are determned by maxmzng the performance of the fused model over a separate valdaton data set. The cross-modal fuson archtecture s shown n Fg. 4. Fused Normalzed Vsual Model (Fg. ) Normalzed Audo Model W V W A + () Fused AV model The topc profles p(g z) (whch are shared between all clps), and the per-clp topc weghts p(z c), are optmzed by EM. The number of dstnct topcs determnes how accurately the ndvdual dstrbutons can be matched, but also provdes a way to smooth over rrelevant mnor varatons n the use of certan Gaussans. We tuned t emprcally on the development data, and found that around 6 topcs was the best number for our task. Representng a test tem smlarly nvolves fndng the best set of weghts to match the observed hstogram as a combnaton of the topc profles; we match n the sense of mnmzng the KL dstance, whch requres an teratve soluton. Fnally, each clp s represented by ts vector of topc weghts, and the SVM s gram matrx (referred to as kernel K n Secton 5.3) s calculated as audo the Mahalanobs (.e. covarance-normalzed Eucldean) dstance n that 6-dmensonal space. The process of plsa feature extracton s llustrated n Fg FUSION OF AUDIO-VISUAL FEATURES AND MODELS Semantc concepts are usually defned by both vsual and audo characterstcs. For example, dancng s usually accompaned wth background musc. It can be expected that by combnng the audo and vsual features and correspondng models, better performance can be obtaned than usng any sngle modalty. In the secton, we develop three fuson strateges for combnng audo and vsual features and models. 5. Ensemble Fuson One ntutve strategy to fuse the audo-based and vsual-based detecton results s ensemble fuson, whch typcally combnes ndependent detecton scores by weghted sum along wth some normalzaton procedures to adjust the raw scores before fuson. Fgure 4: Ensemble fuson of audo and vsual models. 5. Audo-Vsual BCRF (AVBCRF) In all of the approaches mentoned above, each concept s detected ndependently from each other n the one-vs.-all manner. However, semantc concepts do not occur n solaton -- knowng the nformaton about certan concepts (e.g. person ) of an mage s expected to help detecton of other concepts (e.g. weddng ). Based on ths dea, n the followng two subsectons, we propose to use context-based concept detecton methods for multmodal fuson by takng nto account the nter-conceptual relatonshps. Specfcally, two algorthms are developed under two dfferent fuson frameworks: () an Audo-Vsual Boosted Condtonal Random Feld (AVBCRF) method where a two-stage Context- Based Concept Fuson (CBCF) framework s utlzed; () an Audo-Vsual Jont Boostng (AVJB) algorthm where both audobased and vsual-based kernels are combned to tran mult-class concept detectors jontly. The former can be categorzed as late fuson snce t combnes predcton results from models that have been traned separately. On the contrary, the latter s consdered as an early fuson approach as t utlzes kernels derved from ndvdual concepts n order to learn jont models for detectng multple concepts smultaneously. In addton, on the vsual sde, CBCF fuses baselne models usng global features, whle AVJB further explores the potental benefts of local vsual features. We wll ntroduce AVBCRF n ths subsecton, and the AVJB algorthm wll be descrbed n the next subsecton. The Boosted Condtonal Random Feld (BCRF) algorthm s proposed n [8] as an effcent context-based fuson method for mprovng concept detecton performance. Specfcally, the relatonshps between dfferent concepts are modeled by a Condtonal Random Feld (CRF), where each node represents a concept and the edges between nodes represent the parwse

6 relatonshps between concepts. Ths BCRF algorthm has a twolayer framework (as shown n Fg. 5). In the frst layer, ndependent vsual-based concept detectors are appled to get a set of ntal posteror probabltes of concept labels on a gven mage. Then n the second layer the detecton results of each ndvdual concept are updated through a context-based model by consderng the detecton confdence of the other concepts. Here we extend BCRF to nclude models usng both vsual and audo modaltes. Fgure 5: The context-based concept fuson framework based on Boosted Condtonal Random Feld. For each mage I, the nput observatons are the ntal posteror probabltes hi = [ hvs, I, h ao, I], ncludng the vsual-based ndependent detecton results M hvs, I = { hvs, I,, hvs, I} as well as the audo-based ndependent detecton results M hao, I = { hao, I,, hao, I}. Then these nputs are fed nto the CRF to get the mproved posteror probabltes P( y I I) through nference based on the nter-conceptual relatonshps. After nference the belef b I on each node C s used to approxmate the posteror probablty: Py ( I =± I) bi( ± ). The am of CRF modelng s to mnmze the total loss J for all concepts over all the tranng data (D): I I M ( ) / ( ) / ( ) + y y ( ) I D = I I J = b + b. (4) Eqn.(4) s an ntutve functon: the mnmzer of J favors those posterors closest to tranng labels. To avod the dffculty of desgnng potental functons n CRF, the Boosted CRF framework developed n [4] s ncorporated and generalzed to optmze the logarthm of Eqn.(4): arg mn{log = arg mn M log yi ( FI GI )/ (5) bi FI, GI { I D = } n an teratve boostng process by fndng the optmal F I and G I, where F I and G I are addtve models: F ( T T T ) = f (), t G ( T ) = g () t, I t= I I t= I fi () t s a dscrmnant functon (e.g. SVM or logstc) wth nput h I as the feature, and gi () t s a dscrmnant functon (e.g. SVM n our algorthm) wth the current belef bi () t as the feature n teraton t. Both fi () t and gi () t can be consdered weak classfers learned by the standard boostng procedure, but over dfferent features. The contrbutons from other concept scores to detecton of a specfc concept are explored n each teraton snce the whole set of concept detecton scores are used as nput to the classfers n each teraton. More detals about the formula dervaton can be found n [8], [4]. 5.3 Audo-Vsual Jont Boostng (AVJB) In ths secton, we wll ntroduce a systematc early fuson framework to combne the audo-based and vsual-based features/kernels for tranng mult-class concept detectors. Instead of tranng ndependent detectors based on vsual features and audo features separately, the vsual features/kernels and audo features/kernels can be used together to learn concept detectors at the frst place. To ths end, we adopt the jont boostng and kernel sharng framework developed n [7] whch utlzes a two-stage framework: () the kernel constructon stage; and () the kernel selecton and sharng stage. In the frst stage, concept-specfc features/kernels such as the VSPM kernels descrbed n Sec. 3.., are constructed to capture the most representatve characterstcs of the vsual content for each concept ndvdually. Note local vsual features (e.g., SIFT-based vsual tokens) are used here. Then n the second stage, these kernels are shared by dfferent concepts through a jont boostng algorthm whch can automatcally select the optmal kernels from the kernel pool to learn a mult-class concept detector jontly. Ths two-stage framework can be drectly generalzed to ncorporate audo-based kernels. That s, n the frst stage, based on acoustc analyss varous features/kernels can be constructed (such as the audo vocabulary and kernel descrbed n Sec. 4.), and these kernels can be added nto the rch kernel pool together wth all the vsual-based kernels, and n the second stage the optmal subset of kernels are selected and shared through the jont boostng learnng algorthm. The process of jont boostng s llustrated n Fg. 6. By sharng good kernels among dfferent concept detectors, ndvdual concepts can be enhanced by ncorporatng the descrptve power from other concepts. Also by sharng the common detectors among concepts, requred kernels and tranng samples for detectng ndvdual concepts wll be reduced [7], [3]. K K L K K K L K K M K M L K M K audo * K () K * K () { C, C} Fgure 6: Illustraton of kernel and classfer sharng usng jont boostng. A kernel pool K s shared by dfferent detectors. Frst, usng kernel K*() a bnary classfer s used to separate C and C from the background. Then usng K*() a bnary classfer further pcks out C. In Secton 3.. we obtaned L+ concept-specfc VSPM kernels L K,, K for each concept C correspondng to the multresoluton vsual vocabulares L V,, V. In addton, n Secton C

7 4. we have the audo-based kernel K audo. Then the jont boostng framework from [7] can be drectly adopted here for sharng vsual and audo based kernels for concept detecton. Specfcally, durng each teraton t, we select the optmal kernel K*(t) and the optmal subset of concepts S*(t) to share the optmal kernel. Then a bnary classfer s traned usng kernel K*(t) whch tres to separate concepts n subset S*(t) from the background (for the other concepts not n S*(t), a predcton k c (t) s gven based on the pror). After that, we calculate the tranng error of ths bnary classfer and re-weght the tranng samples smlar to the Real AdaBoost algorthm. Fnally all weak classfers from all teratons are fused together to generate the mult-class concept detector. 6. EXPERIMENTS In ths secton, we evaluate the performance of features, models, and fuson methods descrbed earler. We conduct extensve experments usng the Kodak benchmark vdeo set descrbed n Secton. Among the 5 concepts annotated over the vdeo set, we use vsual-domnated concepts to evaluate the performance of vsual methods and mpact of ncorporatng addtonal methods based on audo features. Audo-based methods are also evaluated by usng three addtonal audo-domnated concepts (sngng, musc, and cheer). In the dscusson followng each experment, we hghlght man fndngs and mportant nsghts n talc text. 6. Expermental Setup & Performance Metrcs Each concept detecton algorthm s evaluated n fve runs and the average performances over all runs are reported. The data sets n the runs are generated as follows: the entre data set D s randomly splt to 5 subsets D,, D 5. By rotatng these 5 subsets, we generate the tranng set, valdaton set, and test set for each run. That s, for run, tranng set = {D,D }, valdaton set = D 3, test set = {D 4,D 5 }. Then we swtch one subset for run, where tranng set ={D,D 3 }, valdaton set = D 4, test set = {D 5,D }. Smlarly, we can keep swtchng to generate the data sets for run 3, run 4, and run 5. For each run, all algorthms are traned over the tranng set and evaluated over the test set, except for the AVBCRF algorthm n whch the valdaton set s used to learn the jont boostng model that fuses ndvdual detectors learned usng the tranng set separately. The average precson (AP) and mean average precson (MAP) are used as performance metrcs. AP s related to mult-pont average precson value of a precson-recall curve. AP s an offcal performance metrc used by TRECVID []. To calculate AP for concept C we frst rank the test data accordng to the classfcaton posterors of concept C. Then from top to bottom, the precson after each postve sample s calculated. These precson values are averaged over the total number of postve samples for C. AP favors hghly ranked postve samples and combnes precson and recall values n a balanced way. MAP s the average of per-concept APs across all concepts. To help readers compare performance, n some cases, we also report the detecton accuracy based on Equal Error Rate (EER). 6. Performance Comparson and Dscussons 6.. Baselne Approaches Vsual Baselne Frst, we evaluate the vsual baselne detector wth multple parameter sets descrbed n Sec. 3.. For score normalzaton, we used sgmod whch was shown to outperform other optons. Fg. 7 shows the performance when dfferent numbers of SVMs wth dstnct parameter settngs are fused. Top(n) denotes the fused model that computes average of detecton scores from n detectors that acheve top performance over the valdaton set. The objectve here s to study the effect of varyng the number of models durng ensemble fuson. Intutvely, the more models used n fuson the more stable the fused performance wll be when testng over unseen data set. Such conjecture has been confrmed n our experments Top5 gves the best MAP performance as well as good APs over dfferent concepts. On the other hand, APs of Top are not stable across dfferent concepts and the MAP s the worse among all compared methods. Ths ndcates that n our data sets the dstrbuton of the valdaton set s qute dfferent from that of the test set, and the conventonal method optmzng a sngle set of parameters by cross-valdaton suffers from over fttng. In comparson, the mult-parameter set model can get relatvely stable performance n such case. Based on ths observaton, n the followng experments, the Top5 results are used and referred to as the vsual-based baselne detecton results. Fg. 7 also shows the AP of random guess, whch s proportonal to the number of postve samples of each concept. From the above results, we found that n general frequent concepts enjoy hgher detecton accuracy. However, other factors such as concept defnton specfcty and content consstency are also mportant. For example, concepts lke sunset, parade, sports, beach and boat, though nfrequent (# of postve samples < ), can be detected wth hgh accuracy. On the other hand, some frequent concepts lke group of 3 and one person have much lower accuracy. Ths confrms that careful choces and defntons of concepts play a crtcal role n developng robust semantc classfcaton systems. AP Random Top Top 5 Top Top 5 Fgure 7: Performance of vsual baselne detectors fusng varyng numbers of models wth dfferent parameter sets Audo Baselne Fg. 8 shows the results of the three dfferent audo-based approaches (sngle Gaussans wth ether KL or Mahalanobs dstance measure, or the plsa modelng of GMM component hstograms). We see that all three approaches perform roughly

8 the same, wth dfferent models dong best for ndvdual concepts. There s also a wde varaton n performance dependng on the concept, whch s to be expected snce dfferent labels wll be more or less evdent n the soundtrack. However, the man determnant of performance of audo-based classfers appears to be the pror lkelhood of that label, suggestng that a large amount of tranng data s the most mportant ngredent for a successful classfer. For example, although the nfrequent classes weddng, museum, and parade have APs smlar to more common classes cheer and one person, ther varaton s much larger among the 5-fold cross-valdaton. Such a relatonshp between the frequency and the performance varance was also found n the vsual detectors. Though not shown n Fg. 7 (due to space lmt n the graph), the nfrequent concepts ( boat, parade, and sk ) have accuracy smlar to common concepts ( one person, shows, and sports ), but much larger performance varance among cross valdaton. Snce dfferent approaches have smlar performances, n the followng experments, the sngle Gaussan wth KL dstance measure s used as the audo-based baselne detector. Snce most of the selected concepts are domnated by the vsual cues, the results show the vsual-based models as expected acheve hgher accuracy than the audo models for most concepts. However, audo models also provde sgnfcant benefts. For example, concepts lke musc, sngng, and cheer can be detected by audo models only due to the nature of the concepts. Even for some vsually domnated concepts (lke museum and anmal ), audo methods were found to be more relable than vsual counterparts. The soundtracks of vdeo clps from these concepts provde rather consstent audo features for classfcaton. Ths also suggests these two concepts may need to be refned to be more specfc so that the correspondng vsual content may be more consstent (e.g., anmal refned to dog and cat etc). Fgure 8: Performance of audo-based classfers on Kodak data usng MFCC+delta-MFCC base features. Labels are sorted by pror probablty (guessng). Error bars ndcate standard devaton over 5-fold cross-valdaton testng. 6.. Audo-Vsual Fuson Approaches Ensemble Fuson We evaluate dfferent normalzaton strateges used n ensemble fuson descrbed n Secton 5.. Specfcally, we compare normalzaton methods based on z-score, sgmod, or sgmod (.e., z-score followed by sgmod). Addtonally, we test two dfferent score fuson methods unform average and weghted average. We found unform averagng between audo and vsual baselne models does not perform as well as vsual models alone. Ths s reasonable as most of the selected concepts have stronger cues from vsual appearances than audo attrbutes; thus equal weghtng s not expected to be the best opton. Ths s ndeed confrmed n results shown n Fg. 9, whch compares weghted audo-vsual combnaton wth dfferent normalzaton strateges. Among dfferent score normalzaton strateges, the z-score method performs best, outperformng the vsual-only model by 4% n MAP. The mprovement s especally sgnfcant for several concepts, dance, parade and show, wth 6% - 4% gans n terms of AP. Note the optmal weghts for combnng audo and vsual models are determned through valdaton, and thus vary across dfferent concepts. For most concepts, the vsual models domnate, wth the vsual weght rangng from.6 to anmal baby beach brthday Random Vsual Audo AV AVG z-score AV WS z-score crowd dancng group_3+ boat group_ museum nght one_person parade park pcnc playground shows sk sport sunset weddng MAP Fgure 9: Comparson of weghted fuson of audo and vsual models wth dfferent score normalzaton processes. The above results show that wth smple weghted averagng schemes, audo and vsual models can be combned to mprove the concept detecton accuracy. However, addtonal care s needed to determne the approprate weghts and score normalzaton strateges. Audo-Vsual Boosted CRF & Audo-Vsual Jont Boostng Fg. shows the per-concept AP of dfferent audo-vsual fuson algorthms, where AVBCRF + baselne corresponds to the method that computes average of the posterors from AVBCRF and the vsual baselne, and AVJB + baselne corresponds to the method that computes average of the posterors from AVJB and the vsual baselne. ALL corresponds to the method that we average the posterors from AVBCRF, AVJB, and the vsual baselne model. From our prevous experences [3], combnng the advanced algorthms (e.g. AVBCRF and AVJB) wth the vsual baselne usually gves better performance than usng these advanced algorthms alone. For comparson, the best performng ensemble fuson method (weghted combnaton of audo and vsual based detecton scores wth z-score normalzaton) s also shown n the fgure. By combnng vsual baselne detectors and audo baselne detectors through context fuson, the AVBCRF algorthm mproves the performance by more than % when t s fused wth the vsual baselne. The mprovements over many concepts are sgnfcant, e.g. 4% over anmal, 5% over baby, 8% over museum, 35% over dancng, and % over parade. These results confrm the power of ncorporatng nter-concept relatons nto the context fuson model. Our experments also show that context fuson among vsual models only does not provde performance gan on the average. Only when the audo

models are ncorporated nto the context fuson, clear performance gan s acheved. Ths s nterestng and mportant the audo models provde non-trval complementary benefts n addton to the vsual models.

Most mportantly, t avods the problem of large performance degradaton by weghted average model over a few concepts ( sunset and museum ), when models from one modalty are sgnfcantly worse than the

9 models are ncorporated nto the context fuson, clear performance gan s acheved. Ths s nterestng and mportant the audo models provde non-trval complementary benefts n addton to the vsual models. Compared to straghtforward weghted averagng over audo and vsual models for each concept, the AVBCRF context fuson method shows more consstent mprovement over the dverse set of concepts. Most mportantly, t avods the problem of large performance degradaton by weghted average model over a few concepts ( sunset and museum ), when models from one modalty are sgnfcantly worse than the others. In other words, by fusng multmodal models over a large pool of concepts, the stablty of the detectors can be greatly mproved. Fg. gves an example of the top detected vdeo clps for the parade concept (ranked based on the detecton scores n descendng order) usng both AVBCRF and vsual based baselne. Many rrelevant vdeos (marked by red rectangular) are ncluded n the top result when usng only vsual based baselnes. Ths s because most of these rrelevant vdeos contans crowd n the outdoor scene and the vsual appearances are smlar to those of parade mages. By usng AVBCRF, such rrelevant vdeos are removed largely because of the help from the audo models. Parade scenes are usually accompaned wth nosy sound from the crowd and loud musc assocated wth the parade. The vsual appearances plus audo together can dstngush parade vdeos more effectvely than only usng a sngle type of features. nvestgate the relatve contrbutons of features extracted from mages of ndvdual concepts, and how they are shared across classfers of multple concepts. Fg. shows the frequency of ndvdual kernels used by the AVJB algorthm n smultaneously detectng concepts through teratons. Only 5 out of the total 64 kernels (3 vsual-based kernels for each concept and audo kernel for all concepts) are selected by the feature selecton /sharng procedures. It s surprsng to see that sngle audo kernel turns out to be the most frequently used kernel, more than any other kernels constructed from vsual features (descrbed n Sec. 3..). Ths agan confrms the mportance of multmodal fuson despte the lower accuracy acheved by the audo models (compared to ther vsual counterparts), the underlyng audo features play an mportant role n developng multmodal fuson models. Top vdeo clps detected by vsual baselne model Random Guess vsual baselne audo baselne AV WS z-score AVBCRF + vsual baselne AVJB + vsual baselne AV All + vsual baselne Top vdeo clps detected by AVBCRF + vsual baselne.5 AP Fgure : comparson of dfferent audo-vsual fuson algorthms. AVJB does not result n mproved performance when t s appled alone or combned wth the vsual baselne. Ths ndcates that the use of local features and feature sharng n AVJB s not as effectve as the exploraton of nter-concept context modelng n AVBRCF. However, AVJB does provde complementary benefts by combnng AVJB wth AVBCRF and vsual baselne, we acheved further mprovements over many concepts, e.g. % over anmal, % over baby, 7% over beach, 7% over crowd, 7% over one person, etc. It s nterestng to see that most concepts beneftng from feature sharng (AVJB) overlap wth concepts beneftng from context fuson (AVBCRF). More research s needed to gan deeper understandng of the mechansm underlyng ths phenomenon, and develop technques that may automatcally dscover such concepts. Analyss of the results from the AVJB models also allows us to Fgure : Top vdeo clps from the parade concept. The rrelevant vdeos are marked by red rectangles. Vdeo clps are ranked based on the detecton scores n descendng order. The feature selecton and sharng processes used n AVJB are useful n prunng the feature pool n order to make the models more compact. Kernels learned from brthday, museum, and pcnc are dscarded because of ther relatvely poor qualty. Images from these concepts have hghly dverse vsual content and thus the learned vsual vocabulares and assocated kernels can not capture meanngful characterstcs of these concepts. To allow comparson wth other classfcaton systems, we also measure the detecton accuracy usng a common metrc, Equal Error Rate (EER). EER values of the vsual model, audo model, the fnal fused model ( AV ALL shown n Fg. ) are shown n Fg. 3. It can be seen that the proposed fuson framework s effectve, reducng the overall error rates from. (usng vsual models alone) to.7 a 5% mprovement. It s also encouragng to see that wth sound approaches of audo-vsual content analytcs and machne learnng, a satsfactory accuracy of

10 83% can be acheved n detectng the dverse set of semantc concepts over consumer vdeos. Fgure : Frequency of kernels used by the AVJB algorthm throughout teratons. EER anmal baby audo baselne vsual baselne AV All beach brthday boat crowd dancng group_3+ group_ museum nght one_person parade park pcnc playground shows Fgure 3: EER comparson of dfferent algorthms. 7. CONCLUSIONS sk sport sunset weddng Average We develop new methods and assess the state of the art n automatc classfcaton of consumer vdeos nto a large set of semantc concepts. Experments of 4 dverse concepts over 3+ vdeos from real users reveal several mportant fndngs specfcty of concept defntons and numbers of tranng samples play mportant roles n determnng the detector performance; both audo and vsual features contrbute sgnfcantly to the robust detecton performance; nter-concept context fuson s more effectve than the use of complex local features; and most mportantly a satsfactory detecton accuracy as hgh as 83% over dverse semantc concepts s demonstrated. The results confrm the feasblty of semantc classfcaton of consumer vdeos and suggest novel deas for further mprovements. One mportant area s to ncorporate other contextual nformaton such as user profle and socal relatons. Another drecton s to explore advanced frameworks that model the synchronzaton and the temporal evoluton among audo and vsual features of temporal events. 8. ACKNOWLEDGEMENT Ths project has been supported n part by a grant from Eastman Kodak. We Jang s also a Kodak Graduate Research Fellow. 9. REFERENCES [] C.C. Chang and C.J. Ln. LIBSVM: a Lbrary for Support Vector Machnes., [] S.F. Chang, et al. Columba Unversty TRECVID-5 Vdeo Search and Hgh-Level Feature Extracton. In NIST TRECVID workshop, Gathersburg, MD, 5. [3] A. Amr, et al. IBM Research TRECVID-4 Vdeo Retreval System. In NIST TRECVID 4 Workshop, Gathersburg, MD, 4.. [4] R.Fergus, P. Perona, A. Zsserman. Object class recognton by unsupervsed scale-nvarant learnng. IEEE Proc. CVPR, 3, pp [5] J. Fredman, T. Haste, and R. Tbshran. Addtve logstc regresson: a statstcal vew of boostng. Dept. Statstcs, Stanford Unversty Techncal Report, 998. [6] K. Grauman and T. Darrel. Approxmate correspondences n hgh dmensons. Advances n NIPS. 6. [7] W. Jang, S.F. Chang, and A.C. Lou. Kernel sharng wth jont boostng for mult-class concept detecton. In CVPR Workshop on Semantc Learnng Applcatons n Multmeda, Mnneapols, MN, 7. [8] W. Jang, S.F. Chang, and A.C. Lou. Context-based concept fuson wth boosted condtonal random felds. In IEEE Proc. ICASSP. vol., 7, pp [9] S. Lazebnc, C. Schmd, and J. Ponce. Beyond bags of features: spatal pyramd matchng for recognzng natural scene categores. In Proc. CVPR, vol., 6, pp [] A.C. Lou, et al. Kodak Consumer Vdeo Benchmark Data Set: Concept Defnton & Annotaton. ACM Multmeda Informaton Retreval Workshop, Sept. 7. [] D.G. Lowe. Object recognton from local scale-nvarant features. In Proc. ICCV, 999, pp [] NIST. TREC Vdeo Retreval Evaluaton (TRECVID). -- 6, [3] A. Torralba, K. Murphy, and W. Freeman. Sharng features: effectve boostng procedure for mult-class object detecton. In Proc. CVPR, vol., 4, pp [4] A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detecton usng boosted random felds. Advances n NIPS, 4. [5] A. Yanagawa, et al. Columba Unversty's Baselne Detectors for 374 LSCOM Semantc Vsual Concepts. Columba Unversty ADVENT Tech. Report # -6-8, March 7, [6] A. Yanagawa, W. Hsu, and S.-F. Chang. Bref Descrptons of Vsual Features for Baselne TRECVID Concept Detectors. Columba Unversty ADVENT Tech. Report #9-6-5, July 6. [7] Caltech data sets, [8] T. Hoffmann. Probablstc latent semantc ndexng. In Proc. SIGIR, 999.

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: