Large-scale Web Video Event Classification by use of Fisher Vectors

Size: px

Start display at page:

Download "Large-scale Web Video Event Classification by use of Fisher Vectors"

Jesse Smith
6 years ago
Views:

1 Large-scale Web Vdeo Event Classfcaton by use of Fsher Vectors Chen Sun and Ram Nevata Unversty of Southern Calforna, Insttute for Robotcs and Intellgent Systems Los Angeles, CA 90089, USA {chensun Abstract Event recognton has been an mportant topc n computer vson research due to ts many applcatons. However, most of the work has focused on vdeos taken from a fxed camera, known envronments and basc events. Here, we focus on classfcaton of unconstraned, web vdeos nto much hgher level actvtes. We follow the approach of constructng fxed length feature vectors from local feature descrptors for classfcaton usng an SVM. Our key contrbuton s the study of the utlty of Fsher Vector representaton n mprovng results compared to the conventonal Bag-of-Words (BoW) approach. Such codng has shown to be useful for statc mage classfcaton n the past but not appled to vdeo categorzaton. We perform tests on the challengng NIST TRECVID Multmeda Event Detecton (MED) dataset, whch has thousand hours of unconstraned user generated vdeos; our approach acheves as much as 35% mprovement over the BoW baselne. We also offer an analyss of possble causes of such mprovements. 1. Introducton Recognton of events n vdeos s mportant for many applcatons and has been recevng ncreasng attenton n computer vson research n recent years. Most of ths work s focused on analyss of vdeos n a known envronment wth a fxed camera and the events of nterests are the basc ones such as runnng, jumpng and bendng, e.g. see [14]. There s also another mportant class of vdeos whch conssts largely of user captured content uploaded to the Internet: n such vdeos, qualty s varable, camera s lkely n moton and there are large varatons n the background envronments. Our goal n ths research s to provde automatc classfcaton of such vdeos as belongng to classes defned by the large scale events takng place n them, e.g. a weddng vdeo or one where someone performs a board trck. There are many applcatons of such classfcaton ncludng sharng and browsng of large vdeo databases. To promote research n large scale vdeo categorzaton, the Natonal Insttute of Standards and Technology (NIST) has been sponsorng annual TRECVID evaluatons [1]. These evaluatons provde large scale data collectons (courtesy of Lngustcs Data Consortum or LDC of Unversty of Pennsylvana), a lst of event classes to be annotated and procedures to evaluate the results. We focus our experments on data provded by these evaluatons, n partcular, the datasets known as MED11 and MED12 (these datasets also nclude speech content whch we gnore for the study descrbed here). A varety of approaches have been developed for the constraned envronment actvty analyss. These nclude use of statstcs of local features as well as analyss of semantc enttes by detectng and trackng, actors, ther pose and the objects they nteract wth. At the hgh level, the web vdeo actvtes are also characterzed by actors and objects; however, detecton of such enttes n unconstraned envronments s extremely dffcult and nference of hgher level actvtes s also challengng. Hence, focus has been on use of statstcal technques usng local features; some recent results on MED11 data have been reported n [15][11]. The statstcal technques typcally have four stages: low-level feature extracton, feature encodng, classfcaton and fuson of results (from multple channels f avalable). At the feature extracton stage, local mage or vdeo patches are selected, ether densely or va some salency selecton process; descrptons are then computed for each patch. Feature detecton and descrpton technques buld on deas developed for object recognton n stll mages but also ncorporate the use of temporal dmenson. Interestngly, although t may be expected that sparse salent feature ponts wll be more robust, experments show that dense features perform better for more complex vdeos [18]. Feature encodng stage turns sets of local features nto fxed length vectors; ths s usually accomplshed by vector quantzaton of feature vectors and buldng hstograms of vsual codewords; ths s commonly known as Bag-of-Word (BoW) encodng [3]. Varatons nclude soft quantzaton where dstance from a number of codewords s consdered [16][19]. Feature vectors are used to tran classfers (typ- 1

2 cally χ 2 kernel SVMs). Several late fuson strateges can be used to combne the classfcaton results of dfferent low-level features; such fuson typcally shows consstent mprovement[11]. Impressve results have been obtaned on a large dataset (MED11) that evaluates classfcaton accuracy on thousands of vdeos wth very dverse characterstcs for 10 hgh level events. [15] offers a very systematc analyss of the performance of dfferent features and ther combnatons. Our am n ths paper s to ncrease the dscrmnatve power of these features by use of dfference codng technques (of whch Fsher Vector s one). Fsher kernel[4] was proposed to utlze the advantages of generatve model n a dscrmnatve framework. The basc dea s to represent a set of data by gradent of ts loglkelhood wth respect to model parameters, and measure the dstance between nstances wth Fsher kernel. The fxed-length representaton vector s also called a Fsher Vector. For local features extracted from vdeos, t s natural to model ther dstrbuton as mxtures of Gaussan(GMM), formng a soft codebook. Wth GMM, the dmenson of Fsher Vector s lnear n number of mxtures and local feature dmenson. Fsher vector has been appled to statc mage classfcaton[12] and ndexng[5], showng sgnfcant mprovements over BoW methods. In ths paper, we show that the Fsher Vector also mproves performance greatly over BoW for vdeo actvty classfcaton. We also provde an analyss of what may be the cause for observng such mprovements. Our contrbuton s thus three-fold: frst, we propose a vdeo event classfcaton framework by usng Fsher Vector encodng; second, we gve an analyss of several desred propertes Fsher Vector for ths task; and fnally, we provde a seres of evaluatons on the choce of Fsher Vector parameters on a complcated large scale web vdeo dataset. The remander of ths paper s organzed as follows: Related work on vdeo event classfcaton s dscussed n secton 2. In secton 3, we descrbe our vdeo classfcaton framework wth Fsher Vector encodng, and gves an analyss on the benefts of Fsher Vector. Fnally, we show experment results on Fsher kernel, ts varaton and BoW baselne n secton 4, and conclude the paper. 2. Related Work Snce a vdeo s a sequence of mage frames, low level features desgned for mages can all be appled to vdeo classfcaton, such as SIFT[10]. To take moton nformaton nto account, there are several nnovatons n moton related features. STIP[7], for example, extends 2D feature detector to 3D space. An evaluaton by Wang et al. [18] shows that dense features may have better performance compared to those fltered by detectors. In ths category, the dense trajectory features[17] track local ponts to obtan short tracklets, and descrbe the volumes around each tracklet. A recent evaluaton[15] on a complex web vdeo dataset shows that a late fuson of dfferent features almost always helps. Besdes low level features, some argue that md-level or hgh-level features can provde better performance. A common approach s to buld pre-traned, fast classfers for a set of md-level or hgh-level concepts, and encode the classfer responses as vdeo representaton. Object Bank[9], for example, uses a lst of more than 170 object concepts. Its counterpart n event classfcaton, Acton Bank[13], has 205 template acton detectors. A number of papers usng varous combnatons of features and classfers on the MED11 dataset can be found at [2]. Dfference codng technques were frst developed n machne learnng lterature and then appled to mage classfcaton tasks as descrbed earler n Secton 1. To the best of our knowledge, these technques have not been appled to moton features n pror work. 3. Classfcaton Framework In ths secton, we wll descrbe the vdeo event classfcaton framework, ncludng feature extracton, Fsher Vector encodng, postprocessng and classfcaton Local Feature Extracton We perform experments wth both a sparse local feature and a dense local feature. We use Laptev s Space-Tme Interest Ponts (STIP)[7] as the sparse feature, each nterest pont s descrbed by hstograms of gradents (HoG) and optcal flow (HoF) of ts surroundng volume. For dense feature, we choose Wang et al. s Dense Trajectory (DT)[17] features; the descrptor ncludes shape of trajectory, HoG, HoF and moton boundares hstograms (MBH). These choces are made based on the good performance of each n ther category. When envronment s constraned and camera s fxed, sparse features are lkely to select robust features that are hghly correlated to events of nterest. However, web vdeos are usually taken n the wld, and n most cases camera moton s unknown. It s lkely that camera moton causes many feature ponts to orgnate from the statc background (See Fgure 1). Dense features do not suffer from feature pont selecton ssues, but they treat foreground and background nformaton equally, whch can also be a source of dstracton Fsher Vector Encodng One key dea usually assocated wth local features s that of a vsual words codebook, obtaned by clusterng feature ponts and quantzng the feature space. A set of feature ponts can then be represented by a fxed length hstogram.

(x d γ t () t µ d )2 (σ d 1 )3 σ d } (3) (4) (5) However, some feature ponts may be far from any vsual word, to compensate for ths, Gemert et al.

3 Fgure 1. Camera moton can make lots of sparse feature ponts fall nto background. Left s a frame n a bke trck vdeo, rght shows the detected feature ponts by STIP feature ponts set X s [ L(X θ) F X =, L(X θ) ] µ Σ L(X θ) T { x d µ d = γ t () t µ d } (σ d)2 L(X θ) σ d = t=1 T t=1 { (x d γ t () t µ d )2 (σ d 1 )3 σ d } (3) (4) (5) However, some feature ponts may be far from any vsual word, to compensate for ths, Gemert et al. propose a soft assgnment of vsual words[16], but each codeword s stll only modeled by ts mean Fsher Vector under Gaussan Mxture Model Fsher Vector concepts have been ntroduced n prevous work [4] and appled to mage classfcaton and retreval by several authors [12][5]. We repeat ther formulaton below for ease of readng and makng ths paper self-contaned. Accordng to [4], suppose we have a generatve probablty model P (X θ), where X = {x = 1, 2,..., N} s a sample set, and θ s the set of model parameters. We can map X nto a vector by computng the gradent vector of ts loglkelhood functon at the current θ: F X = θ log P (X θ) (1) F X s a Fsher Vector, t can be seen as a measurement of the drecton to make θ ft better to X. Snce θ s fxed, the dmensons of Fsher Vector for dfferent X are the same. Ths makes F X a sutable alternatve to represent a vdeo wth ts local features. GMM has the form P (X θ) = K w N (X; µ, Σ ) (2) =1 where K s the number of clusters, w s the weght of the th cluster, and µ, Σ are the mean and covarance matrx of the th cluster. As the dmenson of Fsher Vector s the same as the number of parameters, dagonal covarance matrces are usually assumed to smplfy the model and thus reduce the sze of Fsher Vector. Denote L(X θ) as the loglkelhood functon, the dth dmenson of µ as µ d, the dth dagonal element of Σ as (σ d)2, local feature dmenson as D and the total number of feature ponts as T. By assumng that each local feature s ndependent, the Fsher Vector F X of Here, γ t () s the probablty of feature pont x t belongs to the th cluster, gven by γ t () = w N (x t ; µ, Σ ) K j=1 w jn (x t ; µ j, Σ j ) The dmenson of F X s 2KD. The frst term, L(X θ) µ, s composed of frst order dfferences of feature ponts to cluster centers. The second term, L(X θ) Σ, contans second order terms. Both of these are weghted by the covarances and soft assgnment terms. Fsher Vector wth GMM can be seen as an extenson of BoW[6]. Actually, t accumulates the relatve poston to each cluster center, and models codeword assgnment uncertanty, whch has shown to be benefcal for BoW encodng[16] Non-probablstc Fsher Vector (6) In [5], the authors gve a non-probablstc approxmaton of Fsher Vector, called Vector of Locally Aggregated Descrptors (VLAD). It uses K-Means clusterng to get a codebook, each value n VLAD s computed as v d = x t:nn (x t)= x d t µ d (7) Compared wth Fsher Vector, VLAD drops the second order terms, and assumes unform covarance among all dmensons. It also assgns each feature pont to ts nearest neghbor n the codebook. The feature dmenson s KD Comparson wth BoW The basc dea of both BoW and Fsher Vector s to map feature pont set X nto a fxed dmenson vector, from whch the dstrbuton n the orgnal feature space can be reconstructed approxmately. However, there are also several key dfferences dscussed below: Frst, BoW uses a hard quantzaton of feature space by KMeans, where each cluster has the same mportance and s descrbed by ts centrod only. Meanwhle, Fsher Vector assumes GMM s the underlyng generatve model for local

Fgure 2. Suppose O s both the mean of a Gaussan mxture and a centrod of KMeans. A and B contrbute the same under BoW but dfferently under Fsher Vector. features.

Secondly, n Fsher Vector, local features contrbuton to a Gaussan mxture depends on ther relatve poston to the mxture center.

Gven two ponts A and B to be coded, as ther dstances to O are the same but AO and BO are dfferent, they contrbute the same to the codeword n BoW but dfferently to the mxture n Fsher Vector.

nferred by tranng on general data, whch s lkely to be domnated by background features, above mples that Fsher Vector can suppress the part of data that ft the general model well.

4 Fgure 2. Suppose O s both the mean of a Gaussan mxture and a centrod of KMeans. A and B contrbute the same under BoW but dfferently under Fsher Vector. features. Although modfcatons to BoW can help t capture more nformaton, such as assgnng dfferent weghts to codewords and soft assgnment of codewords[16], GMM ncorporates them naturally. Secondly, n Fsher Vector, local features contrbuton to a Gaussan mxture depends on ther relatve poston to the mxture center. In Fgure 2, suppose we have a traned GMM for Fsher Vector as well as a traned vsual codebook for BoW, and O happens to be both the mean of a mxture n GMM and the centrod of a codeword. Gven two ponts A and B to be coded, as ther dstances to O are the same but AO and BO are dfferent, they contrbute the same to the codeword n BoW but dfferently to the mxture n Fsher Vector. Fnally, let X be separated nto two sets X r and X b, where X b contans the ponts that ft the GMM model well, we have L(X θ) By defnton, L(X b θ) = L(X r θ) L(X θ) 0, so L(X r θ) + L(X b θ) Snce θ s nferred by tranng on general data, whch s lkely to be domnated by background features, above mples that Fsher Vector can suppress the part of data that ft the general model well. Fgure 2 also gves an llustraton, the trangles and crcles are feature ponts taken from two dfferent vdeos, though ther postons are dfferent, they ft the model perfectly thus have no overall nfluence on the Fsher Vector Postprocessng and Classfcaton Both BoW and Fsher Vector drop spatal and temporal nformaton of the feature ponts. However, sometmes (8) (9) Fgure 3. Spatal-Temporal Pyramds. For pyramd level, vdeo s dvded nto 2 slces along tmelne, and 2 by 2 blocks for each frame. spatal and temporal structures can be useful for classfcaton. Lazebnk et al. proposed to buld spatal pyramds to preserve approxmate locaton nformaton for mage classfcaton task [8]. Here, we use a smlar approach, but takng temporal nformaton nto account. At pyramd level vdeo, vdeo s dvded nto 2 slces along tmelne, and 2 by 2 blocks for each frame. Suppose there are P s spatal pyramd levels, and P t temporal pyramd levels, the total number of sub-volumes s (4 Ps 1)(2 Pt 1)/3. Encodng s performed for local features n each sub-volume, the fnal representaton s a concatenaton of all vectors. We then normalze each dmenson of F X by a power normalzaton: f(x ) = sgn(x ) x α, 0 α 1 (10) Power normalzaton step s mportant when a few dmensons have large values and domnate the vector, the normalzed vector wll become flatter as α decreases. It s suggested by [6] for mage retreval task. We tran Support Vector Machnes (SVM) classfers. Though the smlarty of Fsher Vector s usually measured by an nner product weghted by the nverse of Fsher nformaton matrx, Fsher Vector tself can be used wth nonlnear kernels. Moreover, to combne frst and second order terms n Fsher Vector, we can ether drectly concatenate them or buld classfers separately and do a late fuson. For the prevous approach, we use K(F X, F Xj ) = exp 1 D(F f X A, F f X j ) f (11) f F where D(F f X, F f X j ) = F f X F f X j 2 2 (12)

5 F s the set of dfferent feature vector types, D() s the dstance functon, and A f s the average of dstances for feature type f n tranng data. Ths kernel functon s a specal case of RBF kernel, where features are concatenated wth dfferent weghts, and sgma s set based on average dstances. Late fuson s a way to combne decson confdences from dfferent classfers, t has shown superor performance than early fuson n some tasks. In ths paper, we use a geometrc average of ndvdual scores. 4. Experments Ths secton descrbes the dataset for evaluaton, and provdes expermental results Dataset We use vdeos selected from the entre TRECVID MED11 vdeo corpus and from MED12 Event Kt data [1] for evaluaton. The datasets contan more than dverse, user-generated vdeos vary n length, qualty and resoluton. The total length of vdeos s more than 1400 hours. There are 25 postve event classes defned n ths data set, along wth a bg collecton of samples that belong to none of these. The event concepts can be bascally categorzed as: Dstnct by objects: Board trck, bke trck, makng a sandwch, etc. Dstnct by moton patterns: Changng vehcle tre, gettng vehcle unstuck, etc. Dstnct by hgh-level motves: Brthday party, weddng ceremony, marrage proposal, etc. A complete lst of event defntons can be found n [1]. 1 For our evaluaton, we utlze two dfferent data parttons. A smaller set, called Event Kt (EK), has 2062 postve samples of events 1 to 15. It s used for fast selecton of classfer ndependent parameters. A larger set wth vdeos s separated nto 2 parttons, a tranng set (Tran) and a test set (Test), ts goal s to evaluate the framework s performance on a large scale dataset. The number of vdeos n each partton s shown n Table 1. We sampled randomly from the large set to create these parttons Experment Setup In ths secton, we descrbe the proposed classfcaton framework as well as the baselne system. 1 There are fve events not belongng to test set, they are: Attemptng a board trck, feedng an anmal, landng a fsh, weddng ceremony and workng on a woodworkng project. Partton #Events #Pos #Neg #Total Event Kt Tran Test Table 1. Number of postve and negatve samples n each partton Fsher Vector and VLAD Generaton We use Laptev s STIP mplementaton 2, wth default parameters and sparse feature detecton mode. For Dense Trajectory 3, we resze the vdeo s wdth to 320 frst and set samplng strde to 10 pxels. Both descrptors have several components, we concatenate them drectly to form a 162 dmenson feature vector for STIP features and a 426 dmenson feature vector for DT features. Snce the length of Fsher Vector s lnear n the dmenson of local features, PCA s used to project the features onto a lower dmenson space; we project STIP features to 64 dmensons and DT features to 128 dmensons. We randomly select about feature descrptors from the Tran set. These descrptors are used to tran the PCA projecton matrx and get codebooks wth GMM and K-Means clusterng. Wth spato-temporal pyramd, each sub-volume has ts own Fsher Vector or VLAD. Accordng to our experments, ncreasng the number of spatal pyramd layers boosts the performance, whle ncreasng temporal pyramd layers has lttle nfluence or even hampers the performance. We set number of spatal pyramd layers (#SP) as 2 and temporal pyramd layers (#TP) as 1, to balance the classfcaton performance and speed. Input : Local feature pont set X from a sngle vdeo Output: Fsher Vector / VLAD F X Buld spatal-temporal pyramd V = {V 1,..., V k } Project X to X by PCA Set F X as an empty vector for V n V do Select pont set X that le n V Encode X to Fsher Vector or VLAD F X Power normalze F X l2-normalze F X Concatenate F X at the end of F X end l2-normalze F X Algorthm 1: Fsher Vector/VLAD generaton Before concatenaton, we normalze each vector by two 2 laptev/download.html# stp 3 trajectores

6 steps. Frst, a power normalzaton s conducted on each dmenson. Then, all vectors are l2-normalzed, concatenated together and l2-normalzed agan, ths s dfferent from the tradtonal spatal pyramd, where hstograms from larger cells are penalzed and normalzaton s after concatenaton[8]. We use l2-normalzaton snce t s natural wth lnear kernel, whch s evaluated later. A full algorthm s shown n Algorthm 1. In the followng experments, we wll call event classfcaton framework wth VLAD as VLAD, wth frst order components of Fsher Vector as FV 1 and wth second order components of Fsher Vector as FV 2. map VLAD FV 1 FV BoW Baselne We use the same local features wth no dmenson reducton, and a standard BoW approach wth the followng modfcatons: Frst, nstead of hard assgnng each local feature to ts nearest neghbor, soft assgnment to nearest #K neghbors s used[16]. Secondly, we use spato-temporal pyramd to encode spatal and temporal structures. Based on expermental results, we set codebook sze as 1000, K = 4, #SP = 3 and #T P = 1, the fnal representaton has dmensons and s l1-normalzed to form a hstogram Classfcaton Scheme Classfers are bult wth SVM n a one over rest approach. We use the probablty output produced by classfer as confdence values. For parameter selecton, we use 5-fold cross valdaton: Tranng data are separated nto 5 parts randomly, and the rato of postve over negatve samples s approxmately kept, the parameter set wth the hghest average performance s selected. Because the dataset s hghly unbalanced, tradtonal accuracy based parameter search s qute lkely to produce a trval classfer predctng all queres as negatve. We choose to optmze map nstead. Kernel functon also plays a key role n SVM classfcaton. For Fsher Vector and VLAD, we compare the kernel mentoned n Secton 3.3 and lnear kernel. For BoW hstogram, χ 2 kernel s used Results on Event Kt We use Event Kt data and STIP features to study how to choose non-classfer related parameters for Fsher Vector and VLAD, ncludng power normalzaton factor α, PCA projected dmenson D and number of clusters K, ther default values are 0.5, 64 and 64, respectvely. Gaussan kernel s used for SVM, and maps are calculated based on average cross-valdaton results α Fgure 4. maps of power normalzaton factor α wth VLAD, FV 1 and FV 2 Encodng no PCA (162 dm) PCA (64 dm) VLAD FV FV Table 2. map of VLAD, FV 1 and FV 2, wth and wthout PCA Effect of Power Normalzaton As dscussed above, when 0 α < 1, power normalzaton can smooth the spkes n the feature vector. We set α as 0.1, 0.3, 0.5 and 1.0, ther performance are shown n Fgure 4. From the fgure, t s easy to see that power normalzaton step mproves map. Meanwhle, VLAD s more susceptble to change of α than the Fsher Vector. One possble reason s that VLAD treats all cluster centers as equally mportant, and gnores covarance nformaton Effect of PCA Next we show how dmenson reducton nfluences classfcaton performance. The maps are dsplayed n Table 2. Interestngly, PCA has dfferent effects on VLAD and Fsher Vector. For VLAD, the performance drops slghtly, t s understandable snce some nformaton s lost durng PCA. However, for Fsher Vector, PCA helps mprove the performance. One possble explanaton, as descrbed n [6], s PCA s mpact on GMM: It decorrelates dfferent dmensons durng the projecton. Snce we assume dagonal covarance matrx for Gaussan dstrbuton, t may be more advantageous to do so. Besdes, clusterng methods can become unstable n hgher dmenson space, whch s often a trade-off wth nformaton preserved.

7 map FV 1 FV 2 Encodng Lnear Kernel Gaussan Kernel VLAD FV FV Table 3. map Performance comparson on Test set wth lnear kernel and Gaussan kernel Combnaton Drect MK Late Fuson FV 1 + FV Table 4. map comparson on Test set among dfferent fuson methods Number of mxtures Fgure 5. maps of cluster sze K wth VLAD, FV 1 and FV Effect of Number of Clusters Fnally, Fgure 5 shows how the performance changes wth the number of clusters K. Based on the fgure, map grows monotoncally as K becomes larger. However, relatve mprovement s small after K 32. Snce computatonal cost s hgh after K gets large, we try K only as large as 128. It s worth notng that, second order components of Fsher Vector performs the best when used alone. VLAD and frst order components of Fsher Vector have very smlar performance. However, snce VLAD s more senstve to the change of α, Fsher Vector mght be stll preferred when only frst order nformaton s used Results on Test Set To compare our framework wth BoW baselne, and have a better dea of how our method works on larger dataset, evaluaton on our Test set s gven below. All the classfers are traned by vdeos n the Tran partton. For both VLAD and Fsher Vector, we set K = 64, the total dmenson s for STIP and for DT. For BoW, we use K = 1000, the total dmenson s Gaussan Kernel vs Lnear Kernel In ths secton, we study the nfluence of dfferent kernel functons. STIP features are used and α s set to 0.5. Accordng to Table 3, Gaussan kernel gves hgher map n all three encodngs, the dfference can be as bg as 13%. Ths may be explaned by the non-lnear nature of representatons. Note that s n varance wth observatons of work n mage classfcaton [12] that focus on use of lnear SVMs only for Fsher Vector features. In the followng experments, we use Gaussan kernel to buld SVM classfers Dfferent fuson methods There are several ways to combne the frst and second order terms of Fsher Vector. Besdes the mult-kernel (MK) approach dscussed above, we also try to concatenate the two drectly (Drect). All methods gve smlar results, wth late fuson beng slghtly better. The maps are shown n Table Comparson wth Baselne We compare the performance of BoW, VLAD and FV wth STIP and DT features. Late fuson s used for Fsher Vector (FV), α s set to 0.3 for VLAD and 0.5 for FV. The results are shown n Table 5. We can see that Fsher Vector gves the best map for both STIP and DT, t has about 35% mprovement for STIP and 26% mprovement for DT over the baselne. VLAD mproves map by 19% for STIP, less than Fsher Vector does. Consderng AP s the percentage of postve samples when assgnng labels randomly, our best framework s 47 tmes better than random performance (35.5/ ). It s dffcult to account precsely for the reasons of superor results of Fsher Vector codng. It captures more of the feature pont dstrbutons and hence lkely to be more dscrmnatve. We beleve that our hypothess, gven n Secton 3.2.3, that Fsher codng can suppress the contrbuton of background features when they ft the model well, s also part of the answer. From the table, we can also see that DT outperforms STIP n most events, but the relatve mprovement from BoW to Fsher Vector s smaller. Snce DT features are dense, the mpact of background may be less than for sparse features. 5. Concluson We have presented a technque for classfcaton of unconstraned vdeos by usng Fsher Vector codng of sparse and dense local features. Sgnfcant mprovements (35% and 26% mprovement for sparse STIP features and dense DT features respectvely) over standard Bag-of-Words have

8 BoW+ VLAD+ FV+ BoW+ FV+ E001 STIP STIP STIP DT DT E E E E E E E E E E E E E E E E E E E E E E E E map Table 5. map Performance comparson on Test set wth dfferent features and encodngs been demonstrated on a rather large test set whch contans hghly dverse vdeos of varyng qualty. The mprovement s consstent across the event classes ndcatng robustness of the process. Whle use of smlar technques for mage classfcaton has been demonstrated before, we are not aware of use of such methods for vdeo classfcaton n prevous work. We also fnd that use of the full Fsher vector gves sgnfcant mprovements over the smpler VLAD representaton for vdeo classfcaton. 6. Acknowledgement Ths work was supported by the Intellgence Advanced Research Projects Actvty (IARPA) va Department of Interor Natonal Busness Center contract number D11PC0067. The U.S. Government s authorzed to reproduce and dstrbute reprnts for Governmental purposes nonwthstandng any copyrght annotaton thereon. Dsclamer: The vews and conclusons contaned heren are those of the authors and should not be nterpreted as necessarly representng the offcal polces or endorsements, ether expressed or mpled, of IARPA, DoI/NBC, or the U.S. Government. We thank Dr. Cees Snoek of Unversty of Amsterdam for ntroducng us to the concepts of dfference codng and to the authors of the feature extracton software used n ths paper. References [1] cfm. 1, 5 [2] tvpubs/tv.pubs.11.org.html. 2 [3] G. Csurka, C. Dance, L. Fan, J. Wllamowsk, and C. Bray. Vsual categorzaton wth bags of keyponts. In ECCV Workshop, [4] T. Jaakkola and D. Haussler. Explotng generatve models n dscrmnatve classfers. In NIPS, , 3 [5] H. Jegou, M. Douze, C. Schmd, and P. Pérez. Aggregatng local descrptors nto a compact mage representaton. In CVPR, , 3 [6] H. Jégou, F. Perronnn, M. Douze, J. Sánchez, P. Pérez, and C. Schmd. Aggregatng local mage descrptors nto compact codes. PAMI, , 4, 6 [7] I. Laptev. On space-tme nterest ponts. IJCV, 64(2-3): , [8] S. Lazebnk, C. Schmd, and J. Ponce. Beyond bags of features: Spatal pyramd matchng for recognzng natural scene categores. In CVPR, , 6 [9] L.-J. L, H. Su, E. P. Xng, and F.-F. L. Object bank: A hghlevel mage representaton for scene classfcaton & semantc feature sparsfcaton. In NIPS, [10] D. G. Lowe. Dstnctve mage features from scale-nvarant keyponts. IJCV, 60(2):91 110, [11] P. Natarajan, S. Vtaladevun, U. Park, S. Wu, V. Manohar, X. Zhuang, S. Tsakalds, R. Prasad, and P. Natarajan. Multmodel feature fuson for robust event detecton n web vdeos. In CVPR, , 2 [12] F. Perronnn and C. Dance. Fsher kernels on vsual vocabulares for mage categorzaton. In CVPR, , 3, 7 [13] S. Sadanand and J. Corso. Acton bank: A hgh-level representaton of actvty n vdeo. In CVPR, [14] C. Schuldt, I. Laptev, and B. Caputo. Recognzng human actons: A local svm approach. In ICPR, [15] A. Tamrakar, S. Al, Q. Yu, J. Lu, O. Javed, A. Dvakaran, H. Cheng, and H. S. Sawhney. Evaluaton of low-level features and ther combnatons for complex event detecton n open source vdeos. In CVPR, , 2 [16] J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. M. Smeulders. Kernel codebooks for scene categorzaton. In ECCV, , 3, 4, 6 [17] H. Wang, A. Kläser, C. Schmd, and C.-L. Lu. Acton recognton by dense trajectores. In CVPR, [18] H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmd. Evaluaton of local spato-temporal features for acton recognton. In BMVC, , 2 [19] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Localty-constraned lnear codng for mage classfcaton. In CVPR,

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: