Motion Boundary Trajectory for Human Action Recognition

Size: px

Start display at page:

Download "Motion Boundary Trajectory for Human Action Recognition"

Owen Owens
6 years ago
Views:

1 Moton Boundary Trajectory for Human Acton Recognton So-Long Lo and Ah-Chung Tso Faculty of Informaton Technology, Macau Unversty of Scence and Technology Abstract. In ths paper, we propose a novel approach to extract local descrptors of a vdeo, based on two deas, one usng moton boundary between objects, and, second, the resultng moton boundary trajectores extracted from vdeos, together wth other local descrptors n the neghbourhood of the extracted moton boundary trajectores, hstogram of orented gradents, hstogram of optcal flow, moton boundary hstogram, can be used as local descrptors for vdeo representatons. The moton boundary approach captures more nformaton between movng objects whch mght be caused by camera movements. We compare the performance of the proposed moton boundary trajectory approach wth other state-of-the-art approaches, e.g., trajectory based approach, on a number of human acton benchmark datasets (YouTube, UCF sports, Olympc Sports, HMDB51, Hollywood2 and UCF50), and found that the proposed approach gves mproved recognton results. 1 Introducton Recognzng human acton n a vdeo s a commonly studed topc n computer vson and machne learnng [1 4]. Broadly speakng, a popular approach s to frst extract a set of local descrptors, and then use a bag-of-features model for matchng those local descrptors obtaned n the set of labeled tranng vdeo clps, to those as yet unlabelled n the testng dataset [5 7]. Laptev [8] ntroduced space-tme nterest ponts (STIPs) usng an extenson of the Harrs corner detecton method [9] from mage to vdeo. Other detectors are also used to detect nterest ponts n vdeos, e.g., Wllems et al. [10] proposed usng the determnant of the spatotemporal Hessan matrx for nterest pont detecton, Dollar et al. [11] proposed a 1D Gabor flter n the tme dmenson wth a 2D Gaussan n the spatal dmensons to detect the underlyng perodc frequency components for nterest pont detecton. Based on the detected nterest ponts n a vdeo, a descrptor s proposed to descrbe the nformaton of sub-regons of the vdeo as local features. Several descrptors have been proposed for descrbng these spatotemporal local features, e.g., hgher order dervatves (local jets) [8], hstogram of orented gradent (HOG) [12] for capturng object shape, these are called the appearance descrptors; hstogram of optcal flow (HOF) [12] for capturng object moton

2 2 S.L. Lo and A.C. Tso nformaton, a spatotemporal verson of HOG, called HOG3D [13] whch extends the dea of HOG to the 3D case, hstogram of orented flows (HOF), a way of representng movements across tme [12], and moton boundary hstograms (MBH) [14] to cope wth the camera moton. Ths detector/descrptor approach can be consdered as a knd of bag-of-features vdeo representaton. In contrast from detectng nterest ponts n a 3D volume data, another approach to obtanng local features from a vdeo s the trajectory approach, so called dense trajectory approach, as the patch s represented by a large number of nterest ponts, [4, 15]. In ths approach, a set of local nterest ponts s frst detected usng the 2D Harrs condton [9] from vdeo frames and an optcal flow feld s then used to track these nterest ponts temporally to form the patch trajectores n the vdeo [4]. The trajectory descrptor, together wth the local descrptors, can be used to represent the vdeo under a bag-of-features framework. However, t s dffcult to detect the actual movng objects n a complex background scene wth severe camera moton usng the 2D Harrs corner condton [9] as the local patch detector. In ths paper, we wsh to show that the moton patterns of objects are mportant and wll help detect nformatve patch trajectores for acton recognton. In [16], the authors also ntroduced a moton boundary based samplng for acton recognton, though t s dfferent from the one whch we proposed n ths paper. The fact that moton provdes mportant cue for groupng objects s well known [17]. On the other hand, to cope wth camera moton, Dalal et al. ntroduced the moton boundary hstogram (MBH) [14] as an effectve local descrptor. MBH encodes the gradents of optcal flow, whch are helpful for cancelng constant camera moton. Despte the mportance of MBH as clearly shown n Dalal et al [14], t appears that no one has yet explored the dea of a moton boundary n the dense trajectory approach [4]. It s expected that f we can embed the moton boundary concept n the dense trajectory approach [4], then t can handle ssues related to camera moton, and thus would result n mproved recognton rate, for datasets whch may have taken whle the camera mght be movng. In ths paper, we propose to use the moton boundary between objects for detectng local patches wthn the dense trajectory approach [4]. The moton boundary can capture more nformatve nformaton between movng objects whch mght be caused because the camera was movng. Wth the moton boundary defned, the moton boundary trajectory can be extracted and can be used for the vdeo representaton. We compare the performances of varous approaches on a number of standard benchmark datasets [18 22] and acheve better results usng the proposed approach. The rest of ths paper s organzed as follows. Secton 2 dscusses related work; Secton 3.3 brefly ntroduces the concept of local descrptor extractons from vdeos, whch nclude moton boundary trajectores (n Secton 3.2), appearance based descrptors and moton descrptors; Secton 4 provdes approaches to classfcaton; expermental results are shown n Secton 5. Fnally, some conclusons are drawn n Secton 6.

3 Moton Boundary Trajectory for Human Acton Recognton 3 Contrbuton: Ths paper establshes the deployment of moton boundary determnaton n the dense-trajectory approach for acton recognton. The moton boundary between objects s determned and then those ponts n ths moton boundary are tracked to form the moton boundary trajectores for vdeo representaton. Expermental results show that ths dea can mprove the performance of recognton sgnfcantly. 2 Relatve Works The most popular approach for acton recognton s the well known bag-offeature model [23, 19, 24]. In ths model, the selecton of local features of a vdeo s mportant for the vdeo representaton. There are two broad approaches wthn ths tradton: the detector/descrptor approach [18] and the trajectory approach [4]. In the detector/descrptor approach [18], the detector s used to detect nterestng sub-regons of a vdeo, contaned wthn such sub-regons are typcally the ntensty values that have sgnfcant local varatons n both space and tme. For these sub-regons, the descrptors are appled to descrbe the spatal-temporal local features of the vdeo [18]. The dense trajectory approach [4] tracks the detected local patches n the vdeo frames through tme. Then patch trajectores can be extracted from these sub-regons of the vdeo. In the dense trajectory approach, the extracted spatal-temporal local features are sgnfcant [4]. It can be explaned that the detected/extracted features are specfcally based on object appearances and, to some extent, on motons (as the moton boundary hstogram s used to represent the moton). Some related work can be found n moton segmentaton and vdeo cosegmentaton [25]. Moton segmentaton s the problem of decomposng a vdeo and to detect movng objects and background based on the dea of coherent regons wth respect to moton and appearance propertes [25]. Moton nformaton provdes an mportant cue for dentfyng the surfaces n a scene and for dfferentatng mage texture from physcal structures. In [17], long term pont trajectores based on dense optcal flow are used to spatal temporal cluster the feature ponts nto temporally consstent segmentatons of movng objects. The qualty of moton segmentaton depends sgnfcantly on the par of frames wth a clear moton dfference between the objects [26]. The advantage of moton segmentaton derves from the fact that t combnes moton estmaton wth segmentaton. For segmentng multple objects n the scene, the layered model for moton segmentaton s proposed [27]. Typcally, the scene conssts of a number of movng objects and representng each movng object by a layer that allows the moton of each layer to be descrbed [27]. Such a representaton can model the occluson relatonshps among layers makng the detecton of occluson boundares possble [28, 29]. Typcally, the background/foreground segmentaton s a specal case of bnary object segmentaton n ths layered model [30]. In [25], multple objects and mult-class vdeo co-segmentaton task s proposed to segment objects n vdeos. Object co-segmentaton [25] s to segment a promnent object based on an mage par n whch t appears n both mages.

4 4 S.L. Lo and A.C. Tso Wth ths dea, vdeo co-segmentaton segments the objects that are shared between vdeos, therefore co-segmentaton can be encouraged. Wth ths approach, object boundares can be detected [28, 29]. Based on the dea of moton segmentaton, objects may be segmented from the background n the acton recognton. Inspred by the dea of moton boundary hstogram descrptor n the bag-of-feature framework, n ths paper, we propose to use the boundary between objects as a descrptor n the dense-trajectory approach. The moton boundary can then be tracked frame by frame and then deployed as a descrptor, very much n the same manner as the patch trajectores n the dense trajectory approach [4] and then used for acton recognton. Ths has the advantage of not requrng to perform the segmentaton or co-segmentaton task whch are very tme consumng tasks, where there s no sgnfcant occluson of the objects nvolved. 3 Moton Boundary Trajectores In ths secton, we wll descrbe the proposed moton boundary dense trajectory approach. We wll frst descrbe the dense trajectory approach [4] brefly, and then we wll show how moton boundary trajectores can be extracted from the vdeo. 3.1 Dense Trajectores The dea of a trajectory s based on nterest ponts trackng [4]; the nterest ponts are tracked frame by frame and then the correspondng trajectory can be extracted based on the tracked ponts [4]. For the moton boundary trajectores, we frst detect the moton boundary on vdeo frames and then track the detected moton boundary through tme to form the moton boundary trajectores of a vdeo. Consder a vdeo whch conssts of I (t), t = 1, 2,..., T and I (t) s a 2D pxel ntensty array wth dmensons W H. The optcal flow feld s computed over a two-frame sequence I (t) and I (t+1), ω (t) = (u (t), v (t) ), where, u (t), v (t) are respectvely the optcal flow n the horzontal and vertcal drectons. We apply a medan flterng on the optcal flow feld ω (t) = (u (t), v (t) ) wthn a 3 3 patch. The resultng optcal flow feld s denoted by ω (t) = (ū (t), v (t) ) = ω (t) M 3 3, where M 3 3 s the medan flter kernel and ω (t) s the fltered result of the optcal flow feld and s the convoluton operator. In the dense trajectory approach [4], the Harrs corner condton [9]. Wth ths selecton, a set of nterest ponts, determned usng a 2D Harrs corner condton [9] on the object appearance, s then tracked frame by frame to form the dense trajectores. In other to cope wth the camera moton, a matchng of feature ponts usng SURF descrptors and dense optcal flow s appled to eastmate a homography between two subsequent frames by RANSAC algorthm as n [31]. Based on the reason of human acton s ngeneral dfferent from camera moton. A human

5 Moton Boundary Trajectory for Human Acton Recognton 5 detector s employed to remove matches from human regons to mprove the camera moton eastmaton. Fnally, the trajectores consstent wth the camera moton are then removed no longer to for the trackng process [31]. 3.2 Moton Boundary Trajectores Dfferent from usng object appearances, moton boundary trajectory approach s based on the moton boundary between objects. To detect the moton boundary, we extract ts locaton usng the optcal flow. Assume each object wll have dfferent flow drectons and veloctes, we detect ther boundares usng the dervatve of the optcal flow feld whch captures the dscontnuty, e.g., edges, of the optcal flow feld. For the pont P (t) I (t), the measurement of ts boundary s gven by where, (ū P (t) H P (t) = ū P (t) 2 + v (t) P 2, v (t) P ) s the flow vector of pont P (t). The determnaton of the moton boundary trajectores s very smlar to that proposed n [4] n the dense trajectory approach. Gven a dense grd of frame I (t), we can densely sample ponts on a grd spaced by w pxels. In our case, the dense grd s set to 5 5. Samplng s carred out on each spatal scale separately. Dfferent scales can be obtaned by smply re-szng the vdeo 1 to dfferent resolutons, wth a scalng factor of 2. In our settng, there are at most 8 spatal scales n total [4]. To obtan the moton boundary trajectores, we frst select the ponts based on Harrs corner condton where, (λ 1 P (t) P (t), λ 2 ) are the egenvalues of the auto-correlaton matrx of pont P (t) n frame I (t). We then thresholds the moton boundary based on the threshold T (t) T (t) corner = C 1 max mn(λ 1 I (t) P (t) P (t), λ 2 ) P (t) corner as H P (t) = { HP (t) mn(λ 1 P (t) 0 otherwse, λ 2 ) T P (t) corner (t) We then use another threshold condton for whch a pont s of nterest (.e., sgnfcant enough for further consderaton): T (t) moton = C 2 max HP (t) + C 3 P (t) I (t) The pont P (t) wll be selected, f ts magntude s greater than the threshold,.e., HP (t) > T (t) moton, whle those ponts whch do not satsfy ths condton wll not be consdered further. In our settng, we set C 1 = , C 2 = 0.01

6 6 S.L. Lo and A.C. Tso and C 3 = From the above process, we wll know whch sub-sampled pont P (t) wll need to be consdered for the trajectory trackng. We then track the selected ponts usng optcal flow feld ω (t) = (ū (t), v (t) ). Consder a pont P (t) = (x (t), y (t) ) n frame I (t), the tracked pont P (t+1) = (x (t+1), y (t+1) ) of P (t) n the next frame I (t+1) s computed by: P (t+1) = P (t) = (x (t) + ω t,p (t), y (t) ) + (ū t, v t ) (t) (x,y (t) ) The tracked ponts of subsequent frames are then concatenated temporally to form a trajectory, Traj = (P (t), P (t+1), P (t+2),...). For each frame, f no tracked pont s found n the neghborhood, a new pont P (t) s sampled and added to the trackng process. If the length of a trajectory has reached a maxmum length L = 15, a post-processng stage s then performed to remove the statc trajectores [4]. In other to obtan a better moton boundary, we follow [31] and estmate the homography of two subsequent frames, and then warp the second frame wth the estmated homography. Based on the warped frame, the Harrs cornerness s computed by the warped second frame and the optcal flow s computed between the frst and the warped second frame. To obtan more nterest ponts aroundng the movng obejcts, we apply a Gaussan flter and then a medan flter on the moton boundary map,.e., H. We then select and track the ponts for extractng the moton boundary trajectoes. For the optcal flow, we use the Farneback optcal flow algorthm [32], whch employs a polynomal expanson to approxmate the pxel ntenstes n the neghborhood to obtan a good qualty flow feld as well as capturng some fne detals [4]. Fgure 1 shows the results of the moton boundary as well as the moton boundary trajectory obtaned from some selected vdeos. It s observed that the moton boundary trajectores can capture the moton qute well. 3.3 Moton Boundary Descrptors Local descrptors are features whch descrbe the spatal temporal behavours of humans n the vdeo. There are a number of such descrptors proposed by varous researchers: [4]. The essental dea s to fnd good descrptors whch wll descrbe the spatal temporal behavours of pxel values n a small neghborhood of a volume consstng of two dmensonal space and tme [4]. Some of these methods were extended from mage processng technques, whle others were constructed explctly for spatal temporal behavours [4]. Several descrptors can be obtaned to encode ether the shape of a trajectory or the local moton [4] and appearance wthn a space-tme volume [14] around the trajectory. The trajectory shape descrptor encodes local moton patterns by usng the dsplacement vectors of a trajectory [4]. HOG (Hstogram of orented gradent) along a trajectory focuses on the statc part of the appearance of a

Moton Boundary Trajectory for Human Acton Recognton 7 Fg.

shows the detected moton boundares; and the thrd row

flow) captures the local moton nformaton based on the

gradent of the optcal flow to cancel out most of the

These descrptors gve a state-of-the-art performance for

In ths paper, we wll add the moton boundary trajectores

The moton trajectory descrptor can be formed by consderng

dsplacement vectors P (t) = P (t+1) P (t) = (x (t+1),.

become the feature vector of the trajectory shape: Shape

.., P (t+l 1) P (k) ) Wth the moton boundary trajectory,

..), the correspondng HOG, HOF and MBH descrptors can

7 Moton Boundary Trajectory for Human Acton Recognton 7 Fg. 1. The frst row shows the orgnal mages; the second row shows the detected moton boundares; and the thrd row shows the correspondng moton boundary trajectores local patch of the vdeo. For encodng the moton nformaton, HOF (Hstograms of optcal flow) captures the local moton nformaton based on the optcal flow feld; MBH (Moton boundary hstogram) uses the gradent of the optcal flow to cancel out most of the effects of camera moton [14]. These descrptors gve a state-of-the-art performance for representng local nformaton. In ths paper, we wll add the moton boundary trajectores as the descrptors for the moton n the tme axs. The moton trajectory descrptor can be formed by consderng the shape of the trajectores, n a manner very smlar to that proposed n [4]. Gven a trajectory of length L, a sequence ( P (t) of the dsplacement vectors P (t) = P (t+1) P (t) = (x (t+1),..., P (t+l 1) x (t), y (t+1) ) y (t) ) s used for descrbng the trajectory shape. The normalzed concatenaton of the dsplacement vectors wll become the feature vector of the trajectory shape: Shape = (t) ( P t+l 1 k=t,..., P (t+l 1) P (k) ) Wth the moton boundary trajectory, Traj = (P (t), P (t+1), P (t+2),...), the correspondng HOG, HOF and MBH descrptors can also be extracted based on the moton boundary trajectory as the trajectory based HOG, HOF and MBH descrptors (please see Fgure 2 for an llustraton of these concepts). We follow [31], moton descrptors (HOF and MBH) are computed on the warped optcal flow. The trajectory shape descrptor and HOG descrptor remans unchanged.

8 8 S.L. Lo and A.C. Tso 1 Cell 1 Cell t cell 1 Cell t Fg. 2. Illustraton of Moton Boundary Trajectory Descrptor. The moton boundary trajectory s represented by relatve pont coordnates, Traj = (P (t), P (t+1), P (t+2),...); based on the moton boundary trajectores, the HOG, HOF and MBH descrptors are computed along the trajectores. 4 Classfcaton We apply the standard bag-of-features approach to convert the local descrptors from a vdeo nto a fxed-dmensonal vector. We frst construct a codebook for the trajectory descrptor (Secton 3.3) usng the k-mean clusterng algorthm, and then the clusters wll serve as vsual words. We fx the number of vsual words to V = 4, 000. To lmt the complexty of the problem, we cluster a subset of 100,000 randomly selected from the tranng features n the k-mean clusterng algorthm. Descrptors are then assgned to ther closest vocabulary word usng an Eucldean norm. The resultng hstograms of vsual word occurrences are used as vdeo representatons. We apply the lnear and non-lnear SVM for acton recognton. For the lnear SVM [33], we frst scale the value of each vsual word feature to [0, 1], and then the feature vector of a vdeo s normalzed by a norm-2 normalzaton. For the nonlnear SVM [12], we normalze the hstogram usng the RootSIFT approach [34],.e., square root each dmenson after L1 normalzaton, and then apply the standard RBF (radal bass functon)-χ 2 kernel [4] as the baselne algorthm n our experments. K χ 2(H, H j ) = exp ( 1 2A V k=1 ) (h k h jk ) 2 h k + h jk where H = {h k } V k=1 and H j = {h jk } V k=1 are the frequency hstograms of word occurrences and V s the vocabulary sze. A s the mean value of dstances between all tranng samples [18]. In the case of mult-class classfcaton, the one-aganst-all approach s appled, we select the class wth the hghest score. Typcally, the approach for ntegratng the contrbuton of dfferent descrptors s the multple channel SVM [12, 7], whch s a specal case of multple kernel learnng [35]. We smply average the kernels computed from dfferent representatons to combne dfferent channels usng the dea of multple channel SVM. We also apply the Fsher vector [36] encodng for vdeo representaton. Fsher vector encode both frst and second order statstcs between the vdeo descrptors

9 Moton Boundary Trajectory for Human Acton Recognton 9 and a Gaussan Mxture Model (GMM). We follow [31], frst reduce the descrptor dmensonalty by Prncpal Component Analyss (PCA), as n [31]. We set the number of Gaussans to K = 256 and randomly sample a subset of 256,000 features from the tranng set to estmate the GMM [31]. As a result, for each type of descrtpor, each vdeo s represented by a 2DK dmensonal Fsher vector, where D s the dmenson of the descrptor after performng PCA. Fnally, we apply power and the RootSIFT approach normalzaton to the Fsher vector. For ntegratng dfferent descrptor types, we concatenate ther normalzed Fsher vectors, and a lnear SVM s used for classfcaton. 5 Experments Ths secton evaluates the proposed moton boundary trajectores as a descrptor. We run the experments at least 3 tmes for descrptor-classfer pars. We wll report the average accuracy of those experments. 5.1 Datasets We evaluate our proposed moton boundary descrptor on fve standard benchmark datasets, vz., UCF-Sports [20], YouTube dataset [19], Olympc Sports dataset [21], the HMDB51 dataset [22], the Hollywood2 datasets, and the UCF50 datasets. The UCF-Sports dataset contans 150 vdeos from ten acton classes, dvng, golf swngng, kckng, lftng, horse rdng, walkng, runnng, skatng, swngng (on the pommel horse and on the floor), and swngng (at the hgh bar). These vdeos are taken from real sports broadcasts and the boundng boxes around the subjects are provded for each frame. We follow the protocol proposed n [37, 38] usng the same tranng/testng samples for our experments; by takng one thrd of the vdeos from each acton category to form the test set, and the rest of the vdeos are used for tranng. Average accuracy over all classes s reported as the performance measure. The YouTube dataset contans 11 acton categores: basketball shootng, bkng/cyclng, dvng, golf swngng, horse back rdng, soccer jugglng, swngng, tenns swngng, trampolne jumpng, volleyball spkng, and walkng wth a dog. For each category, the vdeos are grouped nto 25 groups wth more than 4 acton clps n t. The dataset contans a total of 1,168 sequences. We follow the orgnal setup [19], usng leave-one-out cross-valdaton for a pre-defned set of 25 groups. Average accuracy over all classes s reported as the performance measure. The Olympc Sports dataset [21] conssts of athletes practsng dfferent sports, whch are collected from YouTube and annotated usng the Amazon Mechancal Turk technque. There are 16 sports actons: hgh jump, long jump, trple jump, pole vault, dscuss throw, hammer throw, javeln throw, shot put, basketball layup, bowlng, tenns serve, platform (dvng), sprngboard (dvng), snatch (weght lftng), clean and jerk (weght lftng) and vault (gymnastcs), represented by a total of 783 vdeo sequences. We adopt the tran/test splt from

10 10 S.L. Lo and A.C. Tso [21]. The mean average precson (map) over all classes [12, 39] s reported as the performance measure. The HMDB51 contans 51 dstnct acton categores, each contanng at least 101 clps for a total of 6,766 vdeo clps extracted from a wde range of sources. We follow the orgnal evaluaton protocol usng three tran-test splts [22]. For every class and splt, there are 70 vdeos for tranng and 30 vdeos for testng. We report the average accuracy over three-splts as performance measure. The Hollywood2 dataset [40] has been collected from 69 dfferent Hollywood moves and ncludes 12 acton classes. It contans 1,707 vdeos splt nto a tranng set (823 vdeos) and a test set (884 vdeos). Tranng and test vdeos come from dfferent moves. The performance s measured by mean average precson (map) over all classes, as n [40]. The UCF50 dataset [41] has 50 acton categores, consstng of real-world vdeos taken from YouTube. There are 50 categores n UCF50 dataset, the vdeos are splt nto 25 groups. For each group, there are at least 4 acton clps. In total, there are 6,618 vdeo clps. We apply the leave-one-group-out crossvaldaton as recommended by the authors and report average accuracy over all classes. 5.2 Expermental Results The expermental results usng bag-of-feature hstogram are shown n Table 1. We also lst the results of mproved dense trajectory approach [4] n our experments, under the name Dense Trajectory n Table 1. For the dense trajectory approach, the 2D nterest ponts are detected based on corner condton [4], and then track the detected ponts frame by frame to form the dense trajectores. From the results lsted n Table 1, we note that the best performance s acheved usng our moton boundary trajectory descrptor. UCF Sport YouTube Dense Trajectory Moton Boundary Dense Trajectory Moton Boundary Lnear χ 2 SVM Lnear χ 2 SVM Lnear χ 2 SVM Lnear χ 2 SVM Traj. Shape HOG HOF MBH Combned Olympc Sports HMDB51 Dense Trajectory Moton Boundary Dense Trajectory Moton Boundary Lnear χ 2 SVM Lnear χ 2 SVM Lnear χ 2 SVM Lnear χ 2 SVM Traj. Shape HOG HOF MBH Combned Table 1. Expermental results of Moton Boundary Trajectory on dfferent datasets.

Moton Boundary Trajectory for Human Acton Recognton 11 We found that on the UCF Sports

Ths observaton s also true wth the Olympc Sports dataset, n whch the moton boundary

and are personal vdeos. Ths dataset s very challengng due to large varatons n camera moton.

As a result, the performance of moton boundary trajectory only mprove slghtly that compare

Comparson between the dense trajectores and moton boundary trajectores (the frst row shows

performance of combnng representatons named Combned as lsted n Table 1.

We smply average the kernel matrces computed from dfferent representatons to obtan the

The moton boundary trajectory also mproves the performance at least 1% on the UCF Sports

Fgures 3 show the moton boundary trajectores and the dense trajectores.

11 Moton Boundary Trajectory for Human Acton Recognton 11 We found that on the UCF Sports dataset, the moton boundary trajectory descrptor together wth HOF as well as MBH obtan very good results. The UCF Sports dataset contans vdeos whch are typcally featured on broadcast televson channels, e.g., BBC and ESPN; these vdeos are recorded by professonal cameramen and camera movement s relatvely smooth. As a result, the detected moton boundary s much more meanngful, whch s shown n Fgure 3. Ths observaton s also true wth the Olympc Sports dataset, n whch the moton boundary trajectory wth MBH descrptor obtan good results. The vdeos of YouTube dataset are collected from YouTube and are personal vdeos. Ths dataset s very challengng due to large varatons n camera moton. In ths case, the moton boundary trajectores are not very accurate. As a result, the performance of moton boundary trajectory only mprove slghtly that compare wth dense trajectory. Fg. 3. Comparson between the dense trajectores and moton boundary trajectores (the frst row shows dense trajectory; the second shows moton boundary trajectory) We also evaluated the performance of combnng representatons named Combned as lsted n Table 1. We evaluated two dfferent classfers, vz., the lnear SVM and the χ 2 SVM. We smply average the kernel matrces computed from dfferent representatons to obtan the aggregated results. The moton boundary trajectory also mproves the performance at least 1% on the UCF Sports and HMDB51 datasets and slghtly mproves on YouTube and Olympc Sports datasets. Fgures 3 show the moton boundary trajectores and the dense trajectores. In Fgure 3, we note that the moton boundary detected n some vdeos s sgnfcant, the moton boundary can capture the trajectores around the movng objects when compare wth those obtaned from the dense trajectory approach. Comparson to the state of the art. In [31], Wang ntroducted mproved dense trajectory feature for acton recognton. Together wth the fsher vector encodng for vdeo representaton, Wang obtaned state-of-the-art results. We use the same settng as n [31] but nstead of extractng dense trajectory, we extract the moton boundary trajectory. We also use the human boundary boxes

12 12 S.L. Lo and A.C. Tso provded by authors [31] for better eastmaton of homography between two subsequent frames. The expermental result n Table 2, we also lsted the result from [31], named as IDT (mproved dense trajectory). In Table 2, we noted that the Olympc Sports dataset, the moton boundary trajectory (MBT) approach obtans at least 2% mprovement. We obtan 93.5% map. For the HMDB51 dataset, we obtan at least 5% mprovement and obtan 63.8 accuracy. For the Hollywood2 dataset, the mprovement s not too much, only 0.1% mprovement. For the UCF50 dataset, we get 1% mprovement and obtan 92.2% accuracy. Those results show that the moton boundary s useful for descrbng the moton nformaton and sgnfcantly mprove the recognton accuracy n acton recognton. Olympc Sports HMDB51 Hollywood2 UCF50 IDT[31] MBT IDT[31] MBT IDT[31] MBT IDT[31] MBT Traj. Shape HOG HOF MBH Combned Table 2. Expermental results of Moton Boundary Trajectory on dfferent datasets usng Fsher vector vdeo representaton; IDT means Improved Dense Trajectory, and MBT means Moton Boundary Trajectory; The results lsted n IDT here are from [31]. 6 Concluson In ths paper, we propose a novel approach based on two deas, one usng moton boundary between objects, and, second, the resultng moton boundary trajectores extracted from vdeos as the local descrptors. These resulted n a new descrptor, the moton boundary descrptor. We compare the performance of the proposed approach wth other state-of-the-art approaches, e.g., trajectory based approach, on sx human acton recognton benchmark datasets, and found that the proposed approach gves better recognton results. Acknowledgment Ths work was fnancally supported by Fundo para o Desenvolvmento das Cenca das e da Tecnologa, Macau SAR Grant Number 034/2011/A2. The authors would lke to thank Assocate Prof. Markus Hagenbuchner, Unversty of Wollongong and Prof. Franco Scarsell, Unversty of Sena, for many helpful comments on the proposed approach.

13 Moton Boundary Trajectory for Human Acton Recognton 13 References 1. Brendel, W., Todorovc, S.: Learnng spatotemporal graphs of human actvtes. In: ICCV. (2011) Nebles, J.C., Wang, H., Fe-Fe, L.: Unsupervsed learnng of human acton categores usng spatal-temporal words. IJCV 79 (2008) Guo, K., Ishwar, P., Konrad, J.: Acton recognton n vdeo by covarance matchng of slhouette tunnels. In: Brazlan Symposum on Computer Graphcs and Image Processng. (2009) Wang, H., Kläser, A., Schmd, C., Lu, C.L.: Dense trajectores and moton boundary descrptors for acton recognton. IJCV (2013) 5. Wallraven, C., Caputo, B., Graf, A.: Recognton wth local features: the kernel recpe. In: ICCV. (2003) Wllamowsk, J., Arregu, D., Csurka, G., Dance, C.R., Fan, L.: Categorzng nne vsual classes usng local appearance descrptors. In: ICPR Workshop on Learnng for Adaptable Vsual Systems. (2004) 7. Zhang, J., Lazebnk, S., Schmd, C.: Local features and kernels for classfcaton of texture and object categores: a comprehensve study. IJCV 73 (2007) 8. Laptev, I.: On space-tme nterest ponts. IJCV 64 (2005) Harrs, C., Stephens, M.: A combned corner and edge detector. In: Proceedngs of the Alvey Vson Conference. (1988) Wllems, G., Tuytelaars, T., Gool, L.: An effcent dense and scale-nvarant spatotemporal nterest pont detector. In: ECCV. (2008) Dalal, N., Trggs, B.: Hstograms of orented gradents for human detecton. In: CVPR. (2005) Laptev, I., Marsza lek, M., Schmd, C., Rozenfeld, B.: Learnng realstc human actons from moves. In: CVPR. (2008) Kläser, A., Marsza lek, M., Schmd, C.: A spato-temporal descrptor based on 3d-gradents. In: BMVC. (2008) Dalal, N., Trggs, B., Schmd, C.: Human detecton usng orented hstograms of flow and appearance. In: ECCV. Volume (2006) Matkanen, P., Hebert, M., Sukthankar, R.: Trajectons: Acton recognton through the moton analyss of tracked features. In: ICCV workshop on Vdeoorented Object and Event Classfcaton. (2009) 16. Peng, X., Qao, Y., Peng, Q., Q, X.: Explorng moton boundary based samplng and spatal-temporal context descrptors for acton recognton. In: BMVC. (2013) 17. T.Brox, J.Malk: Object segmentaton by long term analyss of pont trajectores. In: ECCV. (2010) 18. Schuldt, C., Laptev, I., Caputo, B.: Recognzng human actons: A local svm approach. In: ICPR. Volume 3. (2004) Lu, J., Luo, J., Shah, M.: Recognzng realstc actons from vdeos n the wld. In: CVPR. (2009) 20. Rodrguez, M., Ahmed, J., Shah, M.: Acton mach a spato-temporal maxmum average correlaton heght flter for acton recognton. In: CVPR. (2008) Nebles, J.C., Chen, C.W., Fe-Fe, L.: Modelng temporal structure of decomposable moton segments for actvty classfcaton. In: ECCV. (2010) Kuehne, H., Jhuang, H., Garrote, E., Poggo, T., Serre, T.: HMDB: a large vdeo database for human moton recognton. In: ICCV. (2011) 23. Svc, J., Zsserman, A.: Vdeo Google: A text retreval approach to object matchng n vdeos. In: ICCV. Volume 2. (2003)

14 14 S.L. Lo and A.C. Tso 24. Lazebnk, S., Schmd, C., Ponce, J.: Beyond bags of features: Spatal pyramd matchng for recognzng natural scene categores. In: In. (2006) Chu, W.C., Frtz, M.: Mult-class vdeo co-segmentaton wth a generatve multvdeo model. In: CVPR. (2013) 26. Wang, J.Y., Adelson, E.H.: Representng movng mages wth layers (1994) 27. Sun, D., Sudderth, E.B., Black, M.J.: Layered segmentaton and optcal flow estmaton over tme. In: CVPR. (2012) Black, M.J., Fleet, D.J.: Probablstc detecton and trackng of moton boundares. IJCV 38 (2000) Feghal, R., Mtche, A.: Spatotemporal moton boundary detecton and moton boundary velocty estmaton for trackng movng objects wth a movng camera: a level sets pdes approach wth concurrent camera moton compensaton. IEEE Transactons on Image Processng 13 (2004) Sun, D., Wulff, J., Sudderth, E., Pfster, H., Black, M.: A fully-connected layered model of foreground and background flow. In: CVPR. (2013) 31. Wang, H., Schmd, C.: Acton recognton wth mproved trajectores. In: ICCV. (2013) 32. Farnebäck, G.: Two-frame moton estmaton based on polynomal expanson. In: Scandnavan Conference on Image Analyss. Volume (2003) Chang, C.C., Ln, C.J.: Lbsvm: A lbrary for support vector machnes. ACM Transactons on Intellgent Systems and Technology 2 (2011) Arandjelovć, R., Zsserman, A.: Three thngs everyone should know to mprove object retreval. In: CVPR. (2012) 35. Gönen, M., Alpaydn, E.: Multple kernel learnng algorthms. JMLR 12 (2011) Perronnn, F., Sánchez, J., Mensnk, T.: Improvng the fsher kernel for large-scale mage classfcaton. In: ECCV. (2010) Shapovalova, N., Vahdat, A., Cannons, K., Lan, T., Mor, G.: Smlarty constraned latent support vector machne: an applcaton to weakly supervsed acton classfcaton. In: ECCV. (2012) Lan, T., Wang, Y., Yang, W., Robnovtch, S., Mor, G.: Dscrmnatve latent models for recognzng contextual group actvtes. PAMI (2011) 39. Everngham, M., Van Gool, L., Wllams, C.K.I., Wnn, J., Zsserman, A.: (The PASCAL Vsual Object Classes Challenge 2007 (VOC2007) Results) 40. Marsza lek, M., Laptev, I., Schmd, C.: Actons n context. In: CVPR. (2009) 41. Reddy, K.K., Shah, M.: Recognzng 50 human acton categores of web vdeos. Machne Vson and Applcatons 24 (2013)

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: