Human Violence Recognition and Detection in Surveillance Videos

Size: px

Start display at page:

Download "Human Violence Recognition and Detection in Surveillance Videos"

Asher Carroll
5 years ago
Views:

1 Human Volence Recognton and Detecton n Survellance Vdeos Potr Blnsk and Francos Bremond INRIA Sopha Antpols, STARS team 2004 Route des Lucoles, BP93, Sopha Antpols, France {Potr.Blnsk,Francos.Bremond}@nra.fr Abstract In ths paper, we focus on the mportant topc of volence recognton and detecton n survellance vdeos. Our goal s to determne f a volence occurs n a vdeo (recognton) and when t happens (detecton). Frstly, we propose an extenson of the Improved Fsher Vectors (IFV) for vdeos, whch allows to represent a vdeo usng both local features and ther spato-temporal postons. Then, we study the popular sldng wndow approach for volence detecton, and we re-formulate the Improved Fsher Vectors and use the summed area table data structure to speed up the approach. We present an extensve evaluaton, comparson and analyss of the proposed mprovements on 4 state-ofthe-art datasets. We show that the proposed mprovements make the volence recognton more accurate (as compared to the standard IFV, IFV wth spato-temporal grd, and other state-of-the-art methods) and make the volence detecton sgnfcantly faster. 1. Introducton Vdeo survellance cameras are part of our lves. They are used almost everywhere, e.g. at streets, subways, tran and bus statons, arports, and sport stadums. Today s ncrease n threats to securty n ctes and towns around the world makes the use of vdeo cameras to montor people necessary. The attacks on humans, fghts, and vandalsm are just some cases where detecton, partcularly volence detecton, systems are needed. In ths paper, we focus on the mportant topc of volence recognton and detecton n survellance vdeos. Our goal s to determne f a volence occurs n a vdeo (recognton) and when t happens (detecton). Over the last years, several volence recognton and detecton technques have been proposed. [6] have used moton trajectory nformaton and orentaton nformaton of person s lmbs for person-on-person fght detecton. One of the man drawbacks of ths approach s that t requres precse segmentaton, whch s very dffcult to obtan n real world vdeos. Instead, [7, 9, 21, 23] have focused on local features and the bag-of-features approach; the man dfference between these technques les n the type of features used. [23] have appled the STIP and SIFT, and [7] have used the STIP and MoSIFT features. [9] have proposed the VIolence Flows descrptor, encodng how flow-vector magntudes change over tme. [21] have proposed a vdeo descrptor based on substantal dervatve. Despte recent mprovements n volence recognton and detecton, effectve solutons for real-world stuatons are stll unavalable. The Improved Fsher Vectors (IFV) [24] s a bag-offeatures-lke vdeo encodng strategy whch has shown to outperform the standard bag-of-features. It s a vdeo (and mage) descrptor obtaned by poolng local features nto a global representaton. It descrbes local features by ther devaton from the unversal generatve Gaussan Mxture Model. The IFV has been wdely appled for recognton tasks n vdeos [1, 2, 11, 27, 28]. One of the man drawbacks of the IFV s that t smplfes the structure of a vdeo assumng condtonal ndependence across spatal and temporal domans; t computes global statstcs of local features only, gnorng spato-temporal postons of features. Clearly, spatal nformaton may contan useful nformaton. A common way to use spato-temporal nformaton wth IFV s to use ether spato-temporal grds [16] or mult-scale pyramds [17]; however, these methods are stll lmted n terms of a detaled descrpton provdng only a coarse representaton. There are several other state-of-theart methods [13, 14, 19, 25], but as they were proposed for mages (for mage categorzaton and object recognton), they cannot be drectly appled for vdeos; moreover, [13, 14] acheve smlar results as compared to the spatal grds/pyramds, and [19] s parameter senstve and requres addtonal parameter learnng. As opposed to the exstng volence recognton and detecton methods (whch focus manly on new descrptors), we focus on a vdeo representaton model due to two reasons: to make t more accurate for volence recognton, and to make t faster for volence detecton. Frstly, we propose an extenson of the IFV for vdeos (Sec. 2.2), /16/$ IEEE IEEE AVSS 2016, August 2016, Colorado Sprngs, CO, USA

2 whch allows to use spato-temporal postons of features wth the IFV. The proposed extenson boosts the IFV and acheves better or smlar accuracy (keepng the representaton more compact) as compared to the spato-temporal grds. Then, we study and evaluate the popular sldng wndow approach [12] for volence detecton. We re-formulate the IFV and use the summed area table data structure to speed up the sldng wndow method (Sec. 2.3). Then, we present an extensve evaluaton, comparson and analyss of the proposed mprovements on 4 state-of-the-art datasets (Sec. 3 and Sec. 4). Fnally, we conclude n Sec. 5. Abnormal behavor detecton: There are several methods for abnormal behavor and anomaly detecton [4, 18, 20, 22]. However, abnormaltes do not represent a compact and well defned concept. Abnormalty detecton s a dfferent research topc, wth dfferent constrants and assumptons, and therefore we do not focus on these technques. 2. Boostng the Improved Fsher Vectors (IFV) 2.1. State-of-The-Art: Improved Fsher Vectors Ths secton provdes a bref descrpton of the Improved Fsher Vectors, ntroduced n [24]. The mathematcal notatons and formulas provded here are n accordance wth [24], and we refer to t for more detals. Let = {x t, t = 1... T } be a set of T local features extracted from a vdeo, where each local feature s of dmenson D, x t R D. Let λ = {w, µ, Σ, = 1... K} be parameters of a Gaussan Mxture Model (GMM): u λ (x) = K =1 w u (x) fttng the dstrbuton of local features, where w R, µ R D and Σ R D D are respectvely the mxture weght, mean vector and covarance matrx of the -th Gaussan u. We assume that the covarance matrces are dagonal and we denote by σ 2 the varance vector,.e. Σ = dag(σ 2 ), σ2 RD. Moreover, let γ t () be the soft assgnment of a descrptor x t to a Gaussan : γ t () = w u (x t ) K w ju j (x t ), (1) and let G µ, (resp. G σ,) be the gradent w.r.t. the mean µ (resp. standard devaton σ ) of a Gaussan : G µ, = 1 T w G 1 σ, = T 2w ( ) xt µ γ t (), (2) σ [ (xt µ ) 2 ] γ t () σ 2 1, (3) where the dvson between vectors s as a term-by-term operaton. Then, the gradent vector G λ s the concatenaton of all the K gradent vectors G µ, R D and all the K gradent vectors G σ, R D, = 1... K: G λ = [ G µ,1, G σ,1,..., G µ,k, G σ,k ]. (4) The IFV (Improved Fsher Vectors) representaton Φ λ, Φ λ R 2DK, s the gradent vector G λ normalzed by the power normalzaton and then the L2 norm: Φ λ = [ G µ,1, G σ,1,..., G µ,k, G σ,k ] +l2. (5) 2.2. Boostng the IFV wth spato-temporal nf. The Improved Fsher Vectors encodng smplfes the structure of a vdeo assumng condtonal ndependence across spatal and temporal domans (see Sec. 2.1). It computes global statstcs of local features only, gnorng spatotemporal postons of features. Thus, we propose an extenson of the Improved Fsher Vectors whch ncorporates spato-temporal postons of features nto the vdeo model. Frstly, we represent postons of local features n a vdeo normalzed manner. In ths paper, we focus on local trajectores only; however, the followng representaton can also be appled to spato-temporal nterest ponts [8, 15, 29] (wth assumptons: p t = (a t,1, b t,1, c t,1 ) s the spatotemporal poston of a pont and n t = 1). Let P = {p t, t = 1... T } be a set of T trajectores extracted from a vdeo sequence and p t = ((a t,1, b t,1, c t,1 ),..., (a t,nt, b t,nt, c t,nt )) s a sample trajectory, where a feature pont detected at a spatal poston (a t,1, b t,1 ) n a frame c t,1 s tracked n n t 1 subsequent frames untl a spatal poston (a t,nt, b t,nt ) n a frame c t,nt. We defne the vdeo normalzed poston ˆp t of a center of a trajectory p t as: ˆp t = [ 1 v w n t n t =1 1 n t 1 n t a t,, b t,, v h n t v l n t =1 =1 c t, ], (6) where v w s the vdeo wdth (wth the unts n pxels), v h s the vdeo heght (n pxels), and v l s the vdeo length (number of frames). We normalze the poston of a center of a trajectory, so that the vdeo sze does not sgnfcantly change the magntude of the feature poston vector. Once postons of local features are represented n a vdeo normalzed manner, we also consder usng the unty based normalzaton to reduce the nfluence of motonless regons at the boundares of a vdeo, so that the large motonless regons do not sgnfcantly change the magntude of the feature poston vector. Let ˆp t, be the -th dmenson of the vector ˆp t and mn(ˆp :, ) (resp. max(ˆp :, )) be the mnmum (resp. maxmum) value of the -th dmenson among all the vdeo normalzed poston vectors extracted from the tranng vdeos. When the condton : mn(ˆp :, ) max(ˆp :, ) s true, we can apply the unty based

normalzaton to calculate the vector p t. The -th dmenson of the vector p t s: p t, = ˆp t, mn(ˆp :, ) max(ˆp :, ) mn(ˆp :, ).

Let Y = {y t = [ p t, x t ], t = 1... T } be a set of local features, where x t R D s a local feature descrptor and p t R E s ts correspondng normalzed poston, typcally E = 3, calculated as above.

.. K} be parameters of a GMM u λ(y) = K =1 w u (y) fttng the dstrbuton of local features, where w R, µ R D+E and Σ R (D+E) (D+E) are respectvely the mxture weght, mean vector and covarance matrx of

3) for all K Gaussan components, and concatenate all the gradent vectors nto a vector G Ỹ λ.

Fast IFV-based Sldng Wndow Our goal s to determne f a volence occurs n a vdeo and when t happens; therefore, we search for a range of frames whch contans volence.

3 normalzaton to calculate the vector p t. The -th dmenson of the vector p t s: p t, = ˆp t, mn(ˆp :, ) max(ˆp :, ) mn(ˆp :, ). (7) Then, we ncorporate the normalzed postons of local features nto the Improved Fsher Vectors model, so that vdeos are represented usng both local descrptors and ther spato-temporal postons. Let Y = {y t = [ p t, x t ], t = 1... T } be a set of local features, where x t R D s a local feature descrptor and p t R E s ts correspondng normalzed poston, typcally E = 3, calculated as above. Let λ = { w, µ, Σ, = 1... K} be parameters of a GMM u λ(y) = K =1 w u (y) fttng the dstrbuton of local features, where w R, µ R D+E and Σ R (D+E) (D+E) are respectvely the mxture weght, mean vector and covarance matrx of the -th Gaussan. As before, we assume that the covarance matrces are dagonal and we denote by σ 2 the varance vector,.e. Σ = dag( σ 2 ), σ2 R D+E. We calculate G Ỹ µ, (Eq. 2) and G Ỹ σ, (Eq. 3) for all K Gaussan components, and concatenate all the gradent vectors nto a vector G Ỹ λ. Fnally, the new Improved Fsher Vectors representaton s the gradent vector G Ỹ normalzed by the power normalzaton and then the L2 λ norm: Φ Ỹ λ = [ GỸ µ,1, G Ỹ σ,1,..., G Ỹ µ,k, G Ỹ σ,k ] +l2. (8) 2.3. Fast IFV-based Sldng Wndow Our goal s to determne f a volence occurs n a vdeo and when t happens; therefore, we search for a range of frames whch contans volence. We base our approach on the temporal sldng wndow [12] whch evaluates vdeo sub-sequences at varyng locatons and scales. Let v l be a vdeo length (n frames), s > 0 be the wndow step sze (n frames), and w = {s} =1...m be temporal wndow szes (scales) for the sldng wndow algorthm. Moreover, let v = ns be an approxmated vdeo length (where: ns v l > (n 1)s and n m 1). Vsualzaton of a sample vdeo and sample sldng wndows s presented n Fg. 1. Note the IFV are calculated for features from the same temporal segments multple tmes,.e. m(n m + 1) tmes for m segments (e.g. Fg. 1: 20 tmes for 8 segments). Therefore, to speed up the detecton framework, we re-formulate the IFV and use the summed area table data structure, so that the IFV are calculated for features from the temporal segments only ones. Let = {x t, t = 1... T } be a set of T local features extracted from a vdeo. Let = { j, j = 1... N} be a partton of a set nto N subsets j = {x j,k } j k=1 such that: j s the cardnalty of the set j, = N j, tme s 1 s 2 s 3 s 4 s 5 s 6 s 7 s = 20 Fgure 1. Temporal sldng wndow: a sample vdeo s dvded nto n 8 segments. We use m = 4 wndow scales. Note that the IFV are calculated for features from the same segments multple tmes (20 tmes for 8 segments). N j k, j,k=1 j k = and φ(j, k) t s the mappng functon such that x j,k = x t. We re-wrte Eq. (2): G µ, = 1 T w = 1 T w = 1 T ( ) xt µ γ t () H j µ, = Gj µ, j j w = 1 σ j ( ) xφ(j,k) µ k=1 G j µ, j = 1 T k=1 Smlarly, we re-wrte Eq. (3): G 1 σ, = T 2w 1 = T 2w = 1 T H j σ, = Gj σ, j = 1 j 2w H j µ,, σ ( xφ(j,k) µ σ ). (9) [ (xt µ ) 2 ] γ t () σ 2 1 j [ (xφ(j,k) µ ) 2 ] σ 2 1 k=1 G j σ, j = 1 T k=1 H j σ,, [ (xφ(j,k) µ ) 2 ] σ 2 1. (10) Then, let s defne the gradent vector H j λ as a concatenaton of all the K gradent vectors H j µ, and all the K grad-

ent vectors Hσ,j, = 1... K: 0 j j Hλ j = [ Hµ,1j, Hσ,1j,..., Hµ,K, Hσ,K ].

j=m j j SM 1 j H Hλ, = PNλ PM 1 j j (12) and applyng the power normalzaton and then the L2 norm to the obtaned gradent vector.

However, n contrast to the orgnal IFV, the above equatons can be drectly used wth data structures such as summed area table

(9)-(12)), and drectly apply the summed area table (Integral Images [26]).

exactly once; e.g. we detected 25k features n a 84 frames long vdeo.

In our algorthm, each feature s assgned to each Gaussan exactly once. Ths means nearly 9 tmes less calculatons.

6M floats to store per second (segment), whch s 29 tmes more than the IFV representaton wth 128 Gaussans calculated for ths

Expermental Setup: Approach Overvew Frstly, we extract local spato-temporal features n vdeos, and we use the Improved Dense

Then, we extract local spato-temporal vdeo volumes around the detected trajectores, and we represent each trajectory usng:

Moton Boundary Hstogram (wth MBH-x and MBH-y components) descrptors capturng moton nformaton.

As the results, they have shown to acheve excellent results for varous recognton Fgure 2.

datasets. tasks n vdeos and they have been wdely used n lterature [1, 27].

) usng the IFV / proposed Spato-Temporal IFV (Sec. 2.2), and we concatenate the obtaned representatons usng late fuson (.e. per vdeo: we concatenate the IFVbased vdeo representaton from HOG wth vdeo representaton from HOF, etc.

hgh-dmensonal data such as Fsher Vectors; as typcally f the number of features s large, there s no need to map data to a hgher

4 ent vectors Hσ,j, = 1... K: 0 j j Hλ j = [ Hµ,1j, Hσ,1j,..., Hµ,K, Hσ,K ]. (11) The Improved Fsher Vectors representaton Φ of local λ SN features = j=m j, where 1 < M N, can be calculated usng: SN SN Gλ j=m j j SM 1 j H Hλ, = PNλ PM 1 j j (12) and applyng the power normalzaton and then the L2 norm to the obtaned gradent vector. The obtaned representaton s exactly the same as f we use Eq. (2)-(5). However, n contrast to the orgnal IFV, the above equatons can be drectly used wth data structures such as summed area table (Integral Images) and KDD-trees. For the task of volence localzaton, we use the above formulaton of the IFV (Eq. (9)-(12)), and drectly apply the summed area table (Integral Images [26]). The 2 man advantages of ths soluton are: (1) t allows to speed up the calculatons, as every feature s assgned to each Gaussan exactly once; e.g. we detected 25k features n a 84 frames long vdeo. Wth m = 4 and s = 5, every feature was assgned to each Gaussan 4 10 tmes; ths s lke 224k features were assgned to each Gaussan. In our algorthm, each feature s assgned to each Gaussan exactly once. Ths means nearly 9 tmes less calculatons. (2) t allows to reduce the memory usage, especally when a vdeo contans a lot of moton and dense features are extracted [27]; e.g. we extracted 130k features n a 35 seconds long vdeo ( 3.7k features per second on average). Wth Improved Dense Trajectores [27] (each trajectory s represented usng 426 floats), ths means 1.6M floats to store per second (segment), whch s 29 tmes more than the IFV representaton wth 128 Gaussans calculated for ths segment. 3. Expermental Setup: Approach Overvew Frstly, we extract local spato-temporal features n vdeos, and we use the Improved Dense Trajectores (IDT) [27] for that; we apply a dense samplng and track the extracted nterest ponts usng a dense optcal flow feld. Then, we extract local spato-temporal vdeo volumes around the detected trajectores, and we represent each trajectory usng: Hstogram of Orented Gradents (HOG) capturng appearance nformaton, and Trajectory Shape (TS), Hstogram of Optcal Flow (HOF) and Moton Boundary Hstogram (wth MBH-x and MBH-y components) descrptors capturng moton nformaton. The extracted IDT features provde a good coverage of a vdeo and ensure extracton of meanngful nformaton. As the results, they have shown to acheve excellent results for varous recognton Fgure 2. Sample vdeo frames from the Volent-Flows (frst row), Hockey Fght (second row), Moves (thrd row), and VolentFlows 21 (fourth row) datasets. tasks n vdeos and they have been wdely used n lterature [1, 27]. To represent a vdeo, we calculate a separate vdeo representaton for each descrptor ndependently (.e. HOG, etc.) usng the IFV / proposed Spato-Temporal IFV (Sec. 2.2), and we concatenate the obtaned representatons usng late fuson (.e. per vdeo: we concatenate the IFVbased vdeo representaton from HOG wth vdeo representaton from HOF, etc.). For volence recognton, we use lnear Support Vector Machnes (SVMs) [3] classfer, whch has shown to acheve excellent results wth hgh-dmensonal data such as Fsher Vectors; as typcally f the number of features s large, there s no need to map data to a hgher dmensonal space [10]. Moreover, lnear SVMs have shown to be effcent both n tranng and predcton steps. For volence detecton, we use the Fast Sldng Wndowbased framework, explaned n Sec Experments 4.1. Datasets We use 4 benchmark datasets for evaluaton and we follow the recommended evaluaton protocols provded by the authors of the datasets. We use Volent-Flows dataset [9], Hockey Fght dataset [23] and Moves dataset [23] for volence recognton task. We use Volence-Flows 21 dataset [9] for volence detecton task. Sample vdeo frames from the datasets are presented n Fg. 2. The Volent-Flows (Crowd Volence \ Non-volence) dataset [9] contans 246 vdeos wth real-world footage of crowd volence. Vdeos are collected from YouTube and

5 contan a varety of scenes, e.g. streets, football stadums, volleyball and ce hockey arenas, and schools. The dataset s dvded nto 5 folds and we follow the recommended 5- folds cross-valdaton to report the performance. The Hockey Fght dataset [23] contans 1000 real-world vdeos: 500 volent scenes (between two and more partcpants) and 500 non-volent scenes. Vdeos are dvded nto 5 folds, where each fold contans 50% volent and 50% nonvolent vdeos, and we follow the recommended 5-folds cross-valdaton to report the performance. The Moves dataset [23] contans 200 vdeo clps: 100 vdeos wth a person-on-person fght (collected from acton moves) and 100 vdeos wth non-fght scenaros (collected from varous acton recognton datasets). Ths dataset contans a wder varety of scenes than the Hockey Fght dataset, and scenes are captured at dfferent resolutons [23]. Vdeos are dvded nto 5 folds and we follow the recommended 5-folds cross-valdaton to report the performance. Although ths dataset does not contan survellance vdeos, t has been wdely used n the past for volence recognton task. The man dfferences between above datasets are: varous scenaros and scenes, volence/fght and nonvolence/non-fght classes varatons, number of tranng and testng samples, pose and camera vew pont varatons, moton blur, background clutter, occlusons, and llumnaton condtons. The Volent-Flows 21 dataset (Crowd Volence \ Nonvolence 21 Database) [9] contans 21 vdeos wth realworld vdeo footage of crowd volence. Vdeos are collected from YouTube, they are of spatal sze pxels, and they begn wth non-volent behavor, whch turns to volent md-way through the vdeo. The tranng s performed usng 227 out of 246 vdeos from the Volent-Flows dataset; 19 vdeos are removed as they are ncluded n the detecton set. The orgnal annotatons are not avalable. Therefore, as proposed n the orgnal paper [9], we manually mark the frame n each vdeo where the transton happens from non-volent to volent behavor Implementaton Detals We use the GMM wth K = 128 and K = 256 to compute the IFV / Spato-Temporal IFV, and we set the number of Gaussans usng 5-folds cross-valdaton. To ncrease clusterng precson, we ntalze the GMM 10 tmes and we keep the codebook wth the lowest error. To lmt the complexty, we cluster a subset of 100, 000 randomly selected tranng features. To report recognton results, we use the Mean Class Accuracy (MCA) metrc. For volence detecton, we use sx temporal wndows of length {5} 6 =1 frames and the wndow strde equal to 1 frame. To report 1 Dfferences can exst between our and [9] annotatons. Approach Sze Volent-F. Hockey F. Moves Baselne Ours: STIFV IFV 1x1x IFV 1x2x IFV 2x1x IFV 1x1x IFV 1x3x IFV 3x1x IFV 2x2x IFV 2x2x IFV 2x2x IFV 2x1x IFV 1x2x Table 1. Evaluaton results: the baselne (IFV wth 1x1x1) approach, our IFV wth spato-temporal nformaton (STIFV), and the IFV wth varous spato-temporal grds on the Volent-Flows, Hockey Fght, and Moves datasets. Second column presents the sze of the vdeo representaton relatvely to the sze of the vdeo representaton of the baselne approach. detecton results, we use the Recever Operatng Characterstc (ROC) curve and the Area Under Curve (AUC) metrcs Results: Volence Recognton For volence recognton, we evaluate the standard IFV approach (baselne approach) and our IFV wth spatotemporal nformaton (STIFV, Sec. 2.2). Moreover, we evaluate the IFV wth 11 varous spato-temporal grds (1x1x2, 1x2x1, 2x1x1, 1x1x3, 1x3x1, 3x1x1, 2x2x2, 2x2x3, 2x2x1, 2x1x2, and 1x2x2). The evaluatons are performed on 3 datasets: Volent-Flows, Hockey Fght and Moves datasets. The results are presented n Table 1. In all cases, our STIFV approach outperforms the IFV method, and acheves better or smlar performance as compared to the IFV wth spato-temporal grd. Note that fndng an approprate sze of the spato-temporal grd s tme consumng (there are 3 addtonal parameters to learn). Moreover, a spato-temporal grd-based representaton requres sgnfcantly more amount of memory (up to 12 tmes n our experments, see Table 1). Then, we compare our approach wth the state-of-theart. The comparson on the Volent-Flows, Hockey Fght, and Moves datasets s presented n Table 2. Note that our approach sgnfcantly outperforms remanng technques, achevng even up to 11% better results (on the Volent- Flows dataset). In summary, for volence recognton, the proposed mprovement (IFV wth spato-temporal nformaton) boosts the state-of-the-art IFV, and acheves better or smlar ac-

6 Volent-Flows Dataset Hockey Fght Dataset Moves Dataset Approach Acc. (%) Approach Acc. (%) HNF [16] 56.5 Approach Acc. (%) STIP-HOG + HIK [23] 49 HOG [16] 57.4 LTP [30] 71.9 STIP-HOF + HIK [23] 59 HOF [16] 58.3 VF [9] 82.9 BoW-MoSIFT [5] 86.5 LTP [30] 71.5 STIP-HOF + HIK [23] 88.6 MoSIFT + HIK [23] 89.5 Jerk [6] 74.2 Extreme Acceleratons [5] 90.1 VF [9] 91.3 Interacton Force [20] 74.5 MoSIFT + HIK [23] 90.9 Jerk [6] 95.0 VF [9] 81.3 BoW-MoSIFT [5] 91.2 Interacton Force [20] 95.5 HOT [22] 82.3 STIP-HOG + HIK [23] 91.7 F L F Cv [21] 96.9 F L F Cv [21] 85.4 Our Approach 93.7 Extreme Acceleratons [5] 98.9 Our Approach 96.4 Our Approach 99.5 Table 2. Comparson wth the state-of-the-art on the Volent-Flows (left table), Hockey Fght (mddle), and Moves (rght) datasets. same results). The results and comparson wth the state-ofthe-art are presented n Fgure 3 (usng the ROC curves) and n Table 3 (usng the AUC metrc). Then, we evaluate the speed of the Improved Dense Trajectores (IDT), and we compare the speed of the standard sldng wndow approach wth the speed of our Fast Sldng Wndow technque (Sec. 2.3). The results are presented n Table 4. We observe that the proposed Fast Sldng Wndow technque s more than 10 tmes faster than the standard sldng wndow approach. Fgure 3. ROC curves: our approach (on the left) vs. the state-ofthe-art (on the rght) on the Volent-Flows 21 dataset. Approach LTP HOG HOF HNF VIF Ours AUC Table 3. AUC metrc on the Volent-Flows 21 dataset [9]. Process Processng Tme (fps) Feature Extracton (IDT) 5.7 Sldng Wndow 9.28 Ours: Fast Sldng Wndow Table 4. Average processng tme on the Volent-Flows 21 dataset usng a sngle Intel(R) eon(r) CPU E GHz. curacy (keepng the representaton more compact) as compared to the IFV wth spato-temporal grds. Moreover, our approach sgnfcantly outperforms the exstng technques on all three volence recognton datasets Results: Volence Detecton We evaluate our Fast Sldng Wndow-based approach on the Volence-Flows 21 dataset. Frstly, we evaluate the accuracy of the sldng wndow / Fast Sldng Wndow approach (both technques acheve the 5. Conclusons We have proposed an extenson of the Improved Fsher Vectors (IFV) for volence recognton n vdeos, whch allows to represent a vdeo usng both local features and ther spato-temporal postons. The proposed extenson has shown to boost the IFV achevng better or smlar accuracy (and keepng the representaton more compact) as compared to the IFV wth spato-temporal grd. Moreover, our approach has shown to sgnfcantly outperform the exstng technques on three volence recognton datasets. Then, we have studed the popular sldng wndow approach for volence detecton. We have re-formulated the IFV and have used the summed area table data structure to sgnfcantly speed up the volence detecton framework. The evaluatons have been performed on 4 state-of-the-art datasets. Acknowledgements. The research leadng to these results has receved fundng from the People Programme (Mare Cure Actons) of the European Unon s Seventh Framework Programme FP7/ / under REA grant agreement n o [324359]. However, the vews and opnons expressed heren do not necessarly reflect those of the fnancng nsttutons.

7 References [1] P. Blnsk and F. Bremond. Vdeo Covarance Matrx Logarthm for Human Acton Recognton n Vdeos. In Internatonal Jont Conference on Artfcal Intellgence (IJCAI), [2] P. Blnsk, M. Kopersk, S. Bak, and F. Bremond. Representng Vsual Appearance by Vdeo Brownan Covarance Descrptor for Human Acton Recognton. In IEEE Internatonal Conference on Advanced Vdeo and Sgnal-Based Survellance (AVSS), [3] C.-C. Chang and C.-J. Ln. LIBSVM: A Lbrary for Support Vector Machnes. ACM Transactons on Intellgent Systems and Technology (TIST), 2(3):27, [4]. Cu, Q. Lu, M. Gao, and D. N. Metaxas. Abnormal Detecton Usng Interacton Energy Potentals. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [5] O. Danz, I. Serrano, G. Bueno, and T.-K. Km. Fast Volence Detecton n Vdeo. In Internatonal Conference on Computer Vson Theory and Applcatons (VISAPP), [6] A. Datta, M. Shah, and N. D. V. Lobo. Person-on-Person Volence Detecton n Vdeo Data. In Internatonal Conference on Pattern Recognton (ICPR), [7] F. D. de Souza, G. C. Chavez, E. A. do Valle, and A. de A Araujo. Volence Detecton n Vdeo Usng Spato- Temporal Features. In SIBGRAPI Conference on Graphcs, Patterns and Images, [8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belonge. Behavor Recognton va Sparse Spato-Temporal Features. In IEEE Internatonal Workshop on Vsual Survellance and Performance Evaluaton of Trackng and Survellance, [9] T. Hassner, Y. Itcher, and O. Klper-Gross. Volent Flows: Real-Tme Detecton of Volent Crowd Behavor. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR) Workshops, [10] C.-W. Hsu, C.-C. Chang, and C.-J. Ln. A Practcal Gude to Support Vector Classfcaton. Techncal report, Department of Computer Scence, Natonal Tawan Unversty, [11] V. Kantorov and I. Laptev. Effcent feature extracton, encodng and classfcaton for acton recognton. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [12] A. Klaser, M. Marszalek, C. Schmd, and A. Zsserman. Human Focused Acton Localzaton n Vdeo. In Trends and Topcs n Computer Vson, pages Sprnger, [13] J. Krapac, J. Verbeek, and F. Jure. Modelng Spatal Layout wth Fsher Vectors for Image Categorzaton. In IEEE Internatonal Conference on Computer Vson (ICCV), [14] J. Krapac, J. Verbeek, and F. Jure. Spatal Fsher Vectors for Image Categorzaton. Research Report RR-7680, INRIA, [15] I. Laptev. On Space-Tme Interest Ponts. Internatonal Journal of Computer Vson (IJCV), 64(2-3): , [16] I. Laptev, M. Marszalek, C. Schmd, and B. Rozenfeld. Learnng realstc human actons from moves. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [17] S. Lazebnk, C. Schmd, and J. Ponce. Beyond Bags of Features: Spatal Pyramd Matchng for Recognzng Natural Scene Categores. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [18] V. Mahadevan, W. L, V. Bhaloda, and N. Vasconcelos. Anomaly Detecton n Crowded Scenes. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [19] S. McCann and D. G. Lowe. Spatally Local Codng for Object Recognton. In Asan Conference on Computer Vson (ACCV), [20] R. Mehran, A. Oyama, and M. Shah. Abnormal Crowd Behavor Detecton usng Socal Force Model. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [21] S. Mohammad, H. Kan, A. Perna, and V. Murno. Volence detecton n crowded scenes usng substantal dervatve. In IEEE Internatonal Conference on Advanced Vdeo and Sgnal-Based Survellance (AVSS), [22] H. Mousav, S. Mohammad, A. Perna, R. Chellal, and V. Murno. Analyzng Tracklets for the Detecton of Abnormal Crowd Behavor. In IEEE Wnter Conference on Applcatons of Computer Vson (WACV), [23] E. B. Nevas, O. D. Suarez, G. B. Garca, and R. Sukthankar. Volence Detecton n Vdeo Usng Computer Vson Technques. In Internatonal Conference on Computer Analyss of Images and Patterns (CAIP), [24] F. Perronnn, J. Sanchez, and T. Mensnk. Improvng the Fsher Kernel for Large-Scale Image Classfcaton. In European Conference on Computer Vson (ECCV), [25] J. Sanchez, F. Perronnn, and T. De Campos. Modelng the Spatal Layout of Images Beyond Spatal Pyramds. Pattern Recognton Letters, 33(16): , [26] O. Tuzel, F. Porkl, and P. Meer. Regon Covarance: A Fast Descrptor for Detecton And Classfcaton. In European Conference on Computer Vson (ECCV), [27] H. Wang and C. Schmd. Acton Recognton wth Improved Trajectores. In IEEE Internatonal Conference on Computer Vson (ICCV), [28] L. Wang, Y. Qao, and. Tang. Acton Recognton wth Trajectory-Pooled Deep-Convolutonal Descrptors. In IEEE Conference on Computer Vson and Pattern Recognton (CVPR), [29] G. Wllems, T. Tuytelaars, and L. Van Gool. An Effcent Dense and Scale-Invarant Spato-Temporal Interest Pont Detector. In European Conference on Computer Vson (ECCV), [30] L. Yeffet and L. Wolf. Local Trnary Patterns for human acton recognton. In IEEE Internatonal Conference on Computer Vson (ICCV), 2009.

Motion Boundary Trajectory for Human Action Recognition

Motion Boundary Trajectory for Human Action Recognition Moton Boundary Trajectory for Human Acton Recognton So-Long Lo and Ah-Chung Tso Faculty of Informaton Technology, Macau Unversty of Scence and Technology Abstract. In ths paper, we propose a novel approach