Free Viewpoint Action Recognition using Motion History Volumes

Fee Viewpoint Action Recognition using Motion Histoy Volumes Daniel Weinland 1, Remi Ronfad, Edmond Boye Peception-GRAVIR, INRIA Rhone-Alpes, 38334 Montbonnot Saint Matin, Fance. Abstact Action ecognition is an impotant and challenging topic in compute vision, with many impotant applications including video suveillance, automated cinematogaphy and undestanding of social inteaction. Yet, most cuent wok in gestue o action intepetation emains ooted in view-dependent epesentations. This pape intoduces Motion Histoy Volumes (MHV) as a fee-viewpoint epesentation fo human actions in the case of multiple calibated, and backgound-subtacted, video cameas. We pesent algoithms fo computing, aligning and compaing MHVs of diffeent actions pefomed by diffeent people in a vaiety of viewpoints. Alignment and compaisons ae pefomed efficiently using Fouie tansfoms in cylindical coodinates aound the vetical axis. Results indicate that this epesentation can be used to lean and ecognie basic human action classes, independently of gende, body sie and viewpoint. Key wods: action ecognition, view invaiance, volumetic econstuction 1 Intoduction Recogniing actions of human actos fom video is an impotant topic in compute vision with many fundamental applications in video suveillance, video indexing and social sciences. Accoding to Neumann et al. [1] and fom a computational pespective, actions ae best defined as fou-dimensional pattens Email addesses: weinland@inialpes.f (Daniel Weinland), onfad@inialpes.f (Remi Ronfad), edmond.boye@inialpes.f (Edmond Boye). 1 D. Weinland is suppoted by a gant fom the Euopean Community unde the EST Maie-Cuie Poject Visito. Pepint submitted to Elsevie Science 16 Octobe 2006

in space and in time. Video ecodings of actions can similaly be defined as thee-dimensional pattens in image-space and in time, esulting fom the pespective pojection of the wold action onto the image plane at each time instant. Recogniing actions fom a single video is howeve plagued with the unavoidable fact that pats of the action ae hidden fom the camea because of self-occlusions. That the human bain is able to ecognie actions fom a single viewpoint should not hide the fact that actions ae fimly fou-dimensional, and, futhemoe, that the mental models of actions suppoting ecognition may also be fou-dimensional. In this pape, we investigate how to build spatio-tempoal models of human actions that can suppot categoiation and ecognition of simple action classes, independently of viewpoint, acto gende and body sies. We use multiple cameas and shape fom silhouette techniques. We sepaate action ecognition in two sepaate tasks. The fist task is the extaction of motion desciptos fom visual input, and the second task is the classification of the desciptos into vaious levels of action classes, fom simple gestues and postues to pimitive actions to highe levels of human activities, as pointed out by Kojima et al. [2]. That second task can be pefomed by leaning statistical models of the tempoal sequencing of motion desciptos. Popula methods fo doing this ae hidden makov models and othe stochastic gammas, e.g. stochastic pasing as poposed by Ivanov and Bobick [3]. In this pape, we focus on the extaction of motion desciptos fom multiple cameas, and thei classification into pimitive actions such as aising and dopping hands and feet, sitting up and down, jumping, etc. To this aim, we intoduce new motion desciptos based on motion histoy volumes which fuse action cues, as seen fom diffeent viewpoints and ove shot time peiods, into a single thee dimensional epesentation. In pevious wok on motion desciptos, Geen and Guan [4] use positions and velocities of human body pats, but such infomation is difficult to extact automatically duing unesticted human activities. Motion desciptos which can be extacted automatically, and which have been used fo action ecognition, ae optical flows, as poposed by Efos et al. [5], motion templates in the seminal wok of Bobick and Davis [6], and space-time volumes, intoduced by Syeda-Mahmood et al. [7] o Yilma and Shah [8]. Such desciptos ae not invaiant to viewpoint, which can be patially esolved by multiplying the numbe of action classes by the numbe of possible viewpoints [6], elative motion diections [5], and point coespondences [7,8]. This esults in a pooe categoiation and an inceased complexity. In this eseach, we investigate the altenative possibility of building feeviewpoint class models fom view-invaiant motion desciptos. The key to ou appoach is the assumption that we need only conside vaiations in viewpoints aound the cental vetical axis of the human body. Within this assumption, we 2

y θ x θ a) Visual Hull b) Motion Histoy Volume c) Cylindical Coodinates d) Fouie Magnitudes y θ x θ Fig. 1. The two actions ae ecoded by multiple cameas, spatially integated into thei visual hulls (a), and tempoally integated into motion histoy volumes (b)(c). Invaiant motion desciptos in Fouie space (d) ae used fo compaing the two actions. popose a epesentation based on Fouie analysis of motion histoy volumes in cylindical coodinates. Figue 1 explains ou method fo compaing two action sequences. We sepaately compute thei visual hulls and accumulate them into motion histoy volumes. We tansfom the MHVs into cylindical coodinates aound thei vetical axes, and extact view-invaiant featues in Fouie space. Such a epesentation fits nicely within the famewok of Ma s 3D model [9] which has been advocated by linguist Jackendoff [10] as a useful tool fo epesenting action categoies in natual language. The pape is oganied as follows. Fist, we ecall Davis and Bobick s definition of motion templates and extend it to thee dimensions in Section 2. We pesent efficient desciptos fo matching and aligning MHVs in Section 3. We pesent classification esults in Section 4 and conclude in Section 5. 2 Definitions In this section, we fist ecall 2D motion templates as intoduced by Bobick and Davis in [6] to descibe tempoal actions. We then popose thei genealiation 3

(a) (b) (c) (d) Fig. 2. Motion vesus occupancy. Using motion only in image (a), we can oughly gathe that someone is lifting one am. Using the whole silhouette instead, in (b), makes it clea that the ight am is lifted. Howeve the same movement executed by a woman, in (c), compaes favoably with the man s action in (a), wheeas the whole bodies compaisons between (b) and (d) is less evident. to 3D in ode to emove the viewpoint dependence in an optimal fashion using calibated cameas. Finally, we show how to pefom tempoal segmentation using the 3D MHVs. 2.1 Motion Histoy Images Motion Enegy Images (MEI) and Motion Histoy Images (MHI) [6] wee intoduced to captue motion infomation in images. They encode, espectively, whee motion occued, and the histoy of motion occuences, in the image. Pixel values ae theefoe binay values (MEI) encoding motion occuence at a pixel, o multiple-values (MHI) encoding how ecently motion occued at a pixel. Moe fomally, conside the binay-valued function D(x, y, t), D = 1 indicating motion at time t and location (x, y), then the MHI function is defined by: τ if D(x, y, t) = 1 h τ (x, y, t) = (1) max(0, h τ (x, y, t 1) 1) othewise, whee τ is the maximum duation a motion is stoed. The associated MEI can easily be computed by thesholding h > 0. The above motion templates ae based on motion, i.e. D(x, y, t) is a motion indicating function, howeve Bobick and Davis also suggest to compute templates based on occupancy, eplacing D(x, y, t) by the silhouette occupancy function. They ague that including the complete body makes templates moe obust to incidental motions that occu duing an action. Ou expeiments confim that and show that occupancy povides obust cues fo ecognition, even if occupancy encodes not only motion but also shapes which may add difficulties when compaing movements, as illustated in Figue 2. 4

2.2 Motion Histoy Volumes In this pape, we popose to extend 2D motion templates to 3D. The choice of a 3D epesentation has seveal advantages ove a single, o multiple, 2D view epesentation: A 3D epesentation is a natual way to fuse multiple images infomation. Such epesentation is moe infomative than simple sets of 2D images since additional calibation infomation is taken into account. A 3D epesentation is moe obust to the object s positions elative to the cameas as it eplaces a possibly complex matching between leaned views and the actual obsevations by a 3D alignment (see next section). A 3D epesentation allows diffeent camea configuations. Motion templates extends easily to 3D by consideing the occupancy function D(x, y,, t) in 3D, whee D = 1 if (x, y, ) is occupied at time t and D = 0 othewise, and by consideing voxels instead of pixels: τ if D(x, y,, t) = 1 v τ (x, y,, t) = max(0, h τ (x, y,, t 1) 1) othewise. (2) In the est of the pape, we will assume templates to be nomalied and segmented with espect to the duation of an action: v(x, y, ) = v τ=tmax t min (x, y,, t max )/(t max t min ), (3) whee t min and t max ae stat and end time of an action. Hence, motions loose dependencies on absolute speed and esult all in the same length. Section 2.3 shows how we detect these boundaies using a motion enegy based segmentation. The input occupancy function D(x, y,, t) is estimated using silhouettes and thus, coesponds to the visual hull [11]. Visual hulls pesent seveal advantages, they ae easy to compute and they yield obust 3D epesentations. Note howeve that, as fo 2D motion templates, diffeent body popotions may still esult in vey diffeent templates. Figue 3 shows examples fo motion histoy volumes. 2.3 Tempoal Segmentation Tempoal Segmentation consist in splitting a continuous sequence of motions into elementay segments. In this wok, we use an automatic pocedue that we 5

Fig. 3. Motion histoy volume examples: Fom left to ight: sit down ; walk ; kick ; punch. Colo values encode time of last occupancy. ecently intoduced in [12]. It elies on the definition of motion boundaies as minima in motion enegy, as oiginally poposed by Ma and Vaina [9]. Such minima coespond eithe to small ests between motions o to evesals in motion. As it tuns out, an appoximation of the global motion enegy can be effectively computed using MHVs: Intuitively, instant motion can be encoded using MHVs ove small time windows (typically 2-10 fames). Then the sum ove all voxel values at time t will give a measue of the global motion enegy at that time. Next, we seach this enegy fo local minima, and ecompute the MHVs based on the detected boundaies. Fo moe details we efe to ou wok in [12]. 3 Motion Desciptos Ou objective is to compae body motions that ae fee in locations, oientations and sies. This is not the case of motion templates, as defined in the pevious section, since they encode space occupancy. The location and scale dependencies can be emoved by centeing, with espect to the cente of mass, and scale nomaliing, with espect to a unit vaiance, motion templates, as usual in shape matching. Fo the otation, and following Bobick and Davis [6] who used the Hu Moments [13] as otation invaiant desciptos, we could conside thei simple 3D extensions by Sadjadi and Hall [14]. Howeve, ou expeiments with these desciptos, based on fist and second ode moments, wee unsuccessful in disciminating detailed actions. In addition, using highe ode moments as in [15] is not easy in pactice. Moeove, seveal woks tend to show that moments ae inappopiate featue desciptos, especially in the pesence of noise, e.g. Shen [16]. In contast, seveal woks, such as that by Gace and Spann [17] and Heesch and Ruege [18], demonstated bette esults using Fouie based featues. Fouie based featues ae obust to noise and iegulaities, and pesent the nice popety to sepaate coase global and fine local featues in low and high fequency components. Moeove, they can be efficiently computed using fast Fouie-tansfoms (FFT). Ou appoach is theefoe based on these featues. 6

Invaiance of the Fouie tansfom follows fom the Fouie shift theoem: a function f 0 (x) and its tanslated countepat f t (x) = f 0 (x x 0 ) only diffe by a phase modulation afte Fouie tansfomation: F t (k) = F 0 (k)e j2πkx 0. (4) Hence, Fouie magnitudes F t (k) ae shift invaiant signal epesentations. The invaiance popety tanslates easily onto otation by choosing coodinate systems that map otation onto tanslation. Popula example is the Fouie- Mellin tansfom, e.g. Chen et al. [19], that uses log-pola coodinates fo tanslation, scale, and otation invaiant image egistation. Recent wok in shape matching by Kahdan et al. [20] poposes magnitudes of Fouie spheical hamonics as otation invaiant shape desciptos. In a simila way, we use Fouie-magnitudes and cylindical coodinates, centeed on bodies, to expess motion templates in a way invaiant to locations and otations aound the -axis. The oveall choice is motivated by the assumption that simila actions only diffe by igid tansfomations composed of scale, tanslation, and otation aound the -axis. Of couse, this does not account fo all simila actions of any body, but it appeas to be easonable in most situations. Futhemoe, by esticting the Fouie-space epesentation to the lowe fequencies, we also implicitly allow fo additional degees of feedom in object appeaances and action executions. The following section details ou implementation. 3.1 Invaiant Repesentation We expess the motion templates in a cylindical coodinate-system: ( ) y v( x 2 + y 2, tan 1, ) v(, θ, ). x Thus otations aound the -axis esults in cyclical tanslation shifts: v(x cos θ 0 + y sin θ 0, x sin θ 0 + y cos θ 0, ) v(, θ + θ 0, ). We cente and scale-nomalie the templates. In detail, if v is the volumetic cylindical epesentation of a motion template, we assume all voxels that epesent a time step, i.e. fo which v(, θ, ) > 0, to be pat of a point cloud. We compute the mean µ and vaiances σ and σ in - and -diection. The template is then shifted, so that µ = 0, and scale nomalied so that σ = σ = 1. 7

Fig. 4. 1D-Fouie tansfom in cylindical coodinates. Fouie tansfoms ove θ ae computed fo couples of values (, ). Concatenation of the Fouie magnitudes fo all and foms the final featue vecto. We choose to nomalie in and diection, instead of a pincipal component based nomaliation, focusing on the main diections human diffe on, and assuming scale effects dependent on positions to be athe small. This method may fail aligning e.g. a peson speading its hand with a peson dopping its hand, but gives good esults fo people pefoming simila actions, which is moe impotant. The absolute values V (, k θ, ) of the 1D Fouie-tansfom V (, k θ, ) = π π v(, θ, )e j2πk θθ dθ, (5) fo each value of and, ae invaiant to otation along θ. See Figue 4 fo an illustation of the 1D-Fouie tansfom. Note that vaious combinations of the Fouie tansfom could be used hee. Fo the 1D Fouietansfom the spatial ode along and emains unaffected. One could say, a maximum of infomation in these diections is peseved. An impotant popety of the 1D-Fouie magnitudes is its tivial ambiguity with espect to the evesal of the signal. Consequently, motions that ae symmetic to the -axis (e.g. move left am - move ight am) esult in the same motion desciptos. This can be consideed eithe as a loss in infomation o as a useful featue halving the space of symmetic motions. Howeve, ou pactical expeience shows that most high level desciptions of human actions do not depend on this sepaation. In cases whee it is impotant to esolve left/ight ambiguities a slightly diffeent descipto can be used. One such descipto is the magnitude V (k, k θ, k ) 8

(a) lift left am (b) (c) (a) lift ight am (b) (c) k k θ k θ k θ θ k θ k θ k k k k k k θ k θ k θ θ k θ k θ Fig. 5. Volume and specta of sample motions: (a) cylindical epesentation in (θ, ), (, ), (θ, ) aveaged ove the thid dimension fo visualiation puposes; (b) coesponding 3D-Fouie Specta; (c) 1D-Fouie specta. Note that the 3D descipto teats both motions diffeently (i.e. top and bottom ow (b)), while the 1D desciptos teats them the same. of the 3D-Fouie tansfom π π V (k, k θ, k ) = (6) v(, θ, )e j2π(k+k θθ+k ) ddθd, applied to the motion template v. This descipto is only symmetic with espect to an invesion of all vaiables, i.e. humans standing upside-down, which does not happen vey often in pactice. While ou pevious wok [21] used that descipto (6) with success, the esults wee anyway infeio to those obtained with (5) and an invaiance to left ight symmety poved to be beneficial in many classification cases. A visualiation of both desciptos is shown in Figue 5. 3.2 On Invaiance vs. Exhaustive Seach Although we cannot epot expeiments fo lack of space, anothe significant esult of ou eseach is that viewpoint-invaiant motion desciptos (Fouie magnitudes) ae at least as efficient as methods based on exhaustive seach (coelation), at least fo compaing simple actions. Numeous expeiments have shown that, although it is possible to pecisely ecove the elative oientations between histoy volumes using phase o nomalied coelation in Fouie space [22], and compae the aligned volumes diectly, this almost 9

neve impoves the classification esults. Using invaiant motion desciptos is of couse advantageous because we do not need to align taining examples fo leaning a class model, o align test examples with all class pototypes fo ecognition. 4 Classification Using Motion Desciptos We have tested the pesented desciptos and evaluated how disciminant they ae with diffeent actions, diffeent bodies o diffeent oientations. Ou pevious esults [21] using a small dataset of only two pesons aleady indicated the high potential of the descipto. This pape pesents esults on an extended dataset, the so called IXMAS dataset. The dataset is intoduced in the next section, followed by classification esults using dimensional eduction combined with Mahalanobis distance and linea disciminant analysis (LDA). 4.1 The IXMAS Dataset The Inia Xmas Motion Acquisition Sequences (IXMAS) 2 aim to fom a dataset compaable to the cuent state-of-the-at in action ecognition. It contains 11 actions, see Figue 6 fo instance, each pefomed 3 times by 10 actos (5 males / 5 females). To demonstate the view-invaiance, the actos feely change thei oientation fo each acquisition and no futhe indications on how to pefom the actions beside the labels wee given, as illustated in Figue 7. The acquisition was achieved using 5 standad Fiewie cameas. Figue 8 shows example views fom the camea setup used duing the acquisition. Fom the video we extact silhouettes using a standad backgound subtaction technique modeling each pixel as a Gaussian in RGB space. Then visual hulls ae caved fom a discete space of voxels, whee we cave each voxel that not pojects into all of the silhouettes. Howeve, thee ae no special equiements fo the visual hull computation and even the simplest method showed to wok pefectly with ou appoach. Afte mapping into cylindical coodinates the epesentation has a esolution of 64 64 64. Tempoal segmentation was pefomed as descibed in section 2.3. Note, that the tempoal segmentations splits some of the actions into seveal elementay pats. To evaluate the descipto on a selected dataset of pimitive motions, we choose fom each of the segments the one that best epesents the motion. Fo example the action 2 The data is available on the Peception website http://peception.inialpes. f in the Data section. 10

Check watch Coss ams Scatch head Sit down Get up Tun aound Walk Wave Punch Kick Pick up Fig. 6. 11 actions, pefomed by 10 actos. check watch is split into thee pats: an upwad motion of the am - seveal seconds of esting in this position - eleasing the am. Fom these motions we only use the fist fo the class check watch. Anothe example is the action walk, that has been boken down into sepaate steps. Inteestingly, in those examples, we wee able to classify even modeately complex actions based on one segment only. Howeve, classification of composite actions is a topic of futue eseach. 4.2 Classification Using Mahalanobis Distance and PCA In initial expeiments on a small dataset and with diffeent distance measues (i.e. Euclidean distance, simplified Mahalanobis distance, and Mahalanobis 11

Fig. 7. Sample action kick pefomed by 10 actos. Fig. 8. Example views of 5 cameas used duing acquisition. distance + PCA, see also [21]), the combination of a pincipal component analysis (PCA) dimensional eduction plus Mahalanobis distance based nomaliation showed best esults. Due to the small amount of taining samples we only used one pooled covaiance matix fo all classes. Inteestingly, we found that the method extends well to lage datasets and even competes with linea disciminant analysis (LDA), as will be shown in the next section. PCA is a commonly used method fo dimensional eduction. Data points ae pojected onto a subspace that is chosen to yield the econstuction with minimum squaed eo. It has been shown that this subspace is spanned by the lagest eigenvectos of the data s covaiance Σ, and coesponds to the diections of maximum vaiance within the data. Futhe, by nomaliation with espect to the vaiance, an equally weighting of all components is achieved, simila to the classical use of Mahalanobis distances in classification, but hee computed fo one pooled covaiance matix. Evey action class in the data-set is epesented by the mean value of the desciptos ove the available population in the action taining set. Any new action is then classified accoding to a Mahalanobis distance associated to a PCA based dimensional eduction of the data vectos. One pooled covaiance matix Σ based on the taining samples of all classes x i R d, i = 1,..., n was 12

# Action PCA Mahal. LDA 1 Check watch. 46.66% 86.66% 83.33% 2 Coss ams. 83.33% 100.00% 100.00% 3 Scatch head. 46.66% 93.33% 93.33% 4 Sit down. 93.33% 93.33% 93.33% 5 Get up. 83.33% 93.33% 90.00% 6 Tun aound. 93.33% 96.66% 96.66% 7 Walk. 100.00% 100.00% 100.00% 8 Wave hand. 53.33% 80.00% 90.00% 9 Punch. 53.33% 96.66% 93.33% 10 Kick. 83.33% 96.66% 93.33% 11 Pick up. 66.66% 90.00% 83.33% aveage ate 73.03% 93.33% 92.42% Table 1 IXMAS data classification esults. Results on PCA, PCA + Mahalanobis distance based nomaliation using one pooled covaiance, and LDA ae pesented. computed: Σ = 1 n (x i m)(x i m), (7) n i whee m epesents the mean value ove all taining samples. The Mahalanobis distance between featue vecto x and a class mean m i epesenting one action is: d(m i, x) = (x m i ) V Λ 1 V (x m i ), with Λ containing the k lagest eigenvalues λ 1 λ 2 λ k, k n 1, and V the coesponding eigenvectos of Σ. Thus featue vectos ae educed to k pincipal components. Following this pinciple, and educing the initial descipto (equation (5)) to k = 329 components an aveage classification ate of 93.33% was obtained with leave-one-out coss validation, whee we successively used 9 of the actos to lean the motions and the 10th fo testing. Note that in the oiginal input space, as well as fo a simple PCA eduction without covaiance nomaliation the aveage ate is only 73.03%. Detailed esults ae given in Table 1. 13

4.3 Classification Using Linea Disciminant Analysis Fo futhe data eduction, class specific knowledge becomes impotant in leaning low dimensional epesentations. Instead of elying on the eigendecomposition of one pooled covaiance matix, we use hee a combination of PCA and Fishe linea disciminant analysis (LDA), see e.g. Swets and Weng [23], fo automatic featue selection fom high dimensional data. Fist PCA is applied, Y = V X, V = [v 1,..., v m ], to deive a m n c dimensional epesentation of the data points x i, i = 1,..., n. The classnumbe c dependent limit is necessay to guaanty non-singulaity of matices in disciminant analysis. Fishe disciminant analysis defines as within-scatte matix: S w = c n i (y j m i )(y j m i ), (8) i j and between-scatte matix: S b = c (m i m)(m i m), (9) i and aims at maximiing the between-scatte while minimiing the withinscatte, i.e. we seach a pojection W that maximie det(s b) det(s W. It has been ) poven that W equal to the lagest eigenvectos of Sw 1S b maximies this atio. Consequently a second pojection Z = W Y, W = [w 1,..., w k ], k c 1 is applied to deive ou final featue epesentation Z. Duing classification each class is epesented by its mean vecto m i. Any new action is then classified by summing Euclidean distances ove the disciminant featues and with espect to the closest action class: d(m i, ) = m i 2. (10) In the expeiments the magnitudes of the Fouie epesentation (equation (5)) ae pojected onto k = 10 disciminant featues. Successively we use 9 of the actos to lean the motions, the 10th is used fo testing. The aveage ate of coect classifications is then 92.42%. Class specific esults ae shown in Table 1 and Figue 9. We note that we obtain much bette esults with the Mahalanobis distance, using the 329 lagest components of the PCA decomposition, as compaed to using the PCA components alone. LDA allows us to futhe educe the 14

Check watch Coss ams Scatch head Sit down Get up Tun aound Walk Wave hand Punch Kick Pick up Check watch Coss ams Scatch head Sit down Get up Tun aound Walk Wave hand Punch Kick Pick up Check watch Coss ams Scatch head Sit down Get up Tun aound Walk Wave hand Punch Kick Pick up Check watch Coss ams Scatch head Sit down Get up Tun aound Walk Wave hand Punch Kick Pick up Fig. 9. Aveage class distance: (Left) befoe disciminant analysis. (Right) afte disciminant analysis. numbe of featues to 10, but othewise does not futhe impove the oveall classification esults. 4.4 Motion Histoy vs. Motion Enegy and Key Fames With the same dataset as befoe, we compae ou MHV based desciptos with a combination of key poses and enegy volumes. While Davis and Bobick suggested in the oiginal pape the use of histoy and binay images, ou expeiments with motion volumes showed no impovement in using a combination of MHVs and the binay MEVs. We epeated the expeiment descibed in section 4.3, fo MEVs. Using the binay infomation the ecognition ate becomes 80.00% only. See Table 2 fo detailed esults. As can be expected: evese actions, e.g. sit down - get up, pesent lowe scoes with MEVs than with MHVs. The MHVs show also bette pefomance in disciminating actions on moe detailed scales, e.g. scatch head - wave. Also, to show that integation ove time plays a fundamental ole of infomation, we compae ou descipto with desciptos based on a single selected key fame. The idea of key fames is to epesent a motion by one specific fame, see e.g. Calson and Sullivan [24]. As invaiant epesentation, we use the magnitudes of equation (5). Fo the pupose of this compaison we simply choose the last fame of each MHV computation as coesponding key fame. The aveage ecognition ate becomes 80.30%. While motion intensive action, e.g. walk - tun aound scoe much lowe, a few pose expessive actions, e.g. pick up, achieve a bette scoe. This may indicate that not all actions should be descibed with the same featues. 15

# Action MEV Key fame MHV 1 Check watch. 86.66% 73.33% 86.66% 2 Coss ams. 80.00% 93.33% 100.00% 3 Scatch head. 73.33% 86.66% 93.33% 4 Sit down. 70.00% 93.33% 93.33% 5 Get up. 46.66% 53.33% 93.33% 6 Tun aound. 90.00% 60.00% 96.66% 7 Walk. 100.00% 80.00% 100.00% 8 Wave hand. 80.00% 76.66% 80.00% 9 Punch. 93.33% 80.00% 96.66% 10 Kick. 90.00% 90.00% 96.66% 11 Pick up. 70.00% 96.66% 90.00% aveage ate 80.00% 80.30% 93.33% Table 2 IXMAS data classification esults. Results using the poposed MHVs ae pesented. Fo compaison we also include esults using binay MEVs and key fame desciptos. We conclude, that invaiant Fouie desciptos of binay motion volumes and key fames ae suitable fo motion ecognition as well. Howeve, the use of additional motion infomation, as pesent in the motion histoy volumes, in both cases distinctly impoves the ecognition. 4.5 Classification on Video Sequences The pevious expeiments show that the descipto pefoms well in disciminating selected sets of leaned motions. In this expeiment we test the descipto on unseen motion categoies as they appea in ealistic situations. Fo this pupose we wok on the aw video sequences of the IXMAS dataset. In a fist step the dataset is segmented into small motion pimitives using the automatic segmentation. Then each segment is eithe ecognied as one of the 11 leaned classes o ejected. As in the pevious expeiments, we wok in PCA space spanned by the 11 sample motions and pefom neaest-mean assignment. To decide fo the eject -class we use a global theshold on the distance to the closest class. The automatic segmentation of the videos esults in 1188 MHVs, coesponding to appoximately 23 minutes of video. In manual gound tuth labeling 16

1 Recognition vs. False Positive 0.9 0.8 0.7 Recognition 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate Fig. 10. Recognition on aw video sequences: Plots ecognition ate into 11 classes against false positive ate. 2 x 108 1.5 1 0.5 0 Check watch Coss ams Scatch head Sit down Get up Tun aound Walk Wave hand Punch Kick Pick up Fig. 11. Aveage distance between eject -samples and taining classes. we discove 495 known motions and 693 eject -motions. Note, that such a gound tuth labeling is not always obvious. A good example is the tun - motion that was included in the expeiments, but additional tun-like motions also appea as the actos whee fee to change position duing the expeiments. Moeove, it might be that an acto was accidentally checking his watch o scatching his head. Testing in a leave-one-out manne, using all possible combinations of 9 actos fo taining and the emaining 10th fo testing, we show a multi-class ROC cuve, Figue 10, plotting the aveage numbe of coectly classified samples, against the numbe of false positives. We found a maximal oveall ecognition ate (including coectly ejected motions) of 82.79%, fo 14.08% false positives and 78.79% coectly classified motions. Figue 11 shows the aveage distance between the eject -motions and the leaned classes. The expeiments demonstate the ability of MHVs even to wok with lage amounts of data and unde ealistic situations (23 minutes of video, 1188 17

motion desciptos). The segmentation poved to almost always detect the impotant pats of motions; MHVs showed good quality in disciminating leaned and unseen motions. An obvious poblem fo the false detections, is the nealy infinite class of possible motions. Modeling unknown motions may equie moe than a single theshold and class, multiple classes and explicit leaning on samples of unknown motions becomes impotant. Anothe poblem we found is, that many motions can not be modeled by a single template. Small motions may seem vey simila, but ove time belong to vey diffeent actions. Fo example the tun aound motion is split into seveal small steps that may easily be confused with a single side step. In such cases tempoal netwoks ove templates, as e.g. in an HMM appoach, must be used to esolve these ambiguities. Howeve, we leave this fo futue wok. 5 Conclusion Using a data set of 11 actions, we have been able to extact 3D motion desciptos that appea to suppot meaningful categoiation of simple action classes pefomed by diffeent actos, iespective of viewpoint, gende and body sies. Best esults ae obtained by discading the phase in Fouie space and pefoming dimensionality eduction with a combination of PCA and LDA. Futhe, LDA allows a dastic dimension eduction (10 components). This suggests that ou motion descipto may be a useful pesentation fo view invaiant ecognition of an even lage class of pimitive actions. Ou cuent wok is suited to segmentation of composite actions into pimitives, and classification of sequences of the coesponding LDA coefficients. Refeences [1] J. Neumann, C. Femlle, Y. Aloimonos, Animated heads: Fom 3d motion fields to action desciptions, in: DEFORM/AVATARS, 2000. [2] A. Kojima, T. Tamua, K. Fukunaga, Natual language desciption of human activities fom video images based on concept hieachy of actions., Intenational Jounal of Compute Vision 50 (2) (2002) 171 184. [3] Y. A. Ivanov, A. F. Bobick, Recognition of visual activities and inteactions by stochastic pasing, IEEE Tans. Patten Anal. Mach. Intell. 22 (8) (2000) 852 872. [4] R. D. Geen, L. Guan, Quantifying and ecogniing human movement pattens fom monocula video images-pat i: a new famewok fo modeling human 18

motion., IEEE Tans. Cicuits Syst. Video Techn. 14 (2) (2004) 179 190. [5] A. A. Efos, A. Beg, G. Moi, J. Malik, Recogniing action at a distance, in: IEEE Intenational Confeence on Compute Vision, 2003. [6] A. F. Bobick, J. W. Davis, The ecognition of human movement using tempoal templates, IEEE Tansactions on Patten Analysis and Machine Intelligence 23 (3) (2001) 257 267. [7] T. Syeda-Mahmood, M. Vasilescu, S. Sethi, Recognition action events fom multiple viewpoints, in: EventVideo01, 2001, pp. 64 72. [8] A. Yilma, M. Shah, Actions sketch: A novel action epesentation, in: IEEE Confeence on Compute Vision and Patten Recognition, 2005, pp. I: 984 989. [9] D. Ma, L. Vaina, Repesentation and ecognition of the movements of shapes, Poceedings of the Royal Society of London B 214 (1982) 501 524. [10] R. Jackendoff, On beyond eba: the elation of linguistic and visual infomation, Cognition 20 (1987) 89 114. [11] A. Lauentini, The Visual Hull Concept fo silhouette-based Image Undestanding, IEEE Tansactions on Patten Analysis and Machine Intelligence 16 (2) (1994) 150 162. [12] D. Weinland, R. Ronfad, E. Boye, Automatic discovey of action taxonomies fom multiple views, in: In Poceedings of the IEEE Confeence on Compute Vision and Patten Recognition, 2006. URL http://peception.inialpes.f/publications/2006/wrb06 [13] M.-K. Hu, Visual patten ecognition by moment invaiants, IRE Tansactions on Infomation Theoy IT-8 (1962) 179 187. [14] F. Sadjadi, E. Hall, Thee-dimensional moment invaiants, IEEE Tansactions on Patten Analysis and Machine Intelligence 2 (2) (1980) 127 136. [15] C. Lo, H. Don, 3-d moment foms: Thei constuction and application to object identification and positioning, PAMI 11 (10) (1989) 1053 1064. [16] D. Shen, H. H.-S. Ip, Disciminative wavelet shape desciptos fo ecognition of 2-d pattens., Patten Recognition 32 (2) (1999) 151 165. [17] A. E. Gace, M. Spann, A compaison between fouie-mellin desciptos and moment based featues fo invaiant object ecognition using neual netwoks., Patten Recognition Lettes 12 (10) (1991) 635 643. [18] D. Heesch, S. M. Ruege, Combining featues fo content-based sketch etieval - a compaative evaluation of etieval pefomance, in: Poceedings of the 24th BCS-IRSG Euopean Colloquium on IR Reseach, Spinge-Velag, London, UK, 2002, pp. 41 52. [19] Q. Chen, M. Defise, F. Deconinck, Symmetic phase-only matched filteing of fouie-mellin tansfoms fo image egistation and ecognition, IEEE Tans. Patten Anal. Mach. Intell. 16 (12) (1994) 1156 1168. 19

[20] M. Kahdan, T. Funkhouse, S. Rusinkiewic, Rotation invaiant spheical hamonic epesentation of 3d shape desciptos, in: Symposium on Geomety Pocessing, 2003. [21] D. Weinland, R. Ronfad, E. Boye, Motion histoy volumes fo fee viewpoint action ecognition, in: IEEE Intenational Wokshop on modeling People and Human Inteaction, 2005. URL http://peception.inialpes.f/publications/2005/wrb05 [22] C. D. Kuglin, D. C. Hines, The phase coelation image alignment method, in: IEEE Intenational Confeence on Cybenetics and Society, 1975, pp. 163 165. [23] D. L. Swets, J. Weng, Using disciminant eigenfeatues fo image etieval, IEEE Tansactions on Patten Analysis and Machine Intelligence 18 (8) (1996) 831 836. [24] S. Calsson, J. Sullivan, Action ecognition by shape matching to key fames, in: Wokshop on Models vesus Exemplas in Compute Vision, 2001. 20