Context-Aware Activity Modeling using Hierarchical Conditional Random Fields

Size: px

Start display at page:

Download "Context-Aware Activity Modeling using Hierarchical Conditional Random Fields"

Melanie Owens
6 years ago
Views:

Context-Aware Ativity Modeling using Hierarhial Conditional Random Fields Yingying Zhu, Nandita M. Nayak, and Amit K.

This is motivated from the observations that ativities related in spae and time rarely our independentlnd an serve as the ontext for eah other.

The model allows the integration of both motion and various ontext features at different levels and automatially learns the statistis that apture the patterns of the features.

Rather than generating ativity labels for individual ativities, our model simultaneously predits an optimum strutural label for the related ativities in the sene.

Index Terms Ativity loalization and reognition, Context-aware ativity model, Hierarhial Conditional Random Field.

1 Context-Aware Ativity Modeling using Hierarhial Conditional Random Fields Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury Abstrat In this paper, rather than modeling ativities in videos individually, we jointly model and reognize related ativities in a sene using both motion and ontext features. This is motivated from the observations that ativities related in spae and time rarely our independentlnd an serve as the ontext for eah other. We propose a two-layer onditional random field model, that represents the ation segments and ativities in a hierarhial manner. The model allows the integration of both motion and various ontext features at different levels and automatially learns the statistis that apture the patterns of the features. With weakly labeled training data, the learning problem is formulated as a max-margin problem and is solved bn iterative algorithm. Rather than generating ativity labels for individual ativities, our model simultaneously predits an optimum strutural label for the related ativities in the sene. We show promising results on the UCLA Offie Dataset and VIRAT Ground Dataset that demonstrate the benefit of hierarhial modeling of related ativities using both motion and ontext features. Index Terms Ativity loalization and reognition, Context-aware ativity model, Hierarhial Conditional Random Field. INTRODUCTION It has been demonstrated in [28] that ontext is signifiant in human visual systems. As there is no formal definition of ontext in omputer vision, we onsider all the deteted objets and motion regions as providing ontextual information about eah other. Ativities in natural senes rarely happen independently. The spatial layout of ativities and their sequential patterns provide useful ues for their understanding. Consider the ativities that happen in the same spatio-temporal region in Fig. : the existene of the nearby ar gives information about what the person (bounded by red irle) is doing, and the relative position of the person of interest and the ar says that ativities (b) and () are very different from ativity (a). Moreover, just fousing on the person, it may be hard to tell what the person is doing in (b) and () - opening vehile trunk or losing vehile trunk. If we knew that these ativities ourred around the same vehile along time, it would be immediately lear that in (b) the person is opening the vehile trunk and in () the person is losing the vehile trunk. This example shows the importane of spatial and temporal relationships for ativity reognition.. Overview of the Framework Many existing works on ativity reognition assume that, the temporal loations of the ativities are known [], [27]. This work was partially supported under ONR grant N and NSF grant IIS Y. Zhu is with the Department of Eletrial and Computer Engineering, University of California, Riverside. yzhu00@ur.edu. N. M. Nayak is with the Department of Computer Siene and Engineering, University of California, Riverside. nandita.nayak@ .ur.edu. A. K. Roy-Chowdhury is with the Department of Eletrial and Computer Engineering, University of California, Riverside. amitr@ee.ur.edu. (a) t=5se (a) t=6se Spatial Relationships (b) t=3se Temporal Relationships (b) t=9se () t=8se Fig.. An example that demonstrates the importane of ontext in ativity reognition. Motion region surrounding the person of interest is loated by red irle, interating vehile is loated by blue bounding box. In pratie, ativity-based analysis of videos should involve reasoning about motion regions, objets involved in these motion regions, and spatio-temporal Spatial Relationships relationships between the motion regions. We fous on the problem of deteting ativities of interest in ontinuous videos without prior information about the loations of the ativities. The main hallenge is to develop a representation of the ontinuous video that respets the spatio-temporal relationships of the () t=25se ativities. To ahieve this goal, we build upon existing well-known feature desriptors Temporal and Relationships spatio-temporal ontext representations that, when ombined together, provide a powerful framework to model ativities in ontinuous videos. An ativity an be onsidered as a union of ation segments or ations that are neighbors to eah other losely in spae and time. We provide an integrated framework that onduts multiple stages of video analysis, starting with motion loalization. The deteted motion regions are divided into ation segments, whih are onsidered as the elements of ativities, using a motion segmentation algorithm based on the nonlinear dynami model (NDM) in [5]. The goal then is to generate smoothed ativity labels, whih

2 2 are optimum in a global sense, for the ation segments; and thus obtaining semantially meaningful ativity regions and orresponding ativity labels. Towards this goal, we perform an initial labeling to group adjaent ation segments into semantially meaningful ativities using a baseline ativity detetor. Any existing ativity detetion method, suh as sliding window bagof-words (BOW) with a support vetor mahine (SVM) [25] an be used in this step. We all the labeled groups of ation segments as the andidate ativities. Candidate ativities that are related to eah other in spae and time are grouped together into ativity sets. For eah set, the underlying ativities are jointly modeled and reognized with the proposed two-layer Conditional Random Field model, whih models the hierarhial relationship between the ation segments and ativities. We refer to this proposed two-layer Hierarhial-CRF as Hierarhial-CRF in short for simpliity of expression. First, the ation layer is modeled as a linear-hain CRF model with the ativity labels with the ation segments as the random variables. Latent ativity variables, whih represent the deteted ativities, are then introdued in the hidden ativity layer. Doing so, ation-ativity onsistennd intra-ativity potentials, as the higher-order smoothness potentials, an be introdued into the model to smooth the preliminartivity labels in the ation layer. Finally, the ativity layer variables, whose underlying ativities are within the neighborhoods of eah other in spae and time, are onneted to utilize the spatial and temporal relationships between ativities. The resulting model is the ation-based two-layer Hierarhial- CRF model. Potentials in and between the ation and ativity layers are developed to represent the motion and ontext patterns of individual variables and groups of them in both ation and ativity levels, as well as ation-ativity onsisteny patterns between variables in the two layers. The ationativity potentials upon sets of ation nodes and their orresponding ativity nodes are introdued between ation and ativity layers. Suh potentials, as smoothness potentials, are used to enfore label onsisteny of ation segments within ativity regions while allowing for label inonsisteny for ertain irumstanes. This allows the retifiation of the preliminartivity labels of ation segments during the inferene of the Hierarhial-CRF model aording to the motion and ontext patterns in and between ations and ativities. Fig. 2 shows the framework of our approah. Given a video, we detet the motion regions using bakground subtration. Then, the segmentation algorithm aims to divide a ontinuous motion region into ation segments, whose motion pattern is onsistent and is different from its adjaent segments. These ation segments, as the nodes in the ation layer, are modeled as a linear-hain CRF and the proposed Hierarhial-CRF model is built aordingly as desribed above. The model parameters are learned automatially from weakly-labeled training data with the loation and labels of ativities of interest. Image-level features are deteted and Y Time Motion Regions X a y2 3 y h y h 3 a y4 a y n h y m Ation Segments Ation Layer Ativity Layer Fig. 2. The left graph shows the video representation of an ativity set with n motion segments and m andidate ativities. The right graph shows the graphial representation of our Hierarhial-CRF model. The white nodes are the ation variables and the gray nodes in the graph are the hidden ativity variables. Note that observations assoiated with the model variables are not shown for lear representation. organized to form the ontext for ativities. Common sense domain knowledge about the ativities of interest is used to guide the formulation of these ontext features within ativities from the weakly-labeled training data. We utilize a strutural model in a max-margin framework, iteratively inferring the hidden ativity variables and learning the parameters of different layers. For the testing, the ation segments, whih are merged together and assigned with ativity labels by the preliminartivity detetion method, are relabeled through inferene on the learned Hierarhial- CRF model..2 Main Contributions The main ontribution of this work is three-fold. (i) We ombine low-level motion segmentation with highlevel ativity model under one framework. With the deteted individual ation segments as the elements of ativities, we design a Hierarhial-CRF model that jointly models the related ativities in the sene. (ii) We propose a weakly supervised approah that utilizes ontext within and between ations and ativities that provide helpful ues for ativity reognition. The proposed model integrates motion and various ontext features within and between ations and ativities into a unified model. The proposed model an loalize and label ativities in ontinuous videos simultaneously, in the presene of multiple ators in the sene interating with eah other or ating independently. (iii) With a task-oriented disriminative approah, the model learning problem is formulated as a max-margin problem and is solved bn Expetation Maximization approah. 2 RELATED WORK Many existing works exploring ontext fous on interations among features, objets and ations [], [3], [4], [34], [39], environmental onditions suh as spatial loations of ertain ativities in the sene [23], and temporal relationships between ativities [24], [35]. Spatio-temporal onstraints aross ativities in a wide-area sene are rarely onsidered.

3 3 Motion segmentation and ation reognition are done simultaneously in [26]. The proposed algorithm models the temporal order of ations while ignoring the spatial relationships between ations. The work in [35] models a omplex ativity b variable-duration hidden Markov model on equal-length temporal segments. It deomposes a omplex ativity into sequential ations, whih are the ontext of eah other. However, it onsiders only the temporal relationships, while ignoring the spatial relationships between ations. AND-OR graph [2], [3], [32] is a powerful tool for ativity representation. It has been used for multi-sale analysis of human ativities in [2], α, β, γ proedures were defined for a bottom-up ost sensitive inferene of low-level ation detetion. However, the learning and inferene proesses of AND-OR graphs beome more omplex as the graph grows large and more and more ativities are learned. In [20], [2], a strutural model is proposed to learn both feature-level and ationlevel interations of group members. This method labels eah image with a group ativity label. How to smooth the labeling results along time is a problem and is not addressed in the paper. Also, these methods aim to reognize group ativities and are not suitable in our senario where ativities annot be onsidered as the parts of larger ativities. In [4], omplex ativities are represented as spatiotemporal graphs representing multi-sale video segments and their hierarhial relationships. Existing higher-order models [6], [7], [9], [42] propose the use of higher order potentials that enourage the smoothness of variables within liques of the graph. Higher-order graphial models have been frequently used in image segmentation, objet reognition, et. However, few works exist in the field of ativity reognition. We propose a novel method that expliitly models the ation and ativity level motion and ontext patterns with a Hierarhial-CRF model and use them in the inferene stage for reognition. The problem of simultaneous traking and ativity reognition was addressed in [6], [5]. In these works, traking and ation/ativity reognition are expeted to benefit eah other through an iterative proess that maximizes a deomposable potential funtion whih onsists of traking potentials and ation/ativity potentials. However, only olletive ativities are onsidered in [6], [5], in whih the individual persons of interest have a ommon goal in terms of ativity. This work address the general problem of ativity reognition, when individual persons in the sene may ondut heterogeneous ativities. The inferene method on a strutural model proposed in [20], [2] searhes through the graphial struture, in order to find the one that maximizes the potential funtion. Though this inferene method is omputationally less intensive than exhaustive searh, it is still time onsuming. As an alternative, greedy searh has been used for inferene in objet reognition [8]. This paper has major differenes with our previous work in [45]. In [45], we proposed a strutural SVM to expliitly model the durations, motion, intra-ativity ontext and the spatio-temporal relationships between the ativities. In this work, we develop a hierarhial model whih represents the related ativities in a hidden ativity layer, whih interats with a lower-level ation layer. Representing ativities as hidden ativity variables simplifies the inferene problem, bssoiating eah hidden ativity with a small set of neighboring ation segments, and enables effiient iterative learning and inferene algorithms. Furthermore, the modeling of more aspets of the ativities of interest adds additional feature funtions that measure both ation and ativity variables. Sine more information about the ativities to be reognized is modeled, the reognition auray is improved as demonstrated by the experiments. 3 MODEL FORMULATION FOR CONTEXT- AWARE ACTIVITY REPRESENTATION In this setion, we desribe how the higher-order onditional random field (CRF) modeling of ativities that integrates ativity durations, motion features and various ontext features within and aross ativities is built upon automatially deteted ation segments to jointly model related ativities in spae and time. 3. Video Preproessing Assuming there are M + lasses of ativities at the sene, inluding a bakground lass with label 0 and M lasses of interest with labels,...,m (the bakground lass an be omitted if all the ativity lasses in the sene are known). Our goal is to loate and label eah ativity of interest in videos. Given a ontinuous video, bakground substration [46] is used to loate the moving objets. Moving persons are identified, and loal trajetories of moving persons are generated (any existing traking methods like [33] an be used). Spatio-temporal Interest Point (STIP) features [22] are generated only for these motion regions. Thus, STIPs generated by noise, suh as slight tree shaking, amera jitter and motion of shadows, are avoided. Eah motion region is segmented into ation segments using the motion segmentation based on the method in [5] with STIP histograms as the model observation. The detailed motion segmentation algorithm is desribed in Setion Hierarhial-CRF Models for Ativities The problem of ativity reognition in ontinuous videos requires two main tasks: to detet motion regions and to label these deteted motion regions. The detetion and labeling problems an be solved simultaneousls proposed in [26] or separatels proposed in [44], [45]. For the latter, andidate ation or ativity regions are usually deteted before the labeling task. The problem of ativity reognition is then onverted to a problem of labeling, that is, to assign eah andidate region with an optimum ativity label. CRF is a disriminative model often used usually used for labeling problems of image and image objets. Essentially, CRF an be onsidered as a speial version of Markov Random Field (MRF) where the variable potentials are onditioned on the observed data. Let x be the model

4 4 observations and y be the label variables. The posterior distribution p(y x, ω) of the label variables over the CRF is a Gibbs distribution and is usually represented as p(y x,ω) = Z(x,ω) exp(ω T ϕ (x,y )), () C where ω is a model weight vetor, whih needs to be learned from training data. Z(x, ω) is a normalizing onstant alled the partition funtion. ϕ (x,y ) is a feature vetor derived from the observation x and the label vetor, y, in the lique. The potential funtion of the CRF model given the observations x and model weight vetor ω is defined as ψ(y x,ω) = ω T ϕ (x,y ). (2) For the development of the Hierarhial-CRF model, the ation layer is first modeled as a linear-hain CRF. Ativity layer variables whih are assoiated with deteted ativities are then introdued for the smoothing of the ation-layer variables. Finally, ativity-layer variables are onneted to represent the spatial and temporal relationships between ativities. The evolution of the proposed two-layer Hierarhial-CRF model from the one-layer CRF model is shown in Fig. 3. Details on the development of these models will be desribed in the following sub-setions. The various feature vetors used for the alulation of the potentials are desribed in Setion Ation-based Linear-hain CRF We first desribe the linear-hain CRF model in Fig. 3(a). We first define the following items: intra-ation potential ψ ν ( i x,ω), whih measures the ompatibility of the observed feature of i and its label i ; inter-ation potential ψ ε ( i,ya j x,ω), whih measures the onsisteny between two onneted ation segments i and j. Let V a be the set of verties, eah representing an ation segment as the element in the ation layer and E a denotes the set of onneted ation pairs. The potential funtion of the ationlayer linear-hain CRF is ψ( x,ω) = ψ ν ( i x,ω) + ψ ε ( i, j x,ω) (3) i V a i j E a = ων, T a ϕ ν (x a i V a i i, i ) + ωε, T a ϕ i j E a i,ya ε (x a j i,x a j, i, j), where ϕ ν (x a i,ya i ) is the intra-ation feature vetor that desribes ation segment i. ων, a is the weight vetor of i the intra-ation features for lass i. ϕ ε(x a i,xa j,ya i,ya j ) is the inter-ation feature, whih is derived from the labels i, j and intra-ation feature vetors x a i and x a j. ωa ε, is i,ya j the weight vetor of the inter-ation features for lass pair i,ya j Inorporating Higher Order Potentials Aording to experimental observations, ation segments in a andidate ativity region, whih are generated btivity detetion methods [44], tend to have the same ativity labels. However, onsistent labeling is not guaranteed due to inaurate detetions. Let an ation lique a denote the union of ation segments in a andidate ativity. The linear-hain CRF an be onverted to a higher-order CRF bdding a latent ativity variable y h, representing the label of, for eah ation lique a. All ation variables assoiated with the same ativity variable are onneted. Then, the assoiated higher-order potential ψ ( x,ω) is introdued to enourage ation segments in the lique a to take the same label, while still allowing some of them to have different labels without additional penalty. The resulting CRF model is shown in 3 (b). The potential funtion ψ for the higher-order CRF model is represented as ψ(,y h x,ω) = ων, T a ϕ ν (x a i V a i i, i ) (4) + ωε, T a ϕ i j {E a i,ya ε (x a j i,x a j, i, j) + ψ ( x,ω), } C ah where E a denotes the set of onneted ation pairs in the new model. C ah is the set of ation-ativity liques and eah ation-ativity lique in C ah orresponds to an ation lique a in the ation layer and its assoiated ativity in the ativity layer. Let L = 0,,,M be the ativity label set in the ation layer, from whih the ation variables may take values. The ativity variable y h takes values from an extended label set L h = L l f, where L is the set of variable values in the ation layer. When an ativity variable takes value l f, it allows its hild variables to take different labels in L, without additional penalty upon label inonsisteny. We define ϕ,l (,y h ) as the ation-ativity onsisteny feature of ativity, and ω ah to be the weight vetor of,l,y h the ation-ativity onsisteny feature for lass y h. Define ϕ, f (x a,y h ) as the intra-ativity feature for ativity, and to be the weight vetor of intra-ativity feature ω ah, f,y h for lass y h. The orresponding ation-ativity higher-order potential an be defined as ψ( x,ω) = maxω ah T y h,y h ϕ (x a,x h,,y h ) (5) = max[ω ah T y h,l,,y h ϕ,l (,y h ) + ω ah T, f,y h ϕ, f (x a,y h )], where ω ah T,l,y h ϕ,l (,y h ) measures the labeling onsisteny within the ativity. Intuitively, the higher-order potentials are onstruted suh that a latent variable tends to take a label from L if majority of its hild nodes take the same value, and take the label l f if its hild nodes take diversified values. ω ah T, f,y h ϕ, f (x a,y h ) is the intra-ativity potential that measures the ompatibility between the ativity label of lique and its ativity features Inorporating Inter-Ativity Potentials As stated before, it would be helpful to model the spatial and temporal relationships between ativities. For this reason, we onnet ativity nodes in the higher-order CRF model. The resulting CRF is shown in Fig. 3(). We define ϕ s (x h s,x h d,yh s,y h d ) as the inter-ativity spatial feature that

5 3 2 Y Time Ation Segments y 8 a 7 6 X 5 3 2 y Y a 2 Time y h y 3 h y 2 h y 8 a Ation Segments y 5 a y 7 a y 6 a 3 2 X Ation Segments y h 6 y h 2 5 Y Time 8 y h 3 7 X 4 4 4 (a) (b) () Symbols used

5 5 3 2 Y Time Ation Segments y 8 a 7 6 X y Y a 2 Time y h y 3 h y 2 h y 8 a Ation Segments y 5 a y 7 a y 6 a 3 2 X Ation Segments y h 6 y h 2 5 Y Time 8 y h 3 7 X (a) (b) () Symbols used in this figure. i label variable for ation segment i, i L, where L = {0,,...,M} and a denotes the ation-layer and i denotes the index of the ation segment. y h label variable for ativity, y h L h, where L h = {0,,...,M} l f and h denotes the hidden ativity layer and denotes the index of the hidden ativity. (d) (e) Fig. 3. Illustration of CRF models for ativity reognition. (a): Ation-based Linear-Chain CRF; (b): Ation-based higher-order CRF model (with latent ativity variables); (): Ation-based two-layer Hierarhial-CRF. Note that all the observations for the random variables are omitted for ompatness; (d): symbols in sub-figures (a, b, ); (e): graph representation of the model in [45] for omparison. One ation segment denotes a random variable in the ation layer, whose value is the ativity label for the ation segment. A olored irle denotes a random variable in the ativity layer, whose value is the label for its onneted lique. As shown in (a), in the ation layer, ation segments that belong to the same trajetorre modeled as a linear-hain CRF. Then, hidden ativity-level variables with ation-ativity edges (in light blue) are added for eah ation lique to form higher-order CRF as shown in (b). An ativitnd its assoiated ation nodes have a same olor. Finally, pair-wise ativity edges (in red) are added to form the proposed two-layer Hierarhial-CRF mdoel. enodes the spatial relationship between ativities s and d, and ω h to be the weight vetor of inter-ativity spatial s,y h s,y h d feature for lass pair (y h s,y h d ). Define ϕ t(x h s,x h d,yh s,y h d ) as the inter-ativity temporal feature that enodes the temporal relationship between ativities s and d, and ω h to be t,y h s,y h d the weight vetor of inter-ativity temporal feature for lass pair (y h s,y h d ). The pairwise ativity potential between lique s and d is defined as ψ(y h x,ω) = [ω h T s,y h sd E h s,y h ϕs (x h s,x h d,yh s,y h d ) d + ω h T t,y h s,y h ϕt (x h s,x h d,yh s,y h d )], (6) d where ω h T s,y h s,y h ϕs (x h s,x h d,yh s,y h d ) is the pairwise spatial d potential between ativities s and d that measures the ompatibility between the andidate labels of s and d and their spatial relationship. ω h T t,y h s,y h ϕt (x h s,x h d,yh s,y h d ) is the d pairwise temporal potential between ativities s and d that measures the ompatibility between the andidate labels of s and d and their temporal relationship. 3.3 Feature Desriptors We now define the onepts we use for the feature development. An ativity is a 3D region onsisting of one or multiple onseutive ation segments. An agent is the underlying moving person(s) or a trajetory. Motion region at frame n is the region surrounding the moving objets of interest in the n th frame of the ativity. Ativity region is the smallest retangle region that enapsulates the motion regions over all frames of the ativity. In general, same type of features for different lass or lass pair an be different. There are mainly three kinds of features in our model: ation-layer features, ation-ativity features and ativitylayer features, whih an be further divided into five types of features. We now desribe how to enode motion and ontext information into feature desriptors. Intra-ation Feature: ϕ ν (x a i,ya i ) enodes the motion information of the ation segment i that is extrated from low-level motion features suh as STIP features. Sine in the ation layer, we obtain ation segments by utilizing their disriminative motion patterns, we use only motion features for the development of ation-layer features. STIP histograms are generated for eah ation segment using bag-of-word method [25]. We train a kernel multi-svm upon ation segments to generate the normalized onfidene sores, s i, j, of lassifying the ation segment i as ativity lass j, where j {0,,...,M}, suh that M j=0 s i, j =. In general, any kind of lassifier and low-level motion features an be used here. Given an ation segment i,

ϕ ε (x a i,xa j,ya i,ya j ) = I(ya i )I(ya j ), where I(ya k ) is the Dira measure that equals if the true label of segment k is k and equals to 0 other wise, for k = i, j.

6 6 ϕ ν (x a i,ya i ) = [s i,0 s i,m ] T is developed as the intra-ation feature desriptor of ation segment i. Inter-ation Feature: ϕ ε (x a i,xa j,ya i,ya j ) enodes the probabilities of oexistene of ation segments i and j aording to their features and ativity labels. ϕ ε (x a i,xa j,ya i,ya j ) = I(ya i )I(ya j ), where I(ya k ) is the Dira measure that equals if the true label of segment k is k and equals to 0 other wise, for k = i, j. Ation-Ativity Consisteny Feature: ϕ,l (,y h ) enodes the labeling information within lique as { ϕ,l (,y h y h = l f ) = i I( i =yh ) N y h L. where I( ) is the Dira measure and N is the number of ation segments in lique. Intra-ativity Feature: ϕ, f (x a,x,y h a,y h ) enodes the intra-ativity motion and ontext information of ativity. To apture the motion pattern of an ativity, we use the intra-ation features of ation segments whih belong to the ativity. Given an ativity, [max i ℵ s i,0,...,max i ℵ s i,m ] is developed as the intra-ativity motion feature desriptor, where ℵ is a list of ation segments in ativity. Intra-ativity ontext feature aptures the ontext information about the agents and relationships between the agents, as well as the the interating objets (e.g. the objet lasses, interations between agents and their surroundings). We define a set, G, of attributes that desribes suh ontext for ativities of interest, using ommon-sense knowledge about the ativities of interest (how to identify suh attributes automatially is another researh topi that we do not address in this paper). For a given ativity, whether the defined attributes are true or not are determined from image-level detetion results. The resulting feature desriptor is a normalized feature histogram. The attributes used and the development of intra-ativity ontext features are different for different tasks (please refer to Setion 5.3. for the details). Finally, the weighted motion and ontext features are used as the input to a multi-svm and the output onfidene sores are used to develop the intra-ativity feature as ϕ, f (x a,y h ) = [s,0,...,s,m ] T. Inter-ativity Spatial and Temporal Features ϕ s (x h s,x h d,yh s,y h d ) and ϕ t(x h s,x h d,yh s,y h d ) apture the spatial and temporal relationships between ativities s and d. Define the saled distane between ativities s and d at the n th frame of s as r s (s(n),d) = D(O s(n),o d ) R s (n) + R d, (7) where O s (n) and R s (n) denote the enter and radius of the motion region of ativity s at its n th frame and O d and R d denote the enter and radius of the ativity region of ativity d. D( ) denotes the Eulidean distane. Then, the spatial relationship of s and d at the n th frame is modeled by s sd (n) = bin(r s (s(n),d)) as in Fig. 4 (a). The normalized histogram s s,d = N f N f n= s sd(n) is the inter-ativity spatial feature of ativity s and d. Let TC be defined by the following temporal relationships: n th frame of s is before d, n th frame of s is during d and n th frame of s is after d. t sd (n) is the temporal relationship of s and d at the n th frame of s as shown in Fig. 4 (b). The normalized histogram t = N f N f n= t sd(n) is the inter-ativity temporal ontext feature of ativity s with respet to ativity d. (a) Fig. 4. (a) The image shows one example of inter-ativity spatial relationship. The red irle indiates the motion region of s at this frame while the purple retangle indiates the ativity region of d. Assume SC is defined by quantizing and grouping r s (n) into three bins: r s (n) 0.5 (s and d is at the same spatial position at the n th frame of s), 0.5 < r s (n) <.5 (s is near d at the n th frame of s) and r s (n).5 (s is far away from d at the n th frame of s). In the image, r s (n) >.5, so, s sd (n) = [0 0 ]. (b) The image shows one example of inter-ativity temporal relationship. The n th frame of s ours before d. So, t sd (n) = [ 0 0]. 4 MODEL LEARNING AND INFERENCE The parameters of the overall potential funtion ψ(y x, ω) for the two-layer hierarhial CRF inlude ωv a, ωε a, ω ah ω ah, f, ωh s and ωt. h We define the weight vetor as the onatenation of these parameters: (b),l, ω = [ω a v,ω a ε,ω ah,l,ωah, f,ωh s,ω h t]. (8) Thus, the potential funtion, ψ(y x, ω), an be onverted into a linear funtion with a single parameter ω as ψ( ) = maxω T Γ(x,,y h ), (9) y h where Γ(x,,y h ), alled the joint feature of ativity set x, an be easily obtained by onatenating various feature vetors in (4),(5) and (6). 4. Learning Model Parameters Suppose we have P ativity sets for learning. Let the training set be (X,Y a,y h ) = (x,y,a,y,h ),...,(x P,y P,a,y P,h ), where x i denotes the i th ativity set as well as the observed features of the set. y i,a is the label vetor in the ation layer and y i,h is the label vetor in the hidden ativity layer. While there are various ways of learning the model parameters, we hoose a task-oriented disriminative approah. We would like to train the model in suh a way that it inreases the average preision sores on a training data and thus tend to produe the orret ativity labels for eah ation segment. A natural way to learn the model parameter ω is to adopt the latent strutural SVM. The loss (x i,ŷ i,a ) of labeling

7 7 x i with ŷ i,a in the ation layer equals the number of ation segments that assoiate with inorret ativity labels (an ation segment is mislabeled if over half of the segment is mislabeled). From the onstrution of the higher-order potentials in setion 3.2.2, it is observed that, in order to ahieve the best labeling of the ation segments, the optimum latent ativity label of an ation lique must be the dominant ground truth label l of its hild nodes in the ation layer; or the free label l f if no dominant label exists for the ation lique. Thus the loss (x i,ŷ i,h ) of labeling the ativity layer of x i with ŷ i,h is (x i,ŷ i,h ) = I(y i,h {l,l i f }), (0) V h where I( ) is the indiator funtion whih equals if the inside equation is satisfied and 0 otherwise. (0) ounts the number of ativity labels in ŷ i,h that are neither a free label nor the dominant label of its hild nodes. Finally, the loss funtion of assigning x i with (ŷ i,a,ŷ i,h ) is defined as the summation of the two, that is (x i,ŷ i,a,ŷ i,h ) = (x i,ŷ i,a ) + (x i,ŷ i,h ). () Next, we define a onvex funtion F(ω) and a onave funtion J(ω) as F(ω) = 2 ωt ω (2) +C and P max i= (ŷ i,a,ŷ i,h ) J(ω) = C [ ( ω T Γ x i,ŷ i,a,ŷ i,h) ( + x i,ŷ i,a,ŷ i,h)], P i= ( max ωt Γ x i,y i,a,y i,h). y i,h The model learning problem is given as: ω = argmin[f(ω) + J(ω)] (3) ω Although the objetive funtion to be minimized in (3) is not onvex, it is a ombination of a onvex funtion and a onave funtion [29]. Suh kind of problems an be solved using the Conave-Convex Proedure (CCCP) [40], [4]. We desribe an algorithm similar to the CCCP in [40] that iteratively infers the latent variables y i,h for i =,...,P and optimizes the weight vetor ω. The inferene and optimization proedures ontinue until onvergene or a predefined maximum number of iterations is reahed. The limitation of all learning algorithms that involve gradient optimization is that there suseptible to loal extrema and saddle points [8]. Thus, the performane of the proposed latent strutural model is sensitive to initialization. There have been many works dealing with the problem of learning the parameters of hierarhial models [0], [36]. We use a oarse to fine sheme that separately initializes the model parameters using pieewise training, and then refines the model parameters jointly in a globally optimum manner. Speifially, the separately learned model parameters are used as the initialization values for the proposed learning algorithm. Given the weakly labeled training data with ativity labels for eah ation segment, the dominant label l for eah ation lique an be determined. We initialize the latent ativity variable of with the dominant label l of its ation lique a, and with l f if there is no dominant label for a. In the E step, we infer latent variables using the previously learned weight vetor ω t (or the initiallssigned weight vetor for the first iteration) leading to ( y i,h t+ = argmaxω t T Γ x i,y i,a,y i,h). (4) y i,h Then, in the M step, with the inferred latent variable y i,h t+, we solve a fully visible strutural SVM (SSVM). Let us define the risk funtion at iteration t +, Λ(ω), as P { ( Λ t+ (ω) = C max x i,ŷ i,a,ŷ i,h) (5) i= (ŷ i,a,ŷ i,h ) + ω T [ Γ ( x i,ŷ i,a,ŷ i,h) Γ (x i,y i,a,y i,h t+)] }. Thus, the optimization problem in (3) is onverted to a fully visible SSVM as { } ωt+ = argmin ω 2 ωt ω + Λ t+ (ω). (6) The problem in (6) an be onverted to an unonstrained onvex optimization problem [44] and solved by the modified bundle method in [38]. The algorithm iteratively searhes for the inreasingly tight quadrati upper and lower utting planes of the objetive funtion until the gap between the two bounds reahes a predefined threshold. The algorithm is effetive beause of its very high onvergene rate [37]. The visible SSVM learning algorithm speified for our problem is summarized in Algorithm. Algorithm Learning the model parameter in (6) through bundle method Input: S = ((a T (),y T ()),...,(a T (P),y T (P))),ωt,y i,h t+,c,ε Output: Optimum model parameter ωt+ ) initialize ωt+ 0 with ω t, G t+ (utting plane set) Ø. 2) for k = 0 to do 3) for i =,...,P do find the most violated label vetor for eah training instane, if any, using ωt+ k (the value of ω t+ at the k th iteration); 4) end for 5) find the utting plane g ω k of Λ(ω) at ω k t+ t+ : g ω k = ω T ω Λ t+ (ω k t+ t+ ) + b ωt+ k, where b ω k = Λ t+ (ω k t+ t+ ) ωk t+t ω Λ(ωt+ k ). 6) G t+ G t+ g ω k (ω); t+ 7) update ω t+ : ωt+ k+ ω F ω k (ω), t+ where F ω k t+ (ω) = 2 ωt ω + max(0,max j=,...,k g ω j t+ 8) gap k+ = min k k F ω k + (ω k + t+ ) F ω k (ωt+ k+ ); t+ t+ 9) if gap k+ ε, then return ωt+ = ωk+ t+ ; 0) end for 4.2 Inferene (ω)). Suppose the model parameter vetor ω is given. We now desribe how to identify the optimum label vetor for a test instane x that maximizes (9). The inferene problem is generally NP hard for multi-lass problems, thus MAP

8 8 inferene algorithms, suh as loopy belief propagation [29], are slow to onverge. We propose an approximation method that alternatively optimizes the hidden variable y h and the label vetor. Suh an algorithm is guaranteed to inrease the objetive at every iteration [29]. Let us define the ativity layer potential funtion as ψ h (y h ) = C a ψ( x,ω) + ψ(y h x,ω). (7) For eah iteration, with urrent predited label vetor fixed, the inferene sub-problem is to find the y h that maximizes ψ h (y h ). An effiient greedy searh method is used to find the optimum y h as desribed in Algorithm 2. In order to simplify the inferene, we fore the edge weights between non-adjaent ations to be zeros. With the inferred hidden variable y h, the model is redued to a one-layer disriminative CRF. The inferene sub-problem of finding the optimum an now be solved by omputing the exat mixed integer solution. We initialize the proess by holding the hidden variable fixed using the values obtained from automati ativity detetion. The proess ontinues until onvergene or a predefined maximum number of iterations is reahed. Algorithm 2 Greedy Searh Algorithm for the sub-problem of finding optimum hidden variable y h Input: Output: Testing instane with ation layer labels Hidden variable labels y h ) initialize (V h,y h ) {Ø,Ø} and ψ h = 0. 2) repeat ψ h (y h ) V h = ψ(y h y h ) ψ(y h ); y h opt = argmax V h ψ h (y h ); (V h,y h ) (V h,y h ) (,y h opt ); 3) end if all ativities are labeled Analysis of Computational Complexity We now disuss the omputational omplexity of inferene for a partiular ativity set onsists of n ation segments and m ativities. Assuming there are M ativity lasses in the problem. For the graphial model in [45], the time omplexity of the inferene as disussed in the paper is O(d max n 2 M), where d max is the maximum number of ation segments one ativity may have. The inferene on both the higher-order CRF and hierarhial-crf is arried out layerby-layer, and so the overall time omplexity is linear in the number of layers used. Speifially, we use two-layer CRFs with an ation layer and an ativity layer. For the higher-order CRF model, inferene on the ativity layer takes O(mM) omputation to obtain the ativity labels for eah andidate ativity. With the inferred ativity labels, inferene on the ation layer takes O(nM 2 ), sine the model is redued to a hain-crf. For the hierarhial-crf, the inrease of omputational omplexity over the higher-order CRF lies in the inferene on the ativity layer, beause the ativities are onneted with eah other in this model. Using the proposed greedy searh algorithm, the time omplexity for inferene on the ativity layer is O(m 2 M). Thus, the overall omplexity of inferene is O[T ((mm)+o(nm 2 ))] for higher-order CRF and O[T ((m 2 M) + O(nM 2 ))] for hierarhial-crf, where T is the number of iterations. Furthermore, the number of ation segments n is usually several times of the number of ativities, that is n = αm, where α is a small positive value larger that one. d max and T are small positive value larger than one. Assuming n, m and M are in the same order, whih is a reasonable assumption for our ase, the asymptoti omputational omplexity of the model in [45] and the ompared higher-order CRF and hierarhial-crf models is of the same order. 5 EXPERIMENTAL RESULTS The goal of our framework is to loate and reognize ativities of interest in ontinuous videos using both motion and ontext information about the ativities; therefore, datasets with segmented video lips or independent ativities like Weizmann [], KTH [3], UT-Interation Dataset [30] and Colletive Ativity Dataset [7] do not fit our evaluation goal. To assess the effetiveness of our framework in ativity modeling and reognition, we perform experiments on two hallenging datasets ontaining long duration videos: the UCLA offie Dataset [32] and VIRAT Ground Dataset [9]. 5. Motion Segmentation and Ativity Loalization We first develop an automati motion segmentation algorithm by deteting boundaries where the statistis of motion features hange dramatially, and thus obtain the ation segments. Let two NDMs be denoted as M and M 2, and d s be the dimension of the hidden states. The distane between the models an be measured by the normalized geodesi distane dist(m,m 2 ) = 4 d s π 2 d s i= θ i 2, where θ i is the prinipal subspae angle (please refer to [5] for details on the distane omputation). A sliding window of size T s, where T s is the number of temporal bins in the window, is applied to eah deteted motion region along time. A NDM M(t) is built for the time window entered at the t th temporal bin. Sine an ation an be modeled as one dynami model, the model distanes between subsequenes from the same ation should be small, ompared to those of subsequenes from a different ation. Suppose an ativity starts from temporal bin k; the average model distane between temporal bin j > k and k is defined as the weighted average distane between model j and neighboring models of k as T d DE k ( j) = i=0 γ i dist(m(k + i),m( j)), (8) where T d is the number of neighboring bins used, and γ i is the smoothing weight for model k + i that dereases along time. When the average model distane grows above a predefined threshold d th, an ation boundary is deteted. Ation segments along traks are thus obtained. A multi-lass SVM is trained upon the intra-ativity features (as desribed in Setion 3.3) of ativities of d- ifferent lasses. After obtaining the ation segments, we

2 UCLA Dataset The UCLA Offie Dataset [32] onsists of indoor and outdoor videos of single ativities and person-person interations.

9 9 use the sliding window method with the trained multilass SVM to group adjaent ation segments into andidate ativities. To speed up, we only work on andidate ativities with onfidene sores larger than a predefined threshold, indiating there likely to be of ativity lasses of interest. 5.2 UCLA Dataset The UCLA Offie Dataset [32] onsists of indoor and outdoor videos of single ativities and person-person interations. Here, we perform experiments on the videos of offie sene ontaining about 35 minutes of ativities in an offie room that aptured with a single fixed amera. We identify 0 frequent ativities as the ativities of interest: - enter room, 2 - exit room, 3 - sit down, 4 - stand up, 5 - work on laptop, 6 - work on paper, 7 - throw trash, 8 - pour drink, 9 - pik phone, 0 - plae phone down. Eah ativity ours 9 to 26 times in the dataset. Sine the dataset ontains only single person ativities, it is natural to model ativities in one sequene together. The dataset is divided into 8 sets, eah set ontains 2 sequenes of ativities and eah sequene ontains 2 to 9 ativities of interest, as well as varying number of bakground ativities. We use leaveone-set-out ross validation for the evaluation: use 7 sets for training and set for testing Preproessing Intra-ativity ontext feature is based on interations between the agent and the surroundings. In the offie dataset, there are 7 lasses of objets that are frequently involved in the ativities of interest: laptop, garbage an, papers, phone, offee maker and up. Fig. 5 shows the deteted objets of interest in the offie room. Sine the UCLA Dataset onsists (a) Fig. 5. Deteted objets of interest in the UCLA offie sene. of single person ativities, the intra-ativitttributes onsidered inlude agent-objet interations and their relative loations. We identify (N G = 0) subsets of attributes for the development of intra-ativity ontext features in the experiment as shown in Fig. 6. For a given ativity, the above attributes are determined from image-level detetion results. The loations of objets are automatially traked. Similar to [32], if enough skin olor is deteted within the areas of laptop, paper and phone,the orresponding Attribute Subset G G 2 G 3 G 4 G 5 G 6 G 7 G 8 G 9 G 0 Assoiated Attributes the agent is touhing / not touhing laptop, paper 2, phone 3. the agent is oluding / not oluding the garbage an 4, offee maker 5. the agent is near / far away from the garbage an 6, offee maker 7, door 8. the agent disappears / not disappears at the door. the agent appears / not appears at the door. Fig. 6. Subsets of ontext attributes used for the development of intra-ativity ontext features for UCLA Dataset (the supersripts indiates the orrespondene between the subsets and the objets). touh laptop touh paper olude garbage an touh phone (a) Fig. 7. Examples of agent-objet interations deteted from image. attributes are onsidered as true. Fig. 7 shows examples of deteted agent-objet interations. Whether the agent is near or far away from an objet is determined by the distane between the two based on normal distributions of the distanes of the two senarios. Probabilities indiating how likely the agent is near or far away from an objet are thus obtained. For frame n of an ativity, we obtain g i (n) = I(G i (n)), where I( ) is the indiator funtion. g i (n) is then normalized so that its elements sum to. Related andidate ativities are onneted. Whether two ativities are related an be naturally determined by their temporal distanes. One way to deide if the relationships between two andidate ativities should be modeled is to see if there in the α-neighborhood of eah other in time. Two ativities are said to be in the α-neighborhood of eah other if there are less than α other ativities ourring between the two Experimental Results Although UCLA Dataset has been used in [32], the reognition auray for the offie dataset has not been provided in the paper. We ompare the performane of the popular BOW+SVM lassifier and our model. The experiment results in preision and reall as shown in Fig. 8. In order to show the affets of inorporating different kinds of motion and ontext features, we also show results of using the ation-based linear-hain CRF approah and the ation-based higher-order CRF approah (Fig. 3 (a) and 3 (b)). It an be seen that the use of intra-ativity ontext inreases the reognition auray of ativities with obvious ontext patterns. For example, enter room is haraterized by the ontext that the agent appears at

0 the door. The inreased reognition auray of enter room by using intra-ativity ontext features indiates that our model suessfully aptures this harateristis.

inter-ativity spatiotemporal relationships inreases the reognition auray signifiantly. Next, we hange the value of α to see how it 0.8 0.6 0.4 0.

10 0 the door. The inreased reognition auray of enter room by using intra-ativity ontext features indiates that our model suessfully aptures this harateristis. From the performane of higher-order CRF approah and Hierarhial-CRF approah, we an see that for ativities with strong spatio-temporal patterns, suh as pik phone and plae phone down, modeling the inter-ativity spatiotemporal relationships inreases the reognition auray signifiantly. Next, we hange the value of α to see how it BOW+SVM Linear hain CRF Higher order CRF HCRF α=2 (a) BOW+SVM Linear hain CRF Higher order CRF HCRF α=2 (b) Fig. 8. Preision (a) and reall (b) for the ten ativities in UCLA Offie Dataset. The ativities are defined in Setion 5.2. HCRF is the short of Hierarhial-CRF. influenes the reognition auray of the Hierarhial-CRF approah. Fig. 9 ompares the overall auray of different methods and the Hierarhial-CRF approah with different α values. From the results, we an see that Hierarhial- Method Overall Average per-lass BOW+SVM Linear-hain CRF Higher-order CRF HCRF (α = ) HCRF (α = 2) HCRF (fully onneted) Fig. 9. Overall and average per-lass auray for different methods on UCLA Offie Dataset. The BOW+SVM method is tested on video lips, while other results are in the framework of our proposed ation-based CRF models upon automatially deteted ation segments. HCRF is the short of Hierarhial-CRF. CRF approah with α = 2 outperforms other models. This is expeted. When α is too small, the spatio-temporal relationships of related ativities are not fully utilized, while Hierarhial-CRF with fully onneted ativity layer models the spatio-temporal relationships of unrelated ativities. For instane, in the UCLA offie Dataset, one typial temporal pattern of ativities is a person sits down to work on the laptop, then, the same person stands up to do other things, and then sits down to work on the laptop. All these ativities are onduted sequentially. Thus, Hierarhial- CRF model with fully onneted ativity layer aptures the false temporal pattern of stand up followed by work on the laptop. The optimum value of α an be obtained using ross validation on the training data. 5.3 VIRAT Ground Dataset The VIRAT Ground Dataset is a state-of-the-art ativity dataset with many hallenging harateristis, suh as wide variation in the ativities and lutter in the sene. The dataset onsists of surveillane videos of realisti senes with different sales and resolution, eah lasting 2 to 5 minutes and ontaining upto 30 events. The ativities defined in Release inlude - person loading an objet to a vehile; 2 - person unloading an objet from a vehile; 3 - person opening a vehile trunk; 4 - person losing a vehile trunk; 5 - person getting into a vehile; 6 - person getting out of a vehile. We work on the all the senes in Release exept sene 0002 and use half of the data for training and the rest for testing. Five more ativities are defined in VIRAT Release 2 as: 7 - person gesturing; 8 - person arrying an objet; 9 - person running; 0 - person entering a faility; - person exiting a faility. We work on the all the senes in Release 2 exept sene 0002 and 002, and use two-third of the data for training and the rest for testing Preproessing Motion regions that do not involve people are exluded from the experiments sine we are only interested in person ativities and person-vehile interations. For the development of STIP histograms, nearest neighbor softweighting sheme [25] is used. Sine we work on the VIRAT Dataset with individual person ativities and person-objet interations, we use the following N G = 7 subsets of attributes for the development of intra-ativity ontext features in the experiments as shown in Fig. 0. Persons and vehiles are deteted based on the partbased objet detetion method in [9]. Opening/losing entrane/exit doors of failities, boxes and bags are deteted using method in [6] with binary linear-svm as the lassifier. Using these high-level image features, we follow the desription in Setion 5.2. to develop the feature desriptors for eah ativity set. The first three sets of attributes in Fig. 0 are used for the experiments on Release, and all are used for the experiments on Release 2. Fig. shows examples of g i (n) defined as in Setion 5.2. for different ativities in VIRAT. Sine, in VIRAT, ativities are naturally related to eah other, the ativity layer nodes are fully onneted to utilize the spatio-temporal relationships of ativities ourring in the same loal spae-time volume. 5.4 Reognition Results on VIRAT Release Fig. 2 ompares the preision and reall for the six ativities defined in VIRAT Release using BOW+SVM method and our approah with different kinds of features. The results show, as expeted, the reognition auray

Subset G G 2 G 3 G 4 G 5 G 6 G 7 Assoiated Attributes moving objet is a person; moving objet is a vehile trunk; moving objet is of other kind.

the agent disappears at the body of the interating vehile; the agent appears at the body of the interating vehile; none of the two.

veloity of the agent (in pixel) is larger than a predefined threshold; veloity of objet of interest is smaller than a predefine threshold.

Fig. 0. Subsets of ontext attributes used for the development of intra-ativity ontext features.

0] g 5 (n) [0 ] [0 ] [0 ] [0 ] g 6 (n) [ 0] [ 0] [ 0] [ 0] g 7 (n) [ 0] [0 ] [0 ] [ 0] Ativity getting into vehile getting out of vehile gesturing

Thus, ativities with strong spatio-temporal relationships with eah other are better reognized by the Hierarhial-CRF approah.

However, if the two ativities happen losely in time in the same plae, the first ativity in time is probably open a vehile trunk.

performane. Fig. 3 shows examples that demonstrate the signifiane of ontext in ativity reognition. 0.8 0.6 0.4 0.

getting out of vehile opening trunk getting into vehile Example Image g (n) [ 0 0] [ 0 0] [ 0 0] [ 2 0 2 ] g 2 (n) [ 0 0] [ 0 0] [0 0 ] [0 0 ] g 5 (n)

. Examples of deteted intra-ativity ontext features. The example images are shown with deteted high-level image features.

kind; objet in blak bounding box is a bag/box on the agent. getting out of vehile losing trunk getting into vehile Fig. 3.

Examples showorretly the effetreognized of ontext features btion-based in reognizing linear-hain ativitiescrf that (top), were inorretly reognized

(middle), and inorretly for inreases by enoding the various ontext features. For Figure 6 in Setion 5.4).

11 Subset G G 2 G 3 G 4 G 5 G 6 G 7 Assoiated Attributes moving objet is a person; moving objet is a vehile trunk; moving objet is of other kind. the agent is at the body of the interating vehile; the agent is at the rear/head of the interating vehile; the agent is far away from the vehiles. the agent disappears at the body of the interating vehile; the agent appears at the body of the interating vehile; none of the two. the agent disappears at the entrane of a faility; the agent appears at the exit of a faility; none of the two. veloity of the agent (in pixel) is larger than a predefined threshold; veloity of objet of interest is smaller than a predefine threshold. the ativity ours at parking areas; the ativity ours at other areas. an objet (e.g. bag/box) is deteted on the agent; no objet is deteted on the agent. Fig. 0. Subsets of ontext attributes used for the development of intra-ativity ontext features. Ativity Example Image g (n) person loading [ ] person unloading [ 0 0] opening trunk losing trunk [ ] [ ] g 2 (n) [0 0] [0 0] [0 0] [0 0] g 5 (n) [0 ] [0 ] [0 ] [0 ] g 6 (n) [ 0] [ 0] [ 0] [ 0] g 7 (n) [ 0] [0 ] [0 ] [ 0] Ativity getting into vehile getting out of vehile gesturing arrying objet inter-ativity ontext patterns of ativities. Thus, ativities with strong spatio-temporal relationships with eah other are better reognized by the Hierarhial-CRF approah. For instane, the higher-order CRF approah often onfuses open a vehile trunk and lose a vehile trunk with eah other. However, if the two ativities happen losely in time in the same plae, the first ativity in time is probably open a vehile trunk. This kind of ontextual information within and aross ativity lasses are aptured by the Hierarhial- CRF approah and used to improve the reognition performane. Fig. 3 shows examples that demonstrate the signifiane of ontext in ativity reognition BOW+SVM Linear hain CRF Higher order CRF HCRF (a) BOW+SVM Linear hain CRF Higher order CRF HCRF (b) Fig. 2. Preision (a) and reall (b) for the six ativities defined in VIRAT Release. getting out of vehile opening trunk getting into vehile Example Image g (n) [ 0 0] [ 0 0] [ 0 0] [ ] g 2 (n) [ 0 0] [ 0 0] [0 0 ] [0 0 ] g 5 (n) [0 ] [0 ] [0 ] [0 ] g 6 (n) [ 0] [ 0] [ 0] [ 0] loading an objet unloading an objet getting into vehile g 7 (n) [0 ] [0 ] [0 ] [ 0] Fig.. Examples of deteted intra-ativity ontext features. The example images are shown with deteted high-level image features. Objet in red bounding box is a moving person; objet in blue bounding box is a stati vehile; objet in orange bounding box is a moving objet of other kind; objet in blak bounding box is a bag/box on the agent. getting out of vehile losing trunk getting into vehile Fig. 3. Example ativities (defined in VIRAT Release ) Figure 2. Examples showorretly the effetreognized of ontext features btion-based in reognizing linear-hain ativitiescrf that (top), were inorretly reognized byinorretly the baseline by(ndm+svm) linear-hainlassifier CRF but (related orreted example using results higherorder CRF with intra-ativity ontext (middle), and inorretly for inreases by enoding the various ontext features. For Figure 6 in Setion 5.4). instane, the higher-order CRF approah enodes intraativity reognized by higher-order CRF, but retified using ation- ontext patterns of ativities of interest. Thus, based hierarhial CRF with inter-ativity ontext (bottom). ativities with strong intra-ativity ontext pattern, suh as person getting into vehile, are better reognized by the higher-order CRF approah than by the linea-hain CRF approah, whih does not model intra-ativity ontext of ativities. The Hierarhial-CRF approah further enodes We also show the results on VIRAT Release for different methods using overall and average auray in Fig. 4. We have ompared our results with the popular BOW+SVM approah, the more reently proposed String-

2 of-feature-graphs approah [2], [43] and strutural model in [45]. Method average auray BOW+SVM [25] 45.8 SFG [44] 57.6 Strutural Model [45] 62.9 Linear-hain CRF 42.6 Higher-order CRF 60.

Note that BOW+SVM works on video lip while others work on ontinuous video. The Hierarhial-CRF approah outperforms the other methods.

information enoded in low-level features.

between various ativities and thus our method outperforms the SFGs.

5 Reognition Results on VIRAT Release 2 VIRAT Release 2 defines additional ativities of interest. We work on VIRAT Release 2 to further evaluate the effetiveness of the proposed approah.

5 ompares the preision and reall for the eleven ativities defined in VIRAT Release 2 for BOW+SVM method, the strutural model in [45], and our method.

to ativities with weak ontext patterns suh as person gesturing (7). Fig. 6 shows example results on ativities in Release 2. Fig. 7 ompares the reognition auray using reall for different methods.

However, [] works on video lips, eah ontaining an ativity of interest with additional 0 seonds ourring randomly before or after the target ativity instane, while we work on ontinuous video.

12 2 of-feature-graphs approah [2], [43] and strutural model in [45]. Method average auray BOW+SVM [25] 45.8 SFG [44] 57.6 Strutural Model [45] 62.9 Linear-hain CRF 42.6 Higher-order CRF 60.4 Hierarhial-CRF 66.2 Fig. 4. Average auray for the six ativities defined in VIRAT Release. Note that SVM+BOW works on video lips; while other methods work on ontinuous videos. Note that BOW+SVM works on video lip while others work on ontinuous video. The Hierarhial-CRF approah outperforms the other methods. The results are expeted sine the intra-ativity and inter-ativity ontext within and between ation and ativities gives the model additional information about the ativities of interest beyond the motion information enoded in low-level features. SFG approah models the spatial and temporal relationships between the low-level features and thus takes into aount the loal struture of the sene; However, it does not onsider the relationships between various ativities and thus our method outperforms the SFGs. Strutural model in [45] models the intra and inter ontext within and between ativities, however, it does not model the ation layer and the interations between ation and ativities. 5.5 Reognition Results on VIRAT Release 2 VIRAT Release 2 defines additional ativities of interest. We work on VIRAT Release 2 to further evaluate the effetiveness of the proposed approah. We follow the method defined above to get the reognition results on this dataset. Fig. 5 ompares the preision and reall for the eleven ativities defined in VIRAT Release 2 for BOW+SVM method, the strutural model in [45], and our method. We see that by modeling the relationships between ativities, those with strong ontext patterns, suh as person losing a vehile trunk (4) and person running (9), ahieve larger performane gain ompared to ativities with weak ontext patterns suh as person gesturing (7). Fig. 6 shows example results on ativities in Release 2. Fig. 7 ompares the reognition auray using reall for different methods. We an see that the performane of our Hierarhial-CRF approah is omparable to the reently proposed method in []. In [], a SPN on BOW is learned to explore the ontext among motion features. However, [] works on video lips, eah ontaining an ativity of interest with additional 0 seonds ourring randomly before or after the target ativity instane, while we work on ontinuous video. 6 CONCLUSION In this paper, we design a framework for modeling and detetion of ativities in ontinuous videos. The proposed BOW+SVM Linear hain CRF Higher order CRF HCRF (a) BOW+SVM Linear hain CRF Higher order CRF HCRF (b) Fig. 5. Preision (a) and reall (b) for the eleven ativities defined in VIRAT Release 2. opening trunk getting out of vehile entering a faility exiting a faility person running arrying an objet Fig. 6. Examples of reognition results (from VIRAT Release 2). For eah two rows, examples in the bottom row show the effet of ontext features in orretly reognizing ativities that were inorretly reognized by the linear-hain CRF approah, while other examples of the same ativities orretly reognized by the linear-hain CRF are shown in the top row. framework jointly models a variable number of ativities in ontinuous videos, with ation segments as the basi motion elements. The model expliitly learns the ativity durations and motion patterns for eah ativity lass as well as the ontext patterns within and aross ation and ativities of different lasses from training ativity sets. It has been demonstrated that joint modeling of ativities by enapsulating objet interations and spatial and temporal

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging 0-708: Probabilisti Graphial Models 0-708, Spring 204 Disrete sequential models and CRFs Leturer: Eri P. Xing Sribes: Pankesh Bamotra, Xuanhong Li Case Study: Supervised Part-of-Speeh Tagging The supervised