UNDERSTANDING human actions in videos has been

Size: px

Start display at page:

Download "UNDERSTANDING human actions in videos has been"

Nickolas Lee
5 years ago
Views:

1 PAPER IDENTIFICATION NUMBER 1 A Space-Time Graph Optimization Approach Based on Maximum Cliques for Action Detection Sunyoung Cho, Member, IEEE, and Hyeran Byun, Member, IEEE Abstract We present an efficient action detection method that takes a space-time graph optimization approach for realworld videos. Given a space-time graph representing the entire action video, our method identifies a maximum-weight connected subgraph indicating an action region by applying an optimization approach based on clique information. We define an energy function based on maximum weight cliques for subregions of the graph, and formulate it using an optimization problem that can be represented as a linear system. Our energy function includes the maximum and connectivity properties for finding the maximum-weight connected subgraph, and its optimization solution indicates the probability of belonging to the maximum subgraph for each node. Our graph optimization method efficiently solves the detection problem by applying the cliquebased approach and simple linear system solver. We demonstrate that our detection method results in more accurate localization compared to conventional methods through our experimental results with real-world datasets, such as Hollywood and MSR action datasets. We also show that our method outperforms the state-of-the-art methods of action detection. Index Terms Action detection, space-time graph, sparse representation, maximum weight connected subgraph, maximum weight clique, optimization. I. INTRODUCTION UNDERSTANDING human actions in videos has been considered as one of the important areas in the computer vision community due to a variety of applications, such as video indexing and searching, video surveillance, and humancomputer interaction. The action learning problem has been extensively studied over the past years, and recent works have explored more realistic actions rather than the simplified and constrained ones used in earlier studies. Action learning in the real-world is a challenging problem because uncontrolled real-world videos include a large amounts of variation in terms of the number of people, the scale of each person, background clutter, occlusion, and camera parameter changes. Furthermore, two problems need to be solved for the action understanding: localizing an action-occurring region in a video sequence (action detection) and classifying the category of the action (action recognition). Detecting actions is to find the space and time regions of actions occurring in the video sequence. The most common approach to this is to employ a sliding window approach [1]- [6], which applies a classifier function to the subregions and S. Cho was with the Department of Computer Science, Yonsei University, Seoul , Korea. She is now with Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, PA, USA ( sycho22@gmail.com). H. Byun is with the Department of Computer Science, Yonsei University, Seoul , Korea ( hrbyun@yonsei.ac.kr). takes the maximum of the classification score as the location of the action. Since the computation in a video sequence of size h w f is of the order O ( h 2 w 2 f 2), evaluating the classifier function for all subregions is too computationally expensive. Another challenge arises from uncontrolled real-world videos containing large variations in actions and complex backgrounds. It is difficult to construct a classifier that recognizes realistic actions, making the action detection problem more challenging. The main approach of detection problems is to find the sub-volume that maximizes the output of scoring function. In order to define the scoring function, the classifier measuring the score for an input should be learned. Several methods use a max-margin framework [7], [43], [44] to train the model parameter of the classifier. Chen and Grauman [7] use a linear SVM to learn parameters of the scoring function. Since they use the graph structure for representing a video, the learned parameters are used to compute the weight for each node in the graph. Kong et al. [43] measure the interactive phrases of human interaction using a latent SVM. They define attribute and interaction models for scoring the interactive phrase, and find the maximum interactive phrase by solving the optimization problem based on the coordinate descent. Hoai and Torre [44] consider a detection scoring function that extends the Structured Output SVM [45] to enable the recognition of sequential data. By exploiting the partial events as training examples for learning the scoring function, their method supports the early detection of events in sequential data. Our approach exploits the max-margin framework for defining the scoring function. We are especially inspired by that of Chen and Grauman [7], which solves the action detection problem as a maximum-weight connected subgraph problem (MWCSP) from a weighted space-time (ST) graph representing an input video. They transform the MWCSP into a prizecollecting Steiner tree problem and solve the problem using a branch-and-cut optimization strategy. Since their detection method is formulated using an integer linear programming problem with binary variables, the resulting detections are made under certainty, i.e., since the coefficients are constant, their result is only able to produce whether each node is action node or not. We also solve the MWCSP by applying a clique-based optimization strategy from a weighted ST graph representing an action video. Our method first determines the maximum weight cliques (MWC) for each graph unit. Based on the node information regarding the MWC, we define the energy function that represents the maximum property of each node

PAPER IDENTIFICATION NUMBER 2 and connectivity among the nodes in the resulting subgraph. The connectivity property makes the energy function robust to noise occurring in the middle of an action.

Our energy function is formulated using an optimization problem that can be represented as a linear system.

Given an action video, the first step involves extractions of the local ST features (Fig. 1(a)).

Since our method is trained to have larger weights for local features within action region rather than local features within non-action region, nodes within action region get larger weights (darker

2 PAPER IDENTIFICATION NUMBER 2 and connectivity among the nodes in the resulting subgraph. The connectivity property makes the energy function robust to noise occurring in the middle of an action. In particular, our method that determines the probability of action occurrence at each node enables more robust and accurate detection compared to that of [7]. Our energy function is formulated using an optimization problem that can be represented as a linear system. The final maximum-weight connected subgraph (MWCS) is obtained by leveraging the likelihood threshold for the solution of the linear system. Fig. 1 shows an overview of our approach. Given an action video, the first step involves extractions of the local ST features (Fig. 1(a)). The action video is represented as a ST graph in which each node has a weight computed based on the extracted local features (Fig. 1(b)). Since our method is trained to have larger weights for local features within action region rather than local features within non-action region, nodes within action region get larger weights (darker colored nodes). Therefore, the process of action detection is equivalent to finding an MWCS. The node s probability of belonging to the MWCS is found by solving a linear system (Fig. 1(c)), and the detection result can be determined by applying the threshold of the probability (Fig. 1(d)). The final detection result is a set of regions corresponding to a set of nodes in the MWCS (Fig. 1(e)). Since most of previous methods in action detection [1], [4], [5], [13] give the cubic-shaped detection result, it is not possible to search the subvolume that can shift spatial location over time. However, recent method [7] produces the noncubic detection result by representing a video as ST-graph and searching the maximum subgraph from the entire graph. Our method produces more flexible result as well as non-cubic result by introducing the subgraph search on a flexible ST graph structure. In addition, we propose the maximum subgraph search method that produces more detailed detection results with less computation time by introducing the optimization based on MWC information. The main contributions of this paper are summarized into three points. We propose a novel approach that solves the action detection by taking a graph optimization strategy. We define an energy function based on MWC information in order to find an MWCS indicating an action region from a ST graph representing the entire action video. The energy function is formulated using an optimization problem that can be represented as a linear system. By solving the linear system, our method efficiently detects the spatial and temporal region of occurring action. Our method produces more robust and accurate detection results. We define the energy function that includes the maximum property of each node and connectivity among the nodes in the resulting subgraph. These properties make our detection robust to noise that may occur in the real-world action videos. Furthermore, our optimization result indicates the likelihood of belonging to the MWCS for each node, not indicating whether action node or not. It improves the performance by allowing more robust and flexible detection. Our graph optimization method takes relatively short (a) (b) (c) (d) (e) Fig. 1: Overview of our approach. (a) Local ST feature extraction. (b) Video representation using a weighted ST graph. (Note that the node numbers are presented on the nodes, and some edges are omitted for ease of viewing.) (c) Optimization problem for computing the probability of becoming an MWCS node. (d) Action detection by thresholding. (e) Action detection result. time for long length video. It is because that the energy function is defined with MWC information which is computed from much smaller graph unit than the entire graph representing a video. In addition, the energy function is an optimization problem that can be represented as a simple linear system. By exploiting MWC search in small graph units and linear system solver, our method can efficiently solve the detection problem in terms of computational cost and problem complexity. The rest of the paper is organized as follows. Section II briefly reviews the previous works on action detection and graph-based approaches for representing an action video. Section III describes the details of our video representation and the proposed optimization approach for action detection. Section IV presents our experimental results and Section V provides conclusions for the paper.

3 PAPER IDENTIFICATION NUMBER 3 II. RELATED WORK In this section, we review the state-of-the-art methods of action detection. We also review the previous works which applies the graph-based approach for video representation. A. Previous Works on Action Detection The most common approach to action detection is a sliding window method which applies a classifier function to subregions within an entire sequence and considers the maximum classification score as the location of the action. Although this method has been successfully applied in many works [1], [4], [5], it is computationally expensive for evaluation of the classifier function for all subregions. To avoid exhaustive searches, several recent works have employed the voting-based approach. The voting-based approach performs localization based on voting regarding local ST features. Mikolajczyk and Uemura [8] propose a method based on a vocabulary forest of local features and a voting scheme. They use a large number of low-dimensional local features for building a vocabulary forest in order to capture the joint appearance-motion of actions. Voting is performed for action categories and occurrence locations over each vocabulary tree. However, their work is restricted to only the spatial localization of the subjects in each frame. Ryoo and Aggarwal [9] apply the voting technique for the intersection of relationship histograms between different action videos. They develop a new matching function, ST relationship match, to measure the similarity between two sets of features. Each pair of features in the intersection votes indicates the expected starting and ending locations of the action. Oikonomopoulos et al. [10] accumulate localization evidence in a probabilistic ST voting scheme and use class-specific codebooks of the codeword ensembles to encode the ST positions. Yao et al. [11] present a method to classify and localize human actions using a Hough transform voting framework. They perform the voting with a collection of random trees and learn a mapping between densely sampled ST feature patches and an action center. The resulting set of leaf nodes in the trees forms a discriminative codebook with shared features across actions and votes for action centers in a probabilistic manner. To summarize, although the voting approach reduces computational costs compared to the sliding window method, it is often sensitive to a noisy background and ambiguous for actions with periodicity, which leads to incorrect votes. Hence, this approach cannot guarantee determination of the maximum scoring region. The branch-and-bound approach has also been explored to avoid the enormous computation cost of an exhaustive search. This approach identifies the most probable occurring actions using an optimization scheme. Willems et al. [12] propose an extended exemplar-based approach based on local features in the ST domain. The most discriminative visual words are selected and used to formulate the bounding box hypotheses. Actions are finally detected by merging the hypotheses with a high confidence value. Yuan et al. [13] also solve the action detection problem using a branch-and-bound strategy. They formulate a detection problem as a search for the 3D subvolume with the maximum amount of mutual information. To this end, a video sequence is represented by a set of features, and each feature casts a positive- or negativevalued vote for the action class. Cao et al. [14] present a framework that combines Gaussian mixture model (GMM)- based representation of ST features and detection through a maximum a posteriori (MAP) estimation. They handle data mismatches through the simultaneous performance of model adaptation and action detection. Some methods perform dynamic programming for action segmentation. Zhou et al. [15] formulate motion segmentation as aligned cluster analysis (ACA) that is an extension of the k-means clustering algorithm. ACA combines a dynamic time alignment kernel with dynamic programming for temporal segmentation of actions. An efficient coordinate descent algorithm solves ACA. Hoai et al. [16] also use dynamic programming for temporal segmentation, which maximizes the classification score of the winning class, while suppressing those of the nonmaximum classes. Recent methods have used a structured graph to represent the human region or entire video. Chen and Grauman [7] use a space-time graph for video representation, where each node indicates the subvolume and its weight represents the likelihood of an actions occurrence. Under the weighted graph representation, they solve the action detection problem by maximum subgraph search. Lan et al. [41] improve the action recognition results by treating the human location as a latent variable. Since this method requires human detection procedure for each frame, it takes large computation time and is difficult to apply it to complex real-world data which contains various human appearances and complex backgrounds. Shapovalova et al. [42] relax the assmption of Lan et al. [41] for human detection by introducing the clustering of objectness regions. In their method, a video is represented as global feature of a whole video and local features of objectness regions. Under the video representation, they develop a Similarity Constrained Latent SVM (SCLSVM) model to perform weakly supervised action recognition and localization. B. Graph-based Approach for Representing an Action Video Existing methods in human action recognition mainly use feature descriptors extracted from human parts or interest points in order to capture the appearance, shape, and motion patterns of an actor. Those features have been represented using various methods, including bag of features (BoF), dynamic time warping (DTW), hidden Markov models (HM- M), and conditional random fields (CRF). The most popular method represents a video sequence as a BoF and performs the classification using the BoF vector. Although BoF-based methods have shown good results for action recognition, their representations tend to ignore the spatio-temporal relationships among features, which can be an important property for action classification. However, a graph provides an efficient way to describe the spatio-temporal relationships between structural parts or low-level features. Several recent works have used a graph structure to represent action videos, where each node corresponds to the local

4 PAPER IDENTIFICATION NUMBER 4 feature, and each edge corresponds to the relationship between its nodes. Most of these works perform action recognition as a graph matching problem, measuring the similarity between a model and test graphs. Ta et al. [17] construct a graph with a significantly reduced number of edges by filtering the set of triangles between triplets of interest in ST points. Because of the reduced complexity of the resulting graph, their approach provides efficient graph matching that computes the matching score by projecting the set of nodes of the first graph onto that of the second graph. Gaur et al. [18] represent an action in a video as a string of feature graphs (SFG) that models the spatial arrangement of the features. Recognition result is obtained by matching an SFG of a video using DTW. Brendel et al. [19] learn the structure and pdfs associated with graphs as the permutation of adjacency matrices of training graphs in the weighted least-squares sense. Celiktutan et al. [20] present a hyper-graph structure that performs an exact matching with low complexity, using a point set matching problem. Some works have applied a graph embedding, converting a graph into a point in a vector space to make it suitable for general recognition approaches based on feature vectors. Liu et al. [21] apply fiedler embedding for the graph and then use its resulting vector to k-nearest neighbor classifier for action recognition. Borzeshi et al. [22] also apply graph embedding with a class-based prototype selection method that maximizes a function of the inter- and intra-class distances. The resulting embedding of the graph is fed to a HMM classifier for action recognition. III. MODELING ACTION AS A WEIGHTED ST-SUBGRAPH We first introduce a method for extracting the ST features via sparse representation (SR) in Section III-A and then present the construction of an ST-graph representing an action video in Section III-B. Section III-C explains our action detection method, which finds an MWCS on the ST-graph using an optimization approach. A. Extracting Localized ST Video Features via SR Suppose I is a set of detected space-time interest points (STIP) extracted from a set of training video sequences S = {S i, i = 1, 2,..., N}. To represent the detected STIP I, we first compute histograms of oriented gradients (HoG) and histograms of optical flow (HoF) in the ST neighborhoods of the detected STIP I. These descriptors capture the local appearance and motion information and have been used successfully in many works on action recognition employing the BoF scheme for video representation. However, the BoF representation incurs two drawbacks [23]: (1) It leads to a considerable amount of approximation error because each feature is assigned to only one codeword, and (2) The codebook size may be increased for data with large variation. Recently, the SR-based approach has been shown to overcome these drawbacks by reducing the approximation error and constructing compact dictionaries in many vision tasks [24], [21], including action recognition [25], [26], [23], [27]. The SR has been shown to efficiently represent and compress high-dimensional signals. To obtain the SR, the first step involves construction of a dictionary with orthogonal bases or overcomplete bases that can represent the essential information in a signal. The next step determines a sparse solution that is the degree of contribution to each element of the dictionary. Next, we need to learn the overcomplete dictionaries and the corresponding SR using the descriptors of the detected STIP I. We use an online optimization algorithm for dictionary learning [28] and start with briefly describing the online dictionary learning algorithm. Let a set of HoG/HoF feature descriptors of the training set S be X = {x j, j = 1, 2,..., n}, where x j R m, n = N i=1 n i and n i is a number of descriptors existed in training video sequence S i. The online dictionary learning algorithm optimizes the following cost function min D, α 1 n n j=1 ( ) 1 2 x j Dα j λ α j 1, (1) where D R m K (m < K) is the overcomplete dictionary, each column representing a basis vector. α {α j, j = 1, 2,..., n}, α j R K (K n) is SR over X such that each α j contains a few nonzero elements of D. The online dictionary learning algorithm iteratively solves Eq. 1 by performing two steps at every iteration: sparse coding and dictionary updating. In other words, an initial decomposition SR of X is first computed from an initial dictionary, then the initial decomposition SR makes update of dictionary. This procedure is iteratively performed until the number of iterations is satisfied. Sparse decomposition problem is solved with LARS (Least-angle regression) algorithm [38], and dictionary update is performed using block-coordinate descent. Finally, each STIP in I is represented by its corresponding SR computed over the final dictionary. B. Modeling an Action Video as a Weighted ST-Graph We represent a video sequence Q using a weighted ST-graph G = (V, E), where V is a set of weighted nodes and E is a set of edges. The weights of the nodes V are determined based on the feature descriptors extracted from the video sequence. The most popular shape for representing the action region is the ST-cuboid [13], [29], [30], which is a cube-shaped subvolume maximizing the action occurring probability. However, this shape is restricted to a particular location over the temporal domain; that is, it cannot shift spatial location over time. Recently, Chen and Grauman [7] use a ST-graph that allows spatial changes over the temporal domain and provides more accurate detection. We also use a space-time graph to represent the video sequence and present the node and edge structures of our ST-graph as follows. 1) Node structure: The node structure is determined by dividing the video sequence into a grid of H W F ST cubes. The size of each grid indicates the computational efficiency and detection sensitivity. That is, a smaller grid leads to higher computational cost but gives a detailed detection shape, whereas a larger grid provides sparse detection but has lower computational cost. For our implementation, we set H and W to 1/3 of the frame dimensions and F to 10 frames. Using this node structure, we detect an action region of irregular and non-cubic shape.

5 PAPER IDENTIFICATION NUMBER 5 Fig. 2: Conceptual illustration of our node weighting scheme. The yellow points are local features, and each bar on the right side indicates the SR for each local feature. In the bar, each color represents the weight of each element in the dictionary. The weight of the red border node in the center is computed from the SRs of the local features falling within the node. 2) Node weight: We need to represent the amount of action information contained in each node in order to detect the action region in the graph. We formulate an equation for computing the node weight inspired by a common SVM scoring function. Given a set of training video sequences S, each video sequence S i with n i STIP can be represented by a coefficient histogram h (S i ) obtained from max-pooling of corresponding SR vectors of n i feature descriptors. We train a linear SVM using all the coefficient histograms h (S i ), i [1, N], extracted from all the training examples S. The training examples S include positive and negative samples, where each sample is considered as a positive if it contains the action to detect or otherwise negative. Let c and β denote the weight and bias of the SVM, respectively. Now, we compute a weight for each node v V in the graph G for a video sequence Q as: w v = β + x l v K c k αl k, (2) k=1 where x l is the l-th local feature descriptor falling within node v in the graph G which is constructed from a video sequence Q. α k l is the k-th value of SR α of x l obtained from a method of Section III-A. Fig. 2 shows a conceptual illustration of our node weighting scheme. Note that nodes with higher positive weights indicate higher probability that the action occurs in that region, while smaller weights indicate lower probability. By defining the weight for each node in the graph, we can use a method that searches the regions with the highest sum of node weights in order to detect the regions of interest in the graph. This enables us to score an arbitrarily-shaped set of nodes where action occurs. In this context, we apply a detection approach based on an MWC search, as presented in Section III-C. 3) Edge structure: The linking strategy between nodes affects both the shapes of candidate subgraphs and the search cost. In general, each node is linked with 4-connected neighboring nodes. However, since our detection approach is based on a clique search, this edge structure is not sufficient for searching for maximum cliques (Section III-C). Hence, we additionally include three types of edges: (1) 8-connected edges in the spatial dimension, (2) 8-connected edges in the temporal dimension, and (3) jump edges over a second adjacent neighbor in the temporal dimension. Chen and Grauman [7] show that the jump edge ignores misleading features Fig. 3: Conceptual illustration of our edge structure. The black lines indicate general 4-connected edges. The green line indicates the jump edge. The red lines are 8-connected edges in the spatial dimension, and the blue lines are 8-connected edges in the temporal dimension (a) Fig. 4: This example illustrates the strength of our linking structure. (a) 4-connected neighbors only, (b) 4- and 8- connected neighbors. The yellow nodes belong to an MWC. The MWC weight of graph (a) is (4+3), and the MWC weight of graph (b) is ( ). that may interrupt an otherwise good instance of an action. Since realistic videos tend to contain noisy elements such as background clutter or camera motion, a graph with jump edges can be useful for more robust detection. Fig. 3 shows an example of our linking structure. Our additional edges enable us to connect each node to more neighboring nodes including 8-connected neighboring nodes, which yields an expanded space of candidate subvolumes by searching more cliques. Fig. 4 shows an example of the strength of our additional edges. An MWC with greater weight can be found in the graph with 8-connected edges. Consequently, it provides strong localization in the detection problem. One can consider the complete graph, in which every pair of nodes is connected, to search for the MWC with the greatest weight. However, this graph yields a very large MWC including noisy nodes with positive weights, and distant nodes can be included in an MWC. Even if we limit the field of the complete graph to local region, its resulting MWC still includes noisy nodes. For example, if the graph in Fig. 4(b) is complete, the resulting MWC includes a node with weight 5. However, the node has neighboring nodes with negative weights and, consequently, is likely to include noise. We demonstrate this case through experiments (b)

6 PAPER IDENTIFICATION NUMBER 6 C. Searching for the MWCS for Action Detection Given a ST-graph G = (V, E) with weighted nodes, we need to find the subgraph G satisfying Eq. 3 G = arg max G G v V w v, (3) where G = (V V, E E) can be any connected subgraphs of G, and G is the connected subgraph with the highest sum of node weights. Since each node has a learned weight indicating the probability of the action occurring, the action region is a set of connected nodes which total sum of their weights is maximal. Therefore, the action detection problem can be considered as finding the maximum subgraph from the entire graph representing an action video. This means the action detection problem can be solved by MWCSP. If all node weights are positive, an optimal solution is easily computed by determining any spanning tree. However, the node weight in our graph can be have either a positive or a negative value. Therefore, the MWCSP is NP-complete [33], i.e., there exists no known efficient algorithm to solve it. Dittrich et al. [34] transform the MWCSP into the prize-collecting Steiner tree problem (PCST) to identify the functional modules in proteinprotein interaction networks. In [34], a graph is transformed into a directed graph, and integer linear programming (ILP) is performed on the transformed directed graph with binary variables for each node and edge. Finally, the problem can be solved with a linear programming-based branch-and-bound algorithm. Chen and Grauman [7] apply the same method [34] for ST-graphs. Their max-subgraph approach seeks the subvolume that maximizes the action classifiers output. We propose a novel method to solve the MWCSP by defining an optimization problem based on an MWC in small graph units. Our approach is inspired by the works of Shervashidze et al. [31] and Levi [32]. First, Shervashidze et al. [31] present a graph kernel based on counting and sampling the subgraphs of a limited size in the entire graph, which they called graphlets. Their sampling scheme allows them to compute the graph kernels on graphs of sizes that are beyond the scope of the stateof-the-art methods. Similarly, since it is a very challenging problem to provide information about the MWCS for the entire graph, our approach computes information about the MWCS for each graph unit by dividing the graph into smaller graph units. Next, Levi [32] shows that each MWC in the product graph is associated with a maximum common subgraph. The MWC problem is then to determine the maximum clique of an arbitrary undirected graph. It enumerates all common subgraph isomorphisms by enumerating the cliques of the product graph, i.e., they solve the problem by splitting the subgraph into several cliques. In this paper, we exploit the MWC information found in the smaller subgraphs in order to solve the MWCSP of the graph. Instead of enumerating the raw MWC, the nodes of the MWC are utilized with their weights in order to determine the optimized solution of the MWCSP. We compare the experimental results between these two approaches: enumerating MWCs and solving a linear system based on the MWC. Next, we describe the solution to the MWCSP in the ST-graph for a test video sequence. Our approach consists of three steps: (1) finding MWC for each subgraph (graph unit), (2) identifying candidates of the MWCS by solving an optimization problem based on MWC information, and (3) selecting the resulting MWCS based on the top-scoring detection. Let a graph G = {g i } consists of multiple graph elements g i, in which g i is a set of nodes within the same temporal dimension. We also define a graph unit as a set of n graph elements. We divide the entire graph G into overlapping m graph units and determine the MWC from all the graph units. Since the MWC problem is NP-hard [35], many works have solved it using heuristic approaches. We solve the MWC problem using a method by Östergård [35]. This method presents a branch-and-bound algorithm, which exploits the node order based on a coloring of the nodes and a pruning strategy. Once the MWCs C = {c i } are obtained from all graph units, we use an optimization approach to search for an MWCS that is a localized ST region containing action. Namely, we minimize the following energy function: E (G) = (x u b u ) 2 + λ (x u x v ) 2, (4) u V b u = (u,v) E { w u + σw (c i ) if u c i w u otherwise, (5) where x u indicates the possibility of being the MWCS for node u, and b u is determined by the type of node u. If the node u is part of the MWC, b u is computed by summing the node weight w u and MWC weight w (c i ). Otherwise, b u is determined by node weight w u alone. The first term of eq. 4 is the data term that encourages the MWCS weight x u to become similar to b u, which is determined by the node type. Hence, a node having greater weight and with greater MWC weight has a greater weight MWCS. The second term of Eq. 4 is the smoothness term, which generates the connected maximum weight subgraph. We aim to minimize the difference in MWCS weight between the node u and its neighboring connected node v. The parameter λ is set as 1, and parameter σ is experimentally determined as 1/ c i. Optimization The energy function in Eq. 4 is an optimization problem that can be represented as a linear system Ax = b. In other words, by setting the first-order derivative of E (G) of Eq. 4 to zero and re-arranging x u across all the nodes u in a vector x, Eq. 4 can be written in matrix form Ax = b, where A is a V V matrix, and b is a vector in which each element corresponds to b u of Eq. 5. We obtain the solution x by solving the linear system. In the solution x, each solution x u is the probability of belonging to MWCS for node u, i.e., a larger value indicates a higher likelihood of action occurrence. Next, we determine the candidates of the MWCS MS = {ms i } from the following condition: ms i = { u j x uj max (x) max (x) /2, (u j, u j+1 ) E }. (6)

7 PAPER IDENTIFICATION NUMBER 7 (a) Example of MWC search result (b) Example of result after applying eq. 6 Fig. 5: A conceptual illustration of our MWCS search method. (a) Yellow nodes and colored lines comprise MWCs. (b) Two MWCS candidates are founded from an optimization based on MWC information. Data: A graph G = (V, E) = {g i, i = 1,..., m} Result: a set of MWCS D = {G }, G arg max = G w G v v V set C = for i = 1 to m do search MWC c i for each g i using algorithm [35] append c i into C end while MS do compute A and b according to Eq. 4 and Eq. 5 solve Ax = b identify candidates of MWCS MS according to Eq. 6 choose the candidate ms with the largest weight in MS set weights of all nodes u ms to w u append ms into D end return D Algorithm 1: Maximum weight connected subgraph (MWC- S) search for multiple detection. Fig. 5 shows a conceptual illustration of our MWCS search method. Note that we leave out some edges for convenient viewing. Suppose we use a graph unit of size 2. We first search the MWC for each graph unit, as shown in fig. 5(a). Yellow colored nodes indicate the nodes which are included in MWC, and each clique represented by colored line (red, green, blue, orange) indicates the MWC. In otherwords, red colored MWC is searched from the first graph unit which is a set of g 1 and g 2 graph elements, and blue colored MWC is searched from the second graph unit which is a set of g 2 and g 3 graph elements. In this way, we can search MWCs for all the graph units within a whole graph, and four MWCs are founded in this example. Based on MWCs C, we construct the energy function of eq. 4 and solve the optimization problem according to the method described above. After applying eq. 6 to the optimization result, we can find MWCS candidates MS (red and blue colored subgraphs), as shown in fig. 5(b), and apply the top-scoring detection for final result. The above condition provides one or more candidate MWCS according to the number of actions included in a test video sequence. Basically, we first choose the candidate with the largest weight. To return multiple top-scoring detections, we apply a method similar to that in [13], [7], iteratively performing the MWCS algorithm by setting weights to w u for the nodes of the MWCS in the previous iteration. Algorithm 1 provides the pseudocode for our MWCS algorithm. IV. EXPERIMENTS In this section, we evaluate the proposed method using two datasets, uncropped Hollywood action videos [36] and MSR actions [13]. Both datasets contain actions with dynamic occlusions in complex and moving backgrounds of real-world environments. We extract the STIP and the descriptor using the method of [36]. To compute the node weight for each action, we train a linear SVM. T-Sliding ST-Sliding Subvolume Subgraph Our result Fig. 6: Examples of shape of detection result according to the different methods. We employ the mean overlap accuracy as an evaluation metric, which computes the intersection of the predicted detection region with the ground truth divided by their union. We use detection time to evaluate computational cost and also compare the detection time of baselines measured on a system equipped with 3.40GHz Intel Core i CPU. We compare our method with three state-of-the-art baselines: (1) Sliding: The sliding window is a standard and popular action detection method used in several works [1], [4], [5]. The temporal sliding window is used for temporal detection and spatio-temporal sliding window is used for spatio-temporal detection. The ST sliding window is a variant of the temporal sliding window which searches ST subvolumes of cubic shape. (2) Subvolume: The Cube-subvolume method [13] searches the cube-shaped subvolume that maximizes the action-occurring probability. Hence, its spatial detection region is more flexible than that of ST sliding window. However, spatial detection is restricted to one location. (3) Subgraph: The subgraph method [7] allows spatial shifts over time, as does our method. However, it cannot detect the subvolume that consists of 8- connected neighboring nodes. Note that our method produces more flexible detection regions than the result of Subgraph [7] by allowing 8-connected edge structure. Our method can detect most irregular and non-cubic shapes. Fig. 6 illustrates the shape of detection results according to the different methods.

PAPER IDENTIFICATION NUMBER 8 TABLE I: The mean temporal overlap accuracy results on the Hollywood dataset under the BoF feature scheme. Actions Sliding Subvolume [13] Subgraph [7] Ours AnswerPhone 0.

8 PAPER IDENTIFICATION NUMBER 8 TABLE I: The mean temporal overlap accuracy results on the Hollywood dataset under the BoF feature scheme. Actions Sliding Subvolume [13] Subgraph [7] Ours AnswerPhone GetOutCar HandShake HugPerson Kiss SitDown SitUp StandUp Average TABLE II: Comparison of detection accuracies according to the feature type in the Hollywood dataset. Actions BoF SR AnswerPhone GetOutCar HandShake HugPerson Kiss SitDown SitUp StandUp Average A. Experiments on the Hollywood Actions Dataset The Hollywood Actions Dataset [36] consists of videos collected from 32 different Hollywood movies, with a total of 663 video sequences from 8 action classes: answering the phone, getting out of the car, hand shaking, hugging, kissing, sitting down, sitting up, and standing up. The dataset is divided into uncropped and cropped versions of the sequences for training, i.e., videos containing extraneous frames and only the action of interest. Hence, we use the cropped version for training and the uncropped version for detection evaluation. We perform temporal detection only because most actions occur across the entire frame. Table I shows the temporal detection results for feature settings same as in [7]. Our method achieves the best accuracy for 5 of the 8 action classes and an average accuracy. This validates the superiority of our MWCS search method in detecting action regions. In addition, our method can estimate the probability of occurrence of actions for the detected region, in contrast to the MWCS search method of [7]. This property enables more robust detection in terms of performance of the basic detector. A comparative evaluation of BoF- and SRbased methods for representing a video is shown in Table II. We show that our SR-based method achieves a better result than does the BoF-based method. However, we can observe that SR-based method performs worse than BoF-based method in GetOutCar, HandShake, and SitUp classes. Although SRbased representation has been shown to overcome the drawbacks of BoF representation, the SR-based method has also drawbacks. The sparse solution becomes denser when the image is under large amount of random corruption or contiguous occlusion [39], [40]. Since some action videos in GetOutCar, HandShake, and SitUp classes have those problems, SR-based method performs worse than BoF-based method. Given MWC information, combining all neighboring MWCs is the most straightforward method to solve the MWCSP. In other words, the continuously connected MWCs are gathered into a larger graph, and then, these gathered graphs are candidate MWCSs. Among several candidates, the top-scoring candidate becomes the final MWCS. Fig. 7 compares the combination-based method and the proposed method to solve the MWCSP. Our method outperforms the combination-based method for all action classes. The resulting MWCS obtained from the combination is likely to be one piece of ground-truth in the MWCS because the MWC cannot be extracted from the Fig. 7: The mean temporal overlap accuracies for each action in the Hollywood action dataset using the combination-based method and our proposed method. middle noisy part. Our method defines a smoothness term in Eq. 4 in order to encourage connected subgraph generation, and consequently, our method overcomes the shortcomings of the combination method and leads to better detection results. Table III shows the detection time on the Hollywood dataset. Although Subgraph [7] method takes the least detection time, our method takes significantly shorter than Sliding method and Subvolume [13]. Our method takes more time than Subgraph [7] because our graph is more complex than that of Subgraph [7]. However, we observe that our method for maximum subgraph search is more efficient than that of Subgraph [7] for large graph, as shown in table VI. B. Experiments on the MSR Actions Dataset The MSR Actions Dataset [13] contains multiple instances of different actions, such as boxing, handclapping, and handwaving. All of the sequences are captured with clutter and TABLE III: Detection time on the Hollywood dataset. Method Detection time (sec) Sliding Subvolume [13] Subgraph [7] Ours

PAPER IDENTIFICATION NUMBER 9 TABLE IV: The mean temporal overlap accuracy results for the MSR dataset. TABLE V: The mean space-time overlap accuracy results for the MSR dataset.

2341 Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing 0.0474 0.0456 0.0309 0.0452 Clapping 0.0261 0.0173 0.0295 0.0533 Waving 0.0912 0.1013 0.0699 0.0956 Average 0.0549 0.0547 0.0434 0.

9 PAPER IDENTIFICATION NUMBER 9 TABLE IV: The mean temporal overlap accuracy results for the MSR dataset. TABLE V: The mean space-time overlap accuracy results for the MSR dataset. Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing Clapping Waving Average Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing Clapping Waving Average Fig. 8: The mean temporal overlap accuracies for each action in the MSR action dataset when using the combination-based method and our method. Fig. 9: The mean space-time overlap accuracies for each action in the MSR action dataset when using the combination-based method and our method. moving backgrounds. Since the actors change their position over time, this dataset is good for evaluating the ST detection. We perform both temporal and ST detections on this dataset. Following the detection method in [13], [7], we train detectors using the KTH dataset [37]. Table IV shows the temporal detection results. Our method achieves the best accuracy for all action classes. In particular, our method significantly improves the accuracies for boxing and clapping actions. Temporal detection results of the combination-based method and our method for solving the MWCSP are compared in Fig. 8. Our method achieves higher detection accuracy than the combination-based method for all action classes. The results show the strength of our optimization strategy that considers both maximum weight and connectivity properties. Table V shows the ST detection results. Although the results of our proposed method are slightly lower than that with the best performance for the boxing and waving actions, our method achieves the best average accuracy due to increased accuracy for the clapping action. Compared with the graphbased method of [7], our method achieves higher accuracies for all action classes. Note that, although the method of [7] achieves better results than the sliding window and [13] for temporal detection, it achieves lower accuracy for ST detection. In contrast, our method achieves the best average accuracy for both temporal and ST detections. This validates the superiority of our graph structure and the MWCS search method. Fig. 9 compares the combination-based method and our method for the MWCSP. Our method outperforms the combination-based method for all action classes, especially for the waving action. This demonstrates that our MWCS search method produces more detailed localization. Fig. 10 shows the mean space-time overlap accuracy ac- cording to graph type and graph unit size. We use three types of graphs. A basic graph indicates a general graph with 4- connected edges, and a complete graph is a graph in which every pair of nodes within the graph unit is connected. Our graph contains 8-connected edges in spatial and temporal dimensions, as well as 4-connected edges. Therefore, the graph complexity decreases in the order of a complete graph to our graph to a basic graph. Our graph outperforms the basic graph and the complete graph for every size of graph unit. We also observe that the complete graph produces poor performance. This indicates that unnecessary edges cause errors, and our linking strategy is suitable for solving the detection problem. We show the mean space-time overlap accuracy results according to various grid sizes for MSR action dataset. Fig. 11 shows the results under graph unit of size 2 and temporal grid of size F = 10. We can see that spatial grid of 1 3 of the frame dimensions gives the best average accuracy. Generally, one might think that smaller grid size gives more detailed detection shape and larger detection accuracy. However, too small grid ( 1 6 of the frame dimensions) produces less accuracy, as shown in fig. 11. Note that our method finds the maximum subgraph based on node weight which is computed from weights of local features falling in the node. If node size becomes smaller, the number of local features within the node is reduced. Consequently, this makes it difficult to aggregate statistics of neighborhood of local features. Fig. 12 shows the results according to various sizes of temporal grid under graph unit of size 2 and spatial grid of 1 3 of the frame dimensions. Temporal grid of size 20 gives the best average accuracy. We can observe that too small temporal grid or too large temporal grid produces the less accuracy. Note that proper grid size can be different according to the data in terms of the frame dimensions of an input video, average size of humans, and average length of actions.

10 PAPER IDENTIFICATION NUMBER 10 Mean Overlap Accuracy Graph unit size Fig. 10: The mean space-time overlap accuracies for the MSR action dataset according to graph type and graph unit size. Fig. 12: The mean space-time overlap accuracy results according to various temporal grid sizes for MSR action dataset. TABLE VII: Comparison of mean space-time overlap accuracy results between Lan et al. [41] and our method. Actions Lan et al. [41] Ours Boxing Clapping Waving Average Fig. 11: The mean space-time overlap accuracy results according to various spatial grid sizes for MSR action dataset. Table VI shows the detection time on the MSR dataset. We can see that our method achieves the least detection time for spatio-temporal detection, compared to ST-Sliding, Subvolume, and Subgraph. Note that detection time on the MSR dataset is larger than that on Hollywood dataset. It is because average length of videos on the MSR dataset is longer than that on the Hollywood dataset. Especially, Subgraph method [7] takes long detection time for MSR dataset because it requires optimization for large graph. However, our method takes much less time than Subgraph [7], though both methods solve MWCSP. It is because our method first searches MWC for graph unit less than the size of entire graph and performs optimization of simple linear system based on MWC information. Therefore, our optimization method takes less detection time compared to the method of Subgraph [7] even if video length is lengthened. We compare our method with the method of Lan et al. [41] on MSR dataset. For appearance feature x i, we use BoF feature scheme based on HoGHoF descriptors like our method. TABLE VI: Detection time on the MSR dataset. Method Detection time (sec) Sliding Subvolume [13] Subgraph [7] Ours Table VII compares the spatio-temporal detection results. Our method outperforms the method of Lan et al. [41] for all actions without human detection procedure. Lan et al. [41] produces poor detection result compared to other baselines as well as our method. It is because their method is not appropriate for MSR dataset which contains multiple actions in the same scene and has no rule about location variation. In other words, their method is difficult to obtain benefits from pairwise potential and global action potential in their potential function. Fig. 13 shows the example of the detection results for MSR action dataset. We compare the detection results between graph of [7] (green rectangles) and our graph (yellow rectangles). In case of the third row, the previous graph fails to detect any regions for 2th, 3rd and 4th images, while our graph detects the action regions for all the images. We can show that our graph detects not only more temporal action frames but also detailed spatial action region within the frame. V. CONCLUSION We presented the optimization approach to identify the maximum subgraph on the ST-graph for action detection. Our energy function is defined based on maximum cliques by including the maximum and connectivity properties for finding the maximum subgraph. The energy function is formulated using a linear system and its solution gives the probability of being action nodes. Our graph structure and optimization method efficiently solve the detection problem by applying the clique-based approach and simple linear system solver. We showed that our graph and optmization method produce more accurate localization through experiments using various

PAPER IDENTIFICATION NUMBER 11 Fig. 13: Example of detection results using a graph of [7] and our graph.

For each row, the first 4 images are detection results of the graph of [7] and remaining 4 images are detection results of our graph.

We also compared our method with the stateof-the-art methods using real-world action datasets.

11 PAPER IDENTIFICATION NUMBER 11 Fig. 13: Example of detection results using a graph of [7] and our graph. Each row indicates the detection results for waving, boxing, and clapping actions, respectively, for MSR action dataset. For each row, the first 4 images are detection results of the graph of [7] and remaining 4 images are detection results of our graph. Red rectangles denote ground truth, green rectangles denote the result of the graph of [7], and yellow rectangles denote the result of our graph. graph structures. We also compared our method with the stateof-the-art methods using real-world action datasets. ACKNOWLEDGMENT This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government(mest) (No. NRF-2013R1A2A1A ). REFERENCES [1] Y. Ke, R. Sukthankar, and M. Hebert, Efficient visual event detection using volumetric features, in Proc. ICCV, pp , [2] L. Zelnik-Manor and M. Irani, Statistical analysis of dynamic actions, in IEEE Trans. on PAMI, Vol. 28, No. 9, pp , [3] Y. Ke, R. Sukthankar, and M. Hebert, Event detection in crowded videos, in Proc. ICCV, pp. 1-8, [4] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, in Proc. ICCV, pp , [5] S. Satkin and M. Hebert, Modeling the temporal extent of actions, in Proc. ECCV, pp , [6] A. Klaser, M. Marszalek, C. Schmid, and A. Zisserman, Human focused action localization in video, in Proc. ECCV, pp , [7] C. Y. Chen and K. Grauman, Efficient activity detection with maxsubgraph search, in Proc. CVPR, pp , [8] K. Mkolajczyk and H. Uemura, Action recognition with motionappearance vocabulary forest, in Proc. CVPR, pp. 1-8, [9] M. S. Ryoo and J. K. Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, in Proc. ICCV, pp , [10] A. Oikonomopoulos, I. Patras, and M. Pantic, Spatiotemporal localization and categorization of human actions in unsegmented image sequences, in IEEE Trans. on Image Processing, Vol. 20, No. 4, pp , [11] A. Yao, J. Gall, and L. V. Gool, A hough transform-based voting framework for action recognition, in Proc. CVPR, pp , [12] G. Willems, J. H. Becker, T. Tuytelaars, and L. V. Gool, Exemplar-based action recognition in video, in Proc. BMVC, pp. 1-11, [13] J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, in Proc. CVPR, pp , [14] L. Cao, Z. Liu, and T. S. Huang, Cross-dataset action detection, in Proc. CVPR, pp , [15] F. Zhou, F. De La Torre, and J. K. Hodgins, Aligned cluster analysis for temporal segmentation of human motion, in Proc. FGR, pp. 1-7, [16] M. Hoai, Z. Z. Lan, and F. De La Torre, Joint segmentation and classification of human actions in video, in Proc. CVPR, pp , [17] A. P. Ta, C. Wolf, G. Lavoue, and A. Baskurt, Recognizing and localizing individual activities through graph matching, in Proc. AVSS, pp , [18] U. Gaur, Y. Zhu, B. Song, and A. Roy-Chowdhury, A String of feature graphs model for recognition of complex activities in natural videos, in Proc. ICCV, pp , [19] W. Brendel and S. Todorovic, Learning spatiotemporal graphs of human activities, in Proc. ICCV, pp , [20] O. Celiktutan, C. Wolf, B. Sankur, and E. Lombardi, Real-time exact graph matching with application in human action recognition, in Proc. ICHBU, pp , [21] J. Liu, S. Ali, and M. Shah, Recognizing human actions using multiple features, in Proc. CVPR, pp. 1-8, [22] E. Z. Borzeshi, M. Piccardi, and R. Y. D. Xu, A discriminative prototype selection approach for graph embedding in human action recognition, in Proc. ICCVW, pp , [23] E. Z. Borzeshi, M. Piccardi, and R. Y. D. Xu, Learning sparse representations for human action recognition, in IEEE Trans on PAMI, pp , [24] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, Robust face recognition via sparse representation, in IEEE Trans on PAMI, pp , [25] K. Guo, P. Ishwar, and J. Konrad, Action recognition using sparse representation on covariance manifolds of optical flow, in Proc. AVSS, pp , [26] C. Liu, Y. Yang, and Y. Chen, Human action recognition using sparse representation, in Proc. ICIS, pp , [27] S. R. Fanello, I. Gori, G. Metta, and F. Odone, Keep it simple and sparse: real-time action recognition, in JMLR, Vol. 14, pp , [28] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online dictionary learning for sparse coding, in Proc. ICML, pp , [29] M. Rodriguez, J. Ahmed, and M. Shah, Action MACH: A spatiotemporal maximum average correlation height filter for action recognition, in Proc. CVPR, pp. 1-8, [30] K. Derpanis, M. Sizintsev, K. Cannons, and R. Wildes, Efficient action spotting based on a spacetime oriented structure representation, in Proc. CVPR, pp , [31] N. Shervashidze, S. V. N. Vishwanathan, T. Petri, K. Mehlhom, and K. Borgwardt, Efficient graphlet kernels for large graph comparison, in Proc. AISTATS, pp , [32] G. Levi, A note on the derivation of maximal common subgraphs of two directed or undirected graphs, in Calcolo, Vol. 9, No. 4, pp , [33] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, Discovering regulatory and signalling circuites in molecular interaction networks, in Bioinformatics, Vol. 18, pp , [34] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Mller, Identifying functional modules in protein-protein interaction networks: an integrated exact approach, , in Bioinformatics, Vol. 24, No. 13, pp.

PAPER IDENTIFICATION NUMBER 12 [35] P. R. J. Östergård, A fast algorithm for the maximum clique problem, in Discrete Applied Mathematics, Vol. 120, No. 1-3, pp. 197-207, 2002. [36] I. Laptev, M.

Caputo, Recognizing human actions: A local svm approach, in Proc. ICPR, pp. 32-36, 2004. [38] B. Efron, T. Hastie, I. Johnstone, and R.

12 PAPER IDENTIFICATION NUMBER 12 [35] P. R. J. Östergård, A fast algorithm for the maximum clique problem, in Discrete Applied Mathematics, Vol. 120, No. 1-3, pp , [36] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proc. CVPR, pp. 1-8, [37] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local svm approach, in Proc. ICPR, pp , [38] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, in Annals of statistics, Vol. 32, pp , [39] E. J. Candès, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccurate measurements, in Communications on Pure and Applied Mathematics, Vol. 59, No. 8, pp , [40] A. Martinez, Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class, in IEEE Trans on PAMI, Vol. 24, No. 6, pp , [41] T. Lan, Y. Wang, and G. Mori, Discriminative figure-centric models for joint action localization and recognition, in Proc. ICCV, pp , [42] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori, Similarity constrained latent support vector machine: an application to weakly supervised action classification, in Proc. ECCV, pp , [43] Y. Kong, Y. Jia, and Y. Fu, Learning human interaction by interactive phrases, in Proc. ECCV, pp , [44] M. Hoai and F. De la Torre, Max-margin early event detectors, in Proc. CVPR, pp , [45] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large margin methods for structured and interdependent output variables, in Journal of Machine Learning Research, Vol. 6, pp , Sunyoung Cho received the B.S. degree in computer science from Sookmyung Women s University, Seoul, Korea, in 2007, and the M.S. and Ph.D. degrees in computer science from Yonsei University, Seoul, Korea, in 2009 and 2014, respectively. She is currently a Post-Doctoral Researcher with Human- Computer Interaction Institute at Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include computer vision, image and video processing, machine learning, and human-computer interaction. Hyeran Byun received the B.S. and M.S. degrees in mathematics from Yonsei University, Seoul, Korea, and the Ph.D. degree in computer science from Purdue University, West Lafayette, IN, USA. She was an Assistant Professor at Hallym University, Chooncheon, Korea, from 1994 to She is currently a Professor of Computer Science at Yonsei University. She served as president of Artificial Intelligence Society in KIISE (Korean Institute of Information Scientists and Engineers) from Mar 2013 to Feb Her research interests include computer vision, image and video processing, artificial intelligence, event recognition, gesture recognition, and pattern recognition.

Action recognition in videos

Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit