UNDERSTANDING human actions in videos has been

Size: px
Start display at page:

Download "UNDERSTANDING human actions in videos has been"

Transcription

1 PAPER IDENTIFICATION NUMBER 1 A Space-Time Graph Optimization Approach Based on Maximum Cliques for Action Detection Sunyoung Cho, Member, IEEE, and Hyeran Byun, Member, IEEE Abstract We present an efficient action detection method that takes a space-time graph optimization approach for realworld videos. Given a space-time graph representing the entire action video, our method identifies a maximum-weight connected subgraph indicating an action region by applying an optimization approach based on clique information. We define an energy function based on maximum weight cliques for subregions of the graph, and formulate it using an optimization problem that can be represented as a linear system. Our energy function includes the maximum and connectivity properties for finding the maximum-weight connected subgraph, and its optimization solution indicates the probability of belonging to the maximum subgraph for each node. Our graph optimization method efficiently solves the detection problem by applying the cliquebased approach and simple linear system solver. We demonstrate that our detection method results in more accurate localization compared to conventional methods through our experimental results with real-world datasets, such as Hollywood and MSR action datasets. We also show that our method outperforms the state-of-the-art methods of action detection. Index Terms Action detection, space-time graph, sparse representation, maximum weight connected subgraph, maximum weight clique, optimization. I. INTRODUCTION UNDERSTANDING human actions in videos has been considered as one of the important areas in the computer vision community due to a variety of applications, such as video indexing and searching, video surveillance, and humancomputer interaction. The action learning problem has been extensively studied over the past years, and recent works have explored more realistic actions rather than the simplified and constrained ones used in earlier studies. Action learning in the real-world is a challenging problem because uncontrolled real-world videos include a large amounts of variation in terms of the number of people, the scale of each person, background clutter, occlusion, and camera parameter changes. Furthermore, two problems need to be solved for the action understanding: localizing an action-occurring region in a video sequence (action detection) and classifying the category of the action (action recognition). Detecting actions is to find the space and time regions of actions occurring in the video sequence. The most common approach to this is to employ a sliding window approach [1]- [6], which applies a classifier function to the subregions and S. Cho was with the Department of Computer Science, Yonsei University, Seoul , Korea. She is now with Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, PA, USA ( sycho22@gmail.com). H. Byun is with the Department of Computer Science, Yonsei University, Seoul , Korea ( hrbyun@yonsei.ac.kr). takes the maximum of the classification score as the location of the action. Since the computation in a video sequence of size h w f is of the order O ( h 2 w 2 f 2), evaluating the classifier function for all subregions is too computationally expensive. Another challenge arises from uncontrolled real-world videos containing large variations in actions and complex backgrounds. It is difficult to construct a classifier that recognizes realistic actions, making the action detection problem more challenging. The main approach of detection problems is to find the sub-volume that maximizes the output of scoring function. In order to define the scoring function, the classifier measuring the score for an input should be learned. Several methods use a max-margin framework [7], [43], [44] to train the model parameter of the classifier. Chen and Grauman [7] use a linear SVM to learn parameters of the scoring function. Since they use the graph structure for representing a video, the learned parameters are used to compute the weight for each node in the graph. Kong et al. [43] measure the interactive phrases of human interaction using a latent SVM. They define attribute and interaction models for scoring the interactive phrase, and find the maximum interactive phrase by solving the optimization problem based on the coordinate descent. Hoai and Torre [44] consider a detection scoring function that extends the Structured Output SVM [45] to enable the recognition of sequential data. By exploiting the partial events as training examples for learning the scoring function, their method supports the early detection of events in sequential data. Our approach exploits the max-margin framework for defining the scoring function. We are especially inspired by that of Chen and Grauman [7], which solves the action detection problem as a maximum-weight connected subgraph problem (MWCSP) from a weighted space-time (ST) graph representing an input video. They transform the MWCSP into a prizecollecting Steiner tree problem and solve the problem using a branch-and-cut optimization strategy. Since their detection method is formulated using an integer linear programming problem with binary variables, the resulting detections are made under certainty, i.e., since the coefficients are constant, their result is only able to produce whether each node is action node or not. We also solve the MWCSP by applying a clique-based optimization strategy from a weighted ST graph representing an action video. Our method first determines the maximum weight cliques (MWC) for each graph unit. Based on the node information regarding the MWC, we define the energy function that represents the maximum property of each node

2 PAPER IDENTIFICATION NUMBER 2 and connectivity among the nodes in the resulting subgraph. The connectivity property makes the energy function robust to noise occurring in the middle of an action. In particular, our method that determines the probability of action occurrence at each node enables more robust and accurate detection compared to that of [7]. Our energy function is formulated using an optimization problem that can be represented as a linear system. The final maximum-weight connected subgraph (MWCS) is obtained by leveraging the likelihood threshold for the solution of the linear system. Fig. 1 shows an overview of our approach. Given an action video, the first step involves extractions of the local ST features (Fig. 1(a)). The action video is represented as a ST graph in which each node has a weight computed based on the extracted local features (Fig. 1(b)). Since our method is trained to have larger weights for local features within action region rather than local features within non-action region, nodes within action region get larger weights (darker colored nodes). Therefore, the process of action detection is equivalent to finding an MWCS. The node s probability of belonging to the MWCS is found by solving a linear system (Fig. 1(c)), and the detection result can be determined by applying the threshold of the probability (Fig. 1(d)). The final detection result is a set of regions corresponding to a set of nodes in the MWCS (Fig. 1(e)). Since most of previous methods in action detection [1], [4], [5], [13] give the cubic-shaped detection result, it is not possible to search the subvolume that can shift spatial location over time. However, recent method [7] produces the noncubic detection result by representing a video as ST-graph and searching the maximum subgraph from the entire graph. Our method produces more flexible result as well as non-cubic result by introducing the subgraph search on a flexible ST graph structure. In addition, we propose the maximum subgraph search method that produces more detailed detection results with less computation time by introducing the optimization based on MWC information. The main contributions of this paper are summarized into three points. We propose a novel approach that solves the action detection by taking a graph optimization strategy. We define an energy function based on MWC information in order to find an MWCS indicating an action region from a ST graph representing the entire action video. The energy function is formulated using an optimization problem that can be represented as a linear system. By solving the linear system, our method efficiently detects the spatial and temporal region of occurring action. Our method produces more robust and accurate detection results. We define the energy function that includes the maximum property of each node and connectivity among the nodes in the resulting subgraph. These properties make our detection robust to noise that may occur in the real-world action videos. Furthermore, our optimization result indicates the likelihood of belonging to the MWCS for each node, not indicating whether action node or not. It improves the performance by allowing more robust and flexible detection. Our graph optimization method takes relatively short (a) (b) (c) (d) (e) Fig. 1: Overview of our approach. (a) Local ST feature extraction. (b) Video representation using a weighted ST graph. (Note that the node numbers are presented on the nodes, and some edges are omitted for ease of viewing.) (c) Optimization problem for computing the probability of becoming an MWCS node. (d) Action detection by thresholding. (e) Action detection result. time for long length video. It is because that the energy function is defined with MWC information which is computed from much smaller graph unit than the entire graph representing a video. In addition, the energy function is an optimization problem that can be represented as a simple linear system. By exploiting MWC search in small graph units and linear system solver, our method can efficiently solve the detection problem in terms of computational cost and problem complexity. The rest of the paper is organized as follows. Section II briefly reviews the previous works on action detection and graph-based approaches for representing an action video. Section III describes the details of our video representation and the proposed optimization approach for action detection. Section IV presents our experimental results and Section V provides conclusions for the paper.

3 PAPER IDENTIFICATION NUMBER 3 II. RELATED WORK In this section, we review the state-of-the-art methods of action detection. We also review the previous works which applies the graph-based approach for video representation. A. Previous Works on Action Detection The most common approach to action detection is a sliding window method which applies a classifier function to subregions within an entire sequence and considers the maximum classification score as the location of the action. Although this method has been successfully applied in many works [1], [4], [5], it is computationally expensive for evaluation of the classifier function for all subregions. To avoid exhaustive searches, several recent works have employed the voting-based approach. The voting-based approach performs localization based on voting regarding local ST features. Mikolajczyk and Uemura [8] propose a method based on a vocabulary forest of local features and a voting scheme. They use a large number of low-dimensional local features for building a vocabulary forest in order to capture the joint appearance-motion of actions. Voting is performed for action categories and occurrence locations over each vocabulary tree. However, their work is restricted to only the spatial localization of the subjects in each frame. Ryoo and Aggarwal [9] apply the voting technique for the intersection of relationship histograms between different action videos. They develop a new matching function, ST relationship match, to measure the similarity between two sets of features. Each pair of features in the intersection votes indicates the expected starting and ending locations of the action. Oikonomopoulos et al. [10] accumulate localization evidence in a probabilistic ST voting scheme and use class-specific codebooks of the codeword ensembles to encode the ST positions. Yao et al. [11] present a method to classify and localize human actions using a Hough transform voting framework. They perform the voting with a collection of random trees and learn a mapping between densely sampled ST feature patches and an action center. The resulting set of leaf nodes in the trees forms a discriminative codebook with shared features across actions and votes for action centers in a probabilistic manner. To summarize, although the voting approach reduces computational costs compared to the sliding window method, it is often sensitive to a noisy background and ambiguous for actions with periodicity, which leads to incorrect votes. Hence, this approach cannot guarantee determination of the maximum scoring region. The branch-and-bound approach has also been explored to avoid the enormous computation cost of an exhaustive search. This approach identifies the most probable occurring actions using an optimization scheme. Willems et al. [12] propose an extended exemplar-based approach based on local features in the ST domain. The most discriminative visual words are selected and used to formulate the bounding box hypotheses. Actions are finally detected by merging the hypotheses with a high confidence value. Yuan et al. [13] also solve the action detection problem using a branch-and-bound strategy. They formulate a detection problem as a search for the 3D subvolume with the maximum amount of mutual information. To this end, a video sequence is represented by a set of features, and each feature casts a positive- or negativevalued vote for the action class. Cao et al. [14] present a framework that combines Gaussian mixture model (GMM)- based representation of ST features and detection through a maximum a posteriori (MAP) estimation. They handle data mismatches through the simultaneous performance of model adaptation and action detection. Some methods perform dynamic programming for action segmentation. Zhou et al. [15] formulate motion segmentation as aligned cluster analysis (ACA) that is an extension of the k-means clustering algorithm. ACA combines a dynamic time alignment kernel with dynamic programming for temporal segmentation of actions. An efficient coordinate descent algorithm solves ACA. Hoai et al. [16] also use dynamic programming for temporal segmentation, which maximizes the classification score of the winning class, while suppressing those of the nonmaximum classes. Recent methods have used a structured graph to represent the human region or entire video. Chen and Grauman [7] use a space-time graph for video representation, where each node indicates the subvolume and its weight represents the likelihood of an actions occurrence. Under the weighted graph representation, they solve the action detection problem by maximum subgraph search. Lan et al. [41] improve the action recognition results by treating the human location as a latent variable. Since this method requires human detection procedure for each frame, it takes large computation time and is difficult to apply it to complex real-world data which contains various human appearances and complex backgrounds. Shapovalova et al. [42] relax the assmption of Lan et al. [41] for human detection by introducing the clustering of objectness regions. In their method, a video is represented as global feature of a whole video and local features of objectness regions. Under the video representation, they develop a Similarity Constrained Latent SVM (SCLSVM) model to perform weakly supervised action recognition and localization. B. Graph-based Approach for Representing an Action Video Existing methods in human action recognition mainly use feature descriptors extracted from human parts or interest points in order to capture the appearance, shape, and motion patterns of an actor. Those features have been represented using various methods, including bag of features (BoF), dynamic time warping (DTW), hidden Markov models (HM- M), and conditional random fields (CRF). The most popular method represents a video sequence as a BoF and performs the classification using the BoF vector. Although BoF-based methods have shown good results for action recognition, their representations tend to ignore the spatio-temporal relationships among features, which can be an important property for action classification. However, a graph provides an efficient way to describe the spatio-temporal relationships between structural parts or low-level features. Several recent works have used a graph structure to represent action videos, where each node corresponds to the local

4 PAPER IDENTIFICATION NUMBER 4 feature, and each edge corresponds to the relationship between its nodes. Most of these works perform action recognition as a graph matching problem, measuring the similarity between a model and test graphs. Ta et al. [17] construct a graph with a significantly reduced number of edges by filtering the set of triangles between triplets of interest in ST points. Because of the reduced complexity of the resulting graph, their approach provides efficient graph matching that computes the matching score by projecting the set of nodes of the first graph onto that of the second graph. Gaur et al. [18] represent an action in a video as a string of feature graphs (SFG) that models the spatial arrangement of the features. Recognition result is obtained by matching an SFG of a video using DTW. Brendel et al. [19] learn the structure and pdfs associated with graphs as the permutation of adjacency matrices of training graphs in the weighted least-squares sense. Celiktutan et al. [20] present a hyper-graph structure that performs an exact matching with low complexity, using a point set matching problem. Some works have applied a graph embedding, converting a graph into a point in a vector space to make it suitable for general recognition approaches based on feature vectors. Liu et al. [21] apply fiedler embedding for the graph and then use its resulting vector to k-nearest neighbor classifier for action recognition. Borzeshi et al. [22] also apply graph embedding with a class-based prototype selection method that maximizes a function of the inter- and intra-class distances. The resulting embedding of the graph is fed to a HMM classifier for action recognition. III. MODELING ACTION AS A WEIGHTED ST-SUBGRAPH We first introduce a method for extracting the ST features via sparse representation (SR) in Section III-A and then present the construction of an ST-graph representing an action video in Section III-B. Section III-C explains our action detection method, which finds an MWCS on the ST-graph using an optimization approach. A. Extracting Localized ST Video Features via SR Suppose I is a set of detected space-time interest points (STIP) extracted from a set of training video sequences S = {S i, i = 1, 2,..., N}. To represent the detected STIP I, we first compute histograms of oriented gradients (HoG) and histograms of optical flow (HoF) in the ST neighborhoods of the detected STIP I. These descriptors capture the local appearance and motion information and have been used successfully in many works on action recognition employing the BoF scheme for video representation. However, the BoF representation incurs two drawbacks [23]: (1) It leads to a considerable amount of approximation error because each feature is assigned to only one codeword, and (2) The codebook size may be increased for data with large variation. Recently, the SR-based approach has been shown to overcome these drawbacks by reducing the approximation error and constructing compact dictionaries in many vision tasks [24], [21], including action recognition [25], [26], [23], [27]. The SR has been shown to efficiently represent and compress high-dimensional signals. To obtain the SR, the first step involves construction of a dictionary with orthogonal bases or overcomplete bases that can represent the essential information in a signal. The next step determines a sparse solution that is the degree of contribution to each element of the dictionary. Next, we need to learn the overcomplete dictionaries and the corresponding SR using the descriptors of the detected STIP I. We use an online optimization algorithm for dictionary learning [28] and start with briefly describing the online dictionary learning algorithm. Let a set of HoG/HoF feature descriptors of the training set S be X = {x j, j = 1, 2,..., n}, where x j R m, n = N i=1 n i and n i is a number of descriptors existed in training video sequence S i. The online dictionary learning algorithm optimizes the following cost function min D, α 1 n n j=1 ( ) 1 2 x j Dα j λ α j 1, (1) where D R m K (m < K) is the overcomplete dictionary, each column representing a basis vector. α {α j, j = 1, 2,..., n}, α j R K (K n) is SR over X such that each α j contains a few nonzero elements of D. The online dictionary learning algorithm iteratively solves Eq. 1 by performing two steps at every iteration: sparse coding and dictionary updating. In other words, an initial decomposition SR of X is first computed from an initial dictionary, then the initial decomposition SR makes update of dictionary. This procedure is iteratively performed until the number of iterations is satisfied. Sparse decomposition problem is solved with LARS (Least-angle regression) algorithm [38], and dictionary update is performed using block-coordinate descent. Finally, each STIP in I is represented by its corresponding SR computed over the final dictionary. B. Modeling an Action Video as a Weighted ST-Graph We represent a video sequence Q using a weighted ST-graph G = (V, E), where V is a set of weighted nodes and E is a set of edges. The weights of the nodes V are determined based on the feature descriptors extracted from the video sequence. The most popular shape for representing the action region is the ST-cuboid [13], [29], [30], which is a cube-shaped subvolume maximizing the action occurring probability. However, this shape is restricted to a particular location over the temporal domain; that is, it cannot shift spatial location over time. Recently, Chen and Grauman [7] use a ST-graph that allows spatial changes over the temporal domain and provides more accurate detection. We also use a space-time graph to represent the video sequence and present the node and edge structures of our ST-graph as follows. 1) Node structure: The node structure is determined by dividing the video sequence into a grid of H W F ST cubes. The size of each grid indicates the computational efficiency and detection sensitivity. That is, a smaller grid leads to higher computational cost but gives a detailed detection shape, whereas a larger grid provides sparse detection but has lower computational cost. For our implementation, we set H and W to 1/3 of the frame dimensions and F to 10 frames. Using this node structure, we detect an action region of irregular and non-cubic shape.

5 PAPER IDENTIFICATION NUMBER 5 Fig. 2: Conceptual illustration of our node weighting scheme. The yellow points are local features, and each bar on the right side indicates the SR for each local feature. In the bar, each color represents the weight of each element in the dictionary. The weight of the red border node in the center is computed from the SRs of the local features falling within the node. 2) Node weight: We need to represent the amount of action information contained in each node in order to detect the action region in the graph. We formulate an equation for computing the node weight inspired by a common SVM scoring function. Given a set of training video sequences S, each video sequence S i with n i STIP can be represented by a coefficient histogram h (S i ) obtained from max-pooling of corresponding SR vectors of n i feature descriptors. We train a linear SVM using all the coefficient histograms h (S i ), i [1, N], extracted from all the training examples S. The training examples S include positive and negative samples, where each sample is considered as a positive if it contains the action to detect or otherwise negative. Let c and β denote the weight and bias of the SVM, respectively. Now, we compute a weight for each node v V in the graph G for a video sequence Q as: w v = β + x l v K c k αl k, (2) k=1 where x l is the l-th local feature descriptor falling within node v in the graph G which is constructed from a video sequence Q. α k l is the k-th value of SR α of x l obtained from a method of Section III-A. Fig. 2 shows a conceptual illustration of our node weighting scheme. Note that nodes with higher positive weights indicate higher probability that the action occurs in that region, while smaller weights indicate lower probability. By defining the weight for each node in the graph, we can use a method that searches the regions with the highest sum of node weights in order to detect the regions of interest in the graph. This enables us to score an arbitrarily-shaped set of nodes where action occurs. In this context, we apply a detection approach based on an MWC search, as presented in Section III-C. 3) Edge structure: The linking strategy between nodes affects both the shapes of candidate subgraphs and the search cost. In general, each node is linked with 4-connected neighboring nodes. However, since our detection approach is based on a clique search, this edge structure is not sufficient for searching for maximum cliques (Section III-C). Hence, we additionally include three types of edges: (1) 8-connected edges in the spatial dimension, (2) 8-connected edges in the temporal dimension, and (3) jump edges over a second adjacent neighbor in the temporal dimension. Chen and Grauman [7] show that the jump edge ignores misleading features Fig. 3: Conceptual illustration of our edge structure. The black lines indicate general 4-connected edges. The green line indicates the jump edge. The red lines are 8-connected edges in the spatial dimension, and the blue lines are 8-connected edges in the temporal dimension (a) Fig. 4: This example illustrates the strength of our linking structure. (a) 4-connected neighbors only, (b) 4- and 8- connected neighbors. The yellow nodes belong to an MWC. The MWC weight of graph (a) is (4+3), and the MWC weight of graph (b) is ( ). that may interrupt an otherwise good instance of an action. Since realistic videos tend to contain noisy elements such as background clutter or camera motion, a graph with jump edges can be useful for more robust detection. Fig. 3 shows an example of our linking structure. Our additional edges enable us to connect each node to more neighboring nodes including 8-connected neighboring nodes, which yields an expanded space of candidate subvolumes by searching more cliques. Fig. 4 shows an example of the strength of our additional edges. An MWC with greater weight can be found in the graph with 8-connected edges. Consequently, it provides strong localization in the detection problem. One can consider the complete graph, in which every pair of nodes is connected, to search for the MWC with the greatest weight. However, this graph yields a very large MWC including noisy nodes with positive weights, and distant nodes can be included in an MWC. Even if we limit the field of the complete graph to local region, its resulting MWC still includes noisy nodes. For example, if the graph in Fig. 4(b) is complete, the resulting MWC includes a node with weight 5. However, the node has neighboring nodes with negative weights and, consequently, is likely to include noise. We demonstrate this case through experiments (b)

6 PAPER IDENTIFICATION NUMBER 6 C. Searching for the MWCS for Action Detection Given a ST-graph G = (V, E) with weighted nodes, we need to find the subgraph G satisfying Eq. 3 G = arg max G G v V w v, (3) where G = (V V, E E) can be any connected subgraphs of G, and G is the connected subgraph with the highest sum of node weights. Since each node has a learned weight indicating the probability of the action occurring, the action region is a set of connected nodes which total sum of their weights is maximal. Therefore, the action detection problem can be considered as finding the maximum subgraph from the entire graph representing an action video. This means the action detection problem can be solved by MWCSP. If all node weights are positive, an optimal solution is easily computed by determining any spanning tree. However, the node weight in our graph can be have either a positive or a negative value. Therefore, the MWCSP is NP-complete [33], i.e., there exists no known efficient algorithm to solve it. Dittrich et al. [34] transform the MWCSP into the prize-collecting Steiner tree problem (PCST) to identify the functional modules in proteinprotein interaction networks. In [34], a graph is transformed into a directed graph, and integer linear programming (ILP) is performed on the transformed directed graph with binary variables for each node and edge. Finally, the problem can be solved with a linear programming-based branch-and-bound algorithm. Chen and Grauman [7] apply the same method [34] for ST-graphs. Their max-subgraph approach seeks the subvolume that maximizes the action classifiers output. We propose a novel method to solve the MWCSP by defining an optimization problem based on an MWC in small graph units. Our approach is inspired by the works of Shervashidze et al. [31] and Levi [32]. First, Shervashidze et al. [31] present a graph kernel based on counting and sampling the subgraphs of a limited size in the entire graph, which they called graphlets. Their sampling scheme allows them to compute the graph kernels on graphs of sizes that are beyond the scope of the stateof-the-art methods. Similarly, since it is a very challenging problem to provide information about the MWCS for the entire graph, our approach computes information about the MWCS for each graph unit by dividing the graph into smaller graph units. Next, Levi [32] shows that each MWC in the product graph is associated with a maximum common subgraph. The MWC problem is then to determine the maximum clique of an arbitrary undirected graph. It enumerates all common subgraph isomorphisms by enumerating the cliques of the product graph, i.e., they solve the problem by splitting the subgraph into several cliques. In this paper, we exploit the MWC information found in the smaller subgraphs in order to solve the MWCSP of the graph. Instead of enumerating the raw MWC, the nodes of the MWC are utilized with their weights in order to determine the optimized solution of the MWCSP. We compare the experimental results between these two approaches: enumerating MWCs and solving a linear system based on the MWC. Next, we describe the solution to the MWCSP in the ST-graph for a test video sequence. Our approach consists of three steps: (1) finding MWC for each subgraph (graph unit), (2) identifying candidates of the MWCS by solving an optimization problem based on MWC information, and (3) selecting the resulting MWCS based on the top-scoring detection. Let a graph G = {g i } consists of multiple graph elements g i, in which g i is a set of nodes within the same temporal dimension. We also define a graph unit as a set of n graph elements. We divide the entire graph G into overlapping m graph units and determine the MWC from all the graph units. Since the MWC problem is NP-hard [35], many works have solved it using heuristic approaches. We solve the MWC problem using a method by Östergård [35]. This method presents a branch-and-bound algorithm, which exploits the node order based on a coloring of the nodes and a pruning strategy. Once the MWCs C = {c i } are obtained from all graph units, we use an optimization approach to search for an MWCS that is a localized ST region containing action. Namely, we minimize the following energy function: E (G) = (x u b u ) 2 + λ (x u x v ) 2, (4) u V b u = (u,v) E { w u + σw (c i ) if u c i w u otherwise, (5) where x u indicates the possibility of being the MWCS for node u, and b u is determined by the type of node u. If the node u is part of the MWC, b u is computed by summing the node weight w u and MWC weight w (c i ). Otherwise, b u is determined by node weight w u alone. The first term of eq. 4 is the data term that encourages the MWCS weight x u to become similar to b u, which is determined by the node type. Hence, a node having greater weight and with greater MWC weight has a greater weight MWCS. The second term of Eq. 4 is the smoothness term, which generates the connected maximum weight subgraph. We aim to minimize the difference in MWCS weight between the node u and its neighboring connected node v. The parameter λ is set as 1, and parameter σ is experimentally determined as 1/ c i. Optimization The energy function in Eq. 4 is an optimization problem that can be represented as a linear system Ax = b. In other words, by setting the first-order derivative of E (G) of Eq. 4 to zero and re-arranging x u across all the nodes u in a vector x, Eq. 4 can be written in matrix form Ax = b, where A is a V V matrix, and b is a vector in which each element corresponds to b u of Eq. 5. We obtain the solution x by solving the linear system. In the solution x, each solution x u is the probability of belonging to MWCS for node u, i.e., a larger value indicates a higher likelihood of action occurrence. Next, we determine the candidates of the MWCS MS = {ms i } from the following condition: ms i = { u j x uj max (x) max (x) /2, (u j, u j+1 ) E }. (6)

7 PAPER IDENTIFICATION NUMBER 7 (a) Example of MWC search result (b) Example of result after applying eq. 6 Fig. 5: A conceptual illustration of our MWCS search method. (a) Yellow nodes and colored lines comprise MWCs. (b) Two MWCS candidates are founded from an optimization based on MWC information. Data: A graph G = (V, E) = {g i, i = 1,..., m} Result: a set of MWCS D = {G }, G arg max = G w G v v V set C = for i = 1 to m do search MWC c i for each g i using algorithm [35] append c i into C end while MS do compute A and b according to Eq. 4 and Eq. 5 solve Ax = b identify candidates of MWCS MS according to Eq. 6 choose the candidate ms with the largest weight in MS set weights of all nodes u ms to w u append ms into D end return D Algorithm 1: Maximum weight connected subgraph (MWC- S) search for multiple detection. Fig. 5 shows a conceptual illustration of our MWCS search method. Note that we leave out some edges for convenient viewing. Suppose we use a graph unit of size 2. We first search the MWC for each graph unit, as shown in fig. 5(a). Yellow colored nodes indicate the nodes which are included in MWC, and each clique represented by colored line (red, green, blue, orange) indicates the MWC. In otherwords, red colored MWC is searched from the first graph unit which is a set of g 1 and g 2 graph elements, and blue colored MWC is searched from the second graph unit which is a set of g 2 and g 3 graph elements. In this way, we can search MWCs for all the graph units within a whole graph, and four MWCs are founded in this example. Based on MWCs C, we construct the energy function of eq. 4 and solve the optimization problem according to the method described above. After applying eq. 6 to the optimization result, we can find MWCS candidates MS (red and blue colored subgraphs), as shown in fig. 5(b), and apply the top-scoring detection for final result. The above condition provides one or more candidate MWCS according to the number of actions included in a test video sequence. Basically, we first choose the candidate with the largest weight. To return multiple top-scoring detections, we apply a method similar to that in [13], [7], iteratively performing the MWCS algorithm by setting weights to w u for the nodes of the MWCS in the previous iteration. Algorithm 1 provides the pseudocode for our MWCS algorithm. IV. EXPERIMENTS In this section, we evaluate the proposed method using two datasets, uncropped Hollywood action videos [36] and MSR actions [13]. Both datasets contain actions with dynamic occlusions in complex and moving backgrounds of real-world environments. We extract the STIP and the descriptor using the method of [36]. To compute the node weight for each action, we train a linear SVM. T-Sliding ST-Sliding Subvolume Subgraph Our result Fig. 6: Examples of shape of detection result according to the different methods. We employ the mean overlap accuracy as an evaluation metric, which computes the intersection of the predicted detection region with the ground truth divided by their union. We use detection time to evaluate computational cost and also compare the detection time of baselines measured on a system equipped with 3.40GHz Intel Core i CPU. We compare our method with three state-of-the-art baselines: (1) Sliding: The sliding window is a standard and popular action detection method used in several works [1], [4], [5]. The temporal sliding window is used for temporal detection and spatio-temporal sliding window is used for spatio-temporal detection. The ST sliding window is a variant of the temporal sliding window which searches ST subvolumes of cubic shape. (2) Subvolume: The Cube-subvolume method [13] searches the cube-shaped subvolume that maximizes the action-occurring probability. Hence, its spatial detection region is more flexible than that of ST sliding window. However, spatial detection is restricted to one location. (3) Subgraph: The subgraph method [7] allows spatial shifts over time, as does our method. However, it cannot detect the subvolume that consists of 8- connected neighboring nodes. Note that our method produces more flexible detection regions than the result of Subgraph [7] by allowing 8-connected edge structure. Our method can detect most irregular and non-cubic shapes. Fig. 6 illustrates the shape of detection results according to the different methods.

8 PAPER IDENTIFICATION NUMBER 8 TABLE I: The mean temporal overlap accuracy results on the Hollywood dataset under the BoF feature scheme. Actions Sliding Subvolume [13] Subgraph [7] Ours AnswerPhone GetOutCar HandShake HugPerson Kiss SitDown SitUp StandUp Average TABLE II: Comparison of detection accuracies according to the feature type in the Hollywood dataset. Actions BoF SR AnswerPhone GetOutCar HandShake HugPerson Kiss SitDown SitUp StandUp Average A. Experiments on the Hollywood Actions Dataset The Hollywood Actions Dataset [36] consists of videos collected from 32 different Hollywood movies, with a total of 663 video sequences from 8 action classes: answering the phone, getting out of the car, hand shaking, hugging, kissing, sitting down, sitting up, and standing up. The dataset is divided into uncropped and cropped versions of the sequences for training, i.e., videos containing extraneous frames and only the action of interest. Hence, we use the cropped version for training and the uncropped version for detection evaluation. We perform temporal detection only because most actions occur across the entire frame. Table I shows the temporal detection results for feature settings same as in [7]. Our method achieves the best accuracy for 5 of the 8 action classes and an average accuracy. This validates the superiority of our MWCS search method in detecting action regions. In addition, our method can estimate the probability of occurrence of actions for the detected region, in contrast to the MWCS search method of [7]. This property enables more robust detection in terms of performance of the basic detector. A comparative evaluation of BoF- and SRbased methods for representing a video is shown in Table II. We show that our SR-based method achieves a better result than does the BoF-based method. However, we can observe that SR-based method performs worse than BoF-based method in GetOutCar, HandShake, and SitUp classes. Although SRbased representation has been shown to overcome the drawbacks of BoF representation, the SR-based method has also drawbacks. The sparse solution becomes denser when the image is under large amount of random corruption or contiguous occlusion [39], [40]. Since some action videos in GetOutCar, HandShake, and SitUp classes have those problems, SR-based method performs worse than BoF-based method. Given MWC information, combining all neighboring MWCs is the most straightforward method to solve the MWCSP. In other words, the continuously connected MWCs are gathered into a larger graph, and then, these gathered graphs are candidate MWCSs. Among several candidates, the top-scoring candidate becomes the final MWCS. Fig. 7 compares the combination-based method and the proposed method to solve the MWCSP. Our method outperforms the combination-based method for all action classes. The resulting MWCS obtained from the combination is likely to be one piece of ground-truth in the MWCS because the MWC cannot be extracted from the Fig. 7: The mean temporal overlap accuracies for each action in the Hollywood action dataset using the combination-based method and our proposed method. middle noisy part. Our method defines a smoothness term in Eq. 4 in order to encourage connected subgraph generation, and consequently, our method overcomes the shortcomings of the combination method and leads to better detection results. Table III shows the detection time on the Hollywood dataset. Although Subgraph [7] method takes the least detection time, our method takes significantly shorter than Sliding method and Subvolume [13]. Our method takes more time than Subgraph [7] because our graph is more complex than that of Subgraph [7]. However, we observe that our method for maximum subgraph search is more efficient than that of Subgraph [7] for large graph, as shown in table VI. B. Experiments on the MSR Actions Dataset The MSR Actions Dataset [13] contains multiple instances of different actions, such as boxing, handclapping, and handwaving. All of the sequences are captured with clutter and TABLE III: Detection time on the Hollywood dataset. Method Detection time (sec) Sliding Subvolume [13] Subgraph [7] Ours

9 PAPER IDENTIFICATION NUMBER 9 TABLE IV: The mean temporal overlap accuracy results for the MSR dataset. TABLE V: The mean space-time overlap accuracy results for the MSR dataset. Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing Clapping Waving Average Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing Clapping Waving Average Fig. 8: The mean temporal overlap accuracies for each action in the MSR action dataset when using the combination-based method and our method. Fig. 9: The mean space-time overlap accuracies for each action in the MSR action dataset when using the combination-based method and our method. moving backgrounds. Since the actors change their position over time, this dataset is good for evaluating the ST detection. We perform both temporal and ST detections on this dataset. Following the detection method in [13], [7], we train detectors using the KTH dataset [37]. Table IV shows the temporal detection results. Our method achieves the best accuracy for all action classes. In particular, our method significantly improves the accuracies for boxing and clapping actions. Temporal detection results of the combination-based method and our method for solving the MWCSP are compared in Fig. 8. Our method achieves higher detection accuracy than the combination-based method for all action classes. The results show the strength of our optimization strategy that considers both maximum weight and connectivity properties. Table V shows the ST detection results. Although the results of our proposed method are slightly lower than that with the best performance for the boxing and waving actions, our method achieves the best average accuracy due to increased accuracy for the clapping action. Compared with the graphbased method of [7], our method achieves higher accuracies for all action classes. Note that, although the method of [7] achieves better results than the sliding window and [13] for temporal detection, it achieves lower accuracy for ST detection. In contrast, our method achieves the best average accuracy for both temporal and ST detections. This validates the superiority of our graph structure and the MWCS search method. Fig. 9 compares the combination-based method and our method for the MWCSP. Our method outperforms the combination-based method for all action classes, especially for the waving action. This demonstrates that our MWCS search method produces more detailed localization. Fig. 10 shows the mean space-time overlap accuracy ac- cording to graph type and graph unit size. We use three types of graphs. A basic graph indicates a general graph with 4- connected edges, and a complete graph is a graph in which every pair of nodes within the graph unit is connected. Our graph contains 8-connected edges in spatial and temporal dimensions, as well as 4-connected edges. Therefore, the graph complexity decreases in the order of a complete graph to our graph to a basic graph. Our graph outperforms the basic graph and the complete graph for every size of graph unit. We also observe that the complete graph produces poor performance. This indicates that unnecessary edges cause errors, and our linking strategy is suitable for solving the detection problem. We show the mean space-time overlap accuracy results according to various grid sizes for MSR action dataset. Fig. 11 shows the results under graph unit of size 2 and temporal grid of size F = 10. We can see that spatial grid of 1 3 of the frame dimensions gives the best average accuracy. Generally, one might think that smaller grid size gives more detailed detection shape and larger detection accuracy. However, too small grid ( 1 6 of the frame dimensions) produces less accuracy, as shown in fig. 11. Note that our method finds the maximum subgraph based on node weight which is computed from weights of local features falling in the node. If node size becomes smaller, the number of local features within the node is reduced. Consequently, this makes it difficult to aggregate statistics of neighborhood of local features. Fig. 12 shows the results according to various sizes of temporal grid under graph unit of size 2 and spatial grid of 1 3 of the frame dimensions. Temporal grid of size 20 gives the best average accuracy. We can observe that too small temporal grid or too large temporal grid produces the less accuracy. Note that proper grid size can be different according to the data in terms of the frame dimensions of an input video, average size of humans, and average length of actions.

10 PAPER IDENTIFICATION NUMBER 10 Mean Overlap Accuracy Graph unit size Fig. 10: The mean space-time overlap accuracies for the MSR action dataset according to graph type and graph unit size. Fig. 12: The mean space-time overlap accuracy results according to various temporal grid sizes for MSR action dataset. TABLE VII: Comparison of mean space-time overlap accuracy results between Lan et al. [41] and our method. Actions Lan et al. [41] Ours Boxing Clapping Waving Average Fig. 11: The mean space-time overlap accuracy results according to various spatial grid sizes for MSR action dataset. Table VI shows the detection time on the MSR dataset. We can see that our method achieves the least detection time for spatio-temporal detection, compared to ST-Sliding, Subvolume, and Subgraph. Note that detection time on the MSR dataset is larger than that on Hollywood dataset. It is because average length of videos on the MSR dataset is longer than that on the Hollywood dataset. Especially, Subgraph method [7] takes long detection time for MSR dataset because it requires optimization for large graph. However, our method takes much less time than Subgraph [7], though both methods solve MWCSP. It is because our method first searches MWC for graph unit less than the size of entire graph and performs optimization of simple linear system based on MWC information. Therefore, our optimization method takes less detection time compared to the method of Subgraph [7] even if video length is lengthened. We compare our method with the method of Lan et al. [41] on MSR dataset. For appearance feature x i, we use BoF feature scheme based on HoGHoF descriptors like our method. TABLE VI: Detection time on the MSR dataset. Method Detection time (sec) Sliding Subvolume [13] Subgraph [7] Ours Table VII compares the spatio-temporal detection results. Our method outperforms the method of Lan et al. [41] for all actions without human detection procedure. Lan et al. [41] produces poor detection result compared to other baselines as well as our method. It is because their method is not appropriate for MSR dataset which contains multiple actions in the same scene and has no rule about location variation. In other words, their method is difficult to obtain benefits from pairwise potential and global action potential in their potential function. Fig. 13 shows the example of the detection results for MSR action dataset. We compare the detection results between graph of [7] (green rectangles) and our graph (yellow rectangles). In case of the third row, the previous graph fails to detect any regions for 2th, 3rd and 4th images, while our graph detects the action regions for all the images. We can show that our graph detects not only more temporal action frames but also detailed spatial action region within the frame. V. CONCLUSION We presented the optimization approach to identify the maximum subgraph on the ST-graph for action detection. Our energy function is defined based on maximum cliques by including the maximum and connectivity properties for finding the maximum subgraph. The energy function is formulated using a linear system and its solution gives the probability of being action nodes. Our graph structure and optimization method efficiently solve the detection problem by applying the clique-based approach and simple linear system solver. We showed that our graph and optmization method produce more accurate localization through experiments using various

11 PAPER IDENTIFICATION NUMBER 11 Fig. 13: Example of detection results using a graph of [7] and our graph. Each row indicates the detection results for waving, boxing, and clapping actions, respectively, for MSR action dataset. For each row, the first 4 images are detection results of the graph of [7] and remaining 4 images are detection results of our graph. Red rectangles denote ground truth, green rectangles denote the result of the graph of [7], and yellow rectangles denote the result of our graph. graph structures. We also compared our method with the stateof-the-art methods using real-world action datasets. ACKNOWLEDGMENT This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government(mest) (No. NRF-2013R1A2A1A ). REFERENCES [1] Y. Ke, R. Sukthankar, and M. Hebert, Efficient visual event detection using volumetric features, in Proc. ICCV, pp , [2] L. Zelnik-Manor and M. Irani, Statistical analysis of dynamic actions, in IEEE Trans. on PAMI, Vol. 28, No. 9, pp , [3] Y. Ke, R. Sukthankar, and M. Hebert, Event detection in crowded videos, in Proc. ICCV, pp. 1-8, [4] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, in Proc. ICCV, pp , [5] S. Satkin and M. Hebert, Modeling the temporal extent of actions, in Proc. ECCV, pp , [6] A. Klaser, M. Marszalek, C. Schmid, and A. Zisserman, Human focused action localization in video, in Proc. ECCV, pp , [7] C. Y. Chen and K. Grauman, Efficient activity detection with maxsubgraph search, in Proc. CVPR, pp , [8] K. Mkolajczyk and H. Uemura, Action recognition with motionappearance vocabulary forest, in Proc. CVPR, pp. 1-8, [9] M. S. Ryoo and J. K. Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, in Proc. ICCV, pp , [10] A. Oikonomopoulos, I. Patras, and M. Pantic, Spatiotemporal localization and categorization of human actions in unsegmented image sequences, in IEEE Trans. on Image Processing, Vol. 20, No. 4, pp , [11] A. Yao, J. Gall, and L. V. Gool, A hough transform-based voting framework for action recognition, in Proc. CVPR, pp , [12] G. Willems, J. H. Becker, T. Tuytelaars, and L. V. Gool, Exemplar-based action recognition in video, in Proc. BMVC, pp. 1-11, [13] J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, in Proc. CVPR, pp , [14] L. Cao, Z. Liu, and T. S. Huang, Cross-dataset action detection, in Proc. CVPR, pp , [15] F. Zhou, F. De La Torre, and J. K. Hodgins, Aligned cluster analysis for temporal segmentation of human motion, in Proc. FGR, pp. 1-7, [16] M. Hoai, Z. Z. Lan, and F. De La Torre, Joint segmentation and classification of human actions in video, in Proc. CVPR, pp , [17] A. P. Ta, C. Wolf, G. Lavoue, and A. Baskurt, Recognizing and localizing individual activities through graph matching, in Proc. AVSS, pp , [18] U. Gaur, Y. Zhu, B. Song, and A. Roy-Chowdhury, A String of feature graphs model for recognition of complex activities in natural videos, in Proc. ICCV, pp , [19] W. Brendel and S. Todorovic, Learning spatiotemporal graphs of human activities, in Proc. ICCV, pp , [20] O. Celiktutan, C. Wolf, B. Sankur, and E. Lombardi, Real-time exact graph matching with application in human action recognition, in Proc. ICHBU, pp , [21] J. Liu, S. Ali, and M. Shah, Recognizing human actions using multiple features, in Proc. CVPR, pp. 1-8, [22] E. Z. Borzeshi, M. Piccardi, and R. Y. D. Xu, A discriminative prototype selection approach for graph embedding in human action recognition, in Proc. ICCVW, pp , [23] E. Z. Borzeshi, M. Piccardi, and R. Y. D. Xu, Learning sparse representations for human action recognition, in IEEE Trans on PAMI, pp , [24] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, Robust face recognition via sparse representation, in IEEE Trans on PAMI, pp , [25] K. Guo, P. Ishwar, and J. Konrad, Action recognition using sparse representation on covariance manifolds of optical flow, in Proc. AVSS, pp , [26] C. Liu, Y. Yang, and Y. Chen, Human action recognition using sparse representation, in Proc. ICIS, pp , [27] S. R. Fanello, I. Gori, G. Metta, and F. Odone, Keep it simple and sparse: real-time action recognition, in JMLR, Vol. 14, pp , [28] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online dictionary learning for sparse coding, in Proc. ICML, pp , [29] M. Rodriguez, J. Ahmed, and M. Shah, Action MACH: A spatiotemporal maximum average correlation height filter for action recognition, in Proc. CVPR, pp. 1-8, [30] K. Derpanis, M. Sizintsev, K. Cannons, and R. Wildes, Efficient action spotting based on a spacetime oriented structure representation, in Proc. CVPR, pp , [31] N. Shervashidze, S. V. N. Vishwanathan, T. Petri, K. Mehlhom, and K. Borgwardt, Efficient graphlet kernels for large graph comparison, in Proc. AISTATS, pp , [32] G. Levi, A note on the derivation of maximal common subgraphs of two directed or undirected graphs, in Calcolo, Vol. 9, No. 4, pp , [33] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, Discovering regulatory and signalling circuites in molecular interaction networks, in Bioinformatics, Vol. 18, pp , [34] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Mller, Identifying functional modules in protein-protein interaction networks: an integrated exact approach, , in Bioinformatics, Vol. 24, No. 13, pp.

12 PAPER IDENTIFICATION NUMBER 12 [35] P. R. J. Östergård, A fast algorithm for the maximum clique problem, in Discrete Applied Mathematics, Vol. 120, No. 1-3, pp , [36] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proc. CVPR, pp. 1-8, [37] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local svm approach, in Proc. ICPR, pp , [38] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, in Annals of statistics, Vol. 32, pp , [39] E. J. Candès, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccurate measurements, in Communications on Pure and Applied Mathematics, Vol. 59, No. 8, pp , [40] A. Martinez, Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class, in IEEE Trans on PAMI, Vol. 24, No. 6, pp , [41] T. Lan, Y. Wang, and G. Mori, Discriminative figure-centric models for joint action localization and recognition, in Proc. ICCV, pp , [42] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori, Similarity constrained latent support vector machine: an application to weakly supervised action classification, in Proc. ECCV, pp , [43] Y. Kong, Y. Jia, and Y. Fu, Learning human interaction by interactive phrases, in Proc. ECCV, pp , [44] M. Hoai and F. De la Torre, Max-margin early event detectors, in Proc. CVPR, pp , [45] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large margin methods for structured and interdependent output variables, in Journal of Machine Learning Research, Vol. 6, pp , Sunyoung Cho received the B.S. degree in computer science from Sookmyung Women s University, Seoul, Korea, in 2007, and the M.S. and Ph.D. degrees in computer science from Yonsei University, Seoul, Korea, in 2009 and 2014, respectively. She is currently a Post-Doctoral Researcher with Human- Computer Interaction Institute at Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include computer vision, image and video processing, machine learning, and human-computer interaction. Hyeran Byun received the B.S. and M.S. degrees in mathematics from Yonsei University, Seoul, Korea, and the Ph.D. degree in computer science from Purdue University, West Lafayette, IN, USA. She was an Assistant Professor at Hallym University, Chooncheon, Korea, from 1994 to She is currently a Professor of Computer Science at Yonsei University. She served as president of Artificial Intelligence Society in KIISE (Korean Institute of Information Scientists and Engineers) from Mar 2013 to Feb Her research interests include computer vision, image and video processing, artificial intelligence, event recognition, gesture recognition, and pattern recognition.

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

Lecture 18: Human Motion Recognition

Lecture 18: Human Motion Recognition Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio

More information

Efficient Activity Detection in Untrimmed Video with Max-Subgraph Search

Efficient Activity Detection in Untrimmed Video with Max-Subgraph Search 1 Efficient Activity Detection in Untrimmed Video with Max- Search Chao Yeh Chen and Kristen Grauman arxiv:1607.02815v1 [cs.cv] 11 Jul 2016 Abstract We propose an efficient approach for activity detection

More information

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,

More information

CS 231A Computer Vision (Fall 2012) Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3 CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest

More information

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213) Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Adaptive Action Detection

Adaptive Action Detection Adaptive Action Detection Illinois Vision Workshop Dec. 1, 2009 Liangliang Cao Dept. ECE, UIUC Zicheng Liu Microsoft Research Thomas Huang Dept. ECE, UIUC Motivation Action recognition is important in

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

1126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 4, APRIL 2011

1126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 4, APRIL 2011 1126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 4, APRIL 2011 Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences Antonios Oikonomopoulos, Member, IEEE,

More information

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute

More information

CS 223B Computer Vision Problem Set 3

CS 223B Computer Vision Problem Set 3 CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.

More information

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Learning realistic human actions from movies

Learning realistic human actions from movies Learning realistic human actions from movies Ivan Laptev, Marcin Marszalek, Cordelia Schmid, Benjamin Rozenfeld CVPR 2008 Presented by Piotr Mirowski Courant Institute, NYU Advanced Vision class, November

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Patch-based Object Recognition. Basic Idea

Patch-based Object Recognition. Basic Idea Patch-based Object Recognition 1! Basic Idea Determine interest points in image Determine local image properties around interest points Use local image properties for object classification Example: Interest

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 9 No. 4 Dec. 2014, pp. 1708-1717 2014 Innovative Space of Scientific Research Journals http://www.ijias.issr-journals.org/ Evaluation

More information

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task Fahad Daniyal and Andrea Cavallaro Queen Mary University of London Mile End Road, London E1 4NS (United Kingdom) {fahad.daniyal,andrea.cavallaro}@eecs.qmul.ac.uk

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Person Action Recognition/Detection

Person Action Recognition/Detection Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1 In object recognition: is there a chair in the

More information

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation M. Blauth, E. Kraft, F. Hirschenberger, M. Böhm Fraunhofer Institute for Industrial Mathematics, Fraunhofer-Platz 1,

More information

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular

More information

CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 21st, 2016 Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha

More information

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Leveraging Textural Features for Recognizing Actions in Low Quality Videos Leveraging Textural Features for Recognizing Actions in Low Quality Videos Saimunur Rahman, John See, Chiung Ching Ho Centre of Visual Computing, Faculty of Computing and Informatics Multimedia University,

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Structured Models in. Dan Huttenlocher. June 2010

Structured Models in. Dan Huttenlocher. June 2010 Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Class 9 Action Recognition

Class 9 Action Recognition Class 9 Action Recognition Liangliang Cao, April 4, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition

More information

Linear combinations of simple classifiers for the PASCAL challenge

Linear combinations of simple classifiers for the PASCAL challenge Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu

More information

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Visuelle Perzeption für Mensch- Maschine Schnittstellen Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints Last week Multi-Frame Structure from Motion: Multi-View Stereo Unknown camera viewpoints Last week PCA Today Recognition Today Recognition Recognition problems What is it? Object detection Who is it? Recognizing

More information

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Object Category Detection. Slides mostly from Derek Hoiem

Object Category Detection. Slides mostly from Derek Hoiem Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Robotics Programming Laboratory

Robotics Programming Laboratory Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car

More information

CS229: Action Recognition in Tennis

CS229: Action Recognition in Tennis CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Det De e t cting abnormal event n s Jaechul Kim

Det De e t cting abnormal event n s Jaechul Kim Detecting abnormal events Jaechul Kim Purpose Introduce general methodologies used in abnormality detection Deal with technical details of selected papers Abnormal events Easy to verify, but hard to describe

More information

Separating Objects and Clutter in Indoor Scenes

Separating Objects and Clutter in Indoor Scenes Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous

More information

Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences

Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences Antonios Oikonomopoulos, Student Member, IEEE, Ioannis Patras, Member, IEEE, and Maja Pantic, Senior Member,

More information

Computer Vision. Exercise Session 10 Image Categorization

Computer Vision. Exercise Session 10 Image Categorization Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

Sparse coding for image classification

Sparse coding for image classification Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction

More information

SCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING

SCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING SCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING Xuejian Rong 1, Chucai Yi 2, Xiaodong Yang 1 and Yingli Tian 1,2 1 The City College, 2 The Graduate Center, City University of New York

More information

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors Segmentation I Goal Separate image into coherent regions Berkeley segmentation database: http://www.eecs.berkeley.edu/research/projects/cs/vision/grouping/segbench/ Slide by L. Lazebnik Applications Intelligent

More information

Object detection using non-redundant local Binary Patterns

Object detection using non-redundant local Binary Patterns University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2010 Object detection using non-redundant local Binary Patterns Duc Thanh

More information

Learning Realistic Human Actions from Movies

Learning Realistic Human Actions from Movies Learning Realistic Human Actions from Movies Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** INRIA Rennes, France ** INRIA Grenoble, France *** Bar-Ilan University, Israel Presented

More information

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1 2 Problem to solve Object detection Input: Image Output: Bounding box of the object 3 Object detection using CNN

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

Content-based image and video analysis. Event Recognition

Content-based image and video analysis. Event Recognition Content-based image and video analysis Event Recognition 21.06.2010 What is an event? a thing that happens or takes place, Oxford Dictionary Examples: Human gestures Human actions (running, drinking, etc.)

More information

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Short Survey on Static Hand Gesture Recognition

Short Survey on Static Hand Gesture Recognition Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of

More information

Classification and Detection in Images. D.A. Forsyth

Classification and Detection in Images. D.A. Forsyth Classification and Detection in Images D.A. Forsyth Classifying Images Motivating problems detecting explicit images classifying materials classifying scenes Strategy build appropriate image features train

More information

Exploring Bag of Words Architectures in the Facial Expression Domain

Exploring Bag of Words Architectures in the Facial Expression Domain Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu

More information

Category vs. instance recognition

Category vs. instance recognition Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Graph-based High Level Motion Segmentation using Normalized Cuts

Graph-based High Level Motion Segmentation using Normalized Cuts Graph-based High Level Motion Segmentation using Normalized Cuts Sungju Yun, Anjin Park and Keechul Jung Abstract Motion capture devices have been utilized in producing several contents, such as movies

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601 Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601 Introduction Face ID is complicated by alterations to an individual s appearance Beard,

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu ECG782: Multidimensional Digital Signal Processing Spring 2014 TTh 14:30-15:45 CBC C313 Lecture 10 Segmentation 14/02/27 http://www.ee.unlv.edu/~b1morris/ecg782/

More information

Motion Estimation for Video Coding Standards

Motion Estimation for Video Coding Standards Motion Estimation for Video Coding Standards Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Introduction of Motion Estimation The goal of video compression

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

Object Category Detection: Sliding Windows

Object Category Detection: Sliding Windows 04/10/12 Object Category Detection: Sliding Windows Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical

More information

Human Motion Detection and Tracking for Video Surveillance

Human Motion Detection and Tracking for Video Surveillance Human Motion Detection and Tracking for Video Surveillance Prithviraj Banerjee and Somnath Sengupta Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur,

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

HOG-based Pedestriant Detector Training

HOG-based Pedestriant Detector Training HOG-based Pedestriant Detector Training evs embedded Vision Systems Srl c/o Computer Science Park, Strada Le Grazie, 15 Verona- Italy http: // www. embeddedvisionsystems. it Abstract This paper describes

More information

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image

More information

TA Section: Problem Set 4

TA Section: Problem Set 4 TA Section: Problem Set 4 Outline Discriminative vs. Generative Classifiers Image representation and recognition models Bag of Words Model Part-based Model Constellation Model Pictorial Structures Model

More information

Beyond Bags of features Spatial information & Shape models

Beyond Bags of features Spatial information & Shape models Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features

More information

A New Feature Local Binary Patterns (FLBP) Method

A New Feature Local Binary Patterns (FLBP) Method A New Feature Local Binary Patterns (FLBP) Method Jiayu Gu and Chengjun Liu The Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA Abstract - This paper presents

More information

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Accurate 3D Face and Body Modeling from a Single Fixed Kinect Accurate 3D Face and Body Modeling from a Single Fixed Kinect Ruizhe Wang*, Matthias Hernandez*, Jongmoo Choi, Gérard Medioni Computer Vision Lab, IRIS University of Southern California Abstract In this

More information

Bus Detection and recognition for visually impaired people

Bus Detection and recognition for visually impaired people Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation

More information

Development in Object Detection. Junyuan Lin May 4th

Development in Object Detection. Junyuan Lin May 4th Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

Markov Networks in Computer Vision. Sargur Srihari

Markov Networks in Computer Vision. Sargur Srihari Markov Networks in Computer Vision Sargur srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Important application area for MNs 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

Region-based Segmentation and Object Detection

Region-based Segmentation and Object Detection Region-based Segmentation and Object Detection Stephen Gould Tianshi Gao Daphne Koller Presented at NIPS 2009 Discussion and Slides by Eric Wang April 23, 2010 Outline Introduction Model Overview Model

More information

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy BSB663 Image Processing Pinar Duygulu Slides are adapted from Selim Aksoy Image matching Image matching is a fundamental aspect of many problems in computer vision. Object or scene recognition Solving

More information

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking Feature descriptors Alain Pagani Prof. Didier Stricker Computer Vision: Object and People Tracking 1 Overview Previous lectures: Feature extraction Today: Gradiant/edge Points (Kanade-Tomasi + Harris)

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Norbert Schuff VA Medical Center and UCSF

Norbert Schuff VA Medical Center and UCSF Norbert Schuff Medical Center and UCSF Norbert.schuff@ucsf.edu Medical Imaging Informatics N.Schuff Course # 170.03 Slide 1/67 Objective Learn the principle segmentation techniques Understand the role

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

WITH the increasing use of digital image capturing

WITH the increasing use of digital image capturing 800 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 A Hybrid Approach to Detect and Localize Texts in Natural Scene Images Yi-Feng Pan, Xinwen Hou, and Cheng-Lin Liu, Senior Member, IEEE

More information

Mobile Visual Search with Word-HOG Descriptors

Mobile Visual Search with Word-HOG Descriptors Mobile Visual Search with Word-HOG Descriptors Sam S. Tsai, Huizhong Chen, David M. Chen, and Bernd Girod Department of Electrical Engineering, Stanford University, Stanford, CA, 9435 sstsai@alumni.stanford.edu,

More information