UNDERSTANDING human actions in videos has been
|
|
- Nickolas Lee
- 5 years ago
- Views:
Transcription
1 PAPER IDENTIFICATION NUMBER 1 A Space-Time Graph Optimization Approach Based on Maximum Cliques for Action Detection Sunyoung Cho, Member, IEEE, and Hyeran Byun, Member, IEEE Abstract We present an efficient action detection method that takes a space-time graph optimization approach for realworld videos. Given a space-time graph representing the entire action video, our method identifies a maximum-weight connected subgraph indicating an action region by applying an optimization approach based on clique information. We define an energy function based on maximum weight cliques for subregions of the graph, and formulate it using an optimization problem that can be represented as a linear system. Our energy function includes the maximum and connectivity properties for finding the maximum-weight connected subgraph, and its optimization solution indicates the probability of belonging to the maximum subgraph for each node. Our graph optimization method efficiently solves the detection problem by applying the cliquebased approach and simple linear system solver. We demonstrate that our detection method results in more accurate localization compared to conventional methods through our experimental results with real-world datasets, such as Hollywood and MSR action datasets. We also show that our method outperforms the state-of-the-art methods of action detection. Index Terms Action detection, space-time graph, sparse representation, maximum weight connected subgraph, maximum weight clique, optimization. I. INTRODUCTION UNDERSTANDING human actions in videos has been considered as one of the important areas in the computer vision community due to a variety of applications, such as video indexing and searching, video surveillance, and humancomputer interaction. The action learning problem has been extensively studied over the past years, and recent works have explored more realistic actions rather than the simplified and constrained ones used in earlier studies. Action learning in the real-world is a challenging problem because uncontrolled real-world videos include a large amounts of variation in terms of the number of people, the scale of each person, background clutter, occlusion, and camera parameter changes. Furthermore, two problems need to be solved for the action understanding: localizing an action-occurring region in a video sequence (action detection) and classifying the category of the action (action recognition). Detecting actions is to find the space and time regions of actions occurring in the video sequence. The most common approach to this is to employ a sliding window approach [1]- [6], which applies a classifier function to the subregions and S. Cho was with the Department of Computer Science, Yonsei University, Seoul , Korea. She is now with Human-Computer Interaction Institute, Carnegie Mellon University, Pittsburgh, PA, USA ( sycho22@gmail.com). H. Byun is with the Department of Computer Science, Yonsei University, Seoul , Korea ( hrbyun@yonsei.ac.kr). takes the maximum of the classification score as the location of the action. Since the computation in a video sequence of size h w f is of the order O ( h 2 w 2 f 2), evaluating the classifier function for all subregions is too computationally expensive. Another challenge arises from uncontrolled real-world videos containing large variations in actions and complex backgrounds. It is difficult to construct a classifier that recognizes realistic actions, making the action detection problem more challenging. The main approach of detection problems is to find the sub-volume that maximizes the output of scoring function. In order to define the scoring function, the classifier measuring the score for an input should be learned. Several methods use a max-margin framework [7], [43], [44] to train the model parameter of the classifier. Chen and Grauman [7] use a linear SVM to learn parameters of the scoring function. Since they use the graph structure for representing a video, the learned parameters are used to compute the weight for each node in the graph. Kong et al. [43] measure the interactive phrases of human interaction using a latent SVM. They define attribute and interaction models for scoring the interactive phrase, and find the maximum interactive phrase by solving the optimization problem based on the coordinate descent. Hoai and Torre [44] consider a detection scoring function that extends the Structured Output SVM [45] to enable the recognition of sequential data. By exploiting the partial events as training examples for learning the scoring function, their method supports the early detection of events in sequential data. Our approach exploits the max-margin framework for defining the scoring function. We are especially inspired by that of Chen and Grauman [7], which solves the action detection problem as a maximum-weight connected subgraph problem (MWCSP) from a weighted space-time (ST) graph representing an input video. They transform the MWCSP into a prizecollecting Steiner tree problem and solve the problem using a branch-and-cut optimization strategy. Since their detection method is formulated using an integer linear programming problem with binary variables, the resulting detections are made under certainty, i.e., since the coefficients are constant, their result is only able to produce whether each node is action node or not. We also solve the MWCSP by applying a clique-based optimization strategy from a weighted ST graph representing an action video. Our method first determines the maximum weight cliques (MWC) for each graph unit. Based on the node information regarding the MWC, we define the energy function that represents the maximum property of each node
2 PAPER IDENTIFICATION NUMBER 2 and connectivity among the nodes in the resulting subgraph. The connectivity property makes the energy function robust to noise occurring in the middle of an action. In particular, our method that determines the probability of action occurrence at each node enables more robust and accurate detection compared to that of [7]. Our energy function is formulated using an optimization problem that can be represented as a linear system. The final maximum-weight connected subgraph (MWCS) is obtained by leveraging the likelihood threshold for the solution of the linear system. Fig. 1 shows an overview of our approach. Given an action video, the first step involves extractions of the local ST features (Fig. 1(a)). The action video is represented as a ST graph in which each node has a weight computed based on the extracted local features (Fig. 1(b)). Since our method is trained to have larger weights for local features within action region rather than local features within non-action region, nodes within action region get larger weights (darker colored nodes). Therefore, the process of action detection is equivalent to finding an MWCS. The node s probability of belonging to the MWCS is found by solving a linear system (Fig. 1(c)), and the detection result can be determined by applying the threshold of the probability (Fig. 1(d)). The final detection result is a set of regions corresponding to a set of nodes in the MWCS (Fig. 1(e)). Since most of previous methods in action detection [1], [4], [5], [13] give the cubic-shaped detection result, it is not possible to search the subvolume that can shift spatial location over time. However, recent method [7] produces the noncubic detection result by representing a video as ST-graph and searching the maximum subgraph from the entire graph. Our method produces more flexible result as well as non-cubic result by introducing the subgraph search on a flexible ST graph structure. In addition, we propose the maximum subgraph search method that produces more detailed detection results with less computation time by introducing the optimization based on MWC information. The main contributions of this paper are summarized into three points. We propose a novel approach that solves the action detection by taking a graph optimization strategy. We define an energy function based on MWC information in order to find an MWCS indicating an action region from a ST graph representing the entire action video. The energy function is formulated using an optimization problem that can be represented as a linear system. By solving the linear system, our method efficiently detects the spatial and temporal region of occurring action. Our method produces more robust and accurate detection results. We define the energy function that includes the maximum property of each node and connectivity among the nodes in the resulting subgraph. These properties make our detection robust to noise that may occur in the real-world action videos. Furthermore, our optimization result indicates the likelihood of belonging to the MWCS for each node, not indicating whether action node or not. It improves the performance by allowing more robust and flexible detection. Our graph optimization method takes relatively short (a) (b) (c) (d) (e) Fig. 1: Overview of our approach. (a) Local ST feature extraction. (b) Video representation using a weighted ST graph. (Note that the node numbers are presented on the nodes, and some edges are omitted for ease of viewing.) (c) Optimization problem for computing the probability of becoming an MWCS node. (d) Action detection by thresholding. (e) Action detection result. time for long length video. It is because that the energy function is defined with MWC information which is computed from much smaller graph unit than the entire graph representing a video. In addition, the energy function is an optimization problem that can be represented as a simple linear system. By exploiting MWC search in small graph units and linear system solver, our method can efficiently solve the detection problem in terms of computational cost and problem complexity. The rest of the paper is organized as follows. Section II briefly reviews the previous works on action detection and graph-based approaches for representing an action video. Section III describes the details of our video representation and the proposed optimization approach for action detection. Section IV presents our experimental results and Section V provides conclusions for the paper.
3 PAPER IDENTIFICATION NUMBER 3 II. RELATED WORK In this section, we review the state-of-the-art methods of action detection. We also review the previous works which applies the graph-based approach for video representation. A. Previous Works on Action Detection The most common approach to action detection is a sliding window method which applies a classifier function to subregions within an entire sequence and considers the maximum classification score as the location of the action. Although this method has been successfully applied in many works [1], [4], [5], it is computationally expensive for evaluation of the classifier function for all subregions. To avoid exhaustive searches, several recent works have employed the voting-based approach. The voting-based approach performs localization based on voting regarding local ST features. Mikolajczyk and Uemura [8] propose a method based on a vocabulary forest of local features and a voting scheme. They use a large number of low-dimensional local features for building a vocabulary forest in order to capture the joint appearance-motion of actions. Voting is performed for action categories and occurrence locations over each vocabulary tree. However, their work is restricted to only the spatial localization of the subjects in each frame. Ryoo and Aggarwal [9] apply the voting technique for the intersection of relationship histograms between different action videos. They develop a new matching function, ST relationship match, to measure the similarity between two sets of features. Each pair of features in the intersection votes indicates the expected starting and ending locations of the action. Oikonomopoulos et al. [10] accumulate localization evidence in a probabilistic ST voting scheme and use class-specific codebooks of the codeword ensembles to encode the ST positions. Yao et al. [11] present a method to classify and localize human actions using a Hough transform voting framework. They perform the voting with a collection of random trees and learn a mapping between densely sampled ST feature patches and an action center. The resulting set of leaf nodes in the trees forms a discriminative codebook with shared features across actions and votes for action centers in a probabilistic manner. To summarize, although the voting approach reduces computational costs compared to the sliding window method, it is often sensitive to a noisy background and ambiguous for actions with periodicity, which leads to incorrect votes. Hence, this approach cannot guarantee determination of the maximum scoring region. The branch-and-bound approach has also been explored to avoid the enormous computation cost of an exhaustive search. This approach identifies the most probable occurring actions using an optimization scheme. Willems et al. [12] propose an extended exemplar-based approach based on local features in the ST domain. The most discriminative visual words are selected and used to formulate the bounding box hypotheses. Actions are finally detected by merging the hypotheses with a high confidence value. Yuan et al. [13] also solve the action detection problem using a branch-and-bound strategy. They formulate a detection problem as a search for the 3D subvolume with the maximum amount of mutual information. To this end, a video sequence is represented by a set of features, and each feature casts a positive- or negativevalued vote for the action class. Cao et al. [14] present a framework that combines Gaussian mixture model (GMM)- based representation of ST features and detection through a maximum a posteriori (MAP) estimation. They handle data mismatches through the simultaneous performance of model adaptation and action detection. Some methods perform dynamic programming for action segmentation. Zhou et al. [15] formulate motion segmentation as aligned cluster analysis (ACA) that is an extension of the k-means clustering algorithm. ACA combines a dynamic time alignment kernel with dynamic programming for temporal segmentation of actions. An efficient coordinate descent algorithm solves ACA. Hoai et al. [16] also use dynamic programming for temporal segmentation, which maximizes the classification score of the winning class, while suppressing those of the nonmaximum classes. Recent methods have used a structured graph to represent the human region or entire video. Chen and Grauman [7] use a space-time graph for video representation, where each node indicates the subvolume and its weight represents the likelihood of an actions occurrence. Under the weighted graph representation, they solve the action detection problem by maximum subgraph search. Lan et al. [41] improve the action recognition results by treating the human location as a latent variable. Since this method requires human detection procedure for each frame, it takes large computation time and is difficult to apply it to complex real-world data which contains various human appearances and complex backgrounds. Shapovalova et al. [42] relax the assmption of Lan et al. [41] for human detection by introducing the clustering of objectness regions. In their method, a video is represented as global feature of a whole video and local features of objectness regions. Under the video representation, they develop a Similarity Constrained Latent SVM (SCLSVM) model to perform weakly supervised action recognition and localization. B. Graph-based Approach for Representing an Action Video Existing methods in human action recognition mainly use feature descriptors extracted from human parts or interest points in order to capture the appearance, shape, and motion patterns of an actor. Those features have been represented using various methods, including bag of features (BoF), dynamic time warping (DTW), hidden Markov models (HM- M), and conditional random fields (CRF). The most popular method represents a video sequence as a BoF and performs the classification using the BoF vector. Although BoF-based methods have shown good results for action recognition, their representations tend to ignore the spatio-temporal relationships among features, which can be an important property for action classification. However, a graph provides an efficient way to describe the spatio-temporal relationships between structural parts or low-level features. Several recent works have used a graph structure to represent action videos, where each node corresponds to the local
4 PAPER IDENTIFICATION NUMBER 4 feature, and each edge corresponds to the relationship between its nodes. Most of these works perform action recognition as a graph matching problem, measuring the similarity between a model and test graphs. Ta et al. [17] construct a graph with a significantly reduced number of edges by filtering the set of triangles between triplets of interest in ST points. Because of the reduced complexity of the resulting graph, their approach provides efficient graph matching that computes the matching score by projecting the set of nodes of the first graph onto that of the second graph. Gaur et al. [18] represent an action in a video as a string of feature graphs (SFG) that models the spatial arrangement of the features. Recognition result is obtained by matching an SFG of a video using DTW. Brendel et al. [19] learn the structure and pdfs associated with graphs as the permutation of adjacency matrices of training graphs in the weighted least-squares sense. Celiktutan et al. [20] present a hyper-graph structure that performs an exact matching with low complexity, using a point set matching problem. Some works have applied a graph embedding, converting a graph into a point in a vector space to make it suitable for general recognition approaches based on feature vectors. Liu et al. [21] apply fiedler embedding for the graph and then use its resulting vector to k-nearest neighbor classifier for action recognition. Borzeshi et al. [22] also apply graph embedding with a class-based prototype selection method that maximizes a function of the inter- and intra-class distances. The resulting embedding of the graph is fed to a HMM classifier for action recognition. III. MODELING ACTION AS A WEIGHTED ST-SUBGRAPH We first introduce a method for extracting the ST features via sparse representation (SR) in Section III-A and then present the construction of an ST-graph representing an action video in Section III-B. Section III-C explains our action detection method, which finds an MWCS on the ST-graph using an optimization approach. A. Extracting Localized ST Video Features via SR Suppose I is a set of detected space-time interest points (STIP) extracted from a set of training video sequences S = {S i, i = 1, 2,..., N}. To represent the detected STIP I, we first compute histograms of oriented gradients (HoG) and histograms of optical flow (HoF) in the ST neighborhoods of the detected STIP I. These descriptors capture the local appearance and motion information and have been used successfully in many works on action recognition employing the BoF scheme for video representation. However, the BoF representation incurs two drawbacks [23]: (1) It leads to a considerable amount of approximation error because each feature is assigned to only one codeword, and (2) The codebook size may be increased for data with large variation. Recently, the SR-based approach has been shown to overcome these drawbacks by reducing the approximation error and constructing compact dictionaries in many vision tasks [24], [21], including action recognition [25], [26], [23], [27]. The SR has been shown to efficiently represent and compress high-dimensional signals. To obtain the SR, the first step involves construction of a dictionary with orthogonal bases or overcomplete bases that can represent the essential information in a signal. The next step determines a sparse solution that is the degree of contribution to each element of the dictionary. Next, we need to learn the overcomplete dictionaries and the corresponding SR using the descriptors of the detected STIP I. We use an online optimization algorithm for dictionary learning [28] and start with briefly describing the online dictionary learning algorithm. Let a set of HoG/HoF feature descriptors of the training set S be X = {x j, j = 1, 2,..., n}, where x j R m, n = N i=1 n i and n i is a number of descriptors existed in training video sequence S i. The online dictionary learning algorithm optimizes the following cost function min D, α 1 n n j=1 ( ) 1 2 x j Dα j λ α j 1, (1) where D R m K (m < K) is the overcomplete dictionary, each column representing a basis vector. α {α j, j = 1, 2,..., n}, α j R K (K n) is SR over X such that each α j contains a few nonzero elements of D. The online dictionary learning algorithm iteratively solves Eq. 1 by performing two steps at every iteration: sparse coding and dictionary updating. In other words, an initial decomposition SR of X is first computed from an initial dictionary, then the initial decomposition SR makes update of dictionary. This procedure is iteratively performed until the number of iterations is satisfied. Sparse decomposition problem is solved with LARS (Least-angle regression) algorithm [38], and dictionary update is performed using block-coordinate descent. Finally, each STIP in I is represented by its corresponding SR computed over the final dictionary. B. Modeling an Action Video as a Weighted ST-Graph We represent a video sequence Q using a weighted ST-graph G = (V, E), where V is a set of weighted nodes and E is a set of edges. The weights of the nodes V are determined based on the feature descriptors extracted from the video sequence. The most popular shape for representing the action region is the ST-cuboid [13], [29], [30], which is a cube-shaped subvolume maximizing the action occurring probability. However, this shape is restricted to a particular location over the temporal domain; that is, it cannot shift spatial location over time. Recently, Chen and Grauman [7] use a ST-graph that allows spatial changes over the temporal domain and provides more accurate detection. We also use a space-time graph to represent the video sequence and present the node and edge structures of our ST-graph as follows. 1) Node structure: The node structure is determined by dividing the video sequence into a grid of H W F ST cubes. The size of each grid indicates the computational efficiency and detection sensitivity. That is, a smaller grid leads to higher computational cost but gives a detailed detection shape, whereas a larger grid provides sparse detection but has lower computational cost. For our implementation, we set H and W to 1/3 of the frame dimensions and F to 10 frames. Using this node structure, we detect an action region of irregular and non-cubic shape.
5 PAPER IDENTIFICATION NUMBER 5 Fig. 2: Conceptual illustration of our node weighting scheme. The yellow points are local features, and each bar on the right side indicates the SR for each local feature. In the bar, each color represents the weight of each element in the dictionary. The weight of the red border node in the center is computed from the SRs of the local features falling within the node. 2) Node weight: We need to represent the amount of action information contained in each node in order to detect the action region in the graph. We formulate an equation for computing the node weight inspired by a common SVM scoring function. Given a set of training video sequences S, each video sequence S i with n i STIP can be represented by a coefficient histogram h (S i ) obtained from max-pooling of corresponding SR vectors of n i feature descriptors. We train a linear SVM using all the coefficient histograms h (S i ), i [1, N], extracted from all the training examples S. The training examples S include positive and negative samples, where each sample is considered as a positive if it contains the action to detect or otherwise negative. Let c and β denote the weight and bias of the SVM, respectively. Now, we compute a weight for each node v V in the graph G for a video sequence Q as: w v = β + x l v K c k αl k, (2) k=1 where x l is the l-th local feature descriptor falling within node v in the graph G which is constructed from a video sequence Q. α k l is the k-th value of SR α of x l obtained from a method of Section III-A. Fig. 2 shows a conceptual illustration of our node weighting scheme. Note that nodes with higher positive weights indicate higher probability that the action occurs in that region, while smaller weights indicate lower probability. By defining the weight for each node in the graph, we can use a method that searches the regions with the highest sum of node weights in order to detect the regions of interest in the graph. This enables us to score an arbitrarily-shaped set of nodes where action occurs. In this context, we apply a detection approach based on an MWC search, as presented in Section III-C. 3) Edge structure: The linking strategy between nodes affects both the shapes of candidate subgraphs and the search cost. In general, each node is linked with 4-connected neighboring nodes. However, since our detection approach is based on a clique search, this edge structure is not sufficient for searching for maximum cliques (Section III-C). Hence, we additionally include three types of edges: (1) 8-connected edges in the spatial dimension, (2) 8-connected edges in the temporal dimension, and (3) jump edges over a second adjacent neighbor in the temporal dimension. Chen and Grauman [7] show that the jump edge ignores misleading features Fig. 3: Conceptual illustration of our edge structure. The black lines indicate general 4-connected edges. The green line indicates the jump edge. The red lines are 8-connected edges in the spatial dimension, and the blue lines are 8-connected edges in the temporal dimension (a) Fig. 4: This example illustrates the strength of our linking structure. (a) 4-connected neighbors only, (b) 4- and 8- connected neighbors. The yellow nodes belong to an MWC. The MWC weight of graph (a) is (4+3), and the MWC weight of graph (b) is ( ). that may interrupt an otherwise good instance of an action. Since realistic videos tend to contain noisy elements such as background clutter or camera motion, a graph with jump edges can be useful for more robust detection. Fig. 3 shows an example of our linking structure. Our additional edges enable us to connect each node to more neighboring nodes including 8-connected neighboring nodes, which yields an expanded space of candidate subvolumes by searching more cliques. Fig. 4 shows an example of the strength of our additional edges. An MWC with greater weight can be found in the graph with 8-connected edges. Consequently, it provides strong localization in the detection problem. One can consider the complete graph, in which every pair of nodes is connected, to search for the MWC with the greatest weight. However, this graph yields a very large MWC including noisy nodes with positive weights, and distant nodes can be included in an MWC. Even if we limit the field of the complete graph to local region, its resulting MWC still includes noisy nodes. For example, if the graph in Fig. 4(b) is complete, the resulting MWC includes a node with weight 5. However, the node has neighboring nodes with negative weights and, consequently, is likely to include noise. We demonstrate this case through experiments (b)
6 PAPER IDENTIFICATION NUMBER 6 C. Searching for the MWCS for Action Detection Given a ST-graph G = (V, E) with weighted nodes, we need to find the subgraph G satisfying Eq. 3 G = arg max G G v V w v, (3) where G = (V V, E E) can be any connected subgraphs of G, and G is the connected subgraph with the highest sum of node weights. Since each node has a learned weight indicating the probability of the action occurring, the action region is a set of connected nodes which total sum of their weights is maximal. Therefore, the action detection problem can be considered as finding the maximum subgraph from the entire graph representing an action video. This means the action detection problem can be solved by MWCSP. If all node weights are positive, an optimal solution is easily computed by determining any spanning tree. However, the node weight in our graph can be have either a positive or a negative value. Therefore, the MWCSP is NP-complete [33], i.e., there exists no known efficient algorithm to solve it. Dittrich et al. [34] transform the MWCSP into the prize-collecting Steiner tree problem (PCST) to identify the functional modules in proteinprotein interaction networks. In [34], a graph is transformed into a directed graph, and integer linear programming (ILP) is performed on the transformed directed graph with binary variables for each node and edge. Finally, the problem can be solved with a linear programming-based branch-and-bound algorithm. Chen and Grauman [7] apply the same method [34] for ST-graphs. Their max-subgraph approach seeks the subvolume that maximizes the action classifiers output. We propose a novel method to solve the MWCSP by defining an optimization problem based on an MWC in small graph units. Our approach is inspired by the works of Shervashidze et al. [31] and Levi [32]. First, Shervashidze et al. [31] present a graph kernel based on counting and sampling the subgraphs of a limited size in the entire graph, which they called graphlets. Their sampling scheme allows them to compute the graph kernels on graphs of sizes that are beyond the scope of the stateof-the-art methods. Similarly, since it is a very challenging problem to provide information about the MWCS for the entire graph, our approach computes information about the MWCS for each graph unit by dividing the graph into smaller graph units. Next, Levi [32] shows that each MWC in the product graph is associated with a maximum common subgraph. The MWC problem is then to determine the maximum clique of an arbitrary undirected graph. It enumerates all common subgraph isomorphisms by enumerating the cliques of the product graph, i.e., they solve the problem by splitting the subgraph into several cliques. In this paper, we exploit the MWC information found in the smaller subgraphs in order to solve the MWCSP of the graph. Instead of enumerating the raw MWC, the nodes of the MWC are utilized with their weights in order to determine the optimized solution of the MWCSP. We compare the experimental results between these two approaches: enumerating MWCs and solving a linear system based on the MWC. Next, we describe the solution to the MWCSP in the ST-graph for a test video sequence. Our approach consists of three steps: (1) finding MWC for each subgraph (graph unit), (2) identifying candidates of the MWCS by solving an optimization problem based on MWC information, and (3) selecting the resulting MWCS based on the top-scoring detection. Let a graph G = {g i } consists of multiple graph elements g i, in which g i is a set of nodes within the same temporal dimension. We also define a graph unit as a set of n graph elements. We divide the entire graph G into overlapping m graph units and determine the MWC from all the graph units. Since the MWC problem is NP-hard [35], many works have solved it using heuristic approaches. We solve the MWC problem using a method by Östergård [35]. This method presents a branch-and-bound algorithm, which exploits the node order based on a coloring of the nodes and a pruning strategy. Once the MWCs C = {c i } are obtained from all graph units, we use an optimization approach to search for an MWCS that is a localized ST region containing action. Namely, we minimize the following energy function: E (G) = (x u b u ) 2 + λ (x u x v ) 2, (4) u V b u = (u,v) E { w u + σw (c i ) if u c i w u otherwise, (5) where x u indicates the possibility of being the MWCS for node u, and b u is determined by the type of node u. If the node u is part of the MWC, b u is computed by summing the node weight w u and MWC weight w (c i ). Otherwise, b u is determined by node weight w u alone. The first term of eq. 4 is the data term that encourages the MWCS weight x u to become similar to b u, which is determined by the node type. Hence, a node having greater weight and with greater MWC weight has a greater weight MWCS. The second term of Eq. 4 is the smoothness term, which generates the connected maximum weight subgraph. We aim to minimize the difference in MWCS weight between the node u and its neighboring connected node v. The parameter λ is set as 1, and parameter σ is experimentally determined as 1/ c i. Optimization The energy function in Eq. 4 is an optimization problem that can be represented as a linear system Ax = b. In other words, by setting the first-order derivative of E (G) of Eq. 4 to zero and re-arranging x u across all the nodes u in a vector x, Eq. 4 can be written in matrix form Ax = b, where A is a V V matrix, and b is a vector in which each element corresponds to b u of Eq. 5. We obtain the solution x by solving the linear system. In the solution x, each solution x u is the probability of belonging to MWCS for node u, i.e., a larger value indicates a higher likelihood of action occurrence. Next, we determine the candidates of the MWCS MS = {ms i } from the following condition: ms i = { u j x uj max (x) max (x) /2, (u j, u j+1 ) E }. (6)
7 PAPER IDENTIFICATION NUMBER 7 (a) Example of MWC search result (b) Example of result after applying eq. 6 Fig. 5: A conceptual illustration of our MWCS search method. (a) Yellow nodes and colored lines comprise MWCs. (b) Two MWCS candidates are founded from an optimization based on MWC information. Data: A graph G = (V, E) = {g i, i = 1,..., m} Result: a set of MWCS D = {G }, G arg max = G w G v v V set C = for i = 1 to m do search MWC c i for each g i using algorithm [35] append c i into C end while MS do compute A and b according to Eq. 4 and Eq. 5 solve Ax = b identify candidates of MWCS MS according to Eq. 6 choose the candidate ms with the largest weight in MS set weights of all nodes u ms to w u append ms into D end return D Algorithm 1: Maximum weight connected subgraph (MWC- S) search for multiple detection. Fig. 5 shows a conceptual illustration of our MWCS search method. Note that we leave out some edges for convenient viewing. Suppose we use a graph unit of size 2. We first search the MWC for each graph unit, as shown in fig. 5(a). Yellow colored nodes indicate the nodes which are included in MWC, and each clique represented by colored line (red, green, blue, orange) indicates the MWC. In otherwords, red colored MWC is searched from the first graph unit which is a set of g 1 and g 2 graph elements, and blue colored MWC is searched from the second graph unit which is a set of g 2 and g 3 graph elements. In this way, we can search MWCs for all the graph units within a whole graph, and four MWCs are founded in this example. Based on MWCs C, we construct the energy function of eq. 4 and solve the optimization problem according to the method described above. After applying eq. 6 to the optimization result, we can find MWCS candidates MS (red and blue colored subgraphs), as shown in fig. 5(b), and apply the top-scoring detection for final result. The above condition provides one or more candidate MWCS according to the number of actions included in a test video sequence. Basically, we first choose the candidate with the largest weight. To return multiple top-scoring detections, we apply a method similar to that in [13], [7], iteratively performing the MWCS algorithm by setting weights to w u for the nodes of the MWCS in the previous iteration. Algorithm 1 provides the pseudocode for our MWCS algorithm. IV. EXPERIMENTS In this section, we evaluate the proposed method using two datasets, uncropped Hollywood action videos [36] and MSR actions [13]. Both datasets contain actions with dynamic occlusions in complex and moving backgrounds of real-world environments. We extract the STIP and the descriptor using the method of [36]. To compute the node weight for each action, we train a linear SVM. T-Sliding ST-Sliding Subvolume Subgraph Our result Fig. 6: Examples of shape of detection result according to the different methods. We employ the mean overlap accuracy as an evaluation metric, which computes the intersection of the predicted detection region with the ground truth divided by their union. We use detection time to evaluate computational cost and also compare the detection time of baselines measured on a system equipped with 3.40GHz Intel Core i CPU. We compare our method with three state-of-the-art baselines: (1) Sliding: The sliding window is a standard and popular action detection method used in several works [1], [4], [5]. The temporal sliding window is used for temporal detection and spatio-temporal sliding window is used for spatio-temporal detection. The ST sliding window is a variant of the temporal sliding window which searches ST subvolumes of cubic shape. (2) Subvolume: The Cube-subvolume method [13] searches the cube-shaped subvolume that maximizes the action-occurring probability. Hence, its spatial detection region is more flexible than that of ST sliding window. However, spatial detection is restricted to one location. (3) Subgraph: The subgraph method [7] allows spatial shifts over time, as does our method. However, it cannot detect the subvolume that consists of 8- connected neighboring nodes. Note that our method produces more flexible detection regions than the result of Subgraph [7] by allowing 8-connected edge structure. Our method can detect most irregular and non-cubic shapes. Fig. 6 illustrates the shape of detection results according to the different methods.
8 PAPER IDENTIFICATION NUMBER 8 TABLE I: The mean temporal overlap accuracy results on the Hollywood dataset under the BoF feature scheme. Actions Sliding Subvolume [13] Subgraph [7] Ours AnswerPhone GetOutCar HandShake HugPerson Kiss SitDown SitUp StandUp Average TABLE II: Comparison of detection accuracies according to the feature type in the Hollywood dataset. Actions BoF SR AnswerPhone GetOutCar HandShake HugPerson Kiss SitDown SitUp StandUp Average A. Experiments on the Hollywood Actions Dataset The Hollywood Actions Dataset [36] consists of videos collected from 32 different Hollywood movies, with a total of 663 video sequences from 8 action classes: answering the phone, getting out of the car, hand shaking, hugging, kissing, sitting down, sitting up, and standing up. The dataset is divided into uncropped and cropped versions of the sequences for training, i.e., videos containing extraneous frames and only the action of interest. Hence, we use the cropped version for training and the uncropped version for detection evaluation. We perform temporal detection only because most actions occur across the entire frame. Table I shows the temporal detection results for feature settings same as in [7]. Our method achieves the best accuracy for 5 of the 8 action classes and an average accuracy. This validates the superiority of our MWCS search method in detecting action regions. In addition, our method can estimate the probability of occurrence of actions for the detected region, in contrast to the MWCS search method of [7]. This property enables more robust detection in terms of performance of the basic detector. A comparative evaluation of BoF- and SRbased methods for representing a video is shown in Table II. We show that our SR-based method achieves a better result than does the BoF-based method. However, we can observe that SR-based method performs worse than BoF-based method in GetOutCar, HandShake, and SitUp classes. Although SRbased representation has been shown to overcome the drawbacks of BoF representation, the SR-based method has also drawbacks. The sparse solution becomes denser when the image is under large amount of random corruption or contiguous occlusion [39], [40]. Since some action videos in GetOutCar, HandShake, and SitUp classes have those problems, SR-based method performs worse than BoF-based method. Given MWC information, combining all neighboring MWCs is the most straightforward method to solve the MWCSP. In other words, the continuously connected MWCs are gathered into a larger graph, and then, these gathered graphs are candidate MWCSs. Among several candidates, the top-scoring candidate becomes the final MWCS. Fig. 7 compares the combination-based method and the proposed method to solve the MWCSP. Our method outperforms the combination-based method for all action classes. The resulting MWCS obtained from the combination is likely to be one piece of ground-truth in the MWCS because the MWC cannot be extracted from the Fig. 7: The mean temporal overlap accuracies for each action in the Hollywood action dataset using the combination-based method and our proposed method. middle noisy part. Our method defines a smoothness term in Eq. 4 in order to encourage connected subgraph generation, and consequently, our method overcomes the shortcomings of the combination method and leads to better detection results. Table III shows the detection time on the Hollywood dataset. Although Subgraph [7] method takes the least detection time, our method takes significantly shorter than Sliding method and Subvolume [13]. Our method takes more time than Subgraph [7] because our graph is more complex than that of Subgraph [7]. However, we observe that our method for maximum subgraph search is more efficient than that of Subgraph [7] for large graph, as shown in table VI. B. Experiments on the MSR Actions Dataset The MSR Actions Dataset [13] contains multiple instances of different actions, such as boxing, handclapping, and handwaving. All of the sequences are captured with clutter and TABLE III: Detection time on the Hollywood dataset. Method Detection time (sec) Sliding Subvolume [13] Subgraph [7] Ours
9 PAPER IDENTIFICATION NUMBER 9 TABLE IV: The mean temporal overlap accuracy results for the MSR dataset. TABLE V: The mean space-time overlap accuracy results for the MSR dataset. Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing Clapping Waving Average Actions Sliding Subvolume [13] Subgraph [7] Ours Boxing Clapping Waving Average Fig. 8: The mean temporal overlap accuracies for each action in the MSR action dataset when using the combination-based method and our method. Fig. 9: The mean space-time overlap accuracies for each action in the MSR action dataset when using the combination-based method and our method. moving backgrounds. Since the actors change their position over time, this dataset is good for evaluating the ST detection. We perform both temporal and ST detections on this dataset. Following the detection method in [13], [7], we train detectors using the KTH dataset [37]. Table IV shows the temporal detection results. Our method achieves the best accuracy for all action classes. In particular, our method significantly improves the accuracies for boxing and clapping actions. Temporal detection results of the combination-based method and our method for solving the MWCSP are compared in Fig. 8. Our method achieves higher detection accuracy than the combination-based method for all action classes. The results show the strength of our optimization strategy that considers both maximum weight and connectivity properties. Table V shows the ST detection results. Although the results of our proposed method are slightly lower than that with the best performance for the boxing and waving actions, our method achieves the best average accuracy due to increased accuracy for the clapping action. Compared with the graphbased method of [7], our method achieves higher accuracies for all action classes. Note that, although the method of [7] achieves better results than the sliding window and [13] for temporal detection, it achieves lower accuracy for ST detection. In contrast, our method achieves the best average accuracy for both temporal and ST detections. This validates the superiority of our graph structure and the MWCS search method. Fig. 9 compares the combination-based method and our method for the MWCSP. Our method outperforms the combination-based method for all action classes, especially for the waving action. This demonstrates that our MWCS search method produces more detailed localization. Fig. 10 shows the mean space-time overlap accuracy ac- cording to graph type and graph unit size. We use three types of graphs. A basic graph indicates a general graph with 4- connected edges, and a complete graph is a graph in which every pair of nodes within the graph unit is connected. Our graph contains 8-connected edges in spatial and temporal dimensions, as well as 4-connected edges. Therefore, the graph complexity decreases in the order of a complete graph to our graph to a basic graph. Our graph outperforms the basic graph and the complete graph for every size of graph unit. We also observe that the complete graph produces poor performance. This indicates that unnecessary edges cause errors, and our linking strategy is suitable for solving the detection problem. We show the mean space-time overlap accuracy results according to various grid sizes for MSR action dataset. Fig. 11 shows the results under graph unit of size 2 and temporal grid of size F = 10. We can see that spatial grid of 1 3 of the frame dimensions gives the best average accuracy. Generally, one might think that smaller grid size gives more detailed detection shape and larger detection accuracy. However, too small grid ( 1 6 of the frame dimensions) produces less accuracy, as shown in fig. 11. Note that our method finds the maximum subgraph based on node weight which is computed from weights of local features falling in the node. If node size becomes smaller, the number of local features within the node is reduced. Consequently, this makes it difficult to aggregate statistics of neighborhood of local features. Fig. 12 shows the results according to various sizes of temporal grid under graph unit of size 2 and spatial grid of 1 3 of the frame dimensions. Temporal grid of size 20 gives the best average accuracy. We can observe that too small temporal grid or too large temporal grid produces the less accuracy. Note that proper grid size can be different according to the data in terms of the frame dimensions of an input video, average size of humans, and average length of actions.
10 PAPER IDENTIFICATION NUMBER 10 Mean Overlap Accuracy Graph unit size Fig. 10: The mean space-time overlap accuracies for the MSR action dataset according to graph type and graph unit size. Fig. 12: The mean space-time overlap accuracy results according to various temporal grid sizes for MSR action dataset. TABLE VII: Comparison of mean space-time overlap accuracy results between Lan et al. [41] and our method. Actions Lan et al. [41] Ours Boxing Clapping Waving Average Fig. 11: The mean space-time overlap accuracy results according to various spatial grid sizes for MSR action dataset. Table VI shows the detection time on the MSR dataset. We can see that our method achieves the least detection time for spatio-temporal detection, compared to ST-Sliding, Subvolume, and Subgraph. Note that detection time on the MSR dataset is larger than that on Hollywood dataset. It is because average length of videos on the MSR dataset is longer than that on the Hollywood dataset. Especially, Subgraph method [7] takes long detection time for MSR dataset because it requires optimization for large graph. However, our method takes much less time than Subgraph [7], though both methods solve MWCSP. It is because our method first searches MWC for graph unit less than the size of entire graph and performs optimization of simple linear system based on MWC information. Therefore, our optimization method takes less detection time compared to the method of Subgraph [7] even if video length is lengthened. We compare our method with the method of Lan et al. [41] on MSR dataset. For appearance feature x i, we use BoF feature scheme based on HoGHoF descriptors like our method. TABLE VI: Detection time on the MSR dataset. Method Detection time (sec) Sliding Subvolume [13] Subgraph [7] Ours Table VII compares the spatio-temporal detection results. Our method outperforms the method of Lan et al. [41] for all actions without human detection procedure. Lan et al. [41] produces poor detection result compared to other baselines as well as our method. It is because their method is not appropriate for MSR dataset which contains multiple actions in the same scene and has no rule about location variation. In other words, their method is difficult to obtain benefits from pairwise potential and global action potential in their potential function. Fig. 13 shows the example of the detection results for MSR action dataset. We compare the detection results between graph of [7] (green rectangles) and our graph (yellow rectangles). In case of the third row, the previous graph fails to detect any regions for 2th, 3rd and 4th images, while our graph detects the action regions for all the images. We can show that our graph detects not only more temporal action frames but also detailed spatial action region within the frame. V. CONCLUSION We presented the optimization approach to identify the maximum subgraph on the ST-graph for action detection. Our energy function is defined based on maximum cliques by including the maximum and connectivity properties for finding the maximum subgraph. The energy function is formulated using a linear system and its solution gives the probability of being action nodes. Our graph structure and optimization method efficiently solve the detection problem by applying the clique-based approach and simple linear system solver. We showed that our graph and optmization method produce more accurate localization through experiments using various
11 PAPER IDENTIFICATION NUMBER 11 Fig. 13: Example of detection results using a graph of [7] and our graph. Each row indicates the detection results for waving, boxing, and clapping actions, respectively, for MSR action dataset. For each row, the first 4 images are detection results of the graph of [7] and remaining 4 images are detection results of our graph. Red rectangles denote ground truth, green rectangles denote the result of the graph of [7], and yellow rectangles denote the result of our graph. graph structures. We also compared our method with the stateof-the-art methods using real-world action datasets. ACKNOWLEDGMENT This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government(mest) (No. NRF-2013R1A2A1A ). REFERENCES [1] Y. Ke, R. Sukthankar, and M. Hebert, Efficient visual event detection using volumetric features, in Proc. ICCV, pp , [2] L. Zelnik-Manor and M. Irani, Statistical analysis of dynamic actions, in IEEE Trans. on PAMI, Vol. 28, No. 9, pp , [3] Y. Ke, R. Sukthankar, and M. Hebert, Event detection in crowded videos, in Proc. ICCV, pp. 1-8, [4] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, in Proc. ICCV, pp , [5] S. Satkin and M. Hebert, Modeling the temporal extent of actions, in Proc. ECCV, pp , [6] A. Klaser, M. Marszalek, C. Schmid, and A. Zisserman, Human focused action localization in video, in Proc. ECCV, pp , [7] C. Y. Chen and K. Grauman, Efficient activity detection with maxsubgraph search, in Proc. CVPR, pp , [8] K. Mkolajczyk and H. Uemura, Action recognition with motionappearance vocabulary forest, in Proc. CVPR, pp. 1-8, [9] M. S. Ryoo and J. K. Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, in Proc. ICCV, pp , [10] A. Oikonomopoulos, I. Patras, and M. Pantic, Spatiotemporal localization and categorization of human actions in unsegmented image sequences, in IEEE Trans. on Image Processing, Vol. 20, No. 4, pp , [11] A. Yao, J. Gall, and L. V. Gool, A hough transform-based voting framework for action recognition, in Proc. CVPR, pp , [12] G. Willems, J. H. Becker, T. Tuytelaars, and L. V. Gool, Exemplar-based action recognition in video, in Proc. BMVC, pp. 1-11, [13] J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, in Proc. CVPR, pp , [14] L. Cao, Z. Liu, and T. S. Huang, Cross-dataset action detection, in Proc. CVPR, pp , [15] F. Zhou, F. De La Torre, and J. K. Hodgins, Aligned cluster analysis for temporal segmentation of human motion, in Proc. FGR, pp. 1-7, [16] M. Hoai, Z. Z. Lan, and F. De La Torre, Joint segmentation and classification of human actions in video, in Proc. CVPR, pp , [17] A. P. Ta, C. Wolf, G. Lavoue, and A. Baskurt, Recognizing and localizing individual activities through graph matching, in Proc. AVSS, pp , [18] U. Gaur, Y. Zhu, B. Song, and A. Roy-Chowdhury, A String of feature graphs model for recognition of complex activities in natural videos, in Proc. ICCV, pp , [19] W. Brendel and S. Todorovic, Learning spatiotemporal graphs of human activities, in Proc. ICCV, pp , [20] O. Celiktutan, C. Wolf, B. Sankur, and E. Lombardi, Real-time exact graph matching with application in human action recognition, in Proc. ICHBU, pp , [21] J. Liu, S. Ali, and M. Shah, Recognizing human actions using multiple features, in Proc. CVPR, pp. 1-8, [22] E. Z. Borzeshi, M. Piccardi, and R. Y. D. Xu, A discriminative prototype selection approach for graph embedding in human action recognition, in Proc. ICCVW, pp , [23] E. Z. Borzeshi, M. Piccardi, and R. Y. D. Xu, Learning sparse representations for human action recognition, in IEEE Trans on PAMI, pp , [24] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, Robust face recognition via sparse representation, in IEEE Trans on PAMI, pp , [25] K. Guo, P. Ishwar, and J. Konrad, Action recognition using sparse representation on covariance manifolds of optical flow, in Proc. AVSS, pp , [26] C. Liu, Y. Yang, and Y. Chen, Human action recognition using sparse representation, in Proc. ICIS, pp , [27] S. R. Fanello, I. Gori, G. Metta, and F. Odone, Keep it simple and sparse: real-time action recognition, in JMLR, Vol. 14, pp , [28] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online dictionary learning for sparse coding, in Proc. ICML, pp , [29] M. Rodriguez, J. Ahmed, and M. Shah, Action MACH: A spatiotemporal maximum average correlation height filter for action recognition, in Proc. CVPR, pp. 1-8, [30] K. Derpanis, M. Sizintsev, K. Cannons, and R. Wildes, Efficient action spotting based on a spacetime oriented structure representation, in Proc. CVPR, pp , [31] N. Shervashidze, S. V. N. Vishwanathan, T. Petri, K. Mehlhom, and K. Borgwardt, Efficient graphlet kernels for large graph comparison, in Proc. AISTATS, pp , [32] G. Levi, A note on the derivation of maximal common subgraphs of two directed or undirected graphs, in Calcolo, Vol. 9, No. 4, pp , [33] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, Discovering regulatory and signalling circuites in molecular interaction networks, in Bioinformatics, Vol. 18, pp , [34] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Mller, Identifying functional modules in protein-protein interaction networks: an integrated exact approach, , in Bioinformatics, Vol. 24, No. 13, pp.
12 PAPER IDENTIFICATION NUMBER 12 [35] P. R. J. Östergård, A fast algorithm for the maximum clique problem, in Discrete Applied Mathematics, Vol. 120, No. 1-3, pp , [36] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proc. CVPR, pp. 1-8, [37] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local svm approach, in Proc. ICPR, pp , [38] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, in Annals of statistics, Vol. 32, pp , [39] E. J. Candès, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccurate measurements, in Communications on Pure and Applied Mathematics, Vol. 59, No. 8, pp , [40] A. Martinez, Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class, in IEEE Trans on PAMI, Vol. 24, No. 6, pp , [41] T. Lan, Y. Wang, and G. Mori, Discriminative figure-centric models for joint action localization and recognition, in Proc. ICCV, pp , [42] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori, Similarity constrained latent support vector machine: an application to weakly supervised action classification, in Proc. ECCV, pp , [43] Y. Kong, Y. Jia, and Y. Fu, Learning human interaction by interactive phrases, in Proc. ECCV, pp , [44] M. Hoai and F. De la Torre, Max-margin early event detectors, in Proc. CVPR, pp , [45] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large margin methods for structured and interdependent output variables, in Journal of Machine Learning Research, Vol. 6, pp , Sunyoung Cho received the B.S. degree in computer science from Sookmyung Women s University, Seoul, Korea, in 2007, and the M.S. and Ph.D. degrees in computer science from Yonsei University, Seoul, Korea, in 2009 and 2014, respectively. She is currently a Post-Doctoral Researcher with Human- Computer Interaction Institute at Carnegie Mellon University, Pittsburgh, PA, USA. Her research interests include computer vision, image and video processing, machine learning, and human-computer interaction. Hyeran Byun received the B.S. and M.S. degrees in mathematics from Yonsei University, Seoul, Korea, and the Ph.D. degree in computer science from Purdue University, West Lafayette, IN, USA. She was an Assistant Professor at Hallym University, Chooncheon, Korea, from 1994 to She is currently a Professor of Computer Science at Yonsei University. She served as president of Artificial Intelligence Society in KIISE (Korean Institute of Information Scientists and Engineers) from Mar 2013 to Feb Her research interests include computer vision, image and video processing, artificial intelligence, event recognition, gesture recognition, and pattern recognition.
Action recognition in videos
Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit
More informationLecture 18: Human Motion Recognition
Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio
More informationEfficient Activity Detection in Untrimmed Video with Max-Subgraph Search
1 Efficient Activity Detection in Untrimmed Video with Max- Search Chao Yeh Chen and Kristen Grauman arxiv:1607.02815v1 [cs.cv] 11 Jul 2016 Abstract We propose an efficient approach for activity detection
More informationExtracting Spatio-temporal Local Features Considering Consecutiveness of Motions
Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,
More informationCS 231A Computer Vision (Fall 2012) Problem Set 3
CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest
More informationRecognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)
Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding
More informationPreviously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011
Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition
More informationAdaptive Action Detection
Adaptive Action Detection Illinois Vision Workshop Dec. 1, 2009 Liangliang Cao Dept. ECE, UIUC Zicheng Liu Microsoft Research Thomas Huang Dept. ECE, UIUC Motivation Action recognition is important in
More informationStructural and Syntactic Pattern Recognition
Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent
More information1126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 4, APRIL 2011
1126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 4, APRIL 2011 Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences Antonios Oikonomopoulos, Member, IEEE,
More informationIMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim
IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute
More informationCS 223B Computer Vision Problem Set 3
CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.
More informationUsing the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection
Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual
More informationPart-based and local feature models for generic object recognition
Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza
More informationLearning realistic human actions from movies
Learning realistic human actions from movies Ivan Laptev, Marcin Marszalek, Cordelia Schmid, Benjamin Rozenfeld CVPR 2008 Presented by Piotr Mirowski Courant Institute, NYU Advanced Vision class, November
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationClassifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationRobust PDF Table Locator
Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationPatch-based Object Recognition. Basic Idea
Patch-based Object Recognition 1! Basic Idea Determine interest points in image Determine local image properties around interest points Use local image properties for object classification Example: Interest
More informationPart based models for recognition. Kristen Grauman
Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily
More informationAnalysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009
Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context
More informationEvaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition
International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 9 No. 4 Dec. 2014, pp. 1708-1717 2014 Innovative Space of Scientific Research Journals http://www.ijias.issr-journals.org/ Evaluation
More informationQMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task
QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task Fahad Daniyal and Andrea Cavallaro Queen Mary University of London Mile End Road, London E1 4NS (United Kingdom) {fahad.daniyal,andrea.cavallaro}@eecs.qmul.ac.uk
More informationData Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification
More informationPerson Action Recognition/Detection
Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1 In object recognition: is there a chair in the
More informationLarge-Scale Traffic Sign Recognition based on Local Features and Color Segmentation
Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation M. Blauth, E. Kraft, F. Hirschenberger, M. Böhm Fraunhofer Institute for Industrial Mathematics, Fraunhofer-Platz 1,
More informationEVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari
EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular
More informationCAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 21st, 2016 Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha
More informationLeveraging Textural Features for Recognizing Actions in Low Quality Videos
Leveraging Textural Features for Recognizing Actions in Low Quality Videos Saimunur Rahman, John See, Chiung Ching Ho Centre of Visual Computing, Faculty of Computing and Informatics Multimedia University,
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationStructured Models in. Dan Huttenlocher. June 2010
Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies
More informationMultiple Kernel Learning for Emotion Recognition in the Wild
Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,
More informationClass 9 Action Recognition
Class 9 Action Recognition Liangliang Cao, April 4, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition
More informationLinear combinations of simple classifiers for the PASCAL challenge
Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu
More informationVisuelle Perzeption für Mensch- Maschine Schnittstellen
Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de
More informationEstimating Human Pose in Images. Navraj Singh December 11, 2009
Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks
More informationMobile Human Detection Systems based on Sliding Windows Approach-A Review
Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationLast week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints
Last week Multi-Frame Structure from Motion: Multi-View Stereo Unknown camera viewpoints Last week PCA Today Recognition Today Recognition Recognition problems What is it? Object detection Who is it? Recognizing
More informationP-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh
P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationObject Category Detection. Slides mostly from Derek Hoiem
Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models
More informationCS 664 Segmentation. Daniel Huttenlocher
CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical
More informationRobotics Programming Laboratory
Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car
More informationCS229: Action Recognition in Tennis
CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active
More informationDeformable Part Models
CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones
More informationClassification of objects from Video Data (Group 30)
Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time
More informationDet De e t cting abnormal event n s Jaechul Kim
Detecting abnormal events Jaechul Kim Purpose Introduce general methodologies used in abnormality detection Deal with technical details of selected papers Abnormal events Easy to verify, but hard to describe
More informationSeparating Objects and Clutter in Indoor Scenes
Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous
More informationSpatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences
Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences Antonios Oikonomopoulos, Student Member, IEEE, Ioannis Patras, Member, IEEE, and Maja Pantic, Senior Member,
More informationComputer Vision. Exercise Session 10 Image Categorization
Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationSelection of Scale-Invariant Parts for Object Class Recognition
Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract
More informationSparse coding for image classification
Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction
More informationSCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING
SCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING Xuejian Rong 1, Chucai Yi 2, Xiaodong Yang 1 and Yingli Tian 1,2 1 The City College, 2 The Graduate Center, City University of New York
More informationApplications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors
Segmentation I Goal Separate image into coherent regions Berkeley segmentation database: http://www.eecs.berkeley.edu/research/projects/cs/vision/grouping/segbench/ Slide by L. Lazebnik Applications Intelligent
More informationObject detection using non-redundant local Binary Patterns
University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2010 Object detection using non-redundant local Binary Patterns Duc Thanh
More informationLearning Realistic Human Actions from Movies
Learning Realistic Human Actions from Movies Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** INRIA Rennes, France ** INRIA Grenoble, France *** Bar-Ilan University, Israel Presented
More informationObject detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation
Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1 2 Problem to solve Object detection Input: Image Output: Bounding box of the object 3 Object detection using CNN
More informationBeyond Bags of Features
: for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015
More informationContent-based image and video analysis. Event Recognition
Content-based image and video analysis Event Recognition 21.06.2010 What is an event? a thing that happens or takes place, Oxford Dictionary Examples: Human gestures Human actions (running, drinking, etc.)
More informationPreliminary Local Feature Selection by Support Vector Machine for Bag of Features
Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationShort Survey on Static Hand Gesture Recognition
Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of
More informationClassification and Detection in Images. D.A. Forsyth
Classification and Detection in Images D.A. Forsyth Classifying Images Motivating problems detecting explicit images classifying materials classifying scenes Strategy build appropriate image features train
More informationExploring Bag of Words Architectures in the Facial Expression Domain
Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu
More informationCategory vs. instance recognition
Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationGraph-based High Level Motion Segmentation using Normalized Cuts
Graph-based High Level Motion Segmentation using Normalized Cuts Sungju Yun, Anjin Park and Keechul Jung Abstract Motion capture devices have been utilized in producing several contents, such as movies
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationDisguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601
Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601 Introduction Face ID is complicated by alterations to an individual s appearance Beard,
More informationECG782: Multidimensional Digital Signal Processing
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu ECG782: Multidimensional Digital Signal Processing Spring 2014 TTh 14:30-15:45 CBC C313 Lecture 10 Segmentation 14/02/27 http://www.ee.unlv.edu/~b1morris/ecg782/
More informationMotion Estimation for Video Coding Standards
Motion Estimation for Video Coding Standards Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Introduction of Motion Estimation The goal of video compression
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationString distance for automatic image classification
String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,
More informationObject Category Detection: Sliding Windows
04/10/12 Object Category Detection: Sliding Windows Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical
More informationHuman Motion Detection and Tracking for Video Surveillance
Human Motion Detection and Tracking for Video Surveillance Prithviraj Banerjee and Somnath Sengupta Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur,
More informationAccelerometer Gesture Recognition
Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate
More informationHOG-based Pedestriant Detector Training
HOG-based Pedestriant Detector Training evs embedded Vision Systems Srl c/o Computer Science Park, Strada Le Grazie, 15 Verona- Italy http: // www. embeddedvisionsystems. it Abstract This paper describes
More informationSIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014
SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image
More informationTA Section: Problem Set 4
TA Section: Problem Set 4 Outline Discriminative vs. Generative Classifiers Image representation and recognition models Bag of Words Model Part-based Model Constellation Model Pictorial Structures Model
More informationBeyond Bags of features Spatial information & Shape models
Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features
More informationA New Feature Local Binary Patterns (FLBP) Method
A New Feature Local Binary Patterns (FLBP) Method Jiayu Gu and Chengjun Liu The Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA Abstract - This paper presents
More informationAccurate 3D Face and Body Modeling from a Single Fixed Kinect
Accurate 3D Face and Body Modeling from a Single Fixed Kinect Ruizhe Wang*, Matthias Hernandez*, Jongmoo Choi, Gérard Medioni Computer Vision Lab, IRIS University of Southern California Abstract In this
More informationBus Detection and recognition for visually impaired people
Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation
More informationDevelopment in Object Detection. Junyuan Lin May 4th
Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,
More informationObject Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision
Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition
More informationMarkov Networks in Computer Vision. Sargur Srihari
Markov Networks in Computer Vision Sargur srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Important application area for MNs 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction
More informationRegion-based Segmentation and Object Detection
Region-based Segmentation and Object Detection Stephen Gould Tianshi Gao Daphne Koller Presented at NIPS 2009 Discussion and Slides by Eric Wang April 23, 2010 Outline Introduction Model Overview Model
More informationBSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy
BSB663 Image Processing Pinar Duygulu Slides are adapted from Selim Aksoy Image matching Image matching is a fundamental aspect of many problems in computer vision. Object or scene recognition Solving
More informationFeature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking
Feature descriptors Alain Pagani Prof. Didier Stricker Computer Vision: Object and People Tracking 1 Overview Previous lectures: Feature extraction Today: Gradiant/edge Points (Kanade-Tomasi + Harris)
More informationCombining PGMs and Discriminative Models for Upper Body Pose Detection
Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationNorbert Schuff VA Medical Center and UCSF
Norbert Schuff Medical Center and UCSF Norbert.schuff@ucsf.edu Medical Imaging Informatics N.Schuff Course # 170.03 Slide 1/67 Objective Learn the principle segmentation techniques Understand the role
More informationLearning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009
Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer
More informationWITH the increasing use of digital image capturing
800 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 A Hybrid Approach to Detect and Localize Texts in Natural Scene Images Yi-Feng Pan, Xinwen Hou, and Cheng-Lin Liu, Senior Member, IEEE
More informationMobile Visual Search with Word-HOG Descriptors
Mobile Visual Search with Word-HOG Descriptors Sam S. Tsai, Huizhong Chen, David M. Chen, and Bernd Girod Department of Electrical Engineering, Stanford University, Stanford, CA, 9435 sstsai@alumni.stanford.edu,
More information