Representing Pairwise Spatial and Temporal Relations for Action Recognition
|
|
- Blake Blankenship
- 6 years ago
- Views:
Transcription
1 Representing Pairwise Spatial and Temporal Relations for Action Recognition Pyry Matikainen 1, Martial Hebert 1, and Rahul Sukthankar 2,1 1 The Robotics Institute, Carnegie Mellon University 2 Intel Labs Pittsburgh {pmatikai,hebert,rahuls}@cs.cmu.edu Abstract. The popular bag-of-words paradigm for action recognition tasks is based on building histograms of quantized features, typically at the cost of discarding all information about relationships between them. However, although the beneficial nature of including these relationships seems obvious, in practice finding good representations for feature relationships in video is difficult. We propose a simple and computationally efficient method for expressing pairwise relationships between quantized features that combines the power of discriminative representations with key aspects of Naïve Bayes. We demonstrate how our technique can augment both appearance- and motion-based features, and that it significantly improves performance on both types of features. 1 Introduction It is well known that classification and recognition problems in general cannot be solved by using a single type of feature and that, instead, progress lies in combinations of feature representations. But as there is as vast an array of ways to combine and augment features as there are features themselves, the development of such higher order methods is as difficult (and potentially rewarding) an endeavor as direct feature construction. These meta-methods span a range of approaches from those that make absolutely no assumptions on their base features, and thus are consigned to black-boxes operating on vectors of numbers, to those that are so intimately tied to their base features as to be virtually inseparable from them. The former types of methods, such as multiple kernel learning (MKL) techniques are attractive for their broad applicability, but the latter tend to be more powerful in specific applications due to their strong coupling with their underlying features. However, between those extremes there are still augmentations that compromise between generality and power by being applicable to broad classes of loosely-related base features. The popularity of statistical bag-of-words style techniques [1, 2] for action recognition and related video tasks creates an opportunity to take advantage of the broad similarities in these techniques. In particular, these techniques rely on accumulating histograms of quantized features extracted from video, features that are almost always localized in space and time. Yet these methods typically do not take advantage of the spatial
2 2 P. Matikainen, M. Hebert, and R. Sukthankar Fig. 1. Pairs of features probabilistically vote for action classes; pairs voting for the correct action class are shown in yellow, with brighter color denoting stronger (more informative) votes. For answerphone, the relative motion of the hands is particularly discriminative. These results employ trajectory fragments on the Rochester dataset [3], but our method works with any localized video feature. and temporal relationships between features. While it is obvious that in general using such relationships should help, the subtleties of designing appropriate representations have limited their use. Figure 1 shows examples of the types of informative relationships between features that we are interested in. Pairwise spatial relationships, in the form of star topologies, fans, constellations, and parts models, have seen frequent use in static image analysis [4 6]. Practical limitations have made transitioning these methods to video difficult. Sparse pairwise topologies (stars, fans, parts) often suffer from a lack of appropriately annotated training data, as they often require annotations that specify the topology for training [7, 8]. Alternatively, there are structured methods which can operate without such annotations, but at the cost of significantly more complicated and computationally expensive training or testing [9, 10]. In the special limited case of a fixed camera, the entire topology can be fixed relative to the frame by simply using the absolute positions of features [11, 12]. The key contributions of this paper can be summarized as follows: (1) we propose an efficient method for augmenting quantized local features with relative spatial-temporal relationships between pairs of features, and (2) we show that our representation can be applied to a variety of base features and results in improved recognition accuracy on several standard video datasets. For the pairwise model, the most direct representation that is compatible with bag-of-words techniques is to simply generate higher-order features by quantizing the possible spatio-temporal relationships between features. However, this results in a significantly larger number of possible codewords; for example, 100 codeword labels and 10 possible relationships, would produce 100,000 possible labels for the pairwise codewords. Attempts to mitigate this have centered on dimensionality reduction and limiting the number of relationships. In the former case, Gilbert et al. [13, 14] employ data mining techniques to find frequentlyoccurring combinations of features. Similarly, Ryoo and Aggarwal [15] use a sparse representation of the resulting high-dimensional histograms, in addition to using a relatively small relationship set. Taken to the extreme, Savarese et
3 Representing Pairwise Spatial and Temporal Relations 3 al. [16] consider effectively only one relationship: whether two features occur within a fixed distance of each other, and likewise Sun et al. [17] also use a simple proximity relationship. The subtle difficulty is exposing enough information for discriminative machinery to gain traction, but not so much as to overwhelm it in the noise. We strike this balance by selectively organizing estimated pairwise relationships, thereby exploiting fine spatio-temporal relationships without having to resort to an unmanageable representation for the classifier. Inspired by recent work on the max-margin Hough transform [18], we propose accumulating probabilities in a Naïve-Bayes like fashion into a reduced number of bins, and then presenting these binned probabilities to discriminative machinery. By choosing our bins to coincide with codeword labels, we produce vectors with size proportional to the number of codewords while still taking advantage of discriminative techniques. Since we propose a method for augmenting features with spatio-temporal relationships, we wish to show that this augmentation performs well on a range of features. To this end, we consider two radically different types of base features. First, we consider features built from space-time interest points (STIPs) with associated Histogram of Oriented Gradient (HOG) descriptors, which are sophisticated appearance-based features popularized by Laptev et al. [1]. Second, we consider a simple form of trajectory based features similar to those proposed by Matikainen et al. [19] and Messing et al. [3], quantized through a fixed (training data independent) quantization method. This selection of base features demonstrates the effectiveness of our method on both appearance- and motion-based features, as well as on sophisticated and simple feature extraction methods. In particular, our simplified trajectory method produces a fixed number of features per frame, and the feature labels are not derived from a clustering of training data. The method is virtually certain to produce a large number of extraneous features, and the feature labels are likely to be more sensitive to noise compared to those produced through clustering. In contrast, STIP-HOG produces relatively few features, which tend to be more stable due to the clustering. As discussed above, our proposed approach formulates the problem in a Na ve Bayes manner, but rather than independently summing per-feature probabilities in log space, we pass them through a discriminative classifier. We train this classifier by estimating all of the cross probabilities for feature labels, that is, for each pair of labels and each action we build a relative location probability table (RLPT) of the observed spatial and temporal relationships between features of those labels under the given action. Then, any feature label can compute its estimate of the distribution over action probabilities using the trained crossprobability maps. These estimates are combined for each feature label, and the final feature vector is presented to a classifier. 2 Base features The proposed method can augment a variety of common features employed in video action recognition. To demonstrate our method s generality, we describe
4 4 P. Matikainen, M. Hebert, and R. Sukthankar how it can be applied to two types of features that represent video in very different ways, as discussed below. 2.1 Base feature: STIP-HOG Laptev et al. s space-time interest points (STIPs) [1], in conjunction with Histogram of Oriented Gradient (HOG) descriptors have achieved state-of-the-art performance on a variety of video classification and retrieval tasks. A variable number of STIPs are discovered in a single video and the local space-time volume near each interest point is represented using an 72-dimensional descriptor. These HOG descriptors are quantized using a codebook (typically pre-generated using k-means clustering on a large collection) to produce a discrete label and a space-time location (x, y, t) for each STIP. 2.2 Base feature: Quantized trajectories In previous work [19], we considered trajectory-based features that we coined trajectons; these features describe video data in a very different manner from STIP-HOG. First, Harris corner features are tracked in a given video using KLT to produce a set of trajectories. Each trajectory is first converted from a list of (x t, y t ) position pairs into a list of discrete derivative pairs (dx t, dy t ) = (x t x t 1, y t y t 1 ). These trajectories are then broken up into overlapping windows of fixed duration T, each of which is considered a new feature or trajectory fragment. Unlike STIP-HOG, trajectory fragments seek to express longer-term motion in the video. We generally follow our previous work, but substitute a more straightforward quantization method. Sequencing code map (SCM) quantization. In our earlier work, we quantize trajectories using k-means. Fragments are clustered to produce a codebook. Messing et al. [3] also use trajectory features but employ a different quantization strategy in which trajectories are soft-assigned to Markov mixture components; their strategy is similar to that of Sun et al. [17] who also consider quantized transitions within a trajectory. Both Messing et al. s and our earlier approach can be computationally expensive depending on the number of mixture components or k-means centers, respectively. Taking inspiration from both, we propose quantizing fixed-length trajectories using a derivative table similar to Messing et al. s, which we call a sequencing code map (SCM), examples of which can be seen in Figure 3. However, rather than using quantized derivatives to look up probabilities, we simply combine the quantized indices over the fixed length trajectory fragment into a single label encoding quantized derivatives at specific times with each fragment (see Figure 2). In our method, a trajectory fragment is divided into k consecutive stages of length t frames, such that kt T, where T is the total length of the fragment. The total motion, or summed derivative, of each stage is computed as a (dx, dy) k
5 Representing Pairwise Spatial and Temporal Relations Fig. 2. Sequencing code map (SCM) quantization breaks a trajectory fragment into a number of stages (in this case three) that are separately quantized according to a map (in this case a 6-way angular map). These per-stage labels, called sequence codes, are combined into a final label for the fragment, which in this case would be = Fig. 3. Examples of possible sequencing code maps (SCMs) or relative location maps (RLMs). Our experiments focus on angular maps (second from left) but our method is completely general. pair for each stage. This (dx, dy) vector is quantized according to the SCM into one of n stage labels, or sequence codes; the k sequence codes are combined to produce a single combined label that can take on k n values. For example, with 3 stages and an SCM containing 8 bins, there would be 8 3 or 512 labels total. Since the time to quantize a stage is a single table lookup regardless of how the table was produced, this method is extremely computationally efficient (i.e., the computation time does not grow with an increasing number of quantization bins). Formally, we denote by M(dx, dy) the SCM function, which is implemented as a two-dimensional lookup table that maps from a dx, dy to an integer label in the range of 0 to n 1 inclusive. We denote by (dx, dy) k the derivative pair for stage k. Then the assigned label is given by l = k j=0 nj M(dx j, dy j ). Besides the convenience of not having to build a codeword dictionary and the reduced computational cost, our introduction of this quantization method is meant to demonstrate that our pairwise features do not depend on data-driven clustering techniques. The quantized labels produced by SCM quantization are unlikely to correspond nicely to clusters of features in the dataset (e.g.,, parts), yet the improvement produced by our pairwise relationships persists.
6 6 P. Matikainen, M. Hebert, and R. Sukthankar 3 Augmenting Features with Pairwise Relationships In the following subsections we detail our approach for augmenting generic base features with spatial and temporal relationships. While the previous discussions focused on STIP and trajectory fragment features, our proposed method for pairwise spatio-temporal augmentation applies equally well to any type of feature that can be quantized and localized. In the remainder of this section, a feature is simply the tuple (l, x, y, t): a codeword label in conjunction with its spatiotemporal position. The set of all observed features is denoted F. 3.1 Pairwise discrimination with relative location probabilities (RLPs) Starting with the observation that Naïve Bayes is a linear classifier in log space, in this section we formulate the pairwise representation first in the familiar terms of a Naïve Bayes classifier, and then demonstrate how to expose more of the underlying structure to discriminative methods. We start with the assumption that all pairs are conditionally independent given the action class. Then, if features are quantized to L labels and the spatial relationships between features are quantized to S labels, we could represent the full distribution over pairs of features with a vector of L 2 S bins. Unfortunately, for even as few as L = 100 trajectory labels and S = 10 spatial relationships, there would be (100 2 )(10) = 100, 000 elements in the computed feature vector, far too many to support with the merely hundreds of training samples typically available in video datasets. This feature vector must be reduced, but the direct approach of combining bins is just equivalent to using a coarser quantization level. Instead, taking inspiration from the Max-Margin Hough Transform [18] and Naïve Bayes, we build probability maps of the spatial relationships between features, and instead of summing counts, we accumulate probabilities, allowing pairs to contribute more information during their aggregation. Specifically, we produce a feature vector B of length AL, where A is the number of action classes (e.g., walk, run, jump, etc.), and where each entry B a,l corresponds to the combined conditional probability of all pairs containing a feature with label l given the action class a. In other words, a bin contains a feature label s probabilistic vote for a given action class, and we could compute a Naïve Bayes estimate of the probability of all the observed features given an action by summing all the votes for that action: log P (F a) = l L B a,l. However, instead of summing these in a Naïve Bayes fashion, we present the vector as a whole to discriminative machinery, in our case a linear SVM. We now describe how we accomplish this feature vector reduction. Notation. Formally, a video segment has a number of quantized features computed from it. A feature f i F is associated with a discrete quantized label l i L as well as a spatio-temporal position (x i, y i, t i ) indicating the frame and location in the frame where it occurs. For a pair of features within the same frame and a
7 Representing Pairwise Spatial and Temporal Relations 7 given action a A, there is a probability of the two features occurring together in a frame P (l i, l j a) as well as the relative location probability (RLP) for their particular spatial relationship P (x i, x j, y i, y j l i, l j, a). We make the simplifying assumption that RLPs depend only on the relative spatial relationship between the two features, so that P (x i, x j, y i, y j l i, l j, a) = P (dx, dy a, l i, l j ), where dx and dy are slight abuses of notation that should be understood to mean x i x j and y i y j where appropriate. This assumption enforces the property that the computed relationships are invariant to simple translations of the feature pairs. Probabilistic formulation. The reduction to a feature vector of length AL is done by selectively computing parts of the whole Naïve Bayes formulation of the problem. In particular, a full probabilistic formulation would compute P (F a) and select the a that maximizes this expression. Since Naïve Bayes takes the form of multiplying a number of features probabilities, or in this case pair probabilities, we can exploit the distributive and commutative properties of multiplication to pre-multiply groups of pair probabilities together, and then return those intermediate group probabilities rather than the entire sum. This can be seen as binning the pair probabilities. Assuming feature pairs are conditionally independent, we can compute the probability of a feature set F given an action a according to the equation P (F a) = P (l i a) P (l j l i, a)p (dx, dy a, l i, l j ), (1) f i F f j F which strictly speaking double-counts pairs since each pair is included twice in the computation; however, since we are only interested in the most likely action, this is not an issue. In practice, we employ log probabilities both to avoid issues with numerical precision from extremely small values and to formulate the problem as a linear classifier. In this case the log probability expression becomes log(p (F a)) = log(p (l i a))+ (2) f i F log(p (l j l i, a)) + log(p (dx, dy a, l i, l j )). f j F To simplify the expression we assume uniform probabilities for P (l i a) and P (l j l i, a). Later we can include nonuniform label probabilities by simply concatenating the individual label histogram to the pairwise feature vector when both are presented to the classifier. Thus, our probability expression becomes log(p (F a)) = log(p (dx, dy a, l i, l j )) + C, (3) f i F f j F which is simply a formal way of stating that the whole log probability is the sum of all the pairwise log probabilities. Since we are only interested in the
8 8 P. Matikainen, M. Hebert, and R. Sukthankar answerphone chopbanana dialphone drinkwater eatbanana eatsnack lookupinphonebook peelbanana usesilverware writeonwhiteboard Fig. 4. Relative location probabilities (RLPs) for feature labels 25 and 30 over all actions in the Rochester dataset, using an 8-way angular map. Lighter (yellower) indicates higher probability. We see that for the answerphone action, features with label 30 tend to occur up and to the right of features with label 25, whereas for usesilverware, features with label 30 tend to occur down and to the left of those with label 25. relative probabilities over action classes, we collect the uniform probabilities for labels into a constant C which does not depend on the action a, and which is omitted from the following equations for clarity. We now wish to divide this expression into a number of sub-sums that can be presented to a classifier, and this expression leaves us a great deal of flexibility, since we are free to compute and return sub-sums in an arbitrary manner. Discriminative Form. We rewrite Equation 3 in such a way as to bin probabilities according to individual feature labels. In particular, we can rewrite it in log form as log(p (F a)) = log(p (b l a)), (4) l L where log(p (b l a)) = f i l f j log(p (dx, dy a, l, l j )). (5) The expression log(p (b l a)) is the bin probability, which directly corresponds to an element of the feature vector according to B a,l = log(p (b l a)). Since there are A actions and L labels, this B vector contains AL elements. 3.2 Estimating relative location probabilities from training data The previous section assumed that the relative location probabilities (RLPs) were simply available. However, these probabilities must be estimated from real data, requiring some care in the representation choice for the relative location probability tables (RLPTs). An RLPT represents an expression of the form log(p (dx, dy a, l i, l j )), where a, l i, and l j are considered fixed. In practice this means that it must represent a function from a (dx, dy) pair to a log probability,
9 Representing Pairwise Spatial and Temporal Relations (20,17,eatBanana) (21,17,eatBanana) (22,17,eatBanana) (25,17,eatBanana) (26,17,eatBanana) (27,17,eatBanana) (dx, dy) 17 (23,17,eatBanana) (24,17,eatBanana) Fig. 5. Features with labels 17 and 23 are observed in a frame of a training video of the class eatbanana. The feature with label 17 has a relative displacement of (dx, dy) from that with label 23, which maps to bin #2 in an 8-way angular RLM. Thus, we increment bin #2 in the corresponding table entry; these counts are all converted to estimated log probabilities after the entire video collection is processed. and that we must represent one such function for every (a, l i, l j ) triplet of discrete values. While many representations are possible, we use an approach similar to that used for staged quantization. We denote by M(dx, dy) a function that maps a (dx, dy) pair to an integer bin label, allowing n such labels. We refer to this map as the relative location map (RLM), possible forms of which can be seen in Figure 3. Then the RLPT for a given (a, l i, l j ) triplet is a list of n numbers, denoted T a,li,l j. An RLP can then be retrieved according to: log(p (dx, dy a, l i, l j )) = T a,li,l j [M(dx, dy)]. (6) For example, with 216 labels, 10 actions, and 8 bins in the RLM, storing all the RLPs would require (216)(10)(8) = 3, 732, 480 entries in 466,560 tables. Estimating the RLPTs is simply a matter of counting the displacements falling within each bin (see Figure 5), and finally normalizing by the total counts in each map. Since some bins may receive zero counts, leading to infinities when the log probability is computed, we use a prior to seed each bin with a fixed number of pseudo-counts. Examples of RLPTs found in this way can be seen in Figure 4. This method could seemingly generate very sparse probability maps where most bins receive few or no real counts. However, in practice almost all of the bins receive counts. A typical situation (in this case our experiments on Rochester s Daily Living dataset) might have 10 classes, 216 feature labels, and a probability map with 8 bins, for a total of (216 2 )(10)(8) = bins. For Rochester we have approximately 120 training videos, each of which is approximately 600 frames long. If we track 300 features per frame, and only consider in-frame pairs,
10 10 P. Matikainen, M. Hebert, and R. Sukthankar then across all videos we will have (300 2 )(600)(120) = pairwise features. Thus, on average each bin will receive over a thousand counts. 3.3 Extension to temporal relationships The method is naturally extended to include temporal relationships. Rather than representing the relationship between two features as a (dx, dy) pair, it is represented as a (dx, dy, dt) triple. The RLPT then contains entries of the form log(p (dx, dy, dt a, l i, l j )), which are indexed according to a mapping function M(dx, dy, dt), so that log(p (dx, dy, dt a, l i, l j )) = T a,li,l j [M(dx, dy, dt)]. (7) Previously, the map M could be stored as a simple image, whereas with spatial-temporal relationships this map is a volume or series of images. When counts are accumulated or probabilities evaluated, only pairs of features within a sliding temporal window are considered, since considering all pairs of features over the entire video would both result in a prohibitively large number of pairs and prevent the method from being run online on a video stream. Nevertheless, the change from considering only pairs within a frame to pairs within a temporal window vastly increases the number of pairs to consider, and depending on the number of features generated by a particular feature detector, it may be necessary to randomly sample pairs rather than considering them all. We find that for STIP-HOG we can consider all pairs, while for SCM-Traj we must sample. 3.4 Classification We train a linear SVM [20] to classify video clips, which is a straightforward matter of presenting computed B vectors and corresponding ground truth classes from training clips. Each bin in B can be interpreted as a particular label s vote for an action, in which case the classifier learns the importance of each label s vote. Since, when considered in isolation, pairwise relationships are unlikely to be as informative as the base features from which they are derived, we present a simple method for combining the raw base feature histograms with the computed pairwise log probability vectors. We do not present this combination method as the canonical way of combining the two sources of information, but rather as a convincing demonstration that the proposed pairwise relationships provide a significant additional source of information rather than merely a rearrangement of the existing data. Supposing that H represents the histogram for the base features, and B represents the computed pairwise relationship vector, then one way of combining the two would be to simply concatenate the two vectors into [H, B], and present the result to a linear SVM. However this is unlikely to result in the best performance, since the two vectors represent different quantities. Instead, we separately scale each part, and then simply cross validate to find the scaling ratio p that maximizes performance on the validation set, where the combined vector is [ph, (1 p)b]. This scaled vector is simply presented to the SVM.
11 Representing Pairwise Spatial and Temporal Relations 11 Table 1. Action recognition accuracy on standard datasets. Adding pairwise features significantly boosts the accuracy of various base features. Method UCF-YT Rochester STIP-HOG (single) (Laptev et al. [1]) 55.0% 56.7% STIP-HOG (NB-pairwise alone) 16.4% 20.7% STIP-HOG (D-pairwise alone) 46.6% 46.0% STIP-HOG (single + D-pairwise) 59.0% 64.0% STIP-HOG-Norm (single) (Laptev et al. [1]) 42.6% 40.6% SCM-Traj (single) 42.3% 37.3% SCM-Traj (NB-pairwise alone) 14.3% 70.0% SCM-Traj (D-pairwise alone) 40.0% 48.0% SCM-Traj (single + D-pairwise) 47.1% 50.0% 4 Evaluation We evaluate our pairwise method on a forced choice action classification task on two standard datasets, the UCF YouTube dataset (UCF-YT) [21] and the recently-released University of Rochester Activities of Daily Living [3]. To evaluate the contribution of our method for generating pairwise relationships, we consider two different types of base features: the trajectory based features we introduced earlier, and Laptev et al. s space-time interest points. We consider both our discriminative formulation (denoted D-pairwise) and a Naïve-Bayes formulation (NB-pairwise) for our pairwise features, where the NB-pairwise results are primarily intended as a baseline against which to compare. Table 1 summarizes our results. Our experiments are designed to evaluate the effect of adding spatial and temporal relations to the features and to understand in detail the effect of various parameters on the performance of the augmented features. Clearly, significantly more tuning and additional steps would go into building a complete, optimized video classification system. In particular, we do not claim that our performance numbers are the best that can be obtained by using complete systems optimized for these data sets. We use the evaluation metric of total accuracy across all classes in an n-way classification task. On both datasets we use 216 base feature codewords for both trajectories and STIP-HOG. The number 216 results from the choice of three stages with a 6-way mask for the staged quantization (6 3 = 216), and we use the same number for STIP-HOG to make the comparison as even as possible. Likewise, for both datasets we use an 8-way spatial relationship binning for the RLPTs. Combined results are produced by cross validating on the scaling ratio. UCF-YT consists of 1600 videos in 11 categories acquired from YouTube clips. For evaluation, we randomly split the dataset into a training set of approximately 1200 videos and a testing set of approximately 400 videos. This dataset was chosen for its difficulty, in order to evaluate the performance of pairwise relationships outside of highly controlled environments. In particular,
12 12 P. Matikainen, M. Hebert, and R. Sukthankar this dataset is challenging because the videos contain occlusions, highly variable viewpoints, significant camera motion, and high amounts of visual clutter. On UCF-YT we find that discriminative pairwise features are not as informative as the base features, which is not unexpected since the diversity of the dataset means there are unlikely to be strong, consistent relationships between features. Nevertheless, we still find modest gains for combinations of pairwise and individual features, on the order of 5%. This means that the pairwise features are providing an additional source of information, rather than just obfuscating the information already present in the individual feature histograms. The Naïve Bayes pairwise evaluation performs poorly, but better than chance. Furthermore, we can see that the performance of our simple fixed quantization on trajectories performs similarly to normalized STIP-HOG features, but significantly worse than non-normalized STIP histograms. This suggests that much of the discriminative power of STIP features might originate from the variation in the quantity of features found in different videos. The Rochester dataset consists of 150 videos of five individuals performing a series of scripted tasks in a kitchen environment, acquired using a stationary camera. Due to the limited pool of available data, we evaluate using 5-fold crossvalidation, using videos from four individuals for training and the fifth for testing, in each fold. On Rochester we observe that the pairwise features for STIP-HOG do not perform as well as the individual STIP-HOG features, but that the combination outperforms both, which is consistent with the results for UCF-YT. For trajectory features, the pairwise features alone significantly outperforms the base features, a reversal from UCF-YT. The combination of the two outperforms both the individual and pairwise, but adds only a modest gain on top of the pairwise performance. For both types of features, the gains with pairwise relationships in combination are much larger than for UCF-YT, which is explained by the greater consistency of spatial relationships between codeword labels due to the fixed viewpoint and highly consistent actions. Qualitatively examining which pairs contribute to a correct action identification supports this hypothesis: as can be seen in Figure 1, the pairwise features supporting an action appear to be strongly tied to that action. For STIP-HOG, the Naïve-Bayes pairwise formulation once again performs poorly, however for trajectories the Naïve-Bayes pairwise is the strongest performer. This suggests that for some applications, even simple relationships can give very good performance. 4.1 Effect of Temporal Relationships The results with using spatial and temporal relationships on UCF-YT are shown in Table 2 in which X-T-Pairwise denotes the classifier (discriminative or Naïve- Bayes) augmented with temporal relations. For these results, we have used the same 8-way spatial relationship binning combined with a 5-way temporal binning, for a total of 40 bins. The pairwise relationships are evaluated over a 30
13 Representing Pairwise Spatial and Temporal Relations 13 Table 2. Action recognition accuracy with temporal relationships on UCF-YT Method STIP-HOG Traj-SCM NB-Pairwise (baseline) 16.4% 14.3% NB-T-Pairwise 22.2% 31.2% D-Pairwise (baseline) 46.6% 40.0% D-T-Pairwise 49.2% 39.7% frame sliding window. For STIP-HOG, all pairs within the window are considered, but for trajectories we sample 1/20 of the pairs in the interest of tractability. Note that even with this sampling, a four second UCF-YT clip can produce over 100,000,000 pairs when using trajectory features. The performance of the discriminative pairwise relationships remains virtually unchanged for Traj-SCM, but there is a modest performance boost for STIP-HOG. The Naïve-Bayes versions continue to perform worse than the discriminative ones, however the temporal relationships have a much larger impact on their performance. The difference is especially dramatic for NB-Pairwise vs. NB-T-Pairwise with STIP-HOG, where the temporal relationships have more than doubled the accuracy from 14.3% to 31.2%. 4.2 RLPT sparsity Earlier we argued that the relative location probability tables should not be sparse based on a simple counting argument. Empirically, we find that for the Rochester dataset 71.6% of the entries receive counts, and that 91.7% of the tables have at least one count in one of the 8 bins. The number of tables containing at least 100 counts is 41.1%, and 15.2% of tables have over 1000 counts. These numbers validate our original claim that the tables are not sparse. 5 Conclusion We present a simple yet powerful method for representing pairwise spatio-temporal relationships between features in the popular bag-of-words framework. Unlike naïvely expanding codewords to include all possible pairs and relationships between features, our method produces an output whose size is proportional to the number of base codewords rather than to its square, which reduces the likelihood of overfitting and is more computationally efficient. We demonstrate that our method can be used to improve action classification performance with dissimilar STIP-HOG (appearance) and trajectory (motion) based features on two different datasets, and that a discriminative formulation of our pairwise features generally outperforms a Naïve-Bayes classification approach. Although our method takes advantage of spatial relationships, it does not require any additional annotation in the training data, making it appropriate for a wide range of datasets and applications.
14 14 P. Matikainen, M. Hebert, and R. Sukthankar As we have only considered simple angular maps, there is potentially still considerable power to be extracted from this method through the careful selection of relative location maps. Additionally, we have presented a binning of probabilities based on codeword label, but an interesting question is whether more intelligent data-driven binnings can be found. We plan to explore these questions in future work. References 1. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR. (2008) 2. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79 (2008) 3. Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: ICCV. (2009) 4. Carneiro, G., Lowe, D.: Sparse flexible models of local features. In: ECCV. (2006) 5. Crandall, D.J., Huttenlocher, D.P.: Weakly supervised learning of part-based spatial models for visual object recognition. In: ECCV. (2006) 6. Leordeanu, M., Hebert, M., Sukthankar, R.: Beyond local appearance: Category recognition from pairwise interactions of simple features. In: CVPR. (2007) 7. Zhang, Z.M., Hu, Y.Q., Chan, S., Chia, L.T.: Motion context: A new representation for human action recognition. In: ECCV. (2008) 8. Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV. (2007) 9. Jiang, H., Martin, D.R.: Finding actions using shape flows. In: ECCV. (2008) 10. Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: Cross-view action recognition from temporal self-similarities. In: ECCV. (2008) 11. Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event recognition. Image and Vision Computing 14 (1996) 12. Makris, D., Ellis, T.: Spatial and probabilistic modelling of pedestrian behaviour. In: BMVC. (2002) 13. Gilbert, A., Illingworth, J., Bowden, R.: Scale invariant action recognition using compound features mined from dense spatio-temporal corners. In: ECCV. (2008) 14. Gilbert, A., Illingworth, J., Bowden, R.: Fast realistic multi-action recognition using mined dense spatio-temporal features. In: ICCV. (2009) 15. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: ICCV. (2009) 16. Savarese, S., DelPozo, A., Niebles, J., Fei-Fei, L.: Spatial-Temporal correlatons for unsupervised action classification. In: WMVC. (2008) 17. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatiotemporal context modeling for action recognition. In: CVPR. (2009) 18. Maji, S., Malik, J.: Object detection using a max-margin Hough transform. In: CVPR. (2009) 19. Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognition through the motion analysis of tracked features. In: ICCV workshop on Videooriented Object and Event Classification. (2009) 20. Chang, C.C., Lin, C.J.: LIBSVM a library for support vector machines (2001) 21. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: CVPR. (2009)
Part based models for recognition. Kristen Grauman
Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily
More informationPreviously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011
Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition
More informationAction recognition in videos
Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit
More informationPart-based and local feature models for generic object recognition
Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza
More informationLecture 18: Human Motion Recognition
Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio
More informationCS 231A Computer Vision (Fall 2012) Problem Set 3
CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest
More informationCS 223B Computer Vision Problem Set 3
CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.
More informationBeyond Bags of Features
: for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015
More informationExtracting Spatio-temporal Local Features Considering Consecutiveness of Motions
Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,
More informationObject Category Detection. Slides mostly from Derek Hoiem
Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models
More informationEfficient Kernels for Identifying Unbounded-Order Spatial Features
Efficient Kernels for Identifying Unbounded-Order Spatial Features Yimeng Zhang Carnegie Mellon University yimengz@andrew.cmu.edu Tsuhan Chen Cornell University tsuhan@ece.cornell.edu Abstract Higher order
More informationMiddle-Level Representation for Human Activities Recognition: the Role of Spatio-temporal Relationships
Middle-Level Representation for Human Activities Recognition: the Role of Spatio-temporal Relationships Fei Yuan 1, Véronique Prinet 1, and Junsong Yuan 2 1 LIAMA & NLPR, CASIA, Chinese Academy of Sciences,
More informationPerson Action Recognition/Detection
Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1 In object recognition: is there a chair in the
More informationCS229: Action Recognition in Tennis
CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active
More informationREJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin
REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin Univ. Grenoble Alpes, GIPSA-Lab, F-38000 Grenoble, France ABSTRACT
More informationBeyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets
More informationAction Recognition with HOG-OF Features
Action Recognition with HOG-OF Features Florian Baumann Institut für Informationsverarbeitung, Leibniz Universität Hannover, {last name}@tnt.uni-hannover.de Abstract. In this paper a simple and efficient
More informationCombining PGMs and Discriminative Models for Upper Body Pose Detection
Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative
More informationFast Motion Consistency through Matrix Quantization
Fast Motion Consistency through Matrix Quantization Pyry Matikainen 1 ; Rahul Sukthankar,1 ; Martial Hebert 1 ; Yan Ke 1 The Robotics Institute, Carnegie Mellon University Intel Research Pittsburgh Microsoft
More informationAggregating Descriptors with Local Gaussian Metrics
Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,
More informationMinimizing hallucination in Histogram of Oriented Gradients
Minimizing hallucination in Histogram of Oriented Gradients Javier Ortiz Sławomir Bąk Michał Koperski François Brémond INRIA Sophia Antipolis, STARS group 2004, route des Lucioles, BP93 06902 Sophia Antipolis
More informationIMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim
IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute
More informationLinear combinations of simple classifiers for the PASCAL challenge
Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu
More informationStructured Models in. Dan Huttenlocher. June 2010
Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies
More informationMultiple Kernel Learning for Emotion Recognition in the Wild
Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,
More informationString distance for automatic image classification
String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,
More informationEvaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition
International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 9 No. 4 Dec. 2014, pp. 1708-1717 2014 Innovative Space of Scientific Research Journals http://www.ijias.issr-journals.org/ Evaluation
More informationEigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor
EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor Xiaodong Yang and YingLi Tian Department of Electrical Engineering The City College of New York, CUNY {xyang02, ytian}@ccny.cuny.edu
More informationBeyond Bags of features Spatial information & Shape models
Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features
More informationVisuelle Perzeption für Mensch- Maschine Schnittstellen
Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de
More informationLearning Realistic Human Actions from Movies
Learning Realistic Human Actions from Movies Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** INRIA Rennes, France ** INRIA Grenoble, France *** Bar-Ilan University, Israel Presented
More informationDeformable Part Models
CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationPart-Based Models for Object Class Recognition Part 2
High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv Class of Object
More informationPart-Based Models for Object Class Recognition Part 2
High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv Class of Object
More informationActivity recognition using the velocity histories of tracked keypoints
Activity recognition using the velocity histories of tracked keypoints Ross Messing University of Rochester rmessing@cs.rochester.edu Chris Pal École Polytechnique de Montréal christopher.pal@polymtl.ca
More informationRobust PDF Table Locator
Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records
More informationPart-Based Models for Object Class Recognition Part 3
High Level Computer Vision! Part-Based Models for Object Class Recognition Part 3 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de! http://www.d2.mpi-inf.mpg.de/cv ! State-of-the-Art
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationEVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari
EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular
More informationStatistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition
Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition Piotr Bilinski, Francois Bremond To cite this version: Piotr Bilinski, Francois Bremond. Statistics of Pairwise
More informationLearning Human Motion Models from Unsegmented Videos
In IEEE International Conference on Pattern Recognition (CVPR), pages 1-7, Alaska, June 2008. Learning Human Motion Models from Unsegmented Videos Roman Filipovych Eraldo Ribeiro Computer Vision and Bio-Inspired
More informationFast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features
Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features Andrew Gilbert, John Illingworth and Richard Bowden CVSSP, University of Surrey, Guildford, Surrey GU2 7XH United Kingdom
More informationCS231N Section. Video Understanding 6/1/2018
CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image
More informationPreliminary Local Feature Selection by Support Vector Machine for Bag of Features
Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and
More informationOutline 7/2/201011/6/
Outline Pattern recognition in computer vision Background on the development of SIFT SIFT algorithm and some of its variations Computational considerations (SURF) Potential improvement Summary 01 2 Pattern
More informationSemi-Supervised Hierarchical Models for 3D Human Pose Reconstruction
Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers 3D Human Pose Inference Difficulties Towards
More informationClass 9 Action Recognition
Class 9 Action Recognition Liangliang Cao, April 4, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition
More informationA New Strategy of Pedestrian Detection Based on Pseudo- Wavelet Transform and SVM
A New Strategy of Pedestrian Detection Based on Pseudo- Wavelet Transform and SVM M.Ranjbarikoohi, M.Menhaj and M.Sarikhani Abstract: Pedestrian detection has great importance in automotive vision systems
More informationAnalysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009
Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context
More informationLarge-scale visual recognition The bag-of-words representation
Large-scale visual recognition The bag-of-words representation Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline Bag-of-words Large or small vocabularies? Extensions for instance-level
More informationBehavior Recognition in Video with Extended Models of Feature Velocity Dynamics
Behavior Recognition in Video with Extended Models of Feature Velocity Dynamics Ross Messing 1 1 Department of Computer Science University of Rochester Rochester, NY, 14627, USA {rmessing, cpal}@cs.rochester.edu
More informationTA Section: Problem Set 4
TA Section: Problem Set 4 Outline Discriminative vs. Generative Classifiers Image representation and recognition models Bag of Words Model Part-based Model Constellation Model Pictorial Structures Model
More informationSummarization of Egocentric Moving Videos for Generating Walking Route Guidance
Summarization of Egocentric Moving Videos for Generating Walking Route Guidance Masaya Okamoto and Keiji Yanai Department of Informatics, The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi,
More informationPattern recognition (3)
Pattern recognition (3) 1 Things we have discussed until now Statistical pattern recognition Building simple classifiers Supervised classification Minimum distance classifier Bayesian classifier Building
More informationRecognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)
Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding
More informationRECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University
RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS Tatsuya Ishihara Kris M. Kitani Wei-Chiu Ma Hironobu Takagi Chieko Asakawa IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationDeep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks
Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin
More informationon learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015
on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 Vector visual representation Fixed-size image representation High-dim (100 100,000) Generic, unsupervised: BoW,
More informationDetecting Object Instances Without Discriminative Features
Detecting Object Instances Without Discriminative Features Edward Hsiao June 19, 2013 Thesis Committee: Martial Hebert, Chair Alexei Efros Takeo Kanade Andrew Zisserman, University of Oxford 1 Object Instance
More informationCS 231A Computer Vision (Fall 2011) Problem Set 4
CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based
More informationDeep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon
Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationAction Recognition using Randomised Ferns
Action Recognition using Randomised Ferns Olusegun Oshin Andrew Gilbert John Illingworth Richard Bowden Centre for Vision, Speech and Signal Processing, University of Surrey Guildford, Surrey United Kingdom
More informationPatch-based Object Recognition. Basic Idea
Patch-based Object Recognition 1! Basic Idea Determine interest points in image Determine local image properties around interest points Use local image properties for object classification Example: Interest
More informationFirst-Person Animal Activity Recognition from Egocentric Videos
First-Person Animal Activity Recognition from Egocentric Videos Yumi Iwashita Asamichi Takamine Ryo Kurazume School of Information Science and Electrical Engineering Kyushu University, Fukuoka, Japan yumi@ieee.org
More informationEnsemble of Bayesian Filters for Loop Closure Detection
Ensemble of Bayesian Filters for Loop Closure Detection Mohammad Omar Salameh, Azizi Abdullah, Shahnorbanun Sahran Pattern Recognition Research Group Center for Artificial Intelligence Faculty of Information
More informationTemporal Poselets for Collective Activity Detection and Recognition
Temporal Poselets for Collective Activity Detection and Recognition Moin Nabi Alessio Del Bue Vittorio Murino Pattern Analysis and Computer Vision (PAVIS) Istituto Italiano di Tecnologia (IIT) Via Morego
More informationLecture 16: Object recognition: Part-based generative models
Lecture 16: Object recognition: Part-based generative models Professor Stanford Vision Lab 1 What we will learn today? Introduction Constellation model Weakly supervised training One-shot learning (Problem
More informationEfficient Acquisition of Human Existence Priors from Motion Trajectories
Efficient Acquisition of Human Existence Priors from Motion Trajectories Hitoshi Habe Hidehito Nakagawa Masatsugu Kidode Graduate School of Information Science, Nara Institute of Science and Technology
More informationSegmenting Objects in Weakly Labeled Videos
Segmenting Objects in Weakly Labeled Videos Mrigank Rochan, Shafin Rahman, Neil D.B. Bruce, Yang Wang Department of Computer Science University of Manitoba Winnipeg, Canada {mrochan, shafin12, bruce, ywang}@cs.umanitoba.ca
More informationFitting: The Hough transform
Fitting: The Hough transform Voting schemes Let each feature vote for all the models that are compatible with it Hopefully the noise features will not vote consistently for any single model Missing data
More informationActivity Recognition in Temporally Untrimmed Videos
Activity Recognition in Temporally Untrimmed Videos Bryan Anenberg Stanford University anenberg@stanford.edu Norman Yu Stanford University normanyu@stanford.edu Abstract We investigate strategies to apply
More informationFast trajectory matching using small binary images
Title Fast trajectory matching using small binary images Author(s) Zhuo, W; Schnieders, D; Wong, KKY Citation The 3rd International Conference on Multimedia Technology (ICMT 2013), Guangzhou, China, 29
More informationChapter 2 Basic Structure of High-Dimensional Spaces
Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,
More informationFast Motion Consistency through Matrix Quantization
Fast Motion Consistency through Matrix Quantization Pyry Matikainen; Rahul Sukthankar; Martial Hebert; Yan Ke Robotics Institute Carnegie Mellon University Pittsburgh, Pennsylvania, USA {pmatikai, rahuls}@cs.cmu.edu;
More informationCategory-level localization
Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object
More informationDetecting and Segmenting Humans in Crowded Scenes
Detecting and Segmenting Humans in Crowded Scenes Mikel D. Rodriguez University of Central Florida 4000 Central Florida Blvd Orlando, Florida, 32816 mikel@cs.ucf.edu Mubarak Shah University of Central
More informationImageCLEF 2011
SZTAKI @ ImageCLEF 2011 Bálint Daróczy joint work with András Benczúr, Róbert Pethes Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Training/test
More informationSparse coding for image classification
Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction
More informationMobile Human Detection Systems based on Sliding Windows Approach-A Review
Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationIs 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014
Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014 Problem Definition Viewpoint estimation: Given an image, predicting viewpoint for object of
More informationCPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017
CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.
More informationLeveraging Textural Features for Recognizing Actions in Low Quality Videos
Leveraging Textural Features for Recognizing Actions in Low Quality Videos Saimunur Rahman, John See, Chiung Ching Ho Centre of Visual Computing, Faculty of Computing and Informatics Multimedia University,
More informationVision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada
Spatio-Temporal Salient Features Amir H. Shabani Vision and Image Processing Lab., University of Waterloo, ON CRV Tutorial day- May 30, 2010 Ottawa, Canada 1 Applications Automated surveillance for scene
More informationSelection of Scale-Invariant Parts for Object Class Recognition
Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract
More informationObject detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation
Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1 2 Problem to solve Object detection Input: Image Output: Bounding box of the object 3 Object detection using CNN
More informationAction Recognition with Multiscale Spatio-Temporal Contexts
Action Recognition with Multiscale Spatio-Temporal Contexts Jiang Wang, Zhuoyuan Chen and Ying Wu EECS Department, Northwestern University 2145 Sheridan Road, Evanston, IL 628 {jwa368,zch318,yingwu}@eecs.northwestern.edu
More informationFast Fuzzy Clustering of Infrared Images. 2. brfcm
Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.
More informationReal-Time Action Recognition System from a First-Person View Video Stream
Real-Time Action Recognition System from a First-Person View Video Stream Maxim Maximov, Sang-Rok Oh, and Myong Soo Park there are some requirements to be considered. Many algorithms focus on recognition
More informationCombining Selective Search Segmentation and Random Forest for Image Classification
Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many
More informationExploring Bag of Words Architectures in the Facial Expression Domain
Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationTri-modal Human Body Segmentation
Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4
More informationOrdered Trajectories for Large Scale Human Action Recognition
2013 IEEE International Conference on Computer Vision Workshops Ordered Trajectories for Large Scale Human Action Recognition O. V. Ramana Murthy 1 1 Vision & Sensing, HCC Lab, ESTeM, University of Canberra
More informationLearning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009
Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer
More informationLeveraging Textural Features for Recognizing Actions in Low Quality Videos
Leveraging Textural Features for Recognizing Actions in Low Quality Videos Saimunur Rahman 1, John See 2, and Chiung Ching Ho 3 Centre of Visual Computing, Faculty of Computing and Informatics Multimedia
More informationContent-based image and video analysis. Event Recognition
Content-based image and video analysis Event Recognition 21.06.2010 What is an event? a thing that happens or takes place, Oxford Dictionary Examples: Human gestures Human actions (running, drinking, etc.)
More information