Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets
|
|
- Martin McKenzie
- 6 years ago
- Views:
Transcription
1 Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets Ronald Poppe Human Media Interaction Group, Department of Computer Science University of Twente, Enschede, The Netherlands Abstract We present an example-based approach to pose recovery, using histograms of oriented gradients as image descriptors. Tests on the HumanEva-I and HumanEva-II data sets provide us insight into the strengths and limitations of an example-based approach. We report mean relative 3D errors of approximately 65 mm per joint on HumanEva-I, and 175 mm on HumanEva-II. We discuss our results using single and multiple views. Also, we perform experiments to assess the algorithm s generalization to unseen subjects, actions and viewpoints. We plan to incorporate the temporal aspect of human motion analysis to reduce orientation ambiguities, and increase the pose recovery accuracy. 1. Introduction Approaches to vision-based human motion analysis can broadly be divided into generative and discriminative. The first category explicitly uses a human body model that describes both the visual and kinematic properties of the human body. Discriminative (recognition) approaches learn the mapping from image space to pose space directly from carefully selected training data. Human motion analysis from images is challenging due to variations in human body dimensions, camera viewpoint, type of motion and numerous environmental settings such as lighting. In a generative approach, many of these parameters can be included in the estimation. While this may improve the pose recovery accuracy, it comes at the cost of computational complexity. In discriminative approaches, these variations can either be encoded implicitly (for example variations in lighting and body size dimensions), or explicitly (for example viewpoint). The fact that not all parameters can efficiently be encoded explicitly causes discriminative approaches to perform less accurately compared to generative work. However, discriminative approaches are computationally much less expensive, and can potentially be applied in real-time. Therefore, they are useful for applications in Human-Computer Interaction and surveillance, where speed is an issue, rather than accuracy. A discriminative approach also has the ability to automatically (re)initialize. In this respect, a combined generative and discriminative approach (e.g. [13]) is a promising direction to obtain accurate results within reasonable time. We describe an example-based approach to pose recovery. Such an approach uses a (fixed) set of example poses, with their corresponding visual appearance. Pose recovery is simply selecting the pose that corresponds to the most visually similar example. Since the examples usually cover the pose space very sparsely, the pose estimate is often obtained through interpolation of the n closest examples. While an example-based approach is conceptually hardly complex (in fact, the only parameter that can be tuned is n), existing literature explores many different ways to encode the image and the poses. Moreover, the data used for training and testing is often proprietary, which makes interpretation of the various results and proper consolidation of the findings difficult. The HumanEva data set [12] provides us with the opportunity to compare pose recovery results objectively on data that, although recorded in a controlled environment, poses challenges with respect to the activities performed, and the subjects performing them. This allows us to measure the influence of various factors on the pose recovery performance. In the research described here, we do exactly that, and report our findings on an example-based approach. The next section discusses related work on examplebased pose estimation. Section 3 explains our method. Experiments and our results are presented in Section 4. In Section 5, we summarize the advantages and limitations of our approach, and present directions for future research. 2. Related work The area of vision-based human motion analysis is too broad to be thoroughly reviewed here. The interested reader is referred to [9] for a recent overview. In this section, we focus on discriminative work, which can be divided into functional and example-based (nearest-neighbor). In functional work, the mapping from image space to pose space 1
2 is approached functionally, whereas in example-based work all training data is retained. This notion is important since it puts a practical limit on the number of examples m in the training set of an example-based approach. Regarding the computational complexity, the order of a straightforward example-based algorithm is linear in m. Below, we present a short overview of the different factors that play a role in an example-based pose recovery approach. Subsequently, we will explain how these factors have determined the design of our contribution. The basis of an example-based approach is to encode the image observations. In existing literature, a number of different image descriptors have been used, usually based on silhouettes [3, 4, 6] or edges [1, 8, 11, 13]. These descriptors vary in terms of computational complexity, descriptor size, robustness against clutter and invariance to scale, rotation and translation. Another important factor is the number of cameras that is employed. Grauman et al. [4] show that pose recovery accuracy increases with the number of views. Also, the type of input affects the algorithm s performance. Some authors use a training set of synthesized images [8, 11], which can be encoded robustly without the presence of noise. Others place their synthesized characters against a photographed background to introduce realistic noise into their training set [1, 13]. Closely related are the issues of localization and segmentation. Usually, the location of the person is known, or obtained from a background segmentation process. In the latter case, the success of an algorithm depends heavily on the results of the segmentation. Human motion analysis is a spatio-temporal problem, hence the temporal aspect can be used to improve accuracy. Howe [6] recovers poses of an entire sequence (batch). This ensures consistency over time, which is particularly useful when only a single view is used. Ong et al. use a tracking (incremental) approach, where a dynamical model puts a prior on the poses in the next time frame. This guarantees temporal consistency, and reduces the number of evaluations needed since only poses with a non-negligible prior have to be evaluated. Other measures to reduce the order of example-based algorithms to sub-linear include hashing [11]. Also, a hierarchical organization of the examples can reduce the complexity of the algorithm. In our contribution, we do not regard the temporal aspect, nor do we apply any measures to reduce the computational complexity. A variant of histogram of oriented gradients (HoG, [2]) is used to encode the image observations within extracted foreground masks. We vary the number of views and report the performance on the majority of the available actions and subjects in both the HumanEva-I and HumanEva-II test sets. Our results provide insight into the capabilities and limitations of example-based approaches. 3. Method In this section, we will describe the components of our approach. Section 3.1 discusses the image descriptors that are used in this research. The pose estimation approach is explained in Section Histogram of oriented gradients Dalal and Triggs proposed histograms of oriented gradients (HoG) as an image descriptor to localize pedestrians in cluttered images [2]. We believe that their use can be extended to pose recovery. Gradients are to some extent invariant to lighting changes. Moreover, spatial ordering is preserved, which has been found to be of key importance for effective recovery [10]. HoGs are calculated within an image s region of interest (ROI), in our case the bounding box around the subject. While HoGs can be used to determine this region, as in [2], we rely on background subtraction. This significantly speeds up the process, and we obtain the foreground mask at the same time. We describe the process here in detail to allow for replication. First, we apply the background subtraction with the suggested risk values, as included in the HumanEva source code [12]. The minimum enclosing box of all foreground areas larger than 600 pixels is obtained. After conversion to HSV color space, we apply shadow removal in the lower 20% of the ROI. Pixels that have a saturation that is between 0 and 25 higher than the saturation of any of the means in the background mixture model, are removed from the foreground mask. We again obtain the minimum enclosing box, which is our ROI. Figure 1 shows an example of background subtraction and shadow removal. (a) (b) (c) Figure 1. Example of foreground mask calculation: (a) original image, (b) with background subtracted, and (c) shadow removed. It may seem that our approach is highly sensitive to good background subtraction, but the shadow removal is only needed to ensure that the ROI fits the subject reasonably.
3 For certain cases, we slightly adjusted the parameters. For camera 1 in HumanEva-I, only for subject 2, we multiplied the risk with factor to remove artefacts from the foreground. For cameras 2 and 3 in HumanEva-I, we lowered the shadow threshold to 10. We did not use the additional four grayscale cameras in HumanEva-I. For HumanEva-II, we only reduced the background risk with factor for camera 3. To obtain the HoG, we divide the ROI into a grid with 6 rows and 5 columns. This is an arbitrary choice, but the height of each cell roughly corresponds with the height of the head in a standing position. Similarly, in a relaxed standing pose, the body covers approximately 3 columns horizontally. We did not perform experiments with various settings but admit this would have been interesting. Within each cell in the grid, we calculate the orientation and magnitude of each pixel that appears in the foreground mask. We divide the absolute orientations over 9 equally sized bins in the range. Each pixel contributes the magnitude of its orientation to the according histogram bin, which results in a 9-bin histogram per cell. The total length of the descriptor is therefore 270. The entire descriptor is normalized to unit length. Compared to Dalal and Triggs [2], we do not apply color normalization and do not use the notion of blocks (groups of cells). This reduces computation cost, and results in a significantly reduced descriptor size Pose estimation In an example-based approach, each image observation is encoded and matched against a training set of encoded observations. We use the previously described HoGs as encodings. To match a HoG with those in the training set, we need to define a distance measure between the two descriptors. We performed a small-scale experiment with Manhattan, Euclidian, cosine and χ 2 distance. Consistent with earlier findings, Manhattan and Euclidian distance proved to be most suitable. In this work, we use the Manhattan distance since it has a lower computational complexity. Matching a HoG with the entire training set results in a distance value for each of the m examples. We could choose the example with the lowest distance, as this is the example that best matches the image observation. However, in practice, taking the n best matches results in more accurate pose recovery. Of course, n will depend on the number of examples in the training set that are close to the presented frame. Here we use n = 25, in accordance with [10]. To determine the final pose estimate, we use the poses that correspond to the n best examples. The final estimate is the normalized weighted interpolation of these poses. This implies that close HoG matches, contribute more to the final estimate than HoG matches at a larger distance. One word of caution is in its place here. Since we interpolate poses, the final joint estimates are likely to lie closer to the mean distance for this joint, so closer to the body. This effect is especially visible for examples that have similar image observations but are distant in pose space. Since we do not determine the correspondences between our localized subject in the image and the estimated pose, we are not able to reliably estimate the global position of each joint. Instead, we report the distances of each joint relative to the pelvis (torsodistal) joint. 4. Experimental results We report our findings of experiments on both the HumanEva-I and HumanEva-II data set. First we describe our two training sets, one for monocular pose recovery (T1), and one for recovery using three cameras simultaneously (T3). In both cases, we used the HumanEva-I sequences that are used for training and validation Training sets For the monocular training set T1, we associate the HoGs for each view with their corresponding poses. Only the examples that contain valid mocap data are included in the training set. Subject Action S1 S2 S3 Total Walking Jog Throw/Catch Gestures Box Combo Total Table 1. Number of training examples per action and subject. When given a new image observation, together with the knowledge from which camera the observation is obtained, we can now estimate the pose. We observe that the elevation (rotation in vertical direction) and roll (rotation around line of sight) of all cameras are approximately the same. In other words, the orientation of all cameras is almost equal except for the orientation around a vertical axis. If we would rotate the subject in the scene around a vertical axis, we would theoretically be able to generate very similar observations for all cameras. In practice, view-specific parameters such as backgrounds and lighting conditions are likely to result in observations that are somewhat different. However, we want our approach to be robust against these image deformations and therefore, we perform this rotation virtually. This has the additional advantage that the number of training data is effectively tripled, resulting in a total of 28,890 samples.
4 We transform the mocap data in such a way that we obtain the global positions as if we were looking through another camera. With an observation from camera i, and the projection onto camera j, our pose vector p i = (x i, y i, z i, 1) 1 is transformed into p j as follows: p j = M j M 1 i p i, where M i and M j are the rotation matrices of cameras i and j, respectively. In T3, we combine the HoGs of the three views into a single HoG descriptor of length 810. This combined descriptor is larger, and contains the same pose from multiple views. We therefore expect increased pose estimation accuracy over the monocular descriptors. However, combining our HoGs comes at a cost of a reduced number of examples, compared to T1. For each frame with valid mocap, we have exactly one example. The total amount of examples m that we can obtain is 9,630. Table 1 summarizes the origins of the examples. Combining all views into one descriptor has some drawbacks. When the setup of the cameras changes, the descriptor cannot be used anymore. The relative orientations of the cameras are encoded implicitly in the combined descriptors. Practically, this means that we cannot evaluate the HumanEva-II sequences with our training sets, since these are obtained from the HumanEva-I setting. Another drawback arises when one or more views contain inaccurate segmentations. This can, in some cases, render the example useless Results for HumanEva-I The HumanEva-I test sequences are performed with the same camera setup as the training sets. Also, the same test subjects appear in these videos. Except for action Combo, all action-subject combinations also exist in the training set. For the number of examples per subject and action, we again refer to Table 1. We did not perform evaluations for subject 4, due to background segmentation errors. We plan to report on these sequences in the near future, to provide more insight into how our approach performs on unseen subjects. For each sequence, we evaluate the performance on both T1 and T3. For T1, we use the image observations from camera 1. In informal experiments we looked at the differences between the three cameras but these were small. For T3, we use the combined image observations of all three cameras. The results, broken down by action, subject and training set, are summarized in Table 2. We omitted all frames for which the mocap data was invalid. Also, the first 5 frames of each sequence were removed since these frames were duplicated in the decoding of the video. The first thing to notice is the relatively small difference between the performance of our monocular tests, and those using three cameras. We obtained mean errors over all subjects and sequences of and mm, respectively. Apparently, our HoG descriptors are to a large extent invariant to depth ambiguities. We expect that this ability can mainly be subscribed to the fact that our descriptor is based on gradients, which have high magnitudes for the arms. These are helpful when differentiating between front and back poses. To be able to interpret our results, we need a baseline. We use the mean distance between all poses. Since we interpolate the poses of the best matches, our pose estimate will always be within the range of poses in the training set. This baseline is therefore the mean distance for randomly selecting a pose with n = 1. The mean distance per joint for T3 is then mm. For T1, the distance is slightly lower, mm Comparison with related research Before we take a closer look at the errors for different sequences, we compare our general findings with those previously reported on the HumanEva-I data set. We restrict our comparisons to discriminative approaches. The work by Howe [5] uses silhouettes in an examplebased approach. Attention is paid to correct segmentation of these silhouettes. As in previous work [6], chamfer distance and turning angle are used to compare the silhouettes. Additionally, optical flow is used to include some form of temporal consistency. Second order Markov chaining is used to eliminate poses that have matching observations, but differ significantly in pose (e.g. swapping of the arms). The fact that silhouettes are ambiguous regarding the depth ordering of limbs makes such a measure necessary. Errors are reported in 2D for the Walking sequence of subject 3. Tests are performed with either one color camera, or one grayscale camera, with comparable results. The mean error in both cases is around 17 pixels which, for a figure height between 250 and 410 pixels and an assumed subject height of 1.80 meters, comes down to approximately mm. If we compare these results with our own, there are some important differences. First, Howe performs image registration and the errors reported are in fact absolute. Second, our approach recovers poses in 3D and does not make use of temporal information. We would be interested to see how our approach would perform with the silhouettes and matching functions as used by Howe. This would give more insight into the strengths and weaknesses of both image descriptors. Lee and Elgammal [7] present another study that reports on the HumanEva-I set. Silhouettes are used, and a functional mapping is learned. In an offline step, they construct a manifold topology based on synthesized silhouettes of a walking person, as viewed from a large number of angles. The novelty of their work is to learn a geometric mapping between the embedded manifold to the data. The authors re-
5 Subject 1 Subject 2 Subject 3 Action C1 C1 3 C1 C1 3 C1 C1 3 Walking (16.84) (11.95) (26.75) (23.86) (21.70) (25.07) Jog (18.60) (13.74) (9.97) (12.20) (23.47) (18.32) Throw/Catch (31.94) (23.75) (33.38) (31.75) Gestures (13.51) (7.18) (28.10) (26.32) (11.09) (6.44) Box (27.71) (36.20) (45.08) (41.19) (52.31) (47.86) Combo (49.34) (52.17) (79.74) (53.10) Average (19.17) (17.27) (31.86) (29.92) (39.95) (30.42) Table 2. Mean 3D error (and SD) in mm for HumanEva-I test sequences, evaluated with a single camera (C1) and all three cameras (C1 C3). port errors on the Walking validation sequences of subjects 1-3. Their mean, normalized, relative 3D error is approximately 35 mm when no dynamical model is assumed, and 31 mm using a particle filter with constant velocity for both gait phase and view change. Normalization of the pose is obtained by transforming the pose coordinates into a bodycentered coordinate space. Effectively, this removes errors due to incorrect estimation of the pose around the vertical axis. Insensitivity to the subject is demonstrated since the manifold topology is learned from data not contained within HumanEva-I. It remains to be investigated how the approach works for open (i.e. non-repetitive) actions Results for individual sequences We will now discuss our results in more detail. If we look at different actions, we see large variations between sequences. In general, poses from the Walking and Jog sequences are recovered with the highest accuracy. This can, at least partly, be explained by the fact that these motions have been performed at least several times in the training sets. Each cycle resembles the others in contrast to, for example, catching a ball where the ball appears at more or less random places. In Figure 2, we present the affinity matrices of two walking cycles and two instances of catching and throwing a ball. We immediately see the similarity between the two walking cycles by the dark line on the diagonal. This line is less apparent for two sequences of catching and throwing a ball. We should be able to see the same when looking at our test results. Figure 3(a-b) shows the plots of the mean errors over the sequences Walking and Throw/Catch, performed by subject 2 and obtained using all three cameras. The error plot is much more peaked for the Throw/Catch sequence. The peaks (e.g. around 350, 550 and 700) correspond to catching or throwing the ball. The lower errors around 450 and 650 correspond to waiting for the ball, which is in fact a standing pose. The slightly higher errors in the Walking sequence after 200, and around 650, correspond to the subject walking towards the camera. Here, depth ambiguities occur. (a) Walking (b) Throw/Catch Figure 2. Affinity matrices for camera 1 of (a) Walking cycle, and (b) Throw/Catch sequence, performed by subject 1. Dark values correspond with small distances. Notice the similarity between the Walking cycles. The white plus in the Throw/Catch sequence is caused by incorrect segmentation due to the presence of the ball. When we take another look at Table 2, we notice rather larger differences between subjects for the Gesture action. In this action, the subject waves and makes beckoning gestures. Subject 1 and 3 perform these gestures with their right hand, in both the training and test sequences. This explains the low standard deviations. The Box actions show some of the highest errors, which is somewhat surprising. More careful analysis of the video data shows that, for subject 2 and 3, there is quite some variation in the footwork. Subject 1 uses the same standing pose as a basis but is facing camera 1. From this view, it is difficult to estimate the depth of the arms. The view for camera 2 and 3 are almost exactly from the side. This results in many estimations where the wrong arm is estimated to be stretched out. The Combo action is a combination of walking, jogging, and some freestyle moves, jumping on one leg and balancing on one foot. These latter moves are not present in the training set. For subject 2, the mean error over the whole sequence is shown in Figure 3(c). The peak around frame 250 is caused by incorrect background segmentation. Frames contain the jumping on one leg action, in frames
6 (a) Walking Set Train Validation Total T1 T (25.39) (23.95) (24.81) T3 T (28.50) (23.86) (26.20) T1 A (33.46) (28.90) (31.17) T3 A (46.23) (44.96) (45.68) T1 S (29.68) (25.56) (27.83) T3 S (29.02) (24.12) (26.57) Table 3. Mean 3D error (and SD) in mm for HumanEva-I training/validation sequence Walking, performed by subject 1. Errors are presented for different training sets. (b) Throw/Catch (c) Combo Figure 3. Mean 3D error (in mm) plots for HumanEva-I (a) Walking, (b) Throw/Catch, and (c) Combo action, all performed by subject 2 and obtained using all three cameras. Instances that have a zero error contain invalid mocap the subject balances on one foot. The difference in error is apparent, and gives us a clue about the performance of an example-based approach for unseen actions. We discuss this further in Section Additional tests on validation set To allow comparison with previous work on the HumanEva- I set, we report here our results for the Walking sequence, performed by subject 1. This sequence is contained within our training sets, so we had to remove it. Our resulting training sets with removed trial are T1 T and T3 T, for the monocular and multi-camera cases, respectively. In addition, we created training sets where we removed all examples for the Walking action, T1 A and T3 A. This reduces the training set by 20%. Also, we removed all instances of subject 1, resulting in T1 S and T3 S, each containing roughly two thirds of the total number of samples. These training sets allow us to gain insight into the generalization to unseen subjects and unseen motions. The results are presented in Table 3, with mean errors for the training part (frames ), the validation part (1-590) and over the entire sequence. Compared to the results in Table 2, the errors obtained here are approximately mm per joint higher. It should be noted that the number of examples is reduced and this is likely to have a negative effect on the pose recovery accuracy. Our first observation is that even when only one trial is removed, the error is larger. Removal of this sequence results in a lower number of examples of the Walking action. Moreover, all examples of the same subject for the Walking action are left out. Apparently, our approach is much more accurate when person-specific observations are used. When all Walking examples are removed, the error increases mm. Closer analysis of the examples used in the reconstruction shows that these are mainly from the Jog action. Walking and jogging show many resemblances, but in the Jog action the elbows are usually more bent, and the distance between the feet is kept small. This explains the increase. A final observation can be made with respect to the removal of examples containing subject 1. The results are slightly less accurate than in the case where only the Walking action of subject 1 is removed. This leads us to conclude that only few examples from other actions are used in the estimation. Also, we have an estimate of the accuracy for pose recovery when the subject is not in the training set. Again, we must make a remark about the significantly reduced number of relevant samples, that is likely to increase the error also in this case Results for HumanEva-II The HumanEva-II set consists of two Combo sequences performed by subjects 2 and 4. HumanEva-II differs from HumanEva-I in the recording setup. Four color cameras are used, instead of the three color cameras and four grayscale cameras. The cameras are placed at different positions, and the elevation angles differ slightly. As mentioned before, this setup does not allow us to do any evaluation on the T3 training set. Our evaluation was performed similar to that of the HumanEva-I test sequences. In Table 4 the results are summarized for both subjects and for each camera separately. The three sets contain various movements within the sequence. Set 1 (frame 1-350) contains walking, set 2 (frame 1-700) contains walking and jogging, and set 3 (frame 1-
7 1202 for subject 2, and for subject 4) contains the whole sequence. This includes walking, jogging, jumping on one leg and balancing on one foot. Our first remark concerns the analysis of the sequence performed by subject 4. We removed frames from the results since these appeared to be erroneous. We obtained mean errors per joint above 1,200 mm. Visual inspection of our pose estimates and the video revealed no peculiarities. Moreover, an average distance per joint of 1,200 mm is very unlikely since we use relative distances. Compared to the results that we obtained for the Combo sequences of HumanEva-I for T1, these results are higher. Ideally, we would expect errors for set 1 that are comparable to those of the Walking sequences in HumanEva-I. Similarly, for set 2, we expected results that are close to those of Walking and Jog. In explaining these results, we take another look at the results that we obtained when analyzing the validation Walking sequence in HumanEva-I, Section We observed that the error increases when there are no examples in the training set for the same person, performing the same action. We expect that this is also the case for our HumanEva-II results. Although subject 2 appears in HumanEva-I, the clothing is different. This probably has an effect on the HoGs, and subsequently on the closest matches. Yet, if we consider the subjects as unseen, the error is higher. We expect this to be the result of the modified camera setup. The elevation of the camera is different, and there is a roll angle introduced for some cameras. We expect this to have quite a large impact, as HoG are not rotation invariant. Also, the background is modelled with a single Gaussian. This results in less accurate background segmentation, and this reflects on the HoG descriptors. Figure 4 shows the mean error plots of the HumanEva- II sequences. The increased error for the unseen actions (frame 750-end) is apparent. Other peaks (e.g. for subject 2 around frame 100, 220, 300 and 370) are due to forwardbackward ambiguities. Figure 5 shows the original frames and the frames corresponding to the single best example. We plan to conduct experiments with tracking to increase accuracy in these cases. 5. Conclusion and future work We took an example-based approach to pose recovery. Histograms of oriented gradients have been used as image descriptors, and matched against our training sets. The final pose estimate was obtained by interpolating the poses of the 25 best matches based on their normalized weight. Since we do not calculate the correspondence between the image and our examples, we report mean 3D errors over all joints, relative to the pelvis. (a) Subject 2 (b) Subject 4 Figure 4. Mean 3D error (in mm) plots for the HumanEva-II Combo sequences performed by (a) subject 2, and (b) subject 4. Instances that have a zero error have been ignored (see text). We evaluated our approach on both the HumanEva-I and HumanEva-II test sequences. For HumanEva-I, we found only small differences between our monocular case, and when three views were used. Part of this small difference can be explained by the reduced amount of relevant samples in the multi-view case. Our results varied depending on the action performed. For walking and jogging, errors around 45 mm were obtained. The mean error for all actions and all subjects is approximately 65 mm. Our approach is somewhat person specific, and does not generalize well to unseen actions. For HumanEva-II, errors close to 125 mm were obtained for the walking and jogging movements. The higher error can be explained by the fact that the subjects look different from the ones in the training set (one of the subjects does not appear in the training set, the other wears different clothing). Also, the viewpoints and recording settings are slightly different, which resulted in less accurate background segmentation and consequently, less accurate HoG descriptors. Due to this different setup, we were not able to perform any evaluations with more views. From a practical point of view, we plan to process the sequences containing subject 4 in HumanEva-I. This would give us more insight into whether the reduced accuracy for the HumanEva-II sequences is due to the changed viewpoints, or the fact that the subjects appear differently. In order to allow our approach to perform in real-time, we would like to evaluate some measures to reduce the computational complexity, including hashing and hierarchical matching. We have already observed that vision-based human motion analysis is spatio-temporal in nature. We noticed several times that reduced accuracy is often caused by forward-
8 Action Set Camera 1 Camera 2 Camera 3 Camera 4 Average Combo S (72.51) (34.28) (44.72) (48.19) (49.93) Combo S (59.83) (34.54) (60.93) (54.85) (52.54) Combo S (111.22) (77.93) (144.72) (89.45) (105.83) Combo S (54.38) (49.24) (86.69) (57.14) (61.86) Combo S (58.27) (39.16) (71.99) (55.84) (56.32) Combo S (77.09) (66.10) (120.20) (101.67) (91.27) Table 4. Mean 3D error (and SD) in mm for HumanEva-II test sequences, evaluated with a single camera. Results are broken down per set. backward ambiguities. We therefore plan to conduct evaluations where we track the movements over time (incrementally), and recover poses in a batch approach over an entire sequence. We expect both approaches to substantially reduce our errors, and bring accurate real-time pose recovery within reach. Acknowledgements This work was supported by the European IST Programme Project FP (Augmented Multi-party Interaction with Distant Access, publication AMIDA-25), and is part of the ICIS program. ICIS is sponsored by the Dutch government under contract BSIK (a) Frame 300 (b) Frame 600 (c) Frame 900 (d) Frame 1200 Figure 5. Single best estimate (top row) and original frame (bottom row) for the HumanEva-II Combo sequence performed by subject 2. Frame 300 shows forward-backward ambiguity in the walking action. Frames 900 and 1200 contain unseen movements. Colors have been adapted for the viewer s convenience. References [1] A. Agarwal and B. Triggs. A local basis representation for estimating human pose from cluttered images. In Proceedings of the Asian Conference on Computer Vision (ACCV 06) - part 1, number 3851 in Lecture Notes in Computer Science, pages 50 59, Hyderabad, India, January [2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 05) - volume 1, pages , San Diego, CA, June , 3 [3] A. Elgammal and C.-S. Lee. Nonlinear manifold learning for dynamic shape and dynamic appearance. Computer Vision and Image Understanding (CVIU), 106(1):31 46, April [4] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3D structure with a statistical image-based shape model. In Proceedings of the International Conference on Computer Vision (ICCV 03) - volume 1, pages , Nice, France, October [5] N. R. Howe. Evaluating lookup-based monocular human pose tracking on the humaneva test data. In Proceedings of the Workshop on Evaluation of Articulated Human Motion and Pose Estimation (EHuM), Whistler, Canada, December [6] N. R. Howe. Silhouette lookup for monocular 3D pose tracking. Image and Vision Computing, 25(3): , March , 4 [7] C.-S. Lee and A. Elgammal. Body pose tracking from uncalibrated camera using supervised manifold learning. In Proceedings of the Workshop on Evaluation of Articulated Human Motion and Pose Estimation (EHuM), Whistler, Canada, December [8] E.-J. Ong, A. S. Micilotta, R. Bowden, and A. Hilton. Viewpoint invariant exemplar-based 3D human tracking. Computer Vision and Image Understanding (CVIU), 104(2-3): , November [9] R. Poppe. Vision-based human motion analysis: an overview. Computer Vision and Image Understanding (CVIU), to appear. 1 [10] R. Poppe and M. Poel. Comparison of silhouette shape descriptors for example-based human pose recovery. In Proceedings of the International Conference on Automatic Face and Gesture Recognition (FGR 06), pages , Southampton, UK, April , 3 [11] G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the International Conference on Computer Vision (ICCV 03) - volume 2, pages , Nice, France, October [12] L. Sigal and M. J. Black. HumanEva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical Report CS-06-08, Brown University, Department of Computer Science, Providence, RI, September , 2 [13] C. Sminchisescu, A. Kanaujia, and D. Metaxas. Learning joint top-down and bottom-up processes for 3D visual inference. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 06) - volume 2, pages , New York, NY, June , 2
Human Upper Body Pose Estimation in Static Images
1. Research Team Human Upper Body Pose Estimation in Static Images Project Leader: Graduate Students: Prof. Isaac Cohen, Computer Science Mun Wai Lee 2. Statement of Project Goals This goal of this project
More informationHuman Motion Detection and Tracking for Video Surveillance
Human Motion Detection and Tracking for Video Surveillance Prithviraj Banerjee and Somnath Sengupta Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur,
More informationHISTOGRAMS OF ORIENTATIO N GRADIENTS
HISTOGRAMS OF ORIENTATIO N GRADIENTS Histograms of Orientation Gradients Objective: object recognition Basic idea Local shape information often well described by the distribution of intensity gradients
More informationDiscriminative human action recognition using pairwise CSP classifiers
Discriminative human action recognition using pairwise CSP classifiers Ronald Poppe and Mannes Poel University of Twente, Dept. of Computer Science, Human Media Interaction Group P.O. Box 217, 7500 AE
More informationExpanding gait identification methods from straight to curved trajectories
Expanding gait identification methods from straight to curved trajectories Yumi Iwashita, Ryo Kurazume Kyushu University 744 Motooka Nishi-ku Fukuoka, Japan yumi@ieee.org Abstract Conventional methods
More informationFeature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking
Feature descriptors Alain Pagani Prof. Didier Stricker Computer Vision: Object and People Tracking 1 Overview Previous lectures: Feature extraction Today: Gradiant/edge Points (Kanade-Tomasi + Harris)
More informationPredicting 3D People from 2D Pictures
Predicting 3D People from 2D Pictures Leonid Sigal Michael J. Black Department of Computer Science Brown University http://www.cs.brown.edu/people/ls/ CIAR Summer School August 15-20, 2006 Leonid Sigal
More information2D Image Processing Feature Descriptors
2D Image Processing Feature Descriptors Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Overview
More informationCategory vs. instance recognition
Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building
More informationMobile Human Detection Systems based on Sliding Windows Approach-A Review
Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg
More informationSUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS
SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract
More informationEE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm
EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm Group 1: Mina A. Makar Stanford University mamakar@stanford.edu Abstract In this report, we investigate the application of the Scale-Invariant
More informationBeyond Bags of features Spatial information & Shape models
Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features
More informationArticulated Pose Estimation with Flexible Mixtures-of-Parts
Articulated Pose Estimation with Flexible Mixtures-of-Parts PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION Outline Modeling Special Cases Inferences Learning Experiments Problem and Relevance Problem:
More informationEstimating Human Pose in Images. Navraj Singh December 11, 2009
Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks
More informationAn Implementation on Histogram of Oriented Gradients for Human Detection
An Implementation on Histogram of Oriented Gradients for Human Detection Cansın Yıldız Dept. of Computer Engineering Bilkent University Ankara,Turkey cansin@cs.bilkent.edu.tr Abstract I implemented a Histogram
More informationAn Edge-Based Approach to Motion Detection*
An Edge-Based Approach to Motion Detection* Angel D. Sappa and Fadi Dornaika Computer Vison Center Edifici O Campus UAB 08193 Barcelona, Spain {sappa, dornaika}@cvc.uab.es Abstract. This paper presents
More informationRobust PDF Table Locator
Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records
More informationPart I: HumanEva-I dataset and evaluation metrics
Part I: HumanEva-I dataset and evaluation metrics Leonid Sigal Michael J. Black Department of Computer Science Brown University http://www.cs.brown.edu/people/ls/ http://vision.cs.brown.edu/humaneva/ Motivation
More informationMonocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France
Monocular Human Motion Capture with a Mixture of Regressors Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France IEEE Workshop on Vision for Human-Computer Interaction, 21 June 2005 Visual
More informationHuman Detection and Tracking for Video Surveillance: A Cognitive Science Approach
Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach Vandit Gajjar gajjar.vandit.381@ldce.ac.in Ayesha Gurnani gurnani.ayesha.52@ldce.ac.in Yash Khandhediya khandhediya.yash.364@ldce.ac.in
More informationObject Category Detection. Slides mostly from Derek Hoiem
Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models
More information3D Human Motion Analysis and Manifolds
D E P A R T M E N T O F C O M P U T E R S C I E N C E U N I V E R S I T Y O F C O P E N H A G E N 3D Human Motion Analysis and Manifolds Kim Steenstrup Pedersen DIKU Image group and E-Science center Motivation
More informationMeasurement of 3D Foot Shape Deformation in Motion
Measurement of 3D Foot Shape Deformation in Motion Makoto Kimura Masaaki Mochimaru Takeo Kanade Digital Human Research Center National Institute of Advanced Industrial Science and Technology, Japan The
More informationFeature descriptors and matching
Feature descriptors and matching Detections at multiple scales Invariance of MOPS Intensity Scale Rotation Color and Lighting Out-of-plane rotation Out-of-plane rotation Better representation than color:
More informationDynamic Human Shape Description and Characterization
Dynamic Human Shape Description and Characterization Z. Cheng*, S. Mosher, Jeanne Smith H. Cheng, and K. Robinette Infoscitex Corporation, Dayton, Ohio, USA 711 th Human Performance Wing, Air Force Research
More informationObject Category Detection: Sliding Windows
04/10/12 Object Category Detection: Sliding Windows Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical
More informationHistograms of Oriented Gradients
Histograms of Oriented Gradients Carlo Tomasi September 18, 2017 A useful question to ask of an image is whether it contains one or more instances of a certain object: a person, a face, a car, and so forth.
More informationViewpoint invariant exemplar-based 3D human tracking
Computer Vision and Image Understanding 104 (2006) 178 189 www.elsevier.com/locate/cviu Viewpoint invariant exemplar-based 3D human tracking Eng-Jon Ong *, Antonio S. Micilotta, Richard Bowden, Adrian
More informationVideo Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin
Literature Survey Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin Omer Shakil Abstract This literature survey compares various methods
More informationDynamic Time Warping for Binocular Hand Tracking and Reconstruction
Dynamic Time Warping for Binocular Hand Tracking and Reconstruction Javier Romero, Danica Kragic Ville Kyrki Antonis Argyros CAS-CVAP-CSC Dept. of Information Technology Institute of Computer Science KTH,
More informationDepth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth
Common Classification Tasks Recognition of individual objects/faces Analyze object-specific features (e.g., key points) Train with images from different viewing angles Recognition of object classes Analyze
More informationSIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014
SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image
More informationHand Tracking Miro Enev UCDS Cognitive Science Department 9500 Gilman Dr., La Jolla CA
Hand Tracking Miro Enev UCDS Cognitive Science Department 9500 Gilman Dr., La Jolla CA menev@ucsd.edu Abstract: Tracking the pose of a moving hand from a monocular perspective is a difficult problem. In
More informationNon-rigid body Object Tracking using Fuzzy Neural System based on Multiple ROIs and Adaptive Motion Frame Method
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Non-rigid body Object Tracking using Fuzzy Neural System based on Multiple ROIs
More informationPerson identification from spatio-temporal 3D gait
200 International Conference on Emerging Security Technologies Person identification from spatio-temporal 3D gait Yumi Iwashita Ryosuke Baba Koichi Ogawara Ryo Kurazume Information Science and Electrical
More informationAutomatic Colorization of Grayscale Images
Automatic Colorization of Grayscale Images Austin Sousa Rasoul Kabirzadeh Patrick Blaes Department of Electrical Engineering, Stanford University 1 Introduction ere exists a wealth of photographic images,
More informationObject Detection Design challenges
Object Detection Design challenges How to efficiently search for likely objects Even simple models require searching hundreds of thousands of positions and scales Feature design and scoring How should
More informationMULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION
MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION Panca Mudjirahardjo, Rahmadwati, Nanang Sulistiyanto and R. Arief Setyawan Department of Electrical Engineering, Faculty of
More informationHistogram of Oriented Gradients for Human Detection
Histogram of Oriented Gradients for Human Detection Article by Navneet Dalal and Bill Triggs All images in presentation is taken from article Presentation by Inge Edward Halsaunet Introduction What: Detect
More informationAUTOMATED THRESHOLD DETECTION FOR OBJECT SEGMENTATION IN COLOUR IMAGE
AUTOMATED THRESHOLD DETECTION FOR OBJECT SEGMENTATION IN COLOUR IMAGE Md. Akhtaruzzaman, Amir A. Shafie and Md. Raisuddin Khan Department of Mechatronics Engineering, Kulliyyah of Engineering, International
More informationDrywall state detection in image data for automatic indoor progress monitoring C. Kropp, C. Koch and M. König
Drywall state detection in image data for automatic indoor progress monitoring C. Kropp, C. Koch and M. König Chair for Computing in Engineering, Department of Civil and Environmental Engineering, Ruhr-Universität
More informationCS4670: Computer Vision
CS4670: Computer Vision Noah Snavely Lecture 6: Feature matching and alignment Szeliski: Chapter 6.1 Reading Last time: Corners and blobs Scale-space blob detector: Example Feature descriptors We know
More informationStructured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov
Structured Light II Johannes Köhler Johannes.koehler@dfki.de Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Introduction Previous lecture: Structured Light I Active Scanning Camera/emitter
More informationTri-modal Human Body Segmentation
Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4
More informationarxiv: v1 [cs.cv] 28 Sep 2018
Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,
More informationDetecting motion by means of 2D and 3D information
Detecting motion by means of 2D and 3D information Federico Tombari Stefano Mattoccia Luigi Di Stefano Fabio Tonelli Department of Electronics Computer Science and Systems (DEIS) Viale Risorgimento 2,
More informationAutomatic Tracking of Moving Objects in Video for Surveillance Applications
Automatic Tracking of Moving Objects in Video for Surveillance Applications Manjunath Narayana Committee: Dr. Donna Haverkamp (Chair) Dr. Arvin Agah Dr. James Miller Department of Electrical Engineering
More informationChapter 3 Image Registration. Chapter 3 Image Registration
Chapter 3 Image Registration Distributed Algorithms for Introduction (1) Definition: Image Registration Input: 2 images of the same scene but taken from different perspectives Goal: Identify transformation
More informationRecognizing hand-drawn images using shape context
Recognizing hand-drawn images using shape context Gyozo Gidofalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract The objective
More informationDetecting and Segmenting Humans in Crowded Scenes
Detecting and Segmenting Humans in Crowded Scenes Mikel D. Rodriguez University of Central Florida 4000 Central Florida Blvd Orlando, Florida, 32816 mikel@cs.ucf.edu Mubarak Shah University of Central
More informationASIAGRAPH 2008 The Intermediate View Synthesis System For Soccer Broadcasts
ASIAGRAPH 2008 The Intermediate View Synthesis System For Soccer Broadcasts Songkran Jarusirisawad, Kunihiko Hayashi, Hideo Saito (Keio Univ.), Naho Inamoto (SGI Japan Ltd.), Tetsuya Kawamoto (Chukyo Television
More informationImage Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58
Image Features: Local Descriptors Sanja Fidler CSC420: Intro to Image Understanding 1/ 58 [Source: K. Grauman] Sanja Fidler CSC420: Intro to Image Understanding 2/ 58 Local Features Detection: Identify
More informationMarkerless human motion capture through visual hull and articulated ICP
Markerless human motion capture through visual hull and articulated ICP Lars Mündermann lmuender@stanford.edu Stefano Corazza Stanford, CA 93405 stefanoc@stanford.edu Thomas. P. Andriacchi Bone and Joint
More informationProbabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences
Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences Presentation of the thesis work of: Hedvig Sidenbladh, KTH Thesis opponent: Prof. Bill Freeman, MIT Thesis supervisors
More informationInferring 3D Body Pose from Silhouettes using Activity Manifold Learning
CVPR 4 Inferring 3D Body Pose from Silhouettes using Activity Manifold Learning Ahmed Elgammal and Chan-Su Lee Department of Computer Science, Rutgers University, New Brunswick, NJ, USA {elgammal,chansu}@cs.rutgers.edu
More informationFAST HUMAN DETECTION USING TEMPLATE MATCHING FOR GRADIENT IMAGES AND ASC DESCRIPTORS BASED ON SUBTRACTION STEREO
FAST HUMAN DETECTION USING TEMPLATE MATCHING FOR GRADIENT IMAGES AND ASC DESCRIPTORS BASED ON SUBTRACTION STEREO Makoto Arie, Masatoshi Shibata, Kenji Terabayashi, Alessandro Moro and Kazunori Umeda Course
More informationVideo Alignment. Final Report. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin
Final Report Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin Omer Shakil Abstract This report describes a method to align two videos.
More informationHOG-based Pedestriant Detector Training
HOG-based Pedestriant Detector Training evs embedded Vision Systems Srl c/o Computer Science Park, Strada Le Grazie, 15 Verona- Italy http: // www. embeddedvisionsystems. it Abstract This paper describes
More informationFinally: Motion and tracking. Motion 4/20/2011. CS 376 Lecture 24 Motion 1. Video. Uses of motion. Motion parallax. Motion field
Finally: Motion and tracking Tracking objects, video analysis, low level motion Motion Wed, April 20 Kristen Grauman UT-Austin Many slides adapted from S. Seitz, R. Szeliski, M. Pollefeys, and S. Lazebnik
More informationPerson Detection in Images using HoG + Gentleboost. Rahul Rajan June 1st July 15th CMU Q Robotics Lab
Person Detection in Images using HoG + Gentleboost Rahul Rajan June 1st July 15th CMU Q Robotics Lab 1 Introduction One of the goals of computer vision Object class detection car, animal, humans Human
More information3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.
3D Computer Vision Structured Light II Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Introduction
More informationMORPH-II: Feature Vector Documentation
MORPH-II: Feature Vector Documentation Troy P. Kling NSF-REU Site at UNC Wilmington, Summer 2017 1 MORPH-II Subsets Four different subsets of the MORPH-II database were selected for a wide range of purposes,
More informationA Keypoint Descriptor Inspired by Retinal Computation
A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement
More informationHuman Detection. A state-of-the-art survey. Mohammad Dorgham. University of Hamburg
Human Detection A state-of-the-art survey Mohammad Dorgham University of Hamburg Presentation outline Motivation Applications Overview of approaches (categorized) Approaches details References Motivation
More informationReport: Reducing the error rate of a Cat classifier
Report: Reducing the error rate of a Cat classifier Raphael Sznitman 6 August, 2007 Abstract The following report discusses my work at the IDIAP from 06.2007 to 08.2007. This work had for objective to
More informationSimultaneous surface texture classification and illumination tilt angle prediction
Simultaneous surface texture classification and illumination tilt angle prediction X. Lladó, A. Oliver, M. Petrou, J. Freixenet, and J. Martí Computer Vision and Robotics Group - IIiA. University of Girona
More informationReal-Time Human Pose Inference using Kernel Principal Component Pre-image Approximations
1 Real-Time Human Pose Inference using Kernel Principal Component Pre-image Approximations T. Tangkuampien and D. Suter Institute for Vision Systems Engineering Monash University, Australia {therdsak.tangkuampien,d.suter}@eng.monash.edu.au
More informationStructured Models in. Dan Huttenlocher. June 2010
Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies
More informationFactorization with Missing and Noisy Data
Factorization with Missing and Noisy Data Carme Julià, Angel Sappa, Felipe Lumbreras, Joan Serrat, and Antonio López Computer Vision Center and Computer Science Department, Universitat Autònoma de Barcelona,
More informationof human activities. Our research is motivated by considerations of a ground-based mobile surveillance system that monitors an extended area for
To Appear in ACCV-98, Mumbai-India, Material Subject to ACCV Copy-Rights Visual Surveillance of Human Activity Larry Davis 1 Sandor Fejes 1 David Harwood 1 Yaser Yacoob 1 Ismail Hariatoglu 1 Michael J.
More informationTracking Human Body Parts Using Particle Filters Constrained by Human Biomechanics
Tracking Human Body Parts Using Particle Filters Constrained by Human Biomechanics J. Martínez 1, J.C. Nebel 2, D. Makris 2 and C. Orrite 1 1 Computer Vision Lab, University of Zaragoza, Spain 2 Digital
More informationChapter 9 Object Tracking an Overview
Chapter 9 Object Tracking an Overview The output of the background subtraction algorithm, described in the previous chapter, is a classification (segmentation) of pixels into foreground pixels (those belonging
More informationBeyond Bags of Features
: for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015
More informationCalibrated Image Acquisition for Multi-view 3D Reconstruction
Calibrated Image Acquisition for Multi-view 3D Reconstruction Sriram Kashyap M S Guide: Prof. Sharat Chandran Indian Institute of Technology, Bombay April 2009 Sriram Kashyap 3D Reconstruction 1/ 42 Motivation
More informationCSE 252B: Computer Vision II
CSE 252B: Computer Vision II Lecturer: Serge Belongie Scribes: Jeremy Pollock and Neil Alldrin LECTURE 14 Robust Feature Matching 14.1. Introduction Last lecture we learned how to find interest points
More informationComponent-based Face Recognition with 3D Morphable Models
Component-based Face Recognition with 3D Morphable Models B. Weyrauch J. Huang benjamin.weyrauch@vitronic.com jenniferhuang@alum.mit.edu Center for Biological and Center for Biological and Computational
More informationBiometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)
Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong) References: [1] http://homepages.inf.ed.ac.uk/rbf/hipr2/index.htm [2] http://www.cs.wisc.edu/~dyer/cs540/notes/vision.html
More informationFast Denoising for Moving Object Detection by An Extended Structural Fitness Algorithm
Fast Denoising for Moving Object Detection by An Extended Structural Fitness Algorithm ALBERTO FARO, DANIELA GIORDANO, CONCETTO SPAMPINATO Dipartimento di Ingegneria Informatica e Telecomunicazioni Facoltà
More informationModeling 3D Human Poses from Uncalibrated Monocular Images
Modeling 3D Human Poses from Uncalibrated Monocular Images Xiaolin K. Wei Texas A&M University xwei@cse.tamu.edu Jinxiang Chai Texas A&M University jchai@cse.tamu.edu Abstract This paper introduces an
More informationMotion Estimation for Video Coding Standards
Motion Estimation for Video Coding Standards Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Introduction of Motion Estimation The goal of video compression
More informationSemi-Supervised Hierarchical Models for 3D Human Pose Reconstruction
Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers 3D Human Pose Inference Difficulties Towards
More informationDet De e t cting abnormal event n s Jaechul Kim
Detecting abnormal events Jaechul Kim Purpose Introduce general methodologies used in abnormality detection Deal with technical details of selected papers Abnormal events Easy to verify, but hard to describe
More informationSelection of Scale-Invariant Parts for Object Class Recognition
Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract
More informationMotion Retrieval. Motion Capture Data. Motion Capture Data. Motion Capture Data. Motion Capture Data. Motion Capture Data
Lecture Information Retrieval for Music and Motion Meinard Müller and Andreas Baak Summer Term 2008 Motion Capture Data Digital 3D representations of motions Computer animation Sports Gait analysis 2 Motion
More informationComputer Science Faculty, Bandar Lampung University, Bandar Lampung, Indonesia
Application Object Detection Using Histogram of Oriented Gradient For Artificial Intelegence System Module of Nao Robot (Control System Laboratory (LSKK) Bandung Institute of Technology) A K Saputra 1.,
More informationA Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space
A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space Eng-Jon Ong and Shaogang Gong Department of Computer Science, Queen Mary and Westfield College, London E1 4NS, UK fongej
More informationLarge-Scale Traffic Sign Recognition based on Local Features and Color Segmentation
Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation M. Blauth, E. Kraft, F. Hirschenberger, M. Böhm Fraunhofer Institute for Industrial Mathematics, Fraunhofer-Platz 1,
More informationDiscriminative classifiers for image recognition
Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study
More informationCS229: Action Recognition in Tennis
CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active
More informationTranslation Symmetry Detection: A Repetitive Pattern Analysis Approach
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops Translation Symmetry Detection: A Repetitive Pattern Analysis Approach Yunliang Cai and George Baciu GAMA Lab, Department of Computing
More informationBetter than best: matching score based face registration
Better than best: based face registration Luuk Spreeuwers University of Twente Fac. EEMCS, Signals and Systems Group Hogekamp Building, 7522 NB Enschede The Netherlands l.j.spreeuwers@ewi.utwente.nl Bas
More informationExtracting Spatio-temporal Local Features Considering Consecutiveness of Motions
Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,
More informationRecognition-Based Motion Capture and the HumanEva II Test Data
Recognition-Based Motion Capture and the HumanEva II Test Data Nicholas R. Howe Smith College Northampton, Massachusetts nhowe@cs.smith.edu Abstract Quantitative comparison of algorithms for human motion
More informationFundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision
Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision What Happened Last Time? Human 3D perception (3D cinema) Computational stereo Intuitive explanation of what is meant by disparity Stereo matching
More informationCOMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE
COMPUTER VISION 2017-2018 > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE OUTLINE Optical flow Lucas-Kanade Horn-Schunck Applications of optical flow Optical flow tracking Histograms of oriented flow Assignment
More informationFast trajectory matching using small binary images
Title Fast trajectory matching using small binary images Author(s) Zhuo, W; Schnieders, D; Wong, KKY Citation The 3rd International Conference on Multimedia Technology (ICMT 2013), Guangzhou, China, 29
More informationAn Approach for Real Time Moving Object Extraction based on Edge Region Determination
An Approach for Real Time Moving Object Extraction based on Edge Region Determination Sabrina Hoque Tuli Department of Computer Science and Engineering, Chittagong University of Engineering and Technology,
More informationCOMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION
COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION Mr.V.SRINIVASA RAO 1 Prof.A.SATYA KALYAN 2 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PRASAD V POTLURI SIDDHARTHA
More informationObject detection using non-redundant local Binary Patterns
University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2010 Object detection using non-redundant local Binary Patterns Duc Thanh
More information