G3D: A Gaming Action Dataset and Real Time Action Recognition Evaluation Framework
|
|
- Henry Singleton
- 5 years ago
- Views:
Transcription
1 G3D: A Gaming Action Dataset and Real Time Action Recognition Evaluation Framework Victoria Bloom Kingston University London. UK Victoria.Bloom@kingston.ac.uk Dimitrios Makris Kingston University London. UK D.Makris@kingston.ac.uk Vasileios Argyriou Kingston University London. UK Vasileios.Argyriou@kingston.ac.uk Abstract In this paper a novel evaluation framework for measuring the performance of real-time action recognition methods is presented. The evaluation framework will extend the time-based event detection metric to model multiple distinct action classes. The proposed metric provides more accurate indications of the performance of action recognition algorithms for games and other similar applications since it takes into consideration restrictions related to time and consecutive repetitions. Furthermore, a new dataset, G3D for real-time action recognition in gaming containing synchronised video, depth and skeleton data is provided. Our results indicate the need of an advanced metric especially designed for games and other similar real-time applications. 1. Introduction The gaming industry in recent years has attracted an increasing large and diverse group of people. A new generation of games based on full body play such as dance and sports games have increased the appeal of gaming to family members of all ages. Full body play relies on detecting players movements using sensors to provide a controller free gaming experience. The Kinect originally developed for the Xbox 360 games console has a wide range of released titles but they are limited to a small set of actions. Currently action recognition in gaming ranges from heuristic based techniques to machine learning algorithms. The approach taken depends on the number and complexity of gestures to be performed in the game. For example, a bowling game only requires a few simple gestures and an algorithm can be hardcoded for each gesture. However, this approach may not work well for a greater number of complex gestures where machine learning algorithms are more suited. Various machine learning techniques can be applied to more complex games for example AdaBoost with a boxing game and exemplar matching with a tennis game [1]. The benefit of machine learning algorithms is that they can be trained to recognise a wide range of actions including sporting, driving and action-adventure actions such as walking, running, jumping, dropping, firing, changing weapon, throwing and defending. This approach can increase the complexity and appeal of games that can be developed to include action-adventure games (e.g. Lara Croft). State of the art action recognition approaches are currently appearance based. The algorithms use input features such as colour, dense optical flow and spatiotemporal gradients extracted from a RGB image. The context of the environment can be used to further improve accuracy as intuitively certain actions will only happen in specific scenes [2]. For example, performing a golf swing in a real golf game would require a golf club and will occur outdoors, probably on a green field. However, performing a golf swing in a Kinect game the user has no golf club and is performing the action indoors. The restricted environment associated with gaming, typically the user s lounge, poses the challenge of missing context. The normal scene and objects usually associated with a given action are missing. This lack of contextual information may mean that appearance-based action recognition approaches may under perform. An alternative is a pose based approach where joint positions, velocities and angles are extracted from a human articulated pose. This approach was previously disregarded due to the complexity of estimating the human pose. Now it is possible to obtain real-time skeleton data for multiple subjects in an indoor environment using Kinect [3] so pose based approaches are being revisited by researchers. Yao et al. [4] experiments showed that pose based features outperform low-level appearance features in a home monitoring scenario. To compare the performance of the appearance, pose based and combined approaches for gaming, a common dataset and evaluation framework for measuring performance is required. G3D is a new public gaming action dataset containing synchronised colour, depth and skeleton data that has been captured for this paper. A novel evaluation framework for measuring the performance of real-time action recognition algorithms will also be proposed to provide a common base for comparison /12/$ IEEE 7
2 2. Related work 2.1. Action recognition There is a vast wealth of research on human action recognition in computer vision (for a comprehensive review see Aggarwal and Ryoo [5]). The majority of the state-of-the art algorithms in activity recognition are appearance based as low level features can easily be extracted from video sequences. Due to recent technological developments in depth camera technology it is now possible and economical to capture real-time sequences of depth images. This has resulted in depth and posed based approaches being developed. Li et al. [6] directly sample 3D representative points from a depth map as features for their action graph method. Their results show recognition errors were halved when using 3D depth data in comparison with 2D silhouettes. Yao et al. [4] posed the question Does Human Action Recognition Benefit from Pose Estimation? Their experiments compared appearance based, pose based and a combined approach in a home monitoring scenario using the same classifier and same dataset. The appearance based features used were colour, dense optical flow and spatio-temporal gradients. The pose based features were qualitative geometric features [7]. Yao et al. [4] results showed that the optimum approach was pose based. This significantly outperformed the appearance based approach and was even slightly better than the combined approach. Both studies show promising results and could benefit from a direct comparison with state-of-the-art appearance based methods using a common dataset Action recognition datasets There are many existing public datasets available containing video sequences. For a comprehensive review of these also see Aggarwal and Ryoo [5]. The datasets can be categorized into three groups, as follows. The first are simple action recognition datasets such as the KTH [8] and Weizmann [9] datasets, where each video contains one simple action performed by one actor in a controlled environment. The second type are surveillance datasets such as PETS [10] and i-lids [11] obtained in realistic environments such as airports. The third type are movie datasets with challenging videos with frequently moving camera viewpoints obtained from real movie scenes such as Hollywood2 [2]. For gaming, the existing action recognition datasets are insufficient as during a game multiple different actions will be performed by the player. Public datasets containing both video and skeleton data for sports and locomotion actions exist [12] [13] [14] but as mentioned previously there is a difference between performing a real action and a gaming action. Even simple actions such as walking are different in the gaming environment as the player will walk on the spot. The MPI HDM05 Motion Capture Database [13] database does include locomotion on the spot but not a full range of gaming actions. Microsoft research specifically developed a gaming action database, MSR Action3D Database [6] which initially consisted of a sequence of depth maps and was later extended by a third party to include skeleton data. Nevertheless, no corresponding video data is available. As far as we are aware there is no publicly available action recognition database for gaming that contains all three types of subject data (video, depth and skeleton). This paper will introduce a publicly available dataset, G3D for real-time action recognition in gaming containing synchronised video, depth and skeleton data Action recognition performance metrics A common performance measure used in general action recognition is classification accuracy and confusion matrices are frequently used to breakdown the number of correct classifications by class [5]. Applying the performance metric depends on the nature of the input sequence. In the simple case where the video is segmented to contain only one action e.g. KTH [8] the classification process is straightforward. An action label is predicted for each frame in the sequence and a majority decision over all frames is taken to decide the action label for the complete sequence [15]. In the more complex case, with multiple actions within a sequence e.g. the new G3D dataset, the classification process outlined above is not viable. The simple approach of applying the performance measure to the entire sequence does not incorporate timing or continuity constraints that are required to make action recognition methods applicable for a range of real-world applications. Real time performance is critical for a game to appear responsive to the player s actions. A single action performed by the player must be detected as soon as possible, as one continuous action consisting of multiple sequential frames and not multiple actions to prevent poor gameplay. In surveillance systems, continuous detection in realtime is required. The i-lids event detection metric [11] incorporates the continuous detection of events in realtime by summing occurrences of true positives, false negatives and false positives along an event timeline to produce an F1 score. The limitation of this metric is that it only recognises a single event type e.g. abandon package. A new real-time action recognition evaluation framework will be proposed that extends the event timeline detection metric to model multiple action classes. 8
3 Figure 1 Colour image 3. G3D dataset A new dataset, G3D for real-time action recognition in gaming containing synchronised video, depth and skeleton data has been captured. This dataset is publicly available at to allow researchers to develop new action recognition algorithms for video games and benchmark their performance. Due to the formats selected it is possible to view all the recorded data and tags without any special software tools. The Microsoft Kinect device and Windows SDK [3] enables easy capture of synchronised video, depth and skeleton data. The three streams were recorded at 30fps in a mirrored view so Figures 1-4 are actually a right punch. The PNG image format was selected for storing both the depth and colour images as it is a lossless format, suitable for online access and is open source. The resolution used to store both the depth and colour images was 640x480. The raw depth information contains the depth of each pixel in millimetres and was stored in 16-bit greyscale (see Figure 2) and the raw colour in 24-bit RGB (see Figure 1). The 16-bits of depth data contains 13 bits for depth data and 3 bits to identify the player. The player index can be used to segment the depth maps by user (see Figure 4). The depth information was also mapped to the colour coordinate space and stored in a 16-bit greyscale. Combining the colour image with the mapped depth data allows the user to also be segmented in the colour image. The XML text format was selected for storing the skeleton information as it is human readable and again suited for online access. The root node the XML file is an array of skeletons. Each skeleton contains the player s position and pose. The pose comprises of 20 joints as defined by Microsoft [3]. The player and joint positions are given in X,Y and Z co-ordinates in meters. These positions are also mapped into the depth (see Figure 4) and colour co-ordinates spaces. The skeleton data includes a joint tracking state, displayed in Figure 3 as tracked (green), inferred (yellow) and not tracked (red). In many cases the inferred joints will be accurate as in Figure 4 but certain situations where limbs are occluded the inferred joints may be inaccurate as in Figure 5. Consequently, pose data may need to be combined with colour or depth data to improve accuracy. Figure 2 Depth map Figure 3 Skeleton data Figure 4 Correctly inferred joints Figure 5 Incorrectly inferred joints This dataset contains 10 subjects performing 20 gaming actions : punch right, punch left, kick right, kick left, defend, golf swing, tennis swing forehand, tennis swing backhand, tennis serve, throw bowling ball, aim and fire gun, walk, run, jump, climb, crouch, steer a car, wave, flap and clap. Most sequences contain multiple actions in a controlled indoor environment with a fixed camera, a typical setup for gesture based gaming. The subjects were given basic instructions as to how to perform the action, similar to those issued in a Kinect game. Nevertheless, the subjects were free to perform the gesture with either hand or in the case of a side facing action stand with either foot forward to create a diverse dataset. Each sequence is repeated three times by each subject. This resulted in over 80,000 frames of video, depth and skeleton data. All the frames in the dataset that contain actions were manually labelled in a separate file with an appropriate tag. Each tag represents a single action and contains the action class e.g. PunchRight, first frame number and last frame number. The XML tags are also publicly available. 9
4 4. Evaluation Framework The existing performance metric for action recognition, classification accuracy determined for each sequence is inadequate for gaming as it does not incorporate time and continuity constraints critical for real-time action detection. The existing metric is also restricted to sequences containing one action. To address these limitations a new evaluation framework is proposed. To incorporate multiple actions a frame based evaluation framework could be used. Matches between each predicted frame and ground truth frame could be accumulated in a confusion matrix and average classification accuracy determined. Although this would work for multiple actions in a sequence it still does not incorporate timing and continuity constraints. To overcome all of the existing limitations an action based evaluation framework is needed. However, matching each predicted action with a ground truth action poses a challenge as both the predicted and ground truth actions represent a sequence of frames rather than just a single frame. It is no longer possible to accumulate matches in a confusion matrix as there is a new case. The new case is where the predicted action class matches a ground truth action type but is a false positive; e.g. in a duplicated action (see the second predicted defend action in Figure 6). As the predicted and ground truth action classes match, the only place to represent this information in a confusion matrix is along the diagonal axis, which represents the true positives. However, in this is clearly wrong as the second defend action is a false positive. The i-lids event detection metric [11] resolves this problem by using an event time-line to determine the correct number of true positives, false negatives and false positives. For example, a true positive is detected if a predicted event occurs within 10 seconds of an actual event. However, this metric only recognises a single event type e.g. an alarm. Extending the time based comparison of alarm events by the i-lids event detection metric [11] for multiple actions results in a new action based evaluation framework. The new framework can evaluate the performance of any real time action recognition algorithm by selecting an appropriate time constraint. A latency of 4 frames is by selecting an acceptable in gaming as it should not be perceived by the user. In Microsoft s experiments [1] an actual latency of 4 frames resulted in a perceived latency of 0 to 1 frames. The new action matching process is illustrated in Figure 6. For each sequence the start time of any predicted actions of a particular class should be compared to the start time of the ground truth action to evaluate the number of true positive, false positive and false negative actions. In the proposed framework a true positive arises when a predicted event starts within T (T=4 frames) of an actual event. A false positive is a predicted event that does not start within T frames of an actual event; this can be caused by two different situations. Firstly, when the correct action is detected but not within the T time period, for example the predicted punch left action on Figure 6. Secondly, when the incorrect action is predicted, for example the first predicted kick left action on Figure 6. Finally, a false negative is an actual event that remains unmatched with any predicted event. The main difference between this evaluation framework and the i-lids event detection is the addition of multiple distinct action classes to detect incorrect actions. The output of the matching process is the total number of true positives, false positive and false negative actions for each action class. This division by class is extremely helpful for improving action recognition algorithms but not so effective for ranking different algorithms. The performance metrics of precision, recall and F1 are then computed for each action class and a final single average F1 figure calculated. Figure 6. Matching of predicted actions with ground truth actions 10
5 5. Action recognition algorithm To test our new evaluation framework we used Adaptive Boosting (AdaBoost) [16] a real-time action recognition machine learning algorithm proved successful for gaming gestures [1]. Microsoft [1] recommended pose based input features such as position difference, position velocity, position velocity magnitude, angle velocity and joint angles. A description of the features is not available so the following is our own definition. Let be the 3D location (,, ) and of joint at time. The position difference features are defined as the difference between two distinct joints in a single pose: (1) (2) (3) The position velocity features are similar except they encode the differences between a single joint separated by time, where : (4) (5) (6) The position velocity magnitude feature is defined as the Euclidean distance between a single joint separated by time, where : (7) The angle velocity features are defined as the angle between a single joint separated by time projected in the x- y plane and the x-z plane: (8) to sum up the per frame results. A sliding window of size 3 is being used to filter our results (see Figure 9). The filtering is important to link broken actions into a single continuous block. However, the sliding window has the negative side effect of delaying the detection of the action. The length of the delay is directly related to the size of the sliding window so there is an inherent tradeoff between detecting continuous actions and the latency in detecting the action. Figure 7. Ground truth (frame based) for trial 24 Figure 8. Confidence results (frame based) for trial 24 Figure 9. Filtered confidence results (frame based) for trial 24 (9) The joint angle feature is defined as the 3D angle between three distinct joints in a single pose, let vector v 1 be (j 2 j 1 ) and vector v 2 be (j 2 j 3 ) : (10) We have implemented 170 features based on the Microsoft features described by equations (1)-(10). Instead of obtaining thresholds through experimentation [7] to produce Boolean features we quantised the continuous output of the features into 16 discrete outputs. We used the skeleton data from the G3D dataset which was split by subjects where the first 5 subjects were used for training and the remaining 5 subjects for testing. 6. Results The output from the AdaBoost algorithm is per frame confidence results (see Figure 8). Microsoft [1] recommend filtering the result by using a sliding window Figure 10. Ground truth (action based) for trial 24 Figure 11. Highest confidence results (action based) for trial 24 The filtered frame based results were then converted to action based results by detecting the class with the highest confidence at each frame. If other was the class with highest confidence then no action was predicted. Otherwise, the action predicted was simply the action with the highest confidence. Once an action was predicted sequential frames of the same class where linked to produce a single start and end frame number for each action (see Figure 11). The predicted actions were then matched against the ground truth actions (see Figure 10) using the new matching process illustrated in Figure 6. Since a confusion matrix is not feasible for the action based approach, a matrix with the estimated true positives, false 11
6 negatives and false positives is provided in Figure 12 (left). The action-based F1 value for each class was computed from this matrix and then averaged to produce an F1 value for the fighting sequence, of 46.66%. For comparison a frame based confusion matrix was also computed (see Figure 12 right) and the frame-based average F1 value for the same fighting sequence is 61.17%. AdaBoost a real-time action recognition method and further indicate the need of the proposed metric that takes into consideration all these new parameters related to games and real-time applications. Finally, a new dataset for gaming action recognition is provided containing synchronised video, depth and skeleton data. This combined dataset will allow further research into the comparison of pose and appearance based methodologies. 8. References Figure 12. Fighting matrices action based (left), frame based (right) This process was repeated for all sequences in the G3D dataset and Table 1 shows a comparison of the frame based and action based results. The results highlight a dramatic difference in the frame based F1 values and the action based F1 values. Whilst the algorithm appears to be performing well at the frame level it is severely underperforming at the action level. The difference between the frame based and action-based evaluation framework is significant. Therefore, in real-time game applications, a performance measure that does not take into consideration timing and continuity may lead to a wrong conclusion and therefore should be avoided. Action Category Frame based F1 Action based F1 (T=4) Fighting 70.46% 58.54% Golf 83.37% 11.88% Tennis 56.44% 14.85% Bowling 80.78% 31.58% FPS 53.57% 13.65% Driving a car 84.24% 2.50% Misc 78.21% 18.13% 7. Conclusion Table 1 Testing results A novel and reliable evaluation framework for real-time action recognition algorithms was suggested in this work. Compared with the existing metric utilised in this area it provides additional restrictions related to time and repetitions. Therefore, correct recognitions but not in time have the same negative effect in games as other real-time applications. Additionally, recognising multiple times the same action while only one occurrence is present will have disastrous effects in a game and thus it has to be taken into consideration during the evaluation process and the corresponding metrics. Experiments were performed using [1] Marais C. (2011). Kinect Gesture Detection using Machine Learning [Online]. Available: s.aspx?id= [2] M. Marszalek, I. Laptev and C. Schmid, "Actions in context," in CVPR, pp , [3] Microsoft. (2012). Kinect Natural User Interface (NUI) Overview [Online]. Available: [4] A. Yao, J. Gall, G. Fanelli and L. Van Gool, "Does Human Action Recognition Benefit from Pose Estimation?"BMVC. pp , [5] J.K. Aggarwal and M.S. Ryoo, "Human activity analysis: A review," CSUR, vol. 43, [6] W. Li, Z. Zhang, and Z. Liu, "Action Recognition Based on A Bag of 3D Points", CVPR4HB, pages 9-14, San Francisco, CA, USA, June 18, [7] M. Muller, T. Roder and M. Clausen, "Efficient contentbased retrieval of motion capture data," ACM Trans.Graph., vol. 24, pp , July [8] C. Schuldt, I. Laptev and B. Caputo, "Recognizing human actions: a local SVM approach,", ICPR. pp Vol.3, [9] M. Blank, L. Gorelick, E. Shechtman, M. Irani and R. Basri, "Actions as space-time shapes," in ICCV, pp , [10] D. Thirde. (2005). PETS: Performance Evaluation of Tracking and Surveillance [Online]. Available: [11] Home Office, "Imagery Library for Intelligent Detection Systems : The i-lids User Guide," [12] Carnegie Mellon University. (2011). Motion Capture Database [Online]. Available: [13] M. Muller, T. Roder, M. Clausen, B. Eberhardt, B. Kruger and A. Weber, "Documentation Mocap Database HDM05," Universitat Bonn., Tech. Rep. No. CG , [14] L. Sigal, A.O. Balan and M.J. Black, "HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion," Int.J.Comput.Vision, vol. 87, pp. 4-27, March [15] A. Gilbert, J. Illingworth and R. Bowden, "Action Recognition Using Mined Hierarchical Compound Features," TPAMI, vol. 33, pp , [16] Freund, Y. and Schapire, R. E. (1996b). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference Morgan Kaufman, San Francisco. 12
QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task
QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task Fahad Daniyal and Andrea Cavallaro Queen Mary University of London Mile End Road, London E1 4NS (United Kingdom) {fahad.daniyal,andrea.cavallaro}@eecs.qmul.ac.uk
More informationAction Recognition with HOG-OF Features
Action Recognition with HOG-OF Features Florian Baumann Institut für Informationsverarbeitung, Leibniz Universität Hannover, {last name}@tnt.uni-hannover.de Abstract. In this paper a simple and efficient
More informationLecture 18: Human Motion Recognition
Lecture 18: Human Motion Recognition Professor Fei Fei Li Stanford Vision Lab 1 What we will learn today? Introduction Motion classification using template matching Motion classification i using spatio
More informationEvaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition
International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 9 No. 4 Dec. 2014, pp. 1708-1717 2014 Innovative Space of Scientific Research Journals http://www.ijias.issr-journals.org/ Evaluation
More informationCS229: Action Recognition in Tennis
CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active
More informationCombined Shape Analysis of Human Poses and Motion Units for Action Segmentation and Recognition
Combined Shape Analysis of Human Poses and Motion Units for Action Segmentation and Recognition Maxime Devanne 1,2, Hazem Wannous 1, Stefano Berretti 2, Pietro Pala 2, Mohamed Daoudi 1, and Alberto Del
More informationMotion analysis for broadcast tennis video considering mutual interaction of players
14-10 MVA2011 IAPR Conference on Machine Vision Applications, June 13-15, 2011, Nara, JAPAN analysis for broadcast tennis video considering mutual interaction of players Naoto Maruyama, Kazuhiro Fukui
More informationDynamic Human Shape Description and Characterization
Dynamic Human Shape Description and Characterization Z. Cheng*, S. Mosher, Jeanne Smith H. Cheng, and K. Robinette Infoscitex Corporation, Dayton, Ohio, USA 711 th Human Performance Wing, Air Force Research
More informationPerson Action Recognition/Detection
Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1 In object recognition: is there a chair in the
More informationAction recognition in videos
Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit
More informationEigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor
EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor Xiaodong Yang and YingLi Tian Department of Electrical Engineering The City College of New York, CUNY {xyang02, ytian}@ccny.cuny.edu
More informationThe Kinect Sensor. Luís Carriço FCUL 2014/15
Advanced Interaction Techniques The Kinect Sensor Luís Carriço FCUL 2014/15 Sources: MS Kinect for Xbox 360 John C. Tang. Using Kinect to explore NUI, Ms Research, From Stanford CS247 Shotton et al. Real-Time
More informationHuman Detection and Tracking for Video Surveillance: A Cognitive Science Approach
Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach Vandit Gajjar gajjar.vandit.381@ldce.ac.in Ayesha Gurnani gurnani.ayesha.52@ldce.ac.in Yash Khandhediya khandhediya.yash.364@ldce.ac.in
More informationCombining PGMs and Discriminative Models for Upper Body Pose Detection
Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative
More informationHuman Upper Body Pose Estimation in Static Images
1. Research Team Human Upper Body Pose Estimation in Static Images Project Leader: Graduate Students: Prof. Isaac Cohen, Computer Science Mun Wai Lee 2. Statement of Project Goals This goal of this project
More informationPerson Identity Recognition on Motion Capture Data Using Label Propagation
Person Identity Recognition on Motion Capture Data Using Label Propagation Nikos Nikolaidis Charalambos Symeonidis AIIA Lab, Department of Informatics Aristotle University of Thessaloniki Greece email:
More informationIndoor Object Recognition of 3D Kinect Dataset with RNNs
Indoor Object Recognition of 3D Kinect Dataset with RNNs Thiraphat Charoensripongsa, Yue Chen, Brian Cheng 1. Introduction Recent work at Stanford in the area of scene understanding has involved using
More informationLatent Variable Models for Structured Prediction and Content-Based Retrieval
Latent Variable Models for Structured Prediction and Content-Based Retrieval Ariadna Quattoni Universitat Politècnica de Catalunya Joint work with Borja Balle, Xavier Carreras, Adrià Recasens, Antonio
More informationTri-modal Human Body Segmentation
Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4
More informationAction Recognition From Videos using Sparse Trajectories
Action Recognition From Videos using Sparse Trajectories Alexandros Doumanoglou, Nicholas Vretos, Petros Daras Centre for Research and Technology - Hellas (ITI-CERTH) 6th Km Charilaou - Thermi, Thessaloniki,
More informationIMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim
IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION Maral Mesmakhosroshahi, Joohee Kim Department of Electrical and Computer Engineering Illinois Institute
More informationObject and Action Detection from a Single Example
Object and Action Detection from a Single Example Peyman Milanfar* EE Department University of California, Santa Cruz *Joint work with Hae Jong Seo AFOSR Program Review, June 4-5, 29 Take a look at this:
More informationScale Invariant Human Action Detection from Depth Cameras using Class Templates
Scale Invariant Human Action Detection from Depth Cameras using Class Templates Kartik Gupta and Arnav Bhavsar School of Computing and Electrical Engineering Indian Institute of Technology Mandi, India
More informationFrom Activity to Language:
From Activity to Language: Learning to recognise the meaning of motion Centre for Vision, Speech and Signal Processing Prof Rich Bowden 20 June 2011 Overview Talk is about recognising spatio temporal patterns
More informationClassification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine
Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine Xi Chen and Markus Koskela Department of Information and Computer Science Aalto University School of Science P.O. Box
More informationEVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari
EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular
More informationFast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features
Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features Andrew Gilbert, John Illingworth and Richard Bowden CVSSP, University of Surrey, Guildford, Surrey GU2 7XH United Kingdom
More informationRobust Classification of Human Actions from 3D Data
Robust Classification of Human s from 3D Data Loc Huynh, Thanh Ho, Quang Tran, Thang Ba Dinh, Tien Dinh Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam hvloc@apcs.vn,
More informationDoes Human Action Recognition Benefit from Pose Estimation?
YAO ET AL.: DOES ACTION RECOGNITION BENEFIT FROM POSE ESTIMATION? 1 Does Human Action Recognition Benefit from Pose Estimation? Angela Yao 1 yaoa@vision.ee.ethz.ch Juergen Gall 1 gall@vision.ee.ethz.ch
More informationPart I: HumanEva-I dataset and evaluation metrics
Part I: HumanEva-I dataset and evaluation metrics Leonid Sigal Michael J. Black Department of Computer Science Brown University http://www.cs.brown.edu/people/ls/ http://vision.cs.brown.edu/humaneva/ Motivation
More informationThree-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients
ThreeDimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Authors: Zhile Ren, Erik B. Sudderth Presented by: Shannon Kao, Max Wang October 19, 2016 Introduction Given an
More informationTemporal Feature Weighting for Prototype-Based Action Recognition
Temporal Feature Weighting for Prototype-Based Action Recognition Thomas Mauthner, Peter M. Roth, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology {mauthner,pmroth,bischof}@icg.tugraz.at
More informationArticulated Pose Estimation with Flexible Mixtures-of-Parts
Articulated Pose Estimation with Flexible Mixtures-of-Parts PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION Outline Modeling Special Cases Inferences Learning Experiments Problem and Relevance Problem:
More informationHuman Activity Recognition Using a Dynamic Texture Based Method
Human Activity Recognition Using a Dynamic Texture Based Method Vili Kellokumpu, Guoying Zhao and Matti Pietikäinen Machine Vision Group University of Oulu, P.O. Box 4500, Finland {kello,gyzhao,mkp}@ee.oulu.fi
More informationSTOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences
STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences Antonio W. Vieira 1,2,EricksonR.Nascimento 1, Gabriel L. Oliveira 1, Zicheng Liu 3,andMarioF.M.Campos 1, 1 DCC - Universidade
More informationGraph-based High Level Motion Segmentation using Normalized Cuts
Graph-based High Level Motion Segmentation using Normalized Cuts Sungju Yun, Anjin Park and Keechul Jung Abstract Motion capture devices have been utilized in producing several contents, such as movies
More informationAction Recognition & Categories via Spatial-Temporal Features
Action Recognition & Categories via Spatial-Temporal Features 华俊豪, 11331007 huajh7@gmail.com 2014/4/9 Talk at Image & Video Analysis taught by Huimin Yu. Outline Introduction Frameworks Feature extraction
More informationExtracting Spatio-temporal Local Features Considering Consecutiveness of Motions
Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu Noguchi and Keiji Yanai Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka,
More informationCategorizing Turn-Taking Interactions
Categorizing Turn-Taking Interactions Karthir Prabhakar and James M. Rehg Center for Behavior Imaging and RIM@GT School of Interactive Computing, Georgia Institute of Technology {karthir.prabhakar,rehg}@cc.gatech.edu
More informationP-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh
P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained
More informationAUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering
AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering www.csse.uwa.edu.au/~ajmal/ Overview Aim of automatic human action recognition Applications
More informationFully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information
Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information Ana González, Marcos Ortega Hortas, and Manuel G. Penedo University of A Coruña, VARPA group, A Coruña 15071,
More informationLinear combinations of simple classifiers for the PASCAL challenge
Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu
More informationACTION BASED FEATURES OF HUMAN ACTIVITY RECOGNITION SYSTEM
ACTION BASED FEATURES OF HUMAN ACTIVITY RECOGNITION SYSTEM 1,3 AHMED KAWTHER HUSSEIN, 1 PUTEH SAAD, 1 RUZELITA NGADIRAN, 2 YAZAN ALJEROUDI, 4 MUHAMMAD SAMER SALLAM 1 School of Computer and Communication
More informationDefinition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos
Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Sung Chun Lee, Chang Huang, and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA sungchun@usc.edu,
More informationAn Efficient Part-Based Approach to Action Recognition from RGB-D Video with BoW-Pyramid Representation
2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) November 3-7, 2013. Tokyo, Japan An Efficient Part-Based Approach to Action Recognition from RGB-D Video with BoW-Pyramid
More informationKinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION
Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen Abstract: An XBOX 360 Kinect is used to develop two applications to control the desktop cursor of a Windows computer. Application
More informationMultimodal Motion Capture Dataset TNT15
Multimodal Motion Capture Dataset TNT15 Timo v. Marcard, Gerard Pons-Moll, Bodo Rosenhahn January 2016 v1.2 1 Contents 1 Introduction 3 2 Technical Recording Setup 3 2.1 Video Data............................
More informationAction Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels
Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels Kai Guo, Prakash Ishwar, and Janusz Konrad Department of Electrical & Computer Engineering Motivation
More informationCategory vs. instance recognition
Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building
More informationLecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013
Lecture 19: Depth Cameras Visual Computing Systems Continuing theme: computational photography Cameras capture light, then extensive processing produces the desired image Today: - Capturing scene depth
More informationFace Recognition At-a-Distance Based on Sparse-Stereo Reconstruction
Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction Ham Rara, Shireen Elhabian, Asem Ali University of Louisville Louisville, KY {hmrara01,syelha01,amali003}@louisville.edu Mike Miller,
More informationDistance matrices as invariant features for classifying MoCap data
Distance matrices as invariant features for classifying MoCap data ANTONIO W. VIEIRA 1, THOMAS LEWINER 2, WILLIAM SCHWARTZ 3 AND MARIO CAMPOS 3 1 Department of Mathematics Unimontes Montes Claros Brazil
More informationBag of Optical Flow Volumes for Image Sequence Recognition 1
RIEMENSCHNEIDER, DONOSER, BISCHOF: BAG OF OPTICAL FLOW VOLUMES 1 Bag of Optical Flow Volumes for Image Sequence Recognition 1 Hayko Riemenschneider http://www.icg.tugraz.at/members/hayko Michael Donoser
More informationAction Recognition Using Motion History Image and Static History Image-based Local Binary Patterns
, pp.20-214 http://dx.doi.org/10.14257/ijmue.2017.12.1.17 Action Recognition Using Motion History Image and Static History Image-based Local Binary Patterns Enqing Chen 1, Shichao Zhang* 1 and Chengwu
More informationAdaptive Action Detection
Adaptive Action Detection Illinois Vision Workshop Dec. 1, 2009 Liangliang Cao Dept. ECE, UIUC Zicheng Liu Microsoft Research Thomas Huang Dept. ECE, UIUC Motivation Action recognition is important in
More informationEvaluation of local descriptors for action recognition in videos
Evaluation of local descriptors for action recognition in videos Piotr Bilinski and Francois Bremond INRIA Sophia Antipolis - PULSAR group 2004 route des Lucioles - BP 93 06902 Sophia Antipolis Cedex,
More informationLearning Human Motion Models from Unsegmented Videos
In IEEE International Conference on Pattern Recognition (CVPR), pages 1-7, Alaska, June 2008. Learning Human Motion Models from Unsegmented Videos Roman Filipovych Eraldo Ribeiro Computer Vision and Bio-Inspired
More informationReal-Time Human Detection using Relational Depth Similarity Features
Real-Time Human Detection using Relational Depth Similarity Features Sho Ikemura, Hironobu Fujiyoshi Dept. of Computer Science, Chubu University. Matsumoto 1200, Kasugai, Aichi, 487-8501 Japan. si@vision.cs.chubu.ac.jp,
More informationReal-time Detection of Illegally Parked Vehicles Using 1-D Transformation
Real-time Detection of Illegally Parked Vehicles Using 1-D Transformation Jong Taek Lee, M. S. Ryoo, Matthew Riley, and J. K. Aggarwal Computer & Vision Research Center Dept. of Electrical & Computer Engineering,
More informationDeep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks
Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin
More informationHuman Action Recognition Using Dynamic Time Warping and Voting Algorithm (1)
VNU Journal of Science: Comp. Science & Com. Eng., Vol. 30, No. 3 (2014) 22-30 Human Action Recognition Using Dynamic Time Warping and Voting Algorithm (1) Pham Chinh Huu *, Le Quoc Khanh, Le Thanh Ha
More informationThe Analysis of Animate Object Motion using Neural Networks and Snakes
The Analysis of Animate Object Motion using Neural Networks and Snakes Ken Tabb, Neil Davey, Rod Adams & Stella George e-mail {K.J.Tabb, N.Davey, R.G.Adams, S.J.George}@herts.ac.uk http://www.health.herts.ac.uk/ken/vision/
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationDyadic Interaction Detection from Pose and Flow
Dyadic Interaction Detection from Pose and Flow Coert van Gemeren 1,RobbyT.Tan 2, Ronald Poppe 1,andRemcoC.Veltkamp 1, 1 Interaction Technology Group, Department of Information and Computing Sciences,
More informationHuman Detection. A state-of-the-art survey. Mohammad Dorgham. University of Hamburg
Human Detection A state-of-the-art survey Mohammad Dorgham University of Hamburg Presentation outline Motivation Applications Overview of approaches (categorized) Approaches details References Motivation
More informationHuman Focused Action Localization in Video
Human Focused Action Localization in Video Alexander Kläser 1, Marcin Marsza lek 2, Cordelia Schmid 1, and Andrew Zisserman 2 1 INRIA Grenoble, LEAR, LJK {klaser,schmid}@inrialpes.fr 2 Engineering Science,
More informationReal-Time Action Recognition System from a First-Person View Video Stream
Real-Time Action Recognition System from a First-Person View Video Stream Maxim Maximov, Sang-Rok Oh, and Myong Soo Park there are some requirements to be considered. Many algorithms focus on recognition
More informationModeling Motion of Body Parts for Action Recognition
K.N.TRAN, I.A.KAKADIARIS, S.K.SHAH: MODELING MOTION OF BODY PARTS 1 Modeling Motion of Body Parts for Action Recognition Khai N. Tran khaitran@cs.uh.edu Ioannis A. Kakadiaris ioannisk@uh.edu Shishir K.
More informationSpatio-temporal Shape and Flow Correlation for Action Recognition
Spatio-temporal Shape and Flow Correlation for Action Recognition Yan Ke 1, Rahul Sukthankar 2,1, Martial Hebert 1 1 School of Computer Science, Carnegie Mellon; 2 Intel Research Pittsburgh {yke,rahuls,hebert}@cs.cmu.edu
More informationMobile Human Detection Systems based on Sliding Windows Approach-A Review
Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg
More informationThe Analysis of Animate Object Motion using Neural Networks and Snakes
The Analysis of Animate Object Motion using Neural Networks and Snakes Ken Tabb, Neil Davey, Rod Adams & Stella George e-mail {K.J.Tabb, N.Davey, R.G.Adams, S.J.George}@herts.ac.uk http://www.health.herts.ac.uk/ken/vision/
More informationEvaluating Example-based Pose Estimation: Experiments on the HumanEva Sets
Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets Ronald Poppe Human Media Interaction Group, Department of Computer Science University of Twente, Enschede, The Netherlands poppe@ewi.utwente.nl
More informationSkeleton based Human Action Recognition using Kinect
Skeleton based Human Action Recognition using Kinect Ayushi Gahlot Purvi Agarwal Akshya Agarwal Vijai Singh IMS Engineering college, Amit Kumar Gautam ABSTRACT This paper covers the aspects of action recognition
More informationCategory-level localization
Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object
More informationHuman Motion Reconstruction from Action Video Data Using a 3-Layer-LSTM
Human Motion Reconstruction from Action Video Data Using a 3-Layer-LSTM Jihee Hwang Stanford Computer Science jiheeh@stanford.edu Danish Shabbir Stanford Electrical Engineering danishs@stanford.edu Abstract
More informationAction Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features
Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features Saimunur Rahman, John See, Chiung Ching Ho Centre of Visual Computing, Faculty of Computing and Informatics
More informationObject Detection by 3D Aspectlets and Occlusion Reasoning
Object Detection by 3D Aspectlets and Occlusion Reasoning Yu Xiang University of Michigan Silvio Savarese Stanford University In the 4th International IEEE Workshop on 3D Representation and Recognition
More informationA Method of Hyper-sphere Cover in Multidimensional Space for Human Mocap Data Retrieval
Journal of Human Kinetics volume 28/2011, 133-139 DOI: 10.2478/v10078-011-0030-0 133 Section III Sport, Physical Education & Recreation A Method of Hyper-sphere Cover in Multidimensional Space for Human
More informationReal-time gesture recognition from depth data through key poses learning and decision forests
Real-time gesture recognition from depth data through key poses learning and decision forests Leandro Miranda Thales Vieira Dimas Martinez Mathematics, UFAL Thomas Lewiner Mathematics, PUC-Rio Antonio
More informationAutomatic Colorization of Grayscale Images
Automatic Colorization of Grayscale Images Austin Sousa Rasoul Kabirzadeh Patrick Blaes Department of Electrical Engineering, Stanford University 1 Introduction ere exists a wealth of photographic images,
More informationLearning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009
Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer
More informationSummarization of Egocentric Moving Videos for Generating Walking Route Guidance
Summarization of Egocentric Moving Videos for Generating Walking Route Guidance Masaya Okamoto and Keiji Yanai Department of Informatics, The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi,
More informationShort Survey on Static Hand Gesture Recognition
Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of
More informationHand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction
Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction Chieh-Chih Wang and Ko-Chih Wang Department of Computer Science and Information Engineering Graduate Institute of Networking
More informationSign Language Recognition using Dynamic Time Warping and Hand Shape Distance Based on Histogram of Oriented Gradient Features
Sign Language Recognition using Dynamic Time Warping and Hand Shape Distance Based on Histogram of Oriented Gradient Features Pat Jangyodsuk Department of Computer Science and Engineering The University
More informationSeminar Heidelberg University
Seminar Heidelberg University Mobile Human Detection Systems Pedestrian Detection by Stereo Vision on Mobile Robots Philip Mayer Matrikelnummer: 3300646 Motivation Fig.1: Pedestrians Within Bounding Box
More informationRevisiting LBP-based Texture Models for Human Action Recognition
Revisiting LBP-based Texture Models for Human Action Recognition Thanh Phuong Nguyen 1, Antoine Manzanera 1, Ngoc-Son Vu 2, and Matthieu Garrigues 1 1 ENSTA-ParisTech, 828, Boulevard des Maréchaux, 91762
More informationReal-Time Continuous Action Detection and Recognition Using Depth Images and Inertial Signals
Real-Time Continuous Action Detection and Recognition Using Depth Images and Inertial Signals Neha Dawar 1, Chen Chen 2, Roozbeh Jafari 3, Nasser Kehtarnavaz 1 1 Department of Electrical and Computer Engineering,
More informationStructured Models in. Dan Huttenlocher. June 2010
Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies
More informationAutomatic visual recognition for metro surveillance
Automatic visual recognition for metro surveillance F. Cupillard, M. Thonnat, F. Brémond Orion Research Group, INRIA, Sophia Antipolis, France Abstract We propose in this paper an approach for recognizing
More informationHuman Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview
Human Body Recognition and Tracking: How the Kinect Works Kinect RGB-D Camera Microsoft Kinect (Nov. 2010) Color video camera + laser-projected IR dot pattern + IR camera $120 (April 2012) Kinect 1.5 due
More informationPeople Tracking and Segmentation Using Efficient Shape Sequences Matching
People Tracking and Segmentation Using Efficient Shape Sequences Matching Junqiu Wang, Yasushi Yagi, and Yasushi Makihara The Institute of Scientific and Industrial Research, Osaka University 8-1 Mihogaoka,
More informationHuman Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations Mohamed E.
More informationExploring the Trade-off Between Accuracy and Observational Latency in Action Recognition
Noname manuscript No. (will be inserted by the editor) Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition Chris Ellis Syed Zain Masood Marshall F. Tappen Joseph J.
More informationarxiv: v1 [cs.cv] 2 May 2016
16-811 Math Fundamentals for Robotics Comparison of Optimization Methods in Optical Flow Estimation Final Report, Fall 2015 arxiv:1605.00572v1 [cs.cv] 2 May 2016 Contents Noranart Vesdapunt Master of Computer
More informationA Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation
A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation Alexander Andreopoulos, Hirak J. Kashyap, Tapan K. Nayak, Arnon Amir, Myron D. Flickner IBM Research March 25,
More informationLearning video saliency from human gaze using candidate selection
Learning video saliency from human gaze using candidate selection Rudoy, Goldman, Shechtman, Zelnik-Manor CVPR 2013 Paper presentation by Ashish Bora Outline What is saliency? Image vs video Candidates
More informationForensic Image Recognition using a Novel Image Fingerprinting and Hashing Technique
Forensic Image Recognition using a Novel Image Fingerprinting and Hashing Technique R D Neal, R J Shaw and A S Atkins Faculty of Computing, Engineering and Technology, Staffordshire University, Stafford
More informationSeparating Objects and Clutter in Indoor Scenes
Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous
More information