Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels

Similar documents
Object and Action Detection from a Single Example

Human Activity Recognition Using a Dynamic Texture Based Method

Lecture 18: Human Motion Recognition

2010 ICPR Contest on Semantic Description of Human Activities (SDHA) M. S. Ryoo* J. K. Aggarwal Amit Roy-Chowdhury

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Learning Human Actions with an Adaptive Codebook

SUPERVISED NEIGHBOURHOOD TOPOLOGY LEARNING (SNTL) FOR HUMAN ACTION RECOGNITION

Space-Time Shapelets for Action Recognition

Action Recognition & Categories via Spatial-Temporal Features

Bag of Optical Flow Volumes for Image Sequence Recognition 1

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Discriminative human action recognition using pairwise CSP classifiers

Human Action Recognition Using Silhouette Histogram

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin

Action Recognition by Learning Mid-level Motion Features

Class 9 Action Recognition

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

Spatio-temporal Shape and Flow Correlation for Action Recognition

CS229: Action Recognition in Tennis

Locally Adaptive Regression Kernels with (many) Applications

Incremental Action Recognition Using Feature-Tree

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Human Action Recognition in Videos Using Hybrid Motion Features

Evaluation and comparison of interest points/regions

Human detection using local shape and nonredundant

Available online at ScienceDirect. Procedia Computer Science 60 (2015 )

Action Recognition from a Small Number of Frames

Learning Human Motion Models from Unsegmented Videos

Human action recognition using an ensemble of body-part detectors

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Fast Motion Consistency through Matrix Quantization

Automatic Gait Recognition. - Karthik Sridharan

Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition

An Efficient Part-Based Approach to Action Recognition from RGB-D Video with BoW-Pyramid Representation

Computation Strategies for Volume Local Binary Patterns applied to Action Recognition

Action Recognition By Learnt Class-Specific Overcomplete Dictionaries

A Motion Descriptor Based on Statistics of Optical Flow Orientations for Action Classification in Video-Surveillance

Temporal Feature Weighting for Prototype-Based Action Recognition

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Action Recognition using Randomised Ferns

Human Action Recognition Using Independent Component Analysis

Action recognition in videos

Spatio-temporal Feature Classifier

Middle-Level Representation for Human Activities Recognition: the Role of Spatio-temporal Relationships

FAST HUMAN DETECTION USING TEMPLATE MATCHING FOR GRADIENT IMAGES AND ASC DESCRIPTORS BASED ON SUBTRACTION STEREO

A Spatio-Temporal Descriptor Based on 3D-Gradients

Expanding gait identification methods from straight to curved trajectories

Spatial-Temporal correlatons for unsupervised action classification

Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features

Detection and Summarization of Salient Events in Coastal Environments

Action Recognition Using Hybrid Feature Descriptor and VLAD Video Encoding

Action Snippets: How many frames does human action recognition require?

Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

3D Human Motion Analysis and Manifolds

Recognizing and localizing individual activities through graph matching

Generic Human Action Recognition from a Single Example

Fast Motion Consistency through Matrix Quantization

Human Activity Recognition using the 4D Spatiotemporal Shape Context Descriptor

MoSIFT: Recognizing Human Actions in Surveillance Videos

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Evaluation of local descriptors for action recognition in videos

Learning View-invariant Sparse Representations for Cross-view Action Recognition

Tri-modal Human Body Segmentation

Recognizing Actions by Shape-Motion Prototype Trees

Dictionary of gray-level 3D patches for action recognition

Part-based and local feature models for generic object recognition

WHO MISSED THE CLASS? - UNIFYING MULTI-FACE DETECTION, TRACKING AND RECOGNITION IN VIDEOS. Yunxiang Mao, Haohan Li, Zhaozheng Yin

An Efficient Face Detection and Recognition System

Human Upper Body Pose Estimation in Static Images

Action Recognition with HOG-OF Features

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Fast Human Detection Algorithm Based on Subtraction Stereo for Generic Environment

Fast and Stable Human Detection Using Multiple Classifiers Based on Subtraction Stereo with HOG Features

Image Set-based Face Recognition: A Local Multi-Keypoint Descriptor-based Approach

Edge Enhanced Depth Motion Map for Dynamic Hand Gesture Recognition

IMA Preprint Series # 2378

Human Action Classification

Multiple-View Object Recognition in Band-Limited Distributed Camera Networks

A Hierarchical Model of Shape and Appearance for Human Action Classification

Person identification from spatio-temporal 3D gait

Silhouette Based Human Motion Detection and Recognising their Actions from the Captured Video Streams Deepak. N. A

Combined Shape Analysis of Human Poses and Motion Units for Action Segmentation and Recognition

Spatio-Temporal Optical Flow Statistics (STOFS) for Activity Classification

Cross-View Action Recognition from Temporal Self-Similarities

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Dynamic Shape Tracking via Region Matching

Selection of Scale-Invariant Parts for Object Class Recognition

Categorizing Turn-Taking Interactions

ActivityRepresentationUsing3DShapeModels

View Invariant Movement Recognition by using Adaptive Neural Fuzzy Inference System

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada

Adaptive Action Detection

Human Detection Using SURF and SIFT Feature Extraction Methods in Different Color Spaces

MULTI-POSE FACE HALLUCINATION VIA NEIGHBOR EMBEDDING FOR FACIAL COMPONENTS. Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo

Distance-driven Fusion of Gait and Face for Human Identification in Video

Selecting Models from Videos for Appearance-Based Face Recognition

Transcription:

Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels Kai Guo, Prakash Ishwar, and Janusz Konrad Department of Electrical & Computer Engineering

Motivation Recognize actions in video walk skate Applications Surveillance: Sports & entertainment: Tools for hearing impaired: Wildlife habitat monitoring: 2

Challenges and assumptions Scene Multiple objects Clutter and occlusion Illumination variability 3

Challenges and assumptions Scene Multiple objects Clutter and occlusion Illumination variability Scene Single object No significant clutter and occlusion No significant illumination change 4

Challenges and assumptions Scene Multiple objects Clutter and occlusion Illumination variability Scene Single object No significant clutter and occlusion No significant illumination change Acquisition Camera motion Camera viewpoint and zoom Camera imperfections 5

Challenges and assumptions Scene Multiple objects Clutter and occlusion Illumination variability Acquisition Camera motion Camera viewpoint and zoom Camera imperfections Scene Single object No significant clutter and occlusion No significant illumination change Acquisition Single camera, fixed viewpoint No significant camera motion and distortion 6

Challenges and assumptions Scene Multiple objects Clutter and occlusion Illumination variability Acquisition Camera motion Camera viewpoint and zoom Camera imperfections Scene Single object No significant clutter and occlusion No significant illumination change Acquisition Single camera, fixed viewpoint No significant camera motion and distortion Action Non-rigid objects Complex motion (ballet video) Intra and inter object motion variability 7

Challenges and assumptions Scene Multiple objects Clutter and occlusion Illumination variability Acquisition Camera motion Camera viewpoint and zoom Camera imperfections Scene Single object No significant clutter and occlusion No significant illumination change Acquisition Single camera, fixed viewpoint No significant camera motion and distortion Action Non-rigid objects Complex motion (ballet video) Intra and inter object motion variability Action Non-rigid objects Complex motion Intra and inter object motion variability 8

Problem statement Dictionary of labeled training data Given: walk run jump wave Task: Action Recognition walk unknown action Recall: challenges non-rigid object complex motion intra and inter object motion variability 9

Related work Action features Shape-based features Interest -point based features Geometric human body features Motion-based features Action classifiers Nearest Neighbor SVM [Gorelick-Irani-PAMI 07] [Bobick-Davis-PAMI 01] [Collins-Gross- ICAFGR 02] [Ikizler-Duygulu- LNCS 07] [Ahmad-Lee-Journal of Multimedia 10] [Dollar-Rabaud-VS PETS 05] [Shuldt-Laptev- ICPR 04] [Laptev-CVPR 08] [Cunado-Nixon- CVIU 03] [Wang-Ning-ICSVT 04] [Goncalves-Bernardo- CVPR 95] [Seo-Milanfar-PAMI (submitted)], [Liu-Ali-CVPR 08] [Lowe-IJCV 04] [Danafar-Gheissari- ACCV07], [Scovanner- Ali-ACM Multimedia 07] Boosting [Zhang-Liu-ICCV 09] [Smith-Shah-ICCV 05] - [Alireza-Mori- CVPR 08], [Ke-Sukthankar- ICCV 05] Graphical (Probabilistic) model [Chen-Wu-ICDMW 08] [Niebles-Lei-IJCV 08] [Wong-Cipolla- ICCV 07] [Rohr-CVGIP 94] [Ali-Shah-PAMI 10] 10

Action recognition framework Action recognition = Supervised learning problem, where data samples are video clips Two main ingredients: Representation of samples Classification 11

Action representation video clip local features bag of dense local features 12

Action representation video clip local features bag of dense local features How to reduce the dimension of bag of local features? 13

Action representation video clip local features bag of dense local features pdf Action representation How to reduce the dimension of bag of local features? Ideally, one should learn and compare pdfs of features Problem: it may not reduce the dimensionality 14

Action representation video clip local features bag of dense local features Mean Action representation How to reduce the dimension of bag of local features? Idea-1: Learn and compare 1 st order statistics (mean) Problem: not sufficiently discriminative 15

Action representation video clip local features bag of dense local features covariance matrix Action representation How to reduce the dimension of bag of local features? Idea-2: Learn and compare 2 nd order statistics (covariance) [Tuzel-Porikli-Meer PAMI 08] 16

Action representation video clip local features bag of dense local features covariance matrix Action representation Ψ How to reduce the dimension of bag of local features? Idea-2: Learn and compare 2 nd order statistics (covariance) [Tuzel-Porikli-Meer PAMI 08] Output: feature covariance matrix (e.g., 13-dim vector -> 91-dim covariance matrix) 17

Action representation video clip local features bag of dense local features covariance matrix Action representation Ψ How to reduce the dimension of bag of local features? Idea-2: Learn and compare 2 nd order statistics (covariance) [Tuzel-Porikli-Meer PAMI 08] Output: feature covariance matrix (e.g., 13-dim vector -> 91-dim covariance matrix) Main thesis: covariance matrix is sufficient for action recognition 18

Action representation video clip silhouette tunnel bag of dense local features covariance matrix Action representation Ψ How to reduce the dimension of bag of local features? Idea-2: Learn and compare 2 nd order statistics (covariance) [Tuzel-Porikli-Meer PAMI 08] Output: feature covariance matrix (e.g., 13-dim vector -> 91-dim covariance matrix) Main thesis: covariance matrix is sufficient for action recognition 19

Covariance manifold Covariance matrices form: a Riemannian manifold not a vector space Manifold of covariance matrices C C Matrix-log maps a Riemannian manifold to a vector space C = UDU T log( C ) : = U log( D) U [Arsigny-Pennec-Ayache 06] T matrix-log log(c ) log(c) vector space 20

Classification based on sparse linear representation seg1 seg2 label(1) label(2) Ψ Ψ C1 C2 Matrix-log & vectorization Matrix-log & vectorization p 1 p 2 P label(n) CN... p N segk Ψ Matrix-log & vectorization Cquery p query Ψ Matrix-log & vectorization Decide query label Estimated query label 21 [Wright et al., PAMI 09]

Classification algorithm Each coefficient of weights the contribution of training segments to query segment P * α P1 P2 P3 action 1 action 2 action 3 * α 1 Step 1: Compute residual error: * α 2 * α 3 R * i ( pquery ) = pquery Pi αi 2 Step 2: Determine query label label( p query ) = arg min Ri ( pquery ) i 22

Silhouette-based local features dn dt- dnw dw dsw (x,y,t) ds dne de dse dt+ x yt s = x y t f(s) = dn de dt+ Dimensionality reduction s S dt- 13x1 1 C S : = ( f ( s) μ)( f ( s) μ) S T 23

Implementation issues How to process a long video? Break it into segments? What should be the length of segments? Period for human action 0.4 0.8 s Segment length 10 20 frames (@25 fps video) y time x 24

Action segments No knowledge about the beginning and end of periods Use overlapping segments Additional benefits of overlapping segments Reduced sensitivity to temporal action misalignment Richer dictionary 25

Segment-level classification seg1 seg2 label(1) label(2) Ψ Ψ C1 C2 Matrix-log & vectorization Matrix-log & vectorization p 1 p 2 P label(n) CN... p N segk Ψ Matrix-log & vectorization Cquery p query Ψ Matrix-log & vectorization Decide query label Estimated query label 26 [Wright et al., PAMI 09]

From segment decisions to a sequence decision How to get the labels of query video? Majority rule walk 27

Summary of the approach Partitioning into segments 28

Summary of the approach Partitioning into segments Action representation for each segment video clip silhouette tunnel bag of dense local features covariance matrix Ψ Action representation 29

Summary of the approach Partitioning into segments Action representation for each segment Segment-wise action recognition Bag of labeled feature covariance matrices label(1) label(2) label(k) C1 C2 CK Action recognition Estimated label of query segment Cquery 30

Summary of the approach Partitioning into segments Action representation for each segment Segment-wise action recognition Decision fusion walk 31

Experimental results Datasets Weizmann: 9 persons x 10 actions (180x144) bend jumping-jack jump pjump run side skip walk wave1 wave2 UT-Tower: 6 persons x 9 actions x 2 times (about 90x70) point stand dig walk carry jump wave1 wave2 run 32

Performances Correct classification rate (CCR) SEG-CCR: % of correctly classified query segments SEQ-CCR: % of correctly classified query sequences Weizmann dataset: (LOOCV, N=8) Method Proposed NN-based Gorelick Niebles Ali Seo SEG-CCR 96.74% 97.05% 97.83% 95.75% SEQ-CCR 100% 100% 90% 96% UT-Tower dataset: (LOOCV, N=8) Method Proposed NN-based SEG-CCR 96.15% 93.53% SEQ-CCR 97.22% 96.30% 33

Computational complexity Platform: Dual Core 2.2 GHz + 2GB Memory + Matlab 7.6 Action representation video: 180 x 144 x 84 0.12 sec/frame (8.3fps) Action classification 0.07 sec/segment (14.3fps) 34

Conclusions We proposed a novel approach to action recognition: action representation = covariance matrix of local features action classification = sparse-representation-based classifier The proposed approach has state-of-the-art performance on Weizmann dataset 100% performance on non-static actions in low-resolution UT-Tower dataset low memory requirements with close to real-time performance 35