Person Action Recognition/Detection

Similar documents
Two-Stream Convolutional Networks for Action Recognition in Videos

Learning realistic human actions from movies

Action recognition in videos

Learning Realistic Human Actions from Movies

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Lecture 18: Human Motion Recognition

CS231N Section. Video Understanding 6/1/2018

Content-based image and video analysis. Event Recognition

Action Recognition by Dense Trajectories

SPATIO-TEMPORAL PYRAMIDAL ACCORDION REPRESENTATION FOR HUMAN ACTION RECOGNITION. Manel Sekma, Mahmoud Mejdoub, Chokri Ben Amar

Activity Recognition in Temporally Untrimmed Videos

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

MoSIFT: Recognizing Human Actions in Surveillance Videos

CS229: Action Recognition in Tennis

Action Recognition with HOG-OF Features

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

A Spatio-Temporal Descriptor Based on 3D-Gradients

Understanding Sport Activities from Correspondences of Clustered Trajectories

People Detection and Video Understanding

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Long-term Temporal Convolutions for Action Recognition INRIA

Action Recognition & Categories via Spatial-Temporal Features

Action Recognition From Videos using Sparse Trajectories

Dense trajectories and motion boundary descriptors for action recognition

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Class 9 Action Recognition

Evaluation of local spatio-temporal features for action recognition

Modeling and visual recognition of human actions and interactions

Spatio-temporal Feature Classifier

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Deformable Part Models

Large-scale Video Classification with Convolutional Neural Networks

Category-level localization

Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features

Object and Action Detection from a Single Example

Robust Action Recognition Using Local Motion and Group Sparsity

Action Localization in Video using a Graph-based Feature Representation

Learning Representations for Visual Object Class Recognition

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features

Rich feature hierarchies for accurate object detection and semantic segmentation

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

A Unified Method for First and Third Person Action Recognition

Detecting Parts for Action Localization

Dynamic Vision Sensors for Human Activity Recognition

Evaluation of local descriptors for action recognition in videos

Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition

Local Feature Detectors

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Classification of objects from Video Data (Group 30)

arxiv: v2 [cs.cv] 22 Apr 2016

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Visual Action Recognition

arxiv: v2 [cs.cv] 31 May 2018

1126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 4, APRIL 2011

Bag-of-features. Cordelia Schmid

MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition

Storyline Reconstruction for Unordered Images

REALISTIC HUMAN ACTION RECOGNITION: WHEN CNNS MEET LDS

Sampling Strategies for Real-time Action Recognition

Eigen-Evolution Dense Trajectory Descriptors

Human Action Recognition from Gradient Boundary Histograms

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada

Action Recognition with Improved Trajectories

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

An evaluation of local action descriptors for human action classification in the presence of occlusion

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Detection III: Analyzing and Debugging Detection Methods

Object Detection Based on Deep Learning

The SIFT (Scale Invariant Feature

UNDERSTANDING human actions in videos has been

LEARNING TO SEGMENT MOVING OBJECTS IN VIDEOS FRAGKIADAKI ET AL. 2015

Revisiting LBP-based Texture Models for Human Action Recognition

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

HUMAN action recognition has received significant research

Human Detection and Action Recognition. in Video Sequences

Object Detection Using Segmented Images

Beyond Bags of features Spatial information & Shape models

ACTION RECOGNITION USING INTEREST POINTS CAPTURING DIFFERENTIAL MOTION INFORMATION

Action Recognition Using Global Spatio-Temporal Features Derived from Sparse Representations

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

Object detection with CNNs

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

CS 231A Computer Vision (Fall 2011) Problem Set 4

Multiple-Choice Questionnaire Group C

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Learning video saliency from human gaze using candidate selection

Action Recognition using Discriminative Structured Trajectory Groups

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Robotics Programming Laboratory

From Activity to Language:

Stereoscopic Video Description for Human Action Recognition

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Tri-modal Human Body Segmentation

Transcription:

Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1

In object recognition: is there a chair in the image? In object detection: is there a chair and where is it in the image? 2

In action recognition: is there an action present in the video? In action detection: is there an action and where is it in the video? 3

4

Datasets 5

KTH Six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping Four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4. 2391 sequences taken with a static camera with 25 fps. 6

Hollywood2 12 classes of human actions. 10 classes of scenes distributed over 3669 video clips from 69 movies. Approximately 20.1 hours of video in total. 7

UCF Sports Action Data Set Set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. 10 classes of human actions. 150 sequences with the resolution of 720 x 480. 8

UCF YouTube Action Data Set 11 action categories collected from YouTube and personal videos. Challenging due to large variations in camera motion, object pose and appearance, object scale, viewpoint, cluttered background, illumination conditions, etc. 9

JHMDB 51 categories, 928 clips, 33183 frames. Puppet flow per frame (approximated optical flow on the person). Puppet mask per frame. Joint positions per frame. Action label per clip. Meta label per clip (camera motion, visible body parts, camera viewpoint, number of people, video quality). 10

Articles 11

Timeline Learning realistic human actions from movies Ivan Laptev, Marcin Marszałek, Cordelia Schmid and Cordelia Schmid CVPR Dense Trajectories and Motion Boundary Descriptors for Action Recognition Heng Wang, Alexander Kl aser, Cordelia Schmid and Cheng-Lin Liu IJCV 2013 2013 Two-Stream Convolutional Networks for Action Recognition in Videos Finding Action Tubes Georgia Gkioxari Jitendra Malik CVPR 2015 Karen Simonyan Andrew Zisserman NIPS 2014 2014 2015 12

Learning Realistic Human Actions from Movies Ivan Laptev, Marcin Marszałek, Cordelia Schmid and Cordelia Schmid CVPR 13

Introduction & Dataset Generation Inspired by new robust methods for image description and classification. First version of Hollywood dataset. Movies contain a rich variety and a large number of realistic human actions. To avoid the difficulty of manual annotation, the dataset was build using script-based action annotation. Time information is transferred from subtitles to scripts and then time intervals for scene descriptions are inferred - 60% precision achieved. Learning Realistic Human Actions from Movies 14

Script-based Action Annotation Example of matching speech sections (green) in subtitles and scripts. Time information (blue) from adjacent speech sections is used to estimate time intervals of scene descriptions (yellow). Learning Realistic Human Actions from Movies 15

Space-time Features Detect interest points using a space-time extension of the Harris operator. Histogram descriptors of space-time volumes are computed in the neighborhood of detected points (the size of each volume is related to the detection scales). Each volume is subdivided into a (Nx, Ny, Nt) grid of cuboids; for each cuboid HOG and HOF are computed. Both are concatenated, creating a descriptor vector. Learning Realistic Human Actions from Movies 16

Space-time Features 1. 2. Space-time interest points detected for two video frames with human actions hand shake (left) and get out car (right). Result of detecting the strongest spatio-temporal interest points in a football sequence with a player heading the ball (a) and in a hand clapping sequence (b). Learning Realistic Human Actions from Movies 17

Spatio-temporal Bag-of-features A visual vocabulary is built clustering a subset of 100k features sampled from the training videos with the k-means algorithm, with k=4000. BoF assigns each feature to the closest (Euclidean distance) vocabulary word and computes the histogram of visual word occurrences over a space-time volume corresponding either to the entire video sequence or subsequences defined by a spatio-temporal grid. Learning Realistic Human Actions from Movies 18

Spatio-temporal Bag-of-features Bag-of-features illustration. Learning Realistic Human Actions from Movies 19

Classification Support Vector Machine (SVM) with a multi-channel X² kernel that combines channels. Defined by: Where Hi={hin} and Hj={Hjn} are the histograms for channel c and Dc(Hi,Hj) is the X² distance defined as: Learning Realistic Human Actions from Movies 20

Results Average class accuracy on the KTH actions dataset: Method Schuldt et al. Niebles et al. Wong et al. This work Accuracy 71.7% 81.5% 86.7% 91.8% Learning Realistic Human Actions from Movies 21

Results Average precision (AP) for each action class of test set - results for clean (annotated), automatic training data and for a random classifier (chance) Clean Automatic Chance AnswerPhone 32.1% 16.4% 10.6% GetOutCar 41.5% 16.4% 6.0% HandShake 32.3% 9.9% 8.8% HugPerson 40.6% 26.8% 10.1% Kiss 53.3% 45.1% 23.5% SitDown 38.6% 24.8% 13.8% SitUp 18.2% 10.4% 4.6% StandUp 50.5% 33.6% 22.6% Learning Realistic Human Actions from Movies 22

Dense Trajectories and Motion Boundary Descriptors for Action Recognition Heng Wang, Alexander Kl aser, Cordelia Schmid and Cheng-Lin Liu IJCV 2013 23

Introduction Bag-of-features achieves state-of-the-art performance. Feature trajectories have shown to be efficient for representing videos. Generally extracted using KLT tracker or matching SIFT descriptors between frames, however, quantity and quality are not enough. Video description by dense trajectories. 2013 Dense trajectories and motion boundary descriptors for a.r. 24

Dense Trajectories Separate sample feature points on a grid spaced by W pixels (W=5). Sampling is carried out on each spatial scale separately and the goal is to track all these sampled points through the video. Areas without any structure are removed (if the eigenvalues of the auto-correlation matrix are very small - few explanation ). Feature points are tracked on each spatial scale separately. Features are extracted using grids of cuboids, similar to last article. 2013 Dense trajectories and motion boundary descriptors for a.r. 25

Dense Trajectories Left: feature points are densely sampled on a grid for each spatial scale. Middle: tracking is carried out in the corresponding spatial scale for L frames by median filtering in a dense optical flow field. Right: trajectory shape is represented by relative point coordinates. The descriptors (HOG, HOF, MBH) are computed along the trajectory in a N N pixels neighborhood, which is divided into grids of cuboids. 2013 *Motion boundary histograms (MBH) are extracted by computing derivatives separately for the horizontal and vertical components of the optical flow. Dense trajectories and motion boundary descriptors for a.r. 26

Dense Trajectories 2013 Dense trajectories and motion boundary descriptors for a.r. 27

Results Comparison of different descriptors and methods for extracting trajectories on nine datasets. Mean average precision is reported over all classes (map) for Hollywood2 and Olympic Sports, average accuracy over all classes for the other seven datasets. The three best results for each dataset are in bold. 2013 Dense trajectories and motion boundary descriptors for a.r. 28

Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan and Andrew Zisserman NIPS 2014 29

Introduction CNNs work very well for image recognition. Extend CNN to action recognition in video. Two separate recognition streams related to the two-stream hypothesis: Spatial Stream - appearance recognition ConvNet. Temporal Stream - motion recognition ConvNet. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 30

Two-stream Hypothesis Ventral pathway (purple, what pathway ) responds to shape, color and texture. Dorsal pathway (green, where pathway ) responds to spatial transformations and movement. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 31

Two-stream Architecture for Video Recognition Spatial part: in the form of individual frame appearance, carries information about scenes and objects in the video. Temporal part: in the form of motion across the frames, carries information about the movement of the camera and the objects. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 32

Two-stream Architecture for Video Recognition Each stream is implemented using a deep ConvNet, softmax scores which are combined by fusion methods. Two fusion methods proposed: Averaging. Training a multiclass linear SVM on stacked L2-normalised softmax scores as features. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 33

The Spatial Stream ConvNet Similar model used for image classification. Operates on individual video frames. Static appearance is a useful feature, due to actions that are strongly associated with particular objects. Network pre-trained on a large image classification dataset, such as the ImageNet challenge dataset. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 34

The Temporal Stream ConvNet Input of the ConvNet model is stacking optical flow displacement fields between several consecutive frames. This input describes the motion between video frame. Motion representation: Optical flow stacking: displacement vector fields dtx and dty of L consecutive frames are stacked, creating a total of 2L input channels. Trajectory stacking: trajectory-based descriptors. Bi-directional optical flow, mean flow subtraction. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 35

Optical Flow Stacking Displacement vector fields dtx and dty of L consecutive frames are stacked, creating a total of 2L input channels. Examples: higher intensity corresponds to positive values, lower intensity to negative values. (a) Horizontal component dx of the displacement vector field. 2013 2014 (b) Vertical component dy of the displacement vector field. Two-Stream Convolutional Networks for Action Recognition in Videos 36

Multi-task Learning Unlike the spatial stream ConvNet, which can be pre-trained on a large still image classification dataset (such as ImageNet), the temporal ConvNet needs to be trained on video data. Available datasets for video action classification are still rather small. UCF-101 and HMDB-51 datasets have only 9.5K and 3.7K, respectively. ConvNet architecture is modified so that it has two softmax classification layers on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one the UCF-101 scores. Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset. The overall training loss is computed as the sum of the individual tasks losses. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 37

Results 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 38

Finding Action Tubes Georgia Gkioxari and Jitendra Malik CVPR 2015 39

Introduction Image region proposals: regions that are motion salient are more likely to contain the action, so they are selected. Significant reduction in the number of regions being processed and faster computations. Detection pipeline also is inspired by the human vision system. Outperforms other techniques in the task of action detection. 2013 2014 Finding Action Tubes 2015 40

Regions of Interest Selective search are used on the RGB frames to generate approximately 2K regions per frame. Regions that are void of motion are discarded using the optical flow signal. Motion saliency algorithm: Normalized magnitude of optical flow signal (fm) is seen as a heat map at the pixel level. If R is a region, then fm(r) = 1/( R ) i R fm(i) is a measure of how motion salient R is ɑ. R is discarded if fm(r) < ɑ. For ɑ = 0.3, approximately 85% of boxes are discarded. 2013 2014 Finding Action Tubes 2015 41

Feature Extraction (a) Candidate regions are fed into action specific classifiers, which make predictions using static and motion cues. (b) The regions are linked across frames based on the action predictions and their spatial overlap. Action tubes are produced for each action and each video. 2013 2014 Finding Action Tubes 2015 42

Action Detection Model Action specific SVM classifiers are used on spatio-temporal features. The features are extracted from the fc7 layer of two CNNs, spatial-cnn and motion-cnn, which were trained to detect actions using static and motion cues, respectively. The architecture of spatial-cnn and motion-cnn is similar to the ones used for image classification. 2013 2014 Finding Action Tubes 2015 43

This approach yields an accuracy of 62.5%, averaged over the three splits of JHMDB. 2013 2014 Finding Action Tubes 2015 44

General Results Dataset Ivan Laptev et al. Heng Wang et al. 2013 Karen Simonyan et al. 2014 Georgia Gkioxari et al. 2015 KTH 91.8% 95.0% - - Hollywood2 38,38%* 58.2% - - UCF Youtube - 84.1% - - UCF Sports - 88.0% 88.0% 75.8% JHMDB - 46.6% 59.4% 62.5% *First version of Hollywood2. 45

References Articles Learning Realistic Human Actions from Movies - Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Cordelia Schmid - CVPR Action Recognition with Improved Trajectories - Heng Wang and Cordelia Schmid - CVPR 2013 Dense trajectories and motion boundary descriptors for action recognitionheng Wang, Alexander Kl aser, Cordelia Schmid, Cheng-Lin Liu - IJCV 2013 Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan and Andrew Zisserman - NIPS 2014 Finding Action Tubes - Georgia Gkioxari and Jitendra Malik - CVPR 2015. 46

References Datasets KTH Dataset UCF YouTube Action Data Set Hollywood2 Dataset UCF Sports Action Data Set Joint-annotated Human Motion Data Base (JHMDB) 47