Person Action Recognition/Detection

Person Action Recognition/Detection Fabrício Ceschin Visão Computacional Prof. David Menotti Departamento de Informática - Universidade Federal do Paraná 1

In object recognition: is there a chair in the image? In object detection: is there a chair and where is it in the image? 2

In action recognition: is there an action present in the video? In action detection: is there an action and where is it in the video? 3

Datasets 5

KTH Six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping Four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4. 2391 sequences taken with a static camera with 25 fps. 6

Hollywood2 12 classes of human actions. 10 classes of scenes distributed over 3669 video clips from 69 movies. Approximately 20.1 hours of video in total. 7

UCF Sports Action Data Set Set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. 10 classes of human actions. 150 sequences with the resolution of 720 x 480. 8

UCF YouTube Action Data Set 11 action categories collected from YouTube and personal videos. Challenging due to large variations in camera motion, object pose and appearance, object scale, viewpoint, cluttered background, illumination conditions, etc. 9

JHMDB 51 categories, 928 clips, 33183 frames. Puppet flow per frame (approximated optical flow on the person). Puppet mask per frame. Joint positions per frame. Action label per clip. Meta label per clip (camera motion, visible body parts, camera viewpoint, number of people, video quality). 10

Articles 11

Timeline Learning realistic human actions from movies Ivan Laptev, Marcin Marszałek, Cordelia Schmid and Cordelia Schmid CVPR Dense Trajectories and Motion Boundary Descriptors for Action Recognition Heng Wang, Alexander Kl aser, Cordelia Schmid and Cheng-Lin Liu IJCV 2013 2013 Two-Stream Convolutional Networks for Action Recognition in Videos Finding Action Tubes Georgia Gkioxari Jitendra Malik CVPR 2015 Karen Simonyan Andrew Zisserman NIPS 2014 2014 2015 12

Learning Realistic Human Actions from Movies Ivan Laptev, Marcin Marszałek, Cordelia Schmid and Cordelia Schmid CVPR 13

Introduction & Dataset Generation Inspired by new robust methods for image description and classification. First version of Hollywood dataset. Movies contain a rich variety and a large number of realistic human actions. To avoid the difficulty of manual annotation, the dataset was build using script-based action annotation. Time information is transferred from subtitles to scripts and then time intervals for scene descriptions are inferred - 60% precision achieved. Learning Realistic Human Actions from Movies 14

Script-based Action Annotation Example of matching speech sections (green) in subtitles and scripts. Time information (blue) from adjacent speech sections is used to estimate time intervals of scene descriptions (yellow). Learning Realistic Human Actions from Movies 15

Space-time Features Detect interest points using a space-time extension of the Harris operator. Histogram descriptors of space-time volumes are computed in the neighborhood of detected points (the size of each volume is related to the detection scales). Each volume is subdivided into a (Nx, Ny, Nt) grid of cuboids; for each cuboid HOG and HOF are computed. Both are concatenated, creating a descriptor vector. Learning Realistic Human Actions from Movies 16

Space-time Features 1. 2. Space-time interest points detected for two video frames with human actions hand shake (left) and get out car (right). Result of detecting the strongest spatio-temporal interest points in a football sequence with a player heading the ball (a) and in a hand clapping sequence (b). Learning Realistic Human Actions from Movies 17

Spatio-temporal Bag-of-features A visual vocabulary is built clustering a subset of 100k features sampled from the training videos with the k-means algorithm, with k=4000. BoF assigns each feature to the closest (Euclidean distance) vocabulary word and computes the histogram of visual word occurrences over a space-time volume corresponding either to the entire video sequence or subsequences defined by a spatio-temporal grid. Learning Realistic Human Actions from Movies 18

Spatio-temporal Bag-of-features Bag-of-features illustration. Learning Realistic Human Actions from Movies 19

Classification Support Vector Machine (SVM) with a multi-channel X² kernel that combines channels. Defined by: Where Hi={hin} and Hj={Hjn} are the histograms for channel c and Dc(Hi,Hj) is the X² distance defined as: Learning Realistic Human Actions from Movies 20

Results Average class accuracy on the KTH actions dataset: Method Schuldt et al. Niebles et al. Wong et al. This work Accuracy 71.7% 81.5% 86.7% 91.8% Learning Realistic Human Actions from Movies 21

Results Average precision (AP) for each action class of test set - results for clean (annotated), automatic training data and for a random classifier (chance) Clean Automatic Chance AnswerPhone 32.1% 16.4% 10.6% GetOutCar 41.5% 16.4% 6.0% HandShake 32.3% 9.9% 8.8% HugPerson 40.6% 26.8% 10.1% Kiss 53.3% 45.1% 23.5% SitDown 38.6% 24.8% 13.8% SitUp 18.2% 10.4% 4.6% StandUp 50.5% 33.6% 22.6% Learning Realistic Human Actions from Movies 22

Dense Trajectories and Motion Boundary Descriptors for Action Recognition Heng Wang, Alexander Kl aser, Cordelia Schmid and Cheng-Lin Liu IJCV 2013 23

Introduction Bag-of-features achieves state-of-the-art performance. Feature trajectories have shown to be efficient for representing videos. Generally extracted using KLT tracker or matching SIFT descriptors between frames, however, quantity and quality are not enough. Video description by dense trajectories. 2013 Dense trajectories and motion boundary descriptors for a.r. 24

Dense Trajectories Separate sample feature points on a grid spaced by W pixels (W=5). Sampling is carried out on each spatial scale separately and the goal is to track all these sampled points through the video. Areas without any structure are removed (if the eigenvalues of the auto-correlation matrix are very small - few explanation ). Feature points are tracked on each spatial scale separately. Features are extracted using grids of cuboids, similar to last article. 2013 Dense trajectories and motion boundary descriptors for a.r. 25

Dense Trajectories Left: feature points are densely sampled on a grid for each spatial scale. Middle: tracking is carried out in the corresponding spatial scale for L frames by median filtering in a dense optical flow field. Right: trajectory shape is represented by relative point coordinates. The descriptors (HOG, HOF, MBH) are computed along the trajectory in a N N pixels neighborhood, which is divided into grids of cuboids. 2013 *Motion boundary histograms (MBH) are extracted by computing derivatives separately for the horizontal and vertical components of the optical flow. Dense trajectories and motion boundary descriptors for a.r. 26

Dense Trajectories 2013 Dense trajectories and motion boundary descriptors for a.r. 27

Results Comparison of different descriptors and methods for extracting trajectories on nine datasets. Mean average precision is reported over all classes (map) for Hollywood2 and Olympic Sports, average accuracy over all classes for the other seven datasets. The three best results for each dataset are in bold. 2013 Dense trajectories and motion boundary descriptors for a.r. 28

Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan and Andrew Zisserman NIPS 2014 29

Introduction CNNs work very well for image recognition. Extend CNN to action recognition in video. Two separate recognition streams related to the two-stream hypothesis: Spatial Stream - appearance recognition ConvNet. Temporal Stream - motion recognition ConvNet. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 30

Two-stream Hypothesis Ventral pathway (purple, what pathway ) responds to shape, color and texture. Dorsal pathway (green, where pathway ) responds to spatial transformations and movement. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 31

Two-stream Architecture for Video Recognition Spatial part: in the form of individual frame appearance, carries information about scenes and objects in the video. Temporal part: in the form of motion across the frames, carries information about the movement of the camera and the objects. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 32

Two-stream Architecture for Video Recognition Each stream is implemented using a deep ConvNet, softmax scores which are combined by fusion methods. Two fusion methods proposed: Averaging. Training a multiclass linear SVM on stacked L2-normalised softmax scores as features. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 33

The Spatial Stream ConvNet Similar model used for image classification. Operates on individual video frames. Static appearance is a useful feature, due to actions that are strongly associated with particular objects. Network pre-trained on a large image classification dataset, such as the ImageNet challenge dataset. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 34

The Temporal Stream ConvNet Input of the ConvNet model is stacking optical flow displacement fields between several consecutive frames. This input describes the motion between video frame. Motion representation: Optical flow stacking: displacement vector fields dtx and dty of L consecutive frames are stacked, creating a total of 2L input channels. Trajectory stacking: trajectory-based descriptors. Bi-directional optical flow, mean flow subtraction. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 35

Optical Flow Stacking Displacement vector fields dtx and dty of L consecutive frames are stacked, creating a total of 2L input channels. Examples: higher intensity corresponds to positive values, lower intensity to negative values. (a) Horizontal component dx of the displacement vector field. 2013 2014 (b) Vertical component dy of the displacement vector field. Two-Stream Convolutional Networks for Action Recognition in Videos 36

Multi-task Learning Unlike the spatial stream ConvNet, which can be pre-trained on a large still image classification dataset (such as ImageNet), the temporal ConvNet needs to be trained on video data. Available datasets for video action classification are still rather small. UCF-101 and HMDB-51 datasets have only 9.5K and 3.7K, respectively. ConvNet architecture is modified so that it has two softmax classification layers on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one the UCF-101 scores. Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset. The overall training loss is computed as the sum of the individual tasks losses. 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 37

Results 2013 2014 Two-Stream Convolutional Networks for Action Recognition in Videos 38

Finding Action Tubes Georgia Gkioxari and Jitendra Malik CVPR 2015 39

Introduction Image region proposals: regions that are motion salient are more likely to contain the action, so they are selected. Significant reduction in the number of regions being processed and faster computations. Detection pipeline also is inspired by the human vision system. Outperforms other techniques in the task of action detection. 2013 2014 Finding Action Tubes 2015 40

Regions of Interest Selective search are used on the RGB frames to generate approximately 2K regions per frame. Regions that are void of motion are discarded using the optical flow signal. Motion saliency algorithm: Normalized magnitude of optical flow signal (fm) is seen as a heat map at the pixel level. If R is a region, then fm(r) = 1/( R ) i R fm(i) is a measure of how motion salient R is ɑ. R is discarded if fm(r) < ɑ. For ɑ = 0.3, approximately 85% of boxes are discarded. 2013 2014 Finding Action Tubes 2015 41

Feature Extraction (a) Candidate regions are fed into action specific classifiers, which make predictions using static and motion cues. (b) The regions are linked across frames based on the action predictions and their spatial overlap. Action tubes are produced for each action and each video. 2013 2014 Finding Action Tubes 2015 42

Action Detection Model Action specific SVM classifiers are used on spatio-temporal features. The features are extracted from the fc7 layer of two CNNs, spatial-cnn and motion-cnn, which were trained to detect actions using static and motion cues, respectively. The architecture of spatial-cnn and motion-cnn is similar to the ones used for image classification. 2013 2014 Finding Action Tubes 2015 43

This approach yields an accuracy of 62.5%, averaged over the three splits of JHMDB. 2013 2014 Finding Action Tubes 2015 44

General Results Dataset Ivan Laptev et al. Heng Wang et al. 2013 Karen Simonyan et al. 2014 Georgia Gkioxari et al. 2015 KTH 91.8% 95.0% - - Hollywood2 38,38%* 58.2% - - UCF Youtube - 84.1% - - UCF Sports - 88.0% 88.0% 75.8% JHMDB - 46.6% 59.4% 62.5% *First version of Hollywood2. 45

References Articles Learning Realistic Human Actions from Movies - Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Cordelia Schmid - CVPR Action Recognition with Improved Trajectories - Heng Wang and Cordelia Schmid - CVPR 2013 Dense trajectories and motion boundary descriptors for action recognitionheng Wang, Alexander Kl aser, Cordelia Schmid, Cheng-Lin Liu - IJCV 2013 Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan and Andrew Zisserman - NIPS 2014 Finding Action Tubes - Georgia Gkioxari and Jitendra Malik - CVPR 2015. 46

References Datasets KTH Dataset UCF YouTube Action Data Set Hollywood2 Dataset UCF Sports Action Data Set Joint-annotated Human Motion Data Base (JHMDB) 47