Understanding Sport Activities from Correspondences of Clustered Trajectories

Similar documents
EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Action recognition in videos

Person Action Recognition/Detection

Lecture 18: Human Motion Recognition

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

People Detection and Video Understanding

Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Action Recognition using Discriminative Structured Trajectory Groups

CS229: Action Recognition in Tennis

RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University

CS231N Section. Video Understanding 6/1/2018

Automatic Data Acquisition Based on Abrupt Motion Feature and Spatial Importance for 3D Volleyball Analysis

Histogram of Flow and Pyramid Histogram of Visual Words for Action Recognition

Multiple Kernel Learning for Emotion Recognition in the Wild

Trademark Matching and Retrieval in Sport Video Databases

Motion analysis for broadcast tennis video considering mutual interaction of players

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

An evaluation of local action descriptors for human action classification in the presence of occlusion

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

HUMAN action recognition has received significant research

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin

Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context

arxiv: v1 [cs.cv] 29 Apr 2016

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Action Recognition by Dense Trajectories

IMA Preprint Series # 2378

Human Action Recognition Based on Oriented Motion Salient Regions

A Unified Method for First and Third Person Action Recognition

Large-scale Video Classification with Convolutional Neural Networks

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Combined Shape Analysis of Human Poses and Motion Units for Action Segmentation and Recognition

Assistive Sports Video Annotation: Modelling and Detecting Complex Events in Sports Video

Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation

Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off

An evaluation of bags-of-words and spatio-temporal shapes for action recognition

Class 9 Action Recognition

Modified Time Flexible Kernel for Video Activity Recognition using Support Vector Machines

Chapter 2 Action Representation

Two-Stream Convolutional Networks for Action Recognition in Videos

Dynamic Vision Sensors for Human Activity Recognition

Minimizing hallucination in Histogram of Oriented Gradients

Automatic summarization of video data

Action Recognition From Videos using Sparse Trajectories

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Robust Action Recognition Using Local Motion and Group Sparsity

Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Human Action Recognition from Gradient Boundary Histograms

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Action Recognition Using Hybrid Feature Descriptor and VLAD Video Encoding

Eigen-Evolution Dense Trajectory Descriptors

Activity Recognition in Temporally Untrimmed Videos

Highlight Ranking for Broadcast Tennis Video Based on Multi-modality Analysis and Relevance Feedback

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

AUTOMATED BALL TRACKING IN TENNIS VIDEO

Video Classification with Densely Extracted HOG/HOF/MBH Features: An Evaluation of the Accuracy/Computational Efficiency Trade-off

Action Recognition Using Global Spatio-Temporal Features Derived from Sparse Representations

Content-based image and video analysis. Event Recognition

LOCAL VISUAL PATTERN MODELLING FOR IMAGE AND VIDEO CLASSIFICATION

Highlights Extraction from Unscripted Video

Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition

A Hybrid Approach to News Video Classification with Multi-modal Features

HUMAN ACTION RECOGNITION

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

ACTIVE CLASSIFICATION FOR HUMAN ACTION RECOGNITION. Alexandros Iosifidis, Anastasios Tefas and Ioannis Pitas

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

Spatio-temporal Feature Classifier

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

A Survey on Content-aware Video Analysis for Sports

Action Recognition with Improved Trajectories

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada

Multiple-Choice Questionnaire Group C

GUANGHAN NING Pinard St, Milpitas, CA, 95035

Consumer Video Understanding

BEYOND BAG-OF-WORDS: FAST VIDEO CLASSIFICATION WITH FISHER KERNEL VECTOR OF LOCALLY AGGREGATED DESCRIPTORS

Matching Mixtures of Curves for Human Action Recognition

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information

SCENE TEXT RECOGNITION IN MULTIPLE FRAMES BASED ON TEXT TRACKING

Action Localization in Video using a Graph-based Feature Representation

Efficient and effective human action recognition in video through motion boundary description with a compact set of trajectories

Video Action Detection with Relational Dynamic-Poselets

Real-Time Content-Based Adaptive Streaming of Sports Videos

The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System

Team SRI-Sarnoff s AURORA TRECVID 2011

Local Part Model for Action Recognition in Realistic Videos

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

Categorizing Turn-Taking Interactions

Latent Variable Models for Structured Prediction and Content-Based Retrieval

Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Approach to Metadata Production and Application Technology Research

Lesson 11. Media Retrieval. Information Retrieval. Image Retrieval. Video Retrieval. Audio Retrieval

ImageCLEF 2011

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN

Efficient Activity Detection in Untrimmed Video with Max-Subgraph Search

Visual Action Recognition

CLUSTER ENCODING FOR MODELLING TEMPORAL VARIATION IN VIDEO

Transcription:

Understanding Sport Activities from Correspondences of Clustered Trajectories Francesco Turchini, Lorenzo Seidenari, Alberto Del Bimbo http://www.micc.unifi.it/vim

Introduction The availability of multimedia content is continuously growing Sport events are among the most watched tv content driving the pay-perview model for major broadcasters. Many commercial and professional application can be enabled by automatic action recognition in sports: Improved broadcast commentaries through fast similar event retrieval Gameplay analysis for head coaches Automatic game statistics collection

State-of-the-Art Few systems actually succeed in performing classification on sport videos without employing additional information Usually trackers need camera calibration and an ad-hoc setup Specific sport knowledge is often used to achieve player identification These requirements make these systems less general and hard to employ Method Sport-Specific Player bbox annotations* Player Team/Identity Player tracking Camera calibration Features Atmosukarto [CVPRW CVSports 2013] Yes No Yes Yes Yes Raw frames Ballan [CBMI 2009] No No No No No SIFT Waltner [ÖAGM 2014] Bialkowski [CVPRW CVSports 2013] No Yes Yes Yes Yes HoG, HoF, SC, RWPC Yes No Yes Yes Yes Raw frames Ours No No No No No Dense Trajectories * for training

Main Idea To avoid player detection, tracking and identification a system should be able to make partial correspondences of spatio-temporal patterns Starting from a set of trajectories we would like to decompose the action in order to perform partial correspondences.

The Method Video Representation PCA Trajectory Clusters LSC [Cai 15] Trajectory Description [Wang 13] Fisher Encoding [Perronin 12] 1. Video trajectories are clustered using Landmark Based Spectral Clustering (LSC) 2. Trajectories are represented with appearance and motion HoG,HoF and MBH 3. Each cluster is encoded using Fisher Vectors over a Gaussian Mixture Model 4. We end up with Fisher Vectors Ψ(X i ) computed from each cluster X i for each descriptor (HoG, HoF, MBH)

The Method In sport footage our method naturally groups trajectories stemming from motion of players or generated by the motion of relevant objects (e.g. ball) This approach allows to make partial correspondences of relevant features without detecting, tracking or recognizing player in the field.

Cluster Set Kernel Given the set of extracted motion features X, the clustering step yields a partition such that Given two feature sets X and Y, we define a kernel K exploiting trajectory grouping as follows The max operator allows to put into correspondence the most similar patterns from the compared videos We use a kernel SVM as a classifier

Feature Fusion To improve the representation we fuse multiple kernels First we consider multiple features (HoG, HoG, MBHx,MBHy) Second we can fuse kernels computed from different groupings (varying the number of clusters): Baselines: Global Representation Clustering only N =2

Results We tested our approach on three public datasets UCF Sports Actions dataset 150 Clips (6s), nine classes: Diving, Golf Swinging, Kicking, Lifting, Horseback Riding, Running, Skating, Swinging, Walking Performance measured with mean average precision MICC-SOCACT4 dataset 100 Clips (7s), four classes: Goal Kick, Throw In, Placed Kick, Shot on Goal Performance measured with mean per-class accuracy Volleyball Activity dataset 903 Clips (2s), seven classes with five volley specific classes: Serve, Reception, Setting, Attack, Block and two more general classes: Stand, Defense/Move Performance measured with mean per-class accuracy

Results We have state-of-the art results on the smaller MICC-SOCACT dataset Our Fusion Our Clustering FV Baseline String Kernel [Ballan09] NN + NWD [Ballan09] 92.5 89.8 88.8 73.0 54.0 Our Fusion Our Clustering FV Baseline

Our Fusion Our Clustering FV Baseline Results Clustering and Global representation (FV) have complementary behavior, our fusion obtain a 20% improvement over Waltner et al. Our Fusion Clustering FV Baseline Waltner et al. 5 Classes 94.1 68.2 60.3 77.5 7 Classes 91.2 78.5 53.7 90.2

Results We also tested our method on the more generic UCF Sports dataset Our method obtains state-of-the art performance hinting that our approach is also suitable for generic action recognition. Clustering (10 clusters) FV Baseline Karaman et al. [5] Kovashka et al. [10] Klaser [14] 91.0 87.6 90.4 87.3 86.7

Action Saliency Trajectory clustering yields a motion segmentation which identifies various salient patterns We perform relevant cluster mining exploiting the SVM scores variation Given a true positive video feature set Z, we search for the cluster Z i that, if removed, causes the higher classification score drop: We iterate this process in a greedy manner to score all clusters of a video.

Action Saliency Correctly classified examples of Service and Setting classes Service Setting Note that in the Setting action, spikers, the middle-blocker and the opposite player running up are localized instead of the setter.

Conclusions We have proposed a novel method for activity recognition based on local trajectory grouping and matching This feature grouping helps identifying some mid-level spatio-temporal patterns that are semantically sensible Preliminary results on cluster salience showing localization potential State-of-the-Art on various public benchmarks without: Player tracking and identification Exploiting sport-specific players position in the court Camera calibration