Understanding Sport Activities from Correspondences of Clustered Trajectories Francesco Turchini, Lorenzo Seidenari, Alberto Del Bimbo http://www.micc.unifi.it/vim
Introduction The availability of multimedia content is continuously growing Sport events are among the most watched tv content driving the pay-perview model for major broadcasters. Many commercial and professional application can be enabled by automatic action recognition in sports: Improved broadcast commentaries through fast similar event retrieval Gameplay analysis for head coaches Automatic game statistics collection
State-of-the-Art Few systems actually succeed in performing classification on sport videos without employing additional information Usually trackers need camera calibration and an ad-hoc setup Specific sport knowledge is often used to achieve player identification These requirements make these systems less general and hard to employ Method Sport-Specific Player bbox annotations* Player Team/Identity Player tracking Camera calibration Features Atmosukarto [CVPRW CVSports 2013] Yes No Yes Yes Yes Raw frames Ballan [CBMI 2009] No No No No No SIFT Waltner [ÖAGM 2014] Bialkowski [CVPRW CVSports 2013] No Yes Yes Yes Yes HoG, HoF, SC, RWPC Yes No Yes Yes Yes Raw frames Ours No No No No No Dense Trajectories * for training
Main Idea To avoid player detection, tracking and identification a system should be able to make partial correspondences of spatio-temporal patterns Starting from a set of trajectories we would like to decompose the action in order to perform partial correspondences.
The Method Video Representation PCA Trajectory Clusters LSC [Cai 15] Trajectory Description [Wang 13] Fisher Encoding [Perronin 12] 1. Video trajectories are clustered using Landmark Based Spectral Clustering (LSC) 2. Trajectories are represented with appearance and motion HoG,HoF and MBH 3. Each cluster is encoded using Fisher Vectors over a Gaussian Mixture Model 4. We end up with Fisher Vectors Ψ(X i ) computed from each cluster X i for each descriptor (HoG, HoF, MBH)
The Method In sport footage our method naturally groups trajectories stemming from motion of players or generated by the motion of relevant objects (e.g. ball) This approach allows to make partial correspondences of relevant features without detecting, tracking or recognizing player in the field.
Cluster Set Kernel Given the set of extracted motion features X, the clustering step yields a partition such that Given two feature sets X and Y, we define a kernel K exploiting trajectory grouping as follows The max operator allows to put into correspondence the most similar patterns from the compared videos We use a kernel SVM as a classifier
Feature Fusion To improve the representation we fuse multiple kernels First we consider multiple features (HoG, HoG, MBHx,MBHy) Second we can fuse kernels computed from different groupings (varying the number of clusters): Baselines: Global Representation Clustering only N =2
Results We tested our approach on three public datasets UCF Sports Actions dataset 150 Clips (6s), nine classes: Diving, Golf Swinging, Kicking, Lifting, Horseback Riding, Running, Skating, Swinging, Walking Performance measured with mean average precision MICC-SOCACT4 dataset 100 Clips (7s), four classes: Goal Kick, Throw In, Placed Kick, Shot on Goal Performance measured with mean per-class accuracy Volleyball Activity dataset 903 Clips (2s), seven classes with five volley specific classes: Serve, Reception, Setting, Attack, Block and two more general classes: Stand, Defense/Move Performance measured with mean per-class accuracy
Results We have state-of-the art results on the smaller MICC-SOCACT dataset Our Fusion Our Clustering FV Baseline String Kernel [Ballan09] NN + NWD [Ballan09] 92.5 89.8 88.8 73.0 54.0 Our Fusion Our Clustering FV Baseline
Our Fusion Our Clustering FV Baseline Results Clustering and Global representation (FV) have complementary behavior, our fusion obtain a 20% improvement over Waltner et al. Our Fusion Clustering FV Baseline Waltner et al. 5 Classes 94.1 68.2 60.3 77.5 7 Classes 91.2 78.5 53.7 90.2
Results We also tested our method on the more generic UCF Sports dataset Our method obtains state-of-the art performance hinting that our approach is also suitable for generic action recognition. Clustering (10 clusters) FV Baseline Karaman et al. [5] Kovashka et al. [10] Klaser [14] 91.0 87.6 90.4 87.3 86.7
Action Saliency Trajectory clustering yields a motion segmentation which identifies various salient patterns We perform relevant cluster mining exploiting the SVM scores variation Given a true positive video feature set Z, we search for the cluster Z i that, if removed, causes the higher classification score drop: We iterate this process in a greedy manner to score all clusters of a video.
Action Saliency Correctly classified examples of Service and Setting classes Service Setting Note that in the Setting action, spikers, the middle-blocker and the opposite player running up are localized instead of the setter.
Conclusions We have proposed a novel method for activity recognition based on local trajectory grouping and matching This feature grouping helps identifying some mid-level spatio-temporal patterns that are semantically sensible Preliminary results on cluster salience showing localization potential State-of-the-Art on various public benchmarks without: Player tracking and identification Exploiting sport-specific players position in the court Camera calibration