Motion analysis for broadcast tennis video considering mutual interaction of players

Similar documents
High Precision Gait Recognition Using a Large-Scale PC Cluster

Directionally-grouped CHLAC Motion Feature Extraction and Its Application to Sport Motion Analysis

Trademark Matching and Retrieval in Sport Video Databases

Image Set-based Hand Shape Recognition Using Camera Selection Driven by Multi-class AdaBoosting

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information

Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks

Face Tracking in Video

Baseball Game Highlight & Event Detection

Feature Point Extraction using 3D Separability Filter for Finger Shape Recognition

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

Highlight Ranking for Broadcast Tennis Video Based on Multi-modality Analysis and Relevance Feedback

Gesture Recognition using Temporal Templates with disparity information

Logo Matching and Recognition for Avoiding Duplicate Logos

Evaluation of Hardware Oriented MRCoHOG using Logic Simulation

TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D PARTICLE FILTER BASED ON DETECTOR CONFIDENCE

MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

CS229: Action Recognition in Tennis

3D Tracking of a Soccer Ball Using Two Synchronized Cameras

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM

Algorithms and System for High-Level Structure Analysis and Event Detection in Soccer Video

A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space

Understanding Sport Activities from Correspondences of Clustered Trajectories

A Rapid Scheme for Slow-Motion Replay Segment Detection

Text Enhancement with Asymmetric Filter for Video OCR. Datong Chen, Kim Shearer and Hervé Bourlard

Real-Time Content-Based Adaptive Streaming of Sports Videos

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Generic object recognition using graph embedding into a vector space

Player Viewpoint Video Synthesis Using Multiple Cameras

Subject-Oriented Image Classification based on Face Detection and Recognition

Offering Access to Personalized Interactive Video

Saliency Detection for Videos Using 3D FFT Local Spectra

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

A Pattern Matching Technique for Detecting Similar 3D Terrain Segments

Characterization of the formation structure in team sports. Tokyo , Japan. University, Shinjuku, Tokyo , Japan

Tracking. Hao Guan( 管皓 ) School of Computer Science Fudan University

Efficient Acquisition of Human Existence Priors from Motion Trajectories

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction

Short Survey on Static Hand Gesture Recognition

A new approach to reference point location in fingerprint recognition

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Automatic Data Acquisition Based on Abrupt Motion Feature and Spatial Importance for 3D Volleyball Analysis

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

FLY THROUGH VIEW VIDEO GENERATION OF SOCCER SCENE

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Motion in 2D image sequences

Player Detection and Tracking in Broadcast Tennis Video

Highlights Extraction from Unscripted Video

FAST HUMAN DETECTION USING TEMPLATE MATCHING FOR GRADIENT IMAGES AND ASC DESCRIPTORS BASED ON SUBTRACTION STEREO

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Action Recognition with HOG-OF Features

Effect of Grouping in Vector Recognition System Based on SOM

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation

Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Real-Time Position Estimation and Tracking of a Basketball

Ensemble Image Classification Method Based on Genetic Image Network

Similarity Image Retrieval System Using Hierarchical Classification

Multi-level analysis of sports video sequences

Experimentation on the use of Chromaticity Features, Local Binary Pattern and Discrete Cosine Transform in Colour Texture Analysis

A Novel Smoke Detection Method Using Support Vector Machine

Story Unit Segmentation with Friendly Acoustic Perception *

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection

Improving Recognition through Object Sub-categorization

Measurement of Pedestrian Groups Using Subtraction Stereo

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors

HUMAN TRACKING SYSTEM

Efficient Iris Spoof Detection via Boosted Local Binary Patterns

Modeling Body Motion Posture Recognition Using 2D-Skeleton Angle Feature

Design guidelines for embedded real time face detection application

A Two-Stage Template Approach to Person Detection in Thermal Imagery

sizes. Section 5 briey introduces some of the possible applications of the algorithm. Finally, we draw some conclusions in Section 6. 2 MasPar Archite

Color Local Texture Features Based Face Recognition

Mobile Face Recognization

Expanding gait identification methods from straight to curved trajectories

Overlay Text Detection and Recognition for Soccer Game Indexing

Photometric Stereo with Auto-Radiometric Calibration

Human Motion Detection and Tracking for Video Surveillance

Introduction to Medical Imaging (5XSA0) Module 5

Face Recognition Using Ordinal Features

Multi-pass approach to adaptive thresholding based image segmentation

A Model of Dynamic Visual Attention for Object Tracking in Natural Image Sequences

Mouse Pointer Tracking with Eyes

A new predictive image compression scheme using histogram analysis and pattern matching

Learning to Recognize Faces in Realistic Conditions

Linear combinations of simple classifiers for the PASCAL challenge

Temporal structure analysis of broadcast tennis video using hidden Markov models

Countermeasure for the Protection of Face Recognition Systems Against Mask Attacks

An Efficient Need-Based Vision System in Variable Illumination Environment of Middle Size RoboCup

Threshold-Based Moving Object Extraction in Video Streams

CIRCULAR MOIRÉ PATTERNS IN 3D COMPUTER VISION APPLICATIONS

CS 231A Computer Vision (Fall 2011) Problem Set 4

Adaptive Dynamic Space Time Warping for Real Time Sign Language Recognition

A Person Identification Method Using a Top-view Head Image from an Overhead Camera

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Real-time Monitoring System for TV Commercials Using Video Features

SmartTennisTV: Automatic indexing of tennis videos

Image segmentation based on gray-level spatial correlation maximum between-cluster variance

Transcription:

14-10 MVA2011 IAPR Conference on Machine Vision Applications, June 13-15, 2011, Nara, JAPAN analysis for broadcast tennis video considering mutual interaction of players Naoto Maruyama, Kazuhiro Fukui Graduate School of Systems and Information Engineering University of Tsukuba, JAPAN maruyama@cvlab.cs.tsukuba.ac.jp, kfukui@cs.tsukuba.ac.jp Abstract In this paper, we propose a new scheme of player recognition based on Cubic High-order Local Auto- Correlation () features. To achieve a high classification rate based on features, various types of features should be used, which are generated by controlling their parameters. However, some features are unreliable for classification. To find the best features, we apply the algorithm. Further, we add the information on the opposing player for classification enhancement. There are strong interactions between the actions of two competing tennis players, and thus it is effective to utilize their correlation for classification. Our approach of considering two player s interactions achieved a classification success rate of 96.09%, which is much more accurate than methods using only the target player s information, with which the classification success rate is 85.02%. 1 Introduction Broadcast sports programs are often watched for more serious reasons than entertainment. For example, some people learn body movements from players in videos, and some people watch the video to analyze players strategy. Given the variety of purposes for watching broadcast sports programs, there is a great demand for technology to obtain information corresponding to the viewers desires. Accordingly, many methods and technologies for automatic sports video content analysis have been proposed[1, 2, 3]. In our research, we deal with broadcast tennis video, which is widely viewed. From the various possible factors, we focus upon the actions performed by players during play. The result can be used for various purposes such as event detection and strategy analysis. The action of players in broadcast tennis video has been analyzed via various approaches. Miyamori et al. [4] recognized player action by using the silhouette of the player and the relationship between the position of the ball and player. In the research of Zhu et al. [5], a motion descriptor based on optical flow was used, and high performance was achieved in action recognition. However, these methods depend on the performance of object tracking algorithms, which are difficult to apply to broadcast tennis video because of the low resolution of the image frames and the complex movement of the target object. Further, more accurate action classification is required for detailed scene understanding. In our method, features[7] are used to describe the movement of a tennis player. Unlike existing works[4, 5], segmentation of the player is unnecessary before feature extraction, owing to the position Opposing Player Target Player New New Ensemble Learning Figure 1. Basic idea of the proposed method. invariance of features. We obtain a variety of features by changing their parameters, which produces an effect similar to using multiple image resolutions and multiple speeds of player action. By using many types of features together, we can perform the classification considering both changes of video resolution and motion phase[8]. However, some features do not contribute to the classification performance. To improve the performance, we need to select features useful for classification. Accordingly, the algorithm[9] is applied to select useful features efficiently. For further improvement of the classification, we utilize information on the opposing player. There is very strong correlation between the behaviors of a target player and an opposing player in a singles tennis match, and thus this correlation can be exploited for classification enhancement. The whole process flow of the proposed method is shown in Fig. 1. features are extracted from the opposing player in the same manner as from the target player. Then, new motion descriptors are obtained from all the possible combinations of connecting pairs of each player s features. Finally, the useful new motion descriptors are selected by the Adaboost algorithm. In this case, Adaboost can find the best combination of several temporal-spatial parameters for integrating multiple features, which is equivalent to the combination of the time and scale suitable to represent the information about the interaction between the two players. The main contributions of this paper are as follows. 1) We propose a new framework of tennis player action recognition based on the selection of features for classification by the Adaboost algorithm. 2) Information on opposing player is also utilized to consider the mutual interaction between two players for the improvement of classification. 454

Figure 2. Parameters of temporal-spatial scales. 2 Concept of proposed framework In this section, we propose a new framework of action recognition considering mutual interactions between two players. First, we explain features as descriptors of player motion. Then, we describe the way to select useful features with. Finally, we present an approach to integrating the information on both the target player and opponent for classification. 2.1 descriptor using features There are many types of features to represent spatial-temporal information, such as the action of a tennis player. In this paper, we focus on Cubic Highorder Local Auto Correlation () features to represent the action of a tennis player, since they are know to be among the most suitable features to represent human motion without high computational complexity. Further, features are position invariant, with the value of the feature vector being unaffected by the position of the moving object. Due to this characteristic, segmentation of the player is not required. By changing the temporal parameter rt and the spatial parameter r, features can represent various spatial-temporal scale motions as shown in Fig. 2. 2.2 Selection of valid features with Extremely large numbers of features with various temporal-spatial scales can be easily generated by changing the parameters rt and r mentioned above. Among these features are many that are not effective for classification, and thus we must select features with suitable parameters. For this purpose, we apply the algorithm [9] to select the effective features. In this selection process, we regard each simple classifier, which classifies input data based on the Euclidean distance from the center of the class in a feature vector space, as a weak classifier. is an algorithm used to make a strong classifier by selecting the best weak classifiers and combining them with weights according to their performance [10]. Class(x) = arg max H k (x). (1) k=1 ( ) H k 1 T (x) = a k T t=1 ak t h k t h k t (x), (2) t (x) t=1 where k =(1,...,C) is the category of player action and C is the number of category classes to be classified. Input vector x is a feature vector derived from a set of sequential images of a player motion. h k t (x) is the t-th selected weak classifier, which outputs 1 or 1. T is the total number of selected classifiers, and a k t is the weight of the t-th classifier selected by the algorithm. For details on how to obtain a k t, refer to [10]. H k (x) is the obtained strong classifier which outputs the similarity of the input vector x to category k. Finally, input player motion x is classified to the category in which the similarity H k (x) isthe highest. This process can also be considered the selection process of useful feature vectors for classifying a player s motion from all of the possible features. 2.3 Consideration of mutual interaction between two players actions We can consider the information on the opposing player by adding features extracted from that player. How to integrate such the additional features is a problem to be solved. In this case, we can exploit the proposed framework of selection and integration by using Adaboost without any modification. The solution is to add only the additional features into the set of features from the target player. Such adaptable integration of additional information is another advantage of our framework. 3 Framework of proposed method In this section, we explain the process for action classification in the proposed method. Our framework of action recognition consists of two phases: the training phase and the test phase, which are shown in the flow diagram in Fig. 3. The same preprocessing is used in both phases. 3.1 Training phase (1)Preprocessing Our method starts with preprocessing of input images as shown in Fig. 5. First, we calculate subtraction images from input images to extract the movement of a player. Next, we binarize the subtraction images for noise elimination. Threshold values for binarization are calculated by using Otsu s method[11]. Then, we divide the binarized image into two images to extract features from each player. Because features are position invariant, segmentation of a player is unnecessary before extracting features. (2)Feature extraction After preprocessing, features are extracted from a set of 20 frames of binarized subtraction images, which are selected based a focus frame, as shown in Fig.4. This extraction operation is conducted by the following equation. ˆx(a 1, a 2 )= r S = f(r)f(r + Sa 1 )f(r + Sa 2 ), (3) ( rxy 0 0 0 r xy 0 0 0 r t ), (4) 455

Preprocessing Training phase Binarized Training images Combine Selected parameters Weight at in equation (2) Preprocessing Test phase Binarized Test images Combine Equation (1) Category Figure 3. Flow diagram of player action classification Figure 4. The manner to extract features from an image sequence. Raw images 20 frames where r is a reference point vector in three-dimensional spatiotemporal space, f(r) is the image intensity of the reference point r, anda 1 and a 2 are displacement vectors from the reference point r. Parameters of r xy and r t represent the scale of displacement vectors. r xy corresponds to the scale factors of the direction of the x- and y-axes and r t corresponds to the scale factor of the time-axis direction. (3)Combining features To consider the interaction between each player s action, we use features of both players together to obtain new motion descriptors. This is accomplished by connecting all the possible pairs of each player s features. The dimension of the feature vector is 251, and thus the combined features are obtained as 502-dimensional vectors. (4) To boost the classification ability of the combined features, we apply Multiple Discriminant Analysis (MDA) to them. In the weak classifier, the combined features are classified based on the Euclidean distance from the Differential images Figure 5. Preprocessing before feature extraction 20 frames center of each class. The class center vector is calculated from the projected combined features onto the discriminate space. We execute this classification process for all the combined features. (5)Selecting useful features Useful combined features and their weights of a k t in Eq. (4) are selected by using the algorithm. 3.2 Test phase In the test phase, the useful features selected in the training phase are extracted from preprocessed images of the target player and the opposing player, separately. After that, two features of both the players are connected according to the list of 456

Table 1. Number of examples of each motion. Choose one Method 1 Method 2 Method 3 Method 4 Figure 6. Four methods used in experiment. 3969 the proper combination obtained in the training phase. Next, all the combined features are classified in the same manner in the training phase, and then a final result is obtained by using the classification results of all the result of each the combined features by the majority rule considering the weight of each classifier. 4 Experiments 4.1 Experimental conditions We conducted an experiment to evaluate the performance of the proposed method. Service, forehand, backhand and wait are selected as classification category. The player in the front side is regarded as the target of action classification. Data for the experiment was collected from broadcast video of 10 tennis matches. The video contains various players and different resolutions. We collected data on right-handed male players. The number of examples of each motion is shown in Table 1. The data were captured at 30 frames per second (fps) from the Internet. The resolution of the videos was not uniform; therefore, we resized each video such that the size of the players was almost the same. We used 10-fold cross-validation to evaluate the performance of the proposed method. In each test, 1 of the 10 matches was used as test data and the other 9 matches were combined to form the training data. This process was repeated 10 times, with each of the matches used exactly once as the test data. From each player, features were extracted by varying r xy from 1 to 9 and r t from 1 to 7. These numbers were decided based on a preliminary experiment. All the processing was executed on an Intel Xeon 3 GHz quad core processor. 4.2 s and discussion We compared the accuracy of the four methods shown in Fig. 6. Method 1 uses only 1 feature, namely, the best performing of the, extracted from the target player. In Method 2, useful features are selected from the of them by using. Method 3 uses features extracted from both Service Forehand Backhand Wait Total 224 496 268 368 1356 Table 2. Performance of each method. Method Accuracy Method 1 64.74% Method 2 85.02% Method 3 86.13% Method 4(proposed method) 96.09% player for classification without connecting the feature vectors of each player. In this case, features selected from among 126 of them by Adaboost are used for the target player s classification. In method 4, that is, the proposed method, 3969 combined features are obtained by changing the pair of features of the target and the opposing players to be connected. The final useful features are selected by using Adaboost from all the combined features. The accuracy of each method is listed in Table 2. The proposed method (method 4) achieved the highest classification success rate of 96.09%. From this result, we can see that using various features in combination is effective for player action classification. Furthermore, using information of the opposing player leads to further improvement of the classification accuracy. The performance of method 3 is improved slightly compared with method 2. This means that the information on the opposing player was useful in predicting the target player s action. However, the information on the correlation between the two players might be insufficient since each player s features was used individually. Additionally, in method 3, only a few feature of the opposing player were selected by Adaboost, and thus information on the opposing player s action was scarcely considered for classification. On the other hand, in the proposed method, the information on the correlation between the two players was much more extensively considered as compared with method 3, owing to using the forcibly combined features. Fig. 7 shows the examples of the set of sequential images of each action, which was successfully classified by the proposed method. The lower images in each figure show the enlarged images of the neighborhood regions of the target player. From the enlarge images, we can see that each action is complicated and the differences among forehand, backhand and wait are small. 5 Conclusions and future work In this paper, we proposed a new scheme of recognizing the action of a player in broadcast tennis video. The proposed method classified the motion of a player using effective High-order Local Auto- Correlation () features selected by the Adaboost algorithm. In addition, we added information on the opposing player to enhance the classification. 457

(a) service (b) forehand (c) backhand (d) wait Figure 7. The examples of the set of sequential images of each action, which was successfully classified: The lower images in each figure show the enlarged images of the neighborhood regions of the target player. Evaluation experiments demonstrated that our proposed method can achieve much higher performance than a simple method using only information on the target player for action classification. In the future work, we plan to apply the results of this research to more detailed analysis of tennis scenes. Acknowledgments This work was supported by KAKENHI (22300195). References [1] Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Giuseppe Serra, Action Categorization in Soccer Videos Using String Kernels, Proc. 7th International Workshop on Content-Based Multimedia Indexing, pp.13 18, 2009. [2] Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlights for TV baseball programs, Proc. ACM Multimedia, pp. 105 115, 2000. [3] D. D. Saur, Y.-P. Tan, S. R. Kulkarni, and P. J. Ramadge, Automated analysis and annotation of basketball video, Proc. Symp. Electronic Imaging: Science and Technology: Storage and Retrieval for Image and Video Databases, vol. 3022, pp. 176 187, 1997. [4] H. Miyamori, Improving accuracy in behavior identification for content-based retrieval by using audio and video information, Proc. International Conference on Pattern Recognition (ICPR), vol.2, pp.826-830, 2002. [5] G. Zhu, C. Xu, Q. Huang, W. Gao and L. Xing, Player Action Recognition in Broadcast Tennis Video with Applications to Semantic Analysis of Sports Game, Proc. 14th annual ACM international conference on Multimedia, pp.431 440, 2006. [6] N. Otsu, and T. Kurita, A New Scheme for Practical Flexible and Intelligent Vision Systems, Proc. IAPR Workshop on Computer Vision, pp.431 435, 1988. [7] T. Kobayashi, N. Otsu, Action and Simultaneous Multiple-Person identification Using Cubic Higher-Order Local Auto-Correlation, Proc. International Conference on Pattern Recognition (ICPR), pp. 741-744, 2004. [8] Y. Horita, S. Ito, K. Kaneda, T. Nanri, Y. Shimohata, K. Taura, M. Otake, T. Sato and N. Otsu, High Precision Gait Recognition Using a Large-Scale PC Cluster, Proc. IFIP International Conference on Network and Parallel Computing (NPC 2006), pp.50 56, 2006. [9] Y. Freund and R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, In Computational Learning Theory: Eurocolt 95, pp.23 37. Springer-Verlag, 1995. [10] R.E. Schapire and Y. Singer, Improved boosting algorithms using confidence-rated predictions, Machine Learning, vol.37, no3, pp.297 336, 1999. [11] Otsu. N, A threshold selection method from graylevel histograms, IEEE Trans. Systems, Man, and Cybernetics, 9(1), pp.62 66, 1979. 458