Temporal structure analysis of broadcast tennis video using hidden Markov models

Size: px

Start display at page:

Download "Temporal structure analysis of broadcast tennis video using hidden Markov models"

Gavin Malone
6 years ago
Views:

1 Temporal structure analysis of broadcast tennis video using hidden Markov models Ewa Kijak a,b, Lionel Oisel a, Patrick Gros b a THOMSON multimedia S.A., Cesson-Sevigne, France b IRISA-CNRS, Campus de Beaulieu, Rennes, France ABSTRACT This work aims at recovering the temporal structure of a broadcast tennis video from an analysis of the raw footage. Our method relies on a statistical model of the interleaving of shots, in order to group shots into predefined classes representing structural elements of a tennis video. This stochastic modeling is performed in the global framework of Hidden Markov Models (HMMs). The fundamental units are shots and transitions. In a first step, colors and motion attributes of segmented shots are used to map shots into classes: game (view of the full tennis court) and not game (medium, close up views, and commercials). In a second step, a trained HMM is used to analyze the temporal interleaving of shots. This analysis results in the identification of more complex structures, such as first missed services, short rallies that could be aces or services, long rallies, breaks that are significant of the end of a game and replays that highlight interesting points. These higher-level unit structures can be used either to create summaries, or to allow nonlinear browsing of the video. Keywords: sport video analysis, structure analysis, Hidden Markov Model, macro-segmentation, highlights detection, video content analysis 1. INTRODUCTION Video classification and segmentation are fundamental steps for efficiently searching and browsing video content. Lowlevel visual features are largely used for indexing generic video contents, but are not sufficient to provide the semantic information that is meaningful to an end-user. When the indexing of videos is restricted to a given category, domainspecific knowledge about the processed content facilitates the recovery of higher-level indexing information. One domain-specific application is the detection and recognition of highlights in sport videos. Sport video analysis is motivated by the growing amount of archived sport video material. Broadcasters need detailed annotation of video contents to select relevant excerpts to be edited for summaries or magazines. At present, this logging task is performed manually by librarians. Domain-specific video indexing can be divided into 3 main research areas: genre classification, content analysis, and structure analysis. The goal of genre classification is to automatically classify TV broadcast into predetermined genres like commercials, news, sport, etc. For this general video classification, Hidden Markov Models (HMM) are widely used 1,. Content analysis usually aims at detecting specific events in a video. Domain knowledge and properties of lowlevel features are exploited for mapping low-level information extracted from video data to high-level concepts. Finally, structure analysis aims at highlighting the table-of-contents of videos within a given genre. The table-of-contents is obtained by finding the discontinuities of semantics in the video. It involves detecting the temporal boundaries of the coherent segments and identifying all segments of video according to predefined semantic categories. As not all of the content of a video is of interest, separating the process of structure parsing and event detection may enhance the indexing process, by first extracting the interesting segments, and then applying content analysis on them. The temporal structure of a video will vary from one video type to another, and prior knowledge of some general structure for the class of video under study is obviously useful in video structure parsing. In particular, not all video Correspondence: kijake@thmulti.com Storage and Retrieval for Media Databases 003, Minerva M. Yeung, Rainer W. Lienhart, Chung-Sheng Li, Editors, Proceedings of SPIE IS&T Electronic Imaging, SPIE Vol. 501 (003) 003 SPIE IS&T X/03/$

2 documents are highly structured. For example, in the movie category, it is generally admitted that the structure follows a hierarchical model similar to that of theatre plays. Structure analysis comes down to a segmentation into scenes, obtained by grouping shots with similar content. Most scene segmentation approaches attempt to merge similar and consecutive shots into scene. These time-constrained methods rely on visual similarities between shots in scene. News programs are much more structured, as they can be defined by an interleaving of anchorperson shots and news shots. A model-based approach based on color histograms and the spatial layouts of frames allows to parse a newscast video into anchorperson and news story scenes 3. In the domain of sport, a finite number of identified scenes, related to game phases, occur all along the video. In baseball, a scene can be defined as a pitching-batting cycle 4. Such a scene is made up of very different shots, making time-constrained scene segmentation approaches used for movies unsuitable. Inside the category of sport videos, a distinction should be made between time-constrained sports such as soccer, and score-constrained sports such as tennis or volleyball. Time-constrained sports have a relatively loose structure. The game can be decomposed into equal periods. During a period, the content flow is quite unpredictable. In the scoreconstrained sports, however, the content exhibits a strong hierarchical structure. For example, a tennis match can be broken down into sets, games and points. In this paper, we take advantage of the well-defined structure of tennis broadcast to parse the structure of tennis videos. Our goal is to take advantage of the available domain-specific knowledge of tennis videos to analyze these videos up to the level in the semantic scale where the structure can reliably be recovered. The structure identification is accomplished according to a model-based approach. The outline of our paper is as follows. Section provides an overview of previous work in the domain of sport video indexing. Section 3 describes the elements of the syntax of tennis videos. Section 4 presents our feature extraction method, followed in section 5 by a presentation of our tennis video structure analysis system. Experimental results on real broadcast tennis videos are given in section 6 to demonstrate the effectiveness of the proposed method.. RELATED WORK Sports video analysis is an area of research where domain-specific knowledge can significantly enhance the performance of indexing. Most of the existing work in this domain is related to content analysis. It focuses on the detection of interesting play events of sports video. A common approach in event detection consists in combining the extraction of low-level features with heuristic rules. Various low-level features such as color, edge, motion and audio features are used to extract mid-level information such as court lines, goal posts, players and ball position. This information is combined with heuristic rules to infer predetermined highlights. For example, after detecting tennis shots, Sudhir et al. 5 classify tennis shots into semantic categories such as baseline-rallies, passing-shots, net-games, and serve-andvolley games using a reasoning module that interprets the players positions relatively to the court lines. Using audio features, Rui et al. 6 employ speech endpoint detection, baseball hit detection, and detect the excitement of the reporter in his speech to infer important events in baseball video. However, no semantic analysis of the relevant events is carried out. Nepal 7 develops simple temporal models from heuristic rules using crowd cheer, scoreboard and change in camera direction detections, to find goal segments in a basketball video. All these works are more related to event detection than to structure analysis because they attempt to detect and identify segments of interest of a video, without analyzing the temporal relations between them. A structure analysis is done when all the shots of a video are identified according to predetermined classes. This classification could be relative to play location. Gong et al. 8 classify each shot of a soccer video into one of nine positions of play such as in the midfield, around the left penalty area, in the top-right corner area, etc First, the line mark patterns are identified, and camera motion, ball, and players are detected. Then, the classification is performed according to the physical location in the field or the presence/absence of the ball. Zhou et al. 9 encapsulate basketball knowledge information in an inductive decision tree. A rule-based classifier takes in input color, edge, and motion features and categorizes basketball video into left or right fast-break, left or right dunk, and close-up shots. In these works, no particular events are detected but each shot is identified by the location of play within the court or the field. Classification into different playing locations is adapted to sports in which play action cannot be recorded from a single point of view. 90 Proc. of SPIE Vol. 501

The input video is segmented and classified into predetermined game phases, such as first missed serve, rally, replay and break.

3 We do not attempt to analyze the content of a shot of interest to identify particular actions, as Sudhir et al. 5 do on tennis. Our aim is to highlight the temporal relations between all the shots in order to identify a global structure of the video. The input video is segmented and classified into predetermined game phases, such as first missed serve, rally, replay and break. The novelty of our approach lies in the use of statistical models to describe domain-specific rules, rather than heuristic inference engines. In a recent work, Xie et al. 10 investigate a similar approach to segment a soccer video into play/out of play categories. Taking advantage of the well-defined structure of tennis games, we aim at segmenting the video up to a much higher level in the semantic scale. 3. TENNIS VIDEO SYNTAX Tennis video is characterized by a typical production style, which we call the tennis video syntax. Tennis games are recorded from a fixed number of cameras. The point of view that gives the most relevant information is selected for broadcast. For example, during a rally, the content provided by the camera filming the whole court is selected (this kind of shots called global views are thus of much interest), and the player who has just carried out an action of interest is captured with a close-up. As close-up views never appear during a rally but right after or before it, global views are generally significant of a rally. Because of the presence of typical scenes and the finite number of views, the tennis video has a predictable temporal syntax. A game is usually followed by a long succession of close-up views or commercials. A first serve is a short global view closely followed by an other global view, whereas an ace is characterized by a short global view followed by a series of close-ups. Replays are notified to the viewers by insertion of special transitions. A closer observation of tennis video reveals that there are only a few main types of video shots that occur repeatedly throughout the video footage. In addition, each shot has a different meaning according to its context. For example, in tennis videos, a global view that appears after a dissolve transition is probably a replay, whereas a global view that appears shortly after a previous global view is probably a winning rally. These two observations have motivated the use of an HMM for modeling a tennis video sequence. We integrated a priori information by deriving syntactical basic elements from tennis video syntax and modeling each of these elements by a HMM. These models rely on the type of view for the shot, on the shot duration and on the temporal relationships between shots. The next section describes the method used to identify the type of view for the shots. It is based on color and motion attributes extracted from the raw video. The shot classification and the temporal analysis are performed independently, so that the temporal analysis gets rid of color attributes that may change from one sequence to another. 4. SHOT CHARACTERIZATION In this section, we present the different types of views present in a tennis video and the process used to classify each of them. These views can be divided into four principal classes: global, medium, close-up, and audience (see Figure 1). Figure 1. Four types of view in tennis video (left-to-right: global, medium, close-up, audience) This fine granularity classification is not necessary in our case. In a tennis video production, global views contain much of the pertinent information. The remaining information relies on the presence or the absence of secondary views but is independent of the type of these views. Our classification algorithm will thus consist of a binary classifier. The classification process will label the shots according to two classes: global views (GV) and non-global views (N). In the following subsections, we present the whole process starting from the initial video sequence up to the list of classified shots. After the presentation of a preprocessing step, our classification method is detailed and compared to existing methods. The resulting classification is exploited in section 5. Proc. of SPIE Vol

4 4.1. Preliminary video data processing To cope with broadcasters usage, only MPEG videos are considered. In preliminary video processing, video is segmented into shots. Our segmentation process can be divided into several steps. These steps are briefly described in the following: - Straight cut detection is performed by detecting rapid changes of the standard bin-wise difference between luminance histograms of DC pictures. - Gradual shot transitions detection is performed by the twin-comparison method 15, using a dual threshold that accumulates significant differences to detect gradual transitions. - The content of a detected shot is represented by a single keyframe (only one keyframe is sufficient to illustrate the whole content of a view). The keyframe is chosen as the I frame the closest to the frame with the lowest activity (activity is defined as the average magnitude of all the MPEG vectors associated with a given frame). 4.. Features description The pre-processing is now complete. A list of shots has been identified and associated keyframes have been extracted. Cuts and transitions are known. The following step consists of labeling extracted shots according to global views (GV) and non-global (N) label. Identifying the different types of view is a necessary step in any sports content analysis. Many works deal with shot classification in the context of sport videos. Classification processes are divided into two parts: features extraction and classification of these features. We now present the features usually used, the features we actually use and our classification algorithm. We can identify two main kinds of features. The first class relies on color-based features. Considering sport videos, a global view is actually characterized by a large region of homogeneous color (color of the play field). Most of the works thus use this basic feature to identify a shot label. In a baseball video, a pitching scene is detected by computing the difference between a candidate keyframe and a representative pitching image 4. The representative pitching image is manually extracted from the considered video. A color model can also be learned to further improve shot recognition. This can be expressed as an unique color model or using several models trying to capture all the kind of tennis courts 5, 11. Another article 9 proposes the extraction of edge features around the dominant color region in association with a rulebased classifier to classify keyframes into left court, right court, middle court and close-up. The second class consists of motion-based descriptors. For example, the variation and persistence of the estimated camera motion, as well as the number of intra-coded macroblocks in a MPEG video of basketball is used to classify wide-angle and close-up shots 1. To efficiently capture the frame contents, some approaches mix several descriptors. In the context of soccer video, the grass pixels ratio and motion intensity in a frame are relevant features to categorize each views 10. In most of the methods that use color information, the game field color has to be first evaluated, because it can largely vary from one video to another. Our approach tries to avoid the use of predefined field color to be able to automatically take into account a large type of videos. As presented in the introduction of this section, our goal is to separate global views from other types of view. Close-up and global views are characterized by homogeneous color content. Indeed, the dominant colors of a global view consist of the colors of the court and its surrounding, and the dominant colors of a close-up shot consist of the colors of the clothes and face of a player. However, medium and audience views are characterized by scattered color content. The color-based classification should be re-enforced using motion-based features. On one hand, a global view must capture at each time the main part of the court. On the other hand, in close-up views, the camera is generally tracking the player. The first class can thus be characterized by a small camera motion while the second implies important camera translations. Based on these observations, we choose two features to identify game shots: activity that reflects camera motion during a shot and color. Rather than color histogram, we use a global descriptor of dominant colors that is more compact. In addition, dominant colors vectors capture the most significant color information of a frame and are less noise sensitive. 9 Proc. of SPIE Vol. 501

Let F be a vector of N dominant colors and p the percentage of each color with respect to the whole associated frame.

The goal is to ensure that the N dominant colors are perceptually different.

j D ( F, F ) = a p p (1) 1 N N i= 1 j = 1 i= 1 N N j= 1 1i, j 1i j where a k,l is the similarity coefficient between two colors c k and c l, a k, l 1 d k, l / d = 0 max d d k, l k, l T d > T d () T d

5 Let F be a vector of N dominant colors and p the percentage of each color with respect to the whole associated frame. The colors of the original images are quantized into N values using a k-means clustering algorithm. Neighboring dominant colors are merged when their distance are less than a predefined threshold T d. The goal is to ensure that the N dominant colors are perceptually different. According to MPEG-7 normalization, the similarity between two dominant colors features F 1 and F can be then measured by the following simplified quadratic distance function D(F 1, F ): 1 1 p1 i + p j D ( F, F ) = a p p (1) 1 N N i= 1 j = 1 i= 1 N N j= 1 1i, j 1i j where a k,l is the similarity coefficient between two colors c k and c l, a k, l 1 d k, l / d = 0 max d d k, l k, l T d > T d () T d is the maximum distance for two similar colors, d max = αt d, and d k,l is the Euclidean distance between two colors c k and c l defined as follows: d = c c k, l k l (3) To take into account the spatial configuration of similar color pixels, a confidence measure CM is associated to each dominant color feature. A pixel of color C i is considered to be coherent if all the pixels in its neighborhood have the same color. As a result, the confidence measure CM for the dominant color feature F is defined as: CM = N Number _ of _ coherent _ pixels _ Ci Total number of pixel C i= 1 i p i (4) The spatial coherency of dominant colors represented by the confidence measure as well as respective activity A 1 and A are taken into account in the final distance function between two dominant colors descriptors D 1 and D : Diff ( D = A (5) 1, D ) W1 CM1 CM + W D( F1, F ) + W3 ( A1 ) where W 1, W, and W 3 are three weighted coefficients. In our implementation, we use 4 dominant colors (N=4) to characterize the most significant color information in the game field. The weighting coefficients are set as follows: W 1 =0,, W =0,5, and W 3 =0,3. Colors are represented in YCbCr color space. In the simplified quadratic distance function, the luminance component is not taken into account thus effectively eliminating illumination variations. Figure. An example of four dominant colors extraction Proc. of SPIE Vol

6 4.3. Game shot identification Our goal is to identify global views from all extracted keyframes without making any assumption about the playing area color. Our method can thus get rid of the different types of tennis court (carpet, clay, hard or grass). Analyzing several hours of tennis video reveals that in a video, GV keyframes represent only 0% to 30% of all extracted keyframes including commercials. However, it is quite easy to distinguish global views from medium and audience views. To do so, we consider dominant colors ratios. As it previously noted, color contents in medium and audience views are more scattered than in close-up and global views. Considering that a global view is mainly composed of the playing area, we assume that the percentage of the main dominant color is greater than 50%. We reduce the set of candidate keyframes by discarding keyframes whose highest percentage of dominant color is less than 50%. In the resulting subset of K images, GV keyframes represent more than 50% of the data (most of medium and audience views have been discarded). The main problem thus remains the distinction between close-up and global views. In other words, the problem is now reduced to an identification of inliers datapoints i.e. GV keyframes, in the presence of many data outliers (N keyframes). First, we select a keyframe that is representative of a global view. In a random selection of p keyframes, we choose a keyframe by least median square method. The number p of samples is chosen in a way that the probability P of finding a representative GV keyframe is greater than 99%. The expression for p is given by 13 : log(1 P) p = q (6) log(1 (1 ε ) ) where ε is the fraction of outlier data, and q the number of features in each sample. Once a GV keyframe is found, outliers are removed. The set of candidate keyframes is reduced again by keeping keyframes whose distance is lower than the median distance previously found. The LMS is iterated on this new subset to select a reference keyframe K ref. K ref should be one of the best representative of GV keyframes. Assuming that the distribution of distances from all the keyframes to K ref can be modeled by a gaussian function, a keyframe K i is labeled as a GV-keyframe if: Diff ( D ref, D i ) 1,96 τ (7) 5.1. Hidden Markov models 5. STRUCTURE ANALYSIS A HMM is a Markov chain whose state sequence cannot be observed directly, but rather through a sequence of observation vectors. Each observation vector corresponds to an underlying state with an associated probability distribution. A discrete hidden Markov model is defined by a set of states, a set of state transition probabilities, a set of output symbols, and a probability distribution of output symbols on each state. Formally, for a N-states discrete HMM with an alphabet of M symbols and an observation sequence of length T, we have the following notations: S = {S 1, S,, S N } denotes the individual states V = {V 1, V,, V M } denotes the distinct observation symbols in observation space Q = {Q 1, Q,, Q T } is the state sequence O = {O 1, O,, O T } is the observation sequence The state transition probabilities distribution between states is represented by a matrix A={A ij }, where a ij = Pr(Q t+1 = S j Q t = S i ), and the observation symbol probability distribution is represented by a matrix B={b j (k)}, where b j (k) = Pr(O t = V k Q t = S j ) is the probability of observing V k when the current state is Q j. Initial state distribution denoted by π=pr(q 1 = S j ) contains the probabilities of the model being in state i at time t=1. For convenience we use λ = {A, B, π} to indicate the model parameters. 94 Proc. of SPIE Vol. 501

7 5.. Model description We define four structural elements in a tennis video game: first missed serve and rally, rally, replay and break. Structural elements are modeled by to 5-states left right HMM. The construction of the HMMs takes domain-knowledge derived from tennis syntax into account as follows: - In a broadcast video, the producers notify the viewers that a replay being display by inserting special transitions - A first missed serve is a global view of short duration following by close-up views of short duration too (as the players do not have to change their positions) and following by an other global view - A break is characterized by an important succession of close-up views, public views and advertisements. This set of consecutive shots has a particular long duration. It appears when players change ends, generally every two games. The type of view, the shot duration and their temporal relations are of first importance in the discrimination of the structural elements. The activity of a shot is not a discriminatory feature: as it represents an average quantity of motion over a shot, it has quite the same value for one type of view. Consequently, it cannot help to distinguish one global view from another. Each state S i models either segments of the video within a single shot, or dissolve transition between shots. A cut transition is not considered as a state; it is implicitly taken into account in the shot state. Each state of the HMM has one observation symbol V k, which can be one label of an alphabet of 3 symbols {GV, N, D}. Each symbol represents the class of the shot: GV for global view, N for non-global, and D for dissolve transition. In addition, a shot duration model is associated with each state. The shot duration is modeled by either a single Gaussian, a mixture of Gaussians, or an histogram. Figure 3 shows the HMM models corresponding to the four structural elements. In the HMM models for a first missed serve and rally, and for a rally, the last states are two distinct GV states. One represents a rally containing only a serve (that is just returned or not). The other characterizes a rally containing significant strokes (more than two exchanges). These two GV states are differentiated by their shot duration distributions. Concerning the N-states, the shot duration is cumulated for a group of consecutive N states. Indeed, the shot duration of a non-global view is not a relevant feature. Whereas the cumulative duration of consecutive non-global views indicates the time interval between two global view. The observation sequence O consists of a sequence of shots labeled according to the previous classification step, and their respective duration. Then, given an observation O t with associated label L t and duration D t, and a state S j with observation symbol V k, the probability of the observation O t to be in state S j at t is: b O ) = Pr ( L / Q = S ) Pr ( D / Q = S ) (8) j ( t t t j t t j where: - Pr ( D t / Qt = S j ) is given by the probability distribution of the shot duration in state S j - 1 if Lt = Vk Pr ( Lt / Qt = S j ) = ε otherwise Proc. of SPIE Vol

8 Figure 3. (a) HMM model for a first missed serve and rally (b) HMM model for a rally (c) HMM model for a replay (d) HMM model for a break GV stands for Global View, N for Non-global view, and D for Dissolve transition The HMM process is divided into two steps: training and classification. In the training step, the parameters of the HMM, namely the transition probabilities A and the probability distribution of the shot duration, are estimated. The observation parameters B are not estimated, since the observation symbol of each state is fixed manually according to a priori knowledge. As a result, the probability of observing V k in the state S j is quite binary. This is a very hard constraint in the classification process. It comes from the fact that classification of shots into global view, non-global view, and dissolve transition results from a previous step. An alternative approach consists in integrating this classification step in the HMM by adding color descriptors and activity models to each state. However it supposes a re-estimation of the observation probabilities B for each different play area. In the classification step, the most likely sequence of states according to a given sequence of observations is computed. In other words, we have to find the state sequence Q that maximizes Pr(O/Q, λ). Segmentation and classification of the whole observed sequence into the different structural elements are performed simultaneously. Segmentation of a video sequence into more macroscopic time objects than shots is also called macrosegmentation. Classification results are given by the likelihood of each model for every segment. Macro-segmentation relies on long-time correlation of structural elements. To take into account the long-term structure of a tennis game, the four HMMs are connected in a higher level HMM. This higher level HMM is obtained by the concatenation of the previous HMMs. It is represented in Figure 4. This level reflects the structure of a tennis game in terms of points. A point correspond to a winner rally, that is to say almost all rallies except first missed serves. Replays happen at the end of a point and a break happen at the end of at least ten consecutive points. This last rule is represented by a low transition probability between point and break. The long-time correlation avoids the apparition of an interleaving of points and breaks. It prevents also two breaks from being consecutive. 96 Proc. of SPIE Vol. 501

9 Figure 4: Higher level HMM model for long-time correlation of sub-hmms 5.3. Training and classification The sequence observation vectors O is extracted from the video shots. Data for training consist of labeled shots computed for a collection of videos. Models are trained by determining manually the state alignment of the training data before re-estimating the parameters. Each model is trained separately using only observation vectors corresponding to the specific structural element the model should represent. Transition probabilities between the HMMs are estimated by counting. Once all the HMMs {λ i, i=1,,5} and the higher level HMM are trained, we use the Viterbi algorithm 14 to get the optimal class sequence for a given observation sequence O = {O 1, O,, O T }. 6. EXPERIMENTAL RESULTS Our experiments were performed on real broadcast tennis videos produced by different broadcasters: 3 videos of the Roland Garros tournament (RG1, RG, and RG3) and videos of the US Open (US1, US, and US3). These videos contain different editing styles. A test database characterized by a significant variety of editing styles is of first importance to be able to efficiently test the robustness of our system. Each MPEG- video is about 1 hour long. We use videos for the training set and the 3 others for evaluating our system. Experimental results on shot classification, macro-segmentation and highlights classification are shown in Table 1. The macro-segmentation accuracy is defined as the number of correctly classified shots according to basic structural elements over the total number of shots. Precision and recall are given for five identified tennis highlights that are: first missed serve, short rally, rally, replay and break. Precision is the ratio of correctly identified shots to the total number of identified shots. Recall is the ratio of the number of correctly identified shots to the total of relevant shots. Video Shot classification RG1 RG US1 US Macro-segmentation accuracy ,85 Highlight type recall precision recall precision recall precision recall precision missed serve short rally rally replay break Table 1. Classification and macro-segmentation results Proc. of SPIE Vol

10 Our initial results are satisfactory: the experimental results of using the trained HMM to segment a new set of tennis video into the four basic structural elements give global accuracies from 77 % to 90 % (see Table1 for details). This result proves the robustness of our classification scheme to heterogeneous video content (i.e. different editing rules, different types of court, ). Typical errors in highlights classification are due to model ambiguities. These ambiguities mainly relies on the fact that shot durations do not always reflect the semantic state of the shot. For example, the distinction between a short rally and a rally is only performed according to their respective shot duration distributions. In the learning set, a global view is considered as a short rally when it represents an ace or a serve plus a return of serve, without regarding the shot duration. An overlap thus appears in the shot duration distribution, because the duration of a short rally can be equivalent to the duration of a rally containing only three or four strokes. Such a confusion could be eliminated by a further content analysis of shots. Break states suffer from the same type of confusion. A set of consecutive close-up shots with a particular long duration can appear for example when there is a discussion between one player and the umpire. This introduces confusion in break states detection and explains the low precision rate of break states. In some cases, the ambiguities due to probability distribution can be removed. For example, first missed serve and short rally have almost the same shot duration distribution. They are discriminated by the context i.e. the previous and the following state. The good precision rate for missed serve proves the validity of our approach. Replay states benefit from a hard constraint on the presence of dissolve transitions. They are correctly identified. False alarms however occur when a dissolve transition is employed for example to end a break. Another source of errors relies on the non-respect of the assumption that a global view represents a rally. During a break, it can happen that a global view was displayed to show the status of the court. Because we consider the cumulated shot duration over successive non-global views, the shot duration for non-global views are then cumulated before and after such an occurrence of a global view. This leads to a group of two non-global views with a cumulated shot duration that is not necessary significant of a break, separate by a global view. Finally, the last source of errors comes from less frequent events that are not explicitly taken into account in the model. We have tried more complex basic structural elements, which include more configurations (for example possible dissolve transitions in a break or repetition of let services). These models did not give significant improvement due to the introduction of new ambiguous states. 7. CONCLUSION AND FUTURE WORK In this paper, we have proposed a statistical approach for tennis video macro-segmentation and classification. Based on domain-knowledge, we have defined four basic structural elements of a tennis game on which the structure analysis is based. Each element is modeled using a HMM, and the four resulting HMMs are connected in a higher level HMM. These four elements are also interesting while they infer the following highlights in a further event detection process: a replay happens just after a rally of interest have been played; a first missed serve is not of interest and should not be taken into account in a further analysis; a short rally may include an ace; a break indicates that a game or a set ended, when the players change ends. The results reported in this preliminary work are promising about modeling a hierarchical structure of a scoreconstrained sport by HMM. Future work will include a much more complex model that takes into account the entire hierarchical structure of a tennis game. It should lead to a higher level up in the semantic analysis of the temporal structure. We are also currently investigating the improvement of the performance by adding complementary information provided by audio. Since this paper was written, two recent papers have been published that use statistical approaches to classify sport highlights respectively for a soccer game 16 and a baseball game 17. HMMs are used for highlight classification, however macro-segmentation was not performed. Nevertheless, these works confirm the interest of using HMM in domainspecific applications. 98 Proc. of SPIE Vol. 501

11 ACKNOWLEDGMENTS The authors would like to thank Guillaume Gravier, from IRISA-CNRS, for his help and advice about Hidden Markov Models. REFERENCES 1. N. Dimitrova, L. Agnihotri, and G. Wei, "Video classification based on HMM using text and faces", European Conference on Signal Processing, J. Huang, Z. Liu, and Y. Wang, "Joint video scene segmentation and classification based on hidden Markov model", Proc. of the IEEE Int l Conference on Multimedia and Expo, pp , H.J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong, Automatic parsing and indexing of news video, Multimedia Systems, v, pp , T.Kawashima, K. Tateyama, T.Iijima, and Y. Aoki, "Indexing of baseball telecast for content-based video retrieval", International Conference on Image Processing, G. Sudhir, J.C.M. Lee, and A.K. Jain, "Automatic classification of tennis video for high-level content-based retrieval", Proc. of the Int l. Workshop on Content-Based Access of Image and Video Databases (CAIVD '98), Y. Rui, A. Gupta, and A. Acero, "Automatically extracting highlights for TV baseball programs", Proc. of ACM Multimedia Conference, S. Nepal, U. Srinivasan, and G. Reynolds, "Automatic detection of goal segments in basketball videos", Proc. of ACM Multimedia Conference, pp , Y. Gong, L.T. Sin, C.H. Chuan, H. Zhang, and M. Sakauchi, "Automatic parsing of TV soccer programs", Proc. of Int'l Conference on Multimedia Computing and Systems (ICMCS '95), pp , W. Zhou, A. Vellaikal, and C.-C. J. Kuo, "Rule-based video classification system for basketball video indexing", Proc. of ACM International Multimedia Conference, pp , L. Xie, S-F. Chang, A. Divakaran, and H. Sun, "Structure analysis of soccer video with hidden Markov models", IEEE Int l Conference on Acoustics, Speech, and Signal Processing (ICASSP '0), D. Zhong, and S.F. Chang, "Structure analysis of sports video using domain models", IEEE Conference on Multimedia and Expo, Y.P. Tan, D.D. Saur, S.R. Kulkarni, and P. J. Ramadge, "Rapid estimation of camera motion from compressed video with application to video annotation", IEEE Trans. on Circuits and Systems for video Technology, v10(1), pp , P.J. Rousseeuw, Robust regression and outlier detection, Wiley, New York, L.R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", Proc. of the IEEE, v77(), pp , H.J Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems, v1(1), pp 10-8, J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati, and P. Pala, "Soccer highlights detection and recognition using HMMs", IEEE Int l Conference on Multimedia and Expo (ICME '0), P. Chang, M. Han, and Y. Gong, "Extract highlights from baseball game video with hidden Markov models", Proc. of IEEE Int l Conference on Image Processing (ICIP '0), 00. Proc. of SPIE Vol

Baseball Game Highlight & Event Detection

Baseball Game Highlight & Event Detection Student: Harry Chao Course Adviser: Winston Hu 1 Outline 1. Goal 2. Previous methods 3. My flowchart 4. My methods 5. Experimental result 6. Conclusion & Future