Title: Pyramidwise Structuring for Soccer Highlight Extraction. Authors: Ming Luo, Yu-Fei Ma, Hong-Jiang Zhang

Size: px

Start display at page:

Download "Title: Pyramidwise Structuring for Soccer Highlight Extraction. Authors: Ming Luo, Yu-Fei Ma, Hong-Jiang Zhang"

Terence Sharp
5 years ago
Views:

1 Title: Pyramidwise Structuring for Soccer Highlight Extraction Authors: Ming Luo, Yu-Fei Ma, Hong-Jiang Zhang Mailing address: Microsoft Research Asia, 5F, Beijing Sigma Center, 49 Zhichun Road, Beijing , China Electronic address: Phone: (Ming Luo) (Yu-Fei Ma & Hong-Jiang Zhang) Fax number: Contact author: Ming Luo Topic area: Multimedia processing and coding Subject area: Content analysis and adaptation

2 Pyramidwise Structuring for Soccer Highlight Extraction Ming Luo 1, Yu-Fei Ma 2, Hong-Jiang Zhang 2 1 Department of Computer Science University of Maryland, College Park, MD 20770, USA ming@cs.umd.edu 2 Microsoft Research Asia, 5F, Beijing Sigma Center, 49 Zhichun Road, Beijing , China {yfma, hjzhang}@microsoft.com Abstract Fast browsing video contents not only is an important research issue, but also has a variety of potential applications, especially for sports videos. In this paper, we propose a practical solution to extract highlights from soccer videos, which is based on the structure analysis of broadcast soccer video. First, the broadcast soccer video is structured into a soccer pyramid composing a series of layers from fine to coarse. With such a soccer pyramid structure, soccer highlights can be extracted in a flexible manner according to different high-level applications. Besides, in order to obtain soccer pyramid, a condensation approach to soccer video and the corresponding structure extraction methods are also proposed to achieve pyramidwise structuring. Experiments have demonstrated the effectiveness, efficiency and robustness of the proposed approach to structuring and highlight extraction. 1. Introduction Highlight extraction plays an important role in sports video fast browsing. The viewers usually would like to look through the critical segments of sports game, instead of the whole games. To flexibly browse highlights may be a good choice for viewers to rapidly understand the games. In this paper, we are seeking for an effective and flexible solution to soccer highlight extraction. There existed a number of related works in the literatures. Soccer video [1] and tennis video [2] are automatically parsed into meaningful structure based on a prior model comprising the sketch of field, ball and players. In [3], a similar approach tries to detect the complete set of semantic events in a soccer game using the position information of players and ball as input. But in fact, such approaches rely on too many high-level semantics, which are not practical for a great variety of broadcast styles and qualities. On the other hand, A. Ekin et al. extracted highlights of goals, relying on slow motion detection [4]. However, there are several limits in such method, that is, 1) the critical events are not always replayed by slow motion in some broadcasted programs, which will results in missing highlights; 2) it is an unsolved issue to detect slow motion replays for sports video analysis, especially for the slow motions created by high speed cameras [5]. Some other methods based on the temporal evolution of lowlevel features are also used, for example, goal detection algorithms based on finite-state machine [6] and controlled Markov chain model [7]. Although the loudness of audio signal is applied to improve the accuracy of visual analysis in [7], the number of false detections is still high. Although HMMs (Hidden Markov Models) based method is reasonable for sports video analysis in [8], the multiple low-level features fusion is still a challenging issue. Consequently, only alternative play/break segments can be identified in their work. In summary, the approaches based on high-level features such as field sketch, tracking of ball and players are not practical because such features are very difficult to extract robustly. While the approaches that directly fuse low-level features (e.g. color, texture and motion) using some stochastic methods are limited in semantic understanding, because of the great gap between low-level features and high-level semantics. In this paper, we are aiming at highlight extraction based on a structuring scheme without fully semantic understandings. We defined a pyramidal structure for soccer game, called soccer pyramid, which is composed of a series of layers, i.e. SEG, FAR-SIDE, GOAL, ATTACK, and GOA (group of attacks). This structure contains both physical layers and intermediate semantic layers. With high level heuristic rules or domain knowledge, highlights can be flexibly extracted based on such soccer pyramid. In our implementation, a condensed binary image sequence is generated by using a field mask, which is a readable form for computers and has no redundant information for semantic understanding. Additionally, a set of global statistics features are extracted from such condensed binary sequence for soccer video structuring, which are more robust than the local features, such as field, player and ball positions. The effectiveness of the proposed approach has been proven by extensive experiments. The rest of paper is organized as follows. In Section 2, the soccer pyramid is introduced. Then, the soccer video condensation method is discussed in Section 3. The detailed structure analysis methods and flexible highlights schemes are explained in Section 4. The evaluation results are given in Section 5. Section 6 concludes the paper. 2. Soccer Pyramid 1

3 Physical Levels Semantic SEG FAR-SIDE GOAL SEGs Levels ATTACK Shots (a) GOA Figure 1: Soccer Pyramid (a) Soccer pyramid; (b) SEG boundary definition. Soccer pyramid structure is illustrated in Figure 1(a). SEG is the top layer of the soccer pyramid. A SEG is a consequential segment with a uniform content, which has shortest duration among the layers of pyramid. SEG is finer than shot, the basic unit for traditional video segmentation. As shown in Figure 1(b), the shot boundary is also a SEG boundary. That is, a shot may be a SEG or compose several SEGs. We define 4 types of SEGs: CLOSE-UP, FAR-CENTER, FAR-SIDE and MIDDLE, as shown in Figure 2. CLOSE- UP SEGs are taken as a close-up view of players, referee or a view of coach or audience. FAR-CENTER SEGs are taken from far view and their contents are in the center area of field, i.e. non-penalty-area. FAR-SIDE SEGs are taken from far view and the contents are in the side area of field, i.e. penalty-area. MIDDLE SEGs are taken from middle distance view, normally focusing on one or several players. (b) As shown in Figure 1 (a), the top part of pyramid is physical layers, including SEG and FAR-SIDE, which have no semantics. Under the physical layers, there are 3 semantic layers, i.e. GOAL, ATTACK and GOA, which are utilized in highlights extraction. Based on semantic layers, the multi-scale soccer highlights can be flexibly extracted. In order to build such soccer pyramid, a soccer video condensation method is employed first, which compress raw video data into a concise form that can be understood by computers. 3. Soccer Video Condensation As the richest media, video sequence has a lot of redundant information, which are attractive for human, but over complex for contemporary computers. To make soccer video addressable for computers, a soccer video condensation method is proposed, which generates a condensed binary image sequence from original video sequence. As shown in Figure 3, we mask the images with field color to achieve an abstract description of soccer video that looks like a binary image in Figure 3 (b). In our implementation, we use Base line as Goal line, which is parallel to the goal line of soccer field, not only because it is easier to extracted based line than goal line for our mask, but also because the base line has equivalent semantics of goal line. (a) (b) (c) (d) Figure 2: Four Types of SEGs (a) CLOSE-UP; (b) FAR- CENTER; (c) FAR-SIDE; (d) MIDDLE We list FAR-SIDE SEG as the second layer of the soccer pyramid because it is greatly meaningful. FAR-SIDE SEGs are quite probable to be the occurrences of critical events such as shoots or goals. As the most boring part, FAR- CENTER SEGs covering a major portion of soccer video (generally over 50%) are not considered in soccer pyramid. Whereas, MIDDLE and CLOSE-UP SEGs often containing various semantics are selectively considered, because the ones around a FAR-SIDE SEG are mostly probable to be the set up or the end scene of a critical event. As the most interesting event in soccer game, GOAL is listed as the third layer of soccer pyramid. The fourth layer is ATTACK. ATTACKs are the segments involving critical moments, such as shooting and goals. ATTACKs may also be further classified as team A ATTACKs and team B ATTACKs. At the bottom of pyramid, the ATTACKs which are temporal or semantic relevant are grouped into GOA, the group of attacks. Such a soccer pyramid structure builds a bridge from low-level physical information of videos to high-level meaningful semantics. (a) (b) (c) Figure 3: Soccer Video Condensation (a) Original image; (b) Condensed image; (c) Base line detection We use field color as mask to condense soccer video. However, the field color model must be tuned for different soccer videos due to different field grass conditions, various building shadows and illumination in stadium. For example, the 6 fields shown in Figure 4 have great difference of field colors. Figure 4: Different Field Colors in Soccer Videos Generally, field color is not only green color in sense but also the dominant color in most scenes of soccer video. In this work, we initialize the green model as a convex set in HSI space, i.e., hue:[0.18, 0.4], saturation:[0.1, 1], intensity: [0.2, 1]. Then, a number of frames are scanned to obtain statistical HSI values of pixels in these frames, based on which the tuned field color model is built. Specifically, we build an evenly distributed H-S-I histogram in the HSI subspace ([0.18, 0.4], [0.1, 1], [0.2, 1]). The weight value w (i, j, k) in bin (i, j, k) is the ratio of the 2

4 number of pixels in that bin to the number of all pixels. If the criteria (1) and (2) are satisfied, the color falling in bin (i, j, k) is viewed as the field color in the considered video. w ( i, j, k)? 0.01 (1)?? 1,...,10 k j? 1,...,10 w ( i, j, k)? (2) Such tuning process is run per 5 minutes to automatically update field color model. Experiments show acceptable results for field color tuning. Not only are the influences of grass condition, illumination and shadows greatly eliminated, the special green colors on some uniforms are also successfully avoided to be considered as field color. 4. Highlight Extraction 4.1 SEG classification To segment video into SEGs, we classify each frame into 4 types of SEGs with Bayesian network first. Then, the SEGs are generated from the class label sequence by a merging routine Feature extraction For SEG classification, we extract 3 global features, field color ratio in the image, the probability of inclined base line existence, and the summing object size ratio in the field area, which are noted as F g, F l and F o respectively. F g is defined as, N g Fg? (3) W? H where N g is the number of field color pixels in the image, W and H are width and height of the image respectively. F l is an important feature for distinguishing a FAR-SIDE SEG. As shown in Figure 3 (c), in a FAR-SIDE, there is an inclined base line, which can be detected by a 2- dimensional Hough transform method [9]. So F l is got by Max{ vs, d 10? s? 170,? d max? d? d max} Fl? (4) d max d? where v s,d is the vote value for the straight line with a slope s and the distance to image top-left corner as d. We define d>0 if the line is over the top-left corner and d<0 otherwise. F o is sensitive to MIDDLE SEGs. In the condensed image as Figure 3 (b), field region, non-field region and objects in the field are extracted. Thus F o is computed by 2 2 max? W H (5) F? S i i o? (6) S F where S i is the size in pixels of the i th object in the field region, and S F is the field size in pixels. F g, F l and F o are all continuous features in [0, 1] Classification by Bayesian network To classify each frame into four types, we choose to use continuous Bayesian network because Bayesian network is a non-linear method which is more suitable for multimedia analysis. Using 3 features introduced in 4.1.1, the Bayesian network is shown as Figure 5. This is a Bayesian network comprising 3 continuous observation nodes and 1 discrete hidden node with 4 possible values (4 class labels). S F g F l F o Figure 5: Bayesian Network for SEG Classification With observations of F g, F l and F o, the a posterior probability is calculated as follows P( Fg, Fl, Fo S) P( S) P ( S Fg, Fl, Fo )? (7) P( F, F, F ) To compare the a posterior probability of different S value, it is equivalent to compare P(F g,f l,f o S)P(S). We compute P(S) using maximum likelihood method and assume that P(F g,f l,f o S) is a 3-dimensional Gaussian distribution that can be trained by samples. As there are some errors in the classification results, we employ a merging algorithm to generate SEGs from class label sequence generated by Bayesian classification. First, the adjacent frames with the same class labels are grouped together as one SEG. Then, the over short SEGs are filtered. In this manner, video sequence is parsed into SEGs with 4-type labels. 4.2 GOAL detection According to the definitions of the layers of soccer pyramid, goals must occur in FAR-SIDE SEGs with some special image characteristics, and have special temporal patterns constituted by the lower layers. By exploiting such characteristics and patterns, we propose an algorithm to detect goal as well as the replays following GOALs. The distinctive characteristics include: 1) There is at least one FAR-SIDE SEG within a GOAL; 2) During this FAR- SIDE SEG, the inclined base line in the image is moving down until a very below position; 3) This FAR-SIDE SEG usually is followed by a series of MIDDLE or CLOSE-UP SEGs. Moreover, these MIDDLE/CLOSE-UP SEGs often last a considerable length, presenting cheer scene or replays of the GOAL. According to these characteristics, an important feature, the base line intercept noted as R is defined based on the condensed image sequence, as shown in Figure 6. R (a) (b) Figure 6: Definition of R (a) rightward (b) leftward R g l o 3

5 Supposing a FAR-SIDE SEG has an R sequence as {R 1,,R n }, if the following 3 rules are all satisfied, a GOAL is detected. 1) There exist a number of consecutive R i with big values; 2) There is an rapid increase in R sequence; 3) The duration of the MIDDLE/CLOSE-UP SEGs following the GOAL is long enough. Usually, several replays from different viewpoints follow a goal. To achieve a more adequate structure understanding, we extract these replays by locating the short FAR- CENTER SEG and MIDDEL SEGs within a considerable extension after the GOAL. 4.3 ATTACK and GOA generation FAR-SIDE SEGs usually display the dangerous situations for the defending team, so they can be viewed as the anchors of critical moments. However, FAR-SIDE SEGs are very short, with an average length of about 3 seconds. In order to deliver more reasonable video clips to users, we defined higher level structure, i.e., ATTACK and GOA, which last enough time from the set up of a critical event until its end. Therefore, the ATTACKs are adaptively extended from corresponding FAR-SIDE SEGs forward and backward. For example, Figure 7 illustrated a corner kick ATTACK. Figure 7 (b) is the FAR-SIDE SEG, while the SEG before it is a MIDDEL SEG (Figure 7 (a)) and the one after it is a CLOSE-UP SEG (Figure 7 (c)). The 3 SEGs present a complete corner kick. (a) (b) (c) Figure 7: A Corner Kick ATTACK The attack direction can also be determined from the condensed image according to the inclination direction of the base line, as shown in Figure 6. GOA is generated as a series of relevant ATTACKs. If some ATTACKs are along the same direction and close enough, they are grouped into GOA. GOAs deliver a more adequate understanding about the progress of game to the viewers. 4.4 Highlight extraction With the soccer pyramid defined in Section 2, we can extract soccer highlights in a flexible manner. As illustrated in Figure 8, the multi-scale highlights can be extracted based on different semantic layers of soccer pyramid, 1) to view all GOALs without replays, 2) to view all GOALs with replays, 3) to view all ATTACKs or 4) to view all GOAs; 5) to view all attacks of team A or team B. Also, viewers may access any GOAL, ATTACK, or GOA in a non-linear manner. - GOAL without replay - GOAL with replay - ATTACK - GOA Figure 8: Multi-scale Highlight Extraction In addition, if a criticality measure is well defined for each layer, the criticality-based highlight can be generated. For example, we define the maximum R-value within an ATTACK (as shown in Figure 6) as a criticality indicator. With this indicator, the ATTACKs may be displayed in a ranking list, in which the most interesting ATTACKs are put on the top. This kind of highlight facilitates users to fast review the most important segments of the game. 5. Evaluations Six soccer matches summing up to 7.4 hours are used to evaluate the system performance, including 4 matches of World Cup 2002, 1 match of FIFA Cup 2001, and 1 match of MPEG7 test video. Only a 6-minute video clip from the third match of World Cup is used as training set for Bayesian network. The other videos are testing data. Ground truth is labeled manually. The testing videos have different image qualities. The videos from World Cup have good quality, while the video from FIFA Cup is too dark and the MPEG7 test video is too light. However, the experimental results are encouraging. We evaluate SEG accuracy, GOAL detection, ATTACK accuracy separately, and calculate the highlight time coverage. In the evaluation tables, M1 to M6 stands for Match1 to Match6. Table 1 gives the SEG accuracy. The classification precision and recall of FAR-SIDE is also shown in Table 1, because FAR-SIDE is the most important unit for highlight extraction. The average SEG accuracy (acc.) reaches 89.6%, while the average precision (pre.) and recall (rec.) of FAR-SIDE are 94.0% and 87.1% respectively. Table 1: SEG Accuracy SEG acc. FARSIDE pre. FARSIDE rec. M1 91.9% 93.9% 93.7% M2 87.5% 93.7% 89.9% M3 94.7% 95.9% 90.9% M4 90.2% 94.0% 78.3% M5 86.0% 91.5% 87.5% M6 84.1% 96.6% 73.0% Avg. 89.6% 94.0% 87.1% GOAL detection results are shown in Table 2. Averagely, with a 100% recall, the precision achieves 68.2%. Compared to the literatures such as [4] [7], these results are 4

6 more reasonable. Moreover, for all the 15 goals in our experiments, the replay precision and recall is both 100%. Table 2: GOAL Detection Accuracy correct false miss precision recall M % 100% M % 100% M % 100% M % 100% M % 100% M % 100% Total % 100% We compute the ATTACK precision, recall and direction accuracy in Table 3. The average precision, recall and direction accuracy are 93.5%, 86.4%, 96.8% respectively. Table 3: ATTACK Accuracy Precision(%) Recall(%) Direction(%) M M M M M M Total The time coverage of highlights extracted from the GOAL (G), GOAL with replays (GR) and ATTACK are shown in Table 4 and 5. Table 4: GOAL Time Coverage G G % GR GR% Whole M1 41 sec sec min M2 71 sec sec min M3 25 sec sec min M4 20 sec sec min M5 42 sec sec min M6 28 sec sec min Total 227sec sec min Table 5: ATTACK Time Coverage Number Time Time ratio Whole M min. 16.9% 78 min. M min. 17.2% 87 min. M min. 12.8% 94 min. M min. 9.0% 89 min. M min. 17.0% 47 min. M min. 9.1% 55 min. Total min. 13.6% 450min As the most interesting portions in soccer video, GOALs cover only 0.84% of the whole games, while GOAL with replays coverage is 2.30%, as shown in Table 4. From Table 5, we can see that the ATTACKs averagely cover 13.6% of the whole game. This ratio reflects that the ATTACK based highlight is a reasonable soccer synopsis as the whole game usually lasts 2 hours. In our experiments, the false alarms of ATTACK are mainly caused by goal kick or pass of guards. In fact, such ATTACKs can be easily filtered by criticality ranking. The miss faults are usually caused by low-quality attacks However, errors bring few negative effects to the highlight presentation because viewers usually pay little attention to them. On the other hand, we may further improve the performance of our system by selecting more effective features, and employing statistical models for structure extraction on every layer. 6. Conclusions In this paper, we proposed a pyramidal structure for soccer video analysis, which includes a series of layers including SEG, FAR-SIDE, GOAL, ATTACK and GOA from top to bottom. As soccer pyramid contains rich intermediate semantics, the soccer highlights can be extracted in a flexible manner according to the different viewers requirements. By generating a condensed binary image sequence, the effective global features are extracted, which make it possible to obtain accurate structures in soccer pyramid. The encouraging experimental results have proven the practicality of the proposed approach. References [1] Y. Gong, L. T. Sin, C. H. Chuan, H. Zhang, M. Sakauchi, Automatic parsing of TV soccer programs ± Proc. ICMCS95, Washington DC, USA, [2] D. Zhong, S. F. Chang, Structure Analysis of Sports Video Using Domain Models, Proc. ICME2001, pp , Tokyo, Japan, Aug [3] V. Tovinkere, R. J. Qian, Detecting Semantic Events in Soccer Games: Toward a Complete Solution, Proc. ICME2001, pp , Tokyo, Japan, Aug [4] A. Ekin, M. Tekalp, Automatic Soccer Video Analysis and Summarization, Proc. SST SPIE03, CA, USA, [5] H. Pan, P. van Beek and M.I. Sezan, Detection of slow-motion replay segments in sports video for highlights generation, ICASSP 2001, Salt Lake City, UT, May [6] A. Bonzanini, R. Leonardi, P. Migliorati, Event Recognition in Sport Programs Using Low-Level Motion Indices, Proc. ICME2001, pp , Japan, Aug [7] R. Leonardi, P. Migliorati, M. Prandini. Se mantic indexing of sport program sequences by audio-visual analysis, Proc. ICIP 2003, Barcelona, Spain, Sep [8] L. Xie, S-F Chang, A. Divakaran, H. Sun, Structure analysis of soccer video with Hidden Markov Models, Proc. ICASSP, [9] J. Illingworth, J. Kittler, A Survey of the Hough Transform, CVGIP, vol. 44, pp ,

Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain

MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain Radhakrishan, R.; Xiong, Z.; Divakaran,