MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO

Size: px

Start display at page:

Download "MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO"

Chrystal Collins
5 years ago
Views:

1 MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO YIFAN ZHANG, QINGSHAN LIU, JIAN CHENG, HANQING LU National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing , China {yfzhang, qsliu, jcheng, Abstract: In this paper, we propose an effective fusion scheme of audio and visual modals for highlight detection in broadcast soccer videos. The Adaboost learning is adopted to learn some discriminating audio features for some special audio classification. The logo-based replay shot detection is used for mid-level visual semantic analysis. A finite state machine is utilized to integrate the audio and visual analysis result for further highlight detection. Experiments conducted on several real-world soccer game videos show that the proposed method has an encouraging performance. Keywords: Highlight detection; broadcast video; multimodal analysis 1. Introduction The quantity and availability of sports video content is soaring due to the popularization of television and internet. The various video services via new media channels such as network TV and mobile device have shown tremendous commercial potentials, and brought huge demands of personalized sports video services according to consumers preference. The traditional one-to-many broadcast mode can not meet different audiences demands. Sometimes, people may only be interested in the highlights from lengthy and voluminous sports video programs and want to skip the less interesting parts. In this paper, an effective multimodal approach using audio and visual features for highlights detection in broadcast sport videos is presented. We initially focus the application domain on soccer, because it is the most popular sport appealing with large audiences. Among the whole soccer game, usually only a small portion is exciting and highlight-worthy. Manually generating the highlights of soccer videos is a high-labored cost work, as editors need to browse the whole game. The soccer game s dynamical and flexible structure also brings the challenge to video parsing and analysis work. Audio track is an important information source, which has good correlations with semantics and is less expensive to compute. Hence, it is important to utilize audio information in highlight detection. In soccer games, the audio track consists of whistle, audience applaud, commentator speech, and various kinds of environmental sounds. Based our observation, the wonderful events which are highlight-worthy always occur with excited commentator speeches, excited audience applauds and sometimes the whistles from the referee. The whistles often occur in other common events such as foul, kickoff, the start and the end of game, etc. In addition, some whistles may be from the audience but not the referee. It will affect the detecting accuracy. The audience applauds are always mixed with various environmental noises. Thus, we set the excited commentator speech as the audio cue to facilitate highlight inferring and detecting. However, audio based analysis is not always reliable in soccer game videos because the environment sounds of soccer are very noise. We therefore appeal to some visual analysis for help. To limit the computing complexity and enhance the robustness, we only use replay shot detection in visual analysis. As we know, replay shot is a special effect inserted by the TV director to explain the game progress and show player s details. It is a significant mid-level feature and has strong relationship to highlights. However, sometimes replay shots are inserted to show some technical details such as foul and offside, which are less interesting for most of the audiences. Therefore, the combination of audio cue extraction and replay shot detection can effectively reduce the false positive of the two single modal s analysis. The scheme of our proposed solution contains three parts: audio cue extraction, replay shot detection and audio-visual modals integration. For audio cue extraction, Adaboost is utilized to automatically select some discriminating low-level features for audio classification.

2 Replay shot detection is based on flying logo detection before and after the replay shots, which is a production rule in broadcast sports videos. Finally, a finite state machine is designed to integrate the audio and visual analysis result for highlight detection. Since the features we used are generic in broadcast sports videos but not game specific, it is easy to extend our approach to other application domain.. Related Work Highlight detection and game analysis for sports video has attracted much research attention in recent years [1]. Most of the existing methods were based on visual analysis [, 3], which attempted to extract mid-level semantic concepts from low-level visual information. For soccer games, [4] tryed to use the position information of the players and the ball during the game and therefore it needed a quite complex and accurate tracking system. Ekin et al. [5] proposed a framework using object-based features for analysis and summarization of soccer videos. The framework included some novel video processing approaches such as dominant color region detection, referee detection and penalty-box detection. TV broadcasting rules were also used together with visual information to detect goal event. However, visual features are not only expensive to compute, but also not very robust. Hence, some researchers began to focus on audio analysis [6, 7]. Rui et al. [6] detected speeches and ball-hit sounds for extracting highlight of baseball videos. Several learning algorithms are compared in speech classification. A directional template matching approach was used for ball-hit sounds detection. Since the game-specific sounds and domain knowledge are used, it seems difficult to be generalized to many other sports. In [7], SVM was employed to train sound (applause, speech and whistles) recognizers. It was assumed that those sounds are closely related to some events under specific sports game rules. The low-level audio features used for recognizers were selected manually, which was labor-intensive in training and testing, and difficult to fit different classification tasks. Some researchers attempted to combine audio and visual features to improve the detection precision. Han et al. [8] used a maximum entropy to integrate audio, image and speech to detect highlight in baseball videos. Nepal et al.[9] employed heuristic approach to combine cheer voice, score display and the transition in camera motion for detecting goal events in basketball games. They are almost based on some specific domain rules and game-specific features. 3. Audio Cue Extraction Audio cue is referred to the significant audio information which has strong relationship with semantics in the game and can facilitate highlight detection. In soccer game, we set excited commentator speech as the audio cue in the reason that it has better correlations to exciting events and is relatively easier to be classified than other audio information Feature Extraction Since the audio track consists of sounds mainly from commentator, audience, whistle and other environment noise, we extract features which can well characterize those sounds from both time domain and frequency domain of the audio signals Mel-Frequency Cepstral Coefficient (MFCCs) The mel-frequency cepstral is proved to be effective in speech recognition and modeling the subjective pitch and frequency content of audio signals. The frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FFT or DCT. The MFCCs are computed from the FFT power coefficients which are filtered by a triangular pass filter bank as follows: K Cn = (log Sk)cos[ n( k 0.5) / k], n 1,,... N k π = (1) k = 1 where S k is the output of the filter banks and N is the number of MFCCs dimensions. The delta and acceleration of MFCCs are also used in our experiments Linear Prediction Coefficient (LPCs) The LPCs are the coefficients of linear prediction coding which are frequently used for transmitting spectral envelope information. By minimizing the sum of the squared differences between the actual audio samples and the predictive ones, a set of predictor coefficients can be determined. The Levinson recursion approach is used for iteration and calculation LPC Cepstral Coefficient (LPCCs) The LPCCs are the cepstral coefficients converted from linear prediction coefficients. The LPCs are defined by [ a 0, a 1, a,, a p ], and the LPCCs are defined by [ b 0, b 1, b,..., b p,..., bn 1 ]. The recursion is defined by the

3 following equations: B0 ln E = () 1 m 1 Bm = am + [ ( m k) akb( m k) ],1 m p m (3) k = 1 m 1 ( m k) Bm = [ ab k ( m k) ], p< m< n (4) k = 1 m where E is the prediction error, n is the dimension number of cepstral coefficients Zero Crossing Rate (ZCR) The zero crossing rate is the rate of sign-changes along a signal. The rate at which zero crossings occur is a simple measure of the frequency content of a signal. It is calculated as 1 T 1 R = sign( s i s ) (5) t t 1 T t= 0 where s is a signal of length T and the indicator function sign(a) is the algebraic sign of its argument A Short Time Energy (STE) The short time energy is the mean square of samples in each frame which is weighted with a Hamming window h(n). It is calculated as 1 T 1 STE = s( n) h( T n) T (6) n= 0 where s is a signal of length T. Adaboost is a popular learning algorithm which can select and weight the discriminating features for efficient classifiers [10]. In this paper, we use the Adaboost to select the most discriminating features for excited commentator speech classification. We simply use the Gaussian weak classifier for each dimension of the features. The whole process is shown in Table 1. Table 1. Feature selection by Adaboost 1 Initialize weights w 1 1 1, i, m n = for positive and negative samples, where m and n are the number of negatives and positives respectively. For t = 1,,T: 1. normalize the weights w ti,.. for each feature f j, train the Gaussian classifier G j. The error is evaluated with respect to w ti,, ej = wt, i Gj( xi) y i. where y i is the label of samples. 3. choose the feature f t, with the minimum error e t. 4. update the weights: 1 w i t+ 1, i = wt, iβt δ where δ i = 0 if sample x i is classified correctly, δ i = 1 otherwise, and β e t t =. 1 e t 3 The final selected feature vector is : { α f, α f, α f,..., α f } where 1 αt = log βt T T i 3.. Feature Selection We segment the original audio signals into 50ms per frame as the basic unit. Each frame is described by its observation of the low-level features extracted in section 3.1. The features of one frame are normalized and combined into a vector. The extraction of audio cue here can be formulated as a two-class classification problem (e.g. excited commentator speech vs. others). The frames which belong to excited commentator speech segments are considered as positive samples and other frames are considered as negative ones. In the research work of Rui et al. [6] and Xu et al. [7], SVM is proved to be an effective classifier. However, they did not consider the properties of different low level features. Actually different low-level features have different influence on audio classification. For example, the energy and MFCC feature perform well in speech detection, while whistle is easy distinguished by ZCR feature [6, 7]. Moreover, sometimes simply combining them together will degenerate the classification performance. Thus, it is necessary to do feature selection automatically according to practical demands. 4. Replay Detection In most of broadcast soccer games, there exists a special transition at both start and end of a replay, that is, a logo comes in and disappears gradually. Base our observation, above 90% broadcast soccer videos use flying logo to launch replays, which can be seen as a production rule. Figure 1 shows examples of the flying logo in several soccer game videos. (a) (b) (c) (d) FIG. 1. Flying logo in (a) World Cup 006, (b) European Championship 004, (c) European Champion League and (d) England Premier League Based on our previous work [11], we use an effective solution for replay shot detection using the flying logo. The solution consists of logo-transition detection, logo detection and replay recognition. We firstly detect the logo-transitions

4 and further extract logo-samples from them. Then, we employ the template matching approach to detect other logos. After all the logos are obtained, the videos can be partitioned into segments which are replay or non-replay. In the logo-transition detection, the difference between neighbor frames is measured by intensity mean square difference (MSD). Count the number of consecutive inter-frame differences exceeding a certain threshold. If the number is large enough, a wipe transition can be determined. The logo template is obtained from the average image of the samples in the transition process. Color and shape features are used in template matching. Ideally, a pair of detected logos can determine a replay shot. However, because of the existing of false and missing detection, we have to add other features (such as shot length, shot type, motion vector etc.) to help determine the replay shot recognition. Further technical details refer to [11]. 5 Audio-Visual Fusion In our scheme, it is an important part to integrate the audio and visual modal analysis results for highlight detection. In audio modal, since observation of real-world sports games reveals that excited commentator speech usually lasts much longer than one second, we divide audio stream into segments which are one second each. Each segment is labeled by the majority voting of the classification results of the frame sequences. In visual modal, the video stream is also divided into one second per segment and each segment is labeled as {1, 0} for replay and non-replay respectively. As we know, highlights are always followed with replay shot, and the excited commentator speech occurs before or in the replay shot. Therefore, a forward-search rule is utilized to search for the excited commentator speech based on the replay shot detection results. The search rule between audio and video streams is shown in Figure. FIG.. Search rule between audio and video streams Based on the forward search rule, a finite state machine (FSM) is designed to detecting highlight. Based on observation, we make two rules in the FSM for soccer. Certainly they can also be modified to adapt to other kinds of sports video. Rule1: the replay shots should not longer than 60 seconds. Rule: the interval between excited commentator speeches and replay shots should not longer than 30 seconds. Transition condition: A: rule 1 not satisfied; B: audio cue found; C: audio cue not found and rule satisfied; D: audio cue not found and rule not satisfied. FIG. 3. Finite state machine for highlight detection The FSM s states and transition conditions are listed in Figure 3. It first searches the replay shots according to section 4. If the detected segment is not longer than 60 seconds, it will be regarded as a replay shot; otherwise it is regarded as false replay shot detection. Then the forward search is carried out in audio stream from the replay moment. If the audio cue (the excited commentator speech segment) is found and obeys the rule, the highlight can be determined. Forward search will continue to search other audio cues which should be included in the highlight segment until the rule is not satisfied. 6 Experiment Results We conduct the experiments on 5 real-world soccer games (3 FIFA World Cup 006 games and UEFA European Championship 004 games). The audio samples were collected with 44.1 khz sample rate, 705kbps bit rate, 16bits per sample. In audio cue extraction modal, 10 minutes audio data from 3 games are used for feature selection and classifier training. The rests are for testing. The excited commentator speech frames in the original audio track are labeled manually as the ground truth. To further evaluate the proposed Adaboost classifier, we also investigate the SVM classifier using every single feature and their several combinations. Figure 4 shows the error rate of excited commentator speech detection result. In this figure, the last bin corresponds to the Adaboost classifier, while the others are SVM classifiers using the corresponding features. It is clear that not all the features

are effective for classification. The SVM classifier using all the features yields high error rate. The Adaboost and the SVM using the features of MFCCs and STE both perform well.

5 are effective for classification. The SVM classifier using all the features yields high error rate. The Adaboost and the SVM using the features of MFCCs and STE both perform well. It is encouraging that our approach is comparable to the result of SVM using the best features which are evaluated and selected manually. To enhance the conclusion, we further change the classification task on whistle (the result is shown in Figure 5). We detect the whistles out of other sounds. The Adaboost classifier still gets the second lowest error rate, which is a little higher than the SVM using the features of STE and ZCR experiments are also conducted on audio modal and visual modal respectively. We detect the segments which have the excited commentator speeches on audio modal or replay shots on visual modal and regard them as highlights. Comparing with the ground truth, the results are listed in Table 4 and Table 5 respectively. It can be seen that although the recall is good, the detection precision is unfortunately low by single modal. It is because that some technical but not wonderful events (e.g. foul, offside) also have replay shots; and the audio cue detection is not always reliable due to the environment noise in soccer games, to guarantee the recall we has to sacrifice the precision. Hence, it is clear that the integration of audio and visual modal analysis can effectively reduce the false positive and achieve the satisfied results. FIG. 4. Excited commentator speech detection Table 3. Highlight detection by multimodal No. Game True False Miss Precision Recall 1 France_Spain % 95.7% Germany_Costa Rica % 84.0% 3 Portugal_Mexico % 90.0% FIG. 5. Whistle detection For replay shot detection, three games are used to test the performance of the flying logo-based approach. The results are listed in table. In our results, some missing detections are due to the logo themselves missing in the original videos. Table. Replay shot detection Game Precision % Recall % Portugal_Mexico France_Spain Czech_Greece In audio-video fusion modal, results of audio cue extraction and replay shot detection are integrated for finally detecting the highlights. A human subject (not included in our project) was asked to watch the 5 real-world soccer games and selected the highlights as the ground truth. Table 3 lists the results of multimodal highlight detection approach. 99 of 114 highlights were successfully detected from 5 real-world soccer games. 15 of them are missing. In these 15 segments, 6 of them are caused by missing detection of replay shot. The other 9 segments commentator speeches are not very excited. The result of the 5 th game is relatively low is due to the low quality of the audio track in the original video data. In contrast, 4 Portugal_England % 88.5% 5 Czech_Greece % 75.0% Table 4. Highlight detection by audio modal No. Game True False Miss Precision Recall 1 France_Spain % 95.7% Germany_Costa Rica % 84.0% 3 Portugal_Mexico % 90.0% 4 Portugal_England % 9.3% 5 Czech_Greece % 75.0% Table 5. Highlight detection by visual modal No. Game True False Miss Precision Recall 1 France_Spain % 95.7% Germany_Costa Rica % 84.0% 3 Portugal_Mexico % 95.0% 4 Portugal_England % 88.5% 5 Czech_Greece % 80.0%

6 7 Conclusion In this paper, a multimodal highlight detection scheme is proposed for broadcast soccer games. The Adaboost learning is present to select discriminating audio cues for excited commentator speeches classification. To limit the computing complexity, only replay shot detection is used in visual analysis. The Finite state machine is adopted to fuse the audio and visual analysis together for highlight detection. The experimental results show that the integration of audio and visual modal analysis is effective for highlights detection. Our next step work is to add some other effective visual cues such as object based features to enhance the detection. 8 Acknowledgement The research is supported by the 863 Program of China (Grant No. 006AA01Z315, 006AA01Z117), NNSF of China (Grant No ) and NSF of Beijing (Grant No ). [6] Y.Rui, A. Gupta, and A. Acero, Automatically extracting highlights for TV baseball programs, In Proc. of ACM Multimedia, Los Angeles, CA, (000) [7] M. Xu, N. C. Maddage, C. S. Xu, M. Kankanhalli, and Q, Tian, Creating audio key-words for event detection in soccer video, in Proc. of International Conference on Multimedia and Expo. (003) 6-9 [8] M. Han, W. Hua, W. Xu, and Y. H. Gong, An integrated baseball digest system using maximum entropy method, In Proc. of ACM Multimedia. (00) [9] S. Nepal, U. Srinivasan and G. Reynolds, Automatic detection of goal segments in basketball videos, In Proc. of ACM Multimedia, Ottawa, Canada, (001) [10] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting, In Computational Learning Theory, Springer-verlag, Eurocolt 95, (1995) 3-37 [11] X. F. Tong, H. Q. Lu, Q. S. Liu, and H. L. Jin, Replay detection in broadcasting sports video, In Proc. of ICIG, (004) References [1] Y. H. Gong, L. T. Sin, C. H. Chuan, H. J. Zhang, and M. Sakauchi, Automatic parsing of TV soccer programs, in Proc. of International Conference on Multimedia Computing and System, (1995) [] Y. P. Tan, D. D. Saur, S. R. Kulkarni, and P. J. Ramadge, Rapid estimation of camera motion from compressed video with application to video annotation, in IEEE Trans. on Circuits and Systems for Video Technology. vol.10,( 000) [3] P. Xu, L. Xie, S. F. Chang, A. Divakaran, A. Vetro, and H. Sun, Algorithms and systems for segmentation and structure analysis in soccer video, in Proc. of International Conference on Multimedia and Expo. Tokyo, Japan, (001) 5 [4] V. Tovinkere, and R. J. Qian, Detecting semantic events in soccer games: Toward a complete solution, in Proc. of International Conference on Multimedia and Expo. Tokyo, Japan, (001) [5] A. Ekin, and M. Tekalp, Automatic soccer video analysis and summarization, In Proc. of IS&T/SPIE03, Santa Clara, CA (003)

Highlight Ranking for Broadcast Tennis Video Based on Multi-modality Analysis and Relevance Feedback

Highlight Ranking for Broadcast Tennis Video Based on Multi-modality Analysis and Relevance Feedback Guangyu Zhu 1, Qingming Huang 2, and Yihong Gong 3 1 Harbin Institute of Technology, Harbin, P.R. China