Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain

Similar documents
Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors

Baseball Game Highlight & Event Detection

Highlights Extraction from Unscripted Video

Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection

Discriminative Genre-Independent Audio-Visual Scene Change Detection

Title: Pyramidwise Structuring for Soccer Highlight Extraction. Authors: Ming Luo, Yu-Fei Ma, Hong-Jiang Zhang

A Rapid Scheme for Slow-Motion Replay Segment Detection

An Enhanced Video Summarization System Using Audio Features for a Personal Video Recorder

Computer Vision and Image Understanding

Story Unit Segmentation with Friendly Acoustic Perception *

Multi-Camera Calibration, Object Tracking and Query Generation

MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO

MATRIX BASED SEQUENTIAL INDEXING TECHNIQUE FOR VIDEO DATA MINING

Detection of goal event in soccer videos

Chapter 9.2 A Unified Framework for Video Summarization, Browsing and Retrieval

Searching Video Collections:Part I

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

AIIA shot boundary detection at TRECVID 2006

A MPEG-4/7 based Internet Video and Still Image Browsing System

An Introduction to Pattern Recognition

Highlight Ranking for Broadcast Tennis Video Based on Multi-modality Analysis and Relevance Feedback

Real-Time Content-Based Adaptive Streaming of Sports Videos

A Robust Wipe Detection Algorithm

Algorithms and System for High-Level Structure Analysis and Event Detection in Soccer Video

AUTOMATIC VIDEO INDEXING

Multimodal Semantic Analysis and Annotation for Basketball Video

If we want widespread use and access to

Audio-Visual Content Indexing, Filtering, and Adaptation

Audio-Visual Content Indexing, Filtering, and Adaptation

Multimedia Database Systems. Retrieval by Content

Exhaustive Generation and Visual Browsing for Radiation Patterns of Linear Array Antennas

QueryLines: Approximate Query for Visual Browsing

Bus Detection and recognition for visually impaired people

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

Exciting Event Detection Using Multi-level Multimodal Descriptors and Data Classification

Image retrieval based on bag of images

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Video De-interlacing with Scene Change Detection Based on 3D Wavelet Transform

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

Fast Highlight Detection and Scoring for Broadcast Soccer Video Summarization using On-Demand Feature Extraction and Fuzzy Inference

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Hierarchical Semantic Content Analysis and Its Applications in Multimedia Summarization. and Browsing.

Disparity Search Range Estimation: Enforcing Temporal Consistency

EE 6882 Statistical Methods for Video Indexing and Analysis

A Unified Framework for Semantic Content Analysis in Sports Video

Content-based Video Genre Classification Using Multiple Cues

Fast Region-of-Interest Transcoding for JPEG 2000 Images

Change Detection by Frequency Decomposition: Wave-Back

Toward Spatial Queries for Spatial Surveillance Tasks

CONTENT analysis of video is to find meaningful structures

TRACKING OF MULTIPLE SOCCER PLAYERS USING A 3D PARTICLE FILTER BASED ON DETECTOR CONFIDENCE

Structure Analysis of Soccer Video with Domain Knowledge and Hidden Markov Models

Video shot segmentation using late fusion technique

A Hybrid Approach to News Video Classification with Multi-modal Features

Feature Selection and Order Identification for Unsupervised Discovery of Statistical Temporal Structures in Video

Key-frame extraction using dominant-set clustering

Recall precision graph

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

Temporal structure analysis of broadcast tennis video using hidden Markov models

International Journal of Innovative Research in Computer and Communication Engineering

A NOVEL FEATURE EXTRACTION METHOD BASED ON SEGMENTATION OVER EDGE FIELD FOR MULTIMEDIA INDEXING AND RETRIEVAL

THE AMOUNT of digital video is immeasurable and continues

International Journal of Electrical, Electronics ISSN No. (Online): and Computer Engineering 3(2): 85-90(2014)

Under My Finger: Human Factors in Pushing and Rotating Documents Across the Table

An Introduction to Content Based Image Retrieval

Binju Bentex *1, Shandry K. K 2. PG Student, Department of Computer Science, College Of Engineering, Kidangoor, Kottayam, Kerala, India

View Synthesis for Multiview Video Compression

UNSUPERVISED MINING OF MULTIPLE AUDIOVISUALLY CONSISTENT CLUSTERS FOR VIDEO STRUCTURE ANALYSIS

Lesson 11. Media Retrieval. Information Retrieval. Image Retrieval. Video Retrieval. Audio Retrieval

CS229: Action Recognition in Tennis

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

Offering Access to Personalized Interactive Video

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Semantic Video Indexing

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

Joint Key-frame Extraction and Object-based Video Segmentation

Text Area Detection from Video Frames

A Miniature-Based Image Retrieval System

AUTOMATIC CONSUMER VIDEO SUMMARIZATION BY AUDIO AND VISUAL ANALYSIS

Real-Time Detection of Sport in MPEG-2 Sequences using High-Level AV-Descriptors and SVM

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

Content Discovery from Composite Audio

Dynamic Time Warping

Dynamic Window Decoding for LDPC Convolutional Codes in Low-Latency Optical Communications

Surveillance System with Mega-Pixel Scalable Transcoder

Tag Recommendation for Photos

Fast Image Registration via Joint Gradient Maximization: Application to Multi-Modal Data

TEVI: Text Extraction for Video Indexing

Key Frame Extraction and Indexing for Multimedia Databases

Content Based Image Retrieval Using Color Quantizes, EDBTC and LBP Features

Prediction-based Directional Search for Fast Block-Matching Motion Estimation

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Overlay Text Detection and Recognition for Soccer Game Indexing

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain

Tri-modal Human Body Segmentation

University of Cambridge Engineering Part IIB Module 4F12 - Computer Vision and Robotics Mobile Computer Vision

Textural Features for Image Database Retrieval

Search Engines. Information Retrieval in Practice

Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks

Transcription:

MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain Radhakrishan, R.; Xiong, Z.; Divakaran, A.; Ishikawa, Y. TR2003-144 February 2004 Abstract In our past work we have used supervised audio classification to develop a common audiobased platform for highlight extraction that works across three different sports. We then use a heuristic to post-process the classification results to identify interesting events and also to adjust the summary length. In this paper, we propose a combination of unsupervised and supervised learning approaches to replace the heuristic. The proposed unsupervised framework mines the semantic audio-visual labels so as to detect interesting events. We then use a Hidden Markov Model based approach to control the length of the summary. Our experimental results show that the proposed techniques are promising. This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2004 201 Broadway, Cambridge, Massachusetts 02139

MERLCoverPageSide2

Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain Regunathan Radhakrishan 1, Ziyou Xiong 1, Ajay Divakaran 1, Yasushi Ishikawa 2 1 Mitsubishi Electric Research Labs, Cambridge, MA, USA. {regu, zxiong, ajayd}@merl.com 2 Mitsubishi Electric Information Technology Research Center, Ofuna, Kamakura, Japan. yasushi@isl.melco.jp Abstract In our past work we have used supervised audio classification to develop a common audio-based platform for highlight extraction that works across three different sports. We then use a heuristic to post-process the classification results to identify interesting events and also to adjust the summary length. In this paper, we propose a combination of unsupervised and supervised learning approaches to replace the heuristic. The proposed unsupervised framework mines the semantic audio-visual labels so as to detect interesting events. We then use a Hidden Markov Model based approach to control the length of the summary. Our experimental results show that the proposed techniques are promising. 1. Introduction Past work on automatic extraction of highlights of sports events in general and soccer video in particular has relied on color feature extraction, audio feature extraction as well as camera motion extraction (see [1] [2] for example). Color and texture features have been employed to find the video segments in which the background is mostly green i.e. mostly consists of grass. In [9], Ekin et al propose an approach that detects dominant color regions and some other low-level video features to detect goals, referee and penalty boxes. In [8], Pan et al propose detection of slow motion segments as a solution to detect interesting events in sports video. Audio features have been used to detect interesting events by looking for an increase in the volume or of the pitch of the commentator s voice or by even looking for the word goal in the commentary. In [10], Xu et al propose a audio classification framework based on SVMs (Support Vector Machines) to detect whistles, commentators excited speech etc. The proposed method relies on rules based on these low-level audio events to extract highlights. While most of the above approaches use supervised learning for event detection, Xie et al propose a completely unsupervised approach to discover play and break segments in soccer video in [7]. Finally, combinations of the various features have been used to get refined results. While most of the aforementioned techniques rely on video domain techniques, they are computationally intensive than those that rely on audio domain techniques. Since our motivation is computational simplicity and easy incorporation into consumer system hardware, we focus on feature extraction in the compressed domain. In the compressed domain, however, since the motion vectors are noisy, such accuracy is difficult to achieve. In our previous work [3] we have shown that it is possible to rapidly generate highlights of various sports using gross motion descriptors such as the MPEG-7 motion activity descriptor. Such descriptors work well with compressed domain motion vectors, since they are coarse by definition. However, we found that with soccer video, the number of false positives was unacceptably high. In [3] we also eliminated false positives by first detecting sudden surges in audio volume or peaks, and then only retaining the motion activity based highlights that correspond to an audio peak. While this procedure works reasonably well, it does not use a reliable audio feature. Thus, we developed an audio-classification based approach in which we explicitly identify applause/cheering segments, and use those to identify highlights. In [4] we post-process the classification results to identify interesting events. Specifically, we use length of contiguous applause/cheering audio segments to rank the detected highlight events thereby enabling modulation of summary length. In this paper, we propose a combination of unsupervised and supervised learning in place of the post-processing step in [4]. The unsupervised approach relies on the departure from stationarity of audio labels histogram to detect candidate interesting events. We then further test the validity of these events using a pre-trained highlight Hidden Markov Model. In section 2, we describe our motivation for the chosen audio classification based framework. In Section 3, we give details on the proposed techniques. In Section 4, we present the experimental results and conclude in Section 5. 2. Motivation for Audio Processing The system constraints of our target platform rule out having a completely distinct algorithm for each sport and motivate us to investigate a common unified highlights framework for our three sports of interest, golf, soccer and baseball. Since audio lends itself better to extraction of content semantics, we start with audio classification. In [3] 1

we employ a general sound recognition framework based on Hidden Markov Models (HMM) using Mel Frequency Cepstral Coefficients (MFCC) to classify and recognize the following audio signals: applause, cheering, music, speech and speech with music. The former two are used for highlights extraction due to their strong correlation with highlights and the latter three are used to filter out the uninteresting segments. This kind of processing based on low-bandwidth audio signal lends itself towards a platform for the analysis of different sports video followed by more sophisticated game-specific post-processing. 3. Proposed Techniques 3.1 Audio Classification Framework In this paper, we follow the approach in [4] using Gaussian Mixture Models(GMM) to model classes of sound. Silence segments are declared if the energy of the segment is no more than 10% of the average energy of all the segments in the whole game. We then classify every second of nonsilence audio into the following 7 classes: applause, ballhits, female, male, music, speech-music and noise (audience noise including cheering). 3.2 Unsupervised Label Mining Framework Once we have obtained semantic labels for audio from the GMM, we pose the problem of interesting event detection as a multimedia mining problem across these labels for patterns. Figure 1 illustrates the proposed framework for mining audio labels. The basic assumption in this approach is that interesting events are rare and have different audio characteristics in time, when compared to the usual characteristics in a given context. A context is defined by a time window of length W L. In order to quantify what is considered as usual in a context, we compute the distribution of labels within it. Then, for every smaller time window W S, we compute the same distribution. We compare the local statistic (computed within W S ) with the global statistic (computed within W L ) using either an information theoretic measure called relative entropy or a histogram distance metric proposed in [6]. One would expect a large distance value for a W S with a different distribution compared to what is usual within W L. By moving W L one W S at a time, we compute (W L / W S ) distance values and find the maximum of this set of values. We associate this maximum value, M ws, with the small window W S. Then, rare events are times when there is a peak in the curve of all M ws. For instance, in a golf game the onset of audience applause would cause the local distribution of semantic audio labels to peak around applause whereas the global distribution in the current context, would peak around speech. Therefore, a large difference between these two distributions would indicate a deviation from stationarity, thereby signaling the occurrence of something unusual in that context. The peak detection in the curve of all Mw s, is performed adaptively using a sliding window similar to shot detection in [5]. When the first local maximum value is at the center of the window, we compare if its value is greater than the second local maximum value by a factor P th. This helps eliminate peaks that are a result of random fluctuations. 3.3 Highlights Validation by Trained HMM Points of unusual events detected by the above algorithm only signal a change in characteristics in terms of the semantic audio labels. For instance, in a golf game, the onset of commercials would also be signaled as an unusual event. Therefore, in order to capture only semantic events of interest, we use a trained HMM of highlights to validate the candidate unusual events signaled by the previous step. This step is analogous to human participation in data mining to validate the unusual patterns output by the data mining algorithm. Another advantage of this scheme is that we can use the likelihood values output from the HMM to rank the highlights. This avoids the simple heuristic in [4] that uses the duration of applause/cheering segments for ranking and summary length modulation. Figure 2 illustrates the combination of unsupervised and supervised learning approaches for highlight extraction. 4. Experimental Results Training data was collected for each of the low-level sound classes from three and a half hours of MPEG video for three different sports namely baseball, soccer and golf. Twelve dimensional MFCC features extracted from every 30ms frame, were used to train a 10 component GMM for each sound class. Once models are trained, input audio is divided into chunks of 1s segments. MFCC features are extracted from each frame in the 1s segment. Each frame is then classified as belonging to one of the sound classes using the maximum likelihood criterion. We use a majority voting scheme from all the frames to decide the sound class of the 1s segment. After assigning audio labels to 1s segments, unsupervised label mining was performed to detect unusual events as shown in Figure 1. The value of large window (W L ) was chosen to be 4 minutes and the value of was small window (W S ) chosen to be 0.5 minutes. Kullback-Leibler distance metric was used to compare the distribution of audio labels within these chosen windows. Sliding window peak detection algorithm was used on the recorded distance values (Mw s ) to detect peaks. Figure 3 shows the peak detection results for two soccer games each of duration two hours. 2

After identifying candidate interesting events from the unsupervised label mining step, we use a Hidden Markov Model (HMM) to validate and rank the interesting events. The highlight HMM was trained from the labels extracted from DVD data which contains half an hour of professionally edited highlights of the 2002 soccer world cup. There were 81 interesting clips in this training data including mainly attempts at goal and goals. The number of states for the HMM was empirically chosen to be 4. After training, it was observed that one of the states corresponded to the Cheering label and the corresponding state had a high self-transition probability. This implies that the HMM was indeed modeling the occurrence of contiguous cheering segments. This kind of modeling is better than simply using the length of duration of cheering segments as in [4]. To illustrate this we include the corresponding precision-recall figures for the same soccer games in Table 2. The threshold based scheme in [4] to detect highlights, is based on a fixed notion of highlights and is less flexible. Here, we let the HMM learn from the label sequences of training data what a highlight is. Such a data driven approach does not have a fixed notion of highlights and may capture other audio cues apart from cheering/applause. It also opens the possibility of information fusion with other modalities. Table 1 shows the performance in the two soccer games. Figure 3: Unsupervised label mining results on soccer games. [A] [B] [C] Precision Recall Game 1 34 22 32 40.74% 64.71% Game1 34 6 8 42.86% 17.65% Game 1 34 16 24 40% 47.06% Game 1 34 11 20 35.48% 32.35% Game 2 45 25 35 41.67% 55.55% Game 2 45 5 0 100% 11.11% Game 2 45 20 14 58.82% 44.44% Game 2 45 9 4 69.23% 20% Table 1: Soccer Game 1: World cup Brazil Germany; Soccer Game 2: World cup Brazil Turkey; [A]: Number of true highlight segments; [B]: Number of true highlight segments output by the algorithm; [C]: Number of False Alarms [A] [B] [C] Precision Recall Game 1 34 22 59 27% 64.71% Game 1 34 6 11 35% 17.65% Game 1 34 16 39 29% 47.06% Game 1 34 11 30 27% 32.35% Game 2 45 25 44 36% 55.55% Game 2 45 5 10 32.5% 11.11% Game 2 45 20 30 40% 44.44% Game 2 45 10 14 42% 20% Table 2: Results for the same two games using the approach in [4]. In order to compare the current approach with the threshold based scheme in [4], we compare the precision values for the same recall values in two soccer games. For the current approach, different points on the precision-recall curve were generated by changing the likelihood threshold of the highlight HMM. Note that the proposed approach outperforms the simple threshold based scheme in [4] for the recall values. The number of false alarms has been reduced by the inclusion 3

of the peak detection and a validation step in place of a threshold based highlight detection. 5. Conclusion We have presented our latest improvement on the sports highlights extraction framework described in [4]. These improvements can be summarized as a combination of unsupervised label mining method and a supervised highlight model. In [4], we have relied on the correlation between the length of applause/cheering segments and highlights without explicit modeling of these highlights. In this paper, we use highlight models together with the unsupervised mining of unusual segments, to further improve the performance. generation, in Proc. IEEE International Conf. on Acoustics, Speech and Signal Processing, 2001. [9] A. Ekin, A. M. Tekalp, and R. Mehrotra, Automatic soccer video analysis and summarization, accepted for publication in IEEE Trans. Image Processing [10] M.Xu, N.Maddage, C.Xu, M.Kankanhalli, Q.Tian, Creating Audio Keywords for Event Detection in Soccer Video, Proc. of ICME 2003. In the future, we will further investigate the fusion of audio and video domain features to model the highlights. References [1] S. A. Dagtas and M. Abdel-Mottaleb, Extraction of Soccer Highlights using Multimedia Features, MMSP, 2001 [2] C. Toklu, S-P. Liou and M.Das, Video Abstract: A Hybrid Approach to Generate Semantically Meaningful Video Summaries, IEEE ICME, New York, 2000. [3] A. Divakaran, K. A. Peker, R. Radhakrishnan, Z. Xiong and R. Cabasson, "Video Summarization using MPEG-7 Motion Activity and Audio Descriptors", Video Mining, eds. A. Rosenfeld, D. Doermann, and D. DeMenthon, Kluwer Academic Publishers, 2003. [4] Z.Xiong, R.Radhakrishnan, A.Divakaran, "Generation of Sports Highlights Using Motion Activity in Combination with a Common Audio Feature Extraction Framework", IEEE International Conference on Image Processing (ICIP), To Appear September 2003. [5] B. L. Yeo, B. Liu, Rapid scene analysis on compressed video, IEEE Trans. On Circuits and Systems for Video Technology, 5(6):533 544, Dec 1995. [6] M. J. Swain, D. H. Ballard, "Color indexing", Int. J. Comput. Vision, 7:11-32. 1991. [7] L.Xie, S.F.Chang, A.Divakaran, H.Sun, Unsupervised Mining of Statistical Temporal Structures in Video, Video Mining, eds. A. Rosenfeld, D. Doermann, and D. DeMenthon, Kluwer Academic Publishers, 2003. [8] H.Pan, P.Van Beek, M.I.Sezan, Detection of slowmotion replay segments in sports video for highlights 4

Figure 1: Unsupervised Label Mining Framework Mw s time Unusual event detection using unsupervised approach..11122221111111.455777333555555 Audio Labels Highlight HMM Semantic event-specific validation using supervised approach Likelihood score used to validate and rank unusual events as highlights Figure 2: Combination of Unsupervised and Supervised learning for highlight extraction. 5