Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

Similar documents
Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain

Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors

Detection of goal event in soccer videos

Baseball Game Highlight & Event Detection

Highlights Extraction from Unscripted Video

Repeating Segment Detection in Songs using Audio Fingerprint Matching

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

Available online Journal of Scientific and Engineering Research, 2016, 3(4): Research Article

Discriminative Genre-Independent Audio-Visual Scene Change Detection

An Enhanced Video Summarization System Using Audio Features for a Personal Video Recorder

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection

MULTIMODAL BASED HIGHLIGHT DETECTION IN BROADCAST SOCCER VIDEO

Data Hiding in Video

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

1 Introduction. 3 Data Preprocessing. 2 Literature Review

Modeling the Spectral Envelope of Musical Instruments

A Hybrid Approach to News Video Classification with Multi-modal Features

Speech-Music Discrimination from MPEG-1 Bitstream

MPEG-7 Sound Recognition Tools

Music Genre Classification

Audio-Visual Speech Activity Detection

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Variable-Component Deep Neural Network for Robust Speech Recognition

Multi-Camera Calibration, Object Tracking and Query Generation

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

A Generic Audio Classification and Segmentation Approach for Multimedia Indexing and Retrieval

CONTENT analysis of video is to find meaningful structures

Voice Command Based Computer Application Control Using MFCC

AUDIO information often plays an essential role in understanding

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

A General Purpose Queue Architecture for an ATM Switch

Musical Applications of MPEG-7 Audio

Chapter 3. Speech segmentation. 3.1 Preprocessing

Real-Time Detection of Sport in MPEG-2 Sequences using High-Level AV-Descriptors and SVM

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

Multimedia Database Systems. Retrieval by Content

An Introduction to Pattern Recognition

Real-Time Content-Based Adaptive Streaming of Sports Videos

LabROSA Research Overview

ON-LINE GENRE CLASSIFICATION OF TV PROGRAMS USING AUDIO CONTENT

EE 6882 Statistical Methods for Video Indexing and Analysis

METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION

Time series, HMMs, Kalman Filters

Face recognition using Singular Value Decomposition and Hidden Markov Models

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Discriminative training and Feature combination

Content Based Classification of Audio Using MPEG-7 Features

Authentication of Fingerprint Recognition Using Natural Language Processing

Music Signal Spotting Retrieval by a Humming Query Using Start Frame Feature Dependent Continuous Dynamic Programming

716 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 5, MAY 2004

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011

Fast Region-of-Interest Transcoding for JPEG 2000 Images

DRA AUDIO CODING STANDARD

Audio Segmentation and Classification. Abdillahi Hussein Omar

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

signal-to-noise ratio (PSNR), 2

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Intelligent Hands Free Speech based SMS System on Android

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

A framework for audio analysis

Evaluation of Model-Based Condition Monitoring Systems in Industrial Application Cases

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

Feature Selection and Order Identification for Unsupervised Discovery of Statistical Temporal Structures in Video

The Automatic Musicologist

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Implementing a Speech Recognition Algorithm with VSIPL++

Better Than MFCC Audio Classification Features. Author. Published. Book Title DOI. Copyright Statement. Downloaded from. Griffith Research Online

Handwritten Script Recognition at Block Level

Computer Vision and Image Understanding

Depth Estimation for View Synthesis in Multiview Video Coding

Change Detection by Frequency Decomposition: Wave-Back

Chapter 9.2 A Unified Framework for Video Summarization, Browsing and Retrieval

A Rapid Scheme for Slow-Motion Replay Segment Detection

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Title: Pyramidwise Structuring for Soccer Highlight Extraction. Authors: Ming Luo, Yu-Fei Ma, Hong-Jiang Zhang

A text-independent speaker verification model: A comparative analysis

Ergodic Hidden Markov Models for Workload Characterization Problems

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Neetha Das Prof. Andy Khong

WHO WANTS TO BE A MILLIONAIRE?

Affective Music Video Content Retrieval Features Based on Songs

Implementation of Speech Based Stress Level Monitoring System

Static Gesture Recognition with Restricted Boltzmann Machines

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework

Learning based face hallucination techniques: A survey

Sequence representation of Music Structure using Higher-Order Similarity Matrix and Maximum likelihood approach

A MPEG-4/7 based Internet Video and Still Image Browsing System

Real Time Speaker Recognition System using MFCC and Vector Quantization Technique

A SCANNING WINDOW SCHEME BASED ON SVM TRAINING ERROR RATE FOR UNSUPERVISED AUDIO SEGMENTATION

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Automatic Classification of Audio Data

A Performance Evaluation of HMM and DTW for Gesture Recognition

THE AMOUNT of digital video is immeasurable and continues

The Method of User s Identification Using the Fusion of Wavelet Transform and Hidden Markov Models

Transcription:

MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran, Thomas S. Huang TR2004-082 June 2003 Abstract We present a comparison of 6 methods for classification of sports audio. For the feature extraction we have two choices: MPEG-7 audio features and Mel-scale Frequency Cepstrum Coefficients (MFCC). For the classificaiton we also have two choices: Maximum Likelihood Hidden Markov Models (ML-HMM) and Entropic Prior HMM(EP-HMM). EP-HMM, in turn, have two variations: with and without trimming of the model parameters. We thus have 6 possible methods, each of which corresponds to a combination. Our results show that all the combinations achieve classification accuracy of around 90% with the best and the second best being MPEG-7 features with EP-HMM and MFCC with ML-HMM. ICME 2003 This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2003 201 Broadway, Cambridge, Massachusetts 02139

COMPARING MFCC AND MPEG-7 AUDIO FEATURES FOR FEATURE EXTRACTION, MAXIMUM LIKELIHOOD HMM AND ENTROPIC PRIOR HMM FOR SPORTS AUDIO CLASSIFICATION Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran and Thomas S. Huang Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 Mitsubishi Electric Research Laboratory at Murray Hill, Murray Hill, NJ 07974 E-mail: {zxiong, huang}@ifp.uiuc.edu, {regu, ajayd}@merl.com ABSTRACT We present a comparison of 6 methods for classification of sports audio. For the feature extraction we have two choices: MPEG-7 audio features and Mel-scale Frequency Cepstrum Coefficients(MFCC). For the classification we also have two choices: Maximum Likelihood Hidden Markov Models(ML-HMM) and Entropic Prior HMM(EP-HMM). EP-HMM, in turn, have two variations: with and without trimming of the model parameters. We thus have 6 possible methods, each of which corresponds to a combination. Our results show that all the combinations achieve classification accuracy of around 90% with the best and the second best being MPEG-7 features with EP-HMM and MFCC with ML-HMM. Keywords: Sports Audio Classification, MFCC, MPEG- 7 Audio Feature, HMM 1. INTRODUCTION AND RELATED WORK Most of the audio features proposed so far have fallen into three categories: energy-based, spectrum-based and perceptualbased. Examples of the first category are 4Hz modulation energy used by Scheirer et al[1] for speech/music classification. Examples of the second category are roll-off of the spectrum, spectral flux, MFCC by Scheirer et al[1] and linear spectrum pair, band periodicity in [2]. Examples of the third category include pitch estimated by Zhang et al[3] to discriminate more classes such as songs, speech over music. Although there are comparative studies of audio features for speech/music discrimination[4], there are few studies on this topic for general sound classification. Recently the MPEG-7 international standard has adopted the new, dimension-reduced, de-correlated spectral features[5] for general sound classification[6]. This motivates us to compare it with other widely used features. In addition to numerous audio features, a broad spectrum of classifiers has been studied for audio classification such as Nearest Neighbor, Neural Networks, Gaussian Mixture Models(GMM), Hidden Markov Models(HMM), Nearest Feature Line, Adaboost and Support Vector Machines(SVM). Among all the classifiers listed above, HMM have their advantage of better modelling the temporal evolution of dynamic sounds. Other forms of HMM have been studied as well, such as continuously-variable duration HMM, EP- HMM[7]. Their classification results have been reported to be better than those with ML-HMM on various clean databases. This motivates us to compare some of them with ML-HMM using our noisy sports audio database. For an explanation of clean and noisy databases, please see Section 4. The rest of the paper is organized as follows. In Section 2 and Section 3 a brief overview is given on two different audio features and two different HMM. The experiments are described in Section 4. The experimental results and discussions are in Section 5 and Section 6. 2. MPEG-7 AUDIO FEATURES AND MFCC We compare the MPEG-7 standardized features for sound recognition with MFCC. Although both are spectrum-based features, MPEG-7 audio features are new to the audio features family while MFCC have been widely used in speech recognition and audio classification. The extraction processes are summarized in Figure 1 and Figure 2 respectively. The MPEG-7 features consist of dimension-reduced spectral vectors obtained using a linear transformation of a spectrogram. They are the basis projection features based on Principal Component Analysis(PCA) and an optional Independent Component Analysis(ICA). For each audio class, PCA is performed on the normalized log subband energy of

Window Window Audio Power Spectrum Rectangular Filter (log-scale) Subband Energy Audio Power Spectrum Triangular Filter (Mel-scale) Subband Energy Normalization Log MPEG-7 Features Dimension Reduction (by PCA/ICA) L2 Norm MFCC Dimension Reduction (by DCT) Log Fig. 1. Extraction Method for MPEG-7 Audio Features Fig. 2. Extract Method for MFCC all the audio frames from all the training examples in the class. The frequency bands are decided using the logarithmic scale(e.g. an octave scale). MFCC are based on discrete cosine transform(dct). 2 K They are defined as: c n = K k=1 (log S k cos[n(k 1 2 ) π K ]), n = 1,, L where K is the number of the subbands and L is the desired length of the cepstrum. Usually L K for the dimension reduction purpose. S k s, 0 k < K are the filter bank energy after passing the k th triangular band-pass filter. The frequency bands are decided using the Mel-frequency scale(linear scale below 1kHz and logarithmic scale above 1kHz). Their differences are summarized as follows: 1. The Mel-frequency scale used for MFCC has been shown to be better than the logarithmic scale for speech recognition. MPEG-7 audio features use the logarithmic scale because of its simplicity. 2. MFCC of a testing audio example are the same for all the audio classes. This is because the DCT bases are the same. However, MPEG-7 audio features of the same example are different. Since each PCA space is derived from the training examples of each training class, each class has its distinct PCA space. 3. During training, the extraction of the MPEG-7 audio features requires more memory to buffer the features of all the training examples of the audio classes. During testing, the PCA projection needs to be performed for each class. The extraction of MFCC only requires buffering the features of one training example. The features are the same for different classes. Thus the extraction of the MPEG-7 audio features takes more time and memory. 3. ML-HMM[8] AND EP-HMM[7] We compare EP-HMM with ML-HMM for classification of sports audio. Let s denote λ as the model parameters, O as the observation. The Maximum A Posteriori(MAP) test is the following: O is classified to be of class j if P (λ j O) P (λ i O), i. When we don t have any bias towards any prior model λ i, i.e, we assume P (λ i ) = P (λ j ), i, j, the MAP test is equivalent to the ML test: O is classified to be of class j if P (O λ j ) P (O λ i ), i due to the Bayes rule: P (λ O) = P (O λ)p (λ) P (O) However, if we assume the following biased probabilistic model P (λ O) = P (O λ)p e(λ) H(P (λ)) P (O), where P e (λ) = e and H denotes entropy, i.e, the smaller the entropy, the more likely the parameter, then we must use the MAP test and compare P (O λi)e H(P (λ i )) with 1 to see whether O should P (O λ j )e H(P (λ j )) be classified to be of class i or j. EP-HMM have been shown to improve the classification accuracy over ML-HMM on melody, text[7] and general sound classification[6]. Moreover, in EP-HMM, it is possible to trim the parameters of their graphical structure, thus obtaining more compact graphical models. For further details on EP-HMM, please see [7].

4.1. Data Set 4. SPORTS AUDIO CLASSIFICATION Unlike the data-set used in [9], our database of sports audio is not composed of relatively clean audio such as CD recordings and TIMIT database. It is from broadcast TV which is an un-controlled audio environment with high background interference that makes the audio classification more difficult. We ve collected 814 audio clips from TV broadcasting of golf, baseball and soccer games. Each of them is handlabelled into one of the six classes as ground truth: applause, ball-hit, cheering, music, speech, speech with music. Their corresponding numbers of clips are 105, 135, 82, 185, 168, 139. Their duration differs from around 0.5 seconds(for ball hit) to more than 10 seconds(for music segments. The total duration is approximately 1 hour and 12 minutes. The database is partitioned into a 90%/10% training/testing set. 4.2. Feature Extraction In our feature extraction, an audio signal is divided into overlapping frames of duration 30ms with 10ms overlapping for a pair of consecutive frames. Each frame is multiplied by a hamming-window function. MFCCs are calculated from 40 subbands(17 linear bands between 62.5Hz and 1kHz, 23 logarithmic bands between 1kHz and 8kHz). 10th-order MFCCs are used as audio features. That is, L = 10 and K = 40. The lower and upper boundary of the frequency bands for MPEG-7 features are also 62.5Hz and 8kHz that are over a spectrum of 7 octaves. Each subband spans a quarter of an octave so there are 28 subbands in between. Those frequencies that are below 62.5Hz are group into 1 extra subband. After normalization of the 29 log subband energy, a 30-element vector represents the frame. This vector is then projected onto the first 10 principal components of the PCA space of every class. Notice 10 principal components are used so that the number of MPEG-7 features is the same as that of the MFCC features. 5. EXPERIMENTAL RESULTS The number of states for both HMM is selected to be 10 initially. The distribution of the observations is modelled as a single component multi-variate Gaussian. In order to compare the two HMM fairly, we first compare ML-HMM with EP-HMM without trimming states or parameters. We then compare ML-HMM with EP-HMM with trimming. The results on classification accuracy performed on the 10-fold cross-validation data set are organized into Table 1 for a selected combination of features with classifiers. Because of the limited space, we omit those for the other 5 combinations. The average recognition rates for the 6 methods are summarized into Table 2. The following observations can be made based on these tables: 1. For our sports audio database, the best combination, on the average, is MPEG-7 features with EP-HMM with trimming of states and model parameters. The improvement of classification accuracy from Combination 2 to Combination 3 is solely due to the trimming of states and model parameters, especially so for the ball-hit class. We found that the most of the trimming was done for this class of short-duration impulse-like signals. The number of states needed was smaller than 10 and many of the state transitions were trimmed. With a more compact model, there were more training data per state per parameter to converge more closely to the global maximum. 2. Either with MFCC or MPEG-7 audio features, trimming of states and parameters for EP-HMM improves classification accuracy. This can be observed from the improvement from Combination 2 to Combination 3 and from Combination 5 to Combination 6. 3. Using ML-HMM as classifier, MFCC out-perform MPEG- 7 audio features. This can be seen from Combination 4 and Combination 1. However, when EP-HMM is used as the classifier, either with or without trimming of state or parameters, the performance of MFCC drops by as much as 5% when compared with MPEG-7 features. The observation is from Combination 2, 3, 5 and 6. 4. For the 6 combinations, all of them perform well and are comparable in performance. No method seems to enjoy a significant advantage over the others. The choice of a particular combination would be governed by the computational complexity and memory requirement of the application. 6. DISCUSSION 1. Casey[6] shows that MPEG-7 features with EP-HMM with trimming, which is also the best combination for our database, yielded significant better results(on average 6.5%) than MPEG-7 features with ML-HMM for a database of 1000 sounds ranging over 20 audio classes. Our gain is smaller(about 4.0%), possibly due to the noisy nature of our database. 2. The small marginal advantage of MPEG-7 features over MFCC and the result that MFCC with ML-HMM yielded better results than the ones in Table 1, 2, 5, 6 is a bit surprising to us. Based on the results in [6][7]

[1] [2] [3] [4] [5] [6] [1] 1.00 0 0 0 0 0 [2] 0 0.923 0 0 0.077 0 [3] 0.125 0 0.875 0 0 0 [4] 0 0 0 0.944 0.056 0 [5] 0 0 0 0 0.941 0.059 [6] 0 0 0 0 0 1 Average Recognition Rate: 94.728% Table 1. Recognition Matrix(or Confusion Matrix) on a 90%/10% training/testing split of a data set composed of 6 classes. [1]: Applause; [2]: Ball-Hit; [3]: Cheering; [4] Music; [5] Speech; [6] Speech with Music. The results here are based on MPEG-7 Audio Features and EP-HMM with trimming of states and parameters. # Combination Accuracy Rate 1 MPEG-7+ML-HMM 90.974% 2 MPEG-7+EP-HMM No Trimming 90.974% 3 MPEG-7+EP-HMM+Trimming 94.728% 4 MFCC+ML-HMM 94.604% 5 MFCC+EP-HMM No Trimming 88.427% 6 MFCC+EP-HMM+Trimming 89.353% Table 2. Comparison of the average recognition rates for the 6 methods. we expected that the MPEG-7 features would beat MFCC and EP-HMM would beat ML-HMM. One factor we may have missed is that our audio signals sampling rate has been set to be 16kHz, not 32kHz or 44.1kHz, hence the MPEG-7 features have not been able to capture the spectral information that is outside the speech spectrum range. We need further research to get a definitive comparison. 7. CONCLUSIONS AND FUTURE DIRECTIONS We present a comparison of 6 methods for classification of sports audio. Our results show that all the combinations achieve classification accuracy of around 90%. They are comparable in performance with the best and the second best being MPEG-7 features with EP-HMM and MFCC with ML-HMM. Possible further directions include increasing the size of our database by both increasing the number of samples per class as well as the number of classes in order to derive more robust audio models, studying the effect of the audio sampling rate on the classification results for both MFCC and MPEG-7 audio features and applying our findings to audio highlights extraction. 8. ACKNOWLEDGEMENT We appreciate the help from Dr. Michael Casey and Dr. Matthew Brand of MERL Cambridge Research Lab and Dr. Vladimir Pavlovic and Dr. Malcolm Slaney. 9. REFERENCES [1] E. Scheirer and S. Malcolm, Construction and evaluation of a robust multifeature speech/music discriminator, Proc. ICASSP-97, April 1997, Munich, Germany. [2] L. Lu, S. Li, and H.J. Zhang, Content-based audio segmentation using support vector machines, Proceeding of ICME 2001, pp. 956 959, 2001, Tokyo, Japan. [3] T. Zhang and C. Kuo, Content-based classification and retrieval of audio, Proceeding of the SPIE 43rd Annual Conference on Advanced Signal Processing Algorithms, Architectures and Implemenations, vol. VIII, 1998, San Diego, CA. [4] M.J. Carey, E.S. Parris, and H.L. Thomas, A comparison of features for speech, music discrimination, Proceedings of ICASSP 1999, pp. 149 152, 1999. [5] M. Casey, MPEG-7 sound-recognition tools, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 737 747, June 2001. [6] M. Casey, Reduced-rank spectra and entropic priors as consistent and reliable cues for general sound recognition, Proceeding of the Workshop on Consistent & Reliable Acoustic Cues for Sound Analysis, 2001. [7] M. Brand, Structure discovery in conditional probability models via an entropic prior and parameter extinction, Neural Computation, vol. 11, no. 5, pp. 1155 1183, 1999. [8] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, no. 2, pp. 257 86, February 1989. [9] E. Wold, T. Blum, D. Keislar, and J. Wheaton, Content-based classification, search and retrieval of audio, IEEE Multimedia Magazine, vol. 3, no. 3, pp. 27 36, 1996.