Detection of Acoustic Events in Meeting-Room Environment

11/Dec/2008 Detection of Acoustic Events in Meeting-Room Environment Presented by Andriy Temko Department of Electrical and Electronic Engineering

Page 2 of 34 Content Introduction State of the Art Acoustic Event Classification Acoustic Event Detection Participation in International Evaluations Speech Activity Detection UPC Smart-Room Activities

Page 3 of 34 Introduction Need of environments (Smart-rooms) in which computers do not need much human attention and humans focus on interaction with humans (not with the computer) e.g. CHIL project Multimodality Technologies (Person ID, Person Localization & Tracking, Meeting Summarization) Need a large set of unobtrusive Sensors

Page 4 of 34 Introduction In such environments the human activity is reflected in a rich variety of acoustic events (AEs) Need of perceptual components (activity detection, automatic speech recognition, speaker identification and verification, and speaker localization) Speech is the most informative AE however Detection and classification of other AEs may: Describe social activity that takes place in the room (e.g. laughter or clapping inside meeting, door slam when a meeting just started) Increase the robustness of other (audio) technologies e.g. speech recognition / speech activity detection

Page 5 of 34 State of the art Sound taxonomy Applications of audio recognition Features and classifiers Methods of detection of AEs

Page 6 of 34 Sound Taxonomy Sound Continuous tone Continuous nontone Non-repetitive noncontinuous Acoustic Regular repetitive noncontinuous Irregular repetitive noncontinuous Other Sound Speech noises Repetitions Corrections False starts Mispronunciation Human vocal tract non-speech noises Breathes Laughter Throat/cough Disagreement Whisper Mouth (generic) Non-speech environment noises Clapping Beep Chair moving Door slams Footsteps Key jingling Music (radio, handy) Semantic

Page 7 of 34 Applications Audio indexing and retrieval (speech, background noise, etc Audio recognition for a given environment (kitchen, shops, hospitals, etc) Recognition of generic sounds (birds, animals, alarms, etc) Classification of acoustic environments (street, restaurants, library, airport, etc)

Page 8 of 34 Features Features and Classifiers Conventional ASR features Perceptual features (energy, silence ratio, spectral tilt, SBE, spectral flux, zero-crossing etc) Concatenation of the ASR and perceptual features Feature selection (feature space reduction): LDA, PCA, ICA, etc Classifiers Distance based classifiers (Euclidean, Mahalanobis, knn, etc) Generative classifiers (conventional ASR): GMM, HMM, etc Discriminative classifiers: ANN, Decision trees, SVM, etc

Page 9 of 34 Methods of Detection Detection by classification Detection and classification (classification of the segment bounded by detection algorithm)

Page 10 of 34 Acoustic Event Classification SVM-based Clustering Schemes (Multiclass problem) Comparison of Sequence Discriminant SVM (dynamic data problem) Fuzzy Integral Based Information Fusion for Classification of Highly Confusable Non-Speech Sounds (Fusion and Feature Selection)

Page 11 of 34 SVM-based Clustering Schemes (I) First attempt to deal with AEC comparing different feature sets and two distinct classifiers SVM and GMM Database of AEs: 16 classes (Cough, Laugh, Paper, Door slam, Steps, etc) Defined acoustic feature sets: 9 Feature sets (different combination of perceptual features and ASR features)

Page 12 of 34 SVM-based Clustering Schemes (II) ACC % 100 80 60 40 20 0 82.9 78.87 SVM 1 2 GMM 3 4 5 6 7 8 9 feature set Results of AEC with binary tree multi-class SVM and GMM. Best feature set for SVM is 8 and for GMM is 9. ACC % 100 50 0 1 2 3 4 5 6 7 8 9 feature set liquid sneeze sniff Best feature sets for liquid, sneeze, and sniff are different and none is the best overall (8).

Page 13 of 34 SVM-based Clustering Schemes (III) Outline of variable-feature-set clustering algorithm Split classes into two clusters with the least possible confusion based on confusion matrices (exhaustive search) The algorithm returns both the two clusters of classes and the 1 2 feature set that provide the smallest number of confusions 3 4 5 6 7 8 9 10 11 12 13 14 15 8 2 1 4 5 6 7 9 10 11 12 13 3 14 15 9 8 15 7 5 10 1 11 12 2 4 14 13 6 3 16 Variable feature set clustering schemes Introducing knowledge about class confusions 1 4 2 14 3 16 6 13 5 12 10 11 7 8 9 15 88.29 % classification rate - 31.5% relative average error reduction

Page 14 of 34 Acoustic Event Classification SVM-based Clustering Schemes (Multi-class problem) Comparison of Sequence Discriminant SVM (dynamic data problem) Fuzzy Integral Based Information Fusion for Classification of Highly Confusable Non-Speech Sounds (Fusion and Feature Selection)

Page 15 of 34 Comparison of sequence discriminant SVM (I) A drawback of SVMs when dealing with audio data is their restriction to work with a fixed-length input that represents a time sequence of feature vectors Variable-size sequence of feature vectors 1 Features... N Frames Solution 1: normalize the size of the time sequence of feature vectors Solution 2: find a suitable kernel function that can deal with sequential data of variable length

Page 16 of 34 Comparison of sequence discriminant SVM (II) Rank 1 2 3 4 5 6 7 8 Rank AEs with temporal structure sneeze, music 13 Outpr 13 PolyDTW 13 Fisher 13 Dtak 13 SVMStat 13 Fisher Ratio 13 Gdtw 70 75 80 85 90 95 100 Recognition Rate, % 7 Dtak 7 PolyDTW 7 Fisher 7 Fisher Ratio 7 Gdtw 7 GMM 7 Outpr 13 GMM 7 SVMStat AEs without temporal structure paper pen writing, liquid pouring 1 2 3 4 5 6 7 11 Outpr 10 Outpr 11 Gdtw 11 PolyDTW 10 PolyDTW 10 GMM 11 GMM 11 Fisher Ratio 10 Fisher Ratio 11 SVMStat 10 Fisher 11 Fisher 10 SVMStat 10 Dtak 90 88 86 84 82 80 78 76 74 72 70 Fisher 88.13 LikeRatio Comparison Results 84.32 Outerpr 77.36 77.13 76.21 GDTW DTAK 84.46 84.22 83.1 PolyDTW GMM SVM stat Fisher kernel obtained the best overall results DTW-based kernels work well for sounds that show a temporal structure The observed bias of the classifiers to specific types of classes is a good condition for a successful application of fusion techniques 8 10 Gdtw 11 Dtak 60 65 70 75 80 85 90 95 100 Recognition Rate, %

Page 17 of 34 Acoustic Event Classification SVM-based Clustering Schemes (Multi-class problem) Comparison of Sequence Discriminant SVM (dynamic data problem) Fuzzy Integral Based Information Fusion for Classification of Highly Confusable Non-Speech Sounds (Fusion and Feature Selection)

Page 18 of 34 Fuzzy Integral Information Fusion (I) Acoustic Features 1. Zero crossing rate (1) 2. Short-time energy (1) 3. Fundamental frequency (1) 4. Sub-band log energies (4) 5. Sub-band log energy distribution (4) 6. Sub-band log energy correlations (4) 7. Sub-band log energy time differences (4) 8. Spectral centroid (1) 9. Spectral roll-off (1) 10.Spectral bandwidth (1) Feat 1 Feat 2. Feature and decision level information fusion Feat N Classifier (a) Feat 1 Feat 2. Event hypothesis Feat N Classifier 1 Classifier 2 Classifier N (b) Out 1 Out 2 Out N FI Event hypothesis

Page 19 of 34 Fuzzy Integral and Fuzzy Measure Weighted Arithmetical Mean M = μ( i) WAM h i i Z μ ( i) = 1 i Z μ( i) 0 for all i Z μ 0 M μ 1 μ 2 μ 3 μ 4 μ 12 μ 13 μ 14 μ 23 μ 24 μ 34 Fuzzy Integral (Choquet) = [ z FI ( μ, h) μ( i,..., z) μ( i + 1,..., z) ] h i i= 1 h1... h z S T μ( S) μ( T ) μ ( ø) = 0, μ ( Z ) = 1 h i are the information sources μ 123 μ 124 μ 134 μ 234 μ 1234 The lattice representation of fuzzy measure for n = 4. FM can be leant from data provided by an expert calculated from fuzzy densities

Page 20 of 34 Fuzzy Integral Information Fusion (II) Performance for the 10 SVM systems running of each feature type, the combination of the 10 features at the feature-level with SVM, and the fusion on the decision-level with WAM and FI operators. FM is leant from data.

Page 21 of 34 Fuzzy Integral Information Fusion (III) Importance Interaction Importance and Interactions are extracted from Fuzzy Measure The most important 4, 6, 7 Highly redundant (light cell) {4,5}, {1,8} Feature selection: With feature selection based on information extracted from FM an improvement over the baseline was obtained Fusion of different classifiers: SVM with signal level features + HMM with frame-level features

Page 22 of 34 Fuzzy Integral Information Fusion (IV) Fusion of several information sources with FI formalism showed significant improvements with respect to single information source Decision-level FI fusion showed comparable results with SVM feature level fusion Importance and the degree of interaction can be used for feature selection and give a valuable insight into the problem FI may be a good choice when feature-level fusion is not an option (different nature of the involved features, application/technique dependence Finalist of the Best Journal Paper Award of 2008 in Speech Technologies

Page 23 of 34 Acoustic Event Detection UPC Acoustic Event Detection Systems 2006 UPC Acoustic Event Detection Systems 2007

Page 24 of 34 UPC Acoustic Event Detection Systems 2006 Feature extraction SVM silence / non-silence segmentation SVM classification Segmentation step Classification step 1st SVM for pre-segmentation based on silence/non-silence detection. 2nd SVM trained on 12 classes + speech + unknown for event detection.

Page 25 of 34 UPC Acoustic Event Detection Systems 2007 200ms 1 second... SVM SVM SVM SVM Post-processing Final decision windows System output AE assignment segments

Page 26 of 34 Participation in International Evaluations CHIL Dry-Run Evaluations 2004 (1 place 2 participants) CHIL Evaluations 2005 (1 place 2 participants) CLEAR Evaluations 2006 (1 place 3 participants) CLEAR Evaluations 2007 (2 place 8 participants)

Page 27 of 34 Speech Activity Detection AED is event-based, SAD is frame-based Dataset reduction algorithm Metric NIST error rate = Duration of Incorrect Decisions / Duration of all Speech Segments Features Feature Extractor FF ΔFF ΔΔFF ΔlogE lfed hfed ldam LDA t-15... t... t+15 hfed t-9... t... t+9 lfed Current frame t-9... t... t+9

Page 28 of 34 Enhanced SVM Training for SAD Dataset reduction algorithm Step 1. Divide all the data into chunks of 1000 samples per chunk. Step 2. Train a PSVM on each chunk performing 5-fold cross-validation (CV) Step 3. Apply an appropriate threshold to the highest/lowest CV accuracy Step4. Train a classical SVM on the amount of data selected in Step 3. Classical SVM PSVM H -1 H 1 H -1 H 1 Separating hyperplane H 0 Separating hyperplane H 0 NIST SAD Evaluations 2006 best SAD system out of 10 systems with 4.88 % error rate

Page 29 of 34 UPC Smart-Room Activities UPC Smart-Room Database Recording Real-Time Implementation Demonstrations

Page 30 of 34 UPC Smart-Room y Hammer_xx x Cam1 Cam6 12 Cam4...... MkIII_001 MkIII_064 Hammer_13 to Hammer_20 09 Cam8 zenithal cam 10 11 05 06 07 08 Hammer_xx 3.966 m Cam2 PTZ Cam Cam5 03 02 04 01 Cam3 5.245 m Hammer_xx

Page 31 of 34 Real-Time Implementation Implementation of AED component as a smart-flow client

Page 32 of 34 Demonstrations (I) Participation in several demonstrations Mockup demo: Journalist scenario Reliably detected AEs: door knock, door opening/closing, speech, applause, and key jingles. XIL LA LA

Page 33 of 34 Demonstrations (II) Participation in several demonstrations Virtual UPC smart-room: 3D person tracking, ASL, and AED

Page 34 of 34 Demonstrations (III) Participation in several demonstrations Demo of AED & ASL Real-time video VIDEO >> Graphical representation