Detection of Acoustic Events in Meeting-Room Environment

Similar documents
ACOUSTIC EVENT DETECTION AND CLASSIFICATION IN SMART-ROOM ENVIRONMENTS: EVALUATION OF CHIL PROJECT SYSTEMS

Audiovisual Event Detection Towards Scene Understanding

Inclusion of Video Information for Detection of Acoustic Events using Fuzzy Integral

Workshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Integration of audiovisual sensors and technologies in a smart room

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

Minimal-Impact Personal Audio Archives

Discriminate Analysis

Available online Journal of Scientific and Engineering Research, 2016, 3(4): Research Article

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Analysis of Local Appearance-based Face Recognition on FRGC 2.0 Database

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Index. Symbols. Index 353

INF 4300 Classification III Anne Solberg The agenda today:

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Neetha Das Prof. Andy Khong

ECG782: Multidimensional Digital Signal Processing

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Client Dependent GMM-SVM Models for Speaker Verification

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Machine Learning based session drop prediction in LTE networks and its SON aspects

Network Traffic Measurements and Analysis

Real-time Monitoring of Participants Interaction in a Meeting using Audio-Visual sensors

Large-scale visual recognition Efficient matching

METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION

Introduction to digital image classification

Fuzzy Entropy based feature selection for classification of hyperspectral data

Content-based image and video analysis. Machine learning

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

K-Nearest Neighbor Classification Approach for Face and Fingerprint at Feature Level Fusion

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Deep Learning for Computer Vision

Linear Discriminant Analysis for 3D Face Recognition System

Exemplar Selection Methods to Distinguish Human from Animal Footsteps

Statistical Pattern Recognition

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

MPEG-7 Audio: Tools for Semantic Audio Description and Processing

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

A Framework for Efficient Fingerprint Identification using a Minutiae Tree

Optimizing feature representation for speaker diarization using PCA and LDA

Tri-modal Human Body Segmentation

Automatic summarization of video data

Class 5: Attributes and Semantic Features

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Statistical Pattern Recognition

Unsupervised learning in Vision

Algorithms for Recognition of Low Quality Iris Images. Li Peng Xie University of Ottawa

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Binary Hierarchical Classifier for Hyperspectral Data Analysis

Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors

Multiple Kernel Learning for Emotion Recognition in the Wild

1 Introduction. 3 Data Preprocessing. 2 Literature Review

DUPLICATE DETECTION AND AUDIO THUMBNAILS WITH AUDIO FINGERPRINTING

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

Applying Supervised Learning

Multimedia Event Detection for Large Scale Video. Benjamin Elizalde

Learning to Recognize Faces in Realistic Conditions

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Application of Characteristic Function Method in Target Detection

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Baseball Game Highlight & Event Detection

Statistical Pattern Recognition

Semantic Video Indexing

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Action recognition in videos

Speaker Diarization System Based on GMM and BIC

Content Based Classification of Audio Using MPEG-7 Features

Confidence Measures: how much we can trust our speech recognizers

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

Multimedia Database Systems. Retrieval by Content

COPYRIGHTED MATERIAL. Introduction. 1.1 Introduction

Background: Bioengineering - Electronics - Signal processing. Biometry - Person Identification from brain wave detection University of «Roma TRE»

Discriminative training and Feature combination

Biometrics Technology: Multi-modal (Part 2)

Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain

A Brief Overview of Audio Information Retrieval. Unjung Nam CCRMA Stanford University

Pattern Recognition for Neuroimaging Data

Supervised vs unsupervised clustering

Features: representation, normalization, selection. Chapter e-9

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

CS570: Introduction to Data Mining

System Identification Related Problems at SMN

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE

Gait analysis for person recognition using principal component analysis and support vector machines

Dynamic Time Warping

Tag Recommendation for Photos

Motivation: why study sound? Motivation (2) Related Work. Our Study. Related Work (2)

Pattern recognition. Classification/Clustering GW Chapter 12 (some concepts) Textures

A Geometric Perspective on Machine Learning

EE 6882 Statistical Methods for Video Indexing and Analysis

Robust PDF Table Locator

System Identification Related Problems at

3D Object Recognition using Multiclass SVM-KNN

Dynamic Human Fatigue Detection Using Feature-Level Fusion

3 Feature Selection & Feature Extraction

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Transcription:

11/Dec/2008 Detection of Acoustic Events in Meeting-Room Environment Presented by Andriy Temko Department of Electrical and Electronic Engineering

Page 2 of 34 Content Introduction State of the Art Acoustic Event Classification Acoustic Event Detection Participation in International Evaluations Speech Activity Detection UPC Smart-Room Activities

Page 3 of 34 Introduction Need of environments (Smart-rooms) in which computers do not need much human attention and humans focus on interaction with humans (not with the computer) e.g. CHIL project Multimodality Technologies (Person ID, Person Localization & Tracking, Meeting Summarization) Need a large set of unobtrusive Sensors

Page 4 of 34 Introduction In such environments the human activity is reflected in a rich variety of acoustic events (AEs) Need of perceptual components (activity detection, automatic speech recognition, speaker identification and verification, and speaker localization) Speech is the most informative AE however Detection and classification of other AEs may: Describe social activity that takes place in the room (e.g. laughter or clapping inside meeting, door slam when a meeting just started) Increase the robustness of other (audio) technologies e.g. speech recognition / speech activity detection

Page 5 of 34 State of the art Sound taxonomy Applications of audio recognition Features and classifiers Methods of detection of AEs

Page 6 of 34 Sound Taxonomy Sound Continuous tone Continuous nontone Non-repetitive noncontinuous Acoustic Regular repetitive noncontinuous Irregular repetitive noncontinuous Other Sound Speech noises Repetitions Corrections False starts Mispronunciation Human vocal tract non-speech noises Breathes Laughter Throat/cough Disagreement Whisper Mouth (generic) Non-speech environment noises Clapping Beep Chair moving Door slams Footsteps Key jingling Music (radio, handy) Semantic

Page 7 of 34 Applications Audio indexing and retrieval (speech, background noise, etc Audio recognition for a given environment (kitchen, shops, hospitals, etc) Recognition of generic sounds (birds, animals, alarms, etc) Classification of acoustic environments (street, restaurants, library, airport, etc)

Page 8 of 34 Features Features and Classifiers Conventional ASR features Perceptual features (energy, silence ratio, spectral tilt, SBE, spectral flux, zero-crossing etc) Concatenation of the ASR and perceptual features Feature selection (feature space reduction): LDA, PCA, ICA, etc Classifiers Distance based classifiers (Euclidean, Mahalanobis, knn, etc) Generative classifiers (conventional ASR): GMM, HMM, etc Discriminative classifiers: ANN, Decision trees, SVM, etc

Page 9 of 34 Methods of Detection Detection by classification Detection and classification (classification of the segment bounded by detection algorithm)

Page 10 of 34 Acoustic Event Classification SVM-based Clustering Schemes (Multiclass problem) Comparison of Sequence Discriminant SVM (dynamic data problem) Fuzzy Integral Based Information Fusion for Classification of Highly Confusable Non-Speech Sounds (Fusion and Feature Selection)

Page 11 of 34 SVM-based Clustering Schemes (I) First attempt to deal with AEC comparing different feature sets and two distinct classifiers SVM and GMM Database of AEs: 16 classes (Cough, Laugh, Paper, Door slam, Steps, etc) Defined acoustic feature sets: 9 Feature sets (different combination of perceptual features and ASR features)

Page 12 of 34 SVM-based Clustering Schemes (II) ACC % 100 80 60 40 20 0 82.9 78.87 SVM 1 2 GMM 3 4 5 6 7 8 9 feature set Results of AEC with binary tree multi-class SVM and GMM. Best feature set for SVM is 8 and for GMM is 9. ACC % 100 50 0 1 2 3 4 5 6 7 8 9 feature set liquid sneeze sniff Best feature sets for liquid, sneeze, and sniff are different and none is the best overall (8).

Page 13 of 34 SVM-based Clustering Schemes (III) Outline of variable-feature-set clustering algorithm Split classes into two clusters with the least possible confusion based on confusion matrices (exhaustive search) The algorithm returns both the two clusters of classes and the 1 2 feature set that provide the smallest number of confusions 3 4 5 6 7 8 9 10 11 12 13 14 15 8 2 1 4 5 6 7 9 10 11 12 13 3 14 15 9 8 15 7 5 10 1 11 12 2 4 14 13 6 3 16 Variable feature set clustering schemes Introducing knowledge about class confusions 1 4 2 14 3 16 6 13 5 12 10 11 7 8 9 15 88.29 % classification rate - 31.5% relative average error reduction

Page 14 of 34 Acoustic Event Classification SVM-based Clustering Schemes (Multi-class problem) Comparison of Sequence Discriminant SVM (dynamic data problem) Fuzzy Integral Based Information Fusion for Classification of Highly Confusable Non-Speech Sounds (Fusion and Feature Selection)

Page 15 of 34 Comparison of sequence discriminant SVM (I) A drawback of SVMs when dealing with audio data is their restriction to work with a fixed-length input that represents a time sequence of feature vectors Variable-size sequence of feature vectors 1 Features... N Frames Solution 1: normalize the size of the time sequence of feature vectors Solution 2: find a suitable kernel function that can deal with sequential data of variable length

Page 16 of 34 Comparison of sequence discriminant SVM (II) Rank 1 2 3 4 5 6 7 8 Rank AEs with temporal structure sneeze, music 13 Outpr 13 PolyDTW 13 Fisher 13 Dtak 13 SVMStat 13 Fisher Ratio 13 Gdtw 70 75 80 85 90 95 100 Recognition Rate, % 7 Dtak 7 PolyDTW 7 Fisher 7 Fisher Ratio 7 Gdtw 7 GMM 7 Outpr 13 GMM 7 SVMStat AEs without temporal structure paper pen writing, liquid pouring 1 2 3 4 5 6 7 11 Outpr 10 Outpr 11 Gdtw 11 PolyDTW 10 PolyDTW 10 GMM 11 GMM 11 Fisher Ratio 10 Fisher Ratio 11 SVMStat 10 Fisher 11 Fisher 10 SVMStat 10 Dtak 90 88 86 84 82 80 78 76 74 72 70 Fisher 88.13 LikeRatio Comparison Results 84.32 Outerpr 77.36 77.13 76.21 GDTW DTAK 84.46 84.22 83.1 PolyDTW GMM SVM stat Fisher kernel obtained the best overall results DTW-based kernels work well for sounds that show a temporal structure The observed bias of the classifiers to specific types of classes is a good condition for a successful application of fusion techniques 8 10 Gdtw 11 Dtak 60 65 70 75 80 85 90 95 100 Recognition Rate, %

Page 17 of 34 Acoustic Event Classification SVM-based Clustering Schemes (Multi-class problem) Comparison of Sequence Discriminant SVM (dynamic data problem) Fuzzy Integral Based Information Fusion for Classification of Highly Confusable Non-Speech Sounds (Fusion and Feature Selection)

Page 18 of 34 Fuzzy Integral Information Fusion (I) Acoustic Features 1. Zero crossing rate (1) 2. Short-time energy (1) 3. Fundamental frequency (1) 4. Sub-band log energies (4) 5. Sub-band log energy distribution (4) 6. Sub-band log energy correlations (4) 7. Sub-band log energy time differences (4) 8. Spectral centroid (1) 9. Spectral roll-off (1) 10.Spectral bandwidth (1) Feat 1 Feat 2. Feature and decision level information fusion Feat N Classifier (a) Feat 1 Feat 2. Event hypothesis Feat N Classifier 1 Classifier 2 Classifier N (b) Out 1 Out 2 Out N FI Event hypothesis

Page 19 of 34 Fuzzy Integral and Fuzzy Measure Weighted Arithmetical Mean M = μ( i) WAM h i i Z μ ( i) = 1 i Z μ( i) 0 for all i Z μ 0 M μ 1 μ 2 μ 3 μ 4 μ 12 μ 13 μ 14 μ 23 μ 24 μ 34 Fuzzy Integral (Choquet) = [ z FI ( μ, h) μ( i,..., z) μ( i + 1,..., z) ] h i i= 1 h1... h z S T μ( S) μ( T ) μ ( ø) = 0, μ ( Z ) = 1 h i are the information sources μ 123 μ 124 μ 134 μ 234 μ 1234 The lattice representation of fuzzy measure for n = 4. FM can be leant from data provided by an expert calculated from fuzzy densities

Page 20 of 34 Fuzzy Integral Information Fusion (II) Performance for the 10 SVM systems running of each feature type, the combination of the 10 features at the feature-level with SVM, and the fusion on the decision-level with WAM and FI operators. FM is leant from data.

Page 21 of 34 Fuzzy Integral Information Fusion (III) Importance Interaction Importance and Interactions are extracted from Fuzzy Measure The most important 4, 6, 7 Highly redundant (light cell) {4,5}, {1,8} Feature selection: With feature selection based on information extracted from FM an improvement over the baseline was obtained Fusion of different classifiers: SVM with signal level features + HMM with frame-level features

Page 22 of 34 Fuzzy Integral Information Fusion (IV) Fusion of several information sources with FI formalism showed significant improvements with respect to single information source Decision-level FI fusion showed comparable results with SVM feature level fusion Importance and the degree of interaction can be used for feature selection and give a valuable insight into the problem FI may be a good choice when feature-level fusion is not an option (different nature of the involved features, application/technique dependence Finalist of the Best Journal Paper Award of 2008 in Speech Technologies

Page 23 of 34 Acoustic Event Detection UPC Acoustic Event Detection Systems 2006 UPC Acoustic Event Detection Systems 2007

Page 24 of 34 UPC Acoustic Event Detection Systems 2006 Feature extraction SVM silence / non-silence segmentation SVM classification Segmentation step Classification step 1st SVM for pre-segmentation based on silence/non-silence detection. 2nd SVM trained on 12 classes + speech + unknown for event detection.

Page 25 of 34 UPC Acoustic Event Detection Systems 2007 200ms 1 second... SVM SVM SVM SVM Post-processing Final decision windows System output AE assignment segments

Page 26 of 34 Participation in International Evaluations CHIL Dry-Run Evaluations 2004 (1 place 2 participants) CHIL Evaluations 2005 (1 place 2 participants) CLEAR Evaluations 2006 (1 place 3 participants) CLEAR Evaluations 2007 (2 place 8 participants)

Page 27 of 34 Speech Activity Detection AED is event-based, SAD is frame-based Dataset reduction algorithm Metric NIST error rate = Duration of Incorrect Decisions / Duration of all Speech Segments Features Feature Extractor FF ΔFF ΔΔFF ΔlogE lfed hfed ldam LDA t-15... t... t+15 hfed t-9... t... t+9 lfed Current frame t-9... t... t+9

Page 28 of 34 Enhanced SVM Training for SAD Dataset reduction algorithm Step 1. Divide all the data into chunks of 1000 samples per chunk. Step 2. Train a PSVM on each chunk performing 5-fold cross-validation (CV) Step 3. Apply an appropriate threshold to the highest/lowest CV accuracy Step4. Train a classical SVM on the amount of data selected in Step 3. Classical SVM PSVM H -1 H 1 H -1 H 1 Separating hyperplane H 0 Separating hyperplane H 0 NIST SAD Evaluations 2006 best SAD system out of 10 systems with 4.88 % error rate

Page 29 of 34 UPC Smart-Room Activities UPC Smart-Room Database Recording Real-Time Implementation Demonstrations

Page 30 of 34 UPC Smart-Room y Hammer_xx x Cam1 Cam6 12 Cam4...... MkIII_001 MkIII_064 Hammer_13 to Hammer_20 09 Cam8 zenithal cam 10 11 05 06 07 08 Hammer_xx 3.966 m Cam2 PTZ Cam Cam5 03 02 04 01 Cam3 5.245 m Hammer_xx

Page 31 of 34 Real-Time Implementation Implementation of AED component as a smart-flow client

Page 32 of 34 Demonstrations (I) Participation in several demonstrations Mockup demo: Journalist scenario Reliably detected AEs: door knock, door opening/closing, speech, applause, and key jingles. XIL LA LA

Page 33 of 34 Demonstrations (II) Participation in several demonstrations Virtual UPC smart-room: 3D person tracking, ASL, and AED

Page 34 of 34 Demonstrations (III) Participation in several demonstrations Demo of AED & ASL Real-time video VIDEO >> Graphical representation