Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Similar documents
Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimedia Event Detection for Large Scale Video. Benjamin Elizalde

Towards Audio-Visual On-line Diarization Of Participants In Group Meetings

Automatic Enhancement of Correspondence Detection in an Object Tracking System

The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System

Dynamic Time Warping

Tracking. Hao Guan( 管皓 ) School of Computer Science Fudan University

Keyword Extraction by KNN considering Similarity among Features

There is No Data Like Less Data: Percepts for Video Concept Detection on Consumer-Produced Media

Baseball Game Highlight & Event Detection

An Introduction to Pattern Recognition

A ROBUST SPEAKER CLUSTERING ALGORITHM

Experimental Design for Machine Learning on Multimedia Data See:

Semantic Video Indexing

Hands On: Multimedia Methods for Large Scale Video Analysis (Project Meeting) Dr. Gerald Friedland,

Short Survey on Static Hand Gesture Recognition

Real-time Monitoring of Participants Interaction in a Meeting using Audio-Visual sensors

Content-based Video Genre Classification Using Multiple Cues

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

K-Nearest Neighbor Classification Approach for Face and Fingerprint at Feature Level Fusion

Dr Andrew Abel University of Stirling, Scotland

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection

ECG782: Multidimensional Digital Signal Processing

Multiple Kernel Learning for Emotion Recognition in the Wild

Deep Learning. Volker Tresp Summer 2014

Encoding Words into String Vectors for Word Categorization

Unsupervised Learning

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

String Vector based KNN for Text Categorization

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Multimedia Information Retrieval The case of video

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Joint design of data analysis algorithms and user interface for video applications

Research Article Image and Video Indexing Using Networks of Operators

SRE08 system. Nir Krause Ran Gazit Gennady Karvitsky. Leave Impersonators, fraudsters and identity thieves speechless

Object Recognition. Lecture 11, April 21 st, Lexing Xie. EE4830 Digital Image Processing

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J.

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

EE 6882 Statistical Methods for Video Indexing and Analysis

Multimodal Sparse Coding for Event Detection

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Advanced Multimodal Machine Learning

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Analysis of Local Appearance-based Face Recognition on FRGC 2.0 Database

CS6670: Computer Vision

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

Hello, I am from the State University of Library Studies and Information Technologies, Bulgaria

Xing Fan, Carlos Busso and John H.L. Hansen

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Multimodal Cue Detection Engine for Orchestrated Entertainment

2. Basic Task of Pattern Classification

CAP 6412 Advanced Computer Vision

Static Gesture Recognition with Restricted Boltzmann Machines

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Multimodal Video Indexing: A Review of the State-of-the-art

Combining Audio and Video for Detection of Spontaneous Emotions

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Optimizing feature representation for speaker diarization using PCA and LDA

Event-Based Modeling and Processing of Digital Media

Lecture Video Indexing and Retrieval Using Topic Keywords

Video Key-Frame Extraction using Entropy value as Global and Local Feature

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

Machine Learning Practice and Theory

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Knowledge Acquisition from Multimedia Content using an Evolution Framework

Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain

Video Event Detection Using Motion Relativity and Feature Selection

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Lecture 12: Video Representation, Summarisation, and Query

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Content-based image and video analysis. Machine learning

MATRIX BASED SEQUENTIAL INDEXING TECHNIQUE FOR VIDEO DATA MINING

FACIAL MOVEMENT BASED PERSON AUTHENTICATION

CONTENT BASED VIDEO RETRIEVAL SYSTEM

Multifactor Fusion for Audio-Visual Speaker Recognition

Cluster Analysis: Agglomerate Hierarchical Clustering

The Kinect Sensor. Luís Carriço FCUL 2014/15

TRECVid 2012 Experiments at Dublin City University

Available online Journal of Scientific and Engineering Research, 2016, 3(4): Research Article

Chapter 10. Conclusion Discussion

Detection of Acoustic Events in Meeting-Room Environment

Digital Newsletter. Editorial. Second Review Meeting in Brussels

CHAPTER 8 Multimedia Information Retrieval

Efficient Indexing and Searching Framework for Unstructured Data

Supervised Multi-Modal Action Classification

Detection of Mouth Movements and Its Applications to Cross-Modal Analysis of Planning Meetings

DATA and signal modeling for images and video sequences. Region-Based Representations of Image and Video: Segmentation Tools for Multimedia Services

3D Environment Reconstruction

Part-based and local feature models for generic object recognition

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Multimedia Systems. Lehrstuhl für Informatik IV RWTH Aachen. Prof. Dr. Otto Spaniol Dr. rer. nat. Dirk Thißen

Transcription:

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland, fractor@icsi.berkeley.edu 1

Today Recap: Some more Machine Learning Multimedia Systems An example Multimedia System 2

Recap: Architecture of Content Analysis Algorithms 3

Recap: Some More Machine Learning k-nearest Neighbors Neural Networks SVMs HMMs 4

k-nearest Neighbors 5

Another Magic Duo Histograms are the most practically used image models Nearest Neighbors (with Euclidean Distance) is the most used technique for visual features are comparison 6

Neural Networks (MLPs) 7

Linear Separation 8

Support Vector Machines 9

Hidden Markov Models a s: State transitions b s: Likelihood observations 10

Hidden Markov Models 11

Multimedia: Definition Entry: multimedia Function: noun plural but singular or plural in construction Date: 1950 A technique (as the combining of sound, video, and text) for expressing ideas (as in communication, entertainment, or art) in which several media are employed; also: something (as software) using or facilitating such a technique. (Merriam-Webster online dictionary) 12

Multimedia Content Analysis Automatic analysis of the content (semantics) contained in data directly encoded for human perception (audio, images, video, touch) and its associated meta data (natural text, computerencoded data). 13

Multimodal Integration 14

Multimodal Integration... is a field of cognitive psychology. 14

Multimodal Integration... is a field of cognitive psychology. Before 1960: Unimodal approach 14

Multimodal Integration... is a field of cognitive psychology. Before 1960: Unimodal approach Initial results in the 1960 s, recently hyped again (2003+) 14

Multimodal Integration... is a field of cognitive psychology. Before 1960: Unimodal approach Initial results in the 1960 s, recently hyped again (2003+) 14

Multimodal Integration Human psychology suggests: 15

Multimodal Integration Human psychology suggests: Multiple sensory inputs increase the speed of the output (Hershenson 1962) 15

Multimodal Integration Human psychology suggests: Multiple sensory inputs increase the speed of the output (Hershenson 1962) Uncertainty in sensory domains results in increased dependency of multisensory integration (Alais & Burr 2004) 15

Multimodal Integration In computer science: 16

Multimodal Integration In computer science: How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are 16

Multimodal Integration In computer science: How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are more accurate, robust, and/or faster than unimodal state of the art and/or 16

Multimodal Integration In computer science: How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are more accurate, robust, and/or faster than unimodal state of the art and/or offer qualitative improvements over unimodal approaches 16

Recap: Architecture of Content Analysis Algorithms 17

Generic Scheme of a Classification Algorithm Signal reduce dimensions Some signal is observed and reduced... Features build abstraction...to the essentials relevant to the problem,... Models generate score Decision output decision...statistical models are used to compute a score (e.g. probabilities) for the given observations,...... so that a decision function can decide on the classification. Result 18

Feature-Level Integration Signal 1 reduce dimensions Features Signal 2 reduce dimensions Features + build abstraction Models generate score Decision output decision Result Features are integrated before the model layer using a function +. For example concatenation: n-dimensional vector + m- dimensional vector = n+mdimensional vector 19

Model-Level Integration Signal 1 reduce dimensions Features build abstraction Signal 2 reduce dimensions Features build abstraction Models generate scores Models generate scores + combined score Decision output decision Output scores are integrated using a function +. For example weighted combined log-likelihoods. Result 20

Decision-Level Integration Signal 1 reduce dimensions Features build abstraction Signal 2 reduce dimensions Features build abstraction Models generate score Models generate score Decision output decision Decision output decision Output decision are fused using a function +. For example majority voting. + output decision Result WARNING: Meta-data fusion in general is a difficult research problem. 21

Remarks Signal-level integration is unlikely because of intractable data dimensionality. Multi-Level integration is also possible. In reality, a classification algorithm is more complicated than this scheme (eg. feedback loops) The integration function + may also be learned automatically. 22

Example System Dialocalization: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem G. Friedland, C. Yeo., H. Hung: Dialocalizaton: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 6, No. 4, Article 27, November 2010 23

Current Common Sense Localization (Computer Vision Task) Localization in space. Speaker Diarization (Speech Processing Task)

Example: Speaker Diarization Audiotrack: Speaker localization on timeline: who spoke when. 25

Example: Speaker Diarization Audiotrack: Speaker localization on timeline: who spoke when. 25

Example: Speaker Diarization Audiotrack: Segmentation: Speaker localization on timeline: who spoke when. 25

Example: Speaker Diarization Audiotrack: Segmentation: Clustering: Speaker localization on timeline: who spoke when. 25

Example: Speaker Diarization Audiotrack: Segmentation: Clustering: Speaker localization on timeline: who spoke when. 25

Speaker Diarization... tries to answer the question: who spoke when? using a single microphone input without prior knowledge of anything (#speakers, language, text, etc...) 26

Single Audio Stream Audio Signal Feature Extraction MFCC Speech Only Segmentation Diarization Engine Metadata Speech/Non- Speech Detector Clustering 27

Bottom-Up Algorithm Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training (Re-)Alignment Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster1 Cluster2 Cluster1 Cluster2 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes No Merge two Clusters? End (Re-)Alignment Cluster1 Cluster2 Cluster1 Cluster2 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Current Accuracy Single-Stream System ICSI Devset 07 Eval07 VACE (AMI) Speech/Non Speech Error 6.4% 6.8% 12.2% Speaker Error 11.3% 14.9% 19.89% Diarization Error Rate 17.57% 21.24% 32.09% ICSI Speaker Diarization Engine as participated in NIST RT07. 29

Goals 30

Goals Improve Robustness while... 30

Goals Improve Robustness while......increasing or at least keeping the speed. 30

Goals Improve Robustness while......increasing or at least keeping the speed. Need to identify speakers, eg by association with face. 30

Goals Improve Robustness while......increasing or at least keeping the speed. Need to identify speakers, eg by association with face. 30

Goals Improve Robustness while......increasing or at least keeping the speed. Need to identify speakers, eg by association with face. Idea: Multimodality could help 30

Multimodal Speaker tries to answer the question: who spoke when? using a single microphone and single camera input without prior knowledge of anything (#speakers, language, text, etc...) 31

AMI Meeting Room Setup 32

AMI Meetings: Real- World Problems Close-view still not good enough for face detection People lean back and forward, stand up, walk around, leave the room, etc... 33

Even more Problems: Single Camera View Very low resolution per participant Partial occlusions 34

Audio/Visual Correlation Assumptions Camera captures all participants, most of the time. Speaker locations have limited spatial variance. Speakers have more visual activity than non-speakers. 35

Multimodal Diarization Audio Signal Feature Extraction MFCC Speech/Non- Speech Detector MFCC (only Speech) Segmentation Diarization Engine "Who spoke when" Events Feature Extraction Video Activity (only Speech Regions) Clustering Video Signal 36

Video Feature Extraction MPEG-4 Video Detect Skin Blocks Avg. Motion Vectors Divide Frames into n Regions n-dimensional activity vector Windowsize: 400ms 37

Model-Level Integration Audio Video MFCC Activity GMMs likelihoods GMMs likelihoods + Decision Result 38

Multimodal Diarization Results 12 Meetings from AMI corpus VACE Meetings 39

Multimodal vs Unimodal Video features alone perform poorly! Error/System Four Cameras Random Speaker Error 68.80% 75.00% 40

Multimodal vs Unimodal Video features alone perform poorly! Error/System Four Cameras Random Speaker Error 68.80% 75.00% Warning: Designing multimodal algorithms may require integrated thinking. Blackbox combination of unimodal approaches may not work. 40

Agglomerative Clustering + Video activities in each region Cepstral audio features Models containing MFCC and video activity vectors 41

Who Spoke When?? + Video activities in each region Cepstral audio Features Which model fits best? Speaker X 42

Where is the Speaker? Speaker X +? Speaker from Diarization all possible activity locations for speakers Which activity location fits best? 43

Speaker Localization Audio Signal Feature Extraction Segmentation MFCC Speech/Non- Speech Detector MFCC (only Speech) Diarization Engine Clustering "who spoke when" Events Video Activity (only Speech Regions) Feature Extraction Invert Visual Models "where the speaker was" Video Signal 44

Speaker Localization and Diarization 45

Conclusion I Speaker Diarization = Speaker Localization No need to treat as separate problem! 46

Conclusion II 47

Conclusion II Multimodal diarization with video results in: 47

Conclusion II Multimodal diarization with video results in: higher accuracy at low computational overhead 47

Conclusion II Multimodal diarization with video results in: higher accuracy at low computational overhead speaker localisation as a by-product = Multimodal Synergy 47

Conclusion III 48

Conclusion III It is possible to create a machine learning system that benefits from multimodal integration such that 48

Conclusion III It is possible to create a machine learning system that benefits from multimodal integration such that it is more accurate than the unimodal state of the art and it 48

Conclusion III It is possible to create a machine learning system that benefits from multimodal integration such that it is more accurate than the unimodal state of the art and it offers qualitative improvements over unimodal approaches (here: more semantic output) 48

Next Week (Project Meeting) Benjamin Elizalde on ICSIs TRECVID MED 2012 System

Next Week (Lecture) How to estimate computational needs 50