Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland, fractor@icsi.berkeley.edu 1

Today Recap: Some more Machine Learning Multimedia Systems An example Multimedia System 2

Recap: Architecture of Content Analysis Algorithms 3

Recap: Some More Machine Learning k-nearest Neighbors Neural Networks SVMs HMMs 4

k-nearest Neighbors 5

Another Magic Duo Histograms are the most practically used image models Nearest Neighbors (with Euclidean Distance) is the most used technique for visual features are comparison 6

Neural Networks (MLPs) 7

Linear Separation 8

Support Vector Machines 9

Hidden Markov Models a s: State transitions b s: Likelihood observations 10

Hidden Markov Models 11

Multimedia: Definition Entry: multimedia Function: noun plural but singular or plural in construction Date: 1950 A technique (as the combining of sound, video, and text) for expressing ideas (as in communication, entertainment, or art) in which several media are employed; also: something (as software) using or facilitating such a technique. (Merriam-Webster online dictionary) 12

Multimedia Content Analysis Automatic analysis of the content (semantics) contained in data directly encoded for human perception (audio, images, video, touch) and its associated meta data (natural text, computerencoded data). 13

Multimodal Integration 14

Multimodal Integration... is a field of cognitive psychology. 14

Multimodal Integration... is a field of cognitive psychology. Before 1960: Unimodal approach 14

Multimodal Integration... is a field of cognitive psychology. Before 1960: Unimodal approach Initial results in the 1960 s, recently hyped again (2003+) 14

Multimodal Integration Human psychology suggests: 15

Multimodal Integration Human psychology suggests: Multiple sensory inputs increase the speed of the output (Hershenson 1962) 15

Multimodal Integration Human psychology suggests: Multiple sensory inputs increase the speed of the output (Hershenson 1962) Uncertainty in sensory domains results in increased dependency of multisensory integration (Alais & Burr 2004) 15

Multimodal Integration In computer science: 16

Multimodal Integration In computer science: How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are 16

Multimodal Integration In computer science: How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are more accurate, robust, and/or faster than unimodal state of the art and/or 16

Recap: Architecture of Content Analysis Algorithms 17

Generic Scheme of a Classification Algorithm Signal reduce dimensions Some signal is observed and reduced... Features build abstraction...to the essentials relevant to the problem,... Models generate score Decision output decision...statistical models are used to compute a score (e.g. probabilities) for the given observations,...... so that a decision function can decide on the classification. Result 18

Feature-Level Integration Signal 1 reduce dimensions Features Signal 2 reduce dimensions Features + build abstraction Models generate score Decision output decision Result Features are integrated before the model layer using a function +. For example concatenation: n-dimensional vector + m- dimensional vector = n+mdimensional vector 19

Model-Level Integration Signal 1 reduce dimensions Features build abstraction Signal 2 reduce dimensions Features build abstraction Models generate scores Models generate scores + combined score Decision output decision Output scores are integrated using a function +. For example weighted combined log-likelihoods. Result 20

Decision-Level Integration Signal 1 reduce dimensions Features build abstraction Signal 2 reduce dimensions Features build abstraction Models generate score Models generate score Decision output decision Decision output decision Output decision are fused using a function +. For example majority voting. + output decision Result WARNING: Meta-data fusion in general is a difficult research problem. 21

Remarks Signal-level integration is unlikely because of intractable data dimensionality. Multi-Level integration is also possible. In reality, a classification algorithm is more complicated than this scheme (eg. feedback loops) The integration function + may also be learned automatically. 22

Example System Dialocalization: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem G. Friedland, C. Yeo., H. Hung: Dialocalizaton: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 6, No. 4, Article 27, November 2010 23

Current Common Sense Localization (Computer Vision Task) Localization in space. Speaker Diarization (Speech Processing Task)

Example: Speaker Diarization Audiotrack: Speaker localization on timeline: who spoke when. 25

Example: Speaker Diarization Audiotrack: Segmentation: Speaker localization on timeline: who spoke when. 25

Example: Speaker Diarization Audiotrack: Segmentation: Clustering: Speaker localization on timeline: who spoke when. 25

Speaker Diarization... tries to answer the question: who spoke when? using a single microphone input without prior knowledge of anything (#speakers, language, text, etc...) 26

Single Audio Stream Audio Signal Feature Extraction MFCC Speech Only Segmentation Diarization Engine Metadata Speech/Non- Speech Detector Clustering 27

Bottom-Up Algorithm Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training (Re-)Alignment Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster2 Cluster1 Cluster2 Cluster3 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes Merge two Clusters? (Re-)Alignment Cluster1 Cluster2 Cluster1 Cluster2 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Bottom-Up Algorithm Initialization Cluster2 Cluster1 Cluster2 Cluster2 (Re-)Training Yes No Merge two Clusters? End (Re-)Alignment Cluster1 Cluster2 Cluster1 Cluster2 Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed 28

Current Accuracy Single-Stream System ICSI Devset 07 Eval07 VACE (AMI) Speech/Non Speech Error 6.4% 6.8% 12.2% Speaker Error 11.3% 14.9% 19.89% Diarization Error Rate 17.57% 21.24% 32.09% ICSI Speaker Diarization Engine as participated in NIST RT07. 29

Goals 30

Goals Improve Robustness while... 30

Goals Improve Robustness while......increasing or at least keeping the speed. 30

Goals Improve Robustness while......increasing or at least keeping the speed. Need to identify speakers, eg by association with face. 30

Goals Improve Robustness while......increasing or at least keeping the speed. Need to identify speakers, eg by association with face. Idea: Multimodality could help 30

Multimodal Speaker tries to answer the question: who spoke when? using a single microphone and single camera input without prior knowledge of anything (#speakers, language, text, etc...) 31

AMI Meeting Room Setup 32

AMI Meetings: Real- World Problems Close-view still not good enough for face detection People lean back and forward, stand up, walk around, leave the room, etc... 33

Even more Problems: Single Camera View Very low resolution per participant Partial occlusions 34

Audio/Visual Correlation Assumptions Camera captures all participants, most of the time. Speaker locations have limited spatial variance. Speakers have more visual activity than non-speakers. 35

Multimodal Diarization Audio Signal Feature Extraction MFCC Speech/Non- Speech Detector MFCC (only Speech) Segmentation Diarization Engine "Who spoke when" Events Feature Extraction Video Activity (only Speech Regions) Clustering Video Signal 36

Video Feature Extraction MPEG-4 Video Detect Skin Blocks Avg. Motion Vectors Divide Frames into n Regions n-dimensional activity vector Windowsize: 400ms 37

Model-Level Integration Audio Video MFCC Activity GMMs likelihoods GMMs likelihoods + Decision Result 38

Multimodal Diarization Results 12 Meetings from AMI corpus VACE Meetings 39

Multimodal vs Unimodal Video features alone perform poorly! Error/System Four Cameras Random Speaker Error 68.80% 75.00% 40

Multimodal vs Unimodal Video features alone perform poorly! Error/System Four Cameras Random Speaker Error 68.80% 75.00% Warning: Designing multimodal algorithms may require integrated thinking. Blackbox combination of unimodal approaches may not work. 40

Agglomerative Clustering + Video activities in each region Cepstral audio features Models containing MFCC and video activity vectors 41

Who Spoke When?? + Video activities in each region Cepstral audio Features Which model fits best? Speaker X 42

Where is the Speaker? Speaker X +? Speaker from Diarization all possible activity locations for speakers Which activity location fits best? 43

Speaker Localization Audio Signal Feature Extraction Segmentation MFCC Speech/Non- Speech Detector MFCC (only Speech) Diarization Engine Clustering "who spoke when" Events Video Activity (only Speech Regions) Feature Extraction Invert Visual Models "where the speaker was" Video Signal 44

Speaker Localization and Diarization 45

Conclusion I Speaker Diarization = Speaker Localization No need to treat as separate problem! 46

Conclusion II 47

Conclusion II Multimodal diarization with video results in: 47

Conclusion II Multimodal diarization with video results in: higher accuracy at low computational overhead 47

Conclusion II Multimodal diarization with video results in: higher accuracy at low computational overhead speaker localisation as a by-product = Multimodal Synergy 47

Conclusion III 48

Conclusion III It is possible to create a machine learning system that benefits from multimodal integration such that 48

Conclusion III It is possible to create a machine learning system that benefits from multimodal integration such that it is more accurate than the unimodal state of the art and it 48

Conclusion III It is possible to create a machine learning system that benefits from multimodal integration such that it is more accurate than the unimodal state of the art and it offers qualitative improvements over unimodal approaches (here: more semantic output) 48

Next Week (Project Meeting) Benjamin Elizalde on ICSIs TRECVID MED 2012 System

Next Week (Lecture) How to estimate computational needs 50