Multiple Kernel Learning for Emotion Recognition in the Wild

Similar documents
Exploring Bag of Words Architectures in the Facial Expression Domain

DA Progress report 2 Multi-view facial expression. classification Nikolas Hesse

Facial-component-based Bag of Words and PHOG Descriptor for Facial Expression Recognition

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Recognizing Micro-Expressions & Spontaneous Expressions

Facial Expression Recognition with Emotion-Based Feature Fusion

Robust Facial Expression Classification Using Shape and Appearance Features

COMPOUND LOCAL BINARY PATTERN (CLBP) FOR PERSON-INDEPENDENT FACIAL EXPRESSION RECOGNITION

Learning to Recognize Faces in Realistic Conditions

TA Section: Problem Set 4

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Action recognition in videos

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

HUMAN S FACIAL PARTS EXTRACTION TO RECOGNIZE FACIAL EXPRESSION

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Scene Recognition using Bag-of-Words

Part-based and local feature models for generic object recognition

Facial Expression Recognition Using Non-negative Matrix Factorization

Person-Independent Facial Expression Recognition Based on Compound Local Binary Pattern (CLBP)

Texture Features in Facial Image Analysis

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

String distance for automatic image classification

CS 231A Computer Vision (Fall 2011) Problem Set 4

A Real Time Facial Expression Classification System Using Local Binary Patterns

Facial Expression Analysis

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

A New Strategy of Pedestrian Detection Based on Pseudo- Wavelet Transform and SVM

Image Processing Pipeline for Facial Expression Recognition under Variable Lighting

Improved Spatiotemporal Local Monogenic Binary Pattern for Emotion Recognition in The Wild

Sparse coding for image classification

Beyond Bags of Features

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Bag-of-features. Cordelia Schmid

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Facial Expression Recognition in Real Time

Category-level localization

Describable Visual Attributes for Face Verification and Image Search

An Algorithm based on SURF and LBP approach for Facial Expression Recognition

A Novel LDA and HMM-based technique for Emotion Recognition from Facial Expressions

Development in Object Detection. Junyuan Lin May 4th

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Sparse Models in Image Understanding And Computer Vision

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Facial Expression Recognition Using a Hybrid CNN SIFT Aggregator

Part-Based Models for Object Class Recognition Part 2

Facial Expression Classification with Random Filters Feature Extraction

Part-Based Models for Object Class Recognition Part 2

Facial Expression Recognition Using Local Binary Patterns

Object Classification for Video Surveillance

CS229: Action Recognition in Tennis

Part-Based Models for Object Class Recognition Part 3

Convolutional Neural Network for Facial Expression Recognition

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Part based models for recognition. Kristen Grauman

Supplementary Material for Ensemble Diffusion for Retrieval

AUTOMATIC VIDEO INDEXING

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur

Facial expression recognition using shape and texture information

Content-based image and video analysis. Event Recognition

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Aggregating Descriptors with Local Gaussian Metrics

Decorrelated Local Binary Pattern for Robust Face Recognition

Lecture 18: Human Motion Recognition

An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012

Structured Models in. Dan Huttenlocher. June 2010

Tri-modal Human Body Segmentation

Object Classification Problem

arxiv: v1 [cs.cv] 16 Nov 2015

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

CS 231A Computer Vision (Fall 2012) Problem Set 4

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Shape Matching. Brandon Smith and Shengnan Wang Computer Vision CS766 Fall 2007

CS231N Section. Video Understanding 6/1/2018

Recognizing Deformable Shapes. Salvador Ruiz Correa UW Ph.D. Graduate Researcher at Children s Hospital

Facial Expression Recognition with PCA and LBP Features Extracting from Active Facial Patches

Dynamic Facial Expression Recognition Using A Bayesian Temporal Manifold Model

Learning Compact Visual Attributes for Large-scale Image Classification

Analyzing the Affect of a Group of People Using Multi-modal Framework

Facial Expression Analysis

Classification and Detection in Images. D.A. Forsyth

CLASSIFICATION Experiments

Convolutional Neural Networks for Facial Expression Recognition

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

ImageCLEF 2011

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Forensic Sketches matching

Discriminative classifiers for image recognition

People Detection and Video Understanding

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Learning Representations for Visual Object Class Recognition

ROBUST REPRESENTATION AND RECOGNITION OF FACIAL EMOTIONS

A Discriminative Parts Based Model Approach for Fiducial Points Free and Shape Constrained Head Pose Normalisation In The Wild

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Facial expression recognition based on two-step feature histogram optimization Ling Gana, Sisi Sib

Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior

Descriptors for CV. Introduc)on:

Transcription:

Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge, ICMI, 2013 1

Task Emotion Recognition on the Acted Facial Expression in the Wild dataset - AFEW. Video clips collected from Hollywood movies. Classification into 7 emotion categories: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise. 2

Challenges in AFEW Videos resemble emotions in real-world conditions. Others: Pose Variations. Occlusion. Spontaneous nature of expressions. Variations among subjects. Small number of training samples given the complexity of the problem (~ 60 clips per emotion). 3

Our Approach Multimodal classification system comprising of: 1. Face Extraction and Alignment. Handle non-frontal faces. 2. Feature Extraction. Visual and audio features. 3. Feature fusion using Multiple Kernel Learning. 4

Our Approach 5

Face Extraction and Alignment Combined state-of-the-art face detection method with state-of-the-art tracking method. Face Detection: Deformable part-based model by Ramanam et al (CVPR 12). Employs a mixture of trees model with shape model. Ability to handle non-frontal head pose: critical for faces in AFEW. 6

Face Extraction and Alignment Fiducial-point Tracker: Based on supervised gradient descent by Torre et al. (CVPR 13). Returns 49 fiducial-points. Output from detector is fed to tracker. Re-initialization using detector if the tracker fails. Faces aligned with a reference face using affine transform. 7

Multimodal Features 3 feature modalities: Facial features like BoW, HOG. Sound features. Scene or context features like GIST. 8

Facial Features 1. Bag of Words (BoW): State-of-art pipeline for static expression recognition by Sikka et al. (ECCV 12). Based on multi-scale dense SIFT features (4 scales). Encoding using LLC*. Spatial information encoded using pooling over spatial pyramids. * LLC- Locality constrained Linear Coding (Wang et al. 2010) 9

Facial Features Video features obtained by max-pooling over frame BoW features. (Sikka et al., AFGR 13). Robust compared to Gabor and LBP. Included multiple BoW features- constructed using different dictionary sizes (200, 400, 600). Motivated by recent success in multiple dictionary classification*. *e.g. Aly, Munich, & Perona 2011, Zhang et al. 2009 10

Facial Features 2. LPQ-TOP* Local Phase Quantization over Three Orthogonal Planes. Texture descriptor for videos. Robust variant of LBP-TOP. Three set of features extracted with different window sizes of 5, 7 and 9. * Päivärinta et al. 2011 11

Facial Features 3. HOG Histogram of gradient features. Describe shape information of objects using distribution of local image gradients. Used for object detection and static facial expression analysis. 4. PHOG Variant of HOG based on pyramids. Video features obtained by max-pooling over frame features. 12

Sound features Audio features improve performance of expression recognition systems (AVEC challenge). Employed paralinguistic descriptors from audio channel Ex: MFCCs, fundamental frequency Summarized using functionals like max, min etc. 38 low-level descriptors + 21 functionals. Features provided by organizers. 13

Scene or Context features Investigated if scene information is relevant to recognition on AFEW. Two sets of features: 1. BoW features extracted over entire image instead of just faces. 2. GIST features (Oliva et al.) 1. Output of bank of multi-scale oriented filters + PCA. 2. Popular to summarize scene context. 14

Feature Fusion Multiple features encode complementary information discriminative for a task. Combining features -> improves classification accuracy. Techniques for fusing features: 1. Feature concatenation. 2. Decision (classifier) level fusion. 3. Multiple Kernel Learning (MKL) strategy. MKL is more principled since it can be coupled with classifier learning, e.g. with a SVM. 15

Multiple Kernel Learning Used Multi-label MKL (Jain et al., NIPS 10). Estimates optimal convex combination of multiple kernels for training SVM. Formulates MKL as a convex optimization problem. Globally optimal solution. Unique kernel weights are learned for each class. 16

Our Approach Our approach fused different features using MKL. Referred to as All-features + MKL in results. RBF kernels used as base kernels for all features. Employed one-vs-all multi-class classification strategy instead of one-vs-one in SVM. More training data per classifier. Showed improvement in results. Class assignment based on maximum probability across the per-class classifiers. 17

Experiments Kernel and SVM hyper-parameters obtained by cross-validation on validation set. Performance metric is classification accuracy on the 7 classes. 18

Results Validation Set Features Accuracy Baseline video (LBPTOP) 27.27% Baseline sound 19.95% Baseline video + sound 22.22% Baseline-performance on validation set. 19

Results Validation Set Features Accuracy Baseline video (LBPTOP) 27.27% BoW-600 33.16% BoW shows an advantage of 5% compared to LBPTOP used for baseline. Performance boost attributed to both (1) better face alignment + (2) more discriminative BoW features. 20

Features Results Validation Set Accuracy Baseline video (LBPTOP) 27.27% Baseline sound 19.95% Baseline video + sound (Feature concatenation) 22.22% BoW-600 33.16% BoW-600 + Sound (MKL) 34.99% Fusion method feature concatenation leads to fall in performance for baseline features. However, performance rises for feature fusion using MKL. Highlights advantage of MKL. 21

Final Results Validation Set Method Accuracy Baseline video (LBPTOP) 27.27% BoW-600 + Sound + MKL 34.99% All features + MKL 37.08% Test Set Method Accuracy Baseline video (LBPTOP) + audio 27.56% All features + MKL 35.89% Best accuracies are reported for baseline approaches. All-features + MKL is the proposed approach. Using multiple features gives significant improvement over just BoW-600 and sound features. 22

Kernel Weights Visual features Sound features Context features Mean and standard deviation are calculated across kernel weights learned for each class. 23

Kernel Weights Visual features are more discriminative compared to sound features. Highest weights are assigned to HOG and BoW kernels. Context based features: BoW over entire scene (including faces) weight of.0006. Information from this BoW kernel could come from both face and scene information. GIST features not included in final features because they did not improve performance. Scene information might not be discriminative. 24

Insights MKL works better than naïve feature fusion using feature concatenation. MKL allows separate γ for each RBF feature kernel leading to better discriminability. Fusion of visual and sounds features leads to improvement in results (multimodality). Found improvements in result using one-vs-all multi-class strategy. 25

Conclusion Proposed an approach for recognizing emotions in unconstrained settings. Our method of combining multiple features using MKL shows significant improvement over baseline on both test and validation set. Highlighted advantage of using both (1) multiple features, and (2) MKL for feature fusion. Investigated learned kernel weights to show the contribution of different kernels. 26

Thanks Pl. forward any questions to ksikka@ucsd.edu Thanks to our Presenter Yale Song, Graduate Student, Multimodal Understanding Group, MIT. 27