Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Similar documents
Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Multiple Kernel Learning for Emotion Recognition in the Wild

Facial Expression Analysis

DA Progress report 2 Multi-view facial expression. classification Nikolas Hesse

MULTI-POSE FACE HALLUCINATION VIA NEIGHBOR EMBEDDING FOR FACIAL COMPONENTS. Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo

Cross-pose Facial Expression Recognition

A Real Time Facial Expression Classification System Using Local Binary Patterns

Facial Expression Recognition with Emotion-Based Feature Fusion

Deep Learning for Face Recognition. Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong

Learning to Recognize Faces in Realistic Conditions

Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Facial Expression Analysis

CS231N Section. Video Understanding 6/1/2018

Tri-modal Human Body Segmentation

Face2Face Comparing faces with applications Patrick Pérez. Inria, Rennes 2 Oct. 2014

Enhanced Active Shape Models with Global Texture Constraints for Image Analysis

Appearance Manifold of Facial Expression

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Image Set Classification Based on Synthetic Examples and Reverse Training

Supplementary Material for: Video Prediction with Appearance and Motion Conditions

Understanding Faces. Detection, Recognition, and. Transformation of Faces 12/5/17

Improving Face Recognition by Exploring Local Features with Visual Attention

arxiv: v1 [cs.cv] 16 Nov 2015

Deep Convolutional Neural Network using Triplet of Faces, Deep Ensemble, and Scorelevel Fusion for Face Recognition

Partial Face Matching between Near Infrared and Visual Images in MBGC Portal Challenge

Sparsity Preserving Canonical Correlation Analysis

Recognizing people. Deva Ramanan

An efficient face recognition algorithm based on multi-kernel regularization learning

Heat Kernel Based Local Binary Pattern for Face Representation

Feature-Aging for Age-Invariant Face Recognition

Dynamic Facial Expression Recognition Using A Bayesian Temporal Manifold Model

HUMAN S FACIAL PARTS EXTRACTION TO RECOGNIZE FACIAL EXPRESSION

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Learning based face hallucination techniques: A survey

Boosting Coded Dynamic Features for Facial Action Units and Facial Expression Recognition

WHO MISSED THE CLASS? - UNIFYING MULTI-FACE DETECTION, TRACKING AND RECOGNITION IN VIDEOS. Yunxiang Mao, Haohan Li, Zhaozheng Yin

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Facial Expression Classification with Random Filters Feature Extraction

Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu

Convolutional Neural Network for Facial Expression Recognition

Face detection and recognition. Detection Recognition Sally

An Algorithm based on SURF and LBP approach for Facial Expression Recognition

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Action recognition in videos

How to Generate Keys from Facial Images and Keep your Privacy at the Same Time

IJCAI Dept. of Information Engineering

An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012

Structured Models in. Dan Huttenlocher. June 2010

Outline 7/2/201011/6/

Color Local Texture Features Based Face Recognition

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

Face Detection and Recognition in an Image Sequence using Eigenedginess

Image Processing Pipeline for Facial Expression Recognition under Variable Lighting

Locating Facial Landmarks Using Probabilistic Random Forest

Intensity-Depth Face Alignment Using Cascade Shape Regression

Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition

Generic Face Alignment Using an Improved Active Shape Model

Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis

Structured Face Hallucination

Human pose estimation using Active Shape Models

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Misalignment-Robust Face Recognition

Enhance ASMs Based on AdaBoost-Based Salient Landmarks Localization and Confidence-Constraint Shape Modeling

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Face Image Quality Assessment for Face Selection in Surveillance Video using Convolutional Neural Networks

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

SLIDING WINDOW BASED MICRO-EXPRESSION SPOTTING: A BENCHMARK

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

An Adaptive Threshold LBP Algorithm for Face Recognition

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

Facial Expression Recognition with PCA and LBP Features Extracting from Active Facial Patches

Tracking. Hao Guan( 管皓 ) School of Computer Science Fudan University

Motion Estimation and Optical Flow Tracking

Dynamic facial expression recognition using a behavioural model

arxiv: v3 [cs.cv] 1 Apr 2015

Large-scale Video Classification with Convolutional Neural Networks

arxiv: v1 [cs.cv] 6 Jul 2016

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Lecture 7: Semantic Segmentation

Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels

Illumination Normalization in Face Recognition Using DCT and Supporting Vector Machine (SVM)

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

3D Human Motion Analysis and Manifolds

END-TO-END CHINESE TEXT RECOGNITION

COMPRESSED FACE HALLUCINATION. Electrical Engineering and Computer Science University of California, Merced, CA 95344, USA

LOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM

Human detection using local shape and nonredundant

Learning Expressionlets on Spatio-Temporal Manifold for Dynamic Facial Expression Recognition

Thermal to Visible Face Recognition

COMBINING SPEEDED-UP ROBUST FEATURES WITH PRINCIPAL COMPONENT ANALYSIS IN FACE RECOGNITION SYSTEM

International Journal of Modern Engineering and Research Technology

Simultaneous Feature and Sample Reduction for Image-Set Classification

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Face Alignment Under Various Poses and Expressions

Transcription:

Emotion Recognition In The Wild Challenge and Workshop (EmotiW 2013) Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition Mengyi Liu, Ruiping Wang, Zhiwu Huang, Shiguang Shan, Xilin Chen Institute of Computing Technology, Chinese Academy of Sciences

Outline Problem Related work Our Method Experiments Conclusion 2

Outline Problem Related work Our Method Experiments Conclusion 3

Emotion recognition in the wild Challenges Large data variations head pose, illumination, partial occlusion, etc. Lack of labeled data Manual annotation is hard as spontaneous expression is ambiguous in the real world. 4

Outline Problem Related work Our Method Experiments Conclusion 5

Video-based emotion recognition Acoustic information based Time domain and frequency domain e.g. pitch, intensity, pitch contour, Low Short-time Energy Ratio (LSTER), maximum bandwidth, Vision information based Spatial space and temporal space e.g. Optical flow, 3D descriptor (LBP-TOP, HOG 3D), tracking based (AAM, CLM), probabilistic graph model (HMM, CRF), 6

Outline Problem Related work Our Method Experiments Conclusion 7

Key issue Our method How to model the emotion video clip? Motivation Alleviate the effect of mis-alignment of facial images Encode the data variations among video frames Basic idea Inspired by recent progress of image set-based face recognition [1] Treat the video clip as an image set, i.e., a collection of frames Linear subspace for video (image set) modeling 8 [1] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariance discriminative learning: A natural and efficient approach to image set classification. CVPR, 2012.

Our method An overview Preprocessing Feature Designing Classification Original aligned face images Purified face images Mid-level image features Video/Image set features One-to-Rest PLS classification Video Filtering out non-face in PCA subspace Subspace learning on Grassmannian manifold Video-Audio Fusion Audio Original audio data Clip-wise audio features extracted using opensmile toolkit*[2] One-to-Rest PLS classification 9 [2] F. Eyben, M. Wollmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. ACM MM, 2010.

Preprocessing Our method Original face alignment using MoPS [3] (provided by organizer) Purification of face images Original aligned face images set: X = x 1, x 2,, x n, x i R D. PCA projection learned on X by preserving low energy: W. Mean reconstruction error of each image: MeanErr t = 1 D x t W T 2 Wx t Non-face/Badly-aligned face images tend to have large MeanErr t. 10 [3] X. Zhu, and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. CVPR, 2012.

The Number of Samples Preprocessing Our method The distribution of MeanErr t on training set in EmotiW2013. 45 40 35 30 25 20 15 10 5 Threshold 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 The Mean Reconstruction Error * Threshold is for filtering out non-face in PCA space. 11

Our method Preprocessing An example of 100 samples with largest mean reconstruction error. Most are non-face images or mis-alignment results. 12

Our method An overview Preprocessing Feature Designing Classification Original aligned face images Purified face images Mid-level image features Video/Image set features One-to-Rest PLS classification Video Filtering out non-face in PCA subspace Subspace learning on Grassmannian manifold Video-Audio Fusion Audio Original audio data Clip-wise audio features extracted using opensmile toolkit*[2] One-to-Rest PLS classification 13

Feature designing Image feature [4] Our method Convolution Filters 6x6x100 Max-Pooling 3x3 Face Image 32x32 Filter Maps 27x27x100 Mid-level Feature 9x9x100 [4] M. Liu, S. Li, S. Shan, X. Chen. AU-aware Deep Networks for Facial Expression Recognition. FG, 2013. 14

Our method Feature designing Video feature Each video clip is a set of images, denoted as S i R f n i, where f is the dimension of image feature, and n i is the number of frames. The video S i can be represented as a linear subspace P i, s.t. S i S i T = P i Λ i P i T Thus all the video clips can be modeled as a collection of subspaces, which are also the points on Grassmannian manifold. 15

Feature designing Video feature Our method An illustration of subspaces on Grassmannian manifold Video Clip 1 Video Clip 2 Similarity M 16

Our method Feature designing Video feature The similarity between two points P i and P j on manifold M can be measured by a linear combination of Grassmannian kernels. Projection kernel[5]: k proj ij = P T 2 i P j F. Canonical correlation kernel[6]: k CC ij = max ap span P i max bq span P j a T p b q. Linear combination: k com ij = k proj ij + αk CC ij. The kernels of each point (i.e., each video) to all training points serve as its final feature representation for classification. [5] J. Hamm, D. Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. ICML, 2008. [6] M. Harandi, C. Sanderson, S. Shirazi, B.C. Lovell. Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching. CVPR, 2011. 17

Our method An overview Preprocessing Feature Designing Classification Original aligned face images Purified face images Mid-level image features Video/Image set features One-to-Rest PLS classification Video Filtering out non-face in PCA subspace Subspace learning on Grassmannian manifold Video-Audio Fusion Audio Original audio data Clip-wise audio features extracted using opensmile toolkit*[2] One-to-Rest PLS classification 18

Our method Classification Partial Least Squares (PLS) for classification [1] Maximize the covariance between observations and class labels Space X Space Y Each sample is a feature vector Each sample is a 0/1 class label Linear Projection Latent Space T X = T * P Y = T * B * C = X * B_pls T is the common latent representation [1] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariance discriminative learning: A natural and efficient approach to image set classification. CVPR, 2012. 19

Classification 20 One-to-Rest PLS Our method Suppose there are c categories and N training samples, we train c One-to-Rest PLS classifiers to predict each class simultaneously. Effectively to handle the hard classes, e.g. Sad vs. Disgust Binarize Original training label vector Y R N 1 Binary training label matrix Y R N c Separate One-to-Rest training label vectors, y 1, y 2,, y c R N 1

Our method Classification One-to-Rest PLS Training and test process Training data X + One-to-Rest training label vectors y 1, y 2,, y c Test sample One-to-Rest PLS(1) One-to-Rest PLS(2) One-to-Rest PLS(3) One-to-Rest PLS(c-1) One-to-Rest PLS(c) Test result: Fit R c 1 21

Our method Classification Video-Audio fusion for final test output For a given test video, using the c PLS classifiers for video and audio respectively, we obtain two prediction vectors Fit video, Fit audio R c 1. We conduct a linear fusion at decision level using weighted parameter λ Fit fusion = (1 λ) Fit video +λfit audio. The category corresponding to the maximum value in Fit fusion is determined to be the recognition result. 22

Outline Problem Related work Our Method Experiments Conclusion 23

Recognition Accuracy Experiments Discussion of Parameters The fusion weights of Grassmannian kernels 0.33 0.32 0.31 0.3 Train-Val Val-Train k ij com = k ij proj + αk ij CC 0.29 0.28 Train-Val: @α = 2 6, 2 5 Val-Train: @α = 2 10 0.27 0.26 10-4 10-2 10 0 10 2 10 4 The Value of Grassmannian Kernels Combination Weight (Alpha) 24

Recognition Accuracy (Video) Experiments Discussion of Parameters The dimension of One-to-Rest PLS (video) 1 0.9 0.8 0.7 0.6 Train-Train Train-Val Val-Val Val-Train 0.5 0.4 0.3 Train-Val: @dim = 10 Val-Train: @dim = 5 0.2 0.1 0 20 40 60 80 100 The Dimensions of PLS 25

Recognition Accuracy (Audio) Experiments Discussion of Parameters The dimension of One-to-Rest PLS (audio) 1 0.9 0.8 0.7 0.6 Train-Train Train-Val Val-Val Val-Train 0.5 0.4 Train-Val: @dim = 5 Val-Train: @dim = 5 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 The Dimensions of PLS 26

Recognition Accuracy Experiments Discussion of Parameters The fusion weights of video and audio modalities 0.36 0.34 0.32 Train-Val Val-Train Fit fusion = (1 λ) Fit video +λfit audio 0.3 0.28 Train-Val: @λ = 0. 25 Val-Train: @λ = 0. 85 0.26 0.24 0 0.2 0.4 0.6 0.8 1 The Value of Video-Audio Fusion Weight (Lambda) 27

Experiments Results comparison Performance Comparison Audio only One-to-Rest PLS Grassmannian Discriminant Analysis [6] Video only Grassmannian Kernels + One-to-Rest PLS Original data Feature-level fusion Multi-class LR One-to-Rest PLS Audio + Video Decisionlevel fusion One-to-Rest PLS Purified data Decisionlevel fusion One-to-Rest PLS Ours Baseline Val 24.49 % 30.81% 32.07% 22.48% 24.24% 34.34% 35.86% Test* -- 24.04% -- -- 26.28% 33.01% 34.61% Val 19.95% 27.27% 22.22% Test 22.44% 22.75% 27.56% [6] M. Harandi, C. Sanderson, S. Shirazi, B.C. Lovell. Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching. CVPR, 2011. 28

Outline Problem Related work Our Method Experiments Conclusion 29

Conclusion Key points of the current method PCA-based data purifying to filter out mis-alignment faces Linear subspace modeling of video data variations Multiple video features fusion by Grassmannian kernels combination Multi-modality fusion at decision level of video and audio Issues to further address 30 Exploration of video temporal dynamics information More sophisticated video modeling More effective fusion at feature level

Thank you. Question?