Multi-Modal Audio, Video, and Physiological Sensor Learning for Continuous Emotion Prediction

Size: px

Start display at page:

Download "Multi-Modal Audio, Video, and Physiological Sensor Learning for Continuous Emotion Prediction"

Oswin Carroll
5 years ago
Views:

1 Multi-Modal Audio, Video, and Physiological Sensor Learning for Continuous Emotion Prediction Youngjune Gwon 1, Kevin Brady 1, Pooya Khorrami 2, Elizabeth Godoy 1, William Campbell 1, Charlie Dagli 1, Thomas S Huang 2 1. MIT Lincoln Laboratory Human Language Technology Group 2. University of Illinois Urbana-Champaign Beckman Institute October 16, 2016 This work was sponsored by the Defense Advanced Research Projects Agency under Air Force Contract FA C Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

2 System Overview & Technical Contributions Approach Leverage derived low- and high-level features and exploit the timecorrelated nature of the emotional state Novel Features Low-level prosodic-based descriptors High-level Unsupervised sparse coding of audio and video features Supervised deep-learning of video and physiological features Kalman Filtering framework with smoothing Bias term compensation for non-zero mean measurement noise AVEC- 2

3 Outline System Overview & Technical Contributions Technical Overview System Architecture Audio Processing Pipeline Unsupervised High-Level Features (Sparse Coding) Supervised High-Level Features (Deep Learning) Time-Varying Emotional State Estimation Results Concluding Remarks AVEC- 3

4 System Architecture Sensor Channels Audio Low-Level Audio Feature Extraction Baseline MFCC Prosodic SDC High-Level Unsupervised (Sparse Coding) Framework Video High-Level Supervised (Deep-Learning) Framework Kalman Filtering Framework Emotional State Physiological Baseline Estimates * (all sensors) * Extracted using supplied AVEC baseline code AVEC- 4 MFCC: Mel-frequency cepstral coefficients SDC: Shifted delta cepstrum

5 Audio Feature Processing Pipeline In addition to precomputed audio features, we perform the following feature extraction MFCC: 40-dimensional Mel-frequency cepstral coefficients per 10-msec audio frame are computed using filterbank SDC: 56-dimensional shifted delta cepstra are computed from stacked MFCC vectors of multiple frames Prosody: 7-dimensional perceptual audio features based on vocal effort, variations in intonation and speaking rate We consider MFCC, SDC, and prosody low-level audio features Not suitable for regressing directly, but treated as input for highlevel learning High-level feature learning using sparse coding Audio (*.WAV) Speech Activity Detection MFCC, SDC, Prosody Feature Extraction High-level feature learning (Sparse Coding) SVM Regression Arousal, valence estimates AVEC- 5 MFCC: Mel-frequency cepstral coefficients SDC: Shifted delta cepstra SVM: Support vector machine

6 Unsupervised High-Level Features Sparse Coding Audio & Video Channels Low-level feature aggregation MFCC and SDC features are max-pooled over 40 msec before sparse coding For sparse coding, we used L 1 -regularized LARS with hyperparameters trained in K = , λ = , average pooling over 1 2 second window min {D,y} ǁ x Dyǁ λǁ yǁ 1 Regression L 2 -regularized L 2 -loss linear SVM LARS: Least angle regression MFCC: Mel-frequency cepstral coefficients SDC: Shifted delta cepstra SVM: Support vector machine AVEC- 6

7 Supervised High-Level Features Deep Learning Video & Physiological Channels Video Appearance Data (CNN+RNN) / Physiological Data (LSTM) Input: A fixed size window (W) of video frames / physiological features 1. Video (appearance): pass extracted faces from frame sequence through a CNN, and the resulting CNN features through a recurrent network (RNN) 2. Physiological (EDA, HRHRV): Pass baseline features from frame sequence through a LSTM Sensor Channel 3. Compute desired output ( " $ ) AVEC Dev Set CCC Results Baseline Arousal MITLL- UIUC Baseline Valence MITLL- UIUC ( $ ' ( $ 1 ( $ RNN RNN RNN Video (appearance) CNN CNN CNN HRHRV EDA & $ ' & $ 1 & $ AVEC- 7 CNN: Convolutional Neural Network RNN: Recurrent Neural Network LSTM: Long Short Term Memory EDA: Electrodermal activity HRHRV: Heart rate & heart rate variability

8 Time-Varying Emotional State Estimation System Equations Emotional State x ( k + 1) = Ax( k) + w( k) Dynamic System z ( k) = Cx( k) + β + v( k) Measurement System Q = cov( w, w) Process Noise R = cov( v, v) Measurement Noise Measurement Bias Sensor Measurements System Parameters A ˆ, Cˆ, Qˆ, Rˆ, βˆ System Identification ˆ T T A = ( X )( ) 1 2, N X1, N 1 X1, N 1X 1, N 1 [ ˆ ˆ T T C β ] = ( Z X )( ) 1 1, N X 1, N X 1, N 1, N ( X 2, N AX1, 1 ) ( Z CX β ) Qˆ = cov N ˆ cov R = 1, N 1, N X 1, N X = 1 1, N xn X Z [ x ] 1, N = 1 x N [ z ] 1, N = 1 z N Held out data Leveraging Kalman Filter estimation model (1 st order polynomial) with smoothing Introducing bias term to compensate for non-zero mean sensor measurement error Sensor Measurements z z z( k) = z z audio physiolog ( k) ( k) ( k) ical ( k) video _ appearance video _ geometric + - υ zˆ ( k k 1) z ˆ( k + 1 k) Measurement Prediction Kalman Estimator x ˆ( k + 1 k) xˆ ( k k) Dynamic Prediction Kalman Smoother Emotional State x ˆ( k k + T ) AVEC- 8

9 AVEC Emotion State Estimation Results AVEC Channel and Feature Utilization AVEC CCC Results Channel Audio Low-level features High-level features Arousal Valence Baseline Yes Yes Baseline Sparse coding Yes Yes MFCC Sparse coding Yes Yes Prosody Sparse coding Yes SDC Sparse coding Yes Yes Data Partition Arousal Valence Baseline MITLL-UIUC Baseline MITLL-UIUC Dev Set Test Set Video (appearance) Video (geometric) Baseline Yes Yes CNN+RNN Yes Yes Sparse coding Yes Baseline Yes Yes Sparse coding Yes ECG Baseline Yes Yes HRHRV Baseline Yes Yes Baseline CNN+RNN Yes Yes EDA Baseline Yes Baseline CNN+RNN Yes Yes SCL Baseline Yes SCR Baseline Strong Impact Performance was strongly impacted by: Arousal: Low-level audio MFCC & SDC features with high-level sparse coding features Valence: High-level deeply-learned (CNN-DNN) video (appearance) features and sparse coded video (geometric) features General: Kalman filtering fusion framework exploiting time-correlated signal CCC: Concordance correlation coefficient ECG: Electrocardiogram EDA: Electrodermal activity HRHRV: Heart rate & heart rate variability SCL: Skin conductance level SCR: Skin conductance resistance MFCC: Mel-frequency cepstral coefficients SDC: Shifted delta cepstra AVEC- 9

10 Concluding Remarks Provided overview of MITLL-UIUC AVEC System Novel low-level (prosodic) and high-level (sparse coding and deep-learning) features Kalman filtering fusion framework with compensation for non-zero mean sensor measurement noise Reviewed emotional state recognition results driven by: Arousal: High-level sparse coding of low-level MFCC & SDC features Valence: High-level deeply learned video (appearance) features and sparse coded video (geometric) features General: Kalman filtering framework exploitation of time-correlated signal Next steps Refine low-level and high-level features and apply to other sensors Multiple hypothesis emotional state estimation framework Improved train-test data partitioning & sensor channel delay compensation Questions? AVEC- 10 MFCC: Mel-frequency cepstral coefficients SDC: Shifted delta cepstra

arxiv: v1 [cs.cv] 4 Feb 2018

arxiv: v1 [cs.cv] 4 Feb 2018 End2You The Imperial Toolkit for Multimodal Profiling by End-to-End Learning arxiv:1802.01115v1 [cs.cv] 4 Feb 2018 Panagiotis Tzirakis Stefanos Zafeiriou Björn W. Schuller Department of Computing Imperial