Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Size: px

Start display at page:

Download "Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network"

Cory Floyd
6 years ago
Views:

1 Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network Noriki Nishida and Hideki Nakayama Machine Perception Group Graduate School of Information Science and Technology The University of Tokyo 7th Pacific Rim Symposium on Image and Video Technology (PSIVT 2015) November 27, 2015 Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 1 / 26

2 Menu 1 Introduction 2 Proposed model 3 Experiments 4 Conclusion & Future works Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 2 / 26

3 Recent breakthroughs Object recognition [Krizhevsky et al., 2012] Object detection [Girshick et al., 2014] Speech recognition [Hinton et al., 2012] Word embedding [Mikolov et al., 2013] Convolutional neural networks [Krizhevsky et al., 2012] Recurrent neural networks with Long Short-Term Memory (LSTM) [Hochreiter et al., 1997] AdaDelta (optimization) [Zeiler et al., 2012] Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 3 / 26

Multimodal-sequential fusion is NOT solved Figure : Examples of multiple modalities (in gesture recognition) [http://gesture.chalearn.

4 Multimodal-sequential fusion is NOT solved Figure : Examples of multiple modalities (in gesture recognition) [ How should we fuse multiple modalities into a common space (vector representation)? How should we extract sequential dynamics from multiple sequential modalities? Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 4 / 26

5 Problem with traditional methods Hand-crafted heuristics 1. lead to a lack of generality e.g., skin color filtering for hand detection 2. require prior knowledge of target gesture domains System with less gesture-specific engineering is preferable. Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 5 / 26

6 Our goal 1. Propose an effective approach for fusing multiple sequential modalities 2. Propose a completely data-driven model that can be optimized from end to end Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 6 / 26

7 Recurrent Neural Networks (RNNs) h t = σ(w in x t + W hh h t 1 + b in ) y t = f(w out h t + b out ) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 7 / 26

8 Overall view of our multi-stream RNN (MRNN) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 8 / 26

9 Components in our MRNN I (m) : extracts feature vectors from the frame-level inputs of modality m at every time step S (m) : computes the sequential dynamics of the modality m F : fuses the multiple modalities while considering sequential dynamics in multimodal space O: predicts the gesture category given the last output of F What we should optimize are parameters of these components. Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 9 / 26

10 Formularization Input video (with M modalities): x = {(x (1) 1, x (2) 1,..., x (M) 1 ),..., (x (1) T, x(2) T Extract feature representation ĥt from x v (m) t,..., x(m) T )} = I (m) (x (m) t ) for m = 1,..., M h (m) t = S (m) (v (m) t, h (m) t 1) for m = 1,..., M Classification: ĥ t = F ([h (1) t ; h (2) t ;... ; h (M) t ], ĥt 1) y = O(ĥT ) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 10 / 26

11 Formularization Input video (with M modalities): x = {(x (1) 1, x (2) 1,..., x (M) 1 ),..., (x (1) T, x(2) T Extract feature representation ĥt from x v (m) t,..., x(m) T )} = I (m) (x (m) t ) for m = 1,..., M h (m) t = S (m) (v (m) t, h t 1) (m) for m = 1,..., M Classification: ĥ t = F ([h (1) t ; h (2) t ;... ; h (M) t ], ĥt 1) y = O(ĥT ) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 11 / 26

12 Formularization Input video (with M modalities): x = {(x (1) 1, x (2) 1,..., x (M) 1 ),..., (x (1) T, x(2) T Extract feature representation ĥt from x v (m) t,..., x(m) T )} = I (m) (x (m) t ) for m = 1,..., M h (m) t = S (m) (v (m) t, h t 1) (m) for m = 1,..., M Classification: ĥ t = F ([h (1) t ; h (2) t ;... ; h (M) t ], ĥt 1) y = O(ĥT ) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 12 / 26

13 Formularization Input video (with M modalities): x = {(x (1) 1, x (2) 1,..., x (M) 1 ),..., (x (1) T, x(2) T Extract feature representation ĥt from x v (m) t,..., x(m) T )} = I (m) (x (m) t ) for m = 1,..., M h (m) t = S (m) (v (m) t, h (m) t 1) for m = 1,..., M Classification: ĥ t = F ([h (1) t ; h (2) t ;... ; h (M) t ], ĥt 1) y = O(ĥT ) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 13 / 26

14 Formularization Input video (with M modalities): x = {(x (1) 1, x (2) 1,..., x (M) 1 ),..., (x (1) T, x(2) T Extract feature representation ĥt from x v (m) t,..., x(m) T )} = I (m) (x (m) t ) for m = 1,..., M h (m) t = S (m) (v (m) t, h (m) t 1) for m = 1,..., M Classification: ĥ t = F ([h (1) t ; h (2) t ;... ; h (M) t ], ĥt 1) y = O(ĥT ) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 14 / 26

15 Graphical representation of our method (M = 2) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 15 / 26

16 Advantages 1. Whole free parameters of the MRNN can be trained towards end-to-end performance in a supervised manner using SGD and backpropagation. No hand-crafted engineering 2. We can choose current state-of-the-art neural networks for each component: ConvNet or DNN for I (m) LSTM or GRU [Cho et al., 2014] for S (m), F DNN for O Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 16 / 26

17 Late multimodal fusion model (M = 2) No mechanism to consider sequential dynamics in multimodal space Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 17 / 26

18 Early multimodal fusion model (M = 2) No mechanism to consider sequential dynamics in each single-modal space Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 18 / 26

19 Dataset Sheffield Kinect Gesture (SKIG) Dataset [Liu et al., 2013] 10 gesture classes Various illumination and cluttered background Each video consists of two modalities (RGB + Depth) We compute Optical Flow as additional modality Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 19 / 26

20 Experimental Results: MRNN vs. alternatives Table : Test accuracy (MRNN vs. alternative models) Method Accuracy (%) Early multimodal fusion 94.1 Late multimodal fusion 94.6 MRNN 97.8 Extracting sequential dynamics in both single-modal space and multimodal space is beneficial for higher accuracy Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 20 / 26

21 Experimental Results: MRNN vs. previous works Table : Test accuracy (MRNN vs. state-of-the-art methods) Method Accuracy (%) Liu et al. (2013) 88.7 Choi et al. (2014) 91.9 Tung et al. (2014) 96.7 MRNN 97.8 The MRNN outperforms other state-of-the-art methods. Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 21 / 26

22 Experimental Results: multimodal vs. single modality Table : Test accuracy (multiple modality vs. single modality) Method Accuracy (%) MRNN (color) 91.6 MRNN (opt flow) 88.5 MRNN (depth) 95.9 MRNN (color + opt flow + depth) 97.8 The MRNN successfully incorporates multiple sequential modalities. Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 22 / 26

23 Investigation of the robustness to noisy inputs Add Gaussian noise with different standard deviation σ to the depth information in test set. The MRNN can maintain relatively high accuracy. Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 23 / 26

24 Conclusion We propose the MRNN for multimodal-sequential fusion. We successfully applied this approach to multimodal gesture recognition. The MRNN achieves newly state-of-the-art result in the SKIG dataset. Multimodal fusion while considering sequential dynamics in both single-modal space and multimodal space is beneficial. Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 24 / 26

25 Future works Further investigation for theoretical analysis Test our model in other datasets Use other modalities such as skeletal or speech data Apply our model to other tasks that have multimodal-sequential data (e.g., speech recognition) Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 25 / 26

26 Thank you very much! Q & A Noriki Nishida and Hideki Nakayama Multimodal Gesture Recognition using MRNN The University of Tokyo 26 / 26

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network Noriki Nishida, Hideki Nakayama Machine Perception Group Graduate School of Information Science and Technology The University