Deep Temporal Models (Benchmarks and Applica6ons Analysis)

Size: px

Start display at page:

Download "Deep Temporal Models (Benchmarks and Applica6ons Analysis)"

Vincent Hoover
6 years ago
Views:

1 Deep Temporal Models (Benchmarks and Applica6ons Analysis) Sek Chai SRI Interna6onal Presented at: NICE 2016, March 7, SRI International

Project Summary Goals Analyze Deep Temporal

Find approaches to reduce training 6me, lower

High dimensional Data Set Audio, video, gesture

Benchmarks and Applica4ons Analysis Processor

Sek Chai (PI) Mohamed Amer David Zhang Tim

2 Project Summary Goals Analyze Deep Temporal Models (DTMs). Find approaches to reduce training 6me, lower memory size, and use low precision. High dimensional Data Set Audio, video, gesture Benchmark Deep Temporal Models Analysis Benchmarks and Applica4ons Analysis Processor Architecture RNN, LTSM, CRBM SRI Interna4onal Sek Chai (PI) Mohamed Amer David Zhang Tim Shields U. Guelph Graham Taylor Dhanesh Ramachandram U. Montreal Roland Memisevic Yoshua Bengio SRI International

Seeing Humans ChaLearn 2014 - This dataset consists of a single user is recorded in front of a depth camera, performing natural communica6ve gestures and speaking in fluent Italian.

3 Seeing Humans ChaLearn This dataset consists of a single user is recorded in front of a depth camera, performing natural communica6ve gestures and speaking in fluent Italian. The dataset focuses on the user independent automa6c recogni6on of a vocabulary of 20 Italian cultural/anthropological signs in image sequences. Challenges: Mul6modal visual cues (RGBD) and audio Mul6-6mescale, unreliable depth cues No informa6on about the number of gestures within each sequence High intra-class variability of gesture samples Low inter-class variability for some gesture categories. Several distractor gestures (out of the vocabulary) are present. Image: Neverova et al. (2015) S. Escalera, et al., "ChaLearn Looking at People Challenge 2014: Dataset and Results", ECCV-W SRI International

DeepGesture Architecture (for ChaLearn Dataset) temporal strides Valida6on Error, % N. Neverova, et al. (2015), ModDrop: Adap6ve Mul6modal Gesture Recogni6on, IEEE PAMI (In Press) State-of-Art 88.

4 DeepGesture Architecture (for ChaLearn Dataset) temporal strides Valida6on Error, % N. Neverova, et al. (2015), ModDrop: Adap6ve Mul6modal Gesture Recogni6on, IEEE PAMI (In Press) State-of-Art 88.1% recogni4on rate 4 Training Stage Key Insights We adopted a strategy where like modali6es are fused first; it resembles brain s mul6- modal fusion strategy. Most previous work on mul6-model learning has fused data At the input feature level (early fusion); or At the level of per-modality classifier outputs (late fusion) 2016 SRI International

Example Training Complexity (for ChaLearn Dataset) Classifier Description Modality Training Data Size (GB) Training Time Epochs Time(sec)/ epoch Motion detect Skeleton Video Feature Video Shallow MLP

5 Example Training Complexity (for ChaLearn Dataset) Classifier Description Modality Training Data Size (GB) Training Time Epochs Time(sec)/ epoch Motion detect Skeleton Video Feature Video Shallow MLP used to detect the startframe Motion Capture GB 20,174 sec and stop-frame of a given gesture (5.6hrs) in an action set. Convolutional network which is trained to extract features from motion capture data (Path M) 3D convolutional layer followed by 2D convolution layer which uses depth and intensity video (ConvC1 -> ConvC2, ConD1- >ConvD2 ) Shared hidden layers which uses inputs from previous convolutional layers (HLV1 + HLV2) Multimodal Corresponds to the fully connected shared hidden layer where multimodal inputs are fused. (HLS) Motion Capture GB 69,344 sec (19.25hrs) intensity+depth video Intensity+depth video GB 82,655 sec (22.9 hrs) GB 170,223 sec (47.28 hrs) all GB 93,247 (25.9 hrs) 200, , Summary Total 5 days to process 42GB training data on Sharcnet Copper Cluster (064 GPUs, 128 CPU cores, 24 cores/node, 64 GB/node, x86, 080 TB RAID Ahached Storage, InfiniBand, 4 Tesla K80s/node). Total # Parameters = 7,836, SRI International

Current Approaches Stochas6c rounding Gupta, et al.

6 Low Precision Neural Networks Needs: Memory is main bohleneck, especially for embedded solu6ons. Es6mates 1B connec6on neural network consumes 12W*. Current Approaches Stochas6c rounding Gupta, et al. (arxiv 2015) Network Pruning *Image: Han, et al. (NIPS 2015) Binary Connect Image: Courbariaux, et al. (NIPS 2015) 2016 SRI International 6

Subband /Wavelet Decomposi6on Subband decomposi6on enables data reduc6on by discarding informa6on about certain frequencies where human visual system is less sensi6ve.

7 Subband /Wavelet Decomposi6on Subband decomposi6on enables data reduc6on by discarding informa6on about certain frequencies where human visual system is less sensi6ve.* Can we do the same for learnt representa6ons? * Good representa6on : Used in Image Compression[1], Reconstruc6on[2] and Fusion[3] extensively. [1] "The Laplacian Pyramid as a Compact Image Code", Burt et. al. [2] Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, Denton et. al. [3] Image fusion: algorithms and applications, Stathaki SRI International 7

8 Conven6onal Approach for MNIST data I 0 *Image: Lecun, et al. (1998) The MNIST database contains black and white handwrihen digits, normalized to 20x20 pixel size. There are 60,000 training images and 10,000 tes6ng images SRI International 8

9 Separate Networks for Subband Learning Our Approach Image (I 0 ) Laplacian (L 0 ) Gaussian (G 1 ) 28x28 32x x16 Captures Edges Spa6al Informa6on L o G 1 Background Cues + Fusion Basic Idea: We separate imagery into different frequency bands (e.g. with different informa6on content) such that the neural net can be.er learn using less bits SRI International 9

10 Valida6on Error vs Epochs in MNIST I 0 L 0 G 1 + Valida4on Error (Log scale) 28x28 32x32 16x16 Discussions CNN trained on Laplacian focusses more on edges (good feature). CNN(L 0 ) beats CNN(I 0 ) on this dataset. --- Gaussian G 1 -o- Original I 0 -x- Laplace L 0 -x- Fusion Epochs 2016 SRI International 10

11 Robustness to Low Bit Precision Weights Stochas4c Rounding azer final epoch Weight bits 32bit 16bit 8bit 4bit Original GBlur Laplace Fusion Stochas4c Rounding azer every epoch Weight bits 32bit 16bit 8bit 4bit Original GBlur Laplace Fusion Discussions Fusion results are comparable to original, using half the number of bits. Stochas6c rounding aver every epoch guides the learning, and is especially useful for low precision. Simple Fusion : Equi-Weighted average of sovmax output Scores 0.5*(s 1 (x)+s 2 (y)), s 1 (x),s 2 (x) in [0,1] 10 and s(x) 1 = SRI International 11

Cifar-10 data set The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

12 Cifar-10 data set The CIFAR-10 dataset consists of x32 color images in 10 classes, with 6000 images per class. There are training images and test images. Discussions: Cluhered color images, and more challenging than MNIST. Contains background cues and context that can help recogni6on, (e.g. blue sky for airplane or water for ship). Architecture is the same as before (LeNet-5)with 30 and 60 feature maps. Our goal is to show compara6ve results for low precision. There are no data augmenta6on SRI International 12

13 CIFAR-10 Performance Gaussian blur removes noise in clu.er Test error Laplacian enhances foreground Clu.ered image combined with high learning rate Epoch Original Laplacian Gblur Fusion 2016 SRI International 13

14 Conclusion Hybrid-mul6modal neural networks improves algorithmic performance. Fusion of learnt representa6ons is important. Low precision networks shows promise. Stop by and visit the poster at NICE Chai, et al., "Low Precision Neural Networks using Subband DecomposiHon", CogArch, April SRI International 14

LEARNING DEEP MULTI-MODAL FUSION ARCHITECTURES

LEARNING DEEP MULTI-MODAL FUSION ARCHITECTURES GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH Joint work with: Natalia Neverova (Facebook AI Research), Christian Wolf (INSA-Lyon) Dhanesh Ramachandram