arxiv: v1 [cs.lg] 29 Oct 2015

Similar documents
ATTENTION MECHANISMS are one of the biggest trends

RATM: Recurrent Attentive Tracking Model

Learning visual odometry with a convolutional network

LSTM with Working Memory

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

arxiv: v7 [cs.lg] 22 May 2016

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

arxiv: v6 [stat.ml] 15 Jun 2015

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Emotion Detection using Deep Belief Networks

arxiv: v5 [stat.ml] 4 May 2017

Machine Learning 13. week

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

Theano Tutorials. Ian Goodfellow

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Part Localization by Exploiting Deep Convolutional Networks

Distributed neural networks for Internet of Things: the Big-Little approach

arxiv: v1 [cs.lg] 9 Nov 2015

arxiv: v5 [cs.lg] 23 Sep 2015

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Single Object Tracking with Organic Optic Attenuation

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

CS489/698: Intro to ML

On the Efficiency of Recurrent Neural Network Optimization Algorithms

Rotation Invariance Neural Network

Convolutional Neural Network for Facial Expression Recognition

arxiv: v1 [cs.lg] 24 Jun 2014

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

An Online Sequence-to-Sequence Model Using Partial Conditioning

arxiv: v2 [stat.ml] 27 May 2016

Deep Neural Networks:

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Combining Modality Specific Deep Neural Networks for Emo=on Recogni=on in Video ICMI 2013, Emo-W 2013 Grand Challenge

Real-time convolutional networks for sonar image classification in low-power embedded systems

Deep Learning and Its Applications

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Neural Networks: promises of current research

arxiv: v2 [cs.cv] 14 May 2018

Convolutional Neural Networks

End-to-end Training of Differentiable Pipelines Across Machine Learning Frameworks

Frame and Segment Level Recurrent Neural Networks for Phone Classification

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Parallel one-versus-rest SVM training on the GPU

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Outline GF-RNN ReNet. Outline

Advanced Introduction to Machine Learning, CMU-10715

Crafting Adversarial Input Sequences for Recurrent Neural Networks

arxiv: v3 [cs.lg] 15 Apr 2014

Deep Learning With Noise

Convolutional Neural Networks for No-Reference Image Quality Assessment

Bidirectional Recurrent Neural Networks as Generative Models

ImageNet Classification with Deep Convolutional Neural Networks

CHARACTER-LEVEL DEEP CONFLATION FOR BUSINESS DATA ANALYTICS

Deep Learning for Computer Vision II

Weighted Convolutional Neural Network. Ensemble.

Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks

arxiv: v2 [cs.cv] 26 Jan 2018

arxiv: v1 [cs.ne] 28 Nov 2017

Deep Learning with Tensorflow AlexNet

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Visual object classification by sparse convolutional neural networks

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Study of Residual Networks for Image Recognition

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

A spatiotemporal model with visual attention for video classification

Image-to-Text Transduction with Spatial Self-Attention

Multi-Glance Attention Models For Image Classification

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Deep Neural Networks for Recognizing Online Handwritten Mathematical Symbols

A Comparison of Random Forests and Dropout Nets for Sign Language Recognition with the Kinect

arxiv: v5 [cs.cv] 5 Jul 2016

Stacked Denoising Autoencoders for Face Pose Normalization

Real-Time Depth Estimation from 2D Images

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

arxiv: v1 [cs.lg] 18 Jun 2017

arxiv: v4 [cs.cv] 1 Mar 2016

Autoencoders, denoising autoencoders, and learning deep networks

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Computer Vision Lecture 16

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Groupout: A Way to Regularize Deep Convolutional Neural Network

Image Captioning with Object Detection and Localization

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

CAP 6412 Advanced Computer Vision

VEHICLE DETECTION IN URBAN POINT CLOUDS WITH ORTHOGONAL-VIEW CONVOLUTIONAL NEURAL NETWORK. Jing Huang and Suya You. University of Southern California

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Pixel-level Generative Model

Implicit Mixtures of Restricted Boltzmann Machines

arxiv: v2 [cs.lg] 7 Jun 2017

Transcription:

RATM: Recurrent Attentive Tracking Model Samira Ebrahimi Kahou École Polytechnique de Montréal, Canada samira.ebrahimi-kahou@polymtl.ca Vincent Michalski Université de Montréal, Canada vincent.michalski@umontreal.ca arxiv:1510.08660v1 [cs.lg] 9 Oct 015 Roland Memisevic Université de Montréal, Canada roland.memisevic@umontreal.ca Abstract This work presents an attention mechanism-based neural network approach for tracking objects in video. A recurrent neural network is trained to predict the position of an object at time t + 1 given a series of selective glimpses into video frames at time steps 1 to t. The proposed recurrent attentive tracking model can be trained using simple gradient-based training. Various settings are explored in experiments on artificial data to justify design choices. 1 Introduction Traditional work on object tracking uses different sampling methods to select and evaluate multiple window candidates based on similarity in feature space. [1]. In this work we introduce a tracking method, that predicts the position of an object at time t only based on prior information, i.e. without actually considering the content of the current frame. Recently, a fully differentiable attention mechanism, that can selectively read and write in images was introduced and used to build a model for sequential generation of images []. Interestingly, in the experiments of [] on handwritten digits, the read mechanism learned to trace contours and the write mechanism generated digits in a continuous motion of the write window, hinting at the potential of this method for tracking applications. The proposed recurrent attentive tracking model (RATM) employs a modified version of that read mechanism. Previous work has investigated the application of attention mechanisms for tracking with neural networks [3]. In contrast to that work, we develop a fully-integrated neural framework, that can completely be trained by back-propagation. Method The proposed model mainly consists of two components: an attention mechanism that extracts patches from the input images and a recurrent neural network (RNN), which predicts attention parameters, i.e. it predicts where to look in the next frame. Sections.1 and. describe the attention mechanism and the RNN, respectively..1 Attention The attention mechanism introduced in [] extracts glimpses from the input data by applying a set of D Gaussian window filters. Each filter response corresponds to one pixel of the glimpse. Exploiting the separability of D Gaussian windows, this attention mechanism constructs an N N grid of two-dimensional Gaussian filters by sequentially applying a set of row filters and a set of column 1

(b) Top: A full image, Bottom: The selected attention patch, (a) Layout of the Recurrent Attentive Tracking Model. The black arrows indicate presented as network that the hidden states affect the patch extraction in the next time step. input. filters. The grid has 4 parameters 1, the grid center g X, g Y, the isotropic standard deviation σ and the stride between grid points δ. For the filter at row i, column j in the attention patch, the mean locations µ i X, µj Y are computed as follows: µ i X = g X + (i N 0.5) δ, µj Y = g Y + (j N 0.5) δ (1) The parameters are dynamically generated as a linear transformation of a vector h: followed by normalization of the parameters: g X = g X + 1 ( g X, g Y, σ, δ) = Wh, (), g Y = g Y + 1 max(a, B) 1, δ = δ, σ = σ, (3) N 1 where A and B are the width and height of the input image. In contrast to the original implementation in [], we use the absolute value function to ensure positivity of δ and σ. The intuition behind this choice is the following: In our experiments we observed that the stride and standard deviation parameters tend to saturate at low values, causing the attention window to zoom in into a single pixel. This effectively inhibited gradient flow through neighboring pixels of the attention filters. By replacing the exponential function with the absolute value, we introduce a discontinuity at x = 0. Piecewise linear activation functions have been shown to benefit optimization [4] and the absolute value function is a convenient trade-off between the harsh zeroing of all negative inputs of the Rectified Linear Unit (ReLU) and the extreme saturation for highly negative inputs of the exponential function. The filters which are used to extract a patch p from an image x are computed as follows: F X [i, a] = 1 exp ( (a ) ( ) µi X ) Z X σ, F Y [j, b] = 1 exp (b µj Y ) Z Y σ In Equation 4, a and b correspond to column and row indices in the input image. The N N patch p is then extracted from the image x by separate application of both filters: p = F Y xf T X. An example of the patch extraction is shown in Figure 1b.. Recurrent Neural Network Advances in the optimization of RNNs enable their successful application in a wide range of tasks [5, 6, 7]. The most simple RNN consists of one input, one hidden and one output layer. In time-step 1 The original read mechanism from [] also adds a scalar intensity parameter, that is multiplied to filter responses. (4)

t, the network, based on the input frame x t and the previous hidden state h t 1, computes the new hidden state h t = σ(w in x t + W rec h t 1 ), (5) where σ is a non-linear activation function, W in is the input and W rec is the recurrent weight matrix. While the application of the more sophisticated long-short-term memory [7] and gated recurrent units [6] is very common in recent work, we rely on a simpler variant of recurrent networks, called IRNN [8], in which W rec is initialized with a scaled version of the identity matrix and σ in Equation 5 corresponds to the element-wise ReLU activation function [4] In our experiments we use the IRNN implementation of [9]..3 Recurrent Attentive Tracking Model σ(x) = max(0, x). (6) The task of tracking involves the mapping of a sequence of inputs X = (x 1, x,..., x T ) to a sequence of locations Y = (y 1, y,..., y T ). Given that for the prediction ŷ t of an object s location at time t, the trajectory (y 1,..., y t 1 ) usually contains highly relevant contextual information, it is suitable to choose a hidden state model which has the capacity of building a representation of this trajectory. The proposed RATM is a fully differentiable neural network, consisting of the described attention mechanism and an RNN. The hidden state vector h t of the RNN is mapped to window parameters according to Equations and 3. The resulting image patch returned by the attention mechanism is then passed as input x t+1 to the RNN, which computes h t+1 according to Equation 5. A sketch of the model is shown in Figure 1a. Instead of passing the raw input patch to the RNN, a pre-trained model can be used to extract image features, e.g. a convolutional neural network (CNN). Note that the size of this feature model affects memory consumption and/or computation speed of the whole model, especially during training, because gradients have to be back-propagated through its instances in all time-steps. In Section 3, we explore various ways of generating a training signal for gradient descent training of RATM. 3 Experiments In this section we present some of the experiments with RATM and comment on qualitative results. Unless noted otherwise, the initial parameters of the attention mechanism are set such that the resulting first patch roughly covers the whole image. 3.1 Bouncing Balls For our initial experiment, we used the bouncing balls data set [10] with 3 frames. In the objective function, we minimized only the mean-squared error between the selected patch in the last frame and a target patch which is simply a cropped white ball image as the shape and color of the object is constant across the whole data set. The dynamics in this data set are very limited and even though the tracking looks accurate, it can be explained by the RNNs potential for memorizing the set of possible trajectories. For this reason, we experimented with more challenging data sets. Figure 1 shows the results of tracking a ball in a test sequence in the top row. 3. MNIST Single-Digit: To better understand the learning ability of RATM, we generated a new data set by moving one digit randomly drawn from MNIST [11]. We respected the same data split for training and testing as MNIST, i.e. digits were drawn from the training split to generate training sequences and testing split for testing. The generated videos have frame size 100 100 and show a digit moving in a random walk with momentum against a black background. This experiment is different from bouncing balls tracking since digits are from different classes which the model has to recognize. As an objective function we used the average classification error provided by a pre-trained shallow CNN, that classified the patches of each time-step. This CNN is trained on the MNIST data set and the model parameters remained fixed during RATM training. The model failed to learn only based on the classification term. To amend this, we initially tried to add a penalty on the distance between 3

corners of the ground truth bounding boxes and the corners of the attention grid. This again led to unsatisfactory results and we changed the localization cost to the mean-squared error between ground truth center coordinates and patch center. This modification helped the model to reliably detect and track the digits. The localization term helped in the early stages of training to guide RATM to track the digits, whereas the classification term encourages the model to properly zoom into the image and maximize classification accuracy. Training is done on sequences with only 10 frames. The classification and localization penalties were applied at every time-step. At test time, the CNN is switched off and we let the model track test sequences of 30 frames. Figure 1 shows one test sample in the middle row. Multi-Digit: It was interesting to see how robust RATM is in presence of another moving digit in the background. Thus, we generated new sequences by modifying the bouncing balls script [10]. The balls were replaced by randomly drawn MNIST digits. We also added a random walk with momentum to the motion vector. The cost function and settings are the same as in the single digit experiment, except that we set the initial bias of the attention mechanism such that the initial window is centered on the digit to be tracked. The model was trained on 10 frames and was able to focus on digits at least for 15 frames in test cases. Figure 1 shows tracking results on a test sequence in the bottom row. Figure 1: Sequences of the tracking experiments. Top: bouncing balls, middle: single MNIST digit, bottom: two MNIST digits. Each showing in two rows the original sequence and the sequence of selected patches. Each column is one time-step. The green cross marks the ground truth position and the red rectangle marks the attention window. 4 Conclusion and Future Work We showed that the proposed method, RATM, is able to learn to track objects in experiments on artificial data. We experimented with RATM on the Caltech Pedestrian Dataset [1, 13], but the number of long annotated training sequences which contain the same subject is quite low for deep learning standards. Moreover, due to the high frame rate motion is very limited, so the model learned to predict almost a static window. The results were inconclusive and are not presented here. Currently we are experimenting with other data sets, not only containing pedestrians, and try to provide quantitative results as a baseline for comparison. As a long-term project goal, we plan to explore multi-object tracking and to investigate the representation learned by the network. In particular, we will focus on the model s ability to deal with occlusions and to compress object identity and motion trajectory in its hidden state. Acknowledgments The authors would like to thank the developers of Theano [14, 15] and Jörg Bornschein for helpful discussions. This work was supported by an NSERC Discovery Award and the German BMBF, project 01GQ0841. We also thank the Canadian Foundation for Innovation (CFI) for support under the Leaders program. 4

References [1] Arnold W M Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah. Visual tracking: An experimental survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):144 1468, July 014. [] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/150.0463, 015. [3] Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. Learning where to attend with deep architectures for image tracking. CoRR, abs/1109.3737, 011. [4] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 7th International Conference on Machine Learning (ICML-10), pages 807 814, 010. [5] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 311, 014. [6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arxiv preprint arxiv:1406.1078, 014. [7] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [8] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arxiv preprint arxiv:1504.00941, 015. [9] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 17th ACM on International Conference on Multimodal Interaction, ICMI 15, 015. [10] Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems, pages 1601 1608, 009. [11] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):78 34, Nov 1998. [1] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In CVPR, June 009. [13] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 34, 01. [14] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arxiv preprint arxiv:111.5590, 01. [15] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 010. 5