RATM: Recurrent Attentive Tracking Model Samira Ebrahimi Kahou École Polytechnique de Montréal, Canada samira.ebrahimi-kahou@polymtl.ca Vincent Michalski Université de Montréal, Canada vincent.michalski@umontreal.ca arxiv:1510.08660v1 [cs.lg] 9 Oct 015 Roland Memisevic Université de Montréal, Canada roland.memisevic@umontreal.ca Abstract This work presents an attention mechanism-based neural network approach for tracking objects in video. A recurrent neural network is trained to predict the position of an object at time t + 1 given a series of selective glimpses into video frames at time steps 1 to t. The proposed recurrent attentive tracking model can be trained using simple gradient-based training. Various settings are explored in experiments on artificial data to justify design choices. 1 Introduction Traditional work on object tracking uses different sampling methods to select and evaluate multiple window candidates based on similarity in feature space. [1]. In this work we introduce a tracking method, that predicts the position of an object at time t only based on prior information, i.e. without actually considering the content of the current frame. Recently, a fully differentiable attention mechanism, that can selectively read and write in images was introduced and used to build a model for sequential generation of images []. Interestingly, in the experiments of [] on handwritten digits, the read mechanism learned to trace contours and the write mechanism generated digits in a continuous motion of the write window, hinting at the potential of this method for tracking applications. The proposed recurrent attentive tracking model (RATM) employs a modified version of that read mechanism. Previous work has investigated the application of attention mechanisms for tracking with neural networks [3]. In contrast to that work, we develop a fully-integrated neural framework, that can completely be trained by back-propagation. Method The proposed model mainly consists of two components: an attention mechanism that extracts patches from the input images and a recurrent neural network (RNN), which predicts attention parameters, i.e. it predicts where to look in the next frame. Sections.1 and. describe the attention mechanism and the RNN, respectively..1 Attention The attention mechanism introduced in [] extracts glimpses from the input data by applying a set of D Gaussian window filters. Each filter response corresponds to one pixel of the glimpse. Exploiting the separability of D Gaussian windows, this attention mechanism constructs an N N grid of two-dimensional Gaussian filters by sequentially applying a set of row filters and a set of column 1
(b) Top: A full image, Bottom: The selected attention patch, (a) Layout of the Recurrent Attentive Tracking Model. The black arrows indicate presented as network that the hidden states affect the patch extraction in the next time step. input. filters. The grid has 4 parameters 1, the grid center g X, g Y, the isotropic standard deviation σ and the stride between grid points δ. For the filter at row i, column j in the attention patch, the mean locations µ i X, µj Y are computed as follows: µ i X = g X + (i N 0.5) δ, µj Y = g Y + (j N 0.5) δ (1) The parameters are dynamically generated as a linear transformation of a vector h: followed by normalization of the parameters: g X = g X + 1 ( g X, g Y, σ, δ) = Wh, (), g Y = g Y + 1 max(a, B) 1, δ = δ, σ = σ, (3) N 1 where A and B are the width and height of the input image. In contrast to the original implementation in [], we use the absolute value function to ensure positivity of δ and σ. The intuition behind this choice is the following: In our experiments we observed that the stride and standard deviation parameters tend to saturate at low values, causing the attention window to zoom in into a single pixel. This effectively inhibited gradient flow through neighboring pixels of the attention filters. By replacing the exponential function with the absolute value, we introduce a discontinuity at x = 0. Piecewise linear activation functions have been shown to benefit optimization [4] and the absolute value function is a convenient trade-off between the harsh zeroing of all negative inputs of the Rectified Linear Unit (ReLU) and the extreme saturation for highly negative inputs of the exponential function. The filters which are used to extract a patch p from an image x are computed as follows: F X [i, a] = 1 exp ( (a ) ( ) µi X ) Z X σ, F Y [j, b] = 1 exp (b µj Y ) Z Y σ In Equation 4, a and b correspond to column and row indices in the input image. The N N patch p is then extracted from the image x by separate application of both filters: p = F Y xf T X. An example of the patch extraction is shown in Figure 1b.. Recurrent Neural Network Advances in the optimization of RNNs enable their successful application in a wide range of tasks [5, 6, 7]. The most simple RNN consists of one input, one hidden and one output layer. In time-step 1 The original read mechanism from [] also adds a scalar intensity parameter, that is multiplied to filter responses. (4)
t, the network, based on the input frame x t and the previous hidden state h t 1, computes the new hidden state h t = σ(w in x t + W rec h t 1 ), (5) where σ is a non-linear activation function, W in is the input and W rec is the recurrent weight matrix. While the application of the more sophisticated long-short-term memory [7] and gated recurrent units [6] is very common in recent work, we rely on a simpler variant of recurrent networks, called IRNN [8], in which W rec is initialized with a scaled version of the identity matrix and σ in Equation 5 corresponds to the element-wise ReLU activation function [4] In our experiments we use the IRNN implementation of [9]..3 Recurrent Attentive Tracking Model σ(x) = max(0, x). (6) The task of tracking involves the mapping of a sequence of inputs X = (x 1, x,..., x T ) to a sequence of locations Y = (y 1, y,..., y T ). Given that for the prediction ŷ t of an object s location at time t, the trajectory (y 1,..., y t 1 ) usually contains highly relevant contextual information, it is suitable to choose a hidden state model which has the capacity of building a representation of this trajectory. The proposed RATM is a fully differentiable neural network, consisting of the described attention mechanism and an RNN. The hidden state vector h t of the RNN is mapped to window parameters according to Equations and 3. The resulting image patch returned by the attention mechanism is then passed as input x t+1 to the RNN, which computes h t+1 according to Equation 5. A sketch of the model is shown in Figure 1a. Instead of passing the raw input patch to the RNN, a pre-trained model can be used to extract image features, e.g. a convolutional neural network (CNN). Note that the size of this feature model affects memory consumption and/or computation speed of the whole model, especially during training, because gradients have to be back-propagated through its instances in all time-steps. In Section 3, we explore various ways of generating a training signal for gradient descent training of RATM. 3 Experiments In this section we present some of the experiments with RATM and comment on qualitative results. Unless noted otherwise, the initial parameters of the attention mechanism are set such that the resulting first patch roughly covers the whole image. 3.1 Bouncing Balls For our initial experiment, we used the bouncing balls data set [10] with 3 frames. In the objective function, we minimized only the mean-squared error between the selected patch in the last frame and a target patch which is simply a cropped white ball image as the shape and color of the object is constant across the whole data set. The dynamics in this data set are very limited and even though the tracking looks accurate, it can be explained by the RNNs potential for memorizing the set of possible trajectories. For this reason, we experimented with more challenging data sets. Figure 1 shows the results of tracking a ball in a test sequence in the top row. 3. MNIST Single-Digit: To better understand the learning ability of RATM, we generated a new data set by moving one digit randomly drawn from MNIST [11]. We respected the same data split for training and testing as MNIST, i.e. digits were drawn from the training split to generate training sequences and testing split for testing. The generated videos have frame size 100 100 and show a digit moving in a random walk with momentum against a black background. This experiment is different from bouncing balls tracking since digits are from different classes which the model has to recognize. As an objective function we used the average classification error provided by a pre-trained shallow CNN, that classified the patches of each time-step. This CNN is trained on the MNIST data set and the model parameters remained fixed during RATM training. The model failed to learn only based on the classification term. To amend this, we initially tried to add a penalty on the distance between 3
corners of the ground truth bounding boxes and the corners of the attention grid. This again led to unsatisfactory results and we changed the localization cost to the mean-squared error between ground truth center coordinates and patch center. This modification helped the model to reliably detect and track the digits. The localization term helped in the early stages of training to guide RATM to track the digits, whereas the classification term encourages the model to properly zoom into the image and maximize classification accuracy. Training is done on sequences with only 10 frames. The classification and localization penalties were applied at every time-step. At test time, the CNN is switched off and we let the model track test sequences of 30 frames. Figure 1 shows one test sample in the middle row. Multi-Digit: It was interesting to see how robust RATM is in presence of another moving digit in the background. Thus, we generated new sequences by modifying the bouncing balls script [10]. The balls were replaced by randomly drawn MNIST digits. We also added a random walk with momentum to the motion vector. The cost function and settings are the same as in the single digit experiment, except that we set the initial bias of the attention mechanism such that the initial window is centered on the digit to be tracked. The model was trained on 10 frames and was able to focus on digits at least for 15 frames in test cases. Figure 1 shows tracking results on a test sequence in the bottom row. Figure 1: Sequences of the tracking experiments. Top: bouncing balls, middle: single MNIST digit, bottom: two MNIST digits. Each showing in two rows the original sequence and the sequence of selected patches. Each column is one time-step. The green cross marks the ground truth position and the red rectangle marks the attention window. 4 Conclusion and Future Work We showed that the proposed method, RATM, is able to learn to track objects in experiments on artificial data. We experimented with RATM on the Caltech Pedestrian Dataset [1, 13], but the number of long annotated training sequences which contain the same subject is quite low for deep learning standards. Moreover, due to the high frame rate motion is very limited, so the model learned to predict almost a static window. The results were inconclusive and are not presented here. Currently we are experimenting with other data sets, not only containing pedestrians, and try to provide quantitative results as a baseline for comparison. As a long-term project goal, we plan to explore multi-object tracking and to investigate the representation learned by the network. In particular, we will focus on the model s ability to deal with occlusions and to compress object identity and motion trajectory in its hidden state. Acknowledgments The authors would like to thank the developers of Theano [14, 15] and Jörg Bornschein for helpful discussions. This work was supported by an NSERC Discovery Award and the German BMBF, project 01GQ0841. We also thank the Canadian Foundation for Innovation (CFI) for support under the Leaders program. 4
References [1] Arnold W M Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah. Visual tracking: An experimental survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):144 1468, July 014. [] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/150.0463, 015. [3] Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. Learning where to attend with deep architectures for image tracking. CoRR, abs/1109.3737, 011. [4] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 7th International Conference on Machine Learning (ICML-10), pages 807 814, 010. [5] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 311, 014. [6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arxiv preprint arxiv:1406.1078, 014. [7] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [8] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arxiv preprint arxiv:1504.00941, 015. [9] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 17th ACM on International Conference on Multimodal Interaction, ICMI 15, 015. [10] Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems, pages 1601 1608, 009. [11] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):78 34, Nov 1998. [1] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In CVPR, June 009. [13] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 34, 01. [14] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arxiv preprint arxiv:111.5590, 01. [15] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 010. 5