Single Object Tracking with Organic Optic Attenuation

Size: px

Start display at page:

Download "Single Object Tracking with Organic Optic Attenuation"

Kellie Jones
5 years ago
Views:

1 Single Object Tracking with Organic Optic Attenuation Note: DEMO GIFS Have been removed due to making the presentation too large to upload to blackboard! (other gifs have been lossy-compressed) Ibraheem Saleh Download original presentation (970 MB) at: XeXYlCstDQDBPJJdU/view?usp=sharing

2 A Report on Hierarchical Attentive Recurrent Tracking Submitted for Publication June 2017 Github Page: Research Paper Publication Link: Research Paper Link: Adam Kosiorek et all

3 Problem Class Agnostic Object Tracking Different from class-aware object tracking which knows the type of object being tracked. Many Challenges Not Knowing Relational Data about the Object ahead of time Varying Lighting Conditions Environment Occlusion Subject Rotation Camera Movement Motion Blur Etc Object Tracking!

4 Significance Self-Driving Cars Artificial Intelligence Special Effects Rendering Security Systems Detecting Abandoned Bags at Airports Object Tracking is one of the core problems in the field Computer Vision

5 What is HART? Inspired by the general architecture of the human visual cortex and the role of attention mechanisms, this work presents a biologically-inspired recurrent model for single object tracking in videos [1] Goals of HART Fast Operate in real time Data Driven Perform Object Tracking by mimicking Human Eye Mechanisms

6 Biologically Inspired from Human Vision?!

Biology of Human Vision Spatial Attention When a human tracks an object, after the

cortex in the brain, it further subdivides the image processing to two different

is responsible for determining Where the object of interest is (Spatial Attention)

7 Biology of Human Vision Spatial Attention When a human tracks an object, after the eyes retrieve the world-image information and passes that to the primary visual cortex in the brain, it further subdivides the image processing to two different processing pathways: to the Ventral Stream and the Dorsal Stream The Dorsal Stream is responsible for determining Where the object of interest is (Spatial Attention) It quickly acknowledges the images from the visual cortex and discard spatially irrelevant information. Unlike many popular object tracking algorithms, humans don t process the image in full for each image. Green: Dorsal stream Purple: Ventral Stream

8 Biology of Human Vision Appearance Attention (Ventral Stream) The Ventral Stream is responsible for determining What the object of interest is This is where our brain learns the details about the object that we are actively looking at. What was the color of the subjects shirt? His Pants? His Cape? Humans have limited processing capability. Whenever more than one visual stimulus is present in the receptive field of a neuron, all the stimuli compete for computational resources due to the limited processing capacity. [1] Green: Dorsal stream Purple: Ventral Stream

9 Building the Model Computes a Foreground and Background segmentation of the Glimpse Primary Visual Cortex Spatial Bernoulli Distribution Each value in s t Represents the probability of the tracked object occupying the corresponding location Compute the next Attention Area Input Images glimpse s t h t 1 α t+1 x t g t V t ഥs t o t b t Determine Where the object to track is v t Masked Features h t a t+1 Bounding Box of Object Extract Appearance Based Features Features Haadamard Product Learn the Object that we are Tracking Compute the next Appearance Area

10 Understanding the Model s t h t 1 α t+1 x t g t b t V t ഥs t o t v t h t a t+1

11 Understanding the Model Spatial Attention s t h t 1 α t+1 x t g t b t V t ഥs t o t v t h t a t+1

12 Understanding the Model Spatial Attention Given an image, the spatial attention mechanism creates 2 matrices. Each matrix contains one Gaussian per row. The width and position of the Gaussian determines which parts of the image are extracted. Initial Glimpse bounds are specified externally from the model. Future changes in the stride and centers of the glimpses are taken from the output of the LSTM

13 Understanding the Model V1 & Ventral/Dorsal Streams s t h t 1 α t+1 x t g t b t V t ഥs t o t v t h t a t+1

14 Understanding the Model V1 & Ventral/Dorsal Streams The V1 (Primary Cortex) is implemented as a CNN and, given a glimpse, outputs a number of convolutional and max-pooling layers which are then passed to both the Dorsal and Ventral Streams The Ventral Stream is also a CNN and outputs feature-maps to handle visual features. The Dorsal Stream is implemented as a DFN (Dynamic Filter Network) [2] Filters for a DFN are computed on the fly conditioned on input features (as opposed to the traditional CNN model which has the network remain static after training!)

15 Understanding the Model V1 & Ventral/Dorsal Streams After processing through those networks, the model takes the Haadamard Product of the outputs from the dorsal and ventral streams. Imitates the distractor-suppressing behavior of the human-brain.

17 Understanding the Model LSTM and MLP s t h t 1 α t+1 x t g t b t V t ഥs t o t v t h t a t+1

18 Understanding the Model LSTM and MLP Masked Feature outputs are then fed into a Long-Short-Term Memory mechanism -- special type of RNN The output from the LSTM is used to compute the predicted attention and appearance for the next frame. The LSTM is designed with the assumption that motion of these objects is representable as a Markovian State IE: Future states only depend on the current state and not frames or states before it.

19 Understanding the Model LSTM and MLP Benefits of using an LSTM Can learn rotating and occluded objects on the fly!

20 Loss Functions HART Loss Function Tracking Loss Function Spatial Attention Loss Function Appearance Attention Loss Function:

21 Experimentation Dataset - KITTI The Karlstuhe Institute of Technology and Toyota Technological Institute (KITTI) [1] dataset. A collection of real world pedestrian and traffic video footage taken from the perspective of a car. we hired a set of annotators, to label 3D bounding boxes as tracklets in point clouds.

22 Akosiorek et Al Results by introducing a set of auxiliary losses we are able to scale to challenging real world data, outperforming predecessor attempts and approaching state-of-the-art performance. [1]

23 My Experimentation Simple Pedestrian

24 My Experimentation Simple Huge Truck

25 My Experimentation Distant Car

26 My Experimentation Distant Pedestrian

27 My Experimentation Car Turning

28 My Experimentation Apple Rolling

29 My Experimentation My Cat

30 My Experimentation Me On Screen to Off

31 My Experimentation Brother Walking

32 My Experimentation Me Driving on Street

33 My Thoughts It learns the Appearance Features too slow causing them to get lost with fast rotational changes or lighting shifts. (Problem with Appearance Attention) Once the image is out of frame momentarily, it doesn t know how to search for it anymore (problem with Spatial Attention mechanism) Too hard to train!

34 Questions?

35 Primary References Kosiorek R Adam, Bewley Alex, Posner Ingmar, Hierarchical Attentive Recurrent Tracking, 5 Sept, 2017, arxiv: v2 Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic Filter Networks. NIPS, Samira Ebrahimi Kahoú, Vincent Michalski, and Roland Memisevic. RATM: Recurrent Attentive Tracking Model. CVPR Work., 2017

Machine Learning 13. week

Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of