Fully-Convolutional Siamese Networks for Object Tracking

Similar documents
Fully-Convolutional Siamese Networks for Object Tracking

Visual Object Tracking. Jianan Wu Megvii (Face++) Researcher Dec 2017

Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking

Internet of things that video

Real-Time MDNet. 1 Introduction. Ilchae Jung 1, Jeany Son 1, Mooyeol Baek 1, and Bohyung Han 2

Spatial Localization and Detection. Lecture 8-1

Yiqi Yan. May 10, 2017

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object detection with CNNs

Modeling and Propagating CNNs in a Tree Structure for Visual Tracking

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Rich feature hierarchies for accurate object detection and semantic segmentation

Know your data - many types of networks

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Object Detection Based on Deep Learning

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Lecture 7: Semantic Segmentation

Regionlet Object Detector with Hand-crafted and CNN Feature

Lecture 5: Object Detection

Learning to Track at 100 FPS with Deep Regression Networks - Supplementary Material

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

arxiv: v1 [cs.cv] 5 Sep 2018

Using Machine Learning for Classification of Cancer Cells

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

Cascade Region Regression for Robust Object Detection

arxiv: v1 [cs.cv] 3 Jan 2017

Fully Convolutional Networks for Semantic Segmentation

RoI Pooling Based Fast Multi-Domain Convolutional Neural Networks for Visual Tracking

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Joint Object Detection and Viewpoint Estimation using CNN features

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

3D model classification using convolutional neural network

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Flow-Based Video Recognition

A Deep Learning Framework for Authorship Classification of Paintings

CREST: Convolutional Residual Learning for Visual Tracking

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

arxiv: v2 [cs.cv] 8 Jan 2019

Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF)

OBJECT DETECTION HYUNG IL KOO

Object Detection on Self-Driving Cars in China. Lingyun Li

ImageNet Classification with Deep Convolutional Neural Networks

Classification of objects from Video Data (Group 30)

Large-scale Video Classification with Convolutional Neural Networks

Project 3 Q&A. Jonathan Krause

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

arxiv: v2 [cs.cv] 13 Sep 2017

Object Detection. Computer Vision Yuliang Zou, Virginia Tech. Many slides from D. Hoiem, J. Hays, J. Johnson, R. Girshick

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Supplementary Material

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Semantic Segmentation

End-to-end representation learning for Correlation Filter based tracking

Binary Convolutional Neural Network on RRAM

Tracking by recognition using neural network

arxiv: v1 [cs.cv] 7 Sep 2018

Category-level localization

arxiv: v1 [cs.cv] 5 Jul 2018

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

Structured Prediction using Convolutional Neural Networks

Visual Tracking via Spatially Aligned Correlation Filters Network

Yield Estimation using faster R-CNN

arxiv: v1 [cs.cv] 28 Nov 2017

Martian lava field, NASA, Wikipedia

WE are witnessing a rapid, revolutionary change in our

Rich feature hierarchies for accurate object detection and semantic segmentation

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CSE 559A: Computer Vision

Study of Residual Networks for Image Recognition

A Twofold Siamese Network for Real-Time Object Tracking

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

DeepIndex for Accurate and Efficient Image Retrieval

Learning Dynamic Memory Networks for Object Tracking

Applying Supervised Learning

Perceptron: This is convolution!

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Automatic detection of books based on Faster R-CNN

08 An Introduction to Dense Continuous Robotic Mapping

PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS

Fuzzy Set Theory in Computer Vision: Example 3, Part II

arxiv: v2 [cs.cv] 8 Nov 2018

End-to-End Localization and Ranking for Relative Attributes

Computer Vision Lecture 16

Object Detection. Part1. Presenter: Dae-Yong

Contents. Preface to the Second Edition

Deep Face Recognition. Nathan Sun

Deep Learning for Object detection & localization

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep Learning with Tensorflow AlexNet

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Transcription:

Fully-Convolutional Siamese Networks for Object Tracking Luca Bertinetto*, Jack Valmadre*, João Henriques, Andrea Vedaldi and Philip Torr www.robots.ox.ac.uk/~luca luca.bertinetto@eng.ox.ac.uk

Tracking of single, arbitrary objects Problem. Track an arbitrary object with the sole supervision of a single bounding box in the first frame of the video. Challenges. We need to be class-agnostic. Stability-Plasticity dilemma[grossberg87] How can a learning system remain plastic in response to significant new events, yet also remain stable in response to irrelevant events?

Recent history of object tracking [2010 - today] Tracking-by-detection paradigm Learn online a binary classifier (+ is object, - is background). Re-detect the object at every frame + update the classifier.

Recent history of object tracking [2014 - today] Correlation filters become the most popular choice Sampling space is loosely a circulant matrix diagonalized with Discrete Fourier Transform. From [Henriques15] Fast training and evaluation of linear classifier in the Fourier Domain. Mostly used with HOG features.

Recent history of object tracking [2015 - today] What about the deep learning frenzy? In tracking, deep-nets took more time to become mainstream. CVPR 15 - not a single tracker was using deep-nets as a core and not even deep features. CVPR 16-50% were. Not clear advantage Slow Similar performance to methods based on legacy features. Training on benchmarks controversial. Benchmarks propose very similar scenarios. Risk to overfit and lack of generalization.

MDNet [CVPR16, winner of VOT15] Best results so far. Rationale: separate domain-independent (e.g. the concept of objectness ) to domain-dependent (video-specific) information. Training. fixed common part (3conv+2fc) and several one-hot fc branches. Tracking. fine-tuning of several layers, hard-negative mining, bbox regression. Best results so far. Trained from benchmarks video. Very slow. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking - Hyeonseob Nam and Bohyung Han - CVPR 2016. 1 fps

Our work We wanted to use conv-nets for arbitrary object tracking Three constraints No below real-time (at least 20-25 frames per second). No benchmark videos for training. Simplicity.

Vanilla siamese conv-net for similarity learning Siamese conv-net trained to address a similarity learning problem in an offline phase. The conv-net learns a function that compares an exemplar z to a candidate of the same size x. Score tell us how similar are the two image patches.

Fully-Convolutional Siamese Networks for Object Tracking Our network is fully convolutional. No padding. No fully-connected layers. Two inputs of different sizes: smaller is the exemplar (target object during tracking), bigger is the search area. Output of embedding function has spatial support. Cross-correlation layer: computes the similarity at all translated sub-windows on a dense grid in a single evaluation. Output is a score map. CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html Cross-correlation layer Forward pass: >100Hz

Training Dataset build by extracting two patches with +/- context for every labelled object. Then resized to 127x127 and 255x255. Pick random video and random pair of frames within the video (max N frames apart). N controls the difficulty of the problem. Mean of logistic loss at every position, CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html

ILSVRC15-VID (ImageNet Video) So far tracking community could not rely on large labelled dataset. ALOV+OTB+VOT in total have less than 600 video, with some overlap. They should be reserved for the purpose of testing. ImageNet Video Official task is object detection and classification from video. Step-by-step guide to prepare the data to train our net: https://github.com/bertinetto/siamese-fc/tree/master/ilsvrc15-curation Almost 4,500 videos and 1,200,000 bounding boxes! 30 classes: mostly animals (~75%) and some vehicles (~25%)

Activations for the exemplar z only computed for first frame. Subwindow of x with max similarity sets the new location. That s (almost) it! No update of target representation. No re-detection. No bbox regression. No fine-tuning fast! Frame t Frame 1 Tracking pipeline Only three little tricks: Pyramid of 3 scales. Response upsamped with bi-cubic interpolation. Cosine window to penalize large displacements. CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html 50-100 fps

New state-of-the art for real-time trackers (OTB-13)

State-of-the-art for general trackers (VOT-15) At 1 fps, the best tracker is almost 2 orders of magnitude slower of our method, which runs at 86 frames per second. None among the top-15 trackers operate above 20 frames per second.

Concurrent work - GOTURN [ECCV `16] Siamese architecture trained to solve Bounding Box regression problems. Differently, network is not fully convolutional. Trained from consecutive frames. They are not strictly learning a similarity function - method works (albeit worse) also with a single branch. Fast (100fps), but significantly lower results compared to our method. Learning to Track at 100 FPS with Deep Regression Networks - David Held, Sebastian Thrun, Silvio Savarese - ECCV 2016.

Concurrent work - SINT [CVPR `16] Siamese architecture trained to learn a generic similarity function. Differently, their network is not fully convolutional and they recur instead to ROI pooling to sample candidates. Results reported only on OTB-13: ~2% better than our method. BBox regression to improve tracking performance. Much slower: only 2 fps vs 50-85 fps of our method. Siamese Instance Search for Tracking - Ran Tao, Efstratios Gavves, Arnold W.M. Smeulders - CVPR 2016.

Few examples

Conclusions ImageNet Video: new standard for training tracking algorithms? Siamese networks allow simplistic trackers to achieve state-of-the-art results. Fully-convolutional siamese: allows very high frame-rates, still achieving state-of-the-art performance. Fully-convolutional siamese: simple and fast building block for future work: e.g. online update of representation. Code available: www.robots.ox.ac.uk/~luca/siamese-fc.html

Thank you.