Fully-Convolutional Siamese Networks for Object Tracking

Similar documents
Fully-Convolutional Siamese Networks for Object Tracking

Visual Object Tracking. Jianan Wu Megvii (Face++) Researcher Dec 2017

Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation

Internet of things that video

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Real-Time MDNet. 1 Introduction. Ilchae Jung 1, Jeany Son 1, Mooyeol Baek 1, and Bohyung Han 2

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object detection with CNNs

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking

Yiqi Yan. May 10, 2017

Spatial Localization and Detection. Lecture 8-1

arxiv: v1 [cs.cv] 3 Jan 2017

Object Detection Based on Deep Learning

arxiv: v1 [cs.cv] 5 Sep 2018

Learning to Track at 100 FPS with Deep Regression Networks - Supplementary Material

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Lecture 5: Object Detection

Modeling and Propagating CNNs in a Tree Structure for Visual Tracking

arxiv: v2 [cs.cv] 8 Jan 2019

Regionlet Object Detector with Hand-crafted and CNN Feature

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Joint Object Detection and Viewpoint Estimation using CNN features

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Rich feature hierarchies for accurate object detection and semantic segmentation

Cascade Region Regression for Robust Object Detection

Lecture 7: Semantic Segmentation

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

CREST: Convolutional Residual Learning for Visual Tracking

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

arxiv: v1 [cs.cv] 5 Jul 2018

A Twofold Siamese Network for Real-Time Object Tracking

OBJECT DETECTION HYUNG IL KOO

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Structured Siamese Network for Real-Time Visual Tracking

Know your data - many types of networks

Flow-Based Video Recognition

RoI Pooling Based Fast Multi-Domain Convolutional Neural Networks for Visual Tracking

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Object Detection on Self-Driving Cars in China. Lingyun Li

Tracking by recognition using neural network

Visual Tracking via Spatially Aligned Correlation Filters Network

3D model classification using convolutional neural network

Semantic Segmentation

SIAMESE RECURRENT ARCHITECTURE FOR VISUAL TRACKING. Institute of Computing Technology, CAS, Bejing, , China

arxiv: v1 [cs.cv] 28 Nov 2017

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v1 [cs.cv] 7 Sep 2018

A Deep Learning Framework for Authorship Classification of Paintings

Classification of objects from Video Data (Group 30)

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Fully Convolutional Networks for Semantic Segmentation

arxiv: v2 [cs.cv] 13 Sep 2017

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Learning Dynamic Memory Networks for Object Tracking

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF)

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Online Object Tracking with Proposal Selection. Yang Hua Karteek Alahari Cordelia Schmid Inria Grenoble Rhône-Alpes, France

Deep-LK for Efficient Adaptive Object Tracking

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

End-to-end representation learning for Correlation Filter based tracking

Automatic detection of books based on Faster R-CNN

Object Detection. Computer Vision Yuliang Zou, Virginia Tech. Many slides from D. Hoiem, J. Hays, J. Johnson, R. Girshick

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

arxiv: v2 [cs.cv] 10 Apr 2017

G-CNN: an Iterative Grid Based Object Detector

End-to-End Localization and Ranking for Relative Attributes

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CSE 559A: Computer Vision

Structured Prediction using Convolutional Neural Networks

arxiv: v1 [cs.cv] 26 Nov 2017

Fast and Accurate Online Video Object Segmentation via Tracking Parts

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Project 3 Q&A. Jonathan Krause

A Novel Representation and Pipeline for Object Detection

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Using Machine Learning for Classification of Cancer Cells

arxiv: v1 [cs.cv] 10 Nov 2017 Abstract

YOLO 9000 TAEWAN KIM

Category-level localization

WE are witnessing a rapid, revolutionary change in our

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

CS468: 3D Deep Learning on Point Cloud Data. class label part label. Hao Su. image. May 10, 2017

Transfer Learning. Style Transfer in Deep Learning

Object Detection. Part1. Presenter: Dae-Yong

Robust Tracking using Manifold Convolutional Neural Networks with Laplacian Regularization

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

Unveiling the Power of Deep Tracking

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Multi-task Correlation Particle Filter for Robust Object Tracking

Yield Estimation using faster R-CNN

Transcription:

Fully-Convolutional Siamese Networks for Object Tracking Luca Bertinetto*, Jack Valmadre*, João Henriques, Andrea Vedaldi and Philip Torr www.robots.ox.ac.uk/~luca luca.bertinetto@eng.ox.ac.uk

Tracking of single, arbitrary objects Problem. Track an arbitrary object with the sole input of a single bounding box in the first frame of the video. Challenge: we need to be class-agnostic. t=0 t=0 t=0 1/16

Tracking-by-detection paradigm Learn online a binary classifier (+ is object, - is background). Re-detect the object at every frame + update the classifier Online training and testing. 2/16

What about the deep learning frenzy? In tracking community, deep-nets took more time to become mainstream. CVPR 15 - not a single tracker was using deep-nets as a core component and not even deep features. CVPR 16-50% were. Sometimes better performance than legacy features, but Training on benchmarks controversial. Slow 1 fps Learning Multi-Domain Convolutional Neural Networks for Visual Tracking - Hyeonseob Nam and Bohyung Han - CVPR 2016. 3/16

Our work Conv-nets for arbitrary object tracking, with three constraints. 1. Real-time. 2. No benchmark videos for training. 3. Simple. 4/16

Vanilla siamese conv-net Trains a model to address a similarity learning problem. Function compares an exemplar z to a candidate of the same size x. Output score tell us how similar are the two image patches. 5/16

Our architecture Our network is fully convolutional. Two inputs of different sizes: smaller (exemplar / target-object). bigger (search area). Cross-correlation layer: computes the similarity at all translated sub-windows on a dense grid in a single evaluation. Output is a score map. Cross-correlation layer Forward pass: >100Hz CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html 6/16

ILSVRC15-VID (ImageNet Video) So far tracking community could not rely on large labelled dataset. ALOV+OTB+VOT in total have less than 600 video, with some overlap. Not all labelled per frame. ImageNet Video Official task is object detection from video - can be easily adapted to arbitrary object tracking. Almost 4,500 videos and 1,200,000 bounding boxes! 30 classes: mostly animals (~75%) and some vehicles (~25%) CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html 7/16

Training Dataset build by extracting two patches with different amount of context for every labelled object. Then resized to 127x127 and 255x255. Pick random video and random pair of frames within the video (max N frames apart). N controls the difficulty of the problem. Mean of logistic loss over all positions, CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html 8/16

Tracking pipeline Activations for the exemplar z only computed for first frame. Subwindow of x with max similarity sets the new location. That s (almost) it! No update of target representation. No bbox regression. No fine-tuning fast! Only three little tricks: Pyramid of 3 scales. Response upsamped with bi-cubic interpolation. Cosine window to penalize large displacements. Frame 1 Computed only once! Frame t CODE AVAILABLE! www.robots.ox.ac.uk/~luca/siamese-fc.html 50-100 fps 9/16

New state-of-the art for real-time trackers (OTB-13) 10/16

State-of-the-art for general trackers (VOT-15) At 1 fps, the best tracker is almost 2 orders of magnitude slower of our method, which runs at 86 frames per second. None among the top-15 trackers operate above 20 frames per second. 11/16

Results reflect training dataset bias

Concurrent work - GOTURN [ECCV `16] Siamese architecture trained to solve Bounding Box regression problems. Differently, network is not fully convolutional. Trained from consecutive frames. They are not strictly learning a similarity function - method works (albeit worse) also with a single branch. Fast (100fps), but much lower results compared to our method (only VOT-14 available). Learning to Track at 100 FPS with Deep Regression Networks - David Held, Sebastian Thrun, Silvio Savarese - ECCV 2016. 13/16

Concurrent work - SINT [CVPR `16] Siamese architecture trained to learn a generic similarity function. Differently, their network is not fully convolutional and they recur instead to ROI pooling to sample candidates. Results reported only on OTB-13: relative +2% better than our method. BBox regression to improve tracking performance. Much slower: only 2 fps vs 85 fps of our method. Siamese Instance Search for Tracking - Ran Tao, Efstratios Gavves, Arnold W.M. Smeulders - CVPR 2016. 14/16

Few examples 15/16

Conclusions ImageNet Video: new standard for training tracking algorithms? Fully-convolutional siamese: Generalizes well (trained on ImageNet Video). allows very high frame-rates, still achieving state-of-the-art performance. Simple+efficient building block for future work. Code available: www.robots.ox.ac.uk/~luca/siamese-fc.html 16/16

Thank you.