MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Similar documents
arxiv: v1 [cs.cv] 30 Aug 2016

Object detection with CNNs

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Joint Object Detection and Viewpoint Estimation using CNN features

Spatial Localization and Detection. Lecture 8-1

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Object Detection Based on Deep Learning

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

YOLO9000: Better, Faster, Stronger

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Cascade Region Regression for Robust Object Detection

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Optimizing Object Detection:

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Channel Locality Block: A Variant of Squeeze-and-Excitation

Yiqi Yan. May 10, 2017

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Deep Residual Learning

Lecture 5: Object Detection

Fuzzy Set Theory in Computer Vision: Example 3

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Unified, real-time object detection

Automatic detection of books based on Faster R-CNN

Kaggle Data Science Bowl 2017 Technical Report

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Computer Vision Lecture 16

Computer Vision Lecture 16

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Two-Stream Convolutional Networks for Action Recognition in Videos

Modern Convolutional Object Detectors

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Rich feature hierarchies for accurate object detection and semantic segmentation

Final Report: Smart Trash Net: Waste Localization and Classification

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Feature-Fused SSD: Fast Detection for Small Objects

Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

OBJECT DETECTION HYUNG IL KOO

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Regionlet Object Detector with Hand-crafted and CNN Feature

arxiv: v1 [cs.cv] 14 Dec 2015

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Flow-Based Video Recognition

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Illuminating Pedestrians via Simultaneous Detection & Segmentation

arxiv: v2 [cs.cv] 24 Jul 2018

Deep Convolutional Neural Networks for Pedestrian Detection with Skip Pooling

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Real-Time* Multiple Object Tracking (MOT) for Autonomous Navigation

Feature Fusion for Scene Text Detection

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Abandoned Luggage Detection

An Analysis of Scale Invariance in Object Detection SNIP

Photo OCR ( )

Classifying a specific image region using convolutional nets with an ROI mask as input

Self Driving. DNN * * Reinforcement * Unsupervised *

Object Detection and Its Implementation on Android Devices

arxiv: v1 [cs.cv] 19 Oct 2016

arxiv: v2 [cs.cv] 8 Apr 2018

Object Detection on Self-Driving Cars in China. Lingyun Li

arxiv: v1 [cs.cv] 20 Dec 2016

Xiaowei Hu* Lei Zhu* Chi-Wing Fu Jing Qin Pheng-Ann Heng

A MultiPath Network for Object Detection

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Optimizing Object Detection:

Bilinear Models for Fine-Grained Visual Recognition

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance Applications

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Towards High Performance Video Object Detection

Architecting new deep neural networks for embedded applications

Supplementary material for Analyzing Filters Toward Efficient ConvNet

arxiv: v3 [cs.cv] 29 Mar 2017

Traffic Multiple Target Detection on YOLOv2

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Instance-aware Semantic Segmentation via Multi-task Network Cascades

POINT CLOUD DEEP LEARNING

FCHD: A fast and accurate head detector

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Fully-Convolutional Siamese Networks for Object Tracking

arxiv: v1 [cs.lg] 9 Jan 2019

arxiv: v1 [cs.cv] 4 Jun 2015

Modeling and Propagating CNNs in a Tree Structure for Visual Tracking

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

Robust Hand Detection and Classification in Vehicles and in the Wild

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL

arxiv: v2 [cs.cv] 18 Jul 2017

Transcription:

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection ILSVRC 2016 Object Detection from Video Byungjae Lee¹, Songguo Jin¹, Enkhbayar Erdenee¹, Mi Young Nam², Young Gui Jung², Phill Kyu Rhee¹ Inha University¹, NaeulTech²

Results with additional training data Object Detection from Video (VID) 2 nd place (map: 73.15%) Object Detection/Tracking from Video (VID) 2 nd place (map: 49.09%) 2

Overview I. Faster R-CNN Object Detector + Context region + Larger feature map + Ensemble + Data configuration II. MCMOT: Multi-Class Multi-Object Tracking Tracking by Detection Detection: Ensemble of CNNs Tracking: MCMOT using CPD S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 3

I. Faster R-CNN Object Detector S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 4

Object Detection from Video Challenge I: Small Object 5 * Figure from J. Li et al. Scale-aware Fast R-CNN for Pedestrian Detection. arxiv 2015.

Object Detection from Video Challenge I: Small Object Solution Larger feature map 6 * Figure from J. Li et al. Scale-aware Fast R-CNN for Pedestrian Detection. arxiv 2015.

Object Detection from Video Challenge II: Blurred Object 7 * Figure from Y. Zhu et al. segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection. CVPR 2015.

Object Detection from Video Challenge II: Blurred Object Solution Context Region 8 * Figure from Y. Zhu et al. segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection. CVPR 2015.

Network Architecture I. Faster R-CNN with VGG16 Conv1 Conv2 Conv3 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Class layer Softmax class loss Conv4 Conv5 Bbox. reg. loss pool3 pool4 Bbox layer pool2 pool1 Karen Simonyan, & Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 9

Network Architecture I. Faster R-CNN with VGG16 Conv1 Conv2 Conv3 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Class layer Softmax class loss Conv4 Conv5 Bbox. reg. loss pool3 pool4 Bbox layer pool2 pool1 Karen Simonyan, & Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 10

Network Architecture I. Faster R-CNN with VGG16 + Larger feature map (remove pool4 layer) (+2.9% map) Conv1 Conv2 Conv3 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Class layer Softmax class loss Conv4ˈ Bbox. reg. loss Bbox layer pool3 pool2 pool1 J. Li, X. Liang, S. Shen, T. Xu, & S. Yan. Scale-aware Fast R-CNN for Pedestrian Detection. arxiv 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 11

Network Architecture I. Faster R-CNN with VGG16 + Larger feature map (remove pool4 layer) (+2.9% map) + Context region (+2.6% map) Conv1 Conv2 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Conv3 Conv4ˈ Class layer Softmax class loss pool2 pool3 Box * 1.5 Concat layer Bbox. reg. loss pool1 Bbox layer RoI Pooling Context layer FC FC S. Gidaris, & N. Komodakis. Object detection via a multi-region & semantic segmentation-aware CNN model. CVPR 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 12

Training Data Configuration Overview of ILSVRC VID Dataset ILSVRC VID Training Validation Images 1122397 176126 Snippets 3862 555 * Images are from the ILSVRC VID dataset 13

Training Data Configuration Overview of ILSVRC VID Dataset ILSVRC VID Training Validation Images 1122397 176126 Snippets 3862 555 Redundant images within each snippet Diversity is too low to train CNN * Images are from the ILSVRC VID dataset 14

Training Data Configuration Overview of ILSVRC VID Dataset ILSVRC VID Training Validation Images 1122397 176126 Snippets 3862 555 Redundant images within each snippet Diversity is too low to train CNN We need more data * Images are from the ILSVRC VID dataset 15

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET ImageNet CLS Pre-trained Model 16

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET ImageNet CLS Pre-trained Model Fine-tune COCO DET Pre-trained Model 17

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET Sample 2K images per class (30 classes) Sample 2K images per class 100 images per class ImageNet CLS Pre-trained Model Fine-tune COCO DET Pre-trained Model 18

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET Sample 2K images per class (30 classes) Sample 2K images per class 100 images per class ImageNet CLS Pre-trained Model Fine-tune Training Dataset COCO DET Pre-trained Model 19

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET Sample 2K images per class (30 classes) Sample 2K images per class 100 images per class ImageNet CLS Pre-trained Model Fine-tune Training Dataset COCO DET Pre-trained Model Fine-tune Final Model 20

Detection Components VGG 16 map(%) Baseline 70.7% ResNet-101 map(%) Baseline 78.8% * We used 12 anchors for RPN * Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images) 21

Detection Components VGG 16 map(%) Baseline 70.7% + Larger feature map 73.6% + Context layer 76.2% ResNet-101 map(%) Baseline 78.8% * We used 12 anchors for RPN * Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images) 22

Detection Components VGG 16 map(%) Baseline 70.7% + Larger feature map 73.6% + Context layer 76.2% ResNet-101 map(%) Baseline 78.8% Faster R-CNN VGG16 Faster R-CNN ResNet Detection Results Combination MCMOT Tracking 82.3% map * We used 12 anchors for RPN * Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images) 23

II. MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 24

map(%) Motivation 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ILSVRC Object Detection Performance 0.23 0.44 0.62 0.66 2013 2014 2015 2016 Year Object detector becomes robust Should we use complex multi-object tracking algorithm? 25

Motivation Based on high performance detection, simple & fast MOT algorithm can achieve competitive result 26

Results ILSVRC2016 Object Detection/Tracking from Video (VID) 2 nd place (map: 49.09%) with additional training data https://motchallenge.net/results/mot16/ http://www.cvlibs.net/datasets/kitti/eval_tracking.php B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 27

MOTA(%) MOTA(%) Results ILSVRC2016 Object Detection/Tracking from Video (VID) 2 nd place (map: 49.09%) with additional training data Our MCMOT also achieves state-of-the-art results in different MOT datasets Multiple Object Tracking Benchmark 2016 KITTI Tracking Benchmark (Car) 70 60 50 40 30 20 10 62.4 62.2 57.3 52.5 50.9 72.5 72 71.5 71 70.5 70 69.5 69 68.5 72.11 71.44 70.15 69.73 69.46 0 MCMOT NOMTwSDP16 KFILDAwSDP EAMTT AMPL 68 MCMOT RBPF DuEye NOMT CNN-OCC Method Method https://motchallenge.net/results/mot16/ http://www.cvlibs.net/datasets/kitti/eval_tracking.php 28 B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016.

Tracking by Detection Object Detection Sequence Multi-Object Tracking t Trajectory Object Detection: Ensemble of CNNs Multi-Object Tracking: MCMOT using CPD B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 29

MCMOT using CPD Video sequence (a) Likelihood Calculation (b) Track Segment Creation t t Change point Change point Change point Forward-backward Validation (c) Changing Point Detection t (d) Trajectory Combination MCMOT Trajectory Result B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 30

Track Segment Creation MCMC-based MOT approach Changing number of moving objects are challenging, which require high computation overheads due to a high-dimensional state space Separating motion dynamics The method separates the motion dynamic model of Bayesian filter into the entity transitions and motion moves No dimension variation in the iteration loop by separating the moves of birth and death H. Sakaino. Video-based tracking, learning, and recognition method for multiple moving objects. IEEE trans. On circuits and systems for video technology. 2013. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 31

Track Segment Creation Estimation of entity state transition (Birth, Death) The entity transitions are modeled as the birth and death events We estimate the entity prior by data-driven approach, instead of the inside of MCMC loop H. Sakaino. Video-based tracking, learning, and recognition method for multiple moving objects. IEEE trans. On circuits and systems for video technology. 2013. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 32

Track Segment Creation Separating motion dynamics Pros Since the Markov chain has no dimension variation in the iteration loop, it can reach to stationary states with less computation overhead Cons such a simple approach cannot deal with complex situations that occur in MOT Many of them are suffered from track drifts due to appearance variations Drift problem is attacked by a CPD algorithm H. Sakaino. Video-based tracking, learning, and recognition method for multiple moving objects. IEEE trans. On circuits and systems for video technology. 2013. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 33

Changing Point Detection Two-Stage Learning for Changing Point Detection Drifts in MCMOT are investigated by detection such abrupt change points between stationary time series that represent track segment A possible track drift is determined by a changing point detection 2 nd level time series is built using the scanned average responses to reduce outliers in the time series * Figure from J. Takeuchi, & K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. TKDE 2006. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 34

Changing Point Detection Track segments Changing Point Detection Low CPD score High Forward-backward Validation Keep confident segments Low FB error High Exclude unstable segments Confident segments J. Takeuchi, & K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. TKDE 2006. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 35

Changing Point Detection MOT16-09 #170 MOT16-09 #172 MOT16-09 #174 MOT16-07 #179 MOT16-02 #438 MOT16-02 #440 MOT16-02 #442 MOT16-02 #444 * Images are from MOT 2016 benchmark J. Takeuchi, & K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. TKDE 2006. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 36

Tracking Speed Tracking performances comparison on the MOT benchmark 2016 The timing excludes detection time With a Titan X Maxwell GPU, the detector runs at approximately 3.5 FPS B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 37

Results ImageNet VID MOT Benchmark KITTI Tracking Benchmark 38

Acknowledgement 39

Thank You Email: byungjae89@gmail.com