MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection ILSVRC 2016 Object Detection from Video Byungjae Lee¹, Songguo Jin¹, Enkhbayar Erdenee¹, Mi Young Nam², Young Gui Jung², Phill Kyu Rhee¹ Inha University¹, NaeulTech²

Results with additional training data Object Detection from Video (VID) 2 nd place (map: 73.15%) Object Detection/Tracking from Video (VID) 2 nd place (map: 49.09%) 2

Overview I. Faster R-CNN Object Detector + Context region + Larger feature map + Ensemble + Data configuration II. MCMOT: Multi-Class Multi-Object Tracking Tracking by Detection Detection: Ensemble of CNNs Tracking: MCMOT using CPD S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 3

I. Faster R-CNN Object Detector S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 4

Object Detection from Video Challenge I: Small Object 5 * Figure from J. Li et al. Scale-aware Fast R-CNN for Pedestrian Detection. arxiv 2015.

Object Detection from Video Challenge I: Small Object Solution Larger feature map 6 * Figure from J. Li et al. Scale-aware Fast R-CNN for Pedestrian Detection. arxiv 2015.

Object Detection from Video Challenge II: Blurred Object 7 * Figure from Y. Zhu et al. segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection. CVPR 2015.

Object Detection from Video Challenge II: Blurred Object Solution Context Region 8 * Figure from Y. Zhu et al. segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection. CVPR 2015.

Network Architecture I. Faster R-CNN with VGG16 Conv1 Conv2 Conv3 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Class layer Softmax class loss Conv4 Conv5 Bbox. reg. loss pool3 pool4 Bbox layer pool2 pool1 Karen Simonyan, & Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 9

Network Architecture I. Faster R-CNN with VGG16 + Larger feature map (remove pool4 layer) (+2.9% map) Conv1 Conv2 Conv3 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Class layer Softmax class loss Conv4ˈ Bbox. reg. loss Bbox layer pool3 pool2 pool1 J. Li, X. Liang, S. Shen, T. Xu, & S. Yan. Scale-aware Fast R-CNN for Pedestrian Detection. arxiv 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 11

Network Architecture I. Faster R-CNN with VGG16 + Larger feature map (remove pool4 layer) (+2.9% map) + Context region (+2.6% map) Conv1 Conv2 Convolutional Network Region Proposal Network RoI Pooling layer FC FC Conv3 Conv4ˈ Class layer Softmax class loss pool2 pool3 Box * 1.5 Concat layer Bbox. reg. loss pool1 Bbox layer RoI Pooling Context layer FC FC S. Gidaris, & N. Komodakis. Object detection via a multi-region & semantic segmentation-aware CNN model. CVPR 2015. S. Ren, K. He, R. Girshick, & J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI 2016. 12

Training Data Configuration Overview of ILSVRC VID Dataset ILSVRC VID Training Validation Images 1122397 176126 Snippets 3862 555 * Images are from the ILSVRC VID dataset 13

Training Data Configuration Overview of ILSVRC VID Dataset ILSVRC VID Training Validation Images 1122397 176126 Snippets 3862 555 Redundant images within each snippet Diversity is too low to train CNN * Images are from the ILSVRC VID dataset 14

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET ImageNet CLS Pre-trained Model 16

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET ImageNet CLS Pre-trained Model Fine-tune COCO DET Pre-trained Model 17

Training Data Configuration ILSVRC DET ILSVRC VID Additional Data COCO DET Sample 2K images per class (30 classes) Sample 2K images per class 100 images per class ImageNet CLS Pre-trained Model Fine-tune COCO DET Pre-trained Model 18

Detection Components VGG 16 map(%) Baseline 70.7% ResNet-101 map(%) Baseline 78.8% * We used 12 anchors for RPN * Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images) 21

Detection Components VGG 16 map(%) Baseline 70.7% + Larger feature map 73.6% + Context layer 76.2% ResNet-101 map(%) Baseline 78.8% * We used 12 anchors for RPN * Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images) 22

Detection Components VGG 16 map(%) Baseline 70.7% + Larger feature map 73.6% + Context layer 76.2% ResNet-101 map(%) Baseline 78.8% Faster R-CNN VGG16 Faster R-CNN ResNet Detection Results Combination MCMOT Tracking 82.3% map * We used 12 anchors for RPN * Performances are evaluated on VID minival (only used initial 30 frames per each snippet, total 16,524 images) 23

II. MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 24

map(%) Motivation 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ILSVRC Object Detection Performance 0.23 0.44 0.62 0.66 2013 2014 2015 2016 Year Object detector becomes robust Should we use complex multi-object tracking algorithm? 25

Motivation Based on high performance detection, simple & fast MOT algorithm can achieve competitive result 26

Results ILSVRC2016 Object Detection/Tracking from Video (VID) 2 nd place (map: 49.09%) with additional training data https://motchallenge.net/results/mot16/ http://www.cvlibs.net/datasets/kitti/eval_tracking.php B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 27

MOTA(%) MOTA(%) Results ILSVRC2016 Object Detection/Tracking from Video (VID) 2 nd place (map: 49.09%) with additional training data Our MCMOT also achieves state-of-the-art results in different MOT datasets Multiple Object Tracking Benchmark 2016 KITTI Tracking Benchmark (Car) 70 60 50 40 30 20 10 62.4 62.2 57.3 52.5 50.9 72.5 72 71.5 71 70.5 70 69.5 69 68.5 72.11 71.44 70.15 69.73 69.46 0 MCMOT NOMTwSDP16 KFILDAwSDP EAMTT AMPL 68 MCMOT RBPF DuEye NOMT CNN-OCC Method Method https://motchallenge.net/results/mot16/ http://www.cvlibs.net/datasets/kitti/eval_tracking.php 28 B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016.

Tracking by Detection Object Detection Sequence Multi-Object Tracking t Trajectory Object Detection: Ensemble of CNNs Multi-Object Tracking: MCMOT using CPD B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 29

MCMOT using CPD Video sequence (a) Likelihood Calculation (b) Track Segment Creation t t Change point Change point Change point Forward-backward Validation (c) Changing Point Detection t (d) Trajectory Combination MCMOT Trajectory Result B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 30

Track Segment Creation MCMC-based MOT approach Changing number of moving objects are challenging, which require high computation overheads due to a high-dimensional state space Separating motion dynamics The method separates the motion dynamic model of Bayesian filter into the entity transitions and motion moves No dimension variation in the iteration loop by separating the moves of birth and death H. Sakaino. Video-based tracking, learning, and recognition method for multiple moving objects. IEEE trans. On circuits and systems for video technology. 2013. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 31

Track Segment Creation Estimation of entity state transition (Birth, Death) The entity transitions are modeled as the birth and death events We estimate the entity prior by data-driven approach, instead of the inside of MCMC loop H. Sakaino. Video-based tracking, learning, and recognition method for multiple moving objects. IEEE trans. On circuits and systems for video technology. 2013. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 32

Track Segment Creation Separating motion dynamics Pros Since the Markov chain has no dimension variation in the iteration loop, it can reach to stationary states with less computation overhead Cons such a simple approach cannot deal with complex situations that occur in MOT Many of them are suffered from track drifts due to appearance variations Drift problem is attacked by a CPD algorithm H. Sakaino. Video-based tracking, learning, and recognition method for multiple moving objects. IEEE trans. On circuits and systems for video technology. 2013. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 33

Changing Point Detection Two-Stage Learning for Changing Point Detection Drifts in MCMOT are investigated by detection such abrupt change points between stationary time series that represent track segment A possible track drift is determined by a changing point detection 2 nd level time series is built using the scanned average responses to reduce outliers in the time series * Figure from J. Takeuchi, & K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. TKDE 2006. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 34

Changing Point Detection Track segments Changing Point Detection Low CPD score High Forward-backward Validation Keep confident segments Low FB error High Exclude unstable segments Confident segments J. Takeuchi, & K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. TKDE 2006. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 35

Changing Point Detection MOT16-09 #170 MOT16-09 #172 MOT16-09 #174 MOT16-07 #179 MOT16-02 #438 MOT16-02 #440 MOT16-02 #442 MOT16-02 #444 * Images are from MOT 2016 benchmark J. Takeuchi, & K. Yamanishi. A Unifying Framework for Detecting Outliers and Change Points from Time Series. TKDE 2006. B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 36

Tracking Speed Tracking performances comparison on the MOT benchmark 2016 The timing excludes detection time With a Titan X Maxwell GPU, the detector runs at approximately 3.5 FPS B. Lee, E. Erdenee, S. Jin, & P. Rhee. Multi-Class Multi-Object Tracking using Changing Point Detection. arxiv 2016. 37

Results ImageNet VID MOT Benchmark KITTI Tracking Benchmark 38

Acknowledgement 39

Thank You Email: byungjae89@gmail.com