Internet of things that video

Size: px

Start display at page:

Download "Internet of things that video"

Drusilla Neal
5 years ago
Views:

1 Video recognition from a sentence Cees Snoek Intelligent Sensory Information Systems Lab University of Amsterdam The Netherlands Internet of things that video 45 billion cameras by 2022 [LDV Capital] 2 1

2 Technology: self-driving cars 3 Forensics: analyzing terrorist behavior 4 2

3 Well-being: elderly monitoring CareMedia project, CMU 2002 Social: media monitoring 6 3

Goal Understand what is happening where and when

Later extended to object segment Hu et al.

4 Goal Understand what is happening where and when Kissing Shaking hands Hu et al. CVPR 2016 Inspiration: object location in images Find object location in image based on language Later extended to object segment Hu et al. ECCV 2016 window upper right bottom left window ground truth prediction 4

5 Hu et al. ECCV 2016 Object segmentation from a sentence Image embedding spatial feature map through CNN Sentence embedding final hidden state in LSTM Fully convolutional classification match input sentence to every location on the spatial grid and up-sample Tracking Zhenyang Li Ran Tao Efstratios Gavves Cees Snoek Arnold Smeulders Tracking by Natural Language Specification. In CVPR

6 A long standing computer vision challenge Object Tracking? Step 1: Set the box SINT 6

7 Step 2: Start tracking SINT Key contribution Specify the target by language instead of box Track the little green person with the pointy ears and the beige robe 7

8 Challenges How to obtain a tight box around an object from text? Text ambiguity vs object variance vs object invariance? What happens if the description is no longer valid? Model I: Lingual specification only LSTM encodes the text query h ' h " h h $ # LSTM LSTM LSTM LSTM % " % # % & % $ ( = Little green person robe 8

9 Model I: Lingual specification only LSTM encodes the text query Dynamically generate filters from LSTM output h ' h " h h $ # LSTM LSTM LSTM LSTM Dynamic Filter generation ( ) % " % # % & % $ * = Little green person robe Model I: Lingual specification only LSTM encodes the text query Dynamically generate filters from LSTM output Convolve with input frame ( ) * ) conv5 h ' h " h h $ # LSTM LSTM LSTM LSTM Dynamic Filter generation * ) % " % # % & % $ + = Little green person robe 9

10 Model I: Lingual specification only Tracking by repeated detection Little green person robe! "! #! $! % & = 0,, + Model II: Lingual first, then visual Use Model I for initialization, then track Little green person robe ' ( ' ) ' * ' + Dynamic filters! = 0,, & Tao et al. CVPR

11 Model III: Lingual & visual Adapts the lingual specification over time Little green person robe ' ( ' ) ' * ' + Dynamic filters Attention module! = 0,, & Datasets, pre-training and evaluation Extended existing object tracking datasets with language description about the target box in the first frame Lingual OTB99, all videos from OTB51 and 48 from OTB100 Lingual ImageNet, 4 videos for each of the 25 object categories. 11

Datasets, pre-training and evaluation Extended existing object tracking datasets with language description about the target box in the first frame Lingual OTB99, all videos from OTB51 and 48 from

12 Datasets, pre-training and evaluation Extended existing object tracking datasets with language description about the target box in the first frame Lingual OTB99, all videos from OTB51 and 48 from OTB100 Lingual ImageNet, 4 videos for each of the 25 object categories. VGG pre-trained on ImageNet, fine-tuned on lingual sets Lingual network pre-trained on ReferIt Tracker evaluation: AUC (area under curve) score ReferIt dataset Kazemsadeh et al., EMNLP 2014 Natural language expressions for segmented regions. 20,000 images 130,525 expressions 96,654 segmented image regions. 10K images for training and validation and 10K for testing. 12

13 Target identification by language Hu et al. CVPR16 Rohrbach et al. ECCV16 Hu et al. ECCV16 Ours percentage 9.6 of 26.7 test samples with IoU19.3 x 23.3 P@ P@ IoU Target identification by language Hu et al. CVPR16 Hu et al. ECCV16 Ours P@ P@ P@ P@ P@ IoU

14 Target identification by language Hu et al. CVPR16 Hu et al. ECCV16 Ours IoU Cross-modal dynamic filter generation makes sense Model results Sorted by first frame accuracy Model 2 and 3 profit from good initialization Hard to recover from poor initialization, then model 1 best choice Lingual only Lingual, then visual Lingual & visual 14

15 White car on the left Lingual only Lingual, then visual Lingual & visual Girl in yellow shirt and purple pants Lingual only Lingual, then visual Lingual & visual 15

16 The black and white dog Lingual only Lingual, then visual Lingual & visual Track by language and provided box Female skater in red Ground truth Box specification Language and box specification 16

17 Track by language and provided box People on the right next to a big tree Ground truth Box specification Language and box specification New applications Tracking objects in multiple videos simultaneously No first-frame requirement, live monitoring across streams Man with blue pants 17

18 Conclusion New type of human-machine interaction for video recognition. More robust tracking. Representation enables novel application scenarios. Vacancies in my lab: 1 assistant professor & 4 PhD students Thank you 18

Fully-Convolutional Siamese Networks for Object Tracking

Fully-Convolutional Siamese Networks for Object Tracking Luca Bertinetto*, Jack Valmadre*, João Henriques, Andrea Vedaldi and Philip Torr www.robots.ox.ac.uk/~luca luca.bertinetto@eng.ox.ac.uk Tracking