Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Size: px

Start display at page:

Download "Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley"

Steven Lang
5 years ago
Views:

1 Towards Large-Scale Semantic Representations for Actionable Exploitation Prof. Trevor Darrell UC Berkeley

2 traditional surveillance sensor emerging crowd sensor

3 Desired capabilities: spatio-temporal reconstruction vehicle and person reidentification wide-area activity analysis visual sentiment assessment fine-grained description Rich representations are key Analysis/response

4 Large-scale Semantic Description

5 Large-scale Semantic Description Object Detection

6 Large-scale Semantic Description Object Detection Semantic Segmentation Pose Estimation

talking Object Detection Semantic Segmentation Pose Estimation Attribute Classification Fine-Grained Recognition

7 Large-scale Semantic Description A man with glasses and a coat, facing back, walking away A blue GMC van parked, in a back view An entlebucher mountain dog sitting in a bag An elderly man with a hat and glasses, facing the camera and talking Object Detection Semantic Segmentation Pose Estimation Attribute Classification Fine-Grained Recognition Action Recognition a man wearing long sleeves, possibly holding a shovel. person last seen at 0900 in view 5 unusual farm worker

8 Deformable Part Descriptors (DPDs) for Attribute Prediction [ICCV13] Image Deformable Part Model Detection Detection Detected Parts Pooled Descriptor Attribute Models Attribute Predictions

9 Deformable Part Descriptors (DPDs) for Fine-grained Recognition [ICCV13]

10 Traditional Vision Models SIFT-VQ-BOW Scanning Window HOG Convolve-Quantize-Pool [Convolve-Quantize-Pool]

11 Deep (at least according to Google Image Search)

12 Learning

13 Deep Learning? (at least according to Google Image Search)

14 now, CNN ILSVRC Architecture: Input Layer Convolutional Layers Fully-Connected Layers Diagram from Krizhevsky et al., 2012 [Krizhevsky et al] Convolve-Quantize-Pool [Convolve-Quantize-Pool] [[Convolve-Quantize-Pool]] Fukushima s Neocognitron ; LeCun s LeNet, 1989; Krizhevsky, A., Sutskever, I., and Hinton., G. E. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. NIPS, 2012.

Caffe: Open-Sourcing Deep Learning Yangqing Jia, Trevor

Fast Feature Extraction Seamless switching between CPU

5ms / image with GPU) Full training and testing

framework to support multiple applications:

15 Caffe: Open-Sourcing Deep Learning Yangqing Jia, Trevor Darrell, UC Berkeley Convolutional Architecture for Fast Feature Extraction Seamless switching between CPU and GPU Fast computation (2.5ms / image with GPU) Full training and testing capability Reference ImageNet model available A framework to support multiple applications: Classification Embedding Detection Your next Application! Publicly available at

16 Does it Generalize? SURF DeCAF

17 to Detection? Rich feature hierarchies for accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik available on arxiv

18 Object detection system Regions with CNN features (R-CNN) (With a few minor tweaks: semantic segmentation) (e.g., selective search )

19 CNN features for detection region warped region pool5: 6 x 6 x 256 = 9216-dim 6.4% / 15% non-zero fc6: 4096-dimensional 71.2% / 20% nz fc7: 4096-dimensional 100% / 20% nz

20 Results VOC 2007 VOC DPM 33.7% 29.6% 2. UVA sel. search 35.1% 3. R-CNN pool5 40.1% 4. R-CNN fc6 43.4% 5. R-CNN fc7 42.6% 6. R-CNN FT pool5 42.1% 7. R-CNN FT fc6 47.2% 8. R-CNN FT fc7 48.0% 43.5% metric: mean average precision (higher is better)

21 pre-trained only Results VOC 2007 VOC DPM 33.7% 29.6% 2. UVA sel. search 35.1% 3. R-CNN pool5 40.1% 4. R-CNN fc6 43.4% 5. R-CNN fc7 42.6% 6. R-CNN FT pool5 42.1% 7. R-CNN FT fc6 47.2% 8. R-CNN FT fc7 48.0% 43.5% metric: mean average precision (higher is better)

22 fine-tuned Results VOC 2007 VOC DPM 33.7% 29.6% 2. UVA sel. search 35.1% 3. R-CNN pool5 40.1% 4. R-CNN fc6 43.4% 5. R-CNN fc7 42.6% 6. R-CNN FT pool5 42.1% 7. R-CNN FT fc6 47.2% 8. R-CNN FT fc7 48.0% 43.5% metric: mean average precision (higher is better)

23 fine-tuned Results updated Jan 14 VOC 2007 VOC DPM 33.7% 29.6% 2. UVA sel. search 35.1% 3. R-CNN pool5 40.1% 4. R-CNN fc6 43.4% 5. R-CNN fc7 42.6% 6. R-CNN FT pool5 47.5% 7. R-CNN FT fc6 53.2% 8. R-CNN FT fc7 53.8% 43.5% metric: mean average precision (higher is better)

24 Top bicycle FPs (AP 62.5%)

25 Top bird FPs (AP 41.4%)

26 Visualizing features What does pool5 learn? Recap: - pool5: max-pooled output of last conv. layer - 6 x 6 spatial structure (with 256 channels) - receptive field size 163 x 163 (of 224 x 224) unit position receptive field

27 Visualization method Select a unit in pool5 Run it as a detector Show top-scoring regions Non-parametric, lets unit speak for itself (Used ~10 million held-out regions.)

38 Adding Parts to CNNs: PANDA

39 PANDA

40 Beyond Nouns: towards verbs, adjectives, : Mind s Eye Challenge: a Video Tweet summarize what happened in 140 characters success in limited domains (small verb set) Now: Towards large-scale / broad coverage say something about anything getting out of the kitchen ; beyond digging + jumping jacks 3+ yr challenge problem? extend to include location context, multi-view

Activity Recognition Datasets C&D YouTube

& Dolan, TRECVID ACL MED 2011) (6 actions) A

racipe for katsu curry A man is sitting and

verbs and 300 nouns in 2000 videos A girl is

41 Activity Recognition Datasets C&D YouTube (a) Hollywood Dataset (8 actions) (Chen (b) & Dolan, TRECVID ACL MED 2011) (6 actions) A woman is cooking onions. Someone is cooking in a pan. someone preparing something a person coking. racipe for katsu curry A man is sitting and playing a guitar A man is playing guitar Street artists play guitar. A man is playing a guitar. a lady is playing the guitar. (c) YouTube (218 actions) More than 200 verbs and 300 nouns in 2000 videos A girl is ballet dancing. A girl is dancing on a stage. A girl is performing as a ballerina. A woman dances. A train is rolling by. A train passes by Mount Fuji. A bullet train zooms through the countryside. A train is coming down the tracks.

(Sub + Verb + Obj) Generalization (best Sub + Verb + Obj)

42 Overall view Activity detection from videos Object detection from videos Expanded verbs using similarity Content Planning (Sub + Verb + Obj) Generalization (best Sub + Verb + Obj) Surface realization using template based approach Person walking a dog

43 G.T. Man clamps knife, Person clamps knife, Man working input video being do entity animal person play work tool person man woman clamp sharpen knife baby conventional output: Woman sharpens baby Desired output: Person working with a tool

44 Results on YouTube Videos

45 Summary Extreme events Car bomb, shooting, natural disaster E.g. Boston marathon bombing source: R 4 : Reconstruct / Recognize / Retrieve / Recount Understand what happened Find individuals Locate vehicles Describe succinctly Large-scale Deep Semantic Representations Will Critically Underpin These Abilities

46 Large-scale Deep Semantic Representations: Recommendations Near Fused deep object classification, detection, scene recognition, material classification, action recognition, pose estimation (Fixed ontology and domain) Efficient implementation on CPUs, GPUs, and nextgeneration approximate computing 1K->10K+ RT on CPU Mid Far Scaling to open namespaces Modeling, mitigating, and adapting across domains Lifelong representation learning Anytime models and time-optimal c/overt attention

Rich feature hierarchies for accurate object detection and semantic segmentation

Rich feature hierarchies for accurate object detection and semantic segmentation BY; ROSS GIRSHICK, JEFF DONAHUE, TREVOR DARRELL AND JITENDRA MALIK PRESENTER; MUHAMMAD OSAMA Object detection vs. classification