Future directions in computer vision. Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA

Size: px

Start display at page:

Download "Future directions in computer vision. Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA"

Reynold Marsh
5 years ago
Views:

1 Future directions in computer vision Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA

2 Presentation overview Future Directions Workshop on Computer Vision Object detection using CNN s without object proposals Incorporating context onto detection Scale dependent pooling to detect small object instances Resolving referring expressions using context Summary

3 Strategic Directions Workshop on Visual Commonsense Nov in D.C. Sponsored by OSTP in the US Poggio, Malik, Zhu, Berg (Alex), Kohli, Hoeim, Grauman, Zitnick, Gupta, Fox, Tellex, Oliva, Scholl (absent), Domingos, Daume. Organized by me, Fei Fei Li and Devi Parikh

4 The computer vision landscape Breakthroughs in CV (and AI generally) would clearly be disruptive. This has been known forever. Our field has more than doubled in size in less than a decade and there are currently more than 175 startups in computer vision worldwide according to chrunchbase. Feeding frenzy in self driving cars So, has the field finally progressed to the point where real vision problems can be solved?

5 So, what has changed? Deep learning

6 So, what has changed? Deep learning SFM and stereo

7 So, what has changed? Deep learning SFM and stereo Human pose estimation and tracking

8 So, what has changed? Deep learning SFM and stereo Human pose estimation and tracking Computing infrastructure Big Data Crowd sourcing GPU s Cloud computing and free storage

9 So, what has changed? Deep learning SFM and stereo Human pose estimation and tracking Computing infrastructure Big Data Crowd sourcing GPU s Cloud computing and free storage Open source software

10 Commercial indicators Image search Tineye, Clarifai Face recognition under the hood at social media companies Google self driving cars 1.5 M miles and going Driving aids and autonomous driving - Mobileye

11 And what about the next 10? So what do you think the future of the field is? Here are some of the workshop recommendations.

12 Workshop recommendations Develop the field of social perception Understand the internal state of people as they interact with each other and with the world Crucial for human robot interaction Perceptual Robotics and testbeds for measurement of progress in situated vision research. Visual Search intelligent sampling of the visual world Acquisition and Representation of Visual Commonsense from Observation and Interaction Vision and Language

13 Many useful challenges Where to look to answer a question? How to relate existing detectors, pose estimators, attribute classifiers, etc. to this task? How to combine general knowledge with Language and vision - How to test ability to accumulate and integrate knowledge? VQA Dataset

14 Workshop recommendations Structured prediction Relationship between parts, objects and scenes The hierarchical structure of human behavior- movement, goals, actions and events Explainable perception. Don t just classify, explain your answer

15 Workshop recommendations Deep learning. Why/when does it work? Why are all local minima created equal? Visual learning with minimal (no) supervision Developmental learning (NEIL)

16 Are object proposals necessarily the answer? G-CNN an iterative grid based object detector Mahyar Najibi and Mohammad Rastegari CVPR 2016

17 Object detection Localization bounding box, segmentation masks Classification

18 In your camera sliding window detection Sliding Window horse = 0.6 person = 0.3 Extracted Boxes horse = 0.0 person = 0.5 Multi class Classifier horse = 0.0 person = 0.8 horse = 0.5 person = 0.9 horse = 0.9 person = 0.0

19 Object proposals Sliding windows are slow scale, orientation,.. Object proposals are (learning-based) multi- segmentation algorithms that generate fewer regions for classification (typically boxes). Consensus is that region proposals are crucial to SOA detection systems whether they are given to the network or constructed by the network However localization is poor, so (class-dependent) post-processing is typically employed Regressor

20 Object proposals and CNN s R-CNN - push each proposal through the CNN; slow because the network is run multiple times. SPP-Net [1] computes filter responses only once for each image and pools from them to form features for the proposals. Fast R-CNN [2] builds on this and packs all stages of the system except the region proposal into one CNN. Fast R-CNN 1. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." Computer Vision ECCV Springer International Publishing, Girshick, Ross. "Fast R-CNN." ICCV (2015).

21 Region Proposal Stage These methods use an external object proposal stage (e.g. selective search with ~2K proposals/image) In Fast R-CNN, computing object proposals is the bottleneck, taking around 2 sec/image time. Faster R-CNN [3] increases efficiency by reducing the number of proposed bounding boxes. Jointly learns proposal generator and features Fast and accurate 3. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS (2015).

22 G-CNN Training Network Structure

23 G-CNN: Training Training set for step 1

24 G-CNN: Training Added samples for step 2

25 G-CNN Detection Iteratively update the position of the initial bounding boxes with the regressor corresponding to the class with the highest score. Car Regressor Car Regressor Car Regressor *The highest scoring class is car. *The highest scoring class is car. *The highest scoring class is car *The highest scoring class is car.

26 G-CNN structure in detection time To reduce detection time, the G-CNN network is divided into two parts: The global part is called only once for each image. The regression part is called S test times, one for each step.

27 Experimental Setup Experiments are performed on VOC 2007 and VOC 2012 datasets. G-CNN is trained with S=3 steps over an initial grid with three scales [2,5,10] and overlaps [0.9,0.8,0.7] at each scale. At test time, use a coarser grid with overlaps [0.7,0.5,0.0] (around 180 initial boxes) after 5 iterations achieves the same map as Fast R-CNN with around 2K bounding boxes.

28 VOC2012 using VGG16

29 How effective are the regressors? IoU histogram of the best overlapping boxes to ground truth boxes at each iteration.

31 How can a neural network learn and utilize context? Mahyar Najibi, Mohammad Rastegari, Abhinav Gupta, Ali Farhadi Deep Saccadic Detectors

32 Top choices of FRCNN are very accurate

33 Detection with GTS Methods are trained on VOC2007 trainval. AlexNet is employed as the CNN structure. Method Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow FRCNN SS FRCNN SS+GT FRCNN GT Dining table Dog Horse Motorbike Person Potted Plant Sheep Sofa Train TV Monitor Average FRCNN SS: Fast RCNN using selective search proposals. FRCNN SS+GT: GT boxes are added to SS boxes. FRCNN GT: Only GT boxes are used.

34 Sequential detection This suggests a simple strategy for detection Commit to the most confident detection Use it as context for determining the next most confident detection, And so on All integrated into one CNN architecture

35 Concat Linear (h1) ReLU Linear (h2) ReLU Active Selector Classifier Classification Output ROI Pooling Linear (fc6) ReLU Linear (fc7) ReLU Regressor Class-based output Deep Sequential Detection Input Convolutional Layers ROI info Active select input Hidden State Selector Hidden select input MAX NMS

36 Pascal VOC Classes ~10K images Pascal VOC Classes ~15K images MSCOCO (2015) 80 Classes ~300K images Datasets

37 VOC 2012

38 MSCOCO Precision and Recall Methods are trained on the train-set and evaluated on the validationset. Top 2K selective search proposals are used for the methods. Class-based Relative Improvement

39 Scale dependent pooling Fan Yang (CVPR 2016) Goal - detect (even small) objects effectively and efficiently using CNNs + object proposals scale variance huge number of proposals 61

40 Scale-dependent pooling Pool proposals of different scales from different conv layers: n-branch structure Small instances of objects are well represented using features pooled from lower conv layers

41 Scale-dependent pooling Divide proposals into groups based on their size Pool small proposals at lower conv layers and larger ones at higher conv layers Train the entire system end-to-end Pooling Pooling small proposal s large proposal s

42 KITTI (map) Experiments Inner-city (map)

43 Detection as a function of size - Kitti Car Pedestrian Cyclist map Methods Inputs S1 S2 S3 S4 S S1 S2 S3 S4 S S1 S2 S3 S4 S S FRCNN+AlexN et (400) FRCNN+VGG (500) (800) (400) SDP 1 (500) (800) SDP+CRC 1 (500) SDP+CRC ft (500)

44 Modeling Context between Objects for Understanding Referring Expressions Varun Nagaraja, Vlad Morariu, Larry Davis ECCV 2016

45 Referring Expressions Descriptions that identify a particular object instance Man sitting on the left holding a game controller Woman in the middle sitting on the bed Man wearing a red jacket and blue jeans sitting on the right

46 Referring expressions rely on attributes and context Person riding a blue motorcycle Blonde fluffy dog Giraffe bending down Tan colored sofa Plant on the right side of the TV

47 Problem Formulation Input Output Sentence:Girl wearing a red jacket Image I

48 Solution Framework Hypothesize a set of region candidates Generation and Comprehension of Unambiguous Object Descriptions J. Mao et al., CVPR 2016

49 Solution Framework Pick the region candidate with the highest probability of generating the query referring expression Generation and Comprehension of Unambiguous Object Descriptions J. Mao et al., CVPR 2016

50 Baseline Method Modeling referring expression probability using an LSTM Girl wearing a red jacket <EOS> LST M unit LST M unit LST M unit LST M unit LST M unit LST M unit Region CNN features Image CNN features Bounding box features Word embedding <BOS> Girl wearing a red jacket Generation and Comprehension of Unambiguous Object Descriptions J. Mao et al., CVPR 2016

51 Max-margin Method The baseline method can be improved by training the model to have lower probability for negative regions Girl wearing a red jacket Referred region Negative regions Generation and Comprehension of Unambiguous Object Descriptions J. Mao et al., CVPR 2016

52 Modeling Context Previous methods do not model locations of contextual objects The plant on the right side of the TV

53 Modeling Context Baseline and Max-margin architecture Word Embedding Region CNN features Region BBox Image features LSTM

54 Modeling Context Context model architecture Word Embedding Region CNN features Region BBox Context region features Context region BBox LSTM

55 Modeling Context Word Embedding Region1 CNN features Region1 BBox Region2 CNN features Region2 BBox LSTM Region2 Region1

56 Modeling Context Word Embedding Region1 CNN features Region1 BBox Region3 CNN features Region3 BBox LSTM Region1 Region3

57 Modeling Context Word Embedding Region1 CNN features Region1 BBox Region4 CNN features Region4 BBox LSTM Region4 Region1

58 Modeling Context Pooling context from multiple pairs of regions

59 Modeling Context We can also use noisy-or pooling which is more robust Noisy-or Noisy-or

60 Training the Context Model The challenge is that there are no annotations available for context objects The plant on the right side of the TV

61 Multiple Instance Learning So we use a MIL based technique and use the annotation of the referred object as weak supervision The plant on the right side of the TV

62 Experiments Implemented in Caffe Region and Image features VGG16 fc8 layer - fine-tuned. Bounding box features scaled <xmin, ymin, xmax, ymax, area> Word embedding size 1024 LSTM hidden dimension 1024 Region candidates MCG technique Region filtering process Obtain scores from Fast-RCNN and select regions above a threshold

63 Google RefExp Results A detection is considered true positive if the IOU score is greater than 0.5 All results are from noisy-or pooling Google RefExp Validation Partition Method \ Proposals GT MC G Max Likelihood [Mao et al] Max margin [Mao et al] Ours, Neg. Bag margin Ours, Pos. & Neg. Bag margin

64 Google RefExp Results Groundtruth Image context only Noisy-or pooling The chair closest to the lady A white truck in front of a yellow truck

65 UNC RefExp Results TestB Partition (Object centric) Method \ Proposals GT MCG Max Likelihood [Mao et al] Max margin [Mao et al] Ours, Neg. Bag margin Ours, Pos. & Neg. Bag margin

66 UNC RefExp Results TestB Partition (Object centric) Groundtruth Image context only Noisy-or pooling Elephant towards the back Food on the far back on the plate

67 A few closing observations Success depends on region proposal algorithm including candidates for the correct referred and context objects Much more demanding than just requiring a candidate for the referred object Ameliorated somewhat by having the entire image as a candidate context object Straightforward extension to include additional context objects (language can be deeply nested) intractable (Methodological) would like to evaluate performance restricted to relevant referring expressions, but difficult to specify correct criteria for selection

68 Summary Intellectual landscape of computer vision has changed dramatically over the past decade Many of the future research directions identified by the workshop are already well underway And there are still huge performance shortfalls on basic problems like detection and recognition (compare MSCOCO vs VOC) My favorite future research directions Context sooner or later it has to make a difference Visual search Tasking visual surveillance systems compositional models and video analysis (structured prediction)

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf