ECS 289H: Visual Recognition Fall Yong Jae Lee Department of Computer Science

ECS 289H: Visual Recognition Fall 2014 Yong Jae Lee Department of Computer Science

Plan for today Questions? Research overview

Standard supervised visual learning building Category models Annotators tree Novel images Number of training images required can be costly Assumes closed-world setting where all categories are known 4

Unsupervised visual discovery Discovered categories Visual world 5

Unsupervised visual discovery Visual world Object segmentations in images and video 6

Unsupervised visual discovery 1:00 pm 2:00 pm 3:00 pm 4:00 pm Storyboard visual summary Visual world No human to explicitly guide visual recognition process 7

Why visual discovery? Exploring new environments 8

Summarization Why visual discovery? MSR Sensecam 9

Why visual discovery? 6 billion images 70 billion images 1 billion images served daily 10 billion images 100 hours uploaded per minute From : Almost 90% of web traffic is visual! Most of it is unlabeled!! 10

Inputs today Understand and organize and Personal photo albums Movies, news, sports index all this data!! Surveillance and security Svetlana Lazebnik Medical and scientific images

Let s first explore what we can do with big data!

Everyday use of big data: Predictive text 13

Predictive drawing? 14

video ShadowDraw

Research goal: Visual discovery Discovered categories Visual world 16

Key challenges Simultaneously estimate segmentation and groups Unknown variability in appearance What is the proper distance metric? 17

How similar are two pictures? CLIME - CRIME = hamming distance of 1 letter y y x - x = Euclidian distance of 5 units - = Grayvalue distance of 50 values - =? Alyosha Efros 18

How similar are two pictures?? = 19

Problem Clusters formed from full image matches

Mutual Relationship between Foreground Features and Clusters If we have only foreground features, we can form good clusters Clusters formed from full image matches Clusters formed from foreground matches

Mutual Relationship between Foreground Features and Clusters If we have good clusters, we can detect the foreground

Our Approach Feature weights Feature index Update cluster based on weighted feature matches Refine feature weights given current clusters Unsupervised task that iteratively seeks the mutual support between discovered objects and their defining features [Lee & Grauman, Foreground Focus, IJCV 2009]

Cluster and Feature Weight Refinement: Iteration 1 Normalized Images Initial Pair-wise as Set Local Cuts of Feature Clustering Matching Clusters Sets Feature weights Feature index

Cluster and Feature Weight Refinement: Iteration 1 Feature weights Feature index New Compute Feature Feature Weights Weights

Cluster and Feature Weight Refinement: Iteration 2 New Set of Clusters Feature weights Feature index New Compute Feature Feature Weights Weights

Cluster and Feature Weight Refinement: Iteration 3 Pair-wise Final Set of Matching + Clusters Normalized Cuts Feature weights Feature index New Feature Weights

Quality of Clusters Formed Black dotted lines indicate the best possible quality that could be obtained if the ground truth segmentation were known

Quality of Foreground Detection 10-classes subset - highly weighted features

Shape Invariant to lighting conditions Relatively stable compared to intra-category appearance (texture, color) variations Can we discover common object shapes within unlabeled multicategory collections of images?

Anchoring Edge Fragments to Local Patches Even with accurate patch matches, there s a limit to how much shape information can be captured. By anchoring edge fragments to patch features, we can produce more reliable matches and describe the object s shape.

Foreground Shape Discovery: Prototypical Shape Examples of discovered object contours Our shapes [Lee & Grauman, Shape Discovery, CVPR 2009]

Works well for object-centric images Complex images with multiple objects remains challenging

Existing approaches Previous work treats unsupervised visual discovery as an appearance-grouping problem. 1 2 3 4 50

Our idea How can seeing previously learned objects in novel images help to discover new categories? 1 2 3 4 51

Our idea Our idea: Discover visual categories within unlabeled images by modeling interactions between the unfamiliar regions and familiar objects. 1 2 3 4 [Lee & Grauman, Object-graphs, CVPR 2010] 52

Context-aware visual discovery??? sky sky sky driveway house? grass grass house truck fence grass house?? driveway driveway [Lee & Grauman, Object-graphs, CVPR 2010] 53

Learn Models Detect Unknowns Object-level Context Discovery Learn known categories tree building sky road Offline: Train region-based classifiers for N known categories using labeled training data. 54

Learn Models Detect Unknowns Object-level Context Discovery Identifying unknown regions Input: unlabeled pool of novel images Compute multiple segmentations for each unlabeled image 55

P(class region) P(class region) P(class region) P(class region) Learn Models Detect Unknowns Object-level Context Discovery Identifying unknown regions Prediction: known High entropy Prediction: unknown Prediction: known Prediction: known Deem each segment as known or unknown based on resulting entropy: 56

Learn Models Detect Unknowns Object-level Context Discovery Object-graphs An unknown region within an image Model the topology of category predictions relative to the unknown (unfamiliar) region. 57

Learn Models Detect Unknowns Object-level Context Discovery An unknown region within an image Object-graphs Closest nodes in its object-graph 3a 1a 2a S 0 3b 2b 1b Consider spatially near regions above and below, record distributions for each known class. 0 self b t s r 1a above b t s r 1b below g(s) = [,,, ] b t s r H 0 (s) 0 self b t s r Ra above b t s r Rb below b t s r H 1 (s) H R (s) 1 st nearest region out to R th nearest 58

Example object-graphs unknown building sky road Colors indicate the predicted known category (max posterior) 59

Learn Models Detect Unknowns Object-level Context Discovery Clusters from region-region affinities Unknown Regions Object-level context provides more robust affinities 60

Results: object discovery accuracy MSRC-v2 PASCAL 2008 MSRC-v0 Corel 61

Example discoveries 62

Context-aware face discovery Kate David Kate David Kate Kate Kate name? David System can suggest novel people to name based on their appearance and co-occurrence with familiar people. [Lee & Grauman, Face discovery, BMVC 2011] 63

Results: Context-aware face discovery Dataset: Gallagher, Friends, Buffy 12,542 images, 8,452 faces and 23 unique people Two splits: 8 unknowns, and 15 unknowns Discovered Face 2 2 2 2 2 12 12 12 12 12 Co-occurring faces 3 3 3 3 3 7 7 7 7 7 [Lee & Grauman, Face discovery, BMVC 2011] 64

Self-paced discovery Previous work treats unsupervised visual discovery as a one-pass batch procedure. Traditional Batch k-way 66

Self-paced discovery Focus on the easier instances first, and gradually discover new models of increasing complexity. Single Easiest (Ours) [Lee & Grauman, Self-paced discovery, CVPR 2011] 67

Initialize Stuff Detect Easy Instances Discover New Category Expand Knowledge Identify Easy Objects + Objectness (Obj) Context-Awareness (CA) Easiness (ES) Familiarity Map (F) Obj: how well a window contains any generic object. CA: how well surrounding regions resemble familiar categories. 68

Initialize Stuff Detect Easy Instances Discover New Category Expand Knowledge Identify Easy Objects 69

Object Discovery Accuracy 3 9 12 13 14 20 22 29 70

Unsupervised visual discovery Visual world Object segmentations in images and video 71

Collect-Cut Unsupervised Segmentation Examples Discovered Ensemble from Unlabeled Multi-Object Images Unlabeled Images Collect-Cut (ours) Best Bottom-up (with multi-segs) [Lee & Grauman, Collect-Cut, CVPR 2010] 72

Problem: Video object segmentation How to segment the foreground objects in video when background is moving and changing categories of foreground objects are unknown in advance Input: Unannotated video Desired output: Segmentation of high-ranking foreground object Existing methods group pixels using low-level features, which can result in an over-segmentation. [Brendel & Todorovic 2009, Vazquez-Reina et al. 2010, Grundmann et al. 2010, Brox & Malik 2010] 73

Key-segment discovery Discover a set of object-like key-segments for category independent video object segmentation Resist over-segmentation by detecting regions with object-like appearance and motion [Lee, Kim, Grauman, Key-segments, ICCV 2011] 74

Key-segment discovery 1) Find object-like regions using appearance and motion cues 2) Group regions across video to discover key-segment hypotheses 3) Rank hypotheses and build segmentation models for each hypothesis 4) For a given hypothesis, segment the corresponding foreground object using the models Color model Output segmentation Shape model [Lee, Kim, Grauman, Key-segments, ICCV 2011] 75

Results: Key-segment video segmentation Detect and segment people and discovered important objects without category-specific models Success in spite of moving camera, bg changes, low resolution 76

Results: Key-segment video segmentation Grundmann et al. 2010 Ours Grundmann et al. 2010 Ours Resists over-segmentation by detecting regions with objectlike appearance and motion 77

Results: Key-segment video segmentation Segmentation error rate Background subtraction falls apart Ours produces state-of-the-art results even when compared to supervised methods [29]: Tsai et al. BMVC 2010, [7]: Chockalingam et al. ICCV 2009 78

Unsupervised visual discovery 1:00 pm 2:00 pm 3:00 pm 4:00 pm Storyboard visual summary Visual world 79

Mining first-person camera data GoPro Google Glass Looxcie Tobii SMI Pivothead 80

Mining first-person camera data 90 s Steve Mann life logger 81

Problem: Summarizing egocentric videos Wearable camera Input: Egocentric video of the camera wearer s day 9:00 am 10:00 am 11:00 am 12:00 pm 1:00 pm 2:00 pm Output: Storyboard summary of discovered important people and objects [Lee, Ghosh, Grauman, Egocentric video summarization, CVPR 2012] 82

Important person/object discovery Discover important people and objects for egocentric video summarization Important: things with which the camera wearer has significant interaction [Lee, Ghosh, Grauman, Egocentric video summarization, CVPR 2012] 83

Collect training data Learn Importance Segment video into events Discover important regions Storyboard summary Data collection 15 fps, 320 x 480 resolution 10 videos, 3-5 hrs in length; total of 37 hrs Four subjects: one undergraduate, two grad students, and one office worker 84

Collect training data Learn Importance Segment video into events Discover important regions Storyboard summary Egocentric features: Learning region importance distance to hand distance to frame center frequency Object features: [ ] candidate region s appearance, motion [ ] surrounding area s appearance, motion Object-like appearance, motion Region features: size, width, height, centroid overlap w/ face detection 86

Collect training data Learn Importance Segment video into events Discover important regions Storyboard summary Learning region importance importance learned parameters i th feature value Regressor to learn and predict a region s degree of importance Expect significant interactions between the features; e.g., a region near the hand is important only if it is object-like in appearance For training: For testing: predict I(r) given x i (r) s 87

Results: Important region prediction Ours Object-like [Carreira, 2010] Object-like [Endres, 2010] Saliency [Walther, 2006] Good predictions 88

Results: Important region prediction Ours Object-like [Carreira, 2010] Object-like [Endres, 2010] Saliency [Walther, 2006] Failure cases 89

Collect training data Learn Importance Segment video into events Discover important regions Storyboard summary Generating a storyboard summary Event 1 Event 2 Event 3 Event 3 Event 4 Display event boundaries and frames of the selected important people and objects 90

Results: Egocentric video summarization Original video (3 hours) Our summary (12 frames) 91

Results: Egocentric video summarization 92

Fine-grained recognition 94

video AverageExplorer

Sign-up for papers Coming up Next class Object Recognition from Local Scale-Invariant Features. D. Lowe. ICCV 1999. Video Google: A Text Retrieval Approach to Object Matching in Videos. J. Sivic and A. Zisserman. ICCV 2003. Read both papers Write a review for one of them