CS395T Visual Recogni5on and Search. Gautam S. Muralidhar

Similar documents
Sampling Strategies for Object Classifica6on. Gautam Muralidhar

Deformable Part Models

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Introduc)on to Probabilis)c Latent Seman)c Analysis. NYP Predic)ve Analy)cs Meetup June 10, 2010

BoW model. Textual data: Bag of Words model

An Latent Feature Model for

ECE 5424: Introduction to Machine Learning

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Harmony Poten,als: Fusing Global and Local Scale for Seman,c Image Segmenta,on

STA 4273H: Sta-s-cal Machine Learning

Unsupervised discovery of category and object models. The task

Lecture 7: Spectral Clustering; Linear Dimensionality Reduc:on via Principal Component Analysis

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Alignment and Image Comparison

Alignment and Image Comparison. Erik Learned- Miller University of Massachuse>s, Amherst

Improving Recognition through Object Sub-categorization

Unsupervised Discovery of Action Classes

About the Course. Reading List. Assignments and Examina5on

CS6670: Computer Vision

Machine Learning Crash Course: Part I

Rapid Extraction and Updating Road Network from LIDAR Data

Organized Segmenta.on

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Multiple-Choice Questionnaire Group C

Lecture 13: Tracking mo3on features op3cal flow

Lecture 14: Tracking mo3on features op3cal flow

Pattern recognition (3)

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Learning and Recognizing Visual Object Categories Without First Detecting Features

Visual Object Recognition

Unusual Activity Analysis using Video Epitomes and plsa

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Shape Matching. Brandon Smith and Shengnan Wang Computer Vision CS766 Fall 2007

Part based models for recognition. Kristen Grauman

Lecture 16: Object recognition: Part-based generative models

Object Classification Problem

What is Search For? CS 188: Ar)ficial Intelligence. Constraint Sa)sfac)on Problems Sep 14, 2015

Clustering CS 550: Machine Learning

Using Machine Learning to Optimize Storage Systems

Beyond Bags of features Spatial information & Shape models

Detecting Irregularities in Images and in Video

Motion illusion, rotating snakes

Image Retrieval (Matching at Large Scale)

Decision Trees, Random Forests and Random Ferns. Peter Kovesi

Lecture 18: Human Motion Recognition


Probabilistic Generative Models for Machine Vision

Patch Descriptors. CSE 455 Linda Shapiro

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Lecture 12 Recognition

Probabilistic Graphical Models Part III: Example Applications

Visual Object Recognition

Introduc)on to fmri. Natalia Zaretskaya

A Taxonomy of Semi-Supervised Learning Algorithms

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Lecture 12 Recognition. Davide Scaramuzza

Stages of (Batch) Machine Learning

Recognition. Topics that we will try to cover:

Local Features and Bag of Words Models

Comparing Local Feature Descriptors in plsa-based Image Models

Joint Shape Segmentation

CS 229 Midterm Review

Object Oriented Design (OOD): The Concept

Pa#ern Recogni-on for Neuroimaging Toolbox

Text Modeling with the Trace Norm

ECS 289H: Visual Recognition Fall Yong Jae Lee Department of Computer Science

By Suren Manvelyan,

Semi-Supervised Clustering with Partial Background Information

Object Category Detection. Slides mostly from Derek Hoiem

Clustering Lecture 5: Mixture Model

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Informa(on Retrieval

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

TerraSwarm. A Machine Learning and Op0miza0on Toolkit for the Swarm. Ilge Akkaya, Shuhei Emoto, Edward A. Lee. University of California, Berkeley

Machine Learning. CS 232: Ar)ficial Intelligence Naïve Bayes Oct 26, 2015

Informa(on Retrieval

Part-based and local feature models for generic object recognition

Selection of Scale-Invariant Parts for Object Class Recognition

Vocabulary tree. Vocabulary tree supports very efficient retrieval. It only cares about the distance between a query feature and each node.

Image classification Computer Vision Spring 2018, Lecture 18

VK Multimedia Information Systems

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Unsupervised Learning : Clustering

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

Spatial Latent Dirichlet Allocation

Saliency Detection in Aerial Imagery

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Informa(on Retrieval

10703 Deep Reinforcement Learning and Control

Minimum Redundancy and Maximum Relevance Feature Selec4on. Hang Xiao

Segmentation. Bottom up Segmentation Semantic Segmentation

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

CS 664 Flexible Templates. Daniel Huttenlocher

Dr. Ulas Bagci

CS145: INTRODUCTION TO DATA MINING

Lecture 24 EM Wrapup and VARIATIONAL INFERENCE

Learning Human Motion Models from Unsegmented Videos

Shape Co-analysis. Daniel Cohen-Or. Tel-Aviv University

Transcription:

CS395T Visual Recogni5on and Search Gautam S. Muralidhar

Today s Theme Unsupervised discovery of images Main mo5va5on behind unsupervised discovery is that supervision is expensive Common tasks include Detec5ng objects and their loca5ons Segmenta5on Ac5vity recogni5on Irregulari5es in images and videos

Detec5ng Objects and Segmenta5on From Sivic et al

Ac5on Class Recogni5on From Wang et al

Detec5ng Irregulari5es From Boiman and Irani

Recipes Usually Process images and detect interest points Extract low level features /descriptors (e.g., SIFT) Cluster the image based on the descriptors Learn sta5s5cal models to infer object categories / ac5vity classes An alterna5ve To make use of an exis5ng database as evidence for a task, for e.g., detec5ng irregulari5es

Detec5ng objects and their loca5ons in images Sivic et al

Analogy between text documents and images Text documents composed of words, Images composed of visual words Both can be represented by a bag of words approach Associated with each (visual) word is an (object) topic category Text documents mixture of topics, Images mixture of object categories

plsa The joint probability P(w i,d j,z k ) is assumed to have the following graphical model: Goal of plsa find topic specific word distribu5on P(w z), document specific mixing propor5ons P(z d) and from these, the document specific word distribu5on P(w d) From Sivic et al

plsa model Fi\ng the model involves determining the topics, which are common to all documents and mixture of coefficients, which are specific to each document Maximizing the objec5ve func5on yields the maximum likelihood es5mate of the parameters of the model that gives high probability to words that appear in a corpus From Sivic et al

Obtaining Visual Words SIFT descriptors extracted from ellipi5cal shape adapta5on about an interest point and maximally stable extremal regions SIFT has all the nice proper5es The SIFT descriptors are then vector quan5zed (k means) into visual words Total vocabulary size = 2237 words

Doublets of Visual Words Black Ellipse represents the visual word whose doublets we want to es5mate, ellipses that are red and green are candidate neighbors The large red ellipse significantly overlaps with the black ellipse and is discarded Likewise, the smaller red ellipse is too small compared to the black ellipse and is discarded The Green ellipses are returned as doublets for the black ellipse From Sivic et al

Model Learning and Baseline Method EM algorithm for plsa converges in 40 100K itera5ons For the baseline method k means was employed on the same features of the word frequency vectors for each image

Experiments and Datasets Three experiments: 1. Topic discovery categories are discovered by plsa clustering on all available images 2. Classifica5on of unseen images topics on one set of images are learnt to determine the topics in another set 3. Object detec5on to determine the loca5on and approximate segmenta5on of the objects Dataset Caltech 101 (5 categories) and MIT

Topic Discovery Experiment Case 1 Images of 4 object categories with clugered background When number of topics K = 4, 98% of the 4 different categories are accurately discovered K = 5 splits the car dataset into twp subtopics as the data consists of sets of many repeated images of the same car K = 6, splits the motorbike data into sets with plain and clugered background K = 7 and 8, discovers two more sub groups of the car data containing again other repeated images of the same/similar cars. From Sivic et al

Most probable visual words Visual words with high topic specific probability P(w i z k ) From Sivic et al

Topic Discovery Experiment Case 2, with Background Topics From Sivic et al

Classifying New Images Experiment P(w z) topic specific distribu5ons are learned from a separate set of training images When observing a new, previously unseen test image, the document specific mixing coefficients P(z test) are computed Achieved by EM with only coefficients P(z test) updated in each M step and the learned P(w z) are kept fixed

Classifica5on Results From Sivic et al

Segmenta5on Results from the Posteriors Mixing coefficients From Sivic et al

Segmenta5on Results with Doublets From Sivic et al

MIT dataset results from Sivic et al

MIT dataset results from Sivic et al

Conclusion Visual object categories can be discovered using an unsupervised approach Tasks such as segmenta5on can be performed using simple bag of features combined with sta5s5cal models However, bag of features does not take into account seman5c context Can models from sta5s5cal text literature handle context? Are models from text really appropriate? Unlike text, Images have a strong spa5al structure

Moving on. Unsupervised Discovery of Ac5on Classes By Wang et al.

Basic Idea Cluster images that depict similar ac5ons together and label these clusters (ac5on classes) Assign a new image to an ac5on class based on its distance from the centroids of the clusters

Approach Human shape as a cue to determine the ac5on The similarity measure for clustering has to take into account the deforma5ons when comparing two images of different people performing different ac5ons

Similarity Measure Requirement yield a high value on a pair of images when similar poses are depicted and a low value on dissimilar poses Spectral Clustering Affinity Matrix W (n x n) where W ij = affinity between images i and j, W ij = exp( (d ij2 + d ji2 )/2)

Deformable Template Matching Algorithm to match ac5ons in images by measuring affinity (similarity) Posed as an Integer Linear Programming Problem Computa5onally not feasible when required to compute n x n affinity measures (required to do so as the affinity measure is not symmetric) A fast pruning algorithm based on shape contexts is used to address this issue

Fast Pruning using Representa5ve Shape Contexts Slide From Grauman, Original Source from Belongie, Malik and Puzicham ICCV 2001

Pruning results A cri5cism of the RSC pruning algorithm is that it ignores spa5al structure and this leads to some errors in matching. The deformable template matching algorithms tries to rec5fy these errors For the pruning algorithm, a shortlist of length 50 was used.

Back to the Deformable Template Matching Algorithm Goal : To find the op5mal matching of sample points from one image to another The distance between two images is the minimum of the following objec5ve func5on over all possible candidate matches: From Wang et al

More on the Objec5ve Func5on The objec5ve func5on clearly shows that a set of points {f} is sought such that: the cost of matching points {f} with the {s} is minimum the points in the neighborhood of every point in {f} matches closely to the points in the neighborhood of every point in {s}

Cost Func5on The cost func5on c(s,f) is computed as follows: Convert the edge maps to gray scale images by taking the distance transform Consider small neighborhood of 9x9 pixels around each feature point in the two images Compute the normalized sum of absolute differences between the 2 patches to get c(s,f)

Linearizing the Objec5ve Func5on The first term in the objec5ve func5on measures similarity between the 2 sets of points, while the second term is a mesure of rela5ve spa5al deforma5on The objec5ve func5on is neither linear nor convex a hard op5miza5on problem

Linearizing the Objec5ve Func5on \\ contd. Lets focus on the 2 nd term first (it s easy!) Recall that the L1 norm is just the absolute value (or to be formal the sum of absolute values of all elements in the vector) An absolute value can always be replaced by the difference of two non nega5ve values

Linearizing the objec5ve func5on To linearize the first term: For the set of coordinates f, find its basis vectors Represent each fs as a linear combina5on of the basis vectors Approximate the cost func5on as the linear combina5on of the cost between the sample point s and the basis vectors

The final LP The LP can be solved using the Simplex method

Clustering Results Each row corresponds to a cluster Number of clusters fixed between 100 and 200 From Wang et al

From Wang et al

From Wang et al

Image Labeling Experiment From Wang et al

Image Labeling Experiment Accuracy of automatic labeling as evaluated subjectively by 4 naïve observers From Wang et al

Conclusions An unsupervised recipe for ac5on class recogni5on Spectral clustering vs. Topic discovery comparing pairs of images for deciding on common ac5ons vs. learning the visual codewords associated with each ac5on class A few points to note The ac5on classes considered are substan5ally different from one another What happens when ac5ons depicted on s5ll images are similar for e.g., brisk walking vs. jogging / sleeping vs. just lying down! Would there be 5es?

Moving On Detec5ng Irregulari5es in Images and in Video Boiman and Irani

Recap We just saw Unsupervised approach to learn codewords specific to object categories to detect objects in images Spectral clustering to compare images and infer about commonality in ac5ons We will now see An unsupervised approach that makes use of a database of evidence to infer what is regular and what is not in ac5ons!

Detec5ng Irregulari5es From hgp://www.wisdom.weizmann.ac.il/~vision/ Irregulari5es.html#Video%20Examples

Spa5o Temporal Saliency From hgp://www.wisdom.weizmann.ac.il/~vision/ Irregulari5es.html#Video%20Examples

Approach Analyze patches in an image and compare their rela5ve loca5on and appearance with ensemble of patches in a database Regions in the image that have a con5guous region of support in the database are considered likely A graphical model to capture the dependencies between the query patch and a patch from the ensemble

Inference by Composi5on Given a new image, can each region be explained by a large enough con5guous region of support by a database of evidence? From Boiman and Irani

Ensemble of Patches From Boiman and Irani

Ensemble of Patches Local patches at mul5ple scales to allow for local, non rigid deforma5ons Patches of similar proper5es (appearance, or behavior) and having similar geometric configura5on are searched An ensemble typically consists of hundreds of patches at mul5ple scales Patch defined in terms of a simple descriptor vector and absolute loca5on

Sta5s5cal Formula5on Goal to es5mate likelihood at every pixel ;The similarity between a pair of ensembles y and x is given by the likelihood: From Boiman and Irani Given an observed ensemble, a set of ensembles from the database that maximizes the MAP probability assignment is sought. This is solved by implemen5ng an efficient message passing algorithm.

Applica5ons Detec5ng Unusual Image Configura5ons From Boiman and Irani

Spa5al Saliency in an Image From Boiman and Irani

Detec5ng Suspicious Behavior From Boiman and Irani

Detec5ng Salient Behaviors From Boiman and Irani

To Conclude We saw applica5ons of unsupervised discovery Approaches include Topic models that learn codewords specific to object categories to detect objects in images Spectral clustering to compare pairs of images to infer common ac5on classes Evidence based detec5on of irregulari5es

Final Thoughts How can we improve on unsupervised techniques? Semi supervised? Ac5ve Annota5on? Any other thoughts?