Learning-based Methods in Vision

Similar documents
Part-based models. Lecture 10

Adaptive Deconvolutional Networks for Mid and High Level Feature Learning

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning & Feature Learning Methods for Vision

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Deep Neural Networks:

arxiv: v1 [cs.lg] 20 Dec 2013

Computer Vision Lecture 16

Sparse Coding and Dictionary Learning for Image Analysis

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Learning Feature Hierarchies for Object Recognition

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Recent Progress on Object Classification and Detection

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Deep Learning for Computer Vision II

Multiview Feature Learning

String distance for automatic image classification

Learning Convolutional Feature Hierarchies for Visual Recognition

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features

Graphical Models for Computer Vision

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Sparse Models in Image Understanding And Computer Vision

ImageCLEF 2011

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Supervised learning. y = f(x) function

Multipath Sparse Coding Using Hierarchical Matching Pursuit

Discriminative sparse model and dictionary learning for object category recognition

A Probabilistic Model for Recursive Factorized Image Features

The most cited papers in Computer Vision

REJECTION-BASED CLASSIFICATION FOR ACTION RECOGNITION USING A SPATIO-TEMPORAL DICTIONARY. Stefen Chan Wai Tim, Michele Rombaut, Denis Pellerin

Image Representation by Active Curves

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deformable Part Models

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep Learning & Neural Networks

arxiv: v3 [cs.cv] 3 Oct 2012

Ecole normale supérieure, Paris Guillermo Sapiro and Andrew Zisserman

Selection of Scale-Invariant Parts for Object Class Recognition

Lecture 2: Spatial And-Or graph

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

What is the Best Multi-Stage Architecture for Object Recognition?

Learning Sparse FRAME Models for Natural Image Patterns

Category-level localization

Beyond Bags of features Spatial information & Shape models

Learning Multiple Non-Linear Sub-Spaces using K-RBMs

Sparse coding for image classification

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

Learning Hierarchical Feature Extractors For Image Recognition

Semantic Pooling for Image Categorization using Multiple Kernel Learning

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Object Category Detection. Slides mostly from Derek Hoiem

Learning a Representative and Discriminative Part Model with Deep Convolutional Features for Scene Recognition

Computer Vision I - Filtering and Feature detection

Content-Based Image Retrieval Using Deep Belief Networks

Window based detectors

Segmentation. Bottom up Segmentation Semantic Segmentation

Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning

Supervised Translation-Invariant Sparse Coding

Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Part-based and local feature models for generic object recognition

Recap Image Classification with Bags of Local Features

Aggregating Descriptors with Local Gaussian Metrics

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

Locality-constrained and Spatially Regularized Coding for Scene Categorization

Deep (1) Matthieu Cord LIP6 / UPMC Paris 6

Object Recognition with Deformable Models

A New Algorithm for Training Sparse Autoencoders

Rotation Invariance Neural Network

Learning Representations for Visual Object Class Recognition

Unsupervised Feature Learning for RGB-D Based Object Recognition

Bag-of-features. Cordelia Schmid

Facial Expression Classification with Random Filters Feature Extraction

Blind Image Deblurring Using Dark Channel Prior

Bilinear Models for Fine-Grained Visual Recognition

Unsupervised Learning of Feature Hierarchies

Patch Descriptors. CSE 455 Linda Shapiro

Human Vision Based Object Recognition Sye-Min Christina Chan

Learning to Align from Scratch

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Deep Learning for Vision: Tricks of the Trade

Learning Hierarchical Feature Extractors For Image Recognition

Learning of Visualization of Object Recognition Features and Image Reconstruction

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Robotics Programming Laboratory

Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling

On Deep Generative Models with Applications to Recognition

Detection III: Analyzing and Debugging Detection Methods

Transcription:

Learning-based Methods in Vision 16-824 Sparsity and Deep Learning

Motivation Multitude of hand-designed features currently in use in vision - SIFT, HoG, LBP, MSER, etc. Even the best approaches, just capture low-level edge gradients [ Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2007 ] [ Yan & Huang ] (Winner of PASCAL 2010 classification competition) Slide adopted from Rob Fergus

Motivation Multitude of hand-designed features currently in use in vision - SIFT, HoG, LBP, MSER, etc. Even the best approaches, just capture low-level edge gradients [ Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2007 ] [ Yan & Huang ] (Winner of PASCAL 2010 classification competition) Can we learn the features? Slide adopted from Rob Fergus

Visual cortex bottom-up/top-down V1: primary visual cortex simple cells complex cells [ Scientific American, 1999 ] Slide adopted from Ying Nian Wu

Simple V1 cells [Daugman, 1985 ] Gabor wavelets: localized sine and cosine waves V1 simple cells Local sum respond to edges image pixels Slide adopted from Ying Nian Wu

Complex V1 cells [ Riesenhuber and Poggio, 1999 ] Local max V1 complex cells V1 simple cells Local sum respond to edges image pixels Slide adopted from Ying Nian Wu

Single Layer Architecture Input: Image Pixels / Features Slide from Rob Fergus

Single Layer Architecture Input: Image Pixels / Features Filter Slide from Rob Fergus

Single Layer Architecture Input: Image Pixels / Features Filter Normalize Slide from Rob Fergus

Single Layer Architecture Input: Image Pixels / Features Filter Normalize Pool Slide from Rob Fergus

Single Layer Architecture Input: Image Pixels / Features Filter Normalize Pool Output: Features / Classifier Slide from Rob Fergus

Single Layer Architecture Input: Image Pixels / Features Filter Pool Normalize Output: Features / Classifier Slide from Rob Fergus

SIFT Descriptor Image Pixels Apply Gabor filters Slide from Rob Fergus

SIFT Descriptor Image Pixels Apply Gabor filters Spatial pool (Sum) Slide from Rob Fergus

SIFT Descriptor Image Pixels Apply Gabor filters Spatial pool (Sum) Normalize to unit length Feature Vector Slide from Rob Fergus

Feature Learning Architecture Pixels / Features Filter with Dictionary (patch/tiled/ convolutional) + Non-linearity Slide from Rob Fergus

Feature Learning Architecture Pixels / Features Filter with Dictionary (patch/tiled/ convolutional) + Non-linearity Normalization between feature responses (Group) Sparsity Max / Softmax Local Contrast Normalization (Subtractive / Divisive) Slide from Rob Fergus

Feature Learning Architecture Pixels / Features Filter with Dictionary (patch/tiled/ convolutional) + Non-linearity Normalization between feature responses (Group) Sparsity Max / Softmax Local Contrast Normalization (Subtractive / Divisive) Spatial/Feature (Sum or Max) Features Slide from Rob Fergus

Spatial Pyramid Matching SIFT Features Filter with Visual Words [ Lazebnik, Schmid, Ponce, CVPR 2006 ] Max Multi-scale spatial pool (Sum) Classifier Slide from Rob Fergus

Role of Normalization Lots of different mechanisms (e.g., max, sparsity, local contrast normalization) All induce local competition between features to explain the input ( explaining away property).... Convolution Convolutional Sparse Coding Filters Zeiler et al. [CVPR 10/ICCV 11], Kavakouglou et al. [NIPS 10], Yang et al. [CVPR 10] Slide from Rob Fergus

Role of Pooling Spatial Pooling - Invariance to small transformations (e.g., shifts) - Larger receptive fields Pooling Across Features - Gives and/or behavior (grammar) - Compositionality [ Zeiler, Taylor, Fergus, ICCV 2011 ] Pooling with latent variables/springs Slide from Rob Fergus

Role of Pooling Spatial Pooling - Invariance to small transformations (e.g., shifts) - Larger receptive fields Pooling Across Features - Gives and/or behavior (grammar) - Compositionality [ Zeiler, Taylor, Fergus, ICCV 2011 ] Pooling with latent variables/springs [ Felzenszwalb, Girshick, McAllester, Ramanan, PAMI 2009 ] [ Chen, Zhu, Lin, Yuille, Zhang, NIPS 2007 ] Slide from Rob Fergus

Image Restoration [ Mairal, Bach, Ponce, Shapiro, Zisserman, ICCV 2009 ] Image Pixels Feature Vector

Image Restoration [ Mairal, Bach, Ponce, Shapiro, Zisserman, ICCV 2009 ] Image Pixels Filter with Dictionary (patch) Feature Vector

Image Restoration [ Mairal, Bach, Ponce, Shapiro, Zisserman, ICCV 2009 ] Image Pixels Filter with Dictionary (patch) Sparsity Feature Vector

Image Restoration [ Mairal, Bach, Ponce, Shapiro, Zisserman, ICCV 2009 ] Image Pixels Filter with Dictionary (patch) Sparsity Spatial pool (Sum) Feature Vector

Sparse Representation for Image Restoration y {z} observed image = x orig {z } true image + w {z} noise Slide adopted from Julien Mairal

Sparse Representation for Image Restoration y {z} observed image = x orig {z } true image + w {z} noise Can be cast as energy minimization problem: E(x) = 1 2 y x 2 2 {z } reconstruction of observed image + E prior (x) {z } image prior (-log prior) Slide adopted from Julien Mairal

Sparse Representation for Image Restoration y {z} observed image = x orig {z } true image + w {z} noise Can be cast as energy minimization problem: E(x) = 1 2 y x 2 2 {z } reconstruction of observed image + E prior (x) {z } image prior (-log prior) or probabilistically: p(y, x) = p(y x) {z } likelihood p(x) {z} prior Slide adopted from Julien Mairal

Sparse Representation for Image Restoration y {z} observed image = x orig {z } true image + w {z} noise Can be cast as energy minimization problem: E(x) = 1 2 y x 2 2 {z } reconstruction of observed image + E prior (x) {z } image prior (-log prior) or probabilistically: p(y, x) = p(y x) {z } likelihood p(x) {z} prior Classical priors: - Smoothness: - Total variation: Lx 2 2 rx 2 1 Slide adopted from Julien Mairal

Sparse Linear Model Let x be an image (or a signal) in R m And D =[d 1,...,d p ] 2 R m p basis vectors (dictionary) be a set of normalized linear We can represent x with few basis vectors, i.e., there exists a sparse vector 2 R p (sparse code) such that x D. ode. x }{{} x R m d 1 d 2 d p } {{ } D R m p Dictionary α[1] α[2]. α[p] }{{} Sparse code α R p,sparse Slide adopted from Julien Mairal

Why sparsity? A dictionary can be good for representing a class of signals We don t want to reconstruct noise! Any given patch looks like part of an image A sum of a few patches is likely to produce a reasonable patch from an image Image Patches Sum of many patches can

Lateral Inhibition Visual neurons respond less if they are activated at the same time than if one is activated alone. So the fewer neighboring neurons stimulated, the more strongly a neuron responds. Images from Ying Nian Wu

Sparse Representation for Image Restoration Hand designed dictionaries - Wavelets, Curvelets, Wedgelets, Bandlets,... - [Haar, 1910], [Zweig, Morlet, Grossman ~70s], [Meyer, Mallat, Daubechies, Coifman, Donoho, Candes ~80s-today]... (see [Mallat, 1999]) Learned dictionaries of patches - [Olshausen and Field, 1997], [Engan et al., 1999], [Lewicki and Sejnowski, 2000], [Aharon et al., 2006], [Roth and Black, 2005], [Lee et al., 2007] min i,d NX i=1 1 2 x i D i 2 2 {z } reconstruction + i 1 {z } sparsity L1-norm induces sparsity Slide from Julien Mairal

Optimization for Dictionary Learning min i,d NX i=1 1 2 x i D i 2 2 + i 1 Classical optimization does this in EM style (alternating between learning the dictionary and the sparse codes) Good results, but slow [ Mairal et al., 2009a] proposes online learning NX I denoised = 1 M i=1 R i D i Slide adopted from Julien Mairal

Results Slide adopted from Julien Mairal

Image Classifications (Bag-of-Features) SIFT Features Filter with Visual Words Max Spatial pool (Sum) Classifier

Learning Codebooks for Image Classification Image is represented by a set of low-level (SIFT) descriptors at N locations identified with their index i x i Hard-quantization (with p visual words) x i D i i 2{0, 1} p px j=1 i [j] =1 Soft-quantization Sparse coding: i [j] = N (x i ; d j ), 2 P p k=1 N (x i; d k, 2 ) min i,d NX i=1 1 2 x i D i 2 2 {z } reconstruction + i 1 {z } sparsity Slide adopted from Julien Mairal

Discriminative Learning of Dictionaries [ Mairal, Bach, Ponce, Sapiro, Zisserman, CVPR 2008 ] min i,d NX i=1 1 2 x i D i 2 2 {z } reconstruction + i 1 {z } sparsity Positive class min i,d NX i=1 1 2 x i D i 2 2 {z } reconstruction + i 1 {z } sparsity Negative class

Learning Codebooks for Image Classification [ Mairal, Bach, Ponce, Sapiro, Zisserman, CVPR 2008 ] Slide adopted from Julien Mairal

Visual cortex V1: primary visual cortex simple cells complex cells [ Scientific American, 1999 ] Slide adopted from Ying Nian Wu

Visual cortex V1: primary visual cortex simple cells complex cells What is beyond V1? [ Scientific American, 1999 ] Slide adopted from Ying Nian Wu

Visual cortex V1: primary visual cortex simple cells complex cells What is beyond V1? Hierarchical model [ Scientific American, 1999 ] Slide adopted from Ying Nian Wu

Mid-level features Beyond Edges Continuation Parallelism Junctions Corners High-level object parts Objects Scenes??? Slide adopted from Rob Fergus

Mid-level features Beyond Edges Continuation Parallelism Junctions Corners High-level object parts Objects Scenes??? Slide adopted from Rob Fergus

Challenges Grouping mechanism - Want edge structures to group into more complex forms - But it is hard to define explicit rules Invariance to local distortions - Under distortions, corners, T-junctions, parallel lines, etc. can look quite different Slide adopted from Rob Fergus

Deep Feature Learning Build hierarchy of feature extractors (layers) - All the way from pixels to classifiers - Homogeneous (simple) structure for all layers - Unsupervised training Image/Video Pixels Layer 1 Layer 2 Layer 3 Simple Classifier Slide from Rob Fergus

Deep Feature Learning Build hierarchy of feature extractors (layers) - All the way from pixels to classifiers - Homogeneous (simple) structure for all layers - Unsupervised training Image/Video Pixels Layer 1 Layer 2 Layer 3 Simple Classifier Numerous approaches: Restricted Boltzmann Machines [Hinton, Ng, Bengio, ] Sparse coding [Yu, Fergus, LeCun] Auto-encoders [LeCun, Bengio] ICA variants [Ng, Cottrell] & many more. Slide from Rob Fergus

Hierarchical Vision Models [Jin & Geman, CVPR 2006] e.g. animals, trees, rocks e.g. contours, intermediate objects e.g. linelets, curvelets, T- junctions e.g. discontinuities, gradient animal head instantiated by bear head Slide adopted from Rob Fergus

Hierarchical Vision Models [Jin & Geman, CVPR 2006] e.g. animals, trees, rocks e.g. contours, intermediate objects e.g. linelets, curvelets, T- junctions e.g. discontinuities, gradient animal head instantiated by bear head Slide adopted from Rob Fergus

Single Layer Convolutional Architecture Input: Image Pixels / Features Filter Normalize Pool Output: Features / Classifier Slide from Rob Fergus

Single Deconvolutional Layer Convolutional form of sparse coding Slide from Rob Fergus

Single Deconvolutional Layer Slide from Rob Fergus

Single Deconvolutional Layer Slide from Rob Fergus

Single Deconvolutional Layer Slide from Rob Fergus

Toy Example Feature maps Filters Slide from Rob Fergus

Reversible Max Pooling Feature Map Slide from Rob Fergus

Reversible Max Pooling Pooling Feature Map Slide from Rob Fergus

Reversible Max Pooling Pooled Feature Maps Pooling Feature Map Slide from Rob Fergus

Reversible Max Pooling Max Locations Switches Pooled Feature Maps Pooling Feature Map Slide from Rob Fergus

Reversible Max Pooling Max Locations Switches Pooled Feature Maps Pooling Unpooling Feature Map Reconstructed Feature Map Slide from Rob Fergus

Overall Architecture (1 layer) Slide from Rob Fergus

Toy Example Pooled maps Feature maps Filters Slide from Rob Fergus

Overall Architecture (2 Layers) Slide from Rob Fergus

Model Parameters 7x7 filters at all layers Slide from Rob Fergus

Layer 1 Filters 15 filters/feature maps, showing max for each map Slide from Rob Fergus

Layer 2 Filters 50 filters/feature maps, showing max for each map projected down to image Slide from Rob Fergus

Layer 3 Filters 100 filters/feature maps, showing max for each map Slide from Rob Fergus

Layer 4 Filters 150 in total; receptive field is entire image Slide from Rob Fergus

Relative Size of Receptive Fields (to scale) Slide from Rob Fergus

Restricted Boltzmann Machines

Restricted Boltzmann Machines

Restricted Boltzmann Machines Units v i are binary (0/1) v i W ij v j Logistic function p(v i =1 {v j },j 6= i) = 1 1+exp( b i j W ij v j ) Unit is activated based on linear combination of other units plus bias

Restricted Boltzmann Machines Units v i are binary (0/1) 1 p(v i = 1) v i 0 W ij 0 v j b i + j W ij v j Logistic function p(v i =1 {v j },j 6= i) = 1 1+exp( b i j W ij v j ) Unit is activated based on linear combination of other units plus bias

Restricted Boltzmann Machines v i Units are binary (0/1) v i W ij v j p(v) = exp( E(v)) v exp( E(v)) E(v) = i b i v i i6=j W ij v i v j More probable configuration = lower energy

Restricted Boltzmann Machines v i Units are binary (0/1) v i W ij Learning amounts to estimating parameters of the model: v j ={b i,w ij } p(v) = exp( E(v)) v exp( E(v)) E(v) = i b i v i i6=j W ij v i v j More probable configuration = lower energy

Restricted Boltzmann Machines v i Units are binary (0/1) v i W ij Learning amounts to estimating parameters of the model: v j ={b i,w ij } Maximum Likelihood p(v) = exp( E(v)) v exp( E(v)) E(v) = i b i v i i6=j W ij v i v j More probable configuration = lower energy

Maximum Likelihood Learning Typically - assume independence of N samples L( x 1,x 2,,x N )= NY i=1 p(x i ) Take log (which turns product into a sum) and do gradientbased optimization with respect to parameters For Boltzmann Machine that comes down to optimizing sum of energy functions minus the normalizing constant p(v) = exp( E(v)) v exp( E(v))

Bolzmann Machine Learning So, essentially we need to iteratively do: W (iter) W (1) ij = W (0) ij + W (0) ij (iter 1) ij = W ij + W (iter 1) ij b (iter) b (1) i = b (0) i + b (0) i (iter 1) i = b i + b (iter 1) i

Bolzmann Machine Learning So, essentially we need to iteratively do: W (iter) W (1) ij = W (0) ij + W (0) ij (iter 1) ij = W ij + W (iter 1) ij b (iter) b (1) i = b (0) i + b (0) i (iter 1) i = b i + b (iter 1) i Where do we get the gradients?

Bolzmann Machine Learning So, essentially we need to iteratively do: W (iter) W (1) ij = W (0) ij + W (0) ij (iter 1) ij = W ij + W (iter 1) ij b (iter) b (1) i = b (0) i + b (0) i (iter 1) i = b i + b (iter 1) i Where do we get the gradients? W ij /hv i v j i data hv i v j i model b i /hv i i data hv i i model This is easy, just look at the data Requires samples from the model (Gibbs sampler to equilibrium)

Bolzmann Machine Learning W ij /hv i v j i data hv i v j i model Law of large numbers: Approximate expectations using samples W ij = 1 D DX d=1 v (d) i v (d) j 1 M MX m=1 ṽ (m) i ṽ (m) j Where do we get the gradients? W ij /hv i v j i data hv i v j i model b i /hv i i data hv i i model This is easy, just look at the data Requires samples from the model (Gibbs sampler to equilibrium)

Restricted Boltzmann Machines Units v i are binary (0/1) h j W ij p(v, h) = exp( E(v, h)) Z v i E(v, h) = ij W ij v i h j i a i v i j b j h j p(h j =1 v) = 1 1+exp( b j i W ij v i ) p(v i =1 h) = 1 1+exp( a i j W ij h j )

Restricted Boltzmann Machines Units v i are binary (0/1) h j W ij v i Still not very realistic, because most data in real world is continuos p(v i =1 h) =N (a i + j W ij h j, 2 )

Auto-encoders [ Hinton and Salakhutdinov, Science 06 ] Patches 28x28 We train the auto-encoder to reproduce its input vector as its output This forces it to compress as much information as possible into the 30 numbers in the central bottleneck. 1000 neurons 500 neurons 250 neurons 30 These 30 numbers are then a good way to visualize data and do classification. 250 neurons 500 neurons 1000 neurons 28x28 Patches

Learning a Compositional Hierarchy of Object Structure [ Fidler & Leaonardis, CVPR 07; Fidler, Boben & Leonards, CVPR 08 ] Parts model The architecture Learned parts

Learning a Compositional Hierarchy of Object Structure [ Fidler & Leaonardis, CVPR 07; Fidler, Boben & Leonards, CVPR 08 ] Layer 2 Layer 3

Learning a Compositional Hierarchy of Object Structure [ Fidler & Leaonardis, CVPR 07; Fidler, Boben & Leonards, CVPR 08 ]

Learning a Compositional Hierarchy of Object Structure [ Fidler & Leaonardis, CVPR 07; Fidler, Boben & Leonards, CVPR 08 ]

Conclusions Interesting paradigm, where algorithm tries to learn everything, right? Patch size (8x8 or 20x20?) Learning parameters Need lots and lots of data typically Higher-levels mostly work on PASCAL or other simple datasets It is hard to train multi-layer architectures! Since learning is in effect unsupervised, it s difficult to debug or gather what s going on.