Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Similar documents
Click to edit title style

Click to edit title style

Local Image Features

Scale Invariant Feature Transform

ImageCLEF 2011

SCALE INVARIANT FEATURE TRANSFORM (SIFT)

Scale Invariant Feature Transform

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Aggregating Descriptors with Local Gaussian Metrics

The SIFT (Scale Invariant Feature

Outline 7/2/201011/6/

Lecture 4.1 Feature descriptors. Trym Vegard Haavardsholm

Local Image Features

Part based models for recognition. Kristen Grauman

Part-based and local feature models for generic object recognition

Lecture 12 Recognition

Patch Descriptors. CSE 455 Linda Shapiro

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Lecture 12 Recognition. Davide Scaramuzza

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

Local Image Features

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Bag-of-features. Cordelia Schmid

Local Features and Bag of Words Models

Classification of objects from Video Data (Group 30)

Local features and image matching. Prof. Xin Yang HUST

Selection of Scale-Invariant Parts for Object Class Recognition

Discriminative classifiers for image recognition

Computer Vision Lecture 16

By Suren Manvelyan,

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Class 6 Large-Scale Image Classification

Object Classification Problem

Motion illusion, rotating snakes

Mixtures of Gaussians and Advanced Feature Encoding

Scene Recognition using Bag-of-Words

Deep Neural Networks:

Multiple Kernel Learning for Emotion Recognition in the Wild

Evaluation of GIST descriptors for web scale image search

Sparse coding for image classification

Patch Descriptors. EE/CSE 576 Linda Shapiro

Visual Object Recognition

AK Computer Vision Feature Point Detectors and Descriptors

BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task

A System of Image Matching and 3D Reconstruction

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

SIFT - scale-invariant feature transform Konrad Schindler

Deep Learning for Computer Vision II

Visual words. Map high-dimensional descriptors to tokens/words by quantizing the feature space.

Computer Vision. Recap: Smoothing with a Gaussian. Recap: Effect of σ on derivatives. Computer Science Tripos Part II. Dr Christopher Town

Evaluation and comparison of interest points/regions

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Class 9 Action Recognition

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

arxiv: v3 [cs.cv] 3 Oct 2012

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Lecture 10 Detectors and descriptors

Lecture 18: Human Motion Recognition

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

Supervised learning. y = f(x) function

OBJECT CATEGORIZATION

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Human-Robot Interaction

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

CAP 5415 Computer Vision Fall 2012

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

Learning Compact Visual Attributes for Large-scale Image Classification

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Local Patch Descriptors

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Recognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei)

Beyond Bags of features Spatial information & Shape models

Image Features: Detection, Description, and Matching and their Applications

Computer Vision for HCI. Topics of This Lecture

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Content Based Image Retrieval

Yiqi Yan. May 10, 2017

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Efficient Interest Point Detectors & Features

arxiv: v1 [cs.lg] 20 Dec 2013

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Descriptors for CV. Introduc)on:

Geometric VLAD for Large Scale Image Search. Zixuan Wang 1, Wei Di 2, Anurag Bhardwaj 2, Vignesh Jagadesh 2, Robinson Piramuthu 2

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Facial Expression Classification with Random Filters Feature Extraction

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Local features: detection and description. Local invariant features

Learning-based Methods in Vision

Local Features: Detection, Description & Matching

Lecture 12 Recognition

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Spatial Localization and Detection. Lecture 8-1

Scale Invariant Feature Transform by David Lowe

Transcription:

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center

Evolvement of Visual Features Low level features and histogram SIFT and bag-of-words models Sparse coding Super vector and Fisher vector Deep CNN 2

Evolvement of Visual Features Low level features and histogram Less parameters SIFT and bag-of-words models Sparse coding Super vector and Fisher vector Deep CNN More parameters 3

Evolvement of Visual Features Low level features and spatial histogram SIFT and bag-of-words models Sparse coding Three fundamental techniques 1. histogram 2. spatial gridding Super vector and Fisher vector 3. filter have been used extensively Deep CNN 4

Low Level Features and Spatial Pyramid 5

Raw Pixels as Feature Application 1: Face recognition Application 2: Hand written digits Concatenating raw pixels as 1D vector Tiny Image [Torralba et al 2007]: resize an image to 32x32 color thumbnail, which corresponds to a 3072 dimensional vector Pictures courtesy to Face Research Lab, Antonio Torralba and Sam Roweis

From Pixels to Histograms Color histogram [Swain and Ballard 91] is proposed to model the distribution of colors in an image. r g b Unlike raw pixel based vectors, histograms are not sensitive to misalignment scale transform global rotation Similar color histogram feature We can extend color histogram to : Edge histogram Shape context histogram Local binary patterns (LBP) Histogram of gradients 7

From Histogram to Spatialized Histogram Problem of histograms: No spatial information! The same histogram! Example thanks to Erik Learned-Miller Histograms of spatial cells Spatial pyramid matching [Lazebnik et al CVPR 06] Ojala et al, PAMI 02 8

IBM IMARS Spatial Gridding First position in 1 st and 2 nd ImageCLEF Medical Imaging Classification Task: Determine which modality a medical image belongs to. - Images from Pubmed articles - 31 categories (x-ray, CT, MRI, ultrasound, etc.) 9

IBM IMARS Spatial Gridding First position in 1 st and 2 nd ImageCLEF Medical Imaging Classification http://www.imageclef.org/2012/medical 10

Image Filters In addition to histogram, another group of features can be represented as filters. For example: 1. Harr-like filters (Viola-Jones face detection) 2. Gabor filters (simple cells in the visual cortex can be modeled by Gabor functions) Widely used in fingerprint, iris, OCR, texture and face recognition. 11

SIFT Feature and Bag-of-Words Model 1999 Classical features Raw pixel Histogram feature Color Histogram Edge histogram Frequency analysis Image filters Texture features LBP Scene features GIST Shape descriptors Edge detection Corner detection SIFT HOG SURF DAISY BRIEF DoG Hessian detector SIFT features and beyond Laplacian of Harris FAST ORB 12

Scale-Invariant Feature Transform (SIFT) SIFT Descriptor: David G. Lowe - Distinctive image features from scale-invariant keypoints, IJCV 2004 - Object recognition from local scale-invariant features, ICCV 1999 Histogram of gradient orientation - Histogram is more robust to position than raw pixels - Edge gradient is more distinctive than color for local patches Concatenate histograms in spatial cells David Lowe s excellent performance tuning: Good parameters: 4 ori, 4 x 4 grid Soft-assignment to spatial bins Gaussian weighting over spatial location Reduce the influence of large gradient magnitudes: thresholding +normalization 13

Scale-Invariant Feature Transform (SIFT) David G. Lowe - Distinctive image features from scale-invariant keypoints, IJCV 2004 - Object recognition from local scale-invariant features, ICCV 1999 SIFT Detector: Detect maxima and minima of difference-of-gaussian in scale space Post-processing: keep corner points but reject low-contrast and edge points In general object recognition, we may combine multiple detectors (e.g., Harris, Hessian), or use dense sampling for good performance. Following SIFT, many research works including SURF, BRIEF, ORB, BRISK and etc have been proposed for faster local feature extraction. 14

Histogram of Local Features And Bag-of-Words Models 15

Histogram of Local Features frequency codewords 16.. dim = # codewords

Histogram of Local Features + Spatial Gridding 17 dim = #codewords x #grids

Bag of Words Models 18

Bag-of-Words Representation Object Bag of words Computer Vision: Text and NLP: Slide credit: Fei-Fei Li 19

Topic Models for Bag-of-Words Representation Unsupervised classification Supervised classification Sivic et al. ICCV 2005 Fei-Fei et al. CVPR 2005 Classification + segmentation Cao and Fei-Fei. ICCV 2007 20

Pros and Cons of Bag of Words Models Images differ from texts! Bag of Words Models are good in - Modeling prior knowledge - Providing intuitive interpretation But these models suffer from - Loss of spatial information - Loss of information in quantization of visual words Better coding approach 21

Sparse Coding 22

Sparse Coding Naïve histogram uses Vector Quantization as a hard assignment, while Sparse Coding provides a soft assignment. Sparse Coding: approximation of l 0 norm (sparse solution): SC works better with max pooling (while traditional VQ with averages pooling) References: [M. Ranzato et al, CVPR 07] [J. Yang et al, CVPR09], [J. Wang et al CVPR10], [Y. Boureau et al, CVPR10] 23

Sparse Coding + Spatial Pyramid Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009 Sparse coding + spatial pyramid + linear SVM 24

Efficient Approach Locality preserving linear coding: 1. find k nearest neighbors to the query 2. compute sparse coding with the k neighbors Significantly faster than naïve SC, e.g., O(1000 a ) -> O(5 a ) For further speedup, we can use LS regression to replace SC [J. Wang et al CVPR10] Matlab implementation (http://www.ifp.illinois.edu/~jyang29/llc.htm ) Can be further speed up for top-k search 25

Sparse Coding Are Not Necessarily Sparse Hard quantization Sparse coding s.t. Sparsest solution! Less sparse! Sparse coding is less sparse. Image level representation is not sparse after pooling. Is the success of SC due to sparsity? 26

Fisher Vector and Super Vector 27

Information Loss Coding with information loss: VQ: Sparse coding: Lossless coding: Significant difference with a function: SC or VQ: Lossless coding: a scalar!! a function!! 28

Lossless Coding as Mixture of Experts Let s look at each codeword as a local expert : Gating function (e.g., GMM, sparse GMM, Harmonic K-means, etc) Expert 1 Expert 2 Expert 3 29

Pooling Towards Image-Level Representation Component 1 Component 2 Component 3 + + + + + + Pooling: Both Fisher Vector and Super Vector can be written in this form (with different subtraction and normalization and factors) Related references: Fisher Vector [Perronnin et al, ECCV10] Supervector [X. Zhou, K. Yu, T. Zhang et al, ECCV10] HG [X. Zhou et al, ECCV09] 30 Normalize and concatenate

Pooling Towards Image-Level Representation Component 1 Component 2 Component 3 + + + + + + Pooling: Normalize and concatenate Big model: The dimension becomes C (#components) x d (#fea dim) For example, if C=1000, d=128, the final dimension is 128K 100+ times longer than that from SC or VQ! 31

Very Long Vector as Feature Representation We can generate very long image feature vector as we discussed before The strong feature we used for ImageNet LSVRC 2010 Dense sampling: LBP + HOG, fea dim=100 (after PCA) GMM with 1024 components 4 spatial gridding (1+3x1) Dimension of image feature: 100 x 1024 x 4 = 0.41 M LBP GMM pooling HOG 32

How to solve big models? 33

For Small Datasets: Use Kernel Trick! Kernel trick: 10K images => Kernel matrix: 10K x 10K ~100M Computational complexity depends on the size of Kernel matrix which is less than feature dimension We tried nonlinear kernels for face verification and got good performance Results on LFW dataset Learning Locally-Adaptive Decision Functions for Person Verification, CVPR 13 (with Z. Li and S. Chang, F. Liang, T. Huang, J. Smith) 34

For Large Dataset: Use Stochastic Gradient Descent Suppose we are working on ImageNet data using 0.4 M feature vectors. Total training data: 1.2M x 0.4M ~ 0.5 T real values! Too big to load into memory Too many samples to use kernel tricks Solution: Stochastic Gradient Descent (SGD) Idea: estimate the gradient on a randomly picked sample Comparing with gradient descent: 35

SGD Can Be Very Simple To Implement A 10 line binary SVM solver by Shai Shalev-Shwartz decreasing learning rate 36

Deep CNN and Related Tech 37

Deep CNN: A Bigger Model Motivated by the studies of [Kizhevsky et al, NIPS12] [Y. LeCun et al, PIEEE98], deep convolutionary neural network (CNN) becomes the newest winner in ImageNet competition. The most popular CNN has: 5 convolutional layers to learn filters 2 fully connected layers 60 million parameters Stochastic gradient descent (again) Why we can train such a bigger model now (not in 1990s)? The rise of big dataset (ImageNet) The bless of GPU computing 38

Deep Learning Demo http://smith-gpu.pok.ibm.com:8080/

Learning Representation From Big Data Computer vision researchers have seen big performance jump in large scale datasets like ImageNet. Even earlier, researchers in and speech/acoustics have seen similar success in LVCSR and related tasks. In another field, text/nlp researchers are also moving quickly to large scale learning. For example, the IBM Watson system used thousands of sub-systems to won the human players in Jeopardy! Game. Watson is hiring! www.ibm.com/watsonjobs Especially, we are looking for winter interns working on vision + NLP problems. contact zhou@us.ibm.com 40

Conclusion 41

Conclusion The mutual evolvement of big data and big models: Histogram Sparse coding (10K parameters) Supervec, Fishervec (0.4M parameters) Deep CNN (60M para) Bigger Small dataset (e.g., Caltech101, 8K im) Medium dataset (e.g., PASCAL, 10+K) Large dataset (e.g., ImageNet 1.2M) Bigger Motivating questions: - How to develop scalable solutions for big data? - How to deal with situations with limited labeled data? Please see the following talks for the answer! 42