Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center
Evolvement of Visual Features Low level features and histogram SIFT and bag-of-words models Sparse coding Super vector and Fisher vector Deep CNN 2
Evolvement of Visual Features Low level features and histogram Less parameters SIFT and bag-of-words models Sparse coding Super vector and Fisher vector Deep CNN More parameters 3
Evolvement of Visual Features Low level features and spatial histogram SIFT and bag-of-words models Sparse coding Three fundamental techniques 1. histogram 2. spatial gridding Super vector and Fisher vector 3. filter have been used extensively Deep CNN 4
Low Level Features and Spatial Pyramid 5
Raw Pixels as Feature Application 1: Face recognition Application 2: Hand written digits Concatenating raw pixels as 1D vector Tiny Image [Torralba et al 2007]: resize an image to 32x32 color thumbnail, which corresponds to a 3072 dimensional vector Pictures courtesy to Face Research Lab, Antonio Torralba and Sam Roweis
From Pixels to Histograms Color histogram [Swain and Ballard 91] is proposed to model the distribution of colors in an image. r g b Unlike raw pixel based vectors, histograms are not sensitive to misalignment scale transform global rotation Similar color histogram feature We can extend color histogram to : Edge histogram Shape context histogram Local binary patterns (LBP) Histogram of gradients 7
From Histogram to Spatialized Histogram Problem of histograms: No spatial information! The same histogram! Example thanks to Erik Learned-Miller Histograms of spatial cells Spatial pyramid matching [Lazebnik et al CVPR 06] Ojala et al, PAMI 02 8
IBM IMARS Spatial Gridding First position in 1 st and 2 nd ImageCLEF Medical Imaging Classification Task: Determine which modality a medical image belongs to. - Images from Pubmed articles - 31 categories (x-ray, CT, MRI, ultrasound, etc.) 9
IBM IMARS Spatial Gridding First position in 1 st and 2 nd ImageCLEF Medical Imaging Classification http://www.imageclef.org/2012/medical 10
Image Filters In addition to histogram, another group of features can be represented as filters. For example: 1. Harr-like filters (Viola-Jones face detection) 2. Gabor filters (simple cells in the visual cortex can be modeled by Gabor functions) Widely used in fingerprint, iris, OCR, texture and face recognition. 11
SIFT Feature and Bag-of-Words Model 1999 Classical features Raw pixel Histogram feature Color Histogram Edge histogram Frequency analysis Image filters Texture features LBP Scene features GIST Shape descriptors Edge detection Corner detection SIFT HOG SURF DAISY BRIEF DoG Hessian detector SIFT features and beyond Laplacian of Harris FAST ORB 12
Scale-Invariant Feature Transform (SIFT) SIFT Descriptor: David G. Lowe - Distinctive image features from scale-invariant keypoints, IJCV 2004 - Object recognition from local scale-invariant features, ICCV 1999 Histogram of gradient orientation - Histogram is more robust to position than raw pixels - Edge gradient is more distinctive than color for local patches Concatenate histograms in spatial cells David Lowe s excellent performance tuning: Good parameters: 4 ori, 4 x 4 grid Soft-assignment to spatial bins Gaussian weighting over spatial location Reduce the influence of large gradient magnitudes: thresholding +normalization 13
Scale-Invariant Feature Transform (SIFT) David G. Lowe - Distinctive image features from scale-invariant keypoints, IJCV 2004 - Object recognition from local scale-invariant features, ICCV 1999 SIFT Detector: Detect maxima and minima of difference-of-gaussian in scale space Post-processing: keep corner points but reject low-contrast and edge points In general object recognition, we may combine multiple detectors (e.g., Harris, Hessian), or use dense sampling for good performance. Following SIFT, many research works including SURF, BRIEF, ORB, BRISK and etc have been proposed for faster local feature extraction. 14
Histogram of Local Features And Bag-of-Words Models 15
Histogram of Local Features frequency codewords 16.. dim = # codewords
Histogram of Local Features + Spatial Gridding 17 dim = #codewords x #grids
Bag of Words Models 18
Bag-of-Words Representation Object Bag of words Computer Vision: Text and NLP: Slide credit: Fei-Fei Li 19
Topic Models for Bag-of-Words Representation Unsupervised classification Supervised classification Sivic et al. ICCV 2005 Fei-Fei et al. CVPR 2005 Classification + segmentation Cao and Fei-Fei. ICCV 2007 20
Pros and Cons of Bag of Words Models Images differ from texts! Bag of Words Models are good in - Modeling prior knowledge - Providing intuitive interpretation But these models suffer from - Loss of spatial information - Loss of information in quantization of visual words Better coding approach 21
Sparse Coding 22
Sparse Coding Naïve histogram uses Vector Quantization as a hard assignment, while Sparse Coding provides a soft assignment. Sparse Coding: approximation of l 0 norm (sparse solution): SC works better with max pooling (while traditional VQ with averages pooling) References: [M. Ranzato et al, CVPR 07] [J. Yang et al, CVPR09], [J. Wang et al CVPR10], [Y. Boureau et al, CVPR10] 23
Sparse Coding + Spatial Pyramid Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009 Sparse coding + spatial pyramid + linear SVM 24
Efficient Approach Locality preserving linear coding: 1. find k nearest neighbors to the query 2. compute sparse coding with the k neighbors Significantly faster than naïve SC, e.g., O(1000 a ) -> O(5 a ) For further speedup, we can use LS regression to replace SC [J. Wang et al CVPR10] Matlab implementation (http://www.ifp.illinois.edu/~jyang29/llc.htm ) Can be further speed up for top-k search 25
Sparse Coding Are Not Necessarily Sparse Hard quantization Sparse coding s.t. Sparsest solution! Less sparse! Sparse coding is less sparse. Image level representation is not sparse after pooling. Is the success of SC due to sparsity? 26
Fisher Vector and Super Vector 27
Information Loss Coding with information loss: VQ: Sparse coding: Lossless coding: Significant difference with a function: SC or VQ: Lossless coding: a scalar!! a function!! 28
Lossless Coding as Mixture of Experts Let s look at each codeword as a local expert : Gating function (e.g., GMM, sparse GMM, Harmonic K-means, etc) Expert 1 Expert 2 Expert 3 29
Pooling Towards Image-Level Representation Component 1 Component 2 Component 3 + + + + + + Pooling: Both Fisher Vector and Super Vector can be written in this form (with different subtraction and normalization and factors) Related references: Fisher Vector [Perronnin et al, ECCV10] Supervector [X. Zhou, K. Yu, T. Zhang et al, ECCV10] HG [X. Zhou et al, ECCV09] 30 Normalize and concatenate
Pooling Towards Image-Level Representation Component 1 Component 2 Component 3 + + + + + + Pooling: Normalize and concatenate Big model: The dimension becomes C (#components) x d (#fea dim) For example, if C=1000, d=128, the final dimension is 128K 100+ times longer than that from SC or VQ! 31
Very Long Vector as Feature Representation We can generate very long image feature vector as we discussed before The strong feature we used for ImageNet LSVRC 2010 Dense sampling: LBP + HOG, fea dim=100 (after PCA) GMM with 1024 components 4 spatial gridding (1+3x1) Dimension of image feature: 100 x 1024 x 4 = 0.41 M LBP GMM pooling HOG 32
How to solve big models? 33
For Small Datasets: Use Kernel Trick! Kernel trick: 10K images => Kernel matrix: 10K x 10K ~100M Computational complexity depends on the size of Kernel matrix which is less than feature dimension We tried nonlinear kernels for face verification and got good performance Results on LFW dataset Learning Locally-Adaptive Decision Functions for Person Verification, CVPR 13 (with Z. Li and S. Chang, F. Liang, T. Huang, J. Smith) 34
For Large Dataset: Use Stochastic Gradient Descent Suppose we are working on ImageNet data using 0.4 M feature vectors. Total training data: 1.2M x 0.4M ~ 0.5 T real values! Too big to load into memory Too many samples to use kernel tricks Solution: Stochastic Gradient Descent (SGD) Idea: estimate the gradient on a randomly picked sample Comparing with gradient descent: 35
SGD Can Be Very Simple To Implement A 10 line binary SVM solver by Shai Shalev-Shwartz decreasing learning rate 36
Deep CNN and Related Tech 37
Deep CNN: A Bigger Model Motivated by the studies of [Kizhevsky et al, NIPS12] [Y. LeCun et al, PIEEE98], deep convolutionary neural network (CNN) becomes the newest winner in ImageNet competition. The most popular CNN has: 5 convolutional layers to learn filters 2 fully connected layers 60 million parameters Stochastic gradient descent (again) Why we can train such a bigger model now (not in 1990s)? The rise of big dataset (ImageNet) The bless of GPU computing 38
Deep Learning Demo http://smith-gpu.pok.ibm.com:8080/
Learning Representation From Big Data Computer vision researchers have seen big performance jump in large scale datasets like ImageNet. Even earlier, researchers in and speech/acoustics have seen similar success in LVCSR and related tasks. In another field, text/nlp researchers are also moving quickly to large scale learning. For example, the IBM Watson system used thousands of sub-systems to won the human players in Jeopardy! Game. Watson is hiring! www.ibm.com/watsonjobs Especially, we are looking for winter interns working on vision + NLP problems. contact zhou@us.ibm.com 40
Conclusion 41
Conclusion The mutual evolvement of big data and big models: Histogram Sparse coding (10K parameters) Supervec, Fishervec (0.4M parameters) Deep CNN (60M para) Bigger Small dataset (e.g., Caltech101, 8K im) Medium dataset (e.g., PASCAL, 10+K) Large dataset (e.g., ImageNet 1.2M) Bigger Motivating questions: - How to develop scalable solutions for big data? - How to deal with situations with limited labeled data? Please see the following talks for the answer! 42