Class 6 Large-Scale Image Classification

Similar documents
CS6716 Pattern Recognition

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Support vector machines

All lecture slides will be available at CSC2515_Winter15.html

Neural Networks and Deep Learning

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Linear methods for supervised learning

Discriminative classifiers for image recognition

Network Traffic Measurements and Analysis

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

ECS289: Scalable Machine Learning

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Combine the PA Algorithm with a Proximal Classifier

Constrained optimization

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

SVM-KNN : Discriminative Nearest Neighbor Classification for Visual Category Recognition

Metric Learning for Large-Scale Image Classification:

Instance-based Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Machine Learning: Think Big and Parallel

DM6 Support Vector Machines

Metric Learning for Large Scale Image Classification:

SUPPORT VECTOR MACHINES

3D Object Recognition using Multiclass SVM-KNN

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Chakra Chennubhotla and David Koes

A Dendrogram. Bioinformatics (Lec 17)

Machine Learning Lecture 9

Instance-based Learning

Random Projection Features and Generalized Additive Models

A Brief Look at Optimization

Stanford University. A Distributed Solver for Kernalized SVM

Perceptron as a graph

Generative and discriminative classification techniques

732A54/TDDE31 Big Data Analytics

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Machine Learning. Chao Lan

Large Scale Manifold Transduction

Machine Learning Lecture 9

Machine Learning Basics. Sargur N. Srihari

Support Vector Machines.

6 Model selection and kernels

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Aggregating Descriptors with Local Gaussian Metrics

Tutorial on Machine Learning Tools

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Case Study 1: Estimating Click Probabilities

Deep Learning for Computer Vision II

Support Vector Machines + Classification for IR

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Link Prediction for Social Network

Introduction to Support Vector Machines

Classification: Feature Vectors

Machine Learning / Jan 27, 2010

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Sketchable Histograms of Oriented Gradients for Object Detection

Distances and Kernels. Motivation

Part 5: Structured Support Vector Machines

SVMs for Structured Output. Andrea Vedaldi

SGD: Stochastic Gradient Descent

CSE 573: Artificial Intelligence Autumn 2010

Support Vector Machines

Part 5: Structured Support Vector Machines

Learning via Optimization

Recognition Tools: Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

ImageNet Classification with Deep Convolutional Neural Networks

Project Proposals. Xiang Zhang. Department of Computer Science Courant Institute of Mathematical Sciences New York University.

1 Case study of SVM (Rob)

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Kernels and Clustering

Object Classification Problem

Support Vector Machines

Support Vector Machines and their Applications

Efficient Algorithms may not be those we think

Lecture 19: November 5

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

5 Machine Learning Abstractions and Numerical Optimization

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Part-based and local feature models for generic object recognition

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Kernel Methods & Support Vector Machines

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

Image Analysis & Retrieval Lec 10 - Classification II

M. Sc. (Artificial Intelligence and Machine Learning)

Transcription:

Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition And Search 1

Outline Overview of small scale classification Challenges in large scale Techniques for large scale classification Visual Recognition And Search 2

Small Scale Classification Example of Small Scale Classification Let s choose 10K for the example in the following Visual Recognition And Search 3

Small Scale Classification Choices of Discriminant Models Ensemble of decision trees Nearest neighbor Metric learning Support vector machines (SVMs) Visual Recognition And Search 4

Small Scale Classification Choices of Discriminant Models Ensemble of decision trees Nearest neighbor Metric learning Support vector machines (SVMs) Visual Recognition And Search 5

Small Scale Classification SVMs in a Nullshell Maximum margin classifier (linear SVM) Linear SVMs in dual space Of course numerous contents are missing here!! Nonlinear SVMs via representer theorem Visual Recognition And Search 6

Small Scale Classification Nonlinear SVM model: Kernel Machines Examples of Kernel: 10K images => Kernel matrix: 10K x 10K ~800MB Visual Recognition And Search 7

Small Scale Classification Popular Workflow of Small Scale Classification Classification model Sparse coding Training or Fisher vector Sparse coding Testing or Fisher vector Visual Recognition And Search 8

Small Scale Classification Training Stage Optimization in dual space (linear/nonlinear SVMs) Complexity ~ N (train size) Optimization in primal space (linear SVMs) Complexity ~ dim_feature Visual Recognition And Search 9

Small Scale Classification Testing Stage Complexity ~ N_sv A lot of coefficients are zero. Samples corresponding to nonzero are called support vectors. Empirically speaking, number of support vectors is proportional to N. Visual Recognition And Search 10

Challenges in Large Scale Why Small Scale Is Easy Testing is fast If Nis small, N_svis small too Training is affordable even for multiple categories We can precomputethe (Nx N) kernel matrix K Small enough to host in memory Shared across multiple categories 101 categories in Caltech 20 categories in PASCAL 1 kernel matrix Visual Recognition And Search 11

Challenges in Large Scale Why Large Scale Is Hard Testing is slow If Nis big, N_svwill probably be big too Training is challenging Large scale quadratic programming is slow Even worse, Kernel matrix K cannot be loaded into memory! Visual Recognition And Search 12

Challenges in Large Scale Unique Requirements in CV Many CV applications (e.g., object detection) need to evaluate SVM models thousands or millionsof time in a short period A lot of visual features are denseinstead of sparse You are encouraged to borrow techniques from other fields but you may also develop new technique to address these unique problems. Visual Recognition And Search 13

Challenges in Large Scale A Simple Philosophy Use linear model instead of nonlinear Kernels Optimize in primal space Evaluating complexity O(dim) instead of O(N_sv) But: will we suffer from the performance loss? High dimensional linear model will be good In high dimensions, many empirical studies suggest linear model s performance of is similar to nonlinear ones In statistics analysis, the dimensionality has strong consistency with Nfor good performance. Visual Recognition And Search 14

Techniques for Large Scale Classification Kernel approximation Stochastic gradient descent Parallel computing Storage, search, distribute More Visual Recognition And Search 15

Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 Visual Recognition And Search 16

Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 Evaluating each dimension O(#sv) Visual Recognition And Search 17

Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 Evaluating each dimension O(#sv) Independent of input Can be pre-computed To evaluate, find the position of input in the sorted list of support vectors. Can be done using binary search in O(log #sv) time Visual Recognition And Search 18

Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 O(log#sv) Evaluating each dimension O(#sv) Consider a piecewise polynomial approximation O(1) time. Works for any additive kernel Visual Recognition And Search 19

Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 O(1) O(log#sv) Evaluating each dimension O(#sv) Consider a piecewise polynomial approximation O(1) time. Works for evaluating any additive kernel Visual Recognition And Search 20

Kernel Approximation Idea of Explicit Mapping Slide courtesy to Andrea Vedaldi Visual Recognition And Search 21

Kernel Approximation Idea of Explicit Mapping Slide courtesy to Andrea Vedaldi Visual Recognition And Search 22

Kernel Approximation Homogeneous Kernel Approximation in 1D Slide courtesy to Andrea Vedaldi Vedaldi and Zisserman, 10, 11 Visual Recognition And Search 23

Kernel Approximation Additive Homogeneous Kernel Slide courtesy to Andrea Vedaldi Vedaldi and Zisserman, 10, 11 Visual Recognition And Search 24

Kernel Approximation Additive Homogeneous Kernel Slide courtesy to Andrea Vedaldi Limitation: not applicable for RBF and exp-chi2 kernels Vedaldi and Zisserman, 10, 11 Visual Recognition And Search 25

Kernel Approximation Random Mapping Slide courtesy to Andrea Vedaldi Limitation: Dimension is much higher! Visual Recognition And Search 26

Kernel Approximation Short Summary Piece-wise linear approximation for additive kernels Dramatically speed up testing Explicit mapping for homogeneous kernels Efficient in both training and testing Random mapping Useful for exp(.) functions Resulting to even higher dimensions Visual Recognition And Search 27

Techniques for Large Scale Classification Kernel approximation Stochastic gradient descent Parallel computing Storage, search, distribute More Visual Recognition And Search 28

Stochastic Gradient Descent Recall Super vector (Fisher vector) features for images 128 x 1000 = 128 K dimensional vectors Large scale image dataset 1M images in ImageNetLSVRC: ~1M images High dimensional linear model Smartly learn from Tall&Fatdata. Visual Recognition And Search 29

Stochastic Gradient Descent Gradient Descent Problem: minimize the cost function Gradient descent 2-order gradient descent (variant of Newton s method) (approx.) inverse of Hessian Visual Recognition And Search 30

Stochastic Gradient Descent Stochastic Gradient Descent Idea: estimate the gradient on a randomly picked sample Gradient descent Stochastic gradient descent Requirements: (to guarantee converge) Visual Recognition And Search 31

Stochastic Gradient Descent Stochastic Gradient Descent Idea: estimate the gradient on a randomly picked sample Gradient descent Stochastic gradient descent Popular choices: or Visual Recognition And Search 32

Stochastic Gradient Descent Table courtesy to Leon Buttou Visual Recognition And Search 33

Stochastic Gradient Descent How Good Is SGD? Converge to global minimum for convex problems Residual error decreasing speed: In practice, SGD first reduces cost function pretty rapidly but then dances around. ---- use Average SGD instead Table courtesy to Leon Buttou Visual Recognition And Search 34

Stochastic Gradient Descent Average SGD SGD Average SGD Average SGD converges with the optimal asymptotic speed. [Xu 2010]: Visual Recognition And Search 35

Stochastic Gradient Descent Average SGD for ImageNet LSRC 2010 [Lin et al, CVPR 2011] Visual Recognition And Search 36

Stochastic Gradient Descent Learn from more samples in an affordable way but do not expect the exact global minimum Why: Philosophy of SGD for Large Scale The cost function is just an approximation of empirical error (so global minimum is not necessary) More data/more parameter vs. global minimum: the former is more important than the latter This philosophy is also employed by deep learning Visual Recognition And Search 37

SGD for Deep Learning Percepton fi ( x) Wx ij j yn = f( xn; W ) = σ( Wx i in + b) i 1 σ( z) = 1 + exp( z) N d out in error = ( y W x b) d in ij jn i n = 1 i = 1 j = 1 2 A Perceptions is equivalent to logistic regression Slide courtesy to Max Welling Visual Recognition And Search 38

SGD for Deep Learning From Percepton to Neural Network h1 h2 y W3,b3 W2,b2 3 2 3 ˆ i = ( ij j + i ) j y g W h b h = g( W h + b ) 2 2 1 2 i ij j i j h = g( W x + b ) 1 1 1 i ij j i j W1,b1 x Slide courtesy to Max Welling Visual Recognition And Search 39

SGD for Deep Learning y i 3 2 3 ˆ i = σ( ij j + i ) j y W h b SGD for Neural Network y i 3 d error δ ˆ (1 ˆ in = yin yin) d σ in in W3,b3 h2 h = σ( W h + b ) 2 2 1 2 i ij j i j W3,b3 h2 δ = h (1 h ) W δ 2 2 2 3 3 jn jn jn ij in upstream i W W ηδ h 2 2 2 1 jk jk jn kn W2,b2 h1 h = σ( W x + b ) 1 1 1 i ij j i j W2,b2 h1 δ b = h (1 h ) W δjn 1 1 1 2 2 kn kn kn jk upstream j b ηδ 2 2 2 j j jn W1,b1 x Upward pass W1,b1 x Downward pass Slide courtesy to Max Welling Visual Recognition And Search 40

SGD for Deep Learning Tricks to Train NN Why SGD for NN is difficult? Non-convex problem, no global minimum More parameters than training samples, easily overfitting Tricks list Average SGD Normalization of inputs Smartly manipulate Hessian GPU speed up Drop-out Pre-training Reference http://leon.bottou.org/projects/sgd Y. LeCun, Bottouet al, 1998 L. Bottou, 2012 Visual Recognition And Search 41

Discussions Difficulties in large scale classification Why shall we do large scale classification? Visual Recognition And Search 42