Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual Recognition And Search 1
Outline Overview of small scale classification Challenges in large scale Techniques for large scale classification Visual Recognition And Search 2
Small Scale Classification Example of Small Scale Classification Let s choose 10K for the example in the following Visual Recognition And Search 3
Small Scale Classification Choices of Discriminant Models Ensemble of decision trees Nearest neighbor Metric learning Support vector machines (SVMs) Visual Recognition And Search 4
Small Scale Classification Choices of Discriminant Models Ensemble of decision trees Nearest neighbor Metric learning Support vector machines (SVMs) Visual Recognition And Search 5
Small Scale Classification SVMs in a Nullshell Maximum margin classifier (linear SVM) Linear SVMs in dual space Of course numerous contents are missing here!! Nonlinear SVMs via representer theorem Visual Recognition And Search 6
Small Scale Classification Nonlinear SVM model: Kernel Machines Examples of Kernel: 10K images => Kernel matrix: 10K x 10K ~800MB Visual Recognition And Search 7
Small Scale Classification Popular Workflow of Small Scale Classification Classification model Sparse coding Training or Fisher vector Sparse coding Testing or Fisher vector Visual Recognition And Search 8
Small Scale Classification Training Stage Optimization in dual space (linear/nonlinear SVMs) Complexity ~ N (train size) Optimization in primal space (linear SVMs) Complexity ~ dim_feature Visual Recognition And Search 9
Small Scale Classification Testing Stage Complexity ~ N_sv A lot of coefficients are zero. Samples corresponding to nonzero are called support vectors. Empirically speaking, number of support vectors is proportional to N. Visual Recognition And Search 10
Challenges in Large Scale Why Small Scale Is Easy Testing is fast If Nis small, N_svis small too Training is affordable even for multiple categories We can precomputethe (Nx N) kernel matrix K Small enough to host in memory Shared across multiple categories 101 categories in Caltech 20 categories in PASCAL 1 kernel matrix Visual Recognition And Search 11
Challenges in Large Scale Why Large Scale Is Hard Testing is slow If Nis big, N_svwill probably be big too Training is challenging Large scale quadratic programming is slow Even worse, Kernel matrix K cannot be loaded into memory! Visual Recognition And Search 12
Challenges in Large Scale Unique Requirements in CV Many CV applications (e.g., object detection) need to evaluate SVM models thousands or millionsof time in a short period A lot of visual features are denseinstead of sparse You are encouraged to borrow techniques from other fields but you may also develop new technique to address these unique problems. Visual Recognition And Search 13
Challenges in Large Scale A Simple Philosophy Use linear model instead of nonlinear Kernels Optimize in primal space Evaluating complexity O(dim) instead of O(N_sv) But: will we suffer from the performance loss? High dimensional linear model will be good In high dimensions, many empirical studies suggest linear model s performance of is similar to nonlinear ones In statistics analysis, the dimensionality has strong consistency with Nfor good performance. Visual Recognition And Search 14
Techniques for Large Scale Classification Kernel approximation Stochastic gradient descent Parallel computing Storage, search, distribute More Visual Recognition And Search 15
Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 Visual Recognition And Search 16
Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 Evaluating each dimension O(#sv) Visual Recognition And Search 17
Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 Evaluating each dimension O(#sv) Independent of input Can be pre-computed To evaluate, find the position of input in the sorted list of support vectors. Can be done using binary search in O(log #sv) time Visual Recognition And Search 18
Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 O(log#sv) Evaluating each dimension O(#sv) Consider a piecewise polynomial approximation O(1) time. Works for any additive kernel Visual Recognition And Search 19
Kernel Approximation Approximate Histogram-Intersection Kernel with Piecewise Function SVM with histogram intersection kernel Maji, Berg, Malik 2008 O(1) O(log#sv) Evaluating each dimension O(#sv) Consider a piecewise polynomial approximation O(1) time. Works for evaluating any additive kernel Visual Recognition And Search 20
Kernel Approximation Idea of Explicit Mapping Slide courtesy to Andrea Vedaldi Visual Recognition And Search 21
Kernel Approximation Idea of Explicit Mapping Slide courtesy to Andrea Vedaldi Visual Recognition And Search 22
Kernel Approximation Homogeneous Kernel Approximation in 1D Slide courtesy to Andrea Vedaldi Vedaldi and Zisserman, 10, 11 Visual Recognition And Search 23
Kernel Approximation Additive Homogeneous Kernel Slide courtesy to Andrea Vedaldi Vedaldi and Zisserman, 10, 11 Visual Recognition And Search 24
Kernel Approximation Additive Homogeneous Kernel Slide courtesy to Andrea Vedaldi Limitation: not applicable for RBF and exp-chi2 kernels Vedaldi and Zisserman, 10, 11 Visual Recognition And Search 25
Kernel Approximation Random Mapping Slide courtesy to Andrea Vedaldi Limitation: Dimension is much higher! Visual Recognition And Search 26
Kernel Approximation Short Summary Piece-wise linear approximation for additive kernels Dramatically speed up testing Explicit mapping for homogeneous kernels Efficient in both training and testing Random mapping Useful for exp(.) functions Resulting to even higher dimensions Visual Recognition And Search 27
Techniques for Large Scale Classification Kernel approximation Stochastic gradient descent Parallel computing Storage, search, distribute More Visual Recognition And Search 28
Stochastic Gradient Descent Recall Super vector (Fisher vector) features for images 128 x 1000 = 128 K dimensional vectors Large scale image dataset 1M images in ImageNetLSVRC: ~1M images High dimensional linear model Smartly learn from Tall&Fatdata. Visual Recognition And Search 29
Stochastic Gradient Descent Gradient Descent Problem: minimize the cost function Gradient descent 2-order gradient descent (variant of Newton s method) (approx.) inverse of Hessian Visual Recognition And Search 30
Stochastic Gradient Descent Stochastic Gradient Descent Idea: estimate the gradient on a randomly picked sample Gradient descent Stochastic gradient descent Requirements: (to guarantee converge) Visual Recognition And Search 31
Stochastic Gradient Descent Stochastic Gradient Descent Idea: estimate the gradient on a randomly picked sample Gradient descent Stochastic gradient descent Popular choices: or Visual Recognition And Search 32
Stochastic Gradient Descent Table courtesy to Leon Buttou Visual Recognition And Search 33
Stochastic Gradient Descent How Good Is SGD? Converge to global minimum for convex problems Residual error decreasing speed: In practice, SGD first reduces cost function pretty rapidly but then dances around. ---- use Average SGD instead Table courtesy to Leon Buttou Visual Recognition And Search 34
Stochastic Gradient Descent Average SGD SGD Average SGD Average SGD converges with the optimal asymptotic speed. [Xu 2010]: Visual Recognition And Search 35
Stochastic Gradient Descent Average SGD for ImageNet LSRC 2010 [Lin et al, CVPR 2011] Visual Recognition And Search 36
Stochastic Gradient Descent Learn from more samples in an affordable way but do not expect the exact global minimum Why: Philosophy of SGD for Large Scale The cost function is just an approximation of empirical error (so global minimum is not necessary) More data/more parameter vs. global minimum: the former is more important than the latter This philosophy is also employed by deep learning Visual Recognition And Search 37
SGD for Deep Learning Percepton fi ( x) Wx ij j yn = f( xn; W ) = σ( Wx i in + b) i 1 σ( z) = 1 + exp( z) N d out in error = ( y W x b) d in ij jn i n = 1 i = 1 j = 1 2 A Perceptions is equivalent to logistic regression Slide courtesy to Max Welling Visual Recognition And Search 38
SGD for Deep Learning From Percepton to Neural Network h1 h2 y W3,b3 W2,b2 3 2 3 ˆ i = ( ij j + i ) j y g W h b h = g( W h + b ) 2 2 1 2 i ij j i j h = g( W x + b ) 1 1 1 i ij j i j W1,b1 x Slide courtesy to Max Welling Visual Recognition And Search 39
SGD for Deep Learning y i 3 2 3 ˆ i = σ( ij j + i ) j y W h b SGD for Neural Network y i 3 d error δ ˆ (1 ˆ in = yin yin) d σ in in W3,b3 h2 h = σ( W h + b ) 2 2 1 2 i ij j i j W3,b3 h2 δ = h (1 h ) W δ 2 2 2 3 3 jn jn jn ij in upstream i W W ηδ h 2 2 2 1 jk jk jn kn W2,b2 h1 h = σ( W x + b ) 1 1 1 i ij j i j W2,b2 h1 δ b = h (1 h ) W δjn 1 1 1 2 2 kn kn kn jk upstream j b ηδ 2 2 2 j j jn W1,b1 x Upward pass W1,b1 x Downward pass Slide courtesy to Max Welling Visual Recognition And Search 40
SGD for Deep Learning Tricks to Train NN Why SGD for NN is difficult? Non-convex problem, no global minimum More parameters than training samples, easily overfitting Tricks list Average SGD Normalization of inputs Smartly manipulate Hessian GPU speed up Drop-out Pre-training Reference http://leon.bottou.org/projects/sgd Y. LeCun, Bottouet al, 1998 L. Bottou, 2012 Visual Recognition And Search 41
Discussions Difficulties in large scale classification Why shall we do large scale classification? Visual Recognition And Search 42