Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013
Outline 1 2 The learning model The learning algorithm 3 4
Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional spaces, but with low-intrinsic dimensions Sparse representation in some domain. Simple model, effective prior.
Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional spaces, but with low-intrinsic dimensions Sparse representation in some domain. Simple model, effective prior. Sparse representation: represent data in the most parsimonious terms x = Dz, where x R d, D R d K, and z 0 d.
Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional spaces, but with low-intrinsic dimensions Sparse representation in some domain. Simple model, effective prior. Sparse representation: represent data in the most parsimonious terms x = Dz, where x R d, D R d K, and z 0 d. Sparsity: driving factor for broad applications Compressive sensing, low-rank matrices, etc. Compression, denoising, deblurring, super-resolution, etc. Recognition, subspace clustering, deep learning, etc.
Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals?
Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals? A data driven solution: train adaptive dictionaries from the given signal instances for sparse representations.
Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals? A data driven solution: train adaptive dictionaries from the given signal instances for sparse representations. Given training data {x i } N i=1, the dictionary learning problem, in its most popular form, can be formulated as min D,{α i } N i=1 N x i Dα i 2 2 + λ α i 1, s.t. D(:, j) 2 1, i=1 where D R d K (d < K ) is an over-complete dictionary.
Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals? A data driven solution: train adaptive dictionaries from the given signal instances for sparse representations. Given training data {x i } N i=1, the dictionary learning problem, in its most popular form, can be formulated as min D,{α i } N i=1 N x i Dα i 2 2 + λ α i 1, s.t. D(:, j) 2 1, i=1 where D R d K (d < K ) is an over-complete dictionary. Problem: it only cares about low-level sparse reconstruction, not the high-level task!
Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc
Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc We relate the low-level dictionary learning with the high-level task naturally with a bilevel formulation.
Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc We relate the low-level dictionary learning with the high-level task naturally with a bilevel formulation. Goal: learn more meaningful sparse representation for the given task.
Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc We relate the low-level dictionary learning with the high-level task naturally with a bilevel formulation. Goal: learn more meaningful sparse representation for the given task. Advantage: the training procedure is totally consistent with the testing objective.
Bilevel optimization Mathematical programs with optimization problems in the constraints: min x X,y F (x, y) s.t. G(x, y) 0, y = arg min y s.t. g(x, y) 0. f (x, y), F and f are the upper-level and lower-level objective functions respectively. G and g are the upper-level and lower-level constraints respectively.
Bilevel optimization Simple example: Toll-setting problem on a transportation network Network manager maximizes the revenue raised from tolls Network users minimize their travel costs T a x a max T,f,x s.t. a Ā l a T a u a, a Ā (f, x) arg min c a x f,x a + T a x a a A a Ā s.t....
: Outline The learning model The learning algorithm 1 2 The learning model The learning algorithm 3 4
The Learning Model The learning model The learning algorithm A generic bilevel learning model: min D,Θ 1 N N L(D, z i, Θ) i=1 s.t. z i = arg min α 1, s.t. x i Dα 2 2 ɛ, i, α G(Θ) 0, D(:, k) 2 1, k. L is some smooth cost function defined by the specific task. Θ is the parameter set of a specific model. {x i } N i=1 are training samples from the input space X. May involve more than one feature space.
A Simple Example The learning model The learning algorithm Coupled sparse coding: Relate two feature spaces by their common sparse representations. 1 min D x,d y N s.t. z x i N z x i z y i 2 2 i=1 = arg min α 1, s.t. x i D x α 2 2 ɛ x, i, α z y i = arg min α 1, s.t. y i D y α 2 2 ɛ y, i, α D x (:, k) 2 1, k, D y (:, k) 2 1, k, where {x i, y i } N i=1 are randomly sampled from the joint space X Y.
: Outline The learning model The learning algorithm 1 2 The learning model The learning algorithm 3 4
A Difficult Problem The learning model The learning algorithm Bilevel optimization: mathematical programs with optimization problems in the constraints min D,Θ 1 N N L(D, z i, Θ) i=1 s.t. z i = arg min α 1, s.t. x i Dα 2 2 ɛ, i, α G(Θ) 0, D(:, k) 2 1, k. Optimization for D is a bilevel optimization. L is the upper-level objective and l 1 -norm minimization is the lower-level optimization. Highly nonconvex and highly nonlinear.
Descent Method? The learning model The learning algorithm Regard z as an implicit function of D in the lower-level problem, the bilevel program can be viewed solely in terms of the upper-level variable D.
Descent Method? The learning model The learning algorithm Regard z as an implicit function of D in the lower-level problem, the bilevel program can be viewed solely in terms of the upper-level variable D. Applying the chain rule, we have, whenever D z(d) is well defined D L(D, z(d), Θ) = D L(D, z, Θ) + z L(D, z, Θ) D z(d).
Descent Method? The learning model The learning algorithm Regard z as an implicit function of D in the lower-level problem, the bilevel program can be viewed solely in terms of the upper-level variable D. Applying the chain rule, we have, whenever D z(d) is well defined D L(D, z(d), Θ) = D L(D, z, Θ) + z L(D, z, Θ) D z(d). Problem: Is the gradient D z(d) available? z = arg min α α 1, s.t. x Dα 2 2 ɛ.
Differentiability The learning model The learning algorithm Lasso The l 1 -norm minimization problem can be reformulated as the Lasso problem z = arg min α x Dα 2 2 + λ α 1. Transition Point (Efron et al. 2004) For a given response vector x, there is a finite sequence of λ s, λ 0 > λ 1 > > λ K = 0, such that if λ is in the interval of (λ m, λ m+1 ), the active set Λ = {k : z(k) 0} and sign vector sign(z Λ ) are constant with respect to λ.
Differentiability The learning model The learning algorithm Theorem Fix any λ > 0, and λ is not a transition point for x, the active set Λ and the sign vector sign(z Λ ) are locally constant with respect to both x and D.
Differentiability The learning model The learning algorithm If λ is not a transition point of x, we have the equiangular conditions a : x Dz 2 2 + λ sign(z(k)) = 0, for k Λ, z(k) b : x Dz 2 2 z(k) < λ, for k Λ.
Differentiability The learning model The learning algorithm If λ is not a transition point of x, we have the equiangular conditions a : x Dz 2 2 + λ sign(z(k)) = 0, for k Λ, z(k) b : x Dz 2 2 z(k) < λ, for k Λ. Applying implicit differentiation on the above Eqn. (a), we have z Λ = ( ( D T ) 1 D T Λ x D ΛD Λ Λ D Λ DT Λ D ) Λ z Λ. D Λ
Differentiability The learning model The learning algorithm Let Ω denotes the nonactive set, we observe that As z Λ is only connected with D Λ, a perturbation on D Ω will not change its value. Therefore, we have z Λ D Ω = 0. (1) As Λ and sign(z Λ ) are constant for a small perturbation of D, z Ω stays zero, so we have z Ω D = 0 (2)
Differentiability The learning model The learning algorithm Let Ω denotes the nonactive set, we observe that As z Λ is only connected with D Λ, a perturbation on D Ω will not change its value. Therefore, we have z Λ D Ω = 0. (1) As Λ and sign(z Λ ) are constant for a small perturbation of D, z Ω stays zero, so we have z Ω D = 0 (2) Therefore, the nonzero part of D z(d) is defined by z Λ / D Λ.
Stochastic Gradient Descent The learning model The learning algorithm Given D z(d), D L can be evaluated. Applying stochastic gradient descent, we have L n D n+1 = D n r n D / L n D 2 r 0 r n = (n/n + 1) p, where p controls the shrinkage rate the step size.
Stochastic Gradient Descent The learning model The learning algorithm Given D z(d), D L can be evaluated. Applying stochastic gradient descent, we have L n D n+1 = D n r n D / L n D 2 r 0 r n = (n/n + 1) p, where p controls the shrinkage rate the step size. Project the updated dictionary onto the unit ball. The complete optimization procedure alternatively optimize over D and Θ.
: Outline 1 2 The learning model The learning algorithm 3 4
Single Frame Super-resolution Problem: Given a single low-resolution input, and a set of pairs (high- and low-resolution) of training patches sampled from similar images, reconstruct a high-resolution version of the input. Applications Photo zooming (e.g., Photoshop, Genuine Fractal) Photo printing Video standard conversion, etc Difficulty: single-image super-resolution is an extremely ill-posed problem.
Super-resolution via Sparse Recovery High-resolution patches have sparse representations in terms of some over-complete dictionary x = D h z 0 where x R m, D h R m K, and z 0 0 m
Super-resolution via Sparse Recovery High-resolution patches have sparse representations in terms of some over-complete dictionary x = D h z 0 where x R m, D h R m K, and z 0 0 m We do not observe the high-resolution patch x, but its low-resolution version y R n y = Lx = LD h z 0 = D l z 0 L is the sampling matrix (blurring and downsampling) y is the n linear measurements of the sparse coefficients z 0
Super-resolution via Sparse Recovery High-resolution patches have sparse representations in terms of some over-complete dictionary x = D h z 0 where x R m, D h R m K, and z 0 0 m We do not observe the high-resolution patch x, but its low-resolution version y R n y = Lx = LD h z 0 = D l z 0 L is the sampling matrix (blurring and downsampling) y is the n linear measurements of the sparse coefficients z 0 Sparse recovery? If we can obtain z 0 from y = D l z (underdetermined linear system), we can recover x as D h z 0.
Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y.
Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y. Find sparse solution for each patch y p of Y by z 0 = arg min z D l z y p 2 2 + λ z 1.
Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y. Find sparse solution for each patch y p of Y by z 0 = arg min z D l z y p 2 2 + λ z 1. Recover the corresponding high-resolution image patch as x p = D h z 0.
Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y. Find sparse solution for each patch y p of Y by z 0 = arg min z D l z y p 2 2 + λ z 1. Recover the corresponding high-resolution image patch as x p = D h z 0. How to train D l and D h for good recovery?
Joint Dictionary Training Previous Approach Our previous solution. Randomly sample high- and low-resolution image patch pairs {x i, y i } N i=1 from the training data. Learn D h, D l jointly: min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1, i=1 s.t. D h (:, k) 2 1, D l (:, k) 2 1
Joint Dictionary Training Previous Approach Our previous solution. Randomly sample high- and low-resolution image patch pairs {x i, y i } N i=1 from the training data. Learn D h, D l jointly: min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1, i=1 s.t. D h (:, k) 2 1, D l (:, k) 2 1 However,...
Joint Dictionary Training Problem In training, we have min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1 i=1
Joint Dictionary Training Problem In training, we have min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1 i=1 In testing, we only have the low-resolution patch y i, min zi x i D h z i 2 2+ y i D l z i 2 2 + λ z i 1, and therefore, good reconstruction of x i is not guaranteed.
Bilevel Formulation Goal: Learn D h and D l, such that the sparse representation z of y in terms of D l can well reconstruct x with D h.
Bilevel Formulation Goal: Learn D h and D l, such that the sparse representation z of y in terms of D l can well reconstruct x with D h. Given high- and low-resolution training patch pairs {x i, y i } N i=1, the learning model is formulated as 1 min D h,d l N N D h z i x i 2 2 i=1 s.t. z i = arg min α 1, s.t. y i D l α 2 2 ɛ α D l (:, k) 2 1, D h (:, k) 2 1,
Bilevel Formulation Goal: Learn D h and D l, such that the sparse representation z of y in terms of D l can well reconstruct x with D h. Given high- and low-resolution training patch pairs {x i, y i } N i=1, the learning model is formulated as 1 min D h,d l N N D h z i x i 2 2 i=1 s.t. z i = arg min α 1, s.t. y i D l α 2 2 ɛ α D l (:, k) 2 1, D h (:, k) 2 1, The training process is completely consist with testing.
Results Setting: 100, 000 high- and low-resolution 5 5 image patch pairs are sampled for training and 100, 000 for testing. D h and D l are initialized from joint dictionary training. The learning algorithm converges in 5 iterations. 1 21.61% 19.60% 21.89% 18.91% 20.55% 2 17.43% 15.75% 17.92% 15.69% 14.70% 3 17.15% 16.96% 19.95% 17.57% 15.99% 4 16.41% 17.78% 18.30% 16.80% 15.82% 5 20.48% 14.68% 15.52% 14.64% 20.51% 1 2 3 4 5 Pixel-wise MSE reduction compared with joint dictionary training
SR Results Visual comparison: Top: joint dictionary training; bottom: bilevel sparse coding.
Practical Implementation Learn fast sparse coding approximations with a neural network. Selective patch processing. Takes 5s to upscale an image from 200 200 to 800 800 on a single core 3 GHz with 4G RAM. One of the fastest SR algorithms. Input
Practical Implementation Learn fast sparse coding approximations with a neural network. Selective patch processing. Takes 5s to upscale an image from 200 200 to 800 800 on a single core 3 GHz with 4G RAM. One of the fastest SR algorithms. Bicubic
Practical Implementation Learn fast sparse coding approximations with a neural network. Selective patch processing. Takes 6s to upscale an image from 200 200 to 800 800 on a single core 3 GHz with 4G RAM. One of the fastest SR algorithms. Ours
: Outline 1 2 The learning model The learning algorithm 3 4
Feature Representation by Pooling Sparse Codes Fig. The image feature extraction diagram.
Feature Representation by Pooling Sparse Codes A simple two-layer network. Coding: VQ, soft assignment, LLC, sparse coding, linear filtering. Pooling: average, energy, max, log, l p. Works well on diverse recognition benchmarks: object, scene, action, face, digit, gender, expression, age estimation, and so on. Key component of the winner system for PASCAL09 on image recognition. Image feature extraction diagram
The Feature Extraction Algorithm 1 Represent image X as sets of local descriptors in a spatial pyramid X = [ Y 0 11, Y1 11, Y 1 12,..., Y 2 44], 2 Given dictionary D, encode the local descriptors into sparse codes by Ẑ s ij = arg min Y s ij DA 2 2 + λ A 1, A and we obtain S = [Ẑ0 11, Ẑ1 11, Ẑ2 12,..., 44] Ẑ2 3 Max pooling over each set of sparse codes and concatenate them β = 2 s=0 2 s i,j=1 [ ] β s ij, where βij s ( Ẑs ) = max ij.
Unsupervised Dictionary Learning Randomly sample a set of local descriptors {x i } N i=1 from the training set, use current sparse coding technique to learn a dictionary D that can sparsely represent the data. min D,{α i } N i=1 n x i Dα i 2 2 + λ α i 1, i=1 s.t. D(:, k) 2 1, Optimization is performed in an alternating fashion: fix D, optimize {α i } N i=1 ; and fix {α i} N i=1, and optimize D.
Supervised Dictionary Learning The unsupervised dictionary learning is good for reconstruction, not necessarily effective for classification.
Supervised Dictionary Learning The unsupervised dictionary learning is good for reconstruction, not necessarily effective for classification. Training data with image labels {(X i, y i )} N i=1.
Supervised Dictionary Learning The unsupervised dictionary learning is good for reconstruction, not necessarily effective for classification. Training data with image labels {(X i, y i )} N i=1. Train the dictionary together with the classifier { N } min D,w i=1 l(y i, f (β i, w)) + γ w 2 2 s.t. β i = pooling(z i ) Z i = arg min A X i DA 2 2 + λ A 1 D(:, k) 2 1, k, where l( ) is a loss function and f (, w) is the linear prediction model. Optimization for w is training the classifier. Optimization for D is a bilevel program.,
Face recognition CMU Multi-PIE Database This dataset contains 337 subjects across simultaneous variations in pose, expression, and illumination. We use session 1 as training, and the rest sessions 2-4 for testing. The dataset is challenging due to the large number of subjects, and due to natural variations in subject appearance over time.
Face recognition Face recognition error (%) on large-scale Multi-PIE. Rec. Rates Session 2 Session 3 Session 4 LDA 50.6 55.7 52.1 NN 32.7 33.8 37.2 NS 22.4 25.7 26.6 SR 8.6 9.7 9.8 U-SC 5.4 9.0 7.5 S-SC 4.8 6.6 4.9 Improvements 11.1% 26.7% 34.7%
Gender Recognition FRGC 2.0 The dataset contains 568 individuals, totally 14714 face images under various lighting conditions and backgrounds. 11700 images from 451 randomly chosen individuals serve as the training set, and 3014 images from the rest 114 persons are modeled as the testing set. Classification Error (%) Algorithms SVM (RBF) CNN U-SC S-SC Improvements Error Rate 8.6 5.9 6.8 5.3 22.1%
Hand Written Digit Recognition MNIST: The dataset consists of 70,000 handwritten digits, of which 60,000 are selected for training and the rest 10,000 for testing. Algorithms Error Rate SVM (RBF) 1.41 L1 sparse coding 2.02 Local coordinate coding 1.90 Deep Belief Network 1.20 CNN 0.82 U-SC 0.98 S-SC 0.84 Improvements 14.3%
: Outline 1 2 The learning model The learning algorithm 3 4
Formulation Let x be the original signal, Φ be the sampling matrix, and y = Φx be the linear measurements. Compressive sensing recovery is done by z = min α 1, s.t. y = ΦD x α α ˆx =D x z
Formulation Let x be the original signal, Φ be the sampling matrix, and y = Φx be the linear measurements. Compressive sensing recovery is done by z = min α 1, s.t. y = ΦD x α α ˆx =D x z D x is important for the recovery quality.
Formulation Let x be the original signal, Φ be the sampling matrix, and y = Φx be the linear measurements. Compressive sensing recovery is done by z = min α 1, s.t. y = ΦD x α α ˆx =D x z D x is important for the recovery quality. With the training samples {x i } N i=1, learn D x by directly minimizing the compressive sensing recovery error: 1 min D x N N x i D x z i 2 2 i=1 s.t. y i = Φx i, D y = ΦD x z i = arg min α 1, s.t. y i D y α 2 2 ɛ α
CS Results Settings: 10, 000 image patches of 16 16 are randomly sampled for training and 5, 000 for testing from medical images. Haar Wavelet basis is used as our baseline and initialization. Bernouli random matrix is used as the sampling matrix.
CS Results Settings: 10, 000 image patches of 16 16 are randomly sampled for training and 5, 000 for testing from medical images. Haar Wavelet basis is used as our baseline and initialization. Bernouli random matrix is used as the sampling matrix. 2.6 x 105 Cost 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 PSNR 28 26 24 22 20 18 Learned Wavelet 0.6 0 5 10 15 20 25 Iteration Objective value vs. iteration number for 10% sample rate. 0.1 0.15 0.2 0.25 0.3 Sampling Rate Recovery accuracy comparison on the test image patches in PSNR.
CS Results Image recovery on the bone image with 20% measurements Ground truth Wavelet(22.8 db) Ours (27.6 db)
Learning the meaningful representation is critical for many applications Many sparse coding based applications can be formulated as a bilevel program Bilevel programs are extremely useful in many hierarchical models More applications in computer vision and machine learning? E.g., model selection.