Bilevel Sparse Coding

Similar documents
Supervised Translation-Invariant Sparse Coding

Sparse coding for image classification

Sparse Models in Image Understanding And Computer Vision

Single-Image Super-Resolution Using Multihypothesis Prediction

LEARNING SPARSE REPRESENTATION FOR IMAGE SIGNALS ZHAOWEN WANG DISSERTATION

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Extended Dictionary Learning : Convolutional and Multiple Feature Spaces

Image Restoration with Deep Generative Models

A DEEP DICTIONARY MODEL FOR IMAGE SUPER-RESOLUTION. Jun-Jie Huang and Pier Luigi Dragotti

Guided Image Super-Resolution: A New Technique for Photogeometric Super-Resolution in Hybrid 3-D Range Imaging

arxiv: v1 [cond-mat.dis-nn] 30 Dec 2018

Outline Introduction Problem Formulation Proposed Solution Applications Conclusion. Compressed Sensing. David L Donoho Presented by: Nitesh Shroff

DEEP LEARNING OF COMPRESSED SENSING OPERATORS WITH STRUCTURAL SIMILARITY (SSIM) LOSS

Face Recognition via Sparse Representation

Facial Expression Classification with Random Filters Feature Extraction

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Image Restoration Using DNN

The Benefit of Tree Sparsity in Accelerated MRI

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Learning with infinitely many features

Vulnerability of machine learning models to adversarial examples

Sparsity and image processing

Iterative CT Reconstruction Using Curvelet-Based Regularization

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

arxiv: v1 [cs.cv] 9 Sep 2013

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

Generalized Tree-Based Wavelet Transform and Applications to Patch-Based Image Processing

Lecture 17 Sparse Convex Optimization

Single Image Super Resolution of Textures via CNNs. Andrew Palmer

Neural Networks and Deep Learning

Stacked Denoising Autoencoders for Face Pose Normalization

Tutorial Deep Learning : Unsupervised Feature Learning

Non-Differentiable Image Manifolds

GEOMETRIC MANIFOLD APPROXIMATION USING LOCALLY LINEAR APPROXIMATIONS

Backpropagation + Deep Learning

ELEG Compressive Sensing and Sparse Signal Representations

Dimensionality reduction of MALDI Imaging datasets using non-linear redundant wavelet transform-based representations

Deep Learning of Compressed Sensing Operators with Structural Similarity Loss

A Novel Multi-Frame Color Images Super-Resolution Framework based on Deep Convolutional Neural Network. Zhe Li, Shu Li, Jianmin Wang and Hongyang Wang

Lecture 19: November 5

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Optimization. Industrial AI Lab.

Perceptron: This is convolution!

Image Restoration: From Sparse and Low-rank Priors to Deep Priors

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Robust Face Recognition via Sparse Representation

Learning Algorithms for Medical Image Analysis. Matteo Santoro slipguru

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Efficient MR Image Reconstruction for Compressed MR Imaging

Convex and Distributed Optimization. Thomas Ropars

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Example-Based Image Super-Resolution Techniques

Extracting and Composing Robust Features with Denoising Autoencoders

Recent Developments in Model-based Derivative-free Optimization

Complex Prediction Problems

A primal-dual framework for mixtures of regularizers

Learning Feature Hierarchies for Object Recognition

Deep Generative Models Variational Autoencoders

Learning based face hallucination techniques: A survey

When Sparsity Meets Low-Rankness: Transform Learning With Non-Local Low-Rank Constraint For Image Restoration

Face Recognition A Deep Learning Approach

arxiv: v2 [cs.lg] 6 Jun 2015

Markov Random Fields and Gibbs Sampling for Image Denoising

Compressive Sensing for Multimedia. Communications in Wireless Sensor Networks

Unsupervised Learning of Spatiotemporally Coherent Metrics

Relation among images: Modelling, optimization and applications

Edge-Based Blur Kernel Estimation Using Sparse Representation and Self-Similarity

Machine Learning / Jan 27, 2010

Some Blind Deconvolution Techniques in Image Processing

Image Super-Resolution via Sparse Representation

Fast Learning-Based Single Image Super-Resolution

Learning Convolutional Feature Hierarchies for Visual Recognition

3194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 7, JULY 2016

arxiv: v1 [cs.lg] 20 Dec 2013

CSC 411 Lecture 18: Matrix Factorizations

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

A Taxonomy of Semi-Supervised Learning Algorithms

Sparsity Based Regularization

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside

ECG782: Multidimensional Digital Signal Processing

CS294-1 Assignment 2 Report

ICA mixture models for image processing

Adaptive Reconstruction Methods for Low-Dose Computed Tomography

Introduction to Machine Learning. Xiaojin Zhu

A Patch Prior for Dense 3D Reconstruction in Man-Made Environments

Modern Signal Processing and Sparse Coding

CAP 6412 Advanced Computer Vision

Imaging of flow in porous media - from optimal transport to prediction

Inverse Problems and Machine Learning

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXXX 1

An efficient face recognition algorithm based on multi-kernel regularization learning

Transcription:

Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013

Outline 1 2 The learning model The learning algorithm 3 4

Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional spaces, but with low-intrinsic dimensions Sparse representation in some domain. Simple model, effective prior.

Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional spaces, but with low-intrinsic dimensions Sparse representation in some domain. Simple model, effective prior. Sparse representation: represent data in the most parsimonious terms x = Dz, where x R d, D R d K, and z 0 d.

Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional spaces, but with low-intrinsic dimensions Sparse representation in some domain. Simple model, effective prior. Sparse representation: represent data in the most parsimonious terms x = Dz, where x R d, D R d K, and z 0 d. Sparsity: driving factor for broad applications Compressive sensing, low-rank matrices, etc. Compression, denoising, deblurring, super-resolution, etc. Recognition, subspace clustering, deep learning, etc.

Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals?

Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals? A data driven solution: train adaptive dictionaries from the given signal instances for sparse representations.

Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals? A data driven solution: train adaptive dictionaries from the given signal instances for sparse representations. Given training data {x i } N i=1, the dictionary learning problem, in its most popular form, can be formulated as min D,{α i } N i=1 N x i Dα i 2 2 + λ α i 1, s.t. D(:, j) 2 1, i=1 where D R d K (d < K ) is an over-complete dictionary.

Sparse Coding Quest for Dictionary Signals are normally mixtures of diverse phenomena; how can we wisely choose D to perform well on the given signals? A data driven solution: train adaptive dictionaries from the given signal instances for sparse representations. Given training data {x i } N i=1, the dictionary learning problem, in its most popular form, can be formulated as min D,{α i } N i=1 N x i Dα i 2 2 + λ α i 1, s.t. D(:, j) 2 1, i=1 where D R d K (d < K ) is an over-complete dictionary. Problem: it only cares about low-level sparse reconstruction, not the high-level task!

Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc

Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc We relate the low-level dictionary learning with the high-level task naturally with a bilevel formulation.

Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc We relate the low-level dictionary learning with the high-level task naturally with a bilevel formulation. Goal: learn more meaningful sparse representation for the given task.

Quest for Dictionary Many vision and learning tasks can be formulated based on sparse representations Image feature learning Image super-resolution Compressive sensing Image classification, etc We relate the low-level dictionary learning with the high-level task naturally with a bilevel formulation. Goal: learn more meaningful sparse representation for the given task. Advantage: the training procedure is totally consistent with the testing objective.

Bilevel optimization Mathematical programs with optimization problems in the constraints: min x X,y F (x, y) s.t. G(x, y) 0, y = arg min y s.t. g(x, y) 0. f (x, y), F and f are the upper-level and lower-level objective functions respectively. G and g are the upper-level and lower-level constraints respectively.

Bilevel optimization Simple example: Toll-setting problem on a transportation network Network manager maximizes the revenue raised from tolls Network users minimize their travel costs T a x a max T,f,x s.t. a Ā l a T a u a, a Ā (f, x) arg min c a x f,x a + T a x a a A a Ā s.t....

: Outline The learning model The learning algorithm 1 2 The learning model The learning algorithm 3 4

The Learning Model The learning model The learning algorithm A generic bilevel learning model: min D,Θ 1 N N L(D, z i, Θ) i=1 s.t. z i = arg min α 1, s.t. x i Dα 2 2 ɛ, i, α G(Θ) 0, D(:, k) 2 1, k. L is some smooth cost function defined by the specific task. Θ is the parameter set of a specific model. {x i } N i=1 are training samples from the input space X. May involve more than one feature space.

A Simple Example The learning model The learning algorithm Coupled sparse coding: Relate two feature spaces by their common sparse representations. 1 min D x,d y N s.t. z x i N z x i z y i 2 2 i=1 = arg min α 1, s.t. x i D x α 2 2 ɛ x, i, α z y i = arg min α 1, s.t. y i D y α 2 2 ɛ y, i, α D x (:, k) 2 1, k, D y (:, k) 2 1, k, where {x i, y i } N i=1 are randomly sampled from the joint space X Y.

: Outline The learning model The learning algorithm 1 2 The learning model The learning algorithm 3 4

A Difficult Problem The learning model The learning algorithm Bilevel optimization: mathematical programs with optimization problems in the constraints min D,Θ 1 N N L(D, z i, Θ) i=1 s.t. z i = arg min α 1, s.t. x i Dα 2 2 ɛ, i, α G(Θ) 0, D(:, k) 2 1, k. Optimization for D is a bilevel optimization. L is the upper-level objective and l 1 -norm minimization is the lower-level optimization. Highly nonconvex and highly nonlinear.

Descent Method? The learning model The learning algorithm Regard z as an implicit function of D in the lower-level problem, the bilevel program can be viewed solely in terms of the upper-level variable D.

Descent Method? The learning model The learning algorithm Regard z as an implicit function of D in the lower-level problem, the bilevel program can be viewed solely in terms of the upper-level variable D. Applying the chain rule, we have, whenever D z(d) is well defined D L(D, z(d), Θ) = D L(D, z, Θ) + z L(D, z, Θ) D z(d).

Descent Method? The learning model The learning algorithm Regard z as an implicit function of D in the lower-level problem, the bilevel program can be viewed solely in terms of the upper-level variable D. Applying the chain rule, we have, whenever D z(d) is well defined D L(D, z(d), Θ) = D L(D, z, Θ) + z L(D, z, Θ) D z(d). Problem: Is the gradient D z(d) available? z = arg min α α 1, s.t. x Dα 2 2 ɛ.

Differentiability The learning model The learning algorithm Lasso The l 1 -norm minimization problem can be reformulated as the Lasso problem z = arg min α x Dα 2 2 + λ α 1. Transition Point (Efron et al. 2004) For a given response vector x, there is a finite sequence of λ s, λ 0 > λ 1 > > λ K = 0, such that if λ is in the interval of (λ m, λ m+1 ), the active set Λ = {k : z(k) 0} and sign vector sign(z Λ ) are constant with respect to λ.

Differentiability The learning model The learning algorithm Theorem Fix any λ > 0, and λ is not a transition point for x, the active set Λ and the sign vector sign(z Λ ) are locally constant with respect to both x and D.

Differentiability The learning model The learning algorithm If λ is not a transition point of x, we have the equiangular conditions a : x Dz 2 2 + λ sign(z(k)) = 0, for k Λ, z(k) b : x Dz 2 2 z(k) < λ, for k Λ.

Differentiability The learning model The learning algorithm If λ is not a transition point of x, we have the equiangular conditions a : x Dz 2 2 + λ sign(z(k)) = 0, for k Λ, z(k) b : x Dz 2 2 z(k) < λ, for k Λ. Applying implicit differentiation on the above Eqn. (a), we have z Λ = ( ( D T ) 1 D T Λ x D ΛD Λ Λ D Λ DT Λ D ) Λ z Λ. D Λ

Differentiability The learning model The learning algorithm Let Ω denotes the nonactive set, we observe that As z Λ is only connected with D Λ, a perturbation on D Ω will not change its value. Therefore, we have z Λ D Ω = 0. (1) As Λ and sign(z Λ ) are constant for a small perturbation of D, z Ω stays zero, so we have z Ω D = 0 (2)

Differentiability The learning model The learning algorithm Let Ω denotes the nonactive set, we observe that As z Λ is only connected with D Λ, a perturbation on D Ω will not change its value. Therefore, we have z Λ D Ω = 0. (1) As Λ and sign(z Λ ) are constant for a small perturbation of D, z Ω stays zero, so we have z Ω D = 0 (2) Therefore, the nonzero part of D z(d) is defined by z Λ / D Λ.

Stochastic Gradient Descent The learning model The learning algorithm Given D z(d), D L can be evaluated. Applying stochastic gradient descent, we have L n D n+1 = D n r n D / L n D 2 r 0 r n = (n/n + 1) p, where p controls the shrinkage rate the step size.

Stochastic Gradient Descent The learning model The learning algorithm Given D z(d), D L can be evaluated. Applying stochastic gradient descent, we have L n D n+1 = D n r n D / L n D 2 r 0 r n = (n/n + 1) p, where p controls the shrinkage rate the step size. Project the updated dictionary onto the unit ball. The complete optimization procedure alternatively optimize over D and Θ.

: Outline 1 2 The learning model The learning algorithm 3 4

Single Frame Super-resolution Problem: Given a single low-resolution input, and a set of pairs (high- and low-resolution) of training patches sampled from similar images, reconstruct a high-resolution version of the input. Applications Photo zooming (e.g., Photoshop, Genuine Fractal) Photo printing Video standard conversion, etc Difficulty: single-image super-resolution is an extremely ill-posed problem.

Super-resolution via Sparse Recovery High-resolution patches have sparse representations in terms of some over-complete dictionary x = D h z 0 where x R m, D h R m K, and z 0 0 m

Super-resolution via Sparse Recovery High-resolution patches have sparse representations in terms of some over-complete dictionary x = D h z 0 where x R m, D h R m K, and z 0 0 m We do not observe the high-resolution patch x, but its low-resolution version y R n y = Lx = LD h z 0 = D l z 0 L is the sampling matrix (blurring and downsampling) y is the n linear measurements of the sparse coefficients z 0

Super-resolution via Sparse Recovery High-resolution patches have sparse representations in terms of some over-complete dictionary x = D h z 0 where x R m, D h R m K, and z 0 0 m We do not observe the high-resolution patch x, but its low-resolution version y R n y = Lx = LD h z 0 = D l z 0 L is the sampling matrix (blurring and downsampling) y is the n linear measurements of the sparse coefficients z 0 Sparse recovery? If we can obtain z 0 from y = D l z (underdetermined linear system), we can recover x as D h z 0.

Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y.

Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y. Find sparse solution for each patch y p of Y by z 0 = arg min z D l z y p 2 2 + λ z 1.

Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y. Find sparse solution for each patch y p of Y by z 0 = arg min z D l z y p 2 2 + λ z 1. Recover the corresponding high-resolution image patch as x p = D h z 0.

Super-resolution via Sparse Recovery Assume we have the coupled dictionaries D h and D l. Input: low-resolution image Y. Find sparse solution for each patch y p of Y by z 0 = arg min z D l z y p 2 2 + λ z 1. Recover the corresponding high-resolution image patch as x p = D h z 0. How to train D l and D h for good recovery?

Joint Dictionary Training Previous Approach Our previous solution. Randomly sample high- and low-resolution image patch pairs {x i, y i } N i=1 from the training data. Learn D h, D l jointly: min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1, i=1 s.t. D h (:, k) 2 1, D l (:, k) 2 1

Joint Dictionary Training Previous Approach Our previous solution. Randomly sample high- and low-resolution image patch pairs {x i, y i } N i=1 from the training data. Learn D h, D l jointly: min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1, i=1 s.t. D h (:, k) 2 1, D l (:, k) 2 1 However,...

Joint Dictionary Training Problem In training, we have min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1 i=1

Joint Dictionary Training Problem In training, we have min D h,d l,{z i } N x i D h z i 2 2 + y i D l z i 2 2 + λ z i 1 i=1 In testing, we only have the low-resolution patch y i, min zi x i D h z i 2 2+ y i D l z i 2 2 + λ z i 1, and therefore, good reconstruction of x i is not guaranteed.

Bilevel Formulation Goal: Learn D h and D l, such that the sparse representation z of y in terms of D l can well reconstruct x with D h.

Bilevel Formulation Goal: Learn D h and D l, such that the sparse representation z of y in terms of D l can well reconstruct x with D h. Given high- and low-resolution training patch pairs {x i, y i } N i=1, the learning model is formulated as 1 min D h,d l N N D h z i x i 2 2 i=1 s.t. z i = arg min α 1, s.t. y i D l α 2 2 ɛ α D l (:, k) 2 1, D h (:, k) 2 1,

Bilevel Formulation Goal: Learn D h and D l, such that the sparse representation z of y in terms of D l can well reconstruct x with D h. Given high- and low-resolution training patch pairs {x i, y i } N i=1, the learning model is formulated as 1 min D h,d l N N D h z i x i 2 2 i=1 s.t. z i = arg min α 1, s.t. y i D l α 2 2 ɛ α D l (:, k) 2 1, D h (:, k) 2 1, The training process is completely consist with testing.

Results Setting: 100, 000 high- and low-resolution 5 5 image patch pairs are sampled for training and 100, 000 for testing. D h and D l are initialized from joint dictionary training. The learning algorithm converges in 5 iterations. 1 21.61% 19.60% 21.89% 18.91% 20.55% 2 17.43% 15.75% 17.92% 15.69% 14.70% 3 17.15% 16.96% 19.95% 17.57% 15.99% 4 16.41% 17.78% 18.30% 16.80% 15.82% 5 20.48% 14.68% 15.52% 14.64% 20.51% 1 2 3 4 5 Pixel-wise MSE reduction compared with joint dictionary training

SR Results Visual comparison: Top: joint dictionary training; bottom: bilevel sparse coding.

Practical Implementation Learn fast sparse coding approximations with a neural network. Selective patch processing. Takes 5s to upscale an image from 200 200 to 800 800 on a single core 3 GHz with 4G RAM. One of the fastest SR algorithms. Input

Practical Implementation Learn fast sparse coding approximations with a neural network. Selective patch processing. Takes 5s to upscale an image from 200 200 to 800 800 on a single core 3 GHz with 4G RAM. One of the fastest SR algorithms. Bicubic

Practical Implementation Learn fast sparse coding approximations with a neural network. Selective patch processing. Takes 6s to upscale an image from 200 200 to 800 800 on a single core 3 GHz with 4G RAM. One of the fastest SR algorithms. Ours

: Outline 1 2 The learning model The learning algorithm 3 4

Feature Representation by Pooling Sparse Codes Fig. The image feature extraction diagram.

Feature Representation by Pooling Sparse Codes A simple two-layer network. Coding: VQ, soft assignment, LLC, sparse coding, linear filtering. Pooling: average, energy, max, log, l p. Works well on diverse recognition benchmarks: object, scene, action, face, digit, gender, expression, age estimation, and so on. Key component of the winner system for PASCAL09 on image recognition. Image feature extraction diagram

The Feature Extraction Algorithm 1 Represent image X as sets of local descriptors in a spatial pyramid X = [ Y 0 11, Y1 11, Y 1 12,..., Y 2 44], 2 Given dictionary D, encode the local descriptors into sparse codes by Ẑ s ij = arg min Y s ij DA 2 2 + λ A 1, A and we obtain S = [Ẑ0 11, Ẑ1 11, Ẑ2 12,..., 44] Ẑ2 3 Max pooling over each set of sparse codes and concatenate them β = 2 s=0 2 s i,j=1 [ ] β s ij, where βij s ( Ẑs ) = max ij.

Unsupervised Dictionary Learning Randomly sample a set of local descriptors {x i } N i=1 from the training set, use current sparse coding technique to learn a dictionary D that can sparsely represent the data. min D,{α i } N i=1 n x i Dα i 2 2 + λ α i 1, i=1 s.t. D(:, k) 2 1, Optimization is performed in an alternating fashion: fix D, optimize {α i } N i=1 ; and fix {α i} N i=1, and optimize D.

Supervised Dictionary Learning The unsupervised dictionary learning is good for reconstruction, not necessarily effective for classification.

Supervised Dictionary Learning The unsupervised dictionary learning is good for reconstruction, not necessarily effective for classification. Training data with image labels {(X i, y i )} N i=1.

Supervised Dictionary Learning The unsupervised dictionary learning is good for reconstruction, not necessarily effective for classification. Training data with image labels {(X i, y i )} N i=1. Train the dictionary together with the classifier { N } min D,w i=1 l(y i, f (β i, w)) + γ w 2 2 s.t. β i = pooling(z i ) Z i = arg min A X i DA 2 2 + λ A 1 D(:, k) 2 1, k, where l( ) is a loss function and f (, w) is the linear prediction model. Optimization for w is training the classifier. Optimization for D is a bilevel program.,

Face recognition CMU Multi-PIE Database This dataset contains 337 subjects across simultaneous variations in pose, expression, and illumination. We use session 1 as training, and the rest sessions 2-4 for testing. The dataset is challenging due to the large number of subjects, and due to natural variations in subject appearance over time.

Face recognition Face recognition error (%) on large-scale Multi-PIE. Rec. Rates Session 2 Session 3 Session 4 LDA 50.6 55.7 52.1 NN 32.7 33.8 37.2 NS 22.4 25.7 26.6 SR 8.6 9.7 9.8 U-SC 5.4 9.0 7.5 S-SC 4.8 6.6 4.9 Improvements 11.1% 26.7% 34.7%

Gender Recognition FRGC 2.0 The dataset contains 568 individuals, totally 14714 face images under various lighting conditions and backgrounds. 11700 images from 451 randomly chosen individuals serve as the training set, and 3014 images from the rest 114 persons are modeled as the testing set. Classification Error (%) Algorithms SVM (RBF) CNN U-SC S-SC Improvements Error Rate 8.6 5.9 6.8 5.3 22.1%

Hand Written Digit Recognition MNIST: The dataset consists of 70,000 handwritten digits, of which 60,000 are selected for training and the rest 10,000 for testing. Algorithms Error Rate SVM (RBF) 1.41 L1 sparse coding 2.02 Local coordinate coding 1.90 Deep Belief Network 1.20 CNN 0.82 U-SC 0.98 S-SC 0.84 Improvements 14.3%

: Outline 1 2 The learning model The learning algorithm 3 4

Formulation Let x be the original signal, Φ be the sampling matrix, and y = Φx be the linear measurements. Compressive sensing recovery is done by z = min α 1, s.t. y = ΦD x α α ˆx =D x z

Formulation Let x be the original signal, Φ be the sampling matrix, and y = Φx be the linear measurements. Compressive sensing recovery is done by z = min α 1, s.t. y = ΦD x α α ˆx =D x z D x is important for the recovery quality.

Formulation Let x be the original signal, Φ be the sampling matrix, and y = Φx be the linear measurements. Compressive sensing recovery is done by z = min α 1, s.t. y = ΦD x α α ˆx =D x z D x is important for the recovery quality. With the training samples {x i } N i=1, learn D x by directly minimizing the compressive sensing recovery error: 1 min D x N N x i D x z i 2 2 i=1 s.t. y i = Φx i, D y = ΦD x z i = arg min α 1, s.t. y i D y α 2 2 ɛ α

CS Results Settings: 10, 000 image patches of 16 16 are randomly sampled for training and 5, 000 for testing from medical images. Haar Wavelet basis is used as our baseline and initialization. Bernouli random matrix is used as the sampling matrix.

CS Results Settings: 10, 000 image patches of 16 16 are randomly sampled for training and 5, 000 for testing from medical images. Haar Wavelet basis is used as our baseline and initialization. Bernouli random matrix is used as the sampling matrix. 2.6 x 105 Cost 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 PSNR 28 26 24 22 20 18 Learned Wavelet 0.6 0 5 10 15 20 25 Iteration Objective value vs. iteration number for 10% sample rate. 0.1 0.15 0.2 0.25 0.3 Sampling Rate Recovery accuracy comparison on the test image patches in PSNR.

CS Results Image recovery on the bone image with 20% measurements Ground truth Wavelet(22.8 db) Ours (27.6 db)

Learning the meaningful representation is critical for many applications Many sparse coding based applications can be formulated as a bilevel program Bilevel programs are extremely useful in many hierarchical models More applications in computer vision and machine learning? E.g., model selection.