Transfer Learning Algorithms for Image Classification

Similar documents
Transfer Learning for Image Classification with Sparse Prototype Representations

Structured Learning. Jun Zhu

Network Lasso: Clustering and Optimization in Large Graphs

Bilevel Sparse Coding

Latent Variable Models for Structured Prediction and Content-Based Retrieval

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

Variable Selection 6.783, Biomedical Decision Support

Effectiveness of Sparse Features: An Application of Sparse PCA

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Support Vector Machines

Fast or furious? - User analysis of SF Express Inc

Support Vector Machines.

Lecture 19: November 5

Mining Web Data. Lijun Zhang

Parallel and Distributed Sparse Optimization Algorithms

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Machine Learning / Jan 27, 2010

Sparsity Based Regularization

Mining Web Data. Lijun Zhang

Introduction to Artificial Intelligence

Machine Learning: Think Big and Parallel

Kernel Methods & Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Case Study 1: Estimating Click Probabilities

Convex Optimization MLSS 2015

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

A Brief Look at Optimization

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Semi-supervised Learning

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

Beyond Actions: Discriminative Models for Contextual Group Activities

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Clustering and The Expectation-Maximization Algorithm

Face Recognition via Sparse Representation

Lasso. November 14, 2017

scikit-learn (Machine Learning in Python)

Iteratively Re-weighted Least Squares for Sums of Convex Functions

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Data Clustering. Danushka Bollegala

An MM Algorithm for Multicategory Vertex Discriminant Analysis

Natural Language Processing

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Lecture on Modeling Tools for Clustering & Regression

Limitations of Matrix Completion via Trace Norm Minimization

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Combinatorial Selection and Least Absolute Shrinkage via The CLASH Operator

Lecture 17 Sparse Convex Optimization

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Efficient Euclidean Projections in Linear Time

Sparse Screening for Exact Data Reduction

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

10-701/15-781, Fall 2006, Final

Lecture 2 September 3

ECS289: Scalable Machine Learning

Conditional gradient algorithms for machine learning

Sparsity and image processing

Large-scale visual recognition The bag-of-words representation

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

ELEG Compressive Sensing and Sparse Signal Representations

SVMs for Structured Output. Andrea Vedaldi

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

A (somewhat) Unified Approach to Semisupervised and Unsupervised Learning

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

MIT 9.520/6.860 Project: Feature selection for SVM

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Multiple-View Object Recognition in Band-Limited Distributed Camera Networks

Two algorithms for large-scale L1-type estimation in regression

Mining Sparse Representations: Theory, Algorithms, and Applications

CSc 545 Lecture topic: The Criss-Cross method of Linear Programming

Object Detection with Discriminatively Trained Part Based Models

Lecture 3: Linear Classification

Facial Expression Classification with Random Filters Feature Extraction

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources

Supervised vs. Unsupervised Learning

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support vector machines

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Sparse Models in Image Understanding And Computer Vision

Natural Language Processing

FINDING a subset of a large number of models or data points,

Development in Object Detection. Junyuan Lin May 4th

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Characterizing Improving Directions Unconstrained Optimization

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Coding for Random Projects

Efficient Iterative Semi-supervised Classification on Manifold

Learning Algorithms for Medical Image Analysis. Matteo Santoro slipguru

Introduction to Support Vector Machines

I How does the formulation (5) serve the purpose of the composite parameterization

Multi-label Classification. Jingzhou Liu Dec

Data mining techniques for actuaries: an overview

Machine Learning Lecture 9

Support Vector Machines.

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Logistic Regression: Probabilistic Interpretation

Transcription:

Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell 1

Motivation Goal: We want to be able to build classifiers for thousands of visual categories. We want to exploit rich and complex feature representations. Problem: We might only have a few labeled samples per category. Word: team Word: actress Word: president 2

Thesis Contributions We study efficient transfer algorithms for image classification which can exploit supervised training data from a set of related tass. Learn an image representation using supervised data from auxiliary tass automatically derived from unlabeled images + meta-data. A feature sharing transfer algorithm based on joint regularization. An efficient algorithm for training jointly sparse classifiers in high dimensional feature spaces. 3

Outline A joint sparse approximation model for transfer learning. Asymmetric transfer experiments. An efficient training algorithm. Symmetric transfer image annotation experiments. 4

Transfer Learning: A brief overview The goal of transfer learning is to use labeled data from related tass to mae learning easier. Two settings: Asymmetric transfer: Resource: Large amounts of supervised data for a set of related tass. Goal: Improve performance on a target tas for which training data is scarce. Symmetric transfer: Resource: Small amount of training data for a large number of related tass. Goal: Improve average performance over all classifiers. 5

Transfer Learning: A brief overview Three main approaches: Learning intermediate latent representations: [Thrun 1996, Baxter 1997, Caruana 1997, Argyriou 2006, Amit 2007] Learning priors over parameters: [Raina 2006, Lawrence et al. 2004 ] Learning relevant shared features via joint sparse regularization: [Torralba 2004, Obozinsy 2006] 6

Feature Sharing Framewor: Wor with a rich representation: Complex features, high dimensional space Some of them will be very discriminative (hopefully) Most will be irrelevant Related problems may share relevant features. If we new the relevant features we could: Learn from fewer examples Build more efficient classifiers We can train classifiers from related problems together using a regularization penalty designed to promote joint sparsity. 7

Church Airport Grocery Store Flower-Shop 8

Church Airport Grocery Store Flower-Shop 9

Related Formulations of Joint Sparse Approximation Torralba et al. [2004] developed a joint boosting algorithm based on the idea of learning additive models for each class that share wea learners. Obozinsi et al. [2006] proposed L 1-2 joint penalty and developed a blocwise boosting scheme based on Boosted-Lasso.. 10

Our Contribution A new model and optimization algorithm for training jointly sparse classifiers in high dimensional feature spaces. Previous approaches to joint sparse approximation (Torralba et al., 2004, Obozinsi et al.,2006;) have relied on greedy coordinate descent methods. We propose a simple an efficient global optimization algorithm with guaranteed convergence rates 1 O 2 ε Our algorithm can scale to large problems involving hundreds of problems and thousands of examples and features. We test our model on real image classification tass where we observe improvements in both asymmetric and symmetric transfer settings. We show that our algorithm can successfully recover jointly sparse solutions. 11

Notation D 1 D = D 2 D, D,..., D { 1 2 m } Dm Collection of Tass D = { D, D2,..., D 1 m } D = {( x1, y1 ),...,( x n, y n )} Joint Sparse Approximation d x R y { + 1, 1} W w w 1,1 2,1 M w w 1, 2 2, 2 O L L O w w 1, m 2, m M w d,1 w d, 2 L w d, m 12

Single Tas Sparse Approximation Consider learning a single sparse linear classifier of the form: f ( x) = w x We want a few features with non-zero coefficients Recent wor suggests to use L 1 regularization: min arg w l( f ( x), y) + Q ( x, y) D j= 1 d w j Classification error L 1 penalizes non-sparse solutions Donoho [2004] proved (in a regression setting) that the solution with smallest L 1 norm is also the sparsest solution. 13

Joint Sparse Approximation Setting : Joint Sparse Approximation f ( x) = w x arg min 1 m w,,..., l ( f ( x), y) + Q R( w1, w 2,..., w m) 1 w 2 wm = 1 D ( x, y) D Average Loss on training set penalizes solutions that utilize too many features 14

Joint Regularization Penalty How do we penalize solutions that use too many features? W W W 1,1 2,1 M d,1 R( W ) W W W 1,2 2,2 O d,2 L L O L =# non W W W zero 1, m 2, m M d, m rows Would lead to a hard combinatorial problem. Coefficients for for feature 2 Coefficients for classifier 2 15

Joint Regularization Penalty We will use a L 1- norm [Tropp 2006] R( W ) = d i= 1 max( W i ) This norm combines: The L norm on each row promotes non-sparsity on the rows. Share features An L 1 norm on the maximum absolute values of the coefficients across tass promotes sparsity. Use few features The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems. 16

Joint Sparse Approximation Using the L 1- norm we can rewrite our objective function as: min W m 1 l ( x), y) + Q = 1 D ( x, y) D i= 1 ( f d max( W i ) For any convex loss this is a convex objective. For the hinge loss: l( f ( x), y) = max(0,1 yf ( x)) the optimization problem can be expressed as a linear program. 17

Joint Sparse Approximation Linear program formulation (hinge loss): Objective: min Max value constraints: for : =1: m and for : i =1: [ W, ε, t] d m 1 D D ε + j Q = 1 j= 1 i= 1 t i w i t i d t i Slac variables constraints: for : =1: m and for : j = 1: D y j f ( x j ε j ) 1 ε 0 j 18

Outline A joint sparse approximation model for transfer learning. Asymmetric transfer experiments. An efficient training algorithm. Symmetric transfer image annotation experiments. 19

Setting: Asymmetric Transfer SuperBowl Sharon Danish Cartoons Academy Awards Australian Open Trapped Miners Golden globes Grammys Figure Sating Iraq Train a classifier for the 10 th held out topic using the relevant features R only. Learn a representation using labeled data from 9 topics. Learn the matrix W using our transfer algorithm. Define the set of relevant features to be: R = r { r : max ( w ) > 0} 20

Results 0.72 0.7 Baseline Representation Transfered Representation Asymmetric Transfer 0.68 0.66 Average AUC 0.64 0.62 0.6 0.58 0.56 0.54 0.52 4 20 40 60 80 100 120 140 # training samples 21

An efficient training algorithm The LP formulation can be optimized using standard LP solvers. The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions. We might want a more general optimization algorithm that can handle arbitrary convex losses. 22

Outline A joint sparse approximation model for transfer learning. Asymmetric transfer experiments. An efficient training algorithm. Symmetric transfer image annotation experiments. 23

L 1- Regularization: Constrained Convex Optimization Formulation arg min W m 1 D l = 1 ( x, y) D ( f ( x), y) A convex function s. t. d i= 1 max( W i ) C Convex constraints We will use a Projected SubGradient method. Main advantages: simple, scalable, guaranteed convergence rates. Projected SubGradient methods have been recently proposed: L 2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007] L 1 regularization [Duchi et al. 2008] 24

Euclidean Projection into the L 1- ball Snapshot of the idea: We map the projection to a simpler problem which involves finding new maximums for each feature across tass and using them to truncate the original matrix. The total mass removed from a feature across tass should be the same for all features whose coefficients don t become zero. 25

Euclidean Projection into the L 1- ball 26

Characterization of the solution 1 27

Characterization of the solution 2 Feature I Feature II Feature III Feature VI μ 1 θ = A 2, j A 2, j > μ 2 μ 2 μ 4 μ 2 μ 3 28

Mapping to a simpler problem We can map the projection problem to the following problem which finds the optimal maximums μ: 29

Efficient Algorithm for:, in pictures 4 Features, 6 problems, C=14 max( d i= 1 A i ) = 29 μ 1 μ 4 μ 2 μ 3 30

Complexity The total cost of the algorithm is dominated by a sort of the entries of A The total cost is in the order of: O( dmlog( dm)) Notice that we only need to consider non-zero entries of A, so the computational cost is dominated by the number of nonzero. 31

Outline A joint sparse approximation model for transfer learning. Asymmetric transfer experiments. An efficient training algorithm. Symmetric transfer image annotation experiments. 32

Synthetic Experiments Generate a jointly sparse parameter matrix W: For every tas we generate pairs: ( x i, y ) where y = sign( w i t x i ) i We compared three different types of regularization (i.e. projections): L 1 projection L2 projection L1 projection 33

Synthetic Experiments Test Error Performance on predicting relevant features 50 Synthetic Experiments Results: 60 problems 200 features 10% relevant 100 Feature Selection Performance 45 90 80 40 70 35 60 Error 30 50 25 20 L2 L1 L1-LINF 15 10 20 40 80 160 320 640 # training examples per tas 40 30 20 Precision L1-INF Recall L1 Precision L1 Recall L1-INF 10 10 20 40 80 160 320 640 # training examples per tas 34

Dataset: Image Annotation Word: president Word: actress Word: team 40 top content words Raw image representation: Vocabulary Tree (Grauman and Darrell 2005, Nister and Stewenius 2006) 11000 dimensions 35

Experiments: Vocabulary Tree representation Find patches Map each patch to a feature vector. Perform hierarchical -means To compute a representation for an image: Find patches. Map each patch to its closest cluster in each level. 1 1 1 l l x = [# c, # c2,..., c,,...# c1, # c2,...,# c 1 p1 l p l ] 36

Results 37

Results 5 10 15 20 5 10 15 20 38

Results: 39

40

Summary of Thesis Contributions We presented a method that learns efficient image representations using unlabeled images + meta-data. We developed a feature sharing transfer based on performing a joint loss minimization over the training sets of related tass with a shared regularization. Previous approaches to joint sparse approximation have relied on greedy coordinate descent methods. We propose a simple an efficient global optimization algorithm for training joint models with L 1 constraints. We provide a tool that maes implementing a joint sparsity regularization penalty as easy and almost as efficient as implementing the standard L1 and L2 penalties. We show the performance of our transfer algorithm on real image classification tass for both an asymmetric and symmetric transfer setting. 41

Future Wor Online Optimization. Tas Clustering. Combining feature representations. Generalization properties of L 1 regularized models. 42

Thans! 43