- - Tues 14 March Tyler Neylon

Similar documents
Introduction to Support Vector Machines

Variable Selection 6.783, Biomedical Decision Support

Lecture 17 Sparse Convex Optimization

Leave-One-Out Support Vector Machines

ELEG Compressive Sensing and Sparse Signal Representations

10601 Machine Learning. Model and feature selection

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES. Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari

Signal Reconstruction from Sparse Representations: An Introdu. Sensing

Kernel-based online machine learning and support vector reduction

Sparsity Based Regularization

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Content-based image and video analysis. Machine learning

Machine Learning Lecture 9

Machine Learning Lecture 9

Outlier Pursuit: Robust PCA and Collaborative Filtering

Outline Introduction Problem Formulation Proposed Solution Applications Conclusion. Compressed Sensing. David L Donoho Presented by: Nitesh Shroff

Face Recognition via Sparse Representation

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

All lecture slides will be available at CSC2515_Winter15.html

Modified Iterative Method for Recovery of Sparse Multiple Measurement Problems

Multiple-View Object Recognition in Band-Limited Distributed Camera Networks

Support Vector Machines

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

Second Order SMO Improves SVM Online and Active Learning

Hyperspectral Data Classification via Sparse Representation in Homotopy

CSC 411 Lecture 18: Matrix Factorizations

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

1 Case study of SVM (Rob)

Modern Signal Processing and Sparse Coding

Machine Learning for NLP

Sparse Component Analysis (SCA) in Random-valued and Salt and Pepper Noise Removal

Discriminative sparse model and dictionary learning for object category recognition

Support Vector Machines.

Optimization Methods for Machine Learning (OMML)

CSE 417 Branch & Bound (pt 4) Branch & Bound

A Brief Look at Optimization

Introduction to Machine Learning

Robust Face Recognition via Sparse Representation

Connections between the Lasso and Support Vector Machines

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Lecture #11: The Perceptron

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Weka ( )

CS 229 Midterm Review

Sparse wavelet expansions for seismic tomography: Methods and algorithms

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Kernel Density Construction Using Orthogonal Forward Regression

Linear Methods for Regression and Shrinkage Methods

Generative and discriminative classification techniques

Classification by Support Vector Machines

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

CSC630/CSC730 Parallel & Distributed Computing

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Support Vector Machines

Classification by Support Vector Machines

Sparse Solutions to Linear Inverse Problems. Yuzhe Jin

Introduction to Automated Text Analysis. bit.ly/poir599

Hydraulic pump fault diagnosis with compressed signals based on stagewise orthogonal matching pursuit

Generalized version of the support vector machine for binary classification problems: supporting hyperplane machine.

AM 221: Advanced Optimization Spring 2016

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

MITOCW watch?v=4dj1oguwtem

Supervised vs unsupervised clustering

CSEP 573: Artificial Intelligence

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Sparse Reconstruction / Compressive Sensing

Sparse & Redundant Representations and Their Applications in Signal and Image Processing

DM6 Support Vector Machines

P257 Transform-domain Sparsity Regularization in Reconstruction of Channelized Facies

Kernel Methods & Support Vector Machines

Naïve Bayes for text classification

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

The complement of PATH is in NL

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Discrete Optimization. Lecture Notes 2

6.867 Machine Learning

Face Detection Efficient and Rank Deficient

6 Randomized rounding of semidefinite programs

Support Vector Regression with ANOVA Decomposition Kernels

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

Controlling False Alarms with Support Vector Machines

Interlude: Solving systems of Equations

Classification by Support Vector Machines

Introduction to Machine Learning

Linear Classification and Perceptron

Image reconstruction based on back propagation learning in Compressed Sensing theory

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Machine Learning Classifiers and Boosting

CHAPTER 5 SYSTEMS OF EQUATIONS. x y

Unlabeled Data Classification by Support Vector Machines

CSE 158. Web Mining and Recommender Systems. Midterm recap

DEGENERACY AND THE FUNDAMENTAL THEOREM

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Introduction to Topics in Machine Learning

Basic matrix math in R

Image Restoration and Background Separation Using Sparse Representation Framework

Introduction to ANSYS DesignXplorer

Transcription:

- - + + + Linear Sparsity in Machine Learning + Tues 14 March 2006 Tyler Neylon 00001000110000000000100000000000100

Outline Introduction Intuition for sparse linear classifiers Previous Work Sparsity in support vector machines Guarantees for L 1 minimization New Work (note: not actually in these slides) Most matrices are unsparsifiable A VC dim bound for sparse linear classifiers Detecting matrix sparsifiability A better heuristic for max sparsity?

Outline Introduction Intuition for sparse linear classifiers Sparsity as subset selection L 1 minimization approximates L 0 min'n

Advantages of sparsity Philosophically: Occam's Razor More precisely: Compression Faster computations (after learning) Subset selection Better learning Fights gum disease* *not really

Subset selection Learning one coordinate from a subset of the others

Learning on a subset of coord's Example: Stamp collectors Each piece of data is a vector x R n Each coordinate x i represents how many stamps collector i has at that moment Suppose collectors 5,13, and 28 mainly trade stamps among themselves; then

Learning on a subset of coord's Think of each training point as a row in a matrix X Each column corresponds to a single stamp collector Let n = (0,0,0,0,1,0,...,0,1,0,...,0,1,0,-C), with 1's in position 5, 13, 28 Then Xn=0 is our matrix equation Vector n is sparse

Learning on a subset of coord's If Xn=0, and n is not all zeros, then we can single out a particular column of X: We can think of it as learning coordinate 5 from a small subset of the others

Learning on a subset of coord's Example: Diamonds & Jewelers Number of diamonds in our data may change Out of many diamond dealers and jewelers, maybe competitive dealers A & B work primarily with jewelers C & D Then, if x i is the revenue of person i, Again, we could rewrite this as Xn=0, or Xn = x A etc

Learning on a subset of coord's Example: Neurons firing Samples taken from unknown locations in a live human brain Neurons react to each other, but not simultaneously Key idea: take FFT of data first to ignore phase shifts if there is an interdependent subset of interactive neurons

Learning on a subset of data Example: Support Vectors Previous were regression, this is classification Now the sparsity is in which training points we use to learn Learn only from the border cases Why only a few support vectors? Is it good to have only a few?

The L 0 wannabe norm Reminder: Lp norm It's a real norm when p 1 Not well-defined for p=0 Take limit of sum as p 0

The L 0 norm So we define So that minimizing this maximizes sparsity How close to being a norm?

A hard optimization problem P0 Shown NP-hard by Larry Stockmeyer Essentially* shown NP-hard to approximate by Piotr Berman and Marek Karpinski in Approximating Minimum Unsatisfiability in Linear Equations *but the authors don't point out the equivalence with this problem

An easy optimization problem P1 Can be solved with linear programming Heuristic: solution to P1 often identical to P0 solution Whyzat?

Intuition: Inflating level sets Solving P1 is like blowing up an L1- balloon* until it hits the hyperplane Ax=b Sparsest points are most extreme on the balloon *Contains pointy edges -- unsafe for children under 3.

The tetherball counterexample We're in R3 The line {x Ax=b} goes through points s=(0,0,3) t=(1,1,0) Then s 1 > t 1, so P1 P0 here Replace 3 by anything > 2 (1/p) to get a case in which Pp P0 for any p > 0 (!)

Outline Previous Work Sparsity in support vector machines Guarantees for L 1 minimization

Reminder: The Classic Support Vector Machine This is the traditional classification version Will w be sparse here? Sometimes depends on the data! (how?)

Classification Sparsity Some results due to Ingo Steinwart in Sparseness of SVM's some asymptotically sharp bounds We use the alternative version: where If p=1, we call it L1-SVM; if p=2, we call it L2-SVM.

Classification Sparsity Let R denote the Bayes error Let S be the Prob(point x may be noisy) Then S 2R Let N SVp = Number of support vectors from solving Lp-SVM and in probability

Reminder: The Classic Support Vector Machine This is the regression version where and

Another Version An alternative presented by Federico Girosi in An equivalence between sparse approximation and SVM's where and f(x) = target function

An Equivalence This version is equivalent (same solution!) to the traditional regression SVM when: Data is noiseless <f,1> H =0 <1,K(x,y)> H =1 for any y Awesome!

Why low N SV is good We've seen that many forms of SVM's give sparse solutions But does it improve our generalization error? Yes! Vladimir Vapnik* says: *In, for example, Advances in Kernel Methods, chapter 3, editors Bernhard Scholkopf, Christopher Burges, Alexander Smola

Another version of sparse regression Useful for finding/approximating an entire basis of a sparse null space:

Other methods of approximating sparsity MOF MP OMP BOB...also known as...

Other methods of approximating sparsity Least squares Matching Pursuit Orthogonal matching pursuit Best Orthogonal Basis Most people nowadays agree that L1-minimization is generally best, although other methods may be a little faster / better in other ways A good exposition can be found in Atomic Decomposition by Basis Pursuit by Scott Chen, David Donoho, and Michael Saunders

Some L1-min guarantees In solving vs. For a fixed aspect ratio of a sequence of matrices A n, there exists a constant percentage p so that if there exists a solution x with density at most p, then Where the size of A n increases without bound with n

Some L1-min guarantees That's a proof of David Donoho, in For most large underdetermined systems of linear equations, the minimal L1-norm solution is also the sparsest solution That paper is hard to read, by the way

Outline New Work Most matrices are unsparsifiable A VC dim bound for sparse linear classifiers Detecting matrix sparsifiability A better heuristic for max sparsity? Coming Soon! This is the last slide for now. Thank you!