Some Thoughts on Machine Learning Software Design

Similar documents
A Practical Guide to Support Vector Classification

Can Support Vector Machine be a Major Classification Method?

Combining SVMs with Various Feature Selection Strategies

A Practical Guide to Support Vector Classification

Second Order SMO Improves SVM Online and Active Learning

All lecture slides will be available at CSC2515_Winter15.html

A Summary of Support Vector Machine

The Effects of Outliers on Support Vector Machines

Combining SVMs with Various Feature Selection Strategies

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Lab 2: Support vector machines

Natural Language Processing

kernlab A Kernel Methods Package

SUPPORT VECTOR MACHINES

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Robust 1-Norm Soft Margin Smooth Support Vector Machine

Semi-supervised learning and active learning

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Lab 2: Support Vector Machines

DM6 Support Vector Machines

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Naïve Bayes for text classification

Combine the PA Algorithm with a Proximal Classifier

CS 179 Lecture 16. Logistic Regression & Parallel SGD

SUPPORT VECTOR MACHINES

Support vector machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods & Support Vector Machines

Kernels and Clustering

ENSEMBLE RANDOM-SUBSET SVM

6 Model selection and kernels

Support Vector Machines.

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CS 343: Artificial Intelligence

Machine Learning Techniques

Content-based image and video analysis. Machine learning

Adaptive Sampling for Embedded Software Systems using SVM: Application to Water Level Sensors

Stanford University. A Distributed Solver for Kernalized SVM

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Information Management course

Linear and Kernel Classification: When to Use Which?

Chap.12 Kernel methods [Book, Chap.7]

Neural Networks and Deep Learning

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh

Final Report: Kaggle Soil Property Prediction Challenge

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Random Forest A. Fornaser

Introduction to Machine Learning

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Using Graphs to Improve Activity Prediction in Smart Environments based on Motion Sensor Data

Floating-point operations I

Support Vector Machines and their Applications

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

No more questions will be added

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Separating Speech From Noise Challenge

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Support Vector Machines

Large synthetic data sets to compare different data mining methods

Support Vector Machines

Well Analysis: Program psvm_welllogs

Demand Prediction of Bicycle Sharing Systems

Data-driven Kernels for Support Vector Machines

Regularization and model selection

Dimension Reduction in Network Attacks Detection Systems

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

The exam is closed book, closed notes except your one-page cheat sheet.

An Experimental Multi-Objective Study of the SVM Model Selection problem

Classification: Feature Vectors

CS570: Introduction to Data Mining

Neural Networks (pp )

1. What is the VC dimension of the family of finite unions of closed intervals over the real line?

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Support Vector Machines

Code Generation for Embedded Convex Optimization

(Refer Slide Time: 02:59)

EE 8591 Homework 4 (10 pts) Fall 2018 SOLUTIONS Topic: SVM classification and regression GRADING: Problems 1,2,4 3pts each, Problem 3 1 point.

Feature Ranking Using Linear SVM

Fast Support Vector Machine Classification of Very Large Datasets

CSE 573: Artificial Intelligence Autumn 2010

Perceptron Learning Algorithm (PLA)

TEXT CATEGORIZATION PROBLEM

Risk bounds for some classification and regression models that interpolate

Scalable Multilevel Support Vector Machines

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

10. Support Vector Machines

12 Classification using Support Vector Machines

Class 6 Large-Scale Image Classification

ECS289: Scalable Machine Learning

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Lecture 6: Arithmetic and Threshold Circuits

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Choosing the kernel parameters for SVMs by the inter-cluster distance in the feature space Authors: Kuo-Ping Wu, Sheng-De Wang Published 2008

Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Automatic Fatigue Detection System

Transcription:

Support Vector Machines 1 Some Thoughts on Machine Learning Software Design Chih-Jen Lin Department of Computer Science National Taiwan University Talk at University of Southampton, February 6, 2004

Support Vector Machines 2 About This Talk Machine learning software design involves with interesting research issues Also other issues Implementation Users Would like to share my past experience on the software LIBSVM for discussion Many issues are controversial

Support Vector Machines 3 Focus on software of one method e.g. SVM software Integrated ML environments: even more complicated issues A bit different from data mining software Examples: Spider http: //www.kyb.tuebingen.mpg.de/bs/people/spider/main.html PyML http://cmgm.stanford.edu/~asab/pyml/pyml.html So issues such as data types etc. will not be discussed

Support Vector Machines 4 Good Machine Learning Software Must use good methods Has been the focus of machine learning research Issues less discussed Should include tools needed by users e.g. a simple scaling code Should be simple and complete e.g. multi-class classification Should be numerically stable Efficiency may not be the only concern

Support Vector Machines 5 Tools for Users: e.g. Scaling I started working on SVM in 1999 Saw SVM papers presenting excellent accuracy Decided to try by myself Statlog data set (http://www.liacc.up.pt/ml/statlog/) heart data 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1 100% SVs Bad accuracy

Support Vector Machines 6 No idea what happened In few papers one simple sentence mentions normalization or scaling to [-1,1] Then I also realized SVM dual RBF kernel min 2 αt Qα e T α subject to 0 α i C, i = 1,..., l, α 1 y T α = 0, K(x i, x j ) = φ(x i ) T φ(x j ) = e γ x i x j 2

Support Vector Machines 7 If Q I, and C 2l 1 /l. 2l 2 /l if y i = 1, α i 2l 1 /l if y i = 1 All are SVs

Support Vector Machines 8 Lesson: ML researchers know the importance of scaling Most users do not know Such simple tools should be provided David Meyer (author of R interface to LIBSVM) had exactly the same experience He decided to scale data by default

Support Vector Machines 9 Simple and Complete: e.g. Multi-class classification Many methods when proposed: Only two-class case considered OK for a paper Standard extension to multi-class But if no one implements it The proposed method can never be useful I did not realize this before LIBSVM released in April 2000: 2-class only By the summer: many requests for multi-class

Support Vector Machines 10 Multi-class Implementation: One or Many So I was forced to implement it Many options: 1 vs. the rest, 1 vs. 1 (pairwise), error correcting codes, All k-class together as one optimization formula Include one or many?

Support Vector Machines 11 Two Types of Numerical Software 1. Include all options and let users choose 2. Provide only one which is generally good The argument never ends Also depends on different situations For SVM software, I prefer the 2nd Historical reason: I was from a numerical optimization group supporting the 2nd A black box type implementation may be useful Many have no ability to choose from different options

Support Vector Machines 12 Need a serious comparison to find a generally good one Finally I chose 1 vs. 1 [Hsu and Lin, 2002] Similar accuracy to others Shortest training A bit longer on testing than 1 vs. the rest In scientific computing Numerical comparison: seriously conducted and considered part of the research We should emphasize more on such issues

Support Vector Machines 13 More on Completeness: Parameter Selection SVM: a bit sensitive to parameters d2 98.8 98.6 98.4 98.2 98 3 97.8 97.6 97.4 2 97.2 97 1 0 lg(gamma) -1 1 2 3 4 5 6 7-2 lg(c)

Support Vector Machines 14 I spent a lot of time on loo bound leave-one-out error f(c, γ) so min C,γ f(c, γ) Not stable, so for two parameters, now we recommend CV+grid search But, unlike 1vs1 for multi-class, this is still far from settled

Support Vector Machines 15 OK if no need for feature selection Feature selection considered # parameters may be > 2 CV+grid not work Loo bound or Bayesian evidence more suitable? In other words, we may have CV+grid parameter selection > feature selection < loo/bayesian

Support Vector Machines 16 Comparing Two Methods If Method 1 Method 2 Two-class classification so so excellent Multi-class easy complicated Probability output easy easy Parameter selection easy difficult Feature selection easy difficult regression easy easy Which should we use?

Support Vector Machines 17 When comparing two methods All aspects should be considered SVM Not particularly good Each item: by several research papers Any method: one paper provides all and results reasonably good?

Support Vector Machines 18 Random Forest Is One 500 trees Each: full tree using m try random features Prediction: by voting

Support Vector Machines 19 Multi-class: by tree Probability output: proportion of 500 Parameter selection: m try the only parameter Moreover, not sensitive Feature selection: Out-of-bag validation of each tree feature importance All these are discussed in Breiman s paper

Support Vector Machines 20 Performance: My experience and [Meyer et al. 2003] Competitive with (or only a bit worse than) SVM Though some said: Comparing random forest with SVM not fair random forest, random nearest neighbor, random SVM RF: simple and complete My goal for SVM: as simple and complete software

Support Vector Machines 21 Numerical Stability Many classification methods (e.g., SVM, neural networks) solve optimization problems Part of their implementations: Essentially numerical software Numerical analysts: high standard on their code We do not Reasonable: Efforts on implementing method A One day method B: higher accuracy Efforts wasted Really a dilemma

Support Vector Machines 22 Example: SMO and Linear Kernel Selecting working set {i, j}, solve 1 ] min [α α i,α j 2 i α j Q ii Q ji Q ij α i Q jj α j +(Q i,n αn k 1)α i + (Q j,n αn k 1)α j subject to y i α i + y j α j = ynα T N k, 0 α i, α j C, If y i = y j, substituting α i = α j One-variable minimization: α new j = α j + G i G j Q ii + Q jj 2Q ij (1)

Support Vector Machines 23 where G i (Qα) i 1 and G j (Qα) j 1. Clipping it back to [0, C] α j α i + α j = α i

Support Vector Machines 24 Linear kernel: matrix may be PSD but not PD Q ii + Q jj 2Q ij = 0 Division by zero Some may say Check if Q ii + Q jj 2Q ij = 0, if so, add a small threshold Remember floating point == not recommended in general Indeed, no need to worry about this

Support Vector Machines 25 As long as G i G j 0, (1) goes to or, defined under IEEE 754/854 floating-point standard Comparing C and INF: valid IEEE operations Correctly clipped to 0 or C 0/0 not defined G i G j > ɛ always holds the stopping criteria 0/0 never happens

Support Vector Machines 26 What if Q ii + Q jj 2Q ij < 0 due to numerical error Or rounded to zero? Under IEEE: +0, -0-0 causes wrong direction Use G i G j max(0, Q ii + Q jj 2Q ij ) Proper max( 0, 0) gives 0 java.lang.math: max: If one argument is positive zero and the other negative zero, the result is positive zero. Goldberg, ACM Computing Surveys, 1991 What every computer scientist should know about floating-point arithmetic

Support Vector Machines 27 Example: SMO and tanh Kernel Whether it should be used or not is another issue Let s assume it is there Kernel matrix: non-psd Q ii + Q jj + 2Q ij <0 Objective value but not Not converge to a local minimum Infinite loop using LIBSVM At one point: I have to warn users this in LIBSVM FAQ

Support Vector Machines 28 Later we developed a simple strategy for all non-psd kernels and proved convergence [Lin and Lin 2003] But someone said 1 2 wt w + C l ξ i = 1 2 αt Qα + C i=1 Non-convex; change it to l l((qα) i 1) i=1 1 2 αt α + C l l((qα) i 1) i=1 tanh still used, but convex

Support Vector Machines 29 Accuracy may be similar (sparsity another issue) He/she is right; but I cannot force users not to use SVM+tanh Such issues may still need to be investigated Different points of view : One is from designing methods One is from designing software

Support Vector Machines 30 There are Many Such Issues For example How to check support vectors? α i > 0, < C A place where floating point == may be used Not only numerical analysis techniques SVM: optimization issues Implementation of ML software Can be a quite interdisciplinary issue

Support Vector Machines 31 Conclusions ML software: many interesting research issues Some are traditional ML considerations Some are not It is rewarding to see users benefit from such research efforts