Large-scale Non-linear Classification: Algorithms and Evaluations

Size: px
Start display at page:

Download "Large-scale Non-linear Classification: Algorithms and Evaluations"

Transcription

1 IJCAI 2013 Tutorial, Aug. 5 th, 2013 Large-scale Non-linear Classification: Algorithms and Evaluations Zhuang (John) Wang, Ph.D. IBM Global Business Services

2 About the tutorialist Work for IBM Global Business Services and before for Siemens Research Research interests: Support Vector Machine, Large-scale learning, Online learning, Multiple-instance learning 20 or so papers on JMLR, MLJ, ICML, KDD, AISTATS, 2

3 Agenda Overview Large-scale linear classification Large-scale non-linear classification Parallelism Summary 3

4 Real-world predictive analytics problem-solving workflow In real world, many data analytics problems are often being solved by formulating into data classification problem Raw Data Feature Extraction Data Classification Evaluation Feature1 Feature k Label Example Example Example n

5 Big data Cheap, pervasive and networked computing devices are enhancing our ability to collect data to an even greater extent. What is big data? No clear definition. A situation that exponentially grew complex data makes us cannot easily make sense of it. To make sense of it, we need a wide variety of technologies to tackle two difficulties: storage and analysis. Large-scale classification is a highly demanding technique which falls into the 2nd category. 5

6 The current status is The size of datasets have been growing considered large 10 years ago is no longer large by current standard. Ideal algorithm Fast training & prediction; Scalable Nonlinear model; Low memory; High accuracy; Easy to implement; Theoretically sound; The reality is far from the ideal Linear SVM Kernel SVM Nonlinearity No Yes Training time fast slow Prediction fast slow Scalability high low Training space small large Model size small large 6

7 7 Large-scale Linear Classification

8 Problem Setting Training examples: M D ( x, y ), i 1,..., N, x R, y {1, 1} i i i i Goal: train a linear classifier to separate D T sgn( f( x)) sgn( w x), where w R M Note: we ignore bias term in f(x) for simplicity. Bias term can be implicitly incorporated by adding a constant feature in the data 8

9 Perceptron Perceptron algorithm (Rosenblatt, 1957): 1. Initialize w 2. For each example i in D Do w w x where i i yi, if yi f ( xi) 0 i 0, otherwise 3. Repeat step 2 until stopping criteria (e.g. enough iterations) Complexity: O(N) in time, O(M) in space *. *: sequentially load data by chunk Theory: converge after finite steps if data is linearly separable (Novikoff, 1962) 9

10 Perceptron (cont.) Pros: Both conceptually and computationally simple Constant memory consumption + online learning = scalable (to arbitrary large data) Cons: Fail to converge on non-linearly separable data Not sufficiently accurate Out of fashion 10

11 Linear Support Vector Machine Train an optimal linear classifier by solving the optimization (Cortes et al., 1995) Unconstrained form: λ 1 N 2 min w max 1 yf i ( xi),0 w 2 N i 1 Constrained form: 1 min w, 2 w T s.t. y ( wx) 1 2 C i i i i Note: in linear case, we can explicitly work on w rather than through SVs, which makes our life much easier!!! Fig. Source: 11

12 Stochastic Gradient Descent for Linear SVM N λ 2 1 SVM optimization min Obj( w) w max 1 yif ( xi),0 w 2 N i 1 Train SVM using gradient descent 1. Initialize w 2. Do Obj( w w w ) w 3. Repeat step 3 until stopping criteria SGD: approximate the exact gradient using the one on the instantaneous objective (, ) InsObj w w w i w λ InsObj i y f 2 2 ( w, ) w max 1 i ( xi),0 Theory: when i is i.i.d. sampled and #iterations is large, with high probability, w converges to w* (Zhang, 2004; Shalev-Shwartz et al., 2008) 12

13 Stochastic Gradient Descent for Linear SVM (cont.) Train Linear SVM like Perceptron (Zhang, 04; Shalev-Shwartz et al., 08) 1. Initialize w 2. Randomly select an example i in D Do w (1 ) w where ixi 3. Repeat step 2 with enough iterations yi, if yi f ( xi) 1 i 0, otherwise O(N) training time, O(M) training space* *: sequentially load data by chunk 13

14 Dual Coordinate Descent for Linear SVM SVM optimization in dual form T 1 T T max 1 α α Qα, where Q yyx x α 2 w y x * * i i i ij i i i j maximize the dual objective by iteratively optimizing one alpha (i.e. coordinate) at a time and keeping the rest variables fixed Which leads the update rule: where i has closed-form solution new old w w ( ) x i i i 14

15 Dual Coordinate Descent for Linear SVM (cont.) Train Linear SVM like Perceptron (Hsieh et al., 08) old 1.Initialize w and i, i 1,..., N 2.For each example i in D Do new old new old yf i ( xi) 1 w w ( i i )xi where i min max i,0, C 2 old new xi i i 3.Repeat step 2 until stopping criteria O(N) training time, O(N+M) training space 15

16 Other popular approaches Second-order stochastic gradient descent (Bordes et al., 2009) Bundle approach (Teo et al., 2010) Cutting plane approach (Joachims, 2006) Methods for L1-regularized SVM and logistic regression Refer to the survey paper Recent advance on large-scale linear classification by Yuan et al., 16

17 When data cannot fit into memory dataset Training time = in-memory computation time + I/O time Prevent unnecessary I/O operation by fully operating on in-memory data How? Sequentially train data by chunk (Yu et al., 2010) Not for every algorithm But good for SGD and DCD Fig. Source: and Yu et al.,

18 Off-the-shelf tools Liblinear (Fan et al., 2008) Linear SVM, logistic regression Powered by dual coordinate descent Windows/Linux cmd-line tool with interfaces to many languages Well maintained project Train few GB data in a matter of secs/mins Good for single machine usage when data CAN/CANNOT fit into memory 18

19 Empirical comparison between linear and non-linear classification Fig. Source: Yuan et al.,

20 Why is linear classifier popular? Because it is computationally cheap and deliver comparable accuracy to non-linear classifiers in some applications: Carefully designed features already capture non-linear concepts, e.g. computer vision applications In higher-dimensional feature spaces, data tends to be more linearly separable, e.g. document classification (bag-of-words representation). 20

21 Where will the research of linear classification go? A field tend to be mature A lot of good algorithms for a wide variety of practical problems Many off-the-shelf tools Future directions Transfer the mature technologies to other learning scenarios. 21

22 22 Large-scale non-linear classification

23 When to use non-linear classifier? Data has non-linear concepts Sensitive to accuracy 23

24 Kernel Support Vector Machine Feature mapping D {( x, y ), i 1,..., N} i i D' {( ( x ), y ), i 1,..., N} i i SVM optimization on D : λ 1 N 2 min w max(1 yf i ( xi),0) w 2 N i 1 Primal to dual transformation => 24 T where f ( x) w ( x) f () x T α y ( x ) ( x ) α y k ( x, x ) i i i i i i i i Kernel trick Note: w can only be implicitly represented by SVs + their coefficients + kernel function Fig. Source:

25 Decomposition Methods SVM dual form T 1 T max 1 α α Qα, where Qij yi yik( xi, x j ) α 2 s.t. i, 0 C i Sequential Minimal Optimization (Platt, 98) 1. Smartly select a working example i and update by solving 1 max (1 QiUαU ) i iqii i, s.t. 0 i C, i 2 Closed-form solution for i 2. Repeat step 1 until stopping criteria i 25

26 Decomposition Methods (cont.) Libsvm (Chang and Lin, 01) Highly optimized implementation of SMO (plus heuristic for fast convergence) Actively-maintained open source project Windows/Linux cmd-line tool and multiple language APIs exact SVM solver Scalable for few hundreds MB s (or <1M examples ) low-dim data * *: we define scalable as training time less than 10hrs. 26

27 Decomposition Methods (cont.) Lasvm (Bortes et al., 05): approximate SVM solver using online SMO approximation Using less memory than Libsvm Less accurate Scalable for few GB s (or <10M examples ) low-dim data * Lasvm algorithm Online step Sequentially access examples Loosely run SMO on the new dataset S Delete some (currently) useless examples from S Finishing step Run full SMO on S Libsvm with diff. stop. criteria Fig. Source: 27

28 Minimal Enclosing Ball Methods Minimal Enclosing Ball (MEB): the ball with the smallest radius that encloses all the points in a given set Dual form is a QP: Fast iterative approximate solver available for MEB optimization Fig. Source: Tsang et al.,

29 Minimal Enclosing Ball Methods (cont.) CVM (Tsang et al., 2005): square-loss SVM can be casted into a MEB problem MEB dual: Square-loss SVM dual: kernel: Thus SVM can be efficiently + approximately solved by using MEB solver BVM (Tsang et al., 2007): faster version of CVM by further approximation 29

30 Empirical comparison: B/CVM vs Libsvm vs Lasvm Fig. Source: Tsang et al.,

31 Ramp Loss SVM SVM is less scalable on noisy data: hinge loss makes all the noisy examples become SVs and computing with a lot of SVs slows down algorithm convergence. w* C y H '( y, f ( x )) ( x ) i i i i i Replacing hinge loss with ramp loss in the SVM optimization (Collobert et al., 06) 1 2 N 31 min w C R( y, ( )) t 1 t f t 2 x w

32 Ramp Loss SVM (cont.) Solving the new optimization by ConCave Convex Procedure Ramp loss SVM algorithm: 1. Initialization: train f (old) on a small subset of D 2. Calculate y i f (old) (x i ) for all i in D 3. Train f (new) on a subset V, where V = {(x i,y i ), any i, y i f (old) (x i ) >-1} 4. Repeat step 2~3 until V is unchanged Two Gaussians SVM solution Ramp loss SVM solution 32

33 Ramp Loss SVM (cont.) Training a sequence of small SVMs on clean data is easier than training a big SVM on noisy data Improve scalability by several times Generate smaller classifier Fig. Source: Collobert et al.,

34 SGD with kernel Algorithm 1. Initialize w 2. Randomly select an example i in D Do w (1 η λ) w β ( x ) i i i 3. Repeat step 2 with enough iterations where β i η i y i, if y i f ( x i ) 1 0, otherwise Recall: w = Support Vectors (SVs) + their coefficients + kernel function Ok with <10,000 examples but not scalable for larger data due to the curse of kernelization. 34

35 Budgeted SGD BSGD Algorithm (Wang et al., 2012) 1. Initialize w, set B 2. Randomly select an example i in D Do w (1 η λ) w β ( x ) i i i where β i η i yi, if yi f ( xi) 1 0, otherwise if (#SVs>B) then w w i 3. Repeat step 2 with enough iterations Recall: w = Support Vectors (SVs) + their coefficients + kernel function Budget maintenance strategy: to reduce the size of SVs by one Removal Project Merging 35

36 Budgeted BSGD (cont.) Theorem (the impact of budget maintenance) * C1 ln( N) Obj( wn ) Obj( w ) C2E 1 N N t where E, and comes from w w t 1 t N t t Design philosophy: min E min t at each step Budget maintenance optimization Removal: Projection: Merging: min α ( x ) p p, α p p min α ( x ) α ( x ) m, n, z, p p j j j I p t 1 min α ( x ) α ( x ) α ( z) z m m n n z 36

37 Budgeted Online Kernel Classifiers Online learning with kernel Iteratively access example i in D and do w w ( x ) i i i where and are calculated by w and (x i, y i ) i i Online learning with budget Iteratively access example i in D Do w w ( x ) i i i if (#SVs>B) then w w i 37

38 Budgeted Online Kernel Classifier (cont.) Removal-based budget maintenance strategies w w ( x ) Remove a random one (Cesa-Bianchi & Gentile,06; Vucetic et al., 09) The oldest SV (Dekel et al., 08) r r The smallest SV (Cheng et al., 07) The one that would be predicted with the largest confidence after its removal (Crammer et al., 04); The one with the least validation error (Weston et al., 05; Wang and Vucetic, 09) 38

39 Budgeted Online Kernel Classifier (cont.) Project-based budget maintenance strategies w w ( x ) ( x ) the one will be removed BPA (Wang and Vucetic, 2010) r r i i i I subset of the SV set PA objective New constraint 1 2 s.t. w w ( x ) β ( x ) r 2 min Q( w ) w wt C H( yt, f ( xt )) r, w t r r i I i i The choise of I compromises between projection quality and computation cost All; the newest one; the newest one + its NN Closed-form solution 39

40 Budgeted Online Kernel Classifier (cont.) 40 Refer to the survey section in Breaking the curse of kernelization: budgeted stochastic gradient descnet for large-scale svm training by Wang et al., 2012.

41 Linearization methods Idea: explicitly represent data in feature space and train a linear SVM there Exact methods: Poly2SVM (Chang et al., 2010), Coffin (Sonnenburg et al., 2010) Approximate methods: Random Features (Rahimi and Recht, 2007), LLSVM (Zhang et al., 2012) Fig. Source: 41

42 Linearization methods (cont.) Exact methods - Poly2SVM (Chang et al., 2010) Explicitly compute degree-2 polynomial mapping when r=1, d=2 Efficient when mapped feature dimensionality is low (usually occur when input features are sparse or low-dimensional) Approximate methods - Random features (Rahimi and Recht, 2007) Approximate feature mapping of radial basis kernels by randomized features. 42

43 Linearization methods (cont.) LLSVM (Zhang et al., 12): cast nonlinear SVM into an equivalent linear SVM through the decomposition of PSG kernel matrix T K F F, where is the rank of K N N N B N B B T T K ( x ) ( x ) F F ij i j i j 1 min w, 2 w 2 C T s.t. y ( w ( x )) 1 i i i i 1 min w, 2 w T s.t. y ( wf) 1 2 C i i i i r-dim virtual example 43

44 Linearization methods (cont.) Approximate the optimal decomposition by Nyström method B<<N K K K K =( K U )( K U ) 1 T 1/2 1/2 N N N B B B N B N B N B eigenvalue decomposition LLSVM algorithm: T Select B landmarks points using sampling or k-means clustering 1/2 2. Compute eigen decomposition of K BB : M U 3. Train linear SVM on virtual examples, where F N B K N B U O(N) time complexity 1/2

45 Linearization methods (cont.) How B influences accuracy and training time? 45

46 Adaptive Multi-hyperplane Machine Idea: assign multi-hyperplanes to each class to increase representability (Aiolli & Sperduti, 05; Wang et al., 11) where Class 1 w 11 w 11T x = 1.2 w 12 w 12T x = 0.4 Class 2 w 21 w 21T x = 1.5 w 22 w 22T x = -0.1 Class 3 w 31 w 31T x = -0.7 w 32 w 32T x = 0.1 w 33 w 33T x = 0.6 Non-convex User-specified where 46 Maximal prediction from the incorrect classes The maximal prediction from the correct class

47 Adaptive Multi-hyperplane Machine (cont.) Solve a series of convex approximation by replacing the non-convex loss function by its convex upper bound where Example specific; fixed during optimization. SGD z is being recalculated after solving each sub-problem as SGD where SGD 47

48 training time (seconds) error rate (%) AMM: Filling the Scalability and Representability Gap AMM Linear SVM RBF SVM a9a ijcnn webspam mnist_bin mnist_mc rcv1_bin url RBF SVM is NA a9a ijcnn webspam mnsit_bin mnist_mc rcv1_bin url RBF SVM is NA.

49 Off-the-shelf tool BudgetedSVM: a toolbox for large-scale non-linear SVM (Djuric, et al., 13) command-line (Windows/Linux), Matlab interfaces, C/C++ APIs include AMM, BSGD, LLSVM highly optimized for large data when it cannot fit into memory online learning + constant-memory = scalable for arbitrarily large data Download: 49

50 Complexity comparison SGD AMM/Pegasos classifier: BSGD/RBF-SVM classifier: LLSVM classifier: N: #training examples M: data dimensionality C: #classes S: average #non-zero features I: #iteration for Libsvm, I = O(N)~O(N 2 ) B: budget size for BSGD, #hyperplanes for AMM, #landmark points for LLSVM, B<<N 50

51 Error rate and training time comparison SGD 327MB 35MB 18GB 51

52 Data summary methods Summarize the data using meta-examples, then train model on metaexamples D {( x, y ), i 1,..., N} D: i i B N TD: q 2 D' ( q, y ), i 1,..., B, i i q 1 q 3 data quantization or clustering q 4 52

53 Data summary methods (cont.) Simple approach pre-clustering on the data train weighted SVM on cluster centers, where example weighs are determined by the size/purity of the clusters Support Cluster Machine (Li et al., 07) pre-clustering on the data train weighted SVM on clusters, where clusters are treated as Gaussian distribution and the similarity is calculated by probability product kernel Training complexity depends on clustering algorithm 53

54 Data summary methods (cont.) Twin Vector Machine (Wang and Vucetic, 10b): incrementally quantize data into twin vectors by nearest neighbor while incrementally updating SVM on twin vector set tv {(, 1, s q ),( q, 1, s )} j j j j j tv 2 =(q 2,+1,3),(q 2,-1,1) tv 3 =(q 3,+1,4),(q 3,-1,1) tv 1 =(q 1,+1,1),(q 1,-1,2) 1 B tv 4 =(q 4,+1,0),(q 4,-1,3) 2 min w C si H ( 1, f ( qi )) si H ( 1, f ( qi )) w 2 i 1 54

55 Sampling methods Train algorithms on a subset of the data KDDCUP09 data:~5m examples, 129 dim. Best reported accuracy ~94% Sampling method (using only 50 examples) + SVM, accuracy: ~92%, training time less than 1s Covetype data: 500K examples, 57 dim. Rcv1 data: 550K examples, 47,236 dim. Fig. Source: C. J. Lin, Talk at K. U. Leuven Optimization in Engineering Center,

56 Sampling methods (cont.) Accuracy can be further boosted by bagging, F(x)=ave(f i (x)) 2008 Pascal large-scale learning challenge results on alpha dataset Fig. Source: Jochen Garcke, presentation at ICML'08 Workshop PASCAL Large Scale Learning Challenge,

57 57 Parallelism

58 Parallel SVM Cascade SVM (Graf et al., 05) distribute data into nodes train local SVM and only populate SV set to the next layer converge after a few iterations Fig. Source: Graf et al.,

59 Parallel SVM (cont.) Fig. Source: Graf et al.,

60 Parallel SVM (cont.) PSVM (Chang et al., 07) - parallel Interior-Point method IP method Remove the linear constraint in SVM s QP with barrier function Then solve a sequence of the unconstraint problems with Newton method O(M 3 ) time and O(M 2 ) space which is dominated by inverse kernel matrix Parallel IP method Distribute both data loading and computation Approximate expensive matrix manipulations using parallel computing Intense communication between nodes 60

61 Parallel SVM (cont.) P-pack SVM (Zhu et al., 09) Parallel SGD for kernel SVM A lot of communication between nodes Parallel computing platform: MPI Algorithm 1.Initialize w 2.All nodes randomly select a same example i in D Do w (1 η λ) w β ( x ) i i i Only add x i to one node 3.Repeat step 2 with enough iterations where β i η i yi, if yi f ( xi) 1 0, otherwise sum-up f i (x i ) across all nodes 61

62 Parallel SVM (cont.) Fig. Source: Zhu et al.,

63 Parallel SVM (cont.) PSGD (Zinkevish et al., 10): Bagging + Linear SVM SGD Approximate solver Little communication between nodes Good for MapReduce on Hadoop Fig. Source: C.-J. Lin, Talk at at K. U. Leuven Optimization in Engineering Center,

64 Parallel SVM (cont.) ADMM for SVM (Boyd et al., 11; Zhang et al., 12b) Need fast solver for this MPI Fig. Source: Zhang et al., 2012b. 64

65 Parallel SVM (cont.) Data size:9gb # nodes: 8 Per RAM: 12GB AMDD Single-machine Liblinear Fig. Source: Zhang et al., 2012b. 65

66 What I won t cover Parallel tree methods see the tutorial scale up decision tree ensembles by M.Bilenko, R. Bekkerman, and J. Langford at KDD 2011 Parallel deep networks see the tutorial large scale deep learning by M. Ranzato at IPAM summer school

67 Summary Linear classification very scalable computationally cheap accuracy often sufficient for some applications Non-linear classification online learning + constant memory = scalable to arbitrary large data sampling/bagging is effective for large data Parallelism a lot of MPI but few MapReduce implementations 67

68 Acknowledgements Thanks my co-authors: Koby Crammer, Nemanja Djuric, Liang Lan, Fabian Moerchen, Slobodan Vucetic, Kai Zhang Thanks Chih-Jen Lin for reviewing and commenting the tutorial proposal 68

69 Thank you! My homepage at: Contact me at: Download BudgetedSVM toolbox at: 69

70 References F. Aiolli and A. Sperduti. Multi-class classification with multi-prototype support vector machines. Journal of Machine Learning Research, A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers for online and active learning. Journal of Machine Learning Research, A. Bordes, L. Bottou, and P. Gallinari. Sgd-qn: careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 2009 S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. In Annual Conference on Learning Theory, C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines, cjlin/libsvm Y.-W. Chang, C.-J. Hsie, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, Edward Y. Chang, Kaihua Zhu, Hao Wang, Hongjie BaiPSVM: Parallelizing Support Vector Machines on Distributed Computers. In Advances in Neural Information Processing Systems, L. Cheng, S. V. N. Vishwanathan, D. Schuurmans, S. Wang, and T. Caelli. Implicit online earning with kernels. In Advances in Neural Information Processing Systems, 2007 R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In International Conference on Machine Learning, C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,

71 References (cont.) O. Dekel, S. Shalev-Shwartz, and Y. Singer. The forgetron: a kernel-based perceptron on a budget. SIAM Journal on Computing, N. Djuric, L. Liang, S. Vuceitc, and Z. Wang. BudgetedSVM: A Toolbox for Large-scale Non-linear SVM. H.-P. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: the cascade svm. In Advances in Neural Information Processing Systems, 2005 R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In International Conference on Machine Learning, A. B. Novikoff. On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, T. Joachims. Training linear svms in linear time. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, B. Li, M. Chi, J. Fan,, and X. Xue. Support cluster machine. In International Conference on Machine Learning, 2007 J. Platt. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods - Support Vector Learning, MIT Press, A. Rahimi and B. Rahimi. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review,

72 References (cont.) B. Schӧlkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. Müller, G. Rätsch, and A. J. Smola. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks, S. Shalev-Shwartz, Y. Singer, N. Srebro. Pegasos: primal estimated sub-gradient solver for svm. In International Conference on Machine Learning, S. Sonnenburg and V. Franc. Coffin: a computational framework for linear svms. In International Conference on Machine Learning, C.H. Teo, S. V. N. Vishwanathan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core vector machines: fast svm training on very large data sets. Journal of Machine Learning Research, I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In International Conference on Machine Learning, S. Vucetic, V. Coric, and Z. Wang. Compressed Kernel Perceptrons. In IEEE Data Compression Conference Z. Wang and S. Vucetic. Tighter perceptron with improved dual use of cached data for model representation and validation. In International Joint Conference on Neutral Network, Z. Wang and S. Vucetic. Online passive-aggressive algorithms on a budget. In International Conference on Artificial Intelligence and Statistics, Z. Wang and S. Vucetic. Online training on a budget of support vector machines using twin prototypes. Statisitcal Analysis and Data Mining Journal, 2010b. 72

73 References (cont.) Z. Wang, N. Djuric, K. Crammer, and S. Vucetic. Trading representability for scalability: adaptive multihyperplane machine for nonlinear classification. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Z. Wang, K. Crammer, and S. Vucetic. Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training. Journal of Machine Learning Research, J. Weston, A. Bordes, and L. Bottou. Online (and offline) on an even tighter budget. In International Workshop on Artificial Intelligence and Statistics, H.-F. Yu, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Large linear classification when data cannot fit in memory. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification. Proceedings of the IEEE, K. Zhang, L. Lan, Z. Wang, and F. Moerchen. Scaling up kernel svm on limited resources: a low-rank linearization approach. In International Conference on Artificial Intelligence and Statistics, Caoxie Zhang, Honglak Lee, and Kang G. Shin. Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers.. In International Conference on Artificial Intelligence and Statistics, 2012b. T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent. In International Conference on Machine Learning, Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen. P-packsvm: parallel primal gradient descent kernel svm. In IEEE International Conference on Data Mining, M. Zinkevich, M. Weimer, A. J. Smola, L. Li. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems,

BudgetedSVM: A Toolbox for Scalable SVM Approximations

BudgetedSVM: A Toolbox for Scalable SVM Approximations Journal of Machine Learning Research 14 (2013) 3813-3817 Submitted 4/13; Revised 9/13; Published 12/13 BudgetedSVM: A Toolbox for Scalable SVM Approximations Nemanja Djuric Liang Lan Slobodan Vucetic 304

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Second Order SMO Improves SVM Online and Active Learning

Second Order SMO Improves SVM Online and Active Learning Second Order SMO Improves SVM Online and Active Learning Tobias Glasmachers and Christian Igel Institut für Neuroinformatik, Ruhr-Universität Bochum 4478 Bochum, Germany Abstract Iterative learning algorithms

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Memory-efficient Large-scale Linear Support Vector Machine

Memory-efficient Large-scale Linear Support Vector Machine Memory-efficient Large-scale Linear Support Vector Machine Abdullah Alrajeh ac, Akiko Takeda b and Mahesan Niranjan c a CRI, King Abdulaziz City for Science and Technology, Saudi Arabia, asrajeh@kacst.edu.sa

More information

Random Projection Features and Generalized Additive Models

Random Projection Features and Generalized Additive Models Random Projection Features and Generalized Additive Models Subhransu Maji Computer Science Department, University of California, Berkeley Berkeley, CA 9479 8798 Homepage: http://www.cs.berkeley.edu/ smaji

More information

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs Behavioral Data Mining Lecture 10 Kernel methods and SVMs Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms SVMs as large-margin classifiers margin The separating plane maximizes

More information

Support vector machines

Support vector machines Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest

More information

Class 6 Large-Scale Image Classification

Class 6 Large-Scale Image Classification Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Stanford University. A Distributed Solver for Kernalized SVM

Stanford University. A Distributed Solver for Kernalized SVM Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data

More information

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago More Data, Less Work: Runtime as a decreasing function of data set size Nati Srebro Toyota Technological Institute Chicago Outline we are here SVM speculations, other problems Clustering wild speculations,

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017 CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017 Admin Assignment 3: Due Friday of next week. Midterm: Can view your exam during instructor office hours next week, or after

More information

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine

More information

Perceptron Learning Algorithm (PLA)

Perceptron Learning Algorithm (PLA) Review: Lecture 4 Perceptron Learning Algorithm (PLA) Learning algorithm for linear threshold functions (LTF) (iterative) Energy function: PLA implements a stochastic gradient algorithm Novikoff s theorem

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Online Learning for URL classification

Online Learning for URL classification Online Learning for URL classification Onkar Dalal 496 Lomita Mall, Palo Alto, CA 94306 USA Devdatta Gangal Yahoo!, 701 First Ave, Sunnyvale, CA 94089 USA onkar@stanford.edu devdatta@gmail.com Abstract

More information

JKernelMachines: A Simple Framework for Kernel Machines

JKernelMachines: A Simple Framework for Kernel Machines Journal of Machine Learning Research 14 (2013) 1417-1421 Submitted 2/12; Revised 11/12; Published 5/13 JKernelMachines: A Simple Framework for Kernel Machines David Picard ETIS - ENSEA/CNRS/Université

More information

SVM Optimization: An Inverse Dependence on Data Set Size

SVM Optimization: An Inverse Dependence on Data Set Size SVM Optimization: An Inverse Dependence on Data Set Size Shai Shalev-Shwartz Nati Srebro Toyota Technological Institute Chicago (a philanthropically endowed academic computer science institute dedicated

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

DM6 Support Vector Machines

DM6 Support Vector Machines DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR

More information

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN 2016 International Conference on Artificial Intelligence: Techniques and Applications (AITA 2016) ISBN: 978-1-60595-389-2 Face Recognition Using Vector Quantization Histogram and Support Vector Machine

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016 Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent Parallel

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines CS 536: Machine Learning Littman (Wu, TA) Administration Slides borrowed from Martin Law (from the web). 1 Outline History of support vector machines (SVM) Two classes,

More information

5 Machine Learning Abstractions and Numerical Optimization

5 Machine Learning Abstractions and Numerical Optimization Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer

More information

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 98052 jplatt@microsoft.com Abstract Training a Support Vector

More information

Use of Multi-category Proximal SVM for Data Set Reduction

Use of Multi-category Proximal SVM for Data Set Reduction Use of Multi-category Proximal SVM for Data Set Reduction S.V.N Vishwanathan and M Narasimha Murty Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India Abstract.

More information

Constrained optimization

Constrained optimization Constrained optimization A general constrained optimization problem has the form where The Lagrangian function is given by Primal and dual optimization problems Primal: Dual: Weak duality: Strong duality:

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal, V. Vijaya Saradhi and Harish Karnick 1,2 Abstract We apply kernel-based machine learning methods to online learning situations,

More information

The Kernelized Stochastic Batch Perceptron

The Kernelized Stochastic Batch Perceptron Andrew Cotter cotter@ttic.edu Toyota Technological Institute at Chicago 6045 S. Kenwood Ave., Chicago, IL 60637 USA Shai Shalev-Shwartz shais@cs.huji.ac.il John S. Cohen SL in CS, The Hebrew University

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Machine Learning Lecture 9

Machine Learning Lecture 9 Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 19.05.013 Discriminative Approaches (5 weeks) Linear Discriminant Functions

More information

Support Vector Machines and their Applications

Support Vector Machines and their Applications Purushottam Kar Department of Computer Science and Engineering, Indian Institute of Technology Kanpur. Summer School on Expert Systems And Their Applications, Indian Institute of Information Technology

More information

KBSVM: KMeans-based SVM for Business Intelligence

KBSVM: KMeans-based SVM for Business Intelligence Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence

More information

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper

More information

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others Introduction to object recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others Overview Basic recognition tasks A statistical learning approach Traditional or shallow recognition

More information

Streamed Learning: One-Pass SVMs

Streamed Learning: One-Pass SVMs Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Streamed Learning: One-Pass SVMs Piyush Rai, Hal Daumé III, Suresh Venkatasubramanian University of

More information

Learning via Optimization

Learning via Optimization Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 16: Finale Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系

More information

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data

More information

Chakra Chennubhotla and David Koes

Chakra Chennubhotla and David Koes MSCBIO/CMPBIO 2065: Support Vector Machines Chakra Chennubhotla and David Koes Nov 15, 2017 Sources mmds.org chapter 12 Bishop s book Ch. 7 Notes from Toronto, Mark Schmidt (UBC) 2 SVM SVMs and Logistic

More information

MIT 9.520/6.860 Project: Feature selection for SVM

MIT 9.520/6.860 Project: Feature selection for SVM MIT 9.520/6.860 Project: Feature selection for SVM Antoine Dedieu Operations Research Center Massachusetts Insitute of Technology adedieu@mit.edu Abstract We consider sparse learning binary classification

More information

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines Boriana Milenova, Joseph Yarmus, Marcos Campos Data Mining Technologies Oracle Overview Support Vector

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Perceptron Learning Algorithm

Perceptron Learning Algorithm Perceptron Learning Algorithm An iterative learning algorithm that can find linear threshold function to partition linearly separable set of points. Assume zero threshold value. 1) w(0) = arbitrary, j=1,

More information

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017 Kernel SVM Course: MAHDI YAZDIAN-DEHKORDI FALL 2017 1 Outlines SVM Lagrangian Primal & Dual Problem Non-linear SVM & Kernel SVM SVM Advantages Toolboxes 2 SVM Lagrangian Primal/DualProblem 3 SVM LagrangianPrimalProblem

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Large Scale Manifold Transduction

Large Scale Manifold Transduction Large Scale Manifold Transduction Michael Karlen, Jason Weston, Ayse Erkan & Ronan Collobert NEC Labs America, Princeton, USA Ećole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland New York University,

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Machine Learning Lecture 9

Machine Learning Lecture 9 Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 30.05.016 Discriminative Approaches (5 weeks) Linear Discriminant Functions

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

IE598 Big Data Optimization Summary Nonconvex Optimization

IE598 Big Data Optimization Summary Nonconvex Optimization IE598 Big Data Optimization Summary Nonconvex Optimization Instructor: Niao He April 16, 2018 1 This Course Big Data Optimization Explore modern optimization theories, algorithms, and big data applications

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software

More information

Limitations of Matrix Completion via Trace Norm Minimization

Limitations of Matrix Completion via Trace Norm Minimization Limitations of Matrix Completion via Trace Norm Minimization ABSTRACT Xiaoxiao Shi Computer Science Department University of Illinois at Chicago xiaoxiao@cs.uic.edu In recent years, compressive sensing

More information

Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments

Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments Wei-Lin Chiang Dept. of Computer Science National Taiwan Univ., Taiwan b02902056@ntu.edu.tw Mu-Chu

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Outline Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Ivor Tsang, Pakming Cheung, Andras Kocsor, Jacek Zurada, Kimo Lai November

More information

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes 1 CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview

More information

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Noozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 56 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknon) true

More information

Training Data Selection for Support Vector Machines

Training Data Selection for Support Vector Machines Training Data Selection for Support Vector Machines Jigang Wang, Predrag Neskovic, and Leon N Cooper Institute for Brain and Neural Systems, Physics Department, Brown University, Providence RI 02912, USA

More information

JKernelMachines: A simple framework for Kernel Machines

JKernelMachines: A simple framework for Kernel Machines JKernelMachines: A simple framework for Kernel Machines David Picard ETIS - ENSEA/CNRS/Université de Cergy Pontoise 6 avenue du Ponceau 95014 Cergy-Pontoise Cedex, France Nicolas Thome Matthieu Cord LIP6

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Fast Support Vector Machine Classification of Very Large Datasets

Fast Support Vector Machine Classification of Very Large Datasets Fast Support Vector Machine Classification of Very Large Datasets Janis Fehr 1, Karina Zapién Arreola 2 and Hans Burkhardt 1 1 University of Freiburg, Chair of Pattern Recognition and Image Processing

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 8.9 (SVMs) Goals Finish Backpropagation Support vector machines Backpropagation. Begin with randomly initialized weights 2. Apply the neural network to each training

More information

Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material

Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material Rui Yao, Qinfeng Shi 2, Chunhua Shen 2, Yanning Zhang, Anton van den Hengel 2 School of Computer Science, Northwestern

More information

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from LECTURE 5: DUAL PROBLEMS AND KERNELS * Most of the slides in this lecture are from http://www.robots.ox.ac.uk/~az/lectures/ml Optimization Loss function Loss functions SVM review PRIMAL-DUAL PROBLEM Max-min

More information

Active Learning as Non-Convex Optimization

Active Learning as Non-Convex Optimization Andrew Guillory Computer Science and Engineering University of Washington guillory@cs.washington.edu Erick Chastain Neurobiology and Behavior University of Washington erickc@u.washington.edu Jeff Bilmes

More information

DCMSVM: Distributed Parallel Training For Single-Machine Multiclass Classifiers

DCMSVM: Distributed Parallel Training For Single-Machine Multiclass Classifiers DCMSVM: Distributed Parallel Training For Single-Machine Multiclass Classifiers Xufeng Han Alexander C. Berg Computer Science Department Stony Brook University Abstract We present an algorithm and implementation

More information

Data Driven Frequency Mapping for Computationally Scalable Object Detection

Data Driven Frequency Mapping for Computationally Scalable Object Detection 8th IEEE International Conference on Advanced Video and Signal-Based Surveillance, 2011 Data Driven Frequency Mapping for Computationally Scalable Object Detection Fatih Porikli Mitsubishi Electric Research

More information

Fast SVM Training Using Approximate Extreme Points

Fast SVM Training Using Approximate Extreme Points Journal of Machine Learning Research 15 (2014) 59-98 Submitted 3/12; Revised 5/13; Published 1/14 Fast SVM Training Using Approximate Extreme Points Manu andan Department of Computer and Information Science

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Adaptive Dropout Training for SVMs

Adaptive Dropout Training for SVMs Department of Computer Science and Technology Adaptive Dropout Training for SVMs Jun Zhu Joint with Ning Chen, Jingwei Zhuo, Jianfei Chen, Bo Zhang Tsinghua University ShanghaiTech Symposium on Data Science,

More information

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization More on Learning Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization Neural Net Learning Motivated by studies of the brain. A network of artificial

More information

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer David G. Luenberger Yinyu Ye Linear and Nonlinear Programming Fourth Edition ö Springer Contents 1 Introduction 1 1.1 Optimization 1 1.2 Types of Problems 2 1.3 Size of Problems 5 1.4 Iterative Algorithms

More information

NESVM: A Fast Gradient Method for Support Vector Machines

NESVM: A Fast Gradient Method for Support Vector Machines : A Fast Gradient Method for Support Vector Machines Tianyi Zhou School of Computer Engineering, Nanyang Technological University, Singapore Email:tianyi.david.zhou@gmail.com Dacheng Tao School of Computer

More information

SVM online learning ... Salvatore Frandina Siena, August 19, Department of Information Engineering, Siena, Italy

SVM online learning ... Salvatore Frandina Siena, August 19, Department of Information Engineering, Siena, Italy SVM SVM... salvatore.frandina@gmail.com Department of Information Engineering, Siena, Italy Siena, August 19, 2012 1 / 34 SVM An overview of the presentation: - - - 2 / 34 SVM Classification of the learning

More information

Support Vector Machines (a brief introduction) Adrian Bevan.

Support Vector Machines (a brief introduction) Adrian Bevan. Support Vector Machines (a brief introduction) Adrian Bevan email: a.j.bevan@qmul.ac.uk Outline! Overview:! Introduce the problem and review the various aspects that underpin the SVM concept.! Hard margin

More information

Linear Support Vector Machines via Dual Cached Loops

Linear Support Vector Machines via Dual Cached Loops Linear Support Vector Machines via Dual Cached Loops Shin Matsushima Information Science and Technology The University of Tokyo, Tokyo masin@r.dl.itc.u-tokyo.ac.jp S.V. N. Vishwanathan Statistics & Computer

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Linear and Kernel Classification: When to Use Which?

Linear and Kernel Classification: When to Use Which? Linear and Kernel Classification: When to Use Which? Hsin-Yuan Huang Chih-Jen Lin Abstract Kernel methods are known to be a state-of-the-art classification technique. Nevertheless, the training and prediction

More information

Adapting SVM Classifiers to Data with Shifted Distributions

Adapting SVM Classifiers to Data with Shifted Distributions Adapting SVM Classifiers to Data with Shifted Distributions Jun Yang School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 juny@cs.cmu.edu Rong Yan IBM T.J.Watson Research Center 9 Skyline

More information

Data Parallelism and the Support Vector Machine

Data Parallelism and the Support Vector Machine Data Parallelism and the Support Vector Machine Solomon Gibbs he support vector machine is a common algorithm for pattern classification. However, many of the most popular implementations are not suitable

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 Overview The goals of analyzing cross-sectional data Standard methods used

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information