CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Support Vector Machines. CS534 - Machine Learning

Support Vector Machines

Classification / Regression Support Vector Machines

Support Vector Machines

Announcements. Supervised Learning

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Discriminative classifiers for object classification. Last time

Edge Detection in Noisy Images Using the Support Vector Machines

Smoothing Spline ANOVA for variable screening

INF 4300 Support Vector Machine Classifiers (SVM) Anne Solberg

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

CS 534: Computer Vision Model Fitting

Graph-based Clustering

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

GSLM Operations Research II Fall 13/14

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Lecture 5: Multilayer Perceptrons

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

The Research of Support Vector Machine in Agricultural Data Classification

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Efficient Text Classification by Weighted Proximal SVM *

Data Mining: Model Evaluation

Optimizing Document Scoring for Query Retrieval

Machine Learning. Topic 6: Clustering

The Codesign Challenge

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Unsupervised Learning

Machine Learning 9. week

INF Repetition Anne Solberg INF

LECTURE : MANIFOLD LEARNING

Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Feature Reduction and Selection

Biostatistics 615/815

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Classifier Selection Based on Data Complexity Measures *

Discriminative Dictionary Learning with Pairwise Constraints

Using Neural Networks and Support Vector Machines in Data Mining

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers

An Entropy-Based Approach to Integrated Information Needs Assessment

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Cluster Analysis of Electrical Behavior

CMPSCI 670: Computer Vision! Object detection continued. University of Massachusetts, Amherst November 10, 2014 Instructor: Subhransu Maji

All-Pairs Shortest Paths. Approximate All-Pairs shortest paths Approximate distance oracles Spanners and Emulators. Uri Zwick Tel Aviv University

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Lecture 4: Principal components

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

CSCI 5417 Information Retrieval Systems Jim Martin!

Programming in Fortran 90 : 2017/2018

Face Recognition Based on SVM and 2DPCA

Machine Learning: Algorithms and Applications

ELEC 377 Operating Systems. Week 6 Class 3

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Greedy Technique - Definition

Cost-efficient deployment of distributed software services

SI485i : NLP. Set 5 Using Naïve Bayes

Simplification of 3D Meshes

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Specialized Weighted Majority Statistical Techniques in Robotics (Fall 2009)

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

Fast Feature Value Searching for Face Detection

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Intelligent Information Acquisition for Improved Clustering

Network Intrusion Detection Based on PSO-SVM

Mathematics 256 a course in differential equations for engineering students

Performance Evaluation of Information Retrieval Systems

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaborative Filtering Ensemble for Ranking

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort

What s Next for POS Tagging. Statistical NLP Spring Feature Templates. Maxent Taggers. HMM Trellis. Decoding. Lecture 8: Word Classes

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Loop Transformations, Dependences, and Parallelization

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Japanese Dependency Analysis Based on Improved SVM and KNN

Polyhedral Compilation Foundations

Three supervised learning methods on pen digits character recognition dataset

A novel feature selection algorithm based on hypothesis-margin

Active Contours/Snakes

A Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines

Computer Vision. Pa0ern Recogni4on Concepts Part II. Luis F. Teixeira MAP- i 2012/13

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

Relevance Feedback Document Retrieval using Non-Relevant Documents

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

CLASSIFICATION OF ULTRASONIC SIGNALS

Concurrent Apriori Data Mining Algorithms

MULTI-VIEW ANCHOR GRAPH HASHING

A Robust LS-SVM Regression

Radial Basis Functions

Transcription:

CS246: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs246.stanford.edu

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 2 Hgh dm. data Graph data Infnte data Machne learnng Apps Localty senstve hashng PageRank, SmRank Flterng data streams SVM Recommen der systems Clusterng Communty Detecton Web advertsng Decson Trees Assocaton Rules Dmensonal ty reducton Spam Detecton Queres on streams Perceptron, knn Duplcate document detecton

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 3 Study of algorthms that mprove ther performance at some task th experence

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 4 Gven some data: Learn a functon to map from the nput to the output Gven: Tranng examples xx, yy = ff xx for some unknon functon ff Fnd: A good approxmaton to ff

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 5 Would lke to do predcton: estmate a functon f(x) so that y = f(x) Where y can be: Real number: Regresson Categorcal: Classfcaton Complex object: Rankng of tems, Parse tree, etc. Data s labeled: Have many pars {(x, y)} x vector of bnary, categorcal, real valued features y class ({1, 1}, or a real number)

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 6 Task: Gven data (X,Y) buld a model f() to predct Y based on X Strategy: Estmate yy = ff xx on (XX, YY). Hope that the same ff(xx) also orks to predct unknon YY The hope s called generalzaton Tranng data Test data Overfttng: If f(x) predcts ell Y but s unable to predct Y We ant to buld a model that generalzes ell to unseen data But Jure, ho can e ell on data e have never seen before?!? X X Y Y

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 7 tranng ponts 1) Tranng data s dran ndependently at random accordng to unknon probablty dstrbuton PP(xx, yy) 2) The learnng algorthm analyzes the examples and produces a classfer ff Gven ne data xx, yy dran from PP, the classfer s gven xx and predcts yy = ff(xx) The loss LL(yy, yy) s then measured Goal of the learnng algorthm: Fnd ff that mnmzes expected loss EE PP [LL]

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 8 tranng data PP(xx, yy) Tranng set SS test data Learnng algorthm (xx, yy) yy xx ff Why s t hard? We estmate ff on tranng data but ant the ff to ork ell on unseen future (.e., test) data yy yy loss functon L(yy, yy)

Goal: Mnmze the expected loss mn EE P[LL] But, e don t have access to PP but only to tranng sample DD: mn EE D[LL] So, e mnmze the average loss on the tranng data: 2/17/2015 mn NN JJ = 1 NN L h(xx), yy =1 Problem: Just memorzng the tranng data gves us a perfect model (th zero loss) Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 9

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 10 Gven: A set of N tranng examples {(xx (1), yy (1) ), (xx (2), yy (2) ),, (xx (nn), yy (nn) )} A loss functon LL Fnd: The eght vector that mnmzes the expected loss on the tranng data NN JJ = 1 NN L ssssss xx, yy =1

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 11 Problem: Stepse Constant Loss functon 6 5 4 3 Loss 2 1 0 4 2 0 2 4 1 *x Dervatve s ether 0 or

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 12 Approxmatng the expected loss by a smooth functon Replace the orgnal objectve functon by a surrogate loss functon. E.g., hnge loss: NN JJ = 1 NN max 0, 1 yy xx () =1 When yy = 1:

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 13

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 14 Mnmze ff by Gradent Descent Start th eght vector (0) Compute gradent JJ (0) = JJ (0) 0, JJ (0) 1,, Compute (1) = (0) ηηηηjj (0) here ηη s a step sze parameter Repeat untl convergence JJ (0) nn

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 15 Example: Spam flterng Instance space x X ( X = n data ponts) Bnary or realvalued feature vector x of ord occurrences d features (ords other thngs, d~100,000) Class y Y y: Spam (1), Ham (1)

PP(xx, yy): dstrbuton of emal messages xx and ther true labels yy ( spam, ham ) Tranng sample: a set of emal messages that have been labeled by the user Learnng algorthm: What e study! ff: The classfer output by the learnng alg. Test pont: A ne emal xx (th ts true, but hdden, label yy) Loss functon LL(yy, yy): 2/17/2015 predcted label yy true label yy spam ham spam 0 10 not spam 1 0 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 16

Idea: Pretend e do not kno the data/labels e actually do kno Tranng set Buld the model f(x) on X Valdaton the tranng data (mnmze J) set See ho ell f(x) does on Test X set the valdaton data If t does ell, then apply t also to X Refnement: Cross valdaton Estmate y = f(x) on X,Y. Hope that the same f(x) also orks on unseen X, Y Splttng nto tranng/valdaton set s brutal Let s splt our data (X,Y) nto 10folds (buckets) Take out 1fold for valdaton, tran on remanng 9 Repeat ths 10 tmes, report average performance 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 17 Y

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 19 We ll talk about the follong methods: Support Vector Machnes Decson trees Man queston: Ho to effcently tran (buld a model/fnd model parameters)?

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 21 Want to separate from usng a lne Data: Tranng examples: (x 1, y 1 ) (x n, y n ) Each example : x = ( x (1),, x (d) ) x (j) s real valued y { 1, 1 } Inner product: dd xx = (jj) xx (jj) jj=1 Whch s best lnear separator (defned by )?

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 22 A B C Dstance from the separatng hyperplane corresponds to the confdence of predcton Example: We are more sure about the class of A and B than of C

Margn γγ: Dstance of closest example from the decson lne/hyperplane The reason e defne margn ths ay s due to theoretcal convenence and exstence of generalzaton error bounds that depend on the value of margn. 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 23

Remember: the Dot product AA BB = AA BB cccccc θθ AA cccccccc AA = AA (jj) 22 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 24 dd jj=11

Dot product AA BB = AA BB cccccc θθ What s xx 11, xx 22? x 2 x 1 x 2 x 1 x 2 x 1 In ths case γγ 11 22 In ths case γγ 22 22 22 So, γγ roughly corresponds to the margn Bottom lne: Bgger γγ bgger the separaton 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 25

Dstance from a pont to a lne A (x A (1), x A (2) ) H L Let: Lne L: xb = (1) x (1) (2) x (2) b=0 = ( (1), (2) ) Pont A = (x A (1), x A (2) ) Note e assume 22 = 11 Pont M on a lne = (x M (1), x M (2) ) (0,0) M (x M (1), x M (2) ) d(a, L) = AH = (AM) = (x A (1) x M (1) ) (1) (x A (2) x M (2) ) (2) = x A (1) (1) x A (2) (2) b = A b Remember x (1) M (1) x (2) M (2) = b snce M belongs to lne L 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 26

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 27 Predcton = sgn( x b) Confdence = ( x b) y For th datapont: γγ = xx bb yy Want to solve: mmmmmm mmmmmm Can rerte as maxγ, γ γγ s. t., y ( x b) γ

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 28 Maxmze the margn: Good accordng to ntuton, theory (c.f. VC dmenson ) and practce maxγ, γ s. t., y ( x b) γ γ γ γ xb=0 γγ s margn dstance from the separatng hyperplane Maxmzng the margn

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 30 Separatng hyperplane s defned by the support vectors Ponts on / planes from the soluton If you kne these ponts, you could gnore the rest Generally, d1 support vectors (for d dm. data)

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 31 Problem: Let xx bb yy = γγ then 22 xx 22bb yy = 22γγ Scalng ncreases margn! Soluton: Work th normalzed : γγ = xx bb yy x 2 x 1 Let s also requre support vectors xx jj to be on the plane defned by: xx jj bb = ±11 dd = (jj) 2 jj=1

Want to maxmze margn γγ! What s the relaton beteen x 1 and x 2? xx 11 = xx 22 2222 We also kno: xx 11 bb = 11 xx 22 bb = 11 So: xx 11 bb = 11 xx 22 2222 xx 22 bb 2222 1 bb = 11 = 11 γ = 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 32 x 2 2γ x 1 1 = Note: = 2

We started th max, γ γ s. t., y ( x b) arg maxγ = arg max mn s. t., 1 1 2 y 2 ( x b) γ But can be arbtrarly large! We normalzed and... Then: = arg mn arg mn 1 2 1 Ths s called SVM th hard constrants = 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 33 2 x 2 2γ x 1

If data s not separable ntroduce penalty: mn 1 2 s. t., y 2 ( x C (#number of b) 1 Mnmze ǁǁ 2 plus the number of tranng mstakes Set C usng cross valdaton Ho to penalze mstakes? All mstakes are not equally bad! mstakes) 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 34

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 35 Introduce slack varables ξ mn, b, ξ 0 s. t., 1 2 y 2 ( x If pont x s on the rong sde of the margn then get penalty ξ n C ξ b) = 1 1 ξ ξ ξ j For each data pont: If margn 1, don t care If margn < 1, pay lnear penalty

mn s. t., 1 2 y 2 ( x C (#number of b) 1 What s the role of slack penalty C: C= : Only ant to, b that separate the data C=0: Can set ξ to anythng, then =0 (bascally gnores the data) (0,0) mstakes) 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 36 small C good C bg C

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 37 SVM n the natural form n arg mn C max, b 1 2 SVM uses Hnge Loss : 0/1 loss { 0,1 y ( x b) } Margn = 1 Emprcal loss L (ho ell e ft tranng data) Regularzaton parameter penalty mn, b 1 2 s. t., y 2 ( x n C ξ = 1 b) 1 ξ 1 0 1 2 Hnge loss: max{0, 1z} z = y ( x b)

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 39 mn, b 1 2 s. t., y ( x C Want to estmate and bb! Standard ay: Use a solver! Solver: softare for fndng solutons to common optmzaton problems Use a quadratc solver: Mnmze quadratc functon Subject to lnear constrants Problem: Solvers are neffcent for bg data! n = 1 b) ξ 1 ξ

Want to estmate, b! Alternatve approach: Want to mnmze J(,b): Sde note: Ho to mnmze convex functons gg(zz)? Use gradent descent: mn z g(z) Iterate: z t1 z t η g(z t ) 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 40 = = = n d j j j b x y C b J 1 1 ) ( ) ( 2 1 ) ( 0,1 max ), ( n b b x y s t C ξ ξ = 1 ) (,.. mn 1 2 1, g(z) z

Want to mnmze J(,b): Compute the gradent (j).r.t. (j) 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 41 = = = n j j j j y x L C b L J 1 ) ( ) ( ) ( ) ( ), ( ), ( else 1 ) ( f 0 ), ( ) ( ) ( j j x y b x y y x L = = ( ) = = = = n d j j j d j j b x y C b J 1 1 ) ( ) ( 1 2 ) ( 2 1 ) ( 0,1 max ), ( Emprcal loss LL(xx yy )

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 42 Gradent descent: Iterate untl convergence: For j = 1 d j) f (, b) Evaluate: J = ( j) Update: (j) (j) η J (j) n ( ( j) L( x =, y ) C ( j) = 1 Problem: Computng J (j) takes O(n) tme! n sze of the tranng dataset η learnng rate parameter C regularzaton parameter

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 43 Stochastc Gradent Descent Instead of evaluatng gradent over all examples evaluate t for each ndvdual tranng example j) ( j) L( x, y J ( x ) = C ( j) Stochastc gradent descent: ( ) Iterate untl convergence: For = 1 n For j = 1 d Compute: J (j) (x ) Update: (j) (j) η J (j) (x ) J We just had: n ( j) ( j) L( x, y ) = C ( j) = 1 Notce: no summaton over anymore

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 45 Example by Leon Bottou: Reuters RCV1 document corpus Predct a category of a document One vs. the rest classfcaton n = 781,000 tranng examples (documents) 23,000 test examples d = 50,000 features One feature per ord Remove stopords Remove lo frequency ords

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 46 Questons: (1) Is SGD successful at mnmzng J(,b)? (2) Ho quckly does SGD fnd the mn of J(,b)? (3) What s the error on a test set? Standard SVM Fast SVM SGDSVM Tranng tme Value of J(,b) Test error (1) SGDSVM s successful at mnmzng the value of J(,b) (2) SGDSVM s super fast (3) SGDSVM test set error s comparable

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 48 SGD SVM Conventonal SVM Optmzaton qualty: J(,b) J ( opt,b opt ) For optmzng J(,b) thn reasonable qualty SGDSVM s super fast

SGD on full dataset vs. Conjugate Gradent on a sample of n tranng examples Theory says: Gradent descent converges n lnear tme kk. Conjugate gradent converges n kk. Bottom lne: Dong a smple (but fast) SGD update many tmes s better than dong a complcated (but slo) CG update a fe tmes kk condton number 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 49

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 51 Sparse Lnear SVM: Feature vector x s sparse (contans many zeros) Do not do: x = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0, ] But represent x as a sparse vector x =[(4,1), (9,5), ] Can e do the SGD update more effcently? C η Approxmated n 2 steps: L( x, y ) ηc ( 1 η) ) cheap: x s sparse and so fe coordnates j of ll be updated expensve: s not sparse, all coordnates need to be updated x L, ( y

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 52 Soluton 1: = ss vv Represent vector as the product of scalar s and vector v Then the update procedure s: To step update procedure: (1) (2) L( x, y ) ηc ( 1 η) (1) vv = vv ηηηη xx,yy (2) ss = ss(11 ηη) Soluton 2: Perform only step (1) for each tranng example Perform step (2) th loer frequency and hgher η

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 53 Stoppng crtera: Ho many teratons of SGD? Early stoppng th cross valdaton Create a valdaton set Montor cost functon on the valdaton set Stop hen loss stops decreasng Early stoppng Extract to (very) small sets of tranng data A and B Tran on A, stop by valdatng on B Number of tranng epochs on A s an estmate of k Tran for k epochs on the full dataset

Idea 1: One aganst all Learn 3 classfers vs. {o, } vs. {o, } o vs. {, } Obtan: b, b, o b o Ho to classfy? Return class c arg max c c x b c 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 54

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 55 Idea 2: Learn 3 sets of eghts smoultaneously! For each class c estmate c, b c Want the correct class y to have hghest margn: y x b y 1 c x b c c y, (x, y )

Optmzaton problem: To obtan parameters c, b c (for each class c) e can use smlar technques as for 2 class SVM SVM s dely perceved a very poerful learnng algorthm 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 56 c c y y n c b b x b x C ξ ξ = 1 mn 1 c 2 2 1, y c 0,, ξ

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 58 Ne settng: Onlne Learnng Allos for modelng problems here e have a contnuous stream of data We ant an algorthm to learn from t and sloly adapt to the changes n data Idea: Do slo updates to the model SGDSVM makes updates f msclassfyng a datapont So: Frst tran the classfer on tranng data. Then for every example from the stream, f e msclassfy, update the model (usng a small learnng rate)

Protocol: User comes and tell us orgn and destnaton We offer to shp the package for some money ($10 $50) Based on the prce e offer, sometmes the user uses our servce (y = 1), sometmes they don't (y = 1) Task: Buld an algorthm to optmze hat prce e offer to the users Features x capture: Informaton about user Orgn and destnaton Problem: Wll user accept the prce? 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 59

Model hether user ll accept our prce: yy = ff(xx; ) Accept: y =1, Not accept: y=1 Buld ths model th say Perceptron or SVM The ebste that runs contnuously Onlne learnng algorthm ould do somethng lke User comes User s represented as an (x,y) par here x: Feature vector ncludng prce e offer, orgn, destnaton y: If they chose to use our servce or not The algorthm updates usng just the (x,y) par Bascally, e update the parameters every tme e get some ne data 2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 60

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 61 We dscard ths dea of a data set Instead e have a contnuous stream of data Further comments: For a major ebste here you have a massve stream of data then ths knd of algorthm s pretty reasonable Don t need to deal th all the tranng data If you had a small number of users you could save ther data and then run a normal algorthm on the full dataset Dong multple passes over the data

2/17/2015 Jure Leskovec, Stanford CS246: Mnng Massve Datasets, http://cs246.stanford.edu 62 An onlne algorthm can adapt to changng user preferences For example, over tme users may become more prce senstve The algorthm adapts and learns ths So the system s dynamc