CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS46: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs46.stanford.edu

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu Perceptron: y = sgn( x Ho to fnd parameters? Start th 0 = 0 Pck tranng examples x t one by one Predct class of x t usng current t y = sgn( t x t If y s correct (.e., y t = y No change: t1 = t If y s rong: Adjust t t1 = t y t x t s the learnng rate parameter x t s the tth tranng example y t s true tth class label ({1, 1} y t x t t1 t x t, y t =1

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 3 Overfttng: Regularzaton: If the data s not separable eghts dance around Medocre generalzaton: Fnds a barely separatng soluton

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 4 Want to separate from usng a lne Data: Tranng examples: (x 1, y 1 (x n, y n Each example : x = ( x (1,, x (d x (j s real valued y { 1, 1 } Inner product: x = d j=1 (j x (j Whch s best lnear separator (defned by?

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 5 A B C Dstance from the separatng hyperplane corresponds to the confdence of predcton Example: We are more sure about the class of A and B than of C

Margn: Dstance of closest example from the decson lne/hyperplane The reason e defne margn ths ay s due to theoretcal convenence and exstence of generalzaton error bounds that depend on the value of margn. /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 6

Remember: Dot product A B = A B cos θ A = A (j /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 7 d j=1

Dstance from a pont to a lne A (x A (1, x A ( H L Let: Lne L: xb = (1 x (1 ( x ( b=0 = ( (1, ( Pont A = (x A (1, x A ( Pont M on a lne = (x M (1, x M ( (0,0 M (x 1, x d(a, L = AH = (AM = (x A (1 x M (1 (1 (x A ( x M ( ( = x A (1 (1 x A ( ( b = A b Remember x (1 M (1 x ( M ( = b snce M belongs to lne L /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 8

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 10 Predcton = sgn(x b Confdence = ( x b y For th datapont: γ = x b y Want to solve: max mn Can rerte as max, γ s. t., y ( x b

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 11 Maxmze the margn: Good accordng to ntuton, theory (VC dmenson & practce max, s. t., y ( x b xb=0 γ s margn dstance from the separatng hyperplane Maxmzng the margn

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 1 Separatng hyperplane s defned by the support vectors Ponts on / planes from the soluton If you kne these ponts, you could gnore the rest If no degeneraces, d1 support vectors (for d dm. data

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 13 Problem: Let x b y = γ then x b y = γ Scalng ncreases margn! Soluton: Work th normalzed : γ = x b y x x 1 Let s also requre support vectors x j to be on the plane defned by: x j b = ±1 d = (j j=1

Want to maxmze margn γ! What s the relaton beteen x 1 and x? x 1 = x γ We also kno: x 1 b = 1 x b = 1 So: x 1 b = 1 x γ x b γ 1 b = 1 = 1 /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 14 x x 1 1 Note:

We started th max, s. t., y ( x b mn s. t., 1 y ( x b But can be arbtrarly large! We normalzed and... max max Then: 1 mn mn 1 Ths s called SVM th hard constrants 1 /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 15 x x 1

If data s not separable ntroduce penalty: mn 1 s. t., y ( x C b 1 Mnmze ǁǁ plus the number of tranng mstakes Set C usng cross valdaton Ho to penalze mstakes? All mstakes are not equally bad! (#number of mstakes /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 16

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 17 Introduce slack varables mn, b, 0 s. t., 1 y ( x If pont x s on the rong sde of the margn then get penalty C n 1 b 1 j For each datapont: If margn 1, don t care If margn < 1, pay lnear penalty

mn s. t., 1 y ( x C (#number of b 1 What s the role of slack penalty C: C=: Only ant to, b that separate the data C=0: Can set to anythng, then =0 (bascally gnores the data (0,0 mstakes /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 18 small C good C bg C

penalty /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 19 SVM n the natural form arg mn, b 1 C SVM uses Hnge Loss : 0/1 loss n max 0,1 ( x Margn 1 Emprcal loss L (ho ell e ft tranng data Regularzaton parameter mn, b 1 s. t., y y C ( x n 1 b b 1 1 0 1 Hnge loss: max{0, 1z} z y ( x b

Announcement: HW s graded. We sorted t alphabetcally nto several ples. Please don t mess the ples. /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 0

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 1 mn, b 1 s. t., y ( x C Want to estmate and b! Standard ay: Use a solver! Solver: softare for fndng solutons to common optmzaton problems Use a quadratc solver: Mnmze quadratc functon Subject to lnear constrants Problem: Solvers are neffcent for bg data! n 1 b 1

Want to estmate, b! Alternatve approach: Want to mnmze f(,b: Ho to mnmze convex functons f(z? Use gradent descent: mn z f(z Iterate: z t1 z t f (z t /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu n d j j j d j j b x y C b f 1 1 ( ( 1 ( 1 ( 0,1 max, ( n b b x y s t C 1 (,.. mn 1 1, f(z z

Want to mnmze f(,b: Compute the gradent (j.r.t. (j /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 3 n j j j y x L C b f j 1 ( ( (, (, ( ( else 1 ( f 0, ( ( ( j j x y b x y y x L n d j j j d j j b x y C b f 1 1 ( ( 1 ( 1 ( 0,1 max, ( Emprcal loss L(x y

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 4 Gradent descent: Iterate untl convergence: For j = 1 d f (, b Evaluate: ( j ( j Update: (j (j (j j C n 1 L( x, y ( j Problem: Computng (j takes O(n tme! n sze of the tranng dataset learnng rate parameter C regularzaton parameter

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 5 Stochastc Gradent Descent Instead of evaluatng gradent over all examples evaluate t for each ndvdual tranng example ( j L( x, y ( j, C ( j Stochastc gradent descent: Iterate untl convergence: For = 1 n For j = 1 d Evaluate: (j, Update: (j (j (j, We just had: n j L( x, y ( j C ( j ( 1

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 6 Example by Leon Bottou: Reuters RCV1 document corpus Predct a category of a document One vs. the rest classfcaton n = 781,000 tranng examples (documents 3,000 test examples d = 50,000 features One feature per ord Remove stopords Remove lo frequency ords

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 7 Questons: (1 Is SGD successful at mnmzng f(,b? ( Ho quckly does SGD fnd the mn of f(,b? (3 What s the error on a test set? Standard SVM Fast SVM SGD SVM Tranng tme Value of f(,b Test error (1 SGDSVM s successful at mnmzng the value of f(,b ( SGDSVM s super fast (3 SGDSVM test set error s comparable

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 8 SGD SVM Conventonal SVM Optmzaton qualty: f(,b f ( opt,b opt For optmzng f(,b thn reasonable qualty SGDSVM s super fast

SGD on full dataset vs. Batch Conjugate Gradent on a sample of n tranng examples Theory says: Gradent descent converges n lnear tme k. Conjugate gradent converges n k. Bottom lne: Dong a smple (but fast SGD update many tmes s better than dong a complcated (but slo BCG update a fe tmes k condton number /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 9

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 30 Need to choose learnng rate and t 0 t L( x, y t 1 t t C t t0 Leon suggests: Choose t 0 so that the expected ntal updates are comparable th the expected sze of the eghts Choose : Select a small subsample Try varous rates (e.g., 10, 1, 0.1, 0.01, Pck the one that most reduces the cost Use for next 100k teratons on the full dataset

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 31 Sparse Lnear SVM: Feature vector x s sparse (contans many zeros Do not do: x = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0, ] But represent x as a sparse vector x =[(4,1, (9,5, ] Can e do the SGD update more effcently? L( x, y C Approxmated n steps: L( x, y C ( 1 cheap: x s sparse and so fe coordnates j of ll be updated expensve: s not sparse, all coordnates need to be updated

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 3 Soluton 1: = s v Represent vector as the product of scalar s and vector v Then the update procedure s: To step update procedure: (1 ( L( x, y C ( 1 (1 v = v ηc L x,y ( s = s(1 η Soluton : Perform only step (1 for each tranng example Perform step ( th loer frequency and hgher

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 33 Stoppng crtera: Ho many teratons of SGD? Early stoppng th cross valdaton Create valdaton set Montor cost functon on the valdaton set Stop hen loss stops decreasng Early stoppng Extract to dsjont subsamples A and B of tranng data Tran on A, stop by valdatng on B Number of epochs s an estmate of k Tran for k epochs on the full dataset

Idea 1: One aganst all Learn 3 classfers vs. {o, } vs. {o, } o vs. {, } Obtan: b, b, o b o Ho to classfy? Return class c arg max c c x b c /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 34

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 35 Learn 3 sets of eghts smoultaneously For each class c estmate c, b c Want the correct class to have hghest margn: y x b y 1 c x b c c y, (x, y

Optmzaton problem: To obtan parameters c, b c (for each class c e can use smlar technques as for class SVM SVM s dely perceved a very poerful learnng algorthm /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 36 c c y y n c b b x b x C 1 mn 1 c 1, y c 0,,

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 37 Ne settng: Onlne Learnng Allos for modelng problems here e have a contnuous stream of data We ant an algorthm to learn from t and sloly adapt to the changes n data Idea: Do slo updates to the model All our methods SVM and Perceptron make updates f they msclassfy an example So: Frst tran the classfer on tranng data. Then for every example from the stream, f e msclassfy, update the model (usng small learnng rate

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 38 Protocol: User comes and tell us orgn and destnaton We offer to shp the package for some money ($10 $50 Based on the prce e offer, sometmes the user uses our servce (y = 1, sometmes they don't (y = 1 Task: Buld an algorthm to optmze hat prce e offer to the users Features x capture: Informaton about user Orgn and destnaton Problem: Wll user accept the prce?

Model hether user ll accept our prce: y = f(x; Accept: y =1, Not accept: y=1 Buld ths model th say Perceptron or Wnno The ebste that runs contnuously Onlne learnng algorthm ould do somethng lke User comes User s represented as an (x,y par here x: Feature vector ncludng prce e offer, orgn, destnaton y: If they chose to use our servce or not The algorthm updates usng just the (x,y par Bascally, e update the parameters every tme e get some ne data /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 39

We dscard ths dea of a data set Instead e have a contnuous stream of data Further comments: For a major ebste here you have a massve stream of data then ths knd of algorthm s pretty reasonable Don t need to deal th all the tranng data If you had a small number of users you could save ther data and then run a normal algorthm on the full dataset Dong multple passes over the data /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 40

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 41 An onlne algorthm can adapt to changng user preferences For example, over tme users may become more prce senstve The algorthm adapts and learns ths So the system s dynamc