CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Support Vector Machines. CS534 - Machine Learning

Classification / Regression Support Vector Machines

Support Vector Machines

Support Vector Machines

Announcements. Supervised Learning

CS 534: Computer Vision Model Fitting

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Discriminative classifiers for object classification. Last time

Smoothing Spline ANOVA for variable screening

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

INF 4300 Support Vector Machine Classifiers (SVM) Anne Solberg

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

The Codesign Challenge

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Efficient Text Classification by Weighted Proximal SVM *

Biostatistics 615/815

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Graph-based Clustering

Data Mining: Model Evaluation

GSLM Operations Research II Fall 13/14

Machine Learning. Topic 6: Clustering

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Unsupervised Learning

Edge Detection in Noisy Images Using the Support Vector Machines

Collaboratively Regularized Nearest Points for Set Based Recognition

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Lecture 5: Multilayer Perceptrons

The Research of Support Vector Machine in Agricultural Data Classification

Discriminative Dictionary Learning with Pairwise Constraints

LECTURE : MANIFOLD LEARNING

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Optimizing Document Scoring for Query Retrieval

Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Machine Learning 9. week

Cost-efficient deployment of distributed software services

Cluster Analysis of Electrical Behavior

Programming in Fortran 90 : 2017/2018

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

INF Repetition Anne Solberg INF

Towards Semantic Knowledge Propagation from Text to Web Images

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Feature Reduction and Selection

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

Active Contours/Snakes

Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

Fast Feature Value Searching for Face Detection

Simplification of 3D Meshes

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Loop Transformations, Dependences, and Parallelization

y and the total sum of

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Three supervised learning methods on pen digits character recognition dataset

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Using Neural Networks and Support Vector Machines in Data Mining

Classifier Selection Based on Data Complexity Measures *

Radial Basis Functions

Face Recognition Based on SVM and 2DPCA

An Entropy-Based Approach to Integrated Information Needs Assessment

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Spring 2016

CMPSCI 670: Computer Vision! Object detection continued. University of Massachusetts, Amherst November 10, 2014 Instructor: Subhransu Maji

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Specialized Weighted Majority Statistical Techniques in Robotics (Fall 2009)

A Saturation Binary Neural Network for Crossbar Switching Problem

On Some Entertaining Applications of the Concept of Set in Computer Science Course

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Lecture 4: Principal components

Meta-heuristics for Multidimensional Knapsack Problems

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

CS1100 Introduction to Programming

Greedy Technique - Definition

SAO: A Stream Index for Answering Linear Optimization Queries

Japanese Dependency Analysis Based on Improved SVM and KNN

Intelligent Information Acquisition for Improved Clustering

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems

APPLIED MACHINE LEARNING

Learning physical Models of Robots

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

An Optimal Algorithm for Prufer Codes *

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

X- Chart Using ANOM Approach

Solving MultiClass Support Vector Machines with LaRank

Unsupervised Learning and Clustering

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Relevance Assignment and Fusion of Multiple Learning Methods Applied to Remote Sensing Image Analysis

Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids Verification. General Terms Algorithms

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

Transcription:

CS46: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs46.stanford.edu

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu Perceptron: y = sgn( x Ho to fnd parameters? Start th 0 = 0 Pck tranng examples x t one by one Predct class of x t usng current t y = sgn( t x t If y s correct (.e., y t = y No change: t1 = t If y s rong: Adjust t t1 = t y t x t s the learnng rate parameter x t s the tth tranng example y t s true tth class label ({1, 1} y t x t t1 t x t, y t =1

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 3 Overfttng: Regularzaton: If the data s not separable eghts dance around Medocre generalzaton: Fnds a barely separatng soluton

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 4 Want to separate from usng a lne Data: Tranng examples: (x 1, y 1 (x n, y n Each example : x = ( x (1,, x (d x (j s real valued y { 1, 1 } Inner product: x = d j=1 (j x (j Whch s best lnear separator (defned by?

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 5 A B C Dstance from the separatng hyperplane corresponds to the confdence of predcton Example: We are more sure about the class of A and B than of C

Margn: Dstance of closest example from the decson lne/hyperplane The reason e defne margn ths ay s due to theoretcal convenence and exstence of generalzaton error bounds that depend on the value of margn. /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 6

Remember: Dot product A B = A B cos θ A = A (j /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 7 d j=1

Dstance from a pont to a lne A (x A (1, x A ( H L Let: Lne L: xb = (1 x (1 ( x ( b=0 = ( (1, ( Pont A = (x A (1, x A ( Pont M on a lne = (x M (1, x M ( (0,0 M (x 1, x d(a, L = AH = (AM = (x A (1 x M (1 (1 (x A ( x M ( ( = x A (1 (1 x A ( ( b = A b Remember x (1 M (1 x ( M ( = b snce M belongs to lne L /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 8

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 10 Predcton = sgn(x b Confdence = ( x b y For th datapont: γ = x b y Want to solve: max mn Can rerte as max, γ s. t., y ( x b

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 11 Maxmze the margn: Good accordng to ntuton, theory (VC dmenson & practce max, s. t., y ( x b xb=0 γ s margn dstance from the separatng hyperplane Maxmzng the margn

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 1 Separatng hyperplane s defned by the support vectors Ponts on / planes from the soluton If you kne these ponts, you could gnore the rest If no degeneraces, d1 support vectors (for d dm. data

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 13 Problem: Let x b y = γ then x b y = γ Scalng ncreases margn! Soluton: Work th normalzed : γ = x b y x x 1 Let s also requre support vectors x j to be on the plane defned by: x j b = ±1 d = (j j=1

Want to maxmze margn γ! What s the relaton beteen x 1 and x? x 1 = x γ We also kno: x 1 b = 1 x b = 1 So: x 1 b = 1 x γ x b γ 1 b = 1 = 1 /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 14 x x 1 1 Note:

We started th max, s. t., y ( x b mn s. t., 1 y ( x b But can be arbtrarly large! We normalzed and... max max Then: 1 mn mn 1 Ths s called SVM th hard constrants 1 /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 15 x x 1

If data s not separable ntroduce penalty: mn 1 s. t., y ( x C b 1 Mnmze ǁǁ plus the number of tranng mstakes Set C usng cross valdaton Ho to penalze mstakes? All mstakes are not equally bad! (#number of mstakes /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 16

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 17 Introduce slack varables mn, b, 0 s. t., 1 y ( x If pont x s on the rong sde of the margn then get penalty C n 1 b 1 j For each datapont: If margn 1, don t care If margn < 1, pay lnear penalty

mn s. t., 1 y ( x C (#number of b 1 What s the role of slack penalty C: C=: Only ant to, b that separate the data C=0: Can set to anythng, then =0 (bascally gnores the data (0,0 mstakes /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 18 small C good C bg C

penalty /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 19 SVM n the natural form arg mn, b 1 C SVM uses Hnge Loss : 0/1 loss n max 0,1 ( x Margn 1 Emprcal loss L (ho ell e ft tranng data Regularzaton parameter mn, b 1 s. t., y y C ( x n 1 b b 1 1 0 1 Hnge loss: max{0, 1z} z y ( x b

Announcement: HW s graded. We sorted t alphabetcally nto several ples. Please don t mess the ples. /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 0

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 1 mn, b 1 s. t., y ( x C Want to estmate and b! Standard ay: Use a solver! Solver: softare for fndng solutons to common optmzaton problems Use a quadratc solver: Mnmze quadratc functon Subject to lnear constrants Problem: Solvers are neffcent for bg data! n 1 b 1

Want to estmate, b! Alternatve approach: Want to mnmze f(,b: Ho to mnmze convex functons f(z? Use gradent descent: mn z f(z Iterate: z t1 z t f (z t /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu n d j j j d j j b x y C b f 1 1 ( ( 1 ( 1 ( 0,1 max, ( n b b x y s t C 1 (,.. mn 1 1, f(z z

Want to mnmze f(,b: Compute the gradent (j.r.t. (j /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 3 n j j j y x L C b f j 1 ( ( (, (, ( ( else 1 ( f 0, ( ( ( j j x y b x y y x L n d j j j d j j b x y C b f 1 1 ( ( 1 ( 1 ( 0,1 max, ( Emprcal loss L(x y

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 4 Gradent descent: Iterate untl convergence: For j = 1 d f (, b Evaluate: ( j ( j Update: (j (j (j j C n 1 L( x, y ( j Problem: Computng (j takes O(n tme! n sze of the tranng dataset learnng rate parameter C regularzaton parameter

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 5 Stochastc Gradent Descent Instead of evaluatng gradent over all examples evaluate t for each ndvdual tranng example ( j L( x, y ( j, C ( j Stochastc gradent descent: Iterate untl convergence: For = 1 n For j = 1 d Evaluate: (j, Update: (j (j (j, We just had: n j L( x, y ( j C ( j ( 1

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 6 Example by Leon Bottou: Reuters RCV1 document corpus Predct a category of a document One vs. the rest classfcaton n = 781,000 tranng examples (documents 3,000 test examples d = 50,000 features One feature per ord Remove stopords Remove lo frequency ords

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 7 Questons: (1 Is SGD successful at mnmzng f(,b? ( Ho quckly does SGD fnd the mn of f(,b? (3 What s the error on a test set? Standard SVM Fast SVM SGD SVM Tranng tme Value of f(,b Test error (1 SGDSVM s successful at mnmzng the value of f(,b ( SGDSVM s super fast (3 SGDSVM test set error s comparable

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 8 SGD SVM Conventonal SVM Optmzaton qualty: f(,b f ( opt,b opt For optmzng f(,b thn reasonable qualty SGDSVM s super fast

SGD on full dataset vs. Batch Conjugate Gradent on a sample of n tranng examples Theory says: Gradent descent converges n lnear tme k. Conjugate gradent converges n k. Bottom lne: Dong a smple (but fast SGD update many tmes s better than dong a complcated (but slo BCG update a fe tmes k condton number /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 9

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 30 Need to choose learnng rate and t 0 t L( x, y t 1 t t C t t0 Leon suggests: Choose t 0 so that the expected ntal updates are comparable th the expected sze of the eghts Choose : Select a small subsample Try varous rates (e.g., 10, 1, 0.1, 0.01, Pck the one that most reduces the cost Use for next 100k teratons on the full dataset

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 31 Sparse Lnear SVM: Feature vector x s sparse (contans many zeros Do not do: x = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0, ] But represent x as a sparse vector x =[(4,1, (9,5, ] Can e do the SGD update more effcently? L( x, y C Approxmated n steps: L( x, y C ( 1 cheap: x s sparse and so fe coordnates j of ll be updated expensve: s not sparse, all coordnates need to be updated

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 3 Soluton 1: = s v Represent vector as the product of scalar s and vector v Then the update procedure s: To step update procedure: (1 ( L( x, y C ( 1 (1 v = v ηc L x,y ( s = s(1 η Soluton : Perform only step (1 for each tranng example Perform step ( th loer frequency and hgher

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 33 Stoppng crtera: Ho many teratons of SGD? Early stoppng th cross valdaton Create valdaton set Montor cost functon on the valdaton set Stop hen loss stops decreasng Early stoppng Extract to dsjont subsamples A and B of tranng data Tran on A, stop by valdatng on B Number of epochs s an estmate of k Tran for k epochs on the full dataset

Idea 1: One aganst all Learn 3 classfers vs. {o, } vs. {o, } o vs. {, } Obtan: b, b, o b o Ho to classfy? Return class c arg max c c x b c /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 34

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 35 Learn 3 sets of eghts smoultaneously For each class c estmate c, b c Want the correct class to have hghest margn: y x b y 1 c x b c c y, (x, y

Optmzaton problem: To obtan parameters c, b c (for each class c e can use smlar technques as for class SVM SVM s dely perceved a very poerful learnng algorthm /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 36 c c y y n c b b x b x C 1 mn 1 c 1, y c 0,,

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 37 Ne settng: Onlne Learnng Allos for modelng problems here e have a contnuous stream of data We ant an algorthm to learn from t and sloly adapt to the changes n data Idea: Do slo updates to the model All our methods SVM and Perceptron make updates f they msclassfy an example So: Frst tran the classfer on tranng data. Then for every example from the stream, f e msclassfy, update the model (usng small learnng rate

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 38 Protocol: User comes and tell us orgn and destnaton We offer to shp the package for some money ($10 $50 Based on the prce e offer, sometmes the user uses our servce (y = 1, sometmes they don't (y = 1 Task: Buld an algorthm to optmze hat prce e offer to the users Features x capture: Informaton about user Orgn and destnaton Problem: Wll user accept the prce?

Model hether user ll accept our prce: y = f(x; Accept: y =1, Not accept: y=1 Buld ths model th say Perceptron or Wnno The ebste that runs contnuously Onlne learnng algorthm ould do somethng lke User comes User s represented as an (x,y par here x: Feature vector ncludng prce e offer, orgn, destnaton y: If they chose to use our servce or not The algorthm updates usng just the (x,y par Bascally, e update the parameters every tme e get some ne data /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 39

We dscard ths dea of a data set Instead e have a contnuous stream of data Further comments: For a major ebste here you have a massve stream of data then ths knd of algorthm s pretty reasonable Don t need to deal th all the tranng data If you had a small number of users you could save ther data and then run a normal algorthm on the full dataset Dong multple passes over the data /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 40

/19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu 41 An onlne algorthm can adapt to changng user preferences For example, over tme users may become more prce senstve The algorthm adapts and learns ths So the system s dynamc