CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CSE 573: Artificial Intelligence Autumn 2010

A Dendrogram. Bioinformatics (Lec 17)

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

CSEP 573: Artificial Intelligence

Instance-based Learning

Classification: Feature Vectors

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Supervised Learning: Nearest Neighbors

Instance-based Learning

Machine Learning Classifiers and Boosting

Kernels and Clustering

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Problem 1: Complexity of Update Rules for Logistic Regression

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

CS 343: Artificial Intelligence

Lecture #11: The Perceptron

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Going nonparametric: Nearest neighbor methods for regression and classification

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

All lecture slides will be available at CSC2515_Winter15.html

Perceptron as a graph

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

PRACTICE FINAL EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Supervised Learning: K-Nearest Neighbors and Decision Trees

STA 4273H: Sta-s-cal Machine Learning

Lecture 9. Support Vector Machines

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Web- Scale Mul,media: Op,mizing LSH. Malcolm Slaney Yury Li<shits Junfeng He Y! Research

ECG782: Multidimensional Digital Signal Processing

Lecture 3. Oct

Geometric data structures:

6.034 Quiz 2, Spring 2005

CS 340 Lec. 4: K-Nearest Neighbors

Nearest Neighbor with KD Trees

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSC 411: Lecture 05: Nearest Neighbors

Machine Learning for NLP

Going nonparametric: Nearest neighbor methods for regression and classification

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Linear Regression and K-Nearest Neighbors 3/28/18

Large Margin Classification Using the Perceptron Algorithm

Online Algorithm Comparison points

Topic 7 Machine learning

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 6140: Machine Learning Spring Final Exams. What we learned Final Exams 2/26/16

CS 6140: Machine Learning Spring 2016

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Nearest Neighbor with KD Trees

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Metric Learning for Large-Scale Image Classification:

k-nearest Neighbors + Model Selection

Nearest Neighbor Predictors

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

CS 584 Data Mining. Classification 1

Classification and K-Nearest Neighbors

Large-Scale Machine Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

1-Nearest Neighbor Boundary

ADVANCED CLASSIFICATION TECHNIQUES

Support Vector Machines

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CS 8520: Artificial Intelligence

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Large-Scale Machine Learning

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Data Mining in Bioinformatics Day 1: Classification

Predicting Popular Xbox games based on Search Queries of Users

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Slides for Data Mining by I. H. Witten and E. Frank

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Support Vector Machines.

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Data Preprocessing. Supervised Learning

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

Instance-Based Learning.

Machine Learning. Classification

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

About the Course. Reading List. Assignments and Examina5on

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Locality- Sensitive Hashing Random Projections for NN Search

PV211: Introduction to Information Retrieval

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Nearest Neighbor Classification. Machine Learning Fall 2017

Introduction to Artificial Intelligence

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Transcription:

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #19: Machine Learning 1

Supervised Learning Would like to do predicbon: esbmate a func3on f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classifica3on Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x vector of binary, categorical, real valued features y class ({+1, - 1}, or a real number) X Y X Y Training and test set Estimate y = f(x) on X,Y. Hope that the same f(x) also works on unseen X, Y VT CS 5614 2

Large Scale Machine Learning We will talk about the following methods: k- Nearest Neighbor (Instance based learning) Perceptron and Winnow algorithms Support Vector Machines Decision trees Main quesbon: How to efficiently train (build a model/find model parameters)? VT CS 5614 3

Instance Based Learning Instance based learning Example: Nearest neighbor Keep the whole training dataset: {(x, y)} A query example (vector) q comes Find closest example(s) x * Predict y * Works both for regression and classificabon CollaboraBve filtering is an example of k- NN classifier Find k most similar people to user x that have rated movie y Predict ra3ng y x of x as an average of y k VT CS 5614 4

1- Nearest Neighbor To make Nearest Neighbor work we need 4 things: Distance metric: Euclidean How many neighbors to look at? One WeighBng funcbon (opbonal): Unused How to fit with the local points? Just predict the same output as the nearest neighbor VT CS 5614 5

k- Nearest Neighbor Distance metric: Euclidean How many neighbors to look at? k WeighBng funcbon (opbonal): Unused How to fit with the local points? Just predict the average output among k nearest neighbors k=9 VT CS 5614 6

Kernel Regression Distance metric: Euclidean How many neighbors to look at? All of them (!) WeighBng funcbon: w i =exp( d(x i,q) 2 /K w ) w i =exp ( d ( x i,q) 2 / K w ) w i Nearby points to query q are weighted more strongly. K w kernel width. How to fit with the local points? P Predict weighted average: i w i P i w iy i y i / i w i i w i d(x i, q) = 0 K w =10 K w =20 K w =80 VT CS 5614 7

How to find nearest neighbors? Given: a set P of n points in R d Goal: Given a query point q NN: Find the nearest neighbor p of q in P Range search: Find one/all points in P within distance r from q q p VT CS 5614 8

Algorithms for NN Main memory: Linear scan Tree based: Quadtree kd- tree Hashing: Locality- Sensi3ve Hashing Secondary storage: R- trees VT CS 5614 9

PERCEPTRON

Linear models: Perceptron Example: Spam filtering Instance space x X ( X = n data points) Binary or real- valued feature vector x of word occurrences d features (words + other things, d~100,000) Class y Y y: Spam (+1), Ham (- 1) VT CS 5614 11

Linear models for classificabon Binary classificabon: f (x) = { +1 if w 1 x 1 + w 2 x 2 +... w d x d θ -1 otherwise Input: Vectors x (j) and labels y (j) x 2 =1 Vectors x (j) are real valued where x 2 =1 Goal: Find vector w = (w 1, w 2,..., w d ) Each w i is a real number Decision boundary is linear - - - - - w x = 0 - - - w - - - - - - - Note: x,1 w, θ VT CS 5614 12 x w x

Perceptron [Rosenblaa 58] (very) Loose mobvabon: Neuron Inputs are feature values Each feature has a weight w i AcBvaBon is the sum: f(x) = Σ i w i x i = w x x 1 x 2 x 3 w 1 w 2 w 3 0? w x 4 4 If the f(x) is: PosiBve: Predict +1 NegaBve: Predict - 1 w x=0 x (2) Ham=-1 nigeria x (1) Spam=1 viagra VT CS 5614 13 w

Perceptron: EsBmaBng w Perceptron: y = sign(w x) How to find parameters w? Start with w 0 = 0 Pick training examples x (t) one by one (from disk) Predict class of x (t) using current weights y = sign(w (t) x (t) ) If y is correct (i.e., y t = y ) No change: w (t+1) = w (t) If y is wrong: adjust w (t) w (t+1) = w (t) + η y (t) x (t) η is the learning rate parameter x (t) is the t- th training example y (t) is true t- th class label ({+1, - 1}) Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly. η y (t) x (t) w (t+1) w (t) x (t), y (t) =1 VT CS 5614 14

Perceptron Convergence Perceptron Convergence Theorem: If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge How long would it take to converge? Perceptron Cycling Theorem: If the training data is not linearly separable the Perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop How to provide robustness, more expressivity? VT CS 5614 15

ProperBes of Perceptron Separability: Some parameters get training set perfectly Convergence: If training set is separable, perceptron will converge (Training) Mistake bound: Number of mistakes < < 1/ 2 1/ 2 where γ= min t,u =min t,u x (t) x (t) u u and u 2 u =1 2 =1 Note we assume x Euclidean length 1, then γ is the minimum distance of any example to plane u VT CS 5614 16

UpdaBng the Learning Rate Perceptron will oscillate and won t converge When to stop learning? (1) Slowly decrease the learning rate η A classic way is to: η = c 1 /(t + c 2 ) But, we also need to determine constants c 1 and c 2 (2) Stop when the training error stops chaining (3) Have a small test dataset and stop when the test set error stops decreasing (4) Stop when we reached some maximum number of passes over the data VT CS 5614 17

MulBclass Perceptron What if more than 2 classes? Weight vector w c for each class c Train one class vs. the rest: Example: 3- way classifica3on y = {A, B, C} Train 3 classifiers: w A : A vs. B,C; w B : B vs. A,C; w C : C vs. A,B Calculate acbvabon for each class f(x,c) = Σ i w c,i x i = w c x Highest acbvabon wins c = arg max c f(x,c) w C x biggest w A x biggest w C w A w B VT CS 5614 18 w B x biggest

Overfinng: Issues with Perceptrons RegularizaBon: If the data is not separable weights dance around Mediocre generalizabon: Finds a barely separa3ng solu3on VT CS 5614 19

Improvement: Winnow Algorithm VT CS 5614 20

Improvement: Winnow Algorithm VT CS 5614 21

Extensions: Winnow Problem: All w i can only be >0 SoluBon: For every feature x i, introduce a new feature x i = - x i Learn Winnow over 2d features Example: Consider: x=[1,.7,.4], w=[.5,.2,.3] Then new x and w are x=[1,.7,.4, 1,.7,.4], w=[.5,.2, 0, 0, 0,.3] Note this results in the same dot values as if we used original x and w New algorithm is called Balanced Winnow VT CS 5614 22

Extensions: Balanced Winnow In pracbce we implement Balanced Winnow: 2 weight vectors w +, w - ; effec3ve weight is the difference VT CS 5614 23

Extensions: Thick Separator Thick Separator (aka Perceptron with Margin) (Applies both to Perceptron and Winnow) Set margin parameter γ Update if y=+1 but w x < θ + γ or if y=- 1 but w x > θ - γ - - - - - - - - w - - - - - - Note: γ is a functional margin. Its effect could disappear as w grows. Nevertheless, this has been shown to be a very effective algorithmic addition. VT CS 5614 24

Summary of Algorithms Senng: Examples: x {0,1}, weights w R d PredicBon: f(x)=+1 iff w x θ else 1 Perceptron: Addi3ve weight update w w + η y x If y=+1 but w x θ then w i w i + 1 (if x i =1) If y=- 1 but w x > θ then w i w i - 1 (if x i =1) Winnow: Mul3plica3ve weight update w w exp{η y x} If y=+1 but w x θ then w i 2 w i (if x i =1) If y=- 1 but w x > θ then w i w i / 2 (if x i =1) (promote) (demote) (promote) (demote) VT CS 5614 25

Perceptron vs. Winnow How to compare learning algorithms? ConsideraBons: Number of features d is very large The instance space is sparse Only few features per training example are non- zero The model is sparse Decisions depend on a small subset of features In the true model on a few w i are non- zero Want to learn from a number of examples that is small rela3ve to the dimensionality d VT CS 5614 26

Perceptron vs. Winnow Perceptron Online: Can adjust to changing target, over 3me Advantages Simple Guaranteed to learn a linearly separable problem Advantage with few relevant features per training example LimitaBons Only linear separa3ons Only converges for linearly separable data Not really efficient with many features Winnow Online: Can adjust to changing target, over 3me Advantages Simple Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant aaributes LimitaBons Only linear separa3ons Only converges for linearly separable data Not really efficient with many features VT CS 5614 27

Online Learning New senng: Online Learning Allows for modeling problems where we have a con3nuous stream of data We want an algorithm to learn from it and slowly adapt to the changes in data Idea: Do slow updates to the model Both our methods Perceptron and Winnow make updates if they misclassify an example So: First train the classifier on training data. Then for every example from the stream, if we misclassify, update the model (using small learning rate) VT CS 5614 28

Example: Shipping Service Protocol: User comes and tell us origin and des3na3on We offer to ship the package for some money ($10 - $50) Based on the price we offer, some3mes the user uses our service (y = 1), some3mes they don't (y = - 1) Task: Build an algorithm to op3mize what price we offer to the users Features x capture: Informa3on about user Origin and des3na3on Problem: Will user accept the price? VT CS 5614 29

Example: Shipping Service Model whether user will accept our price: y = f(x; w) Accept: y =1, Not accept: y=- 1 Build this model with say Perceptron or Winnow The website that runs conbnuously Online learning algorithm would do something like User comes She is represented as an (x,y) pair where x: Feature vector including price we offer, origin, des3na3on y: If they chose to use our service or not The algorithm updates w using just the (x,y) pair Basically, we update the w parameters every 3me we get some new data VT CS 5614 30

Example: Shipping Service We discard this idea of a data set Instead we have a con3nuous stream of data Further comments: For a major website where you have a massive stream of data then this kind of algorithm is preuy reasonable Don t need to deal with all the training data If you had a small number of users you could save their data and then run a normal algorithm on the full dataset Doing mul3ple passes over the data VT CS 5614 31

Online Algorithms An online algorithm can adapt to changing user preferences For example, over 3me users may become more price sensi3ve The algorithm adapts and learns this So the system is dynamic VT CS 5614 32