Learning from Data. COMP61011 : Machine Learning and Data Mining. Dr Gavin Brown Machine Learning and Op<miza<on Research Group

Size: px

Start display at page:

Download "Learning from Data. COMP61011 : Machine Learning and Data Mining. Dr Gavin Brown Machine Learning and Op<miza<on Research Group"

Jessie Casey
5 years ago
Views:

1 Learning from Data COMP61011 : Machine Learning and Data Mining Dr Gavin Brown Machine Learning and Op<miza<on Research Group

2 Learning from Data Data is recorded from some real- world phenomenon. What might we want to do with that data? Predic+on - what can we predict about this phenomenon? Descrip+on - how can we describe/understand this phenomenon in a new way? Op+miza+on - how can we control and op+mize this phenomenon for our own objec<ves?

3 COMP61011 Machine Learning & Data Mining COMP61021 Modeling & Visualiza<on of High Dimensional Data COMP61032 Op<miza<on for Learning, Planning & Problem Solving Period 1 Oct/Nov Period 2 Nov/Dec Period 3 Feb/Mar Predic+on Lecturer: Dr Gavin Brown

Machine Learning and Data Mining Medical Records /

4 Machine Learning and Data Mining Medical Records / Novel Drugs What characteris<cs of a pa<ent indicate they may react well/badly to a new drug? How can we predict whether it will poten<ally hurt rather then help them? AstraZeneca Project Research Bursaries Limited number of eligible MSc projects, announced Dec 2011

Machine Learning and Data Mining Handwri+ng

handwri<ng to tap into the Asian market.

5 Machine Learning and Data Mining Handwri+ng Recogni+on Google Books is currently digi<zing millions of books. Smartphones need to process non- European handwri<ng to tap into the Asian market. How can we recognize handwriven digits in a huge variety of handwri<ng styles, in real- <me?

6 Learning from Data Where does all this fit? Ar<ficial Intelligence Sta<s<cs / Mathema<cs Data Mining Learning from Data Computer Vision Robo<cs (No defini<on of a field is perfect the diagram above is just one interpreta<on, mine ;- )

7 Learn your trade

PROGRAMMING You must be able to program, and pick up a new language rela<vely easily.

8 Learning from Data.. Prerequisites MATHEMATICS This is a mathema+cal subject. You must be comfortable with probabili+es and algebra. Maths primer on course website for reviewing. PROGRAMMING You must be able to program, and pick up a new language rela<vely easily. We use Matlab for the first 2 modules. In the 3 rd module, you may use any language. Module codes in this theme: (predic<on) (descrip<on) (op<miza<on)

9 COMP61011 topic structure Week 1: Some Data and Simple Predictors Week 2: Support Vector Machines / Model SelecBon Week 3: Decision Trees / Feature SelecBon Week 4: Bayes Theorem / ProbabilisBc Classifiers Week 5: Ensemble Methods / Industry Guest Lectures Week 6: No lecture.

10 COMP61011 assessment structure 50% January exam 50% coursework, broken down as 10% + 10% lab exercises (weeks 2,3) 30% mini- project (weeks 4,5,6) Lab exercises will be marked at the START of the following lab session. You should NOT be sbll working on the previous week s work. Extensions will require a medical note.

11 Matlab MATrix LABoratory Interac<ve scrip<ng language Interpreted (i.e. no compiling) Objects possible, not compulsory Dynamically typed Flexible GUI / plofng framework Large libraries of tools Highly op<mized for maths Available free from Uni, but usable only when connected to our network (e.g. via VPN) Module- specific sozware supported on school machines only.

12 Books (not compulsory purchase, but recommended) Introduc+on to Machine Learning By Ethem Alpaydin Very Short Introduc+on to Sta+s+cs By David Hand Technical. Contains all necessary material For modules 1+2 of this theme. Not technical at all. More of a mo>va>onal, big- picture read.

13 Some Data, and Simple Predictors

14 A Problem to Solve Dis<nguish rugby players from ballet dancers. You are provided with some data. Fallowfield rugby club (16). Rusholme ballet troupe (10). Task Generate a program which will correctly classify ANY player/dancer in the world. Hint We shouldn t fine- tune our system too much so it only works on the local clubs.

15 Taking measurements. We have to process the people with a computer, so it needs to be in a computer- readable form. What are the dis<nguishing characteris<cs? 1. Height 2. Weight 3. Shoe size 4. Sex

16 Taking measurements. Person Weight Height kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm height weight

17 The Nearest Neighbour Rule Person Weight Height Class height kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm weight TRAINING DATA TESTING DATA Who s this guy? - player or dancer? height = 180cm weight = 78kg

18 The Nearest Neighbour Rule Person Weight Height Class height kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm weight TRAINING DATA height = 180cm weight = 78kg 1. Find nearest neighbour 2. Assign the same class

19 The K- Nearest Neighbour Classifier Testing point x For each training datapoint x measure distance(x,x ) End Sort distances Select K nearest Assign most common class! height Person Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm TRAINING DATA Class weight

20 Quick reminder: Pythagoras theorem measure distance(x,x ) c a a 2 + b b = c So... c = a + b 2 a.k.a. Euclidean distance height distance( x, x' ) = i ( x i x' i 2 ) weight

21 The K- Nearest Neighbour Classifier Testing point x For each training datapoint x measure distance(x,x ) End Sort distances Select K nearest Assign most common class! height Person Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm TRAINING DATA Class Seems sensible. But what are the disadvantages? weight

22 The K- Nearest Neighbour Classifier Person Weight Height Class height kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm TRAINING DATA weight Here I chose k=3. What would happen if I chose k=5? What would happen if I chose k=26?

23 The K- Nearest Neighbour Classifier Person Weight Height Class height kg 55kg 75kg 50kg 57kg 85kg 93kg 75kg 99kg 100kg 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm TRAINING DATA weight Any point on the led of this boundary is closer to the red circles. Any point on the right of this boundary is closer to the blue crosses. This is called the decision boundary

24 Where s the decision boundary? height weight Not always a simple straight line!

25 Where s the decision boundary? height weight Not always con<guous!

26 So, we have our first machine learning algorithm The K- Nearest Neighbour Classifier Testing point x For each training datapoint x measure distance(x,x ) End Sort distances Select K nearest Assign most common class! Make your own notes on its advantages / disadvantages.

27 The most important concept in Machine Learning

28 The most important concept in Machine Learning Looks good so far

29 The most important concept in Machine Learning Looks good so far Oh no! Mistakes! What happened?

30 The most important concept in Machine Learning Looks good so far Oh no! Mistakes! What happened? We didn t have all the data. We can never assume that we do. This is called OVER-FITTING to the small dataset.

31 Overfifng Overfieng happens when the classifier is too flexible for the problem. If we d drawn a simpler decision boundary below, maybe a straight line, we may have gogen lower error.

32 Break for 10 mins Possible uses of your break: 1. Ensure you have a working login for the computer lab this aqernoon. 2. Talk to me or a demonstrator about the material. 3. Read ahead in the notes. 4. Go get a coffee.

33 A more simple, compact rule? height θ weight if (weight >θ) then "player" else "dancer"

34 What s an algorithm to find a good threshold? θ = 40 while ( nummistakes!= 0 ) { } θ =θ θ nummistakes = error( ) height if (weight >θ) then "player" else "dancer" weight

35 We have our second Machine Learning procedure. The threshold classifier (also known as a Decision Stump ) if (weight >θ) then "player" else "dancer" θ = 40 while ( nummistakes!= 0 ) { } θ =θ θ nummistakes = error( )

36 Three ingredients of a Machine Learning procedure Model The final product, the thing you have to package up and send to a customer. A piece of code with some parameters that need to be set. Error func<on The performance criterion: the funcbon you use to judge how well the parameters of the model are set. Learning algorithm The algorithm that opbmises the model parameters, using the error funcbon to judge how well it is doing.

37 Three ingredients of a Threshold Classifier Error func<on θ = 40 while ( nummistakes!= 0 ) { } θ =θ θ nummistakes = error( ) Learning algorithm if (weight >θ) then "player" else "dancer" Model

38 What s the model for the k- NN classifier? For the k- nn, the model is the training data itself! - very good accuracy J - very computabonally intensive! L height weight Testing point x For each training datapoint x measure distance(x,x ) End Sort distances Select K nearest Assign most common class!

39 New data: what s an algorithm to find a good threshold? height Our model does not match the problem! θ weight if (weight >θ) then "player" else "dancer" 1 mistake

40 New data: what s an algorithm to find a good threshold? height J But our current model cannot represent this weight if (weight >θ) then "player" else "dancer" L

41 We need a more sophis<cated model if (weight >θ) then "player" else "dancer" if (f ( x) >θ) then "player" else "dancer" x 1 = height (cm) x 2 = weight (kg) height The Linear Classifier f ( x) = ( w1 * x1) + ( w2 * x2) = d w i x i= 1 i weight

42 The Linear Classifier if f ( x) >θ then "player" else "dancer" f ( x) = ( w1 * x1) + ( w2 * x2) = d w i x i= 1 i height height Decision boundary weight weight w θ w1, 2and change the posi<on of the DECISION BOUNDARY

43 Geometry of the Linear Classifier (1) if f ( x) >θ then else -1 "player" "dancer" = + 1 = 1 In 2- d, this is a line. In higher dimensions, it is a decision hyper- plane. Any point on the plane evaluates to 0. Points not on the plane evaluate to /. [ w 1, w2 ] The decision boundary is always ORTHOGONAL to the weight vector. See if you can prove this for yourself before going to the notes. f(x)= f(x)= f(x)=0

44 Geometry of the Linear Classifier (2) We can rearrange the decision rule: d if w i x i >θ then else -1 i=1 d w i x i θ > 0 i=1 d w i x i + ( 1.w 0 ) > 0 i=1 if d w i x i > 0 i=0 d i= 0 w i x i > 0 then + 1else -1

45 Geometry of the Linear Classifier (3) On the plane: In 2- dimensions: d f ( x) = w i x i θ = 0 i=1 f ( x) = w 1 x 1 + w 2 x 2 θ = 0 w 1 w 2 x 1 + x 2 = θ w 2 x 2 = w 1 w 2 x 1 + θ w 2 f(x)= [ w 1, w2 ] f(x)= f(x)=0 This now follows the geometry of a straight line y=mx+c, with m = w 1 w 2 c = θ w 2

46 The Linear Classifier Model if d i= 0 w i x i > 0 then ŷ = + 1else ŷ = 1 Error func<on Learning algo. e = 1 ( 2 f ( x) y) 2???... need to optimise the w values... height x y inputs class weight Note the terminology! See notes for details!!

47 Gradient Descent e w i e = 1 ( 2 f ( x) y) 2 = e f f w i = f y ( ) x i Follow the NEGATIVE gradient.!

48 Stochas<c Gradient Descent initialise weight values to random numbers in range -1 to for n = 1 to NUM_ITERATIONS for each training example (x,y) calculate for each weight i f (x) end end end w i = w i α( f ( x) y)x i α = a small constant, the learning rate Convergence theorem: If the data is linearly separable, then application of the learning rule will find a separating decision boundary, within a finite number of iterations.

49 A problem initialise weight values to random numbers in range -1 to... height Depending on the random ini<alisa<on, the linear classifier will converge to one of the valid boundaries but randomly! weight

50 Break for 30 mins Possible uses of your break: 1. Ensure you have a working login for the computer lab this aqernoon. 2. Talk to me or a demonstrator about the material. 3. Read ahead in the notes. 4. Go get a coffee.

51 Another model : logis+c regression Our model f(x) has range plus/minus INFIINTY! Is this really necessary? What is the confidence of our decisions? Can we es<mate PROBABILITIES? Logis<c regression es<mates p( y=1 x ) Output in range [0,1] Sigmoid func+on p(y =1 x) = f (x) = 1 1+ e (wt x Θ)

52 Another error : cross entropy N j=1 e = y j ln f (x j )+ (1 y j )ln(1 f (x j )) Above we assume y is either 0 or 1. Derived from the sta<s<cal principle of Likelihood. We ll see this again in a few weeks.

53 Gradient Descent N j=1 e = y j ln f (x j )+ (1 y j )ln(1 f (x j )) e w i = e f f w i = f y ( ) x i Follow the NEGATIVE gradient.! SAME update as for squared error!

54 Stochas<c Gradient Descent initialise weight values to random numbers in range -1 to for n = 1 to NUM_ITERATIONS for each training example (x,y) calculate for each weight i f (x) end end end w i = w i α( f ( x) y)x i α = a small constant, the learning rate

55 A natural pairing of error func<on to model N e = y j ln f (x j )+ (1 y j )ln(1 f (x j )) j=1 e = 1 ( 2 f ( x) y) 2 1 f (x) = 1+ e (wt x Θ) d f ( x) = w i x i θ i=1 e w i = e f f w i = f y ( ) x i

56 S<ll a problem initialise weight values to random numbers in range -1 to... height Depending on the random ini<alisa<on, the logis+c regression classifier will converge to one of the valid boundaries but randomly! weight

57 Geometry of Linear Models (see notes)

58 Another problem - new data. non- linearly separable height Our model does not match the problem! We ll deal with this next week! weight

59 End of Day 1 Now read the notes. Read the Surrounded by StaBsBcs chapter in the handouts. The fog will clear. This adernoon learn MATLAB. This week s exercise is unassessed, but you are highly advised to get as much pracbce in as you can.

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 3 Parametric Distribu>ons We want model the probability