Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Size: px

Start display at page:

Download "Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)"

Malcolm Long
5 years ago
Views:

1 Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty)

2 Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel values left to right, top to bottom row) Greyscale images I p p p N p N Pixel value p i can be 0 or (binary image) or 0 to 55 (greyscale)

3 The human brain is extremely good at classifying images Can we develop classification methods by emulating the brain? 3

4 4

5 Neurons communicate via spikes Inputs Output spike (electrical pulse) Output spike roughly dependent on whether sum of all inputs reaches a threshold 5

6 Neurons as Threshold Units Artificial neuron: m binary inputs (- or ), output (- or ) Synaptic weights w ji Threshold i w i Weighted Sum Threshold Inputs u j (- or +) w i w 3i Output v i (- or +) v i ( wjiu j i j ) (x) = if x > 0 and - if x 0 6

7 Perceptrons for Classification Fancy name for a type of layered feed-forward networks (no loops) Uses artificial neurons ( units ) with binary inputs and outputs Single-layer Multilayer 7

8 Perceptrons and Classification Consider a single-layer perceptron Weighted sum forms a linear hyperplane w ji u 0 j Everything on one side of this hyperplane is in class (output = +) and everything on other side is class (output = -) Any function that is linearly separable can be computed by a perceptron j i 8

9 Linear Separability Example: AND is linearly separable Linear hyperplane u u AND u (,) v = u - u u v = iff u + u.5 > 0 Similarly for OR and NOT 9

10 How do we learn the appropriate weights given only examples of (input,output)? Idea: Change the weights to decrease the error in output 0

11 Perceptron Training Rule

12 What about the XOR function? u u XOR - -? u (,) u - Can a perceptron separate the + outputs from the - outputs?

13 Linear Inseparability Perceptron with threshold units fails if classification task is not linearly separable Example: XOR No single line can separate the yes (+) outputs from the no (-) outputs! Minsky and Papert s book showing such negative results put a damper on neural networks research for over a decade! - - u X (,) u 3

14 How do we deal with linear inseparability? 4

15 Idea : Multilayer Perceptrons Removes limitations of single-layer networks Can solve XOR Example: Two-layer perceptron that computes XOR x y Output is + if and only if x + y (x + y.5) 0.5 > 0 5

16 Multilayer Perceptron: What does it do? out y? x y x 6

17 Multilayer Perceptron: What does it do? out y x y 0 =- = y x x y 0 x y x 7

18 Multilayer Perceptron: What does it do? out y =- = = =- x y 0 x y 0 x y x 8

19 Multilayer Perceptron: What does it do? out y =- = - >0 = =- x y x 9

20 Perceptrons as Constraint Satisfaction Networks out y =- = x y 0 = =- x y 0 x y x 0

21 Back to Linear Separability Recall: Weighted sum in perceptron forms a linear hyperplane i w x i i b 0 Due to threshold function, everything on one side of this hyperplane is labeled as class (output = +) and everything on other side is labeled as class (output = -)

22 Separating Hyperplane Class i wi x i b 0 denotes + output denotes - output Class Need to choose w and b based on training data

23 Separating Hyperplanes Different choices of w and b give different hyperplanes Class denotes + output denotes - output Class (This and next few slides adapted from Andrew Moore s) 3

24 Which hyperplane is best? Class denotes + output denotes - output Class 4

25 How about the one right in the middle? Intuitively, this boundary seems good Avoids misclassification of new test points if they are generated from the same distribution as training points 5

26 Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 6

27 Maximum Margin and Support Vector Machine Support Vectors are those datapoints that the margin pushes up against The maximum margin classifier is called a Support Vector Machine (in this case, a Linear SVM or LSVM) 7

28 Why Maximum Margin? Robust to small perturbations of data points near boundary There exists theory showing this is best for generalization to new points Empirically works great 8

29 What if data is not linearly separable? Outliers (due to noise) 9

30 Approach : Soft Margin SVMs ξ Allow errors ξ i (deviations from margin) Trade off margin with errors. Minimize: y i w C w x b and 0, i i i subject to: i i i 30

31 What if data is not linearly separable: Other ideas? Not linearly separable 3

32 What if data is not linearly separable? Approach : Map original input space to higherdimensional feature space; use linear classifier in higher-dim. space x φ(x) Kernel: additional bias to convert into high d space 3

33 Face Detection using SVMs Kernel used: Polynomial of degree (Osuna, Freund, Girosi, 998) 35

34 K-Nearest Neighbors A simple non-parametric classification algorithm Idea: Look around you to see how your neighbors classify data Classify a new data-point according to a majority vote of your k nearest neighbors 36

35 Distance Metric How do we measure what it means to be a neighbor (what is close )? Appropriate distance metric depends on the problem Examples: x discrete (e.g., strings): Hamming distance d(x,x ) = # features on which x and x differ x continuous (e.g., vectors over reals): Euclidean distance d(x,x ) = x -x = square root of sum of squared differences between corresponding elements of data vectors 37

36 Example Input Data: -D points (x,x ) Two classes: C and C. New Data Point + K = 4: Look at 4 nearest neighbors of + 3 are in C, so classify + as C 38

37 Decision Boundary using K-NN Some points near the boundary may be misclassified (but maybe noise) 39

38 What if we want to learn continuous-valued functions? Output Input 40

39 Regression K-Nearest neighbor take the average of k-close by points Linear/Non-linear Regression fit parameters (gradient descent) minimizing the regression error/loss Neural Networks remove the threshold function learning multi-layer networks: backpropagation 4

40 Large Feature Spaces Easy to overfit Regularization add penalty for large weights prefer weights that are zero or close to zero minimize regression error + C.regularization penalty 4

Support Vector Machines

Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining