Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Size: px

Start display at page:

Download "Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines"

Raymond Randall
5 years ago
Views:

1 Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving Least Squares Problems, C.L.Lawson & R.J. Hanson, SIAM, An Introduction to Support Vector Machines, N. Cristiani & J. Shawe-Taylor, Cambridge, COM3250 /

2 Numeric Prediction So far we have primarily focused on concept learning binary classification For example: credit-worthy loan application mushroom data: edible vs poisonous (= edible vs non-edible) However, most algorithms extend easily to n-ary classification zoo data (7 classes) COM3250 /

3 Numeric Prediction So far we have primarily focused on concept learning binary classification For example: credit-worthy loan application mushroom data: edible vs poisonous (= edible vs non-edible) However, most algorithms extend easily to n-ary classification zoo data (7 classes) Key characteristic of these problems is that the target attribute is a nominal attribute COM3250 / a

4 Numeric Prediction So far we have primarily focused on concept learning binary classification For example: credit-worthy loan application mushroom data: edible vs poisonous (= edible vs non-edible) However, most algorithms extend easily to n-ary classification zoo data (7 classes) Key characteristic of these problems is that the target attribute is a nominal attribute In most cases non-target attributes have also been nominal Have seen how numeric attributes can be converted to nominal attributes using a variety of discretization approaches COM3250 / b

5 Numeric Prediction So far we have primarily focused on concept learning binary classification For example: credit-worthy loan application mushroom data: edible vs poisonous (= edible vs non-edible) However, most algorithms extend easily to n-ary classification zoo data (7 classes) Key characteristic of these problems is that the target attribute is a nominal attribute In most cases non-target attributes have also been nominal Have seen how numeric attributes can be converted to nominal attributes using a variety of discretization approaches What if target attribute is numeric? For example: Heuristic evaluation functions for board games, such as checkers Numeric functions to relate one physical quantity to others: temperature/pressure, lean body mass/muscle strength, etc. In such cases usual that non-target attributes are also numeric COM3250 / c

6 Linear Regression If target and non-target attributes are numeric then a classic technique to consider is linear regression. COM3250 /

7 Linear Regression If target and non-target attributes are numeric then a classic technique to consider is linear regression. Output class/target attribute x is expressed as a linear combination of the other attributes a 1,...,a n with predetermined weights w 1,...,w n : x = w 0 + w 1 a 1 + w 2 a w n a n COM3250 / a

8 Linear Regression If target and non-target attributes are numeric then a classic technique to consider is linear regression. Output class/target attribute x is expressed as a linear combination of the other attributes a 1,...,a n with predetermined weights w 1,...,w n : x = w 0 + w 1 a 1 + w 2 a w n a n The machine learning challenge is to compute the weights from the training data. I.e. View an assignment to weights w i as an hypothesis Pick the hypothesis that best fits the training data COM3250 / b

9 Linear Regression If target and non-target attributes are numeric then a classic technique to consider is linear regression. Output class/target attribute x is expressed as a linear combination of the other attributes a 1,...,a n with predetermined weights w 1,...,w n : x = w 0 + w 1 a 1 + w 2 a w n a n The machine learning challenge is to compute the weights from the training data. I.e. View an assignment to weights w i as an hypothesis Pick the hypothesis that best fits the training data In linear regression the technique used to do this choses the w i so as to minimize the sum of the squares of the differences between the actual and predicted values for the target attribute over the training data called least squares approximation Note that if difference between actual and predicted target value is viewed as error then least squares approximation minimizes sum of the errors squared and hence minimizes sum of errors across training data COM3250 / c

10 Linear Regression: Example 1 Estimating pressure of fixed amount of gas in a tank, given temperature Under these conditions, Charles Law states that the pressure of a gas is proportional to its temperature can use this to determine line based on true parameters Given set of temperature/pressure data points can use linear regression to derive a line based on estimated parameters Source: NIST Engineering Statistics Handbook, Section Least Squares, COM3250 /

11 Linear Regression (cont) While linear regression is frequently thought of as fitting a line/plane to a set of data points it can be used to fit the data with any function of the form: in which ( x; β) = β 0 + β 1 x 1 + β 2 x each explanatory variable (x i ) in the function is multiplied by an unknown parameter (β i ) 2. there is at most one unknown parameter with no corresponding explanatory variable (β 0 ), and 3. all of the individual terms are summed to produce the final function value. So, quadratic curves, straight-line models in log(x), polynomials in sin(x) are linear in the statistical sense so long as they are linear in the parameters β i, even though they are not linear in respect of the explanatory variables. COM3250 /

12 Linear Regression: Example 2 Detecting craters (ellipses/circles) on Mars from 2D image data Randomly sample dark points in images and estimate linear parameters a, b, c, d, e for conic sections: ax 2 + bxy+cy 2 + dx+ey = 1 COM3250 /

13 Least Squares Approximation Suppose there are m training examples where each instance is represented by values for n numeric non-target attributes a 0,a 1,...,a n, where the value of j-th attribute for the i-th example is denoted a i, j a i,0 = 1, 1 i m a value for the target attribute x, denoted x i for the i-th example COM3250 /

14 Least Squares Approximation Suppose there are m training examples where each instance is represented by values for n numeric non-target attributes a 0,a 1,...,a n, where the value of j-th attribute for the i-th example is denoted a i, j a i,0 = 1, 1 i m a value for the target attribute x, denoted x i for the i-th example We wish to learn weights w 0,w 1,...w n so as to minimize m i=1 (x i n w j a i, j ) 2 j=0 COM3250 / a

15 Least Squares Approximation Suppose there are m training examples where each instance is represented by values for n numeric non-target attributes a 0,a 1,...,a n, where the value of j-th attribute for the i-th example is denoted a i, j a i,0 = 1, 1 i m a value for the target attribute x, denoted x i for the i-th example We wish to learn weights w 0,w 1,...w n so as to minimize m i=1 (x i n w j a i, j ) 2 j=0 The problem is naturally represented in matrix notation. Ideally we would like to find a column vector of weights w 0...w n such that: a 1,0 a 1,1 a 1,2... a 1,n w 0 x 1 a 2,0 a 2,1 a 2,2... a 2,n w 1 x 2... = a m,0 a m,1 a m,2... a m,n w n x m i.e. such that Aw = x. Failing this we want a vector of weights w that minimizes Aw x COM3250 / b

16 Least Squares Approximation (cont) A vector w that minimizes Aw x is called a least squares solution of Aw = x Such a solution is given by: w = (A T A) 1 A T x (1) COM3250 /

17 Least Squares Approximation (cont) A vector w that minimizes Aw x is called a least squares solution of Aw = x Such a solution is given by: A proof of (1) can be arrived at in various ways: w = (A T A) 1 A T x (1) Reasoning about projections onto the column space of A (i.e. using linear algebra) Differentiating the sum of squares error expression with respect to the weights w and computing the value of w for which this derivative = 0 COM3250 / a

18 Least Squares Approximation (cont) A vector w that minimizes Aw x is called a least squares solution of Aw = x Such a solution is given by: A proof of (1) can be arrived at in various ways: w = (A T A) 1 A T x (1) Reasoning about projections onto the column space of A (i.e. using linear algebra) Differentiating the sum of squares error expression with respect to the weights w and computing the value of w for which this derivative = 0 Consider the latter. The error function Err(w) = m i=1 (x i n j=0 w ja i, j ) 2 can be written: Err(w) = (x Aw) T (x Aw) (2) = x T x 2w T A T x+w T A T Aw (3) So, differentiating wrt w: Setting (4) = 0 yields δerr(w) δw = 2A T x+2a T Aw (4) A T Aw = A T x (5) So, if the inverse of A T A exists we have: w = (A T A) 1 A T x (6) COM3250 / b

19 Least Squares Approximation (cont) How do we compute a least squares solution? Considerable work has been put into developing efficient solutions, given the wide range of applications Can compute (6) using matrix manipulation packages such as Matlab can use \, inverse, pseudo-inverse, QR decomposition operators depending on characteristics of A and A T A An extensive treatment of algorithms for least squares can be found in Lawson & Hanson. A simple algorithm which converges to the least squares solution is the Widrow-Hoff algorithm (here w = w 1,...w n and b = w 0, the bias, is explicit): Given training set S = {a 1,...,a n }, learning rate η R + w 0; b 0 repeat for i = 1 to n (w,b) (w,b) η( w a i +b x i )(w,1) end for until convergence criterion satisfied return (w,b) (Cristiani & Shawe-Taylor, p. 23) COM3250 /

20 Linear Classification Linear regression (or any regression technique) can be used for classification in domains with numeric attributes Perform regression for each class setting output to 1 for training instances in the class and 0 to the others Result is a linear expression for each class For test instances calculate value of each linear expression and assign class the value for whose linear expression is largest This approach is called multiresponse linear regression COM3250 /

21 Linear Classification Linear regression (or any regression technique) can be used for classification in domains with numeric attributes Perform regression for each class setting output to 1 for training instances in the class and 0 to the others Result is a linear expression for each class For test instances calculate value of each linear expression and assign class the value for whose linear expression is largest This approach is called multiresponse linear regression Another technique for multiclass classification (i.e. more than two classes) is pairwise classification: Build a classifier for every pair of classes uses only training instances from those classes Output for test instance is class which receives most votes (across classifiers) If there are k classes this method results in k(k 1)/2 classifiers, but is not overly computationally expensive, since for each classifier training takes place on just the subset of instances in two classes Note this technique can be used with any classification algorithm COM3250 / a

22 Other Linear Classifiers: The Perceptron If the training instances are linearly separable into two classes, i.e. there is a hyperplane that separates them, then a simple algorithm that separates them is the perceptron learning rule The perceptron ancestor of the neural net can be pictured as a two layer network (graph) of neurons (nodes) the input layer: one node per attribute plus an extra node (the bias) = 1 the output layer: a single node each input node is linked to the output node via a weighted connection when an instance is presented to the input layer its attribute values activate the input layer input activations are multiplied by weights and summed if weighted sum > 0 then the output signal is 1; otherwise output is -1 Ouput Layer b (= w0) w1 w2 w3 wk 1 ("bias")... attribute attribute attribute attribute a1 a2 a3 an Input Layer COM3250 /

23 The Perceptron Learning Rule Basic idea: Incorrectly classified +ve examples lead to small increase in weights Incorrectly classified -ve examples lead to small decrease in weights Given a linearly separable training set S = {a 1,...,a n }, learning rate η R + w 0 0; b 0 0; k 0 R max 1 i n a i repeat for i = 1 to n end for if x i ( w k a i +b k ) 0 then end if w k+1 w k + ηx i a i b k+1 b k + ηx i R 2 k k+ 1 until no mistakes made within the for loop return (w k,b k ) where k is the number of mistakes More on this in next lecture... (Cristiani & Shawe-Taylor, p. 12) (incorrect classification) COM3250 /

24 Support Vector Machines Limitation of simple linear classifiers above (perceptron, linear regression with lines) is that they can only represent linear class boundaries. Makes them too simple for many applications Support vector machines (SVMs) use linear models to implement non-linear class boundaries by transforming input using a non-linear mapping instance space mapped into new space non-linear class boundary in original space maps onto linear boundary in new space norikazu/research.en.html COM3250 /

25 Support Vector Machines (cont) For example, suppose we replace the original set of n attributes by a set including all products of k factors that can be constructed from these attributes i.e. we move from a linear expression in n variables to a multivariate polynomial of degree k So, if we started with a linear model with two attributes and two weights x = w 1 a 1 + w 2 a 2 we would move to one with four synthetic attributes and four weights x = w 1 a w 2a 2 1 a 2 + w 3 a 1 a w 4a 3 2 To generate a linear model in the space spanned by these products of factors each training instance is mapped into new space by computing all possible 3 factor products of its two attribute values the learning algorithm is applied to the transformed instances To classify a test instance it is transformed prior to classification Problems: computational complexity: 5 factors of 10 attributes > 2000 coefficients overfitting: if # of coefficients large relative to # training instances model will overfit training data too nonlinear COM3250 /

26 Support Vector Machines (cont) SVMs solve both problems using a linear model called the maximum margin hyperplane Maximum margin hyperplane gives greatest separation between classes maximum margin hyperplane the perpendicular bisector of the shortest line connecting the convex hulls tighest enclosing convex polygon of the sets of points in each class Instances closest to the maximum margin hyperplane are called support vectors (at least one per class) Support vectors uniquely define maximum margin hyperplane given them we can construct the maximum margin hyperplane and all other instances can be discarded support vectors COM3250 /

27 Support Vector Machines (cont) SVMs are unlikely to overfit as overfitting is caused by too much flexibility in the decision boundary maximum margin hyperplane is relatively stable only changes if training instances that are support vectors are added or removed usually few support vectors (can be thought of as global representatives of training set) which give little flexibility SVMs are not computationally infeasible To classify a test instance the vector dot product of the test instance with all support vectors must be calculated Dot product involves one multiplication and one addition per attribute expensive in new high-dimensional space resulting from nonlinear mapping However, can compute dot product on original attribute set before mapping e.g. if using high dimensional feature space based on products of k factors take dot product of vectors in low dimensional space and raise to power k function doing this called a polynomial kernel choosing k: usually start with k = 1 (linear model) and increase until no reduction in estimated error COM3250 /

28 Support Vector Machines (cont) Other kernel functions can be used to implement different nonlinear mappings radial basis function (RBF) kernel RBF neural network sigmoid kernel multilayer perceptron with one hidden layer Choice of kernel depends on application may not be much difference in practice SVMs can be generalised to cases where the training data is not linearly separable SVMs are slow during training compared to other algorithms, such as decision trees However SVMs can produce very accurate classifiers Best results on text classification tasks are now typically obtained using SVMs COM3250 /

29 Summary In learning from examples where attributes are numeric, natural to start with linear models COM3250 /

30 Summary In learning from examples where attributes are numeric, natural to start with linear models When target and non-target attributes are numeric, i.e. the task is numeric prediction, the problem is referred to as linear regression the goal is to fit a line to the training instances and then predict a value for a test instance using the induced linear equation a common computational technique is least squares approximation which selects the line based on minimizing the sum of the squared errors COM3250 / a

31 Summary In learning from examples where attributes are numeric, natural to start with linear models When target and non-target attributes are numeric, i.e. the task is numeric prediction, the problem is referred to as linear regression the goal is to fit a line to the training instances and then predict a value for a test instance using the induced linear equation a common computational technique is least squares approximation which selects the line based on minimizing the sum of the squared errors Linear models can be used for classification by finding a line(s) that separate the classes linear regression can be used to perform linear classification multiresponse linear regression binary linear classification can also be performed using the perceptron COM3250 / b

32 Summary In learning from examples where attributes are numeric, natural to start with linear models When target and non-target attributes are numeric, i.e. the task is numeric prediction, the problem is referred to as linear regression the goal is to fit a line to the training instances and then predict a value for a test instance using the induced linear equation a common computational technique is least squares approximation which selects the line based on minimizing the sum of the squared errors Linear models can be used for classification by finding a line(s) that separate the classes linear regression can be used to perform linear classification multiresponse linear regression binary linear classification can also be performed using the perceptron In cases where classes are not linearly separable in the initial attribute space, a linear model may be found in a higher dimensional space arrived at by nonlinear mapping from the initial space support vector machines are computationally efficient algorithms for mapping instances into higher dimensional feature spaces and finding hyperplanes in these spaces to perform classification COM3250 / c

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer