Linear Regression QuadraJc Regression Year

Size: px

Start display at page:

Download "Linear Regression QuadraJc Regression Year"

Clarence Charles
6 years ago
Views:

1 Linear Regression

2 Regression Given: n Data X = x (1),...,x (n)o where x (i) 2 R d n Corresponding labels y = y (1),...,y (n)o where y (i) 2 R September Arc+c Sea Ice Extent (1,000,000 sq km) Linear Regression QuadraJc Regression Year Data from G. WiF. Journal of StaJsJcs EducaJon, Volume 21, Number 1 (2013) 2

3 Prostate Cancer Dataset 97 samples, parjjoned into 67 train / 30 test Eight predictors (features): 6 conjnuous (4 log transforms), 1 binary, 1 ordinal ConJnuous outcome variable: lpsa: log(prostate specific anjgen level) Based on slide by Jeff Howbert

4 Hypothesis: Linear Regression y = x x d x d Assume x 0 = 1 = dx j=0 j x j Fit model by minimizing sum of squared errors x Figures are courtesy of Greg Shakhnarovich 5

5 Least Squares Linear Regression Cost FuncJon J( ) = 1 2n Fit by solving min nx i=1 J( ) h x (i) y (i) 2 6

6 IntuiJon Behind Cost FuncJon J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 x 2 R so =[ 0, 1 ] Based on example by Andrew Ng 7

7 IntuiJon Behind Cost FuncJon J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 x 2 R so =[ 0, 1 ] (for fixed, this is a funcjon of x) (funcjon of the parameter ) 3 3 y x Based on example by Andrew Ng 8

8 IntuiJon Behind Cost FuncJon J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 x 2 R so =[ 0, 1 ] (for fixed, this is a funcjon of x) (funcjon of the parameter ) 3 3 y Based on example by Andrew Ng x J([0, 0.5]) = (0.5 1) 2 +(1 2) 2 +(1.5 3)

9 IntuiJon Behind Cost FuncJon J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 x 2 R so =[ 0, 1 ] y (for fixed, this is a funcjon of x) (funcjon of the parameter ) 3 3 J([0, 0]) J() is concave x Based on example by Andrew Ng 10

10 IntuiJon Behind Cost FuncJon Slide by Andrew Ng 11

11 IntuiJon Behind Cost FuncJon (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 12

12 IntuiJon Behind Cost FuncJon (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 13

13 IntuiJon Behind Cost FuncJon (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 14

14 IntuiJon Behind Cost FuncJon (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 15

15 Basic Search Procedure Choose inijal value for UnJl we reach a minimum: Choose a new value for to reduce J( ) J(θ 0,θ 1 ) θ 1 θ 0 Figure by Andrew Ng 16

16 Basic Search Procedure Choose inijal value for UnJl we reach a minimum: Choose a new value for to reduce J( ) J(θ 0,θ 1 ) θ 1 θ 0 Figure by Andrew Ng 17

17 Basic Search Procedure Choose inijal value for UnJl we reach a minimum: Choose a new value for to reduce J( ) J(θ 0,θ 1 ) Since the least squares objecjve funcjon is convex θ 1 (concave), θ 0 we don t need to worry about local minima Figure by Andrew Ng 18

Gradient Descent IniJalize Repeat unjl convergence j j @ J( ) @ j simultaneous update

18 Gradient Descent IniJalize Repeat unjl convergence j J( j simultaneous update for j = 0... d learning rate (small) e.g., α = J( )

19 Gradient Descent IniJalize Repeat unjl convergence j J( j simultaneous update for j = 0... d For j J( ) j 2n nx h x (i) y (i) 2 i=1 X X! 2 20

20 Gradient Descent IniJalize Repeat unjl convergence j J( j simultaneous update for j = 0... d For j J( ) j j 1 2n X nx h x (i) y (i) 2 i=1 nx i=1 X dx k=0 k x (i) k! y (i)! 2 X! 21

21 Gradient Descent IniJalize Repeat unjl convergence j J( j simultaneous update for j = 0... d For j J( ) j j 1 2n = 1 n nx i=1 X nx h x (i) y (i) 2 i=1 nx i=1 dx k=0 X dx k=0 k x (i) k k x (i) k y (i)!! y j dx k=0 k x (i) k y (i)! 22

22 Gradient Descent IniJalize Repeat unjl convergence j J( j simultaneous update for j = 0... d For j J( ) j j 1 2n = 1 n = 1 n nx i=1 nx i=1 nx h x (i) y (i) 2 i=1 nx i=1 dx k=0 dx k=0 dx k=0 k x (i) k k x (i) k k x (i) k y (i) y (i)!! nx y (i) x (i) j dx k=0 k x (i) k y (i)! 23

23 Gradient Descent for Linear Regression IniJalize Repeat unjl convergence j j 1 nx h x (i) n i=1 y (i) x (i) j simultaneous update for j = 0... d To achieve simultaneous update At the start of each GD iterajon, compute h x (i) Use this stored value in the update step loop Assume convergence when L kvk 2 = 2 norm: s X vi qv 2 = v v2 v i k new old k 2 < 24

24 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) h(x) = x Slide by Andrew Ng 25

25 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 26

26 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 27

27 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 28

28 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 29

29 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 30

30 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 31

31 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 32

32 Gradient Descent (for fixed, this is a funcjon of x) (funcjon of the parameters ) Slide by Andrew Ng 33

33 Choosing α α too small slow convergence α too large Increasing value for J( ) May overshoot the minimum May fail to converge May even diverge To see if gradient descent is working, print out The value should decrease at each iterajon If it doesn t, adjust α J( ) each iterajon 34

34 Extending Linear Regression to More Complex Models The inputs X for linear regression can be: Original quanjtajve inputs TransformaJon of quanjtajve inputs e.g. log, exp, square root, square, etc. Polynomial transformajon example: y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 Basis expansions Dummy coding of categorical inputs InteracJons between variables example: x 3 = x 1 x 2 This allows use of linear regression techniques to fit non- linear datasets.

35 Linear Basis FuncJon Models Generally, h (x) = dx j=0 j j (x) basis funcjon Typically, 0(x) =1so that 0 acts as a bias In the simplest case, we use linear basis funcjons : j(x) =x j Based on slide by Christopher Bishop (PRML)

36 Linear Basis FuncJon Models Polynomial basis funcjons: These are global; a small change in x affects all basis funcjons Gaussian basis funcjons: These are local; a small change in x only affect nearby basis funcjons. μ j and s control locajon and scale (width). Based on slide by Christopher Bishop (PRML)

also local; a small change in x only affects nearby

37 Based on slide by Christopher Bishop (PRML) Linear Basis FuncJon Models Sigmoidal basis funcjons: where These are also local; a small change in x only affects nearby basis funcjons. μ j and s control locajon and scale (slope).

38 Example of Fimng a Polynomial Curve with a Linear Model y = x + 2 x p x p = px j x j j=0

39 Linear Basis FuncJon Models Basic Linear Model: Generalized Linear Model: h (x) = h (x) = Once we have replaced the data by the outputs of the basis funcjons, fimng the generalized model is exactly the same problem as fimng the basic model Unless we use the kernel trick more on that when we cover support vector machines Therefore, there is no point in clufering the math with basis funcjons dx j=0 dx j=0 j x j j j (x) Based on slide by Geoff Hinton 40

40 Linear Algebra Concepts Vector in R d is an ordered set of d real numbers e.g., v = [1,6,3,4] is in R 4 [1,6,3,4] is a column vector: as opposed to a row vector: ( ) An m- by- n matrix is an object with m rows and n columns, where each entry is a real number: Based on slides by Joseph Bradley

41 Linear Algebra Concepts Transpose: reflect vector/matrix on line: a T = b ( a b) a c b d T = a b c d Note: (Ax) T =x T A T (We ll define muljplicajon soon ) Vector norms: L p norm of v = (v 1,,v k ) is Common norms: L 1, L 2 L infinity = max i v i Length of a vector v is L 2 (v) Based on slides by Joseph Bradley X i v i p! 1 p

42 Linear Algebra Concepts Vector dot product: ( u1 u2) ( v1 v2 ) = u1v 1 u2v2 u v = + Note: dot product of u with itself = length(u) 2 = kuk 2 2 Matrix product: A AB a = a a = a b b a a , + + B = a a b b b b a a b b b b a a b b Based on slides by Joseph Bradley

43 Vector products: Dot product: Outer product: ( ) v u u v v v u u v u v u T + = = = ( ) = = v u v u u v u v v v u u uv T Based on slides by Joseph Bradley Linear Algebra Concepts

44 Benefits of vectorizajon More compact equajons VectorizaJon Faster code (using opjmized matrix libraries) Consider our model: Let = d h(x) = dx j=0 j x j x = 1 x 1... x d Can write the model in vectorized form as h(x) = x 45

45 VectorizaJon Consider our model for n instances: h x (i) dx = j x (i) j Let = d R (d+1) X = Can write the model in vectorized form as j=0 1 x (1) 1... x (1) d x (i) 1... x (i) d x (n) 1... x (n) d R n (d+1) h (x) =X 46

VectorizaJon Let: y = For the linear regression cost funcjon: J( ) = 1 nx h x (i) y (i) 2 2n 2 6 4 y (1)

46 VectorizaJon Let: y = For the linear regression cost funcjon: J( ) = 1 nx h x (i) y (i) 2 2n y (1) y (2). y (n) = 1 2n i=1 nx i=1 x (i) y (i) 2 = 1 2n (X y) (X y) R 1 n R n 1 R n (d+1) R (d+1) 1 47

Closed Form SoluJon Instead of using GD, solve for opjmal NoJce that the solujon is when DerivaJon: Take derivajve and set equal to 0, then solve for : @ ( X X 2 X y

47 Closed Form SoluJon Instead of using GD, solve for opjmal NoJce that the solujon is when DerivaJon: Take derivajve and set equal to 0, then solve for ( X X 2 X y + y (X X) X y =0 Closed Form J( ) ( ) = 1 2n (X y) 1 x 1 J (X y) / X X y X X y + y y / / X X 2 X y + y y (X X) = X y analyjcally =(X X) 1 X y 48

48 Closed Form SoluJon Can obtain by simply plugging X and y into =(X X) 1 X y X = x (1) 1... x (1) d x (i) 1... x (i) d x (n) 1... x (n) d y = 6 4 If X T X is not inverjble (i.e., singular), may need to: Use pseudo- inverse instead of the inverse In python, numpy.linalg.pinv(a) Remove redundant (not linearly independent) features Remove extra features to ensure that d n y (1) y (2). y (n)

49 Gradient Descent vs Closed Form Gradient Descent Requires muljple iterajons Need to choose α Works well when n is large Can support incremental learning Closed Form Solu+on Non- iterajve No need for α Slow if n is large CompuJng (X T X) - 1 is roughly O(n 3 ) 50

Improving Learning: Feature Scaling Idea: Ensure that feature have similar scales 2 20 15 10 Before Feature Scaling

50 Improving Learning: Feature Scaling Idea: Ensure that feature have similar scales Before Feature Scaling Makes gradient descent converge much faster Auer Feature Scaling

51 Feature StandardizaJon Rescales features to have zero mean and unit variance Let μ j be the mean of feature j: µ j = 1 nx x (i) j n Replace each value with: x (i) j x (i) j s j s j is the standard deviajon of feature j Could also use the range of feature j (max j min j ) for s j Must apply the same transformajon to instances for both training and predicjon Outliers can cause problems µ j i=1 for j = 1...d (not x 0!) 52

hypothesis may fit the training set very well ( J( ) 0 ).

52 Quality of Fit Price Price Price Size Underfimng (high bias) Size Correct fit Size Overfimng (high variance) OverfiHng: The learned hypothesis may fit the training set very well ( J( ) 0 )...but fails to generalize to new examples Based on example by Andrew Ng 53

53 RegularizaJon A method for automajcally controlling the complexity of the learned hypothesis Idea: penalize for large values of Can incorporate into the cost funcjon Works well when we have a lot of features, each that contributes a bit to predicjng the label Can also address overfimng by eliminajng features (either manually or via model selecjon) j 54

54 RegularizaJon Linear regression objecjve funcjon J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j model fit to data regularizajon is the regularizajon parameter ( 0) No regularizajon on 0! 55

55 Understanding RegularizaJon J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j dx Note that j 2 = k 1:d k 2 2 j=1 This is the magnitude of the feature coefficient vector! We can also think of this as: dx ( j 0) 2 = k 1:d ~0k 2 2 j=1 L 2 regularizajon pulls coefficients toward 0 56

56 Understanding RegularizaJon J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j What happens if we set to be huge (e.g., )? Price Size Based on example by Andrew Ng 57

57 Understanding RegularizaJon J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j What happens if we set to be huge (e.g., )? Price Size Based on example by Andrew Ng 58

58 Regularized Linear Regression Cost FuncJon J( ) = 1 2n Fit by solving nx i=1 min h x (i) y (i) dx J( ) j=1 0 j J( ) Gradient update: n j j 1 n nx i=1 nx i=1 h x (i) y (i) h x (i) y (i) x (i) j j regularizajon 59

59 Regularized Linear Regression J( ) = 1 2n n nx h x (i) y (i) 2 dx + 2 i=1 j j 1 n nx h x (i) y (i) i=1 nx i=1 h x (i) y (i) x (i) j j=1 2 j j We can rewrite the gradient step as: j j (1 ) 1 nx h x (i) n i=1 y (i) x (i) j 60

60 Regularized Linear Regression To incorporate regularizajon into the closed form solujon: = B X X X y C. 5A

61 Regularized Linear Regression To incorporate regularizajon into the closed form solujon: = B X X X y C. 5A Can derive this the same way, by solving Can prove that for λ > 0, inverse exists in the equajon J( ) 62

Linear Regression & Gradient Descent

Linear Regression & Gradient Descent These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online.