Linear Regression & Gradient Descent

Size: px

Start display at page:

Download "Linear Regression & Gradient Descent"

Nathan Goodwin
5 years ago
Views:

Linear Regression & Gradient Descent These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely

1 Linear Regression & Gradient Descent These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Robot Image Credit: Viktoriya Sukhanova 123RF.com

2 Regression Given: n Data X = x (1),...,x (n)o where x (i) 2 R d n Corresponding labels y = y (1),...,y (n)o where y (i) 2 R September Arctic Sea Ice Extent (1,000,000 sq km) Linear Regression Quadratic Regression Year Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013) 2

Linear Regression Hypothesis: y = 0 + 1 x 1 + 2 x 2 +.

3 Linear Regression Hypothesis: y = x x d x d = dx j x j Assume x 0 = 1 j=0 Fit model by minimizing sum of squared errors x Figures are courtesy of Greg Shakhnarovich 3

4 Least Squares Linear Regression Cost Function J( ) = 1 2n Fit by solving min nx i=1 J( ) h x (i) y (i) 2 4

5 Intuition Behind Cost Function Slide by Andrew Ng 5

6 Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 6

7 Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 7

8 Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 8

9 Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 9

10 Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) q 1 q 0 Figure by Andrew Ng 10

11 Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) q 1 q 0 Figure by Andrew Ng 11

12 Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) Since the least squares objective function is convex (concave), we q 1 don t need to worry q 0 about local minima in linear regression Figure by Andrew Ng 12

13 Gradient Descent Initialize Repeat until convergence j J( ) simultaneous update for j = 0... d learning rate (small) e.g., α = J( )

14 Gradient Descent Initialize Repeat until convergence j J( ) simultaneous update for j = 0... d For j J( ) j j 1 2n = 1 n = 1 n nx i=1 nx i=1 nx h x (i) y (i) 2 i=1 nx i=1 dx k=0 dx k=0 dx k=0 k x (i) k k x (i) k k x (i) k y (i) y (i)!! nx y (i) x (i) j dx k=0 k x (i) k y (i)! 14

15 Gradient Descent for Linear Regression Initialize Repeat until convergence j j 1 nx h x (i) n i=1 y (i) x (i) j simultaneous update for j = 0... d To achieve simultaneous update At the start of each GD iteration, compute h x (i) Use this stored value in the update step loop Assume convergence when L 2 norm: s X kvk 2 = vi qv 2 = v v2 v i k new old k 2 < 15

16 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) h(x) = x Slide by Andrew Ng 16

17 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 17

18 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 18

19 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 19

20 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 20

21 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 21

22 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 22

23 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 23

24 Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 24

25 Choosing α α too small slow convergence α too large Increasing value for J( ) May overshoot the minimum May fail to converge May even diverge To see if gradient descent is working, print out The value should decrease at each iteration If it doesn t, adjust α J( ) each iteration 25

26 Extending Linear Regression to More Complex Models The inputs X for linear regression can be: Original quantitative inputs Transformation of quantitative inputs e.g. log, exp, square root, square, etc. Polynomial transformation example: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 Basis expansions Dummy coding of categorical inputs Interactions between variables example: x 3 = x 1 x 2 This allows use of linear regression techniques to fit non-linear datasets.

27 Linear Basis Function Models Generally, h (x) = dx j=0 j j (x) basis function Typically, 0(x) =1so that 0 acts as a bias In the simplest case, we use linear basis functions : j(x) =x j Based on slide by Christopher Bishop (PRML)

28 Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affects all basis functions Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μ j and s control location and scale (width). Based on slide by Christopher Bishop (PRML)

29 Linear Basis Function Models Sigmoidal basis functions: where These are also local; a small change in x only affects nearby basis functions. μ j and s control location and scale (slope). Based on slide by Christopher Bishop (PRML)

30 Example of Fitting a Polynomial Curve with a Linear Model y = x + 2 x p x p = px j x j j=0

learned hypothesis may fit the training set very well ( J( ) 0 ).

31 Quality of Fit Price Price Price Size Underfitting (high bias) Size Correct fit Size Overfitting (high variance) Overfitting: The learned hypothesis may fit the training set very well ( J( ) 0 )...but fails to generalize to new examples Based on example by Andrew Ng 31

32 Regularization A method for automatically controlling the complexity of the learned hypothesis Idea: penalize for large values of Can incorporate into the cost function Works well when we have a lot of features, each that contributes a bit to predicting the label Can also address overfitting by eliminating features (either manually or via model selection) j 32

33 Regularization Linear regression objective function J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j model fit to data regularization is the regularization parameter ( 0) No regularization on 0! 33

34 Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j dx Note that j 2 = k 1:d k 2 2 j=1 This is the magnitude of the feature coefficient vector! We can also think of this as: dx ( j 0) 2 = k 1:d ~0k 2 2 j=1 L 2 regularization pulls coefficients toward 0 34

35 Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) dx j=1 2 j What happens if we set to be huge (e.g., )? Price Size Based on example by Andrew Ng 35

@ 0 J( ) @ @ j J( ) Gradient update: 0 0 1 n j j 1 n nx i=1

36 Regularized Linear Regression Cost Function J( ) = 1 2n Fit by solving nx i=1 min h x (i) y (i) dx J( ) j=1 0 j J( ) Gradient update: n j j 1 n nx i=1 nx i=1 h x (i) y (i) h x (i) y (i) x (i) j j regularization 36

37 Regularized Linear Regression J( ) = 1 2n n nx h x (i) y (i) 2 dx + 2 i=1 j j 1 n nx h x (i) y (i) i=1 nx i=1 h x (i) y (i) x (i) j j=1 2 j j We can rewrite the gradient step as: j j (1 ) 1 nx h x (i) n i=1 y (i) x (i) j 37

Linear Regression QuadraJc Regression Year

Linear Regression QuadraJc Regression Year Linear Regression Regression Given: n Data X = x (1),...,x (n)o where x (i) 2 R d n Corresponding labels y = y (1),...,y (n)o where y (i) 2 R September Arc+c Sea Ice Extent (1,000,000 sq km) 9 8 7 6 5