DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Size: px

Start display at page:

Download "DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University"

Flora Thomas
5 years ago
Views:

1 DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September

2 Review Solution for multiple linear regression can be computed in closed form Matrix inversion is computationally intense In practice several techniques can help generate more robust models Outlier removal Feature scaling Gradient descent is an efficient algorithm for optimization and training LR The most widely used algorithm in ML! 2

3 Multiple Linear Regression Dataset: x (i) R d, y i R Hypothesis h θ x = θ T x MSE = 1 σ n n i=1 θ T x (i) y (i) 2 loss / cost 3

4 Gradient Descent Outline Derivation for simple and multiple Linear Regression Issues with Gradient Descent Comparison with closed-form solution Regularization Ridge and Lasso regression Lab example 4

5 How to optimize J(θ)? Move in the direction of steepest descent 5

6 Gradient Descent Gradient = slope of line tangent to curve at the same point 6

7 Gradient Descent What happens when θ reaches a local minimum? The slope is 0, and gradient descent converges! 7

8 Gradient Descent As you approach the minimum, the slope gets smaller, and GD will take smaller steps It converges to local minimum (which is global minimum for convex functions)! 8

9 GD Converges to Local Minimum Solution: start from multiple random locations 9

10 GD for Simple Linear Regression J θ = 1 σ n n i=1 θ 0 + θ 1 x (i) y (i) 2 J(θ) θ 0 J(θ) θ 1 = 2 σ n i=1 n (θ 0 + θ 1 x (i) y (i) ) = 2 σ n i=1 n (θ 0 + θ 1 x i y (i) ) x (i) Update of each parameter component depends on all training data 10

11 GD for Multiple Linear Regression 2 n 2 n 1 n 1 n 11

12 GD for Linear Regression 2 n Can also bound number of iterations 12

13 GD Example 13

14 GD Example 14

15 GD Example 15

16 GD Example 16

17 GD Example 17

18 GD Example 18

19 GD Example 19

20 GD Example 20

21 GD Example 21

22 Choosing learning rate 22

23 Feature Scaling 23

24 Gradient Descent vs Closed Form Gradient Descent Closed form 24

25 Issues with Gradient Descent Might get stuck in local optimum and not converge to global optimum Restart from multiple initial points Only works with differentiable loss functions Small or large gradients Feature scaling helps Tune learning rate Can use line search for determining optimal learning rate 25

26 Gradient Descent Outline Derivation for simple and multiple Linear Regression Issues with Gradient Descent Comparison with closed-form solution Regularization Ridge and Lasso regression Lab example 26

27 Generalization in ML Simple model Complex model Goal is to generalize well on new testing data Risk of overfitting to training data MSE close to 0, but performs poorly on test data 27

28 Bias-Variance Tradeoff Under-fitting Over-fitting Bias = Difference between estimated and true models Variance = Model difference on different training sets MSE is proportional to Bias + Variance 28

29 Regularization Reduce model complexity Reduce model variance 29

30 Ridge regression 1 2 λ λ 2 If λ = 0, we train linear regression If λ is large, the coefficients will shrink close to 0 30

31 Bias-Variance Tradeoff MSE Optimal Ridge regression Bias Variance Linear regression Reduced model complexity Ridge performs better when linear regression has high variance Example: d (dimension) is close to n (training set size) 31

32 Coefficient shrinkage Predict credit card balance 32

33 GD for Ridge Regression 1 2 α α αλθ j θ j θ j (1 αλ) α h θ x (i) y (i) x j (i) 33

34 GD for Ridge Regression 1 2 α α αλθ j θ j θ j (1 αλ) α h θ x (i) y (i) x j (i) 34

35 Lasso regression n d J θ = h θ x i y (i) 2 + λ θ j i=1 j=1 Squared Residuals Regularization L1 norm for regularization No closed form solution Algorithms based on quadratic programming or other optimization techniques 35

36 Ridge Alternative Formulations L2 Regularization min θ n σ i=1 h θ x i y i 2 subject to d σj=1 θ j 2 ε Lasso L1 regularization min θ σn i=1 h θ x i y (i) 2 subject to σd j=1 θ j ε 36

37 Lasso vs Ridge Ridge shrinks all coefficients Lasso sets some coefficients at 0 (sparse solution) Perform feature selection θ 1 መθ θ 1 መθ Optimum Optimum θ 0 θ 0 Lasso Ridge 37

38 Lasso vs Ridge 38

39 Lab example 39

40 Ridge regression Data processing (omit N/A) Fit ridge regression Coefficient values Coefficient norm 40

41 Ridge regression Fit ridge regression Coefficient values Coefficient norm λ controls parameter size 41

42 Lasso regression Fit Lasso regression 13 coefficients set at zero Coefficient norm 42

43 Acknowledgements Slides made using resources from: Andrew Ng Eric Eaton David Sontag Thanks!

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description