DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Size: px

Start display at page:

Download "DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University"

Jacob Lucas
5 years ago
Views:

1 DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January

2 Outline Practical issues in Linear Regression Outliers Categorical variables Lab Linear Regression Gradient descent Efficient algorithm for optimizing loss function Training Linear Regression with Gradient Descent Comparison with closed-form solution 2

3 Simple Linear Regression Residual y = θ 0 + θ 1 x θ 0 Intercept x (i), y (i) θ 1 = Δy/Δx Slope Hypothesis: h θ x = θ 0 + θ 1 x Loss: MSE= 1 σ n n i=1 h θ x (i) y (i) 2 3

4 ҧ Simple Linear Regression Dataset x (i) R, y (i) R, h θ x = θ 0 + θ 1 x J θ = 1 σ n n i=1 θ 0 + θ 1 x (i) y (i) 2 MSE / Loss Solution of min loss θ 0 = തy θ 1 xҧ θ 1 = σ (x i x)(y ҧ (i) തy) σ x (i) xҧ 2 Variance of x n x = σ i=1 n n തy = σ i=1 n Co-variance of x and y x (i) y (i) 4

5 Multiple Linear Regression Dataset: x (i) R d, y i R Hypothesis h θ x = θ T x MSE = 1 σ n n i=1 θ T x (i) y (i) 2 Loss / cost 5

6 Feature standardization/normalization Goal is to have individual features on the same scale Is a pre-processing step in most learning algorithms Necessary for linear models and Gradient Descent Different options: Feature standardization Feature min-max rescaling Mean normalization 6

7 Outliers Dashed model is without outlier point Linear regression is not resilient to outliers! Outliers can be eliminated based on residual value Other techniques for outlier detection 7

8 Categorical variables Predict credit card balance Age Income Number of cards Credit limit Credit rating Categorical variables Student (Yes/No) State (50 different levels) 8

9 Indicator Variables Binary (two-level) variable Add new feature x j = 1 if student and 0 otherwise Multi-level variable State: 50 values x MA = 1 if State = MA and 0, otherwise x NY = 1 if State = NY and 0, otherwise How many indicator variables are needed? Disadvantages: data becomes too sparse for large number of levels Will discuss feature selection later in class 9

10 Lab example 10

11 Simple LR 11

12 Residual plot Estimated responses 12

13 Simple LR Coef not zero! RSE = MSE R 2 measures linear relationship between X and Y (equal to correlation coef for simple LR) 13

14 Multiple LR 14

15 What Strategy to Use? 15

16 Follow the Slope Follow the direction of steepest descent! 16

17 How to optimize J(θ)? 17

18 How to optimize J(θ)? Different starting point 18

19 Gradient Descent Gradient = slope of line tangent to curve Function decreases faster in negative direction of gradient Larger learning rate => larger step 19

20 Gradient Descent 20

21 Gradient Descent As you approach the minimum, the slope gets smaller, and GD will take smaller steps It converges to local minimum (which is global minimum for convex functions)! 21

22 Gradient Descent What happens when θ reaches a local minimum? The slope is 0, and gradient descent converges! 22

23 GD Converges to Local Minimum Solution: start from multiple random locations 23

24 GD for Simple Linear Regression J θ = 1 σ n n i=1 θ 0 + θ 1 x (i) y (i) 2 J(θ) θ 0 J(θ) θ 1 = 2 σ n i=1 n (θ 0 + θ 1 x (i) y (i) ) = 2 σ n i=1 n (θ 0 + θ 1 x i y (i) ) x (i) Update of each parameter component depends on all training data 24

25 GD for Multiple Linear Regression 2 n 2 n 1 n 1 n 25

26 GD for Linear Regression 2 n θ new θ old < ϵ or iterations == MAX_ITER Can also bound number of iterations 26

27 GD Example 27

28 GD Example 28

29 GD Example 29

30 GD Example 30

31 GD Example 31

32 GD Example 32

33 GD Example 33

34 GD Example 34

35 GD Example 35

36 Choosing learning rate 36

37 Feature Scaling 37

38 Issues with Gradient Descent Might get stuck in local optimum and not converge to global optimum Restart from multiple initial points Only works with differentiable loss functions Small or large gradients Feature scaling helps Tune learning rate Can use line search for determining optimal learning rate 38

39 Review In practice several techniques can help generate more robust models Outlier removal Feature scaling Gradient descent is an efficient algorithm for optimization and training LR The most widely used algorithm in ML! Much faster than using closed-form solution Main issues with Gradient Descent is convergence and getting stuck in local optima 39

40 Acknowledgements Slides made using resources from: Andrew Ng Eric Eaton David Sontag Thanks! 40

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 18 2018 Logistics HW 1 is on Piazza and Gradescope Deadline: Friday, Sept. 28, 2018 Office