DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Size: px

Start display at page:

Download "DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University"

Cory Maxwell
5 years ago
Views:

1 DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January

2 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description of problem you will solve, dataset, and ML algorithms Individual project Project template and potential ideas are on Piazza Project milestone: due March 21 2 page description on progress Project report at the end of semester and project presentations in class (10 minute per project) 2

3 Outline Gradient Descent comparison with closedform solution Non-linear regression Regularization Ridge and Lasso regression Lab example Classification K Nearest Neighbors (knn) Cross-validation 3

4 Gradient Descent Gradient = slope of line tangent to curve Function decreases faster in negative direction of gradient Larger learning rate => larger step 4

5 GD for Linear Regression 2 n θ new θ old < ϵ or iterations == MAX_ITER Can also bound number of iterations 5

6 Gradient Descent vs Closed Form Gradient Descent Closed form O(d 3 ) 6

7 Issues with Gradient Descent Might get stuck in local optimum and not converge to global optimum Restart from multiple initial points Only works with differentiable loss functions Small or large gradients Feature scaling helps Tune learning rate Can use line search for determining optimal learning rate 7

8 Beyond Linearity Most datasets are not perfectly linear Linear Regression results in high MSE Generalized Additive Models 8

9 Polynomial Regression Polynomial basis function h θ x = θ 0 + θ 1 x + + θ d x d 9

10 Polynomial Regression Typically to avoid overfitting d 4 10

11 Other Regression 11

12 Splines Fit polynomial regression on each region (knot) Spline - Continuous and differentiable function at boundary Natural Spline linear function at boundary 12

13 Generalization in ML Simple model Complex model Goal is to generalize well on new testing data Risk of overfitting to training data MSE close to 0, but performs poorly on test data 13

14 Bias-Variance Tradeoff Under-fitting Over-fitting Bias = Difference between estimated and true models Variance = Model difference on different training sets MSE is proportional to Bias + Variance 14

15 Bias-Variance Decomposition Let መf be trained model Expected MSE of test point x o, y 0 : E (y 0 መf x 0 ) 2 Variance: Var መf x 0 = E መf x 0 2 E 2 መf x 0 Variance of prediction over training data Bias: Bias መf x 0 = E መf x 0 y 0 Bias of prediction over training data Verify that: MSE x o, y 0 = Var መf x 0 + Bias 2 [ መf x 0 ] 15

16 Regularization Reduce model complexity Reduce model variance 16

17 Ridge regression 1 2 λ λ 2 If λ = 0, we train linear regression If λ is large, the coefficients will shrink close to 0 17

18 Bias-Variance Tradeoff MSE Optimal Ridge regression Bias Variance Linear regression Reduced model complexity Ridge performs better when linear regression has high variance Example: d (dimension) is close to n (training set size) 18

19 Coefficient shrinkage Predict credit card balance 19

20 GD for Ridge Regression 1 2 α α αλθ j θ j θ j (1 αλ) α h θ x (i) y (i) x j (i) 20

21 GD for Ridge Regression 1 2 α α αλθ j θ j θ j (1 αλ) α h θ x (i) y (i) x j (i) 21

22 Lasso regression n d J θ = h θ x i y (i) 2 + λ θ j i=1 j=1 Squared Residuals Regularization L1 norm for regularization Cannot compute gradients Algorithms based on quadratic programming or other optimization techniques 22

23 Alternative Formulations Ridge L2 Regularization min θ n σ i=1 h θ x i y i 2 subject to d σj=1 θ j 2 ε Lasso L1 regularization min θ σn i=1 h θ x i y (i) 2 subject to σd j=1 θ j ε 23

24 Lasso vs Ridge Ridge shrinks all coefficients Lasso sets some coefficients at 0 (sparse solution) Perform feature selection θ 1 መθ θ 1 መθ Optimum Optimum θ 0 θ 0 Lasso Ridge 24

25 Lasso vs Ridge 25

26 Lab example 26

27 Ridge regression Data processing (omit N/A) Fit ridge regression Coefficient values Coefficient norm 27

28 Ridge regression Fit ridge regression Coefficient values Coefficient norm λ controls parameter size 28

29 Lasso regression Fit Lasso regression 13 coefficients set at zero Coefficient norm 29

30 Outline Gradient Descent comparison with closedform solution Non-linear regression Regularization Ridge and Lasso regression Lab example Classification K Nearest Neighbors (knn) Cross-validation 30

31 Supervised learning {x (i), y (i) }, for i = 1,, n መf መf x (i) y (i) 31

32 Binary or discrete x 1,, x n and y 1,, y n, x (i) R d, y (i) { 1, 1} f x (i) = y (i) 32

Example 1 Classifying spam email Content-related features Use

features Sender IP address IP blacklist DNS information Email

33 Example 1 Classifying spam Content-related features Use of certain words Word frequencies Language Sentence Structural features Sender IP address IP blacklist DNS information server URL links (non-matching) Binary classification: SPAM or HAM 33

34 Example 2 Handwritten Digit Recognition Multi-class classification 34

35 Example 3 Image classification Multi-class classification 35

36 Training Supervised Learning Process Data Preprocessing Feature extraction Learning model Labeled (Typically) Normalization Standardization Feature Selection Classification Regression Testing New data Unlabeled Learning model Predictions Malicious Benign Classification Risk score Regression 36

37 37

38 K-Nearest-Neighbours for multi-class classification Vote among multiple classes 38

39 Vector distances Vector norms: A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) Norm can be used as distance between vectors x and y x y p 39

40 Distance norms 40

41 knn Algorithm (to classify point x) Find k nearest points to x (according to distance metric) Perform majority voting to predict class of x Properties Does not learn any model in training! Instance learner (needs all data at testing time) 41

42 Overfitting! How to choose k (hyper-parameter)? 42

43 How to choose k (hyper-parameter)? 43

44 How to choose k (hyper-parameter)? 44

45 Review Gradient descent is an efficient algorithm for optimization and training LR The most widely used algorithm in ML! More complex regression models exist Polynomial, spline regression Regularization is general method to reduce model complexity and avoid overfitting Add penalty to loss function Ridge and Lasso regression 45

46 Acknowledgements Slides made using resources from: Andrew Ng Eric Eaton David Sontag Thanks! 46

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form