Machine Learning. Chao Lan

Size: px

Start display at page:

Download "Machine Learning. Chao Lan"

Matthew Wilcox
5 years ago
Views:

1 Machine Learning Chao Lan

2 Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

Machine Learning Prediction Models feature vector example X google lottery cat email

3 Machine Learning Prediction Models feature vector example X google lottery cat transport... book = 1 (x1) 1 (x2) 0 (x3) 1 (x4) 0 (x5)... 0 (xp) label Y Model spam or, ham

4 Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, Bayesian network, k-nearest Neighbor, linear support vector machine (LSVM), decision tree, neural network Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

5 Linear Regression Model Linear regression model has the following form - w s are regression coefficients - w0 is the bias term - y is label (e.g., how likely x is a spam) feature vector google lottery cat transport... book = 1 (x1) 1 (x2) 0 (x3) 1 (x4) 0 (x5)... 0 (xp)

6 Geometric Interpretation of Linear Regression y Linear regression model has the following form - x1 w s are regression coefficients feature vector - y w0 is the bias term y is label (e.g., how likely x is a spam) google lottery = 1 (x1) 0 (x2) x2 x1

7 Geometric Interpretation of the Bias Term w0 is the bias term. It allows the model to shift along the y axis (to better fit data). y y xi xi

8 Three Learners of the Linear Regression Model Three ways to learn w in 1. least square 2. ridge regression 3. Lasso from data.

9 Three Learners of the Linear Regression Model Three ways to learn w in 1. least square: find w that minimizes total squared error 2. ridge regression 3. Lasso from data.

10 Three Learners of the Linear Regression Model Three ways to learn w in from data. 1. least square: find w that minimizes total squared error 2. ridge regression: find w in a restricted range that minimizes total squared error 3. Lasso How would λ affect shrinkage?

11 Three Learners of the Linear Regression Model Three ways to learn w in from data. 1. least square: find w that minimizes the total squared error 2. ridge regression: find w in a restricted range that minimizes the total squared error 3. Lasso: find w with some elements eliminated that minimizes the total squared error How would λ affect elimination?

12 Coefficients Obtained by the Three Learners

13 Lasso: Automatically Eliminate Features by L1-Norm

14 Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting).

15 Recap: Simple Model has Shrinked Range of w A model with shrinked range of unknown parameters is usually simpler. M=9 λ ~ 10-8 M=9 λ=1

16 Recap: Larger λ Learns a Simpler Model M=9 λ ~ 10-8 λ=0 λ = 10-8 λ=1 M=9 λ=1

17 Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting). λ ~ 13 (simplest)

18 Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting). λ ~ 0.7 (simple)

19 Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting). λ ~ 0.09 (complex)

20 Bias-Variance Tradeoff We also have theoretical evidence:

21 Regression Methods in Python

22 Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

23 A Probabilistic View of Classifier A classifier can be a table of p(y x): given an example x, what is the distribution of label y? p( y=spam x=blue ) = 0.75 spam / ham? p( y=ham x=blue ) = 0.25 p( y=spam x=red ) = 0.4 spam / ham? p( y=ham x=red ) = 0.6

24 1. Naive Bayes Estimate p(y x) using (1) the Bayes rule and (2) the independent assumption of features.

25 Example of Bayes Classifier with One Feature spam ham Goal: p( y = spam x = blue ) - p( x = blue y = spam ) = 1/4 - p( x = blue ) = 2/6 = 1/3 - p( y = spam ) = 4/6 = ⅔ p( x = blue y = spam ) * p( y = spam ) - p( y = spam x = blue ) = p( x = blue )

26 Example of Naive Classifier spam ham p(x = blue & google & lottery y = spam) = p(x is blue y = spam) * p(x has word google y = spam) * p(x has word lottery y = spam) Of course, we can also construct conditional probability for continuous variables. google lottery cat ... book = 1 (x1) 1 (x2) 0 (x3) 1 (x4)... 0 (xp)

27 Naive Bayes in Python

28 2. Gaussian Discriminant Analysis Similar to Naive Bayes, but do not make i.i.d. assumption on features. Assume p(x y=k) is normal for each class k and estimate it from data. Quadratic DA (QDA) assumes classes have different distributions. Linear DA (LDA) assumes covariance is shared by classes.

29 spam Example of GDA p(x = blue & google & lottery y = spam) x= color google lottery color.m μ = google.m lottery.m cov[c,c], cov[c,g], cov[c,l] Σ= cov[g,c], cov[g,g], cov[g,l] cov[l,c], cov[l,g], cov[l,l] (μ, Σ) can be estimated by MLE. Quadratic DA (QDA) assumes different classes have different (μ,σ). Linear DA (LDA) assumes Σ is shared by all classes. ham

30 LDA/QDA in Python

31 3. Logistic Regression Directly construct p(y x) with unknown parameters and estimate them by MLE. Logistic regression typically works with binary classification, i.e., y = 0 or 1. And β can be numerically solved by the Newton s method.

32 Regularized Logistic Regression It is logistic regression learned with restrictive range of parameter β. - larger λ gives model?

33 Logistic Regression in Python

34 4. k-nearest Neighbor Q: how to classify the green test data?

35 4. k-nearest Neighbor Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. We could select k nearest neighbors, or choose a distance threshold to find nearest neighbors.

36 Impact of Neighborhood Size Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result?

37 Impact of Neighborhood Size Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result? Classification results depend on k.

38 Model Complexity of k-nn Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result? Classification results depend on k. Q: how does k affect model complexity?

from class c. Q: will k=3 and k=5 give the same result?

39 Model Complexity of k-nn Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result? Classification results depend on k. Q: how does k affect model complexity? Smaller k gives more complex model.

40 Impact of k on Model Complexity

41 k-nn in Python

42 5. Linear Support Vector Machine (LSVM) Find a linear decision boundary to maximally separate instances from two classes. LSVM typically works with binary classification, i.e., y = 0 or 1.

43 Learning LSVM Find a linear decision boundary to maximally separate instances from two classes.

44 Soft-Margin LSVM The boundary can tolerate misbehaved points (soft-margin).

45 LSVM in Python

46 6. Decision Tree Build a tree where each node filters some feature(s) for one-step classification.

47 Learning/Constructing a Decision Tree Keep splitting nodes until leaf nodes are sufficiently pure or we run out of features.

48 Another Example Build a tree to detect spam s. Features and Labels X1 = color X2 = google X4 = lottery R1, R3, R4 = spam R2, R5 = hsm

49 Another Example We want to have pure node. pure Features and Labels X1 = color X2 = google X4 = lottery R1, R3, R4 = spam R2, R5 = hsm not-so-pure

50 Decision Tree in Python

51 7. Neural Network A network of neurons for classification.

52 Neuron A neuron is (typically) a non-linearized linear function of inputs. weights activation functions

53 Backpropagation BP is a popular algorithm to learn weights of a neural network. (~gradient descent)

54 Backpropagation First, we can write output of a neural network as a function of input features.

55 Backpropagation Then, we take derivative of the function w.r.t. each weight Ou tpu t g(x)

56 Neural Network in Python

57 [+] Bayesian Network Apply Bayesian network to infer p( y = cancer x = (smoker, X-ray, Dyspnoea, Asia,...) ).

58 Multi-Class Classification with Binary Classifiers We can apply binary classifier by one-versus-all or one-versus-one strategies. 1-vs-all 1-vs-1

59 Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

60 1. Kernel Methods Project data to a higher-dimensional space (implicitly) and do linear analysis there. google lottery gottery gogog loogle tt #$@@ *&$#...

61 Demo

62 Implicit Mapping and Kernel Function No need to know new feature representation; only need to know inner product of instances. Kernel %%$, gottery gogog loogle = 0.4

63 Representer Theorem The optimal model (a.w. a proper objective function) is a linear combination of training data.

64 Kernel Functions

65 Example Kernel Methods (in Python) Kernel ridge regression Kernel LSVM (a.k.a. SVM) Kernel PCA Kernel K-means Gaussian Process

66 2. Ensemble Method Ensemble several weak models to form a strong model. Two popular ways to learn weak models - bagging: bootstrapping + average - boosting: weighting + weighted av

67 Bagging Bagging has three steps 1. bootstrap training set to get subsets 2. learn one model from each subset 3. average model prediction

68 An Example of Bagging: Random Forest random forest = decision tree + bagging + feature bootstrapping

69 Boosting Loop - weight training instances - train a model on the weighted sample Get weighted average of model prediction

70 An Example of Boosting: AdaBoost Mis-classified instances receive higher weights; accurate models receive higher weights.

71 An Demo of AdaBoost

72 Example Ensemble Methods (in Python)

73 Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

74 Learning from a Sequence of Data How to learn a spam detection model from a sequence of s? - situation 1: have features and labels of received s - situation 2: only have labels of received s

75 S1: Stochastic Gradient Descent We can continually update model based on new s, using stochastic gradient descent.

76 Example: Perceptron A single neuron that only updates itself when an instance is mis-classified.

77 A Demo of Perceptron

78 S2: Need Special Sequential Prediction Model Q: how do we predict tomorrow s weather based solely on historical weather records?

79 S2: Markov and Hidden Markov Models

80 Example Online Learners (in Python)

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches