Machine Learning. Chao Lan

Similar documents
Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Network Traffic Measurements and Analysis

Applying Supervised Learning

Machine Learning / Jan 27, 2010

The exam is closed book, closed notes except your one-page cheat sheet.

What is machine learning?

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning Classifiers and Boosting

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Supervised vs unsupervised clustering

Classification with PAM and Random Forest

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

7. Boosting and Bagging Bagging

Probabilistic Approaches

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Simple Model Selection Cross Validation Regularization Neural Networks

CS 229 Midterm Review

Naïve Bayes for text classification

Machine Learning: Think Big and Parallel

ADVANCED CLASSIFICATION TECHNIQUES

Lecture Notes for Chapter 5

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Supervised Learning for Image Segmentation

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

STA 4273H: Statistical Machine Learning

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

5 Learning hypothesis classes (16 points)

Bioinformatics - Lecture 07

All lecture slides will be available at CSC2515_Winter15.html

Notes and Announcements

CSE 446 Bias-Variance & Naïve Bayes

Machine Learning with MATLAB --classification

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Machine Learning Lecture 3

The Curse of Dimensionality

Bayes Net Learning. EECS 474 Fall 2016

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Contents. Preface to the Second Edition

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Lecture Notes for Chapter 5

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Machine Learning Techniques

Machine Learning in Biology

Support Vector Machines

ECG782: Multidimensional Digital Signal Processing

6 Model selection and kernels

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Nonparametric Methods Recap

Machine Learning in Action

Perceptron as a graph

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Predictive modelling / Machine Learning Course on Big Data Analytics

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Instance-based Learning

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

Support Vector Machines + Classification for IR

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Support Vector Machines

Classification: Linear Discriminant Functions

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Machine Learning Lecture 3

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Machine Learning Lecture 3

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

CS6220: DATA MINING TECHNIQUES

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Generative and discriminative classification techniques

Structured Learning. Jun Zhu

Support Vector Machines

CS229 Final Project: Predicting Expected Response Times

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Classification. Slide sources:

Neural Networks (pp )

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Boosting Simple Model Selection Cross Validation Regularization

Support Vector Machines

A Taxonomy of Semi-Supervised Learning Algorithms

Lecture 9: Support Vector Machines

Predicting Computing Prices Dynamically Using Machine Learning

Using Machine Learning to Optimize Storage Systems

Data Mining Lecture 8: Decision Trees

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Machine Learning to Select Best Network Access Point

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Machine Learning. Supervised Learning. Manfred Huber

Predicting Song Popularity

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Transcription:

Machine Learning Chao Lan

Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

Machine Learning Prediction Models feature vector example X google lottery cat email transport... book = 1 (x1) 1 (x2) 0 (x3) 1 (x4) 0 (x5)... 0 (xp) label Y Model spam or, ham

Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, Bayesian network, k-nearest Neighbor, linear support vector machine (LSVM), decision tree, neural network Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

Linear Regression Model Linear regression model has the following form - w s are regression coefficients - w0 is the bias term - y is label (e.g., how likely x is a spam) feature vector google lottery cat email transport... book = 1 (x1) 1 (x2) 0 (x3) 1 (x4) 0 (x5)... 0 (xp)

Geometric Interpretation of Linear Regression y Linear regression model has the following form - x1 w s are regression coefficients feature vector - y w0 is the bias term y is label (e.g., how likely x is a spam) google lottery = 1 (x1) 0 (x2) x2 x1

Geometric Interpretation of the Bias Term w0 is the bias term. It allows the model to shift along the y axis (to better fit data). y y xi xi

Three Learners of the Linear Regression Model Three ways to learn w in 1. least square 2. ridge regression 3. Lasso from data.

Three Learners of the Linear Regression Model Three ways to learn w in 1. least square: find w that minimizes total squared error 2. ridge regression 3. Lasso from data.

Three Learners of the Linear Regression Model Three ways to learn w in from data. 1. least square: find w that minimizes total squared error 2. ridge regression: find w in a restricted range that minimizes total squared error 3. Lasso How would λ affect shrinkage?

Three Learners of the Linear Regression Model Three ways to learn w in from data. 1. least square: find w that minimizes the total squared error 2. ridge regression: find w in a restricted range that minimizes the total squared error 3. Lasso: find w with some elements eliminated that minimizes the total squared error How would λ affect elimination?

Coefficients Obtained by the Three Learners

Lasso: Automatically Eliminate Features by L1-Norm

Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting).

Recap: Simple Model has Shrinked Range of w A model with shrinked range of unknown parameters is usually simpler. M=9 λ ~ 10-8 M=9 λ=1

Recap: Larger λ Learns a Simpler Model M=9 λ ~ 10-8 λ=0 λ = 10-8 λ=1 M=9 λ=1

Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting). λ ~ 13 (simplest)

Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting). λ ~ 0.7 (simple)

Bias-Variance Tradeoff Simple model often has small variance but large bias (in estimation). Complex model often has small bias but large variance (a.k.a., overfitting). λ ~ 0.09 (complex)

Bias-Variance Tradeoff We also have theoretical evidence:

Regression Methods in Python

Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

A Probabilistic View of Classifier A classifier can be a table of p(y x): given an example x, what is the distribution of label y? p( y=spam x=blue email) = 0.75 spam / ham? p( y=ham x=blue email) = 0.25 p( y=spam x=red email) = 0.4 spam / ham? p( y=ham x=red email) = 0.6

1. Naive Bayes Estimate p(y x) using (1) the Bayes rule and (2) the independent assumption of features.

Example of Bayes Classifier with One Feature spam ham Goal: p( y = spam x = blue email ) - p( x = blue email y = spam ) = 1/4 - p( x = blue email ) = 2/6 = 1/3 - p( y = spam ) = 4/6 = ⅔ p( x = blue email y = spam ) * p( y = spam ) - p( y = spam x = blue email) = p( x = blue email )

Example of Naive Classifier spam ham p(x = blue email & google & lottery y = spam) = p(x is blue y = spam) * p(x has word google y = spam) * p(x has word lottery y = spam) Of course, we can also construct conditional probability for continuous variables. google lottery cat email... book = 1 (x1) 1 (x2) 0 (x3) 1 (x4)... 0 (xp)

Naive Bayes in Python

2. Gaussian Discriminant Analysis Similar to Naive Bayes, but do not make i.i.d. assumption on features. Assume p(x y=k) is normal for each class k and estimate it from data. Quadratic DA (QDA) assumes classes have different distributions. Linear DA (LDA) assumes covariance is shared by classes.

spam Example of GDA p(x = blue email & google & lottery y = spam) x= color google lottery color.m μ = google.m lottery.m cov[c,c], cov[c,g], cov[c,l] Σ= cov[g,c], cov[g,g], cov[g,l] cov[l,c], cov[l,g], cov[l,l] (μ, Σ) can be estimated by MLE. Quadratic DA (QDA) assumes different classes have different (μ,σ). Linear DA (LDA) assumes Σ is shared by all classes. ham

LDA/QDA in Python

3. Logistic Regression Directly construct p(y x) with unknown parameters and estimate them by MLE. Logistic regression typically works with binary classification, i.e., y = 0 or 1. And β can be numerically solved by the Newton s method.

Regularized Logistic Regression It is logistic regression learned with restrictive range of parameter β. - larger λ gives model?

Logistic Regression in Python

4. k-nearest Neighbor Q: how to classify the green test data?

4. k-nearest Neighbor Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. We could select k nearest neighbors, or choose a distance threshold to find nearest neighbors.

Impact of Neighborhood Size Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result?

Impact of Neighborhood Size Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result? Classification results depend on k.

Model Complexity of k-nn Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result? Classification results depend on k. Q: how does k affect model complexity?

Model Complexity of k-nn Q: how to classify the green test data? k-nn assigns an example to class c if most of its k nearest neighbors are from class c. Q: will k=3 and k=5 give the same result? Classification results depend on k. Q: how does k affect model complexity? Smaller k gives more complex model.

Impact of k on Model Complexity

k-nn in Python

5. Linear Support Vector Machine (LSVM) Find a linear decision boundary to maximally separate instances from two classes. LSVM typically works with binary classification, i.e., y = 0 or 1.

Learning LSVM Find a linear decision boundary to maximally separate instances from two classes.

Soft-Margin LSVM The boundary can tolerate misbehaved points (soft-margin).

LSVM in Python

6. Decision Tree Build a tree where each node filters some feature(s) for one-step classification.

Learning/Constructing a Decision Tree Keep splitting nodes until leaf nodes are sufficiently pure or we run out of features.

Another Example Build a tree to detect spam emails. Features and Labels X1 = color X2 = google X4 = lottery R1, R3, R4 = spam R2, R5 = hsm

Another Example We want to have pure node. pure Features and Labels X1 = color X2 = google X4 = lottery R1, R3, R4 = spam R2, R5 = hsm not-so-pure

Decision Tree in Python

7. Neural Network A network of neurons for classification.

Neuron A neuron is (typically) a non-linearized linear function of inputs. weights activation functions

Backpropagation BP is a popular algorithm to learn weights of a neural network. (~gradient descent)

Backpropagation First, we can write output of a neural network as a function of input features.

Backpropagation Then, we take derivative of the function w.r.t. each weight Ou tpu t g(x)

Neural Network in Python

[+] Bayesian Network Apply Bayesian network to infer p( y = cancer x = (smoker, X-ray, Dyspnoea, Asia,...) ).

Multi-Class Classification with Binary Classifiers We can apply binary classifier by one-versus-all or one-versus-one strategies. 1-vs-all 1-vs-1

Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

1. Kernel Methods Project data to a higher-dimensional space (implicitly) and do linear analysis there. google lottery gottery gogog loogle tt #$@@ *&$#...

Demo

Implicit Mapping and Kernel Function No need to know new feature representation; only need to know inner product of instances. Kernel Function @#$, %%$, *&@# gottery gogog loogle = 0.4

Representer Theorem The optimal model (a.w. a proper objective function) is a linear combination of training data.

Kernel Functions

Example Kernel Methods (in Python) Kernel ridge regression Kernel LSVM (a.k.a. SVM) Kernel PCA Kernel K-means Gaussian Process

2. Ensemble Method Ensemble several weak models to form a strong model. Two popular ways to learn weak models - bagging: bootstrapping + average - boosting: weighting + weighted av

Bagging Bagging has three steps 1. bootstrap training set to get subsets 2. learn one model from each subset 3. average model prediction

An Example of Bagging: Random Forest random forest = decision tree + bagging + feature bootstrapping

Boosting Loop - weight training instances - train a model on the weighted sample Get weighted average of model prediction

An Example of Boosting: AdaBoost Mis-classified instances receive higher weights; accurate models receive higher weights.

An Demo of AdaBoost

Example Ensemble Methods (in Python)

Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian discriminant analysis, k-nearest neighbor, linear support vector machine (LSVM), decision tree, neural network, (Bayesian network) - multi-class classification with binary classifier Advanced Model - kernel, ensemble Online Model - SGD learner, HMM

Learning from a Sequence of Data How to learn a spam detection model from a sequence of emails? - situation 1: have features and labels of received emails - situation 2: only have labels of received emails

S1: Stochastic Gradient Descent We can continually update model based on new emails, using stochastic gradient descent.

Example: Perceptron A single neuron that only updates itself when an instance is mis-classified.

A Demo of Perceptron

S2: Need Special Sequential Prediction Model Q: how do we predict tomorrow s weather based solely on historical weather records?

S2: Markov and Hidden Markov Models

Example Online Learners (in Python)