DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Similar documents
DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning / Jan 27, 2010

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Network Traffic Measurements and Analysis

Linear Regression & Gradient Descent

Nonparametric Methods Recap

Machine Learning: Think Big and Parallel

6 Model selection and kernels

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Linear Regression and K-Nearest Neighbors 3/28/18

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

Machine Learning. Chao Lan

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Lecture 13: Model selection and regularization

CP365 Artificial Intelligence

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Moving Beyond Linearity

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Introduction to Automated Text Analysis. bit.ly/poir599

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Neural Network Neurons

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

A Brief Look at Optimization

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Perceptron as a graph

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Lecture #11: The Perceptron

Kernels and Clustering

k-nearest Neighbor (knn) Sept Youn-Hee Han

Lecture 25: Review I

All lecture slides will be available at CSC2515_Winter15.html

CS570: Introduction to Data Mining

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lasso Regression: Regularization for feature selection

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Classification: Feature Vectors

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Machine Learning Lecture-1

The exam is closed book, closed notes except your one-page cheat sheet.

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

1 Case study of SVM (Rob)

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Problem 1: Complexity of Update Rules for Logistic Regression

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Effectiveness of Sparse Features: An Application of Sparse PCA

Hyperparameters and Validation Sets. Sargur N. Srihari

Machine Learning Classifiers and Boosting

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Using Machine Learning to Optimize Storage Systems

Introduction to Artificial Intelligence

Announcements:$Rough$Plan$$

ECS289: Scalable Machine Learning

Applying Supervised Learning

Support Vector Machines

Combine the PA Algorithm with a Proximal Classifier

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CS5670: Computer Vision

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Learning via Optimization

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

CS489/698: Intro to ML

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Data Preprocessing. Supervised Learning

Lecture on Modeling Tools for Clustering & Regression

Machine Learning in Action

Machine Learning Techniques for Data Mining

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

An MM Algorithm for Multicategory Vertex Discriminant Analysis

Facial Expression Classification with Random Filters Feature Extraction

Slides for Data Mining by I. H. Witten and E. Frank

CSE 573: Artificial Intelligence Autumn 2010

CS 343: Artificial Intelligence

Linear Methods for Regression and Shrinkage Methods

Topics in Machine Learning

CSE446: Linear Regression. Spring 2017

Transcription:

DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019

Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description of problem you will solve, dataset, and ML algorithms Individual project Project template and potential ideas are on Piazza Project milestone: due March 21 2 page description on progress Project report at the end of semester and project presentations in class (10 minute per project) 2

Outline Gradient Descent comparison with closedform solution Non-linear regression Regularization Ridge and Lasso regression Lab example Classification K Nearest Neighbors (knn) Cross-validation 3

Gradient Descent Gradient = slope of line tangent to curve Function decreases faster in negative direction of gradient Larger learning rate => larger step 4

GD for Linear Regression 2 n θ new θ old < ϵ or iterations == MAX_ITER Can also bound number of iterations 5

Gradient Descent vs Closed Form Gradient Descent Closed form O(d 3 ) 6

Issues with Gradient Descent Might get stuck in local optimum and not converge to global optimum Restart from multiple initial points Only works with differentiable loss functions Small or large gradients Feature scaling helps Tune learning rate Can use line search for determining optimal learning rate 7

Beyond Linearity Most datasets are not perfectly linear Linear Regression results in high MSE Generalized Additive Models 8

Polynomial Regression Polynomial basis function h θ x = θ 0 + θ 1 x + + θ d x d 9

Polynomial Regression Typically to avoid overfitting d 4 10

Other Regression 11

Splines Fit polynomial regression on each region (knot) Spline - Continuous and differentiable function at boundary Natural Spline linear function at boundary 12

Generalization in ML Simple model Complex model Goal is to generalize well on new testing data Risk of overfitting to training data MSE close to 0, but performs poorly on test data 13

Bias-Variance Tradeoff Under-fitting Over-fitting Bias = Difference between estimated and true models Variance = Model difference on different training sets MSE is proportional to Bias + Variance 14

Bias-Variance Decomposition Let መf be trained model Expected MSE of test point x o, y 0 : E (y 0 መf x 0 ) 2 Variance: Var መf x 0 = E መf x 0 2 E 2 መf x 0 Variance of prediction over training data Bias: Bias መf x 0 = E መf x 0 y 0 Bias of prediction over training data Verify that: MSE x o, y 0 = Var መf x 0 + Bias 2 [ መf x 0 ] 15

Regularization Reduce model complexity Reduce model variance 16

Ridge regression 1 2 λ λ 2 If λ = 0, we train linear regression If λ is large, the coefficients will shrink close to 0 17

Bias-Variance Tradeoff MSE Optimal Ridge regression Bias Variance Linear regression Reduced model complexity Ridge performs better when linear regression has high variance Example: d (dimension) is close to n (training set size) 18

Coefficient shrinkage Predict credit card balance 19

GD for Ridge Regression 1 2 α α αλθ j θ j θ j (1 αλ) α h θ x (i) y (i) x j (i) 20

GD for Ridge Regression 1 2 α α αλθ j θ j θ j (1 αλ) α h θ x (i) y (i) x j (i) 21

Lasso regression n d J θ = h θ x i y (i) 2 + λ θ j i=1 j=1 Squared Residuals Regularization L1 norm for regularization Cannot compute gradients Algorithms based on quadratic programming or other optimization techniques 22

Alternative Formulations Ridge L2 Regularization min θ n σ i=1 h θ x i y i 2 subject to d σj=1 θ j 2 ε Lasso L1 regularization min θ σn i=1 h θ x i y (i) 2 subject to σd j=1 θ j ε 23

Lasso vs Ridge Ridge shrinks all coefficients Lasso sets some coefficients at 0 (sparse solution) Perform feature selection θ 1 መθ θ 1 መθ Optimum Optimum θ 0 θ 0 Lasso Ridge 24

Lasso vs Ridge 25

Lab example 26

Ridge regression Data processing (omit N/A) Fit ridge regression Coefficient values Coefficient norm 27

Ridge regression Fit ridge regression Coefficient values Coefficient norm λ controls parameter size 28

Lasso regression Fit Lasso regression 13 coefficients set at zero Coefficient norm 29

Outline Gradient Descent comparison with closedform solution Non-linear regression Regularization Ridge and Lasso regression Lab example Classification K Nearest Neighbors (knn) Cross-validation 30

Supervised learning {x (i), y (i) }, for i = 1,, n መf መf x (i) y (i) 31

Binary or discrete x 1,, x n and y 1,, y n, x (i) R d, y (i) { 1, 1} f x (i) = y (i) 32

Example 1 Classifying spam email Content-related features Use of certain words Word frequencies Language Sentence Structural features Sender IP address IP blacklist DNS information Email server URL links (non-matching) Binary classification: SPAM or HAM 33

Example 2 Handwritten Digit Recognition Multi-class classification 34

Example 3 Image classification Multi-class classification 35

Training Supervised Learning Process Data Preprocessing Feature extraction Learning model Labeled (Typically) Normalization Standardization Feature Selection Classification Regression Testing New data Unlabeled Learning model Predictions Malicious Benign Classification Risk score Regression 36

37

K-Nearest-Neighbours for multi-class classification Vote among multiple classes 38

Vector distances Vector norms: A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) Norm can be used as distance between vectors x and y x y p 39

Distance norms 40

knn Algorithm (to classify point x) Find k nearest points to x (according to distance metric) Perform majority voting to predict class of x Properties Does not learn any model in training! Instance learner (needs all data at testing time) 41

Overfitting! How to choose k (hyper-parameter)? 42

How to choose k (hyper-parameter)? 43

How to choose k (hyper-parameter)? 44

Review Gradient descent is an efficient algorithm for optimization and training LR The most widely used algorithm in ML! More complex regression models exist Polynomial, spline regression Regularization is general method to reduce model complexity and avoid overfitting Add penalty to loss function Ridge and Lasso regression 45

Acknowledgements Slides made using resources from: Andrew Ng Eric Eaton David Sontag Thanks! 46