Lecture Notes for Chapter 5

Similar documents
Lecture Notes for Chapter 5

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

CS570: Introduction to Data Mining

LABORATORY OF DATA SCIENCE. Reminds on Data Mining. Data Science & Business Informatics Degree

9 Classification: KNN and SVM

Cloud-Based Machine and Deep Learning. Machine Learning and AI Algorithm

Machine Learning. Chao Lan

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

CS 584 Data Mining. Classification 3

CS 584 Data Mining. Classification 1

CS570: Introduction to Data Mining

Applying Supervised Learning

Classification and Regression

CISC 4631 Data Mining

Data Mining. Lecture 03: Nearest Neighbor Learning

ECG782: Multidimensional Digital Signal Processing

Naïve Bayes for text classification

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Statistics 202: Statistical Aspects of Data Mining

Data Mining Concepts & Techniques

CSE4334/5334 DATA MINING

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

Data Mining Classification - Part 1 -

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Contents. Preface to the Second Edition

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Network Traffic Measurements and Analysis

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

VECTOR SPACE CLASSIFICATION

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Topic 1 Classification Alternatives

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Performance Evaluation of Various Classification Algorithms

Data Preprocessing. Supervised Learning

Machine Learning Lecture 3

Statistics 202: Statistical Aspects of Data Mining

STA 4273H: Statistical Machine Learning

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Supervised vs unsupervised clustering

Random Forest A. Fornaser

Logical Rhythm - Class 3. August 27, 2018

Machine Learning Techniques for Data Mining

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Classification Algorithms in Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining. Neural Networks

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

CS 229 Midterm Review

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Based on Raymond J. Mooney s slides

Data Mining Course Overview

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Machine Learning 13. week

Generative and discriminative classification techniques

Lecture 9: Support Vector Machines

Chapter 4: Algorithms CS 795

Lecture 3. Oct

7. Boosting and Bagging Bagging

Supervised Learning Classification Algorithms Comparison

Support Vector Machines + Classification for IR

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Data Mining and Data Warehousing Classification-Lazy Learners

Chapter 2: Classification & Prediction

Machine Learning Classifiers and Boosting

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Data Mining Part 5. Prediction

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Machine Learning / Jan 27, 2010

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Information Management course

CSEP 573: Artificial Intelligence

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Semi-supervised learning and active learning

Chapter 3: Supervised Learning

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Probabilistic Approaches

Part I. Instructor: Wei Ding

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Classification: Basic Concepts and Techniques

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Classification: Basic Concepts, Decision Trees, and Model Evaluation

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Bioinformatics - Lecture 07

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

CS Machine Learning

Transcription:

Classification - Alternative Techniques Lecture tes for Chapter 5 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site.

Topics Rule-Based Classifier Nearest Neighbor Classifier Naive Bayes Classifier Artificial Neural Networks Support Vector Machines Ensemble Methods

Rule-Based Classifier Classify records by using a collection of if then rules Rule: (Condition) y - where Condition is a conjunctions of attributes y is the class label - LHS: rule antecedent or condition - RHS: rule consequent - Examples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=

Rule-based Classifier (Example) Name R1 human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Blood Type warm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warm Give Birth yes yes yes yes yes yes yes Can Fly yes yes yes yes Live in Water yes yes sometimes yes sometimes sometimes yes sometimes yes Class mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds R1: (Give Birth = ) (Can Fly = yes) Birds R2: (Give Birth = ) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = ) (Can Fly = ) Reptiles R5: (Live in Water = sometimes) Amphibians

Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = ) (Can Fly = yes) Birds R2: (Give Birth = ) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = ) (Can Fly = ) Reptiles R5: (Live in Water = sometimes) Amphibians Name hawk grizzly bear Blood Type warm warm Give Birth Can Fly Live in Water Class yes yes?? The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal

Ordered Rule Set vs. Voting Rules are rank ordered according to their priority - An ordered rule set is kwn as a decision list When a test record is presented to the classifier - It is assigned to the class label of the highest ranked rule it has triggered - If ne of the rules fired, it is assigned to the default class R1: (Give Birth = ) (Can Fly = yes) Birds R2: (Give Birth = ) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = ) (Can Fly = ) Reptiles R5: (Live in Water = sometimes) Amphibians Name turtle Blood Type cold Give Birth Can Fly Live in Water Class sometimes? Alternative: (weighted) voting by all matching rules.

Rule Coverage and Accuracy Coverage of a rule: - Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: - Fraction of records that satisfy both the antecedent and consequent of a rule Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K 2 Married 100K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K 10 Single 90K Yes 60K 10 (Status=Single) Coverage = 40%, Accuracy = 50%

Rules From Decision Trees Aquatic Creature = was pruned Rules are mutually exclusive and exhaustive (cover all training cases) Rule set contains as much information as the tree Rules can be simplified (similar to pruning of the tree) Example: C4.5rules

Direct Methods of Rule Generation Extract rules directly from the data Sequential Covering (Example: try to cover class +) d R1 R1 c... R2 a b (ii) Step 1 (iii) Step 2 x R1: a>x>b c>y>d class + (iv) Step 3

Advantages of Rule-Based Classifiers As highly expressive as decision trees Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees

Topics Rule-Based Classifier Nearest Neighbor Classifier Naive Bayes Classifier Artificial Neural Networks Support Vector Machines Ensemble Methods

Nearest Neighbor Classifiers Basic idea: - If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Training Records Choose k of the nearest records Test Record

Nearest-Neighbor Classifiers y Unkwn record k=3 + ++ + + ++ ++ ++ + + x Requires three things - The set of stored records - The value of k, the number of nearest neighbors to retrieve Distance Metric to compute distance between records To classify an unkwn record: - Compute distance to other training records - Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unkwn record (e.g., by taking majority vote)

Definition of Nearest Neighbor X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

Nearest Neighbor Classification Compute distance between two points: - Euclidean distance d ( p, q )= 2 ( p q ) i i i Determine the class from nearest neighbor list - take the majority vote of class labels among the k-nearest neighbors - Weigh the vote according to distance weight factor, w = 1/d 2

Choosing the value of k: - If k is too small, sensitive to ise points - If k is too large, neighborhood may include points from other classes y Nearest Neighbor Classification + ++ ++ + + ++ ++ + + x k is too large!

Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M -> Income will dominate Euclidean distance Solution: scaling/standardization (Z-Score) Z= X barx sd ( X)

Nearest neighbor Classification k-nn classifiers are lazy learners - It does t build models explicitly (unlike eager learners such as decision trees) - Needs to store all the training data - Classifying unkwn records are relatively expensive (find the knearest neighbors) Advantage: Can create n-linear decision boundaries y + + ++ ++ + + + k=1 x

Topics Rule-Based Classifier Nearest Neighbor Classifier Naive Bayes Classifier Artificial Neural Networks Support Vector Machines Ensemble Methods

Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: P( A,C ) P(C A )= P( A ) P( A,C ) P( A C )= P(C ) Bayes theorem: P( A C ) P( C ) P(C A )= P( A ) C and A are events. A is called evidence.

Example of Bayes Theorem - A doctor kws that meningitis causes stiff neck 50% of the time P(S M)=.5 - Prior probability of any patient having meningitis is P(M) = 1/50,000=0.00002 - Prior probability of any patient having stiff neck is P(S) = 1/20=0.05 If a patient has stiff neck, what s the probability he/she has meningitis? P( S M ) P( M ) 0. 5 1 /50000 P( M S )= = =0. 0002 1/ 20 P( S ) Increases the probability by x10!

Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A, A,,A ) - Goal is to predict class C - Specifically, we want to find the value of C that 1 2 maximizes P(C A1, A2,,An ) n

Bayesian Classifiers compute the posterior probability P(C A, A,, A ) for all values of C using the Bayes theorem P (C A 1 A2 A n )= 1 2 n P ( A1 A 2 A n C ) P ( C ) P ( A1 A 2 A n ) Choose value of C that maximizes P(C A1, A2,, An) this is a constant! Equivalent to choosing value of C that maximizes P(A1, A2,, An C) P(C) How to estimate P(A, A,, A 1 2 n C)?

Naïve Bayes Classifier Assume independence among attributes Ai when class is given: - P(A1, A2,, An C) = P(A1 Cj) P(A2 Cj) P(An Cj) - Can estimate P(Ai Cj) for all Ai and Cj. - New point is classified to C j such that: max j ( P (C j ) P( A j C j ))

How to Estimate Probabilities from Data? Class: - e.g., P(C) = Nc / N P(C=) = 7/10, P(C=Yes) = 3/10 For discrete attributes: P(Ai Ck) = Aik / Nc - where A is number of ik instances having attribute Ai and belongs to class Ck - e.g. P(Status=Married C=) = 4/7 P(Refund=Yes C=Yes)=0 10 Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K 2 Married 100K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K 10 Single 90K Yes 60K

How to Estimate Probabilities from Data? For continuous attributes: - Discretize the range into bins one ordinal attribute per bin violates independence assumption - Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute - Probability density estimation: Assume attribute follows a rmal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is kwn, can use it to estimate the conditional probability P(Ai c)

Example of Naïve Bayes Classifier Given a Test Record what is the most likely class? X =( Refund=,Married,Income=120K ) P(X Class=) = P(Refund= Class=) * P(Married Class=) * P(Income=120K Class=) = 4/7 * 4/7 * 0.0072 = 0.0024 P(X Class=Yes) = P(Refund= Class=Yes) * P(Married Class=Yes) * P(Income=120K Class=Yes) = 1 * 0 * 1.2 * 10-9 = 0 Since P(X )P() > P(X Yes)P(Yes) Therefore P( X) > P(Yes X) => Class =

Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: Original: P ( Ai C )= Laplace: P ( Ai C )= N ic Nc N ic +1 N c +c N ic + mp m-estimate: P ( A i C )= N c+ m c: number of classes p: prior probability m: parameter

Naïve Bayes (Summary) Robust to isolated ise points Handle missing values by igring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may t hold for some attributes - Use other techniques such as Bayesian Belief Networks (BBN)

Topics Rule-Based Classifier Nearest Neighbor Classifier Naive Bayes Classifier Artificial Neural Networks Support Vector Machines Ensemble Methods

Artificial Neural Networks (ANN) Model is an assembly of inter-connected des and weighted links Output de sums up each of its input value according to the weights of its links Compare output de against some threshold t Perceptron Model Y =I ( w i X i t ) or i Y =sign( w i X i t ) i

Artificial Neural Networks (ANN) Y =I (0. 3X 1 +0. 3X 2 +0.3X 3 0. 4>0 ) 1 where I ( z )= 0 { if z is true otherwise

General Structure of ANN Training ANN means learning the weights of the neurons

Algorithm for learning ANN Initialize the weights (w, w,, w ) 0 1 k Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples 2 Objective function: E= [ Y i f ( w i, X i ) ] i - Find the weights w s that minimize the above i objective function. Methods: backpropagation algorithm, gradient descend

Deep Learning / Deep Neural Networks Needs lots of data + computation (GPU) Tools: Tensorflow and many others Applications: computer vision, speech recognition, natural language processing, audio recognition, machine translation, bioinformatics, Related: Deep belief networks, recurrent neural networks (RNN), convolutional neural network (CNN),

Topics Rule-Based Classifier Nearest Neighbor Classifier Naive Bayes Classifier Artificial Neural Networks Support Vector Machines Ensemble Methods

Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data

Support Vector Machines One Possible Solution

Support Vector Machines Ather possible solution

Support Vector Machines Other possible solutions

Support Vector Machines B1 B2 Which one is better? B1 or B2? How do you define better?

Support Vector Machines B1 B2 margin Find hyperplane maximizes the margin => B1 is better than B2 Larger margin = more robust = less expected generalization error

Support Vector Machines What if the problem is t linearly separable? slack Use slack variables to account for violations Use hyperplane that minimizes slack

nlinear Support Vector Machines What if decision boundary is t linear?

nlinear Support Vector Machines Project data into higher dimensional space projection Using the Kernel trick!

Topics Rule-Based Classifier Nearest Neighbor Classifier Naive Bayes Classifier Artificial Neural Networks Support Vector Machines Ensemble Methods

Ensemble Methods Construct a set of (possibly weak) classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers Improve the stability and often also the accuracy of classifiers. Reduces variance in the prediction Reduces overfitting

General Idea sampling weak learners voting

Why does it work? Suppose there are 25 base classifiers - Each classifier has error rate, = 0.35 - Assume classifiers are independent (different features and/or training data) - Probability that the ensemble classifier makes a wrong prediction: 25 ( i=13 25 ε i ( 1 ε )25 i =0. 06 i ) = Probability that 13 or more classifier make the wrong decision tes 13 is the majority vote The bimial coefficient gives the number of of ways you can choose i out of 25

Examples of Ensemble Methods How to generate an ensemble of classifiers? - Bagging - Boosting - Random Forests

Bagging (Bootstrap Aggregation) 1.Sampling with replacement (bootstrap sampling) Original Data Bagging (Round 1) Bagging (Round 2) Bagging (Round 3) 1 7 1 1 2 8 4 8 3 10 9 5 4 8 1 10 5 2 2 5 6 5 3 5 7 10 2 9 8 10 7 6 9 5 3 3 10 9 2 7 te: some objects are chosen multiple times in a bootstrap sample while others are t chosen! A typical bootstrap sample contains about 63% of the objects in the original data. 2.Build classifier on each bootstrap sample (classifiers are hopefully independent since they are learned from different subsets of the data) 3.Aggregate the classifiers' results by averaging or voting

Boosting Records that are incorrectly classified in one round will have their weights increased in the next Records that are classified correctly will have their weights decreased Original Data Boosting (Round 1) Boosting (Round 2) Boosting (Round 3) 1 7 5 4 2 3 4 4 3 2 9 8 4 8 4 10 5 7 2 4 6 9 5 5 7 4 1 4 8 10 7 6 9 6 4 3 10 3 2 4 Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds Popular algorithm: AdaBoost (Adaptive Boosting) typically uses decision trees as the weak lerner.

Random Forests Introduce two sources of randomness: Bagging and Random input vectors Bagging method: each tree is grown using a bootstrap sample of training data Random vector method: At each de, best split is chosen only from a random sample of the m possible attributes.

Other Popular Approaches Logistic Regression Linear Discriminant Analysis Regularized Models (Shrinkage) Stacking