Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Similar documents
Input: Concepts, Instances, Attributes

Homework 1 Sample Solution

Data Mining Algorithms: Basic Methods

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Summary. Machine Learning: Introduction. Marcin Sydow

Machine Learning Chapter 2. Input

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Machine Learning: Algorithms and Applications Mockup Examination

Basic Concepts Weka Workbench and its terminology

Chapter 4: Algorithms CS 795

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Classification with Decision Tree Induction

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Chapter 4: Algorithms CS 795

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining and Analytics

ECG782: Multidimensional Digital Signal Processing

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Decision Trees In Weka,Data Formats

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Naïve Bayes for text classification

Performance Analysis of Data Mining Classification Techniques

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Support Vector Machines

Applying Supervised Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Data Mining and Machine Learning: Techniques and Algorithms

We extend SVM s in order to support multi-class classification problems. Consider the training dataset

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Data Mining Practical Machine Learning Tools and Techniques

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Introduction to Artificial Intelligence

Decision Tree Learning

The Curse of Dimensionality

Data Mining and Knowledge Discovery Practice notes 2

Supervised vs unsupervised clustering

Performance Evaluation of Various Classification Algorithms

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Data Mining and Knowledge Discovery: Practice Notes

Artificial Intelligence. Programming Styles

CS 584 Data Mining. Classification 1

Topics in Machine Learning

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Supervised Learning: K-Nearest Neighbors and Decision Trees

Data Mining Input: Concepts, Instances, and Attributes

CSCI567 Machine Learning (Fall 2014)

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Introduction to Machine Learning CANB 7640

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Decision Trees: Discussion

Data Mining Practical Machine Learning Tools and Techniques

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Application of Support Vector Machine Algorithm in Spam Filtering

STA 4273H: Statistical Machine Learning

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Unsupervised: no target value to predict

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

To earn the extra credit, one of the following has to hold true. Please circle and sign.

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

CSEP 573: Artificial Intelligence

Multi-label classification using rule-based classifier systems

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining and Knowledge Discovery: Practice Notes

Slides for Data Mining by I. H. Witten and E. Frank

BITS F464: MACHINE LEARNING

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Stat 602X Exam 2 Spring 2011

Introduction to R and Statistical Data Analysis

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Machine Learning. Classification

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Based on Raymond J. Mooney s slides

Ensemble Methods, Decision Trees

COMP33111: Tutorial and lab exercise 7

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Support Vector Machines

Search Engines. Information Retrieval in Practice

Lecture 9: Support Vector Machines

Modern Medical Image Analysis 8DC00 Exam

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

Contents. Preface to the Second Edition

Transcription:

Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k too low means that the result can be sensitive to noise too high means that the neighborhood may include too many items from other classes choice of distance measure should have the property that a smaller distance means a greater likelihood of belonging to the same class specific measure may depend on the domain Euclidean distance becomes less discriminating as the number of attributes increases may need to scale attribute values to avoid having some dominate CPSC 444: Artificial Intelligence Spring 2019 43 CPSC 444: Artificial Intelligence Spring 2019 45 Idea. find the k nearest neighbors to the item in the dataset choose the majority class of the neighbors as the class for the item Challenges. combining class labels majority vote can be problematic if the neighbors vary widely in distance as all are given the same weight weighted vote weights a neighbor's vote by its distance d commonly 1/d 2 classification of an item is relatively expensive must locate the k nearest neighbors there are improvements e.g. condensing (eliminating stored items), proximity graphs (to quickly find neighbors) CPSC 444: Artificial Intelligence Spring 2019 44 CPSC 444: Artificial Intelligence Spring 2019 46

easy to understand and implement fast to build model can perform well in many situations in spite of its simplicity Kernel used: Linear Kernel: K(x,y) = <x,y> 0.543 * outlook=sunny + 1.0266 * outlook=overcast + 0.4837 * outlook=rainy + 0.2834 * temperature=hot + 0.2614 * temperature=mild + 0.0219 * temperature=cool + 1.0219 * humidity=normal + 0.7872 * windy=false + 0.1354 1 if input instance matches the specified value, 0 if not result < 0 denotes one class, > 0 the other a b < classified as 7 2 a = yes 3 2 b = no Correctly Classified Instances 9 64.2857 % CPSC 444: Artificial Intelligence Spring 2019 47 CPSC 444: Artificial Intelligence Spring 2019 49 Idea binary classification (two classes). find the hyperplane that maximizes the margin between the two classes margin = shortest distance between closest item to the plane and the plane H 1 does not separate H 2 separates with a small margin H 3 separates with maximum margin outlook temperature humidity windy play rainy mild normal FALSE yes -2.4169 overcast hot normal FALSE yes -1.935 rainy cool normal FALSE yes -1.4514 rainy mild high FALSE yes -1.395 overcast hot high FALSE yes -1.2119 overcast cool normal TRUE yes -1.1526 sunny cool normal FALSE yes -1.1526 overcast mild high TRUE yes -0.6049 sunny mild normal TRUE yes -0.4295 rainy mild high TRUE no -0.4247 rainy cool normal TRUE no -0.3702 sunny mild high FALSE no 0.1746 sunny hot high TRUE no 0.3577 sunny hot high FALSE no 0.9618 CPSC 444: Artificial Intelligence Spring 2019 https://en.wikipedia.org/wiki/support-vector_machine 48 CPSC 444: Artificial Intelligence Spring 2019 50

Extensions. use a soft margin to handle errors allow some items to be on the wrong side of the plane with different kernel functions, can be used when classes aren't linearly separable outlook temperature humidity windy play rainy mild normal FALSE yes -1.8843 overcast hot normal FALSE yes -1.7728 rainy cool normal FALSE yes -1.1417 rainy mild high FALSE yes -1.0008 overcast hot high FALSE yes -1.0003 overcast cool normal TRUE yes -1 sunny cool normal FALSE yes -0.9994 overcast mild high TRUE yes -0.9993 sunny mild normal TRUE yes -0.9992 rainy mild high TRUE no 0.999 rainy cool normal TRUE no 0.9997 sunny mild high FALSE no 0.9997 sunny hot high TRUE no 1.0009 sunny hot high FALSE no 1.2399 CPSC 444: Artificial Intelligence Spring 2019 https://en.wikipedia.org/wiki/support-vector_machine 51 CPSC 444: Artificial Intelligence Spring 2019 53 Kernel used: Poly Kernel: K(x,y) = <x,y>^2.0 0.8235 * <0 0 1 0 1 0 0 0 > * X] 0.3287 * <1 0 0 0 1 0 1 0 > * X] + 0.5026 * <1 0 0 0 1 0 0 1 > * X] 0.0933 * <0 1 0 0 0 1 1 0 > * X] + 0.1628 * <1 0 0 1 0 0 0 0 > * X] 0.661 * <0 0 1 0 1 0 0 1 > * X] 0.1146 * <1 0 0 0 0 1 1 1 > * X] 0.3832 * <0 1 0 0 1 0 0 0 > * X] 0.088 * <0 1 0 1 0 0 0 1 > * X] + 0.1799 * <0 0 1 0 0 1 1 0 > * X] + 0.3784 <x,y> denotes the dot product of vectors x and y (dot product = sum of the pairwise product of the components) X is the input instance to be classified <0 0 1 0 1 0 0 0 > * X refers to K(<0 0 1 0 1 0 0 0>,X) a b < classified as 6 3 a = yes 3 2 b = no Correctly Classified Instances 8 57.1429 % Extensions. for multiple classes, use pairwise classification (1-vs-1) or one-against-all method pairwise train separate classifiers for each pairing of classes pick the majority classification one-against-all train separate classifiers for each class to distinguish that class from everything else pick the highest-confidence classification CPSC 444: Artificial Intelligence Spring 2019 52 CPSC 444: Artificial Intelligence Spring 2019 54

SVM Multiple Classes Kernel used: Linear Kernel: K(x,y) = <x,y> Classifier for classes: Iris setosa, Iris versicolor 0.0459 * sepallength + 0.5219 * sepalwidth + 1.0031 * petallength + 0.4641 * petalwidth 1.4491 Classifier for classes: Iris setosa, Iris virginica 0.0095 * sepallength + 0.1796 * sepalwidth + 0.5367 * petallength + 0.2946 * petalwidth 1.5143 Classifier for classes: Iris versicolor, Iris virginica 0.5962 * sepallength + 0.972 * sepalwidth + 2.0313 * petallength + 2.008 * petalwidth a b c 6.786 Correctly Classified Instances 145 96.6667 % < classified as 50 0 0 a = Iris setosa 0 47 3 b = Iris versicolor 0 2 48 c = Iris virginica CPSC 444: Artificial Intelligence Spring 2019 55 Idea binary classification (two classes). based on the posterior probability = probability of an occurrence given evidence assume attributes are independent idea example for yes outcomes, consider separately the probability of a rainy outlook, a mild temperature, a normal humidity, and not windy for independent attributes, the probability of all of these things happening at once is the product of the individual probabilities also factor in the likelihood of a yes outcome compare to no outcomes outlook temperature humidity windy rainy mild normal FALSE CPSC 444: Artificial Intelligence Spring 2019 57 has proven to be robust and accurate in many cases does not require large training sets not sensitive to the number of dimensions efficient training methods solid theoretical foundation Compute ln ( P(1 x) P(x 1) P(1) )=ln( P (0 x) P(x 0) P(0) ) P(i x) = probability of x belonging to class i P(i) = probability of an object belonging to class i the sign of the log indicates whether the probability of x belonging to class 1 is larger or smaller than the probability of x belonging to class 0 P(x i) = probability of x within class i if the components of x are independent, can estimate as the product of P(x j i) for each component x j of x sign of the result indicates the class challenge: if probabilities are estimated from the training set, it could be the case that P(x i i) = 0 solution: use Laplace smoothing use count+1 and total+number of possible values instead CPSC 444: Artificial Intelligence Spring 2019 56 CPSC 444: Artificial Intelligence Spring 2019 58

Example Class Attribute yes no (0.63) (0.38) ============================= outlook sunny 3.0 4.0 overcast 5.0 1.0 rainy 4.0 3.0 [total] 12.0 8.0 temperature hot 3.0 3.0 mild 5.0 3.0 cool 4.0 2.0 [total] 12.0 8.0 humidity high 4.0 5.0 normal 7.0 2.0 [total] 11.0 7.0 windy TRUE 4.0 4.0 FALSE 7.0 3.0 [total] 11.0 7.0 uses Laplace smoothing, so counts are increased by 1 and totals are increased by the number of possible values for the attribute (avoids 0s if there are no training instances with a given value) a b < classified as 7 2 a = yes 4 1 b = no Correctly Classified Instances 8 57.1429 % CPSC 444: Artificial Intelligence Spring 2019 59 for more than two classes compute P(x i) P(i) for each class i choose the class i that maximizes P(x i) P(i) CPSC 444: Artificial Intelligence Spring 2019 61 Example Example outlook temperature humidity windy play sunny hot high TRUE no -1.7149004637 sunny hot high FALSE no -0.8676026033 rainy mild high TRUE no -0.628710695 sunny mild high FALSE no -0.3567769795 rainy mild high FALSE yes 0.2185871654 sunny mild normal TRUE yes 0.2718316799 overcast mild high TRUE yes 0.6930451449 overcast hot high FALSE yes 1.0295173816 rainy cool normal TRUE no 1.0295173816 sunny cool normal FALSE yes 1.3014510971 rainy mild normal FALSE yes 1.6944936852 rainy cool normal FALSE yes 1.876815242 overcast cool normal TRUE yes 2.3512732216 overcast hot normal FALSE yes 2.5054239014 CPSC 444: Artificial Intelligence Spring 2019 60 Class Attribute soft hard none (0.22) (0.19) (0.59) ========================================== age young 3.0 3.0 5.0 pre presbyopic 3.0 2.0 6.0 presbyopic 2.0 2.0 7.0 [total] 8.0 7.0 18.0 spectacle prescrip myope 3.0 4.0 8.0 hypermetrope 4.0 2.0 9.0 [total] 7.0 6.0 17.0 astigmatism no 6.0 1.0 8.0 yes 1.0 5.0 9.0 [total] 7.0 6.0 17.0 tear prod rate reduced 1.0 1.0 13.0 normal 6.0 5.0 4.0 [total] 7.0 6.0 17.0 a b c < classified as 4 0 1 a = soft 0 1 3 b = hard 1 2 12 c = none Correctly Classified Instances 17 70.8333 % CPSC 444: Artificial Intelligence Spring 2019 62

age spectacle-prescrip astigmatism Example tearprod-rate contact -lenses soft hard none young hypermetrope yes normal hard 0.0058 0.0184 0.0109 pre-presbyopic myope yes normal hard 0.0044 0.0245 0.0116 presbyopic myope yes normal hard 0.0029 0.0245 0.0135 young myope yes normal hard 0.0044 0.0367 0.0096 young myope no reduced none 0.0044 0.0015 0.0279 young myope yes reduced none 0.0007 0.0073 0.0314 young hypermetrope no reduced none 0.0058 0.0007 0.0314 pre-presbyopic myope no reduced none 0.0044 0.0010 0.0335 young hypermetrope yes reduced none 0.0010 0.0037 0.0353 pre-presbyopic myope yes reduced none 0.0007 0.0049 0.0376 pre-presbyopic hypermetrope no reduced none 0.0058 0.0005 0.0376 presbyopic myope no reduced none 0.0029 0.0010 0.0390 pre-presbyopic hypermetrope yes normal none 0.0058 0.0122 0.0130 presbyopic hypermetrope yes normal none 0.0039 0.0122 0.0152 pre-presbyopic hypermetrope yes reduced none 0.0010 0.0024 0.0423 presbyopic myope yes reduced none 0.0005 0.0049 0.0439 presbyopic hypermetrope no reduced none 0.0039 0.0005 0.0439 presbyopic hypermetrope yes reduced none 0.0006 0.0024 0.0494 presbyopic myope no normal none 0.0175 0.0049 0.0120 presbyopic hypermetrope no normal soft 0.0233 0.0024 0.0135 young myope no normal soft 0.0262 0.0073 0.0086 pre-presbyopic myope no normal soft 0.0262 0.0049 0.0103 young hypermetrope no normal soft 0.0350 0.0037 0.0096 pre-presbyopic CPSC 444: Artificial Intelligence hypermetrope Spring 2019 no normal soft 0.0350 0.0024 0.011663 easy to implement easy to interpret / understand the resulting classification can be applied to large datasets tends to perform well frequently used in text classification and spam filtering many extensions / modifications Observations. assumption of independence of attributes is not necessarily a problem can start with attribute selection to eliminate highly correlated attributes even with correlated attributes, results based on the independence assumption aren't necessarily wrong CPSC 444: Artificial Intelligence Spring 2019 65 for numeric values discretize can assume a normal distribution and compute probabilities based on that Ensemble Learning Idea. use multiple classifiers to improve on the performance of any one Class Attribute Iris setosa Iris versicolor Iris virginica (0.33) (0.33) (0.33) =============================================================== sepallength mean 4.9913 5.9379 6.5795 std. dev. 0.355 0.5042 0.6353 weight sum 50 50 50 precision 0.1059 0.1059 0.1059 sepalwidth mean 3.4015 2.7687 2.9629 std. dev. 0.3925 0.3038 0.3088 weight sum 50 50 50 precision 0.1091 0.1091 0.1091 petallength mean 1.4694 4.2452 5.5516 std. dev. 0.1782 0.4712 0.5529 weight sum 50 50 50 precision 0.1405 0.1405 0.1405 CPSC 444: Artificial Intelligence Spring 2019 64 CPSC 444: Artificial Intelligence Spring 2019 66

AdaBoost AdaBoost works with weak classifiers (accuracy just above random chance) often decision stumps (single level decision trees) simple algorithm accurate often does not overfit (but it can) solid theoretical foundation CPSC 444: Artificial Intelligence Spring 2019 67 CPSC 444: Artificial Intelligence Spring 2019 70 AdaBoost Algorithm. assign equal weights (1/n) to each training instance n = size of the training set repeat for T rounds or until no further improvement train a classifier using the current training set weights if the classifier algorithm can't deal with weights directly, choose training elements in accordance with their weights test the classifier on the training examples and determine the error adjust the weights based on the error increase the weight of incorrectly classified examples use weighted majority voting amongst the classifiers from each round to determine the class weight is based on the error more accurate models are given higher weights CPSC 444: Artificial Intelligence Spring 2019 69