Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Similar documents
Data Mining Algorithms: Basic Methods

Basic Data Mining Technique

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

COMP 465: Data Mining Classification Basics

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Chapter 4: Algorithms CS 795

Extra readings beyond the lecture slides are important:

Chapter 4: Algorithms CS 795

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining and Machine Learning: Techniques and Algorithms

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Classification with Decision Tree Induction

BITS F464: MACHINE LEARNING

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Decision Trees: Discussion

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Basic Concepts Weka Workbench and its terminology

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Classification and Prediction

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining Part 5. Prediction

Slides for Data Mining by I. H. Witten and E. Frank

Data Mining and Analytics

Machine Learning in Real World: C4.5

Chapter 3: Supervised Learning

Data warehouse and Data Mining

Machine Learning Chapter 2. Input

k-nearest Neighbor (knn) Sept Youn-Hee Han

Machine Learning Techniques for Data Mining

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Supervised Learning: K-Nearest Neighbors and Decision Trees

Data Mining. Lecture 03: Nearest Neighbor Learning

Data Mining Practical Machine Learning Tools and Techniques

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

CS489/698 Lecture 2: January 8 th, 2018

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Machine Learning / Jan 27, 2010

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Data Mining Classification - Part 1 -

CISC 4631 Data Mining

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

An Empirical Study on feature selection for Data Classification

Summary. Machine Learning: Introduction. Marcin Sydow

Unsupervised: no target value to predict

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Decision Trees. Query Selection

SOCIAL MEDIA MINING. Data Mining Essentials

Applying Supervised Learning

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Decision Tree Learning

Knowledge discovery & data mining. Classification & fraud detection

The k-means Algorithm and Genetic Algorithm

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Classification Algorithms in Data Mining

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Topic 1 Classification Alternatives

Advanced learning algorithms

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Lecture 5: Decision Trees (Part II)

Knowledge Discovery in Databases

1) Give decision trees to represent the following Boolean functions:

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Pattern Recognition ( , RIT) Exercise 1 Solution

K- Nearest Neighbors(KNN) And Predictive Accuracy

CSE4334/5334 DATA MINING

A study of classification algorithms using Rapidminer

Network Traffic Measurements and Analysis

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

CS570: Introduction to Data Mining

IBL and clustering. Relationship of IBL with CBR

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Probabilistic Learning Classification using Naïve Bayes

Large Scale Data Analysis Using Deep Learning

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Naïve Bayes for text classification

Transcription:

Prediction

Prediction What is Prediction Simple methods for Prediction Classification by decision tree induction Classification and regression evaluation 2

Prediction Goal: to predict the value of a given variable (named target or objective variable) If the target variable is: Categorical: we have a classification problem; Numeric: we have a regressionproblem. For each record on the dataset determines the value of the class attribute Constructs a model based on the training set; then, uses the model in predicting new data 3

Prediction Attributes describing customers Customer class This is an example on classification Examples to train the model Known New customers unknown Known 4

Typical Applications: Classification Credit approval: classifies credit application as low risk, high risk, or average risk Determine if a local access on a computer is legal or illegal Target marketing (send or not a catalogue?) Medical diagnosis Text classification (spam, not spam) Text recognition (Optical character recognition) 5

Typical Applications: Regression Stock market: predict the share price for the future Energy demanding in a dam Wind speed: eolic energy Travel time prediction: for the planning of transport companies Level of water in a river: for safety & prevention Tax income: public budget 6

A two step process 1.Model construction: Eachsample(or record, or object, or instance, or example) is assumed to have a value for the target attribute The set of samples is used for the model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae 7

Model construction Training Data Prediction Algorithm NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Model IF rank = professor OR years > 6 THEN tenured = yes 8

2.Model usage A two step process For predicting future or unknown objects Estimate accuracy of the model The known target of test sample is compared with the classified result from the model For classification: accuracy rateis the percentage of test set samples that are correctly classified by the model For regression: mean squared erroris the averaged squared error between each prediction and the corresponding target value Test set is independent of training set, otherwise over-fitting will occur 9

Model usage Model Testing Data Unseen Data (Jeff, Professor, 4) NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Tenured? 10

Dataset subsets Training set used in model construction Test set used in model validation Pruning set used in model construction (30% of training set) Train/test (70% / 30%) 11

Overfitting If your goal is to choose a model in order to predict new values, which one would you choose: b? c? d? Taken from: http://cg.postech.ac.kr/publications/images/ylee06b.png Does the model that best fits the training data (the red dots) is the one that best predicts? There is no such guarantee! That is the reason why the model should be evaluated using samples that have not been used for training that model. 12

ALGORITHMS 13

Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, e.g.: One attribute does all the work All attributes contribute equally & independently A weighted linear combination might do Instance-based: use a few prototypes Use simple logical rules Success of method depends on the domain 14

1-R ALGORITHM 15

Inferring rudimentary rules 1R: learns a 1-level decision tree (we only discuss classification) rules that classify an object on the basis of a single attribute Basic version One branch for each value Each branch assigns most frequent class Error rate: proportion of instances that don t belong to the majority class of their corresponding branch Choose attribute with lowest error rate (assumes nominal attributes) 16

Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Note: missing is treated as a separate attribute value 17

Evaluating the weather attributes Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No The target variable Attribute Rules Errors Total errors Outlook Sunny No 2/5 4/14 Overcast Yes 0/4 Rainy Yes 2/5 Temp Hot No* 2/4 5/14 Mild Yes 2/6 Cool Yes 1/4 Humidity High No 3/7 4/14 Normal Yes 1/7 Windy False Yes 2/8 5/14 * indicates a tie True No* 3/6 oner rule for this example IF overcast THEN Play ELSE IF sunny THEN Don tplay ELSE IF rain THEN Play 18

oner rule for this example IF overcast THEN Play ELSE IF sunny THEN Don tplay ELSE IF rain THEN Play 19

Dealing with numeric attributes Discretize numeric attributes (as seen in a previous class) Divide each attribute s range into intervals Sort instances according to attribute s values Place breakpoints where the class changes (the majority class) This minimizes the total error Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No 20

The problem of overfitting This procedure is very sensitive to noise One instance with an incorrect class label will probably produce a separate interval High probability of overfitting Also: time stamp attribute will have zero errors Simple solution: enforce minimum number of instances in majority class per interval 21

Discretization example Example (with min = 3): 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No Final result for temperature attribute 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No 22

With overfitting avoidance Resulting rule set: Attribute Rules Errors Total errors Outlook Sunny No 2/5 4/14 Overcast Yes 0/4 Rainy Yes 2/5 Temperature 77.5 Yes 3/10 5/14 > 77.5 No* 2/4 Humidity 82.5 Yes 1/7 3/14 > 82.5 and 95.5 No 2/6 > 95.5 Yes 0/1 Windy False Yes 2/8 5/14 True No* 3/6 23

Discussion of 1R 1R was described in a paper by Holte (1993) Contains an experimental evaluation on 16 datasets (using crossvalidationso that results were representative of performance on future data) Minimum number of instances was set to 6 after some experimentation 1R s simple rules performed not much worse than much more complex decision trees Simplicity first pays off! Very Simple Classification Rules Perform Well on Most Commonly Used Datasets Robert C. Holte, Computer Science Department, University of Ottawa 24

BAYESIAN CLASSIFICATION 25

Bayesian (Statistical) modeling Opposite of 1R: use all the attributes Two assumptions: Attributes are equally important statistically independent(given the class value) I.e., knowing the value of one attribute says nothing about the value of another (if the class is known) Independence assumption is almost never correct! But this scheme works well in practice 26

Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h D), follows the Bayes theorem Ph ( D) = PD ( ) h Ph ( ) PD ( ) The goal is to identify the hypothesis with the maximum posteriori probability Practical difficulty:require initial knowledge of many probabilities, significant computational cost 27

Bayesian Classification We want to determine P(C X), that is, the probability that the record X=<x 1,,x k > is of class C. e.g. X = <rain, hot, high, light> PlayTennis? Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No 28

Bayesian Classification The classification problem may be formalized using a-posteriori probabilities P(C X) = probability that the record X=<x 1,,x k > is of class C. e.g. P(class=N outlook=sunny, windy=light, ) Thus: assign to samplexthe class labelcsuch that P(C X) is maximal 29

Estimating a-posteriori probabilities Bayes theorem: P(C X) = P(X C) P(C) / P(X) P(X) is constant for all classes P(C) = relative frequency of class C samples C such that P(C X)is maximum = C such that P(X C) P(C) is maximum Problem: computing P(X C) = P(x 1,,x k C) is unfeasible! 30

Naive Bayesian Classification Naïve assumption: attribute independence P(x 1,,x k C) = P(x 1 C) P(x 2 C) P(x k C) If i attribute is categorical: P(x i C) is estimated as the relative freq of samples having value x i as i-th attribute in class C If i-th attribute is continuous: P(x i C) is estimated through a Gaussian density function Computationally easy in both cases 31

Naive Bayesian Classifier The assumptionthat attributes are conditionally independent greatly reduces the computational cost. Given a training set, we can compute the following probabilities Example: P(sunny Play) Outlook P N sunny 2/9 3/5 overcast 4/9 0 rain 3/9 2/5 32

Play-tennis example: estimating P(C) Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No P(P) = 9/14 P(N) = 5/14 33

Play-tennis example: estimating P(x i C) Outlook Temp. Humidity Wind Play? Sunny Hot High Light No Sunny Hot High Strong No Overcast Hot High Light Yes Rain Mild High Light Yes Rain Cool Normal Light Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Light No Sunny Cool Normal Light Yes Rain Mild Normal Light Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Light Yes Rain Mild High Strong No outlook P(sunny p) = 2/9 P(sunny n) = 3/5 P(overcast p) = 4/9 P(overcast n) = 0 P(rain p) = 3/9 P(rain n) = 2/5 temperature P(hot p) = 2/9 P(hot n) = 2/5 P(mild p) = 4/9 P(mild n) = 2/5 P(cool p) = 3/9 P(cool n) = 1/5 humidity P(high p) = 3/9 P(high n) = 4/5 P(normal p) = 6/9 P(normal n) = 2/5 windy P(strong p) = 3/9 P(strong n) = 3/5 P(light p) = 6/9 P(light n) = 2/5 34

Play-tennis example: classifying X An unseen sample X = <rain, hot, high, light> P(play X) P(X play) P(play) = =P(rain play) P(hot play) P(high play) P(light play) P(p) = = 3/9 2/9 3/9 6/9 9/14 = = 0.010582 P(sunny p) = 2/9 P(overcast p) = 4/9 outlook P(sunny n) = 3/5 P(overcast n) = 0 P(rain p) = 3/9 P(rain n) = 2/5 temperature P(hot p) = 2/9 P(hot n) = 2/5 P(mild p) = 4/9 P(mild n) = 2/5 P(cool p) = 3/9 P(cool n) = 1/5 humidity P(high p) = 3/9 P(high n) = 4/5 P(normal p) = 6/9 P(strong p) = 3/9 windy P(normal n) = 2/5 P(strong n) = 3/5 P(light p) = 6/9 P(light n) = 2/5 35

Play-tennis example: classifying X P(don t play X) P(X don t play) P(don t play) = = P(rain don t play) P(hot don t play) P(high don t play) P(light don t play) P(don t play) = = 2/5 2/5 4/5 2/5 5/14 = 0.018286 > 0.010582 Sample X isclassified in class n (don t play) 36

The zero-frequency problem What if an attribute value doesn t occurwith every class value? (e.g. Humidity = high for class yes ) Probability will be zero! Pr[ Humidity = High yes] =0 A posteriori probability will also be zero! (No matter how likely the other values are!) Pr[ yes <..., hum = high > ] = 0 Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) 37

Modified probability estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes 2 + µ /3 9 + µ 4 + µ /3 9 + µ 3 + µ /3 9 + µ Sunny Overcast Rainy Weights don t need to be equal (but they must sum to 1) 2 1 + µ p 9 + µ + µ p 9 + µ 4 2 + µ p 9 + µ 3 3 38

Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation outlook P(sunny p) = 2/9 P(sunny n) = 3/5 P(overcast p) = 4/9 P(overcast n) = 0 P(rain p) = 3/9 P(rain n) = 2/5 temperature P(hot p) = 2/9 P(hot n) = 2/5 P(mild p) = 4/9 P(mild n) = 2/5 P(cool p) = 3/9 P(cool n) = 1/5 humidity P(high p) = 3/9 P(high n) = 4/5 P(normal p) = 6/9 windy P(normal n) = 2/5 P(strong p) = 3/9 P(strong n) = 3/5 Example: Outlook Temp. Humidity Windy Play? Cool High Strong? Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.0238 Likelihood of no = 1/5 4/5 3/5 5/14 = 0.0343 P( yes ) = 0.0238 /(0.0238 + 0.0343) = 41% P( no ) = 0.0343 /(0.0238 + 0.0343) = 59% P(light p) = 6/9 P(light n) = 2/5 39

Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density functionfor the normal distribution is defined by two parameters: Sample mean µ Standard deviation σ 1 µ = n n i= 1 x i σ = 1 n n 1 i= 1 ( µ) x i 2 Then the density function f(x) is f( x) = ( x µ ) 1 2 2σ e 2πσ 2 Karl Gauss, 1777-1855 great German mathematician 40

Probability densities Relationship between probability and density: Pr[ c ε < 2 x < c ε + ] ε 2 f( c) But: this doesn t change calculation of a posteriori probabilities because ε cancels out Exact relationship: Pr[ a x b] = f( t) dt b a 41

Statistics for weather data Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5 Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3 Rainy 3 2 72, 85, 80, 95, Sunny 2/9 3/5 µ =73 µ =75 µ =79 µ =86 False 6/9 2/5 9/14 5/14 Overcast 4/9 0/5 σ =6.2 σ =7.9 σ =10.2 σ =9.7 True 3/9 3/5 Rainy 3/9 2/5 Example density value: f ( 66 73) 1 2 2 6. 2 ( temperature = 66 yes) = e 2π 6. 2 2 = 0. 0340 42

Classifying a new day A new day: Outlook Temp. Humidity Windy Play Sunny 66 90 true? Likelihood of yes = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036 Likelihood of no = 3/5 0.0291 0.0380 3/5 5/14 = 0.000136 P( yes ) = 0.000036 /(0.000036 + 0. 000136) = 20.9% P( no ) = 0.000136 /(0.000036 + 0. 000136) = 79.1% Missing values during training are not included in calculation of mean and standard deviation 43

Naïve Bayes: discussion Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed 44

Improvements: Naïve Bayes Extensions select best attributes (e.g. with greedy search) often works as well or better with just a fraction of all attributes Bayesian Networks combine Bayesian reasoning with causal relationships between attributes is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) 45

Summary OneR uses rules based on just one attribute Naïve Bayes uses all attributes and Bayes rules to estimate probability of the class given an instance. Simple methods frequently work well, but Complex methods can be better (as we will see) 46

K-NEAREST NEIGHBOR CLASSIFIER 47

Different Learning Methods Instance-based learning: Learning=storing all training instances Classification=assigning target function to a new instance Delay the processing until a new instance must be classified Referred to as Lazy learning Instance-based Learning Any random movement => It s a mouse I saw a mouse! Its very similar to a Desktop!! Eager Learning 48

The k-nearest Neighbor Nearest neighbor classifiers are based on learning by analogy All instances correspond to points in the n-dimensional space The nearest neighbors are defined in terms of Euclidean distance (need to normalize attributes) i= 1 ( x y ) 2 d( X, Y) = n i i 1-Nearest Neighbor 3-Nearest Neighbor 49

Example Simple 2-D case, each instance described only by two values (x, y co-ordinates). The class is either or noise? Need to consider : 1) Similarity (how to calculate distance) 2) Number (and weight) of similar (near) instances 50

The k-nearest Neighbor The target function could be discrete- or realvalued. For discrete-valued, the k-nn returns the most common value among the k training examples nearest toxq. k=1 leads to an unstable classifier k large means that the points may not be very close + _. _ + xq _ + _ + 51

Discussion on the k-nn Algorithm The k-nn algorithm for continuous-valued target functions Calculate the mean values of theknearest neighbors. Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query pointx q giving greater weight to closer neighbors Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors w 1 d( x, x) 2 q i Advantages Training is very fast Learn complex target functions Disadvantages Slow at query time Easily fooled by irrelevant attributes 52

Remarks on Lazy vs. Eager Learning Instance-based learning: lazy evaluation Decision-tree and Bayesian classification: eager evaluation Key differences Lazy method may consider query instance when deciding how to generalize beyond the training data D Eager method cannot since they have already chosen global approximation when seeing the query Efficiency: Lazy - less time training but more time predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function Eager: must commit to a single hypothesis that covers the entire instance space 53

References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2000 Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 1999 Tom M. Mitchell, Machine Learning, 1997 54

Thank you!!! 55