Data Mining Algorithms: Basic Methods

Size: px

Start display at page:

Download "Data Mining Algorithms: Basic Methods"

Christopher Powers
5 years ago
Views:

1 Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association rule learning Linear models Instance-based learning Clustering 2 Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, e.g.: One attribute does all the work All attributes contribute equally & independently A weighted linear combination might do fine Instance-based: use a few prototypes Use simple logical rules Success of method depends on the domain Review: Classification Learning Classification-learning algorithms: take a set of already classified training examples also known as training instances learn a model that can classify previously unseen examples The resulting model works like this: input attributes (everything but the class) model output attribute/class

2 Review: Classification Learning (cont.) Recall our medical-diagnosis example: training examples/instances: class/ Patient Sore Swollen output attribute ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Example Problem: Credit-Card Promotions A credit-card company wants to determine which customers should be sent promotional materials for a life insurance offer. It needs a model that predicts whether a customer will accept the offer: age sex income range credit-card insurance* model (will accept the offer) or (will not accept the offer) the learned model: if Swollen Glands = then Diagnosis = Strep Throat if Swollen Glands = and Fever = then Diagnosis = Cold if Swollen Glands = and Fever = then Diagnosis = Allergy * note: credit-card insurance is a / attribute specifying whether the customer accepted a similar offer for insurance on their credit card Example Problem: Credit-Card Promotions 15 training examples (Table 3.1 of Roiger & Geatz): class/ output attribute 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K 1R: Learning Simple Classification Rules Presented by R.C. Holte, University of Ottawa, in the following paper: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11(2003): Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) Minimum number of instances was set to 6 after some experimentation 1R s simple rules performed not much worse than much more complex decision trees Simplicity first pays off! Why is it called 1R? R because the algorithm learns a set of Rules 1 because the rules are based on only 1 input attribute

3 1R: Learning Simple Classification Rules The rules that 1R learns look like this: <attribute-name>: <attribute-val1> <class value> <attribute-val2> <class value> To see how 1R learns the rules, let's consider an example. Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? when =, what is the most frequent class? Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? (it appears in 6 out of 7 of those examples) when =, what is the most frequent class?

4 Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? (it appears in 6 out of 7 of those examples) when =, what is the most frequent class? Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? (it appears in 6 out of 7 of those examples) when =, what is the most frequent class? (it appears in 5 out of 8 of those examples) Applying 1R to the Credit-Card Promotion Data (cont.) Thus, we end up with the following rules based on : : (6 out of 7) (5 out of 8) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy =?

5 Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy = (6 + 5)/(7 + 8) = 11/15 = 73% So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy = (6 + 5)/(7 + 8) = 11/15 = 73% Equivalently, we can focus on the error rate and minimize it. error rate of rules above =? Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy = (6 + 5)/(7 + 8) = 11/15 = 73% Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Credit Card Insurance? Credit Card Insurance: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Equivalently, we can focus on the error rate and minimize it. error rate of rules above = = %

6 Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Credit Card Insurance? Credit Card Insurance: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Credit Card Insurance? Credit Card Insurance: (3 out of 3) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Credit Card Insurance? Credit Card Insurance: (3 out of 3) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Credit Card Insurance? Credit Card Insurance: (3 out of 3) * (6 out of 12) * when Credit Card Insurance =, the two classes are equally likely, but we choose because otherwise the model would always predict

7 Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Income Range? Income Range: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K 20-30K 30-K -50K 50-60K Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Income Range? Income Range: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K 20-30K 30-K -50K 50-60K Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K -50K 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K -50K 50-60K (* would also be a valid choice for 20-30K)

8 Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (* would also be a valid choice for 20-30K)

9 Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) (* would also be a valid choice for 20-30K) Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Another Example: Evaluating the Weather Attributes Humidity High High High High rmal rmal rmal High rmal rmal rmal High rmal High Windy False True False False False True True False False False True True False True Play Attribute Outlook Temp Humidity Windy Rules Sunny Overcast Rainy Hot * Mild Cool High rmal False True * * indicates a tie Errors 2/5 0/4 2/5 2/4 2/6 1/4 3/7 1/7 2/8 3/6 Total errors 4/14 5/14 4/14 5/14 Handing Numeric Attributes To handle numeric attributes, we need to discretize the range of possible values into subranges called bins or buckets. One way is to sort the training instances by age and look for the binary (two-way) split that leads to the most accurate rules. 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Handling Numeric Attributes (cont.) Here's one possible binary split for age: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N the corresponding rules are: : <= (5 out of 6) > (5 out of 9) overall accuracy: 10/15 = 67% sort by age : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N

10 Handling Numeric Attributes (cont.) Here's one possible binary split for age: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N the corresponding rules are: : <= (5 out of 6) > (5 out of 9) The following is one of the splits with the best overall accuracy: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N the corresponding rules are: : <= (9 out of 12) > (3 out of 3) overall accuracy: 10/15 = 67% overall accuracy: 12/15 = 80% Summary of 1R Results : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) : <= (9 out of 12) > (3 out of 3) overall accuracy: 11/15 = 73% overall accuracy: 9/15 = 60% overall accuracy: 11/15 = 73% overall accuracy: 12/15 = 80% Because the rules based on have the highest overall accuracy on the training data, 1R selects them as the model. Special Case: Many-Valued Attributes 1R does not tend to work well with attributes that have many possible values. When such an attribute is present, 1R often ends up selecting its rules. each rule applies to only a small number of examples, which tends to give them a high accuracy However, the rules learned for a many-valued attribute tend not to generalize well. what is this called? Special Case: Many-Valued Attributes 1R does not tend to work well with attributes that have many possible values. When such an attribute is present, 1R often ends up selecting its rules. each rule applies to only a small number of examples, which tends to give them a high accuracy However, the rules learned for a many-valued attribute tend not to generalize well. what is this called? overfitting the training data

11 Special Case: Many-Valued Attributes (cont.) Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Example: let's say we used 1R on this dataset. what would be the accuracy of rules based on Patient ID#? Special Case: Many-Valued Attributes (cont.) Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Example: let's say we used 1R on this dataset. what would be the accuracy of rules based on Patient ID#? 100%! because Patient ID# is a unique identifier, we get one rule for each ID, which correctly classifies its example! We need to remove identifier fields before running 1R. Special Case: Numeric Attributes Special Case: Numeric Attributes The standard way of handling numeric attributes in 1R is a bit more complicated than the method we presented earlier. allows for more than two bins/buckets place breakpoints where the class changes maximizes total accuracy / minimizes the total error possible alternate discretization: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N what's the problem with this discretization? Another example: temperature from weather data Outlook Temperature Humidity Windy Sunny False Sunny True Overcast False Rainy False Play To avoid overfitting, you can specify a minimum bucket size the smallest number of examples allowed in a given bucket.

12 The Problem of Overfitting Example (with minimum bucket size = 3): Resulting rule set: With Overfitting Avoidance : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N Weather data: Attribute Outlook Temperature Humidity Windy Rules Sunny Overcast Rainy 77.5 > 77.5 * 82.5 > 82.5 and 95.5 > 95.5 False True * Errors 2/5 0/4 2/5 3/10 2/4 1/7 2/6 0/1 2/8 3/6 Total errors 4/14 5/14 3/14 5/14 Limitation of 1R 1R won't work well if many of the input attributes have fewer possible values than the class/output attribute does. Example: our medical diagnosis dataset Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Using 1R as a Baseline When performing classification learning, 1R, can serve as a useful baseline. compare the models from more complex algorithms to the model it produces if a model has a lower accuracy than 1R, it probably isn't worth keeping It also gives insight into which of the input attributes has the most impact on the output attribute. there are three possible classes: Strep Throat, Cold, Allergy binary attributes such as Fever produce rules that predict at most two of these classes: Fever: Cold Allergy

13 0R: Another Useful Baseline The 0R algorithm learns a model that considers none of the input attributes! It simply predicts the majority class in the training data. 0R: Another Useful Baseline The 0R algorithm learns a model that considers none of the input attributes! It simply predicts the majority class in the training data. Example: the credit-card training data 9 examples in which the output is 6 examples in which the output is thus, the 0R model would always predict. gives an accuracy of 9/15 = 60% 0R: Another Useful Baseline The 0R algorithm learns a model that considers none of the input attributes! It simply predicts the majority class in the training data. Example: the credit-card training data 9 examples in which the output is 6 examples in which the output is thus, the 0R model would always predict. gives an accuracy of 9/15 = 60% When performing classification learning, you should use the results of this algorithm to put your results in context. if the 0R accuracy is high, you may want to create training data that is less skewed at the very least, you should include the class breakdown of your training and test sets in your report Statistical modeling Opposite of 1R: use all the attributes Two assumptions: Attributes are equally important statistically independent (given the class value) i.e., knowing the value of one attribute says nothing about the value of another (if the class is known) Independence assumption is never correct! But this scheme works well in practice

Probabilities for weather data Outlook Temperature Humidity Windy Play Sunny 2 3 Overcast 4 0 Rainy 3 2 Sunny 2/9 3/5 Overcast 4/9 0/5 Hot Mild Cool Hot Mild Rainy 3/9 2/5 Cool 2 2 4 2 3 1 2/9 2/5

weather data Outlook Temperature Humidity Sunny 2 3 Overcast 4 0 Rainy 3 2 Sunny 2/9 3/5 Overcast 4/9 0/5 Hot Mild Cool Hot Mild Rainy 3/9 2/5 Cool 2 2 High 4 2 rmal 3 1 2/9 2/5 High 4/9 2/5 rmal 3/9

14 Probabilities for weather data Outlook Temperature Humidity Windy Play Sunny 2 3 Overcast 4 0 Rainy 3 2 Sunny 2/9 3/5 Overcast 4/9 0/5 Hot Mild Cool Hot Mild Rainy 3/9 2/5 Cool /9 2/5 4/9 2/5 3/9 1/5 High 3 4 False rmal 6 1 True 3 3 High rmal 3/9 6/9 4/5 1/5 False True 6/9 3/9 2/5 3/5 9/ 14 Outlook Temp Humidity Windy Play Sunny Hot High False 5 5/ 14 Probabilities for weather data Outlook Temperature Humidity Sunny 2 3 Overcast 4 0 Rainy 3 2 Sunny 2/9 3/5 Overcast 4/9 0/5 Hot Mild Cool Hot Mild Rainy 3/9 2/5 Cool 2 2 High 4 2 rmal 3 1 2/9 2/5 High 4/9 2/5 rmal 3/9 1/ /9 4/5 6/9 1/5 Windy False 6 True 3 False 6/9 True 3/ /5 3/5 Play 9 5 9/ 5/ Sunny Overcast Rainy Hot Hot Mild High High High True False False A new day: Outlook Sunny Temp. Cool Humidity High Windy True Play? Rainy Rainy Cool Cool rmal rmal False True Likelihood of the two classes Overcast Sunny Sunny Rainy Cool Mild Cool Mild rmal High rmal rmal True False False False For yes = 2/9 3/9 3/9 3/9 9/14 = For no = 3/5 1/5 4/5 3/5 5/14 = Conversion into a probability by normalization: Sunny Overcast Overcast Mild Mild Hot rmal High rmal True True False P( yes ) = / ( ) = = 20.5% P( no ) = / ( ) = = 79.5% Rainy Mild High True Bayes rule Naïve Bayes for classification Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England Classification learning: what s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent Pr Pr Pr Pr Pr Pr

15 Weather data example The zero-frequency problem Outlook Sunny Probability of class yes Temp. Cool Humidity High Windy True Play? Evidence E What if an attribute value doesn t occur with every class value? (e.g. Outlook = overcast for class no ) Probability will be zero! Pr 0 A posteriori probability will also be zero! Pr [yese ]= 0 ( matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! Pr 4 8 Pr 1 8 Pr 3 8 Modified probability estimates Missing values In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Sunny Overcast Rainy Weights don t need to be equal (but they must sum to 1) 2 p p p 3 9 Training: instance is not included in frequency count Probability ratios based on number of values that actually occur rather than total number of instances Classification: attribute will be omitted from calculation Outlook Temp. Humidity Windy Example:? Cool High True Play? Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.02 Likelihood of no = 1/5 4/5 3/5 5/14 = 0.03 P( yes ) = 0.02 / ( ) = % P( no ) = 0.03 / ( ) = 59%

Numeric attributes Statistics for weather data o o Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal

9 Humidity 65, 70, 70, 85, 70, 75, 90, 91, 80, 95, =79 =86 =10.2 =9.

Naïve Bayes: discussion Outlook Sunny Temp. 66 A new Humidity day: Windy true Play? Likelihood of yes = 2/9 0.03 0.0221 3/9 9/14 = 0.000036 Likelihood of no = 3/5 0.0221 0.01 3/5 5/14 = 0.

16 Numeric attributes Statistics for weather data o o Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean Sunny Overcast Rainy Sunny Overcast Outlook /9 4/ /5 0/5 Temperature 64, 68, 65,71, 69, 70, 72,80, 72, 85, =73 =75 =6.2 =7.9 Humidity 65, 70, 70, 85, 70, 75, 90, 91, 80, 95, =79 =86 =10.2 =9.7 Windy False 6 True 3 False 6/9 True 3/ /5 3/5 Play 9 5 9/ 5/ Rainy 3/9 2/5 Standard deviation Example density value: Then the probability density function f(x) is Classifying a new day Naïve Bayes: discussion Outlook Sunny Temp. 66 A new Humidity day: Windy true Play? Likelihood of yes = 2/ /9 9/14 = Likelihood of no = 3/ /5 5/14 = P( yes ) = / ( ) = 25% P( no ) = / ( ) = 75% Missing values during training are not included in calculation of mean and standard deviation 90 Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) te also: many numeric attributes are not normally distributed ( kernel density estimators)

17 Review: Decision Trees for Classification We've already seen examples of decision-tree models. example: the tree for our medical-diagnosis dataset: Review: Decision Trees for Classification We've already seen examples of decision-tree models. example: the tree for our medical-diagnosis dataset: Swollen Glands Swollen Glands Strep Throat Fever Strep Throat Fever Cold Allergy Cold Allergy what class would this decision tree assign to the following instance? Patient Sore Swollen ID# Throat Fever Glands Congestion Headache 21 Diagnosis? what class would this decision tree assign to the following instance? Patient Sore Swollen ID# Throat Fever Glands Congestion Headache 21 Diagnosis Cold 1R and Decision Trees We can view the models learned by 1R as simple decision trees with only one decision. here is the model that we learned for the credit-card data: <= > 1R and Decision Trees We can view the models learned by 1R as simple decision trees with only one decision. here is the model that we learned for the credit-card data: <= > here are the rules based on Income Range: Income Range 20-30K 30-K -50K 50-60K

18 Building Decision Trees How can we build decision trees that use multiple attributes? Here's the basic algorithm: 1. apply 1R to the full set of attributes, but choose the attribute that "best divides" the examples into subgroups Building Decision Trees How can we build decision trees that use multiple attributes? Here's the basic algorithm: 1. apply 1R to the full set of attributes, but choose the attribute that "best divides" the examples into subgroups 2. create a decision based on that attribute and put it in the appropriate place in the existing tree (if any) <= > Building Decision Trees How can we build decision trees that use multiple attributes? Here's the basic algorithm: 1. apply 1R to the full set of attributes, but choose the attribute that "best divides" the examples into subgroups 2. create a decision based on that attribute and put it in the appropriate place in the existing tree (if any) 3. for each subgroup created by the new decision: if the classifications of its examples are "accurate enough" or if there are no remaining attributes to use, do nothing otherwise, repeat the process for the examples in the subgroup <= > Building Decision Trees (cont.) What does it mean to choose the attribute that "best divides" the training instances? overall accuracy still plays a role however, it's not as important, since subsequent decisions can improve the model's accuracy in addition, we want to avoid letting the tree get too large, to prevent overfitting

19 Building Decision Trees (cont.) What does it mean to choose the attribute that "best divides" the training instances? overall accuracy still plays a role however, it's not as important, since subsequent decisions can improve the model's accuracy in addition, we want to avoid letting the tree get too large, to prevent overfitting We'll compute a goodness score for each attribute's rules: goodness = overall accuracy / N Building Decision Trees (cont.) What does it mean to choose the attribute that "best divides" the training instances? overall accuracy still plays a role however, it's not as important, since subsequent decisions can improve the model's accuracy in addition, we want to avoid letting the tree get too large, to prevent overfitting We'll compute a goodness score for each attribute's rules: goodness = overall accuracy / N where N = the number of subgroups that would need to be subdivided further if we chose this attribute. <= > where N = the number of subgroups that would need to be subdivided further if we chose this attribute. dividing by N should help to create a smaller tree Special case: if N == 0 for an attribute, we'll select that attribute. Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness:? accuracy: 9/15 = 60% goodness:? Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness:? Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness:? Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness:? : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:? : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:?

20 Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness: 60/1 = 60 Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness: 60/1 = 60 Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness:? Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness: 73/3 = 24.3 : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:? : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:? Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness: 60/1 = 60 Building a Decision Tree for the Credit-Card Data (cont.) Because has the highest goodness score, we use it as the first decision in the tree: <= > Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) : <= (9 out of 12) > (3 out of 3) accuracy: 11/15 = 73% goodness: 73/3 = 24.3 accuracy: 12/15 = 80% goodness: 80/1 = 80 9 out of 12 3 out of 3 thing further needs to be done to the > subgroup.

21 Building a Decision Tree for the Credit-Card Data (cont.) Because has the highest goodness score, we use it as the first decision in the tree: <= > thing further needs to be done to the > subgroup. We return to step 2 and apply the same procedure to the <= subgroup. this is an example of recursion: applying the same algorithm to a smaller version of the original problem 9 out of 12 3 out of 3 Building a Decision Tree for the Credit-Card Data (cont.) Here are the 12 examples in the <= subgroup: 30 K 50K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K As before, we sort the examples by and find the most accurate binary split. we'll use a minimum bucket size of 3 : Life Ins: Y N Y Y Y Y Y Y N Y Y N Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 12 examples: : (6 out of 6) * (3 out of 6) Cred.Card Ins: (3 out of 3) (6 out of 9) Income Rng: 20-30K (2 out of 3) 30-K (4 out of 5) -50K * (1 out of 2) 50-60K (2 out of 2) : <= (7 out of 8) > * (2 out of 4) accuracy: 9/12 = 75% goodness: 75/1 = 75 accuracy: 9/12 = 75% goodness: 75/1 = 75 accuracy: 9/12 = 75% goodness: 75/3 = 25 accuracy: 9/12 = 75% goodness: 75/2 = 37.5 Building a Decision Tree for the Credit-Card Data (cont.) Here's the tree that splits the <= subgroup: 6 out of 6 3 out of 6 and Credit Card Insurance are tied for the highest goodness score. We'll pick since it has more examples in the subgroup that doesn't need to be subdivided further.

22 Building a Decision Tree for the Credit-Card Data (cont.) Here's the tree that splits the <= subgroup: It replaces the classification for that subgroup in the earlier tree: <= > 9 out of 12 3 out of 3 6 out of 6 3 out of 6 <= > 3 out of 3 Building a Decision Tree for the Credit-Card Data (cont.) We now recursively apply the same procedure to the 6 examples in the ( <=, = ) subgroup: sort by : Life Ins: N Y Y N Y N We no longer consider. Why? 50K 30 K 30 K 20 30K 30 K 20 30K the only binary split with a minimum bucket size of 3 6 out of 6 3 out of 6 Building a Decision Tree for the Credit-Card Data (cont.) We now recursively apply the same procedure to the 6 examples in the ( <=, = ) subgroup: sort by 50K 30 K 30 K 20 30K 30 K 20 30K : Life Ins: N Y Y N Y N the only binary split with a minimum bucket size of 3 Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) accuracy: 5/6 = 83.3% goodness:? accuracy: 4/6 = 66.7% goodness:? accuracy: 4/6 = 66.7% goodness:? We no longer consider. Why? because all of the examples have the same value for it

23 Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) accuracy: 5/6 = 83.3% goodness: 83.3/1 = 83.3 accuracy: 4/6 = 66.7% goodness:? accuracy: 4/6 = 66.7% goodness:? Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) accuracy: 5/6 = 83.3% goodness: 83.3/1 = 83.3 accuracy: 4/6 = 66.7% goodness: 66.7/2 = 33.3 accuracy: 4/6 = 66.7% goodness:? Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) Credit Card Insurance has the highest goodness score, so we pick it and create the partial tree at right: accuracy: 5/6 = 83.3% goodness: 83.3/1 = 83.3 accuracy: 4/6 = 66.7% goodness: 66.7/2 = 33.3 accuracy: 4/6 = 66.7% goodness: 66.7/2 = 33.3 Credit Card Insurance 2 out of 2 3 out of 4 Building a Decision Tree for the Credit-Card Data (cont.) This new tree replaces the classification for the ( <=, = ) subgroup in the previous tree: <= > 6 out of 6 3 out of 6 3 out of 3 <= > 3 out of 3 Credit Card Insurance 6 out of 6 2 out of 2 3 out of 4

24 Building a Decision Tree for the Credit-Card Data (cont.) Here are the four instances in the ( <=, =, Cred.Card Ins = ) subgroup: 50K 20 30K 30 K 20 30K Building a Decision Tree for the Credit-Card Data (cont.) Here are the four instances in the ( <=, =, Cred.Card Ins = ) subgroup: 50K 20 30K 30 K 20 30K sort by : Life Ins: N Y N N The only remaining attributes are and Income Range. Income Range won't help, because there are two instances with Income Range = 20-30K, one with Life Ins = class and one with Life Ins =. The only remaining attributes are and Income Range. Income Range won't help, because there are two instances with Income Range = 20-30K, one with Life Ins = class and one with Life Ins =. won't help, because we can't make a binary split that separates the Life Ins = and Life Ins = instances. Building a Decision Tree for the Credit-Card Data (cont.) Here are the four instances in the ( <=, =, Cred.Card Ins = ) subgroup: sort by : Life Ins: N Y N N 50K 20 30K 30 K 20 30K The only remaining attributes are and Income Range. Income Range won't help, because there are two instances with Income Range = 20-30K, one with Life Ins = class and one with Life Ins =. won't help, because we can't make a binary split that separates the Life Ins = and Life Ins = instances. Thus, the algorithm stops here. Building a Decision Tree for the Credit-Card Data (cont.) Here's the final model: <= > 3 out of 3 Credit Card Insurance 6 out of 6 2 out of 2 3 out of 4 It manages to correctly classify all but one training example.

25 Building a Decision Tree for the Credit-Card Data (cont.) How would it classify the following instance? K? Building a Decision Tree for the Credit-Card Data (cont.) How would it classify the following instance? K <= > <= > Credit Card Insurance Credit Card Insurance Other Algorithms for Learning Decision Trees ID3 uses a different goodness score based on a field of study known as information theory doesn t handle numeric attributes C4.5 makes a series of improvements to ID3: the ability to handle numeric input attributes the ability to handle missing values measures that prune the tree after it is built making it smaller to improve its ability to generalize (i.e., to handle noise) Decision Tree Results in Weka Weka's output window gives the tree in text form that looks something like this: total # of examples J48 pruned tree in this subgroup = Credit Card Ins. = : (6.0/1.0) Credit Card Ins. = : (2.0) # that are misclassified = : (7.0/1.0) Both ID3 and C4.5 were developed by Ross Quinlan of the University of Sydney. Weka's implementation of C4.5 is called J48.

by combining the tests used to get from the top of the tree to one of the leaves. From Decision Trees to Classification Rules (cont.

26 Decision Tree Results in Weka (cont.) Right-clicking the name of the model in the result list allows you to view the tree in graphical form. From Decision Trees to Classification Rules Any decision tree can be turned into a set of rules of the following form: if <test1> and <test2> and then <class> = <value> were the condition is formed by combining the tests used to get from the top of the tree to one of the leaves. From Decision Trees to Classification Rules (cont.) Here are the rules for this tree: if > then Life Ins = if <= and = then Life Ins = <= > if <= and = and Cred Card Ins = then Life Ins = if <= and = and Cred Card Ins = then Life Ins = Credit Card Insurance Advantages and Disadvantages of Decision Trees Advantages: easy to understand can be converted to a set of rules makes it easier to actually use the model for classification can handle both nominal and numeric input attributes except for ID3, which is limited to nominal Disadvantages: the class attribute must be nominal slight changes in the set of training examples can produce a significantly different decision tree we say that the tree-building algorithm is unstable

27 Practice Building a Decision Tree Let's apply our decision-tree algorithm to the diagnosis dataset. to allow us to practice with numeric attributes, I've replaced Fever with Temp the person's body temperature Patient Sore Swollen ID# Throat Temp Glands Congestion Headache Diagnosis Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold Practice Building a Decision Tree Let's apply our decision-tree algorithm to the diagnosis dataset. to allow us to practice with numeric attributes, I've replaced Fever with Temp the person's body temperature Patient Sore Swollen ID# Throat Temp Glands Congestion Headache Diagnosis Allergy Cold Cold Allergy Allergy Cold Cold Practice Building a Decision Tree Patient Sore Swollen ID# Throat Temp Glands Congestion Headache Diagnosis Strep throat Allergy Strep throat Allergy Strep throat Allergy Review: Rule Sets Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold One possible model that could be used for classifying other patients is a set of rules such as the following: if Swollen Glands == then Diagnosis = Strep Throat if Swollen Glands == and Fever == then Diagnosis = Cold if Swollen Glands == and Fever == then Diagnosis = Allergy Diagnosis? Patient Sore Swollen ID# Throat Fever Glands Congestion Headache 11 Diagnosis?

Review: Rule Sets Covering algorithms If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If

28 Review: Rule Sets Covering algorithms If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none Recall: may convert a decision tree into a rule set Straightforward, but rule set overly complex More effective conversions are not trivial Instead, can generate rule set directly for each class in turn find rule set that covers all instances in it (excluding instances not in the class) Called a covering approach: at each stage a rule is identified that covers some of the instances Spectacle Tear Production Recommended Prescription Rate Astigmatism Lenses myope young normal? Example: generating a rule Rules vs. trees If true then class = a If x > 1.2 then class = a Possible rule set for class b : If x > 1.2 and y > 2.6 then class = a If x 1.2 then class = b If x > 1.2 and y 2.6 then class = b Could add more rules, get perfect rule set Corresponding decision tree: (produces exactly the same predictions) But: rule sets can be more clear when decision trees suffer from replicated subtrees Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

29 Simple covering algorithm Selecting a test Generates a rule by adding tests that maximize rule s accuracy Similar to situation in decision trees: problem of selecting an attribute to split on But: decision tree algorithm maximizes overall purity Each new test reduces rule s coverage: Goal: maximize accuracy t total number of instances covered by rule p positive examples of the class covered by rule t p number of errors made by rule Select test that maximizes the ratio p/t We are finished when p/t = 1 or the set of instances can t be split any further PRISM algorithm for rule induction Example: contact lens data Possible tests: If? then recommendation = hard Rule we seek: Modified rule and resulting data Rule with best test added: If astigmatism = yes then recommendation = hard = = Pre-presbyopic = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope Astigmatism = no Astigmatism = yes Tear production rate = Tear production rate = rmal 2/8 1/8 1/8 3/12 1/12 0/12 4/12 0/12 4/12 Instances covered by modified rule: Spectacle prescription Astigmatism Tear production rate Myope Myope rmal Hypermetrope Hypermetrope rmal Pre-presbyopic Myope Pre-presbyopic Myope rmal Pre-presbyopic Hypermetrope Pre-presbyopic Hypermetrope rmal Presbyopic Myope Presbyopic Myope rmal Presbyopic Hypermetrope Presbyopic Hypermetrope rmal Recommended lenses ne Hard ne hard ne Hard ne ne ne Hard ne ne

30 Further refinement Modified rule and resulting data Current state: Possible tests: If astigmatism = yes and? then recommendation = hard Rule with best test added: If astigmatism = yes and tear production rate = normal then recommendation = hard = = Pre-presbyopic = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope Tear production rate = Tear production rate = rmal 2/4 1/4 1/4 3/6 1/6 0/6 4/6 Instances covered by modified rule: Spectacle prescription Astigmatism Tear production rate Myope rmal Hypermetrope rmal Pre-presbyopic Myope rmal Pre-presbyopic Hypermetrope rmal Presbyopic Myope rmal Presbyopic Hypermetrope rmal Recommended lenses Hard hard Hard ne Hard ne Further refinement The result Current state: If astigmatism = yes and tear production rate = normal and? then recommendation = hard Possible tests: = = Pre-presbyopic = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope 2/2 1/2 1/2 3/3 1/3 If astigmatism = yes and tear production rate = normal and spectacle prescription Final rule: = myope then recommendation = hard p/t = 3/3 = 1, so this rule is finished But 1 instance still isn t covered so we start a new rule Tie between the first and the fourth test We choose the one with greater coverage

31 Remove instances of rule #1 from dataset Spectacle prescription Astigmatism Tear production rate Recommended lenses Myope ne Myope rmal Soft Myope ne Hypermetrope ne Hypermetrope rmal Soft Hypermetrope ne Hypermetrope rmal hard Pre-presbyopic Myope ne Pre-presbyopic Myope rmal Soft Pre-presbyopic Myope ne Pre-presbyopic Hypermetrope ne Pre-presbyopic Hypermetrope rmal Soft Pre-presbyopic Hypermetrope ne Pre-presbyopic Hypermetrope rmal ne Presbyopic Myope ne Presbyopic Myope rmal ne Presbyopic Myope ne Presbyopic Hypermetrope ne Presbyopic Hypermetrope rmal Soft Presbyopic Hypermetrope ne Presbyopic Hypermetrope rmal ne 121 Possible tests: PRISM algorithm for second rule If? then recommendation = hard Rule we seek: = 1/7 = Pre-presbyopic 0/7 = Presbyopic 0/7 Spectacle prescription = Myope 0/9 Spectacle prescription = Hypermetrope 1/12 Astigmatism = no 0/12 Astigmatism = yes 1/9 Tear production rate = 0/12 Tear production rate = rmal 1/9 Modified rule #2 and resulting data Further refinement of rule #2 Rule #2 with best test added: If age = young then recommendation = hard p/t = 1/7 so not done with rule Instances covered by modified rule: Spectacle prescription Myope Myope Myope Hypermetrope Hypermetrope Hypermetrope Hypermetrope Astigmatism Tear production rate rmal rmal rmal Recommended lenses ne Soft ne ne Soft ne hard Current state: Possible tests: Astigmatism = Astigmatism = Spectacle prescription = Myope Spectacle prescription = Hypermetrope Tear production rate = Tear production rate = rmal If age = young and? then recommendation = hard 1/3 0/4 0/3 1/4 0/4 1/3

32 Modified rule #2 and resulting data Further refinement Current state: Rule #2 with best test added: If age = young and astigmatism = yes then recommendation = hard p/t = 1/3, so continue Instances covered by modified rule: Spectacle prescription Myope Hypermetrope Hypermetrope Astigmatism Tear production rate rmal Recommended lenses ne ne hard If age = young and astigmatism = yes and? then recommendation = hard Possible tests: Tear production rate = Tear Production rate = rmal Spectacle prescription = Myope Spectacle prescription = Hypermetrope 0/2 1/1 0/1 1/2 The result for rule #2 If age = young and astigmatism = yes and tear production Final rule: rate = normal then recommendation = hard p/t = 1/1 = 1, so this rule is finished All four hard instances now covered Another example 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K These two rules cover all hard lenses : Process is repeated with other two classes Starting rule: if? then Life Insurance = Can use our previous 1R work to save time in the first step of the algorithm te that PRISM requires all attributes to be nominal, so will have to be discretized before the algorithm begins

33 Possible Tests : (1 out of 7) (5 out of 8) Cred.Card Ins: (0 out of 3) (6 out of 12) Income Range: 20-30K (2 out of 4) 30-K (1 out of 5) -50K (3 out of 4) 50-60K (0 out of 2) : <= (3 out of 12) > (3 out of 3) *** The rule is thus refined to: if > then Life Insurance = p/t = 3/3 = 1 Therefore the covering algorithm ends with no further refinement But this covers only ½ of the NOs need another rule Repeat Algorithm for New Rule Remove the instances covered by the first rule and repeat the algorithm Possible tests: : (0 out of 6) (3 out of 6) *** Cred.Card Ins: (0 out of 3) (3 out of 9) Income Range: 20-30K (1 out of 3) 30-K (1 out of 5) -50K (1 out of 2) 50-60K (0 out of 2) 30 K 50K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K : <= (3 out of 12) > (0 out of 0) New Rule This gives the new rule: if = then Life Insurance = p/t = 3/6 =.5 So we re not done yet Next consider only the instances to which this rule applies 50K 30 K 30 K 20 30K 30 K 20 30K Possible tests: Cred.Card Ins: (0 out of 2) (3 out of 4) Income Range: 20-30K (1 out of 2) 30-K (1 out of 3) -50K (1 out of 1) *** 50-60K (0 out of 0) : <= (3 out of 6) > (0 out of 0) Further Refinement of New Rule This gives the new rule: if = and Income Range=-50K then Life Insurance = p/t = 1/1 = 1 Are we done now? With this rule, yes. But we ve covered only 4 instances of NO We need a third rule, so we begin again with the remaining 11 instances: 30 K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K

34 Possible tests: Repeat Algorithm for Third Rule 30 K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K : (0 out of 6) (2 out of 5) *** Cred.Card Ins: (0 out of 3) (2 out of 8) Income Range: 20-30K (1 out of 3) 30-K (1 out of 5) -50K (0 out of 1) 50-60K (0 out of 2) : <= (2 out of 11) > (0 out of 0) Third Rule This gives the new rule: if = then Life Insurance = p/t = 2/5 =.4 So we re not done yet Next consider only the instances to which this rule applies 30 K 30 K 20 30K 30 K 20 30K Possible tests: Cred.Card Ins: (0 out of 2) (2 out of 3) *** Income Range: 20-30K (1 out of 2) 30-K (1 out of 3) -50K (0 out of 0) 50-60K (0 out of 0) : <= (2 out of 5) > (0 out of 0) Further Refinement of Third Rule This gives the new rule: if = and Credit Card Insurance= then Life Insurance = p/t = 2/3 =.667 Continue to develop the rule Consider only the instances to which this rule applies: 20 30K 30 K 20 30K Income Range: 20-30K (1 out of 2) 30-K (1 out of 1) *** -50K (0 out of 0) 50-60K (0 out of 0) : <= (2 out of 3) > (0 out of 0) Further Refinement of Third Rule This gives the new rule: if = and Credit Card Insurance= and Income Range=30-K then Life Insurance = p/t = 1/1 = 1 Are we done now? With this rule, yes. But we ve covered only 5 instances of NO Stop to avoid overfitting? Possibly. PRISM says to go on. We need a fourth rule, so we begin again with the remaining 10 instances: 30 K 30 K 50 60K 30 K 20 30K 30 K 50K 20 30K 50 60K 20 30K

35 Possible tests: Repeat Algorithm for Fourth Rule 30 K 30 K 50 60K 30 K 20 30K 30 K 50K 20 30K 50 60K 20 30K : (0 out of 6) (1 out of 4) Cred.Card Ins: (0 out of 3) (1 out of 7) Income Range: 20-30K (1 out of 3) *** 30-K (0 out of 4) -50K (0 out of 1) 50-60K (0 out of 2) : <= (1 out of 10) > (0 out of 0) Fourth Rule This gives the new rule: if Income Range=20-30K then Life Insurance = p/t = 1/3 =.333 Next consider only the instances to which this rule applies 20 30K 20 30K 20 30K Possible tests: Cred.Card Ins: (0 out of 1) (1 out of 2) *** : (0 out of 1) (1 out of 2) : <= (1 out of 3) > (0 out of 0) Here we can clearly see that there will be no way to get p/t=1. So this rule is abandoned to avoid overfitting. Conclusion That makes the rule set: if > then Life Insurance = if = and Income Range=-50K then Life Insurance = if = and Credit Card Insurance= and Income Range=30-K then Life Insurance = One instance is still not covered Attempt to make a fourth rule failed outlier? May have made judgement call not even to try Pseudo-code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the instances covered by R from E Practice on your own: Derive the rules for Life Insurance = Derive the rules for Lense Recommendation = Soft Derive the rules for Lense Recommentation = ne

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k