Data Mining Algorithms: Basic Methods

Size: px
Start display at page:

Download "Data Mining Algorithms: Basic Methods"

Transcription

1 Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association rule learning Linear models Instance-based learning Clustering 2 Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, e.g.: One attribute does all the work All attributes contribute equally & independently A weighted linear combination might do fine Instance-based: use a few prototypes Use simple logical rules Success of method depends on the domain Review: Classification Learning Classification-learning algorithms: take a set of already classified training examples also known as training instances learn a model that can classify previously unseen examples The resulting model works like this: input attributes (everything but the class) model output attribute/class

2 Review: Classification Learning (cont.) Recall our medical-diagnosis example: training examples/instances: class/ Patient Sore Swollen output attribute ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Example Problem: Credit-Card Promotions A credit-card company wants to determine which customers should be sent promotional materials for a life insurance offer. It needs a model that predicts whether a customer will accept the offer: age sex income range credit-card insurance* model (will accept the offer) or (will not accept the offer) the learned model: if Swollen Glands = then Diagnosis = Strep Throat if Swollen Glands = and Fever = then Diagnosis = Cold if Swollen Glands = and Fever = then Diagnosis = Allergy * note: credit-card insurance is a / attribute specifying whether the customer accepted a similar offer for insurance on their credit card Example Problem: Credit-Card Promotions 15 training examples (Table 3.1 of Roiger & Geatz): class/ output attribute 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K 1R: Learning Simple Classification Rules Presented by R.C. Holte, University of Ottawa, in the following paper: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11(2003): Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) Minimum number of instances was set to 6 after some experimentation 1R s simple rules performed not much worse than much more complex decision trees Simplicity first pays off! Why is it called 1R? R because the algorithm learns a set of Rules 1 because the rules are based on only 1 input attribute

3 1R: Learning Simple Classification Rules The rules that 1R learns look like this: <attribute-name>: <attribute-val1> <class value> <attribute-val2> <class value> To see how 1R learns the rules, let's consider an example. Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? when =, what is the most frequent class? Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? (it appears in 6 out of 7 of those examples) when =, what is the most frequent class?

4 Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? (it appears in 6 out of 7 of those examples) when =, what is the most frequent class? Applying 1R to the Credit-Card Promotion Data 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Let's start by determining the rules based on. To do so, we ask the following: when =, what is the most frequent class? (it appears in 6 out of 7 of those examples) when =, what is the most frequent class? (it appears in 5 out of 8 of those examples) Applying 1R to the Credit-Card Promotion Data (cont.) Thus, we end up with the following rules based on : : (6 out of 7) (5 out of 8) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy =?

5 Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy = (6 + 5)/(7 + 8) = 11/15 = 73% So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy = (6 + 5)/(7 + 8) = 11/15 = 73% Equivalently, we can focus on the error rate and minimize it. error rate of rules above =? Pseudocode for the 1R Algorithm for each input attribute A: for each value V of A: count how often each class appears together with V find the most frequent class F add the rule A = V F to the rules for A calculate and store the accuracy of the rules learned for A choose the rules with the highest overall accuracy So far, we've learned the rules for the attribute : : (6 out of 7) (5 out of 8) overall accuracy = (6 + 5)/(7 + 8) = 11/15 = 73% Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Credit Card Insurance? Credit Card Insurance: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Equivalently, we can focus on the error rate and minimize it. error rate of rules above = = %

6 Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Credit Card Insurance? Credit Card Insurance: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Credit Card Insurance? Credit Card Insurance: (3 out of 3) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Credit Card Insurance? Credit Card Insurance: (3 out of 3) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Credit Card Insurance? Credit Card Insurance: (3 out of 3) * (6 out of 12) * when Credit Card Insurance =, the two classes are equally likely, but we choose because otherwise the model would always predict

7 Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Income Range? Income Range: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K 20-30K 30-K -50K 50-60K Applying 1R to the Credit-Card Promotion Data (cont.) What rules would be produced for Income Range? Income Range: 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K 20-30K 30-K -50K 50-60K Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K -50K 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K -50K 50-60K (* would also be a valid choice for 20-30K)

8 Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (* would also be a valid choice for 20-30K) Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (* would also be a valid choice for 20-30K)

9 Applying 1R to the Credit-Card Promotion Data (cont.) 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K What rules would be produced for Income Range? Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) (* would also be a valid choice for 20-30K) Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Another Example: Evaluating the Weather Attributes Humidity High High High High rmal rmal rmal High rmal rmal rmal High rmal High Windy False True False False False True True False False False True True False True Play Attribute Outlook Temp Humidity Windy Rules Sunny Overcast Rainy Hot * Mild Cool High rmal False True * * indicates a tie Errors 2/5 0/4 2/5 2/4 2/6 1/4 3/7 1/7 2/8 3/6 Total errors 4/14 5/14 4/14 5/14 Handing Numeric Attributes To handle numeric attributes, we need to discretize the range of possible values into subranges called bins or buckets. One way is to sort the training instances by age and look for the binary (two-way) split that leads to the most accurate rules. 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K Handling Numeric Attributes (cont.) Here's one possible binary split for age: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N the corresponding rules are: : <= (5 out of 6) > (5 out of 9) overall accuracy: 10/15 = 67% sort by age : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N

10 Handling Numeric Attributes (cont.) Here's one possible binary split for age: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N the corresponding rules are: : <= (5 out of 6) > (5 out of 9) The following is one of the splits with the best overall accuracy: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N the corresponding rules are: : <= (9 out of 12) > (3 out of 3) overall accuracy: 10/15 = 67% overall accuracy: 12/15 = 80% Summary of 1R Results : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) Income Range: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) : <= (9 out of 12) > (3 out of 3) overall accuracy: 11/15 = 73% overall accuracy: 9/15 = 60% overall accuracy: 11/15 = 73% overall accuracy: 12/15 = 80% Because the rules based on have the highest overall accuracy on the training data, 1R selects them as the model. Special Case: Many-Valued Attributes 1R does not tend to work well with attributes that have many possible values. When such an attribute is present, 1R often ends up selecting its rules. each rule applies to only a small number of examples, which tends to give them a high accuracy However, the rules learned for a many-valued attribute tend not to generalize well. what is this called? Special Case: Many-Valued Attributes 1R does not tend to work well with attributes that have many possible values. When such an attribute is present, 1R often ends up selecting its rules. each rule applies to only a small number of examples, which tends to give them a high accuracy However, the rules learned for a many-valued attribute tend not to generalize well. what is this called? overfitting the training data

11 Special Case: Many-Valued Attributes (cont.) Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Example: let's say we used 1R on this dataset. what would be the accuracy of rules based on Patient ID#? Special Case: Many-Valued Attributes (cont.) Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Example: let's say we used 1R on this dataset. what would be the accuracy of rules based on Patient ID#? 100%! because Patient ID# is a unique identifier, we get one rule for each ID, which correctly classifies its example! We need to remove identifier fields before running 1R. Special Case: Numeric Attributes Special Case: Numeric Attributes The standard way of handling numeric attributes in 1R is a bit more complicated than the method we presented earlier. allows for more than two bins/buckets place breakpoints where the class changes maximizes total accuracy / minimizes the total error possible alternate discretization: : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N what's the problem with this discretization? Another example: temperature from weather data Outlook Temperature Humidity Windy Sunny False Sunny True Overcast False Rainy False Play To avoid overfitting, you can specify a minimum bucket size the smallest number of examples allowed in a given bucket.

12 The Problem of Overfitting Example (with minimum bucket size = 3): Resulting rule set: With Overfitting Avoidance : Life Ins: Y N Y Y Y Y Y Y N Y Y N N N N Weather data: Attribute Outlook Temperature Humidity Windy Rules Sunny Overcast Rainy 77.5 > 77.5 * 82.5 > 82.5 and 95.5 > 95.5 False True * Errors 2/5 0/4 2/5 3/10 2/4 1/7 2/6 0/1 2/8 3/6 Total errors 4/14 5/14 3/14 5/14 Limitation of 1R 1R won't work well if many of the input attributes have fewer possible values than the class/output attribute does. Example: our medical diagnosis dataset Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold Using 1R as a Baseline When performing classification learning, 1R, can serve as a useful baseline. compare the models from more complex algorithms to the model it produces if a model has a lower accuracy than 1R, it probably isn't worth keeping It also gives insight into which of the input attributes has the most impact on the output attribute. there are three possible classes: Strep Throat, Cold, Allergy binary attributes such as Fever produce rules that predict at most two of these classes: Fever: Cold Allergy

13 0R: Another Useful Baseline The 0R algorithm learns a model that considers none of the input attributes! It simply predicts the majority class in the training data. 0R: Another Useful Baseline The 0R algorithm learns a model that considers none of the input attributes! It simply predicts the majority class in the training data. Example: the credit-card training data 9 examples in which the output is 6 examples in which the output is thus, the 0R model would always predict. gives an accuracy of 9/15 = 60% 0R: Another Useful Baseline The 0R algorithm learns a model that considers none of the input attributes! It simply predicts the majority class in the training data. Example: the credit-card training data 9 examples in which the output is 6 examples in which the output is thus, the 0R model would always predict. gives an accuracy of 9/15 = 60% When performing classification learning, you should use the results of this algorithm to put your results in context. if the 0R accuracy is high, you may want to create training data that is less skewed at the very least, you should include the class breakdown of your training and test sets in your report Statistical modeling Opposite of 1R: use all the attributes Two assumptions: Attributes are equally important statistically independent (given the class value) i.e., knowing the value of one attribute says nothing about the value of another (if the class is known) Independence assumption is never correct! But this scheme works well in practice

14 Probabilities for weather data Outlook Temperature Humidity Windy Play Sunny 2 3 Overcast 4 0 Rainy 3 2 Sunny 2/9 3/5 Overcast 4/9 0/5 Hot Mild Cool Hot Mild Rainy 3/9 2/5 Cool /9 2/5 4/9 2/5 3/9 1/5 High 3 4 False rmal 6 1 True 3 3 High rmal 3/9 6/9 4/5 1/5 False True 6/9 3/9 2/5 3/5 9/ 14 Outlook Temp Humidity Windy Play Sunny Hot High False 5 5/ 14 Probabilities for weather data Outlook Temperature Humidity Sunny 2 3 Overcast 4 0 Rainy 3 2 Sunny 2/9 3/5 Overcast 4/9 0/5 Hot Mild Cool Hot Mild Rainy 3/9 2/5 Cool 2 2 High 4 2 rmal 3 1 2/9 2/5 High 4/9 2/5 rmal 3/9 1/ /9 4/5 6/9 1/5 Windy False 6 True 3 False 6/9 True 3/ /5 3/5 Play 9 5 9/ 5/ Sunny Overcast Rainy Hot Hot Mild High High High True False False A new day: Outlook Sunny Temp. Cool Humidity High Windy True Play? Rainy Rainy Cool Cool rmal rmal False True Likelihood of the two classes Overcast Sunny Sunny Rainy Cool Mild Cool Mild rmal High rmal rmal True False False False For yes = 2/9 3/9 3/9 3/9 9/14 = For no = 3/5 1/5 4/5 3/5 5/14 = Conversion into a probability by normalization: Sunny Overcast Overcast Mild Mild Hot rmal High rmal True True False P( yes ) = / ( ) = = 20.5% P( no ) = / ( ) = = 79.5% Rainy Mild High True Bayes rule Naïve Bayes for classification Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England Classification learning: what s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent Pr Pr Pr Pr Pr Pr

15 Weather data example The zero-frequency problem Outlook Sunny Probability of class yes Temp. Cool Humidity High Windy True Play? Evidence E What if an attribute value doesn t occur with every class value? (e.g. Outlook = overcast for class no ) Probability will be zero! Pr 0 A posteriori probability will also be zero! Pr [yese ]= 0 ( matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! Pr 4 8 Pr 1 8 Pr 3 8 Modified probability estimates Missing values In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Sunny Overcast Rainy Weights don t need to be equal (but they must sum to 1) 2 p p p 3 9 Training: instance is not included in frequency count Probability ratios based on number of values that actually occur rather than total number of instances Classification: attribute will be omitted from calculation Outlook Temp. Humidity Windy Example:? Cool High True Play? Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.02 Likelihood of no = 1/5 4/5 3/5 5/14 = 0.03 P( yes ) = 0.02 / ( ) = % P( no ) = 0.03 / ( ) = 59%

16 Numeric attributes Statistics for weather data o o Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean Sunny Overcast Rainy Sunny Overcast Outlook /9 4/ /5 0/5 Temperature 64, 68, 65,71, 69, 70, 72,80, 72, 85, =73 =75 =6.2 =7.9 Humidity 65, 70, 70, 85, 70, 75, 90, 91, 80, 95, =79 =86 =10.2 =9.7 Windy False 6 True 3 False 6/9 True 3/ /5 3/5 Play 9 5 9/ 5/ Rainy 3/9 2/5 Standard deviation Example density value: Then the probability density function f(x) is Classifying a new day Naïve Bayes: discussion Outlook Sunny Temp. 66 A new Humidity day: Windy true Play? Likelihood of yes = 2/ /9 9/14 = Likelihood of no = 3/ /5 5/14 = P( yes ) = / ( ) = 25% P( no ) = / ( ) = 75% Missing values during training are not included in calculation of mean and standard deviation 90 Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) te also: many numeric attributes are not normally distributed ( kernel density estimators)

17 Review: Decision Trees for Classification We've already seen examples of decision-tree models. example: the tree for our medical-diagnosis dataset: Review: Decision Trees for Classification We've already seen examples of decision-tree models. example: the tree for our medical-diagnosis dataset: Swollen Glands Swollen Glands Strep Throat Fever Strep Throat Fever Cold Allergy Cold Allergy what class would this decision tree assign to the following instance? Patient Sore Swollen ID# Throat Fever Glands Congestion Headache 21 Diagnosis? what class would this decision tree assign to the following instance? Patient Sore Swollen ID# Throat Fever Glands Congestion Headache 21 Diagnosis Cold 1R and Decision Trees We can view the models learned by 1R as simple decision trees with only one decision. here is the model that we learned for the credit-card data: <= > 1R and Decision Trees We can view the models learned by 1R as simple decision trees with only one decision. here is the model that we learned for the credit-card data: <= > here are the rules based on Income Range: Income Range 20-30K 30-K -50K 50-60K

18 Building Decision Trees How can we build decision trees that use multiple attributes? Here's the basic algorithm: 1. apply 1R to the full set of attributes, but choose the attribute that "best divides" the examples into subgroups Building Decision Trees How can we build decision trees that use multiple attributes? Here's the basic algorithm: 1. apply 1R to the full set of attributes, but choose the attribute that "best divides" the examples into subgroups 2. create a decision based on that attribute and put it in the appropriate place in the existing tree (if any) <= > Building Decision Trees How can we build decision trees that use multiple attributes? Here's the basic algorithm: 1. apply 1R to the full set of attributes, but choose the attribute that "best divides" the examples into subgroups 2. create a decision based on that attribute and put it in the appropriate place in the existing tree (if any) 3. for each subgroup created by the new decision: if the classifications of its examples are "accurate enough" or if there are no remaining attributes to use, do nothing otherwise, repeat the process for the examples in the subgroup <= > Building Decision Trees (cont.) What does it mean to choose the attribute that "best divides" the training instances? overall accuracy still plays a role however, it's not as important, since subsequent decisions can improve the model's accuracy in addition, we want to avoid letting the tree get too large, to prevent overfitting

19 Building Decision Trees (cont.) What does it mean to choose the attribute that "best divides" the training instances? overall accuracy still plays a role however, it's not as important, since subsequent decisions can improve the model's accuracy in addition, we want to avoid letting the tree get too large, to prevent overfitting We'll compute a goodness score for each attribute's rules: goodness = overall accuracy / N Building Decision Trees (cont.) What does it mean to choose the attribute that "best divides" the training instances? overall accuracy still plays a role however, it's not as important, since subsequent decisions can improve the model's accuracy in addition, we want to avoid letting the tree get too large, to prevent overfitting We'll compute a goodness score for each attribute's rules: goodness = overall accuracy / N where N = the number of subgroups that would need to be subdivided further if we chose this attribute. <= > where N = the number of subgroups that would need to be subdivided further if we chose this attribute. dividing by N should help to create a smaller tree Special case: if N == 0 for an attribute, we'll select that attribute. Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness:? accuracy: 9/15 = 60% goodness:? Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness:? Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness:? Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness:? : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:? : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:?

20 Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness: 60/1 = 60 Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness: 60/1 = 60 Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness:? Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) accuracy: 11/15 = 73% goodness: 73/3 = 24.3 : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:? : <= (9 out of 12) > (3 out of 3) accuracy: 12/15 = 80% goodness:? Building a Decision Tree for the Credit-Card Data Here are the rules we obtained for each attribute using 1R: : (6 out of 7) (5 out of 8) Cred.Card Ins: (3 out of 3) * (6 out of 12) accuracy: 11/15 = 73% goodness: 73/2 = 36.5 accuracy: 9/15 = 60% goodness: 60/1 = 60 Building a Decision Tree for the Credit-Card Data (cont.) Because has the highest goodness score, we use it as the first decision in the tree: <= > Income Rng: 20-30K * (2 out of 4) 30-K (4 out of 5) -50K (3 out of 4) 50-60K (2 out of 2) : <= (9 out of 12) > (3 out of 3) accuracy: 11/15 = 73% goodness: 73/3 = 24.3 accuracy: 12/15 = 80% goodness: 80/1 = 80 9 out of 12 3 out of 3 thing further needs to be done to the > subgroup.

21 Building a Decision Tree for the Credit-Card Data (cont.) Because has the highest goodness score, we use it as the first decision in the tree: <= > thing further needs to be done to the > subgroup. We return to step 2 and apply the same procedure to the <= subgroup. this is an example of recursion: applying the same algorithm to a smaller version of the original problem 9 out of 12 3 out of 3 Building a Decision Tree for the Credit-Card Data (cont.) Here are the 12 examples in the <= subgroup: 30 K 50K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K As before, we sort the examples by and find the most accurate binary split. we'll use a minimum bucket size of 3 : Life Ins: Y N Y Y Y Y Y Y N Y Y N Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 12 examples: : (6 out of 6) * (3 out of 6) Cred.Card Ins: (3 out of 3) (6 out of 9) Income Rng: 20-30K (2 out of 3) 30-K (4 out of 5) -50K * (1 out of 2) 50-60K (2 out of 2) : <= (7 out of 8) > * (2 out of 4) accuracy: 9/12 = 75% goodness: 75/1 = 75 accuracy: 9/12 = 75% goodness: 75/1 = 75 accuracy: 9/12 = 75% goodness: 75/3 = 25 accuracy: 9/12 = 75% goodness: 75/2 = 37.5 Building a Decision Tree for the Credit-Card Data (cont.) Here's the tree that splits the <= subgroup: 6 out of 6 3 out of 6 and Credit Card Insurance are tied for the highest goodness score. We'll pick since it has more examples in the subgroup that doesn't need to be subdivided further.

22 Building a Decision Tree for the Credit-Card Data (cont.) Here's the tree that splits the <= subgroup: It replaces the classification for that subgroup in the earlier tree: <= > 9 out of 12 3 out of 3 6 out of 6 3 out of 6 <= > 3 out of 3 Building a Decision Tree for the Credit-Card Data (cont.) We now recursively apply the same procedure to the 6 examples in the ( <=, = ) subgroup: sort by : Life Ins: N Y Y N Y N We no longer consider. Why? 50K 30 K 30 K 20 30K 30 K 20 30K the only binary split with a minimum bucket size of 3 6 out of 6 3 out of 6 Building a Decision Tree for the Credit-Card Data (cont.) We now recursively apply the same procedure to the 6 examples in the ( <=, = ) subgroup: sort by 50K 30 K 30 K 20 30K 30 K 20 30K : Life Ins: N Y Y N Y N the only binary split with a minimum bucket size of 3 Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) accuracy: 5/6 = 83.3% goodness:? accuracy: 4/6 = 66.7% goodness:? accuracy: 4/6 = 66.7% goodness:? We no longer consider. Why? because all of the examples have the same value for it

23 Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) accuracy: 5/6 = 83.3% goodness: 83.3/1 = 83.3 accuracy: 4/6 = 66.7% goodness:? accuracy: 4/6 = 66.7% goodness:? Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) accuracy: 5/6 = 83.3% goodness: 83.3/1 = 83.3 accuracy: 4/6 = 66.7% goodness: 66.7/2 = 33.3 accuracy: 4/6 = 66.7% goodness:? Building a Decision Tree for the Credit-Card Data (cont.) Here are the rules obtained for these 6 examples: Cred.Card Ins: (2 out of 2) (3 out of 4) Income Rng: 20-30K * (1 out of 2) 30-K (2 out of 3) -50K (1 out of 1) 50-60K? (none) : <= (2 out of 3) > (2 out of 3) Credit Card Insurance has the highest goodness score, so we pick it and create the partial tree at right: accuracy: 5/6 = 83.3% goodness: 83.3/1 = 83.3 accuracy: 4/6 = 66.7% goodness: 66.7/2 = 33.3 accuracy: 4/6 = 66.7% goodness: 66.7/2 = 33.3 Credit Card Insurance 2 out of 2 3 out of 4 Building a Decision Tree for the Credit-Card Data (cont.) This new tree replaces the classification for the ( <=, = ) subgroup in the previous tree: <= > 6 out of 6 3 out of 6 3 out of 3 <= > 3 out of 3 Credit Card Insurance 6 out of 6 2 out of 2 3 out of 4

24 Building a Decision Tree for the Credit-Card Data (cont.) Here are the four instances in the ( <=, =, Cred.Card Ins = ) subgroup: 50K 20 30K 30 K 20 30K Building a Decision Tree for the Credit-Card Data (cont.) Here are the four instances in the ( <=, =, Cred.Card Ins = ) subgroup: 50K 20 30K 30 K 20 30K sort by : Life Ins: N Y N N The only remaining attributes are and Income Range. Income Range won't help, because there are two instances with Income Range = 20-30K, one with Life Ins = class and one with Life Ins =. The only remaining attributes are and Income Range. Income Range won't help, because there are two instances with Income Range = 20-30K, one with Life Ins = class and one with Life Ins =. won't help, because we can't make a binary split that separates the Life Ins = and Life Ins = instances. Building a Decision Tree for the Credit-Card Data (cont.) Here are the four instances in the ( <=, =, Cred.Card Ins = ) subgroup: sort by : Life Ins: N Y N N 50K 20 30K 30 K 20 30K The only remaining attributes are and Income Range. Income Range won't help, because there are two instances with Income Range = 20-30K, one with Life Ins = class and one with Life Ins =. won't help, because we can't make a binary split that separates the Life Ins = and Life Ins = instances. Thus, the algorithm stops here. Building a Decision Tree for the Credit-Card Data (cont.) Here's the final model: <= > 3 out of 3 Credit Card Insurance 6 out of 6 2 out of 2 3 out of 4 It manages to correctly classify all but one training example.

25 Building a Decision Tree for the Credit-Card Data (cont.) How would it classify the following instance? K? Building a Decision Tree for the Credit-Card Data (cont.) How would it classify the following instance? K <= > <= > Credit Card Insurance Credit Card Insurance Other Algorithms for Learning Decision Trees ID3 uses a different goodness score based on a field of study known as information theory doesn t handle numeric attributes C4.5 makes a series of improvements to ID3: the ability to handle numeric input attributes the ability to handle missing values measures that prune the tree after it is built making it smaller to improve its ability to generalize (i.e., to handle noise) Decision Tree Results in Weka Weka's output window gives the tree in text form that looks something like this: total # of examples J48 pruned tree in this subgroup = Credit Card Ins. = : (6.0/1.0) Credit Card Ins. = : (2.0) # that are misclassified = : (7.0/1.0) Both ID3 and C4.5 were developed by Ross Quinlan of the University of Sydney. Weka's implementation of C4.5 is called J48.

26 Decision Tree Results in Weka (cont.) Right-clicking the name of the model in the result list allows you to view the tree in graphical form. From Decision Trees to Classification Rules Any decision tree can be turned into a set of rules of the following form: if <test1> and <test2> and then <class> = <value> were the condition is formed by combining the tests used to get from the top of the tree to one of the leaves. From Decision Trees to Classification Rules (cont.) Here are the rules for this tree: if > then Life Ins = if <= and = then Life Ins = <= > if <= and = and Cred Card Ins = then Life Ins = if <= and = and Cred Card Ins = then Life Ins = Credit Card Insurance Advantages and Disadvantages of Decision Trees Advantages: easy to understand can be converted to a set of rules makes it easier to actually use the model for classification can handle both nominal and numeric input attributes except for ID3, which is limited to nominal Disadvantages: the class attribute must be nominal slight changes in the set of training examples can produce a significantly different decision tree we say that the tree-building algorithm is unstable

27 Practice Building a Decision Tree Let's apply our decision-tree algorithm to the diagnosis dataset. to allow us to practice with numeric attributes, I've replaced Fever with Temp the person's body temperature Patient Sore Swollen ID# Throat Temp Glands Congestion Headache Diagnosis Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold Practice Building a Decision Tree Let's apply our decision-tree algorithm to the diagnosis dataset. to allow us to practice with numeric attributes, I've replaced Fever with Temp the person's body temperature Patient Sore Swollen ID# Throat Temp Glands Congestion Headache Diagnosis Allergy Cold Cold Allergy Allergy Cold Cold Practice Building a Decision Tree Patient Sore Swollen ID# Throat Temp Glands Congestion Headache Diagnosis Strep throat Allergy Strep throat Allergy Strep throat Allergy Review: Rule Sets Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Strep throat 2 Allergy 3 Cold 4 Strep throat 5 Cold 6 Allergy 7 Strep throat 8 Allergy 9 Cold 10 Cold One possible model that could be used for classifying other patients is a set of rules such as the following: if Swollen Glands == then Diagnosis = Strep Throat if Swollen Glands == and Fever == then Diagnosis = Cold if Swollen Glands == and Fever == then Diagnosis = Allergy Diagnosis? Patient Sore Swollen ID# Throat Fever Glands Congestion Headache 11 Diagnosis?

28 Review: Rule Sets Covering algorithms If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none Recall: may convert a decision tree into a rule set Straightforward, but rule set overly complex More effective conversions are not trivial Instead, can generate rule set directly for each class in turn find rule set that covers all instances in it (excluding instances not in the class) Called a covering approach: at each stage a rule is identified that covers some of the instances Spectacle Tear Production Recommended Prescription Rate Astigmatism Lenses myope young normal? Example: generating a rule Rules vs. trees If true then class = a If x > 1.2 then class = a Possible rule set for class b : If x > 1.2 and y > 2.6 then class = a If x 1.2 then class = b If x > 1.2 and y 2.6 then class = b Could add more rules, get perfect rule set Corresponding decision tree: (produces exactly the same predictions) But: rule sets can be more clear when decision trees suffer from replicated subtrees Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

29 Simple covering algorithm Selecting a test Generates a rule by adding tests that maximize rule s accuracy Similar to situation in decision trees: problem of selecting an attribute to split on But: decision tree algorithm maximizes overall purity Each new test reduces rule s coverage: Goal: maximize accuracy t total number of instances covered by rule p positive examples of the class covered by rule t p number of errors made by rule Select test that maximizes the ratio p/t We are finished when p/t = 1 or the set of instances can t be split any further PRISM algorithm for rule induction Example: contact lens data Possible tests: If? then recommendation = hard Rule we seek: Modified rule and resulting data Rule with best test added: If astigmatism = yes then recommendation = hard = = Pre-presbyopic = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope Astigmatism = no Astigmatism = yes Tear production rate = Tear production rate = rmal 2/8 1/8 1/8 3/12 1/12 0/12 4/12 0/12 4/12 Instances covered by modified rule: Spectacle prescription Astigmatism Tear production rate Myope Myope rmal Hypermetrope Hypermetrope rmal Pre-presbyopic Myope Pre-presbyopic Myope rmal Pre-presbyopic Hypermetrope Pre-presbyopic Hypermetrope rmal Presbyopic Myope Presbyopic Myope rmal Presbyopic Hypermetrope Presbyopic Hypermetrope rmal Recommended lenses ne Hard ne hard ne Hard ne ne ne Hard ne ne

30 Further refinement Modified rule and resulting data Current state: Possible tests: If astigmatism = yes and? then recommendation = hard Rule with best test added: If astigmatism = yes and tear production rate = normal then recommendation = hard = = Pre-presbyopic = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope Tear production rate = Tear production rate = rmal 2/4 1/4 1/4 3/6 1/6 0/6 4/6 Instances covered by modified rule: Spectacle prescription Astigmatism Tear production rate Myope rmal Hypermetrope rmal Pre-presbyopic Myope rmal Pre-presbyopic Hypermetrope rmal Presbyopic Myope rmal Presbyopic Hypermetrope rmal Recommended lenses Hard hard Hard ne Hard ne Further refinement The result Current state: If astigmatism = yes and tear production rate = normal and? then recommendation = hard Possible tests: = = Pre-presbyopic = Presbyopic Spectacle prescription = Myope Spectacle prescription = Hypermetrope 2/2 1/2 1/2 3/3 1/3 If astigmatism = yes and tear production rate = normal and spectacle prescription Final rule: = myope then recommendation = hard p/t = 3/3 = 1, so this rule is finished But 1 instance still isn t covered so we start a new rule Tie between the first and the fourth test We choose the one with greater coverage

31 Remove instances of rule #1 from dataset Spectacle prescription Astigmatism Tear production rate Recommended lenses Myope ne Myope rmal Soft Myope ne Hypermetrope ne Hypermetrope rmal Soft Hypermetrope ne Hypermetrope rmal hard Pre-presbyopic Myope ne Pre-presbyopic Myope rmal Soft Pre-presbyopic Myope ne Pre-presbyopic Hypermetrope ne Pre-presbyopic Hypermetrope rmal Soft Pre-presbyopic Hypermetrope ne Pre-presbyopic Hypermetrope rmal ne Presbyopic Myope ne Presbyopic Myope rmal ne Presbyopic Myope ne Presbyopic Hypermetrope ne Presbyopic Hypermetrope rmal Soft Presbyopic Hypermetrope ne Presbyopic Hypermetrope rmal ne 121 Possible tests: PRISM algorithm for second rule If? then recommendation = hard Rule we seek: = 1/7 = Pre-presbyopic 0/7 = Presbyopic 0/7 Spectacle prescription = Myope 0/9 Spectacle prescription = Hypermetrope 1/12 Astigmatism = no 0/12 Astigmatism = yes 1/9 Tear production rate = 0/12 Tear production rate = rmal 1/9 Modified rule #2 and resulting data Further refinement of rule #2 Rule #2 with best test added: If age = young then recommendation = hard p/t = 1/7 so not done with rule Instances covered by modified rule: Spectacle prescription Myope Myope Myope Hypermetrope Hypermetrope Hypermetrope Hypermetrope Astigmatism Tear production rate rmal rmal rmal Recommended lenses ne Soft ne ne Soft ne hard Current state: Possible tests: Astigmatism = Astigmatism = Spectacle prescription = Myope Spectacle prescription = Hypermetrope Tear production rate = Tear production rate = rmal If age = young and? then recommendation = hard 1/3 0/4 0/3 1/4 0/4 1/3

32 Modified rule #2 and resulting data Further refinement Current state: Rule #2 with best test added: If age = young and astigmatism = yes then recommendation = hard p/t = 1/3, so continue Instances covered by modified rule: Spectacle prescription Myope Hypermetrope Hypermetrope Astigmatism Tear production rate rmal Recommended lenses ne ne hard If age = young and astigmatism = yes and? then recommendation = hard Possible tests: Tear production rate = Tear Production rate = rmal Spectacle prescription = Myope Spectacle prescription = Hypermetrope 0/2 1/1 0/1 1/2 The result for rule #2 If age = young and astigmatism = yes and tear production Final rule: rate = normal then recommendation = hard p/t = 1/1 = 1, so this rule is finished All four hard instances now covered Another example 50K 30 K 50K 30 K 50 60K 20 30K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 50K 20 30K These two rules cover all hard lenses : Process is repeated with other two classes Starting rule: if? then Life Insurance = Can use our previous 1R work to save time in the first step of the algorithm te that PRISM requires all attributes to be nominal, so will have to be discretized before the algorithm begins

33 Possible Tests : (1 out of 7) (5 out of 8) Cred.Card Ins: (0 out of 3) (6 out of 12) Income Range: 20-30K (2 out of 4) 30-K (1 out of 5) -50K (3 out of 4) 50-60K (0 out of 2) : <= (3 out of 12) > (3 out of 3) *** The rule is thus refined to: if > then Life Insurance = p/t = 3/3 = 1 Therefore the covering algorithm ends with no further refinement But this covers only ½ of the NOs need another rule Repeat Algorithm for New Rule Remove the instances covered by the first rule and repeat the algorithm Possible tests: : (0 out of 6) (3 out of 6) *** Cred.Card Ins: (0 out of 3) (3 out of 9) Income Range: 20-30K (1 out of 3) 30-K (1 out of 5) -50K (1 out of 2) 50-60K (0 out of 2) 30 K 50K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K : <= (3 out of 12) > (0 out of 0) New Rule This gives the new rule: if = then Life Insurance = p/t = 3/6 =.5 So we re not done yet Next consider only the instances to which this rule applies 50K 30 K 30 K 20 30K 30 K 20 30K Possible tests: Cred.Card Ins: (0 out of 2) (3 out of 4) Income Range: 20-30K (1 out of 2) 30-K (1 out of 3) -50K (1 out of 1) *** 50-60K (0 out of 0) : <= (3 out of 6) > (0 out of 0) Further Refinement of New Rule This gives the new rule: if = and Income Range=-50K then Life Insurance = p/t = 1/1 = 1 Are we done now? With this rule, yes. But we ve covered only 4 instances of NO We need a third rule, so we begin again with the remaining 11 instances: 30 K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K

34 Possible tests: Repeat Algorithm for Third Rule 30 K 30 K 50 60K 30 K 20 30K 30 K 30 K 50K 20 30K 50 60K 20 30K : (0 out of 6) (2 out of 5) *** Cred.Card Ins: (0 out of 3) (2 out of 8) Income Range: 20-30K (1 out of 3) 30-K (1 out of 5) -50K (0 out of 1) 50-60K (0 out of 2) : <= (2 out of 11) > (0 out of 0) Third Rule This gives the new rule: if = then Life Insurance = p/t = 2/5 =.4 So we re not done yet Next consider only the instances to which this rule applies 30 K 30 K 20 30K 30 K 20 30K Possible tests: Cred.Card Ins: (0 out of 2) (2 out of 3) *** Income Range: 20-30K (1 out of 2) 30-K (1 out of 3) -50K (0 out of 0) 50-60K (0 out of 0) : <= (2 out of 5) > (0 out of 0) Further Refinement of Third Rule This gives the new rule: if = and Credit Card Insurance= then Life Insurance = p/t = 2/3 =.667 Continue to develop the rule Consider only the instances to which this rule applies: 20 30K 30 K 20 30K Income Range: 20-30K (1 out of 2) 30-K (1 out of 1) *** -50K (0 out of 0) 50-60K (0 out of 0) : <= (2 out of 3) > (0 out of 0) Further Refinement of Third Rule This gives the new rule: if = and Credit Card Insurance= and Income Range=30-K then Life Insurance = p/t = 1/1 = 1 Are we done now? With this rule, yes. But we ve covered only 5 instances of NO Stop to avoid overfitting? Possibly. PRISM says to go on. We need a fourth rule, so we begin again with the remaining 10 instances: 30 K 30 K 50 60K 30 K 20 30K 30 K 50K 20 30K 50 60K 20 30K

35 Possible tests: Repeat Algorithm for Fourth Rule 30 K 30 K 50 60K 30 K 20 30K 30 K 50K 20 30K 50 60K 20 30K : (0 out of 6) (1 out of 4) Cred.Card Ins: (0 out of 3) (1 out of 7) Income Range: 20-30K (1 out of 3) *** 30-K (0 out of 4) -50K (0 out of 1) 50-60K (0 out of 2) : <= (1 out of 10) > (0 out of 0) Fourth Rule This gives the new rule: if Income Range=20-30K then Life Insurance = p/t = 1/3 =.333 Next consider only the instances to which this rule applies 20 30K 20 30K 20 30K Possible tests: Cred.Card Ins: (0 out of 1) (1 out of 2) *** : (0 out of 1) (1 out of 2) : <= (1 out of 3) > (0 out of 0) Here we can clearly see that there will be no way to get p/t=1. So this rule is abandoned to avoid overfitting. Conclusion That makes the rule set: if > then Life Insurance = if = and Income Range=-50K then Life Insurance = if = and Credit Card Insurance= and Income Range=30-K then Life Insurance = One instance is still not covered Attempt to make a fourth rule failed outlier? May have made judgement call not even to try Pseudo-code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the instances covered by R from E Practice on your own: Derive the rules for Life Insurance = Derive the rules for Lense Recommendation = Soft Derive the rules for Lense Recommentation = ne

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato Data Mining Part 4 Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato Algorithms: The basic methods Inferring rudimentary rules Statistical modeling Constructing

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Homework 1 Sample Solution

Homework 1 Sample Solution Homework 1 Sample Solution 1. Iris: All attributes of iris are numeric, therefore ID3 of weka cannt be applied to this data set. Contact-lenses: tear-prod-rate = reduced: none tear-prod-rate = normal astigmatism

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation Prediction Prediction What is Prediction Simple methods for Prediction Classification by decision tree induction Classification and regression evaluation 2 Prediction Goal: to predict the value of a given

More information

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4. Data Mining Chapter 4. Algorithms: The Basic Methods (Covering algorithm, Association rule, Linear models, Instance-based learning, Clustering) 1 Covering approach At each stage you identify a rule that

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Data Mining Input: Concepts, Instances, and Attributes

Data Mining Input: Concepts, Instances, and Attributes Data Mining Input: Concepts, Instances, and Attributes Chapter 2 of Data Mining Terminology Components of the input: Concepts: kinds of things that can be learned Goal: intelligible and operational concept

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci Data Representation Information Retrieval and Data Mining Prof. Matteo Matteucci Instances, Attributes, Concepts 2 Instances The atomic elements of information from a dataset Also known as records, prototypes,

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Algorithms: The basic methods Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and. Frank Inferring rudimentary rules Statistical modeling Constructing

More information

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery Practice notes 2 Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Decision trees Extending previous approach: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank to permit numeric s: straightforward

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Advanced learning algorithms

Advanced learning algorithms Advanced learning algorithms Extending decision trees; Extraction of good classification rules; Support vector machines; Weighted instance-based learning; Design of Model Tree Clustering Association Mining

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

Machine Learning in Real World: C4.5

Machine Learning in Real World: C4.5 Machine Learning in Real World: C4.5 Industrial-strength algorithms For an algorithm to be useful in a wide range of realworld applications it must: Permit numeric attributes with adaptive discretization

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form) Comp 135 Introduction to Machine Learning and Data Mining Our first learning algorithm How would you classify the next example? Fall 2014 Professor: Roni Khardon Computer Science Tufts University o o o

More information

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation Preprocessing Data Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation Reading material: Chapters 2 and 3 of

More information

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting Lecture 5 of 42 Decision Trees, Occam s Razor, and Overfitting Friday, 01 February 2008 William H. Hsu, KSU http://www.cis.ksu.edu/~bhsu Readings: Chapter 3.6-3.8, Mitchell Lecture Outline Read Sections

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Decision tree learning

Decision tree learning Decision tree learning Andrea Passerini passerini@disi.unitn.it Machine Learning Learning the concept Go to lesson OUTLOOK Rain Overcast Sunny TRANSPORTATION LESSON NO Uncovered Covered Theoretical Practical

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning 1 Simple example of object classification Instances Size Color Shape C(x) x1 small red circle positive x2 large red circle positive x3 small red triangle negative x4 large blue circle

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Blaž Zupan and Ivan Bratko magixfriuni-ljsi/predavanja/uisp An Example Data Set and Decision Tree # Attribute Class Outlook Company Sailboat Sail? 1 sunny big small yes 2 sunny

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 06/0/ Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12 Association Rules Charles Sutton Data Mining and Exploration Spring 2012 Based on slides by Chris Williams and Amos Storkey The Goal Find patterns : local regularities that occur more often than you would

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable! Project 1 140313 1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable! network.txt @attribute play {yes, no}!!! @graph! play -> outlook! play -> temperature!

More information

Data Mining Classification - Part 1 -

Data Mining Classification - Part 1 - Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning Data Mining and Machine Learning Instance-Based Learning Rote Learning k Nearest-Neighbor Classification Prediction, Weighted Prediction choosing k feature weighting (RELIEF) instance weighting (PEBLS)

More information

Data Engineering. Data preprocessing and transformation

Data Engineering. Data preprocessing and transformation Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another

More information

Classification: Decision Trees

Classification: Decision Trees Metodologie per Sistemi Intelligenti Classification: Decision Trees Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo regionale di Como Lecture outline What is a decision

More information

Bayes Net Learning. EECS 474 Fall 2016

Bayes Net Learning. EECS 474 Fall 2016 Bayes Net Learning EECS 474 Fall 2016 Homework Remaining Homework #3 assigned Homework #4 will be about semi-supervised learning and expectation-maximization Homeworks #3-#4: the how of Graphical Models

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

CS 188: Artificial Intelligence Fall Machine Learning

CS 188: Artificial Intelligence Fall Machine Learning CS 188: Artificial Intelligence Fall 2007 Lecture 23: Naïve Bayes 11/15/2007 Dan Klein UC Berkeley Machine Learning Up till now: how to reason or make decisions using a model Machine learning: how to select

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d) Outline RainForest A Framework for Fast Decision Tree Construction of Large Datasets resented by: ov. 25, 2004 1. 2. roblem Definition 3. 4. Family of Algorithms 5. 6. 2 Classification is an important

More information

Part I. Instructor: Wei Ding

Part I. Instructor: Wei Ding Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set

More information

Machine Learning: Symbolische Ansätze

Machine Learning: Symbolische Ansätze Machine Learning: Symbolische Ansätze Learning Rule Sets Introduction Learning Rule Sets Terminology Coverage Spaces Separate-and-Conquer Rule Learning Covering algorithm Top-Down Hill-Climbing Rule Evaluation

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input Data Mining 1.3 Input Fall 2008 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be learned. Characterized

More information

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical

More information

An Empirical Study on feature selection for Data Classification

An Empirical Study on feature selection for Data Classification An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of

More information

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.3 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set

More information

Introduction to Rule-Based Systems. Using a set of assertions, which collectively form the working memory, and a set of

Introduction to Rule-Based Systems. Using a set of assertions, which collectively form the working memory, and a set of Introduction to Rule-Based Systems Using a set of assertions, which collectively form the working memory, and a set of rules that specify how to act on the assertion set, a rule-based system can be created.

More information

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Optimal Naïve Nets (Adapted from

More information

Rule induction. Dr Beatriz de la Iglesia

Rule induction. Dr Beatriz de la Iglesia Rule induction Dr Beatriz de la Iglesia email: b.iglesia@uea.ac.uk Outline What are rules? Rule Evaluation Classification rules Association rules 2 Rule induction (RI) As their name suggests, RI algorithms

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points Lunds Tekniska Högskola EDA132 Institutionen för datavetenskap VT 2017 Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen 2016 03 15, 14.00 19.00, MA:8 You can give your answers

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 21: ML: Naïve Bayes 11/10/2011 Dan Klein UC Berkeley Example: Spam Filter Input: email Output: spam/ham Setup: Get a large collection of example emails,

More information

ARTIFICIAL INTELLIGENCE (CS 370D)

ARTIFICIAL INTELLIGENCE (CS 370D) Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-18) LEARNING FROM EXAMPLES DECISION TREES Outline 1- Introduction 2- know your data 3- Classification

More information

CSC411/2515 Tutorial: K-NN and Decision Tree

CSC411/2515 Tutorial: K-NN and Decision Tree CSC411/2515 Tutorial: K-NN and Decision Tree Mengye Ren csc{411,2515}ta@cs.toronto.edu September 25, 2016 Cross-validation K-nearest-neighbours Decision Trees Review: Motivation for Validation Framework:

More information

MACHINE LEARNING Example: Google search

MACHINE LEARNING Example: Google search MACHINE LEARNING Lauri Ilison, PhD Data Scientist 20.11.2014 Example: Google search 1 27.11.14 Facebook: 350 million photo uploads every day The dream is to build full knowledge of the world and know everything

More information

IBL and clustering. Relationship of IBL with CBR

IBL and clustering. Relationship of IBL with CBR IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Lazy Decision Trees Ronny Kohavi

Lazy Decision Trees Ronny Kohavi Lazy Decision Trees Ronny Kohavi Data Mining and Visualization Group Silicon Graphics, Inc. Joint work with Jerry Friedman and Yeogirl Yun Stanford University Motivation: Average Impurity = / interesting

More information

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

Implementation of Classification Rules using Oracle PL/SQL

Implementation of Classification Rules using Oracle PL/SQL 1 Implementation of Classification Rules using Oracle PL/SQL David Taniar 1 Gillian D cruz 1 J. Wenny Rahayu 2 1 School of Business Systems, Monash University, Australia Email: David.Taniar@infotech.monash.edu.au

More information

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/11/16 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

ISSUES IN DECISION TREE LEARNING

ISSUES IN DECISION TREE LEARNING ISSUES IN DECISION TREE LEARNING Handling Continuous Attributes Other attribute selection measures Overfitting-Pruning Handling of missing values Incremental Induction of Decision Tree 1 DECISION TREE

More information