Homework2 Chapter4 exersices Hongying Du

Size: px

Start display at page:

Download "Homework2 Chapter4 exersices Hongying Du"

Bertram Stokes
5 years ago
Views:

1 Homework2 Chapter4 exersices Hongying Du Note: use lg to denote log 2 in this whole file. 3. Consider the training examples shown in Table 4.8 for a binary classification problem. (a) The entropy of this collection of training examples with respect to the positive class: E(S) = - (4/9*lg4/9+5/9*lg5/9) = (b) Information gains of a1 and a2 relative to these training examples: classification by a1 or a2: a1 T F a2 T F For a1: E (S T ) = - (3/4*lg3/4+1/4*lg1/4) = E (S F ) = - (1/5*lg1/5+4/5*lg4/5) = G(S, a1) = E(S) 4/9 *E (S T ) 5/9 * E (S F ) = For a2:

2 E (S T ) = - (2/5*lg2/5+3/5*lg3/5) = 0.97 E (S F ) = - (2/4*lg2/4+2/4*lg2/4) = 1 G(S, a1) = E(S) 5/9 *E (S T ) 4/9 * E (S F ) = (c) For a3, compute the information gain for every possible split. There are sis possible spilts: Left leaf of the tree Right leaf of the tree E(left leaf) E(right leaf) Information Gain , , 1-3+, , 1-2+, , 3-2+, , 3-1+, , So if you decide to split by a3, you should use 2 as split point. (d) What is the best split (among a1, a2 and a3) according to the information gain? So by information gain, a1 is the best split. (e) What is the best split (among a1and a2) according to the classification error rate? Classification error rate for a leaf t in the tree: 1-max (p(i,t)), where i class is + or -. Compute it for each leaf and average it by weight. a1 T F a2 T F

3 error 1/4 1/5 error 2/5 2/4 Error (a1) = 4/9 * 1/4 +5/9 * 1/5 = 2/9 Error (a2) = 5/9 * 2/5 + 4/9 * 2/4 = 4/9 So by error rate, a1 is the best split. (f) What is the best split (among a1and a2) according to the Gini index? For each leaf t, Gini(t) = 1- sum (p(i t) 2 ), where i class is + or -. a1 T F a2 T F Gini 3/8 8/25 Gini 12/25 1/2 Gini (a1) = 4/9 * 3/8 + 5/9 * 8/25 =0.344 Gini (a2) = 5/9 * 12/25 + 4/9 * 1/2 =0.489 So by Gini index, a1 is the best split. 8. Consider the decision tree.

4 (a) Generalization error rate of the tree using the optimistic approach. Predicted class 1 st leaf: + 2 nd leaf: - 3 rd leaf: + 4 th leaf: - Actural class , 1-2 +, So there are 5 errors in total. error rate = 5/10=0.5 (b) Generalization error rate of the tree using the pessimistic approach. error rate = (5+ 4*0.5) /10=0.7 (c) Generalization error rate of the tree using the validation set. Predicted class 1 st leaf: + 2 nd leaf: - 3 rd leaf: + 4 th leaf: - Actural

5 class error rate = 1/5 =0.2 9.

6 Cost for left tree: Cost (tree1, data) = 2* lg16 + 3* lg3+ 7 * lgn Cost for right tree: Cost (tree2, data) = 4* lg16 + 5* lg3+ 4 * lgn 2* lg16 + 3* lg3+ 7 * lgn= 4* lg16 + 5* lg3+ 4 * lgn So, when n <= 13, the left decision tree is better, when n > 13, the right one is better.

7 Classification of e-tailer customers using REPTree and random forest Note: The Confusion Matrix uses this form: Class 0 Class 1 Class 0 Class 1 For the Precision, recall and roc area, three values for each result represent Class 0, Class 1, average of that result respectively. Result: For REPTree: Tuning Confusion Matrix Raw data Raw data with balanced data Normalize attributes Supervised Attribute Select filter Results on test data(use selected attribute data) Precision Recall MCC ROC area For random forest

8 Tuning Confusion Matrix Raw data Raw data with balanced data Normalize attributes Supervised Attribute Select filter Results on test data(using select attribute data) Precision Recall MCC ROC area ReplaceMissingValues: replace all missing values for nominal and numeric attributes in a dataset with the modes and means from the training data. Normalization: normalize all data values into [0, 1]. Attribute Selection: use CfsSubsetEval option. For more information, see Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. REPtree: a fast decision tree learner. It builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with back-fitting). The algorithm only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces. (From: Random forest: an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees (from Wikipedia).

9 I did the following steps in the experiment: For the training data set: 1) Combine the feature data file with label data file and save it as a.csv file. Add a line as the first line to describe the name of the feature names. 2) Check if there are missing values. If so, use Weka s unsupervised attribute filter ReplaceMissingValues. 3) Use unsupervised attribute filter numeric2nomial to convert the label column to nominal attribute. 4) To balance data: use unsupervised instance filter RemoveWithValues to remove the customer, use supervised instance filter Resample to get 10% of non-customer data whose number equals to the number of customer instances after this step and save the data. Use RemoveWithValues again with different parameters to get the customer data. Combine these two parts if data together and I get the balanced data set. 5) To normalize attributes: use unsupervised attribute filter Normalize on the previous processed data set. 6) To select attributes: use supervised attribute filter AttributeSelection on the previous processed data set. This will reduce the features to the following: F2, F4, F11, F18, F32, F59, F81, F110, F117, F125, F132, F140, F143, F195, F196, F230, F232, F238, F295, F296, F299, F310, F327, F334, Label. 7) Use REPTree and Random Forest classifiers to classify each data set. Save REPTree model as REPtree1st (for data in row 1 in the following table), REPtree2nd (for data in row 3), REPtree3rd (for data in row 3) and REPtree4th (for data in row 4). I did the same thing for Random Forest algorithm, except the name starts with RFtree. For the test dataset: 8) Load model.

10 9) For the test set, I added a column after original data which is the fake label, the first 900 of which were 0 and the rest 1 (add both 0 and 1 to keep the consistency with the training set.arff file after it s saved as.arff file, which means the attribute headers in these two files are the same). Add a feature name line as the first line. I did 3) and 5) first to the original test set first, and then keep only the attribute that remains in 6). Save the processed test set. 10) Supply the test set and right click on the model-> re-evaluate model on current data set. 11) Right click on the model -> visualize classifier error-> save will save the classifier result in eg. REPtreePredict. Open this file and remove the other attributes but predicted_label, save it in eg. REPtreePredictOnlyLabel. If I supplied test set and click start, there will be a prompt saying Problem evaluating classifier: Train and test set are not compatible. 12) Open the saved label file in text mode, and copy it into Excel. Combine the true label and predicted label into one Excel file and copy the content of it as the input of Matlab. 13) Calculate MCC using calculatemcc.m in Matlab. Calculate precision, recall, MCC for test data set using predict.m in Matlab.

Data Engineering. Data preprocessing and transformation

Data Engineering. Data preprocessing and transformation Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another