IV B. Tech I semester (JNTUH-R13)

Size: px
Start display at page:

Download "IV B. Tech I semester (JNTUH-R13)"

Transcription

1 St. MARTIN s ENGINERING COLLEGE Dhulapally(V), Qutbullapur(M), Secunderabad COMPUTER SCIENCE AND ENGINEERING LAB MANUAL OF DATAWAREHOUSE AND DATAMINING IV B. Tech I semester (JNTUH-R13) Prepared by P.CHANDRASHAKER REDDY Associate Professor S.SRILAXMI Assistant Professor G. SATISH Assistant Professor G. PUSPHA RAJITHA Assistant Professor

2 SNo Experiment Name Date Listing of categorical attributes and the real-valued attributes separately. Rules for identifying attributes. Training a decision tree. Test on classification of decision tree. Testing on the training set. Using cross validation for training. Significance of attributes in decision tree. Trying generation of decision tree with various number of decision tree. Find out differences in results using decision tree and cross-validation on a data set. Decision trees. Reduced error pruning for training Decision Trees using cross-validation Convert a Decision Trees into "if-then-else rules". Roll No. Page 1

3 WEKA INTRODUCTION Weka is a collection machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU general Public License. Weka stands for the Waikato Environment for Knowledge Analysis, which was developed at the University of Waikato in New Zealand. WEKA is extensible and has become a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost every platform. WEKA is easy to use and to be applied at several different levels. You can access the WEKA class library from your own Java Program, and implemented new machine learning algorithms. There are three major implemented schemes, WEKA. 1. Implemented schemes for classification, 2. Implemented schemes for numeric prediction, 3. Implemented mete-schemes. Besides actual learning schemes, WEKA also contains a large variety tools that can be used for pre-processing datasets, so that you can focus on your algorithm without considering too much details as reading the data from files, implementing filtering algorithm and providing code to evaluate the results. The weka GUI chooser provides a starting point launching Weka s main GUI applications and supporting tools. If one prefers a MDI ( Multiple Document Interface ) appearance, then this is provided by an alternative launcher called Main (class weka.gui.main). Roll No. Page 2

4 The GUI Chooser consists of four buttons-one for each of four major Weka applications and four for menus. The buttons can be used to start the following applications: Explorer An environment for exploring data with WEKA. Experimenter An environment for supporting experiments and conducting statistical tests between learning schemes. Knowledge Flow this environment supports essentially the same functions as the Explorer but with a drag and drop interface. One advantage is that is supports incremental learning. Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. Weka GUI Chooser Roll No. Page 3

5 Credit Risk Assessment Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good, or bad. A bank s business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the ban s profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank s loan policy must involve a compromise not too strict, and not too lenient. To do the assignment, you first and foremost need some knowledge about the world of credit. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when not to, approve a loan application. The German Credit Data: Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such dataset (original) Excel spreadsheet version of the German credit data (download from web). In spite of the fact that the data is German, you should probably make use of it for this assignment, (Unless you really can consult a real loan officer!) A few notes on the German dataset: DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a quarter). Owns telephone: German phone rates are much higher than in Canada so fewer people own telephones. Foreign worker: There are millions of these in Germany (many from Turkey). It is very hard to get German citizenship if you were not born of German parents. There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two categories, good or bad. Roll No. Page 4

6 EXPERIMENT-1 List all the categorical (or nominal) attributes and the real-valued attributes separately. Aim: To list all the categorical (or nominal) attributes and the real valued attributes using Weka mining tool. Tools/ Apparatus: Weka Mining tool. Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system German credit data.arff 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. Sample Output: Roll No. Page 5

7 6) Select choose button then select weka->filters->unsupervised->attribute and here select the option as remove type 7) Click anywhere on Remove type white box which is beside the choose button to open weka.gui.genericobjecteditor window. 8) From this select the attribute type as delete nominal attributes and select the option for invertselection as true. 9) Click on ok button and then click on Apply button, then it will show the nominal attributes as follows Roll No. Page 6

8 10) Click on undo button and then repeat the above steps for displaying all numerical attributes by selecting the attribute type as Delete numeric Attributes. Output: Categorical/ Nominal Attributes: 1. Checking_status 2.Credit_history 3.Purpose 4.Savings_status 5. Employment 6. Personal_status 7.Other_parties 8.Property_magnitude 9. Other_payment_plans 10. Housing 11.Job 12.own_telephone 13. foreign_worker 14. Class Numeric Attributes: 1. Duration 2. Credit_amount 3. Installment_commitment 4. Residece_since 5. Age 6. Existing_credits 7. Num_dependents. Result: Hence all categorical and Numerical attributes are displayed. Roll No. Page 7

9 EXPERIMENT-2 What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. Aim: To identify the rules with some of the important attributes by a) Manually and b) Using Weka Tools/ Apparatus: Weka mining tool. According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit Based on the above attributes, we can make a decision whether to give credit or not. Theory: Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where X, Y C I and X Π Y=Φ. The sets of items (for short item sets) X and Y are called antecedent (left hand side or LHS) and consequent (righthandside or RHS) of the rule respectively. The set of items is I = {milk, bread, butter, beer} and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be meaning that if milk and bread is bought, customers also buy butter. Roll No. Page 8

10 To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best known constraints are minimum thresholds on support and confidence. The support supp(x) of an item set X is defined as the proportion of transactions in the data set which contain the item set. In the example database, the item set {milk, bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. ALGORITHM: Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a predefined threshold in the database; Those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence. Suppose one of the large item sets is Lk, Lk = {I1, I2,, Ik}, association rules with this item sets are generated in the following way: the first rule is {I1, I2,, Ik1} and {Ik}, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second sub problem is quite straight forward, most of the researches focus on the first sub problem. The Apriori algorithm finds the frequent sets L In Database D. Find frequent set Lk 1. Join Step..Ck is generated by joining Lk 1with itself Prune Step. Roll No. Page 9

11 .Any (k 1) item set that is not frequent cannot be a subset of a frequent k item set, hence should be removed, Where. (Ck: Candidate item set of size k) (Lk: frequent item set of size k) Apriori Pseudo code Apriori (T, ) L< {Large 1itemsets that appear in more than transactions } K<2 While L(k1) Φ C(k)<Generate( Lk 1) for transactions t T C(t)Subset(Ck,t) for candidates c C(t) count[c]<count[ c]+1 L(k)<{ c C(k) count[c] K<K+ 1 return Ụ L(k) k Procedure: 1) Given the Bank database for mining. 2) Select EXPLORER in WEKA GUI Chooser. 3) Load Bank.csv in Weka by Open file in Preprocess tab. 4) Select only Nominal values. 5) Go to Associate Tab. 6) Select Apriori algorithm from Choose button present in Associator weka.associations.apriori -N 10 -T 0 -C 0.9 -D U 1.0 -M 0.1 -S c -1 7) Select Start button Roll No. Page 10

12 8) Now we can see the sample rules. Sample Output: Roll No. Page 11

13 EXPERIMENT-3 One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. Aim: To create a Decision tree by training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool. Theory: Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Classifications are discrete and do not imply order. Continuous, floating point values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm. The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating. In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown. Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data Roll No. Page 12

14 sets: one for building the model; The other for testing the model. Scoring a classification model results in class assignments and probabilities for each case. For example, a model that classifies customers as low, medium, or high value would also predict the probability of each classification for each customer. Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling. Different Classification Algorithms Oracle Data Mining provides the following algorithms for classification: Decision Tree Decision trees automatically generate rules, which are conditional statements that reveal the logic used to build the tree. Naive Bayes Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Procedure: 1) Open Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system bank.csv. 5) Go to Classify tab. 6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by clicking the button choose 7) and select tree j48 8) Select Test options Use training set 9) if need select attribute. 10) Click Start. 11) Now we can see the output details in the Classifier output. Roll No. Page 13

15 12) Right click on the result list and select visualize tree option. Sample output: Roll No. Page 14

16 Roll No. Page 15

17 EXPERIMENT-4 Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Aim: To find the percentage of examples that are classified correctly by using the above created decision tree model? ie. Testing on the training set. Tools/ Apparatus: Weka mining tool. Theory: Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix The naive Bayes probabilistic model : DESCRIPTION: Classification Predicts categorical class labels (discrete or nominal) Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction Models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit approval Target marketing Medical diagnosis Fraud detection Roll No. Page 16

18 Training Data Data objects whose class labels are known. Accuracy The Accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select NBTree i.e., Navie Baysiean tree. 9) Select Test options Use training set 10) if need select attribute. 11) now Start weka. 12) now we can see the output details in the Classifier output. Roll No. Page 17

19 Sample Output: === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 600 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class YES NO Weighted Avg === Confusion Matrix === a b <-- classified as a = YES b = NO Roll No. Page 18

20 EXPERIMENT-5 Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Aim: To create a Decision tree by cross validation training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Theory: Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; Rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining. Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; There are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. In data mining, trees can be described also as the combination of mathematical and computational techniques to aid the description, categorization and generalization of a given set of data. Data comes in records of the form: (x, y) = (x1, x2, x3..., xk, y) Roll No. Page 19

21 Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select J48 9) Select Test options Cross-validation. 10) Set Folds Ex:10 11) if need select attribute. 12) now Start weka. 13) now we can see the output details in the Classifier output. 14) Compare the output results with that of the 4 th experiment 15) check whether the accuracy increased or decreased? Sample Output: Roll No. Page 20

22 === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 600 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class YES NO Weighted Avg === Confusion Matrix === a b <-- classified as a = YES b = NO Roll No. Page 21

23 EXPERIMENT-6 Is testing on the training set as you did above a good idea? Why? Why not? Bad idea, if take all the data into training set. Then how to test the above classification is correctly or not? According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5% accuracy. This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will be less. This is why; we prefer not to take complete dataset as training set. Use Training Set Result for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error 0.34 Relative absolute error % Root relative squared error % Total Number of Instances 1000 Roll No. Page 22

24 EXPERIMENT-7 One approach for solving the problem encountered in the previous question is using cross-validation? Describe what cross-validation is briefly. Train a Decision Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? Aim: To develop a decision tree and cross validate accuracy of data set is increasing or decreasing. Description: Cross validation is a common technique for assessing classifier accuracy, based on randomly sampled partitions of the given data. In k-fold cross validation the initial data are randomly portioned into k mutually exclusive subsets or folds, s1,s2..sk, each of approximately equal size. Training and testing is performed k times. In iteration I, the subset Si is reserved as the test set, and the remaining subset are collectively used to train the classifier. That is the classifier of the first iteration is trained on subsets s2,..sk and tested on S2 and so on.the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of samples in the initial data. Procedure: In weka tool using cross validation testing method, first once the dataset is been given to the tool using explorer and select all attributes of the dataset using ALL button and select Classify menu option, select classifier as J48, Use the test option as cross validation where folds will be 10 by default selection by weka and click on start and we can see the results of this method. Here accuracy of this method using the same data set when compared to decision tree does not vary. We can try out for various combinations of this option of cross validation with different folding values like 2, 5, 8, 10(default), and see the accuracy. Finally with this method we can come to a conclusion that in cross validation the folds should not be varies neither high nor very low. Roll No. Page 23

25 Sample Output: Number of Leaves : 103 Size of the tree : 140 Time taken to build model: 0.1 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Coverage of cases (0.95 level) 92.8 % Mean rel. region size (0.95 level)91.7 % Total Number of Instances 1000 Roll No. Page 24

26 EXPERIMENT-8 Check to see if the data shows a bias against foreign workers (attribute 20), OR personal status (attribute 9). One way to do this (perhaps rather simple method) is to remove these attributes from the dataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the preprocess tab in weka s GUI Explorer. Did removing these attributes have any significant effect? Discuss AIM: To check accuracy of German credit data after removing foreign_workers and personal_status. PROCEDURE: 1. Open the Weka GUI Chooser. 2. Select Explorer present Applications. 3. Select Preprocess Tab. 4. Go to Open file and browse the file that is already stored in the system German credit data.arff. 5. Clicking on any attribute in the left panel will show the basic statistics on that selected attribute then we can list the following. Roll No. Page 25

27 Sample Output: 6) Check foreign_workers and personal_status and remove those attributes. Roll No. Page 26

28 7. Select choose button then select weka->trees->j48, select Use training set option and click start. 8. Accuracy before removing foreign_worker and personal_status. Correctly Classified Instances % Incorrectly Classified Instances % 9. After removing foreign_worker and personal_status. Output: Correctly Classified Instances % Incorrectly Classified Instances % Result: By removing foreign_workers and personal_status from German credit data accuracy is increased and time has been reduced to some extent. Roll No. Page 27

29 Roll No. Page 28

30 EXPERIMENT 9 Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in previous problem Remember to reload the ARFF data file to get all the attributes initially before you start selecting the ones you want.) AIM: To check accuracy of German credit data after removing selected attributes like 2, 3, 5,7,10,17 and 21. PROCEDURE: 1. Open the Weka GUI Chooser. 2. Select Explorer present Applications. 3. Select Preprocess Tab. 4. Go to Open file and browse the file that is already stored in the system German credit data.arff. 5. Clicking on any attribute in the left panel will show the basic statistics on that selected attribute then we can list the following. 36 P a g e

31 6. Click on invert tab. 37 P a g e

32 7. 2,3,5,7,10,17,21 except these attribute all attributes are removed and check accuracy for these attributes. 8. Select choose button then select weka->trees->j48, select 38 P a g e Use training set option and click start.

33 Correctly Classified Instances % Incorrectly Classified Instances % Here Accuacy is decreased. OUTPUT Here Accuracy is decreased and check randomly selected attributes also. RESULT: Hence accuracy is decreased by removing selected/random attributed. 39 P a g e

34 EXPERIMENT 10 Sometimes, the cost of rejecting an applicant who actually has a good credit Case 1. Might be higher than accepting an applicant who has bad credit Case 2. Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in WEKA. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 9 (using equal cost)? AIM: To generate a report for decision tree and cross validation results by using two cases for cost of rejecting. Case 1: Might be higher than accepting an applicant who has bad credit say cost 5. Case 2: Instead of counting the misclassifications equally in both cases, gives a higher cost to first case and lower cost to the second case. DESCRIPTION: In the Problem 9, we used equal cost and we trained the decision tree. But here, we consider two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2. When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 9. Case1 (cost 5) Case2 (cost 5) 40 P a g e Total Cost Average Cost We don t find this cost factor in problem 9. As there we use equal cost. This is the major difference between the results of problem 9 and problem 10. The cost matrices we used here: Case 1: Case 2: PROCEDURE:

35 Step1: Select classify tab. Step2: Select More Option from Test Option. Step3: Tick on cost sensitive Evaluation and go to set. Step4: Set classes as 2. Step5: Click on Resize and then we ll get cost matrix. Step6: Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 Step7: Then confusion matrix will be generated and you can find out the difference between good and bad attribute. Step8: Check accuracy whether it s changing or not. OUTPUT 41 P a g e

36 42 P a g e

37 Case 1: Total Cost 4852 Average Cost P a g e

38 Case 2: Total Cost 1963 Average Cost P a g e

39 Result: Hence Two cases cost are tested and no change/ variation in accuracy and dataset is not effected. 45 P a g e

40 EXPERIMENT 11 Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? AIM: To check making simple decision tree is good idea instead of making complex decision. Description: When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect. This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. Procedure: Open any existing ARFF file e.g labor.arff. 1. Open any existing Arff file. E.g: labour.arff 2. In preprocess tab, select ALL to select all the attributes. 3. Go to classify tab and then use traning set with J48 algorithm. 46 P a g e

41 4. To generate the decision tree, right click on the result list and select visualize tree option, by which the decision tree will be generated 5. Right click on J48 algorithm to get Generic Object Editor window 6. In this,make the unpruned option as true. 7. Then press OK and then start. we find the tree will become more complex if not pruned 47 P a g e

42 8. The tree has become more complex 48 P a g e

43 Result: Yes, It is a good idea to prefer simple decision trees instead of having long complex decision trees. 49 P a g e

44 EXPERIMENT 12 You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? AIM: To make a decision tree simpler by pruning the nodes. Description: Reduced-error pruning:- The idea of using a separate pruning set for pruning which is applicable to decision trees as well as rule sets is called reduced-error pruning. The variant described previously prunes a rule immediately after it has been grown and is called incremental reduced-error pruning. Another possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual tests. However, this method is much slower. Of course, there are many different ways to assess the worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only rule in the theory, operating under the closed world assumption. If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total T of instances altogether, then it gets positive instances right. The instances that it does not cover include N - n negative ones, where n = t p is the number of negative instances that the rule covers and N = T - P is the total number of negative instances. Thus the rule has an overall success ratio of [p +(N - n)] T, and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning. Or Tree pruning is a method which eliminates redundancy of the nodes structure, outliers of a decision trees. Pruning makes use of two methods like 50 P a g e

45 Pre Pruning Post Pruning Prepruning is a method where we apply some criterions or test conditions before the construction of decision tree induction where we can expect the elimination of outliers and redundancy. Where as in post pruning method we apply some criterions once the decision tree construction is derived. The given task-11 comes under prepruning method where we can apply various test conditions in weka explorer like Use training set Supplied test set Cross validation using folds Percentage split The above test conditions are used as part of prepruning and can expect good classifier accuracy. Procedure 1. Right click on J48 algorithm to get Generic Object Editor window 2. In this,make reduced error pruning option as true and also the unpruned option as true. 3. Then press OK and then start. 4. We find that the accuracy has been increased by selecting the reduced error pruning option. 51 P a g e

46 52 P a g e

47 Result: The above test conditions are used as part of prepruning and can expect good classifier accuracy. 53 P a g e

DATA MINING LAB MANUAL

DATA MINING LAB MANUAL DATA MINING LAB MANUAL Subtasks : 1. List all the categorical (or nominal) attributes and the real-valued attributes seperately. Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose

More information

Data Mining Laboratory Manual

Data Mining Laboratory Manual Data Mining Laboratory Manual Department of Information Technology MLR INSTITUTE OF TECHNOLOGY Marri Laxman Reddy Avenue, Dundigal, Gandimaisamma (M), R.R. Dist. Data Mining Laboratory Manual Prepared

More information

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA LAB RECORD N99A49G70E68S51H Data Mining using WEKA 1 WEKA [ Waikato Environment for Knowledge Analysis

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Enterprise Miner Tutorial Notes 2 1

Enterprise Miner Tutorial Notes 2 1 Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Classification by Association

Classification by Association Classification by Association Cse352 Ar*ficial Intelligence Professor Anita Wasilewska Generating Classification Rules by Association When mining associa&on rules for use in classifica&on we are only interested

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2013/12/09 1 Practice plan 2013/11/11: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Using Weka for Classification. Preparing a data file

Using Weka for Classification. Preparing a data file Using Weka for Classification Preparing a data file Prepare a data file in CSV format. It should have the names of the features, which Weka calls attributes, on the first line, with the names separated

More information

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1731-1743 Research India Publications http://www.ripublication.com Comparative Study of J48, Naive Bayes

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Credit card Fraud Detection using Predictive Modeling: a Review

Credit card Fraud Detection using Predictive Modeling: a Review February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 06/0/ Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART V Credibility: Evaluating what s been learned 10/25/2000 2 Evaluation: the key to success How

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

WEKA Explorer User Guide for Version 3-4

WEKA Explorer User Guide for Version 3-4 WEKA Explorer User Guide for Version 3-4 Richard Kirkby Eibe Frank July 28, 2010 c 2002-2010 University of Waikato This guide is licensed under the GNU General Public License version 2. More information

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

Tree-based methods for classification and regression

Tree-based methods for classification and regression Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery Practice notes 2 Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Building Intelligent Learning Database Systems

Building Intelligent Learning Database Systems Building Intelligent Learning Database Systems 1. Intelligent Learning Database Systems: A Definition (Wu 1995, Wu 2000) 2. Induction: Mining Knowledge from Data Decision tree construction (ID3 and C4.5)

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

1 Document Classification [60 points]

1 Document Classification [60 points] CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

A Program demonstrating Gini Index Classification

A Program demonstrating Gini Index Classification A Program demonstrating Gini Index Classification Abstract In this document, a small program demonstrating Gini Index Classification is introduced. Users can select specified training data set, build the

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

More information

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset V.Veeralakshmi Department of Computer Science Bharathiar University, Coimbatore, Tamilnadu veeralakshmi13@gmail.com Dr.D.Ramyachitra Department

More information

Data Mining Clustering

Data Mining Clustering Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0

More information

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Mining Association Rules in Large Databases

Mining Association Rules in Large Databases Mining Association Rules in Large Databases Association rules Given a set of transactions D, find rules that will predict the occurrence of an item (or a set of items) based on the occurrences of other

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Chapter 7: Frequent Itemsets and Association Rules

Chapter 7: Frequent Itemsets and Association Rules Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 VII.1&2 1 Motivational Example Assume you run an on-line

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Machine-Learning Methods. Goals for the lecture Evaluating Machine-Learning Methods Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from

More information

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde IEE 520 Data Mining Project Report Shilpa Madhavan Shinde Contents I. Dataset Description... 3 II. Data Classification... 3 III. Class Imbalance... 5 IV. Classification after Sampling... 5 V. Final Model...

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics 機器學習與生物資訊學 Molecular Biomedical Informatics 分子生醫資訊實驗室 機器學習與生物資訊學 Machine Learning & Bioinformatics 1 Evaluation The key to success 2 Three datasets of which the answers must be known 3 Note on parameter tuning It

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Nesnelerin İnternetinde Veri Analizi Bölüm 3. Classification in Data Streams w3.gazi.edu.tr/~suatozdemir Supervised vs. Unsupervised Learning (1) Supervised learning (classification) Supervision: The training

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Machine Learning: Symbolische Ansätze

Machine Learning: Symbolische Ansätze Machine Learning: Symbolische Ansätze Unsupervised Learning Clustering Association Rules V2.0 WS 10/11 J. Fürnkranz Different Learning Scenarios Supervised Learning A teacher provides the value for the

More information

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES USING DIFFERENT DATASETS V. Vaithiyanathan 1, K. Rajeswari 2, Kapil Tajane 3, Rahul Pitale 3 1 Associate Dean Research, CTS Chair Professor, SASTRA University,

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Data Understanding Exercise: Market Basket Analysis Exercise:

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information