Data Mining Laboratory Manual

Size: px

Start display at page:

Download "Data Mining Laboratory Manual"

Mitchell George
6 years ago
Views:

1 Data Mining Laboratory Manual Department of Information Technology MLR INSTITUTE OF TECHNOLOGY Marri Laxman Reddy Avenue, Dundigal, Gandimaisamma (M), R.R. Dist.

2 Data Mining Laboratory Manual Prepared by Mr. A. Ramachandra Reddy Assistant Professor, CSE Document No: MLRIT/IT/LAB MANUAL/DATA MINING Date of Issue Date of Revision Faculty name A. RAMACHANDRA REDDY Authorized by HOD Verified by

3 INDEX Sl. No. Content Page No. 1 Preface 4 2 Lab Code 5 3 JNTU Syllabus 6 4 List of Experiments and Problem statements 8 5 Fundamentals of Data Mining 21 6 Introduction to WEKA 22 7 Launching WEKA 23 8 The WEKA Explorer 24 Preprocessing 25 Classification 27 Clustering 30 Association 31 Selecting Attributes 31 Visualization 31 9 Working with WEKA File Formats Credit Risk Management Lab Cycle Tasks JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment JNTU Experiment Additional Experiment Additional Experiment Viva-voce Questions 94

4 PREFACE Data mining is one of the important subjects included in the fourth year curriculum by JNTUH. In addition to theory subject also includes Data mining as lab practical s using WEKA (Waikato Environment for Knowledge Analysis) tool. A weka software is association with machine learning algorithm for data mining task. The algorithm can either be applied directly to a data set or from a java program. A weka software is developed at university of Newzealand and it is named based on a bird found in Newzealand. A weka software is flexible with object oriented programming language as it is used widely today. It is a plat form independent software and can work on windows,linux and solaris. The Weka (pronounced Way-Kuh) workbench contains a collection of visualization tools and algorithms for data analysis and predictive learning together with graphical user interfaces for easy access to this functionality. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes

5 LAB CODE 1. Students should report to the concerned lab as per the time table. 2. Students who turn up late to the labs will in no case be permitted to do the program schedule for the day. 3. After completion of the program, certification of the concerned staff in-charge in the observation book is necessary. 4. Student should bring a notebook of 100 pages and should enter the readings /observations into the notebook while performing the experiment. 5. The record of observations along with the detailed experimental procedure of the experiment in the immediate last session should be submitted and certified staff member in-charge. 6. Not more than 3-students in a group are permitted to perform the experiment on the set. 7. The group-wise division made in the beginning should be adhered to and no mix up of students among different groups will be permitted. 8. The components required pertaining to the experiment should be collected from stores in-charge after duly filling in the requisition form. 9. When the experiment is completed, should disconnect the setup made by them, and should return all the components/instruments taken for the purpose. 10. Any damage of the equipment or burn-out components will be viewed seriously either by putting penalty or by dismissing the total group of students from the lab for the semester/year. 11. Students should be present in the labs for total scheduled duration. 12. Students are required to prepare thoroughly to perform the experiment before coming to laboratory. 13. Procedure sheets/data sheets provided to the student s groups should be maintained neatly and to be returned after the experiment.

6 Description: JNTUH Syllabus The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good, or bad. A bank's business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank's loan policy must involve a compromise: not too strict, and not too lenient. To do the assignment, you first and foremost need some knowledge about the world of credit. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. The German Credit Data: Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such dataset, consisting of 1000 actual cases collected in Germany. credit dataset (original) Excel spreadsheet version of the German credit data. In spite of the fact that the data is German, you should probably make use of it for this assignment. (Unless you really can consult a real loan officer!) A few notes on the German dataset DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a quarter). owns_telephone. German phone rates are much higher than in Canada so fewer people own telephones. foreign_worker. There are millions of these in Germany (many from Turrkey). It is very hard to get German citizenship if you were not born of German parents. There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two categories, good or bad. Subtasks : (Turn in your answers to the following tasks) 1. List all the categorical (or nominal) attributes and the real-valued attributes separately 2. What attributes do you think might be crucial in making the credit assessement? Come up with some simple rules in plain English using your selected attributes.

7 3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. (10 marks) 4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? (10 marks) 5. Is testing on the training set as you did above a good idea? Why orwhy not? (10 marks) 6. One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? (10 marks) 7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personal-status" (attribute 9). One way to do this (perhaps rather simple minded) is to remove these attributes from the dataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the preprocess tab in Weka's GUI Explorer. Did removing these attributes have any significant effect? Discuss. (10 marks) 8. Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) (10 marks) 9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? (10 marks) 10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? (10 marks) 11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? (10 marks) 12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (10 marks)

8 LIST OF EXPERIMENTS AND PROBLEM STATMENTS EXP. NO. 1 JNTU P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 2 JNTU P1 P2 P3 P4 NAME OF THE EXPERIMENT To list all the categorical (or nominal) attributes and the real valued attributes using WEKA mining tool. List all the categorical (or nominal) attributes and the real-valued attributes separately using contact Lenses.arff. List all the categorical (or nominal) attributes and the real-valued attributes separately using cpu.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using diabetes.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using glass.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using inosphere.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using iris.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using labor.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using RoutersGran-test.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using weather.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using supermarket.arff List all the categorical (or nominal) attributes and the real-valued attributes separately using vote.arff What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. What attributes do you think might be crucial in making the analysis of contact-lenses? Come up with some simple rules in plain English using your selected attributes using contact Lenses.arff What attributes do you think might be crucial in making the analysis of cpu? Come up with some simple rules in plain English using your selected attributes using cpu.arff database What attributes do you think might be crucial in making the analysis of diabetes? Come up with some simple rules in plain English using your selected attributes using diabetes.arff database What attributes do you think might be crucial in making the analysis of glass? Come up with some simple rules in plain English using your selected attributes using glass.arff database DATAB ASE FILE German Credit Data contact Lenses CPU Diabetes Glass Inosphere Iris Labor RoutersG ran-test Weather Vote contact Lenses German Credit Data contact Lenses CPU Diabetes Glass PAG E NO

9 P5 P6 P7 P8 P9 P10 P11 3 JNTU P1 P2 P3 P4 P5 P6 P7 P8 What attributes do you think might be crucial in making the analysis of inosphere? Come up with some simple rules in plain English using your selected attributes using inosphere.arff What attributes do you think might be crucial in making the analysis of iris? Come up with some simple rules in plain English using your selected attributes using iris.arff database What attributes do you think might be crucial in making the analysis of labor? Come up with some simple rules in plain English using your selected attributes using labor.arff database What attributes do you think might be crucial in making the analysis of RoutersGran-test? Come up with some simple rules in plain English using your selected attributes using routersgrain-test.arff database What attributes do you think might be crucial in making the analysis of weather? Come up with some simple rules in plain English using your selected attributes using weather.arff database What attributes do you think might be crucial in making the analysis of supermarket? Come up with some simple rules in plain English using your selected attributes using supermarket.arff What attributes do you think might be crucial in making the analysis of vote? Come up with some simple rules in plain English using your selected attributes using vote.arff database One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model Inosphere Iris Labor RoutersG ran-test Weather Vote contact Lenses German Credit Data contact Lenses CPU Diabetes Glass Inosphere Iris Labor RoutersG ran-test 47

10 P9 P10 4 JNTU P1 P2 P3 P4 P5 P6 P7 P8 P9 obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Weather Vote German Credit Data contact Lenses CPU Diabetes Glass Inosphere Iris Labor RoutersG ran-test Weather

11 P10 5 JNTU Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Vote Is testing on the training set as you did above a good idea? Why or Why not? German Credit Data P1 Is testing on the training set as you did above a good idea? Why orwhy not? contact Lenses P2 Is testing on the training set as you did above a good idea? Why orwhy not? CPU P3 Is testing on the training set as you did above a good idea? Why orwhy not? Diabetes P4 Is testing on the training set as you did above a good idea? Why orwhy not? Glass P5 Is testing on the training set as you did above a good idea? Why orwhy not? Inosphere P6 Is testing on the training set as you did above a good idea? Why orwhy not? Iris P7 Is testing on the training set as you did above a good idea? Why orwhy not? Labor P8 Is testing on the training set as you did above a good idea? Why orwhy not? RoutersG ran-test P9 Is testing on the training set as you did above a good idea? Why orwhy not? Weather P10 Is testing on the training set as you did above a good idea? Why orwhy not? Vote 6 JNTU P1 P2 P3 P4 P5 One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? German Credit Data contact Lenses CPU Diabetes Glass Inosphere

12 P6 P7 P8 P9 P10 7 JNTU P1 P2 P3 P4 One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? One approach for solving the problem encountered in the previous question is using cross-validation? Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and report your results. Does your accuracy increase/decrease? Why? Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Iris Labor RoutersG ran-test Weather Vote German Credit Data contact Lenses CPU Diabetes Glass

13 P5 P6 P7 P8 P9 P10 8 JNTU P1 Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try Inosphere Iris Labor RoutersG ran-test Weather Vote German Credit Data contact Lenses

14 P2 P3 P4 P5 P6 P7 P8 P9 out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try CPU Diabetes Glass Inosphere Iris Labor RoutersG ran-test Weather

15 P10 9 JNTU P1 P2 P3 P4 P5 out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You Vote German Credit Data contact Lenses CPU Diabetes Glass Inosphere

16 P6 P7 P8 P9 P10 10 JNTU P1 P2 can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree Iris Labor RoutersG ran-test Weather Vote German Credit Data contact Lenses CPU

17 relate to the bias of the model? P3 P4 P5 P6 P7 P8 P9 P10 11 JNTU P1 P2 Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation Diabetes Glass Inosphere Iris Labor RoutersG ran-test Weather Vote German Credit Data contact Lenses CPU

18 P3 P4 P5 P6 P7 P8 P9 P10 (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, Diabetes Glass Inosphere Iris Labor RoutersG ran-test Weather Vote

19 report your accuracy using the pruned model. Does your accuracy increase? 12 JNTU P1 P2 P3 P4 (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one German Credit Data contact Lenses CPU Diabetes Glass

20 P5 P6 P7 P8 P9 attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output Inosphere Iris Labor RoutersG ran-test Weather

21 P10 the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. Additional Experiments Vote 13 Generate Association rules for the following transactional database using Apriori algorithm. 14 Generate classification rules for the following data base using decision tree (J48). German Credit Data German Credit Data

22 Fundamentals of Data Mining Definition of Data Mining: Data mining refers to extracting or mining knowledge from large amounts of data. Data mining can also be referred as knowledge mining from data, knowledge extraction, data archeology and data dredging. Applications of Data Mining: Business Intelligence applications Insurance Banking Medicine Retail/Marketing etc. Functionalities of Data Mining: These functionalities are used to specify the kind of patterns to be found in data mining tasks. Data mining tasks can be classified into 2 categories: Descriptive Predictive The following are the functionalities of data mining: Concept/Class description: Characterization and Discrimination: Generalize, summarize and contrast data characteristics. Mining frequent patterns, Associations and Correlations Frequent patterns are patterns (such as item sets, subsequences, or substructures) that appear in a data set frequently. Classification and Prediction: Construct models that describe and distinguish classes or concepts for future prediction. Predicts some unknown or missing numerical values. Cluster analysis: Class label is unknown. Group data to form new classes. Maximizing intra-class similarity and minimizing inter-class similarity. Outlier analysis: Outlier: a data object that does not comply with the general behavior of data.

23 Noise or exception but is quite useful in fraud detection, rare events analysis. Introduction to WEKA WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. WEKA is an open source application that is freely available under the GNU general public license agreement. Originally written in C, the WEKA application has been completely rewritten in Java and is compatible with almost every computing platform. It is user friendly with a graphical interface that allows for quick set up and operation. WEKA operates on the predication that the user data is available as a flat file or relation. This means that each data object is described by a fixed number of attributes that usually are of a specific type, normal alpha-numeric or numeric values. The WEKA application allows novice users a tool to identify hidden information from database and file systems with simple to use options and visual interfaces. The WEKAworkbench contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. This original version was primarily designed as a tool for analyzing data from agricultural domains, but the more recent fully Java-based version (WEKA 3), for which development started in 1997, is now used in many different application areas, in particular for educational purposes and research. ADVANTAGES OF WEKA The obvious advantage of a package like WEKA is that a whole range of data preparation, feature selection and data mining algorithms are integrated. This means that only one data format is needed, and trying out and comparing different approaches becomes really easy. The package also comes with a GUI, which should make it easier to use. Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform. A comprehensive collection of data preprocessing and modeling techniques. Ease of use due to its graphical user interfaces. WEKA supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. All of WEKA's techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported). WEKA provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. It is not capable of multi-relational data mining, but there is separate software for converting a collection of linked database tables into a single table that is suitable for processing using WEKA. Another important area

24 is sequence modeling. Attribute Relationship File Format (ARFF) is the text format file used by WEKA to store data in a database. The ARFF file contains two sections: the header and the data section. The first line of the header tells us the relation name. Then there is the list of the attributes (@attribute...). Each attribute is associated with a unique name and a type. The latter describes the kind of data contained in the variable and what values it can have. The variables types are: numeric, nominal, string and date. The class attribute is by default the last one of the list. In the header section there can also be some comment lines, identified with a '%' at the beginning, which can describe the database content or give the reader information about the author. After that there is the data itself (@data), each line stores the attribute of a single entry separated by a comma. WEKA's main user interface is the Explorer, but essentially the same functionality can be accessed through the component-based Knowledge Flow interface and from the command line. There is also the Experimenter, which allows the systematic comparison of the predictive performance of WEKA's machine learning algorithms on a collection of datasets. Launching WEKA The WEKA GUI Chooser window is used to launch WEKA s graphical environments. At the bottom of the window are four buttons: 1. Simple CLI. Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line Interface. 2. Explorer. An environment for exploring data with WEKA. 3. Experimenter. An environment for performing experiments and conducting. 4.Knowledge Flow. This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. If you launch WEKA from a terminal window, some text begins scrolling in the terminal. Ignore this text unless something goes wrong, in which case it can help in tracking down the cause. This User Manual focuses on using the Explorer but does not explain the individual data preprocessing tools and learning algorithms in WEKA. For more information on the various filters and learning methods in WEKA, see the book Data Mining (Witten and Frank, 2005).

25 The WEKA Explorer Section Tabs At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first started only the first tab is active; the others are grayed out. This is because it is necessary to open (and potentially pre-process) a dataset before starting to explore the data. The tabs are as follows: 1. Preprocess. Choose and modify the data being acted on. 2. Classify. Train and test learning schemes that classify or perform regression. 3. Cluster. Learn clusters for the data. 4. Associate. Learn association rules for the data. 5. Select attributes. Select the most relevant attributes in the data. 6. Visualize. View an interactive 2D plot of the data. Once the tabs are active, clicking on them flicks between different screens, on which the respective actions can be performed. The bottom area of the window(including the status box, the log button, and the WEKA bird) stays visible regardless of which section you are in. Status Box The status box appears at the very bottom of the window. It displays messages that keep you informed about what's going on. For example, if the Explorer is busy loading a file, the status box will say that. TIP right-clicking the mouse anywhere inside the status box brings up a little menu. The menu gives two options: 1. Available memory. Display in the log box the amount of memory available to WEKA. 2. Run garbage collector. Force the Java garbage collector to search for memory that is no longer needed and free it up, allowing more memory for new tasks. Note that the garbage collector is constantly running as a back ground task anyway. Log Button Clicking on this button brings up a separate window containing a scrollable text field. Each line of text is stamped with the time it was entered into the log. As you perform actions in WEKA, the log keeps a record of what has happened. WEKA Status Icon To the right of the status box is the WEKA status icon. When no processes are running, the bird sits down and takes a nap. The number beside the symbol gives the number of concurrent processes running. When the system is idle it is zero, but it increases as the number of processes increases. When any process is started, the bird gets up and starts moving around. If it' standing but stops moving for a long time, it's sick:

26 something has gone wrong! In that case you should restart the WEKA explorer. 1. Preprocessing Opening files The first three buttons at the top of the preprocess section enable you to load data into WEKA: 1. Open file... Brings up a dialog box allowing you to browse for the data file on the local file system. 2. Open URL... Asks for a Uniform Resource Locator address for where the data is stored. 3. Open DB... Reads data from a database. (Note that to make this work you might have to edit the file in WEKA/experiment/DatabaseUtils.props.) Using the Open file... button you can read files in a variety of formats: WEKA s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files typically have a.arff extension, CSV files a.csv extension, C4.5 files a data and names extension, and serialized Instances objects a.bsi extension. The Current Relation Once some data has been loaded, the Preprocess panel shows a variety of information. The Current relation box (the current relation is the currently loaded data, which can be interpreted as a single relational table in database terminology) has three entries: 1. Relation. The name of the relation, as given in the file it was loaded from. Filters (described below) modify the name of a relation. 2. Instances. The number of instances (data points/records) in the data. 3. Attributes. The number of attributes (features) in the data. Working with Attributes Below the Current relation box is a box titled Attributes. There are three buttons, and beneath them is a list of the attributes in the current relation. The list has three columns: 1. No... A number that identifies the attribute in the order they are specified in the data file. 2. Selection tick boxes. These allow you select which attributes are present in the relation. 3. Name. The name of the attribute, as it was declared in the data file. When you click on different rows in the list of attributes, the fields change in the box to the right titled selected attribute. This box displays the characteristics of the currently highlighted attribute in the list: 1. Name. The name of the attribute, the same as that given in the attribute list. 2. Type. The type of attribute, most commonly Nominal or Numeric. 3. Missing. The number (and percentage) of instances in the data for which this attribute is missing (unspecified).

27 4. Distinct. The number of different values that the data contains for this Attribute. 5. Unique. The number (and percentage) of instances in the data having a value for this attribute that no other instances have. Below these statistics is a list showing more information about the values stored in this attribute, which differ depending on its type. If the attribute is nominal, the list consists of each possible value for the attribute along with the number of instances that have that value. If the attribute is numeric, the list gives four statistics describing the distribution of values in the data the minimum, maximum, mean and standard deviation. And below these statistics there is a colored histogram, color-coded according to the attribute chosen as the Class using the box above the histogram. (This box will bring up a drop-down list of available selections when clicked.) Note that only nominal Class attributes will result in a color-coding. Finally, after pressing the Visualize All button, histograms for all the attributes in the data are shown in a separate witting. Returning to the attribute list, to begin with all the tick boxes are un ticked. They can be toggled on/off by clicking on them individually. The three buttons above can also be used to change the selection: 1. All. All boxes are ticked. 2. None. All boxes are cleared (un ticked). 3. Invert. Boxes that are ticked become un ticked and vice versa. Once the desired attributes have been selected, they can be removed by clicking the Remove button below the list of attributes. Note that this can be undone by clicking the Undo button, which is located next to the Save button in the top-right corner of the Preprocess panel. Working with Filters The preprocess section allows filters to be defined that transform the data in various ways. The Filter box is used to set up the filters that are required. At the left of the Filter box is a Choose button. By clicking this button it is possible to select one of the filters in WEKA. Once a filter has been selected, its name and options are shown in the field next to the Choose button. Clicking on this box brings up a Generic Object Editor dialog box. The GenericObjectEditor Dialog Box The GenericObjectEditor dialog box lets you configure a filter. The same kind of dialog box is used to configure other objects, such as classifiers and clusters(see below). The fields in the window reflect the available options. Clicking on any of these gives an opportunity to alter the filters settings. For example, the setting may take a text string, in which case you type the string into the text field provided. Or it may give a drop-down box listing several states to choose from. Or it may do something else, depending on the information required. Information on the options is provided in a tool tip if you let the mouse pointer hover of the corresponding field. More information on the filter and its options can be obtained by clicking on the More button in the About panel at the top of the GenericObjectEditor window. Some objects display a brief description of what they do in an About box, along with a More button. Clicking on the More button brings up a window describing what the different options do.

28 At the bottom of the GenericObjectEditor dialog are four buttons. The first two, Open... and Save... allow object configurations to be stored for future use. The Cancel button backs out without remembering any changes that have been made. Once you are happy with the object and settings you have chosen, click OK to return to the main Explorer window. Applying Filters Once you have selected and configured a filter, you can apply it to the data by pressing the Apply button at the right end of the Filter panel in the Preprocess panel. The Preprocess panel will then show the transformed data. The change can be undone by pressing the Undo button. You can also use the Edit...button to modify your data manually in a dataset editor. Finally, the Save...button at the top right of the Preprocess panel saves the current version of the relation in the same formats available for loading data, allowing it to be kept for future use. Note: Some of the filters behave differently depending on whether a class attribute has been set or not (using the box above the histogram, which will bring up a drop-down list of possible selections when clicked). In particular, the supervised filters require a class attribute to be set, and some of the unsupervised attribute filters will skip the class attribute if one is set. Note that it is also possible to set Class to None, in which case no class is set. Selecting a Classifier 2. Classification At the top of the classify section is the Classifier box. This box has a text field that gives the name of the currently selected classifier, and its options. Clicking on the text box brings up a GenericObjectEditor dialog box, just the same as for filters that you can use to configure the options of the current classifier. The Choose button allows you to choose one of the classifiers that are available in WEKA. Test Options The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box. There are four test modes: 1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on. 2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on. 3. Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field. 4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.

29 Note: No matter which evaluation method is used, the model that is output is always the one build from all the training data. Further testing options can beset by clicking on the More options... button: 1. Output model. The classification model on the full training set is output so that it can be viewed, visualized, etc. This option is selected by default. 2. Output per-class stats. The precision/recall and true/false statistics for each class are output. This option is also selected by default. 3. Output entropy evaluation measures. Entropy evaluation measures are included in the output. This option is not selected by default. 4. Output confusion matrix. The confusion matrix of the classifier's predictions is included in the output. This option is selected by default. 5. Store predictions for visualization. The classifier's predictions are remembered so that they can be visualized. This option is selected by default. 6. Output predictions. The predictions on the evaluation data are output. Note that in the case of a crossvalidation the instance numbers do not correspond to the location in the data! 7. Cost-sensitive evaluation. The errors is evaluated with respect to a cost matrix. The Set... button allows you to specify the cost matrix used. 8. Random seed for xval / % Split. This specifies the random seed used when randomizing the data before it is divided up for evaluation purposes. The Class Attribute The classifiers in WEKA are designed to be trained to predict a single 'class attribute', which is the target for prediction. Some classifiers can only learn nominal classes; others can only learn numeric classes (regression problems);still others can learn both. By default, the class is taken to be the last attribute in the data. If you want to train a classifier to predict a different attribute, click on the box below the Test options box to bring up a drop-down list of attributes to choose from. Training a Classifier Once the classifier, test options and class have all been set, the learning process is started by clicking on the Start button. While the classifier is busy being trained, the little bird moves around. You can stop the training process at anytime by clicking on the Stop button. When training is complete, several things happen. The Classifier output area to the right of the display is filled with text describing the results of training and testing. A new entry appears in the Result list box. We look at the result list below; but first we investigate the text that has been output. The Classifier Output Text The text in the Classifier output area has scroll bars allowing you to browse the results. Of course, you can also resize the Explorer window to get a larger display area. The output is split into several sections:

30 1. Run information. A list of information giving the learning scheme options, relation name, instances, attributes and test mode that were involved in the process. 2. Classifier model (full training set). A textual representation of the classification model that was produced on the full training data. 3. The results of the chosen test mode are broken down thus: 4. Summary. A list of statistics summarizing how accurately the classifier was able to predict the true class of the instances under the chosen test mode. 5. Detailed Accuracy By Class. A more detailed per-class break down of the classifier's prediction accuracy. 6. Confusion Matrix. Shows how many instances have been assigned to each class. Elements show the number of test examples whose actual class is the row and whose predicted class is the column. The Result List After training several classifiers, the result list will contain several entries. Left-clicking the entries flicks back and forth between the various results that have been generated. Right-clicking an entry invokes a menu containing these items: 1. View in main window. Shows the output in the main window (just like left-clicking the entry). 2. View in separate window. Opens a new independent window for viewing the results. 3. Save result buffer. Brings up a dialog allowing you to save a text file containing the textual output. 4. Load model. Loads a pre-trained model object from a binary file. 5. Save model. Saves a model object to a binary file. Objects are saved in Java serialized object form. 6. Re-evaluate model on current test set. Takes the model that has been built and tests its performance on the data set that has been specified with the Set.. button under the Supplied test set option. 7. Visualize classifier errors. Brings up a visualization window that plots the results of classification. Correctly classified instances are represented by crosses, whereas incorrectly classified ones show up as squares. 8. Visualize tree or Visualize graph. Brings up a graphical representation of the structure of the classifier model, if possible (i.e. for decision trees or Bayesian networks). The graph visualization option only appears if a Bayesian network classifier has been built. In the tree visualizer, you can bring up a menu by rightclicking a blank area, pan around by dragging the mouse, and see the training instances at each node by clicking on it. CTRL-clicking zooms the view out, while SHIFT-dragging a box zooms the view in. The graph visualizer should be self-explanatory. 9. Visualize margin curve. Generates a plot illustrating the prediction margin. The margin is defined as the

31 difference between the probability predicted for the actual class and the highest probability predicted for the other classes. For example, boosting algorithms may achieve better performance on test data by increasing the margins on the training data. 10. Visualize threshold curve. Generates a plot illustrating the trade offs in prediction that are obtained by varying the threshold value between classes. For example, with the default threshold value of 0.5, the predicted probability of positive must be greater than 0.5 for the instance to be predicted as positive. The plot can be used to visualize the precision/recall tradeoff, for ROC curve analysis (true positive rate vs false positive rate), and for other types of curves. 11. Visualize cost curve. Generates a plot that gives an explicit representation of the expected cost, as described by Drummond and Holte (2000).Options are greyed out if they do not apply to the specific set of results. 3. Clustering Selecting a Cluster By now you will be familiar with the process of selecting and configuring objects. Clicking on the clustering scheme listed in the Clusterer box at the top ofthe window brings up a GenericObjectEditor dialog with which to choose a new clustering scheme. Cluster Modes The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first three options are the same as for classification: Use training set, Supplied test set and Percentage split (Section 4) except that now the data is assigned to clusters instead of trying to predict a specific class. The fourth mode, Classes to clusters evaluation, compares how well the chosen clusters match up with a preassigned class in the data. The dropdown box below this option selects the class, just as in the Classify panel. An additional option in the Cluster mode box, the Store clusters forvisualization tick box, determines whether or not it will be possible to visualize the clusters once training is complete. When dealing with datasets that are solarge that memory becomes a problem it may be helpful to disable this option. Ignoring Attributes Often, some attributes in the data should be ignored when clustering. The Ignore attributes button brings up a small window that allows you to select which attributes are ignored. Clicking on an attribute in the window high lights it, holding down the SHIFT key selects a range of consecutive attributes, and holding down CTRL toggles individual attributes on and off. To cancel the selection, back out with the Cancel button. To activate it, click the Select button. The next time clustering is invoked, the selected attributes are ignored. Learning Clusters The Cluster section, like the Classify section, has Start/Stop buttons, a result text area and a result list. These all behave just like their classification counterparts. Right-clicking an entry in the result list brings up a similar menu, except that it shows only two visualization options: Visualize cluster assignments and Visualize tree. The latter is grayed out when it is not applicable.

32 4. Association Setting Up This panel contains schemes for learning association rules, and the learners are chosen and configured in the same way as the clusterers, filters, and classifiers in the other panels. Learning Associations Once appropriate parameters for the association rule learner bave been set, click the Start button. When complete, right-clicking on an entry in the result listallows the results to be viewed or saved. Searching and Evaluating 5. Selecting Attributes Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction. To do this, two objects must be set up: an attribute evaluator and a searchmethod. The evaluator determines what method is used to assign a worth to each subset of attributes. The search method determines what style of search is performed. Options The Attribute Selection Mode box has two options: 1. Use full training set. The worth of the attribute subset is determined using the full set of training data. 2. Cross-validation. The worth of the attribute subset is determined by a process of crossvalidation. The Fold and Seed fields set the number off olds to use and the random seed used when shuffling the data. As with Classify (Section 4), there is a drop-down box that can be used tospecify which attribute to treat as the class. Performing Selection Clicking Start starts running the attribute selection process. When it is finished, the results are output into the result area, and an entry is added tothe result list. Right-clicking on the result list gives several options. The first three, (View in main window, View in separate window and Save result buffer), are the same as for the classify panel. It is also possible to Visualize reduced data, or if you have used an attribute transformer such as Principal Components, Visualize transformed data. 6. Visualizing WEKA s visualization section allows you to visualize 2D plots of the current relation. The scatter plot matrix When you select the Visualize panel, it shows a scatter plot matrix for all the attributes, color coded according to the currently selected class. It is possible to change the size of each individual 2D plot and the

33 point size, and to randomly jitter the data (to uncover obscured points). It also possible to change the attribute used to color the plots, to select only a subset of attributes for inclusion in the scatter plot matrix, and to sub sample the data. Note that changes will only come into effect once the Update button has been pressed. Selecting an individual 2D scatter plot When you click on a cell in the scatter plot matrix, this will bring up a separate window with a visualization of the scatter plot you selected. (We described above how to visualize particular results in a separate window for example, classifier errors the same visualization controls are used here.)data points are plotted in the main area of the window. At the top are two drop-down list buttons for selecting the axes to plot. The one on the left shows which attribute is used for the x-axis; the one on the right shows which is used for the y-axis. Beneath the x-axis selector is a drop-down list for choosing the colour scheme. This allows you to colour the points based on the attribute selected. Below the plot area, a legend describes what values the colours correspond to. If the values are discrete, you can modify the colour used for each one by clicking on them and making an appropriate selection in the window that pops up. To the right of the plot area is a series of horizontal strips. Each strip represents an attribute, and the dots within it show the distribution of values of the attribute. These values are randomly scattered vertically to help you see concentrations of points. You can choose what axes are used in the main graph by clicking on these strips. Left-clicking an attribute strip changes the x-axis to that attribute, whereas right-clicking changes the y-axis. The 'X' and 'Y' written beside the strips shows what the current axes are ( B is used for both X and Y). Above the attribute strips is a slider labeled Jitter, which is a random displacement given to all points in the plot. Dragging it to the right increases the amount of jitter, which is useful for spotting concentrations of points. Without jitter, a million instances at the same point would look no different to just a single lonely instance. Selecting Instances There may be situations where it is helpful to select a subset of the data using the visualization tool. (A special case of this is the User Classifier in the Classify panel, which lets you build your own classifier by interactively selecting instances.)below the y-axis selector button is a drop-down list button for choosing a selection method. A group of data points can be selected in four ways: 1. Select Instance. Clicking on an individual data point brings up a window listing its attributes. If more than one point appears at the same location, more than one set of attributes is shown. 2. Rectangle. You can create a rectangle, by dragging, that selects the points inside it. 3. Polygon. You can build a free-form polygon that selects the points inside it. Left-click to add vertices to the polygon, right-click to complete it. The polygon will always be closed off by connecting the first point to the last. 4. Polyline. You can build a polyline that distinguishes the points on one side from those on the other. Left-click to add vertices to the polyline, right-click to finish. The resulting shape is open (as opposed to a polygon, which is always closed). Once an area of the plot has been selected using Rectangle, Polygon or Polyline, it turns grey. At this point, clicking the Submit button removes all instances from the plot except those within the grey selection area.

Clicking on the Clear button erases the selected area without affecting the graph. Once any points have been removed from the graph, the Submit button changes to a Reset button.

34 Clicking on the Clear button erases the selected area without affecting the graph. Once any points have been removed from the graph, the Submit button changes to a Reset button. This button undoes all previous removals and returns you to the original graph with all points included. Finally, clicking the Save button allows you to save the currently visible instances to a new ARFF file. BHARAT Weka GUI Chooser. WEKA (Waikato Environment for Knowledge Analysis) Supporting File Formats: ARFF (Attribute Relation File Format) CSV (Comma Separated Values) C4.5 Binary Files (True/False or Yes/No, Buys TV or Not.) Confusion Matrix How well your classifier can recognize tuples of different classes True Positives : Refer to the Positive tuples that were correctly labeled by the classifier. True Negatives : Refer to the Negative tuples that were correctly labeled by the classifier. False Positives : Refer to the Negative tuples that were incorrectly labeled by the classifier. False Negatives : Refer to the Positive tuples that were incorrectly labeled by the classifier. Fig 1. Weka GUI Chooser

Weka Application Interfaces Explorer preprocessing, attribute selection, learning, visualiation Experimenter testing and evaluating machine learning algorithms Knowledge Flow visual design of KDD

35 Weka Application Interfaces Explorer preprocessing, attribute selection, learning, visualiation Experimenter testing and evaluating machine learning algorithms Knowledge Flow visual design of KDD process Explorer Simple Command-line A simple interface for typing commands Fig 2. Weka Application Interfaces Weka Functions and tools Preprocessing Filters Attribute selection Classification/Regression Clustering Association discovery Visualization

36 Load data file Load data file in formats: ARFF, CSV, C4.5, binary Import from URL or SQL database (using JDBC) Attribute Relation File Format (arff) An ARFF file consists of two distinct sections: the Header section defines attribute name, type and relations, start with a <attribute-name> <type> or {range} the Data section lists the data records, starts list of data instances Any line start with % is the comments. Data types supported by ARFF: numeric string nominal specification date SNO NAME AGE CITY BRANCH MARKS CLASS 1,DEEPIKA,22,HYD,CSE,76,PASS 2,RADHIKA,23,DELHI,IT,34,FAIL 3,PRADEEP,21,MUMBAI,EEE,45,PASS 4,KRISHNA,22,HYD,ECE,23,FAIL 5,RISHI,21,DELHI,IT,88,PASS 6,SHARAN,21,MUMBAI,EEE,92,PASS 7,SHREYANSH,22,HYD,CSE,26,FAIL 8,SUGUNA,23,MUMBAI,ECE,65,PASS

37 Write the file in notepad save the file with.arff extension save it in All Files CSV (Comma Separated Value) The CSV File Format Each record is one line Fields are separated with commas. Example John,Doe,120 any st.,"anytown, WW",08123 Leading and trailing space-characters adjacent to comma field separators are ignored. So John, Doe,... resolves to "John" and "Doe", etc. Space characters can be spaces, or tabs. Fields with embedded commas must be delimited with double-quote characters. In the above example. "Anytown, WW" had to be delimited in double quotes because it had an embedded comma. Fields that contain double quote characters must be surounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes. So, John "Da Man" Doe would convert to "John ""Da Man""",Doe, 120 any st.,... A field that contains embedded line-breaks must be surounded by double-quotes So: Note that this is a single CSV record, even though it takes up more than one line in the CSV file. This works because the line breaks are embedded inside the double quotes of the field. Fields with leading or trailing spaces must be delimited with double-quote characters. So to preserve the leading and trailing spaces around the last name above: John," Doe ",... Example: The delimiters will always be discarded. The first record in a CSV file may be a header record containing column (field) names SNO,NAME,AGE,CITY,BRANCH,MARKS,CLASS 1,DEEPIKA,22,HYD,CSE,76,PASS 2,RADHIKA,23,DELHI,IT,34,FAIL 3,PRADEEP,21,MUMBAI,EEE,45,PASS 4,KRISHNA,22,HYD,ECE,23,FAIL 5,RISHI,21,DELHI,IT,88,PASS 6,SHARAN,21,MUMBAI,EEE,92,PASS 7,SHREYANSH,22,HYD,CSE,26,FAIL 8,SUGUNA,23,MUMBAI,ECE,65,PASS Write the file in notepad save the file with.csv extension save it in All Files

38 Credit Risk Assessment Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good, or bad. A bank's business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank's loan policy must involve a compromise: not too strict, and not too lenient. To do the assignment, you first and foremost need some knowledge about the world of credit. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. The German Credit Data: Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such dataset, consisting of 1000 actual cases collected in Germany. Credit dataset (original) Excel spreadsheet version of the German credit data. (Down load from web) In spite of the fact that the data is German, you should probably make use of it for this assignment. (Unless you really can consult a real loan officer!) A few notes on the German dataset DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a quarter). owns telephone. German phone rates are much higher than in Canada so fewer people own telephones. Foreign worker. There are millions of these in Germany (many from Turkey). It is very hard to get German citizenship if you were not born of German parents. There are 21 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two categories, good or bad. Procedure Download German dataset from the internet (save data as arff format). The description of data is as follows: Description of the German credit dataset.

39 1. Title: German Credit data 2. Source Information Professor Dr. Hans Hofmann Institut f"ur Statistik und "Okonometrie Universit"at Hamburg FB Wirtschaftswissenschaften Von-Melle-Park Hamburg Number of Instances: Number of Attributes german: 21 (7 numerical, 14 categorical) 5. Attribute description for german List of Attributes:- Attribute 1: (qualitative) Status of existing checking account A11 :... < 0 DM A12 : 0 <=... < 200 DM A13 :... >= 200 DM / salary assignments for at least 1 year A14 : no checking account Attribute 2: (numerical) Duration in month Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank) Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television

40 A44 : domestic appliances A45 : repairs A46 : education A47 : (vacation - does not exist?) A48 : retraining A49 : business A410 : others Attribute 5: (numerical) Credit amount Attibute 6: (qualitative) Savings account/bonds A61 :... < 100 DM A62 : 100 <=... < 500 DM A63 : 500 <=... < 1000 DM A64 :.. >= 1000 DM A65 : unknown/ no savings account Attribute 7: (qualitative) Present employment since A71 : unemployed A72 :... < 1 year A73 : 1 <=... < 4 years A74 : 4 <=... < 7 years A75 :.. >= 7 years Attribute 8: (numerical) Installment rate in percentage of disposable income Attribute 9: (qualitative) Personal status and sex A91 : male : divorced/separated A92 : female : divorced/separated/married A93 : male : single A94 : male : married/widowed A95 : female : single Attribute 10: (qualitative) Other debtors / guarantors A101 : none A102 : co-applicant A103 : guarantor Attribute 11: (numerical) Present residence since Attribute 12: (qualitative) Property

41 A121 : real estate A122 : if not A121 : building society savings agreement/ life insurance A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown / no property Attribute 13: (numerical) Age in years Attribute 14: (qualitative) Other installment plans A141 : bank A142 : stores A143 : none Attribute 15: (qualitative) Housing A151 : rent A152 : own A153 : for free Attribute 16: (numerical) Number of existing credits at this bank Attribute 17: (qualitative) Job A171 : unemployed/ unskilled - non-resident A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer Attribute 18: (numerical) Number of people being liable to provide maintenance for Attribute 19: (qualitative) Telephone A191 : none A192 : yes, registered under the customers name Attribute 20: (qualitative) foreign worker A201 : yes A202 : no Attribute 21: (qualitative) class A211 : Good A212 : Bad

42 LAB CYCLE TASKS JNTUH Experiment 1: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately. THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/PROCEDURE: 1) Open the WEKA GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system german credit data.csv. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE/SOLUTION: Number of Attributes German credit data: 21 Categorical (or nominal) Attributes: 14 Numerical Attributes: 7 List of Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status 10. debitors 11. residence_since 12. property 13. age in years 14. installment plans 15. housing 16. existing credits 17. job 18. num_dependents 19. telephone 20. foreign worker 21. class

OUTPUT: Categorical or Nomianal attributes:- 1. checking_status 2. credit history 3. purpose 4. savings_status 5. employment_since 6. personal status 7. debtors 8. property 9. installment plans 10.

43 OUTPUT: Categorical or Nomianal attributes:- 1. checking_status 2. credit history 3. purpose 4. savings_status 5. employment_since 6. personal status 7. debtors 8. property 9. installment plans 10. housing 11. job 12. telephone 13. foreign worker 14. class label Real valued attributes:- 1. duration 2. credit amount 3. installment rate 4. residence_since 5. age 6. existing credits 7. num_dependent In Weka Preprocessing, click on Edit button to edit the Data.

PROBLEM DEFINITIONS FOR JNTU EXP. 1 P1: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using contact Lenses.arff.

44 PROBLEM DEFINITIONS FOR JNTU EXP. 1 P1: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using contact Lenses.arff. THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/PROCEDURE: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system contact lenses.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: OUTPUT: List of categorical attributes: 1. Age 2. spectacle-prescrip 3. astigmatism 4. tear-prod-rate 5. contact-lenses

45 List of Real-valued Attributes: NIL P2: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using cpu.arff. THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/PROCEDURE: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system cpu.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE:

OUTPUT: List of categorical Attributes: NIL List of Real-valued Attributes: 1. MYCT 2. MMIN 3. MMAX 4. CASH 5. CHMIN 6. CHMAX 7. CLASS No. of row values (tuples) in the cpu.arff are 209.

46 OUTPUT: List of categorical Attributes: NIL List of Real-valued Attributes: 1. MYCT 2. MMIN 3. MMAX 4. CASH 5. CHMIN 6. CHMAX 7. CLASS No. of row values (tuples) in the cpu.arff are 209. P3: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using pima_diabetes.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS:

WEKA 3.7.4 mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab.

47 WEKA mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system pima_diabetes.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: OUTPUT: List of categorical Attributes: 1. class List of Real-valued Attributes: 1. preg 2. plas 3. pres 4. skin 5. insu 6. mass 7. pedi 8. age No. of row values (tuples) in the pima_diabetes.arff are 768. P4: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using glass.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool.

48 ALGORITHM/PROCEDURE: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system glass.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: List of categorical Attributes: 1. type List of Real-valued Attributes: 1. RI 2. Na 3. Mg 4. Al 5. Si 6. K 7. Ca 8. Ba 9. Fe P5: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using inosphere.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool.

49 ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system inosphere.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: List of categorical Attributes:(No. of Attributes: 1) 1. class List of Real-valued Attributes: a01, a02, a03...,a34 (No. of Attribute: 34) P6: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using iris.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications.

50 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system iris.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: No. of Attributes: 05 List of categorical Attributes: (No. of Attributes: 1) 1. class List of Real-valued Attributes: 1. Sepallength 2. Sepalwidth 3. Petallength 4. Petalwidth P7: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using labor.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system labor.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that

selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: No. of Attributes: 17 List of categorical Attributes:(No. of Attributes: 09) 1. cost-of-living-adjustment 2.

51 selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: No. of Attributes: 17 List of categorical Attributes:(No. of Attributes: 09) 1. cost-of-living-adjustment 2. Pension 3. education-allowance 4. vacation 5. longterm-disability-assistance 6. contribution-to-dental-plan 7. bereavement-assistance 8. contribution-to-health-plan 9. class List of Real-valued Attributes: (No. of Attributes: 08) 1. wage-increase-first-year 2. wage-increase-second-year 3. wage-increase-third-year 4. working-hours 5. standby-pay 6. shift-differential 7. statutory-holidays 8. duration P8: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using RoutersGran-Test.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system

RoutersGran-Test.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: No.

52 RoutersGran-Test.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE: Output: No. of Attributes: 2 No. of instances : 604 No. of weghts: 604 List of categorical Attributes:(No. of Attributes: 01) Text List of Real-valued Attributes: (No. of Attributes: 01) class P9: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using weather.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system weather.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data. SOURCE CODE:

Output: No. of Attributes: 5 No. of instances : 14 No. of weghts: 14 List of categorical Attributes:(No. of Attributes: 03) 1. outlook 2. Humidity 3. windy List of Real-valued Attributes: (No.

53 Output: No. of Attributes: 5 No. of instances : 14 No. of weghts: 14 List of categorical Attributes:(No. of Attributes: 03) 1. outlook 2. Humidity 3. windy List of Real-valued Attributes: (No. of Attributes: 02) 1. temperature 2. play Click on Edit button in Preprocessing and save the data: P10: AIM: List all the categorical (or nominal) attributes and the real-valued attributes separately using vote.arff THEORY: Categorical (or Nominal) Attributes contains the values in the categorical format (Characters/words) The Real-values attributes contains the Numeric data HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1) Open the Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system vote.arff. 5) Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. 6) Click on the Edit button to edit the data.

SOURCE CODE: Output: No. of Attributes: 17 No. of instances : 435 No. of weghts: 435 List of categorical Attributes: (No. of Attributes: 16) 1. handicapped-infants 2. water-project-cost-sharing 3.

54 SOURCE CODE: Output: No. of Attributes: 17 No. of instances : 435 No. of weghts: 435 List of categorical Attributes: (No. of Attributes: 16) 1. handicapped-infants 2. water-project-cost-sharing 3. adoption-of-the-budgetresolution 4. physician-fee-freeze 5. el-salvador-aid 6. religious-groups-in-schools 7. anti-satellite-test-ban 8. aid-to-nicaraguan-contras 9. mx-missile JNTUH Experiment 2: 10. immigration 11. synfuels-corporation-cutback 12. education-spending 13. superfund-right-to-sue 14. crime 15. duty-free-exports 16. export-administration-actsouth-africa 17. Class List of categorical Attributes: (No. of Attributes: 0) AIM: What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. THEORY: The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good, or bad. A bank's business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank's

55 loan policy must involve a compromise: not too strict, and not too lenient. To do the assignment, you first and foremost need some knowledge about the world of credit. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. The German Credit Data: Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such dataset, consisting of 1000 actual cases collected in Germany. Credit dataset (original) Excel spreadsheet version of the German credit data. (Down load from web) In spite of the fact that the data is German, you should probably make use of it for this assignment. A few notes on the German dataset DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a quarter). owns telephone. German phone rates are much higher than in Canada so fewer people own telephones. Foreign worker. There are millions of these in Germany (many from Turkey). It is very hard to get German citizenship if you were not born of German parents. There are 21 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two categories, good or bad. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the german credit data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: German Credit Data.csv Number of Attributes in german credit data: 21

56 Important Attributes: 09 OUTPUT: According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. credit_amount 7. installment 8. existing credit 9. class To make a decision whether to give credit or not, we must analyze the above important attributes. PROBLEM DEFINITIONS FOR JNTU EXP 2 P1: AIM: What attributes do you think might be crucial in making the analysis of contact-lenses? Come up with some simple rules in plain English using your selected attributes using contact Lenses.arff database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data

57 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: contact-lenses.arff Number of Attributes: 5 Important Attributes: 03 OUTPUT: According to me the following attributes may be crucial in making the analysis of contact-lenses. 1. spectacle-prescrip 2. tear-prod-rate 3. contact-lenses P2: AIM: What attributes do you think might be crucial in making the cpu analysis? Come up with some simple rules in plain English using your selected attributes using "cpu.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION:

58 Input Data: cpu.arff Number of Attributes: 7 Important Attributes: 04 OUTPUT: According to me the following attributes may be crucial in making the cpu data assessment. 1. MMIN 2. CASH 3. CHMIN 4. CLASS P3: AIM: What attributes do you think might be crucial in making the assessment of diabetes? Come up with some simple rules in plain English using your selected attributes using "pima-diabetes.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: pima-diabetes.arff Number of Attributes: 9 Important Attributes: 05

59 OUTPUT: According to me the following attributes may be crucial in making the assessment of diabetes. 1. pres 2. skin 3. insu 4. age 5. class P4: AIM: What attributes do you think might be crucial in making the assessment of glass sales? Come up with some simple rules in plain English using your selected attributes using "glass.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: glass.arff Number of Attributes: 10 Important Attributes: 05 OUTPUT: According to me the following attributes may be crucial in making the assessment of glass sales.

60 1. type 2. RI 3. Mg 4. K 5. Fe P5: AIM: What attributes do you think might be crucial in making the assessment of inosphere? Come up with some simple rules in plain English using your selected attributes using "inosphere.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: inosphere.arff Number of Attributes: 10 Important Attributes: 05 OUTPUT: According to me the following attributes may be crucial in making the assessment of inosphere. 1. class 2. a01 3. a02 4. a33 5. a34 P6: AIM: What attributes do you think might be crucial in making the assessment of iris? Come up with some simple rules in plain English using your selected attributes using "iris.arff" database.

61 THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: iris.arff Number of Attributes: 5 Important Attributes: 3 OUTPUT: According to me the following attributes may be crucial in making the assessment of iris. 1. class 2. Sepallength 3. Petalwidth P7: AIM: What attributes do you think might be crucial in making the assessment of labor? Come up with some simple rules in plain English using your selected attributes using "labor.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways.

62 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: labor.arff Number of Attributes: 17 Important Attributes: 8 OUTPUT: According to me the following attributes may be crucial in making the assessment of labor. 1. cost-of-living-adjustment 2. Pension 3. longterm-disability-assistance 4. contribution-to-health-plan 5. class 6. wage-increase-first-year 7. working-hours 8. duration P8: AIM: What attributes do you think might be crucial in making the assessment of Routers Grain Mod Aptitude test? Come up with some simple rules in plain English using your selected attributes using "RoutersGraintest.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways.

63 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: RoutersGrain-test.arff Number of Attributes: 2 Important Attributes: 2 OUTPUT: According to me the following attributes may be crucial in making the assessment of Routers Grain Mod Aptitude test. 1. Text 2. class P9: AIM: What attributes do you think might be crucial in making the assessment of weather? Come up with some simple rules in plain English using your selected attributes using "weather.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant.

64 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: weather.arff Number of Attributes: 5 Important Attributes: 3 OUTPUT: According to me the following attributes may be crucial in making the assessment of weather. 1. Humidity 2. windy 3. temperature P10: AIM: What attributes do you think might be crucial in making the assessment of supermarket? Come up with some simple rules in plain English using your selected attributes using "supermarket.arff" database. THEORY: To do the assignment, you first and foremost need some knowledge about the given data. You can acquire such knowledge in a number of ways. 1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her knowledge in the form of production rules. 2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate this knowledge from text form to production rule form. 3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to judge the credit worthiness of a loan applicant. 4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and when not to, approve a loan application. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool.

65 ALGORITHM/Procedure: 1. Open the data file 2. Read the description of the data 3. Use the domain knowledge to analyses the data 4. List all the crucial attributes SOLUTION: Input Data: supermarket.arff Number of Attributes: 217 Important Attributes: 3 According to me the following attributes may be crucial in making the assessment of supermarket. 1. department1 2. grocery misc 3. baby needs 4. baking needs 5. coupons 6. vegetables 7. total JNTU Experiment 3: AIM: One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. Theory: A decision tree is a flow chart like tree structure where each internal node(nonleaf) denotes a test on the attribute, each branch represents an outcome of the test,and each leaf node(terminal node)holds a class label. Decision trees can be easily converted into classification rules. e.g. ID3,C4.5 and CART. Classification is a data mining function that assigns items in a collection to target categories or

66 classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown. Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data sets: one for building the model; the other for testing the model. Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling. Different Classification Algorithms Oracle Data Mining provides the following algorithms for classification: Decision Tree Decision trees automatically generate rules, which are conditional statements that reveal the logic used to build the tree. Naive Bayes Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. Algorithm/Procedure: 1) Open Weka GUI Chooser. 2) Select EXPLORER present in Applications. 3) Select Preprocess Tab. 4) Go to OPEN file and browse the file that is already stored in the system German Credit Data.csv. 5) Go to Classify tab. 6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by clicking the button choose 7) And select tree j48 9) Select Test options Use training set 10) if need select attribute. 11) Click Start. 12) Now we can see the output details in the Classifier output. 13) Right click on the result list and select visualize tree option. Source Code:

67 The decision tree constructed by using the implemented C4.5 algorithm OUTPUT: J48 pruned tree === Run information === Scheme: weka.classifiers.trees.j48 -C M 2 Relation: German Credit Data Instances: 1000 Attributes: 21

68 checking_status duration credit_history purpose credit_amount savings_status employment installment commitment personal status other parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign class Test mode: evaluate on training data === Classifier model (full training set) === J48 pruned tree checking_status = <0 foreign = yes duration <= 11 existing_credits <= 1 property_magnitude = real estate: good (8.0/1.0) property_magnitude = life insurance own_telephone = yes: good (4.0) own_telephone = none: bad (2.0) property_magnitude = no known property: bad (3.0) property_magnitude = car: good (2.0/1.0) existing_credits > 1: good (14.0) duration > 11 job = skilled other parties = none duration <= 30 savings_status = no known savings existing_credits <= 1 own_telephone = yes: good (4.0/1.0) own_telephone = none: bad (9.0/1.0) existing_credits > 1: good (2.0)

69 savings_status = <100 credit_history = critical/other existing credit: good (14.0/4.0) credit_history = existing paid own_telephone = yes: bad (5.0) own_telephone = none existing_credits <= 1 property_magnitude = real estate age <= 26: bad (5.0) age > 26: good (2.0) property_magnitude = life insurance: bad (7.0/2.0) property_magnitude = no known property: good (2.0) property_magnitude = car credit_amount <= 1386: bad (3.0) credit_amount > 1386: good (11.0/1.0) existing_credits > 1: bad (3.0) credit_history = delayed previously: bad (4.0) credit_history = no credits/all paid: bad (8.0/1.0) credit_history = all paid: bad (6.0) savings_status = 500<=X<1000: good (4.0/1.0) savings_status = >=1000: good (4.0) savings_status = 100<=X<500 credit_history = critical/other existing credit: good (2.0) credit_history = existing paid: bad (3.0) credit_history = delayed previously: good (0.0) credit_history = no credits/all paid: good (0.0) credit_history = all paid: good (1.0) duration > 30: bad (30.0/3.0) other parties = guarantor: good (12.0/3.0) other parties = co applicant: bad (7.0/1.0) job = unskilled resident purpose = radio/tv existing_credits <= 1: bad (10.0/3.0) existing_credits > 1: good (2.0) purpose = education: bad (1.0) purpose = furniture/equipment employment = >=7: good (2.0) employment = 1<=X<4: good (4.0) employment = 4<=X<7: good (1.0) employment = unemployed: good (0.0) employment = <1: bad (3.0) purpose = new car own_telephone = yes: good (2.0) own_telephone = none: bad (10.0/2.0) purpose = used car: bad (1.0) purpose = business: good (3.0) purpose = domestic appliance: bad (1.0) purpose = repairs: bad (1.0) purpose = other: good (1.0)

70 purpose = retraining: good (1.0) job = high qualif/self emp/mgmt: good (30.0/8.0) job = unemp/unskilled non res: bad (5.0/1.0) foreign = no: good (15.0/2.0) checking_status = 0<=X<200 credit_amount <= 9857 savings_status = no known savings: good (41.0/5.0) savings_status = <100 other parties = none duration <= 42 personal status = male single: good (52.0/15.0) personal status = female div/dep/mar purpose = radio/tv: good (8.0/2.0) purpose = education: good (4.0/2.0) purpose = furniture/equipment duration <= 10: bad (3.0) duration > 10 duration <= 21: good (6.0/1.0) duration > 21: bad (2.0) purpose = new car: bad (5.0/1.0) purpose = used car: bad (1.0) purpose = business residence_since <= 2: good (3.0) residence_since > 2: bad (2.0) purpose = domestic appliance: good (0.0) purpose = repairs: good (1.0) purpose = other: good (0.0) purpose = retraining: good (0.0) personal status = male div/sep: bad (8.0/2.0) personal status = male mar/wid duration <= 10: good (6.0) duration > 10: bad (10.0/3.0) duration > 42: bad (7.0) other parties = guarantor purpose = radio/tv: good (18.0/1.0) purpose = education: good (0.0) purpose = furniture/equipment: good (0.0) purpose = new car: bad (2.0) purpose = used car: good (0.0) purpose = business: good (0.0) purpose = domestic appliance: good (0.0) purpose = repairs: good (0.0) purpose = other: good (0.0) purpose = retraining: good (0.0) other parties = co applicant: good (2.0) savings_status = 500<=X<1000: good (11.0/3.0) savings_status = >=1000: good (13.0/3.0) savings_status = 100<=X<500

71 purpose = radio/tv: bad (8.0/2.0) purpose = education: good (0.0) purpose = furniture/equipment: bad (4.0/1.0) purpose = new car: bad (15.0/5.0) purpose = used car: good (3.0) purpose = business housing = own: good (6.0) housing = for free: bad (1.0) housing = rent existing_credits <= 1: good (2.0) existing_credits > 1: bad (2.0) purpose = domestic appliance: good (0.0) purpose = repairs: good (2.0) purpose = other: good (1.0) purpose = retraining: good (0.0) credit_amount > 9857: bad (20.0/3.0) checking_status = no checking: good (394.0/46.0) checking_status = >=200: good (63.0/14.0) Number of Leaves : 98 Size of the tree : 135 Time taken to build model: 0.06 seconds === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error 0.34 Relative absolute error % Root relative squared error % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 93.3 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class good bad Weighted Avg === Confusion Matrix ===

a b <-- classified as 669 31 a = good 114 186 b = bad Then we will be getting confusion matrix as follows: ===

72 a b <-- classified as a = good b = bad Then we will be getting confusion matrix as follows: === Confusion Matrix === a b <-- classified as a = b = 2 Visualize threshold curve Cost benefit analysis

dataset, and classify credit good/bad for each of the

73 Visualize cost curve JNTU Experiment 4: AIM: Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify

74 correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? Theory: Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entirecovariance matrix The naive Bayes probabilistic model : The probability model for a classifier is a conditional model P(C F1...Fn) over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we write P(C F1...Fn)=[{p(C)p(F1...Fn C)}/p(F1,...Fn)] In plain English the above equation can be written as Posterior= [(prior *likehood)/evidence] In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p(c,f1...fn) which can be rewritten as follows, using repeated applications of the definition of conditional probability: p(c,f1...fn) =p(c) p(f1...fn C) =p(c)p(f1 C) p(f2...fn C,F1,F2) =p(c)p(f1 C) p(f2 C,F1)p(F3...Fn C,F1,F2) = p(c)p(f1 C) p(f2 C,F1)p(F3...Fn C,F1,F2)...p(Fn C,F1,F2,F3...Fn1) Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j i. This means that p(fi C,Fj)=p(Fi C) and so the joint model can be expressed as p(c,f1,...fn)=p(c)p(f1 C)p(F2 C)... =p(c)π p(fi C) This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed like this: p(c F1...Fn)= p(c) πp(fi C) Z where Z is a scaling factor dependent only on F1...Fn, i.e., a constant if the values of the feature variables are known. Models of this form are much more manageable, since they factor into a so called class prior p(c) and independent probability distributions p(fi C). If there are k classes and if a model for eachp(fi C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes model is 2n + 1, where n is the number of binary features used for prediction

75 P(h/D)= P(D/h) P(h) P(D) P(h) : Prior probability of hypothesis h P(D) : Prior probability of training data D P(h/D) : Probability of h given D P(D/h) : Probability of D given h Naïve Bayes Classifier : Derivation D : Set of tuples Each Tuple is an n dimensional attribute vector X : (x1,x2,x3,. xn) Let there me m Classes : C1,C2,C3 Cm NB classifier predicts X belongs to Class Ci iff P (Ci/X) > P(Cj/X) for 1<= j <= m, j <> i Maximum Posteriori Hypothesis P(Ci/X) = P(X/Ci) P(Ci) / P(X) Maximize P(X/Ci) P(Ci) as P(X) is constant Naïve Bayes Classifier : Derivation With many attributes, it is computationally expensive to evaluate P(X/Ci) Naïve Assumption of class conditional independence P(X/Ci) = n P( xk/ Ci) k = 1 P(X/Ci) = P(x1/Ci) * P(x2/Ci) * * P(xn/ Ci) HARDWARE/SOFTWARE REQUIREMENTS: WEKA mining tool. Algorithm/Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system German Credit Data.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select NBTree i.e., Navie Baysiean tree. 9) Select Test options Use training set 10) if need select attribute. 11) now Start weka. 12)now we can see the output details in the Classifier output. Source Code:

Output: In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the remaining 14.5% of examples are incorrectly classified.

Due to this the accuracy is affected and hence we can t get 100% training accuracy. 5. Is testing on the training set as you did above a good idea? Why or Why not?

76 Output: In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the remaining 14.5% of examples are incorrectly classified. We can t get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we can t get 100% training accuracy. 5. Is testing on the training set as you did above a good idea? Why or Why not? SOLUTION: According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5% accuracy.

77 This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will be less. This is why, we prefer not to take complete dataset as training set. In some of the cases it is good, it is better to go with cross validation X-fold cross-validation (N-N/x; N/x) The cross-validation is used to prevent the overlap of the test sets First step: split data into x disjoint subsets of equal size Second step: use each subset in turn for testing, the remainder for training (repeating crossvalidation) As resulting rules (if applies) we take the sum of all rules. The error (predictive accuracy) estimates are averaged to yield an overall error (predictive accuracy) estimate Standard cross-validation: 10-fold cross-validation Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate. There is also some theoretical evidence for this. So interesting! Tools/ Apparatus: Weka Mining tool Procedure: 1) In Test options, select the Supplied test set radio button 2) click Set 3) Choose the file which contains records that were not in the training set we used to create the model. 4) Click Start(WEKA will run this test data set through the model we already created. ) 5) Compare the output results with that of the 4th experiment Sample output: This can be experienced by the different problem solutions while doing practice. The important numbers to focus on here are the numbers next to the "Correctly Classified Instances" (92.3 percent) and the "Incorrectly Classified Instances" (7.6 percent). Other important numbers are in the "ROC Area" column, in the first row (the 0.936); Finally, in the "Confusion Matrix," it shows the number of false positives and false negatives. The false positives are 29, and the false negatives are 17 in this matrix. Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a good model. One final step to validating our classification tree, which is to run our test set through the model and ensure that accuracy of the model Comparing the "Correctly Classified Instances" from this test set with the "Correctly Classified Instances" from the training set, we see the accuracy of the

78 model, which indicates that the model will not break down with unknown data, or when future data is applied to it. 6. One approach for solving the problem encountered in the previous question is using crossvalidation? Describe what cross-validation is briefly. Train a Decision Tree again using crossvalidation and report your results. Does your accuracy increase/decrease? Why? Cross validation:- In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or folds D1, D2, D3,......, Dk. Each of approximately equal size. Training and testing is performed k times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model. That is in the first iteration subsets D2, D3,......, Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets D1, D3,......, Dk and test on the D2 and so on. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list available filters. 7) Select weka.filters.unsupervised.attribute.remove 8) Next, click on text box immediately to the right of the "Choose" button 9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the "Invert Selection" option is set to false) 10) Then click "OK". Now, in the filter box you will see "Remove -R 1" 11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and create a new working relation 12) To save the new working relation as an ARFF file, click on save button in the top panel. 13) Go to OPEN file and browse the file that is newly saved (attribute deleted file) 14) Go to Classify tab. 15) Choose Classifier Tree 16) Select j48 tree 17) Select Test options Use training set 18) If needed select attribute. 19) Now start weka. 20) We can see the output details in the Classifier output. 21)Right click on the result list and select visualize tree option. 22) Compare the output results with that of the 4 th experiment 23) check whether the accuracy increased or decreased? 24)Check whether removing these attributes have any significant effect. Output: J48 pruned tree :

79 checking_status = <0 foreign_worker = yes duration <= 11 existing_credits <= 1 property_magnitude = real estate: good (8.0/1.0) property_magnitude = life insurance own_telephone = none: bad (2.0) own_telephone = yes: good (4.0) property_magnitude = car: good (2.0/1.0) property_magnitude = no known property: bad (3.0) existing_credits > 1: good (14.0) duration > 11 job = unemp/unskilled non res: bad (5.0/1.0) job = unskilled resident purpose = new car own_telephone = none: bad (10.0/2.0) own_telephone = yes: good (2.0) purpose = used car: bad (1.0) purpose = furniture/equipment employment = unemployed: good (0.0) employment = <1: bad (3.0) employment = 1<=X<4: good (4.0) employment = 4<=X<7: good (1.0) employment = >=7: good (2.0) purpose = radio/tv existing_credits <= 1: bad (10.0/3.0) existing_credits > 1: good (2.0) purpose = domestic appliance: bad (1.0) purpose = repairs: bad (1.0) purpose = education: bad (1.0) purpose = vacation: bad (0.0) purpose = retraining: good (1.0) purpose = business: good (3.0) purpose = other: good (1.0) job = skilled other_parties = none duration <= 30 savings_status = <100 credit_history = no credits/all paid: bad (8.0/1.0) credit_history = all paid: bad (6.0) credit_history = existing paid own_telephone = none existing_credits <= 1 property_magnitude = real estate age <= 26: bad (5.0) age > 26: good (2.0) property_magnitude = life insurance: bad (7.0/2.0) property_magnitude = car

80 credit_amount <= 1386: bad (3.0) credit_amount > 1386: good (11.0/1.0) property_magnitude = no known property: good (2.0) existing_credits > 1: bad (3.0) own_telephone = yes: bad (5.0) credit_history = delayed previously: bad (4.0) credit_history = critical/other existing credit: good (14.0/4.0) savings_status = 100<=X<500 credit_history = no credits/all paid: good (0.0) credit_history = all paid: good (1.0) credit_history = existing paid: bad (3.0) credit_history = delayed previously: good (0.0) credit_history = critical/other existing credit: good (2.0) savings_status = 500<=X<1000: good (4.0/1.0) savings_status = >=1000: good (4.0) savings_status = no known savings existing_credits <= 1 own_telephone = none: bad (9.0/1.0) own_telephone = yes: good (4.0/1.0) existing_credits > 1: good (2.0) duration > 30: bad (30.0/3.0) other_parties = co applicant: bad (7.0/1.0) other_parties = guarantor: good (12.0/3.0) job = high qualif/self emp/mgmt: good (30.0/8.0) foreign_worker = no: good (15.0/2.0) checking_status = 0<=X<200 credit_amount <= 9857 savings_status = <100 other_parties = none duration <= 42 personal_status = male div/sep: bad (8.0/2.0) personal_status = female div/dep/mar purpose = new car: bad (5.0/1.0) purpose = used car: bad (1.0) purpose = furniture/equipment duration <= 10: bad (3.0) duration > 10 duration <= 21: good (6.0/1.0) duration > 21: bad (2.0) purpose = radio/tv: good (8.0/2.0) purpose = domestic appliance: good (0.0) purpose = repairs: good (1.0) purpose = education: good (4.0/2.0) purpose = vacation: good (0.0) purpose = retraining: good (0.0) purpose = business residence_since <= 2: good (3.0)

81 residence_since > 2: bad (2.0) purpose = other: good (0.0) personal_status = male single: good (52.0/15.0) personal_status = male mar/wid duration <= 10: good (6.0) duration > 10: bad (10.0/3.0) personal_status = female single: good (0.0) duration > 42: bad (7.0) other_parties = co applicant: good (2.0) other_parties = guarantor purpose = new car: bad (2.0) purpose = used car: good (0.0) purpose = furniture/equipment: good (0.0) purpose = radio/tv: good (18.0/1.0) purpose = domestic appliance: good (0.0) purpose = repairs: good (0.0) purpose = education: good (0.0) purpose = vacation: good (0.0) purpose = retraining: good (0.0) purpose = business: good (0.0) purpose = other: good (0.0) savings_status = 100<=X<500 purpose = new car: bad (15.0/5.0) purpose = used car: good (3.0) purpose = furniture/equipment: bad (4.0/1.0) purpose = radio/tv: bad (8.0/2.0) purpose = domestic appliance: good (0.0) purpose = repairs: good (2.0) purpose = education: good (0.0) purpose = vacation: good (0.0) purpose = retraining: good (0.0) purpose = business housing = rent existing_credits <= 1: good (2.0) existing_credits > 1: bad (2.0) housing = own: good (6.0) housing = for free: bad (1.0) purpose = other: good (1.0) savings_status = 500<=X<1000: good (11.0/3.0) savings_status = >=1000: good (13.0/3.0) savings_status = no known savings: good (41.0/5.0) credit_amount > 9857: bad (20.0/3.0) checking_status = >=200: good (63.0/14.0) checking_status = no checking: good (394.0/46.0) Number of Leaves : 103 Size of the tree : 140

82 Time taken to build model: 0.07 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class good bad Weighted Avg === Confusion Matrix === a b <-- classified as a = good b = bad 7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personalstatus"(attribute 9). One way to do this (perhaps rather simple minded) is to remove these attributes from the dataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the reprocess tab in Weka's GUI Explorer. Did removing these attributes have any significant effect? Discuss. This increase in accuracy is because thus two attributes are not much important in training and analyzing by removing this, the time has been reduced to some extent and then it results in increase in the accuracy. The decision tree which is created is very large compared to the decision tree which we have trained now. This is the main difference between these two decision trees. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system German Credit Data.csv.

83 6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list available filters. 7) Select weka.filters.unsupervised.attribute.remove 8) Next, click on text box immediately to the right of the "Choose" button 9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the "Invert Selection" option is set to false) 10) Then click "OK". Now, in the filter box you will see "Remove -R 1" 11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and create a new working relation 12) To save the new working relation as an ARFF file, click on save button in the top panel. 13) Go to OPEN file and browse the file that is newly saved (attribute deleted file) 14) Go to Classify tab. 15) Choose Classifier Tree 16) Select j48 tree 17) Select Test options Use training set 18) If needed select attribute. 19) Now start weka. 20) We can see the output details in the Classifier output. 21)Right click on the result list and select visualize tree option. 22) Compare the output results with that of the 4 th experiment 23) check whether the accuracy increased or decreased? 24)Check whether removing these attributes have any significant effect. Visualize results:- Visualize classifier errors:

84 Visualize tree The Difference what we observed is accuracy had improved.

85 8. Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.) Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Select some of the attributes from attributes list which are to be removed. With this step only the attributes necessary for classification are left in the attributes panel. 7) The go to Classify tab. 8) Choose Classifier Tree 9) Select j48 10) Select Test options Use training set 11) If needed select attribute. 12) Now Start weka. 13) Now we can see the output details in the Classifier output. 14) Right click on the result list and select visualize tree option. 15) Compare the output results with that of the 4 th experiment 16) check whether the accuracy increased or decreased? 17) check whether removing these attributes have any significant effect. === Classifier model (full training set) === J48 pruned tree credit_history = no credits/all paid: bad (40.0/15.0) credit_history = all paid employment = unemployed duration <= 36: bad (3.0) duration > 36: good (2.0) employment = <1 duration <= 26: bad (7.0/1.0) duration > 26: good (2.0) employment = 1<=X<4: good (15.0/6.0) employment = 4<=X<7: bad (10.0/4.0) employment = >=7 job = unemp/unskilled non res: bad (0.0) job = unskilled resident: good (3.0) job = skilled: bad (3.0) job = high qualif/self emp/mgmt: bad (4.0) credit_history = existing paid credit_amount <= 8648

86 duration <= 40: good (476.0/130.0) duration > 40: bad (27.0/8.0) credit_amount > 8648: bad (27.0/7.0) credit_history = delayed previously employment = unemployed credit_amount <= 2186: bad (4.0/1.0) credit_amount > 2186: good (2.0) employment = <1 duration <= 18: good (2.0) duration > 18: bad (10.0/2.0) employment = 1<=X<4: good (33.0/6.0) employment = 4<=X<7 credit_amount <= 4530 credit_amount <= 1680: good (3.0) credit_amount > 1680: bad (3.0) credit_amount > 4530: good (11.0) employment = >=7 job = unemp/unskilled non res: good (0.0) job = unskilled resident: good (2.0/1.0) job = skilled: good (14.0/4.0) job = high qualif/self emp/mgmt: bad (4.0/1.0) credit_history = critical/other existing credit: good (293.0/50.0) Number of Leaves : 27 Size of the tree : 40 Time taken to build model: 0.01 seconds === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000

9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2).

87 9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2. When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. But we find some difference in cost factor which is in summary in the difference in cost factor. Case1 (cost 5) Case2 (cost 5) Total Cost Average cost We don t find this cost factor in problem 6. As there we use equal cost. This is the major difference between the results of problem 6 and problem 9. The cost matrices we used here: Case 1: 5 1 Case 2:

88 Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Go to Classify tab. 7) Choose Classifier Tree 8) Select j48 9) Select Test options Training set. 10)Click on more options. 11)Select cost sensitive evaluation and click on set button 12)Set the matrix values and click on resize. Then close the window. 13)Click Ok 14)Click start. 15) we can see the output details in the Classifier output 16) Select Test options Cross-validation. 17) Set Folds Ex:10 18) if need select attribute. 19) now Start weka. 20)now we can see the output details in the Classifier output. 21)Compare results of 15 th and 20 th steps. 22)Compare the results with that of experiment 6. Sample output:

10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model?

89 10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effected. This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. Aim: Is small rule better or long rule check the bias,by training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Procedure: This will be based on the attribute set, and the requirement of relationship among attribute we want to study. This can be viewed based on the database and user requirement.

90 11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? Reduced-error pruning :- The idea of using a separate pruning set for pruning which is applicable to decision trees as well as rule sets is called reduced-error pruning. The variant described previously prunes a rule immediately after it has been grown and is called incremental reduced-error pruning. Another possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual tests. However, this method is much slower. Of course, there are many different ways to assess the worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only rule in the theory, operating under the closed world assumption. If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total T of instances altogether, then it gets p positive instances right. The instances that it does not cover include N - n negative ones, where n = t p is the number of negative instances that the rule covers and N = T - P is the total number of negative instances. Thus the rule has an overall success ratio of [p +(N - n)] T, and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning. Aim: To create a Decision tree by using Prune mode and Reduced error Pruning and show accuracy for cross validation trained data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Theory : Reduced-error pruning Each node of the (over-fit) tree is examined for pruning A node is pruned (removed) only if the resulting pruned tree performs no worse than the original over the validation set Pruning a node consists of Removing the sub-tree rooted at the pruned node Making the pruned node a leaf node Assigning the pruned node the most common classification of the training instances attached to that node Pruning nodes iteratively Always select a node whose removal most increases the DT accuracy over the validation set Stop when further pruning decreases the DT accuracy over the validation set IF (Children=yes) Λ (income=>30000) THEN (car=yes) Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab.

5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list 7) Go to Classify tab.

91 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier Tree 9) Select NBTree i.e., Navie Baysiean tree. 10) Select Test options Use training set 11) right click on the text box besides choose button, select show properties 12) Now change unprone mode false to true. 13) Change the reduced error pruning % as needed. 14) If need select attribute. 15) Now Start weka. 16) Now we can see the output details in the Classifier output. 17) Right click on the result list and select visualize tree option. Sample output: J48 pruned tree checking_status = <0 foreign_worker = yes credit_history = no credits/all paid: bad (11.0/3.0) credit_history = all paid: bad (9.0/1.0) credit_history = existing paid

92 other_parties = none savings_status = <100 existing_credits <= 1 purpose = new car: bad (17.0/4.0) purpose = used car: good (3.0/1.0) purpose = furniture/equipment: good (22.0/11.0) purpose = radio/tv: good (18.0/8.0) purpose = domestic appliance: bad (2.0) purpose = repairs: bad (1.0) purpose = education: bad (5.0/1.0) purpose = vacation: bad (0.0) purpose = retraining: bad (0.0) purpose = business: good (3.0/1.0) purpose = other: bad (0.0) existing_credits > 1: bad (5.0) savings_status = 100<=X<500: bad (8.0/3.0) savings_status = 500<=X<1000: good (1.0) savings_status = >=1000: good (2.0) savings_status = no known savings job = unemp/unskilled non res: bad (0.0) job = unskilled resident: good (2.0) job = skilled own_telephone = none: bad (4.0) own_telephone = yes: good (3.0/1.0) job = high qualif/self emp/mgmt: bad (3.0/1.0) other_parties = co applicant: good (4.0/2.0) other_parties = guarantor: good (8.0/1.0) credit_history = delayed previously: bad (7.0/2.0) credit_history = critical/other existing credit: good (38.0/10.0) foreign_worker = no: good (12.0/2.0) checking_status = 0<=X<200 other_parties = none credit_history = no credits/all paid other_payment_plans = bank: good (2.0/1.0) other_payment_plans = stores: bad (0.0) other_payment_plans = none: bad (7.0) credit_history = all paid: bad (10.0/4.0) credit_history = existing paid credit_amount <= 8858: good (70.0/21.0) credit_amount > 8858: bad (8.0) credit_history = delayed previously: good (25.0/6.0) credit_history = critical/other existing credit: good (26.0/7.0) other_parties = co applicant: bad (7.0/1.0) other_parties = guarantor: good (18.0/4.0) checking_status = >=200: good (44.0/9.0) checking_status = no checking other_payment_plans = bank: good (30.0/10.0) other_payment_plans = stores: good (12.0/2.0)

93 other_payment_plans = none credit_history = no credits/all paid: good (4.0) credit_history = all paid: good (1.0) credit_history = existing paid existing_credits <= 1: good (92.0/7.0) existing_credits > 1 installment_commitment <= 2: bad (4.0/1.0) installment_commitment > 2: good (5.0) credit_history = delayed previously: good (22.0/6.0) credit_history = critical/other existing credit: good (92.0/3.0) Number of Leaves : 47 Size of the tree : 64 Time taken to build model: 0.49 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in Weka is rules.part, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. In weka, rules.part is one of the classifier which converts the decision trees into IF-THEN- ELSE rules. Converting Decision trees into IF-THEN-ELSE rules using rules.part classifier:- PART decision list outlook = overcast: yes (4.0) windy = TRUE: no (4.0/1.0) outlook = sunny: no (3.0/1.0) : yes (3.0)

94 Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is outlook outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) With respect to the time, the oner classifier has higher ranking and J48 is in 2nd place and PART gets 3rd place. J48 PART oner TIME (sec) RANK II III I But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place and oner gets lst place J48 PART oner ACCURACY (%) % 66.8% RANK I II III Aim: To compare OneR classifier which uses single attribute and rule with J48 and PART classifier s, by training data set using Weka mining tool. Tools/ Apparatus: Weka mining tool.. Procedure: 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier TreesRules 9) Select J48. 10) Select Test options Use training set 11) If need select attribute. 12) Now Start weka. 13) Now we can see the output details in the Classifier output. 14) Right click on the result list and select visualize tree option. (or) java weka.classifiers.trees.j48 -t c:\temp\bank.arff Procedure for OneR : 1) Given the Bank database for mining.

95 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier Rules 9) Select OneR. 10) Select Test options Use training set 11) If need select attribute. 12) Now Start weka. 13) Now we can see the output details in the Classifier output. Procedure for PART : 1) Given the Bank database for mining. 2) Use the Weka GUI Chooser. 3) Select EXPLORER present in Applications. 4) Select Preprocess Tab. 5) Go to OPEN file and browse the file that is already stored in the system bank.csv. 6) Select some of the attributes from attributes list 7) Go to Classify tab. 8) Choose Classifier Rules 9) Select PART. 10) Select Test options Use training set 11) If need select attribute. 12) Now start weka. 13) Now we can see the output details in the Classifier output. Attribute relevance with respect to the class relevant attribute (science) IF accounting=1 THEN class=a (Error=0, Coverage = 7 instance) IF accounting=0 THEN class=b (Error=4/13, Coverage = 13 instances) Sample output: J48 java weka.classifiers.trees.j48 -t c:/temp/bank.arff

96 One R

97 PART

98 Extra Experiments: 13. Generate Association rules for the following transactional database using Apriori algorithm. TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1,I2,I5 I2,I4 I2,I3 I1,I2,I4 I1,I3 I2,I3 I1,I3 I1,I2,I3,I5 I1,I2,I3 Step-1 Create Excel Document of the above data and save it as (.CSV delimited) file type Tid I1 I2 I3 I4 I5 T100 yes yes no no yes T200 no yes no yes no T300 no yes yes no no T400 yes yes no yes no T500 yes no yes no no T600 no yes yes no no T700 yes no yes no no T800 yes yes yes no yes T900 yes yes yes no no Step-2 Open WEKA tool and click on WEKA tool and then click on explorer Step-3 Click on open file tab and take the file from the desired location, the file should be of (.csv) which was saved earlier in one location. Step-4 Click on association tab which is on top headers tab then choose apriori and then click ok. Step-5 Then click start button and see the generated association rules results: === Run information ===

99 Scheme: weka.associations.apriori -N 10 -T 0 -C 0.9 -D U 1.0 -M 0.1 -S c -1 Relation: customer Instances: 9 Attributes: 6 Tid I1 I2 I3 I4 I5 === Associator model (full training set) === Apriori ======= Minimum support: 0.35 (3 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 13 Generated sets of large itemsets: Size of set of large itemsets L(1): 7 Size of set of large itemsets L(2): 13 Size of set of large itemsets L(3): 9 Size of set of large itemsets L(4): 2 Best rules found: 1. I3=yes 6 ==> I4=no 6 conf:(1) 2. I4=no I5=no 5 ==> I3=yes 5 conf:(1) 3. I3=yes I5=no 5 ==> I4=no 5 conf:(1) 4. I1=yes I3=yes 4 ==> I4=no 4 conf:(1) 5. I2=yes I3=yes 4 ==> I4=no 4 conf:(1) 6. I1=no 3 ==> I2=yes 3 conf:(1) 7. I1=no 3 ==> I5=no 3 conf:(1) 8. I3=no 3 ==> I2=yes 3 conf:(1) 9. I1=no I5=no 3 ==> I2=yes 3 conf:(1) 10. I1=no I2=yes 3 ==> I5=no 3 conf:(1) 14. Generate classification rules for the following data base using decision tree (J48).

100 RID Age Income Student Credit_rating Class:buys_computer 1 Youth High No Fair No 2 Youth High No Excellent No 3 Middle_aged High No Fair Yes 4 Senior Medium No Fair Yes 5 Senior Low Yes Fair Yes 6 Senior Low Yes Excellent Yes 7 Middle_aged Low Yes Excellent No 8 Youth Medium No Fair Yes 9 Youth Low Yes Fair No 10 Senior Medium Yes Fair Yes 11 Youth Medium Yes Excellent Yes 12 Middle-aged Medium No Excellent Yes 13 Middle_aged High Yes Fair Yes 14 Senior Medium No Excellent No Step1: Create Excel Document of the above data and save it as (.CSV delimited) file type Step-2 Open WEKA tool and click on WEKA tool and then click on explorer Step-3 Click on open file tab and take the file from the desired location, the file should be of (.csv) which was saved earlier in one location. Step-4 Click on classify tab which is on top headers tab then choose classifier as j48 and select test option as Use training set. Step-5 Then click start button and see the generated classification rules results: === Run information === Scheme:weka.classifiers.trees.J48 -C M 2 Relation: buys Instances: 14 Attributes: 6 Rid age income student credit_rating class:buys_computer Test mode:evaluate on training data

101 === Classifier model (full training set) === J48 pruned tree age = youth student = no: no (3.0) student = yes: yes (2.0) age = middle_aged: yes (4.0) age = senior credit_rating = fair: yes (3.0) credit_rating = excellent: no (2.0) Number of Leaves : 5 Size of the tree : 8 Time taken to build model: 0.02 seconds === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class no yes Weighted Avg === Confusion Matrix === a b <-- classified as 5 0 a = no 0 9 b = yes Step6:

102 Right click on Result list to visualize the tree Output:

DATA MINING LAB MANUAL

DATA MINING LAB MANUAL Subtasks : 1. List all the categorical (or nominal) attributes and the real-valued attributes seperately. Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose