Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad).

Similar documents
DUE By 11:59 PM on Thursday March 15 via make turnitin on acad. The standard 10% per day deduction for late assignments applies.

AI32 Guide to Weka. Andrew Roberts 1st March 2005

6.034 Design Assignment 2

CSC116: Introduction to Computing - Java

CSC 510 Advanced Operating Systems, Fall 2017

The Explorer. chapter Getting started

COMP s1 - Getting started with the Weka Machine Learning Toolkit

CSC116: Introduction to Computing - Java

CSC 343 Operating Systems, Fall 2015

CSC116: Introduction to Computing - Java

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Here are the steps to get the files for this project after logging in on acad/bill.

CSC 220 Object Oriented Multimedia Programming, Fall 2018

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

CSC 543 Multiprocessing & Concurrent Programming, Fall 2016

CIS 302 Relational Database Systems

DM204 - Scheduling, Timetabling and Routing

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

ENGR 3950U / CSCI 3020U (Operating Systems) Simulated UNIX File System Project Instructor: Dr. Kamran Sartipi

CSC 310 Programming Languages, Spring 2014, Dr. Dale E. Parson

Here are the steps to get the files for this project after logging in on acad/bill.

Here are the steps to get the files for this project after logging in on acad/bill.

Hands on Datamining & Machine Learning with Weka

Data Mining: STATISTICA

Project 3: An Introduction to File Systems. COP4610 Florida State University

CSC 552 UNIX System Programming, Fall 2015

WEKA homepage.

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Homework Assignment #3

CSC 343 Operating Systems, Fall 2015

Week 10 Project 3: An Introduction to File Systems. Classes COP4610 / CGS5765 Florida State University

Using Weka for Classification. Preparing a data file

Important Project Dates

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

CSC209. Software Tools and Systems Programming.

CS 241 Data Organization using C

Organisation. Assessment

CS 385 Operating Systems Fall 2011 Homework Assignment 4 Simulation of a Memory Paging System

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

CMSC 201 Spring 2018 Lab 01 Hello World

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Certified Tester Foundation Level Performance Testing Sample Exam Questions

MSA220 - Statistical Learning for Big Data

Assignment 2, perquack2 class hierarchy in Java, due 11:59 PM, Sunday March 16, 2014 Login into your account on acad/bill and do the following steps:

IMS database application manual

Weka ( )

Programming Studio #1 ECE 190

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Parallel Programming Pre-Assignment. Setting up the Software Environment

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

CPS122 Lecture: From Python to Java

CMSC 201 Spring 2017 Lab 01 Hello World

Data Preparation. UROŠ KRČADINAC URL:

Data Mining Laboratory Manual

WEKA Explorer User Guide for Version 3-4

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

COMP 3400 Programming Project : The Web Spider

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Linux File System and Basic Commands

Unsupervised Learning : Clustering

Classification using Weka (Brain, Computation, and Neural Learning)

Working with Basic Linux. Daniel Balagué

Contact No office hours, but is checked multiple times daily. - Specific questions/issues, particularly conceptual

LAD-WEKA Tutorial Version 1.0

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Short instructions on using Weka

The Data Mining Application Based on WEKA: Geographical Original of Music

CSC209H Lecture 1. Dan Zingaro. January 7, 2015

Installing and Upgrading Cisco Network Registrar Virtual Appliance

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Laboratory 1: Eclipse and Karel the Robot

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

BEMIDJI STATE UNIVERSITY COLLEGE OF BUSINESS, TECHNOLOGY AND COMMUNICATION Course syllabus Fall 2011

CSC209. Software Tools and Systems Programming.

Software Testing. 1. Testing is the process of demonstrating that errors are not present.

Decision Trees Using Weka and Rattle

Supervised and Unsupervised Learning (II)

PROJECT 1 DATA ANALYSIS (KR-VS-KP)

Text classification with Naïve Bayes. Lab 3

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

Tools for Annotating and Searching Corpora Practical Session 1: Annotating

Clearing Out Legacy Electronic Records

By Ludovic Duvaux (27 November 2013)

Programming Assignments

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

EMC ViPR SRM. Data Enrichment and Chargeback Guide. Version

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

User Guide Written By Yasser EL-Manzalawy

DATA MINING LAB MANUAL

EXAM PREPARATION GUIDE

Roll Marking Secondary School Tech Tip

CPSC 150 Laboratory Manual. Lab 1 Introduction to Program Creation

A Comparative Study of Selected Classification Algorithms of Data Mining

Transcription:

CSC 458 Data Mining and Predictive Analytics I, Fall 2017 (November 22, 2017) Dr. Dale E. Parson, Assignment 4, Comparing Weka Bayesian, clustering, ZeroR, OneR, and J48 models to predict nominal dissolved oxygen levels in an extension of Assignments 2 and 3. Due by 11:59 PM on Friday December 8 via make turnitin. I will not accept late solutions after the end of Sunday December 10 because I need to post my solution to help with your exam preparation; assignments coming in after December 10 earn 0%. If you are not accustomed to using the Linux acad system, see me during office hours, or an in-class lab session, or consult a graduate assistant in Old Main 257. I will not accept student work via D2L for this assignment. You can do all of your work on your own machine or on the campus PCs, obtaining the starting files via S:\ComputerScience\Parson\Weka on November 27. You can also log into acad and perform the following steps to retrieve the same files. You can use the FileZilla client utility or a similar file transfer program to copy files from acad and to place your solution files back onto acad. Assignment 3 s handout shows how to install and use FileZilla with acad. There will be at least one in-class work session for this assignment, and unless you are registered for the 100% on-line sections, I expect you to attend with questions, either in the room, or at class time via Ultra. 100% on-line students are encouraged to attend in Old Main 158 or nearby labs at class time if schedules permit. Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad). cd $HOME mkdir DataMine # This should already be there from assignment 2. cp ~parson/datamine/bayes458fall2017.problem.zip DataMine/bayes458fall2017.problem.zip cd./datamine unzip bayes458fall2017.problem.zip cd./bayes458fall2017 This is the directory from which you must run make turnitin by the project deadline to avoid a 10% per day late penalty. If you run out of file space in your account, you can perform the following steps from within your DataMine/ directory. Be extremely careful, and do NOT use any file name wildcards. This will discard your results from previous assignments. If you wish to keep those, do not remove directories prepdata1, ruletree458fall2017 or linear458fall2017. rm -rf prepdata1.problem.zip prepdata1.solution.zip prepdata1 rm -rf ruletree458fall2017.problem.zip ruletree458fall2017.solution.zip ruletree458fall2017 rm -rf linear458fall2017.problem.zip linear458fall2017.solution.zip linear458fall2017 You will see the following files in this bayes458fall2017 directory: readme.txt Your answers to Q1 through Q20 below go here, in the required format. csc458fall2017assn4trainingset49k.arff The ARFF file created by assignment 3. makefile Files needed to make turnitin to get your solution to me. checkfiles.sh makelib page 1

How can you avoid running out of memory in Weka? 1. Run Weka using a command line or batch script that sets memory size. I run it this way on my Mac: java -server -Xmx4000M -jar /Applications/weka-3-8-0/weka.jar That requires having the Java runtime environment (not necessarily the Java compiler) installed on your machine (true of campus PCs), and locating the path to the weka.jar Java archive that contains the Weka class libraries and other resources. This line allocates 4,000,000 bytes of storage for Weka. As for assignment 2, I have created batch file S:\ComputerScience\WEKA\WekaWith2GBcampus.bat for campus PCs, with handout data files in S:\ComputerScience\Parson\Weka\. I plan to create a 4Gb. Byte script S:\ComputerScience\WEKA\WekaWith4GBcampus.bat after I return to campus on November 8. Try using that. It will contain this command line: java Xmx4096M -jar "S:\ComputerScience\WEKA\weka.jar" 2. Right-click results buffers in the Weka -> Classify window, or use Alt-click on Mac (control-click on PC) to Delete result buffer after you are done with one. They take up space. You can also save these results to text files via this menu. 3. Some of these models take a long time to execute. I have noted that condition in these instructions. In such cases, it may save time just to exit Weka and restart it via the command line or a batch file with a large memory limit, rather than just deleting result buffers. PART I: Preparing your ARFF file. (30% of project grade.) Answer questions at steps 4 & 5. 1. Open csc458fall2017assn4trainingset49k.arff in Weka s Preprocess tab. 2. Remove TimeOfYear because it is redundant with MinuteOfYear and MinuteFromNewYear. We are leaving month in the attribute set for now. (Note: Some machine learning algorithms such as J48 and other decision trees may perform better using partially redundant attributes. A lowresolution attribute such as TimeOfYear may contribute to a more general tree that is less prone to over-fitting than a high-resolution attribute such as MinuteFromNewYear; also, a redundant attribute may help to fine tune a complex tree. However, the NaiveBayes statistical technique assumes statistical independence of non-class attributes, and may be more accurate after removing redundant attributes.) We are keeping MinuteFromNewYear because we can always coarsen its resolution later via discretization. Once an attribute such as MinuteFromNewYear is in page 2

low-resolution form such as the 4-valued TimeOfYear, it is impossible to get the high resolution of MinuteFromNewYear back.) 3. Remove TimeOfDay because it is redundant with MinuteOfDay and MinuteFromMidnite. Reasoning is similar to that in step 2. 4. Remove MinuteOfYear because it is redundant with MinuteFromNewYear, and it correlates nonlinearly with a remaining numeric attribute that is not derived from the datetime of the water sample, while MinuteFromNewYear correlates linearly with that same attribute that is not derived from datetime. You can use Weka s Visualize tab to decide which numeric attribute that is not derived from datetime correlates linearly with MinuteFromNewYear (but not linearly with MinuteOfYear), or you can use your knowledge gained from assignments 2 and 3. What is this numeric attribute that is not derived from datetime attribute? (5 of the 30% for this question) 5. Remove MinuteOfDay because it is redundant with MinuteFromMidnite. We are keeping MinuteFromMidnite because it correlates positively with an underlying mechanism for increasing dissolved oxygen found in the assignment 2 readings. What is this underlying mechanism? (5 of the 30% for this question) 6. Create a new derived attribute HourFromMidnite by using the Weka unsupervised -> attribute filter AddExpression that divides MinuteFromMidnite by the number of minutes in an hour. Look at the statistics and graph in the right side of the Weka Preprocess tab to ensure that these attributes have the same distribution. After verifying that HourFromMidnite is an accurate representation of MinuteFromMidnite in terms of hours, remove MinuteFromMidnite. We are doing this because HourFromMidnite is easier to think about. There are only 12 possible hours from the closest midnight (before or after the sample datetime), in contrast to 720 minutes. HourFromMidnite preserves the fine-grain resolution of MinuteFromMidnite in its fractional part. 7. Create a new derived attribute DayFromNewYear by using the Weka unsupervised -> attribute filter AddExpression that divides MinuteFromNewYear by the number of minutes in a day. Look at the statistics and graph in the right side of the Weka Preprocess tab to ensure that these attributes have the same distribution. After verifying that DayFromNewYear is an accurate representation of MinuteFromNewYear, remove MinuteFromNewYear. We are doing this because DayFromNewYear is easier to think about. There are only 183 possible days from midnight on the closest January 1 (before or after the sample datetime), in contrast to 263,520 minutes. DayFromNewYear preserves the fine-grain resolution of MinuteFromNewYear in its fractional part. 8. Discretize OxygenMgPerLiter into 10 discrete bins as in assignment 2. Bayesian analysis requires a nominal target attribute (a.k.a. class). Keep useequalfrequency as False. Do NOT discretize any other numeric attributes at this time. 9. Reorder the attributes to put OxygenMgPerLiter in the last (target) position, without disturbing the relative order of the other attributes. At the end of this step you MUST have these attributes in this order. page 3

10. Randomize the order of instances using your unique seed value as in Assignments 2 & 3. Save this as ARFF file csc458fall2017assn4nominaltrainingset49k.arff. It is the name of the input ARFF file with the word Nominal inserted. You must put this into your bayes458fall2017/ project directory before you run make turnitin. Work with csc458fall2017assn4nominaltrainingset49k.arff throughout the remainder of this assignment. We are using 10-fold cross validation with these 49K instances as the training & test dataset in this assignment. Each of Q1 through Q10 is worth 7% of the total project grade. Q1: On this initial set of attributes in this 49K set of measurements, run the following classifiers in the order shown below, and record only these results in your answer. See this footnote for the Kappa statistic 1. ZeroR: Relative absolute error % Root relative squared error % 1 From https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english: The (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves. In addition, it takes into account random chance (agreement with a random classifier), which generally means it is less misleading than simply using accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of 75% versus an Expected Accuracy of 50%). Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy) Not only can this kappa statistic shed light into how the classifier itself performed, the kappa statistic for one model is directly comparable to the kappa statistic for any other model used for the same classification task. Parson s example: If you had a 6-sided die that had the value 1 on 5 sides, and 0 on the other, the random-chance expected accuracy of rolling a 1 would be 5/6 = 83.3%. Since the ZeroR classifier simply picks the most statistically likely class without respect to the other (non-target) attributes, it would pick an expected die value of 1 in this case, giving a random observed accuracy of 83.3%, and a Kappa of (.833 -.833) / (1 -.833) = 0. Also from this linked site: Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect. Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor. It is important to note that both scales are somewhat arbitrary. At least two further considerations should be taken into account when interpreting the kappa statistic. First, the kappa statistic should always be compared with an accompanied confusion matrix if possible to obtain the most accurate interpretation. Second, acceptable kappa statistic values vary on the context. For instance, in many inter-rater reliability studies with easily observable behaviors, kappa statistic values below 0.70 might be considered low. However, in studies using machine learning to explore unobservable phenomena like cognitive states such as day dreaming, kappa statistic values above 0.40 might be considered exceptional. page 4

OneR: Relative absolute error % Root relative squared error % J48: Relative absolute error % Root relative squared error % NaiveBayes: Relative absolute error % Root relative squared error % BayesNet: Relative absolute error % Root relative squared error % Examine the conditional probability table in the output of NaiveBayes and the graph of BayesNet. You can see the latter, partially illustrated on the next page, by Alt-clicking BayesNet in the Classify tab s result list and selecting Visualize graph. Clicking a node in the graph shows its conditional probabilities. BayesNet is sometimes more accurate than NaiveBayes because NaiveBayes assumes statistical independence of the non-class attributes, while BayesNet does not. BayesNet attempts to model statistical interdependence among these attributes. In the BayesNet illustration below, clicking OxygenMgPerLiter reveals the probability distribution of its 10 discretized bins. Clicking other nodes that are successors (downstream) in the directed acyclic graph reveal more complicated tables. In the illustrated table for TempCelsius below, BayesNet auto-discretizes TempCelsius, and then gives conditional probabilities for OxygenMgPerLiter s bins, given discrete bins for TempCelsius. Note how the probability for the low-level (4.44-6.61] bin of OxygenMgPerLiter changes going left-to-right from lower-to-higher TempCelsius, and the probability for the high-level (13.12-15.29] bin of OxygenMgPerLiter also changes with increases in TempCelsius. BayesNet takes all of probabilities in all graph nodes for a given bin of OxygenMgPerLiter, multiplies them together, normalizes the result in the range 0%-100%, and uses this number to predict the probability of that bin of the class (target attribute), given all other attribute value bins. While the graph below auto-generates from OxygenMgPerLiter as the class, it is possible to use expertise to hand-design a graph. Again, the main benefit of BayesNet over NaiveBayes in some cases is BayesNet s non-assumption of conditional independence among the non-class attributes. page 5

Q2: From NaiveBayes, copy & paste the mean row for HourFromMidnite as it correlates with OxygenMgPerLiter in the 10 columns. Attribute '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range)' HourFromMidnite mean What change-in-value pattern does class attribute OxygenMgPerLiter exhibit as it goes left-to-right across increasing distances in HourFromMidnite, particularly for late morning through afternoon? From the analyses of assignments 2 and 3, what is the underlying physical or chemical cause of this pattern? Q3: From the BayesNet graph node for month, what probability-of-occurrence pattern does the low-level (4.44-6.61] bin of OxygenMgPerLiter exhibit as it goes left-to-right across increasing months from 1 (January) through 12 (December)? From the analyses of assignments 2 and 3, what is the underlying physical or chemical cause of this pattern? Alt-click each result except NaiveBayes in the Classify tab s result list and Delete result buffer to recover some storage. Note the value of Correctly Classified Instances for NaiveBayes with this full attribute set. Then, for each of the non-class attributes, starting at ph and working your way, one at a time, down through DayFromNewYear, perform the following steps in a loop: A. Remove the next non-class attribute and run NaiveBayes. B. If Correctly Classified Instances increases or stays the same after this removal, leave that attribute removed; otherwise (Correctly Classified Instances has decreased from its maximum NaiveBayes value so far), execute Undo to restore the attribute. C. Note which attributes you have removed without a subsequent Undo to restore them. D. You can use Delete result buffer to recover some storage. I kept only the NaiveBayes result with the greatest Correctly Classified Instances so far to help me keep track of this maximum. page 6

E. Repeat steps A-D, one attribute at a time, until you have removed, tested, and conditionally restored each non-class attribute, one at a time, through DayFromNewYear, which is the last nonclass attribute. Q4: After completing the above steps, which attribute or attributes did you permanently remove? Q5: Which of the permanently removed attribute(s) of Q4, if any, correlate with a remaining attribute, based on the analyses of assignments 2 and 3? With which of the remaining non-class attributes do these removed attribute(s) correlate? Other removed attributes simply do not correlate well with OxygenMgPerLiter, so their removal decreases error in NaiveBayes. The removed attributes of Q5, on the other hand, violate the statistical independence assumption of NaiveBayes, and so their removal reduces error introduced by violating this assumption. Q6: Repeat step Q1 with this reduced attribute set and record the same results here for those same exact classifiers ZeroR, OneR, J48, NaiveBayes, and BayesNet. Q7: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) improved accuracy in terms of Correct Classified Instances? Why did it or they improve? Q8: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) show decreased accuracy in terms of Correct Classified Instances? Why did it or they get worse? Q9: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) show no change in accuracy in terms of Correct Classified Instances? Why did it or they show no change? Q10: Run SimpleKMeans clustering with 3 clusters and complete the table below by using Cut and Paste from the Weka results. Make a pairwise comparison between the Full Data centroids and Clusters 0, 1, and 2, i.e., pair Full Data with each of the others in turn and compare changes from the overall centroids of Full Data. Describe any correlations you see in changes for TempCelsius and OxygenMgPerLiter in going from Full Data to the respective Cluster 0, 1, and 2. Do any of the other nonclass attributes ph, Conductance, or HourFromMidnite show a similarly clear correlation with OxygenMgPerLiter? Final cluster centroids: Cluster# Attribute Full Data 0 1 2 (49189.0) (N) (N) (N) ================================================================================== ph TempCelsius Conductance HourFromMidnite OxygenMgPerLiter page 7

Added NOTE 11/26/2017: In some cases a BayesNet will create a graph node that looks like this for an attribute: In that case you should remove the attribute from the set of attributes, since a constant multiplier of 1 contributes nothing to the conditional probability calculation for the attribute being estimated. page 8