CSC 458 Data Mining and Predictive Analytics I, Fall 2017 (November 22, 2017) Dr. Dale E. Parson, Assignment 4, Comparing Weka Bayesian, clustering, ZeroR, OneR, and J48 models to predict nominal dissolved oxygen levels in an extension of Assignments 2 and 3. Due by 11:59 PM on Friday December 8 via make turnitin. I will not accept late solutions after the end of Sunday December 10 because I need to post my solution to help with your exam preparation; assignments coming in after December 10 earn 0%. If you are not accustomed to using the Linux acad system, see me during office hours, or an in-class lab session, or consult a graduate assistant in Old Main 257. I will not accept student work via D2L for this assignment. You can do all of your work on your own machine or on the campus PCs, obtaining the starting files via S:\ComputerScience\Parson\Weka on November 27. You can also log into acad and perform the following steps to retrieve the same files. You can use the FileZilla client utility or a similar file transfer program to copy files from acad and to place your solution files back onto acad. Assignment 3 s handout shows how to install and use FileZilla with acad. There will be at least one in-class work session for this assignment, and unless you are registered for the 100% on-line sections, I expect you to attend with questions, either in the room, or at class time via Ultra. 100% on-line students are encouraged to attend in Old Main 158 or nearby labs at class time if schedules permit. Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad). cd $HOME mkdir DataMine # This should already be there from assignment 2. cp ~parson/datamine/bayes458fall2017.problem.zip DataMine/bayes458fall2017.problem.zip cd./datamine unzip bayes458fall2017.problem.zip cd./bayes458fall2017 This is the directory from which you must run make turnitin by the project deadline to avoid a 10% per day late penalty. If you run out of file space in your account, you can perform the following steps from within your DataMine/ directory. Be extremely careful, and do NOT use any file name wildcards. This will discard your results from previous assignments. If you wish to keep those, do not remove directories prepdata1, ruletree458fall2017 or linear458fall2017. rm -rf prepdata1.problem.zip prepdata1.solution.zip prepdata1 rm -rf ruletree458fall2017.problem.zip ruletree458fall2017.solution.zip ruletree458fall2017 rm -rf linear458fall2017.problem.zip linear458fall2017.solution.zip linear458fall2017 You will see the following files in this bayes458fall2017 directory: readme.txt Your answers to Q1 through Q20 below go here, in the required format. csc458fall2017assn4trainingset49k.arff The ARFF file created by assignment 3. makefile Files needed to make turnitin to get your solution to me. checkfiles.sh makelib page 1
How can you avoid running out of memory in Weka? 1. Run Weka using a command line or batch script that sets memory size. I run it this way on my Mac: java -server -Xmx4000M -jar /Applications/weka-3-8-0/weka.jar That requires having the Java runtime environment (not necessarily the Java compiler) installed on your machine (true of campus PCs), and locating the path to the weka.jar Java archive that contains the Weka class libraries and other resources. This line allocates 4,000,000 bytes of storage for Weka. As for assignment 2, I have created batch file S:\ComputerScience\WEKA\WekaWith2GBcampus.bat for campus PCs, with handout data files in S:\ComputerScience\Parson\Weka\. I plan to create a 4Gb. Byte script S:\ComputerScience\WEKA\WekaWith4GBcampus.bat after I return to campus on November 8. Try using that. It will contain this command line: java Xmx4096M -jar "S:\ComputerScience\WEKA\weka.jar" 2. Right-click results buffers in the Weka -> Classify window, or use Alt-click on Mac (control-click on PC) to Delete result buffer after you are done with one. They take up space. You can also save these results to text files via this menu. 3. Some of these models take a long time to execute. I have noted that condition in these instructions. In such cases, it may save time just to exit Weka and restart it via the command line or a batch file with a large memory limit, rather than just deleting result buffers. PART I: Preparing your ARFF file. (30% of project grade.) Answer questions at steps 4 & 5. 1. Open csc458fall2017assn4trainingset49k.arff in Weka s Preprocess tab. 2. Remove TimeOfYear because it is redundant with MinuteOfYear and MinuteFromNewYear. We are leaving month in the attribute set for now. (Note: Some machine learning algorithms such as J48 and other decision trees may perform better using partially redundant attributes. A lowresolution attribute such as TimeOfYear may contribute to a more general tree that is less prone to over-fitting than a high-resolution attribute such as MinuteFromNewYear; also, a redundant attribute may help to fine tune a complex tree. However, the NaiveBayes statistical technique assumes statistical independence of non-class attributes, and may be more accurate after removing redundant attributes.) We are keeping MinuteFromNewYear because we can always coarsen its resolution later via discretization. Once an attribute such as MinuteFromNewYear is in page 2
low-resolution form such as the 4-valued TimeOfYear, it is impossible to get the high resolution of MinuteFromNewYear back.) 3. Remove TimeOfDay because it is redundant with MinuteOfDay and MinuteFromMidnite. Reasoning is similar to that in step 2. 4. Remove MinuteOfYear because it is redundant with MinuteFromNewYear, and it correlates nonlinearly with a remaining numeric attribute that is not derived from the datetime of the water sample, while MinuteFromNewYear correlates linearly with that same attribute that is not derived from datetime. You can use Weka s Visualize tab to decide which numeric attribute that is not derived from datetime correlates linearly with MinuteFromNewYear (but not linearly with MinuteOfYear), or you can use your knowledge gained from assignments 2 and 3. What is this numeric attribute that is not derived from datetime attribute? (5 of the 30% for this question) 5. Remove MinuteOfDay because it is redundant with MinuteFromMidnite. We are keeping MinuteFromMidnite because it correlates positively with an underlying mechanism for increasing dissolved oxygen found in the assignment 2 readings. What is this underlying mechanism? (5 of the 30% for this question) 6. Create a new derived attribute HourFromMidnite by using the Weka unsupervised -> attribute filter AddExpression that divides MinuteFromMidnite by the number of minutes in an hour. Look at the statistics and graph in the right side of the Weka Preprocess tab to ensure that these attributes have the same distribution. After verifying that HourFromMidnite is an accurate representation of MinuteFromMidnite in terms of hours, remove MinuteFromMidnite. We are doing this because HourFromMidnite is easier to think about. There are only 12 possible hours from the closest midnight (before or after the sample datetime), in contrast to 720 minutes. HourFromMidnite preserves the fine-grain resolution of MinuteFromMidnite in its fractional part. 7. Create a new derived attribute DayFromNewYear by using the Weka unsupervised -> attribute filter AddExpression that divides MinuteFromNewYear by the number of minutes in a day. Look at the statistics and graph in the right side of the Weka Preprocess tab to ensure that these attributes have the same distribution. After verifying that DayFromNewYear is an accurate representation of MinuteFromNewYear, remove MinuteFromNewYear. We are doing this because DayFromNewYear is easier to think about. There are only 183 possible days from midnight on the closest January 1 (before or after the sample datetime), in contrast to 263,520 minutes. DayFromNewYear preserves the fine-grain resolution of MinuteFromNewYear in its fractional part. 8. Discretize OxygenMgPerLiter into 10 discrete bins as in assignment 2. Bayesian analysis requires a nominal target attribute (a.k.a. class). Keep useequalfrequency as False. Do NOT discretize any other numeric attributes at this time. 9. Reorder the attributes to put OxygenMgPerLiter in the last (target) position, without disturbing the relative order of the other attributes. At the end of this step you MUST have these attributes in this order. page 3
10. Randomize the order of instances using your unique seed value as in Assignments 2 & 3. Save this as ARFF file csc458fall2017assn4nominaltrainingset49k.arff. It is the name of the input ARFF file with the word Nominal inserted. You must put this into your bayes458fall2017/ project directory before you run make turnitin. Work with csc458fall2017assn4nominaltrainingset49k.arff throughout the remainder of this assignment. We are using 10-fold cross validation with these 49K instances as the training & test dataset in this assignment. Each of Q1 through Q10 is worth 7% of the total project grade. Q1: On this initial set of attributes in this 49K set of measurements, run the following classifiers in the order shown below, and record only these results in your answer. See this footnote for the Kappa statistic 1. ZeroR: Relative absolute error % Root relative squared error % 1 From https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english: The (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves. In addition, it takes into account random chance (agreement with a random classifier), which generally means it is less misleading than simply using accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of 75% versus an Expected Accuracy of 50%). Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy) Not only can this kappa statistic shed light into how the classifier itself performed, the kappa statistic for one model is directly comparable to the kappa statistic for any other model used for the same classification task. Parson s example: If you had a 6-sided die that had the value 1 on 5 sides, and 0 on the other, the random-chance expected accuracy of rolling a 1 would be 5/6 = 83.3%. Since the ZeroR classifier simply picks the most statistically likely class without respect to the other (non-target) attributes, it would pick an expected die value of 1 in this case, giving a random observed accuracy of 83.3%, and a Kappa of (.833 -.833) / (1 -.833) = 0. Also from this linked site: Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect. Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor. It is important to note that both scales are somewhat arbitrary. At least two further considerations should be taken into account when interpreting the kappa statistic. First, the kappa statistic should always be compared with an accompanied confusion matrix if possible to obtain the most accurate interpretation. Second, acceptable kappa statistic values vary on the context. For instance, in many inter-rater reliability studies with easily observable behaviors, kappa statistic values below 0.70 might be considered low. However, in studies using machine learning to explore unobservable phenomena like cognitive states such as day dreaming, kappa statistic values above 0.40 might be considered exceptional. page 4
OneR: Relative absolute error % Root relative squared error % J48: Relative absolute error % Root relative squared error % NaiveBayes: Relative absolute error % Root relative squared error % BayesNet: Relative absolute error % Root relative squared error % Examine the conditional probability table in the output of NaiveBayes and the graph of BayesNet. You can see the latter, partially illustrated on the next page, by Alt-clicking BayesNet in the Classify tab s result list and selecting Visualize graph. Clicking a node in the graph shows its conditional probabilities. BayesNet is sometimes more accurate than NaiveBayes because NaiveBayes assumes statistical independence of the non-class attributes, while BayesNet does not. BayesNet attempts to model statistical interdependence among these attributes. In the BayesNet illustration below, clicking OxygenMgPerLiter reveals the probability distribution of its 10 discretized bins. Clicking other nodes that are successors (downstream) in the directed acyclic graph reveal more complicated tables. In the illustrated table for TempCelsius below, BayesNet auto-discretizes TempCelsius, and then gives conditional probabilities for OxygenMgPerLiter s bins, given discrete bins for TempCelsius. Note how the probability for the low-level (4.44-6.61] bin of OxygenMgPerLiter changes going left-to-right from lower-to-higher TempCelsius, and the probability for the high-level (13.12-15.29] bin of OxygenMgPerLiter also changes with increases in TempCelsius. BayesNet takes all of probabilities in all graph nodes for a given bin of OxygenMgPerLiter, multiplies them together, normalizes the result in the range 0%-100%, and uses this number to predict the probability of that bin of the class (target attribute), given all other attribute value bins. While the graph below auto-generates from OxygenMgPerLiter as the class, it is possible to use expertise to hand-design a graph. Again, the main benefit of BayesNet over NaiveBayes in some cases is BayesNet s non-assumption of conditional independence among the non-class attributes. page 5
Q2: From NaiveBayes, copy & paste the mean row for HourFromMidnite as it correlates with OxygenMgPerLiter in the 10 columns. Attribute '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range)' HourFromMidnite mean What change-in-value pattern does class attribute OxygenMgPerLiter exhibit as it goes left-to-right across increasing distances in HourFromMidnite, particularly for late morning through afternoon? From the analyses of assignments 2 and 3, what is the underlying physical or chemical cause of this pattern? Q3: From the BayesNet graph node for month, what probability-of-occurrence pattern does the low-level (4.44-6.61] bin of OxygenMgPerLiter exhibit as it goes left-to-right across increasing months from 1 (January) through 12 (December)? From the analyses of assignments 2 and 3, what is the underlying physical or chemical cause of this pattern? Alt-click each result except NaiveBayes in the Classify tab s result list and Delete result buffer to recover some storage. Note the value of Correctly Classified Instances for NaiveBayes with this full attribute set. Then, for each of the non-class attributes, starting at ph and working your way, one at a time, down through DayFromNewYear, perform the following steps in a loop: A. Remove the next non-class attribute and run NaiveBayes. B. If Correctly Classified Instances increases or stays the same after this removal, leave that attribute removed; otherwise (Correctly Classified Instances has decreased from its maximum NaiveBayes value so far), execute Undo to restore the attribute. C. Note which attributes you have removed without a subsequent Undo to restore them. D. You can use Delete result buffer to recover some storage. I kept only the NaiveBayes result with the greatest Correctly Classified Instances so far to help me keep track of this maximum. page 6
E. Repeat steps A-D, one attribute at a time, until you have removed, tested, and conditionally restored each non-class attribute, one at a time, through DayFromNewYear, which is the last nonclass attribute. Q4: After completing the above steps, which attribute or attributes did you permanently remove? Q5: Which of the permanently removed attribute(s) of Q4, if any, correlate with a remaining attribute, based on the analyses of assignments 2 and 3? With which of the remaining non-class attributes do these removed attribute(s) correlate? Other removed attributes simply do not correlate well with OxygenMgPerLiter, so their removal decreases error in NaiveBayes. The removed attributes of Q5, on the other hand, violate the statistical independence assumption of NaiveBayes, and so their removal reduces error introduced by violating this assumption. Q6: Repeat step Q1 with this reduced attribute set and record the same results here for those same exact classifiers ZeroR, OneR, J48, NaiveBayes, and BayesNet. Q7: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) improved accuracy in terms of Correct Classified Instances? Why did it or they improve? Q8: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) show decreased accuracy in terms of Correct Classified Instances? Why did it or they get worse? Q9: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) show no change in accuracy in terms of Correct Classified Instances? Why did it or they show no change? Q10: Run SimpleKMeans clustering with 3 clusters and complete the table below by using Cut and Paste from the Weka results. Make a pairwise comparison between the Full Data centroids and Clusters 0, 1, and 2, i.e., pair Full Data with each of the others in turn and compare changes from the overall centroids of Full Data. Describe any correlations you see in changes for TempCelsius and OxygenMgPerLiter in going from Full Data to the respective Cluster 0, 1, and 2. Do any of the other nonclass attributes ph, Conductance, or HourFromMidnite show a similarly clear correlation with OxygenMgPerLiter? Final cluster centroids: Cluster# Attribute Full Data 0 1 2 (49189.0) (N) (N) (N) ================================================================================== ph TempCelsius Conductance HourFromMidnite OxygenMgPerLiter page 7
Added NOTE 11/26/2017: In some cases a BayesNet will create a graph node that looks like this for an attribute: In that case you should remove the attribute from the set of attributes, since a constant multiplier of 1 contributes nothing to the conditional probability calculation for the attribute being estimated. page 8