2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal

Size: px

Start display at page:

Download "2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal"

Carol Horn
5 years ago
Views:

1 2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal SOLUTIONS Task 1 (Data conversion 15 points, Weka commands 10 points = 25 points) You should have implemented a piece of code which converts the provided sequence database to arff format to be processed by apriori algorithm of Weka. Given Data Format Arff Format seq pg1 {TRUE,FALSE} pg2 pg3 pg4 pg5 FALSE,TRUE,TRUE,TRUE,TRUE TRUE,TRUE,FALSE,FALSE,TRUE FALSE,FALSE,TRUE,TRUE,TRUE Table 1 Arff Conversion Example For Task 1 Each sequence should be converted to a set including the pages occurring in the sequence without considering the order. The order is irrelevant for this task since association rules are to be found. The set can be represented in arff with attributes for each item(page) having TRUE and FALSE values. Note that there are 10 different item values (pages) in given sequence databases. After you convert the data to arff format, you can load the file from Open File in Preprocess tab of Explorer window of Weka. As the second step, choose Apriori among the options in Associate Tab. Set minimum support and confidence from the parameters window of the Apriori algorithm.

Figure 1 Apriori Parameters Window When you run apriori, the algorithm produces 728 rules. The rules which contain only TRUE values indicate association rules.

2 Figure 1 Apriori Parameters Window When you run apriori, the algorithm produces 728 rules. The rules which contain only TRUE values indicate association rules. 183 of these rules contain only TRUE values. Task 2: (Data conversion 10 points, Weka commands 10 points = 15 points) You should have implemented a piece of code which converts the provided sequence database to arff format of Weka. Sample sequential data in order to figure out the Weka format can be downloaded from: Below is an example of how your arff files should be: Given Data Format Arff sid page 1,pg3 1,pg2 1,pg5 1,pg4 2,pg1 2,pg5 3,pg5 3,pg4 3,pg4 After obtaining the arff file, 1. Load the arff file by Open File command from the Explorer window 2. From Association tab, click Choose buton and select GeneralizedSequentialPatterns 3. Click on generalizedsequentialpatterns next to Choose buton and enter desired minimum support value. 4. Run the algorithm by clicking Start.

3 - 1-sequences [1] <{pg1}> (67) [2] <{pg4}> (94) [3] <{pg5}> (90) - 2-sequences [1] <{pg1}{pg4}> (52) [2] <{pg4}{pg4}> (84) [3] <{pg4}{pg5}> (72) [4] <{pg5}{pg4}> (75) [5] <{pg5}{pg5}> (66) - 3-sequences [1] <{pg4}{pg4}{pg4}> (53) [2] <{pg4}{pg4}{pg5}> (53) [3] <{pg4}{pg5}{pg4}> (52) [4] <{pg5}{pg4}{pg4}> (51) Table 2 Frequent Sequences With Min Support = sequences [1] <{pg4}> (94) [2] <{pg5}> (90) - 2-sequences [1] <{pg4}{pg4}> (84) Table 3 Frequent Sequences With Min Support = 0.8 Support Number Of Patterns Maximum Pattern Length Average Pattern Length Standard Deviation of Pattern Length

4 Task 3 (25 points) You may have chosen any one of the four data set files adult+stretch.data adult-stretch.data yellow-small+adult-stretch.data yellow-small.data It is sufficient to add the header given below to the top of the You need to append header section given below to top of each color {YELLOW, size {LARGE, act {STRETCH, age {ADULT, CLASS_LABEL Table 4 Header for converting balloon data files to arff format color = YELLOW size = LARGE act = STRETCH age = ADULT: T age = CHILD: F act = DIP: F size = SMALL: T color = PURPLE act = STRETCH age = ADULT: T age = CHILD: F act = DIP: F Table 5 ID3 Decision Tree For yellow-small+adult-stretch.data act = STRETCH age = ADULT: T age = CHILD: F act = DIP: F act = STRETCH: T act = DIP age = ADULT: T age = CHILD: F color = YELLOW size = LARGE: F size = SMALL: T color = PURPLE: F Table 6 ID3 Decision Tree for adult+stretch.data Table 7 ID3 Decision Tree For adult-stretch.data Table 8 ID3 Decision Tree for small-yellow.data

5 You can see that the decision trees are consistent with the information below given on data set page:

6 Task 4 (25 points = Arff Conversion +clustering (20) + Comments on clusters and classes(5)) For arff conversion: 1. Prepend the header given below to the top of the file 2. Replace the tab characters with commas in data. 3. Remove the original class identifier from each row 15.26,14.84,0.871,5.763,3.312,2.221,5.22 instead of area perimeter compactness klength kwidth asymcof groovelen Table 9 Header for Arff Conversion Kmeans clustering algorithm can be run by Choosing SimpleKMeans in Clustering tab. Don t forget to choose k=3 from the parameters window for SimpleKMeans. Weka returns the following clusters when run with k=3 and Euclidean Distance as the distance metric. kmeans ====== Number of iterations: 5 Within cluster sum of squared errors: Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data (210) (64) (77) (69) ========================================================= area perimeter compactness klength kwidth asymcof groovelen Time taken to build model (full training data) : 0.01 seconds === Model and evaluation on training set === Clustered Instances

7 0 64 ( 30%) 1 77 ( 37%) 2 69 ( 33%) Table 10 Kmeans output on Seeds data set by Weka In our data set, there are 3 classes and 70 samples for each. The output clusters, at least in terms of size do not give a perfect match but is approximate. You can visualize clusters by right clicking the result set in the "Result list" panel and clicking Visualize cluster assignments on the menu showing up.

By doing some simple manipulation to this data set, we can easily convert it to a more usable form for additional analysis or processing.

8 When you click the Save button, you can save the results to an arff file. This arff file includes cluster number and instance number of each sample differently from the original arff file. By doing some simple manipulation to this data set, we can easily convert it to a more usable form for additional analysis or processing. For example when we convert this arff file to csv and compare with the original classes of the samples, we can see that the sample pointed by the arrows is assigned to a cluster apart from the samples in its original class. You can compute precision, sensitivity and recall by considering these original classes and resulting clusters.

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence Weka Lab Paula Matuszek Fall, 2015!1 Weka is Waikato Environment for Knowledge Analysis Machine Learning Software Suite from the University of Waikato Been under development