SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM

Size: px

Start display at page:

Download "SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM"

Brian Dickerson
5 years ago
Views:

1 SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB LABORATORY RECORD Name : Reg. No : Class : III BE (CSE) Subject : EBC6AP123 Data Warehousing and Data Mining Lab

2 SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING BONAFIDE CERTIFICATE This is to certify that this is the bonafide record of work done by Mr/Ms., with Reg. No of III Year B.E. (Computer Science and Engineering) in the Data Warehousing and Data Mining Laboratory (EBC6AP123) course during the year Station: Date: Staff-in-charge Head of the Department Submitted for the Practical examination held on. Internal Examiner External Examiner

3 INDEX S.No. Date Title Page No. Staff Initials 1 Exploring Weka Tool a) Defining Weather Relation Data Set in Arff format b) Defining Student Relation Data Set in CSV format Exploring weather relation using Weka Preprocessor & Cross Validation Techniques Exploring employee relation using Weka Classifier 5 Exploring labor relation using Weka Clustering 6 Exploring student relation using Weka Associator Experimenting Vehicle Relation using Weka Experimenter Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 Design a knowledge flow layout, to load attribute selection Normalize the attributes and to store the result in a csv saver.

4 Exp.No:1 Date EXPLORING WEKA TOOL Aim: Procedure: Implementation of Data Mining Algorithms by Attribute Relation File formats Introduction to Weka (Data Mining Tool) Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset (using GUI) or called from your own Java code (using Weka Java library). Tools (or functions) in Weka include: a. Data preprocessing (e.g., Data Filters), b. Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural Networks, SVM), c. Regression (e.g., Linear Regression, Isotonic Regression, SVM for Regression), d. Clustering (e.g., Simple K-means, Expectation Maximization (EM)), e. Association rules (e.g., Apriori Algorithm, Predictive Accuracy, Confirmation Guided), f. Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chi- squared Statistic), and g. Visualization (e.g., View different two-dimensional plots of the data). Launching WEKA The Weka GUI Chooser (Class weka.gui.guichooser) provides a Strating point for launching Weka s Main GUI applications and supporting tools. If one prefers a MDI( multiple document interface) appearance, then this is provided by an alternative launcher called Main (class weka.gui.main). The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus. The buttons can be used to start the following applications: i. Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). ii. Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. iii. iv. Knowledge Flow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

5 Exp.No: 2[a] Date WEATHER RELATION Aim: Procedure:- Open the Note pad Create the New Text File Fill the Attribute Name relation attribute name Give data values for the attributes data Save the File with.arff Format Load the data into the weka tool and Explore the data % ARFF file for weather data with some numeric outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes

6 Sample Screen Shot:- Attributes and data values in the Text file

Data Set with Following column s Name, Register Number,sub1,sub2 and sub3 Goto Save as

7 Exp.No: 2[b] DEFINING STUDENT RELATION DATA SET IN CSV FORMAT Date Aim: Procedure Open Microsoft Office-Excel Sheet Go to File -> New Excel Sheet Create the Students Details Data Set with Following column s Name, Register Number,sub1,sub2 and sub3 Goto Save as -> save the File with.csv Format Screen Shots Shows the Students details in the Excel sheet

8 Converting the Excel sheet data into Comma separated values Result:-

Exp.No:3 Date DATA PREPROCESSING Aim: % ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute

9 Exp.No:3 Date DATA PREPROCESSING Aim: % ARFF file for weather data with some numeric outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes Procedure:- PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

10 Application of Filters:- Select Weka -> filters ->Unsupervised ->Instance -> Remove percentage

The final piece of the puzzle is looking at the information that has been derived throughout the process.

11 After Filter Application-Remove percentage VISUALIZATION: The last tab in the window is the visualization tab. using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another. Result:-

$ravi} @attribute eid numeric @attribute esal numeric @attribute edept {sales, admin} @data john, 85, 8500, sales tony, 85, 9500, admin john, 85, 8500, sales\$

12 Exp.No:4 Date Aim: CLASSIFICATION Employee Relation(INPUT): % ARFF file for employee data with some numeric ename {john, tony, eid esal edept {sales, john, 85, 8500, sales tony, 85, 9500, admin john, 85, 8500, sales\ Procedure:

13 PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of ins. === Run information ===

14 OUTPUT: Scheme: weka.classifiers.rules.zeror Relation: employee Instances: 3 Attributes: 4 ename eid esal edept Test mode: 10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: sales Time taken to build model: 0 seconds VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another. Result:-

Exp.No:5 Date CLUSTERING Aim: STUDENT RELATION % % ARFF file for student data with some numeric features % @relation student @attribute sname {john, tony, ravi} @attribute sid numeric @attribute

15 Exp.No:5 Date CLUSTERING Aim: STUDENT RELATION % % ARFF file for student data with some numeric features sname {john, tony, sid sbranch {ECE, CSE, sage john, 285, ECE, 19 tony, 385, IT, admin john, 485, ECE, 1 PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

16 CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a preassigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. OUTPUT: Scheme: weka.classifiers.rules.zeror Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient -0.5 Mean absolute error 0.5 Root mean squared error Relative absolute error 100 % Root relative squared error 100% Total Number of Instances 3 CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. OUTPUT: Theme: weka.clusterers.em -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5

17 outlook temperature humidity windy play Test mode: evaluate on training data=== Model and evaluation on training set === EM ==Number of clusters selected by cross validation Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean std. dev humidity mean std. dev windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instances 0 14 (100%) Log likelihood: VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and Methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of Attributes are selected; there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

18 Result:-

19 Exp.No:6 Date ASSOCIATION Aim:- Data cars-weka.filters.unsupervised.i buying maint doors persons lugboot safety class vhigh,vhigh,2,2,small,high,unacc vhigh,vhigh,2,2,med,low,unacc vhigh,vhigh,2,4,small,med,unacc

ASSOCIATION: The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results.

20 ASSOCIATION: The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori Algorithm as shown below Output:- === Run information === Scheme: weka.associations.apriori -N 10 -T 0 -C 0.9 -D U 1.0 -M 0.1 -S c -1 Relation: cars-weka.filters.unsupervised.instance.resample-s0-z30.0-no-replacement Instances: 518 Attributes: 7 buying maint doors

21 persons lugboot safety class === Associator model (full training set) === Apriori ======= Minimum support: 0.1 (52 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 23 Size of set of large itemsets L(2): 55 Size of set of large itemsets L(3): 10 Best rules found: 1. persons=2 171 ==> class=unacc 171 <conf:(1)> lift:(1.41) lev:(0.1) [49] conv:(49.85) 2. safety=low 169 ==> class=unacc 169 <conf:(1)> lift:(1.41) lev:(0.1) [49] conv:(49.26) 3. persons=2 lugboot=med 69 ==> class=unacc 69 <conf:(1)> lift:(1.41) lev:(0.04) [20] conv:(20.11) 4. persons=2 safety=med 68 ==> class=unacc 68 <conf:(1)> lift:(1.41) lev:(0.04) [19] conv:(19.82) 5. persons=4 safety=low 59 ==> class=unacc 59 <conf:(1)> lift:(1.41) lev:(0.03) [17] conv:(17.2) 6. lugboot=med safety=low 59 ==> class=unacc 59 <conf:(1)> lift:(1.41) lev:(0.03) [17] conv:(17.2) 7. persons=2 lugboot=big 56 ==> class=unacc 56 <conf:(1)> lift:(1.41) lev:(0.03) [16] conv:(16.32) 8. persons=more safety=low 56 ==> class=unacc 56 <conf:(1)> lift:(1.41) lev:(0.03) [16] conv:(16.32)

22 9. lugboot=big safety=low 56 ==> class=unacc 56 <conf:(1)> lift:(1.41) lev:(0.03) [16] conv:(16.32) 10. persons=2 safety=low 54 ==> class=unacc 54 <conf:(1)> lift:(1.41) lev:(0.03) [15] conv:(15.74) Result:-

23 Exp.No:7 Date Experimenting Vehicle Relations using Weka Experimenter Aim:- Procedure:- Defining an Experiment When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New toinitialize an experiment. This causes default parameters to be defined for the experiment. To define the dataset to be processed by a scheme, first select Use relative pathsǁin the Datasets panel of the Setup window and then click ǁAdd Newǁ to open a dialog box below

Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible.

24 Select iris.arff and click Open to select the iris dataset. The dataset name is now displayed in the Datasets panel of the Setup window. Saving the Results of the Experiment To identify a dataset to which the results are to be sent, Click on the CSVResultListenerǁ Entry inthe Destination panel. Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible. Drag the bottom right-hand corner of the window to resize the window until the scroll bars disappear. The output file parameter is near the bottom of the window, beside the text outputfileǁ. Click on this parameter to display a file selection window.

25 Type the name of the output file, click Select, and then click close (x). The file name is displayed in the output File panel. Click on OK to close the window. The dataset name is displayed in the Destination panel of the Setup window.

26 Saving the Experiment Definition The experiment definition can be saved at any time. Select Save ǁ at the top of the Setup window. Type the dataset name with the extension expǁ (or select the dataset name if the experiment definition dataset already exists). The experiment can be restored by selecting Open in the Setup window and then selecting Experiment1.exp in the dialog window. RUNNING AN EXPERIMENT: To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66%

27 of the patterns for training and 34% for testing, and using the ZeroR scheme. Click Start to run the experiment. If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. Output:- The results of the experiment are saved to the dataset Experiment1.txt. Dataset, Run, Scheme, Scheme_options, Scheme_version_ID,

28 Date_time, Number_of_instances, Number_correct, Number_incorrect, Number_unclassified, Percent_correct, Percent_incorrect, Percent_unclassified, Mean_absolute_error, Root_mean_squared_error, Relative_absolute_error, Root_relative_squared_error, SF_prior_entropy, SF_scheme_entropy, SF_entropy_gain, SF_mean_prior_entropy, SF_mean_scheme_entropy, SF_mean_entropy_gain, KB_information, KB_mean_information, KB_ relative_information, True_positive_rate, Num_true_positives, False_positive_rate, Num_false_positives, True_negative_rate, Num_true_negatives, False_negative_rate, Num_false_negatives, IR_precision, IR_recall, F_measure, Summary iris,1,weka.classifiers.zeror,'', , e7,51.0,15.0,36.0,0.0, , ,0.0, , ,100.0, 100.0, , ,0.0, , ,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?iris,2,weka.classifiers.ZeroR,'', , E7,51.0,11.0,40.0,0.0, , ,0.0, , ,100.0,100.0, , ,0.0, , ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? Result:-

29 KNOWLEDGE FLOW: The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge Flow but not in the Explorer. The Knowledge Flow presents a "data-flow" inspired interface to Weka. The user can select Weka components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of Weka's classifiers, filters, clusterers, loaders and savers are available in the KnowledgeFlow along with some extra tools. Features of the KnowledgeFlow: * intuitive data flow style layout * process data in batches or incrementally * process multiple batches or streams in parallel! (each separate flow executes in its own thread) * chain filters together * view models produced by classifiers for each fold in a cross validation * visualize performance of incremental classifiers during processing (scrolling plots of classification accuracy, RMS error, predictions etc) Components available in the KnowledgeFlow: DataSources: All of Weka's loaders are available DataSinks: All of Weka's savers are available Filters: All of Weka's filters are available Classifiers: All of Weka's classifiers are available Clusterers: All of Weka's clusterers are available Valuation: TrainingSetMaker - make a data set into a training set TestSetMaker - make a data set into a test set CrossValidationFoldMaker - split any data set, training set or test set into folds

30 TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set ClassAssigner - assign a column to be the class for any data set, training set or test set ClassValuePicker - choose a class value to be considered as the "positive" class. This is useful when generating data for ROC style curves (see below) ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions VISUALIZATION: DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot) AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based Models StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental classifiers) LAUNCHING THE KNOWLEDGEFLOW The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the button labeled "Knowledge Flow" to start the Knowledge Flow. Alternatively, you can launch the Knowledge Flow from a terminal window by typing "javaweka.gui.beans.knowledgeflow". At the top of the Knowledge Flow window is are seven tabs: DataSources, DataSinks, Filters, Classifiers, Clusterers, Evaluation and Visualization. The names are pretty much self explanatory. COMPONENTS Components available in the KnowledgeFlow:

31 DataSources All of WEKA s loaders are available. DataSinks All of WEKA s savers are available. Filters All of WEKA s filters are available. Classifiers All of WEKA s classifiers are available. Clusterers All of WEKA s clusterers are available.

Evaluation TrainingSetMaker - make a data set into a training set. TestSetMaker - make a data set into a test set. CrossValidationFoldMaker - split any data set, training set or test set into folds.

ClassValuePicker - choose a class value to be considered as the positiveǁ class. This is useful When generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2).

32 Evaluation TrainingSetMaker - make a data set into a training set. TestSetMaker - make a data set into a test set. CrossValidationFoldMaker - split any data set, training set or test set into folds. TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set. ClassAssigner - assign a column to be the class for any data set, training set or test set. ClassValuePicker - choose a class value to be considered as the positiveǁ class. This is useful When generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2). ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers. IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers. ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers. PredictionAppender - append classifier predictions to a test set. For dis-crete class problems, can either append predicted class labels or probabil- ity distributions. Visualization DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot. ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot). AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data. ModelPerformanceChart - component that can pop up a panel for visual-izing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data.can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based models. StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental clas-iers)

33 Exp.No:8 Date DESIGNA KNOWWLEDGE FLOW LAYOUT FOR CLASS VALIDATION USING J48 Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's C4.5 implementation). Procedure:- First start the KnowlegeFlow. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs"). Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area). Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.

34 Alternatively, you can double-click on the icon to bring up the configuration dialog

35 Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout. Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select the "dataset" under "Connections" in the menu. A "rubber band" line will appear.

36 Move the mouse over the Class Assigner component and left click - a red line labeled "dataset" will connect the two components. Next right click over the Class Assigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).

37 Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default). Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on

38 the layout. Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over Assigner" and selecting "dataset" from under "Connections" in the menu. "Class

39 Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach the "J48" component in the "trees" section. Place a J48 component on the layout. Connect the Cross Validation Fold Maker to J48 TWICE by first choosing "training Set" and then "test Set" from the pop-up menu for the Cross Validation Fold Maker.

41 Next go back to the "Evaluation" tab and place a "Classifier Performance Evaluator" component on the layout.

42 Connect J48 to this component by selecting the "batchclassifier" entry from the pop-up menu for J48.

Connect the Classifier Performance Evaluator to the Text

43 Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout. Connect the Classifier Performance Evaluator to the Text Viewer by selecting the "text" entry from the pop-up menu for Classifier Performance Evaluator.

44 Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.

45 When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component. Result:

46 Exp.No:9 DATE DESIGN A KNOWLEDGE FLOW LAYOUT TO LOAD ATTRIBUTE SELECTION NORMALIZE IN THE A CSV FORMAT Aim:- Procedure: 1. Click on knowledge Glowǁ from weak GUI chooser. 2. It opens a window called Weka knowledge flow environmentǁ. 3. Click on data sourcesǁ and select Arffǁ to read data is the arff source. 4. Now click on the knowledge flow layoutǁ area, which laces the Arffloader in the layout. 5. Cdlick on filtersǁ and select on attribute selector from the supervisedǁ filters.place it on the design layout. 6. Now select another filter to normalize the numeric attribute values, from the unsupervisedǁ filters. Place it on the design layout. 7. Click on Data sinksǁ and choose csvǁ, which writes to a destination that is in csv format. Place it on the design layout of knowledge flow. 8. Now right click on Arffloaderǁ and click on data set to direct the flow to attribute selectionǁ. 9. Now right click on Attribute selectionǁ and select data set to direct the flow to Normalizeǁ from which ;lthe flow is directed to the csv saver in the same way. 10. Right click on csv saver and click on configureǁ, to specify the destination where to sotre the results let at be selected as ravi. 11. Now right click on Affloaderǁ and select configure to specify the source dataǁ.let in sǁ relation has been selected as so. 12. Now again right click on the Affloaderǁ and click on start loadingǁ which results in the below knowledge flow layoutǁ. We can observe the results of lthe abouve process by opening the file z:\weka@ravi\in sweka.filters.supervised.attribute Microsoft office Excellomma in notepad, which displays the results I a comma separated value form Output:- Petal length, Petal width Class In s-setosa In s-setosa In s-setosa In s-versicolor In,s-virginica In s-virginica Result:-

47 Lab Experiments Case Study 1. List all the categorical (or nominal) attributes and the real-valued attributes separately. From the German Credit Assessment Case Study given to us, the following attributes are found to be applicable for Credit-Risk Assessment: Total Valid Attributes 1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status 10. debitors 11. residence_since 12. property 14. installment plans 15. housing 16. existing credits 17. job 18. num_dependents 19. telephone Categorical or Nominal attributes(which takes True/false, etc values) 1. checking_status 2. credit history 3. purpose 4. savings_status 5. employment 6. personal status 7. debtors 8. property 9. installment plans 10. housing 11. job 12. telephone 13. foreign worker Real valued attributes 1. duration 2. credit amount 3. credit amount 4. residence 5. age 6. existing credits 7. num_dependents

48 2. What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit Based on the above attributes, we can make a decision whether to give credit or not. checking_status = no checking AND other_payment_plans = none AND credit_history = critical/other existing credit: good checking_status = no checking AND existing_credits <= 1 AND other_payment_plans = none AND purpose = radio/tv: good checking_status = no checking AND foreign_worker = yes AND employment = 4<=X<7: good foreign_worker = no AND personal_status = male single: good checking_status = no checking AND purpose = used car AND other_payment_plans = none: good duration <= 15 AND other_parties = guarantor: good duration <= 11 AND credit_history = critical/other existing credit: good checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = car: good checking_status = no checking AND property_magnitude = real estate AND other_payment_plans = none AND age > 23: good savings_status = >=1000 AND property_magnitude = real estate: good savings_status = 500<=X<1000 AND employment = >=7: good credit_history = no credits/all paid AND housing = rent: bad savings_status = no known savings AND checking_status = 0<=X<200 AND existing_credits > 1: good

49 checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = life insurance: good installment_commitment <= 2 AND other_parties = co applicant AND existing_credits > 1: bad installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits > 1 AND residence_since > 1: good installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits <= 1: good duration > 30 AND savings_status = 100<=X<500: bad credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad duration > 30 AND savings_status = no known savings AND num_dependents > 1: good duration > 30 AND credit_history = delayed previously: bad duration > 42 AND savings_status = <100 AND residence_since > 1: bad

50 3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. A decision tree is a flow chart like tree structure where each internal node(non -leaf) denotes a test on the attribute, each branch represents an outcome of the test,and each leaf node(terminal node)holds a class label. Decision trees can be easily converted into classification rules. e.g. ID3,C4.5 and CART. J48 pruned tree 1. Using WEKA Tool, we can generate a decision tree by selecting the classify tabǁ. 2. In classify tab select choose option where a list of different decision trees are available. From that list select J Now under test option,select training data test option. 4. The resulting window in WEKA is as follows:

The obtained decision tree for credit risk assessment is very large to fit on

51 5. To generate the decision tree, right click on the result list and select visualize tree option by which the decision tree will be generated. 6. The obtained decision tree for credit risk assessment is very large to fit on the screen. 7. The decision tree above is unclear due to a large number of attributes.

4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly?

In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset.

52 4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset. For example: IF purpose=vacation THEN credit=bad; ELSE purpose=business THEN credit=good; In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the remaining 14.5% of examples are Incorrectly classified. We can t get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we can t get 100% training accuracy.

53 5. Is testing on the training set as you did above a good idea? Why?Why not? Bad idea, if take all the data into training set. Then how to test the above classification is correctly or not? According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5% accuracy. This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will be less. This is why, we prefer not to take complete dataset as training set. UseTraining Set Result for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error 0.34 Relative absolute error % Root relative squared error % Total Number of Instances 1000

6. One approach for solving the problem encountered in the previous question is using cross-validation? Describe what cross- validation is briefly.

54 6. One approach for solving the problem encountered in the previous question is using cross-validation? Describe what cross- validation is briefly. Train a Decision Tree again using cross- validation and report your results. Does your accuracy increase/decrease? Why? Cross validation:- In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or folds D1, D2, D3,......, Dk. Each of approximately equal size. Training and testing is performed k times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model. That is in the first iteration subsets D2, D3,......, Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets D1, D3,......, Dk and test on the D2 and so on. 1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and the number of folds as Number of folds indicates number of partition with the set of attributes. 3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed out, but in reality there is no such training set that gives 100% accuracy.

55 Cross Validation Result at folds: 10 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 Here there are 1000 instances with 100 instances per partition. Cross Validation Result at folds: 20 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 Cross Validation Result at folds: 50 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error

56 Cross Validation Result at folds: 100 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 Percentage split does not allow 100%, it allows only till 99.9%

57 Percentage Split Result at 50%: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 500

58 Percentage Split Result at 99.9%: Correctly Classified Instances 0 0 % Incorrectly Classified Instances % Kappa statistic 0 Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1

59 7. Sometimes, the cost of rejecting an applicant who actually has a good credit Case 1. might be higher than accepting an applicant who has bad credit Case 2. Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in WEKA. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2. two When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5) Total Cost Average cost We don t find this cost factor in problem 6. As there we use equal cost. This is the major difference between the results of problem 6 and problem 9. The cost matrices we used here: Case 1: Case 2: Select

60 classify tab. 2. Select More Option from Test Option. 3.Tick on cost sensitive Evaluation and go to set.

61 4.Set classes as 2. 5.Click on Resize and then we ll get cost matrix. 6.Then change the 2 nd entry in 1 st row and 2 nd entry in 1 st column to Then confusion matrix will be generated and you can find out the difference between good and bad attribute. 8.Check accuracy whether it s changing or not.

62 8. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect. This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. 1. Open any existing ARFF file e.g labour.arff. 2. In preprocess tab, select ALL to select all the attributes. 3. Go to classify tab and then use traning set with J48 algorithm.

63 4. To generate the decision tree, right click on the result list and select visualize tree option, by which the decision tree will be generated.

64 5. Right click on J48 algorithm to get Generic Object Editor window 6. In this,make the unpruned option as true. 7. Then press OK and then start. we find the tree will become more complex if not pruned. Visualizetree

65 8. The tree has become more complex.

66 9. (Extra Credit): How can you convert a Decision Trees into "if-then- else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. In WEKA, rules.part is one of the classifier which converts the decision trees into IF-THEN-ELSEǁ rules. Converting Decision trees into IF-THEN-ELSE rules using rules.part classifier:- PART decision list outlook = overcast: yes (4.0) windy = TRUE: no (4.0/1.0) outlook = sunny: no (3.0/1.0): yes (3.0) Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is outlook outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) With respect to the time, the oner classifier has higher ranking and J48 is in 2 nd place and PART gets 3rd place. J48 PART oner TIME (sec) RANK II III I But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place and oner gets lst place J48 PART oner

67 ACCURACY (%) % 66.8% 1.Open existing file as weather.nomial.arff 2.Select All. 3.Go to classify. 4.Start.

68 Here the accuracy is 100%

The tree is something like if-then-else rule If outlook=overcast then play=yes If outlook=sunny and humidity=high then play = no else play = yes If outlook=rainy and

69 The tree is something like if-then-else rule If outlook=overcast then play=yes If outlook=sunny and humidity=high then play = no else play = yes If outlook=rainy and windy=true then play = no else play = yes To click out the rules 1. Go to choose then click on Rule then select PART. 2. Click on Save and start. 3. Similarly for oner algorithm.

70 If outlook = overcast then play=yes If outlook = sunny and humidity= high then play=no If outlook = sunny and humidity= low then play=yes

DATA MINING LAB MANUAL

DATA MINING LAB MANUAL Subtasks : 1. List all the categorical (or nominal) attributes and the real-valued attributes seperately. Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose