SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM

Size: px
Start display at page:

Download "SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM"

Transcription

1 SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB LABORATORY RECORD Name : Reg. No : Class : III BE (CSE) Subject : EBC6AP123 Data Warehousing and Data Mining Lab

2 SRI CHANDRASEKHARENDRA SARASWATHI VISWA MAHAVIDYALAYA (UNIVERSITY ESTABLISHED UNDER SECTION 3 OF UGC ACT 1956) ENATHUR, KANCHIPURAM DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING BONAFIDE CERTIFICATE This is to certify that this is the bonafide record of work done by Mr/Ms., with Reg. No of III Year B.E. (Computer Science and Engineering) in the Data Warehousing and Data Mining Laboratory (EBC6AP123) course during the year Station: Date: Staff-in-charge Head of the Department Submitted for the Practical examination held on. Internal Examiner External Examiner

3 INDEX S.No. Date Title Page No. Staff Initials 1 Exploring Weka Tool a) Defining Weather Relation Data Set in Arff format b) Defining Student Relation Data Set in CSV format Exploring weather relation using Weka Preprocessor & Cross Validation Techniques Exploring employee relation using Weka Classifier 5 Exploring labor relation using Weka Clustering 6 Exploring student relation using Weka Associator Experimenting Vehicle Relation using Weka Experimenter Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 Design a knowledge flow layout, to load attribute selection Normalize the attributes and to store the result in a csv saver.

4 Exp.No:1 Date EXPLORING WEKA TOOL Aim: Procedure: Implementation of Data Mining Algorithms by Attribute Relation File formats Introduction to Weka (Data Mining Tool) Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset (using GUI) or called from your own Java code (using Weka Java library). Tools (or functions) in Weka include: a. Data preprocessing (e.g., Data Filters), b. Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural Networks, SVM), c. Regression (e.g., Linear Regression, Isotonic Regression, SVM for Regression), d. Clustering (e.g., Simple K-means, Expectation Maximization (EM)), e. Association rules (e.g., Apriori Algorithm, Predictive Accuracy, Confirmation Guided), f. Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chi- squared Statistic), and g. Visualization (e.g., View different two-dimensional plots of the data). Launching WEKA The Weka GUI Chooser (Class weka.gui.guichooser) provides a Strating point for launching Weka s Main GUI applications and supporting tools. If one prefers a MDI( multiple document interface) appearance, then this is provided by an alternative launcher called Main (class weka.gui.main). The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus. The buttons can be used to start the following applications: i. Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). ii. Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. iii. iv. Knowledge Flow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

5 Exp.No: 2[a] Date WEATHER RELATION Aim: Procedure:- Open the Note pad Create the New Text File Fill the Attribute Name relation attribute name Give data values for the attributes data Save the File with.arff Format Load the data into the weka tool and Explore the data % ARFF file for weather data with some numeric outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes

6 Sample Screen Shot:- Attributes and data values in the Text file

7 Exp.No: 2[b] DEFINING STUDENT RELATION DATA SET IN CSV FORMAT Date Aim: Procedure Open Microsoft Office-Excel Sheet Go to File -> New Excel Sheet Create the Students Details Data Set with Following column s Name, Register Number,sub1,sub2 and sub3 Goto Save as -> save the File with.csv Format Screen Shots Shows the Students details in the Excel sheet

8 Converting the Excel sheet data into Comma separated values Result:-

9 Exp.No:3 Date DATA PREPROCESSING Aim: % ARFF file for weather data with some numeric outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes Procedure:- PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

10 Application of Filters:- Select Weka -> filters ->Unsupervised ->Instance -> Remove percentage

11 After Filter Application-Remove percentage VISUALIZATION: The last tab in the window is the visualization tab. using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another. Result:-

12 Exp.No:4 Date Aim: CLASSIFICATION Employee Relation(INPUT): % ARFF file for employee data with some numeric ename {john, tony, eid esal edept {sales, john, 85, 8500, sales tony, 85, 9500, admin john, 85, 8500, sales\ Procedure:

13 PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of ins. === Run information ===

14 OUTPUT: Scheme: weka.classifiers.rules.zeror Relation: employee Instances: 3 Attributes: 4 ename eid esal edept Test mode: 10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: sales Time taken to build model: 0 seconds VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another. Result:-

15 Exp.No:5 Date CLUSTERING Aim: STUDENT RELATION % % ARFF file for student data with some numeric features sname {john, tony, sid sbranch {ECE, CSE, sage john, 285, ECE, 19 tony, 385, IT, admin john, 485, ECE, 1 PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

16 CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a preassigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. OUTPUT: Scheme: weka.classifiers.rules.zeror Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient -0.5 Mean absolute error 0.5 Root mean squared error Relative absolute error 100 % Root relative squared error 100% Total Number of Instances 3 CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. OUTPUT: Theme: weka.clusterers.em -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5

17 outlook temperature humidity windy play Test mode: evaluate on training data=== Model and evaluation on training set === EM ==Number of clusters selected by cross validation Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean std. dev humidity mean std. dev windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instances 0 14 (100%) Log likelihood: VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and Methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of Attributes are selected; there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

18 Result:-

19 Exp.No:6 Date ASSOCIATION Aim:- Data cars-weka.filters.unsupervised.i buying maint doors persons lugboot safety class vhigh,vhigh,2,2,small,high,unacc vhigh,vhigh,2,2,med,low,unacc vhigh,vhigh,2,4,small,med,unacc

20 ASSOCIATION: The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori Algorithm as shown below Output:- === Run information === Scheme: weka.associations.apriori -N 10 -T 0 -C 0.9 -D U 1.0 -M 0.1 -S c -1 Relation: cars-weka.filters.unsupervised.instance.resample-s0-z30.0-no-replacement Instances: 518 Attributes: 7 buying maint doors

21 persons lugboot safety class === Associator model (full training set) === Apriori ======= Minimum support: 0.1 (52 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 23 Size of set of large itemsets L(2): 55 Size of set of large itemsets L(3): 10 Best rules found: 1. persons=2 171 ==> class=unacc 171 <conf:(1)> lift:(1.41) lev:(0.1) [49] conv:(49.85) 2. safety=low 169 ==> class=unacc 169 <conf:(1)> lift:(1.41) lev:(0.1) [49] conv:(49.26) 3. persons=2 lugboot=med 69 ==> class=unacc 69 <conf:(1)> lift:(1.41) lev:(0.04) [20] conv:(20.11) 4. persons=2 safety=med 68 ==> class=unacc 68 <conf:(1)> lift:(1.41) lev:(0.04) [19] conv:(19.82) 5. persons=4 safety=low 59 ==> class=unacc 59 <conf:(1)> lift:(1.41) lev:(0.03) [17] conv:(17.2) 6. lugboot=med safety=low 59 ==> class=unacc 59 <conf:(1)> lift:(1.41) lev:(0.03) [17] conv:(17.2) 7. persons=2 lugboot=big 56 ==> class=unacc 56 <conf:(1)> lift:(1.41) lev:(0.03) [16] conv:(16.32) 8. persons=more safety=low 56 ==> class=unacc 56 <conf:(1)> lift:(1.41) lev:(0.03) [16] conv:(16.32)

22 9. lugboot=big safety=low 56 ==> class=unacc 56 <conf:(1)> lift:(1.41) lev:(0.03) [16] conv:(16.32) 10. persons=2 safety=low 54 ==> class=unacc 54 <conf:(1)> lift:(1.41) lev:(0.03) [15] conv:(15.74) Result:-

23 Exp.No:7 Date Experimenting Vehicle Relations using Weka Experimenter Aim:- Procedure:- Defining an Experiment When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New toinitialize an experiment. This causes default parameters to be defined for the experiment. To define the dataset to be processed by a scheme, first select Use relative pathsǁin the Datasets panel of the Setup window and then click ǁAdd Newǁ to open a dialog box below

24 Select iris.arff and click Open to select the iris dataset. The dataset name is now displayed in the Datasets panel of the Setup window. Saving the Results of the Experiment To identify a dataset to which the results are to be sent, Click on the CSVResultListenerǁ Entry inthe Destination panel. Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible. Drag the bottom right-hand corner of the window to resize the window until the scroll bars disappear. The output file parameter is near the bottom of the window, beside the text outputfileǁ. Click on this parameter to display a file selection window.

25 Type the name of the output file, click Select, and then click close (x). The file name is displayed in the output File panel. Click on OK to close the window. The dataset name is displayed in the Destination panel of the Setup window.

26 Saving the Experiment Definition The experiment definition can be saved at any time. Select Save ǁ at the top of the Setup window. Type the dataset name with the extension expǁ (or select the dataset name if the experiment definition dataset already exists). The experiment can be restored by selecting Open in the Setup window and then selecting Experiment1.exp in the dialog window. RUNNING AN EXPERIMENT: To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66%

27 of the patterns for training and 34% for testing, and using the ZeroR scheme. Click Start to run the experiment. If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. Output:- The results of the experiment are saved to the dataset Experiment1.txt. Dataset, Run, Scheme, Scheme_options, Scheme_version_ID,

28 Date_time, Number_of_instances, Number_correct, Number_incorrect, Number_unclassified, Percent_correct, Percent_incorrect, Percent_unclassified, Mean_absolute_error, Root_mean_squared_error, Relative_absolute_error, Root_relative_squared_error, SF_prior_entropy, SF_scheme_entropy, SF_entropy_gain, SF_mean_prior_entropy, SF_mean_scheme_entropy, SF_mean_entropy_gain, KB_information, KB_mean_information, KB_ relative_information, True_positive_rate, Num_true_positives, False_positive_rate, Num_false_positives, True_negative_rate, Num_true_negatives, False_negative_rate, Num_false_negatives, IR_precision, IR_recall, F_measure, Summary iris,1,weka.classifiers.zeror,'', , e7,51.0,15.0,36.0,0.0, , ,0.0, , ,100.0, 100.0, , ,0.0, , ,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?iris,2,weka.classifiers.ZeroR,'', , E7,51.0,11.0,40.0,0.0, , ,0.0, , ,100.0,100.0, , ,0.0, , ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? Result:-

29 KNOWLEDGE FLOW: The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge Flow but not in the Explorer. The Knowledge Flow presents a "data-flow" inspired interface to Weka. The user can select Weka components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of Weka's classifiers, filters, clusterers, loaders and savers are available in the KnowledgeFlow along with some extra tools. Features of the KnowledgeFlow: * intuitive data flow style layout * process data in batches or incrementally * process multiple batches or streams in parallel! (each separate flow executes in its own thread) * chain filters together * view models produced by classifiers for each fold in a cross validation * visualize performance of incremental classifiers during processing (scrolling plots of classification accuracy, RMS error, predictions etc) Components available in the KnowledgeFlow: DataSources: All of Weka's loaders are available DataSinks: All of Weka's savers are available Filters: All of Weka's filters are available Classifiers: All of Weka's classifiers are available Clusterers: All of Weka's clusterers are available Valuation: TrainingSetMaker - make a data set into a training set TestSetMaker - make a data set into a test set CrossValidationFoldMaker - split any data set, training set or test set into folds

30 TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set ClassAssigner - assign a column to be the class for any data set, training set or test set ClassValuePicker - choose a class value to be considered as the "positive" class. This is useful when generating data for ROC style curves (see below) ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions VISUALIZATION: DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot) AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based Models StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental classifiers) LAUNCHING THE KNOWLEDGEFLOW The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the button labeled "Knowledge Flow" to start the Knowledge Flow. Alternatively, you can launch the Knowledge Flow from a terminal window by typing "javaweka.gui.beans.knowledgeflow". At the top of the Knowledge Flow window is are seven tabs: DataSources, DataSinks, Filters, Classifiers, Clusterers, Evaluation and Visualization. The names are pretty much self explanatory. COMPONENTS Components available in the KnowledgeFlow:

31 DataSources All of WEKA s loaders are available. DataSinks All of WEKA s savers are available. Filters All of WEKA s filters are available. Classifiers All of WEKA s classifiers are available. Clusterers All of WEKA s clusterers are available.

32 Evaluation TrainingSetMaker - make a data set into a training set. TestSetMaker - make a data set into a test set. CrossValidationFoldMaker - split any data set, training set or test set into folds. TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set. ClassAssigner - assign a column to be the class for any data set, training set or test set. ClassValuePicker - choose a class value to be considered as the positiveǁ class. This is useful When generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2). ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers. IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers. ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers. PredictionAppender - append classifier predictions to a test set. For dis-crete class problems, can either append predicted class labels or probabil- ity distributions. Visualization DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot. ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot). AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data. ModelPerformanceChart - component that can pop up a panel for visual-izing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data.can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based models. StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental clas-iers)

33 Exp.No:8 Date DESIGNA KNOWWLEDGE FLOW LAYOUT FOR CLASS VALIDATION USING J48 Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's C4.5 implementation). Procedure:- First start the KnowlegeFlow. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs"). Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area). Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.

34 Alternatively, you can double-click on the icon to bring up the configuration dialog

35 Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout. Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select the "dataset" under "Connections" in the menu. A "rubber band" line will appear.

36 Move the mouse over the Class Assigner component and left click - a red line labeled "dataset" will connect the two components. Next right click over the Class Assigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).

37 Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default). Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on

38 the layout. Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over Assigner" and selecting "dataset" from under "Connections" in the menu. "Class

39 Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach the "J48" component in the "trees" section. Place a J48 component on the layout. Connect the Cross Validation Fold Maker to J48 TWICE by first choosing "training Set" and then "test Set" from the pop-up menu for the Cross Validation Fold Maker.

40

41 Next go back to the "Evaluation" tab and place a "Classifier Performance Evaluator" component on the layout.

42 Connect J48 to this component by selecting the "batchclassifier" entry from the pop-up menu for J48.

43 Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout. Connect the Classifier Performance Evaluator to the Text Viewer by selecting the "text" entry from the pop-up menu for Classifier Performance Evaluator.

44 Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.

45 When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component. Result:

46 Exp.No:9 DATE DESIGN A KNOWLEDGE FLOW LAYOUT TO LOAD ATTRIBUTE SELECTION NORMALIZE IN THE A CSV FORMAT Aim:- Procedure: 1. Click on knowledge Glowǁ from weak GUI chooser. 2. It opens a window called Weka knowledge flow environmentǁ. 3. Click on data sourcesǁ and select Arffǁ to read data is the arff source. 4. Now click on the knowledge flow layoutǁ area, which laces the Arffloader in the layout. 5. Cdlick on filtersǁ and select on attribute selector from the supervisedǁ filters.place it on the design layout. 6. Now select another filter to normalize the numeric attribute values, from the unsupervisedǁ filters. Place it on the design layout. 7. Click on Data sinksǁ and choose csvǁ, which writes to a destination that is in csv format. Place it on the design layout of knowledge flow. 8. Now right click on Arffloaderǁ and click on data set to direct the flow to attribute selectionǁ. 9. Now right click on Attribute selectionǁ and select data set to direct the flow to Normalizeǁ from which ;lthe flow is directed to the csv saver in the same way. 10. Right click on csv saver and click on configureǁ, to specify the destination where to sotre the results let at be selected as ravi. 11. Now right click on Affloaderǁ and select configure to specify the source dataǁ.let in sǁ relation has been selected as so. 12. Now again right click on the Affloaderǁ and click on start loadingǁ which results in the below knowledge flow layoutǁ. We can observe the results of lthe abouve process by opening the file z:\weka@ravi\in sweka.filters.supervised.attribute Microsoft office Excellomma in notepad, which displays the results I a comma separated value form Output:- Petal length, Petal width Class In s-setosa In s-setosa In s-setosa In s-versicolor In,s-virginica In s-virginica Result:-

47 Lab Experiments Case Study 1. List all the categorical (or nominal) attributes and the real-valued attributes separately. From the German Credit Assessment Case Study given to us, the following attributes are found to be applicable for Credit-Risk Assessment: Total Valid Attributes 1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status 10. debitors 11. residence_since 12. property 14. installment plans 15. housing 16. existing credits 17. job 18. num_dependents 19. telephone Categorical or Nominal attributes(which takes True/false, etc values) 1. checking_status 2. credit history 3. purpose 4. savings_status 5. employment 6. personal status 7. debtors 8. property 9. installment plans 10. housing 11. job 12. telephone 13. foreign worker Real valued attributes 1. duration 2. credit amount 3. credit amount 4. residence 5. age 6. existing credits 7. num_dependents

48 2. What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit Based on the above attributes, we can make a decision whether to give credit or not. checking_status = no checking AND other_payment_plans = none AND credit_history = critical/other existing credit: good checking_status = no checking AND existing_credits <= 1 AND other_payment_plans = none AND purpose = radio/tv: good checking_status = no checking AND foreign_worker = yes AND employment = 4<=X<7: good foreign_worker = no AND personal_status = male single: good checking_status = no checking AND purpose = used car AND other_payment_plans = none: good duration <= 15 AND other_parties = guarantor: good duration <= 11 AND credit_history = critical/other existing credit: good checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = car: good checking_status = no checking AND property_magnitude = real estate AND other_payment_plans = none AND age > 23: good savings_status = >=1000 AND property_magnitude = real estate: good savings_status = 500<=X<1000 AND employment = >=7: good credit_history = no credits/all paid AND housing = rent: bad savings_status = no known savings AND checking_status = 0<=X<200 AND existing_credits > 1: good

49 checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = life insurance: good installment_commitment <= 2 AND other_parties = co applicant AND existing_credits > 1: bad installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits > 1 AND residence_since > 1: good installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits <= 1: good duration > 30 AND savings_status = 100<=X<500: bad credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad duration > 30 AND savings_status = no known savings AND num_dependents > 1: good duration > 30 AND credit_history = delayed previously: bad duration > 42 AND savings_status = <100 AND residence_since > 1: bad

50 3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. A decision tree is a flow chart like tree structure where each internal node(non -leaf) denotes a test on the attribute, each branch represents an outcome of the test,and each leaf node(terminal node)holds a class label. Decision trees can be easily converted into classification rules. e.g. ID3,C4.5 and CART. J48 pruned tree 1. Using WEKA Tool, we can generate a decision tree by selecting the classify tabǁ. 2. In classify tab select choose option where a list of different decision trees are available. From that list select J Now under test option,select training data test option. 4. The resulting window in WEKA is as follows:

51 5. To generate the decision tree, right click on the result list and select visualize tree option by which the decision tree will be generated. 6. The obtained decision tree for credit risk assessment is very large to fit on the screen. 7. The decision tree above is unclear due to a large number of attributes.

52 4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset. For example: IF purpose=vacation THEN credit=bad; ELSE purpose=business THEN credit=good; In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the remaining 14.5% of examples are Incorrectly classified. We can t get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we can t get 100% training accuracy.

53 5. Is testing on the training set as you did above a good idea? Why?Why not? Bad idea, if take all the data into training set. Then how to test the above classification is correctly or not? According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5% accuracy. This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will be less. This is why, we prefer not to take complete dataset as training set. UseTraining Set Result for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error 0.34 Relative absolute error % Root relative squared error % Total Number of Instances 1000

54 6. One approach for solving the problem encountered in the previous question is using cross-validation? Describe what cross- validation is briefly. Train a Decision Tree again using cross- validation and report your results. Does your accuracy increase/decrease? Why? Cross validation:- In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or folds D1, D2, D3,......, Dk. Each of approximately equal size. Training and testing is performed k times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model. That is in the first iteration subsets D2, D3,......, Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets D1, D3,......, Dk and test on the D2 and so on. 1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and the number of folds as Number of folds indicates number of partition with the set of attributes. 3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed out, but in reality there is no such training set that gives 100% accuracy.

55 Cross Validation Result at folds: 10 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 Here there are 1000 instances with 100 instances per partition. Cross Validation Result at folds: 20 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 Cross Validation Result at folds: 50 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error

56 Cross Validation Result at folds: 100 for the table GermanCreditData: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1000 Percentage split does not allow 100%, it allows only till 99.9%

57 Percentage Split Result at 50%: Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 500

58 Percentage Split Result at 99.9%: Correctly Classified Instances 0 0 % Incorrectly Classified Instances % Kappa statistic 0 Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 1

59 7. Sometimes, the cost of rejecting an applicant who actually has a good credit Case 1. might be higher than accepting an applicant who has bad credit Case 2. Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in WEKA. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2. two When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5) Total Cost Average cost We don t find this cost factor in problem 6. As there we use equal cost. This is the major difference between the results of problem 6 and problem 9. The cost matrices we used here: Case 1: Case 2: Select

60 classify tab. 2. Select More Option from Test Option. 3.Tick on cost sensitive Evaluation and go to set.

61 4.Set classes as 2. 5.Click on Resize and then we ll get cost matrix. 6.Then change the 2 nd entry in 1 st row and 2 nd entry in 1 st column to Then confusion matrix will be generated and you can find out the difference between good and bad attribute. 8.Check accuracy whether it s changing or not.

62 8. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect. This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. 1. Open any existing ARFF file e.g labour.arff. 2. In preprocess tab, select ALL to select all the attributes. 3. Go to classify tab and then use traning set with J48 algorithm.

63 4. To generate the decision tree, right click on the result list and select visualize tree option, by which the decision tree will be generated.

64 5. Right click on J48 algorithm to get Generic Object Editor window 6. In this,make the unpruned option as true. 7. Then press OK and then start. we find the tree will become more complex if not pruned. Visualizetree

65 8. The tree has become more complex.

66 9. (Extra Credit): How can you convert a Decision Trees into "if-then- else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oner. In WEKA, rules.part is one of the classifier which converts the decision trees into IF-THEN-ELSEǁ rules. Converting Decision trees into IF-THEN-ELSE rules using rules.part classifier:- PART decision list outlook = overcast: yes (4.0) windy = TRUE: no (4.0/1.0) outlook = sunny: no (3.0/1.0): yes (3.0) Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is outlook outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) With respect to the time, the oner classifier has higher ranking and J48 is in 2 nd place and PART gets 3rd place. J48 PART oner TIME (sec) RANK II III I But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place and oner gets lst place J48 PART oner

67 ACCURACY (%) % 66.8% 1.Open existing file as weather.nomial.arff 2.Select All. 3.Go to classify. 4.Start.

68 Here the accuracy is 100%

69 The tree is something like if-then-else rule If outlook=overcast then play=yes If outlook=sunny and humidity=high then play = no else play = yes If outlook=rainy and windy=true then play = no else play = yes To click out the rules 1. Go to choose then click on Rule then select PART. 2. Click on Save and start. 3. Similarly for oner algorithm.

70 If outlook = overcast then play=yes If outlook = sunny and humidity= high then play=no If outlook = sunny and humidity= low then play=yes

DATA MINING LAB MANUAL

DATA MINING LAB MANUAL DATA MINING LAB MANUAL Subtasks : 1. List all the categorical (or nominal) attributes and the real-valued attributes seperately. Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose

More information

WEKA KnowledgeFlow Tutorial for Version 3-5-6

WEKA KnowledgeFlow Tutorial for Version 3-5-6 WEKA KnowledgeFlow Tutorial for Version 3-5-6 Mark Hall Peter Reutemann June 1, 2007 c 2007 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

IV B. Tech I semester (JNTUH-R13)

IV B. Tech I semester (JNTUH-R13) St. MARTIN s ENGINERING COLLEGE Dhulapally(V), Qutbullapur(M), Secunderabad-500014 COMPUTER SCIENCE AND ENGINEERING LAB MANUAL OF DATAWAREHOUSE AND DATAMINING IV B. Tech I semester (JNTUH-R13) Prepared

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA LAB RECORD N99A49G70E68S51H Data Mining using WEKA 1 WEKA [ Waikato Environment for Knowledge Analysis

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

Data Mining Laboratory Manual

Data Mining Laboratory Manual Data Mining Laboratory Manual Department of Information Technology MLR INSTITUTE OF TECHNOLOGY Marri Laxman Reddy Avenue, Dundigal, Gandimaisamma (M), R.R. Dist. Data Mining Laboratory Manual Prepared

More information

WEKA Waikato Environment for Knowledge Analysis Performing Classification Experiments Prof. Pietro Ducange

WEKA Waikato Environment for Knowledge Analysis Performing Classification Experiments Prof. Pietro Ducange WEKA Waikato Environment for Knowledge Analysis Performing Classification Experiments Prof. Pietro Ducange 1 The Knowledge Flow Interface It provides an alternative to the Explorer interface The user can

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

WEKA Explorer User Guide for Version 3-4

WEKA Explorer User Guide for Version 3-4 WEKA Explorer User Guide for Version 3-4 Richard Kirkby Eibe Frank July 28, 2010 c 2002-2010 University of Waikato This guide is licensed under the GNU General Public License version 2. More information

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Data Mining With Weka A Short Tutorial

Data Mining With Weka A Short Tutorial Data Mining With Weka A Short Tutorial Dr. Wenjia Wang School of Computing Sciences University of East Anglia (UEA), Norwich, UK Content 1. Introduction to Weka 2. Data Mining Functions and Tools 3. Data

More information

Hands on Datamining & Machine Learning with Weka

Hands on Datamining & Machine Learning with Weka Step1: Click the Experimenter button to launch the Weka Experimenter. The Weka Experimenter allows you to design your own experiments of running algorithms on datasets, run the experiments and analyze

More information

Using Weka for Classification. Preparing a data file

Using Weka for Classification. Preparing a data file Using Weka for Classification Preparing a data file Prepare a data file in CSV format. It should have the names of the features, which Weka calls attributes, on the first line, with the names separated

More information

Decision Trees In Weka,Data Formats

Decision Trees In Weka,Data Formats CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

WEKA Explorer User Guide for Version 3-5-5

WEKA Explorer User Guide for Version 3-5-5 WEKA Explorer User Guide for Version 3-5-5 Richard Kirkby Eibe Frank Peter Reutemann January 26, 2007 c 2002-2006 University of Waikato Contents 1 Launching WEKA 2 2 The WEKA Explorer 4 2.1 Section Tabs.............................

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Enterprise Miner Tutorial Notes 2 1

Enterprise Miner Tutorial Notes 2 1 Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender

More information

Decision Trees Using Weka and Rattle

Decision Trees Using Weka and Rattle 9/28/2017 MIST.6060 Business Intelligence and Data Mining 1 Data Mining Software Decision Trees Using Weka and Rattle We will mainly use Weka ((http://www.cs.waikato.ac.nz/ml/weka/), an open source datamining

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. SPSS Statistics were designed INTRODUCTION TO SPSS Objective About the

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Weka Lab Paula Matuszek Fall, 2015!1 Weka is Waikato Environment for Knowledge Analysis Machine Learning Software Suite from the University of Waikato Been under development

More information

1 Introduction to Using Excel Spreadsheets

1 Introduction to Using Excel Spreadsheets Survey of Math: Excel Spreadsheet Guide (for Excel 2007) Page 1 of 6 1 Introduction to Using Excel Spreadsheets This section of the guide is based on the file (a faux grade sheet created for messing with)

More information

SAS Visual Analytics 8.2: Working with Report Content

SAS Visual Analytics 8.2: Working with Report Content SAS Visual Analytics 8.2: Working with Report Content About Objects After selecting your data source and data items, add one or more objects to display the results. SAS Visual Analytics provides objects

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Data Understanding Exercise: Market Basket Analysis Exercise:

More information

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Contents 1 Introduction to Using Excel Spreadsheets 2 1.1 A Serious Note About Data Security.................................... 2 1.2

More information

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID: CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem Muhammad Asiful Islam, SBID: 106506983 Original Data Outlook Humidity Wind PlayTenis Sunny High Weak No Sunny

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Short instructions on using Weka

Short instructions on using Weka Short instructions on using Weka G. Marcou 1 Weka is a free open source data mining software, based on a Java data mining library. Free alternatives to Weka exist as for instance R and Orange. The current

More information

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset V.Veeralakshmi Department of Computer Science Bharathiar University, Coimbatore, Tamilnadu veeralakshmi13@gmail.com Dr.D.Ramyachitra Department

More information

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below. Graphing in Excel featuring Excel 2007 1 A spreadsheet can be a powerful tool for analyzing and graphing data, but it works completely differently from the graphing calculator that you re used to. If you

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

To complete the computer assignments, you ll use the EViews software installed on the lab PCs in WMC 2502 and WMC 2506.

To complete the computer assignments, you ll use the EViews software installed on the lab PCs in WMC 2502 and WMC 2506. An Introduction to EViews The purpose of the computer assignments in BUEC 333 is to give you some experience using econometric software to analyse real-world data. Along the way, you ll become acquainted

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised

More information

Data Engineering. Data preprocessing and transformation

Data Engineering. Data preprocessing and transformation Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Using Charts in a Presentation 6

Using Charts in a Presentation 6 Using Charts in a Presentation 6 LESSON SKILL MATRIX Skill Exam Objective Objective Number Building Charts Create a chart. Import a chart. Modifying the Chart Type and Data Change the Chart Type. 3.2.3

More information

Transaction Validity Detection Density Based Clustering

Transaction Validity Detection Density Based Clustering International Journal of Progressive Sciences and Technologies (IJPSAT) ISSN: 2509-0119. 2018International Journals of Sciences and High Technologies http://ijpsat.ijsht-journals.org Vol. 9 No. 1 June

More information

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1731-1743 Research India Publications http://www.ripublication.com Comparative Study of J48, Naive Bayes

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Excel Assignment 4: Correlation and Linear Regression (Office 2016 Version)

Excel Assignment 4: Correlation and Linear Regression (Office 2016 Version) Economics 225, Spring 2018, Yang Zhou Excel Assignment 4: Correlation and Linear Regression (Office 2016 Version) 30 Points Total, Submit via ecampus by 8:00 AM on Tuesday, May 1, 2018 Please read all

More information

Data Science with R Decision Trees with Rattle

Data Science with R Decision Trees with Rattle Data Science with R Decision Trees with Rattle Graham.Williams@togaware.com 9th June 2014 Visit http://onepager.togaware.com/ for more OnePageR s. In this module we use the weather dataset to explore the

More information

Introduction to the workbook and spreadsheet

Introduction to the workbook and spreadsheet Excel Tutorial To make the most of this tutorial I suggest you follow through it while sitting in front of a computer with Microsoft Excel running. This will allow you to try things out as you follow along.

More information

Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1

Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1 Excel Essentials Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1 FREQUENTLY USED KEYBOARD SHORTCUTS... 1 FORMATTING CELLS WITH PRESET

More information

Introduction to Stata: An In-class Tutorial

Introduction to Stata: An In-class Tutorial Introduction to Stata: An I. The Basics - Stata is a command-driven statistical software program. In other words, you type in a command, and Stata executes it. You can use the drop-down menus to avoid

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

BerkeleyImageSeg User s Guide

BerkeleyImageSeg User s Guide BerkeleyImageSeg User s Guide 1. Introduction Welcome to BerkeleyImageSeg! This is designed to be a lightweight image segmentation application, easy to learn and easily automated for repetitive processing

More information

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Contents Introduction... 1 Start DIONE... 2 Load Data... 3 Missing Values... 5 Explore Data... 6 One Variable... 6 Two Variables... 7 All

More information

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal

More information

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. 1 Introduction Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. The gain chart is an alternative to confusion matrix for the evaluation of a classifier.

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

HOUR 12. Adding a Chart

HOUR 12. Adding a Chart HOUR 12 Adding a Chart The highlights of this hour are as follows: Reasons for using a chart The chart elements The chart types How to create charts with the Chart Wizard How to work with charts How to

More information

GeoVISTA Studio Tutorial. What is GeoVISTA Studio? Why is it part of the map making and visualization workshop?

GeoVISTA Studio Tutorial. What is GeoVISTA Studio? Why is it part of the map making and visualization workshop? GeoVISTA Studio Tutorial What is GeoVISTA Studio? Why is it part of the map making and visualization workshop? GeoVISTA Studio is a Java-based environment for visually assembling JavaBeans software components

More information

Tanagra: An Evaluation

Tanagra: An Evaluation Tanagra: An Evaluation Jessica Enright Jonathan Klippenstein November 5th, 2004 1 Introduction to Tanagra Tanagra was written as an aid to education and research on data mining by Ricco Rakotomalala [1].

More information

Barchard Introduction to SPSS Marks

Barchard Introduction to SPSS Marks Barchard Introduction to SPSS 22.0 3 Marks Purpose The purpose of this assignment is to introduce you to SPSS, the most commonly used statistical package in the social sciences. You will create a new data

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

User Services Spring 2008 OBJECTIVES Introduction Getting Help Instructors

User Services Spring 2008 OBJECTIVES  Introduction Getting Help  Instructors User Services Spring 2008 OBJECTIVES Use the Data Editor of SPSS 15.0 to to import data. Recode existing variables and compute new variables Use SPSS utilities and options Conduct basic statistical tests.

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

More information

CRITERION Vantage 3 Admin Training Manual Contents Introduction 5

CRITERION Vantage 3 Admin Training Manual Contents Introduction 5 CRITERION Vantage 3 Admin Training Manual Contents Introduction 5 Running Admin 6 Understanding the Admin Display 7 Using the System Viewer 11 Variables Characteristic Setup Window 19 Using the List Viewer

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

Spreadsheet Concepts: Creating Charts in Microsoft Excel

Spreadsheet Concepts: Creating Charts in Microsoft Excel Spreadsheet Concepts: Creating Charts in Microsoft Excel lab 6 Objectives: Upon successful completion of Lab 6, you will be able to Create a simple chart on a separate chart sheet and embed it in the worksheet

More information

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics. 2018 by Minitab Inc. All rights reserved. Minitab, SPM, SPM Salford

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad).

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad). CSC 458 Data Mining and Predictive Analytics I, Fall 2017 (November 22, 2017) Dr. Dale E. Parson, Assignment 4, Comparing Weka Bayesian, clustering, ZeroR, OneR, and J48 models to predict nominal dissolved

More information

Skills Exam Objective Objective Number

Skills Exam Objective Objective Number Overview 1 LESSON SKILL MATRIX Skills Exam Objective Objective Number Starting Excel Create a workbook. 1.1.1 Working in the Excel Window Customize the Quick Access Toolbar. 1.4.3 Changing Workbook and

More information

Introduction to Data Science

Introduction to Data Science Introduction to Data Science Lab 4 Introduction to Machine Learning Overview In the previous labs, you explored a dataset containing details of lemonade sales. In this lab, you will use machine learning

More information

TexRAD Research Version Client User Guide Version 3.9

TexRAD Research Version Client User Guide Version 3.9 Imaging tools for medical decision makers Cambridge Computed Imaging Ltd Grange Park Broadway Bourn Cambridge CB23 2TA UK TexRAD Research Version Client User Guide Version 3.9 Release date 23/05/2016 Number

More information

SAS Visual Analytics 8.2: Getting Started with Reports

SAS Visual Analytics 8.2: Getting Started with Reports SAS Visual Analytics 8.2: Getting Started with Reports Introduction Reporting The SAS Visual Analytics tools give you everything you need to produce and distribute clear and compelling reports. SAS Visual

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

You are to turn in the following three graphs at the beginning of class on Wednesday, January 21.

You are to turn in the following three graphs at the beginning of class on Wednesday, January 21. Computer Tools for Data Analysis & Presentation Graphs All public machines on campus are now equipped with Word 2010 and Excel 2010. Although fancier graphical and statistical analysis programs exist,

More information

Studying in the Sciences

Studying in the Sciences Organising data and creating figures (charts and graphs) in Excel What is in this guide Familiarisation with Excel (for beginners) Setting up data sheets Creating a chart (graph) Formatting the chart Creating

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

Introduction to Microsoft Excel 2010

Introduction to Microsoft Excel 2010 Introduction to Microsoft Excel 2010 This class is designed to cover the following basics: What you can do with Excel Excel Ribbon Moving and selecting cells Formatting cells Adding Worksheets, Rows and

More information

Lab Exercise Two Mining Association Rule with WEKA Explorer

Lab Exercise Two Mining Association Rule with WEKA Explorer Lab Exercise Two Mining Association Rule with WEKA Explorer 1. Fire up WEKA to get the GUI Chooser panel. Select Explorer from the four choices on the right side. 2. To get a feel for how to apply Apriori,

More information