COMP s1 - Getting started with the Weka Machine Learning Toolkit

Similar documents
Hands on Datamining & Machine Learning with Weka

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

WEKA homepage.

6.034 Design Assignment 2

The Explorer. chapter Getting started

AI32 Guide to Weka. Andrew Roberts 1st March 2005

Using Weka for Classification. Preparing a data file

WEKA Explorer User Guide for Version 3-4

Data Mining Laboratory Manual

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Word: Print Address Labels Using Mail Merge

XP: Backup Your Important Files for Safety

Short instructions on using Weka

Function Algorithms: Linear Regression, Logistic Regression

Quite Hot 3. Installation... 2 About the demonstration edition... 2 Windows... 2 Macintosh... 3

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Data Mining With Weka A Short Tutorial

Running Java Programs

Lab Exercise Two Mining Association Rule with WEKA Explorer

BerkeleyImageSeg User s Guide

The Data Mining Application Based on WEKA: Geographical Original of Music

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

GeoVISTA Studio Tutorial. What is GeoVISTA Studio? Why is it part of the map making and visualization workshop?

Classification using Weka (Brain, Computation, and Neural Learning)

WEKA Explorer User Guide for Version 3-5-5

Downloading & Installing the BIT115 Software & Programs JAVA JDK SE JGRASP BECKER.JAR ROBOTS TEXTBOOK PDFS

To complete the computer assignments, you ll use the EViews software installed on the lab PCs in WMC 2502 and WMC 2506.

Basic Concepts Weka Workbench and its terminology

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad).

Barchard Introduction to SPSS Marks

Gradekeeper Version 5.7

Web-CAT Guidelines. 1. Logging into Web-CAT

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

User Guide Written By Yasser EL-Manzalawy

Eclipse JWT Java Workflow Tooling. Workflow Editor (WE): Installation and Usage Tutorial

Keep Track of Your Passwords Easily

Decision Trees Using Weka and Rattle

8 MANAGING SHARED FOLDERS & DATA

DiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd.

A Comparative Study of Selected Classification Algorithms of Data Mining

Chapter 8 The C 4.5*stat algorithm

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

Introduction to SPSS

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Copyright 2018 MakeUseOf. All Rights Reserved.

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Remodeling Your Office A New Look for the SAS Add-In for Microsoft Office

MT8001 MP4.20 Management Terminal

Practical Data Mining COMP-321B. Tutorial 6: Association Rules

Weka ( )

GradeConnect.com. User Manual

Getting Started. A Getting Started Guide for Locum SecureAudit MANUAL RELATIVE TO VERSION 17.0 LOCUM SOFTWARE SERVICES LIMITED

Assignment 0. Nothing here to hand in

Data Science with R Decision Trees with Rattle

Tutorial: De Novo Assembly of Paired Data

NetBackup 7.6 Replication Director A Hands On Experience

Windows 2000 / XP / Vista User Guide

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Mascot Insight Installation and setup guide

EasySites Quickstart Guide. Table Of Contents

Scorebook Navigator. Stage 1 Independent Review User Manual Version

Installing a Custom AutoCAD Toolbar (CUI interface)

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Objective 1: Familiarize yourself with basic database terms and definitions. Objective 2: Familiarize yourself with the Access environment.

IMS database application manual

Module 1: Introduction RStudio

How do I use BatchProcess

Data Mining: STATISTICA

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

Full Disclosure ECG Holter System Changes in software version 4.01

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

Instructions for Using the Databases

How to create a PDF document for Duplicating to print for you.

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Data Science and Machine Learning Essentials

Creating Dynamic Objects

Horizontal Aggregation in SQL to Prepare Dataset for Generation of Decision Tree using C4.5 Algorithm in WEKA

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

How do I upload my exam answer to MyLaw?

Introduction to Artificial Intelligence

Machine Learning Practical NITP Summer Course Pamela K. Douglas UCLA Semel Institute

Table of contents. Zip Processor 3.0 DMXzone.com

This Tutorial is for Word 2007 but 2003 instructions are included in [brackets] after of each step.

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

Data Mining Practical Machine Learning Tools And Techniques With Java Implementations The Morgan Kaufmann Series In Data Management Systems

You can import data from a CSV file into an existing table or to a new table. The steps are almost identical:

NVivo: 11Pro. Essentials for Getting Started Qualitative Data Analysis

Very Important: Conversion Pre-Planning Creating Your Single User Database Converting from Version 1.x, 2.x, or 3.x...

HPCI Help Desk System User Manual Ver. 5

Office 2016 Excel Basics 25 Video/Class Project #37 Excel Basics 25: Power Query (Get & Transform Data) to Convert Bad Data into Proper Data Set

ASSIGNMENT 5 Objects, Files, and More Garage Management

Remedy ITSM Quick Start Guide

WEKA KnowledgeFlow Tutorial for Version 3-5-6

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Decision Trees In Weka,Data Formats

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

Transcription:

COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka Machine Learning Toolkit. The aim of the assignment is to enable you to compare the performance of different machine learning algorithms on a varied range of datasets and produce an interpretation of the empirical results, a standard data mining task. By completing this introduction you will become sufficiently familiar with running Weka to complete the assignment. The assignment questions will be released separately. 2 Introduction to Weka This section is an introduction to the Weka toolkit using the two Weka GUIs used in the assignment, the Weka Explorer and Experimenter. If you are already familiar with Weka you can skip this. Note: this section of the assignment is just an introduction and is not assessed you don t have to submit anything! The objective is to familiarise yourself with some of the capabilities of Weka that you will use for the assessed section. 2.1 Setting up Weka The entire Weka package is around 25 35 MB, depending on platform, and can be downloaded from the Weka website. If you are working on a non- CSE machine or you have enough disk space you can do this. Otherwise, if you need to save space on your CSE machine, you can link to the Weka software. Note: on this course we will use the most recent stable release of Weka which is version 3-6. This is the book version and corresponds to the descriptions in Witten, Frank and Hall (2011); the latest release of this version is Weka 3-6-13 and can be downloaded from the Weka website. 1

Although there should be few, if any, differences between various Weka 3-6 versions of the core algorithms, for the purposes of this assignment version 3-6-13 has been used for testing. At CSE: In your home directory do the following. First, make a symbolic link from the CSE Weka directory to your home directory. Call it something like myweka : $ ln -s ~weka/weka-3-6-13 myweka Do a quick check to see if this link looks OK: $ ls -l myweka You should see something like: myweka -> /import/adams/1/weka/weka-3-6-13/ Then change to your local Weka directory: $ cd myweka And have a look at what there is: $ ls You should see a number of Java.jar files, plus some other stuff including a README file and a PDF file: WekaManual.pdf. This is a comprehensive and well-written document, although the organisation of it is a little strange most users will probably not start off with the command-line interface but there is a lot of good information in here, especially the sections on the main user-interfaces we will use (Explorer and Experimenter), plus details on the Weka file format (.arff) and much else besides. This plus the resources at the Weka project website are all useful for this assignment. Alternatively, you can click on documention.html to get to the manuals, etc. You can either run Weka in your myweka directory: $ java -jar weka.jar & or set up your environment so you can run it from other directories. Here is one way to do the latter (assumes you are using bash): 2

$ WEKAHOME=~weka/weka-3-6-13 $ export WEKAHOME $ CLASSPATH="$CLASSPATH:$WEKAHOME/weka.jar" $ export CLASSPATH Then check that Weka runs for you: $ java -jar $WEKAHOME/weka.jar & You should see a small window, the Weka GUI Chooser, indicating that Weka is running 1. You can choose different tools from this window: we will be using the Explorer to start with. At home: These instructions should work: 1. go to the Weka home page http://www.cs.waikato.ac.nz/~ml/ and download the latest stable version weka-3-6-13 from the Sourceforge site 2. follow the instructions and install it on your machine 3. if you re on Windows or Mac check the instructions given on the Web page 2.2 Getting started Next you will need to copy the compressed tar file of datasets for this introduction ~myweka/datasets intro.tgz or ~myweka/datasets intro.zip to suitable destinations under your home directory. Here s one way: First make a directory at the same level as myweka and call it, say, experiments $ cd experiments Now copy the data sets into this directory $ cp../myweka/datasets_intro.tgz. Next uncompress and open the datasets archive file $ tar xvfz datasets_intro.tgz 1 You can ignore the error messages about missing database drivers, since we will not be using them in this assignment. 3

2.3 Introduction to Weka Explorer To start Weka Explorer just click on the Explorer button in the Weka GUI Chooser. An introduction to this tool is in Section 4 of the file WekaManual.pdf included in the Weka download package which you can view like this: $ xpdf ~/myweka/wekamanual.pdf To test your knowledge of the Explorer, here are some things you can do: 1) [Preprocess tab] Load the iris data (iris.arff). How many values does the class attribute have in this data set? 2) [Classify tab] Choose the rules classifier ConjunctiveRule, leaving the parameters at their default setting. Select a 70% split of the data into training and test sets. Run ConjunctiveRule. How many incorrectly classified instances are there in the test set? 3) [Classify tab] Choose the decision tree classifier J48. Again, leave the parameters at their default setting and run J48 with the same percentage training/test split as before. Does this give the same Confusion Matrix as you got with ConjunctiveRule? What has changed? 4) [Preprocess and Associate tabs] Load the dataset weather.nominal.arff. Switch to the associate tab and run Apriori with the default parameters. According to the rules discovered, if the outlook is sunny and humidity is high should you play? What kind of rule is this? 5) [Preprocess and Classify tabs] Load the dataset cpu.arff. Click to select the attribute vendor in the Attributes panel, then click the Remove button under this panel the attribute should have gone away. Switch to the classify tab and choose LinearRegression from the set of functions. Select Cross-validation under Test options and click Start. Now have a look at the Linear Regression Model. From this model, which attribute is the most important (i.e., has the largest weight )? The answers are in this footnote 2. 2 1) 3; 2) 15; 3) no now only two incorrectly classified Iris-virginica examples; 4) no it is a classification rule for play ; 5) CHMAX (1.1868). 4

2.4 Introduction to Weka Experimenter Weka Experimenter is an alternative GUI for Weka. It complements Weka Explorer which is designed for exploratory data analysis. Weka Experimenter allows you to set up, run and analyze comparative experiments in which different machine learning algorithms can be evaluated by their performance on a range of different datasets. This is main interface to the software you will be using for the assignment, although Weka Explorer will also be used. Help is available in the file WekaManual.pdf. Follow the intructions below to get started. 2.4.1 Start Weka Experimenter 1. click on the Experimenter button in the Weka GUI Chooser 2. click on the Setup tab at the top of the Experiment Environment window 3. click on the Simple button to select Experiment Configuration Mode 4. click New to initialize the experiment 2.4.2 Adding datasets Do the following in the Datasets panel of the Setup window: 1. check the Use relative... checkbox 2. click on Add new... to open a dialog window 3. navigate to your experiments directory 4. select anneal.arff and click Open to add it to the list 5. you should see anneal.arff displayed in the Datasets panel 6. now add in the same way two more datasets: audiology.arff and vote.arff 7. if you need to delete a dataset, select it in the Datasets panel and just click Delete Selected 5

2.4.3 Save and load Experiment Definitions If you are setting up a complicated experiment you should regularly save the experiment definition so you can reload it to go back to a previous point in your work. Here s how: 1. select Save... at the top of the Setup window 2. navigate to your experiments directory 3. choose a name, such as myexpt.exp (the.exp extension is required) 4. click on Save 5. to restore the experiment definition at any time, first select Open in the Setup window, then select myexpt.exp (or whatever you called it) in the dialog window 2.4.4 Results You need a file to save the actual results from each run of a machine learning algorithm on a dataset: 1. in the Results Destination panel select ARFF file 2. click on Browse... and navigate to your experiments directory 3. choose a name, such as myexpt.arff (the.arff extension is required) 4. click on Save 5. you should now see your file in the Results Destination panel 2.4.5 Evaluation: setting up 10 fold cross validation The results of interest to us will be generated under the control of parameters set via the Experiment Type and Iteration Control panels. 1. in the Experiment Type panel the default selection is Cross-Validation - leave this unchanged 2. the default number of folds is 10 - leave this unchanged 6

3. and the default method is Classification - leave this unchanged 4. in the Iteration Control panel the default number of repetitions is 10 - leave this unchanged 5. and the Data sets first button in the Iteration Control panel should be selected 2.4.6 Execution: additional schemes Now you are ready to run some machine learning algorithms. In Weka terminology these are referred to as schemes. Add the following schemes via the Algorithms panel: 1. click on Add new... 2. the default scheme is ZeroR 3. click on OK to add ZeroR 4. you should see ZeroR is added 5. if we didn t want to use ZeroR we could click on Delete to remove it 6. since we do want to use it, we won t do that! 7. instead, click on Add new again 8. this time click on Choose 9. select OneR under classifiers/rules then click OK 10. repeat these steps to add J48 (find it under classifiers/trees ) 11. now would be a good time to save your Experiment configuration (see above) Notice that some schemes, like ZeroR, have no parameters. Others, like J48 have a number of parameters which you can change in the dialog box. However for now we will just use the default parameter settings. Now we have finished setting up the experiment, so let s run it. This will be 10 repetitions of 10-fold cross-validation for each scheme on each of the datasets. 7

1. click on the Run tab at the top of the Experiment Environment window 2. click Start 3. In the Status bar at the bottom of the Experiment Environment window you should see an indication of the progress of the experiment 4. on completion, you should see in the Log window something like 10:27:26: Started 10:27:37: Finished 10:27:37: There were 0 errors On your machine the runtime may be different, but it shouldn t take very long. Note: some algorithms can take a long time to run on certain datasets! Running a complete cross-validation experiment can then be time-consuming. However, you can set it running and come back from time to time and check progress. 2.4.7 Analysis: compare schemes Now we can use the Weka Experimenter to compare the schemes: 1. click on the Analyse tab at the top of the Experiment Environment window 2. click on Experiment to analyse the results of the current experiment 3. in the Source panel you should see Got 900 results (10 10-fold cross-validation for 3 algorithms on 3 datasets) 4. the next operations are applied in the Configure test panel 5. in the Comparison field box select Percent correct 6. click Select in the Test base field, and in the popup box select ZeroR 7. click on the Show std. deviations check box (you should always do this!) 8

8. click on Perform test to automatically generate the comparisons with statistical significance estimates In the Test output window you should see a table of the comparisons for each of the three schemes in the experiment. The baseline scheme is ZeroR; this tells you the default predictive accuracy, i.e., how accurately you can classify the test data just by predicting that each instance belongs to the most frequent class in the training set. For each of the other schemes in the table you should see an entry containing the percent correct, the standard deviation in brackets and possibly an annotation symbol indicating a statistically significant result: A v symbol means the result is significantly higher than the baseline classifier. A * symbol means the result is significantly lower than the baseline classifier. The default statistical significance level is 0.05. At the bottom of each column (apart from the first) is a count of the number of datasets on which the scheme in the column was significantly above/equal/below the baseline. 2.4.8 Saving results Saving the current contents of the Test output window. First, if you want to start a new file: 1. click on Save output 2. navigate to your experiments directory 3. enter a file name such as myexpt.res 4. click on Save Or, if you already have a result file, then 1. select it in the window, or 2. enter its file name, such as myexpt.res 3. click on Save 4. click on Append to add results to the designated output file In addition to the percentage correct scores which are recorded with respect to the baseline scheme, we might want to save a summary of results for each classifier against all the others. 9

1. click Select in the Test base field, and in the popup box select Summary 2. click on Perform test In the Test output window you should see an all-against-all table. Each scheme is designated by a letter. Each row is read as a summary comparison of that scheme against all other schemes. An entry n in this table at a particular row and column indicates that the column scheme is better than the row scheme on n of the datsets, with the number in brackets (m) meaning it was statistically significantly better m times. A 0 indicates no (statistically significant) difference between the column and row schemes on any of the datasets. A dash - indicates the null comparison of a scheme against itself. The summary table gives some extra information over the test against a baseline scheme: for example, it turns out that although J48 was better than OneR on all 3 datasets, it was statistically significantly better on only 2. The test for statistical significance is explained more in WekaManual.pdf and the Weka textbook. 1. click on Save output 2. select myexpt.res 3. click on Save 4. a File query box will be displayed, click on Append Save the summary output to the output file as you did for the percentage correct scores. The Experimenter is quite powerful and will often be the main tool you use in Weka; you might want to play around a bit more to get a feel for what you can do with it. 10