An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Similar documents
Prototyping DM Techniques with WEKA and YALE Open-Source Software

Classification using Weka (Brain, Computation, and Neural Learning)

Data Mining With Weka A Short Tutorial

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

Association Rule Mining. Entscheidungsunterstützungssysteme

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Association mining rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Chapter 4 Data Mining A Short Introduction

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Machine Learning: Symbolische Ansätze

Chapter 7: Frequent Itemsets and Association Rules

Supervised and Unsupervised Learning (II)

Nesnelerin İnternetinde Veri Analizi

Practical Data Mining COMP-321B. Tutorial 6: Association Rules

Data Mining Clustering

Fundamental Data Mining Algorithms

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Machine Learning Techniques for Data Mining

Frequent Pattern Mining

Data Mining Algorithms

Decision Support Systems

Association Rules Apriori Algorithm

A Novel method for Frequent Pattern Mining

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Improved Frequent Pattern Mining Algorithm with Indexing

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Jarek Szlichta

Rule induction. Dr Beatriz de la Iglesia

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

WEKA homepage.

Slides for Data Mining by I. H. Witten and E. Frank

Association Rule Discovery

Mining Association Rules in Large Databases

Association Rules Apriori Algorithm

Data Mining Course Overview

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Tutorial on Association Rule Mining

Weka: Practical machine learning tools and techniques with Java implementations

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Performance Analysis of Data Mining Classification Techniques

Data warehouse and Data Mining

Classification by Association

WEKA Explorer User Guide for Version 3-4

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Lecture notes for April 6, 2005

Introduction to Artificial Intelligence

Temporal Weighted Association Rule Mining for Classification

Association Pattern Mining. Lijun Zhang

Basic Concepts Weka Workbench and its terminology

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

The Explorer. chapter Getting started

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

CS570 Introduction to Data Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Tutorial on Machine Learning Tools

Implementing Breiman s Random Forest Algorithm into Weka

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Machine Learning: Algorithms and Applications Mockup Examination

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Short instructions on using Weka

Association Rules and

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining and Soft Computing

Association Rule Mining: FP-Growth

COMP 6838 Data MIning

Chapter 7: Frequent Itemsets and Association Rules

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Association Rule Learning

Mining Association Rules in Large Databases

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

WEKA A Machine Learning Workbench for Data Mining

Introduction to Data Mining. Yücel SAYGIN

CSE 5243 INTRO. TO DATA MINING

Data Mining Part 5. Prediction

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

WEKA KnowledgeFlow Tutorial for Version 3-5-6

An Improved Apriori Algorithm for Association Rules

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

SOCIAL MEDIA MINING. Data Mining Essentials

Association Rules. A. Bellaachia Page: 1

Interestingness Measurements

Transcription:

An Introduction to WEKA Explorer In part from: Yizhou Sun 2008

What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,, New Zealand. Weka is also a bird found only on the islands of New Zealand. 2

Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html Support multiple platforms (written in java): Windows, Mac OS X and Linux 3

Main Features 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 3 algorithms for finding association rules 15 attribute/subset evaluators + 10 search algorithms for feature selection 4

Main GUI Three graphical user interfaces The Explorer (exploratory data analysis) The Experimenter (experimental environment) The KnowledgeFlow (new process model inspired interface) Simple CLI- provides users without a graphic interface option the ability to execute commands from a terminal window 5

Explorer The Explorer: Preprocess data Classification Clustering Association Rules Attribute Selection Data Visualization References and Resources 6

Explorer: pre-processing the data Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called filters WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes, 7

WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 8

WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 9

10

11

IRIS dataset 5 attributes, one is the classification 3 classes: setosa, versicolor, virginica

15

Attribute data Min, max and average value of attributes distribution of values :number of items for which: a i = v j a i A,v j V class: distribution of attribute values in the classes

18

19

20

21

Filtering attributes Once the initial data has been selected and loaded the user can select options for refining the experimental data. The options in the preprocess window include selection of optional filters to apply and the user can select or remove different attributes of the data set as necessary to identify specific information. The user can modify the attribute selection and change the relationship among the different attributes by deselecting different choices from the original data set. There are many different filtering options available within the preprocessing window and the user can select the different options based on need and type of data present.

23

24

25

26

27

28

29

30

Discretizes in 10 bins of equal frequency 31

Discretizes in 10 bins of equal frequency 32

Discretizes in 10 bins of equal frequency 33

34

35

36

Explorer: building classifiers Classifiers in WEKA are machine learning algorithmsfor predicting nominal or numeric quantities Implemented learning algorithms include: Conjunctive rules, decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, 37

Explore Conjunctive Rules learner Need a simple dataset with few attributes, let s select the weather dataset

Select a Classifier

Select training method

Right-click to select parameters numantds= number of antecedents, -1= empty rule

Select numantds=10

Results are shown in the right window (can be scrolled)

Can change the right hand side variable

Performance data

Decision Trees with WEKA

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

Explorer: clustering data WEKA contains clusterers for finding groups of similar instances in a dataset Implemented schemes are: k-means, EM, Cobweb, X-means, FarthestFirst Clusters can be visualized and compared to true clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution 69

The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment 70 marzo 11, 2014

right click: visualize cluster assignement

Explorer: finding associations WEKA contains an implementation of the Apriori algorithm for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence 73

Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper itemset: A set of one or more items k-itemset X = {x 1,, x k } (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X s support is no less than a minsup threshold 74 Customer buys beer marzo 11, 2014

Basic Concepts: Association Rules Tid 10 20 30 40 50 75 Customer buys beer Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Find all the rules X à Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 n Association rules: (many more!) n Beer à Diaper (60%, 100%) n Diaper à Beer (60%, 75%) marzo 11, 2014

77

78

79

80

81

1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219 conf:(1) 2. adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-tonicaraguan-contras=y 198 ==> Class=democrat 198 conf:(1) 3. physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==> Class=democrat 210 conf:(1) ecc.

Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two 83

84

85

86

87

88

89

90

91

Explorer: data visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function 92

94

95

96

97

click on a cell 98

99

100

101

102

103

References and Resources References: WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html WEKA Tutorial: Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. A presentation which explains how to use Weka for exploratory data mining. WEKA Data Mining Book: Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/ Main_Page Others: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd ed.