WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Similar documents
Weka: Practical machine learning tools and techniques with Java implementations

Basic Concepts Weka Workbench and its terminology

An Empirical Study on feature selection for Data Classification

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining and Analytics

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Machine Learning Chapter 2. Input

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

The Explorer. chapter Getting started

Weka ( )

Decision Tree Learning

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

WEKA Explorer User Guide for Version 3-4

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Decision Trees In Weka,Data Formats

Summary. Machine Learning: Introduction. Marcin Sydow

TUBE: Command Line Program Calls

Decision Trees Using Weka and Rattle

Data Mining Practical Machine Learning Tools and Techniques

WEKA homepage.

Horizontal Aggregation in SQL to Prepare Dataset for Generation of Decision Tree using C4.5 Algorithm in WEKA

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Input: Concepts, Instances, Attributes

Machine Learning Techniques for Data Mining

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Data Mining Algorithms: Basic Methods

WEKA A Machine Learning Workbench for Data Mining

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Chapter 4: Algorithms CS 795

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

Classification with Decision Tree Induction

COMP33111: Tutorial and lab exercise 7

Data Mining With Weka A Short Tutorial

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Weka VotedPerceptron & Attribute Transformation (1)

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Sabbatical Leave Report

Comparative Study of J48, Naive Bayes and One-R Classification Technique for Credit Card Fraud Detection using WEKA

Association Rule Learning

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

WEKA KnowledgeFlow Tutorial for Version 3-5-6

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Supervised and Unsupervised Learning (II)

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

Chapter 4: Algorithms CS 795

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Information Retrieval & Text Mining

Performance Analysis of Data Mining Classification Techniques

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Data Engineering. Data preprocessing and transformation

The Role of Biomedical Dataset in Classification

A neural-networks associative classification method for association rule mining

Slides for Data Mining by I. H. Witten and E. Frank

Introduction to Machine Learning CANB 7640

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Classification using Weka (Brain, Computation, and Neural Learning)

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

New ensemble methods for evolving data streams

Large Scale Data Analysis Using Deep Learning

Data Mining: An experimental approach with WEKA on UCI Dataset

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

Short instructions on using Weka

An Empirical Study of Lazy Multilabel Classification Algorithms

Data Mining and Machine Learning: Techniques and Algorithms

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Tutorial on Machine Learning Tools

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Classification Algorithms in Data Mining

BITS F464: MACHINE LEARNING

WEKA Explorer User Guide for Version 3-5-5

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Decision Trees: Discussion

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Machine Learning & Data Mining

CSC411/2515 Tutorial: K-NN and Decision Tree

Transcription:

WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov

Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

What is Machine Learning Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

What is Machine Learning T playing chess P percentage of wins E 1000 recorded whole games

Basic definitions Outlook Temperature Humidity Windy Surfing 1 Sunny Mild Normal True Yes 2 Sunny Hot High False No 3 Rainy Mild High False No 4 Overcast Cool Normal True Yes

Basic definitions Attributes Outlook Temperature Humidity Windy Surfing 1 Sunny Mild Normal True Yes 2 Sunny Hot High False No 3 Rainy Mild High False No 4 Overcast Cool Normal True Yes

Basic definitions Special Attribute Class Attribute Outlook Temperature Humidity Windy Surfing 1 Sunny Mild Normal True Yes 2 Sunny Hot High False No 3 Rainy Mild High False No 4 Overcast Cool Normal True Yes

Basic definitions Instance Outlook Temperature Humidity Windy Surfing 1 Sunny Mild Normal True Yes 2 Sunny Hot High False No 3 Rainy Mild High False No 4 Overcast Cool Normal True Yes

Basic definitions Dataset Outlook Temperature Humidity Windy Surfing 1 Sunny Mild Normal True Yes 2 Sunny Hot High False No 3 Rainy Mild High False No 4 Overcast Cool Normal True Yes

Basic definitions T test set: the class attribute of every instance has no value, and it should be predicted atr1 attr2 attr3 cl_attr E training set: the class attribute of every instance has a value, inserted by expert or with experiment atr1 attr2 attr3 cl_attr a1_v1 a2_v3 a3_v2 cl_1 a1_v2 a2_v2 a3_v1 cl_2 a1_v1 a2_v1 a3_v1? a1_v2 a2_v2 a3_v2? a1_v3 a2_v3 a3_v3? a1_v3 a2_v5 a3_v2 cl_1

Basic definitions Hypothesis consist of conjunction of constraints on the instance attributes < Outlook, Temperature, Humidity, Windy > <?, Cold, Ø, Strong >

When to apply Machine Learning Dependencies and correlations can not be obvious - the instances in training and test set usually have huge number of attributes The algorithms need to evolute in the changing environment Some problems are better defined with examples - OCR

Disciplines with influence on ML AI ML in general is search problem using prior knowledge Bayesian methods Bayes theorem as the basis for calculating probabilities of hypothesis Statistics characterization of errors that occur when estimating the accuracy of a hypothesis based on a limited sample of data

Disciplines with influence on ML Psychology simulation of the law of practice Neurobiology neurobiological studies motivate creating a simple models of biological neurons. Control theory procedures for optimizing predefined objectives

Categorization based on the desired outcome of the algorithm Supervised learning - technique for creating a function from training data Unsupervised learning - method where a model is fit to observations Semi-supervised learning - combines both labeled and unlabeled examples to generate an appropriate function

Categorization based on the desired outcome of the algorithm Reinforcement learning an agent exploring an environment in which perceives its current state and takes actions. Learning to learn - where the algorithm learns its own inductive bias based on previous experience.

Some ML algorithm types Concept learning Decision tree learning Neural networks Genetic algorithms Instance based learning Bayesian learning Clustering

WEKA The Weka is an endemic bird of New Zealand or.. W(aikato) E(nvironment) for K(nowlegde) A(nalysis)

Project Weka Developed by the University of Waikato in New Zealand http://www.cs.waikato.ac.nz/~ml/index.html

What is WEKA? Comprehensive suite of Java class libraries Implement many state-of-the-art machine learning and data mining algorithms

WEKA consists of Explorer Experimenter Knowledge flow Simple Command Line Interface Java interface

Explorer WEKA s main graphical user interface Each of the major weka packages Filters, Classifiers, Clusterers, Associations, and Attribute Selection is represented along with a Visualization tool

Explorer Data pre-processing ARFF, CSV, C4.5 or binary data Data loaded from URL or DB Preprocessing routines in WEKA are called filters MergeAttributeValuesFilter, NominalToBinaryFilter, DiscretiseFilter, ReplaceMissingValuesFilter

Explorer train Classifier The process of creating a function or data structure, that will be used for classifying of new instances A set of user defined options is used to refine the result of training

Explorer train Classifier How a trained classifier looks like?

Explorer evaluate Classifiers Train set Test set

Explorer evaluate Classifiers Train set Test set The amount of the data is enough 1/3 Test Set Train Set 2/3

Explorer evaluate Classifiers Train set Test set The amount of the data is limited Cross Validation

Explorer Classification results Confusion matrix TPR matrix dogs ski scuba ml cars 26 0 0 0 0 dogs 0 24 0 0 1 ski 0 0 24 0 1 scuba 0 0 0 25 0 ml 0 0 0 0 25 cars dogs ski scuba ml cars 100 0 0 0 0 dogs 0 96 0 0 4 ski 0 0 96 0 4 scuba 0 0 0 100 0 ml 0 0 0 0 100 cars

Explorer Meta Classifiers Methods that enhance the performance or extend the capabilities of the basic classifiers The Meta Classifiers will be discussed in more details in the talk next week

Explorer Association Rules Weka contains an implementation of the Apriori learner for generating association rules outlook=sunny humidity=high 3 surfing=no 3

Explorer Clustering Unsupervised learning

Explorer Clustering Unsupervised learning Implies metric to calculate the similarity between the instances.

Explorer - Attributes selection Relevant attributes for classification

Explorer - Attributes selection Relevant attributes for classification Finding which subset of attributes works best for prediction attr1. attr4 attr13 class a1v1 a4v1 a13v1 cl1 a1v2 a4v2 a13v2 cl2 a1v3 a4v3 a13v3 cl1

Explorer - Visualize Visualization of the dataset A matrix for every pair of attributes

Experimenter Comparing different learning algorithms on different datasets with various parameter settings and analyzing the performance statistics

Knowledge flow The KnowledgeFlow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The KnowledgeFlow is a work in progress so some of the functionality from the Explorer is not yet available.

Simple command line interface All implementations of the algorithms have a uniform command-line interface. java weka.classifiers.trees.j48 -t weather.arff

Java Interface Classifier class public abstract class Classifier Classifier buildclassifier a routine which generates a classifier model from a training dataset classifyinstance or distributionforinstance

Java Interface Classifier class public abstract class Classifier Classifier buildclassifier classifyinstance or distributionforinstance routine which evaluates the generated model on an unseen test dataset

Java Interface Classifier class public abstract class Classifier Classifier buildclassifier classifyinstance or distributionforinstance a routine which generates a probability distribution for all classes

Java Interface Instances data = new Instances( "data.arff"); // loading data data.setclassindex(position); // setting class attribute Remove remove = new Remove(); // new instance of filter remove.setoptions("-r"); // set options remove.setinputformat(data); // to inform filter about dataset Instances newdata = Filter.useFilter(data, remove); // apply filter J48 tree = new J48(); // new instance of tree tree.setoptions("-u"); // set the options tree.buildclassifier(data); // build classifier

Java Interface // using 10 times 10-fold cross-validation. Evaluation eval = new Evaluation(newData); eval.crossvalidatemodel( tree, newdata, 10, newdata.getrandomnumbergenerator(1)); Instances unlabeled = new Instances( unlabeled.arff" ); // unlabeled data unlabeled.setclassindex( position); // set class attribute Instances labeled = new Instances(unlabeled); // create copy // label instances for (int i = 0; i < unlabeled.numinstances(); i++) { clslabel = tree.classifyinstance(unlabeled.instance(i)); labeled.instance(i).setclassvalue(clslabel); }

Conclusion Weka is a collection of machine learning algorithms for solving real-world data mining problems It is written in Java and runs on almost any platform The algorithms can either be applied directly to a dataset or called from your own Java code.

Conclusion License - GNU General Public License (GPL) So possible to study how the algorithms works and to modify them.

Demo Document classification five different categories Car maintaining Machine learning Dogs breeding Scuba diving Skiing

Demo Every category has 25 documents and every document has ca. 200 words Before pre-processing every document is represented by two attributes class attribute and the next attribute contains the whole document

Demo Used filters StringToWordVector NumericToBinary StringToWordVector with IDFTransform option Attribute Selection method ChiSquaredAttributeEval

Demo Used classifiers J48( C4.5) Naive Bayes IBk (knn)

Demo Results J48 NB 1NN 3NN StringToWordVector 96.80% 97.60% 35.20% - StringToWordVector with IDFTransform 96.80% 100% - 75.20% NumericToBinary 96.80% 99.20% - 75.20% with smaller set of attributes StringToWordVector 98.41% 100% 96.83% - StringToWordVector with IDFTransform 97.60% 100% 99.20% - NumericToBinary 97.60% 100% 99.20% -

References Mitchell, T. Machine Learning, 1997 McGraw Hill. Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cunningham (1999). Weka: Practical machine learning tools and techniques with Java implementations. Ian H. Witten, Eibe Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques (Second Edition, 2005). San Francisco: Morgan Kaufmann