LABORATORY OF DATA SCIENCE. Reminds on Data Mining. Data Science & Business Informatics Degree

Similar documents
Lecture Notes for Chapter 5

CS570: Introduction to Data Mining

Lecture Notes for Chapter 5

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Data Mining Course Overview

WEKA homepage.

Data Mining Part 5. Prediction

Data Mining Concepts & Techniques

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Classification with Decision Tree Induction

Tutorial on Machine Learning Tools

Model s Performance Measures

Classification: Basic Concepts and Techniques

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CSE4334/5334 DATA MINING

CS145: INTRODUCTION TO DATA MINING

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Classification and Regression

COMP90049 Knowledge Technologies

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Basic Concepts Weka Workbench and its terminology

The Explorer. chapter Getting started

Knowledge Discovery and Data Mining

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining With Weka A Short Tutorial

Data Mining Algorithms: Basic Methods

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Classification. Instructor: Wei Ding

10 Classification: Evaluation

Introduction to Data Mining

Cloud-Based Machine and Deep Learning. Machine Learning and AI Algorithm

9. Conclusions. 9.1 Definition KDD

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Classification using Weka (Brain, Computation, and Neural Learning)

Oracle9i Data Mining. Data Sheet August 2002

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Prototyping DM Techniques with WEKA and YALE Open-Source Software

CS 584 Data Mining. Classification 3

Data Mining Classification - Part 1 -

Extra readings beyond the lecture slides are important:

CS434 Notebook. April 19. Data Mining and Data Warehouse

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Part I. Instructor: Wei Ding

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

End-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.

Classification: Basic Concepts, Decision Trees, and Model Evaluation

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

Data Engineering. Data preprocessing and transformation

Chapter 3: Supervised Learning

Machine Learning and Bioinformatics 機器學習與生物資訊學

CS249: ADVANCED DATA MINING

New ensemble methods for evolving data streams

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

ISSUES IN DECISION TREE LEARNING

CAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING. Rafael Santos

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

COMP 465: Data Mining Classification Basics

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

WEKA Explorer User Guide for Version 3-4

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

COMP s1 - Getting started with the Weka Machine Learning Toolkit

Rule Based Classification on a Multi Node Scalable Hadoop Cluster

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Short instructions on using Weka

COMP 6838 Data MIning

CS4491/CS 7265 BIG DATA ANALYTICS

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Gain Greater Productivity in Enterprise Data Mining

Data Mining and Knowledge Discovery: Practice Notes

Data Mining Input: Concepts, Instances, and Attributes

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data warehouse and Data Mining

Data Mining and Knowledge Discovery: Practice Notes

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

9/9/2004. Release 3.5. Decision Trees 350 1

KNIME for the life sciences Cambridge Meetup

Rule induction. Dr Beatriz de la Iglesia

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Installation KNIME AG. All rights reserved. 1

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

DATA MINING LAB MANUAL

Transcription:

LABORATORY OF DATA SCIENCE Reminds on Data Mining Data Science & Business Informatics Degree

BI Architecture 2 6 lessons data access 4 lessons data quality & ETL 1 lessons analytic SQL 5 lessons OLAP and reporting 4 lessons data mining

Preprocessing and visualization Data Mining Techniques 3 Classification/Regression Association Rule Discovery Clustering Sequential Pattern Discovery Deviation Detection Text Mining Web Mining Social Network Analysis

Tools for data mining 4 From DBMS SQL Server Analysis Services Oracle Data Miner IBM DB2 Intelligent Miner (discontinued) From Statistical analysis IBM Modeler (formerly SPSS Clementine) SAS Miner From Machine-Learning Knime Weka An updated list http://www.kdnuggets.com/software/index.html

Standards 5 XML representation of data mining models Predictive Modelling Markup Language: PMML API for accessing data mining services Microsoft OLE DB for DM Java JDM SQL Extensions for data mining Standard SQL/MM Part 6 Data Mining Oracle, DB2 & SQL Server have non-standard extensions SSAS DMX query language and Data Mining queries

6 Weka

Weka 7 Suite for machine learning / data mining Developed in Java Distributed with a GNU GPL licence Since 2006 it is part of the BI Pentaho suite References Data Mining by Witten & Frank, 3 rd ed., 2011 On line docs http://www.cs.waikato.ac.nz/ml/weka/index.html Features / limits: A complete set of tools for pre-processing, classification, clustering, association rules, visualization Extensible (documented APIs) Not very efficient / scalable (data are maintained in main memory)

Weka versions 8 Download: http://www.cs.waikato.ac.nz/~ml/weka May 2015, Weka 3.7.12 (developer version) Patch distribuited by the teacher To be copied in the Weka installation directory It includes setting for: Larger memory occupation (Java default is 80Mb) Data types for SQL Server RDBMS Driver JDBC Weka Light Minimal version 3.7.12, patch already included

Weka interfaces 9 GUI chooser and console with errors/warnings

Weka interfaces: Explorer 10 Explorer: GUI with distinct panels for preprocessing, classification, clustering,

Weka interfaces: KnowledgeFlow 11 KnowledgeFlow: GUI with data flow

Weka interfaces: Simple CLI 12 Simple CLI (Call Level Interface): command line interface

Weka interfaces: Experimenter 13 Experimenter: automation of large experiments by varying datasets, algorithms, parameters,..

Details 14 Weka manual Installation directory, or at the weka website

Filters 15 Conversions Normalization MakeIndicator, NominalToBinary, NumericToBinary, NumericToNominal Selections RemovePercentage, RemoveRange, RemoveWithValues, SubSetByExpression Sampling Resample, SpreadSubSample, StratifiedRemoveFolds Center, Normalize, Standardize Discretization Discretize Cleaning NumericCleaner, Remove, RemoveByType, RemoveUseless Missing Values ReplaceMissingValues Transformation Add, AddExpression, AddNoise, AddValues

16 Reminds on classification

NVisits Visite Who are my best customers? 17 given their age and frequency of visit! Good customers = top buyers, buy more than X, 35 Age <= 26 Molti_Acquisti Good Pochi_Acquisti Bad Age >= 38 30 NVisits > 20 25 20 15 10 5 0 20 25 30 35 40 45 50 Eta' Age

described with a decision tree! 18 <= 26 age > 26 bad age <= 38 > 38 nvisits bad <= 20 >20 good bad

Classification: input 19 A set of examples (or instances or cases) which described a concept or event (class) given predictive attributes (or features) Attributes can be either continuous or discrete (maybe discretized) The class is discrete outlook temperature humidity windy class sunny 85 85 false Don't Play sunny 80 90 true Don't Play overcast 83 78 false Play rain 70 96 false Play rain 68 80 false Play rain 65 70 true Don't Play overcast 64 65 true Play sunny 72 95 false Don't Play sunny 69 70 false Play rain 75 80 false Play sunny 75 70 true Play overcast 72 90 true Play overcast 81 75 false Play rain 71 80 true Don't Play

Classification: output 20 A function f(sample) = class, called a classification model, that describes/predict the class value given the feature values of a sample obtained by generalizing the samples of the training set Usage of a classification model: descriptively Which customers have abandoned? predictively Over a score set of samples with unknown class value Which customers will respond to this offer?

How to evaluate a class. model? 21 Holdout method Split the available data into two sets Training set is used to build the model Test set is used to evaluate the interestingness of the model Typically, training is 2/3 of data and test is 1/3 Model exctraction Quality measure

How good is a classification model? 22 Stratified holdout Available data is divided by stratified sampling wrt class distribution (Stratified) n-fold cross-validation Available data divided into n parts of equal size For i=1..n, the i-th part is used as test set and the rest as training set for building a classifier The average quality measure of the n classifiers is statistically more significative than the holdout method The FINAL classifier is the one training from all the available data Cross-validation is useful when data is scarce or attribute distributions are skewed

Quality measures: accuracy 23 Accuracy: percentage of cases in the test set that is correctly predicted by the model E.g., accuracy of 80% means that in 8 cases out of 10 in the test set the predicted class is the same of the actual class Misclassification % = (100 accuracy) Lower bound on accuracy: majority classifier A trivial classifier for which f(case) = majority class value Its accuracy is the percentage of the majority class E.g., two classes: fraud 2% legal 98% Its hard to beat the 98% accuracy

Quality measures: confusion matrix 24 Predicted class Actual class Misclassified cases

Quality measures: precision 25 Precision: accuracy of predicting C # Cases predicted Class=C and with real Class=C # Cases predicted Class=C 76% of times predictions >50K are correct

Quality measures: recall 26 Recall: coverage of predicting C # Cases predicted Class=C and with real Class=C # Cases with real Class=C 59,4% of real class >50K are found by predictions

Measures: lift chart 27 Classifier: f(sample, class) = confidence and then f(sample) = argmax class f(sample, class) E.g., f(sample, play) = 0.3 f(sample, don t play) = 0.7 Samples in the test set can be ranked according to a fixed class Rank customers on the basis of the classifier confidence they will respond to an offer Lift chart X-axis: ranked sample of the test set Y-axis: percentage of the total cases in the test set with the actual class value included in the ranked sample of the test set (i.e., recall) Plots: performance of a classifier vs random ranking Useful when resources (e.g., budget) are limited

% customers who responded 28 Contacting only 50% of customer will reach 80% of those who respond. Lift = 80/50 = 1.6 Lift Chart 100 90 80 70 60 50 40 30 20 10 0 % customers according to some order Ranking Random

Lift Chart - variants 29 Lift(X) = recall(x) Estimation of random classifier lift Previous example, Lift(50%) = 80% LiftRatio(X) = recall(x) / X Ratio of lift over random order Previous example, LiftRatio(50%) = 80% / 50% = 1.6 Profit chart Given a cost/benefit model, the Y axis represent the total cost/gain when contacting X and not contacting TestSet\X

The unbalancing problem 30 For unbalanced class values, it is difficult to obtain a good model Fraud = 2% Normal = 98% The majority classifier is accurate at 98% but it is not useful Oversampling and Undersampling Select a training set with a more balanced distribution of class values A and B 60-70% for class A and 30-40% for class B By increasing the number of cases with class B (oversampling) or by reducing those with class A (undersampling) The training algorithm has more chances of distinguishing characteristics of A VS B The test set MUST have the original distribution of values Cost Sensitive Classifier, Ensembles (bagging, boosting, stacking) Weights errors, build several classifiers and average their predictions

31 Rule based classification

Rule-Based Classifier 32 Classify records by using a collection of if then rules Rule: where (Condition) y Condition is a conjunctions of attributes y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No

Rule-based Classifier (Example) 33 Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians

Application of Rule-Based Classifier 34 A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no? grizzly bear warm yes no no? The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal

10 Rule Coverage and Accuracy 35 Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes (Status=Single) No Coverage = 40%, Accuracy = 50%

How does Rule-based Classifier Work? 36 R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name Blood Type Give Birth Can Fly Live in Water Class lemur warm yes no no? turtle cold no no sometimes? dogfish shark cold yes no yes? A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules

Characteristics of Rule-Based Classifier 37 Mutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one rule Exhaustive rules Classifier has exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one rule

From Decision Trees To Rules 38 Yes NO Refund {Single, Divorced} Taxable Income No Marital Status < 80K > 80K {Married} NO Classification Rules (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No NO YES Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree

10 Rules Can Be Simplified 39 Yes NO NO Refund {Single, Divorced} Taxable Income No Marital Status < 80K > 80K YES {Married} NO Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Initial Rule: (Refund=No) (Status=Married) No Simplified Rule: (Status=Married) No

Effect of Rule Simplification 40 Rules are no longer mutually exclusive A record may trigger more than one rule Solution? Ordered rule set Unordered rule set use voting schemes Rules are no longer exhaustive A record may not trigger any rules Solution? Use a default class

Building Classification Rules 41 Direct Method: Extract rules directly from data e.g.: RIPPER, CN2, Holte s 1R Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules