Data Mining with Weka

Similar documents
Basic Concepts Weka Workbench and its terminology

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Machine Learning Chapter 2. Input

Data Mining Practical Machine Learning Tools and Techniques

Data Engineering. Data preprocessing and transformation

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Overview of Big Data

Input: Concepts, Instances, Attributes

Data Mining Introduction

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Infographics and Visualisation (or: Beyond the Pie Chart) LSS: ITNPBD4, 1 November 2016

Privacy, Security & Ethical Issues

PRIVACY POLICY. 3.1 This policy does not apply to the collection, holding, use or disclosure of personal information that is an employee record.

This guide is for informational purposes only. Please do not treat it as a substitute of a professional legal

If you have any questions about this notice, please contact the Head Master.

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Data Mining Algorithms: Basic Methods

Data Mining Practical Machine Learning Tools and Techniques

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

How App Ratings and Reviews Impact Rank on Google Play and the App Store

ECLT 5810 Evaluation of Classification Quality

COMP 465 Special Topics: Data Mining

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

Applying Supervised Learning

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Machine Learning Techniques for Data Mining

Function Algorithms: Linear Regression, Logistic Regression

Cognizant Careers Portal Terms of Use and Privacy Policy ( Policy )

About the information we collect We collect and process personal data including but not limited to:-

Data Science Tutorial

The New Data Protection Law a Basic Guide

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Privacy Overview and Data Mining CSC 301 Spring 2018 Howard Rosenthal

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

A Comparative Study of Selected Classification Algorithms of Data Mining

Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?

1 Privacy Statement INDEX

WEKA Explorer User Guide for Version 3-4

Data Preparation. Nikola Milikić. Jelena Jovanović

Task: Design an ER diagram for that problem. Specify key attributes of each entity type.

RippleMatch Privacy Policy

Data Preparation. UROŠ KRČADINAC URL:

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Bayes Net Learning. EECS 474 Fall 2016

A Two-level Learning Method for Generalized Multi-instance Problems

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Usability Testing! Hall of Fame! Usability Testing!

Machine Learning with MATLAB --classification

Hands on Datamining & Machine Learning with Weka

Data Preprocessing. Data Preprocessing

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Frequently Asked Questions

Knowledge Discovery and Data Mining

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Contents. Management. Client. Choosing One 1/20/17

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

2.1 Ethics in an Information Society

Tree-based methods for classification and regression

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

VIACOM INC. PRIVACY SHIELD PRIVACY POLICY

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

The Problem of Overfitting with Maximum Likelihood

Pre-Requisites: CS2510. NU Core Designations: AD

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

7. Boosting and Bagging Bagging

Domino s Pizza Enterprises Ltd. The Business Partner. Code of Practice

Implementing Breiman s Random Forest Algorithm into Weka

CS 229 Project Report:

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data protection legal jungle or common sense Susan Healy. Religious Archives Group 22 Mar 2010

You will see lots of references in the Checklist to the GDPR Pack if you would like to purchase this, go to

PRIVACY POLICY POLICY KEY DEFINITIONS: PROCESSING OF YOUR PERSONAL DATA

Model selection and validation 1: Cross-validation

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering

Data Linkage Methods: Overview of Computer Science Research

LIFEWAY PREMARITAL INFORMATION FORM LIFEWAY REFERRAL INFORMATION

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Data Protection Policy

Example of Focus Group Discussion Guide (Caregiver)

Best Practices for Data Science Projects CMSC320: Introduction to Data Science

Please initial each page and mail, FAX, or your completed application to: Milwaukee NARI W. Dearbourn Ave Wauwatosa, WI

BBC Trust service review: BBC Online and BBC Red Button Consultation Questions

Polemic is a business involved in the collection of personal data in the course of its business activities and on behalf of its clients.

Xpress Super may collect and hold the following personal information about you: contact details including addresses and phone numbers;

Transcription:

Data Mining with Weka Class 5 Lesson 1 The data mining process Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 5.1 The data mining process Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Class 4 More classifiers Class 5 Putting it all together Lesson 5.1 The data mining process Lesson 5.2 Pitfalls and pratfalls Lesson 5.3 Data mining and ethics Lesson 5.4 Summary

Lesson 5.1 The data mining process Data Weka Cool result

Lesson 5.1 The data mining process Data Weka Cool result

Lesson 5.1 The data mining process Collect data Ask question Clean data Weka Deploy Define new features

Lesson 5.1 The data mining process Ask a question what do you want to know? tell me something cool about the data is not enough! Gather data there s soooo much around but we need (expert?) classifications more data beats a clever algorithm Clean the data real data is very mucky Define new features feature engineering the key to data mining Deploy the result technical implementation convince your boss!

Lesson 5.1 The data mining process (Selected) filters for feature engineering AddExpression (MathExpression) Apply a math expression to existing attributes to create new one (or modify existing one) Center (Normalize) (Standardize) Transform numeric attributes to have zero mean (or into a given numeric range) (or to have zero mean and unit variance) Discretize (also supervised discretization) Discretize numeric attributes to have nominal values PrincipalComponents Perform a principal components analysis/transformation of the data RemoveUseless Remove attributes that do not vary at all, or vary too much TimeSeriesDelta, TimeSeriesTranslate Replace attribute values with successive differences between this instance and the next

Lesson 5.1 The data mining process Weka is only a small part (unfortunately) and it s the easy part may all your problems be technical ones old programmer s blessing Course text Section 1.3 Fielded applications

Data Mining with Weka Class 5 Lesson 2 Pitfalls and pratfalls Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 5.2 Pitfalls and pratfalls Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Class 4 More classifiers Class 5 Putting it all together Lesson 5.1 The data mining process Lesson 5.2 Pitfalls and pratfalls Lesson 5.3 Data mining and ethics Lesson 5.4 Summary

Lesson 5.2 Pitfalls and pratfalls Pitfall: A hidden or unsuspected danger or difficulty Pratfall: A stupid and humiliating action

Lesson 5.2 Pitfalls and pratfalls Be skeptical In data mining, it s very easy to cheat whether consciously or unconsciously For reliable tests, use a completely fresh sample of data that has never been seen before Overfitting has many faces Don t test on the training set (of course!) Data that has been used for development (in any way) is tainted Leave some evaluation data aside for the very end

Lesson 5.2 Pitfalls and pratfalls Missing values Missing means what Unknown? Unrecorded? Irrelevant? Should you: 1. Omit instances where the attribute value is missing? or 2. Treat missing as a separate possible value? Is there significance in the fact that a value is missing? Most learning algorithms deal with missing values but they may make different assumptions about them

Lesson 5.2 Pitfalls and pratfalls OneR and J48 deal with missing values in different ways Load weather nominal.arff OneR gets 43%, J48 gets 50% (using 10 fold cross validation) Change the outlook value to unknown on the first four no instances OneR gets 93%, J48 still gets 50% Look at OneR s rules: it uses? as a fourth value for outlook

Lesson 5.2 Pitfalls and pratfalls No free lunch 2 class problem with 100 binary attributes Say you know a million instances, and their classes (training set) You don t know the classes of 2 100 10 6 examples! (that s 99.9999 % of the data set) How could you possibly figure them out? In order to generalize, every learner must embody some knowledge or assumptions beyond the data it s given A learning algorithm implicitly provides a set of assumptions There can be no universal best algorithm (no free lunch) Data mining is an experimental science

Lesson 5.2 Pitfalls and pratfalls Be skeptical Overfitting has many faces Missing values different assumptions No universal best learning algorithm Data mining is an experimental science It s very easy to be misled

Data Mining with Weka Class 5 Lesson 3 Data mining and ethics Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 5.3 Data mining and ethics Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Class 4 More classifiers Class 5 Putting it all together Lesson 5.1 The data mining process Lesson 5.2 Pitfalls and pratfalls Lesson 5.3 Data mining and ethics Lesson 5.4 Summary

Lesson 5.3 Data mining and ethics Information privacy laws (in Europe, but not US) A purpose must be stated for any personal information collected Such information must not be disclosed to others without consent Records kept on individuals must be accurate and up to date To ensure accuracy, individuals should be able to review data about themselves Data must be deleted when it is no longer needed for the stated purpose Personal information must not be transmitted to locations where equivalent data protection cannot be assured Some data is too sensitive to be collected, except in extreme circumstances (e.g., sexual orientation, religion)

Lesson 5.3 Data mining and ethics Anonymization is harder than you think When Massachusetts released medical records summarizing every state employee s hospital record in the mid 1990s, the governor gave a public assurance that it had been anonymized by removing all identifying information such as name, address, and social security number. He was surprised to receive his own health records (which included diagnoses and prescriptions) in the mail. Reidentification techniques. Using publicly available records: 50% of Americans can be identified from city, birth date, and sex 85% can be identified if you include the 5 digit zip code as well Netflix movie database: 100 million records of movie ratings (1 5) Can identify 99% of people in the database if you know their ratings for 6 movies and approximately when they saw the movies ( one week) Can identify 70% if you know their ratings for 2 movies and roughly when they saw them

Lesson 5.3 Data mining and ethics The purpose of data mining is to discriminate who gets the loan who gets the special offer Certain kinds of discrimination are unethical, and illegal racial, sexual, religious,

Lesson 5.3 Data mining and ethics The purpose of data mining is to discriminate who gets the loan who gets the special offer Certain kinds of discrimination are unethical, and illegal racial, sexual, religious, But it depends on the context sexual discrimination is usually illegal except for doctors, who are expected to take gender into account and information that appears innocuous may not be ZIP code correlates with race membership of certain organizations correlates with gender

Lesson 5.3 Data mining and ethics Correlation does not imply causation As icecream sales increase, so does the rate of drownings. Therefore icecream consumption causes drowning??? Data mining reveals correlation, not causation but really, we want to predict the effects of our actions

Lesson 5.3 Data mining and ethics Privacy of personal information Anonymization is harder than you think Reidentification from supposedly anonymized data Data mining and discrimination Correlation does not imply causation Course text Section 1.6 Data mining and ethics

Data Mining with Weka Class 5 Lesson 4 Summary Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 5.4 Summary Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Class 4 More classifiers Class 5 Putting it all together Lesson 5.1 The data mining process Lesson 5.2 Pitfalls and pratfalls Lesson 5.3 Data mining and ethics Lesson 5.4 Summary

Lesson 5.4 Summary There s no magic in data mining Instead, a huge array of alternative techniques There s no single universal best method It s an experimental science! What works best on your problem? Weka makes it easy maybe too easy? There are many pitfalls You need to understand what you re doing! Focus on evaluation and significance Different algorithms differ in performance but is it significant?

Lesson 5.4 Summary What have we missed? Filtered classifiers Filter training data but not test data during cross validation Cost sensitive evaluation and classification Evaluate and minimize cost, not error rate Attribute selection Select a subset of attributes to use when learning Clustering Learn something even when there s no class value Association rules Find associations between attributes, when no class is specified Text classification Handling textual data as words, characters, n grams Weka Experimenter Calculating means and standard deviations automatically + more

Lesson 5.4 Summary What have we missed? Filtered classifiers Filter training data but not test data during cross validation Cost sensitive evaluation and classification Evaluate and minimize cost, not error rate Attribute selection Select a subset of attributes to use when learning Clustering Learn something even when there s no class value Association rules Find associations between attributes, when no class is specified Text classification Handling textual data as words, characters, n grams Weka Experimenter Calculating means and standard deviations automatically + more

Lesson 5.4 Summary Data recorded facts Information patterns, or expectations, that underlie them Knowledge the accumulation of your set of expectations Wisdom the value attached to knowledge

Data Mining with Weka Department of Computer Science University of Waikato New Zealand Creative Commons Attribution 3.0 Unported License creativecommons.org/licenses/by/3.0/ weka.waikato.ac.nz