PSS718 - Data Mining

Similar documents
Chapter 6: DESCRIPTIVE STATISTICS

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Basic Medical Statistics Course

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Slides for Data Mining by I. H. Witten and E. Frank

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Preprocessing. Slides by: Shree Jaswal

Statistics Lecture 6. Looking at data one variable

MHPE 494: Data Analysis. Welcome! The Analytic Process

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

STA 570 Spring Lecture 5 Tuesday, Feb 1

WELCOME! Lecture 3 Thommy Perlinger

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Data Mining and Analytics. Introduction

Knowledge Discovery and Data Mining

Basic Concepts Weka Workbench and its terminology

UNIT 2 Data Preprocessing

Measures of Central Tendency

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

Mining Weather Data Using Rattle

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Chapter 2 Basic Structure of High-Dimensional Spaces

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Measures of Dispersion

IT 403 Practice Problems (1-2) Answers

Data Mining Clustering

Machine Learning Techniques for Data Mining

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Machine Learning Chapter 2. Input

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

WHOLE NUMBER AND DECIMAL OPERATIONS

+ Statistical Methods in

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

MKTG 460 Winter 2019 Solutions #1

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Data Preprocessing UE 141 Spring 2013

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

CS570: Introduction to Data Mining

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

After opening Stata for the first time: set scheme s1mono, permanently

More Summer Program t-shirts

CITS4009 Introduction to Data Science

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Chapter 3: Data Mining:

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Mining Practical Machine Learning Tools and Techniques

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Basic Medical Statistics Course

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

SPSS. (Statistical Packages for the Social Sciences)

Visualizing univariate data 1

Data preprocessing Functional Programming and Intelligent Algorithms

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

IQR = number. summary: largest. = 2. Upper half: Q3 =

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining: Exploring Data. Lecture Notes for Chapter 3

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

SPSS TRAINING SPSS VIEWS

Nature Methods: doi: /nmeth Supplementary Figure 1

CHAPTER 2 DESCRIPTIVE STATISTICS

PSS718 - Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

DATA PREPROCESSING. Tzompanaki Katerina

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Chapter 1. Looking at Data-Distribution

Using Machine Learning to Optimize Storage Systems

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS

SOCIAL MEDIA MINING. Data Mining Essentials

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Frequency Distributions

Lecture Notes 3: Data summarization

Lecture 1: Exploratory data analysis

Basic Data Mining Technique

Data Analyst Nanodegree Syllabus

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Transcription:

Lecture 5 - Hacettepe University October 23, 2016

Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the data Deal with missing values Transform the data Analyze the data to choose better variables

Data Issues The ACM KDD Cup Building models from the right data is crucial to the success of a data mining project. The ACM KDD Cup, an annual Data Mining and Knowledge Discovery competition, is often won by a team that has placed a lot of effort in preprocessing the data supplied.

Data Issues ACM KDD 2009 Orange supplied data related to customer relationship management 50,000 observations with much missing data Each observation recorded values for 15,000 (anonymous) variables Three target variables were to be modeled You really need to pre-process this before mining!

Data Issues Data cleaning When collecting data, it is not possible to ensure that it is perfect There are many reasons for the data to be dirty: Simple data entry errors Decimal points can be incorrectly placed There can be inherent error in any counting or measuring device External factors that cause errors to change over time

Data Issues A number of simple steps Example Most cleaning will be done during exploration Check frequency counts and histograms for anomalies Check very low category counts for categoric variables Check names and adresses, these usually have many versions Genc vs Genç Çırağan vs Cırağan vs Ciragan vs... Hacettepe Üniversitesi vs Hacettepe University vs University of Hacettepe

Data Issues Missing data Missing data is a common feature of any dataset Sometimes there is no information available to populate some value Sometimes the data has simply been lost Sometimes the data is purposefully missing because it does not apply to a particular observation For whatever reason the data is missing, we need to understand and possibly deal with it

Data Issues Some examples Use of sentinels for missing data Symbolic values: 999, 1 Jan 1900, Earth (for an address) Negative values where a positive is necessary: -1 Special characters: *, #, $, %, - Simply missing data: Character replacements: None, Missing, Null, Absent

Data Issues Outliers Definition An outlier is an observation that has values for the variables that are quite different from most other observations. Example > summary(rawdata$alan) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 95 120 591 160 1111000

Data Issues Outlier vs high variance Hawkins (1980) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Extreme weather conditions Extremely rich people Extremely short people (midgets) We have to be careful in deciding what is an outlier and what is not

Data Issues Looking for outliers Sometimes outliers are what we are looking for: Fraud in income tax Fraud in insurance Fraud in medical payment and medication expenses Marketing fraud

Data Issues Variable selection By removing irrelevant variables from the modeling process, the resulting models can be made more robust Some variables will also be found to be quite related to other variables Various techniques exist: Random subset selection Principal component analysis Variable importance measures of random forests...

Data Issues Rattle and transformation Rattle supports many techniques for transforming data.

Data Issues Don t forget Load Data Explore Transform Save!

Data Issues Alternative way of saving Log saving Save your log to a script! This will allow you to rerun the script to generate the modified dataset. Or apply the modification to another dataset! You can also modify the way of generation!

Example Different model builders will have different assumptions on the data from which the models are built When building a cluster, ensure all variables have the same scale Observation 1: Income -> $10,000 and Age -> 30 Observation 2: Income -> $10,500 and Age -> 70 Observation 3: Income -> $9,000 and Age -> 32 Which two observations are closest?

Normalization

Normalization Recenter: uses a so-called Z score, which subtracts the mean and divides by the standard deviation Scale [0-1]: rescaling our data to be in the range from 0 to 1 Median/MAD: a robust rescaling around zero using the median Log 10: Obvious Matrix: transforming multiple variables with one divisor Rank: Order and rank the observations Interval: Group the observations into a predefined amount of bins

Recenter Definition (Recenter) This is a common normalisation that re-centres and rescales our data. The usual approach is to subtract the mean value of a variable from each observation s value of the variable (to recentre the variable) and then divide the values by their standard deviation (calculating the square root of the sum of squares), which rescales the variable back to a range within a few integer values around zero. Example > df$rrc_temp3pm <- scale(df$temp3pm)

Scale [0-1] Definition (Scale [0-1]) This is done by subtracting the minimum value from the variable s value for each observation and then dividing by the difference between the minimum and the maximum values. Example > library(reshape) > df$r01_temp3pm <- rescaler(df$temp3pm, "range")

Median/MAD Definition (Median/MAD) This option for re-centering and rescaling our data is regarded as a robust (to outliers) version of the standard Recenter option. Instead of using the mean and standard deviation, we subtract the median and divide by the so-called median absolute deviation (MAD). Example > library(reshape) > df$rmd_temp3pm <- rescaler(df$temp3pm, "robust")

Natural Logarithm Example Used when the distribution of a variable is quite skewed Maps a very broad range into a narrower range Outliers are easily handled Default is to log in base e (natural logarithm) Be careful for log 0 = and log of negative values > df$rlg_temp3pm <- log(df$temp3pm) > df$rlg_temp3pm[df$rlg_temp3pm == -Inf] <- NA

Rank Definition (Rank) The Rank will convert each observation s numeric value for the identified variable into a ranking in relation to all other observations in the dataset. A rank is simply a list of integers, starting from 1, that is mapped from the minimum value of the variable, progressing by integer until we reach the maximum value of the variable. Example > library(reshape) > df$rrk_temp3pm <- rescaler(df$temp3pm, "rank")

Interval Definition (Interval) An Interval transform recodes the values of a variable into a rank order between 0 and 100.

What is? Definition () is the process of filling in the gaps (or missing values) in data. Data is missing for many different reasons, and it is important to understand why can be questionable because, after all, we are inventing data

How to do it? Replace with a particular value (helps with regression) Add an additional variable to denote when values are missing (helps with models identifying the importance of the missing values) Some models are not troubled with missing values, e.g., decision trees

How Rattle does it?

Zero/Missing Definition (Zero/Missing) The simplest imputations involve replacing all missing values for a variable with a single value. This makes the most sense when we know that the missing values actually indicate that the value is 0 rather than unknown. Example > df$izr_sunshine <- df$sunshine > df$izr_sunshine[is.na(df$izr_sunshine)] <- 0

Mean/Median/Mode Definition (Mean/Median/Mode) Often a simple, if not always satisfactory, choice for missing values that are known not to be zero is to use some central value of the variable. This is often the mean, median, or mode, and thus usually has limited impact on the distribution. Example no skewness -> use mean skewed -> use median categoric variables -> use mode (the most frequent category) > df$imn_sunshine <- df$sunshine > df$imn_sunshine[is.na(df$imn_sunshine)] <- mean(df$sunshine, na.rm=true)

Constant Definition (Constant) This choice allows us to provide our own default value to fill in the gaps. This might be an integer or real number for numeric variables, or else a special marker or the choice of something other than the majority category for categoric variables. Example > df$icn_sunshine <- df$sunshine > df$icn_sunshine[is.na(df$izr_sunshine)] <- 0

What is Definition () provides numerous remapping operations, including binning and transformations of the type of the data.

Binning Definition (Binning) Binning is the operation of transforming a continuous numeric variable into a specific set of categoric values based on the numeric values. Age into AgeGroups Temperature into Low, Medium, High Humidity into Dry, Normal, Humid

Binning Quantile: Each bin will have approximately the same number of observations. If the Data tab includes a Weight variable, then the observations are weighted when performing the binning. KMeans: A kmeans clustering will be used for binning. Equal Width: The min to max range will be divided into equal width bins. We can also change the number of bins to be created for each method

Indicator variables Some model builders often do not directly handle categoric variables. One way to fix this is indicator or dummy variables For each level of a categoric variable, create a new indicator variable If for an observation the value of that categoric variable is that specific level, then the indicator value becomes 1, otherwise 0 Note that for a categoric variable with k unique levels, actually k-1 new variables will do. Why?

Join Categorics Definition (Join Categorics) The Join Categorics option provides a convenient way to stratify the dataset based on multiple categoric variables. It is a simple mechanism that creates a new variable from the combination of all of the values of the two constituent variables selected in the Rattle interface.

Type Conversion Definition (Type Conversion) The As Categoric and As Numeric options will, respectively, convert a numeric variable to categoric and vice versa. Example > df$tfc_cloud3pm <- as.factor(df$cloud3pm) > df$tnm_raintoday <- as.numeric(df$raintoday)

It is quite easy to get our dataset variable count up to significant numbers. The option allows us to tell Rattle to actually delete columns from the dataset. Options: Delete Ignored: remove any variable that is ignored Delete Selected: remove any variables we select Delete Missing: remove any variables that have missing values Delete Obs with Missing: remove observations (rather than variables) that have missing values.