Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Similar documents
Data Preprocessing. Slides by: Shree Jaswal

UNIT 2 Data Preprocessing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

2. Data Preprocessing

3. Data Preprocessing. 3.1 Introduction

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Data Preprocessing. Komate AMPHAWAN

CS570: Introduction to Data Mining

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Preprocessing. Data Mining 1

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

K236: Basis of Data Science

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data Mining and Analytics. Introduction

Data Mining: Concepts and Techniques

Information Management course

Chapter 2 Data Preprocessing

ECLT 5810 Data Preprocessing. Prof. Wai Lam

CS6220: DATA MINING TECHNIQUES

Data preprocessing Functional Programming and Intelligent Algorithms

Jarek Szlichta

Data Preprocessing. Outline. Motivation. How did this happen?

Data Mining: Concepts and Techniques. Chapter 2

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Data Mining Concepts & Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

DATA PREPROCESSING. Tzompanaki Katerina

Data Preprocessing in Python. Prof.Sushila Aghav

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. Chapter Why Preprocess the Data?

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Data Collection, Preprocessing and Implementation

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Mining Course Overview

Sponsored by AIAT.or.th and KINDML, SIIT

2 CONTENTS. 3.8 Bibliographic Notes... 45

Course on Data Mining ( )

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Data Preprocessing UE 141 Spring 2013

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Chapter 1, Introduction

Data Preprocessing. Data Preprocessing

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Table Of Contents: xix Foreword to Second Edition

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Data Mining MTAT

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Preprocessing and Visualization. Jonathan Diehl

CS570 Introduction to Data Mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining

ECLT 5810 Clustering

Analyse des Données. Master 2 IMAFA. Andrea G. B. Tettamanzi

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Data Mining: Concepts and Techniques. Chapter 2

COMP 465: Data Mining Classification Basics

Basic Data Mining Technique

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Web Information Retrieval

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Dta Mining and Data Warehousing

Data Mining: Exploring Data

CSE4334/5334 DATA MINING

Slides for Data Mining by I. H. Witten and E. Frank

Knowledge Modelling and Management. Part B (9)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Code No: R Set No. 1

ECLT 5810 Clustering

AT78 DATA MINING & WAREHOUSING JUN 2015

Enterprise Miner Software: Changes and Enhancements, Release 4.1

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

CS570: Introduction to Data Mining

ETL and OLAP Systems

PSS718 - Data Mining

Transcription:

Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, (yudho@cs.ui.ac.id)) y, Ph.D,, CISA (y Faculty of Computer Science, 2 Wh P Why Preprocess the th D Data? t? Wh Preprocess Why P the th Data? D t? (2) Quality decisions must be based on quality data Data could be incomplete, noisy, and inconsistent Data warehouse needs consistent integration of qqualityy data Incomplete LLacking ki attribute ib values l or certain i attributes ib off iinterest Containing only aggregate data Causes: 3 Noisy (having incorrect attribute values) Data collection instruments used may be faulty Human or computer errors occuring at data entry Errors in data transmission Inconsistent Not considered important at the time of entry Equipment malfunctions Data not entered due to misunderstanding Inconsistent with other recorded data and thus deleted Containing errors, or outlier values that deviate from the expected Causes: 4 Containing discrepancies in the department codes used to categorize items

Why Preprocess the Data? (3) Clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Some examples of inconsistencies: customer_id vs cust_id Bill vs William vs B. Some attributes may be inferred from others. Data cleaning including detection and removal of redundancies that may have resulted. Data Preprocessing Techniques Data Cleaning To remove noise and correct inconsistencies in the data Data Integration Merges data from multiple sources into a coherent data store, such as a data warehouse or a data cube Data Transformation Normalization (to improve the accuracy and efficiency of mining algorithms involving distance measurements E.g. Neural networks, nearest-neighbor) Data Discretization Data Reduction 5 6 Data Preprocessing Techniques (2) Data Reduction Warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Strategies for Data Reduction 7 Data aggregation (e.g., building a data cube) Dimension reduction (e.g. removing irrelevant attributes through correlation analysis) Data compression (e.g. using encoding schemes such as minimum length encoding or wavelets) Numerosity reduction Generalization Data Preprocessing Techniques (3) 8

Data Cleaning Missing i Values 1. Ignore the tuple Usually done when class label is missing classification Not effective when the missing values in attributes spread in different tuples 2. Fill in the missing value manually: tedious + infeasible? 3. Use a global constant to fill in the missing value unknown, a new class? Mining program may mistakenly think that they form an interesting concept, since they all have a value in common not recommended 4. Use the attribute mean to fill in the missing value avg income 9 Data Cleaning Noise and Incorrect (Inconsistent) t) Data Noise is a random error or variance in a measured variable. How can we smooth out the data to remove the noise? Binning Method Smooth a sorted data value by consulting its neighborhood, that is, the values around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Binning is also uses as a discretizatin technique (will be discussed later) Data Cleaning Missing i Values (2) 5. Use the attribute mean for all samples belonging to the same class as the given tuple same credit risk category 6. Use the most probable value to fill in the missing value Determined with regression, inference-based tools such as Bayesian formalism, or decision tree induction Methods 3 to 6 bias the data. The filled-in value may not be correct. However, method 6 is a popular strategy, since: It uses the most information from the present data to predict missing values There is a greater chance that the relationships between income and the other attributes are preserved. 10 Data Cleaning Noisy Data Binning i Methods * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 26, 25, 28, 29, 34 * Partition into (equidepth) bins of depth 3, each bin contains three values: -Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 26 - Bin 3: 25, 28, 29, 34 * Smoothing by bin means: -Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: the larger the width, the greater the effect -Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 26, 26 - Bin 3: 25, 25, 25, 34 11 12

Data Cleaning Noisy Data Clustering Data Cleaning Noisy Data Regression Similar values are organized into groups, or clusters. Values that fall outside of the set of clusters may be considered d outliers. Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the best line to fit two variables, so that one variable can be used to predict the other. Multiple linear regression > 2 variables, multidimensional surface Y1 Y1 y X1 y = x + 1 x 13 14 Data Smoothing vs Data Reduction Many methods for data smoothing are also methods for data reduction involving discretization. Examples Binning techniques reduce the number of distinct values per attribute. Useful for decision tree induction which repeatedly make value comparisons on sorted data. Concept hierarchies are also a form of data discretization that can also be used for data smoothng. Mapping real price into inexpensive, moderately_priced, expensive Reducing the number of data values to be handled by the mining process. Data Cleaning - Inconsistent t Data May be corrected manually. Errors made at data entry may be corrected by performing a paper trace, coupled with routines designed d to help correct the inconsistent use of codes. Can also using tools to detect the violation of known data constraints. 15 16

Data Integration ti and Transformation Data Integration: combines data from multiple data stores Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Dt Detecting ti and resolving li dt data value conflicts flit for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales (feet vs metre) Data Transformation Data are transformed into forms appropriate for mining Methods: Smoothing: binning, clustering, and regression Aggregation: summarization, data cube construction Generalization: low-level or raw data are replaced by higherlevel concepts through the use of concept hierarchies Street city or country Numeric attributes of age young, middle-aged, senior Normalization: attribute data are scaled so as to fall within a small specified range, such as 0.0 0 to 1.0 Useful for classification involving neural networks, or distance measurements such as nearest neighbor classification and clustering 17 18 Data Transformation (2) Normalization: i scaled to fall within a small, specified range min-max normalization v mina v ' = ( new_ maxa new_ mina) + new_ min maxa mina z-score normalization v ' = v mean A stand _ dev normalization by decimal scaling 19 v 10 A v'= Where j is the smallest integer such that Max( )<1 j v' A Data Reduction Data Cube Aggregation Data consist of sales per quarter, for several years. User interested in the annual sales (total per year) data can be aggregated so that the resulting li data summarize the total sales per year instead of per quarter. Resulting data set is smaller in volume, without loss of information necessary for the analysis task. See Figure 3.4 [JH] 20

Dimensionality i Reduction Datasets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining i task, or redundant. d Leaving out relevant attributes or keeping irrelevant attributes can cause confusion for the mining algorithm, poor quality of discovered patterns. Added volume of irrelevant or redundant attributes can slow down the mining i process. Dimensionality reduction reduces the data set size by removing such attributes from it. 21 Dimensionality Reduction (3) Example of Decision i Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} Dimensionality i Reduction (2) The goal of attribute subset selection (also known as feature selection) is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. For d attributes, there are 2 d possible subsets. The best (and worst) attributes are typically determined using tests of statistical significance. Attribute evaluation measures such as information gain can be used. Heuristic methods Stepwise forward selection 22 Stepwise backward selection (or combination of both) Decision tree induction Data Compression Data encoding or transformations are applied so as to obtain a reduced or compressed representation of the original data. Lossless data compression technique: If the original data can be reconstructed from the compressed data without any loss of information. Lossy data compression technique: we can reconstruct only an approximation of the original data. Two popular and effective methods of lossy data compression: wavelet transformts and principal components analysis. 23 24

Data Compression (2) Oi Original i ldata Compressed Data Original Data Approximated losslessl Numerosity Reduction Parametric methods: Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers). Log-linear models: obtain value at a point in m-d space as the product on appropriate marginal subspaces. (see Slide 14) Non-parametric methods: No assume models Three major families: Clustering (see Slide 13) Histograms Sampling 25 26 Numerosity Reduction - Histograms A popular data reduction 40 technique 35 Divide data into buckets 30 and store average (sum) for each bucket 25 Partitionng rules: 20 Equiwidth 15 Equidepth 10 Etc. 5 0 10000 30000 50000 70000 90000 Numerosity Reduction - Sampling Allows a large data set to be represented by a much smaller random sample (or subset) of the data. Choose a representative tti subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Simple random sample without ih replacement (SRSWOR) Simple random sample with replacement (SRSWR) 27 28

Numerosity Reduction Sampling (2) Numerosity Reduction Sampling (3) Raw Data Cluster/Stratified Sample Raw Data 29 30 Discretization and Concept Hierarchy Discretization can be used to reduce the number of values for a given continuous attribute, by dividing the range of the attribute t into intervals. Interval labels l can then be used to replace actual data values. Concept hierarchies can be used to reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Discretization and concept hierarchy generation for numeric data Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning 3-4-5 rule 31 32

Example of 3-4-5 rule Step 1: Step 2: Step 3: Step 4: (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) 33 (-$100-0) count -$351 -$159 profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max msd=1,000 Low=-$1,000 High=$2,000 (-$1,000 - $2,000) (-$400-0) (0 - $1,000) (0 - $200) ($200 - $400) (-$1,000-0) (0 -$ 1,000) ($400 - $600) ($600 - $800) ($800 - $1,000) (-$4000 -$5,000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,000 - $2,000) ($1,000 - $2, 000) ($1,800 - $2,000) ($2,000 - $3,000) ($2,000 - $5, 000) ($3,000 - $4,000) ($4,000 - $5,000) Concept hierarchy generation for categorical data Categorical data are discrete data. Have a finite number of distinct values, with no ordering among the values. Ex. Location, job category. Specification of a set of attributes: Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct t values is placed at the lowest level of the hierarchy. 34 country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values 674,339 distinct values Conclusion Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning Data integration and Data transformation Data reduction and feature selection Discretization A lot a methods have been developed d but still an active area of research References [JH] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. 35 36