Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Similar documents
Data Preprocessing. Slides by: Shree Jaswal

UNIT 2 Data Preprocessing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining and Analytics. Introduction

K236: Basis of Data Science

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

CS570: Introduction to Data Mining

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data preprocessing Functional Programming and Intelligent Algorithms

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Preprocessing. Data Mining 1

Data Preprocessing. Outline. Motivation. How did this happen?

Chapter 2 Data Preprocessing

CS6220: DATA MINING TECHNIQUES

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques. Chapter 2

Information Management course

DATA PREPROCESSING. Tzompanaki Katerina

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

Table Of Contents: xix Foreword to Second Edition

Data Preprocessing in Python. Prof.Sushila Aghav

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data Collection, Preprocessing and Implementation

Data Mining Concepts & Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

SCHEME OF COURSE WORK. Data Warehousing and Data mining

DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING

Data Preprocessing. Chapter Why Preprocess the Data?

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

DATA WAREHOUING UNIT I

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Data Mining Course Overview

2 CONTENTS. 3.8 Bibliographic Notes... 45

Sponsored by AIAT.or.th and KINDML, SIIT

Web Information Retrieval

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Analyse des Données. Master 2 IMAFA. Andrea G. B. Tettamanzi

Question Bank. 4) It is the source of information later delivered to data marts.

Data Preprocessing. Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Jarek Szlichta

CS570 Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

Code No: R Set No. 1

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Data Preprocessing UE 141 Spring 2013

Preprocessing and Visualization. Jonathan Diehl

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Chapter 1, Introduction

Data Mining MTAT

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

Table of Contents. Rajesh Pandey Page 1

Course on Data Mining ( )

Data warehouses Decision support The multidimensional model OLAP queries

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Clustering Part 4 DBSCAN

Problem Analysis and Preprocessing

DATA MINING AND WAREHOUSING

1. Inroduction to Data Mininig

ECLT 5810 Clustering

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining: Exploring Data

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

Managing Data Resources

AT78 DATA MINING & WAREHOUSING JUN 2015

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad


Knowledge Discovery and Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 4

Transcription:

Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is the multi-dimensional data model? What is the difference between OLAP and OLTP? What is the general architecture of a data warehouse? How can we implement a data warehouse? Are there issues related to data cube technology? Can we mine data warehouses? University of Alberta 1 University of Alberta 2 Course Content Introduction to Data Mining Data warehousing and OLAP Data cleaning Data mining operations Data summarization Association analysis Classification and prediction Clustering Web Mining Similarity Search Other topics if time permits Chapter 3 Objectives Realize the importance of data preprocessing for real world data before data mining or construction of data warehouses. Get an overview of some data preprocessing issues and techniques. University of Alberta 3 University of Alberta 4

Data Preprocessing Outline Motivation In real world applications data can be inconsistent, incomplete and/or noisy. Errors can happen: Faulty data collection instruments Data entry problems Human misjudgment during data entry Data transmission problems Technology limitations Discrepancy in naming conventions Results: Duplicated records Incomplete data Contradictions in data University of Alberta 5 University of Alberta 6 Motivation (Con t) Data Preprocessing Data Data Warehouse Data Mining Decision What happens when the data can not be trusted? Can the decision be trusted? Decision making is jeopardized. Better chance to discover useful knowledge when data is clean. University of Alberta 7 Data Cleaning Data Integration Data Transformation Data Reduction University of Alberta 8

Data Preprocessing Outline Data Cleaning Real-world application data can be incomplete, noisy, and inconsistent. No recorded values for some attributes Not considered at time of entry Random errors Data cleaning attempts to: Fill in missing values Smooth out noisy data Correct inconsistencies University of Alberta 9 University of Alberta 10 Solving Missing Data Ignore the tuple with missing values; Smoothing Noisy Data The purpose of data smoothing is to eliminate noise. This can be done by: Fill in the missing values manually; Use a global constant to fill in missing values (NULL, unknown, etc.); Use the attribute value mean to filling missing values of that attribute; Use the attribute mean for all samples belonging to the same class to fill in the missing values; Infer the most probable value to fill in the missing value. Binning Clustering Regression Data regression consists of fitting the data to a function. A linear regression for instance, finds the line to fit 2 variables so that one variable can predict the other. More variables can be involved in a multiple linear regression. Y1 Y1 y X1 y = x + 1 x University of Alberta 11 University of Alberta 12

Binning Binning smoothes the data by consulting the value s neighbourhood. First, the data is sorted to get the values in their neighbourhoods. Second, the data is distributed in equi-width bins: Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34 Bins of depth 3: Bin1: 4, 8, 15 Bin2: 21, 21, 24 Bin3: 25, 28, 34 Third, process local smoothing. Smoothing by bin median Clustering Data is organized into groups of similar values. Rare values that fall outside these groups are considered outliers and are discarded. Smoothing by bin means Bin1: 9, 9, 9 Bin2: 22, 22, 22 Bin3: 29, 29, 29 Smoothing by bin boundaries Bin1: 4, 4, 15 Bin2: 21, 21, 24 Bin3: 25, 25, 34 University of Alberta 13 University of Alberta 14 Data Preprocessing Outline University of Alberta 15 Data Integration Data analysis may require a combination of data from multiple sources into a coherent data store. There are many challenges: Schema integration: CID C_number Cust-id cust# Semantic heterogeneity Data value conflicts (different representations or scales, etc.) Redundant records Redundant attributes (redundant if it can be derived from other attributes) Correlation analysis P(A B)/(P(A)P(B)) 1: independent, >1 positive correlation, <1: negative correlation. Metadata is often necessary University of Alberta 16

Data Preprocessing Outline Data Transformation Data is sometimes in a form not appropriate for mining. Either the algorithm at hand can not handle it, the form of the data is not regular, or the data itself is not specific enough. Normalization (to compare carrots with carrots) Smoothing Aggregation (summary operation applied to data) Generalization (low level data is replaced with level data concept hierarchy) University of Alberta 17 University of Alberta 18 Normalization Min-max normalization: linear transformation from v to v v = v-min/(max min) (newmax newmin) + newmin Ex: transform $30000 between [10000..45000] into [0..1] 30-10/35(1)+0=0.514 Zscore normalization: normalization v into v based on attribute value mean and standard deviation v =v-mean/standarddeviation Normalization by decimal scaling: moves the decimal point of v by j positions such that j is the minimum number of positions moved to the decimal of the absolute maximum value to make is fall in [0..1]. v =v/10 j Ex: if v ranges between 56 and 9976, j=4 v ranges between 0.0056 and 0.9976 Data Preprocessing Outline University of Alberta 19 University of Alberta 20

Data Reduction The data is often too large. Reducing the data can improve performance. Data reduction consists of reducing the representation of the data set while producing the same (or almost the same) results. Data reduction includes: Data cube aggregation Dimension reduction Data compression Discretization Numerosity reduction Regression Histograms Clustering Sampling Data Cube Aggregation Reduce the data to the concept level needed in the analysis. Time (Quarters per years) Q1 Q2 Q3 Q4 Location Maritimes Quebec Drama (province, Ontario Prairies Comedy Canada) Western Pr Horror Sci. Fi.. Category Location (city) Reduce Edmonton Montreal Toronto Vancouver Time (years) 1998 1997 1999 Drama Comedy Horror Sci. Fi.. Category Queries regarding aggregated information should be answered using data cube when possible. University of Alberta 21 University of Alberta 22 Dimensionality Reduction Decision-Tree Induction Feature selection (i.e., attribute subset selection): Select only the necessary attributes. The goal is to find a minimum set of attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all attributes. Exponential number of possibilities. Initial attribute set: {A 1, A 2, A 3, A 4, A 5 } A 1? A 3? A 5? Use heuristics: select local best (or most pertinent) attribute. Information gain, etc. step-wise forward selection {}{A 1 }{A 1 }{A 1,A 5 } step-wise backward elimination {A 1,A 2,A 4,A 5 } {A 1,A 4,A 5 } {A 1,A 5 } combining forward selection and backward elimination decision-tree induction Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A 1, A 3, A 5 } University of Alberta 23 University of Alberta 24

Data Compression Data compression reduces the size of data. saves storage space. saves communication time. There is lossless compression and lossy compression. Used for all sorts of data. Some methods are data specific, others are versatile. For data mining, data compression is beneficial if data mining algorithms can manipulate compressed data directly without uncompressing it. Numerosity Reduction Data volume can be reduced by choosing alternative forms of data representation. Parametric Regression (a model or function estimating the distribution instead of the data.) Non-parametric Histograms Clustering Sampling University of Alberta 25 University of Alberta 26 Reduction with Histograms A popular data reduction technique: Divide data into buckets and store representation of buckets (sum, count, etc.) Equi-width (histogram with bars having the same width) Equi-depth (histogram with bars having the same height) V-Optimal (histogram with least variance (count b *value b ) MaxDiff (bucket boundaries defined by user specified threshold) Reduction with Clustering Partition data into clusters based on closeness in space. Retain representatives of clusters (centroids) and outliers. Effectiveness depends upon the distribution of data Hierarchical clustering is possible (multi-resolution). Related to quantization problem. University of Alberta 27 University of Alberta 28

Reduction with Sampling Allows a large data set to be represented by a much smaller random sample of the data (sub-set). How to select a random sample? Will the patterns in the sample represent the patterns in the data? Simple random sample without replacement (SRSWOR) Simple random sampling with replacement (SRSWR) Cluster sample (SRSWOR or SRSWR from clusters) Stratified sample (stratum = group based on attribute value) Random sampling can produce poor results active research. Data Preprocessing Outline University of Alberta 29 University of Alberta 30 Discretization Discretization is used to reduce the number of values for a given continuous attribute, by dividing the range of the attribute into intervals. Interval labels are then used to replace actual data values. Some data mining algorithms only accept categorical attributes and cannot handle a range of continuous attribute value. Discretization can reduce the data set, and can also be used to generate concept hierarchies automatically. Data Preprocessing Outline University of Alberta 31 University of Alberta 32

Discretization and Concept Hierarchies For numerical data There is: Wide diversity of possible range values Frequent updates It is difficult to construct concept hierarchies for numerical attributes. Automatic concept hierarchy generation based on data distribution analysis. Binning (bin representative (mean/median) recursive binning C.H.) Histogram analysis (recursive with min interval size C.H.) Clustering (recursive clustering C.H.) Entropy-based (binary partitioning with information gain evaluation C.H.) 3-4-5 data segmentation (uniform intervals with rounded boundaries) University of Alberta 33 3-4-5 Partitioning Sort data get Min and Max Determine 5%-95% tile get Low and High Determine most significant digit (msd) get Low and High (High -Low )/msd If 3, 6, 7, or 9 partition in 3 intervals If 2, 4, or 8 partition in 4 intervals If 1, 5, or 10 partition in 5 intervals Adjust both ends intervals to include 100% data Repeat recursively University of Alberta 34 3-4-5 Partitioning Example count 5% 95% Step 1: -$351 -$159 profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%- tile) Max Step 2: msd=1,000 Low =-$1,000 High =$2,000 Step 3: (-$1,000.. $2,000) (-$1,000.. 0) (0..$ 1,000) ($1,000.. $2,000) Step 4: (-$400..$5,000) (-$400.. 0) (0.. $1,000) ($1,000.. $2, 000) ($2,000.. $5, 000) (-$400.. -$300) (0.. $200) ($1,000.. $1,200) ($2,000.. $3,000) (-$300.. -$200) ($200.. $400) ($1,200.. $1,400) Step 5: (-$200.. -$100) (-$100.. 0) ($400.. $600) ($600.. $800) ($800.. $1,000) ($3,000.. $4,000) ($1,400.. $1,600) ($4,000.. $5,000) ($1,600.. $1,800) ($1,800.. $2,000) University of Alberta 35