Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Similar documents
Data Preprocessing. Slides by: Shree Jaswal

UNIT 2 Data Preprocessing

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data Preprocessing. Data Mining 1

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

CS6220: DATA MINING TECHNIQUES

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

CS570: Introduction to Data Mining

Information Management course

Chapter 2 Data Preprocessing

K236: Basis of Data Science

Data Preprocessing. Komate AMPHAWAN

Jarek Szlichta

Data Preprocessing in Python. Prof.Sushila Aghav

Data Mining Concepts & Techniques

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining: Concepts and Techniques

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Mining: Concepts and Techniques. Chapter 2

Data Mining and Analytics. Introduction

Data preprocessing Functional Programming and Intelligent Algorithms

DATA PREPROCESSING. Tzompanaki Katerina

Data Preprocessing. Outline. Motivation. How did this happen?

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

CS570 Introduction to Data Mining

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Course on Data Mining ( )

Data Preprocessing UE 141 Spring 2013

Data Collection, Preprocessing and Implementation

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Sponsored by AIAT.or.th and KINDML, SIIT

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

2 CONTENTS. 3.8 Bibliographic Notes... 45

Data Preprocessing. Chapter Why Preprocess the Data?

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Dta Mining and Data Warehousing

Table Of Contents: xix Foreword to Second Edition

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Analyse des Données. Master 2 IMAFA. Andrea G. B. Tettamanzi

Data Mining: Concepts and Techniques. Chapter 2

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Data Mining Course Overview

Contents. Foreword to Second Edition. Acknowledgments About the Authors

DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING

Data Mining MTAT

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

ECLT 5810 Clustering

Data Preprocessing. Data Preprocessing

Knowledge Discovery and Data Mining

DATA WAREHOUING UNIT I

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

ECLT 5810 Clustering

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Frequency Distributions

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

ETL and OLAP Systems

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Slides for Data Mining by I. H. Witten and E. Frank

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Preprocessing and Visualization. Jonathan Diehl

Question Bank. 4) It is the source of information later delivered to data marts.

Table of Contents (As covered from textbook)

Code No: R Set No. 1

PSS718 - Data Mining

Chapter 1, Introduction

MHPE 494: Data Analysis. Welcome! The Analytic Process

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Data Analysis and Data Science

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

Week 2 Engineering Data

Mineração de Dados Aplicada

Statistical Pattern Recognition

Chapter 6: DESCRIPTIVE STATISTICS

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Transcription:

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing errors or outliers e.g., salary = -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Data Preprocessing MIT-652 Data Mining Applications Thimaporn Phetkaew School of Informatics, Walailak University MIT-652: DM 3: Data Preprocessing 3 MIT-652: DM 3: Data Preprocessing 1 Multi-Dimensional Measure of Data Quality Chapter 3: Data Preprocessing A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Interpretability Accessibility Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 4 MIT-652: DM 3: Data Preprocessing 2

Chapter 3: Data Preprocessing Major Tasks in Data Preprocessing Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 7 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data MIT-652: DM 3: Data Preprocessing 5 Data Cleaning Major Tasks in Data Preprocessing tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data MIT-652: DM 3: Data Preprocessing 8 MIT-652: DM 3: Data Preprocessing 6

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. MIT-652: DM 3: Data Preprocessing 11 MIT-652: DM 3: Data Preprocessing 9 How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree MIT-652: DM 3: Data Preprocessing 12 MIT-652: DM 3: Data Preprocessing 10

Cluster Analysis Binning Methods Equal-width (distance) partitioning: It divides the range into N intervals of equal size if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling MIT-652: DM 3: Data Preprocessing 15 MIT-652: DM 3: Data Preprocessing 13 Regression Binning Methods for Data Smoothing Y1 Y1 y X1 y = x + 1 x * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 mean = 9 - Bin 2: 21, 21, 24, 25 mean = 22.75 - Bin 3: 26, 28, 29, 34 mean = 29.25 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 MIT-652: DM 3: Data Preprocessing 16 MIT-652: DM 3: Data Preprocessing 14

Data Integration Inconsistent Data Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Inconsistant : containing discrepancies in name convensions or data codes used to categorize items To handle inconsistent data, corrected manually using external references, e.g. performing a paper trace known functional dependencies between attrubutes can be used Other data problems which requires data cleaning duplicate records incomplete data MIT-652: DM 3: Data Preprocessing 19 MIT-652: DM 3: Data Preprocessing 17 Handling Redundant Data in Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality MIT-652: DM 3: Data Preprocessing 20 Chapter 3: Data Preprocessing Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 18

Positively and Negatively Correlated Data Correlation Analysis Given two attributes, correlation analysis can measure how strongly one attribute implies the other Correlation between attribute A and B can be measured by ( A A)( B B) γ A, B = n = #tuple ( n 1) σ Aσ B the meaan values of A and B =, the standard deviations of A and B σ A 2 ( A A) = n 1 A A n σ B B B = n 2 ( B B) = n 1 MIT-652: DM 3: Data Preprocessing 23 MIT-652: DM 3: Data Preprocessing 21 Not Correlated Data Correlation Analysis Corelational analysis γ A, B > 0 -> A and B are positively correlated the higher the value, the more each attribute implies the other -> A (or B) may be removed as a redundancy. γ -> A and B are negatively correlated A, B < 0 γ A, B = 0 -> A and B are independent It can also detect duplication at the tuple level MIT-652: DM 3: Data Preprocessing 24 MIT-652: DM 3: Data Preprocessing 22

Data Transformation: Attribute/Feature construction Adding attribute that represent relationships in the data that we know from experience are likely to be important can increase chance that mining process will yield useful results density = population/area ΔBal = currentbal previousbal area = height * width obesityindex = (height/weight)*c Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones MIT-652: DM 3: Data Preprocessing 27 MIT-652: DM 3: Data Preprocessing 25 Chapter 3: Data Preprocessing Data Transformation: Normalization Why preprocess the data? Data integration and transformation Summary min-max normalization v mina v ' = ( new_ maxa new_ mina) + new_ min maxa mina z-score normalization v mean A v ' = stand _ dev normalization by decimal scaling v v 10 '= Where j is the smallest integer such that Max( )<1 j A v' A MIT-652: DM 3: Data Preprocessing 28 MIT-652: DM 3: Data Preprocessing 26

Dimensionality Reduction Data Reduction Strategies Feature selection (i.e., attribute subset selection): Select a minimum set of attributes such that the probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes reduce # of attributes in the patterns, easier to understand There are 2 d possible sub-attributes of d attributes Heuristic methods: greedy step-wise forward selection step-wise backward elimination combining forward selection and backward elimination MIT-652: DM 3: Data Preprocessing 31 Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results strategies Data cube aggregation Dimension reduction e.g., remove unimportant attributes Data compression Numerosity reduction e.g., fit data into models MIT-652: DM 3: Data Preprocessing 29 Data Compression Data Cube Aggregation The lowest level of a data cube: base cuboid the aggregated data for an individual entity of interest Original Data lossless Compressed Data e.g., sales or customer. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Original Data Approximated lossy Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible MIT-652: DM 3: Data Preprocessing 32 MIT-652: DM 3: Data Preprocessing 30

Regression Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole MIT-652: DM 3: Data Preprocessing 35 MIT-652: DM 3: Data Preprocessing 33 Regress Analysis Numerosity Reduction Linear regression: Y = α + β X Two parameters, α and β specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2,, X1, X2,. Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (outliers may also be stored) Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Non-parametric methods Do not assume models Major families: histograms, clustering, sampling MIT-652: DM 3: Data Preprocessing 36 MIT-652: DM 3: Data Preprocessing 34

Sampling Histograms Allow a large data set to be represented by a much smaller random sample (or subset) of data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). MIT-652: DM 3: Data Preprocessing 39 A popular data reduction technique Bar chart Divide data into buckets and store frequencies for each bucket Partitioning rules, e.g. Equi-width Equi-depth 40 35 30 25 20 15 10 5 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 MIT-652: DM 3: Data Preprocessing 37 Sampling Clustering N = 9 SRSWOR (simple random sample without replacement) n =3 Partition data set into clusters, and one can store cluster representation only Cluster representations of the data are used to replace the actual data Can be very effective if data is clustered but not if data SRSWR n =3 is smeared Can have hierarchical clustering and be stored in multi- Raw Data dimensional index tree structures MIT-652: DM 3: Data Preprocessing 40 MIT-652: DM 3: Data Preprocessing 38

Chapter 3: Data Preprocessing Sampling Why preprocess the data? Data integration and transformation Summary Raw Data Cluster/Stratified Sample MIT-652: DM 3: Data Preprocessing 43 MIT-652: DM 3: Data Preprocessing 41 Discretization Sampling Three types of attributes: Categorical/discrete attributes Nominal values from an unordered set, e.g., color Ordinal values from an ordered set, T38 T256 T307 T391 T96 T117 Raw Data young young young young Stratified Sample T38 young T391 young T117 T138 T290 T326 e.g., academic rank Numeric/continuous attributes integer or real numbers T138 T263 T290 T308 T69 senior T326 T387 T69 senior T284 senior MIT-652: DM 3: Data Preprocessing 44 MIT-652: DM 3: Data Preprocessing 42

Discretization and concept hierarchy generation for numeric data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Discretization Discretization: Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Some classification algorithms only accept categorical attributes e.g. Decision Tree Prepare for further analysis MIT-652: DM 3: Data Preprocessing 47 MIT-652: DM 3: Data Preprocessing 45 Concept hierarchy generation for categorical data (cont.) Concept hierarchy Specification of a set of attributes, but not of their partial ordering country 15 distinct values province_or_ state 65 distinct values city 3567 distinct values street 674,339 distinct values Concept hierarchies: Defines a sequence of mappings from a set of lowlevel concepts to higher level (more general concepts) Reduce the data by collecting and replacing low level concepts by higher level concepts e.g., replace numeric values for the attribute age by young,, or senior Specification of only a partial set of attributes street < city MIT-652: DM 3: Data Preprocessing 48 MIT-652: DM 3: Data Preprocessing 46

Chapter 3: Data Preprocessing Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 49 Summary Data preparation is an important issue for both warehousing and mining Data preparation includes -> fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies Data integration -> schema integration, correlation analysis, data conflict detection Data transformation -> smoothing, aggregation, generalization, normalization, attribute construction -> data cube aggregation, dimension reduction, data compression, numerosity reduction, discretization MIT-652: DM 3: Data Preprocessing 50