Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Similar documents
Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

UNIT 2 Data Preprocessing

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Data Preprocessing. Data Mining 1

2. Data Preprocessing

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

K236: Basis of Data Science

3. Data Preprocessing. 3.1 Introduction

Data Preprocessing. Komate AMPHAWAN

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Data preprocessing Functional Programming and Intelligent Algorithms

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Jarek Szlichta

2 CONTENTS. 3.8 Bibliographic Notes... 45

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

CS6220: DATA MINING TECHNIQUES

Data Preprocessing. Chapter Why Preprocess the Data?

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS570: Introduction to Data Mining

Data Preprocessing in Python. Prof.Sushila Aghav

DATA PREPROCESSING. Tzompanaki Katerina

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Information Management course

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Data Collection, Preprocessing and Implementation

Data Mining: Concepts and Techniques

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Data Mining and Analytics. Introduction

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Chapter 2 Data Preprocessing

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining Concepts & Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining Course Overview

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

Data Preprocessing UE 141 Spring 2013

Course on Data Mining ( )

Data Mining: Concepts and Techniques. Chapter 2

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. Outline. Motivation. How did this happen?

Question Bank. 4) It is the source of information later delivered to data marts.

CS570 Introduction to Data Mining

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

ETL and OLAP Systems

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Knowledge Discovery and Data Mining

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Table Of Contents: xix Foreword to Second Edition

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Dta Mining and Data Warehousing

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Data Preprocessing. Data Preprocessing

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

DATA PREPROCESSING MANAGEMENT SYSTEM. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Preprocessing and Visualization. Jonathan Diehl

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CHAPTER 7 CONCLUSION AND FUTURE WORK

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Slides for Data Mining by I. H. Witten and E. Frank

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

Data Mining: Exploring Data. Lecture Notes for Chapter 3

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Data Mining: Concepts and Techniques. Chapter 2

Sponsored by AIAT.or.th and KINDML, SIIT

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Introduction to Data Mining and Data Analytics

Table of Contents. Rajesh Pandey Page 1

Data Mining with SPSS Modeler

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

1. Inroduction to Data Mininig

Web Information Retrieval

Transcription:

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15

Table of contents 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 2 / 15

Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 3 / 15

Data mining process Real-world data bases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. Data have quality if they satisfy the requirements of the intended use. Factors comprising data quality are Accuracy (Does not contain errors) Completeness (All interesting attributes are filled). Consistency Timeliness Believability Interpretability Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 3 / 15

Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 4 / 15

Data preprocessing How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process? There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 4 / 15

Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 5 / 15

a integration Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 5 / 15 ummary, real-world data tend to be dirty, incomplete, and inconsistent. D sing Data techniques cleaning can improve data quality, thereby helping to improve the a ciency of the subsequent mining process. Data preprocessing is an import knowledge discovery process, because quality decisions must be based o a. Detecting data anomalies, rectifying them early, and reducing the da d can lead to huge payoffs for decision making. Data cleaning routines attempt to clean the data by Fill in missing values. Smooth out noisy data Identifying or removing outliers Correct inconsistencies in the data (For ex. the attribute for customer identification may be referred at as customer-id in one data store and cust-id in another one. a cleaning

Filling missing values In real-world data, many tuples have no recorded value for several attributes. How can you go about filling in the missing values for this attribute? Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value such as unknown and. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. Use the attribute mean or median for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value (using regression, inference-based tools using a Bayesian formalism, or decision tree). Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 6 / 15

Smooth out noisy data What is noise? Noise is a random error or variance in a measured variable. Given a numeric attribute. How can we smooth out the data to remove the noise? Binning: Binning methods smooth a sorted data value by consulting its neighborhood. 90 Chapter 3 Data Preprocessing Data partitioning equal-frequency versus equal-width smoothing methods smoothing by bin means versus bin medians and bin boundaries Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 7 / 15

Hamid Beigy (Sharif University Figure of Technology) 3.3 A 2-D customer datadata plotmining with respect to customer locations in a city, Fallshowing 1395 8 three / 15 Smooth out noisy data (cont.) How can we smooth out the data to remove the noise? Regression: Data smoothing can also be done by regression. 3.2 Data Cleaning Outlier analysis: Outliers may be detected by clustering. Intuitively, values that fall outside of the set of clusters may be considered outliers.

Data cleaning as a process Missing values, noise, and inconsistencies contribute to inaccurate data. Data cleaning process Discrepancy detection Discrepancies can be caused by several factors including Poorly designed data entry forms with many optional fields Human error in data entry Data decay (e.g., outdated addresses) Inconsistent data representation Inconsistent use of codes Error in instrumentation devices As a starting point, use any domain knowledge, for example date format. Data should also be examined regarding uniqe-rule (Attribute values most be unique) Data should also be examined regarding consecuitive-rule (no missing values between the lowest and highest values for the attribute, and that all values must also be unique) Data should also be examined regarding null-rule (specifies the use of blanks, question marks, special characters, and how such values should be handled) Some data inconsistencies may be corrected manually using external refrences (ex. using a paper trace) Most errors will require data transformation (define and apply a series of transformations to correct the given attribute) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 9 / 15

Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 10 / 15

Data integration Data mining often requires data integration (the merging of data from multiple data stores). Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help to improve the accuracy and speed of the subsequent data mining process. Issues in data integration Entity identification Schema integration and object matching can be tricky. Redundancy and correlation analysis An attribute (such as annual revenue, for instance) may be redundant if it can be derived from another attribute or set of attributes. Tuple duplication Two or more records may refer to the same object. Data value conflict detection and resolution For the same real-world entity, attribute values from different sources may differ (ex. telphone no.). This may be due to differences in representation, scaling, or encoding. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 10 / 15

Data reduction The given dataset may be huge and data analysis may take a long time. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. Data reduction Attributes Attributes A1 A2 A3... A126 A1 A3... A115 Transactions T1 T2 T3 T4... T2000 Transactions T1 T4... T1456 Data transformation 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 11 / 15

Data reduction (cont.) Data reduction strategies Dimensionality reduction This is the process of reducing the number of attributes under consideration. Feature extraction (PCA, MDS,...) Feature selection Numerosity reduction These techniques replace the original data volume by alternative and smaller form of data representation. Linear regression Histograms clustering sampling Data cube aggregation Data compression In data compression, transformations are applied so as to obtain a reduced or compressed representation of the original data. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 12 / 15

Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 13 / 15

Data transformation In this step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. Data transformation strategies Smoothing (binning, regression, clustering) Attribute constraction (new attributes are constructed to help the mining process) Aggregation Normalization (min-max normalization,...) Discretization (binning, histogram, decision tree, clustering) Concept hierarchy generation for nominal data Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 13 / 15

Concept hierarchy generation Attributes such as street can be generalized to higher-level concepts, like city or country. The following four methods for the generation of concept hierarchies for nominal data Specification of a partial ordering of attributes explicitly at the schema level by users or experts. A user or expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level. Specification of a portion of a hierarchy by explicit data grouping. In a large database, it is unrealistic to define an entire concept hierarchy by explicit value enumeration. Specification of a set of attributes, but not of their partial ordering. A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. Specification of only a partial set of attributes. Sometimes a user can be careless when 3.5 Data Transformation and Data Discretizat defining a hierarchy, or have only a vague idea about what should be included in a hierarchy. country 15 distinct values province_or_state 365 distinct values city 3567 distinct values street 674,339 distinct values Hamid Beigy (Sharif University Figure of Technology) 3.13 Automatic generation Data Mining of a schema concept hierarchy based Fall on1395 the number 14 / 15

Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 15 / 15

Reading Read chapter 3 of the following book J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 15 / 15