UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Similar documents
Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Slides by: Shree Jaswal

UNIT 2 Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

2. Data Preprocessing

Data Preprocessing. Data Mining 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

3. Data Preprocessing. 3.1 Introduction

Data preprocessing Functional Programming and Intelligent Algorithms

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

K236: Basis of Data Science

CS6220: DATA MINING TECHNIQUES

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Data Preprocessing. Komate AMPHAWAN

Information Management course

Data Mining Concepts & Techniques

CS570: Introduction to Data Mining

Data Mining: Concepts and Techniques

Chapter 2 Data Preprocessing

Data Mining: Concepts and Techniques. Chapter 2

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Data Preprocessing in Python. Prof.Sushila Aghav

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Jarek Szlichta

Data Mining and Analytics. Introduction

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Association Rules. Berlin Chen References:

Data Preprocessing. Outline. Motivation. How did this happen?

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Collection, Preprocessing and Implementation

DATA PREPROCESSING. Tzompanaki Katerina

DATA WAREHOUING UNIT I

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Table Of Contents: xix Foreword to Second Edition

Course on Data Mining ( )

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Contents. Foreword to Second Edition. Acknowledgments About the Authors

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Product presentations can be more intelligently planned

Chapter 4 Data Mining A Short Introduction

Data Preprocessing UE 141 Spring 2013

CS570 Introduction to Data Mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Knowledge Modelling and Management. Part B (9)

Data Mining: Concepts and Techniques. Chapter 2

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

Information Management course

2 CONTENTS. 3.8 Bibliographic Notes... 45

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

Data Mining Course Overview

Question Bank. 4) It is the source of information later delivered to data marts.

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Mining Association Rules in Large Databases

Dta Mining and Data Warehousing

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Association Analysis. CSE 352 Lecture Notes. Professor Anita Wasilewska

Sponsored by AIAT.or.th and KINDML, SIIT

Supervised and Unsupervised Learning (II)

Data Preprocessing. Chapter Why Preprocess the Data?

Web Information Retrieval

DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Code No: R Set No. 1

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

BCB 713 Module Spring 2011

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

Winter Semester 2009/10 Free University of Bozen, Bolzano

UNIT 4. DATA WAREHOUSING

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei


Chapter 1, Introduction

1. Inroduction to Data Mininig

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

OLAP Introduction and Overview

Data warehouse and Data Mining

IT6702 DATA WAREHOUSING AND DATA MINING TWO MARKS WITH ANSWER UNIT-1 DATA WAREHOUSING

Preprocessing and Visualization. Jonathan Diehl

Transcription:

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data Pre-processing-Data Cleaning, Integration, Transformation, Reduction, Discretization Concept Hierarchies-Concept Description: Data Generalization And Summarization Based Characterization- Mining Association Rules In Large Databases. Need for Data Preprocessing Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data

Correct inconsistent data How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Forecast the missing value : use the most probable value Vs. use of the value with less impact on the further analysis How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Clustering - detect and remove outliers

Regression smooth by fitting the data into regression functions Correlation Examine the degree to which the values for two variables behave similarly. Correlation coefficient r:

1 = perfect correlation -1 = perfect but opposite correlation 0 = no correlation Data Integration combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Data Transformation Smoothing: remove noise from data Data reduction - aggregation: summarization, data cube Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range dimensions Scales: nominal, order and interval scales min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones Data Cube Aggregation

The lowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest E.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible Attribute Subset Selection Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Decision-tree induction Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4? A1? A6? Class 1 Class 2 Class 2 Reduced attribute set: {A1, A4, A6} Class 1 Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation Dimensionality Reduction: Principal Component Analysis (PCA) Steps Given N data vectors from n-dimensions, find k n orthogonal vectors (principal components) that can be best used to represent data Normalize input data: Each attribute falls within the same range Compute k orthonormal (unit) vectors, i.e., principal components Each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing significance or strength

Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data Works for numeric data only Used when the number of dimensions is large Y2 X2 Y1 X1 Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction

databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc. Examples. Rule form: Body Head [support, confidence]. buys(x, computer ) buys(x, softwares ) [0.5%, 60%] major(x, CS ) ^ takes(x, DB ) grade(x, A ) [1%, 75%] Rule Measures: Support and Confidence Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X 4 Y 4 Z} confidence, c, conditional probability that a transaction having {X 4 Y} also contains Z Support for {, } = 5/10 = 0.5 Confidence for à = 5/8 = 0.625 Pencil Crayons Pencil Crayons Crayons Books Crayons Books Pencil Crayons Pencil Book Book Pencil Books Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)

Association Rule Mining Boolean vs. quantitative associations buys(x, SQLServer ) ^ buys(x, DMBook ) buys(x, DBMiner ) [0.2%, 60%] age(x, 30..39 ) ^ income(x, 42..48K ) buys(x, PC ) [1%, 75%] Single dimension vs. multiple dimensional associations Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis Association does not necessarily imply correlation or causality Constraints enforced Concept Description: Characterization and Comparison Concept description: Characterization: provides a concise and succinct summarization of the given collection of data Comparison: provides descriptions comparing two or more collections of data Review Summary Data preparation or preprocessing is a big issue for both data warehousing and data mining. Discriptive data summarization is need for quality data preprocessing Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization Key terms

Data cleaning data integration data transformation data reduction Discretization Characterization PCA support confidence Association rule decision tree correlation regression Multiple choice questions 1) routines attempt to fill in missing values (a)data cleaning (b)data transformation (c)data integration (d)none of the above 2) Which method is the best for filling missing values? (a)ignore the tuple (b) Manual filling (c) by global constant (d) use attribute mean 3) What are the methods for handling noisy data? (a)binning (b)regression (c)clustering (d)all of the above 4) Rules for examining the data are (a)unique rule (b)consequetive rule (c)null rule (d)all of the above 5) commercial tools for discrepancy detection (a)data scrubbing tools (b)data mining tools (c)data auditing tools (d)all of the 6) Redundancy can be detected by (a)correlation analysis (b)entity identification problem (c)min-max normalization (d)all of the above 7) Attribute data are scaled to fall within a specified range 0.0. to 1.0 is called as (a)integration (b)normalization (c)generalization (d)none of the above 8) Concept hierarchy is also called as (a)generalization (b)normalization (c)integration (d)none of the above 9) smoothing can be performed by

(a)binning (b)clusering (c)regression (d)all of the above 10) Suppose minimum and maximum values are given,we can use normalization (a)min - max (b)z- score (c)decimal (d)all of the above 11) Suppose mean and standard deviation of the values are given, we can use normalization (a)min - max (b)z- score (c)decimal scaling (d)all of the above 12) suppose recorded values given in a range, we can use normalization. (a)min - max (b)z-score (c)decimal scaling (d)all of the above 13) Attribute construction is also known as (a)feature construction (b) Feature selection (c) attribute selection (d) none of the above 14) In irrelevant attributes can be detected and removed. (a) Attribute selection (b) Data transformation (c) data integration (d) data cleaning 15) methods can also be applied for data reduction. (a)data cleaning (b) Data transformation (c) data smoothing (d) All of the above ReviewQuestions Part A 1. Define data cleaning 2. Define data transformation 3. Define smoothing 4. Define binning 5. How to handle noisy data? 6. What are the methods for filling missing values? 7. Define clustering. 8. Define reduction 9. Define attribute selection 10.Define numerosity reduction. 11.What is Association rule?

12.What is Data Generalization? 13.What is support? 14.What is confidence? 15.What is smoothing? Part B 1. Explain the data pre-processing techniques in detail? 2. Explain the smoothing Techniques? 3. Explain Data transformation in detail? 4. Explain Normalisation in detail? 5. Explain data reduction? 6. Explain parametric methods and non-parametric methods of reduction? 7. Explain Data Discretization and Concept Hierarchy Generation? 8. Explain Datamining Primitives? 9. Explain Attribute Oriented Induction? References 1. Jiawei Han, Micheline Kamber, "Data Mining: Concepts and Techniques", morgan Kaufmann Publishers, 2002. 2. Alex Berson,Stephen J. Smith, Data Warehousing, Data Mining,& OLAP, Tata McGraw- Hill, 2004. For further References 1. www.cssu-bg.org/old/seminars/ppt/cssu_dw_dm.ppt 2. http://ai.arizona.edu/mis510/slides/12_dm-part1-2004.ppt 3. http://jisuanji.jyu.edu.cn/db/jishuqianyan/acm_introtodw-data warehousing.ppt