UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data Pre-processing-Data Cleaning, Integration, Transformation, Reduction, Discretization Concept Hierarchies-Concept Description: Data Generalization And Summarization Based Characterization- Mining Association Rules In Large Databases. Need for Data Preprocessing Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data
Correct inconsistent data How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Forecast the missing value : use the most probable value Vs. use of the value with less impact on the further analysis How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Clustering - detect and remove outliers
Regression smooth by fitting the data into regression functions Correlation Examine the degree to which the values for two variables behave similarly. Correlation coefficient r:
1 = perfect correlation -1 = perfect but opposite correlation 0 = no correlation Data Integration combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Data Transformation Smoothing: remove noise from data Data reduction - aggregation: summarization, data cube Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range dimensions Scales: nominal, order and interval scales min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones Data Cube Aggregation
The lowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest E.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible Attribute Subset Selection Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Decision-tree induction Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6}
A4? A1? A6? Class 1 Class 2 Class 2 Reduced attribute set: {A1, A4, A6} Class 1 Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation Dimensionality Reduction: Principal Component Analysis (PCA) Steps Given N data vectors from n-dimensions, find k n orthogonal vectors (principal components) that can be best used to represent data Normalize input data: Each attribute falls within the same range Compute k orthonormal (unit) vectors, i.e., principal components Each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing significance or strength
Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data Works for numeric data only Used when the number of dimensions is large Y2 X2 Y1 X1 Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction
databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc. Examples. Rule form: Body Head [support, confidence]. buys(x, computer ) buys(x, softwares ) [0.5%, 60%] major(x, CS ) ^ takes(x, DB ) grade(x, A ) [1%, 75%] Rule Measures: Support and Confidence Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X 4 Y 4 Z} confidence, c, conditional probability that a transaction having {X 4 Y} also contains Z Support for {, } = 5/10 = 0.5 Confidence for à = 5/8 = 0.625 Pencil Crayons Pencil Crayons Crayons Books Crayons Books Pencil Crayons Pencil Book Book Pencil Books Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)
Association Rule Mining Boolean vs. quantitative associations buys(x, SQLServer ) ^ buys(x, DMBook ) buys(x, DBMiner ) [0.2%, 60%] age(x, 30..39 ) ^ income(x, 42..48K ) buys(x, PC ) [1%, 75%] Single dimension vs. multiple dimensional associations Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis Association does not necessarily imply correlation or causality Constraints enforced Concept Description: Characterization and Comparison Concept description: Characterization: provides a concise and succinct summarization of the given collection of data Comparison: provides descriptions comparing two or more collections of data Review Summary Data preparation or preprocessing is a big issue for both data warehousing and data mining. Discriptive data summarization is need for quality data preprocessing Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization Key terms
Data cleaning data integration data transformation data reduction Discretization Characterization PCA support confidence Association rule decision tree correlation regression Multiple choice questions 1) routines attempt to fill in missing values (a)data cleaning (b)data transformation (c)data integration (d)none of the above 2) Which method is the best for filling missing values? (a)ignore the tuple (b) Manual filling (c) by global constant (d) use attribute mean 3) What are the methods for handling noisy data? (a)binning (b)regression (c)clustering (d)all of the above 4) Rules for examining the data are (a)unique rule (b)consequetive rule (c)null rule (d)all of the above 5) commercial tools for discrepancy detection (a)data scrubbing tools (b)data mining tools (c)data auditing tools (d)all of the 6) Redundancy can be detected by (a)correlation analysis (b)entity identification problem (c)min-max normalization (d)all of the above 7) Attribute data are scaled to fall within a specified range 0.0. to 1.0 is called as (a)integration (b)normalization (c)generalization (d)none of the above 8) Concept hierarchy is also called as (a)generalization (b)normalization (c)integration (d)none of the above 9) smoothing can be performed by
(a)binning (b)clusering (c)regression (d)all of the above 10) Suppose minimum and maximum values are given,we can use normalization (a)min - max (b)z- score (c)decimal (d)all of the above 11) Suppose mean and standard deviation of the values are given, we can use normalization (a)min - max (b)z- score (c)decimal scaling (d)all of the above 12) suppose recorded values given in a range, we can use normalization. (a)min - max (b)z-score (c)decimal scaling (d)all of the above 13) Attribute construction is also known as (a)feature construction (b) Feature selection (c) attribute selection (d) none of the above 14) In irrelevant attributes can be detected and removed. (a) Attribute selection (b) Data transformation (c) data integration (d) data cleaning 15) methods can also be applied for data reduction. (a)data cleaning (b) Data transformation (c) data smoothing (d) All of the above ReviewQuestions Part A 1. Define data cleaning 2. Define data transformation 3. Define smoothing 4. Define binning 5. How to handle noisy data? 6. What are the methods for filling missing values? 7. Define clustering. 8. Define reduction 9. Define attribute selection 10.Define numerosity reduction. 11.What is Association rule?
12.What is Data Generalization? 13.What is support? 14.What is confidence? 15.What is smoothing? Part B 1. Explain the data pre-processing techniques in detail? 2. Explain the smoothing Techniques? 3. Explain Data transformation in detail? 4. Explain Normalisation in detail? 5. Explain data reduction? 6. Explain parametric methods and non-parametric methods of reduction? 7. Explain Data Discretization and Concept Hierarchy Generation? 8. Explain Datamining Primitives? 9. Explain Attribute Oriented Induction? References 1. Jiawei Han, Micheline Kamber, "Data Mining: Concepts and Techniques", morgan Kaufmann Publishers, 2002. 2. Alex Berson,Stephen J. Smith, Data Warehousing, Data Mining,& OLAP, Tata McGraw- Hill, 2004. For further References 1. www.cssu-bg.org/old/seminars/ppt/cssu_dw_dm.ppt 2. http://ai.arizona.edu/mis510/slides/12_dm-part1-2004.ppt 3. http://jisuanji.jyu.edu.cn/db/jishuqianyan/acm_introtodw-data warehousing.ppt