Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing errors or outliers e.g., salary = -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Data Preprocessing MIT-652 Data Mining Applications Thimaporn Phetkaew School of Informatics, Walailak University MIT-652: DM 3: Data Preprocessing 3 MIT-652: DM 3: Data Preprocessing 1 Multi-Dimensional Measure of Data Quality Chapter 3: Data Preprocessing A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Interpretability Accessibility Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 4 MIT-652: DM 3: Data Preprocessing 2

Chapter 3: Data Preprocessing Major Tasks in Data Preprocessing Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 7 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data MIT-652: DM 3: Data Preprocessing 5 Data Cleaning Major Tasks in Data Preprocessing tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data MIT-652: DM 3: Data Preprocessing 8 MIT-652: DM 3: Data Preprocessing 6

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. MIT-652: DM 3: Data Preprocessing 11 MIT-652: DM 3: Data Preprocessing 9 How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree MIT-652: DM 3: Data Preprocessing 12 MIT-652: DM 3: Data Preprocessing 10

Cluster Analysis Binning Methods Equal-width (distance) partitioning: It divides the range into N intervals of equal size if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling MIT-652: DM 3: Data Preprocessing 15 MIT-652: DM 3: Data Preprocessing 13 Regression Binning Methods for Data Smoothing Y1 Y1 y X1 y = x + 1 x * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 mean = 9 - Bin 2: 21, 21, 24, 25 mean = 22.75 - Bin 3: 26, 28, 29, 34 mean = 29.25 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 MIT-652: DM 3: Data Preprocessing 16 MIT-652: DM 3: Data Preprocessing 14

Data Integration Inconsistent Data Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Inconsistant : containing discrepancies in name convensions or data codes used to categorize items To handle inconsistent data, corrected manually using external references, e.g. performing a paper trace known functional dependencies between attrubutes can be used Other data problems which requires data cleaning duplicate records incomplete data MIT-652: DM 3: Data Preprocessing 19 MIT-652: DM 3: Data Preprocessing 17 Handling Redundant Data in Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality MIT-652: DM 3: Data Preprocessing 20 Chapter 3: Data Preprocessing Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 18

Positively and Negatively Correlated Data Correlation Analysis Given two attributes, correlation analysis can measure how strongly one attribute implies the other Correlation between attribute A and B can be measured by ( A A)( B B) γ A, B = n = #tuple ( n 1) σ Aσ B the meaan values of A and B =, the standard deviations of A and B σ A 2 ( A A) = n 1 A A n σ B B B = n 2 ( B B) = n 1 MIT-652: DM 3: Data Preprocessing 23 MIT-652: DM 3: Data Preprocessing 21 Not Correlated Data Correlation Analysis Corelational analysis γ A, B > 0 -> A and B are positively correlated the higher the value, the more each attribute implies the other -> A (or B) may be removed as a redundancy. γ -> A and B are negatively correlated A, B < 0 γ A, B = 0 -> A and B are independent It can also detect duplication at the tuple level MIT-652: DM 3: Data Preprocessing 24 MIT-652: DM 3: Data Preprocessing 22

Data Transformation: Attribute/Feature construction Adding attribute that represent relationships in the data that we know from experience are likely to be important can increase chance that mining process will yield useful results density = population/area ΔBal = currentbal previousbal area = height * width obesityindex = (height/weight)*c Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones MIT-652: DM 3: Data Preprocessing 27 MIT-652: DM 3: Data Preprocessing 25 Chapter 3: Data Preprocessing Data Transformation: Normalization Why preprocess the data? Data integration and transformation Summary min-max normalization v mina v ' = ( new_ maxa new_ mina) + new_ min maxa mina z-score normalization v mean A v ' = stand _ dev normalization by decimal scaling v v 10 '= Where j is the smallest integer such that Max( )<1 j A v' A MIT-652: DM 3: Data Preprocessing 28 MIT-652: DM 3: Data Preprocessing 26

Dimensionality Reduction Data Reduction Strategies Feature selection (i.e., attribute subset selection): Select a minimum set of attributes such that the probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes reduce # of attributes in the patterns, easier to understand There are 2 d possible sub-attributes of d attributes Heuristic methods: greedy step-wise forward selection step-wise backward elimination combining forward selection and backward elimination MIT-652: DM 3: Data Preprocessing 31 Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results strategies Data cube aggregation Dimension reduction e.g., remove unimportant attributes Data compression Numerosity reduction e.g., fit data into models MIT-652: DM 3: Data Preprocessing 29 Data Compression Data Cube Aggregation The lowest level of a data cube: base cuboid the aggregated data for an individual entity of interest Original Data lossless Compressed Data e.g., sales or customer. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Original Data Approximated lossy Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible MIT-652: DM 3: Data Preprocessing 32 MIT-652: DM 3: Data Preprocessing 30

Regression Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole MIT-652: DM 3: Data Preprocessing 35 MIT-652: DM 3: Data Preprocessing 33 Regress Analysis Numerosity Reduction Linear regression: Y = α + β X Two parameters, α and β specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2,, X1, X2,. Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (outliers may also be stored) Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Non-parametric methods Do not assume models Major families: histograms, clustering, sampling MIT-652: DM 3: Data Preprocessing 36 MIT-652: DM 3: Data Preprocessing 34

Sampling Histograms Allow a large data set to be represented by a much smaller random sample (or subset) of data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). MIT-652: DM 3: Data Preprocessing 39 A popular data reduction technique Bar chart Divide data into buckets and store frequencies for each bucket Partitioning rules, e.g. Equi-width Equi-depth 40 35 30 25 20 15 10 5 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 MIT-652: DM 3: Data Preprocessing 37 Sampling Clustering N = 9 SRSWOR (simple random sample without replacement) n =3 Partition data set into clusters, and one can store cluster representation only Cluster representations of the data are used to replace the actual data Can be very effective if data is clustered but not if data SRSWR n =3 is smeared Can have hierarchical clustering and be stored in multi- Raw Data dimensional index tree structures MIT-652: DM 3: Data Preprocessing 40 MIT-652: DM 3: Data Preprocessing 38

Chapter 3: Data Preprocessing Sampling Why preprocess the data? Data integration and transformation Summary Raw Data Cluster/Stratified Sample MIT-652: DM 3: Data Preprocessing 43 MIT-652: DM 3: Data Preprocessing 41 Discretization Sampling Three types of attributes: Categorical/discrete attributes Nominal values from an unordered set, e.g., color Ordinal values from an ordered set, T38 T256 T307 T391 T96 T117 Raw Data young young young young Stratified Sample T38 young T391 young T117 T138 T290 T326 e.g., academic rank Numeric/continuous attributes integer or real numbers T138 T263 T290 T308 T69 senior T326 T387 T69 senior T284 senior MIT-652: DM 3: Data Preprocessing 44 MIT-652: DM 3: Data Preprocessing 42

Discretization and concept hierarchy generation for numeric data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Discretization Discretization: Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Some classification algorithms only accept categorical attributes e.g. Decision Tree Prepare for further analysis MIT-652: DM 3: Data Preprocessing 47 MIT-652: DM 3: Data Preprocessing 45 Concept hierarchy generation for categorical data (cont.) Concept hierarchy Specification of a set of attributes, but not of their partial ordering country 15 distinct values province_or_ state 65 distinct values city 3567 distinct values street 674,339 distinct values Concept hierarchies: Defines a sequence of mappings from a set of lowlevel concepts to higher level (more general concepts) Reduce the data by collecting and replacing low level concepts by higher level concepts e.g., replace numeric values for the attribute age by young,, or senior Specification of only a partial set of attributes street < city MIT-652: DM 3: Data Preprocessing 48 MIT-652: DM 3: Data Preprocessing 46

Chapter 3: Data Preprocessing Why preprocess the data? Data integration and transformation Summary MIT-652: DM 3: Data Preprocessing 49 Summary Data preparation is an important issue for both warehousing and mining Data preparation includes -> fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies Data integration -> schema integration, correlation analysis, data conflict detection Data transformation -> smoothing, aggregation, generalization, normalization, attribute construction -> data cube aggregation, dimension reduction, data compression, numerosity reduction, discretization MIT-652: DM 3: Data Preprocessing 50