Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15
Table of contents 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 2 / 15
Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 3 / 15
Data mining process Real-world data bases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. Data have quality if they satisfy the requirements of the intended use. Factors comprising data quality are Accuracy (Does not contain errors) Completeness (All interesting attributes are filled). Consistency Timeliness Believability Interpretability Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 3 / 15
Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 4 / 15
Data preprocessing How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process? There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 4 / 15
Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 5 / 15
a integration Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 5 / 15 ummary, real-world data tend to be dirty, incomplete, and inconsistent. D sing Data techniques cleaning can improve data quality, thereby helping to improve the a ciency of the subsequent mining process. Data preprocessing is an import knowledge discovery process, because quality decisions must be based o a. Detecting data anomalies, rectifying them early, and reducing the da d can lead to huge payoffs for decision making. Data cleaning routines attempt to clean the data by Fill in missing values. Smooth out noisy data Identifying or removing outliers Correct inconsistencies in the data (For ex. the attribute for customer identification may be referred at as customer-id in one data store and cust-id in another one. a cleaning
Filling missing values In real-world data, many tuples have no recorded value for several attributes. How can you go about filling in the missing values for this attribute? Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value such as unknown and. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. Use the attribute mean or median for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value (using regression, inference-based tools using a Bayesian formalism, or decision tree). Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 6 / 15
Smooth out noisy data What is noise? Noise is a random error or variance in a measured variable. Given a numeric attribute. How can we smooth out the data to remove the noise? Binning: Binning methods smooth a sorted data value by consulting its neighborhood. 90 Chapter 3 Data Preprocessing Data partitioning equal-frequency versus equal-width smoothing methods smoothing by bin means versus bin medians and bin boundaries Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 7 / 15
Hamid Beigy (Sharif University Figure of Technology) 3.3 A 2-D customer datadata plotmining with respect to customer locations in a city, Fallshowing 1395 8 three / 15 Smooth out noisy data (cont.) How can we smooth out the data to remove the noise? Regression: Data smoothing can also be done by regression. 3.2 Data Cleaning Outlier analysis: Outliers may be detected by clustering. Intuitively, values that fall outside of the set of clusters may be considered outliers.
Data cleaning as a process Missing values, noise, and inconsistencies contribute to inaccurate data. Data cleaning process Discrepancy detection Discrepancies can be caused by several factors including Poorly designed data entry forms with many optional fields Human error in data entry Data decay (e.g., outdated addresses) Inconsistent data representation Inconsistent use of codes Error in instrumentation devices As a starting point, use any domain knowledge, for example date format. Data should also be examined regarding uniqe-rule (Attribute values most be unique) Data should also be examined regarding consecuitive-rule (no missing values between the lowest and highest values for the attribute, and that all values must also be unique) Data should also be examined regarding null-rule (specifies the use of blanks, question marks, special characters, and how such values should be handled) Some data inconsistencies may be corrected manually using external refrences (ex. using a paper trace) Most errors will require data transformation (define and apply a series of transformations to correct the given attribute) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 9 / 15
Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 10 / 15
Data integration Data mining often requires data integration (the merging of data from multiple data stores). Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help to improve the accuracy and speed of the subsequent data mining process. Issues in data integration Entity identification Schema integration and object matching can be tricky. Redundancy and correlation analysis An attribute (such as annual revenue, for instance) may be redundant if it can be derived from another attribute or set of attributes. Tuple duplication Two or more records may refer to the same object. Data value conflict detection and resolution For the same real-world entity, attribute values from different sources may differ (ex. telphone no.). This may be due to differences in representation, scaling, or encoding. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 10 / 15
Data reduction The given dataset may be huge and data analysis may take a long time. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. Data reduction Attributes Attributes A1 A2 A3... A126 A1 A3... A115 Transactions T1 T2 T3 T4... T2000 Transactions T1 T4... T1456 Data transformation 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 11 / 15
Data reduction (cont.) Data reduction strategies Dimensionality reduction This is the process of reducing the number of attributes under consideration. Feature extraction (PCA, MDS,...) Feature selection Numerosity reduction These techniques replace the original data volume by alternative and smaller form of data representation. Linear regression Histograms clustering sampling Data cube aggregation Data compression In data compression, transformations are applied so as to obtain a reduced or compressed representation of the original data. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 12 / 15
Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 13 / 15
Data transformation In this step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. Data transformation strategies Smoothing (binning, regression, clustering) Attribute constraction (new attributes are constructed to help the mining process) Aggregation Normalization (min-max normalization,...) Discretization (binning, histogram, decision tree, clustering) Concept hierarchy generation for nominal data Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 13 / 15
Concept hierarchy generation Attributes such as street can be generalized to higher-level concepts, like city or country. The following four methods for the generation of concept hierarchies for nominal data Specification of a partial ordering of attributes explicitly at the schema level by users or experts. A user or expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level. Specification of a portion of a hierarchy by explicit data grouping. In a large database, it is unrealistic to define an entire concept hierarchy by explicit value enumeration. Specification of a set of attributes, but not of their partial ordering. A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. Specification of only a partial set of attributes. Sometimes a user can be careless when 3.5 Data Transformation and Data Discretizat defining a hierarchy, or have only a vague idea about what should be included in a hierarchy. country 15 distinct values province_or_state 365 distinct values city 3567 distinct values street 674,339 distinct values Hamid Beigy (Sharif University Figure of Technology) 3.13 Automatic generation Data Mining of a schema concept hierarchy based Fall on1395 the number 14 / 15
Outline 1 Introduction 2 Data preprocessing 3 Data cleaning 4 Data integration 5 Data transformation 6 Reading Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 15 / 15
Reading Read chapter 3 of the following book J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 15 / 15