Sponsored by AIAT.or.th and KINDML, SIIT

Size: px

Start display at page:

Download "Sponsored by AIAT.or.th and KINDML, SIIT"

Lorena Goodman
5 years ago
Views:

1 CC: BY NC ND Table of Contents Chapter 2. Data Preprocessing Basic Representation for Data: Database Viewpoint Data Preprocessing in the Database Point of View Data Cleaning Data Integration, Transformation, and Reduction Data Transformation: Attribution Construction and Normalization Attribute Contruction Attribute Normalization Time-dependent Attribute Transformation: (Feature Construction) Data Reduction Reduction in the number of attributes Reduction in the number of tuples Reduction in the number of possible values Dimensionality Reduction Techniques Discrete Wavelet Transforms (DWT) Principal Components Analysis Summary Historical Bibliography Exercise

2 Chapter 2. Data Preprocessing Mining knowledge is a kind of discovering relationships that exist in the database, which models situations, in such as the physical, business, and scientific domain. We can summarize that there are four basic types of learning (or mining) process in data and text mining applications. In classification learning, the learning scheme uses a set of classified examples with objects, each of which is composed of attribute values and the target attribute (concept) value, to create a model for classifying unseen examples. In association learning, unlike classification, the process aims to find strong associations among attributes (features), not just ones for predicting a particular class value. In clustering, without any class or target attribute, the set of examples are grouped according their similarity. In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity. To find such relationships via the learning process, we need to understand the domain and the semantics of the data. Therefore, it is necessary to clarify what and how to represent the data. In data and text mining, data preprocessing and feature processing are important factors that affect the quality of the mined knowledge. Normally, the real-world databases often include noisy, missing, and inconsistent data since they typically come from multiple heterogeneous sources. Intuitively such low-quality data will lead to invalid mining results. A most potential solution is to preprocess the data in order to improve the quality of the data and then to obtain better mining results or more efficient process. In a database viewpoint, typical preprocessing techniques include data cleaning, data integration, data transformations and data reduction. Data cleaning aims to remove noise and resolve inconsistencies in the data. Data integration consolidates data from multiple sources into a single integrated data store. Data transformations adjust the input into a suitable format with reasonable scaling in order to improve the accuracy and efficiency of mining algorithms. Data reduction helps compress the data by sampling, aggregating, eliminating redundant features, or clustering, for better representation. Such data preprocessing techniques can be applied complementarily to substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. In pattern-recognition point of view, data preprocessing refers to feature processing where observations or objects are modeled by a set of features. To obtain accurate recognition or prediction from such observations and objects, we need to model them with proper features. This task contains a set of specific activities to modify features in order to improve data representation. Common activities are feature extraction and encoding, feature combination, and feature selection. As the first step, we extract features from observations or objects and then encode them in the form that is suitable for processing. As optional, it is also possible to combine extracted features to obtain new features that are more appropriate. Instead of considering all obtained features, selecting only some of them for mining can help us improve performance or accuracy in many cases. This chapter aims to give both database-oriented and feature-oriented background for data precessing and feature processing in order to bridge these two different notations for novice readers to gain common understanding Basic Representation for Data: Database Viewpoint In most cases, it is not overstatement that to understand a process or a system it is much more important for us to clarify what the inputs and outputs are than to know what goes on in between the process or the system. Data mining or text mining is also not an exception. Although in general the input to data mining may be represented in various forms, it is possible for us to formulate the input in the form of a table. With the table representation, the mining concept, process and its associated theory can be simplified. By this assumption, the input can be reconstructed into the form of a table with columns specifying concepts or attributes, and rows 31

3 indicating instances or examples. In some fields, the output is called as a concept description. Regardless of the type of knowledge we are going to mine, the object to be mined is called the concept, and the output produced by a mining scheme is called the concept description. For example, in case of classification, given the Play-Tennis data in Figure 2-1, the problem is to learn a model to predict the tennis match status (column 5, called concept) from the Play-Tennis information (column 1 to 4, called features or attributes) for a new event (the test case) whether it should be yes or no. Features or Attributes Concept Training data (training cases) A new datum (test case) Outlook Temp. Humidity Windy Sponsor Play sunny false Sony no sunny true HP no overcast false Ford yes rainy false Ford yes rainy false HP yes rainy true Sony no overcast 9 45 true Ford yes sunny false HP no sunny 7 50 false Sony yes rainy false Sony yes sunny true Ford yes overcast true HP yes overcast false Ford yes rainy true Sony no sunny true HP? To predict Figure 2-1: The tennis-match playing dataset (the training cases and the test case) In constrast with tabular representation, alternative representation of a dataset is a formal description in a mathematics schema. By this schema, a dataset D is composed of n instance objects; o 1,, o n, each of which o i is represented by a vector of p attributes (namely A 1,..., A p) a i1,..., a ip. In other words, a dataset can be viewed as a matrix with n rows and p columns, called a data matrix as follows. For classification datasets, the last attribute (column) usually points to the target concept. Figure 2-2 shows an example of the training dataset (Figure 2-1) in the form of a matrix. In this example, n = 14 and p=5. Dataset D A1 A 2 A 3 A 4 A 5 C o1 sunny false Sony no o2 sunny true HP no o3 overcast false Ford yes o4 rainy false Ford yes o5 rainy false HP yes o6 rainy true Sony no o7 overcast 9 45 true Ford yes o8 sunny false HP no o9 sunny 7 50 false Sony yes o10 rainy false Sony yes o11 sunny true Ford yes o12 overcast true HP yes o13 overcast false Ford yes o14 rainy true Sony no D = [ (o1, o2,, o15) T, C ] Database: Data object+concept oi = (ai1, ai2,, ai5) An object characterized by attributes C = (c1, c2,, c15) T Concept 32

4 Figure 2-2: The tennis-match playing dataset in the form of a matrix Each instance object is characterized by the values of attributes that express different aspects of the instance, and it can be represented in the form of a table as shown in Figure 2-1 and a matrix as shown in Figure 2-2. In general, even many different types of attributes, we can roughly divide them to categorial and numeric ones. The categorical attributes includes binary, nominal and ordinal attributes while the numeric attributes involve interval and ratio attributes. As the most basic attribute, a binary (or sometimes called Boolean) attribute models a discrete stage that takes only two values, such as true/false, one/zero and high/low. If an attribute can take only two values, it is naturally binary. For example in the tennis-match playing data, the attribute windy can take only either true or false. A nominal attribute takes a value from a prespecified, finite set of symbols, which usually serve just as labels or names. There is no relation implied among these symbols. For example, in the tennis-match playing data, the attribute sponsor of a tennis match can be either Sony, HP and Ford. This sponsor attribute is nominal since it does not make sense to perform arithmetic operations, such as addition, multiplication and size comparison, among its values. Here, only equality or inequality can be tested. In constrast to a norminal attribute, an ordinal attribute allows us to compare or rank order among its possible value. However, although there is a notion of ordering, there is no exact description of distance. For example, in the tennis-match playing data, the attribute outlook may take a value of sunny, overcast, or rainy. The point is that the value of overcast lies between the other two, sunny and rainy. Although it makes sense to compare two values, it does not make sense to add, subtract, multiply, divide among them. Moreover, it is uncommon to compare the difference between sunny and overcast to the difference between overcast and rainy. As a conventional numeric attribute, an interval attribute takes continuous values, which are not only ordered but also measured in fixed and equal units. For example, in the tennis-match playing data, temperature can be expressed in a continuous number with the unit of either Celsius or Faherenheit. It is common to consider the difference between two temperatures but may be unusal to think about their summation, multiplication or division. Another example is dates, especially years. The value of years in dates is integer. It makes sense to think about difference between two dates, say 10 years between 2000 and 2010 but not their summation, multiplication or division. The zero (or starting) points for temperatures and dates are completely arbitrary with manual assignment. On the other hand, as another numeric attribute, a ratio attribute has an inherently defined zero point. It is suitable to consider the distance (subtraction) from one object to others, and also their summation, multiplication and division. For example, in the tennis-match playing data, the attribute humidity can be measured as percentage, ranging between 0% and 100%. It has an absolute zero. However, sometimes it is unclear on the difference between interval and ratio attributes, which are distinct in the definition of the zero point. For example, while the zero for temperatures in Celsius or Faherenheit may be trivial, the absolute zero for temperatures in the Kelvin unit defines the theoretical absence of all thermal energy. There are so many cases like this in the real situation. Moreover, it is noted that integer and continuous attributes are very similar in terms of mining model since it is possible to order. Therefore, they are grouped into the same group Data Preprocessing in the Database Point of View It is quite usual that any database may include errors, unusual values, and inconsistencies in the data recorded for some transactions. Existing data are incomplete (some records may lack attribute values or certain attributes of interest, or contain only summarized data), noisy (some 33

5 records may contain errors due to machine or human mistakes, outlier values deviating from the norm or duplicated records due to faulty instruments), and inconsistent (some attributes may have value variation in their units or different coding standards). Moreover, in several situations, it is necessary to consider data from several sources, instead of one single source, to find useful knowledge, and improve data quality by some kind of transformations. These faulty or isolated data may reduce the quality and/or reliability of the output from the mining process. From the database point of view, typical preprocessing techniques include data cleaning, data integration, data transformations and data reduction. Although most mining routines have some procedures for dealing with these data, they are not much robust. This section describes how to preprocess duch data in real-world situation. Common procedures in data cleaning are, for example, to fill in missing values, to smooth noisy data, to identify or remove outliers, and to resolve inconsistencies in an erroneous database. Integrating data from several sources may create more value added into each single isolated source of data. Data reduction triggers a reduced representation of an orginal data set, which can used to generate a comparable or better predictive model in performance. Some data reduction tasks are data aggregation (e.g., summarizing in some dimensions), attribute subset selection (e.g., removing irrelevant attributes via correlation analysis), dimensionality reduction (e.g., translating original dimensions into a smaller number of dimensions using some encoding schemes such as minimum length encoding or wavelets), and numerosity reduction (e.g., reducing the variety of data values by grouping values into clusters or some parametric models) Data Cleaning The purpose of data cleaning includes filling in missing values, smoothing or eliminating noises, identifying outliers, and correcting inconsistencies in the data. Missing Values When a record has no value for some of its attributes, the following are possible solutions to handle the missing values. Table 1.15 shows an example for each case. 1. If there are some values missing from a record (tuple) in the data, it is possible to just ignore that record. This is usually done when some important attributes or the class label is missing (especially for the classification task). For example, the second record in Figure 2-3 might be ignored since it has no value for humidity, sponsor and play. The more attribute values in the record (tuple) are missing, the more effective this method is. It is not good if the percentage of missing values per attribute is high. 2. The more accurate but the less practical is to fill in the missing value manually. For example, the outlook record of the third record in the tennis-match playing data can be filled manually with overcast after carefully checking as the third record in Figure 2-3. This method is in general time-consuming and may not be feasible when the data set is larger with a higher portion of missing values. 3. The more practical but the less accurate one is to assign a constant value, say unknown, for the missing value. However, while this label may help to recover the usage of records with missing values, as the side-effect, the unknown value may mislead us to treat it as an interesting concept. This approach is simple but it is not foolproof. For example, we can replace the missing sponsor in the tennis-match playing data, with unknown, as the fourth record in Figure

6 4. For more reasonable, we can use a kind of statistical or information theory-based techniques to recover the values as follows. a. Replace the missing value with the attribute mean to fill in. For example, we can replace the missing temperature in the tennis-match playing data, with the average humidity ( /12 = 66.54), as the fifth record in Figure 2-3. b. Replace the missing value with the attribute mean for all samples belonging to the same class as the value-missing record. For example, we can replace the missing temperature in the tennis-match playing data, with the average temperature of class play= no ( /3 = 28.33), as the sixth record in Figure 2-3. c. Replace the missing value with the most probable value, calculated by regression or Bayesian inference, or decision tree induction. For example, it is possible to predict the missing values for windy by using a constructed decision tree based on other attributes, as the seventh record in Figure 2-3. This inferencing approach is popular with less bias, compared to other methods since it uses existing information from the present data to predict missing values. Features or Attributes Concept Training data (training cases) Outlook Temp. Humidity Windy Sponsor Play sunny false Sony no sunny 32? true??? (overcast) false Ford yes rainy false unknown yes rainy false HP yes rainy true Sony no overcast 9 45? (true) Ford yes sunny false HP no sunny 7 50 false Sony yes rainy false Sony yes sunny true Ford yes overcast true HP yes overcast false Ford yes rainy true Sony no Figure 2-3: Examples of handling missing data in the tennis-match playing dataset Noisy Data Noisy data is inherent in many real-life and industrial application. To solve the noisy data, several solutions are possible as follows. 1. Binning: A most simple method towards data smoothing is binning method. The binning smoothes a sorted data value by considering its neighboring data (local smoothing). Two common varients of binning are equal-width binning and equal-frequency (equal-depth) binning. a. Equal-width binning method This method first divides the range of consideration to n equal-width intervals (bins), forming a uniform grid. All data within each range will be translated into the mean of the range. An example of performing equal-width binning on the temperature attribute in 35

7 Figure 2-4 (a). Although this method is the most straightforward but it is not good at handling skewed data and outliers since they may dominate the final presentation. b. Equal-frequency (equal-depth) binning method This method splits the range into n intervals (bins), each containing approximately a same number of samples. All data within each bin will be translated into the mean, median or boundary value of the range. This method is good at handling skewed data. An example of performing equal-frequency binning on the temperature attribute in Figure 2-4 (b). In smoothing by bin means, the mean of the values 7, 9, and 10 in Bin 1 is 8.7. Therefore, each original value in this bin is replaced by the value 8.7. Similarly, in case of smoothing by bin median, each bin value is replaced by the bin median, that is 9. For smoothing by bin boundaries, first the minimum and maximum values of each bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the deeper the bin is, the larger the smoothing effect is. Temp. Temp. (Equi-width) Temp. (bin mean) (Equi-frequency) Temp. (bin median) (Equi-frequency) Temp. (bin boundary) (Equi-frequency) (a) Equal-width with max=50, min=0 and width=5 Bin Range Average Bin 1 : [0,10) 5 Bin 2 : [10,20) 15 Bin 3 : [20,30) 25 Bin 4 : [30,40) 35 Bin 5 : [40,50] 45 (b) Equal-frequency (equal-depth) bin depth=3 o Sorted data for temperature: 7, 9, 10, 12, 22, 23, 24, 24, 25, 26, 32, 35, 37, 40 o Partition into (equi-frequency) bins: Bin Bin Content Bin Mean Bin Median Bin Boundary Bin 1 : 7, 9, , 8.7, 8.7 9, 9, 9 7, 10, 10 Bin 2 : 12, 22, , 19.0, , 22, 22 12, 23, 23 Bin 3 : 24, 24, , 24.3, , 24, 24 24, 24, 25 Bin 4 : 26, 32, , 31.0, , 32, 32 26, 35, 35 Bin 5 : 37, , , , 40 Figure 2-4: (a) equal-width, and (b) equal-frequency (equal-depth) methods 36

8 2. Regression: A simple parametric method for data smoothing is regression. It is possible to smooth data of an attribute (column) by fitting them to a predefined function of other attributes (columns). The simplest one is linear regression where the function involves only two attributes. One attribute can be used to predict the other. As higher dimensions, multiresponse linear regression is an extension of linear regression with more than two attributes. While the regression approach seems reasonable, it is hard to infer which attribute should depend on which attributes, in order to perform smoothing. The regression will also result in data reduction. Figure 2-5 showed that Y and X depends on each other. Therefore, we can use regression to smooth the data by fitting to the line. X Y X Y Y Figure 2-5: Regression 3. Clustering: Besides smoothing noisy values, eliminating noisy records may be required. The noisy records can be recognized as outliers via clustering that organizes similar records into groups (or clusters). A set of records that fall outside any cluster may be considered outliers as shown in Figure 2-6. X X Outliers Clusters X X Figure 2-6: Clusters and outliers 37

9 2.4. Data Integration, Transformation, and Reduction Besides data cleaning, in several cases, we need to combine data from several data sources, known as data integration. These data may also need to be transformed into appropriate forms before mining. While data integration combines data from multiple sources into a coherent single data store, it may meet the problem of schema mismatch, causing in redundancy and inconsistency. An attribute (e.g., annual salary = base salary + overtime) may be redundant if it can be calculated from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set (e.g., full name vs. abbreviation). Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For numerical attributes, we can evaluate the correlation between two attributes, X and Y, by computing the correlation coefficient (Pearson s product moment coefficient) as shown below. where N is the number of records (tuples), x i and y i are the respective values of X and Y in the i-th record, and are the respective mean values of X and Y, and and are the respective standard deviations of X and Y. Naturally, When is greater than 0, then X and Y are positively correlated, meaning that the values of X decrease as the values of Y decrease, and the values of X increase as the values of Y increase. If is less than 0, X and Y are negatively correlated, meaning that the values of X decrease as the values of Y increase, and the values of X increase as the values of Y decrease. In both positive and negative correlations, the higher the absolute value of ( ), the stronger the correlation is. Implying each attribute implies the other. Hence, a higher absolute value may indicate that X (or Y) may be removed as a redundancy. If the resulting value is equal to zero, then X and Y are independent and there is no correlation between them. Figure 2-7 shows an example of correlation calculation between A-B, A-C, and A-D (sample size (N) = 6). According to the correlation results, it can be observed that A and B are completely correlated with the coefficient of 1.0 ( =1.00), C is negatively correlated with A ( =-0.988), and D has almost no relevant with A ( =-0.004). In general, it is possible to visualize the correlation among attributes by drawing scatter plots. Figure 2-8 showed the scatter plots of A-B (upper graph), A-C and A-D (lower graph), showing that B is completely correlated with A, C is negatively correlated with A, and D almost has no relevant with A. This result tells us that we can ignore the attribute B and C if we keep the attribute A while we retain the attribute D since it cannot be referred from A. Note that correlation does not imply causality. That is, if X and Y are correlated, this does not necessarily imply that X causes Y or that Y causes X. For example, in analyzing a demographic database, we may find that attributes representing the number of hospitals and the number of car thefts in a region are correlated. This figure does not imply that one causes the other. Intuively, both attributes tend to relate to population, another important attribute. 38

10 Sales in Euro (A) Sales in Dollar (B) Investment Risk (C) Humidity (D) = = 240 = =345.6 = =77.17 = =55 = = = = =19.58 = = = Detailed Calculation No. Sales in Euro (A) Sales in Dollar (B) Investment Risk (C) Humidity (D) AVERAGE ( ) ( ) ( ) 55 ( ) STDEV ( A) ( B) 7.73 ( C) ( D) No. A- B- C- D No. (A- ) * (B- ) (A- ) * (C- ) (A- ) * (D- ) SUM SUM/(N* X * Y) (i.e., N = 6) (=SUM/(N* A * B) (=SUM/(N* A * C) (=SUM/(N* A * D) Figure 2-7: Correlation Analysis 39

11 B A (C) (D) A Figure 2-8: Scatter plots of A-B, A-C and A-D Apart from the numeric data, the correlation relationship between two categorical attributes, say X and Y, can be discovered by a 2 (chi-square) test using the chi-square table shown below. Here, DF is the degree of freedom. Significance Level (P) DF

12 Suppose X has p distinct values, namely x 1, x 2,, x p, and Y has q distinct values, namely y 1, y 2,, y q. The cooccurrence frequency between the set of x 1, x 2,, x p and the set of y 1, y 2,, y q can be shown as a contingency table, with the p values of X as the columns and the q values of Y as the rows. The 2 value (also known as the Pearson 2 statistic)of these two attributes is computed as shown below. The summation is computed over all of p distinct values of X and q distinct values of Y. Here, o ij denotes the occurrence frequency that the attribute X takes the value of x i while the attribute Y takes the value of y j, and e ij is the expected frequency that x i and y j co-occurs under the independence assumption, i.e.,, which can be computed as shown in below. In the equation, and is the frequency of x i and the frequency of y j, respectively, and N is the number of data tuples. For instance, consider two 2 2 contingency tables in Figure 2-9 (b) and (c) which are derived from Figure 2-9 (a). The numbers in the parentheses are the expected frequencies. The value of 2 statistic will be used to test the independence hypothesis. The hypothesis whether X and Y are independent or not, involves a significance level, determined by (r-1) (c-1) degrees of freedom. Here, r is the number of possible values for the attribute X while c is the number of possible values for the attribute Y. For example, the calculation in Figure 2-9 (b) is for testing the hypothesis of whether Windy and Humidity is independent of each other or not. Assume that we concern the significance level of In this case when the degree of freedom is 1 (r=2 and c=2), the 2 bound is as shown in the last column (P=0.001) in the first row (DF=1) of chisquare table. Since the 2 value is 0.0, that is lower than , Windy and Humidity are considered as independence. As another example in Figure 2-9 (c), we can check independence between Outlook and Temperature. With the degree of freedom of 4 (r=3 and c=3), the the 2 bound is as shown in the last column (P=0.001) in the first row (DF=4). As the calculated 2 value is 33.25, which is above the bound, we can reject the hypothesis that the attributes Outlook and Temperature are independent and conclude that these two attributes are correlated or associated. For more advanced, in real situation, the chi-square is claimed to have a problem when the frequency is low. A number of literatures give suggestion to not to use it under cases of the maximum expected frequency less than 10 ( )<10). Otherwise, Yates correction for continuity or Fisher s Exact Test will be applied for 5, respecitively. In addition to detecting redundancies between attributes, duplication may be occurred at the tuple level. For instance, there are two or more identical tuples for a given entry. Moreover, another hard problem in data integration is to detect and resolve conflicts in data values inside the database. In the real-world situation, integration of attribute values from different sources may suffer with differences in representation, such as encoding, scaling and so on. For instance, a length attribute may be stored in metric units in one system and British imperial units in another. It may be possible to have a situation that one uses abbreviated names while the other uses full names for specifying organizations or addresses. It is very challenge to resolve these semantic heterogeneity and structure of data trigger. We need to perform careful integration of the data from multiple sources to reduce and avoid redundancies and inconsistencies in the resulting data set. This effort will help us improve accuracy and speed of the subsequent mining process. and 41

13 Temp. Humidity Outlook Temp. Humidity Windy Play sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true No rainy mild high true No (a) The Play-Tennis data set * degree of freedom = (r-1) (c-1) = (2-1) (2-1) = 1 Windy TRUE FALSE Total High 30 (30) 40 (40) 70 Normal 30 (30) 40 (40) 70 Total (b) A 2 2 contingency table * degree of freedom = (r-1) (c-1) = (3-1) (3-1) = 4 Outlook sunny overcast rainy Total hot 20 (14.3) 20 (11.4) 0 (14.3) 40 mild 20 (21.4) 10 (17.1) 30 (21.4) 60 cool 10 (14.3) 10 (11.4) 20 (14.3) 40 Total (c) a 3 3 contingency table Figure 2-9: 2 (chi-square) test: (a) the sample data set (the Play-Tennis data set (Witten et al., 2003), (b) 2 2 contingency table for 2 test on windy and humidity, (c) 3 3 contingency table for 2 test on outlook and temperature. For (b) and (c), the numbers in the brackets are the expected frequency Data Transformation: Attribution Construction and Normalization In several situations, data themselves have errorless, noiseless, inconsistency or incompleteness but they may need to be transformed into a form that we can use easily, efficiently and effectively. 42

Some typical transformations include attribute construction and normalization, where attribute combination or scaling operations are applied to reforming the data. 2.5.1.

14 Some typical transformations include attribute construction and normalization, where attribute combination or scaling operations are applied to reforming the data Attribute Contruction In attribute construction, some new attributes can be constructed from the existing attributes and then added to the original attribute set to help improve structural representation of the data, resulting in higher-dimensional data. With the combination of attributes, some missing information related to the relationships between data attributes can be manually appended to improve the result of knowledge discovery. For example, the attribute namely area can be added by calculating the multiplication of height and width attributes, which are two existing attributes, as the third column in Figure Width Height Area Size (Target) small large middle small middle large large middle small Small: Middle: Large: Figure 2-10: The area attribute is added by the multiplication of width and height Attribute Normalization On the other hand, normalization is related to scaling attribute data in order to fall within an appropriate range, such as from 0 to 1. This normalization is usually used to improve data presentation and then increase the quality of mining results. For example, it is possible to improve accuracy in classification algorithms, such as neural networks, distance-based methods such as k-nearest-neighbor classification, centriod-based classification and clustering. The advantage of this normalization includes speed up of the learning phase, preventing overweighting attributes with possible wider ranges (e.g., sales amount, distance) over attributes with possible narrower ranges (e.g., age, temperature, etc.). Among several normalization methods, three common ones are min-max normalization, z-score normalization, and normalization by decimal scaling. Figure 2-11 shows examples of these three normalizations. Min-max normalization The min-max normalization performs a linear transformation on the original data. Suppose that and are the minimum and maximum values of an attribute, A. The min-max normalization maps the values of and to the new min and the new max of the range [ ], respectively. A single value between and will be mapped linearly to its new value in the range. Generally, the new min ( ) and max ( ) to 0 and 1, respectively. In principle, the min-max normalization preserves the relationships among the original data values but it faces with an out-of-bounds error if a future input case for normalization falls 43

15 outside of the original data range for A. We can prevent this error by estimating the lower bound and upper bound, instead of using the real minimum and maximum values of the attribute in the dataset. The estimation can be done with simply adding a fixed additional range or using some knowledge to decide the range. Figure 2-11 (a) shows an example of the min-max normalization. Month Humid. Min-Max Z-score Decimal scaling Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec = 40 = 0.0 = 90 = 1.0 = (a) Min-max normalization (new min = 0.0, new max = 1.0) = 2 = = = = (b) Z-score normalization = 2 = (c) Decimal-scaling normalization Figure 2-11: Three normalization methods; min-max, z-score and decimal-scaling Z-score normalization (or zero-mean normalization) The z-score normalization (or zero-mean normalization) will transform the values for an attribute A, to a new value based on the mean and standard deviation of A. A value, a, of A is normalized to by computing the z score as follow. where 44

16 Here, and are respectively the mean and standard deviation of attribute A. Compared to the min-max normalization, the z-score normalization is effective when the actual minimum and maximum of the attribute A are unknown, or when there are outliers that dominate the min-max normalization. As its name, the z-score normalization will have zero as its mean. In most cases, a normalized value normally takes a range of -1.5 to Figure 2-11 (b) shows an example of the z-score normalization. Decimal-scaling normalization The decimal-scaling normalization shifts the decimal point of values of an attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value, a, of A is normalized to the largest value where all s falls in the range of -1 and +1, as follows. Here, i is the smallest integer such that The decimal-scaling and the min-max normalization are similar in the point that both of them map to a definte range. The former maps to [-1,+1] while the latter transforms to [0,1] or [min, max]. The z-score normalization has no bound in the minimum and maximum values. Figure 2-11 (c) shows an example of the decimal-scaling normalization Time-dependent Attribute Transformation: (Feature Construction) Unlike statics attributes, a time-dependent attribute (such as a time-series data) involves a variable Y related to time sequence, such as monthly sales amount, daily stock price and so forth. They can be viewed as a function of time t, that is Y= F(t). It is possible to extend the utilization of a time-series attribute by comparing or calculating relations among values in the series. Some common comparisons includes calculations of (1) different values, (2) ratio values, (3) moving average, (4) trend, and (5) trend with seasonal adjustment. They can be formulated as follows. Here, let be a current attribute value at the time. Difference dv: Ratio rv: Moving average ma v : Trend tv: Seasonal value sv: where 45

17 Adding more time-dependent attributes enables us to analyse time-series characteristics for modeling time series (i.e., to gain insight into the mechanisms or underlying forces that generate the time series), and to forecast time series (i.e., to predict the future values of the time-series variables). The following table shows an example of different values, ratio values, moving average, trend, and trend with seasonal adjustment. Here, v: current sales, d: difference, r: ratio, ma: moving average, t: trend and s: seasonal. Period Sales Diff Ratio Moving average Trend Trend Centered moving average Seasoning Seasoning 1/ / / / / / / / / / / / / / / / / / / / / / / / For example, the difference of the 2/2011 period is = 40, where 100 and 140 are the sales of the 1/2011 and 2/2012 periods, respectively. The ratio of the 2/2011 period is 140/100 = , where 100 and 140 are the sales of the 1/2011 and 2/2011 periods, respectively. The moving average sales of the 2/2011 period is ( )/2 = 120, where 100 and 140 are the sales of the 1/2011 and 2/2011 periods, respectively. The difference trend of the 2/2011 period is = 20, where 140 and 120 are the sales and the moving average sale of the 2/2011 period, respectively. The ratio trend of the 2/2011 period is 140/120 = , where 140 and 120 are the sales and the moving average sale of the 2/2011 period, respectively. The centered moving average of the 2/2012 period is =145, where 100, 140 and 200 are the sales of the 1/2011, 2/2011 and 3/2011 periods, respectively. The difference seasoning of the 2/2012 period is = 15, where 160 and 145 are the sales and the centered moving average of the 2/2012 period, respectively. The ratio seasoning of the 2/2012 period is 160/145 = , where 160 and 145 are the sales and the centered moving average of the 2/2012 period, respectively. 46

18 2.6. Data Reduction In several cases, it is necessary for us to reduce the size of the data set in order to improve the representation of data or to resolve the problem of large computational complexity and impractical or infeasible analysis due to large data. Data reduction techniques can be applied to obtain a reduced representation of the data set that even occupies a smaller volume, but maintains the semantics and integrity of the original data. Mining on the reduced data set is expected to require less computational time but produce the same (or almost the same) analytical results or results from different points of view. There are three types of data reduction: reduction in the number of attributes (columns), tuples (rows) and possible values (cells). They are described shortly in order Reduction in the number of attributes Normally, data sets for analysis may contain hundreds of attributes, where several of them may be irrelevant to the mining task or redundant. For example, in classifying customers into classes based on their purchase patterns, their telephone numbers are likely to be irrelevant while attributes such as age or types of purchased goods seems important in classification. Existence of irrelevant or redundant attributes may cause confusion in mining process, resulting in degrading with production of unrelated poor discovered patterns. The function to reduce the number of attributes by selecting only effective attributes is known as feature selection. Feature selection selects only attribute subset from the complete set in order to remove irrelevant, weakly relevant or redundant attributes (or dimensions or column). As advantages of feature selection, we expect not only the computational complexity reduction but also accuracy improvement. While it may be possible for us to manually eliminate useless attributes or select to remain useful attributes, it is a difficult and time-consuming task, especially when the characteristic of the data is not clear or hidden. Redundant attributes may be detected by correlation analysis while useless attributes may be eliminated by attribute subset selection with performance testing. Correlation analysis is to find how strongly one attribute implies the other, based on the available data. The detailed method can be found in in Section 2.4. In attribute subset selection, we aims to find a minimum set of attributes where their probability distribution is similar to the original distribution obtained from all attributes. To search for such subset, we may need to explore all possible combinations of attributes. For n attributes, there are 2 n possible subsets since each attribute is either selected or not selected (i.e., 2 choices each). An exhaustive search for the optimal subset of attributes can be extremely expensive, especially when the number of attributes n is large. To solve this, heuristic methods can be applied to explore a reduced search space. The typical methods are greedy where searching through attribute space, they will always select the best choice at each step of searching. This situation may lead us to a so-called local optimal. However, in several cases, only local optimals are sufficient and sometimes they will lead to global optimal later. Such greedy methods are effective in practice with a near-optimal solution. A number of common solutions can be divided into three main groups: filter methods, wrapper methods, and dimensionality reduction methods as shown in Figure

Figure 2-12: Wrapper, Filter, and Dimensionality Reduction Approaches As the first approach, the filter methods occupy the evaluation function that is independent from the data mining (DM) algorithm.

As the second approach, the wrapper methods perform the DM algorithm as the evaluation function where each candidate subset is tested by using the DM algorithm and then at the same time evaluated on

19 Figure 2-12: Wrapper, Filter, and Dimensionality Reduction Approaches As the first approach, the filter methods occupy the evaluation function that is independent from the data mining (DM) algorithm. They search for a suitable subset before applying the DM algorithm. As the second approach, the wrapper methods perform the DM algorithm as the evaluation function where each candidate subset is tested by using the DM algorithm and then at the same time evaluated on the basis of its performances. On the other hand, the dimensionality reduction methods, instead of selecting a suitable subset from the original feature set, transform features in the original space to different space where we can get better mining performance. 1. Filter Approach The filter approach evaluates the performances of the two situations; one with the attribute and one without it, to determine whether we should keep or not keep the attribute. By this, conceptually it forms a search space. Some common filter-based varients are as follows. i. Stepwise forward selection: The method begins with an empty set of attributes as the reduced set. Then from the original attributes, the best attribute is determined and added to the reduced set. This process of selecting the best attribute will be occurred recursively to add attributes one by one until no improvement is found or a condition is satisfied. 48

20 ii. iii. Stepwise backward elimination: As the reverse, this method assumes the full set of attributes before starting removing the worst attribute one by one from the remaining set until no improvement is found or a condition is satisfied. Stepwise hybrid of forward selection and backward elimination: The combined method performs both stepwise forward selection and backward elimination methods so that, at each step, the procedure selects the best attribute from or removes the worst among the remaining attributes. 2. Wrapper Approach Instead of searching, the wrapper approach performs the data mining task of interest, as well as an evaluation of the set of features at the same time. A well-known wrapper approach is decision tree induction. The decision tree algorithms, such as ID3, C4.5, and CART, were originally intended for classification, but as a by-product, it performs feature selection. The decision tree induction creates a tree-like structure where each nonleaf (internal) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each leaf node expresses a class prediction. At each node, the algorithm chooses the best attribute to partition the data into individual classes. At the end, the final decision tree may include only a subset of attributes. All attributes that do not appear in the tree are assumed to be irrelevant. In the procedure, a threshold on the measure may be employed to determine which timing we should stop the attribute selection process. 3. Dimensionality Reduction Approach While in several cases, the selection is simply eliminating ineffective attributes, it may not be enough, especial the case that the features may have relations with each other. To solve this, it is necessary to find their relations and group them. In the past, developed for this purpose were several methods, such as discrete wavelet transform (DWT), singular value decomposition (SVD), Principal Components Analysis (PCA) or latex semantic indexing (LSI). Such methods treat each feature as an element in vector space and transform them by implicitly considering correlation among them before reduceing the dimension. In this approach, data transformations are applied to gain reduced/compressed representation of the original data. In general, two possible options are lossless and lossy. The lossless transformation enables the perfect reconstruction of the original data from the compressed data without any loss of information. On the other hand, the lossy transformation let us have only an approximation of the original data. However, the lossless methods seems have no problem, but indeed there may have a limitation on the manipulation of the data since the original space is collapse to the other space Reduction in the number of tuples Besides the reduction of the number of attributes, scaling down the number of tuples can help us not only save computational cost but eliminate noisy data. Sampling is a popular method to lower the number of tuples. It allows a large data set to be represented by a more compact version by selecting and using only some representative data selected from the whole set. As another efficient way to omit noisy or useless tuples, we can cluster tuples into groups and eliminate ones that are far from the norm since they are likely to be noise. It is also possible to sample tuples based on their clustered groups in order to obtain a balanced subset among different clusters. Sampling makes a large data set to be represented by a much smaller random sample of the data (a subset of the original data set). Hence, sampling complexity is potentially sublinear to the 49

21 size of the data. Figure Figure 2-15 shows three possible alternatives for sampling called simple random sample without replacement (SRSWOR), simple random sample with replacement (SRSWR), and simple random sample with cluster and stratified sampling, respectively. Suppose that a large data set, D, contains N tuples. Let us look at the most common ways that we could sample n tuples from the data set D for data reduction to form a subset data set S. Naturally, the cost of obtaining a sample is proportional to the size of the sample, n, as opposed to N, the data set size. (1) Simple random sample without replacement (SRSWOR) This method draws n tuples from the N tuples from D (n < N), where the probability of drawing any tuple in D is 1/N. All tuples are equally likely to be sampled. The number of possible subsets by selecting n tuples from the N tuples in this constraint is. See Figure (2) Simple random sample with replacement (SRSWR) This method is similar to SRSWOR but each time a tuple is drawn from D, it is recorded and then replaced back to D so that it may be drawn again later. Therefore, the number of possible subsets by selecting n tuples from the N tuples in this constraint is complicated. It can be counted by the following cases. See Figure 2-14 All selected data are distinct One datum is selected twice + the others are selected once. One datum is selected triple + the others are selected once. Two data are selected each twice + the others are selected once. Three data are selected each twice + the others are selected once. (3) Simple Random Sample With Cluster and Stratified Sampling This method groups the tuples in D into K mutually disjoint clusters based on their similarity, and performs simple random sampling on each cluster, with the consideration of the size of each cluster. A reduced data representation can be obtained by applying SRSWOR or SRDWR to each cluster, resulting in a cluster sample of the tuples. The stratified sampling is also applied simultenuously by selecting the tuples based on the number of each group. Here, we select more samples from larger clusters and fewer samples from smaller clusters, to preserve the ratio of samples from clusters in the reduced data set. See Figure

22 Figure 2-13: Simple random sample without replacement (SRSWOR) Figure 2-14: Simple random sample with replacement (SRSWR) 51

Figure 2-15: Simple random sample with cluster and stratified sampling 2.6.3.

Data discretization and conceptual generalization replace original raw data values of attributes with range representatives or values in a higher conceptual level.

23 Figure 2-15: Simple random sample with cluster and stratified sampling Reduction in the number of possible values Reducing the variety of possible values not only improves representation of data, but also smooths noises. Data discretization and conceptual generalization replace original raw data values of attributes with range representatives or values in a higher conceptual level. They are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. This variety reduction may not be helpful in terms of computational complexity but it may be good for noise reduction or better representation. Discritization may include binning, histogram analysis and information-theory-based methods. For example, the sales data of villages may be aggregated into the sales of cities or the daily sales data may be summarized to compute monthly and annual total amounts. This step typically involves analysis of the data at multiple granularities. Also known as the concept of data generalization, low-level or primitive (raw) data are replaced by higher-level concepts through the use of concept hierarchies. Figure 2-16 shows the aggregation that transfers detailed data set to summarized data. 52

24 Date Product Location Unit Amount 1-Jan-2012 Coke Bangkok Jan-2012 Pepsi Bangkok Jan-2012 Orange Bangkok Jan-2012 Melon Bangkok Jan-2012 Orange Tokyo Feb-2012 Coke Bangkok Feb-2012 Apple Bangkok Feb-2012 Pepsi Bangkok Feb-2012 Pepsi Tokyo Mar-2012 Orange Tokyo Mar-2012 Coke Bangkok Mar-2012 Melon Bangkok Mar-2012 Orange Bangkok Mar-2012 Pepsi Tokyo Month Product Location Unit Amount January Softdrink Bangkok January Fruit Bangkok January Fruit Tokyo February Softdrink Bangkok February Fruit Bangkok February Softdrink Tokyo March Softdrink Bangkok March Fruit Bangkok March Softdrink Tokyo Month Product Location Unit Amount January Softdrink Asia January Fruit Asia February Softdrink Asia February Fruit Asia March Softdrink Asia March Fruit Asia Figure 2-16: Data aggregation in order to mine at a higher level. Another approach to reduction of the number of possible values is binning. The binning has been previously described in Section as it is used for solving noisy data. It can be used to group a value in a range to its range average or boundary. Moreover, discretization techniques can be categorized based on how the discretization is performed, such as whether it uses class information or which direction it proceeds (i.e., top-down vs. bottom-up). If the discretization process uses class information, then we say it is supervised discretization. Otherwise, it is unsupervised Dimensionality Reduction Techniques As an approach to reduce the number of attributes, demensionality reduction is to transform an original feature space to a more appropriate one in order to extract relationship among features and then eliminate inefficient or ineffective transformed features. Some common techniques are discrete wavelet transform (DWT), Principal Components Analysis (PCA), Singular Value Decomposition (SVD), and Latex Semantic Indexing (LSI). More details on DWT and PCA are given below. 53

25 Discrete Wavelet Transforms (DWT) Following Discrete Fourier Transform (DFT) in numerical analysis and functional analysis, a Discrete Wavelet Transform (DWT) aims to transform one space to other one in order to extract frequency and location information. Compared with DFT, as with a kind of wavelet transform, the DWT can capture not only frequency but also location information (temporal resolution). Similar to DFT, the discrete wavelet transform (DWT), a linear signal processing technique, transforms an original data vector X to the vector in a new space X 0, each element corresponding to a wavelet coefficient. Although the two vectors are of the same length, the DWT can be applied to reduce data dimensionality as follows. First, each tuple of an n-dimensional data vector X = [x 1, x 2,, x n] depicts n measured values of n attributes in database measurements. The DWT can transform this original vector to an n-dimentional vector in a new space and then truncate some dimensions of the wavelet-transformed data. By the truncation, the compressed approximation of the data stores, not all, but only some strong wavelet coefficients, e.g. keeping only wavelet coefficients larger than some user-specified threshold and setting small coefficients to 0. With this threshold cutoff, the resultant data representation becomes sparse, so that operations that can take advantage of data sparsity are computationally very fast if performed in wavelet space. The technique also works to remove noise without smoothing out the main features of the data, making it effective for data cleaning as well. Given a set of coefficients, an approximation of the original data can be constructed by applying the inverse of the DWT used. The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique involving sines and cosines. It is known that the DWT achieves better lossy compression than the DFT. Given an original data vector and the same number of coefficients, the DWT usually obtains a more accurate approximation of the original data than the DFT. In other words, the DWT requires less space than the DFT since wavelets are quite localized in space and then contribute to the conservation of local detail. Normally, there seems only one DFT, but there are several families of DWTs. Among them, popular wavelet transforms include the Haar and Daubechies-4 transforms. Figure 2-17 shows these common wavelet transforms. The Haar wavelet (Figure 2-17 (a)) is recognized as the first known wavelet. These simple functions were used to give an example of a countable orthonormal system for the space of square-integrable functions on the real line. The Haar wavelet's mother wavelet function ψ(t) can be described as and its scaling function can be described as The Daubechies wavelets are a family of orthogonal wavelets defining a discrete wavelet transform and characterized by a maximal number of vanishing moments for some given support. With each wavelet type of this class, there is a scaling function (also called father wavelet) which generates an orthogonal multiresolution analysis. The Daubechies wavelets are not defined in terms of the resulting scaling and wavelet functions; in fact, they are not possible to write down in closed form. Figure 2-17 (b) shows the Daubechies-4 wavelet (mother) and scaling (father) function. 54

(a) Haar (Daubechies-2) (b) Daubechies-4 Figure 2-17: Common wavelets: Haar and Daubechies-4 transforms To apply a discrete wavelet transform, we use a hierarchical pyramid algorithm that calculates

26 (a) Haar (Daubechies-2) (b) Daubechies-4 Figure 2-17: Common wavelets: Haar and Daubechies-4 transforms To apply a discrete wavelet transform, we use a hierarchical pyramid algorithm that calculates halves of the data at each iteration, resulting in fast computational speed. The method is as follows: 1. First, we have to modify the input data vector with the length of L, to have a length of an integer power of 2. To do this, we can add zeros, as necessary, at the end of the original data vector. The modified vector has the length of 2 n where 2 n-1 < L < 2 n. 2. There are two functions in the wavelet transformation. The first applies some data smoothing, such as a sum or weighted average. The second performs a weighted difference, which acts to bring out the detailed features of the data. 3. The two functions are applied to pairs of data points in X, that is, to all pairs of measurements (x 2i and x 2i+1). This operation creates two sets of data of length L/2, one as a smoothed or low-frequency compenent of the input data and the other for the highfrequency component. 4. The two functions are recursively applied to the sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2. 55

27 5. In each iteration, it is possible to apply wavelet coefficients for smooth-average and smooth-difference. 6. All the results of these two functions are kept as the output of transformations. 7. The transformed output can be reversely transformed back to the original data vector with approximation. Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes. Lossy compression by wavelets is reportedly better than JPEG compression, the current commercial standard. Wavelet transforms have many real-world applications, including the compression of fingerprint images, computer vision, analysis of time-series data, and data cleaning Principal Components Analysis Besides the wavelet transform, principal components analysis is an alternative to reduce the dimensions by analyzing their relationships. Given a data set consisting of tuples or data vectors described by n attributes or dimensions, the principal components analysis (PCA) will find k n- dimensional orthogonal vectors (k n) which are the best to represent the data. Sometime called as the Karhunen-Loeve (K-L) method, the PCA allows us to project the original data onto a much smaller space, resulting in dimensionality reduction. Unlike the simple method of selecting a small subset from the whole attribute set, the PCA implicitly reveals and combines related attributes and creates a smaller set of variables. The initial data can then be projected onto this smaller set. The PCA transforms the original data to the reduced representation by the following steps. 1. At the first step, the input data are normalized into the same range in order to eliminate the size difference effect. A popular normalization method is to substract elements of each vector with their means. It helps ensure that attributes with large domains will not dominate attributes with smaller domains. Suppose that x 1, x 2,, x M are N 1 vector where N is the number of attributes. Here, a 1, a 2,, a M are the normalized vectors of the original vector x 1, x 2,, x M. 2. Next, a matrix A = [a 1, a 2,, a M] (N M matrix) are formed and its covariance matrix C=AA T (N N matrix) is constructed. This covariance matrix characterizes the scatter of the data. 3. PCA computes the eigenvalues and eigenvectors of the covariance matrix C. It will compute N eigenvalues j and N corresponding eigenvectors y j that provide a basis for the normalized input data. The orthonormal vector needs to satisfy the following condition, where j is the j-th eigenvalue and y j is the j-th eigenvector. Generally, there will be N eigenvalues and N corresponding eigenvectors. 56

Each eigenvector is a unit vector that is perpendicular to each other, i.e., an orthonormal vector. These vectors are referred to as the principal components.

28 Each eigenvector is a unit vector that is perpendicular to each other, i.e., an orthonormal vector. These vectors are referred to as the principal components. The input data are finally transformed into a linear combination of the principal components. 4. The principal components are sorted in order of decreasing eigenvalues, representing significance or strength of the component. The principal components essentially serve as a new set of axes for the data, providing important information about variance. The higher the eigenvalue is, the more variance the eigenvector (used as an axis) has. As an example of two-dimensional data, Figure 2-18 shows the original axes X 1 and X 2 and their two principal components, Y 1 and Y 2, calculated by PCA. Here, the Y 1 axis occupies high variance for the data while the Y 2 axis has low variance. By this, it is possible to eliminate the Y 2 axis. Each datum (point) can be represented by mapping to the point on the Y 1 axis as shown in the figure. That is, two-dimensional data can be approximated to one-dimensional data in the single axis. Figure 2-18: The concept of Principal Components Analysis (PCA) Because the components are sorted according to decreasing order of significance or strength of the component, it is possible to reduce the size of the data by eliminating the weaker components (those with low variance). With only the strongest principal components, it should be possible to reconstruct a good approximation of the original data. In general, PCA is computationally inexpensive, can be applied to ordered and unordered attributes, and can handle sparse data and skewed data. A higher-dimensional data can be mapped to a lower-dimensional data. Principal components may be used as inputs to several applications, such as information retrieval, classification, multiple regression and cluster analysis. In comparison with wavelet transforms, PCA tends to be better at handling sparse data, whereas wavelet transforms are more suitable for data of high dimensionality Summary Since real-world data tend to be noisy, inconsistent and incomplete, it is necessary to preprocess the data before finding knowledge from them. Data preprocessing includes data cleaning, data integration, data transformation, and data reduction. Although data may be represented by several forms, expressing data in the form of table is simple and gives us insight. Data cleaning preprocesses the input by filling in missing values, smoothing or eliminating noises, identifying outliers, and correcting inconsistencies in the data. A kind of statistical or information theorybased techniques can be applied to find missing values. Binning, regression and clustering can be 57

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data