Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science 6/9 2. Introduction to data science 6/13 3. Data and databases 6/16 4. Review of univariate statistics 6/20 5. Review of linear algebra 6/23 6. Data mining software 6/27 7. Data preprocessing 6/30 8. Classification and prediction (1) (1) 7/4 9. Knowledge evaluation 7/7 10. Classification and prediction (2) (2) 7/11 11. Classification and prediction (3) (3) 7/14 12. Mining association rules (1) 7/18 13. Mining association rules (2) 7/21 14. Cluster analysis 7/25 15. Review and Examination (the data is not fixed) 7/27 2 The data analysis process Outline Lecture'6 1 Create/select target$database Select$sampling technique$and$ sample$data Data$organized$by$function$ Data$warehousing 1. Why Preprocess the Data? 2. Data Cleaning 3. Data Integration 2 Supply$missing$ values Eliminate noisy$data 4. Data Reduction Normalize values Transform values Create$derived attributes Find$important attributes$& value$ranges 5. Data Transformation 3 Select$DM$ task$(s) Lecture'7*9,'10*14 Select$DM$ method$(s) Extract$ knowledge Lecture'8 Test$ knowledge Refine$ knowledge 5 Transform$to different representation Query$&$report$generation Aggregation$&$sequences Advanced$methods 4 3 4
Why preprocess the data? Major tasks in data preprocessing Common properties of large real-world databases: Incomplete: lacking attribute values or certain of interest Noisy: containing errors or outliers Inconsistent: containing discrepancies in codes or names 1 2 Data cleaning Data integration Veracity problem! No quality data, no quality analysis results! 5 3 Data reduction (instances and dimensions) 4 Data transformation 6 Major tasks in data preprocessing Outline Data cleaning! Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration! Integration of multiple databases, data cubes, or files Data transformation! Normalization and aggregation Data reduction! Obtains reduced representation in volume but produces the same or similar analytical results Data discretization! Part of data reduction but with particular importance, especially for numerical data 1. Why Preprocess the Data? 2. Data Cleaning 3. Data Integration 4. Data Reduction 5. Data Transformation 7 8
Data cleaning tasks Missing data Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Data is not always available! e.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to! equipment malfunction! inconsistent with other recorded data and thus deleted! data not entered due to misunderstanding! certain data may not be considered important at the time of entry! not register history or changes of the data Missing data may need to be inferred. 9 10 Missing values in databases Missing values in databases Missing values may hide a true answer underlying in the data Many data mining programs cannot be applied with data that includes missing values Methods 1. Ignore&the&tuples 2. Fill&in&the&missing&value&manually& (tedious&+&infeasible?) 3. Use&a&global&constant&to&fill&in&the& missing&value 4. Use&the&attribute&mean&to&fill&the& missing&values 5. Use&the&attribute&mean&(or&mode& for&categorical&attribute)&for&all& samples&belonging&to&the&same& class&as&the&given&tuple. 6. Use&the&most&probable&value&to&&& fill&the&missing&value Methods:22222222422222522222222322222226222222226 yes no yes no no yes 29 29 29 29 13 13 7 13 unknown unknown dna dna unknown dna Class&attribute:&norm,<2norm,>2norm Other&six&attributes&all$have&missing&values 11 7. Others 12
Noisy data Noise: random error or variance in a measured variable Incorrect attribute values may due to! faulty data collection instruments! data entry problems! data transmission problems! technology limitation! inconsistency in naming convention Other data problems which requires data cleaning! duplicate records! incomplete data! inconsistent data How to handle noisy data? Binning method! first sort data and partition into (equi-depth) bins! then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering! detect and remove outliers Combined computer and human inspection! detect suspicious values and check by human Regression! smooth by fitting the data into regression functions 13 14 How to handle noisy data? How to handle noisy data? Binning: to smooth a sorted data value by consulting its neighborhood, that is, the value around it (local smoothing)! Smoothing by bin means: each value in a bin is replaced by the mean value of the bin! Smoothing by bin medians: each bin value is replaced by the bin median! Smoothing by bin boundaries: the minimum and maximum values in a given bin are identified as bin boundaries 15 The original data 9, 21, 24, 21, 4, 26, 28, 34, 29, 8, 15, 25 Sort data in the increasing order, and partition into (equidepth) bins: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Smoothing by bin means 9, 9, 9, 9, 22, 22, 22, 22, 29, 29, 29, 29 Smoothing by bin boundaries (replaced by the closest boundary) 4, 4, 4, 15, 21, 21, 25, 25, 26, 26, 26, 34 16
How to handle noisy data? How to handle noisy data? Outliers may be detected by clustering analysis " Combined computer and human inspection: Output patterns with surprise content to a list. A human can identify the actual garbage ones. " Regression: by fitting the data to a function, such as with regression Y1 Y1 y y = x + 1! Linear regression Values that fall outside of the set of clusters may be considered outliers! Multiple linear regression: more than two variables and the data are fit to a multidimensional surface X1 x 17 18 Outline Data integration 1. Why Preprocess the Data? 2. Data Cleaning 3. Data Integration 4. Data Reduction 5. Data Transformation Data integration combines data from multiple sources (multiple DBs, data cubes, flat files) into a coherent data store. Schema integration (entity identification problem): How can equivalent entities from multiple data sources be matched up? Redundancy: An attribute may be redundant if it can be derived from another table. 19 20
Data integration Outline Redundancy: can be detected by correlation analysis (correlation coefficient), e.g., how strongly one attribute implies another attribute. r A, B # ( A " A)( B " B) = ( n " 1)!! A B 1. Why Preprocess the Data? 2. Data Cleaning 3. Data Integration 4. Data Reduction 5. Data Transformation Detection and resolution of data value conflicts 21 22 Strategies for data reduction Data cube aggregation Data cube aggregation Dimension reduction # Aggregation operations are applied to the data in the construction of a data cube Data compression Numerosity reduction Discretization and concept hierarchy generation On2the2left,2the2sales are2shown2per2quarter. On2the2right,2the2data are2aggregated2to provide2the2annual sales. 23 24
Data cube aggregation Data compression: Attribute selection A2data2cube2for2 multidimensional2 analysis2of2sales2 data2with2respect2 to2annual2sales2per2 item2type2for2each2 branch2of2company Attribute subset selection (also called feature selection )! Stepwise forward selection! Stepwise backward elimination! Combination of forward and backward elimination! Many other methods 25 26 Data compression: Wavelet transforms Data compression: PCA Discrete wavelet transformation (DWT): a linear signal processing technique that, when applied to a data vector D, transforms it to a numerically different vector D of wavelet coefficients. Store only a small fraction of the strongest of the wavelet coefficients Real data WT J=-1 J=-2 RWT Principal Components Analysis: transform data points from k-dimensions into c-dimensions (c! k) with minimum loss of information PCA searches for c-dimensional orthogonal vectors that can best be used to represent data. The original data are thus projected onto a much smaller space of c dimensions (c principal components) Only used for numerical data 3 2 Y 1 O1 O2 O3 O4 O5 "2 1 2 3 "1 Question: Reduction to one dimension? Z1 and Z2, which is better? X 27 28
Numerosity reduction Numerosity reduction: histogram Can we reduce the data volume by choosing alternative, smaller forms of data representation? Parameter methods: a model is used to estimate the data, so that typically only the data parameters need be stored, instead of the actual data! Regression and Log-Linear Models: y = # x + $ Non-parameter methods: for storing reduced representations of the data include! Histograms! Clustering! Sampling Singleton2buckets:2Each2 bucket2represents2one2 priceovalue/frequency2pair An2equiwidth2histogram,2where2 values2are2aggregated2so2that2each2 bucket2has2a2uniform2width2of2$10 29 30 Numerosity reduction: Clustering Numerosity reduction: Sampling A22OD2plot2of2 customer2data2with2 respect2to2customer2 locations2in2a2city,2 showing2three2data2 clusters.2each2cluster2 center is2marked2 with2a2 + + + + Simple random sample without replacement of size n (SRSWOR) Simple random sample with replacement of size n (SRSWR) Cluster sample Stratified sample equal&proportion& (e.g.,&½) 31 32
Outline Data transformation 1. Why Preprocess the Data? 2. Data Cleaning 3. Data Integration 4. Data Reduction 5. Data Transformation Smoothing: to remove noise from data Aggregation: summary or aggregation are applied to the data Generalization: low-level or primitive data are replaced by higher-level concepts through the use of concept hierarchy Normalization: attribute data are scaled so as to fall within a small specified range, says 0.0 to 1.0 Attribute construction: new attributes are constructed and added from the given set of attributes to help the mining process: from continuous to discrete (discretization) and from discrete to continuous (word embedding). 33 34 Min-max and z-score normalization Discretization min*max'normalization:&suppose& min A and&max A are&minimum&and& maximum&values&of&attribute.&we map&a&value&v&of&a&to&v &in&the&range& [newmin A,&newmax A ]&by Example:'Suppose&min A and&max A are&$12,000&and&$98,000.&we&want& to&map&minimum&and&maximum& values&of&attribute.&we&want&to map& income&to&the&range&[0.0,&1.0].&so,& $73,600&is&transformed&to& z*score'normalization: The&values& for&an&attribute&a&are&normalized& based&on&the&mean&and&standard& deviation&of&a Example:'If&the&mean&and&standard& deviation&are&$54,000&and&$16,000,& the&$73,600&is&transformed&to&! " =! %&' ( %)* ( %&' ( '+,%)* ( '+,%&' ( + '+,%&' ( 73,600 12,000 1.0 0.0 + 0 = 0.716 98,000 12,000! " =! 8 : ( 73,600 54,000 = 1.225 16,000 Three types of attributes:! Nominal (categorical): red, yellow, blue, green! Ordinal: small, middle, large, extreme large! Continuous: real numbers Discretization: divide the range of a continuous attribute into intervals! Some classification algorithms only accept categorical attributes.! Reduce data size by discretization! Prepare for further analysis 35 36
Discretization # Binning # Histogram2analysis # Cluster2analysis # EntropyObased2discretization # Segmentation2by2Natural2Partitioning Entropy-based discretization # Given2a2set2of2samples2S,2if2S2is2partitioned2into2two2intervals2S12and2 S22using2boundary2T,2the2entropy2after2partitioning2is S E S T S Ent (, ) = 1 ( S ) + 2 1 S Ent ( S2 ) # The2boundary2that2minimizes2the2entropy2function2over2all2possible2 boundaries2is2selected2as2a2binary2discretization. # The2process2is2recursively2applied2to2partitions2obtained2until2some2 stopping2criterion2is2met,2e.g., S Ent( S)! E( T, S) > " 37 # Experiments2show2that2it2may2reduce2data2size2and2improve2 classification2accuracy 38 What is word embedding? Some more complex data transformation Word embedding: Mapping a word (or phrase) from it's original high dimensional input space to a lower-dimensional numerical vector space. Word2vec is a group of related models that are used to produce word embeddings.! These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.! Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. words C Input&space X Latent&semantic&indexing& % documents dims dims documents words U dims D dims V Feature&space F Topic models words documents C Normalized cooccurrence matrix %: X $ F where the problem can be solved in F words topics & topics documents '
Summary Data preprocessing is an important issue as real-world data tend to be incomplete, noisy, and inconstant Data cleaning routines can be used to fill in missing values, smooth noisy data, identify outliers, and correct data inconsistencies Data integration combines data from multiple sources to form a coherent data store Data transformation routines convert the data into appropriate forms for analyzing. Data reduction techniques can be used to obtain a reduced representation of the data while minimizing the loss of information content Automatic generation of concept hierarchies can involve different techniques for numeric data, and may be based on number of distinct values of attributes for categorical data Data preprocessing remains as an active area of research Homework The labor.arff provided by WEKA has 57 instances, 16 descriptive attributes, and the class attribute with two values bad and good. The atrributes of labor.arff have many missing values. Do the following (1) Use the methods in Lecture 6 to treat the missing values of all attributes in labor.arff (2) Explain why the method you used for each attribute is appropriate? Submit the written report (pdf) by July 7, 2017. Hint: 1. You can use ARFF-Viewer in Tool of WEKA to visualize the labor.arff 2. You may have at least to ways to work on labor data (labor.arff): Use the tool arff2csv.zip at our website http://www.jaist.ac.jp/~bao/k236/ to convert the data into Excel format, and use the data represented in Excel for your preprocessing, or Take the labor data from UCI: http://archive.ics.uci.edu/ml/machinelearningdatabases/labor-negotiations/c4.5/ and store it in Excel format (or whatever you like) to process. 41