CS570: Introduction to Data Mining

Size: px

Start display at page:

Download "CS570: Introduction to Data Mining"

Charleen Bryant
5 years ago
Views:

1 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann. 1

2 Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction 2

3 Data Transformation Aggregation: summarization (data reduction) E.g. Daily sales -> monthly sales (Statistical) Normalization: scaled to fall within a small, specified range E.g. income vs. age Discretization and generalization E.g. age -> youth, middle-aged, senior Attribute construction: construct new attributes from given ones E.g. birthday -> age September 5,

4 Data Aggregation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values Data cubes store multidimensional aggregated information Multiple levels of aggregation for analysis at multiple granularities September 5,

5 Normalization scaled to fall within a small, specified range Min-max normalization: [min A, max A ] to [new_min A, new_max A ] v' v mina maxa min Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to Z-score normalization (μ: mean, σ: standard deviation): v' v Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling A A A ( new _ max A new _ mina) new _ min 73,600 12,000 (1.0 0) ,000 12,000 73,600 54,000 16, v v' Where j is the smallest integer such that Max( ν ) < 1 j 10 A September 5,

6 Discretization and Generalization Discretization: transform continuous attribute into discrete counterparts (intervals) Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Generalization: generalize/replace low level concepts (such as age ranges) by higher level concepts (such as young, middle-aged, or senior) September 5,

7 Binning or histogram analysis Unsupervised, top-down split Discretization Methods Clustering analysis Unsupervised, either top-down split or bottom-up Entropy-based discretization Supervised, top-down split September 5,

Entropy-Based Discretization Entropy based on class distribution of the samples in a set S 1 : m classes, p i is the probability of class i in S 1 Entropy ( S m 1 ) p i log 2( p i ) i 1 Given a set

8 Entropy-Based Discretization Entropy based on class distribution of the samples in a set S 1 : m classes, p i is the probability of class i in S 1 Entropy ( S m 1 ) p i log 2( p i ) i 1 Given a set of samples S, if S is partitioned into two intervals S 1 and S 2 using boundary T, the class entropy after partitioning is S1 S 2 I S, T) Entropy ( S1) Entropy ( S S S ( 2 The boundary that minimizes the entropy function is selected for binary discretization The process is recursively applied to partitions ) September 5,

9 Generalization for Categorical Attributes Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a hierarchy for a set of values by explicit data grouping {Atlanta, Savannah, Columbus} < Georgia Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state, country} September 5,

10 Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values September 5,

11 Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction Data Mining: Concepts and Techniques 11

12 Data Reduction Why data reduction? A database/data warehouse may store terabytes of data Number of data points Number of dimensions Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results September 5,

13 Data Reduction Instance reduction Sampling (instance selection) Aggregation Parametric reduction Dimension reduction Feature selection Feature extraction 13

14 Instance Reduction: Sampling Sampling: obtaining a small representative sample s to represent the whole data set N A sample is representative if it has approximately the same property (of interest) as the original set of data Statisticians sample because obtaining the entire set of data is too expensive or time consuming. Data miners sample because processing the entire set of data is too expensive or time consuming Issues: Sampling method Sampling size September 5,

Why sampling A statistics professor was describing sampling theory Student: I don t believe it, why not study the whole population in the first place?

15 Why sampling A statistics professor was describing sampling theory Student: I don t believe it, why not study the whole population in the first place? The professor continued explaining sampling methods, the central limit theorem, etc. Student: Too much theory, too risky, I couldn t trust just a few numbers in place of ALL of them. The professor explained the Nielsen television ratings Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing? Professor: Well, the next time you go to the campus clinic and they want to do a blood test tell them that s not good enough tell them to TAKE IT ALL!! 15

16 Sampling Methods Simple Random Sampling There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions (stratum); then draw random samples from each partition Cluster sampling When "natural" groupings are evident in a statistical population Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample - the same object can be picked up more than once

17 Simple random sampling without or with SRSWOR (simple random sample without replacement) replacement Raw Data Final Data SRSWR (simple random sample with replacement) Raw Data Final Data September 5,

18 Stratified Sampling Illustration Raw Data Stratified Sample September 5,

19 Sampling size 19

20 Sampling Size 8000 points 2000 Points 500 Points

21 Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 21

22 Numerosity Reduction Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Regression Non-parametric methods Do not assume models Major families: histograms, clustering September 5,

23 Regression Analysis Assume the data fits some model and estimate model parameters Multiple linear regression: Y = b 0 + b 1 X b P X P Line fitting: Y = b 1 X + b 0 Polynomial fitting: Y = b 2 x 2 + b 1 x + b 0 Regression techniques Least square fitting Vertical vs. perpendicular offsets Outliers Robust regression (when there are many outliers)

24 Instance Reduction: Histograms Divide data into buckets (bins) and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance Histograms/v-opt1.html Histograms/v-opt2.html Histograms/v-opt3.html September 5,

25 Instance Reduction: Clustering Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures Cluster analysis will be studied in depth later September 5,

26 Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 26

27 Feature Subset Selection Select a subset of features such that the resulting data does not affect mining result Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA

28 Correlation between attributes Correlation measures the linear relationship between objects 28

29 Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient) r A, B ( A A)( B B) ( n 1) ( AB) ( A B n 1) nab A B where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. r A,B > 0, A and B are positively correlated (A s values increase as B s) r A,B = 0: independent r A,B < 0: negatively correlated A B September 5,

30 Visually Evaluating Correlation Scatter plots showing the Pearson correlation from 1 to 1.

31 Correlation Analysis (Categorical Data) Χ 2 (chi-square) test ( Observed Expected Expected 2 2 ) The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count September 5,

32 Chi-Square Calculation: An Example Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2 (250 90) 90 2 (50 210) 210 Play chess Not play chess Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) ( ) 360 It shows that like_science_fiction and play_chess are correlated in the group ( needed to reject the independence hypothesis) 2 Sum (row) ( ) September 5,

33 Feature Selection Brute-force approach: Try all possible feature subsets Heuristic methods Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination

34 Filter approaches: Feature Selection Features are selected independent of data mining algorithm (before) E.g. Minimal pair-wise correlation/dependence, top k information entropy Wrapper approaches: Use the data mining algorithm as a black box to find best subset E.g. best classification accuracy Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm algorithm decides which attribute to select E.g. Decision tree classification 34

35 Data Reduction Instance reduction Sampling Aggregation Dimension reduction Feature selection Feature extraction/creation 35

36 Feature Extraction Create new features (attributes) by combining/mapping existing ones Methods Principle Component Analysis Data compression methods Discrete Wavelet Transform Regression analysis September 5,

37 Principle component analysis: find the dimensions that capture the most variance A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on. Steps Principal Component Analysis (PCA) Normalize input data: each attribute falls within the same range Compute k orthonormal (unit) vectors, i.e., principal components - each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing significance Weak components can be eliminated, i.e., those with low variance September 5,

38 Illustration of Principal Component Analysis X2 Y2 Y1 X1 September 5,

39 Example of Principal Component Analysis for biological data September 5,

40 Data Compression Data compression: reduced representation of original data Lossless vs. lossy Common lossless techniques (string) Run-length encoding Entropy encoding Huffman encoding, arithmetic encoding Common lossy techniques (audio/video) Discrete cosine transform Wavelet transform Original Data lossless Compressed Data Original Data Approximated September 5,

41 Wavelet Transformation Discrete wavelet transform (DWT): linear signal processing technique divides signal into different frequency components Data compression/reduction: store only a small fraction of the strongest of the wavelet coefficients Discrete wavelet functions Haar wavelet Daubechies wavelets September 5,

42 DWT Algorithm Pyramid algorithm - averaging and differencing method Input data of length L (an integer power of 2) Each transform has 2 functions: smoothing (sum, avg), then (weighted) differencing Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length Select coefficients by threshold Haar Wavelet Transform Haar matrix (sum and difference): Example: (4,6,10,8,1,9,5,3) Filtering of data Low pass filter (averaging) High pass filter (differencing) 42

43 Example of DWT Based Image Compression DWT compression for test image Lenna (threshold = 1) September 5,

44 Summary Data Exploration and Data Preprocessing Data and Attributes Data exploration Descriptive statistics Data visualization Data pre-processing Data cleaning Data integration Data transformation Data reduction Next lecture Frequent itemsets mining and association analysis 44

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data