Data Warehousing and Machine Learning

Size: px

Start display at page:

Download "Data Warehousing and Machine Learning"

Roger Dennis
5 years ago
Views:

1 Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring / 35

2 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains outliers. The data is not in a suitable format. The values appear inconsistent. Garbage in, garbage out DWML Spring / 35

3 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 DWML Spring / 35

4 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Correct zip code? DWML Spring / 35

5 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Correct zip code? DWML Spring / 35

6 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W ?? S M S F D 3000 Missing value! DWML Spring / 35

7 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Error/outlier! DWML Spring / 35

8 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Error! DWML Spring / 35

9 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Unexpected precision. DWML Spring / 35

10 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Categorical value? DWML Spring / 35

11 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Error/missing value? DWML Spring / 35

12 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Other issues: What are the semantics of the marital status? DWML Spring / 35

13 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Other issues: What are the semantics of the marital status? What is the unit of measure for the transaction field? DWML Spring / 35

14 Preprocessing Missing Values In many real world data bases you will be faced with the problem of missing data: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good By simply discarding the records with missing data we might unintentionally bias the data. DWML Spring / 35

15 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good DWML Spring / 35

16 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good Both Low and Medium are modes for savings. DWML Spring / 35

17 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good High and Medium are drawn randomly from the observed distribution for Assets. DWML Spring / 35

18 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium 54 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good DWML Spring / 35

19 Preprocessing Discretization Some data mining algorithms can only handle discrete attributes. Possible solution: Divide the continuous range into intervals. Example: (Income, Risk) = (25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G) Unsupervised discretization Equal width binning (width 25): Equal frequency binning (bin density 3): Bin 1: 25, 25 [25, 50) Bin 2: 50, 51, 54 [50, 75) Bin 3: 75, 75, 100, 100 [75, 100] Bin 1: 25, 25, 50 [25, 50.5) Bin 2: 51, 54, 75, 75 [50.5, 87.5) Bin 3: 100, 100 [87.5, 100] DWML Spring / 35

20 Preprocessing Supervised discretization Take the class distribution into account when selecting the intervals. For example, recursively bisect the interval by selecting the split point giving the highest information gain:» S v Gain(S, v) = Ent(S) Ent(S v ) + S>v Ent(S >v) S S Until some stopping criteria is met. (Income, Risk) = (25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G) 3 Ent(S) = 9 log «9 log 6 2 = Split E-Ent Interval (, 25],(25, ) (, 50],(50, ) (, 51],(51, ) (, 54],(54, ) (, 75],(75, ) DWML Spring / 35

21 Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, Age versus income. DWML Spring / 35

22 Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, Age versus income. The typical approach is to standardize the scales: 1 Min-Max Normalization: X = X min(x) max(x) min(x). normalized values A1 A original values DWML Spring / 35

23 Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, Age versus income. The typical approach is to standardize the scales: 1 Min-Max Normalization: X = X min(x) max(x) min(x). normalized values A1 A original values Z-score standardization: X = X mean(x). SD(X) standardized values A1 A original values DWML Spring / 35

24 Preprocessing Outliers Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, Summary statistics: First quartile (1Q): 25% of the data = 4. Second quartile (2Q): 50% of the data = 6. Third quartile (3Q): 75% of the data = 7. Interquartile range IQR = 3Q 1Q = DWML Spring / 35

25 Preprocessing Outliers Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, Summary statistics: First quartile (1Q): 25% of the data = 4. Second quartile (2Q): 50% of the data = 6. Third quartile (3Q): 75% of the data = 7. Interquartile range IQR = 3Q 1Q = A data point may be an outlier if: It is lower than 1Q 1.5 IQR = = 0.5. It is higher than 3Q IQR = = DWML Spring / 35

26 Data Warehousing and Machine Learning Clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Clustering: partitional and hierarchical DWML Spring / 35

27 Clustering Unlabeled Data The Iris data with class labels removed: Attributes SL SW PL PW Unlabeled data in general: (discrete or continuous) attributes, no class variable. Clustering: partitional and hierarchical DWML Spring / 35

28 Clustering Clustering A clustering of the data S = s 1,..., s N consists of a set C = {c 1,..., c k } of cluster labels, and a cluster assignment ca : S C. Clustering Iris with C = {blue, red}: Note: a clustering partitions the datapoints, not necessarily the instance space. When cluster labels have no particular significance, can identify clustering also with partition S = S 1... S k where S i = ca 1 (c i ). Clustering: partitional and hierarchical DWML Spring / 35

29 Clustering Clustering goal Instance Space Between cluster distances Within cluster distances A candidate clustering (indicated by colors) of data cases in instance space. Arrows indicate between- and within-cluster distances (selected). General goal: find clustering with large between-cluster variation (sum of between-cluster distances), and small within-cluster variation (sum of within-cluster distances). Concrete goal varies according to exact distance definition. Clustering: partitional and hierarchical DWML Spring / 35

30 Clustering Examples Group plants/animals into families or related species, based on morphological features molecular features Identify types of customers based on attributes in a database (can then be targeted by special advertising campaigns) Web mining: group web-pages according to content Clustering: partitional and hierarchical DWML Spring / 35

31 Clustering Clustering vs. Classification The cluster label can be interpreted as a hidden class variable that is never observed whose number of states is unknown on which the distribution of attribute values depends Clustering is often called unsupervised learning, vs. the supervised learning of classifiers: in supervised learning correct class labels for the training data are provided to the learning algorithm by a supervisor, or teacher. One key problem in clustering is determining the right number of clusters. Two different approaches: Partition-based clustering Hierarchical clustering All clustering methods require a distance measure on the instance space! Clustering: partitional and hierarchical DWML Spring / 35

32 Clustering Partition-based Clustering Number k of clusters fixed (user defined). Partition data into k clusters. k-means clustering Assume that there is a distance function d(s, s ) defined between data items we can compute the mean value of a collection {s 1,..., s l } of data items Initialize: randomly pick initial cluster centers c = c 1,..., c k from S repeat for i = 1,..., k S i := {s S c i = arg min c c d(c, s)} c old,i := c i c i := mean S i ca(s) := c i (s S i ) until c = c old Clustering: partitional and hierarchical DWML Spring / 35

33 Clustering Example k = 3: Clustering: partitional and hierarchical DWML Spring / 35

34 Clustering Example k = 3: c 1 c 2 c 3 Clustering: partitional and hierarchical DWML Spring / 35

35 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

36 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

37 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

38 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

39 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

40 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

41 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

42 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

43 Clustering Example(cont.) Result for clustering the same data with k = 2: c 1 c 2 S 1 S 2 Result can depend on choice of initial cluster centers! Clustering: partitional and hierarchical DWML Spring / 35

44 Clustering Outliers The result of partitional clustering can be skewed by outliers. Example with k = 2: useful preprocessing: outlier detection and elimination (be careful not to eliminate interesting outliers!). Clustering: partitional and hierarchical DWML Spring / 35

45 Clustering k-means as optimization With a Euclidean distance function dist we can use the sum of squared errors for evaluating a clustering: kx X SSE = dist(x, c i ) 2. i=1 x C i Clustering: partitional and hierarchical DWML Spring / 35

46 Clustering k-means as optimization With a Euclidean distance function dist we can use the sum of squared errors for evaluating a clustering: kx X SSE = dist(x, c i ) 2. i=1 x C i k-means directly tries to minimize this error: Initialize: randomly pick initial cluster centers c = c 1,..., c k from S repeat for i = 1,..., k S i := {s S c i = arg min c c d(c, s)} //Minimize the SSE for the current clusters c old,i := c i c i := mean S i //The centroid that minimizes the SSE for the assigned objects ca(s) := c i (s S i ) until c = c old Only guaranteed to find a local minimum Clustering: partitional and hierarchical DWML Spring / 35

47 Hierarchical Clustering Reducing SSE Choosing initial centroids: Perform multiple runs with random initializations. Initialize centroids based on results from another algorithm (e.g. hierarchical).... Clustering: partitional and hierarchical DWML Spring / 35

48 Hierarchical Clustering Choosing initial centroids: Reducing SSE Perform multiple runs with random initializations. Initialize centroids based on results from another algorithm (e.g. hierarchical).... Postprocessing: Split a cluster Disperse a cluster (choose the one that increases the SSE the least) Merge two clusters (the two with closets centroids or the two that increases the SSE the least). Clustering: partitional and hierarchical DWML Spring / 35

49 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Clustering: partitional and hierarchical DWML Spring / 35

50 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Clustering: partitional and hierarchical DWML Spring / 35

51 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Clustering: partitional and hierarchical DWML Spring / 35

52 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Provide an explicit representation of nested clusterings of different granularity Clustering: partitional and hierarchical DWML Spring / 35

53 Hierarchical Clustering Agglomerative hierarchical clustering Extend distance function d(s, s ) to distance function D(S, S ) between sets of data items. Two out of many possibilities: D average(s, S ) := 1 S S X s S,s S d(s, s ) D min (S, S ) := min s S,s S d(s, s ) for i = 1,..., N: S i := {s i } while current partition S 1... S k of S contains more than one element (i, j) := arg min i,j 1,...,k D(S i, S j ) form new partition by merging S i and S j. When D average is used, this is also called average link clustering; for D min : single link clustering. Clustering: partitional and hierarchical DWML Spring / 35

54 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

55 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

56 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

57 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

58 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

59 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

60 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

61 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

62 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

63 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

64 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

65 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

66 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

67 Hierarchical Clustering Dendrogram Representation of Hierarchical Clustering Distance of merged components Clustering: partitional and hierarchical DWML Spring / 35

68 Hierarchical Clustering Dendrogram Representation of Hierarchical Clustering Distance of merged components 3 clustering 5 clustering The length of the distance interval correponding to a specific clustering can be interpreted as a measure for the significance of this particular clustering Clustering: partitional and hierarchical DWML Spring / 35

69 Hierarchical Clustering Single link vs. Average link Clustering: partitional and hierarchical DWML Spring / 35

70 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link Clustering: partitional and hierarchical DWML Spring / 35

71 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link single link 2-clustering Clustering: partitional and hierarchical DWML Spring / 35

72 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link single link 2-clustering average link 2-clustering Clustering: partitional and hierarchical DWML Spring / 35

73 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link single link 2-clustering average link 2-clustering Generally: single link will produce rather elongated, linear clusters, average link more convex clusters Clustering: partitional and hierarchical DWML Spring / 35

74 Hierarchical Clustering Another Example Clustering: partitional and hierarchical DWML Spring / 35

75 Hierarchical Clustering Another Example single link 2-clustering Clustering: partitional and hierarchical DWML Spring / 35

76 Hierarchical Clustering Another Example average link 2-clustering (or similar) Clustering: partitional and hierarchical DWML Spring / 35

77 Data Warehousing and Machine Learning Self Organizing Maps Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Self Organizing Maps DWML Spring / 35

78 Self Organizing Maps SOMs as Special Neural Networks Input Layer Output Layer Neural network structure without hidden layers Output neurons structured as two-dimensional array Connection from ith input to jth output has weight w i,j No activation function for output nodes Self Organizing Maps DWML Spring / 35

79 Self Organizing Maps Kohonen Learning Given: Unlabeled data a 1,..., a N R n Distance measure d n(, ) on R n Distance measure d out (, ) on output neurons Update function η(t, d) : N R R; decreasing in t and d. 1. Initialize weight vectors w (0) j for output nodes o j 2. t := 0 3. repeat 4. t := t for i = 1,..., N 6. let o j be the output neuron minimizing d n(w j, a i ). 7. for all output nodes o h : 8. w (t) h := w(t 1) h + η(t, d out (o h, o j ))(a i w (t 1) h ) 9. until termination condition applies Self Organizing Maps DWML Spring / 35

80 Self Organizing Maps Distances etc. Possible choices: d n: Euclidean d out (o j, o h ): e.g. 1 if o j, o h are neighbors (rectangular or hexagonal layout), or Euclidean distance on grid indices η(t, d): e.g. α(t)exp( d 2 /2σ 2 (t)) with α(t), σ(t) decreasing in t. Self Organizing Maps DWML Spring / 35

81 Self Organizing Maps Intuition SOM learning can be understood as fitting a 2-dimensional surface to the data: o 1,0 o1,1 o 0,0 o 0,1 Colors indicate association with different output neurons, not data attributes. Some output neurons may not have any associated data cases. Self Organizing Maps DWML Spring / 35

82 Self Organizing Maps Example (from Tan et al.) Data: Word occurrence data (?) from 3204 articles from the Los Angeles Times with (hidden) section labels Entertainment, Financial, Foreign, Metro, National, Sports. Result of SOM clustering on 4 4 hexagonal grid: Density Sports Sports Metro Metro low Sports Sports Metro Foreign Entertainment Metro Metro National high Entertainment Metro Financial Financial Output nodes labelled with majority label of associated cases and colored according to number of cases associated with it (fictional). Self Organizing Maps DWML Spring / 35

83 Self Organizing Maps SOMs and k-means In spite of its roots in neural networks, SOMs are more closely related to k-means clustering: Weight vectors w j are cluster centers Kohonen updating associates data cases with cluster centers, and repositions cluster centers to fit associated data cases Differences: 2-dim. spatial relationship among cluster centers Data cases associated with more than one cluster center On-line updating (one case at a time) Self Organizing Maps DWML Spring / 35

84 Self Organizing Maps Pros and Cons + Provides more insight than a basic clustering (i.e. partitioning of data) + Can produce intuitive representations of clustering results - No well-defined objective function that is optimized Self Organizing Maps DWML Spring / 35

Preprocessing DWML, /33

Preprocessing DWML, /33 Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains