Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Size: px

Start display at page:

Download "Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC"

Marlene Thomas
5 years ago
Views:

1 Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

2 Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1) #generate 12 #random normal X1 =matrix(d,6,2) #put into 6x2 matrix X1[,1]=X1[,1]+4; #shift center X1[,2]=X1[,2]+2; X 2 #repeat for another set of points #bind data points and plot plot(rbind(x1,x2), xlim=c(-10,10),ylim=c(-10,10)); X 1

3 Why Clustering A good grouping implies some structure In other words, given a good grouping, we can then: Interpret and label clusters Identify important features Characterize new points by the closest cluster (or nearest neighbors) Use the cluster assignments as a compression or summary of the data

4 Clustering Objective Objective: find subsets that are similar within cluster and dissimilar between clusters Similarity defined by distance measures Euclidean distance Manhattan distance Mahalanobis (Euclidean w/dimensions rescaled by variance)

5 Kmeans Clustering A simple, effective, and standard method Start with K initial cluster centers Loop: Assign each data point to nearest cluster center Calculate mean of cluster for new center Stop when assignments don t change Issues: How to choose K? How to choose initial centers? Will it always stop?

6 Kmeans Example For K=1, using Euclidean distance, where will the cluster center be? X 2 X 1

7 Kmeans Example For K=1, the overall mean minimizes Sum Squared Error (SSE), aka Euclidean distance Essential R commands: Kresult = kmeans(x,1,10,1) #choose 1 data point as initial K centers #10 is max loop iterations #1 is number of initial sets to try #Kresult is an R object with subfields Kresult$cluster #cluster assignments Kresults$tot.withinss # tot within SSE

8 Kmeans Example K=1 K=2 Essential R commands: inds=which(kresult$cluster==k) plot(x[inds,],col2use= red ); K=3 K=4

9 Kmeans Example K=1 K=2 Essential R commands: inds=which(kresult$cluster==k) plot(x[inds,],col2use= red ); K=3 K=4 As K increases individual points get a cluster

10 Choosing K for Kmeans Total Within Cluster SSE Essential R commands: for (num_k in 1:10) { Kres=kmeans(X,num_k,10,1); Save and then plot Kres$tot.withinss K=1 to 10 - Not much improvement after K=2 ( elbow )

11 Kmeans Example more points How many clusters should there be?

12 Choosing K for Kmeans Total Within Cluster SSE K=1 to 10 - Smooth decrease at K 2, harder to choose - In general, smoother decrease => less structure

13 Kmeans Guidelines Choosing K: Elbow in total-within-cluster SSE as K=1 N Cross-validation: hold out points, compare fit as K=1 N Choosing initial starting points: take K random data points, do several Kmeans, take best fit Stopping: may converge to sub-optimal clusters may get stuck or have slow convergence (point assignments bounce around), 10 iterations is often good

14 Kmeans Example uniform K=1 K=2 K=3 K=4

15 Choosing K - uniform Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure

16 Kmeans Clustering Issues Scale: Dimensions with large numbers may dominate distance metrics Outliers: Outliers can pull cluster mean, K-mediods uses median instead of mean

17 Soft Clustering Methods Fuzzy Clustering Use weighted assignments to all clusters Weights depend on relative distance Find min weighted SSE Expectation-Maximization: Mixture of multivariate Gaussian distributions Mixture weights are missing Find most likely means & variances, for the expectations of the data given the weights

18 Kmeans unequal cluster variance Can you guess K?

19 Kmeans unequal cluster variance

20 Choosing K unequal distributions Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure

21 EM clustering Selects K=2 (either by Information Criterion= min of SSE+ K*logN, Or by cross-validation) Handles unequal variance R: library( mclust ) em_fit=mclust(x); plot(em_fit);

22 Kmeans computations Distance of each point to each cluster center For N points, D dimensions: each loop requires N*D*K operations Update Cluster centers only track points that change, get change in cluster center But for EM errors to each cluster center update a probability function

23 Kmeans vs EM performance 1 Gordon compute node, normal random matrices R: system.time(mclust()) Wall Time (secs) 15 min 10 min EM: 1000 pts Number of Points (i.e. rows) Kmeans:1000 pts 1K 4K 8K Number of Dimensions (i.e. columns in data matrix)

24 Other distance measures Cosine of the angle between points (equals Euc. Distance normalized by vector lengths) A D C C is closer to A because ABCº<ABDº B Jaccard (over sets A,B): 1- ( A B / AUB ) means number of elments

25 Other distance measures Hamming distance: count 1 if values different e.g. appropriate for binary strings Total difference is 2

26 Exercise: Clustering Athlete s Data in Weka Is there a relationship between athlete s physical attributes and their sport? Let s filter data to try a few, good, candidate sport categories.

27 Weka not nice for picking out instances with multi-criteria, so I used excel: (Excel, data -> advanced -> click on column and then drop down selection menu. Include only a few sports) Download AHW_1_filteredsports.csv from pace.sdsc.edu, open in weka

28 Notice the number of sports and their nominal values

29 Visualize correlation with sport as class any problems, any promising relationships?

30 Are there one or more variables we should ignore?

31 Select Ignore attributes -> ctrl-space to select Total and Sport. Should we ignore sex?

32 select cluster tab; choose -> simple Kmeans ; choose 5 classes, accept other defaults

33 Results end up in output panel. Take note of sum squared error. Any other message to note?

34 Rt click on model result, visualize cluster assignment Compare sport and cluster number how do they correspond? Are the clusters useful to distinguish (or predict) an athlete s sport?

35 Try rerunning with different random seed, you get different clustering, but same SSE

36 What is the relationship of sex variable to cluster and sport assignments?

37 Try EM algorithm, accept defaults and start

38 Rt click on model result, visualize cluster assignment, Compare sport and cluster number

39 Rerun with 5 clusters, compare sport and cluster number

40 Exercise: For simple Kmeans Get a full sweep of K ie change N to 1,3,4,5, what K would you choose?

41 Earth System Sciences Example An Investigation Using SiroSOM for the Analysis of QUEST Stream-Sediment and Lake-Sediment Geochemical Data (Fraser & Hodgkinson 2009) at (e.g. Quest_Out_Data.csv) Summary: identify patterns and relationships anomalous data promote resource assessment and exploration Data: 15,000 rows x 42 measures (and metadata)

42 Data location and spreadsheet

43 N=15,000 x P=42 correlation plot In R: corrplot(cor(d2use))

44 Standard Deviation of Columns Standard Deviation In R: corrplot(cor(d2use)) sdl=lapply(d2use,sd);

45 Kmeans SSE as K=1..30 D=read.table('QUEST_out_data.csv',header=T,sep=",") D2use=D[,11:dim(D)[2]-1] res_win_bet=matrix(0,30,2); for (num_k in 1:30){ kres=kmeans(d2use,num_k,iter.max=10,nstart=1); res_win_bet[num_k,1:2]=c(kres$tot.withinss,kres$betweenss);} plot(1:num_k,res_win_bet[,1],pch=8,main="sse") K=1 30; Authors used 20

46 K=20, heat map of centroids Minerals R: install.packages('fields') library('fields') dev.new() plot.new() image.plot(kres$clusters) dev.print(file= paste( centers.jpeg"), device=jpeg,width=600) K=1 20

$cols2use=colors()[seq(21,631,30)]; for (i in 1:num_k) { class_inds=which(kres$cluster==i);$

47 Plotting in Physical Space Kmeans cluster assignments #get distinct color map values cols2use=colors()[seq(21,631,30)]; for (i in 1:num_k) { class_inds=which(kres$cluster==i); plot(d[class_inds,1:2],col=cols2use[i],xlim=xl, ylim=yl);}

49 Pause

50 Incremental & Hierarchical Clustering Start with 1 cluster (all instances) and do splits OR Start with N clusters (1 per instance) and do merges Can be greedy & expensive in its search some algorithms might merge & split algorithms need to store and recalculate distances Need distance between groups in constrast to K-means

51 Incremental & Hierarchical Clustering Result is a hierarchy of clusters displayed as a dendrogram tree Useful for tree-like interpretations syntax (e.g. word co-occurences) concepts (e.g. classification of animals) topics (e.g. sorting Enron s) spatial data (e.g. city distances) genetic expression (e.g. possible biological networks) exploratory analysis

52 Incremental & Hierarchical Clustering Clusters are merged/split according to distance or utility measure Euclidean distance (squared differences) conditional probabilities (for nominal features) Options to choose which clusters to Link single linkage, mean, average (w.r.t. points in clusters) (may lead to different trees, depending on spreads) Ward method (smallest increase within cluster variance) change in probability of features for given clusters

53 Linkage options e.g. single linkage (closest to any cluster instance) Cluster1 Cluster2 e.g. mean (closest to mean of all cluster instances) Cluster1 Cluster2

54 Linkage options (cont ) e.g. average (mean of pairwise distances) Cluster1 Cluster2 e.g. Ward s method (find new cluster with min. variance) Cluster1 Cluster2

55 Hierarchical Clustering Demo 3888 Interactions among 685 proteins From Hu et.al. TAP dataset b0009 b b0009 b b0014 b b0014 b b0014 b b0014 b b0014 b b0015 b R: d=read.table("core_interactions.txt"); #Create adjacency matrix, for each interaction create an edge C =matrix(0,p,p); #Connection matrix (aka Adjacency matrix) IJ =cbind(d[,1],d[,2]) #factor level not text is saved as Nx2 for (i in 1:N) { C[IJ[i,1],IJ[i,2]]=1; }

56 Interactions as connections structure is hard to see install.packages('igraph') library('igraph') gc=graph.adjacency(c,mode="directed") plot(gc,vertex.size=3,edge.arrow.size=0,vertex.label=na)

57 Hierarchical Clustering Demo hclust with single distance: chaining d2use=dist(c,method="binary") fit <- hclust(d2use, method= single") plot(fit) the cluster distance when 2 are combined Items that cluster first

58 Hierarchical Clustering Demo hclust with Ward distance: spherical clusters the cluster distance when 2 are combined

59 Hierarchical Clustering Demo Where height change looks big, cut off tree groups <- cutree(fit, k=7) rect.hclust(fit, k=7, border="red")

60 Hierarchical Clustering Demo Kmeans vs Hierarchical: Total Within Cluster Sum Sqrd Error Number of K R: emfit=mclust(c_mncntr); # 9 clusters were selected

61 Hierarchical Clustering Demo Kmeans (k=7) vs Hierarchical: Lots of overlap despite that Kmeans not have binary distance option Kmeans cluster assigment Hierarchical Group Assigment R: groups <- cutree(fit, k=7) ; Kresult=kmeans(C,7,10,1); table(kresult$cluster,groups)

62 Principle Components vs Clustering PCA reduces dimensions, Clustering reduces to categorical groups In some cases, k PCs k clusters

63 Principle Components vs Kmeans & Hierarchical Total Variance of Data On Principle Components Number of Principle Components R: C_mncntr=scale(C,center=TRUE,scale=F) P=princomp(C_mncntr); #columns of P$loadings are eigenvectors plot(p$sdev[1:50])

64 pause

65 Microbiome exploration metagenomic data on microbes in gut for diseased and normals many small values so use cut off ( std>.004)

66 Biome subjects correlation Crohn s Ulcer Colitis normals

67 Biome Subject Hierarchical Clustering

68 Clustering Test EM clustering selected 4 classes Kmeans 2-4 classes was reasonble Actual vs EM class assignments (EM class 2,3,4 isolate some subjects Actual\EM emres=mclust(xbt2) table(t,emres$classification)

69 Actual Classes on first 2 PCs

70 EM classes on first 2 PCs

71 Summary Having no label doesn t stop you from finding structure in data Unsupervised methods are somewhat related

72 End

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction