Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

Assignment 5 Refer to the Wiki page Due noon on Monday February 12 th to EEE dropbox Note: due before class (by 2pm) Questions? Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2

What is Exploratory Data Analysis? EDA = {visualization, clustering, dimension reduction,.} For small numbers of variables, EDA = visualization For large numbers of variables, we need to be cleverer Clustering, dimension reduction, embedding algorithms These are techniques that essentially reduce high-dimensional data to something we can look at Today s lecture: Finish up visualization Overview of clustering algorithms Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3

Tufte s Principles of Visualization Graphical excellence is the well-designed presentation of interesting data a matter of substance, of statistics, and of design consists of complex ideas communicated with clarity, precision and efficiency is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space requires telling the truth about the data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4

Different Ways of Presenting the Same Data From Karl Broman, via www.cs.princeton.edu/ Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5

Principle of Proportional Ink (or How to Lie with Visualization) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6

Principle of Proportional Ink (or How to Lie with Visualization) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7

Potentially Misleading Scales on the X-axis Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8

Example: Visualization of Napoleon s 1812 March Illustrates size of army, direction, location, temperature, date all on one chart Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9

Data Journalism From New York Times, Feb 2 2018 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10

Exploratory Data Analysis: Clustering Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11

Example: Clustering Vectors in a 2-Dimensional Space x 2 Each point (or 2d vector) represents a document x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12

Example: Possible Clusters x 2 Cluster 1 Cluster 2 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13

Example: How many Clusters? x 2 Cluster 1 Cluster 2 Cluster 3 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14

Cluster Structure in Real-World Data 1500 subjects signal C 0.0 0.5 1.0 Two measurements per subject 0.0 0.5 1.0 1.5 signal T Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15

signal C 0.0 0.5 1.0 CC CT TT Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine 0.0 0.5 1.0 1.5 signal T 17 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17

Issues in Clustering Representation How do we represent our examples as data vectors? Distance How do we want to define distance between vectors? Algorithm What type of algorithm do we want to use to search for clusters? What is the time and space complexity of the algorithm? Number of Clusters How many clusters do we want? No right answer to these questions in general it depends on the application Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18

Cluster Analysis vs Classification Data are unlabeled The number of clusters are unknown Unsupervised learning Goal: find unknown structures The labels for training data are known The number of classes are known Supervised learning Goal: allocate new observations, whose labels are unknown, to one of the known classes 19 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19

Clustering: The K-Means Algorithm Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20

Notation N documents Represent each document as a vector of T terms (e.g., counts or tf-idf) The vector for the ith document is: x i = ( x i1, x i2,, x ij,..., x it ), i = 1,..N Document-Term matrix x ij is the ith row, jth column columns correspond to terms rows correspond to documents We can think of our documents as being in a T-dimensional space, with clusters as clouds of points Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21

The K-Means Clustering Algorithm Input: N vectors x 1,. x N of dimension D K = number of clusters (K > 1) Output: K cluster centers, c 1,. c K, each center is a vector of dimension D (Equivalently) A list of cluster assignments (values 1 to K) for each of the N input vectors Note: In K-means each input vector x is assigned to one and only one cluster k, or cluster center c k The K -means algorithm partitions the N data vectors into K disjoint groups Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22

Example of K-Means Output with 2 Clusters x 2 Cluster 1 Blue circles are examples of documents Red circles are examples of cluster centers c 1 Cluster 2 c 2 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23

Squared Error Distance Consider two vectors each with T components (i.e., dimension T) x = ( x, x2,!, x y = y, y,!, y 1 T ( 1 2 T A common distance metric is squared error distance: ) ) d E ( x, y) = T j= 1 ( x j y j 2 ) In two dimensions the square root of this is the usual notion of spatial distance, i.e., Euclidean distance Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24

Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Index j is over the D components/dimensions of the vectors Cluster 1 c 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25

Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Total squared error between a cluster center c k and all N k points assigned to that cluster: Cluster 1 S k = Σ i d [ x i, c k ] Sum is over the D components/dimensions of the vectors Distance defined as Euclidean distance This sum is over vectors, over the N k points assigned to cluster k c 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26

Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Total squared error between a cluster center c k and all N k points assigned to that cluster: S k = Σ i d [ x i, c k ] Sum is over the D components/dimensions of the vectors Sum is over the N k points assigned to cluster k Distance defined as Euclidean distance Total squared error summed across K clusters SSE = Σ k S k Sum is over the K clusters Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27

K-means Objective Function K-means: minimize the total squared error, i.e., find the K cluster centers c k, and assignments, that minimize SSE = Σ k S k = Σ k ( Σ i d [ x i, c k ] ) K-means seeks to minimize SSE, i.e., find the cluster centers such that the sum-squared-error is smallest will place cluster centers strategically to cover data similar to data compression (in fact used in data compression algorithms) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28

K-Means Algorithm Random initialization Select the initial K centers randomly from N input vectors randomly Or, assign each of the N vectors randomly to one of the K clusters Iterate: Assignment Step: Assign each of the N input vectors to their closest mean Update the Mean-Vectors (K of them) Compute updated centers: the average value of the vectors assigned to k New c k = 1/N k Σ i x i Convergence: Did any points get reassigned? Yes: terminate No: return to Iterate step Sum is over the N k points assigned to cluster k Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29

Pseudocode for the K-means Algorithm From Chapter 16 in Manning, Raghavan, and Schutze Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 30

Example of K-Means Clustering 7 Original Data 6 5 DIMENSION 2 4 3 2 1 0-1 -2-2 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 31

Example of K-Means Clustering Iteration 1 7 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 3.45-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 32

Example of K-Means Clustering Iteration 2 7 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 1.93-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 33

Example of K-Means Clustering Iteration 3 7 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 1.25-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 34

Example of K-Means Clustering 7 Iteration 5 (converged) 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 1.21-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 35

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 36

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 37

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 38

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. New Centers => new boundaries 6. Repeat until no change Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 39

The Iris Data Collected by R.A. Fisher A famous early data set in multivariate data analysis Four features: sepal length in cm sepal width in cm petal length in cm petal width in cm Three different species Setosa Versicolor Virginica Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 40

K-Means Clustering on the Iris Data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 41

K-Means for Image Compression Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 42

An Example of Data where K-Means does not work well Ideal Clustering of Data in 2 Dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 43

An Example of Data where K-Means does not work well Ideal Clustering of Data in 2 Dimensions K-means Clustering Result, K = 2 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 44

From: http://scikit-learn.org/stable/modules/clustering.html# Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 45

Properties of the K-Means Algorithm Time complexity? N = number of data points K = number of clusters D = dimension of data points (number of variables) O( N K d) in time per iteration This is good: linear time in each input parameter Does K-means always find a Global Minimum? i.e., the set of K centers that minimize the SSE? No: always converges to *some* local minimum, but not necessarily the best Depends on the starting point chosen Can prove that SSE on each iteration must either Decrease, or Not change (in which case we have converged) [Think about how you might prove this] Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 46

Summary of Kmeans Input: N vectors Output: K clusters Each cluster represented by a cluster mean (a vector) Assigns each data point to its closest cluster center Strengths Fast: time complexity is O(N D K), i.e., linear time in N, T, K Simple to implement Weaknesses: Not guaranteed to find the best solution (the global minimum of SSE) Assumes a fixed K, number of clusters Uses Euclidean distance not necessarily ideal Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 47

Number of Clusters? Generally no right answer it depends on the application We can think of clustering as a type of data compression technique: As K, the number of clusters grows, we compress the data better, e.g., lower overall squared error But this does not mean larger K is always better..the larger the value of K the harder it is for humans to understand the clustering results Options? Pick a value of K based on intuition/heuristics, e.g., relatively small K (e.g., K=5 or 10) if we are showing the results to a human Evaluate different values of K if we have some ground truth for evaluation and select the best value of K using the task-specific evaluation measure Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 48

Hierarchical Clustering Algorithms Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 49

Setosa Virginica Versicolor Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 50

Hierarchical Clustering The number of clusters is not required Gives a tree-based representation of observations - dendrogram Each leaf represents an observation Leaves most similar to each other are merged Internal nodes most similar to each other merged Process continues recursively until all nodes are merged at the root node Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 51

Basic Concept of Hierarchical Clustering Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e c d e a b c d e Merge data points, and then clusters, in a bottom-up fashion, until all data points are in 1 cluster. Requires that we can define distance/similarity between sets of points Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 52

Simple Example of Hierarchical Clustering Dimension 2 Dimension 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 53

Complete-link clustering of Reuters news stories Figure from Chapter 17 of Manning, Raghavan, and Schutze Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 54

Distance between Two Branches/Clusters Single linkage Complete linkage Average linkage Many other options Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 55

Complexity of Hierarchical Clustering Time Complexity (N = num of docs, T = dimensionality) Time to compute all pairwise distances: O(N 2 T ) Time to create the tree: O(N 3 ) -> Overall time complexity is O(N 3 + N 2 T ) Space complexity = O(N 2 ) This is a significant weakness of hierarchical clustering: scales poorly in N One practical option is first run K-means with (e.g.,) K = 20 or 100 or 500 clusters and then cluster the clusters from K-means Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 56

Automatically Clustering Languages in Linguistics Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 57

Hierarchical Clustering based on user votes for favorite beers Based on centroid method From data.ranker.com Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 58

Heat-Map Representation (human data) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 59

Discovering Structure from a HeatMap of Brain Network Data From https://seaborn.pydata.org/examples/structured_heatmap.html Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 60

Summary of Clustering Algorithms Used for exploring data Can answer questions such are there subgroups? Different clustering algorithms K-means Simple, fast, easy to interpret Tends to find circular clusters, can fail on complex structure Number of clusters K is fixed ahead of time Hierarchical agglomerative clustering Produces a tree of clusters (dendrogram) Number of clusters is not fixed Computational complexity is high, does not scale well to large N Clustering is useful for exploration.but one should be careful No gold standard to compare it to Many different methods.can give different results Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 61

Assignment 5 Refer to the Wiki page Due noon on Monday February 12 th to EEE dropbox Note change: due before class (by 2pm) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 62