Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Similar documents
An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Introduction to Artificial Intelligence

University of Florida CISE department Gator Engineering. Visualization

Unsupervised Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

What to come. There will be a few more topics we will cover on supervised learning

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Data Warehousing and Machine Learning

Clustering and Visualisation of Data

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

MATH5745 Multivariate Methods Lecture 13

Gene Clustering & Classification

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Network Traffic Measurements and Analysis

Hierarchical Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering: Overview and K-means algorithm

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Multivariate Analysis (slides 9)

Note Set 4: Finite Mixture Models and the EM Algorithm

Finding Clusters 1 / 60

Clustering: Overview and K-means algorithm

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Clustering and Dimensionality Reduction

Preprocessing DWML, /33

CSE 5243 INTRO. TO DATA MINING

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Introduction to R and Statistical Data Analysis

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Information Retrieval and Web Search Engines

Data Mining and Analysis: Fundamental Concepts and Algorithms

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CSE 5243 INTRO. TO DATA MINING

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering CS 550: Machine Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Information Retrieval and Organisation

Distances, Clustering! Rafael Irizarry!

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Mixture Models and the EM Algorithm

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Cluster Analysis: Agglomerate Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Information Retrieval and Web Search Engines

Hierarchical Clustering Lecture 9

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Understanding Clustering Supervising the unsupervised

Hierarchical Clustering 4/5/17

Introduction to Data Mining

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Unsupervised Learning and Clustering

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Lecture 8 May 7, Prabhakar Raghavan

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Data Informatics. Seon Ho Kim, Ph.D.

CSE 5243 INTRO. TO DATA MINING

Lecture on Modeling Tools for Clustering & Regression

Introduction to Machine Learning. Xiaojin Zhu

University of Florida CISE department Gator Engineering. Clustering Part 2

Exploratory data analysis for microarrays

Data Mining: Exploring Data. Lecture Notes for Chapter 3

10701 Machine Learning. Clustering

Introduction to Machine Learning CMU-10701

K-means and Hierarchical Clustering

K-Means Clustering 3/3/17

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Kernels and Clustering

Exploratory Analysis: Clustering

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

K-means and Hierarchical Clustering

Unsupervised Learning and Clustering

Clustering. Unsupervised Learning

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Multivariate Analysis

CS 584 Data Mining. Classification 1

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

K-means Clustering & PCA

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

The Curse of Dimensionality

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Clustering. Unsupervised Learning

Clustering algorithms

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Transcription:

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

Assignment 5 Refer to the Wiki page Due noon on Monday February 12 th to EEE dropbox Note: due before class (by 2pm) Questions? Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2

What is Exploratory Data Analysis? EDA = {visualization, clustering, dimension reduction,.} For small numbers of variables, EDA = visualization For large numbers of variables, we need to be cleverer Clustering, dimension reduction, embedding algorithms These are techniques that essentially reduce high-dimensional data to something we can look at Today s lecture: Finish up visualization Overview of clustering algorithms Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3

Tufte s Principles of Visualization Graphical excellence is the well-designed presentation of interesting data a matter of substance, of statistics, and of design consists of complex ideas communicated with clarity, precision and efficiency is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space requires telling the truth about the data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4

Different Ways of Presenting the Same Data From Karl Broman, via www.cs.princeton.edu/ Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5

Principle of Proportional Ink (or How to Lie with Visualization) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6

Principle of Proportional Ink (or How to Lie with Visualization) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7

Potentially Misleading Scales on the X-axis Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8

Example: Visualization of Napoleon s 1812 March Illustrates size of army, direction, location, temperature, date all on one chart Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9

Data Journalism From New York Times, Feb 2 2018 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10

Exploratory Data Analysis: Clustering Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11

Example: Clustering Vectors in a 2-Dimensional Space x 2 Each point (or 2d vector) represents a document x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12

Example: Possible Clusters x 2 Cluster 1 Cluster 2 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13

Example: How many Clusters? x 2 Cluster 1 Cluster 2 Cluster 3 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14

Cluster Structure in Real-World Data 1500 subjects signal C 0.0 0.5 1.0 Two measurements per subject 0.0 0.5 1.0 1.5 signal T Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15

Cluster Structure in Real-World Data 1500 subjects signal C 0.0 0.5 1.0 Two measurements per subject 0.0 0.5 1.0 1.5 signal T Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16

signal C 0.0 0.5 1.0 CC CT TT Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine 0.0 0.5 1.0 1.5 signal T 17 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17

Issues in Clustering Representation How do we represent our examples as data vectors? Distance How do we want to define distance between vectors? Algorithm What type of algorithm do we want to use to search for clusters? What is the time and space complexity of the algorithm? Number of Clusters How many clusters do we want? No right answer to these questions in general it depends on the application Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18

Cluster Analysis vs Classification Data are unlabeled The number of clusters are unknown Unsupervised learning Goal: find unknown structures The labels for training data are known The number of classes are known Supervised learning Goal: allocate new observations, whose labels are unknown, to one of the known classes 19 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19

Clustering: The K-Means Algorithm Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20

Notation N documents Represent each document as a vector of T terms (e.g., counts or tf-idf) The vector for the ith document is: x i = ( x i1, x i2,, x ij,..., x it ), i = 1,..N Document-Term matrix x ij is the ith row, jth column columns correspond to terms rows correspond to documents We can think of our documents as being in a T-dimensional space, with clusters as clouds of points Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21

The K-Means Clustering Algorithm Input: N vectors x 1,. x N of dimension D K = number of clusters (K > 1) Output: K cluster centers, c 1,. c K, each center is a vector of dimension D (Equivalently) A list of cluster assignments (values 1 to K) for each of the N input vectors Note: In K-means each input vector x is assigned to one and only one cluster k, or cluster center c k The K -means algorithm partitions the N data vectors into K disjoint groups Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22

Example of K-Means Output with 2 Clusters x 2 Cluster 1 Blue circles are examples of documents Red circles are examples of cluster centers c 1 Cluster 2 c 2 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23

Squared Error Distance Consider two vectors each with T components (i.e., dimension T) x = ( x, x2,!, x y = y, y,!, y 1 T ( 1 2 T A common distance metric is squared error distance: ) ) d E ( x, y) = T j= 1 ( x j y j 2 ) In two dimensions the square root of this is the usual notion of spatial distance, i.e., Euclidean distance Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24

Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Index j is over the D components/dimensions of the vectors Cluster 1 c 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25

Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Total squared error between a cluster center c k and all N k points assigned to that cluster: Cluster 1 S k = Σ i d [ x i, c k ] Sum is over the D components/dimensions of the vectors Distance defined as Euclidean distance This sum is over vectors, over the N k points assigned to cluster k c 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26

Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Total squared error between a cluster center c k and all N k points assigned to that cluster: S k = Σ i d [ x i, c k ] Sum is over the D components/dimensions of the vectors Sum is over the N k points assigned to cluster k Distance defined as Euclidean distance Total squared error summed across K clusters SSE = Σ k S k Sum is over the K clusters Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27

K-means Objective Function K-means: minimize the total squared error, i.e., find the K cluster centers c k, and assignments, that minimize SSE = Σ k S k = Σ k ( Σ i d [ x i, c k ] ) K-means seeks to minimize SSE, i.e., find the cluster centers such that the sum-squared-error is smallest will place cluster centers strategically to cover data similar to data compression (in fact used in data compression algorithms) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28

K-Means Algorithm Random initialization Select the initial K centers randomly from N input vectors randomly Or, assign each of the N vectors randomly to one of the K clusters Iterate: Assignment Step: Assign each of the N input vectors to their closest mean Update the Mean-Vectors (K of them) Compute updated centers: the average value of the vectors assigned to k New c k = 1/N k Σ i x i Convergence: Did any points get reassigned? Yes: terminate No: return to Iterate step Sum is over the N k points assigned to cluster k Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29

Pseudocode for the K-means Algorithm From Chapter 16 in Manning, Raghavan, and Schutze Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 30

Example of K-Means Clustering 7 Original Data 6 5 DIMENSION 2 4 3 2 1 0-1 -2-2 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 31

Example of K-Means Clustering Iteration 1 7 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 3.45-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 32

Example of K-Means Clustering Iteration 2 7 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 1.93-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 33

Example of K-Means Clustering Iteration 3 7 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 1.25-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 34

Example of K-Means Clustering 7 Iteration 5 (converged) 6 5 DIMENSION 2 4 3 2 1 0 Mean Squared Error = 1.21-1 -1 0 1 2 3 4 5 6 7 8 DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 35

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 36

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 37

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 38

K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. New Centers => new boundaries 6. Repeat until no change Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 39

The Iris Data Collected by R.A. Fisher A famous early data set in multivariate data analysis Four features: sepal length in cm sepal width in cm petal length in cm petal width in cm Three different species Setosa Versicolor Virginica Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 40

K-Means Clustering on the Iris Data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 41

K-Means for Image Compression Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 42

An Example of Data where K-Means does not work well Ideal Clustering of Data in 2 Dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 43

An Example of Data where K-Means does not work well Ideal Clustering of Data in 2 Dimensions K-means Clustering Result, K = 2 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 44

From: http://scikit-learn.org/stable/modules/clustering.html# Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 45

Properties of the K-Means Algorithm Time complexity? N = number of data points K = number of clusters D = dimension of data points (number of variables) O( N K d) in time per iteration This is good: linear time in each input parameter Does K-means always find a Global Minimum? i.e., the set of K centers that minimize the SSE? No: always converges to *some* local minimum, but not necessarily the best Depends on the starting point chosen Can prove that SSE on each iteration must either Decrease, or Not change (in which case we have converged) [Think about how you might prove this] Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 46

Summary of Kmeans Input: N vectors Output: K clusters Each cluster represented by a cluster mean (a vector) Assigns each data point to its closest cluster center Strengths Fast: time complexity is O(N D K), i.e., linear time in N, T, K Simple to implement Weaknesses: Not guaranteed to find the best solution (the global minimum of SSE) Assumes a fixed K, number of clusters Uses Euclidean distance not necessarily ideal Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 47

Number of Clusters? Generally no right answer it depends on the application We can think of clustering as a type of data compression technique: As K, the number of clusters grows, we compress the data better, e.g., lower overall squared error But this does not mean larger K is always better..the larger the value of K the harder it is for humans to understand the clustering results Options? Pick a value of K based on intuition/heuristics, e.g., relatively small K (e.g., K=5 or 10) if we are showing the results to a human Evaluate different values of K if we have some ground truth for evaluation and select the best value of K using the task-specific evaluation measure Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 48

Hierarchical Clustering Algorithms Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 49

Setosa Virginica Versicolor Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 50

Hierarchical Clustering The number of clusters is not required Gives a tree-based representation of observations - dendrogram Each leaf represents an observation Leaves most similar to each other are merged Internal nodes most similar to each other merged Process continues recursively until all nodes are merged at the root node Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 51

Basic Concept of Hierarchical Clustering Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e c d e a b c d e Merge data points, and then clusters, in a bottom-up fashion, until all data points are in 1 cluster. Requires that we can define distance/similarity between sets of points Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 52

Simple Example of Hierarchical Clustering Dimension 2 Dimension 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 53

Complete-link clustering of Reuters news stories Figure from Chapter 17 of Manning, Raghavan, and Schutze Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 54

Distance between Two Branches/Clusters Single linkage Complete linkage Average linkage Many other options Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 55

Complexity of Hierarchical Clustering Time Complexity (N = num of docs, T = dimensionality) Time to compute all pairwise distances: O(N 2 T ) Time to create the tree: O(N 3 ) -> Overall time complexity is O(N 3 + N 2 T ) Space complexity = O(N 2 ) This is a significant weakness of hierarchical clustering: scales poorly in N One practical option is first run K-means with (e.g.,) K = 20 or 100 or 500 clusters and then cluster the clusters from K-means Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 56

Automatically Clustering Languages in Linguistics Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 57

Hierarchical Clustering based on user votes for favorite beers Based on centroid method From data.ranker.com Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 58

Heat-Map Representation (human data) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 59

Discovering Structure from a HeatMap of Brain Network Data From https://seaborn.pydata.org/examples/structured_heatmap.html Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 60

Summary of Clustering Algorithms Used for exploring data Can answer questions such are there subgroups? Different clustering algorithms K-means Simple, fast, easy to interpret Tends to find circular clusters, can fail on complex structure Number of clusters K is fixed ahead of time Hierarchical agglomerative clustering Produces a tree of clusters (dendrogram) Number of clusters is not fixed Computational complexity is high, does not scale well to large N Clustering is useful for exploration.but one should be careful No gold standard to compare it to Many different methods.can give different results Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 61

Assignment 5 Refer to the Wiki page Due noon on Monday February 12 th to EEE dropbox Note change: due before class (by 2pm) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 62