Clust Clus e t ring 2 Nov

Similar documents
What to come. There will be a few more topics we will cover on supervised learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised learning, Clustering CS434

Clustering algorithms

Introduction to Data Mining

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Clustering. Chapter 10 in Introduction to statistical learning

CSE 494 Project C. Garrett Wolf

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Based on Raymond J. Mooney s slides

Information Retrieval and Organisation

4. Ad-hoc I: Hierarchical clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 4 Hierarchical clustering

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Lecture 15 Clustering. Oct

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Clustering and Visualisation of Data

Unsupervised Learning

K-Means. Oct Youn-Hee Han

Clustering: Overview and K-means algorithm

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Cluster Analysis. Ying Shen, SSE, Tongji University

Data Informatics. Seon Ho Kim, Ph.D.

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Supervised vs. Unsupervised Learning

CS 8520: Artificial Intelligence

Clustering: Overview and K-means algorithm

Hierarchical Clustering

Introduction to Information Retrieval

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

PV211: Introduction to Information Retrieval

MSA220 - Statistical Learning for Big Data

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Cluster Analysis for Microarray Data

Lesson 3. Prof. Enza Messina

9.1. K-means Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Introduction to Information Retrieval

Clustering: K-means and Kernel K-means

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis


Clustering. Unsupervised Learning

Hierarchical Clustering 4/5/17

CHAPTER 4: CLUSTER ANALYSIS

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Clustering Algorithms. Margareta Ackerman

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Distances, Clustering! Rafael Irizarry!

CS7267 MACHINE LEARNING

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Clustering. Danushka Bollegala

Clustering. Unsupervised Learning

Unsupervised Learning : Clustering

Unsupervised Learning Partitioning Methods

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Clustering Part 3. Hierarchical Clustering

Clustering. Stat 430 Fall 2011

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Unsupervised Learning

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Clustering part II 1

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. K-means clustering

Clustering (COSC 416) Nazli Goharian. Document Clustering.

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Machine Learning (CSE 446): Unsupervised Learning

Clustering Color/Intensity. Group together pixels of similar color/intensity.

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

CSE494 Information Retrieval Project C Report

Cluster Analysis. CSE634 Data Mining

Association Rule Mining and Clustering

Text Documents clustering using K Means Algorithm

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Semi-supervised learning

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Clustering Part 2. A Partitional Clustering

Unsupervised Learning and Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical Clustering

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Clustering in Ratemaking: Applications in Territories Clustering

CS47300: Web Information Search and Management

Transcription:

Clustering 2 Nov 3 2008

HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most similar. Replace c i and c j with a single cluster c i c j To compute the distance between two clusters Single Link: distance of two closest members of clusters Complete Link: distance of two furthest members of clusters Average Link: average distance

Comments on HAC HAC is a fast tool that often provides interesting views of a dataset Primarily HAC can be viewed as an intuitively appealing clustering procedure for data analysis/eploration We can create clusterings of different granularity by stopping at different levels of the dendrogram HAC often used together with visualization of the dendrogram to decide how many clusters eist in the data

HAC creates a Dendrogram D(C1, C2) Dendrogram draws the tree such that the height of a tree branch = the distance between the two merged clusters at that particular step The distances are always monotonically increasing This can provide some understanding about how many natural groups there are in the data A drastic height change indicates that we are merging two very different clusters together maybe a good stopping point

Flat vs hierarchical clustering Hierarchical clustering generates nested clusterings Flat clustering generates a single partition of the data without resorting to the hierachical procedure Representative eample: K means clustering

Mean Squared Error Objective Assume instances are real valued vectors Given a clustering of the data into k clusters, we can compute the centroid of each cluster If we use the mean of each cluster to represent all data points in the cluster, our mean squared error (MSE) is: One possible objective of clustering is to find a clustering that minimizes this error Note that this is a combinatorial optimization problem Difficult to find the eact solution Commonly solved using an iterative approach

Basic idea We assume that the number of desired clusters, k, is given Randomly choose k eamples as seeds, one per cluster. Form initial clusters based on these seeds. Iterate by repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fied number of iterations.

K means algorithm (MacQueen 1967) Input: data to be clustered and desired number of clusters k Output: k mutually eclusive clusters that cover all eamples

K Means Eample (K=2)

K Means Eample (K=2) Pick seeds

K Means Eample (K=2) Pick seeds Reassign clusters

K Means Eample (K=2) Pick seeds Reassign clusters Compute centroids

K Means Eample (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters

K Means Eample (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters Compute centroids

K Means Eample (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters Compute centroids Reassign clusters

K Means Eample (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters Compute centroids Reassign clusters Converged!

Monotonicity of K means Monotonicity Property: Each iteration of K means strictly decreases the MSE until convergence The following lemma is key to the proof: Lemma: Given a finite set C of data points the value of μ that minimizes the MSE: is:

Proof of monoticity Given a current set of clusters with their means, the MSE is given by : Consider the reassignment step: Since each point is only reassigned if it is closer to some other cluster than its current cluster, so we know the reassignment step will only decrease MSE Consider the re center step: From our lemma we know that μ i minimizes the distortion of c i which implies that the resulting li MSE again is decreased. d Combine the above two, we know Kmeans always decreases the MSE

Kmeans properties Kmeans always converges in a finite number of steps Typically converges very fast (in fewer iterations than the number of points) Time compleity: Assume computing distance between two instances is O(d) where d is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knd). Computing centroids: Each instance vector gets added once to some centroid: O(nd). Assume these two steps are each done once for I iterations: O(Iknd). Linear in all relevant factors, assuming a fied number of iterations, more efficient than O(n 2 ) HAC.

More Comments Highly sensitive to the initial seeds This is because MSE has many local minimal solutions, i.e., solutions that can not be improved by local reassignments of any particular points

Solutions Run multiple trials and choose the one with the best MSE This is typically done in practice Heuristics: try to choose initial centers to be far apart Using furthest first traversal Start with a random initial center, set the second center to be furthest from the first center, the third center to be furthest from the first two centers, and son on One can also initialize iti with results of other clustering method, then apply kmeans s

Even more comments K Means is ehaustive: Cluster every data point, no notion of outlier Outliers maycause problems, why? Outliers will strongly impact the cluster centers Alternative: K medoids instead of computing the mean of each cluster, we find the medoid for each cluster, i.e., the data point that is on average closest to other objects in the cluster

Deciding k a model selection problem What if we don t know how many clusters there are in the data? Can we use MSE to decide k by choosing k that gives the smallest MSE? We will always favor larger k values Any quick solutions? Find the knee We will see some other model selection techniques later