Clustering: Overview and K-means algorithm

Similar documents
Clustering: Overview and K-means algorithm

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Information Retrieval and Organisation

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Cluster Analysis. Ying Shen, SSE, Tongji University

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Clustering Algorithms for general similarity measures

Clustering Results. Result List Example. Clustering Results. Information Retrieval

CSE 5243 INTRO. TO DATA MINING

University of Florida CISE department Gator Engineering. Clustering Part 2

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Introduction to Information Retrieval

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Information Retrieval

Text Documents clustering using K Means Algorithm

CSE 5243 INTRO. TO DATA MINING

Information Retrieval and Web Search Engines

PV211: Introduction to Information Retrieval

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Unsupervised Learning : Clustering

4. Ad-hoc I: Hierarchical clustering

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering Part 3. Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Unsupervised Learning

Information Retrieval and Web Search Engines

Introduction to Information Retrieval


Data Informatics. Seon Ho Kim, Ph.D.

Introduction to Data Mining

Unsupervised Learning

CS47300: Web Information Search and Management

Clustering CS 550: Machine Learning

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Clustering. Unsupervised Learning

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Clustering and Visualisation of Data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

CSE 5243 INTRO. TO DATA MINING

Cluster analysis. Agnieszka Nowak - Brzezinska

What to come. There will be a few more topics we will cover on supervised learning

Hierarchical Clustering

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Based on Raymond J. Mooney s slides

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Cluster Analysis: Basic Concepts and Algorithms

Clustering. Unsupervised Learning

Clustering Lecture 3: Hierarchical Methods

Unsupervised Learning and Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Document Clustering: Comparison of Similarity Measures

Clustering Algorithms. Margareta Ackerman

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Lesson 3. Prof. Enza Messina

Data Mining Concepts & Techniques

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Hierarchical Clustering

Unsupervised Learning Partitioning Methods

A Comparison of Document Clustering Techniques

Hierarchical Clustering 4/5/17

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Forestry Applied Multivariate Statistics. Cluster Analysis

Clustering. Bruno Martins. 1 st Semester 2012/2013

Gene Clustering & Classification

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS7267 MACHINE LEARNING

Unsupervised Learning and Clustering

Clustering & Bootstrapping

Hierarchical clustering

Hierarchical clustering

Clustering Part 2. A Partitional Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CSE 494 Project C. Garrett Wolf

Machine Learning: Symbol-based

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Cluster Analysis: Basic Concepts and Algorithms

Clust Clus e t ring 2 Nov

Chapter 9. Classification and Clustering

Clustering Basic Concepts and Algorithms 1

2. Background. 2.1 Clustering

CSE 494/598 Lecture-11: Clustering & Classification

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Search Engines. Information Retrieval in Practice

Transcription:

Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin Makowiecki 1 What mean by similar? What is good grouping? Computation time / quality tradeoff 2 General types of clustering Applications: Soft versus hard clustering Hard: partition the objects each object in exactly one partition Soft: assign degree to which object in cluster view as probability or score flat versus hierarchical clustering hierarchical = clusters within clusters Many biology astronomy computer aided design of circuits information organization marketing 3 4 Clustering in information search and analysis Group information objects discover topics? other groupings desirable Clustering versus classifying classifying: have pre-determined classes with example members clustering: get groups of similar objects added problem of labeling clusters by topic e.g. common terms within cluster of docs. 5 Example applications in search Query evaluation: cluster pruning ( 7.1.6) cluster all documents choose representative for each cluster evaluate query w.r.t. cluster reps. evaluate query for docs in cluster(s) having most similar cluster rep.(s) Results presentation: labeled clusters cluster only query results e.g. Clusty.com (metasearch) hard / soft? flat / hier? 6 1

Issues What attributes represent items for clustering purposes? What is measure of similarity between items? General objects and matrix of pairwise similarities Objects with specific properties that allow other specifications of measure Most common: Objects are d-dimensional vectors» Euclidean distance» cosine similarity What is measure of similarity between clusters? Issues continued Cluster goals? Number of clusters? flat or hierarchical clustering? cohesiveness of clusters? How evaluate cluster results? relates to measure of closeness between clusters Efficiency of clustering algorithms large data sets => external storage Maintain clusters in dynamic setting? Clustering methods? - MANY! 7 8 General types of clustering methods agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive = top-down break up cluster containing all objects into smaller clusters both agglomerative and divisive give hierarchies hierarchy can be trivial: 1 (.. )... 2 ((.. ). ).. 3 (((.. ). ). ). 4 ((((.. ). ). ). ) 9 General types of clustering methods cont. constructive versus iterative improvement constructive: decide in what cluster each object belongs and don t change often faster iterative improvement: start with a clustering and move objects around to see if can improve clustering often slower but better 10 Quality of clustering In applications quality of clustering depends on how well solves problem at hand Algorithm uses measure of quality that can be optimized, but that may or may not do a good job of capturing application needs. Underlying graph-theoretic problems usually NPcomplete e.g. graph partitioning Usually algorithm not finding optimal clustering 11 Distance between clusters Possible definitions: I. distance between closest pair of objects with one in each cluster called single link II......... ^ ^ distance between furthest pair objects, one from each cluster called complete linkage........ ^ ^ 12 2

Distance between clusters, cont. Possible definitions: III. average of pairwise distance between all pairs of objects, one from each more computation Generally no representative point for a cluster; If Euclidean distance centroid bounding box Vector model: K- means algorithm Well known, well used Flat clustering Number of clusters picked ahead of time Iterative improvement Uses notion of centroid Typically uses Euclidean distance 13 14 K-means overview Choose k points among set to cluster Call them k centroids For each point not selected, assign it to its closest centroid All assignment give initial clustering Until happy do: Recompute centroids of clusters New centroids may not be points of original set Reassign all points to closest centroid Updates clusters start: choose centroids and cluster 15 16 recompute centroids re-cluster around new centroids 17 18 3

2 nd recompute centroids and re-cluster 3rd (final) recompute and re-cluster 19 20 Details for K-means Need definition of centroid c i = 1/ C i x for i th cluster C i containing objects x x C i notion of sum of objects? Need definition of distance to (similarity to) centroid Typically vector model with Euclidean distance minimizing sum of squared distances of each point to its centroid = Residual Sum of Squares K RSS = dist(c i,x) 2 i=1 x C i 21 K-means performance Can prove RSS decreases with each iteration, so converge Can achieve local optimum No change in centroids Running time depends on how demanding stopping criteria Works well in practice speed quality 22 Time Complexity of K-means Let t dist be the time to calculate the distance between two objects Each iteration time complexity: O(K*n*t dist ) n = number of objects Bound number of iterations I giving O(I*K*n*t dist ) for m-dimensional vectors: O(I*K*n*m) m large and centroids not sparse Space Complexity of K-means Store points and centroids vector model: O((n + K)m) External algorithm versus internal? store k centroids in memory run through points each iteration 23 24 4

Choosing Initial Centroids Bad initialization leads to poor results Optimal Not Optimal 25 Choosing Initial Centroids Many people spent much time examining how to choose seeds Random Fast and easy, but often poor results Run random multiple times, take best Slower, and still no guarantee of results Pre-conditioning remove outliers Choose seeds algorithmically run hierarchical clustering on sample points and use resulting centroids Works well on small samples and for few initial centroids 26 Non-globular clusters Wrong number of clusters 27 28 Outliers and empty clusters Real cases tend to be harder Different attributes of the feature vector have vastly different sizes size of star versus color Can weight different features how weight greatly affects outcome Difficulties can be overcome 29 30 5