Clustering: Overview and K-means algorithm

Similar documents
Clustering: Overview and K-means algorithm

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Information Retrieval and Organisation

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Introduction to Information Retrieval

Introduction to Information Retrieval

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

PV211: Introduction to Information Retrieval

Text Documents clustering using K Means Algorithm

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Information Retrieval and Web Search Engines

Clustering Algorithms for general similarity measures

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Unsupervised Learning : Clustering

CSE 5243 INTRO. TO DATA MINING

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Introduction to Information Retrieval

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering (COSC 416) Nazli Goharian. Document Clustering.

CSE 5243 INTRO. TO DATA MINING

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Web Search Engines

Introduction to Data Mining

Clustering and Visualisation of Data

CS47300: Web Information Search and Management

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

What to come. There will be a few more topics we will cover on supervised learning


9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Cluster Analysis: Basic Concepts and Algorithms

Data Informatics. Seon Ho Kim, Ph.D.

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Unsupervised Learning Partitioning Methods

Clustering Basic Concepts and Algorithms 1

Document Clustering: Comparison of Similarity Measures

Gene Clustering & Classification

High Dimensional Indexing by Clustering

Based on Raymond J. Mooney s slides

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Clustering. Unsupervised Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering Part 2. A Partitional Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Cluster Analysis: Basic Concepts and Algorithms

Note Set 4: Finite Mixture Models and the EM Algorithm

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering CS 550: Machine Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Introduction to Machine Learning CMU-10701

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Unsupervised Learning

Clustering. K-means clustering

CSE 494 Project C. Garrett Wolf

Clustering Algorithms. Margareta Ackerman

Clustering. Unsupervised Learning

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Clustering Part 3. Hierarchical Clustering

Clust Clus e t ring 2 Nov

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CSE 494/598 Lecture-11: Clustering & Classification

4. Ad-hoc I: Hierarchical clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Bruno Martins. 1 st Semester 2012/2013

K-Means. Oct Youn-Hee Han

Introduction to Mobile Robotics

Introduction to Clustering

Hierarchical Clustering 4/5/17

A Comparison of Document Clustering Techniques

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

CHAPTER 4: CLUSTER ANALYSIS

Data Mining Algorithms

Machine Learning using MapReduce

CSE 5243 INTRO. TO DATA MINING

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Unsupervised Learning and Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Semi-Supervised Clustering with Partial Background Information

Clustering Lecture 3: Hierarchical Methods

Behavioral Data Mining. Lecture 18 Clustering

ECLT 5810 Clustering

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Lesson 3. Prof. Enza Messina

Introduction to Computer Science

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering. Unsupervised Learning

CS 8520: Artificial Intelligence

Transcription:

Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin Makowiecki 1 What mean by similar? What is good grouping? Computation time / quality tradeoff 2 General types of clustering Applications: Soft versus hard clustering Hard: partition the objects each object in exactly one partition Soft: assign degree to which object in cluster view as probability or score flat versus hierarchical clustering hierarchical = clusters within clusters Many biology astronomy computer aided design of circuits information organization marketing 3 4 Clustering in information search and analysis Group information objects discover topics? other groupings desirable Clustering versus classifying classifying: have pre-determined classes with example members clustering: get groups of similar objects added problem of labeling clusters by topic e.g. common terms within cluster of docs. 5 Example applications in search Query evaluation: cluster pruning ( 7.1.6) cluster all documents choose representative for each cluster evaluate query w.r.t. cluster reps. evaluate query for docs in cluster(s) having most similar cluster rep.(s) Results presentation: labeled clusters cluster only query results e.g. Clusty.com (metasearch) hard / soft? flat / hier? 6 1

Issues What attributes represent items for clustering purposes? What is measure of similarity between items? General objects and matrix of pairwise similarities Objects with specific properties that allow other specifications of measure Most common: Objects are d-dimensional vectors» Euclidean distance» cosine similarity What is measure of similarity between clusters? Issues continued Cluster goals? Number of clusters? flat or hierarchical clustering? cohesiveness of clusters? How evaluate cluster results? relates to measure of closeness between clusters Efficiency of clustering algorithms large data sets => external storage Maintain clusters in dynamic setting? Clustering methods? - MANY! 7 8 Quality of clustering In applications quality of clustering depends on how well solves problem at hand Algorithm uses measure of quality that can be optimized, but that may or may not do a good job of capturing application needs. Underlying graph-theoretic problems usually NPcomplete e.g. graph partitioning Usually algorithm not finding optimal clustering 9 General types of clustering methods constructive versus iterative improvement constructive: decide in what cluster each object belongs and don t change often faster iterative improvement: start with a clustering and move objects around to see if can improve clustering often slower but better 10 Vector model: K- means algorithm Well known, well used Flat clustering Number of clusters picked ahead of time Iterative improvement Uses notion of centroid Typically uses Euclidean distance 11 K-means overview Choose k points among set to cluster Call them k centroids For each point not selected, assign it to its closest centroid All assignment give initial clustering Until happy do: Recompute centroids of clusters New centroids may not be points of original set Reassign all points to closest centroid Updates clusters 12 2

start: choose centroids and cluster recompute centroids 13 14 re-cluster around new centroids 2 nd recompute centroids and re-cluster 15 16 3rd (final) recompute and re-cluster Details for K-means Need definition of centroid c i = 1/ C i for i th cluster C i containing objects x x C i notion of sum of objects? Need definition of distance to (similarity to) centroid Typically vector model with Euclidean distance minimizing sum of squared distances of each point to its centroid = Residual Sum of Squares K RSS = dist(c i,x) 2 i=1 x C i 17 18 3

K-means performance Can prove RSS decreases with each iteration, so converge Can achieve local optimum No change in centroids Running time depends on how demanding stopping criteria Works well in practice speed quality Time Complexity of K-means Let t dist be the time to calculate the distance between two objects Each iteration time complexity: O(K*n*t dist ) n = number of objects Bound number of iterations I giving O(I*K*n*t dist ) for m-dimensional vectors: O(I*K*n*m) m large and centroids not sparse 19 20 Space Complexity of K-means Store points and centroids vector model: O((n + K)m) External algorithm versus internal? Space Complexity of K-means Store points and centroids vector model: O((n + K)m) External algorithm versus internal? store k centroids in memory run through points each iteration 21 22 Choosing Initial Centroids Bad initialization leads to poor results Optimal Not Optimal 23 Choosing Initial Centroids Many people spent much time examining how to choose seeds Random Fast and easy, but often poor results Run random multiple times, take best Slower, and still no guarantee of results Pre-conditioning remove outliers Choose seeds algorithmically run hierarchical clustering on sample points and use resulting centroids Works well on small samples and for few initial centroids 24 4

K-means weakness Non-globular clusters K-means weakness Wrong number of clusters 25 26 K-means weakness Outliers and empty clusters Real cases tend to be harder Different attributes of the feature vector have vastly different sizes size of star versus color Can weight different features how weight greatly affects outcome Difficulties can be overcome 27 28 5