CLUSTERING ALGORITHMS

Similar documents
Pattern Recognition Lecture Sequential Clustering

Expectation Maximization!

Hierarchical Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning and Clustering

Clustering: K-means and Kernel K-means

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised Learning and Clustering

Hierarchical and Ensemble Clustering

4. Ad-hoc I: Hierarchical clustering

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

INF 4300 Classification III Anne Solberg The agenda today:

Hierarchical clustering

Market basket analysis

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Information Retrieval and Web Search Engines

Clustering k-mean clustering

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Lesson 3. Prof. Enza Messina

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Unsupervised Learning

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Clustering Algorithms for general similarity measures

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Cluster Analysis: Agglomerate Hierarchical Clustering

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Cluster Analysis for Microarray Data

Clustering Part 3. Hierarchical Clustering

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Multivariate Analysis (slides 9)

Clustering: Overview and K-means algorithm

Unsupervised Learning : Clustering

Unsupervised Learning Hierarchical Methods

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Chapter 4: Programming with MATLAB

Automated Clustering-Based Workload Characterization

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Information Retrieval and Web Search Engines

Road map. Basic concepts

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Clustering algorithms and introduction to persistent homology

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Hierarchical Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CLUSTERING IN BIOINFORMATICS

Clustering. K-means clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Introduction to Mobile Robotics


Enhancing K-means Clustering Algorithm with Improved Initial Center

Chapter DM:II. II. Cluster Analysis

CS7267 MACHINE LEARNING

Improving Cluster Method Quality by Validity Indices

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Lecture 15 Clustering. Oct

CPSC 425: Computer Vision

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Clustering. Supervised vs. Unsupervised Learning

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Gene expression & Clustering (Chapter 10)

Parallelization of Hierarchical Density-Based Clustering using MapReduce. Talat Iqbal Syed

Hierarchical Clustering Lecture 9

Clustering in Data Mining

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Bounded, Closed, and Compact Sets

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering in Ratemaking: Applications in Territories Clustering

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

6. Dicretization methods 6.1 The purpose of discretization

Machine learning - HT Clustering

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Clustering & Classification (chapter 15)

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Social-Network Graphs

University of Florida CISE department Gator Engineering. Clustering Part 4

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

A Comparative Study of Various Clustering Algorithms in Data Mining

Contents. Preface to the Second Edition

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Machine Learning (BSMC-GA 4439) Wenke Liu

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering 4/5/17

Machine Learning for OR & FE

Clustering Part 4 DBSCAN

Data Mining Concepts & Techniques

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Transcription:

CLUSTERING ALGORITHMS Number of possible clusterings Let X={x 1,x 2,,x N }. Question: In how many ways the N points can be Answer: Examples: assigned into m groups? S( N, m) 1 m! m i 0 ( 1) m 1 m i i N S( 15,3) 2375101 S( 20,4) 45232115901 68 S(100,5) 10!! 1

Recursion S N, m = ms N 1, m + S(N 1, m 1) 2

A way out: Consider only a small fraction of clusterings of X and select a sensible clustering among them Question 1: Which fraction of clusterings is considered? Question 2: What sensible means? The answer depends on the specific clustering algorithm and the specific criteria to be adopted. 3

MAJOR CATEGORIES OF CLUSTERING ALGORITHMS Sequential: A single clustering is produced. One or few sequential passes on the data. Hierarchical: A sequence of (nested) clusterings is produced. Agglomerative Divisive Combinations of the above (e.g., the Chameleon algorithm.) Cost function optimization. For most of the cases a single clustering is obtained. Other schemes: 4

SEQUENTIAL CLUSTERING ALGORITHMS The common traits shared by these algorithms are: One or very few passes on the data are required. The number of clusters is not known a-priori, except (possibly) an upper bound, q. The clusters are defined with the aid of An appropriately defined distance d (x,c) of a point from a cluster. A threshold Θ associated with the distance. 5

Basic Sequential Clustering Algorithm (BSAS) m=1 // number of clusters C m ={x 1 } For i=2 to N Find C k : d(x i,c k )=min 1 j m d(x i,c j ) If (d(x i,c k )>Θ) AND (m<q) then o m=m+1 o C m ={x i } Else o C k =C k {x i } o Where necessary, update representatives (*) End {if} End {for} (*) When the mean vector m C is used as representative of the cluster C with n c elements, the updating in the light of a new vector x becomes m C new =(n C m C + x) / (n C +1) 6

Remarks: The order of presentation of the data in the algorithm plays important role in the clustering results. Different order of presentation may lead to totally different clustering results, in terms of the number of clusters as well as the clusters themselves. In BSAS the decision for a vector x is reached prior to the final cluster formation. BSAS perform a single pass on the data. Its complexity is O(N). If clusters are represented by point representatives, compact clusters are favored. 7

Estimating the number of clusters in the data set: Let BSAS(Θ) denote the BSAS algorithm when the dissimilarity threshold is Θ. For Θ=a to b step c Run s times BSAS(Θ), each time presenting the data in a different order. Estimate the number of clusters m Θ, as the most frequent number resulting from the s runs of BSAS(Θ). Next Θ Plot m Θ versus Θ and identify the number of clusters m as the one corresponding to the widest flat region in the above graph. 25 15 5 Number of clusters 40 30 20 10-5 0-5 5 15 25 0 10 20 30 Θ 8

MBSAS, a modification of BSAS In BSAS a decision for a data vector x is reached prior to the final cluster formation, which is determined after all vectors have been presented to the algorithm. MBSAS deals with the above drawback, at the cost of presenting the data twice to the algorithm. MBSAS consists of: A cluster determination phase (first pass on the data), which is the same as BSAS with the exception that no vector is assigned to an already formed cluster. At the end of this phase, each cluster consists of a single element. A pattern classification phase (second pass on the data), where each one of the unassigned vector is assigned to its closest cluster. 9

MBSAS, a modification of BSAS Remarks: In MBSAS, a decision for a vector x during the pattern classification phase is reached taking into account all clusters. MBSAS is sensitive to the order of presentation of the vectors. 10 MBSAS requires two passes on the data. Its complexity is O(N).

The maxmin algorithm Let W be the set of all points that have been chosen to form clusters up to the current iteration step. The formation of clusters is carried out as follows: For each x X-W determine d x =min z W d(x,z) Determine y: d y =max x X-W d x If d y is greater than a prespecified threshold then else this vector forms a new cluster the cluster determination phase of the algorithm terminates. End {if} * After the formation of the clusters, each unassigned vector is assigned to its closest cluster. (as in MBSAS) 11

Remarks: The maxmin algorithm is more computationally demanding than MBSAS. However, it is expected to produce better clustering results. 12