Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Similar documents
Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Introduction to Mobile Robotics

COMP 182 Algorithmic Thinking. Breadth-first Search. Luay Nakhleh Computer Science Rice University

Cluster Analysis. Ying Shen, SSE, Tongji University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning and Clustering

Unsupervised Learning : Clustering

Unsupervised Learning and Clustering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Clustering. K-means clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

COMP 182: Algorithmic Thinking Prim and Dijkstra: Efficiency and Correctness

Clustering Algorithms. Margareta Ackerman

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CSE 494 Project C. Garrett Wolf

Pattern Recognition Lecture Sequential Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CSE494 Information Retrieval Project C Report

Clustering CS 550: Machine Learning

Gene Clustering & Classification

Clustering: Overview and K-means algorithm

K+ Means : An Enhancement Over K-Means Clustering Algorithm

Hierarchical Clustering 4/5/17

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

CSSE463: Image Recognition Day 21

Gene expression & Clustering (Chapter 10)

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Analytics and Visualization of Big Data

Introduction to Clustering

Chapter 6: Cluster Analysis

Information Retrieval and Web Search Engines

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering: Overview and K-means algorithm

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Iteration Reduction K Means Clustering Algorithm

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Unsupervised Learning Partitioning Methods

Information Retrieval and Web Search Engines

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Lesson 3. Prof. Enza Messina

Clustering: K-means and Kernel K-means

ECLT 5810 Clustering

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

CS7267 MACHINE LEARNING

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Boolean Network Modeling

Clustering (COSC 488) Nazli Goharian. Document Clustering.

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Introduction to Artificial Intelligence

Divide and Conquer Algorithms

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Real Numbers Student Activity Sheet 1; use with Overview

COUNTING AND PROBABILITY

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

数据挖掘 Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 2

Lecture 4 Hierarchical clustering

ECLT 5810 Clustering

Hierarchical Clustering Lecture 9

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Hierarchical clustering

Chapter VIII.3: Hierarchical Clustering

10701 Machine Learning. Clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Expectation Maximization!

Association Rule Mining and Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CLUSTERING IN BIOINFORMATICS

What to come. There will be a few more topics we will cover on supervised learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Unsupervised Learning

Clustering k-mean clustering

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Clustering & Classification (chapter 15)

Clustering. Chapter 10 in Introduction to statistical learning

Unsupervised Learning Hierarchical Methods

Logic and Discrete Mathematics. Section 2.5 Equivalence relations and partitions

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering COMS 4771

Clustering. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning

Big Data Analytics CSCI 4030

Unsupervised: no target value to predict

Transcription:

Data Clustering Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Data clustering is the task of partitioning a set of objects into groups such that the similarity of objects within each group is higher than that of objects across groups.

Clustering helps to reveal hidden patterns in the data and to summarize large data sets. It has numerous applications in bioinformatics, image analysis, social networks,...

objects in the D space

clusters: points within a cluster are closer ( similar ) to each other than to points in other clusters

The Problem Input: A set of P of points, a distance measure d, and a positive integer k ( k P ) Output: A partition of P into k subsets, C,...,Ck, each of which we call a cluster, such that the similarity of points within each cluster is higher than that across clusters. 6

To cluster the data, we need A distance measure (to quantify how similar or dissimilar two objects are) An algorithm for clustering the data based on the distance measure 7

For data in the D space, an obvious distance measure is the Euclidian distance. 8

Two Clustering Algorithms Hierarchical clustering k-means clustering 9

Hierarchical Clustering 0

Input: A set P of points, and a number of clusters k Initialization: Put each point in a cluster by itself Repeat until k clusters: Find two closest clusters and merge them into one

Q: How do we find a closest pair of clusters? A: In our case, using the closest pair algorithm. Q: But the closest pair algorithm works on points where each point is given by its (x,y) coordinates and the distance between two points is the Euclidian distance. What do we do about clusters? A: Represent each cluster by its center and take the distance between two clusters as the distance between their centers.

Let (x,y ), (x,y ),...,(x m,y m ) be the coordinates of the m points in a cluster C. The center of cluster C is given by the point (x,y ) where x 0 = mx i= x i /m y 0 = mx i= y i /m

The distance between two clusters is then the Euclidian distance between their centers.

Let s consider an example of points (,,..., ) that we want to group into clusters.

cm 6 6 7 8 9 0 6 cm 6

Each point is given by its (x,y) coordinates. 7

cm 6 coordinates of point are (,6) 6 7 8 9 0 coordinates of point 6 are (6,) 6 cm 8

() Form clusters, C,C,...,C each of which contains exactly one of the points. 9

cm 6 C C C C C C6 6 7 8 9 C8 C7 C9 0 C0 C 6 cm 0

() Find the two closest clusters

the two closest clusters cm 6 C C C C C C6 6 7 8 9 C8 C7 C9 0 C0 C 6 cm

... and merge them into one (thus, the number of clusters would decrease by one after this step)

cm 6 C C C C C6 6 7 8 9 C8 C7 C9 0 C0 C 6 cm

() Find the two closest clusters

the two closest clusters cm 6 C C C C C6 6 7 8 9 C8 C7 C9 0 C0 C 6 cm 6

... and merge them into one (thus, the number of clusters would decrease by one after this step) 7

cm 6 C C C C6 6 7 C7 8 C8 9 C9 0 C0 C 6 cm 8

() Find the two closest clusters and merge them into one (thus, the number of clusters would decrease by one after this step) 9

cm 6 C C C6 6 7 C7 8 C8 9 C9 0 C0 C 6 cm 0

() Find the two closest clusters and merge them into one (thus, the number of clusters would decrease by one after this step)

cm 6 C C C68 6 7 C7 8 9 C9 0 C0 C 6 cm

(6) Find the two closest clusters and merge them into one (thus, the number of clusters would decrease by one after this step)

cm 6 C C C68 6 7 C79 8 9 0 C0 C 6 cm

(7) Find the two closest clusters and merge them into one (thus, the number of clusters would decrease by one after this step)

cm 6 C C C6879 6 7 8 9 0 C0 C 6 cm 6

(8) Find the two closest clusters and merge them into one (thus, the number of clusters would decrease by one after this step) 7

cm 6 C C6879 6 7 8 9 0 C0 C 6 cm 8

(9) The pre-specified number of clusters () has been reached, so we return the four clusters, each as a set containing the points in it: {,,,,} {6,7,8,9} {0} {} 9

k-means Clustering 0

Input: A set of P of points, a number of clusters k, and a number of iterations q Initialize k centers Iterate q times: Initialize k empty clusters Add each of the n points to the cluster with the closest center Recompute the clusters centers using its new set of points Return the set of k clusters

Let s consider an example of points (,,..., ) that we want to group into clusters.

cm 6 6 7 8 9 0 6 cm

Each point is given by its (x,y) coordinates.

cm 6 coordinates of point are (,6) 6 7 8 9 0 coordinates of point 6 are (6,) 6 cm

First step: Choose center points for the four clusters 6

Four centers c, c, c, c each with its coordinates cm 6 6 7 c c 8 9 coordinates of c are (,) c c 0 6 cm 7

Second step: For each of points, find the closest center (of the four) to it. 8

Four centers c, c, c, c each with its coordinates cm 6 6 7 lines show closest center to each point c c c c 8 0 9 6 cm 9

Third step: Form a cluster for each set of points that belong to the same center (thus, forming four clusters). 0

Four centers c, c, c, c each with its coordinates cm cluster C 6 cluster C 6 7 c c 8 9 c cluster C c 0 cluster C 6 cm

Fourth step: For each cluster, recompute its center as shown previously (to get the x and y coordinates of a center, average the x coordinates and average the y coordinates, respectively, of all points that belong to the center s cluster)

Four centers c, c, c, c each with its coordinates x = y = +.7 +.+. +. 6+.7 + 6 +. +. =.78 =.6 cm 6 c cluster C 6 7 8 cluster C c 9 x = ++6+6 y = +++ =. =. x = y = x = y = c cluster C c 0 cluster C 6 cm

Now, repeat from the second step: For each point, find its closest center (of the new four centers), form four clusters, and continue...

After running the procedure for a (pre-specified) number of iterations, we return the four clusters, each as a set containing the points in it: {,,,,} {6,7,8,9} {0} {}