What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Similar documents
Clustering CS 550: Machine Learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CHAPTER 4: CLUSTER ANALYSIS

Hierarchical Clustering 4/5/17

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

4. Ad-hoc I: Hierarchical clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

High throughput Data Analysis 2. Cluster Analysis

Lesson 3. Prof. Enza Messina

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning

K-Means Clustering 3/3/17

Cluster Analysis. Angela Montanari and Laura Anderlucci

University of Florida CISE department Gator Engineering. Clustering Part 2

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Based on Raymond J. Mooney s slides

Multivariate Analysis

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Unsupervised Learning and Clustering

Pattern Clustering with Similarity Measures

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Road map. Basic concepts

2. Background. 2.1 Clustering

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

INF 4300 Classification III Anne Solberg The agenda today:

Unsupervised Learning and Clustering

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Clustering (COSC 416) Nazli Goharian. Document Clustering.

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical Clustering

Hierarchical Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering: Overview and K-means algorithm

Clustering Algorithms for general similarity measures

A Comparison of Document Clustering Techniques

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Hierarchical clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Unsupervised Learning and Data Mining

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Chapter DM:II. II. Cluster Analysis

Unsupervised Learning : Clustering

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering Lecture 9

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Supervised vs. Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Introduction to Machine Learning

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Machine Learning. Unsupervised Learning. Manfred Huber

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ECLT 5810 Clustering

Clustering k-mean clustering

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Unsupervised: no target value to predict

Gene Clustering & Classification

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Unsupervised Learning Hierarchical Methods

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

ECLT 5810 Clustering

Behavioral Data Mining. Lecture 18 Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

What is Unsupervised Learning?

SGN (4 cr) Chapter 11

Machine Learning (BSMC-GA 4439) Wenke Liu

Administration. Final Exam: Next Tuesday, 12/6 12:30, in class. HW 7: Due on Thursday 12/1. Final Projects:

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Improving Cluster Method Quality by Validity Indices

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Exploratory data analysis for microarrays

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Supervised vs unsupervised clustering

Clustering Part 3. Hierarchical Clustering

Distances, Clustering! Rafael Irizarry!

Cluster Analysis: Basic Concepts and Algorithms

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Hierarchical Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Transcription:

Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation rom teacher The process o partitioning a set o data into a set o meaningul (hopeully sub-classes, called clusters Cluster: collection o data points that are similar to one another and collectively should be treated as group as a collection, are suiciently dierent rom other groups Data Clustering 2 Clusters Data Clustering 3 Characterizing Cluster Methods Class - label applied by clustering algorithm hard versus uzzy: hard - either is or is not a member o cluster uzzy - member o cluster with probability Distance (similarity measure - value indicating how similar data points are Deterministic versus stochastic deterministic - same clusters produced every time stochastic - dierent clusters may result Hierarchical - points connected into clusters using a hierarchical structure Data Clustering 4 Basic Clustering Methodology Two approaches: Agglomerative: pairs o items/clusters are successively linked to produce larger clusters Divisive (partitioning: items are initially placed in one cluster and successively divided into separate groups Cluster Validity One diicult question: how good are the clusters produced by a particular algorithm? Diicult to develop an objective measure Some approaches: external assessment: compare clustering to a priori clustering internal assessment: determine i clustering intrinsically appropriate or data relative assessment: compare one clustering methods results to another methods Data Clustering 5 Data Clustering 6

Basic Questions Data preparation - getting/setting up data or clustering extraction normalization Similarity/Distance measure - how is the distance between points deined Use o domain knowledge (prior knowledge can inluence preparation, Similarity/Distance measure Eiciency - how to construct clusters in a reasonable amount o time Key to grouping points distance = inverse o similarity Oten based on representation o objects as eature vectors An Employee DB ID Gender Age Salary 1 F 27 19,000 2 M 51 64,000 3 M 52 100,000 4 F 33 55,000 5 M 45 45,000 Term Frequencies or Documents Doc1 0 4 0 0 0 2 Doc2 3 1 4 3 1 2 Doc3 3 0 0 0 3 0 Doc4 0 1 0 3 0 0 Doc5 2 2 2 3 1 4 Which objects are more similar? Data Clustering 7 Data Clustering 8 Properties o measures: based on eature values x instance#,eature# or all objects x i,b, dist(x i, x j 0, dist(x i, x j =dist(x j, x i or any object x i, dist(x i, x i = 0 dist(x i, x j dist(x i, x k + dist(x k, x j Manhattan distance: Euclidean distance: eatures = 1 eatures = 1 x i, j, ( x i,, j 2 Minkowski distance (p: 1 T Mahalanobis distance: ( x i j ( xi j where -1 is covariance matrix o the patterns p eatures = 1 ( x i x p, j, More complex measures: Mutual Neighbor Distance (MND - based on a count o number o neighbors Data Clustering 9 Data Clustering 10 Distance (Similarity Matrix Similarity (Distance Matrix based on the distance or similarity measure we can construct a symmetric matrix o distance (or similarity values (i, j entry in the matrix is the distance (similarity between items i and j I1 I2 L In I1 d12 L d1n Note Note that that d ij = ij d ji (i.e., ji (i.e., the the matrix matrix is is symmetric. So, So, we we only only need need the the lower I lower 2 d21 L d2n triangle triangle part part o o the the matrix. matrix. M M M M The The diagonal is is all all 1 s 1 s (similarity or or all all 0 s 0 s (distance I d d L n n1 n2 d = similarity (or distance o D to D ij i j Data Clustering 11 Example: Term Similarities in Documents Doc1 0 4 0 0 0 2 1 3 Doc2 3 1 4 3 1 2 0 1 Doc3 3 0 0 0 3 0 3 0 Doc4 0 1 0 3 0 0 2 0 Doc5 2 2 2 3 1 4 0 2 Term-Term Similarity Matrix Matrix sim( T, T = ( w w N i j ik jk k = 1 Data Clustering 12

Similarity (Distance Thresholds A similarity (distance threshold may be used to mark pairs that are suiciently similar Using a threshold value o 10 in the previous example Data Clustering 13 Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the act that two items are similar (a one in the similarity threshold matrix I I no no threshold is is used, used, then then matrix matrix can can be be represented as as a a weighted graph graph Data Clustering 14 Agglomerative Single-Link Single-link: connect all points together that are within a threshold distance Algorithm: 1. place all points in a cluster 2. pick a point to start a cluster 3. or each point in current cluster add all points within threshold not already in cluster repeat until no more items added to cluster 4. remove points in current cluster rom graph 5. Repeat step 2 until no more points in graph Example All points except end up in one cluster Data Clustering 15 Data Clustering 16 Agglomerative Complete-Link (Clique Complete-link (clique: all o the points in a cluster must be within the threshold distance In the threshold distance matrix, a clique is a complete graph Algorithms based on inding maximal cliques (once a point is chosen, pick the largest clique it is part o not an easy problem Example Dierent clusters possible based on where cliques start Data Clustering 17 Data Clustering 18

Hierarchical Methods Based on some method o representing hierarchy o data points One idea: hierarchical dendogram (connects points based on similarity Hierarchical Agglomerative Compute distance matrix Put each data point in its own cluster Find most similar pair o clusters merge pairs o clusters (show merger in dendogram update proximity matrix repeat until all patterns in one cluster Data Clustering 19 Data Clustering 20 Partitional Methods Divide data points into a number o clusters Diicult questions how many clusters? how to divide the points? how to represent cluster? Representing cluster: oten done in terms o centroid or cluster centroid o cluster minimizes squared distance between the centroid and all points in cluster k-means Clustering 1. Choose k cluster centers (randomly pick k data points as center, or randomly distribute in space 2. Assign each pattern to the closest cluster center 3. Recompute the cluster centers using the current cluster memberships (moving centers may change memberships 4. I a convergence criterion is not met, goto step 2 Convergence criterion: no reassignment o patterns minimal change in cluster center Data Clustering 21 Data Clustering 22 k-means Clustering k-means Variations What i too many/not enough clusters? Ater some convergence: any cluster with too large a distance between members is split any clusters too close together are combined any cluster not corresponding to any points is moved thresholds decided empirically Data Clustering 23 Data Clustering 24

An Incremental Clustering Algorithm 1. Assign irst data point to a cluster 2. Consider next data point. Either assign data point to an existing cluster or create a new cluster. Assignment to cluster based on threshold 3. Repeat step 2 until all points are clustered Useul or eicient clustering Data Clustering 25 Clustering Summary Unsupervised learning method generation o classes Based on similarity/distance measure Manhattan, Euclidean, Minkowski, Mahalanobis, etc. distance matrix threshold distance matrix Hierarchical representation hierarchical dendogram Agglomerative methods single link complete link (clique Data Clustering 26 Partitional method representing clusters Clustering Summary centroids and error k-means clustering combining/splitting k-means Incremental clustering one pass clustering Data Clustering 27