Clustering algorithms 6CCS3WSN-7CCSMWAL
|
|
- Delilah Richardson
- 5 years ago
- Views:
Transcription
1 Clustering algorithms 6CCS3WSN-7CCSMWAL
2 Contents Introduction: Types of clustering Hierarchical clustering Spatial clustering (k means etc) Community detection (next week)
3 What are we trying to cluster and why? What is the data?
4 What are we trying to cluster and why? What is the data? Vector (terms in documents)
5 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter)
6 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter) What do we want?
7 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter) What do we want? Group together the similar items
8 What are we trying to cluster and why? What is the data? Vector (terms in documents) Graph based (who follows who in Twitter) What do we want? Group together the similar items Separate the items which are clearly different
9 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible
10 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches?
11 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive
12 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points
13 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points Divisive. Centroid: Split the data into a fixed number of regions based on distance to the regional centers
14 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points Divisive. Centroid: Split the data into a fixed number of regions based on distance to the regional centers Graph based data: Possible approaches?
15 How are we trying to cluster and why? We consider unsupervised techniques Various heuristics are possible Vector data: Possible approaches? Agglomerative or Divisive Agglomerative. Hierarchical Clustering: Group all data into a tree based on distance between data points Divisive. Centroid: Split the data into a fixed number of regions based on distance to the regional centers Graph based data: Possible approaches? Separate the graph into subgraphs based on communities
16 Theory and Practice The discussion proceeds by example It is best to try the techniques out for yourself in R
17 This is the content of file UKCITYDATA.txt Example: Major cities of UK Clustering data vectors. How to present data in a meaningful way How could we cluster these cities. If we choose geographic position (Latitude, Longitude) as our data how would you think they divide up? North West London Bristol Leeds Sheffield Bradford Manchester Liverpool Birmingham Glasgow Edinburgh Cardiff Belfast Newcastle
18 We look at two methods. Hierarchical clustering and k-means clustering. Both are based on distance between data points. but analyze and present the data in different ways.
19 Here is a picture of how things might be clustered
20 k means clustering West Cardiff Bristol London Liverpool Manchester Birmingham Bradford Sheffield Leeds Belfast Newcastle Glasgow Edinburgh North We made an arbitrary decision to choose 3 clusters
21 Hierarchical clustering Height Belfast Glasgow Edinburgh Birmingham Manchester Liverpool Newcastle Sheffield Leeds Bradford London Bristol Cardiff Height Cluster Dendrogram Belfast Glasgow Edinburgh Birmingham Manchester Liverpool Newcastle Sheffield Leeds Bradford London Bristol Cardiff d hclust (*, "ward.d") In the second figure we made an arbitrary decision to choose 3 clusters. How does it compare with the k means result?
22 R for this require(graphics) cdata = read.csv("ukcitydata.txt",header=t,row.names=1) cities <- as.matrix(cdata) #run hierarchical clustering using Wards method d=dist(cities) groups <- hclust(d,method="ward.d") #plot dendogram, use hang to ensure that labels fall below tree plot(groups, hang=-1) #cut into 3 subtrees (draw rectangles on plot) rect.hclust(groups,3) #k-means clustering colnames(cities) <- c("north", "West") cl <- kmeans(cities, 3) # make 3 clusters plot(cities, col = cl$cluster, xlim=c(50,58)) # plot clusters points(cl$centers, col = 1:2, pch = 8, cex = 2) # insert cluster centers text(cities, row.names(cities), cex=0.6, pos=4, col="blue") #label citie
23 Details: (Type cl in R to get k-means details) K-means clustering with 3 clusters of sizes 4, 3, 6 Cluster means: North West Clustering vector: London Bristol Leeds Sheffield Bradford Manchester Liverpool Birmingham Glasgow Edinburgh Cardiff Belfast Newcastle 3 Within cluster sum of squares by cluster: (between_ss / total_ss = 72.9 %)
24 Agglomerative Hierarchical Clustering (HAC) Need a measure of distance between data points Merge the two nearest clusters until there is a single cluster. The results are presented as a dendrogram showing hierarchy. Prune the dendrogram to give the required number of clusters. Distance: e.g. Euclidean distance d(a, b) = n (a i b i ) 2 = a b (1) i=1 The notation a b is standard for Euclidian distance. a, b vectors: a = (a 1, a 2,...a n ), b = (b 1, b 2,..., b n ) See clustering
25 Agglomerative Hierarchical Clustering: Detail (1) Assign each data point to its own (single member) cluster (2) Repeat steps 3 and 4 until you have a single cluster containing all data points (3) Find the pair of clusters that are closest to each other. Merge them to reduce the number of clusters by one (4) Compute distances between the new cluster and each of the old clusters From clustering
26 Distance between two clusters There are many methods. Three common ones are:
27 Distance between two clusters There are many methods. Three common ones are: Complete-linkage. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = max{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest.
28 Distance between two clusters There are many methods. Three common ones are: Complete-linkage. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = max{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest. Single-linkage clustering. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = min{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest.
29 Distance between two clusters There are many methods. Three common ones are: Complete-linkage. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = max{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest. Single-linkage clustering. For each pair of clusters A, B (or clusters and data points) calculate d(a, B) = min{d(x, y) : x A, y B}. Merge the two clusters for which d(a, B) is smallest. Ward s method. (Wards minimum variance method). Merge the two clusters which leads to the smallest increase in total within cluster variance. Intuitively the method tries to put together the two clusters whose means are closest.
30 Merge clusters Figure from Page 351, Chapter 17 of Introduction to IR book
31 Both Ward s method and Complete-linkage gave the same three groups in the dendrogram for UK cities, but Single-linkage gave a different answer. However the dendrogram of complete and Ward s look different. Ward s method is considered to give a nice flat clustering. Cluster Dendrogram Cluster Dendrogram Belfast Glasgow Edinburgh Newcastle Liverpool Birmingham Manchester Sheffield Leeds Bradford London Bristol Cardiff Belfast London Glasgow Edinburgh Newcastle Bristol Cardiff Birmingham Liverpool Manchester Sheffield Leeds Bradford Height Height d hclust (*, "complete") d hclust (*, "single")
32 Cophenetic distance The y-axis of the dendrogram (Height). The cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster. Cluster Dendrogram Belfast Glasgow Edinburgh Newcastle Liverpool Birmingham Manchester Sheffield Leeds Bradford London Bristol Cardiff Height d hclust (*, "complete")
33 Example We cluster the numbers 1, 2, 4, 8. If we ask for 3 clusters, hopefully they will be {1, 2}, {4}, {8}. We use method complete-linkage to merge clusters. Max Distance matrix: Clusters 1 0 C C C C8 The clusters with the smallest max distance are C1, C2. Merge these C12.
34 Max distance matrix. Distance from C12 to C4: max(d(1, 4), d(2, 4)) = d(1, 4) = 3 C12 C4 C8 C12 0 C4 3 0 C The clusters with the smallest max distance are C12, C4. Merge these C124. Max distance matrix: C124 C8 C124 0 C8 7 0
35 A plot of the dendrogram The clusters C1, C2 C12, C12, C4 C124, C124, C8 C1248 were merged at complete-linkage intercluster distances 1, 3, 7 This is recorded on the height axis. Cluster Dendrogram Height Exercise. Data 1, 2, 5, 9, 11. pd hclust (*, "complete")
36 k-means clustering Colour quantization: Reduce number of colours used Figures from Wikipedia
37 k-means clustering The number of clusters is an input to the algorithm, which then generates k centers and assigns each data point to nearest center.
38 k-means clustering The number of clusters is an input to the algorithm, which then generates k centers and assigns each data point to nearest center. The aim is to find some good clusters, but that is not always easy. How to define what we mean by good?
39 k-means clustering The number of clusters is an input to the algorithm, which then generates k centers and assigns each data point to nearest center. The aim is to find some good clusters, but that is not always easy. How to define what we mean by good? We want to partition the data points into k sets (the clusters) in such a way that we minimize the squared distance to the centers of the clusters. The center (or centroid µ) of a cluster is the average of the point positions.
40 k-means clustering If there are m points x 1,..., x m then µ = 1 m m x i. i=1 Typically the x i are vectors in which case µ is calculated component wise.
41 This is a wish list. In practice some starting centers are given. If not we generate some random ones. In either case the answer may not be exactly what we want. Assuming we do not have any starting centers: (1) Assign the data points (randomly) into k groups (2) Compute the centroid of each group (3) For each data point, compute the distance to each centroid. Assign the data point to the nearest centroid (4) If the clusters are unchanged then STOP, else go to step 2. If we use random starting centers, the final answer may vary
42 Example Divide 1, 2, 4, 5, 8, 9 into 3 clusters with starting centers 3, 6, data centers (1 2 4) (5) (8 9) assign to nearest ce 7/3 5 17/2 new centroid (1 2) (4 5) (8 9) assign to nearest ce 3/2 9/2 17/2 new centroid (1 2) (4 5) (8 9) assign to nearest ce 3/2 9/2 17/2 new centroid No change STOP
43 Ex1. What would have happened if we had broken the distance ties for 4 and 8 the other way in the first round? Ex2. Where do you think the cluster centers should be for the following set of points, for k = 2, 3? (1, 1), (1.5, 1.5), (2, 2), (2, 3), (3, 2), (3, 3) Check your answers by using them as the initial centers for the k-means algorithm.
44 Partitioning Around Medoids (PAM) Algorithm This is like k-means but the centers have to be part of the data set. The algorithm tries to find a k-partition of the n data points to minimize the dissimilarity F : F = n n d(i, j)z i,j, i=1 j=1 where z i,j = 1 if i, j in the same cluster and zero otherwise. The minimization is carried out subject to the constraint that all k clusters are non-empty. Obviously this is harder to do but makes more sense than k-means. Example: Divide 1, 2, 4, 5, 8, 9 into 3 clusters around medioids. Ans: (1, 2), (4, 5), (8, 9), either point in each cluster can act as a medioid.
45 Cities: Partitioning Around Medoids require(cluster) meds=pam(cities,3) clusplot(meds,labels=2)
46 Within cluster sum of squares (WCSS) The main limitation of the k-means method is that the solution found by the algorithm is often a local rather than global minimum. The algorithm can t improve things but the answer is not best possible. It is important to run the algorithm a number of times with different start centers and choose the result with the minimum WCSS. Keep running the algorithm until there is no significant improvement in WCSS. This is the reason for using random starting centers. For a given set of clusters S = (S 1,..., S k ), with centers (µ 1,..., µ k ) the within cluster sum of squares (WCSS) is defined as WCSS = k i=1 x S i x µ i 2. Here z 2 = z 2 i is squared Euclidian distance of z = (z 1,..., z n ).
47 k means: More detail > clus=kmeans(c(1,2,6),2) > clus K-means clustering with 2 clusters of sizes 2, 1 Cluster means: Clustering vector: (for data points (1,2,6) respectively) Within cluster sum of squares by cluster: Between_SS / total_ss = 96.4 % > clus$totss 14 > clus$betweenss 13.5
48 Total sum of squares (TSS) If there are m points x 1,..., x m then µ = 1 n m x i. For a given set of clusters S = (S 1,..., S k ) the within cluster sum of squares (WCSS) is defined as WCSS = i=1 k i=1 x S i x µ i 2. As usual z is squared Euclidian distance of z (see (??)). Mean of all data M = 1 k x i n TSS = i=1 n x M 2 i=1
49 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6
50 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6
51 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3
52 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5
53 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5 TSS = (1 3) 2 + (2 3) 2 + (6 3) 2 = 14
54 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5 TSS = (1 3) 2 + (2 3) 2 + (6 3) 2 = 14 BCSS = TSS WCSS = 13.5
55 Example: Explanation Divide 1, 2, 6 into 2 clusters Ans (1, 2) and 6 Means (1 + 2)/2 = 1.5 and 6 Overall mean M = ( )/3 = 3 WCSS = (1 1.5) 2 + (2 1.5) 2 + (6 6) 2 = 0.5 TSS = (1 3) 2 + (2 3) 2 + (6 3) 2 = 14 BCSS = TSS WCSS = 13.5 BCSS/TSS = 13.5/14 = 96.4% This is a good fit because the BCSS (between cluster sum of squares) explains 96.4%of data variation, and the WCSS (within cluster sum of squares) was 3.5% of data variation. The data points are close to their cluster centers
56
57 FBook example: Social Network Clustering Analysis This analysis uses a dataset representing a random sample of U.S. high school students who had profiles on a well-known Social Network in from 2006 to From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the persons profile. The aim is to cluster the document corpus (FB pages) according to text content.
58 R program require(cluster) #raw.githubusercontent.com/brenden17/sklearnlab/master/facebook/snsdata.csv teens <- read.csv("snsdata.csv") #download from above and put in wkdir() apply(teens[5:40],2,sum) interests <- teens[5:40] # throw out columns 1--4 of data: gradyear gender age friends (on FBook) interests_z <- as.data.frame(lapply(interests, scale)) teen_clusters <- kmeans(interests_z, 5) teen_clusters$size #The cluster characterization can be obtained with pie charts: pie(colsums(interests[teen_clusters$cluster==1,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==2,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==3,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==4,]),cex=0.5) pie(colsums(interests[teen_clusters$cluster==5,]),cex=0.5)
59 The output >apply(teens[5:40],2,sum) basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs 1813 ========================================================== > teen_clusters$size [1]
60 The five clusters are presented as pie-charts. Its impossible to represent 36 dimensions (basketball,...,drugs) on a page otherwise The final answer are not fully reproducible (random start clusters used) The largest 5 segments are (in order within group Group 5 (5523 points) music shopping dance God hair Group 4 (22258 points) music God dance hair band Group 3 (1039 points) hair sex music kissed die Group 2 (594 points) baseball football basketball music rock Group 1 (586 points) sexy music hair dance cute
61 The plot of cluster 5 dance kissed sexysex hot cute sports tennis baseball cheerleading swimming volleyball softball band marching soccer football music rock god clothes basketball drugs drunk death die abercrombie hollister church shopping jesus bible hair dressblonde mall
62 The plots of cluster 4 dance sexy hot kissed sex cute sports tennis baseball cheerleading swimming volleyball softball soccer marching band football basketball music drugs drunk death die hollister abercrombie clothes rock shopping god church blonde dress jesus bible hair mall
63 The plots of the clusters 3 kissed hot sexy sex dance cute marching music band sports tennis baseball cheerleading swimming volleyball softball soccer football basketball rock drugs god church jesus bible death drunk die hair abercrombie hollister clothes dress blonde mall shopping
64 The plots of the clusters 2 cheerleading swimming volleyball softball soccer baseball football basketball tennis sports cute drugs drunk death die hollister abercrombie clothes shopping sex mall sexy hot kissed blonde dress dance hair band marching music rock god jesus bible church
65 The plots of the clusters 1 sex cute sexy sports tennis baseball cheerleading swimming volleyball softball soccer football hot kissed dance band marching clothes shopping basketball drugs drunk death die abercrombie hollister music rock god church jesus bible hair mall blonde dress
4. Ad-hoc I: Hierarchical clustering
4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationHierarchical Clustering Lecture 9
Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More information11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records
11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationClustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic
Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationClustering Algorithms on Graphs Community Detection 6CCS3WSN-7CCSMWAL
Clustering Algorithms on Graphs Community Detection 6CCS3WSN-7CCSMWAL Contents Zachary s famous example Community structure Modularity The Girvan-Newman edge betweenness algorithm In the beginning: Zachary
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationToday s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ
Clustering CS498 Today s lecture Clustering and unsupervised learning Hierarchical clustering K-means, K-medoids, VQ Unsupervised learning Supervised learning Use labeled data to do something smart What
More informationMachine learning - HT Clustering
Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not
More informationTree Models of Similarity and Association. Clustering and Classification Lecture 5
Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationClustering (COSC 416) Nazli Goharian. Document Clustering.
Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationHierarchical clustering
Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationData Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)
More informationClustering Algorithms for general similarity measures
Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationClustering: K-means and Kernel K-means
Clustering: K-means and Kernel K-means Piyush Rai Machine Learning (CS771A) Aug 31, 2016 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 1 Clustering Usually an unsupervised learning problem
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationUnsupervised Learning Hierarchical Methods
Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationComputing with large data sets
Computing with large data sets Richard Bonneau, spring 2009 Lecture 8(week 5): clustering 1 clustering Clustering: a diverse methods for discovering groupings in unlabeled data Because these methods don
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More information21 The Singular Value Decomposition; Clustering
The Singular Value Decomposition; Clustering 125 21 The Singular Value Decomposition; Clustering The Singular Value Decomposition (SVD) [and its Application to PCA] Problems: Computing X > X takes (nd
More informationLecture 4 Hierarchical clustering
CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More information2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.
Cluster Analysis The main aim of cluster analysis is to find a group structure for all the cases in a sample of data such that all those which are in a particular group (cluster) are relatively similar
More informationHierarchical and Ensemble Clustering
Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example
More informationCluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole
Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationUNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering
UNSUPERVISED LEARNING IN R Introduction to hierarchical clustering Hierarchical clustering Number of clusters is not known ahead of time Two kinds: bottom-up and top-down, this course bottom-up Hierarchical
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 16
CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 16 Today Continue Clustering Last Time Flat Clustring Today Hierarchical Clustering Divisive Agglomerative Applications of Clustering Hierarchical
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationChapter VIII.3: Hierarchical Clustering
Chapter VIII.3: Hierarchical Clustering 1. Basic idea 1.1. Dendrograms 1.2. Agglomerative and divisive 2. Cluster distances 2.1. Single link 2.2. Complete link 2.3. Group average and Mean distance 2.4.
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationHierarchical clustering
Aprendizagem Automática Hierarchical clustering Ludwig Krippahl Hierarchical clustering Summary Hierarchical Clustering Agglomerative Clustering Divisive Clustering Clustering Features 1 Aprendizagem Automática
More informationHierarchical clustering
Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized
More informationClustering Algorithms. Margareta Ackerman
Clustering Algorithms Margareta Ackerman A sea of algorithms As we discussed last class, there are MANY clustering algorithms, and new ones are proposed all the time. They are very different from each
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationData Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation
More informationTypes of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters
Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive
More informationClustering Lecture 3: Hierarchical Methods
Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationAdministrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES
Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationWhat is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationData Mining Algorithms
for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationCSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction
CSE 255 Lecture 5 Data Mining and Predictive Analytics Dimensionality Reduction Course outline Week 4: I ll cover homework 1, and get started on Recommender Systems Week 5: I ll cover homework 2 (at the
More informationMATH5745 Multivariate Methods Lecture 13
MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering.
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationClust Clus e t ring 2 Nov
Clustering 2 Nov 3 2008 HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Hierarchical Clustering Produces a set
More information