Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering

Size: px
Start display at page:

Download "Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering"

Transcription

1 Unsupervised Learning Clustering Centroid models (K-mean) Connectivity models (hierarchical clustering) Density models (DBSCAN) Graph-based models Subspace models (Biclustering) Feature extraction techniques Unsupervised Learning Feature Extraction Techniques Principal component analysis Independent component analysis Singular value decomposition Feature Selection Techniques What is Clustering? Unsupervised Learning I Clustering Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. 1

2 Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the correct number of groups into which the data should be sorted is not known in advance. Daniel S. Wilks s of use of cluster analysis in weather and climate literature: Market Segmentation othe market is divided into smaller groups that are more homogeneous needs. Taxonomy Clustering odifferentiation between different species of animals or plants according to their physical similarity What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Microarray Clustering s Centroid Models Given a K, find a partition of K clusters to optimize the chosen partitioning criterion. Global optimal: exhaustively enumerate all partitions Heuristic method: K-means algorithm (MacQueen 67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centers of clusters

3 K-mean Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps: Initialisation: set seed points Assign each object to the cluster with the nearest seed point Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) Go back to Step 1), stop when no more new assignment Within and Between Cluster Criteria Let s consider total point scatter for a set of N data points: T can be re-written as: Where, Within cluster scatter N N 1 T d( x i, x j ) i1 j1 K 1 T ( d( xi, x j ) d( xi, x j )) k1 C( i) k C( j) k C( j) k W ( C) B( C) K 1 W ( C) d( x i, x j ) k1 C( i) k C( j) k K 1 B( C) d( x i, x j ) k1 C( i) k C( j) k Between cluster scatter Distance between two points If d is square Euclidean distance, then and K W ( C) Nk xi mk k1 C( i) k K B( C) Nk mk m k1 Minimizing W(C) is equivalent to maximizing B(C) Grand mean webdocs.cs.ualberta.ca/~nray1/cmput466_551/clustering.ppt Problem Suppose we have 4 types of medicines and each has two attributes (ph and weight index). Our goal is to group these objects into K= group of medicine. Medicine Weight ph- Index A 1 1 B 1 C 4 3 D 5 4 A B C D Step 1: Use initial seed points for partitioning A B C D d( D, c ) 1 d( D, c ) c A, c 1 B Euclidean distance (5 1) (4 1) 5 (5 ) (4 1) 4.4 Assign each object to the cluster with the nearest seed point webdocs.cs.ualberta.ca/~nray1/cmput466_551/clustering.ppt 1 3

4 Step : Compute new centroids of the current partition c (1, 1) c, 3 3 (11/ 3, 8 / 3) (3.67,.67) Step 3: Repeat the first two steps until its convergence Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships c1, (1, 1) c, (4, 3 ) Step 3: Repeat the first two steps until its convergence Compute the distance of all objects to the new centroids Exercise For the medicine data set, use K-means with the distance metric for clustering analysis by setting K= and initializing seeds as C1 = A and C = C. Answer three questions as follows: 1. How many steps are needed for convergence?. What are memberships of two clusters? 3. What are centroids of two clusters after convergence? Medicine Weight ph- Index A 1 1 C D Stop due to no new assignment B 1 A B C 4 3 D

5 How K-mean partitions? K-means Demo When K centroids are set/fixed, they partition the whole space into K subspaces constituting a partitioning. 1. User set up the number of clusters they d like. (e.g. k=5) Changing positions of centroids leads to a new partitioning. A partitioning amounts to a Voronoi Diagram K-means Demo 1. User set up the number of clusters they d like. (e.g. K=5). Randomly guess K cluster Center locations K-means Demo 1. User set up the number of clusters they d like. (e.g. K=5). Randomly guess K cluster Center locations 3. Each data point finds out which Center it s closest to. (Thus each Center owns a set of data points) Voronoi Diagram

6 K-means Demo K-means Demo 1. User set up the number of clusters they d like. (e.g. K=5). Randomly guess K cluster centre locations 3. Each data point finds out which centre it s closest to. (Thus each Center owns a set of data points) 4. Each centre finds the centroid of the points it owns 1. User set up the number of clusters they d like. (e.g. K=5). Randomly guess K cluster centre locations 3. Each data point finds out which centre it s closest to. (Thus each centre owns a set of data points) 4. Each centre finds the centroid of the points it owns 5. and jumps there 1 K-means Demo 1. User set up the number of clusters they d like. (e.g. K=5). Randomly guess K cluster centre locations 3. Each data point finds out which centre it s closest to. (Thus each centre owns a set of data points) 4. Each centre finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! Relevant Issues Efficient in computation O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. Local optimum sensitive to initial seed points converge to a local optimum that may be unwanted solution Other problems Need to specify K, the number of clusters, in advance Unable to handle noisy data and outliers (K-Medoids algorithm) 3 4 6

7 Relevant Issues Cluster Validity With different initial conditions, the K-means algorithm may result in different partitions for a given data set. Which partition is the best one for the given data set? In theory, no answer to this question as there is no ground-truth available in unsupervised learning (ill-posed problem!) Nevertheless, there are several cluster validity criteria to assess the quality of clustering analysis from different perspectives A common cluster validity criterion is the ratio of the total between-cluster to the total within-cluster distances Between-cluster distance (BCD): the distance between means of two clusters Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster A large ratio of BCD/WCD suggests good compactness inside clusters and good separability among different clusters! 5 K-means Limitations of K-means: Non-globular Shapes Original Points K-means ( Clusters) Conclusion K-means algorithm is a simple yet popular method for clustering analysis Its performance is determined by initialisation and appropriate distance measure There are several variants of K-means to overcome its weaknesses K-Medoids: resistance to noise and/or outliers K-Modes: extension to categorical data clustering analysis CLARA: dealing with large data sets Mixture models (EM algorithm): handling uncertainty of clusters 8 7

8 Hierarchical Clustering Hierarchical Clustering Approach A typical clustering analysis approach via partitioning data set sequentially Construct nested partitions layer by layer via grouping objects into a tree of clusters (without the need to know the number of clusters in advance) Uses distance matrix as clustering criteria Agglomerative vs. Divisive Two sequential clustering strategies for constructing a tree of clusters Agglomerative: a bottom-up strategy Initially each data object is in its own (atomic) cluster Then merge these clusters into larger and larger clusters Divisive: a top-down strategy Initially all objects are in one single cluster Then the cluster is subdivided into smaller and smaller clusters 9 Phylogenetic tree: A tree represents graphical relation between organisms, species, or genomic sequence In Bioinformatics, it s based on genomic sequence Phylogenetic Tree An Hierarchical Clustering Root: origin of evolution Leaves: current organisms, species, or genomic sequence (MSA) Branches: relationship between organisms, species, or genomic sequence Branch length: evolutionary time 8

9 Illustrative Introduction Agglomerative and divisive clustering on the data set {a, b, c, d,e } Step 0 Step 1 Step Step 3 Step 4 a b c d e a b d e c d e a b c d e Step 4 Step 3 Step Step 1 Step 0 Agglomerative Cluster distance Termination condition Divisive Cluster Distance Measures Single link: smallest distance between an element in one cluster and an element in the other, i.e., d(c i, C j ) = min{d(x ip, x jq )} Complete link: largest distance between an element in one cluster and an element in the other, i.e., d(c i, C j ) = max{d(x ip, x jq )} Average (UPGMA: Unweighted Pair Group Method with Arithmetic Mean): Avg distance between elements in one cluster and elements in the other, i.e., d(c i, C j ) = avg{d(x ip, x jq )} single link (min) complete link (max) average Cluster Distance Measures : Given a data set of five objects characterised by a single feature, assume that there are two clusters: C1: {a, b} and C: {c, d, e}. a b c d e Feature Calculate the distance matrix.. Calculate three cluster distances between C1 and C. Single link a b c d e dist(c 1,C) min{ d(a,c), d(a,d), d(a,e), d(b,c), d(b,d), d(b,e)} a min{3, 4, 5,, 3, 4} b c d e Complete link dist(c1,c) max{ d(a,c), d(a,d), d(a,e), d(b,c), d(b,d), d(b,e)} max{3,4, 5,, 3, 4} 5 Average d(a,c) d(a,d) d(a,e) d(b,c) d(b,d) d(b,e) dist(c1,c) Agglomerative Algorithm The Agglomerative algorithm is carried out in three steps: 1) Convert object attributes to distance matrix ) Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) 3) Repeat until number of cluster is one (or known # of clusters) Merge two closest clusters Update distance matrix 36 9

10 Problem: clustering analysis with agglomerative algorithm Merge two closest clusters (iteration 1) data matrix Euclidean distance distance matrix Update distance matrix (iteration 1) Merge two closest clusters (iteration )

11 Update distance matrix (iteration ) Merge two closest clusters/update distance matrix (iteration 3) 41 4 Merge two closest clusters/update distance matrix (iteration 4) Final result (meeting termination condition)

12 lifetime Dendrogram tree representation object 3 1. In the beginning we have 6 clusters: A, B, C, D, E and F. We merge clusters D and F into cluster (D, F) at distance We merge cluster A and cluster B into (A, B) at distance We merge clusters E and (D, F) into ((D, F), E) at distance We merge clusters ((D, F), E) and C into (((D, F), E), C) at distance We merge clusters (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance The last cluster contain all the objects, thus conclude the computation Different Dendograms 45 Phylogenetic Trees can be: Rooted: A single node is designated as root and it represents a common ancestor with a unique path leading from it through evolutionary time to any other node Unrooted tree: specifies only the nodes interrelations but says nothing about the direction in which evolution occurred. Rooted trees Roots can be artificially assigned to unrooted trees by means of an outgroup. An outgroup is a species that have unambiguously separated early from the other species being considered Unrooted trees 1

13 Phylogenetic Tree Calculate all the distance between leaves (taxa & MSA) Based on the distance, construct a tree Good for continuous characters Not very accurate Fastest method UPGMA (average linkage) Neighbor-joining 1 n 1, and 3 are known species. We know d 1, d 13, and d 3. Neighbor-joining Want to add an internal node n to the tree such that distances between species are equal to the sum of the lengths of the branches along the path between the species. d 1n + d n = d 1 ; d 1n + d 3n = d 13 ; d n + d 3n = d 3 ; Solve simultaneous equations 3 d 1n = (d 1 + d 13 d 3 )/; d n = (d 1 + d 3 d 13 )/; d 3n = (d 13 + d 3 d 1 )/; How to construct a tree with Neighbor-joining method? Step 1: Calculate sum all distance from x and divide by (leaves ) Sx = (sum all Dx) / (leaves - ) Step : Calculate pair with smallest M Mij = Distance ij Si Sj Step 3: Create a node U that joins pair with lowest Mij SiU = (Dij / ) + (Si Sj) / How to construct a tree with Neighbor-joining method? Step 4: Join i and j according to S and make all other taxa in form of a star Step 5: Recalculate new distance matrix of all other taxa to U with: DxU = (Dix + Djx Dij)/

14 53 of Neighbor-joining B 5 A B C D E C 4 7 D E F Step 1: S calculation : Sx = (sum all Dx) / (leaves - ) S(A) = ( ) / 4 = 7.5 S(B) = ( ) / 4 = 10.5 S(C) = ( ) / 4 = 8 S(D) = ( ) / 4 = 9.5 S(E) = ( ) / 4 = 8.5 S(F) = ( ) / 4 = 11 of Neighbor-joining cont 1 Step : Calculate pair with smallest M Mij = Distance ij Si Sj Smallest are M(AB) = d(ab) S(A) S(B) = = -13 M(DE) = = B -13 A B C D E C D E F of Neighbor-joining cont Step 3: Create a node U S1U = (Dij / ) + (Si Sj) / U1 joins A and B: S(AU1) = d(ab) / + (S(A) S(B)) / = 5 / + ( ) / = 1 S(BU1) = d(ab) / + (S(B) S(A)) / = 5 / + ( ) / = 4 of Neighbor-joining cont 3 Step 4: Join A and B according to S, and make all other taxa in form of a star. Branches in black are unknown length and Branches in red are known length

15 of Neighbor-joining cont 4 57 Step5: Calculate new distance matrix Dxu = (Dix + Djx Dij) / d(cu) = (d(ac) + d(bc) - d(ab)) / = ( ) / =3 d(du) = d(ad) + d(bd) - d(ab) / = 6 Same as EU and FU Then we get the new distance matrix C 3 U1 C D E D 6 7 E F of Neighbor-joining cont 5 Repeat 1 to 5 until all branches are done In this example, we will get this at the end 1 58 F SakiMonkey NJ method produces an Unrooted, Additive Tree Additive means distance between species = distance summed along internal branches Marmoset Orangutan Gorilla PygmyChimp Gibbon Chimp Human Baboon The tree has been rooted using the Mouse as outgroup Mouse Hierarchical Clustering on Microarray Data 0.1 Mouse 47 Lemur Tarsier Tarsier Lemur SakiMonkey Marmoset Baboon Generates only one possible tree Generates only unrooted tree Gibbon Orang Gorilla 100 Human 84 PygmyChimp 100 Chimp 15

16 Community Structure In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community, and less likely to be connected if they do not share communities. Community Detection Groups of vertices within which connections are dense but between which they are sparser. Within-group( intra-group) edges. High density Between-group( inter-group) edges. Low density. Real Word Networks Internet Citation Networks. Transportation Network. Social Networks. Biochemical Networks. Community Structure 16

17 Finding Community Structures Divide the network into non-empty groups( communities) in such a way that every vertex belongs to one of the communities. Many possible divisions could be done. We need a good division. Measurement of good division. Finding Community Structure in very large networks Consider edges that fall within a community or between a community and the rest of the network Define modularity: Q 1 m vw A vw kvk w ( cv, c m probability of an edge between adjacency matrix two vertices is proportional to their degrees. The expected number of edges between two nodes For a random network, Q = 0 the number of edges within a community is no different from what you would expect 0.3<Q<0.7?significant community structure. Greedy approach to maximize Q. w if vertices are in the same community ) Finding Community Structure in very large networks Algorithm start with all vertices as isolates follow a greedy strategy: successively join clusters with the greatest increase DQ in modularity stop when the maximum possible DQ <= 0 from joining any two successfully used to find community structure in a graph with > 400,000 nodes with > million edges Amazon s people who bought this also bought that Newman Fast Algorithm Modularity Measure Q ( e ii a i ) i Fraction of edges that join vertices in community i to vertices in community j Fraction of edges that are attached to vertices in community i 1 Q m kvk w Avw ( cv, cw) vw m 17

18 Newman Fast Algorithm 1. Separate each vertex solely into n community.. Calculate Q for all possible community pairs. 3. Merge the pair of the largest increase in Q. 4. Repeat & 3 until all communities merged in one community. 5. Cross cut the dendogram where Q is maximum Notes: Q ( e ii a i ) Only consider merge two community i and j. Q=(e ij +e ii +e jj +e ji (a i + a j ) (a i + a j ))-(e ii +e jj - a i a i - a j a j ) = e ij +e ji a i a j i Extensions to weighted networks Q Modularity (Analysis of weighted networks, M. E. J. Newman) 1 m vw weighted edge kvk w Avw ( cv, cw) m k i A ij j Conclusions Hierarchical algorithm is a sequential clustering algorithm With distance matrix to construct a tree of clusters (dendrogram) Hierarchical representation without the need of knowing # of clusters (can set termination condition with known # of clusters) Major weakness of agglomerative clustering methods Can never undo what was done previously Sensitive to cluster distance measures and noise/outliers Less efficient: O (n ), where n is the number of total objects There are several variants to overcome its weaknesses BIRCH: uses clustering feature tree and incrementally adjusts the quality of sub-clusters, which scales well for a large data set CHAMELEON: hierarchical clustering using dynamic modeling, which integrates hierarchical method with other clustering methods Density models DBSCAN and OPTICS defines clusters as connected dense regions in the data space Major features: o Discover clusters of arbitrary shape o Handle noise o One scan o Need density parameters as termination condition 71 18

19 Density-Based Clustering Density Concepts Two global parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point Core Object: object with at least MinPts objects within a Clustering based on density (local cluster criterion), such as density-connected points radius Eps-neighborhood Border Object: object that on the border of a cluster Each cluster has a considerable higher density of points than outside of the cluster q p MinPts = 5 Eps = 1 cm Density-Based Clustering: Background Density-Based Clustering: Background (II) N Eps (p):{q belongs to D dist(p,q) <= Eps} Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p 1,, p n, p 1 = q, p n = p such that p i+1 is directly densityreachable from p i q p 1 p 1) p belongs to N Eps (q) ) N Eps (q) >= MinPts (core point condition) q p MinPts = 5 Eps = 1 cm Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p o q 19

20 DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Border Core Outlier Eps = 1cm MinPts = 5 Arbitrary select a point p Algorithm Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. Algorithm Advantages DBSCAN does not require one to specify the number of clusters. DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. 0

21 Disadvantages Sensitive to Parameters DBSCAN cannot cluster data sets well with large differences in densities Graph-based models Clique i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Corrupted Cliques Problem Distance Graphs Input: A graph G Output: The smallest number of additions and removals of edges that will transform G into a clique graph Turn the distance matrix into a distance graph Genes are represented as vertices in the graph Choose a distance threshold θ If the distance between two vertices is below θ, draw an edge between them The resulting graph may contain cliques These cliques represent clusters of closely located data points! 1

22 Transforming Distance Graph into Clique Graph Overlapping Clustering The distance graph (threshold θ=7) is transformed into a clique graph after removing the two highlighted edges After transforming the distance graph into the clique graph, the dataset is partitioned into three clusters Also known as soft clustering, fuzzy clustering A data object can be assigned to more than one cluster Motivation is that many real world data sets have inherently overlapping clusters A gene can be a part of multiple functional modules (clusters) Subspace Models Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes. Biclustering Microarray data can be viewed as an NM matrix: Each of the N rows represents a gene (or a clone, ORF, etc.). Each of the M columns represents a condition (a sample, a time point, etc.). Each entry represents the expression level of a gene under a condition. It can either be an absolute value (e.g. Affymetrix GeneChip) or a relative expression ratio (e.g. cdna microarrays). A row/column is sometimes referred to as the expression profile of the gene/condition.

23 Biclustering Biclustering It is common to visualize a gene expression datasets by a color plot: Red spots: high expression values (the genes have produced many copies of the mrna). Green spots: low expression values. Gray spots: missing values. N genes M conditions If two genes are related (have similar functions or are coregulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation). However, they can have similar expression patterns only under some conditions (e.g. they have similar response to a certain external stimulus, but each of them has some distinct functions at other time). Similarly, for two related conditions, some genes may exhibit different expression patterns (e.g. two tumor samples of different sub-types) Biclustering As a result, each cluster may involve only a subset of genes and a subset of conditions, which form a checkerboard structure: Biclustering Different biclustering algorithms have different definitions of bicluster In reality, each gene/condition may participate in multiple clusters. 9 3

24 Biclustering Types of Biclusters a a a a a a a a a a a a a a a a Constant values a a a a a+i a+i a+i a+i a+j a+j a+j a+j a+k a+k a+k a+k Constant values on rows a a+i a+j a+k a a+i a+j a+k a a+i a+j a+k a a+i a+j a+k Constant values on cols a b c d a+i b+i c+i d+i a+j b+j c+j d+j a+k b+k c+k d+k Coherent evolutions (additive) a b c d a x i b x i c x i d x i a x j b x j c x j d x j a x k b xk c x k d x k Coherent evolutions (multiplicative) Cheng and Church Model: A bicluster is represented the submatrix A of the whole expression matrix (the involved rows and columns need not be contiguous in the original matrix). Each entry A ij in the bicluster is the superposition (summation) of: 1. The background level. The row (gene) effect 3. The column (condition) effect A dataset contains a number of biclusters, which are not necessarily disjoint. Cheng and Church For any submatrix C IJ where I and J are a subsets of genes and conditions, the mean squared residude score is 96 A bicluster is a submatrix C IJ that has a low mean squared residue score. 4

25 Cheng and Church Goal: to find biclusters with minimum squared residue: 1 H( I, J) ( aij aij aij aij ) I J ii, jj For an ideal bicluster, H(I, J) = 0. Biclusters (a) and (b) fits the definition of MSR Cheng and Church Constraints: 1M and N1 matrixes always give zero residue. => Find biclusters with maximum sizes, with residues not more than a threshold (largest - biclusters). Constant matrixes always give zero residue Cheng and Church Finding the largest -bicluster: The problem of finding the largest square - bicluster ( I = J ) is NP-hard. Objective function for heuristic methods (to minimize): 1 H( I, J) ( aij aij aij aij ) I J ii, jj => sum of the components from each row and column, which suggests simple greedy algorithms to evaluate each row and column independently. Cheng and Church Greedy methods: Algorithm 1: Single node deletion Parameter(s): (maximum squared residue). Initialization: the bicluster contains all rows and columns. Iteration: 1. Compute all a Ij, a ij, a IJ and H(I, J) for reuse.. Remove a row or column that gives the maximum decrease of H. Termination: when no action will decrease H or H <=. Time complexity: O(MN)

26 Issues with Biclustering Methods Computational Complexity Most methods use some heuristics to limit the search space Although this helps finds most of the TMs in a reasonable amount of time, the greedy nature of most biclustering algorithms makes them suffer from the local minima problem Difficult to Compare Different biclustering methods use different definitions of the biclusters and use different optimization techniques and implementation to solve them (See Madeira et al 004 and Prelic et al 006 for details) Difficult to choose appropriate biclustering algorithm Different algorithms exhibit significant variations in terms of their robustness and sensitivity to noise in the gene expression data (Prelic et al 006) 6

Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

K-Means. Oct Youn-Hee Han

K-Means. Oct Youn-Hee Han K-Means Oct. 2015 Youn-Hee Han http://link.koreatech.ac.kr ²K-Means algorithm An unsupervised clustering algorithm K stands for number of clusters. It is typically a user input to the algorithm Some criteria

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

[7.3, EA], [9.1, CMB]

[7.3, EA], [9.1, CMB] K-means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] Outline Introduction K-means Algorithm Example How K-means partitions? K-means Demo Relevant Issues Application: Cell Neulei Detection Summary

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Intelligent Image and Graphics Processing

Intelligent Image and Graphics Processing Intelligent Image and Graphics Processing 智能图像图形处理图形处理 布树辉 bushuhui@nwpu.edu.cn http://www.adv-ci.com Clustering Clustering Attach label to each observation or data points in a set You can say this unsupervised

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Knowledge Discovery in Databases

Knowledge Discovery in Databases Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Online Social Networks and Media. Community detection

Online Social Networks and Media. Community detection Online Social Networks and Media Community detection 1 Notes on Homework 1 1. You should write your own code for generating the graphs. You may use SNAP graph primitives (e.g., add node/edge) 2. For the

More information

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining 1 DATA MINING - 1DL105, 1Dl111 Fall 007 An introductory class in data mining http://user.it.uu.se/~udbl/dm-ht007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

DBSCAN. Presented by: Garrett Poppe

DBSCAN. Presented by: Garrett Poppe DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

More information

Lecture Notes for Chapter 8. Introduction to Data Mining

Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What

More information

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 1 P.G. Scholar, Department of Computer Engineering, ARMIET, Mumbai University, India 2 Principal of, S.S.J.C.O.E, Mumbai University, India ABSTRACT Now a

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Clustering Lecture 4: Density-based Methods

Clustering Lecture 4: Density-based Methods Clustering Lecture 4: Density-based Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

CSE 347/447: DATA MINING

CSE 347/447: DATA MINING CSE 347/447: DATA MINING Lecture 6: Clustering II W. Teal Lehigh University CSE 347/447, Fall 2016 Hierarchical Clustering Definition Produces a set of nested clusters organized as a hierarchical tree

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2015 163 / 238 What is Clustering? Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

11/17/2009 Comp 590/Comp Fall

11/17/2009 Comp 590/Comp Fall Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information