Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Size: px
Start display at page:

Download "Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394"

Transcription

1 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

2 Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

3 Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

4 Introduction Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures. Clustering as a data mining tool has its roots in many application areas such as biology, security, business intelligence, and Web search. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

5 Requirements for cluster analysis Clustering is a challenging research field and the following are its typical requirements. Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint-based clustering Interpretability and usability Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

6 Comparing clustering methods The clustering methods can be compared using the following aspects: The partitioning criteria : In some methods, all the objects are partitioned so that no hierarchy exists among the clusters. Separation of clusters : In some methods, data partitioned into mutually exclusive clusters while in some other methods, the clusters may not be exclusive, that is, a data object may belong to more than one cluster. Similarity measure : Some methods determine the similarity between two objects by the distance between them; while in other methods, the similarity may be defined by connectivity based on density or contiguity. Clustering space : Many clustering methods search for clusters within the entire data space. These methods are useful for low-dimensionality data sets. With high- dimensional data, however, there can be many irrelevant attributes, which can make similarity measurements unreliable. Consequently, clusters found in the full space are often meaningless. Its often better to instead search for clusters within different subspaces of the same data set. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

7 Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

8 Data matrix and dissimilarity matrix Suppose that we have n objects described by p attributes. The objects are x 1 = (x 11, x 12,..., x 1p ), x 2 = (x 21, x 22,..., x 2p ), and so on, where x ij is the value for object x i of the j th attribute. For brevity, we hereafter refer to object x i as object i. The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors. Main memory-based clustering and nearest-neighbor algorithms typically operate on either of the following two data structures: Data matrix This structure stores the n objects in the form of a table or n p matrix. x x 1f... x 1p..... x i1... x if... x ip..... x n1... x nf... x np Dissimilarity matrix : This structure stores a collection of proximities that are available for all pairs of objects. It is often represented by an n n matrix or table: 0 d(1, 2) d(1, 3)... d(1, n) d(2, 1) 0 d(2, 3)... d(2, n) d(n, 1) d(n, 2) d(n, 3)... 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

9 Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

10 Proximity Measures indicates that the patient does not. Treating binary attributes as if they are numeric can be misleading. Therefore, methods specific to binary data are necessary for computing dissimilarity. So, how can we compute the dissimilarity between two binary attributes? One approach involves computing a dissimilarity matrix from the given binary data. If all Proximity measures for nominal attributes : Let the number of states of a nominal binary attributes are thought of as having the same weight, we have the 2 2 contingency table of Table between 2.3, where two q isobjects the number i of and attributes j canthat be equal computed 1 for both objects based on attribute be M. The dissimilarity the ratio of mismatches: i and j, r is the number of attributes d(i, j) = p that m equal 1 for object i but equal 0 for object j, s is the number of attributes that equal 0 for object i but equal 1 for object j, and t is the number of attributes that equal 0 for both objects i and j. The total number of attributes is p, where p = q + r + s + t. p where m is the number ofrecall matches that for and symmetric p is the binary total attributes, number each of stateattributes is equally valuable. describing Dissimilarity that is based on symmetric binary attributes is called symmetric binary the objects. dissimilarity. If objects i and j are described by symmetric binary attributes, then the Proximity measures for binary attributes : Binary attributes are either symmetric or asymmetric. Table 2.3 Contingency Table for Binary Attributes Object j 1 0 sum 1 q r q+ r Object i 0 s t s+ t sum q + s r+ t p For symmetric binary attributes, similarity is calculated as r + s d(i, j) = q + r + s + t For asymmetric binary attributes when the number of negative matches, t, is unimportant and the number of positive matches, q, is important, similarity is calculated as d(i, j) = r + s q + r + s Coefficient 1 d(i, j) is called the Jaccard coefficient. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

11 Proximity Measures (cont.) Dissimilarity of numeric attributes : The most popular distance measure is Euclidean distance d(i, j) = (x i1 x j2 ) 2 + (x i2 x j1 ) (x ip x jp ) 2 Another well-known measure is Manhattan distance d(i, j) = x i1 x j2 + x i2 x j x ip x jp Minkowski distance is generalization of Euclidean and Manhattan distances d(i, j) = h x i1 x j2 h + x i2 x j1 h x ip x jp h Dissimilarity of ordinal attributes : We first replace each x if by its corresponding rank r if {1,..., M f } and then normalize it using z if = r if 1 M f 1 Then dissimilarity can be computed using distance measures for numeric attributes using z if. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

12 Proximity Measures (cont.) Dissimilarity for attributes of mixed types : A more preferable approach is to process all attribute types together, performing a single analysis. d(i, j) = p f =1 δ(f ) ij d (f ) ij p f =1 δ(f ) ij where the indicator δ (f ) ij = 0 if either x if or x jf is missing x if = x jf = 0 and attribute f is asymmetric binary and otherwise δ (f ) ij = 1. The distance d (f ) ij is computed based on the type of attribute f. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

13 Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

14 sify a given algorithm as uniquely belonging to only one clustering method category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques. In the following sections, we examine each clustering method in detail. Advanced There are many clustering clustering methodsalgorithms and relatedinissues the literature. are discussed It is indifficult Chapter to 11. provide In general, a crisp the categorization notation of clustering used is as follows. methods Letbecause D a data these set of categories n objects tomay be clustered. overlap so Anthat objecta is method may described have features by d variables, from where several each categories. variable is also In general, called anthe attribute major orfundamental a dimension, clustering methods can be classified into the following categories. Clustering methods Method Partitioning methods Hierarchical methods Density-based methods Grid-based methods General Characteristics Find mutually exclusive clusters of spherical shape Distance-based May use mean or medoid (etc.) to represent cluster center Effective for small- to medium-size data sets Clustering is a hierarchical decomposition (i.e., multiple levels) Cannot correct erroneous merges or splits May incorporate other techniques like microclustering or consider object linkages Can find arbitrarily shaped clusters Clusters are dense regions of objects in space that are separated by low-density regions Cluster density: Each point must have a minimum number of points within its neighborhood May filter out outliers Use a multiresolution grid data structure Fast processing time (typically independent of the number of data objects, yet dependent on grid size) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

15 Partitioning methods The simplest and most fundamental version of cluster analysis is partitioning, which organizes the objects of a set into several exclusive groups or clusters. Formally, given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k n), where each partition represents a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are similar to one another and dissimilar to objects in other clusters in terms of the data set attributes Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

16 k-means clustering algorithm Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C 1,..., C k, that is, C i D and C i C j = φ for (1 i, j k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is, the objective function aims for high intracluster similarity and low intercluster similarity. A centroid-based partitioning technique uses the centroid of a cluster, C i, to represent that cluster. The difference between an object p C i and µ i, the representative of the cluster, is measured by p µ i. The quality of cluster C i can be measured by the within-cluster variation, which is the sum of squared error between all objects in C i and the centroid c i, defined as E = n i=1 p C i p µ i 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

17 336 Representative-based Clustering k-means clustering algorithm (cont.) (a) Initial dataset µ 1 = 2 µ 2 = (b) Iteration: t = 1 µ 1 = 2.5 µ 2 = µ 1 = (c) Iteration: t = 2 µ 2 = (d) Iteration: t = 3 µ 1 = 4.75 µ 2 = µ 1 = 7 (e) Iteration: t = µ 2 = (f) Iteration: t = 5(converged) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

18 k-means clustering algorithm (cont.) The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers. o obtain good results in practice, it is common to run the k-means algorithm multiple times with different initial cluster centers. The time complexity of the k-means algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k n and t n. Therefore, the method is relatively scalable and efficient in processing large data sets. There are several variants of the k-means method. These can differ in the selection of the initial k-means, the calculation of dissimilarity, and the strategies for calculating cluster means. The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes. The partitioning around medoid (PAM) is a realization of k-medoids method. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

19 Hierarchical methods 0 Chapter A hierarchical 10 Clusterclustering Analysis: Basic method Concepts works andby Methods grouping data objects into a hierarchy or tree of clusters. Agglomerative (AGNES) Step 0 Step 1 Step 2 Step 3 Step 4 a ab b c cde abcde d de e Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Figure Hierarchical 10.6 Agglomerative clustering andmethods divisive hierarchical clustering on data objects {a, b, c, d, e}. Agglomerative hierarchical clustering Divisive hierarchical Level clustering a b c d e l = l = 1 l = 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall / ty scale

20 Distance measures in hierarchical methods Whether using an agglomerative method or a divisive method, a core need is to measure the distance between two clusters, where each cluster is generally a set of objects. Four widely used measures for distance between clusters are as follows, where p q is the distance between two objects or points, p and q; µ i is the mean for cluster, C i ; and n i is the number of objects in C i. They are also known as linkage measures. Minimum distance d min (C i, C j ) = min { p q } p C i,q C j Maximum distance Mean distance Average distance d max (C i, C j ) = max p C i,q C j { p q } d mean (C i, C j ) = µ i µ j d min (C i, C j ) = 1 n i n j p C i,q C j p q Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

21 Hierarchical methods Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. Level l = 0 l = 1 l = 2 l = 3 l = 4 a b c d e Similarity scale Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. different clusters. This is a single-linkage approach in that each cluster is represente by all the objects in the cluster, and the similarity between two clusters is measure Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

22 Model-based clustering k-means is closely related to a probabilistic model known as the Gaussian mixture model. p(x) = k π k N (x µ k, Σ k ) π k, k, Σ k are parameters. π k are called mixing proportions, each Gaussian is called a mixture component. The model is simply a weighted sum of Gaussians. But it is much more powerful than a Gaussian mixture models example single Gaussian, because it can model multi-modal distributions. Gaussian m A Gaussia Note that for p(x) to I Abe mixture a probability of three Gaussians. distribution, we require that k π k = 1 and that for all k we have π k > 0. Thus, we may interpret the π k as probabilities themselves. Set of parameters θ = {{π k }, {µ k }, {Σ k }} Roland Memisevic Machine Learning 21 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

23 Model-based clustering (cont.) Let X = {x 1,..., x n } be drawn i.i.d. from mixture of Gaussian. The log-likelihood of the observations equals to n k ln p(x θ) = ln π j N (x n µ j, Σ j ) i=1 Setting the derivatives of ln p(x θ) with respect to µ j and setting it equal to zero, we obtain N π j N (x i µ j, Σ j ) 0 = k l=1 π ln (x i µ l, Σ l ) Σ j(x i µ j ) Let i=1 γ(z ij ) = j=1 π j N (x i µ j, Σ j ) k l=1 π ln (x i µ l, Σ l ) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

24 Model-based clustering (cont.) We had Multiplying by Σ 1 j 0 = N i=1 and rearranging, we obtain π j N (x i µ j, Σ j ) k l=1 π ln (x i µ l, Σ l ) Σ j(x i µ j ) Similar to the above step, we obtain Please read 9.2 of Bishop. µ j = 1 n γ(z ij )x i n j n j = i=1 n γ(z ij ) i=1 Σ j = 1 n γ(z ij )(x i µ j )(x i µ j ) T n j π j = n j n i=1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

25 Figure Clusters of arbitrary shape. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31 Regions with High Density Density based clustering How can we find dense regions in density-based clustering? The density of an object o can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial Their general Clustering ideaof of Applications these methods with Noise) tofinds continue core objects, growing that is, a given objectscluster that have asdense long as the neighborhoods. It connects core objects and their neighborhoods to form dense regions density in the neighborhood exceeds some threshold. as clusters. How can we How finddoes dense DBSCAN regions quantify in density-based the neighborhood clustering? of an object? A user-specified parameterof > an0 object is used tox specify can bethemeasured radius of aby neighborhood the number we consider of objects for every closeobject. to x. The density The -neighborhood of an object o is the space within a radius centered at o. DBSCAN (Density-Based Due to the fixed neighborhood Spatial Clustering size parameterized of Applications by, the with density Noise) of a neighborhood that can is, objects be measured thatsimply have by dense the number neighborhoods. of objects in the neighborhood. To deter- finds core objects, It connects minecore whether objects a neighborhood and their neighborhoods is dense not, DBSCAN to form dense uses another regions user-specified as clusters.

26 Density based clustering (cont.) 395 How does DBSCAN quantify the neighborhood of an object? 320 user-specified para- meter ɛ > 0 is used to specify the radius of a neighborhood we consider for every object. 245 Definition (ɛ-neighborhood) 170 The ɛ-neighborhood of an object x is the space within a radius ɛ centered at x. 95 Due to the fixed neighborhood size parameterized by ɛ, the density of a neighbor-hood can be measured simply by the number of objects in the neighborhood. 20 Definition (ɛ-neighborhood) Figure Density-based dataset. An object is a core object if the ɛ-neighborhood of the object contains at least MinPts objects. X 1 x ϵ x y z (a) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31 (b)

27 Figure Density-reachability and density-connectivity in density-based clustering. Source: Based on MinPts = 3 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31 Density based clustering (cont.) Given a set, D, of objects, we can identify all core objects with respect to the given parameters, ɛ and MinPts. The clustering task is therein reduced to using core objects and their neighborhoods to form dense regions, where the dense regions are clusters. Definition (Directly density-reachable) For a core object q and an object p, we say that p is directly density-reachable from q (with respect to ɛ and MinPts) if p is within the ɛ neighborhood of q. An object p is directly density-reachable from another 10.4 object Density-Based q if and Methods only if 473 q is a core object and p is in the ɛ neighborhood of q. q m p s r o

28 Density based clustering (cont.) How can we assemble a large dense region using small dense regions centered by core objects? Definition (Density-reachable) An object p is density-reachable from q (with respect to ɛ and MinPts in D) if there is a chain of objects p 1,..., p n, such that p 1 = q, p n = p, and p i is Density-Based directly density-reachable Methods 473 from p i with respect to ɛ and MinPts, for 1 i n, p i D. q m p s r o MinPts = 3 Figure Density-reachability and density-connectivity in density-based clustering. Source: Based on Ester, Kriegel, Sander, and Xu [EKSX96]. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

29 Density based clustering (cont.) To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notion of density-connectedness. Definition (Density-connected) Two objects p 1, p 2 D are density-connected with respect to ɛ and MinPts if there is an object q D such that both p 1 and p 2 are density-reachable 10.4 Density-Based from q with Methods respect 473 to ɛ and MinPts. q m p s r o MinPts = 3 Figure Density-reachability and density-connectivity in density-based clustering. Source: Based on Ester, Kriegel, Sander, and Xu [EKSX96]. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

30 Density based clustering (cont.) How does DBSCAN find clusters? 1 Initially, all objects in data set D are marked as unvisited. 2 It randomly selects an unvisited object p, marks p as visited, and checks whether p is core point or not. 3 If p is not core point, then p is marked as a noise point. Otherwise, a new cluster C is created for p, and all the objects in the ɛ neighborhood of p are added to a candidate set N. 4 DBSCAN iteratively adds to C those objects in N that do not belong to any cluster. 5 In this process, for an object p N that carries the label unvisited, DBSCAN marks it as visited and checks its ɛ neighborhood. 6 If p is a core point, then those objects in its ɛ neighborhood are added to N. 7 DBSCAN continues adding objects to C until C can no longer be expanded, that is, N is empty. At this time, cluster C is completed, and thus is output. 8 To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones. 9 The clustering process continues until all objects are visited. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

31 Density based clustering (example) X X 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

32 Grid-based clustering ter 10 Cluster Analysis: Basic Concepts and Methods The grid-based clustering approach uses a multiresolution grid data structure. First layer (i 1)st layer ith layer 0.19 Hierarchical structure for STING clustering. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

33 Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

34 Cluster validation and assessment Cluster evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results generated by a clustering method. The major tasks of clustering evaluation include the following: 1 Assessing clustering tendency : In this task, for a given data set, we assess whether a nonrandom structure exists in the data. Cluster ing analysis on a data set is meaningful only when there is a nonrandom structure in the data. 2 Determining the number of clusters in a data set : Algorithms such as k-means, require the number of clusters in a data set as the parameter. Moreover, the number of clusters can be regarded as an interesting and important summary statistic of a data set. Therefore, it is desirable to estimate this number even before a clustering algorithm is used to derive detailed clusters. A simple method is to set the number of clusters to about n/2 for a data set of n points. 3 Measuring clustering quality : After applying a clustering method on a data set, we want to assess how good the resulting clusters are. There are also measures that score clusterings and thus can compare two sets of clustering results on the same data set. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

35 Assessing clustering tendency Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

36 Cluster validation and assessment How good is the clustering generated by a method, and how can we compare the clusterings generated by different methods? 1 Internal criterion : Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity and low inter-cluster similarity. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. 2 External criterion : External criterion evaluates how well the clustering matches the gold standard classes. The Rand index measures the percentage of decisions that are correct. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41 Table of contents 1 Introduction 2 Data matrix and

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 443 #1 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

7.1 Euclidean and Manhattan distances between two objects The k-means partitioning algorithm... 21

7.1 Euclidean and Manhattan distances between two objects The k-means partitioning algorithm... 21 Contents 7 Cluster Analysis 7 7.1 What Is Cluster Analysis?......................................... 7 7.2 Types of Data in Cluster Analysis.................................... 9 7.2.1 Interval-Scaled

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Clustering algorithms

Clustering algorithms Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2 Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Sponsored by AIAT.or.th and KINDML, SIIT

Sponsored by AIAT.or.th and KINDML, SIIT CC: BY NC ND Table of Contents Chapter 4. Clustering and Association Analysis... 171 4.1. Cluster Analysis or Clustering... 171 4.1.1. Distance and similarity measurement... 173 4.1.2. Clustering Methods...

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Knowledge Discovery in Databases

Knowledge Discovery in Databases Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8 Outline Cluster Analysis Han and Kamber, ch Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods CS by Rattikorn Hewett Texas Tech University Motivation

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.5 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction DBSCAN Algorithm OPTICS Algorithm DENCLUE Algorithm References Outline Introduction Introduction Density-based

More information

Clustering Lecture 4: Density-based Methods

Clustering Lecture 4: Density-based Methods Clustering Lecture 4: Density-based Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining 1 DATA MINING - 1DL105, 1Dl111 Fall 007 An introductory class in data mining http://user.it.uu.se/~udbl/dm-ht007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Clustering Tips and Tricks in 45 minutes (maybe more :)

Clustering Tips and Tricks in 45 minutes (maybe more :) Clustering Tips and Tricks in 45 minutes (maybe more :) Olfa Nasraoui, University of Louisville Tutorial for the Data Science for Social Good Fellowship 2015 cohort @DSSG2015@University of Chicago https://www.researchgate.net/profile/olfa_nasraoui

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2015 163 / 238 What is Clustering? Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

More information

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science Data Mining: Concepts and Techniques Chapter 7 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 6 Jiawei Han and Micheline Kamber, All rights reserved

More information

DBSCAN. Presented by: Garrett Poppe

DBSCAN. Presented by: Garrett Poppe DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)...... Data Mining i Topic: Clustering CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Cluster Analysis What is Cluster Analysis? Types

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information