Clustering in Data Mining

Size: px
Start display at page:

Download "Clustering in Data Mining"


1 Clustering in Data Mining

2 Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young, adult, old etc. is based on the age of the customers. When we have many parameters describing the profile of an object then it is difficult to segregate them based on some ranges and we need a clustering tool to help. gdeepak. com 2

3 Clustering Clustering is the process of grouping a set of data objects such that objects with in a cluster have high similarity (intracluster cohesiveness) and have high dissimilarity to the objects in other clusters (intercluster distinctiveness). gdeepak. com 3

4 Few points Different clustering algorithms may come out with different clustering. Applying different clustering algorithms on same data may prove to be useful in the sense that it may come out with the previously unknown groupings with in the data. Applications are very diverse such as image patterns, Biological patterns, Searching patterns, attacking or hacking patterns, clustering documents based on topics, clustering alphabets based on handwriting, Finding teams or cells in the social networking sites etc. gdeepak. com 4

5 Supervised Vs Unsupervised learning Supervised learning few initial entries are classified by human intervention about their belongingness to a particular cluster. It is prevalent in machine learning. Clustering by observation is unsupervised learning because it implicitly looks for the similarities with in the objects. gdeepak. com 5

6 Characteristics of good clustering algorithm Whether your algorithm works only for a few hundred data objects or it is scalable to billions of objects. Real life data volumes are very large. Are you dealing with numbers? Your algorithm should also deal with images, documents, sequences and graphs. Your algorithm must be coming out with regular shape clusters. Complex and critical applications may have arbitrary shape clusters also. gdeepak. com 6

7 Characteristics of good clustering algorithm Whether your algorithm is dependent upon domain expertise or independently handles clustering. Your algorithm is online or offline. Based on this it can either offer incremental updates or not. Does your algorithm discards or moderates inaccurate, noisy, erroneous data. How many dimensions it is dealing with? Giving Semantic interpretation of your data as part of post processing step will be good for the user. gdeepak. com 7

8 Orthogonality in clustering Clusters may be at the same level e.g. schools, colleges, universities Clusters may be in some hierarchy e.g., in colleges you have Engineering, Pharmacy, Law etc. Clusters may be exclusive that means each object belongs to one cluster only. Clusters may not be exclusive e.g., few colleges may have Engineering, MBA and Diploma as well. gdeepak. com 8

9 Partitioning Technique Case 1: Number of clusters to be formed is also done by the algorithm. Case 2: Initially the number of clusters to be formed is given. gdeepak. com 9

10 Centroid based K-means Algorithm Ci is the centroid of a cluster which acts as the center point. It is mean of all the objects assigned to that cluster. The objective is to minimize the distance between various objects of a cluster with the representative object of that cluster. k 2 E p c i i 1 p C ( i) For each object in each cluster, the distance from the object to its cluster center is squared and the distances are summed. This tries to make the resulting clusters as compact and separate as possible. gdeepak. com 10

11 K-means algorithm Problem is NP-hard for K-clusters even in 2-D Euclidean space. One of the solution is using K-means algorithms. Number of clusters required are given and output is that many clusters with objects contained in those clusters. Randomly choose any k points from D as cluster centers. All objects are checked for their distances from cluster centers to decide that to which cluster it belongs and recalculate that cluster mean. Repeat above step until there is no change in the clusters. gdeepak. com 11

12 Issues in k-means algorithm It requires value of K to be supplied. It is not good for irregular shape clusters also. It does not converge always. It is dependent upon initial selection of cluster centers. Complexity is O(nkt) where n is the number of objects, k is the number of clusters and t is the number of iterations. Outliers can disturb the normal values of clusters. Different researchers has suggested different methods for selecting initial k points which also depends upon the domain and application area. gdeepak. com 12

13 K-Medoid Algorithm In this instead of taking the mean value of the cluster points, Actual point is chosen which acts as a representative element of the cluster. Choose Initial set of medoids which may suitably represent each cluster. Out of the remaining objects try each one of them, if some of them can be chosen to replace any of the existing medoids. This will happen if replacement reduces the absolute error. PAM(Partitioning around medoids) is a popular implementation of K- medoids algorithm. gdeepak. com 13

14 Issues in K-Medoid algorithm Complexity is O(k(n-k) 2 where k is the number of clusters and n is total number of data objects. The algorithm becomes computationally intensive for large values and does not scale well. It does not have that much sensitivity to outliers as k-means algorithm. Another implementation that scales well for some application is CLARA (Clustering LARge Applications). It uses sampling to choose the initial sample points and then PAM is applied on these sample points rather than the total objects. If an object is one of the best k-medoids but is not selected during sampling, CLARA will never find the best clustering. gdeepak. com 14

15 Hierarchical Methods Bottom-Up Approach starts from single object (Agglomerative) Need to merge at every iteration Top-Down Approach starts from set of all objects (Divisive) Need to split after every iteration gdeepak. com 15

16 Agglomerative Clustering Dendograms are used Method requires a termination criteria or it will result in final cluster containing all the objects Distance measure is important for finding the similarity or dissimilarity of the clusters. We cannot undo the merge operations which may result in local optimizations Complexity is O(n 2 ) so it does not scale well for large volumes AGNES (Agglomerative Nesting) is one of the implementations. gdeepak. com 16

17 Divisive Clustering There are exponential ways in which a cluster can be partitioned into different clusters. Relatively difficult and challenging than merging method. Divisive methods do not backtrack on the partitioning decisions. DIANA (Divisive Analysis) is one of the implementation. gdeepak. com 17

18 Different Distance Measures Minimum Distance: dist(k i, K j ) = min(t ip, t jq ) smallest distance between an element in one cluster and an element in the other called Single Link Maximum Distance : dist(k i, K j ) = max(t ip, t jq ) largest distance between an element in one cluster and an element in the other called complete Link Average Distance: dist(k i, K j ) = avg(t ip, t jq ) avg distance between an element in one cluster and an element in the other gdeepak. com 18

19 More distance measures Centroid : dist(k i, K j ) = dist(c i, C j ) distance between the centroids of two clusters Medoid: dist(k i, K j ) = dist(m i, M j ) distance between the medoids of two clusters Centroid can be calculated as C m N ( t i 1 N ip ) gdeepak. com 19

20 Multiphase clustering In this Hierarchical clustering also takes advantage of other techniques to make the algorithms less complex and more sensitive BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) and Chameleon are two popular implementations. gdeepak. com 20

21 BIRCH It is scalable and it has the ability to undo the clusters performed in the intermediate steps. It uses cluster feature concept and uses cluster feature (CF-Tree) to represent cluster hierarchy. If a cluster is of n d-dimensional data objects then CF is a 3D vector summarizing information about cluster of objects as CF=<n, LS,SS> where LS is linear sum of n data points and SS is the square sum of n data points. Cluster Diameter 1 n( n 1) ( x i 2 ) gdeepak. com 21 x j

22 Cf-Tree gdeepak. com 22

23 CF Tree features It is a height balanced tree storing features of the clusters A non leaf node has children and stores sums of the CFs of his children A CF tree has branching factor( max no of children) and Threshold ( max diameter of sub clusters stored at the leaf) gdeepak. com 23

24 BIRCH algorithm Phase 1- It scans the database to build an initial inmemory CF-tree, which can be seen as a multilevel compression of the data that tries to preserve the data s inherent clustering structures. Phase 2-It applies another clustering algorithm to cluster the leaf nodes of the CF-tree, which removes sparse clusters as outliers and groups dense clusters into the larger ones gdeepak. com 24

25 Building CF tree For each point in the input find the closest leaf entry Add that point to leaf entry and update CF If entry diameter > max_diameter then split leaf. If required split parent also and so on gdeepak. com 25

26 Disadvantages of BIRCH Natural clusters are not formed, because we fix size of leaf nodes Due to the use of radius and diameter parameters clusters tend to be spherical Insertion order of data points may change clustering However, Complexity is O(n) and the algorithm scales well to number of objects. gdeepak. com 26

27 Chameleon Based on Dynamic Modelling of the Graph More adaptive and natural in context to the number of clusters It uses a k-nearest neighbor graph to construct a sparse graph, where each vertex of the graph represents a data object and there exists an edge between two vertices if one object is among the k most similar objects to the other. Edge weight represents similarity between objects. It partitions the graph into a large number of small sub graphs which may represent sub clusters such that it minimizes the edge cut. gdeepak. com 27

28 Illustration gdeepak. com 28

29 Chameleon After partitioning it uses agglomerative hierarchical clustering to merge sub clusters based on their similarity. Similarity is calculated based on interconnectivity(connectivity of c1 and c2 over internal connectivity) and relative closeness (closeness of c1 and c2 over internal closeness) Its advantage lies in discovering arbitrarily shaped clusters of high quality. However complexity of O(n 2 ) worst case may not scale well for high dimensional data. gdeepak. com 29

30 Flaws in distance based algorithms Very difficult to choose a good distance measure Missing attribute values in some data objects also is not allowed Based on local search so optimization boundaries are not clear gdeepak. com 30

31 Probabilistic hierarchical clustering It uses probabilistic models to measure the distance between clusters Data objects are considered as samples of data generation mechanism to be analyzed Can handle missing data values and easy to comprehend Generally Gaussian distribution or Bernoulli distributions are considered. gdeepak. com 31

32 Drawbacks It outputs only one hierarchy for the chosen probabilistic model. Real life data sets generally contain multiple hierarchies fitting the same data. Bayesian tree structured models can handle this drawback but those are very complex to handle. gdeepak. com 32

33 Density Based Techniques These methods look for dense regions of data in the data space with sparse regions in between. Neighborhood concept is used to find out the neighbors of an data object with in a given radius. Density of neighborhood can be measured using number of objects in the given cluster region. gdeepak. com 33

34 Density Based Spatial Clustering of Applications with Noise(DBSCAN) It uses two parameters є(max radius of the neighborhood) and MinPts( minimum density threshold of region) An object is core object if є-neighborhood contains at least MinPts objects. First step is to find all core objects from given data set. An object is called directly density-reachable if it is with in the є distance of a core object. gdeepak. com 34

35 Density terminology An object is density-reachable from another object if it has a chain of directly density-reachable points leading to the core object in the end. Two objects are density connected with respect to є and Minpts if both are density reachable to one common object. Cluster is defined as a maximal set of density connected points. gdeepak. com 35

36 DBSCAN Algorithm Mark all points as unvisited Randomly select a point and mark it as visited Check if є neighborhood of this point contains MinPts If yes, a cluster C is formed and all the points in є neighborhood are added to a candidate set N. otherwise it is designated as a noise point. Iteratively add those objects in N labeled as unvisited to C and if their neighborhood also contains MinPts, then also add all those in N. Iterate this for all data objects until all points has been visited. gdeepak. com 36

37 DBSCAN Complexity General Complexity is O(n 2 ) but with the use of spatial indexing it can be improved to O(nlogn) One drawback is that it is the user s responsibility to select є and MinPts which is not good in general. Slight variations in values may lead to entirely different clustering. gdeepak. com 37

38 Ordering Points to Identify Cluster Structure (OPTICS) It provides a linear list of data objects based on the closeness of the data density wise. This ordering can be based on different parameter settings. User can even try different settings and see the clustering structure visually. It uses two parameters. gdeepak. com 38

39 OPTICS process Core distance of an object p is the smallest value є such that є neighborhood of p has at least MinPts. So є is the minimum distance threshold that makes p a core object. Reachability Distance to object p from q is the minimum radius that makes p density-reachable from q. Therefore the reachability distance from q to p is max{core-distance(q),distance(p,q)} P may have multiple reachability-distances with respect to different core objects. We are interested in smallest reachability distance of p because it gives the shortest path for which p is connected to a dense cluster. gdeepak. com 39

40 OPTICS algorithm Start with any arbitrary object P from the data set Determine the core-distance by finding the є-neighborhood and set the reach ability distance to undefined Object P is written to output. If p is not a core object, move to next object in order seeds list, if it is empty then take next from input database. If it is a core object then for all objects in its neighborhood update the reach ability distance from p and insert q in the order seeds list. Continue until input list as well as order seeds list is empty. gdeepak. com 40

41 DENCLUE (Density Based Clustering) Uses Proved mathematical functions for density estimation. Time complexity is better than other algorithms. Data sets having large noise are also accommodated In other algorithms results were dependent on choice of є gdeepak. com 41

42 Key functions Kernel density estimation treats an observed object as high probability density in the surrounding region. The probability density at a point depends on the distances from this point to the observed objects. DENCLUE uses Gaussian kernel to estimate density based on the given set of objects to be clustered. A point is called density attractor if it is a local maximum of the estimated density function. gdeepak. com 42

43 DENCLUE A Cluster will be a set of density attractors X and a set of input objects C such that each object in C is assigned to a density attractor in X and there exists a path between every pair of density attractors where the density is above ξ. In this way we can find clusters of arbitrary shape. gdeepak. com 43

44 Grid-based techniques The space under consideration is divided into cells irrespective of the actual data points and then it is further processed. Complexity depends upon the number of cells gdeepak. com 44

45 STING(Statistical Information Grid) Space in divided in hierarchical as well as rectangular manner. A cell at higher level points to number of cells at next lower level. Mean, median and other of a cell are precomputed gdeepak. com 45

46 Query algorithm Process starts from a selected layer of cells in top down fashion From this confidence interval (estimated probability range) is calculated which reflects the relevance to the given query At Next lower level only relevant cells are examined after removing irrelevant cells This is repeated till the lowest layer gdeepak. com 46

47 Pros and Cons It is easy to parallelize and supports incremental updates Complexity is O(n) where n is the cells at the lowest layer Disadvantage is that Cluster boundaries are predefined gdeepak. com 47

48 Clustering in Quest It identifies the monotonicity of dense cells with respect to dimensionality. 1 step- it partitions d-dimensional data space into non-overlapping rectangular units. It finds dense cells containing at least l(density threshold) points. The it joins the adjacent dense cells in both dimensions. Iteration terminates when no candidates can be generated or no candidate cells are dense. gdeepak. com 48

49 Clustering in Quest 2-step: determine maximal regions that cover a cluster of connected dense units for each cluster Disadvantage is that meaningful clustering is dependent on tuning of the grid size which is not easy to predict Advantage is that it scales well with data size and dimensions as well. It also does not presume some distribution and is insensitive to order of inputs. gdeepak. com 49

50 Other clustering techniques Clustering High Dimensional Data Clustering Graphs and network data gdeepak. com 50

51 Questions, Comments and Suggestions

52 Question 1 Which is the popular implementation of k-medoid algorithm gdeepak. com 52

53 Question 2 Full form of OPTICS gdeepak. com 53

54 Question 3 Chameleon is based on of Graph. gdeepak. com 54

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

A Comparative Study of Various Clustering Algorithms in Data Mining

A Comparative Study of Various Clustering Algorithms in Data Mining Available Online at International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1 Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize

More information


CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.5 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction DBSCAN Algorithm OPTICS Algorithm DENCLUE Algorithm References Outline Introduction Introduction Density-based

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

CS Data Mining Techniques Instructor: Abdullah Mueen

CS Data Mining Techniques Instructor: Abdullah Mueen CS 591.03 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 6: BASIC CLUSTERING Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links:

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino Data Clustering Outline What is cluster analysis? What

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 443 #1 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Clustering Lecture 3: Hierarchical Methods

Clustering Lecture 3: Hierarchical Methods Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Unsupervised learning Finding centers of similarity using

More information

Knowledge Discovery in Databases

Knowledge Discovery in Databases Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Outline Prototype-based Fuzzy c-means

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information


CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun October 30, 2013 Announcement Homework 1 due next Monday (10/14) Course project proposal due next

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information


CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods Instructor: Yizhou Sun April 2, 2013 Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster

More information

数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining 数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring 2019 S8700113C 1 Introduction Last week: Association Analysis

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)...... Data Mining i Topic: Clustering CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Cluster Analysis What is Cluster Analysis? Types

More information

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

More information

Review of Spatial Clustering Methods

Review of Spatial Clustering Methods ISSN 2320 2629 Volume 2, No.3, May - June 2013 Neethu C V et al., International Journal Journal of Information of Information Technology Technology Infrastructure, Infrastructure 2(3), May June 2013, 15-24

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Clustering Lecture 4: Density-based Methods

Clustering Lecture 4: Density-based Methods Clustering Lecture 4: Density-based Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

DBSCAN. Presented by: Garrett Poppe

DBSCAN. Presented by: Garrett Poppe DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Clustering Algorithms for Spatial Databases: A Survey

Clustering Algorithms for Spatial Databases: A Survey Clustering Algorithms for Spatial Databases: A Survey Erica Kolatch Department of Computer Science University of Maryland, College Park CMSC 725 3/25/01 1. Introduction Spatial Database

More information


CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES 7.1. Abstract Hierarchical clustering methods have attracted much attention by giving the user a maximum amount of

More information

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors Segmentation I Goal Separate image into coherent regions Berkeley segmentation database: Slide by L. Lazebnik Applications Intelligent

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information


CS7267 MACHINE LEARNING S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,

More information


CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information Text Clustering David Kauchak cs160 Fall 2009 adapted from: Administrative 2 nd status reports Paper review

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science Data Mining: Concepts and Techniques Chapter 7 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign 6 Jiawei Han and Micheline Kamber, All rights reserved

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

A Review: Techniques for Clustering of Web Usage Mining

A Review: Techniques for Clustering of Web Usage Mining A Review: Techniques for Clustering of Web Usage Mining Rupinder Kaur 1, Simarjeet Kaur 2 1 Research Fellow, Department of CSE, SGGSWU, Fatehgarh Sahib, Punjab, India 2 Assistant Professor, Department

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Cluster Analysis (b) Lijun Zhang

Cluster Analysis (b) Lijun Zhang Cluster Analysis (b) Lijun Zhang Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

More information


CSE 347/447: DATA MINING CSE 347/447: DATA MINING Lecture 6: Clustering II W. Teal Lehigh University CSE 347/447, Fall 2016 Hierarchical Clustering Definition Produces a set of nested clusters organized as a hierarchical tree

More information

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining 1 DATA MINING - 1DL105, 1Dl111 Fall 007 An introductory class in data mining alt. Kjell Orsborn Uppsala Database

More information

Lecture 10: Semantic Segmentation and Clustering

Lecture 10: Semantic Segmentation and Clustering Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling To Appear in the IEEE Computer CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling George Karypis Eui-Hong (Sam) Han Vipin Kumar Department of Computer Science and Engineering University

More information

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability Preview Lecture Clustering! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods What is Clustering?! Cluster: a collection of data objects! Similar to one

More information


CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Vector Data: Clustering: Part I Instructor: Yizhou Sun April 26, 2017 Methods to Learn Classification Clustering Vector Data Text Data Recommender System Decision

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information