Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Similar documents
CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering Part 4 DBSCAN

DS504/CS586: Big Data Analytics Big Data Clustering II

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

University of Florida CISE department Gator Engineering. Clustering Part 4

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering Lecture 3: Hierarchical Methods

Clustering in Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering II

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering part II 1

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning and Clustering

Supervised vs. Unsupervised Learning

Clustering CS 550: Machine Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CHAPTER 4: CLUSTER ANALYSIS

DBSCAN. Presented by: Garrett Poppe

Understanding Clustering Supervising the unsupervised

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering Lecture 4: Density-based Methods

Unsupervised Learning and Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Lesson 3. Prof. Enza Messina

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Analysis and Extensions of Popular Clustering Algorithms

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering. Supervised vs. Unsupervised Learning

Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering Algorithms for Data Stream

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Distance-based Methods: Drawbacks

Clustering algorithms

Knowledge Discovery in Databases

Network Traffic Measurements and Analysis

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Finding Clusters 1 / 60

Machine Learning (BSMC-GA 4439) Wenke Liu

CSE 347/447: DATA MINING

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Clustering Techniques

Data Mining Algorithms

Clustering Part 3. Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 5

A Review on Cluster Based Approach in Data Mining

Clustering in Ratemaking: Applications in Territories Clustering

MSA220 - Statistical Learning for Big Data

Chapter 6: Cluster Analysis

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CS570: Introduction to Data Mining

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Unsupervised Learning

10701 Machine Learning. Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

CS7267 MACHINE LEARNING

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Hierarchical Clustering 4/5/17

Clustering: Classic Methods and Modern Views

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Unsupervised Learning

Unsupervised Learning : Clustering

COMP 465: Data Mining Still More on Clustering

A Comparative Study of Various Clustering Algorithms in Data Mining

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

Methods for Intelligent Systems

Motivation. Technical Background

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Gene Clustering & Classification

Data Mining Concepts & Techniques

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

K-Means Clustering 3/3/17

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Introduction to Machine Learning. Xiaojin Zhu

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Informatics. Seon Ho Kim, Ph.D.

K-means and Hierarchical Clustering

A REVIEW ON CLUSTERING TECHNIQUES AND THEIR COMPARISON

Keywords: clustering algorithms, unsupervised learning, cluster validity

Distances, Clustering! Rafael Irizarry!

Heterogeneous Density Based Spatial Clustering of Application with Noise

Chapter ML:XI (continued)

Transcription:

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining

All Sources Used for the Presentation Olson CF. Parallel algorithms for hierarchical clustering. Parallel computing. 1995 Aug 1;21(8):1313-25. https://nlp.stanford.edu/ir-book/html/htmledition/time-complexity-of-hac-1.html https://en.wikipedia.org/wiki/hierarchical_clustering Hierarchical Clustering. www.saedsayad.com/clustering_hierarchical.htm Automatic subspace clustering of high dimensional data for data mining applications, Agrawal, Rakesh, et al., Vol 27, No 2, ACM, 1998. https://cognitiveclass.ai/courses/machine-learning-with-python/ https://blog.dominodatalab.com/topology-and-density-based-clustering/ https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

Overview Hierarchical Clustering Density Based Clustering Grid Based Clustering Related Paper

Hierarchical Clustering

Hierarchical Clustering In data mining and statistics, hierarchical clustering(also called hierarchical cluster analysis) is a method of cluster analysis which seeks to build a hierarchy of clusters.

Two Types of Hierarchical Clustering Agglomerative(bottom-up) Assign each observation to its own cluster Compute the similarity (e.g distance) between each of the clusters and join the two most similar clusters Repeat until reach the termination condition Divisive(top-down) Assign all the observations to a single cluster Partition the cluster to two least similar clusters Proceed recursively on each cluster until reach the termination condition

How to measure the distance? Single linkage -- the shortest distance between two points in each cluster Complete linkage -- the longest distance between two points in each cluster Average linkage -- the average distance between each point in one cluster to every point in the other cluster

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Termination Condition (Agglomerative) Method 1: pick a number n, the algorithm stops when n number of cluster are formed. Method 2: stops when the next merge would create a cluster with low cohesion Example: set cohesion as some distance between two clusters. If distance between two clusters are larger than this cohesion, the algorithm stops. Method 3: let the algorithm run until all the nodes become a single cluster

Termination Condition (Divisive) Method 1: pick a depth n, the algorithm stops when n number of split are performed. Method 2: like Agglomerative clustering, set a cohesion. Method 3: let the algorithm run until all the clusters all separated into leaves

Complexity In the standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of O(n^3) and requires O(n^2) for memory because we exhaustively scan the N * N matrix for the largest similarity in each of N - 1 iterations. Divisive clustering with an exhaustive search is O(2^n) but with the help of faster heuristics (such as k-means) to choose splits, the time complexity can be reduced massively.

DENSITY BASED CLUSTERING

Density Based Clustering Published by M. Ester, H. -P. Kriegel, J. Sander, and X. Xu, KDD 96

Density Based Clustering Minimum Points

Density Based Clustering Core Point Border Point Outlier Outlier Border Core Min Points = 5 ε = 1 cm https://cognitiveclass.ai/courses/machine-learning-with-python/

Density Reachability A point y is is said to be reachable from x if there is a path p1,...,pn p1=x and pn=y Where each pi+1 on the path must be core points with the possible exception of pn. Core Point Border Point Minimum Points = 4 Outlier An object y is directly density-reachable from object x, if x is core object and y is in x s epsilonneighborhood. 1. a is directly density-reachable from b. 2. b is directly density-reachable from c. 3. a is indirectly density-reachable from d. 4. d is not density-reachable from a, since a is not a core point. d c a

How does Density-based clustering works? Min points = 3 ε = 1cm Border Core Outlier Step 1: pick a random point that has not yet been assigned to a cluster or designed as an outlier. Determine if it is a core point. If not label it as outlier. Step 2: once a core point has been found, add all directly reachable to its cluster. Then no neighbor jumps to each reachable point and add them to the cluster. If the outlier has been added, label it as a border point. Step 3: repeat these steps until all point are assigned a cluster or label as outlier. https://blog.dominodatalab.com/topology-and-density-based-clustering/

DBSCAN Smiley Face Clustering https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

Density based clustering vs K-means

Density based clustering vs K-means

Pros and Cons Pros: discovers clusters of arbitrary shapes Handle Noise Needs density parameter as termination condition

Pros and Cons Cons: Can not handle varying density Sensitive to parameter(hard to determine correct set of parameters) https://www.ethz.ch/content/dam/ethz/special-interest/gess/computational-social-science-dam/documents/education/spring2015/ datascience/clustering2.pdf

GRID BASED CLUSTERING

Grid based clustering Divide space into grid cells Evaluate density in each cell Form clusters of dense neighbour cells WILL DISCUSS ABOUT CLIQUE ALGORITHM...

CLIQUE Divide space into cells Cell frequency: Number of data inside cell Use Apriori Principle CELL = Itemset ANY SUBSET OF A FREQUENT ITEMSET IS A FREQUENT ITEMSET

PROS Efficient Reduce time complexity Clusters with arbitrary size and shape

CONS CELL 1 CELL 2 Non-uniformity Locality Dimensionality

consider the problem of determining the structure of clustered data, without prior knowledge of the number of clusters or any other information about their composition. Cluster analysis means the partitioning of data into meaningful subgroups, when the number of subgroups and other information about their composition may be unknown

Paper Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information Deng Cai, Xiaofei He. Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen Proceedings of the 12th annual ACM international conference on Multimedia Pages 952-959 2004

Introduction Top 8 results of the query Pluto in Google and AltaVista s search engine[2]

Web-page information A web-page can be divided into semantic blocks. For each image, there is a smallest block which contains that image. We call it image block. The image block contains information that might be useful for describing the image Text, links and visual information can be used to process image[1]

Features Three kinds of representations: Visual feature based representation Textual feature based representation Link graph based representation

Visual feature based representation Color correlogram Color moments Color histogram Texture features

Textual Feature Based Representation File name of the image file URL of the image Image ALT (alternate text) in web page source The title of the web page

Link graph based representation Which URL is connected to which URLS

Clustering Combining the features, will result in a feature space and by using hierarchical clustering the images of a web-page can be put in different clusters.

Results

Conclusion A method to organize WWW image search results. First proposed three representations Then Spectral techniques were applied to cluster the search results into different semantic categories The reorganization of each cluster based on visual features makes the clusters more comfortable to the users

We are Team 2

Paper How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis CHRIS FRALEY AND ADRIAN E. RAFTERY Department of Statistics, University of Washington, USA The computer journal. 1998 Jan;41(8):578-88. https://www.stat.washington.edu/raftery/research/pdf/fraley1998.pdf

Hierarchical methods proceed by stages producing a sequence of partitions, each corresponding to a different number of clusters. They can be either agglomerative, meaning that groups are merged, or divisive, in which one or more groups are split at each stage. Hierarchical procedures that use subdivision are not practical unless the number of possible splittings can somehow be restricted.

In agglomerative hierarchical clustering, however, the number of stages is bounded by the number of groups in the initial partition. It is common practice to begin with each observation in a cluster by itself, although the procedure could be initialized from a coarser partition if some groupings are known. A drawback of agglomerative methods is that those that are practical in terms of time efficiency require memory usage proportional to the square of the number of groups in the initial partition. At each stage of hierarchical clustering, the splitting or merging is chosen so as to optimize some criterion.

In model-based clustering, it is assumed that the data are generated by a mixture of underlying probability distributions in which each component represents a different group or cluster

the basis for a more general model-based strategy for clustering: Determine a maximum number of clusters to consider (M) and a set of candidate parametrizations of the Gaussian model to consider. In general M should be as small as possible. Do agglomerative hierarchical clustering for the unconstrained Gaussian model,3 and obtain the corresponding classifications for up to M groups. Do EM for each parametrization and each number of clusters 2; : : : ; M, starting with the classification from hierarchical clustering. Compute the BIC for the one-cluster model for each parametrization and for the mixture likelihood with the optimal parameters from EM for 2; : : : ; M clusters. This gives a matrix of BIC values corresponding to each possible combination of parametrization and number of clusters. Plot the BIC values for each model. A decisive first local maximum indicates strong evidence for a model (parametrization + number of clusters).

Paper: Parallel algorithms for hierarchical clustering Clark F. Olson Computer Science Department, Cornell University Journal: Parallel Computing, Volume 21, Issue 8, Pages 1313-1325, ISSN 0167-8191, https://doi.org/10.1016/0167-8191(95)00017-i date of the publication: 1995

Parallel algorithms for hierarchical clustering Clustering of multidimensional data is required in many fields. One popular method of performing such clustering is hierarchical clustering. This method starts with a set of distinct points, each of which is considered a separate cluster. The two clusters that are closest according to some metric are agglomerated. This is repeated until all of the points belong to one hierarchically constructed cluster. The final hierarchical cluster structure is called a dendrogram. Which is simply a tree that shows which clusters were agglomerated at each step. A dendrogram shows how the clusters are merged hierarchically. [1]

some metric to determine the distance between pairs of clusters: 1. Graph metrics:consider a completely connected graph where the vertices are the points we wish to cluster and the edges have a cost function that is the Euclidean distance between the points: Single link Average link Complete link

2. Geometric metrics: These metrics define a cluster center for each cluster and use these cluster centers to determine the distances between clusters: Centroid Median Minimum variance