# Clustering Part 4 DBSCAN

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of points within a specified radius (Eps) A point is a core point if it has more than specified number of points (MinPts) within Eps Core point is in the interior of a cluster A border point has fewer than MinPts within Eps but is in neighborhood of a core point A noise point is any point that is neither a core point nor a border point Data Mining Sanjay Ranka Fall

2 DBSCAN: Core, Border and Noise points Data Mining Sanjay Ranka Fall When DBSCAN works well Original Dataset Clusters found by DBSCAN Data Mining Sanjay Ranka Fall

3 DBSCAN: Core, Border and Noise points Original Points Eps = 10, Minpts = 4 Point types: Core Border Noise Data Mining Sanjay Ranka Fall DBSCAN: Determining Eps and MinPts Idea is that for points in a cluster, there k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at at farther distance So, plot sorted distance of every point to its k th nearest neighbor. (k=4 used for 2D points) Data Mining Sanjay Ranka Fall

4 Where DBSCAN doesn t work well Minpts = 4, Eps = 9.75 Original Points MinPts = 4, Eps = 9.92 Data Mining Sanjay Ranka Fall DENCLUE DENsity CLUstEring is a density clustering approach that models the overall density of a set of points as the sum of influence functions associated with each point DENCLUE is based on kernel density estimation. The goal of kernel density estimation is to describe the distribution of data by a function For kernel density estimation, the contribution of each point to the overall density function is expressed by an influence (kernel) function. The overall density is then merely the sum of the influence functions associated with each point Data Mining Sanjay Ranka Fall

5 DENCLUE The resulting overall density functions will have local peaks, i.e. local density maxima, and these local peaks can be used to define clusters For each point, a hill climbing algorithm finds the nearest peak associated with that point, and set of all data points associated with a peak form a cluster However, if the density at a local peak is too low, then the points in the associated cluster are labeled as noise and discarded Similarly, if two peaks are connected by a path of data points, and the density at each point on the path is above a minimum density threshold, then the clusters associated with these two peaks are merged Data Mining Sanjay Ranka Fall DENCLUE: Kernel Function Typically the kernel function is symmetric and its value decreases as the distance from the point increases. The Gaussian function is often used as a kernel function. Data Mining Sanjay Ranka Fall

6 Subspace clustering Instead of using all the attributes (features) of a dataset, if we consider only subset of the features (subspace of the data), then the clusters that we find can be quite different from one subspace to another The clusters we find depend on the subset of the attributes that we consider Data Mining Sanjay Ranka Fall Subspace clustering Data Mining Sanjay Ranka Fall

7 Subspace clustering Data Mining Sanjay Ranka Fall Subspace Clustering Data Mining Sanjay Ranka Fall

8 CLIQUE CLIQUE is a grid based clustering algorithm CLIQUE splits each dimension (attribute) in to a fixed number ( ) of equal length intervals. This partitions the data space in to rectangular units of equal volume We can measure the density of each unit by the fraction of points it contains A unit is considered dense if its density > user specified threshold A cluster is a group of contiguous (touching) dense units Data Mining Sanjay Ranka Fall CLIQUE: Example Data Mining Sanjay Ranka Fall

9 CLIQUE CLIQUE starts by finding all the dense areas in the one dimensional spaces associated with each attribute Then it generates the set of two dimensional cells that might possibly be dense by looking at pairs of dense one dimensional cells In general, CLIQUE generates the possible set of k-dimensional cells that might possibly be dense by looking at dense (k-1)-dimensional cells. This is similar to APRIORI algorithm for finding frequent item sets It then finds clusters finds clusters by taking union of all adjacent high density cells Data Mining Sanjay Ranka Fall MAFIA Merging of Adaptive Finite Intervals (And more than a CLIQUE) MAFIA is a modification of CLIQUE that runs faster and finds better quality clusters. There is also pmafia which is a parallel version of MAFIA The main modification over CLIQUE is the use of an adaptive grid Data Mining Sanjay Ranka Fall

10 MAFIA Initially each dimension is partitioned into a large number of intervals. A histogram is generated that shows the number of data points in each interval Groups of adjacent intervals are grouped in to windows, and the maximum number of points in the window s intervals becomes the value associated with the window Data Mining Sanjay Ranka Fall MAFIA Adjacent windows are grouped together if the values of the two windows are close As a special case, if all windows are combined into one window, the dimensions is partitioned in to a fixed number of cells and the threshold for being considered a dense unit is increased for that dimension Data Mining Sanjay Ranka Fall

11 Limitations of CLIQUE and MAFIA Time complexity is exponential in the number of dimensions Will have difficulty if too many dense units are generated at lower stages May fail if clusters are of widely differing densities, since the threshold is fixed Determining the appropriate and for a variety of data sets can be challenging It is not typically possible to find all clusters using the same threshold Data Mining Sanjay Ranka Fall Clustering Scalability for Large Datasets One very common solution is sampling, but the sampling could miss small clusters. Data is sometimes not organized to make valid sampling easy or efficient. Another approach is to compress the data or portions of the data. Any such approach must ensure that not too much information is lost. (Scaling Clustering Algorithms to Large Databases, Bradley, Fayyad and Reina.) Data Mining Sanjay Ranka Fall

12 Scalable Clustering: BIRCH BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies) BIRCH can efficiently cluster data with a single pass and can improve that clustering in additional passes. Can work with a number of different distance metrics. BIRCH can also deal effectively with outliers. Data Mining Sanjay Ranka Fall Scaleable Clustering: BIRCH BIRCH is based on the notion of a clustering feature (CF) and a CF tree. A cluster of data points (vectors) can be represented by a triplet of numbers (N, LS, SS) N is the number of points in the cluster LS is the linear sum of the points SS is the sum of squares of the points. Points are processed incrementally. Each point is placed in the leaf node corresponding to the closest cluster (CF). Clusters (CFs) are updated. Data Mining Sanjay Ranka Fall

13 Scaleable Clustering: BIRCH Basic steps of BIRCH Load the data into memory by creating a CF tree that summarizes the data. Perform global clustering. Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected. Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. Data Mining Sanjay Ranka Fall Scaleable Clustering: CURE Clustering Using Representatives Uses a number of points to represent a cluster Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster Cluster similarity is the similarity of the closest pair of representative points from different clusters Shrinking representative points toward the center helps avoid problems with noise and outliers CURE is better able to handle clusters of arbitrary shapes and sizes Data Mining Sanjay Ranka Fall

14 Experimental results: CURE Data Mining Sanjay Ranka Fall Experimental Results: CURE Data Mining Sanjay Ranka Fall

15 CURE can not handle differing densities Original Points CURE Data Mining Sanjay Ranka Fall Graph Based Clustering Graph-Based clustering uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Initially the proximity graph is fully connected MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the most simple case, clusters are connected components in the graph Data Mining Sanjay Ranka Fall

16 Graph Based Clustering: Sparsification The amount of data that needs to be processed is drastically reduced Sparsification can eliminate more than 99% of the entries in a similarity matrix The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased Data Mining Sanjay Ranka Fall Sparsification Clustering may work better Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters. Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms. Chameleon and Hypergraph-based Clustering Data Mining Sanjay Ranka Fall

17 Sparsification Data Mining Sanjay Ranka Fall Limitations of Current Merging Schemes Existing merging schemes are static in nature Data Mining Sanjay Ranka Fall

18 Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters. Use a dynamic model to measure the similarity between clusters. Main property is the relative closeness and relative inter-connectivity of the cluster. Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. The merging scheme preserves self-similarity. One of the areas of application is spatial data. Data Mining Sanjay Ranka Fall Characteristics of Spatial Datasets Clusters are defined as densely populated regions of the space Clusters have arbitrary shapes, orientation, and non-uniform sizes Difference in densities across clusters and variation in density within clusters Existence of special artifacts (streaks) and noise The clustering algorithm must address the above characteristics and also require minimal supervision Data Mining Sanjay Ranka Fall

19 Chameleon Preprocessing Step: Represent the Data by a Graph Given a set of points, we construct the k-nearestneighbor (k-nn) graph to capture the relationship between a point and its k nearest neighbors. Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices. Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster. Graph algorithms take into account global structure. Data Mining Sanjay Ranka Fall Chameleon Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters. Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. Two key properties are used to model cluster similarity: Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters. Data Mining Sanjay Ranka Fall

20 Experimental Results: Chameleon Data Mining Sanjay Ranka Fall Experimental Results: Chameleon Data Mining Sanjay Ranka Fall

21 Experimental Results: CURE (10 clusters) Data Mining Sanjay Ranka Fall Experimental Results: CURE (15 clusters) Data Mining Sanjay Ranka Fall

22 Experimental Results: Chameleon Data Mining Sanjay Ranka Fall Experimental Results: CURE (9 clusters) Data Mining Sanjay Ranka Fall

23 Experimental Results: CURE (15 clusters) Data Mining Sanjay Ranka Fall

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Outline Prototype-based Fuzzy c-means

### Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

### DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

### Clustering Part 3. Hierarchical Clustering

Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

### University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

### CS570: Introduction to Data Mining

CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

### University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis 4.5 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction DBSCAN Algorithm OPTICS Algorithm DENCLUE Algorithm References Outline Introduction Introduction Density-based

### Clustering Lecture 4: Density-based Methods

Clustering Lecture 4: Density-based Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

### DS504/CS586: Big Data Analytics Big Data Clustering II

Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

### DBSCAN. Presented by: Garrett Poppe

DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

### Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

### Clustering Algorithms for Spatial Databases: A Survey

Clustering Algorithms for Spatial Databases: A Survey Erica Kolatch Department of Computer Science University of Maryland, College Park CMSC 725 3/25/01 kolatch@cs.umd.edu 1. Introduction Spatial Database

### MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

### CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

To Appear in the IEEE Computer CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling George Karypis Eui-Hong (Sam) Han Vipin Kumar Department of Computer Science and Engineering University

### COMP 465: Data Mining Still More on Clustering

3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

### PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

### A Comparative Study of Various Clustering Algorithms in Data Mining

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

### Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

### Knowledge Discovery in Databases

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

### Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

### What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

// What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

### Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

### d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Data Mining i Topic: Clustering CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Cluster Analysis What is Cluster Analysis? Types

### Clustering Part 2. A Partitional Clustering

Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida

### Clustering Techniques

Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/004 What

### DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

### 2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

### Hierarchical Clustering

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

### Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

### Scalable Varied Density Clustering Algorithm for Large Datasets

J. Software Engineering & Applications, 2010, 3, 593-602 doi:10.4236/jsea.2010.36069 Published Online June 2010 (http://www.scirp.org/journal/jsea) Scalable Varied Density Clustering Algorithm for Large

### Data Preprocessing. Data Preprocessing

Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

### What is Cluster Analysis?

Cluster Analysis What is Cluster Analysis? Finding groups of objects (data points) such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the

### Clustering fundamentals

Elena Baralis, Tania Cerquitelli Politecnico di Torino What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 07) Old Faithful Geyser Data

### Large-Scale Flight Phase identification from ADS-B Data Using Machine Learning Methods

Large-Scale Flight Phase identification from ADS-B Data Using Methods Junzi Sun 06.2016 PhD student, ATM Control and Simulation, Aerospace Engineering Large-Scale Flight Phase identification from ADS-B

### UNIVERSITY OF CALGARY. Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework. Arina Esmaeilpour A THESIS

UNIVERSITY OF CALGARY Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework by Arina Esmaeilpour A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF

### Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline

### On Clustering Validation Techniques

On Clustering Validation Techniques Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis Department of Informatics, Athens University of Economics & Business, Patision 76, 0434, Athens, Greece (Hellas)

### A Survey on Clustering Algorithms for Data in Spatial Database Management Systems

A Survey on Algorithms for Data in Spatial Database Management Systems Dr.Chandra.E Director Department of Computer Science DJ Academy for Managerial Excellence Coimbatore, India Anuradha.V.P Research

### COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

### Kapitel 4: Clustering

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

### Faster Clustering with DBSCAN

Faster Clustering with DBSCAN Marzena Kryszkiewicz and Lukasz Skonieczny Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland Abstract. Grouping data

### Segmentation and Grouping

Segmentation and Grouping How and what do we see? Fundamental Problems ' Focus of attention, or grouping ' What subsets of pixels do we consider as possible objects? ' All connected subsets? ' Representation

### Performance Analysis of Video Data Image using Clustering Technique

Indian Journal of Science and Technology, Vol 9(10), DOI: 10.17485/ijst/2016/v9i10/79731, March 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Performance Analysis of Video Data Image using Clustering

### Hierarchical clustering

Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized

### Big Data SONY Håkan Jonsson Vedran Sekara

Big Data 2016 - SONY Håkan Jonsson Vedran Sekara Schedule 09:15-10:00 Cluster analysis, partition-based clustering 10.00 10.15 Break 10:15 12:00 Exercise 1: User segmentation based on app usage 12:00-13:15

### Clustering. Bruno Martins. 1 st Semester 2012/2013

Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

### 5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

### University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

### Course Content. What is Classification? Chapter 6 Objectives

Principles of Knowledge Discovery in Data Fall 007 Chapter 6: Data Clustering Dr. Osmar R. Zaïane University of Alberta Course Content Introduction to Data Mining Association Analysis Sequential Pattern

### CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods

CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods Jiawei Han, Computer Science, Univ. Illinois at Urbana -Champaign, 2017 1 2 Chapter 10. Cluster Analysis: Basic Concepts

### Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

### Cluster Analysis: Basic Concepts and Algorithms

7 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the

### UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other

### Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

### Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

### CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 30, 2013 Announcement Homework 1 due next Monday (10/14) Course project proposal due next

### ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at:

### Unsupervised Learning

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

### Clustering Algorithms for general similarity measures

Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

### CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

### 6. Object Identification L AK S H M O U. E D U

6. Object Identification L AK S H M AN @ O U. E D U Objects Information extracted from spatial grids often need to be associated with objects not just an individual pixel Group of pixels that form a real-world

### Unsupervised Learning

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

### Centroid Based Text Clustering

Centroid Based Text Clustering Priti Maheshwari Jitendra Agrawal School of Information Technology Rajiv Gandhi Technical University BHOPAL [M.P] India Abstract--Web mining is a burgeoning new field that

### MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-sheng Chen, Oner Ulvi Celepcikay, Christian Giusti, and Christoph F. Eick Computer Science Department,

### Chapter 5: Outlier Detection

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

### Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

### Unsupervised Learning

Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

### CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

### Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs

### Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State

### Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

### A Survey on DBSCAN Algorithm To Detect Cluster With Varied Density.

A Survey on DBSCAN Algorithm To Detect Cluster With Varied Density. Amey K. Redkar, Prof. S.R. Todmal Abstract Density -based clustering methods are one of the important category of clustering methods

### Clustering and Visualisation of Data

Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

### Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

### PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 1 Number 1, May - June (2010), pp. 158-165 IAEME, http://www.iaeme.com/ijcet.html

### CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

### Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity Wendy Foslien, Honeywell Labs Valerie Guralnik, Honeywell Labs Steve Harp, Honeywell Labs William Koran, Honeywell Atrium

### Segmentation of Images

Segmentation of Images SEGMENTATION If an image has been preprocessed appropriately to remove noise and artifacts, segmentation is often the key step in interpreting the image. Image segmentation is a

### Density Based Clustering Using Mutual K-nearest. Neighbors

Density Based Clustering Using Mutual K-nearest Neighbors A thesis submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Master of

### Introduction to Clustering

Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the

### Analytics and Visualization of Big Data

1 Analytics and Visualization of Big Data Fadel M. Megahed Lecture 15: Clustering (Discussion Class) Department of Industrial and Systems Engineering Spring 13 Refresher: Hierarchical Clustering 2 Key

### Density-based clustering algorithms DBSCAN and SNN

Density-based clustering algorithms DBSCAN and SNN Version 1.0, 25.07.2005 Adriano Moreira, Maribel Y. Santos and Sofia Carneiro {adriano, maribel, sofia}@dsi.uminho.pt University of Minho - Portugal 1.

### Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

### Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Data Mining: Concepts and Techniques Chapter 7 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 6 Jiawei Han and Micheline Kamber, All rights reserved

### Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Community detection algorithms survey and overlapping communities Presented by Sai Ravi Kiran Mallampati (sairavi5@vt.edu) 1 Outline Various community detection algorithms: Intuition * Evaluation of the

### 3. Data Structures for Image Analysis L AK S H M O U. E D U

3. Data Structures for Image Analysis L AK S H M AN @ O U. E D U Different formulations Can be advantageous to treat a spatial grid as a: Levelset Matrix Markov chain Topographic map Relational structure

### Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

### Market basket analysis

Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binary-valued data X j. In this context the observations

### Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

### University of Florida CISE department Gator Engineering. Visualization

Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to

### Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton