! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Similar documents
Clustering Techniques

Lecture 7 Cluster Analysis: Part A

Clustering part II 1

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Clustering in Data Mining

Clustering Part 4 DBSCAN

CS Data Mining Techniques Instructor: Abdullah Mueen

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

University of Florida CISE department Gator Engineering. Clustering Part 4

CS145: INTRODUCTION TO DATA MINING

Cluster Analysis. CSE634 Data Mining

Lezione 21 CLUSTER ANALYSIS

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1

Knowledge Discovery in Databases

Unsupervised Learning

Data Mining Algorithms

CS490D: Introduction to Data Mining Prof. Chris Clifton. Cluster Analysis

Clustering Part 2. A Partitional Clustering

Unsupervised Learning Hierarchical Methods

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering in Ratemaking: Applications in Territories Clustering

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Clustering CS 550: Machine Learning

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

CS570: Introduction to Data Mining

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Distance-based Methods: Drawbacks

Kapitel 4: Clustering

ECLT 5810 Clustering

CS249: ADVANCED DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

DS504/CS586: Big Data Analytics Big Data Clustering II

ECLT 5810 Clustering

Clustering fundamentals

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Metodologie per Sistemi Intelligenti. Clustering. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo

CS6220: DATA MINING TECHNIQUES

DBSCAN. Presented by: Garrett Poppe

CHAPTER 4: CLUSTER ANALYSIS

Lesson 3. Prof. Enza Messina

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Comparative Study of Various Clustering Algorithms in Data Mining

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Cluster analysis. Agnieszka Nowak - Brzezinska

DS504/CS586: Big Data Analytics Big Data Clustering II

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS6220: DATA MINING TECHNIQUES

Gene Clustering & Classification

CSE 5243 INTRO. TO DATA MINING

CS570: Introduction to Data Mining

Data Mining 4. Cluster Analysis

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Informatics. Seon Ho Kim, Ph.D.

Cluster Analysis. Ying Shen, SSE, Tongji University

Course Content. What is Classification? Chapter 6 Objectives

A Review on Cluster Based Approach in Data Mining

Data Mining. Cluster Analysis: Basic Concepts and Algorithms

Course Content. Classification = Learning a Model. What is Classification?

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

CS6220: DATA MINING TECHNIQUES

CSE 5243 INTRO. TO DATA MINING

Clustering. Supervised vs. Unsupervised Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition

Unsupervised Learning and Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

COMP 465: Data Mining Still More on Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Based on Raymond J. Mooney s slides

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Unsupervised Learning and Clustering

CSE 347/447: DATA MINING

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

COMP5331: Knowledge Discovery and Data Mining

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

K-Means. Oct Youn-Hee Han

Machine Learning (BSMC-GA 4439) Wenke Liu

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

CS 412 Intro. to Data Mining Chapter 10. Cluster Analysis: Basic Concepts and Methods

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Road map. Basic concepts

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Understanding Clustering Supervising the unsupervised

Transcription:

Preview Lecture Clustering! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods What is Clustering?! Cluster: a collection of data objects! Similar to one another within the same cluster! Dissimilar to the objects in other clusters! Cluster analsis! Grouping a set of data objects into clusters! Clustering is unsupervised classification: no predefined classes! Tpical applications! As a stand-alone tool to get insight into data distribution! As a preprocessing step for other algorithms Eamples of Clustering Applications! Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs! Land use: Identification of areas of similar land use in an earth observation database! Insurance: Identifing groups of motor insurance polic holders with a high average claim cost! Urban planning: Identifing groups of houses according to their house tpe, value, and geographical location! Seismolog: Observed earth quake epicenters should be clustered along continent faults What Is a Good Clustering?! A good clustering method will produce clusters with! High intra-class similarit! Low inter-class similarit! Precise definition of clustering qualit is difficult! Application-dependent! Ultimatel subjective Requirements for Clustering in Data Mining! Scalabilit! Abilit to deal with different tpes of attributes! Discover of clusters with arbitrar shape! Minimal domain knowledge required to determine input parameters! Abilit to deal with noise and outliers! Insensitivit to order of input records! Robustness wrt high dimensionalit! Incorporation of user-specified constraints! Interpretabilit and usabilit

Similarit and Dissimilarit Between Objects! Same we used for IBL (e.g, Lp norm)! Euclidean distance (p = ): d( i, j) = ( + +... + ) i j i j ip jp! Properties of a metric d(i,j):! d(i,j)! d(i,i) =! d(i,j) = d(j,i)! d(i,j) d(i,k) + d(k,j) Major Clustering Approaches! Partitioning: Construct various partitions and then evaluate them b some criterion! Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion! Model-based: Hpothesize a model for each cluster and find best fit of models to data! Densit-based: Guided b connectivit and densit functions Partitioning Algorithms! Partitioning method: Construct a partition of a database D of n objects into a set of k clusters! Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion! Global optimal: ehaustivel enumerate all partitions! Heuristic methods: k-means and k-medoids algorithms! k-means (MacQueen, ): Each cluster is represented b the center of the cluster! k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, ): Each cluster is represented b one of the objects in the cluster K-Means Clustering! Given k, the k-means algorithm consists of four steps:! Select initial centroids at random.! Assign each object to the cluster with the nearest centroid.! Compute each centroid as the mean of the objects assigned to it.! Repeat previous steps until no change. K-Means Clustering (contd.) Comments on the K-Means Method! Eample! Strengths! Relativel efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normall, k, t << n.! Often terminates at a local optimum. The global optimum ma be found using techniques such as simulated annealing and genetic algorithms! Weaknesses! Applicable onl when mean is defined (what about categorical data?)! Need to specif k, the number of clusters, in advance! Trouble with nois data and outliers! Not suitable to discover clusters with non-conve shapes

Hierarchical Clustering AGNES (Agglomerative Nesting)! Use distance matri as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step Step Step Step Step a b c d e ab de cde abcde Step Step Step Step Step agglomerative (AGNES) divisive (DIANA)! Produces tree of clusters (nodes)! Initiall: each object is a cluster (leaf)! Recursivel merges nodes that have the least dissimilarit! Criteria: min distance, ma distance, avg distance, center distance! Eventuall all nodes belong to the same cluster (root) A Dendrogram Shows How the Clusters are Merged Hierarchicall DIANA (Divisive Analsis) Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering ofthedataobjectsisobtainedbcuttingthe dendrogram at the desired level. Then each connected component forms a cluster.! Inverse order of AGNES! Start with root cluster containing all objects! Recursivel divide into subclusters! Eventuall each cluster contains a single object Other Hierarchical Clustering Methods BIRCH! Major weakness of agglomerative clustering methods! Do not scale well: time compleit of at least O(n ), where n is the number of total objects! Can never undo what was done previousl! Integration of hierarchical with distance-based clustering! BIRCH: uses CF-tree and incrementall adjusts the qualit of sub-clusters! CURE: selects well-scattered points from the cluster and then shrinks them towards the center of the cluster b a specified fraction! BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livn, )! Incrementall construct a CF (Clustering Feature) tree! Parameters: ma diameter, ma children! Phase : scan DB to build an initial in-memor CF tree (each node: #points, sum, sum of squares)! Phase : use an arbitrar clustering algorithm to cluster the leaf nodes of the CF-tree! Scales linearl: finds a good clustering with a single scan and improves the qualit with a few additional scans! Weaknesses: handles onl numeric data, sensitive to order of data records.

Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: N i= X i SS: N i= X i CF = (, (,),(,)) (,) (,) (,) (,) (,) B= L= CF Tree Root CF CF CF CF child child child child Non-leaf node CF child CF child CF child CF child Leaf node Leaf node prev CF CF CF net prev CF CF CF net CURE (Clustering Using REpresentatives) Drawbacks of Distance-Based Method! CURE: non-spherical clusters, robust wrt outliers! Uses multiple representative points to evaluate the distance between clusters! Stops the creation of a cluster hierarch if a level consists of k clusters! Drawbacks of square-error-based clustering method! Consider onl one point as representative of a cluster! Good onl for conve clusters, of similar size and densit, and if k can be reasonabl estimated Cure: The Algorithm Data Partitioning and Clustering! Draw random sample s! Partition sample to p partitions with size s/p! Partiall cluster partitions into s/pq clusters! s =! p =! s/p =!s/pq =! Cluster partial clusters, shrinking representatives towards centroid! Label data on disk

Cure: Shrinking Representative Points Model-Based Clustering! Shrink the multiple representative points towards the gravit center b a fraction of α.! Multiple representatives capture the shape of the cluster! Basic idea: Clustering as probabilit estimation! One model for each cluster! Generative model:! Probabilit of selecting a cluster! Probabilit of generating an object in cluster! Find ma. likelihood or MAP model! Missing information: Cluster membership! Use EM algorithm! Qualit of clustering: Likelihood of test objects Mitures of Gaussians EM Algorithm for Mitures! Cluster model: Normal distribution (mean, covariance)! Assume: diagonal covariance, known variance, same for all clusters! Ma. likelihood: mean = avg. of samples! But what points are samples of a given cluster?! Estimate prob. that point belongs to cluster! Mean = weighted avg. of points, weight = prob.! But to estimate probs. we need model! Chicken and egg problem: use EM algorithm! Initialization: Choose means at random! E step:! For all points and means, compute Prob(point mean)! Prob(mean point) = Prob(mean) Prob(point mean) / Prob(point)! M step:! Each mean = Weighted avg. of points! Weight = Prob(mean point)! Repeat until convergence EM Algorithm (contd.) AutoClass! Guaranteed to converge to local optimum! K-means is special case! Developed at NASA (Cheeseman & Stutz, )! Miture of Naïve Baes models! Variet of possible models for Prob(attribute class)! Missing information: Class of each eample! Appl EM algorithm as before! Special case of learning Baes net with missing values! Widel used in practice

COBWEB A COBWEB Tree! Grows tree of clusters (Fisher, )! Each node contains: P(cluster), P(attribute cluster) for each attribute! Objects presented sequentiall! Options: Add to node, new node; merge, split! Qualit measure: Categor utilit: Increase in predictabilit of attributes/#clusters Neural Network Approaches Competitive Learning! Neuron = Cluster = Centroid in instance space! Laer = Level of hierarch! Several competing sets of clusters in each laer! Objects sequentiall presented to network! Within each set, neurons compete to win object! Winning neuron is moved towards object! Can be viewed as mapping from low-level features to high-level ones Self-Organizing Feature Maps! Clustering is also performed b having several units competing for the current object! The unit whose weight vector is closest to the current object wins! The winner and its neighbors learn b having their weights adjusted! SOMs are believed to resemble processing that can occur in the brain! Useful for visualizing high-dimensional data in - or -D space Densit-Based Clustering! Clustering based on densit (local cluster criterion), such as densit-connected points! Major features:! Discover clusters of arbitrar shape! Handle noise! One scan! Need densit parameters as termination condition! Representative algorithms:! DBSCAN (Ester et al., )! DENCLUE (Hinneburg & Keim, )

Definitions (I) Definitions (II)! Two parameters:! Eps: Maimum radius of neighborhood! MinPts: Minimum number of points in an Epsneighborhood of a point! N Eps (p) ={q Є D dist(p,q) <= Eps}! Directl densit-reachable: A point p is directl densitreachable from a point q wrt. Eps, MinPts iff! ) p belongs to N Eps (q)! ) q is a core point: p MinPts = q N Eps (q) >= MinPts Eps = cm! Densit-reachable:! A point p is densit-reachable from a point q wrt. Eps, MinPts if there is a chain of points p,, p n, p = q, p n = p such that p i+ is directl densit-reachable from p i! Densit-connected! A point p is densit-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are densit-reachable from o wrt. Eps and MinPts. p q o p p q DBSCAN: Densit Based Spatial Clustering of Applications with Noise DBSCAN: The Algorithm! Relies on a densit-based notion of cluster: A cluster is defined as a maimal set of densit-connected points! Discovers clusters of arbitrar shape in spatial databases with noise! Arbitraril select a point p! Retrieve all points densit-reachable from p wrt Eps and MinPts. Border Core Outlier Eps = cm MinPts =! If p is a core point, a cluster is formed.! If p is a border point, no points are densit-reachable from p and DBSCAN visits the net point of the database.! Continue the process until all of the points have been processed. DENCLUE: Using Densit Functions DENCLUE! DENsit-based CLUstEring (Hinneburg & Keim, )! Major features! Good for data sets with large amounts of noise! Allows a compact mathematical description of arbitraril shaped clusters in high-dimensional data sets! Significantl faster than other algorithms (faster than DBSCAN b a factor of up to )! But needs a large number of parameters! Uses grid cells but onl keeps information about grid cells that do actuall contain data points and manages these cells in a tree-based access structure.! Influence function: describes the impact of a data point within its neighborhood.! Overall densit of the data space can be calculated as the sum of the influence function of all data points.! Clusters can be determined mathematicall b identifing densit attractors.! Densit attractors are local maima of the overall densit function.

Influence Functions Densit Attractors! Eample f (, ) e d (, i ) N D σ Gaussian ( ) = i = e d (, i ) N D σ fgaussian(, i ) = = ( i i ) e f Gaussian d(, ) = σ Center-Defined & Arbitrar Clusters Clustering: Summar! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods