Clustering Part 2. A Partitional Clustering

Similar documents
University of Florida CISE department Gator Engineering. Clustering Part 2

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1

Data Mining. Cluster Analysis: Basic Concepts and Algorithms

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Clustering fundamentals

CSE 5243 INTRO. TO DATA MINING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Part 3. Hierarchical Clustering

Unsupervised Learning : Clustering

Clustering Basic Concepts and Algorithms 1

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis: Basic Concepts and Algorithms

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering CS 550: Machine Learning

University of Florida CISE department Gator Engineering. Clustering Part 5

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 347/447: DATA MINING

Clustering Lecture 3: Hierarchical Methods

Clustering Part 4 DBSCAN

Cluster analysis. Agnieszka Nowak - Brzezinska

Introduction to Data Mining

6.867 Machine learning

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

6.867 Machine learning

What is Cluster Analysis?

Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering. Supervised vs. Unsupervised Learning

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

ECLT 5810 Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

ECLT 5810 Clustering

Data Mining Concepts & Techniques

Clustering: Overview and K-means algorithm

Hierarchical clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Cluster Analysis: Basic Concepts and Algorithms

Clustering: Overview and K-means algorithm

K-Means. Oct Youn-Hee Han

Cross-validation for detecting and preventing overfitting

CS7267 MACHINE LEARNING

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Bias-Variance Decomposition Error Estimators

Unsupervised Learning

Bias-Variance Decomposition Error Estimators Cross-Validation

Clustering. Huanle Xu. Clustering 1

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

数据挖掘 Introduction to Data Mining

Gene Clustering & Classification

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Kapitel 4: Clustering

[7.3, EA], [9.1, CMB]

Hierarchical Clustering 4/5/17

Unsupervised Learning I: K-Means Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

What to come. There will be a few more topics we will cover on supervised learning

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Introduction to Computer Science

Online Social Networks and Media. Community detection

Supervised vs. Unsupervised Learning

Data Preprocessing. Data Preprocessing

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

k-means Gaussian mixture model Maximize the likelihood exp(

ALTERNATIVE METHODS FOR CLUSTERING

LECTURE 6: CROSS VALIDATION

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Machine Learning (BSMC-GA 4439) Wenke Liu

Clust Clus e t ring 2 Nov

Clustering algorithms

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Based on Raymond J. Mooney s slides

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Knowledge Discovery in Databases

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering and Visualisation of Data

Transcription:

Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida CISE department Gator Engineering Partitional Clustering 5 5 Original Points A Partitional Clustering Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Hierarchical Clustering 5 5..5..5 5 Traditional Hierarchical Clustering Traditional Dendrogram Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Characteristics of Clustering Algorithms Tpe of clustering the algorithm produces: Partitional versus hierarchical Overlapping versus non-overlapping Fuzz versus non-fuzz Complete versus partial Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Characteristics of Clustering Algorithms Tpe of clusters the algorithm seeks: Well-separated, center-based, densit-based or contiguit-based Are the clusters found in the entire space or in a subspace Are the clusters relativel similar to one another, or are the of differing sizes, shapes and densities Data Mining Sanja Ranka Fall 5 Universit of Florida CISE department Gator Engineering Characteristics of Clustering Algorithms Tpe of data the algorithm can handle: Some clustering algorithms need a data matri The K-means algorithm assumes that it is meaningful to take the mean (average) of a set of data objects. This makes sense for data that has continuous attributes and for document data, but not for record data that has categorical attributes. Some clustering algorithms start from a proimit matri Tpicall assume smmetr Does the data have noise and outliers? Is the data high dimensional? Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Characteristics of Clustering Algorithms How the algorithm operates: Minimizing or maimizing a global objective function. Enumerate all possible was of dividing the points into clusters and evaluate the goodness of each potential set of clusters b using the given objective function. (NP Hard) Can have global or local objectives. Hierarchical clustering algorithms tpicall have local objectives Partitional algorithms tpicall have global objectives A variation of the global objective function approach is to fit the data to a parameterized model. Parameters for the model are determined from the data. Miture models assume that the data is a miture of a number of statistical distributions. Data Mining Sanja Ranka Fall 7 Universit of Florida CISE department Gator Engineering Characteristics of Clustering Algorithms How the algorithm operates Map the clustering problem to a different domain and solve a related problem in that domain. Proimit matri defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proimities between points. Clustering is equivalent to breaking the graph into connected components, one for each cluster. Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid. Number of clusters, K, must be specified. The basic algorithm is ver simple. Data Mining Sanja Ranka Fall 9 Universit of Florida CISE department Gator Engineering K-means Clustering Details Initial centroids are often chosen randoml. Clusters produced var from one run to another. The centroid is (tpicall) the mean of the points in the cluster. Closeness is measured b Euclidean distance, cosine similarit, correlation, etc. K-means will converge for common similarit measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to Until relativel few points change clusters Compleit is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Data Mining Sanja Ranka Fall 5

Universit of Florida CISE department Gator Engineering Evaluating K-means Clusters Most common measure is the Sum of the Squared Error (SSE) For each point, the error is the distance to the nearest cluster. To get SSE, we square these errors and sum them. Given two clusters, we can choose the one with the smallest error. One eas wa to reduce SSE is to increase K, the number of clusters. A good clustering with smaller K can have a lower SSE than a poor clustering with higher K. Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Two different K-means Clusterings.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Original Points Optimal Clustering Sub-optimal Clustering Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Importance of Choosing Initial Centroids Iteration 5.5.5.5 - -.5 - -.5.5.5 Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Importance of Choosing Initial Centroids Iteration Iteration Iteration.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Iteration Iteration 5 Iteration.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Data Mining Sanja Ranka Fall 7

Universit of Florida CISE department Gator Engineering Importance of Choosing Initial Centroids Iteration 5.5.5.5 - -.5 - -.5.5.5 Data Mining Sanja Ranka Fall 5 Universit of Florida CISE department Gator Engineering Importance of Choosing Initial Centroids Iteration Iteration.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Iteration Iteration Iteration 5.5.5.5.5.5.5.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 - -.5 - -.5.5.5 Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Problems with Selecting Initial Points If there are K real clusters then the chance of selecting one centroid from each cluster is small. Chance is relativel small when K is large If clusters are the same size, n, then For eample, if K =, then probabilit =!/ =. Sometimes the initial centroids will readjust themselves in right wa, and sometimes the don t Consider an eample of five pairs of clusters Data Mining Sanja Ranka Fall 7 Universit of Florida CISE department Gator Engineering Clusters Eample Iteration - - - 5 5 Starting with two initial centroids in one cluster of each pair of clusters Data Mining Sanja Ranka Fall 9

Universit of Florida CISE department Gator Engineering Clusters Eample Iteration Iteration - - - - - - 5 5 Iteration 5 5 Iteration - - - - - - 5 5 5 5 Starting with two initial centroids in one cluster of each pair of clusters Data Mining Sanja Ranka Fall 9 Universit of Florida CISE department Gator Engineering Clusters Eample Iteration - - - 5 5 Starting with some pairs of clusters having three initial centroids, while other have onl one. Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Clusters Eample Iteration Iteration - - - - - - 5 5 Iteration 5 5 Iteration - - - - - - 5 5 5 5 Starting with some pairs of clusters having three initial centroids, while other have onl one. Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Solutions to Initial Centroids Problem Multiple runs Helps, but probabilit is not on our side Bisecting K-means Not as susceptible to initialization issues Sample and use hierarchical clustering to determine initial Centroids Select more than K initial centroids and then select among these initial centroids Select most widel separated Post-processing Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Handling Empt Clusters Basic K-means algorithm can ield empt clusters Several strategies Choose the point that contributes most to SSE Choose a point from the cluster with the highest SSE If there are several empt clusters, the above can be repeated several times. Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Updating Centers Incrementall In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid. An alternative is to update the centroids after each assignment. Ma need to update two centroids More epensive Introduces an order dependenc Never get an empt cluster Can use weights to change the impact Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Pre-processing and Post-processing Pre-processing Normalize data so distance computations are fast. Eliminate outliers Post-processing Eliminate small clusters that ma represent outliers Split loose clusters, i.e., clusters with relativel high SSE Merge clusters that are close and that have relativel low SSE Can use these steps during the clustering process ISODATA Data Mining Sanja Ranka Fall 5 Universit of Florida CISE department Gator Engineering Bisecting K-means Bisecting K-means algorithm Variant of K-means that can produce a partitional or a hierarchical clustering Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Bisecting K-means Eample Data Mining Sanja Ranka Fall 7 Universit of Florida CISE department Gator Engineering Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers. One solution is to use man clusters. Find parts of clusters, but need to put together. Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Limitations of K-means: Differing Sizes Original Points K-means Clusters Data Mining Sanja Ranka Fall 9 Universit of Florida CISE department Gator Engineering Limitations of K-means: Differing Densit Original Points K-means Clusters Data Mining Sanja Ranka Fall 5

Universit of Florida CISE department Gator Engineering Limitations of K-means: Non-globular Shapes Original Points K-means Clusters Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Overcoming K-means Limitations Original Points K-means Clusters Data Mining Sanja Ranka Fall

Universit of Florida CISE department Gator Engineering Overcoming K-means Limitations Original Points K-means Clusters Data Mining Sanja Ranka Fall Universit of Florida CISE department Gator Engineering Overcoming K-means Limitations Original Points K-means Clusters Data Mining Sanja Ranka Fall 7