Cluster Analysis. CSE634 Data Mining

Similar documents
Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

ECLT 5810 Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

ECLT 5810 Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Unsupervised Learning Partitioning Methods

Lecture 7 Cluster Analysis: Part A

Cluster analysis. Agnieszka Nowak - Brzezinska

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Clustering Analysis Basics

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

CSE 5243 INTRO. TO DATA MINING

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Gene Clustering & Classification

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Data Mining 資料探勘. 分群分析 (Cluster Analysis) 1002DM04 MI4 Thu. 9,10 (16:10-18:00) B513

Clustering part II 1

K-Means. Oct Youn-Hee Han

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Clustering. Supervised vs. Unsupervised Learning

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

A Review on Cluster Based Approach in Data Mining

Supervised vs. Unsupervised Learning

Cluster Analysis: Basic Concepts and Methods

Data Informatics. Seon Ho Kim, Ph.D.

Road map. Basic concepts

CS490D: Introduction to Data Mining Prof. Chris Clifton. Cluster Analysis

Lezione 21 CLUSTER ANALYSIS

[7.3, EA], [9.1, CMB]

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Kapitel 4: Clustering

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Data Mining Algorithms

Microarray data analysis

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Mining di Dati Web. Lezione 3 - Clustering and Classification

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering Techniques

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

10701 Machine Learning. Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Getting to Know Your Data

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Clustering. Chapter 10 in Introduction to statistical learning

CS Data Mining Techniques Instructor: Abdullah Mueen

University of Florida CISE department Gator Engineering. Clustering Part 5

New Approach for K-mean and K-medoids Algorithm

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering in Data Mining

Introduction to Data Mining. Yücel SAYGIN

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Improved Performance of Unsupervised Method by Renovated K-Means

Unsupervised Learning : Clustering

7.1 Euclidean and Manhattan distances between two objects The k-means partitioning algorithm... 21

What to come. There will be a few more topics we will cover on supervised learning

CS145: INTRODUCTION TO DATA MINING

COMP5331: Knowledge Discovery and Data Mining

Clustering. Shishir K. Shah

Exploratory data analysis for microarrays

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Medoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.

Chapter DM:II. II. Cluster Analysis

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Machine Learning - Clustering. CS102 Fall 2017

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Clustering. (Part 2)

Intelligent Image and Graphics Processing

CHAPTER 4: CLUSTER ANALYSIS

Based on Raymond J. Mooney s slides

SOCIAL MEDIA MINING. Data Mining Essentials

Clustering Part 4 DBSCAN

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Unsupervised Learning I: K-Means Clustering

Lecture on Modeling Tools for Clustering & Regression

Overview of Clustering

Clustering and Visualisation of Data

Clustering in Ratemaking: Applications in Territories Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

DBSCAN. Presented by: Garrett Poppe

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Unsupervised Learning Hierarchical Methods

CS570: Introduction to Data Mining

Supervised and Unsupervised Learning (II)

University of Florida CISE department Gator Engineering. Clustering Part 4

Transcription:

Cluster Analysis CSE634 Data Mining

Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering

Introduction to Cluster Analysis The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

Formal Definition Cluster analysis Statistical method for grouping a set of data objects into clusters A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity Cluster : Collection of data objects (Intra-class similarity) - Objects are similar to objects in same cluster (Inter-class dissimilarity) - Objects are dissimilar to objects in other clusters Clustering is unsupervised classification

Supervised vs. Unsupervised Learning Unsupervised learning (clustering) The class labels of training data are unknown Given a set of measurements, observations, etc. establish the eistence of clusters in the data Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set

Clustering vs. Classification Clustering (learning by observations) Unsupervised Input Clustering algorithm Similarity measure Number of clusters No specific information for each set of data Classification (learning by eamples) Supervised Consists of class labeled training data eamples Build a classifier that assigns data objects to one of the classes

Clustering vs. Classification Class Label Attribute : loan_decision Learning of Classifier is supervised à it is told to which class each training tuple (sample) belongs Data Mining Concept and Techniques (Chapter 6, Page 287).

Clustering vs. Classification Clustering class label of training tuple not known number or set of classes to be learned may not be known in advance e.g. if we did not have loan_decision data available à we use clustering and NOT classification to determine groups of like tuples These groups of like tuples may eventually correspond to risk groups within loan application data

Typical Requirements Of Clustering Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to the order of input records Constraint-based clustering Interpretability and usability

Eamples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Fraud detection: Detection of credit card fraud and the monitoring of criminal activities in electronic commerce.

Data Representation Data matri (two mode) n objects with p attributes 11... i1... n1............... 1f... if... nf............... 1p... ip... np Dissimilarity matri (one mode) d(i,j) : dissimilarity between i and j 0 d(2,1) d(3,1) : d( n,1) 0 d(3,2) : d( n,2) 0 :...... 0 Data Mining Concept and Techniques (Chapter 7, Page 386-387).

Types of Data in Cluster Analysis Interval-Scaled Variables Binary Variables Categorical, Ordinal, and Ratio-Scaled Variables Variables of Mied Types

Interval-Scaled Variables Continuous measurements of a roughly linear scale E.g. weight, height, temperature, etc. Height Scale 1. Scale ranges over the metre or foot scale 2. Need to standardize heights as different scale can be used to epress same absolute measurement 20kg 40kg Weight Scale 60kg 80kg 100kg 120kg 1. Scale ranges over the kilogram or pou nd scale

Using Interval-Scaled Values Step 1: Standardize the data To ensure they all have equal weight To match up different scales into a uniform, single scale Not always needed! Sometimes we require unequal weights for an attribute Step 2: Compute dissimilarity between records Use Euclidean, Manhattan or Minkowski distance

Data Types and Distance Metrics Distances are normally used to measure the similarity or dissimilarity between two data objects Minkowski distance: where i = ( i1, i2,, ip ) and j = ( j1, j2,, jp ) are two p- dimensional data objects, and q is a positive integer q q p p q q j i j i j i j i d )... ( ), ( 2 2 1 1 + + + =

Data Types and Distance Metrics (Cont d) If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance... ), ( 2 2 1 1 p p j i j i j i j i d + + + = )... ( ), ( 2 2 2 2 2 1 1 p p j i j i j i j i d + + + =

Data Types and Distance Metrics (Cont d) Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Can also use weighted distance, or other dissimilarity measures. q q q d ( i, j) = q w ( + w +... + w p ) 1 1 2 2 p p 1 i j 2 i j i j

A contingency table for binary data Binary Attributes Object i 1 0 sum 1 a c a+ c Object j 0 sum b a+ b d c+ d b+ d p Simple matching coefficient (applicable only for database with all symmetric binary attributes): d( i, j) = b c a+ b+ + c+ d Jaccard coefficient (applicable only for database with all asymmetric binary attributes): d( i, j)= b c a+ + b+ c sim( i, j)= a+ a b+ c For mied binary attributes, please refer Data Mining Concept and Techniques (Chapter 7, Section 7.2.4).

Considering eample : Binary Attributes Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Points to be considered (refer Chapter 7 of the book for the above eample): In this book, gender is assumed as an asymmetric attribute and the rest of the attributes are assumed symmetric The book ignores the gender attribute and continues to consider the other attributes. Formula defined for similarity and dissimilarity are applicable only when all attributes under consideration are asymmetric or symmetric Calculation of similarity and dissimilarity between attributes when a combination of asymmetric and symmetric attributes is involved, is eplained in section 7.2.4

Dissimilarity between Binary Attributes Henceforth, we now consider Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack Y N P N N N Mary Y N P N P N Jim Y P N N N N Since the table was a combination of symmetric and asymmetric attributes, we now omit Gender which is a symmetric attribute from our consideration We are now left with the asymmetric attributes Fever, Cough, Test-1, Test-2, Test-3, Test-4 Calculating the dissimilarity considering only asymmetric attributes using Jaccard coefficient is as follows <Table 7.2> Data Mining Concept and Techniques (Chapter 7, Page 391).

Object i Dissimilarity between Binary Attributes (cont d) Object j 1 0 sum 1 a b a+ b 0 c d c+ d sum a+ c b+ d p d( d( d( d( i, j)= b c a+ + b+ c 0 + 1 jack, mary) = = 0.33 2 + 0 + 1 1 + 1 jack, jim) = = 0.67 1 + 1 + 1 1 + 2 jim, mary) = = 0.75 1 + 1 + 2 Considering d(jack,mary) b = 0 ( jack is +ve, mary ve) c = 1 ( 1 variable -- jack is ve, mary +ve) a = 2 (2 variables -- jack and mary +ve) so d(jack,mary)= 0.33

Categorical Attributes A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of attributes that are same for both records, p: total # of attributes d( i, j) = p p m Method 2: rewrite the database and create a new binary attribute for each of the m states For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.

Partitioning Algorithms: Basic Concept AIM: To classify the data into k groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to eactly one group. Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., the sum of squared distance is minimum Given some value of k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: ehaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means : Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Algorithm: k-means. The k-means algorithm for partitioning, where each cluster s center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) update the cluster means, i.e., calculate the mean value of the objects for each cluster; (5) until no change;

The K-Means Clustering Method Eample (http://www-faculty.cs.uiuc.edu/~hanj/bk2/07.ppt (page 31)) 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Assign each objects to most similar center 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign Update the cluster means 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign K=2 Arbitrarily choose K object as initial cluster center 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

The k-means algorithm The basic step of k-means clustering is simple: Iterate until stable (= there is no change in the clusters of objects): Determine the centroid coordinate Determine the distance of each object to the centroids Group the object based on minimum distance

What Is the Problem of the K-Means Method? The k-means algorithm is sensitive to outliers! Since an object with an etremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. ((http://www-faculty.cs.uiuc.edu/~hanj/bk2/07.ppt (page 34)) 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets

Algorithm- K Medoids Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects in D as the initial representative objects or seeds; (2) repeat (3) assign each remaining object to the cluster with the nearest representative object; (4) randomly select a non representative object, Orandom; (5) compute the total cost, S, of swapping representative object, Oj, with Orandom; (6) if S < 0 then swap Oj with Orandom to form the new set of k representative objects; (7) until no change;

A Typical K-Medoids Algorithm (PAM) Total Cost = 20 10 10 10 9 9 9 8 8 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K=2 Do loop 10 Total Cost = 26 10 Randomly select a nonmedoid object,o ramdom Until no change ((http://wwwfaculty.cs.uiuc.edu/~hanj/ bk2/07.ppt (page 36)) Swapping O and O ramdom If quality is improved. 9 8 7 6 5 4 3 2 1 0 Cluster Analysis 0 1 2 3 4 5 6 7 8 9 10 Compute total cost of swapping 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

What Is the Problem with PAM? Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other etreme values than a mean Pam works efficiently for small data sets but does not scale well for large data sets. O(k(n-k) 2 ) for each iteration è Sampling based method, where n is # of data,k is # of clusters CLARA(Clustering LARge Applications)

The K-Means Clustering Method Eample ( Maschinelles Lernen and Data Mining (page 3-11)) Cluster Analysis

The K-Means Clustering Method

The K-Means Clustering Method Cluster Analysis

The K-Means Clustering Method Cluster Analysis

The K-Means Clustering Method Cluster Analysis

The K-Means Clustering Method

The K-Means Clustering Method Cluster Analysis

The K-Means Clustering Method

The K-Means Clustering Method

Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-conve shapes

Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters

Cases of the cost function for K- Medoids

CLARA (Clustering Large Applications) CLARA (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S+ It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output. Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size. A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased.