Unsupervised Learning Partitioning Methods

Similar documents
Cluster Analysis. CSE634 Data Mining

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Cluster Analysis: Basic Concepts and Methods

Clustering part II 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

K-Means. Oct Youn-Hee Han

CSE 5243 INTRO. TO DATA MINING

[7.3, EA], [9.1, CMB]

Kapitel 4: Clustering

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

CSE 5243 INTRO. TO DATA MINING

Clustering in Data Mining

ECLT 5810 Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Unsupervised Learning

ECLT 5810 Clustering

Supervised vs. Unsupervised Learning

Data Mining Algorithms

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

New Approach for K-mean and K-medoids Algorithm

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Gene Clustering & Classification

What to come. There will be a few more topics we will cover on supervised learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering CS 550: Machine Learning

Clustering. Supervised vs. Unsupervised Learning

Introduction to Mobile Robotics

Information Retrieval and Organisation

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CHAPTER 4: CLUSTER ANALYSIS

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Intelligent Image and Graphics Processing

Cluster Analysis. Ying Shen, SSE, Tongji University

Introduction to Information Retrieval

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

A Review on Cluster Based Approach in Data Mining

Unsupervised Learning : Clustering

Based on Raymond J. Mooney s slides

Unsupervised Learning and Clustering

Clustering and Visualisation of Data

Information Retrieval and Web Search Engines

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Skimmer: Rapid Scrolling of Relational Query Results. Manish Singh, Arnab Nandi and H.V. Jagadish

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Clustering Techniques

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering: Overview and K-means algorithm

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Learning Hierarchical Methods

Data Informatics. Seon Ho Kim, Ph.D.

A REVIEW ON CLUSTERING TECHNIQUES AND THEIR COMPARISON

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering: Overview and K-means algorithm

Introduction to Computer Science

Machine Learning. Unsupervised Learning. Manfred Huber

A Comparative Study of Various Clustering Algorithms in Data Mining

Unsupervised Learning and Clustering

Clustering in Ratemaking: Applications in Territories Clustering

Clustering algorithms

K-Means Clustering. Sargur Srihari

Lecture 7 Cluster Analysis: Part A

Clustering Results. Result List Example. Clustering Results. Information Retrieval

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Hierarchical Clustering 4/5/17

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Introduction to Information Retrieval

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Information Retrieval and Web Search Engines

PV211: Introduction to Information Retrieval

Unsupervised Learning

10701 Machine Learning. Clustering

Course Content. Classification = Learning a Model. What is Classification?

Course Content. What is Classification? Chapter 6 Objectives

Artificial Intelligence. Programming Styles

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Exploratory data analysis for microarrays

University of Florida CISE department Gator Engineering. Clustering Part 4

The K-modes and Laplacian K-modes algorithms for clustering


CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Redefining and Enhancing K-means Algorithm

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Transcription:

Unsupervised Learning Partitioning Methods

Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS

Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Typical Applications WWW, Social networks, Marketing, Biology, Library, etc.

Partitioning Methods Given A data set of n objects K the number of clusters to form Organize the objects into k partitions (k<=n) where each partition represents a cluster The clusters are formed to optimize an objective partitioning criterion Objects within a cluster are similar Objects of different clusters are dissimilar

Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS

K-Means Choose 3 objects (cluster centroids) Goal: create 3 clusters (partitions) Assign each object to the closest centroid to form Clusters Update cluster centroids + + +

K-Means Recompute Clusters + + + If Stable centroids, then stop + + +

K-Means Algorithm Input K: the number of clusters D: a data set containing n objects Output: A set of k clusters Method: (1) Arbitrary choose k objects from D as in initial cluster centers (2) Repeat (3) Reassign each object to the most similar cluster based on the mean value of the objects in the cluster (4) Update the cluster means (5) Until no change

K-Means Properties The algorithm attempts to determine k partitions that minimize the square-error function E = k i 1 p C i p m i 2 E: the sum of the squared error for all objects in the dataset P: the data point in the space representing an object m i : is the mean of cluster C i It works well when the clusters are compact clouds that are rather well separated from one another

K-Means: Advantages K-means is relatively scalable and efficient in processing large data sets The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k<<n and t<<n

K-Means: Disadvantages Can be applied only when the mean of a cluster is defined Users need to specify k K-means is not suitable for discovering clusters with nonconvex shapes or clusters of very different size It is sensitive to noise and outlier data points (can influence the mean value)

K-Means demo Demo

Variations of K-Means A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means How can we change K-Means to deal with categorical data? Handling categorical data: k-modes (Huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data

Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS

K-Medoids Method Minimize the sensitivity of k-means to outliers Pick actual objects to represent clusters instead of mean values Each remaining object is clustered with the representative object (Medoid) to which is the most similar The algorithm minimizes the sum of the dissimilarities between each object and its corresponding representative object E = k i 1 p C i p o i E: the sum of absolute error for all objects in the data set P: the data point in the space representing an object Oi: is the representative object of cluster Ci

K-Medoids: The idea Initial representatives are chosen randomly The iterative process of replacing representative objects by no representative objects continues as long as the quality of the clustering is improved For each representative Object O For each non-representative object R, swap O and R Choose the configuration with the lowest cost Cost function is the difference in absolute error-value if a current representative object is replaced by a non-representative object

K-Medoids: Example Data Objects O 1 A 1 A 2 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 9 8 7 6 5 4 3 2 1 3 2 4 5 3 4 5 6 7 8 9 Goal: create two clusters Choose randmly two medoids O 2 = (3,4) O 8 = (7,4) 6 10 8 7 9

K-Medoids: Example Data Objects O 1 A 1 A 2 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 9 8 7 6 5 4 3 2 1 3 2 cluster1 4 5 3 4 5 6 7 8 9 Cluster1 = {O 1, O 2, O 3, O 4 } 6 10 8 cluster2 " Assign each object to the closest representative object " Using L1 Metric (Manhattan), we form the following clusters 7 9 Cluster2 = {O 5, O 6, O 7, O 8, O 9, O 10 }

K-Medoids: Example O 1 A 1 2 A 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 Data Objects 2 3 4 5 6 7 8 9 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 " Compute the absolute error criterion [for the set of Medoids (O2,O8)] 8 10 8 9 8 7 8 6 8 5 2 4 2 3 2 1 1 o o o o o o o o o o o o o o o o o p E k i C p i i + + + + + + + = = cluster1 cluster2

K-Medoids: Example Data Objects O 1 A 1 A 2 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 9 8 7 6 5 4 3 2 1 3 2 cluster1 4 5 3 4 5 6 7 8 9 " The absolute error criterion [for the set of Medoids (O2,O8)] 6 10 8 cluster2 E = ( 3+ 4 + 4) + (3 + 1+ 1+ 2 + 2) = 7 9 20

K-Medoids: Example Data Objects O 1 A 1 A 2 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 9 8 7 6 5 4 3 2 1 " Choose a random object O 7 " Swap O8 and O7 3 2 cluster1 4 5 3 4 5 6 7 8 9 " Compute the absolute error criterion [for the set of Medoids (O2,O7)] 6 10 8 cluster2 E = ( 3+ 4 + 4) + (2 + 2 + 1+ 3+ 3) = 22 7 9

K-Medoids: Example Data Objects O 1 A 1 A 2 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 9 8 7 6 5 4 3 2 cluster1 3 4 1 2 6 5 10 9 8 7 3 4 5 6 7 8 9 " Compute the cost function Absolute error [for O 2,O 7 ] Absolute error [O 2,O 8 ] S = 22 20 cluster2 S> 0 it is a bad idea to replace O 8 by O 7

K-Medoids: Example Data Objects O 1 A 1 A 2 2 6 O 2 3 4 O 3 3 8 O 4 4 7 O 5 6 2 O 6 6 4 O 7 7 3 O 8 7 4 O 9 8 5 O 10 7 6 9 8 7 6 5 4 3 2 1 3 2 cluster1 4 5 3 4 5 6 7 8 9 6 10 8 7 cluster2 } In this example, changing the medoid of cluster 2 did not change the assignments of objects to clusters. } What are the possible cases when we replace a medoid by another object? 9

K-Medoids Cluster 1 Cluster 2 A B B First case The assignment of P to A does not change p Representative object Random Object Currently P assigned to A Cluster 1 Cluster 2 A p B B Second case P is reassigned to A Representative object Random Object Currently P assigned to B

K-Medoids Cluster 1 Cluster 2 A p B B Third case P is reassigned to the new B Representative object Random Object Currently P assigned to B Cluster 1 Cluster 2 A Fourth case p B B P is reassigned to B Representative object Random Object Currently P assigned to A

PAM: Partitioning Around Medoids Input K: the number of clusters D: a data set containing n objects Output: A set of k clusters Method: (1) Arbitrary choose k objects from D as representative objects (seeds) (2) Repeat (3) Assign each remaining object to the cluster with the nearest representative object (4) For each representative object O j (5) Randomly select a non representative object O random (6) Compute the total cost S of swapping representative object Oj with O random (7) if S<0 then replace O j with O random (8) Until no change

PAM Properties The complexity of each iteration is O(k(n-k) 2 ) For large values of n and k, such computation becomes very costly Advantages K-Medoids method is more robust than k-means in the presence of noise and outliers Disadvantages K-Medoids is more costly than k-means Like k-means, k-medoids requires the user to specify k It does not scale well for large data sets

Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS

CLARA CLARA (Clustering Large Applications) uses a samplingbased method to deal with large data sets sample PAM A random sample should closely represent the original data The chosen medoids will likely be similar to what would have been chosen from the whole data set

CLARA Draw multiple samples of the data set Choose the best clustering Clusters Clusters Clusters Apply PAM to each sample PAM PAM PAM Return the best clustering sample 1 sample 2 sample m

CLARA Properties Complexity of each Iteration is: O(ks 2 + k(n-k)) s: the size of the sample k: number of clusters n: number of objects PAM finds the best k medoids among a given data, and CLARA finds the best k medoids among the selected samples Problems The best k medoids may not be selected during the sampling process, in this case, CLARA will never find the best clustering If the sampling is biased we cannot have a good clustering Trade off-of efficiency

CLARANS CLARANS (Clustering Large Applications based upon RANdomized Search ) was proposed to improve the quality and the scalability of CLARA It combines sampling techniques with PAM It does not confine itself to any sample at a given time It draws a sample with some randomness in each step of the search

CLARANS: The idea Clustering view Current medoids medoids Cost=-10 Cost=5 Cost=-15 Cost=20 Cost=2 Cost=3 Cost=1 Keep the current medoids

CLARANS: The idea CLARA } Draws a sample of nodes at the beginning of the search } Neighbors are from the chosen sample } Restricts the search to a specific area of the original data First step of the search Neighbors are from the chosen sample Current medoids Sample medoids second step of the search Neighbors are from the chosen sample

CLARANS: The idea CLARANS } Does not confine the search to a localized area } Stops the search when a local minimum is found } Finds several local optimums and output the clustering with the best local optimum First step of the search Draw a random sample of neighbors Current medoids Original data medoids second step of the search Draw a random sample of neighbors The number of neighbors sampled from the original data is specified by the user

CLARANS Properties Advantages Experiments show that CLARANS is more effective than both PAM and CLARA Handles outliers Disadvantages The computational complexity of CLARANS is O(n 2 ), where n is the number of objects The clustering quality depends on the sampling method

Summary Partitioning methods find sphere-shaped clusters K- mean is efficient for large data sets but sensitive to outliers PAM used centers of the clusters instead of means CLARA and CLARANS are used for clustering large databases

Task1: Image Compression Assume you have a 538-pixel by 538- pixel image In a straightforward 24-bit color representation of this image, each pixel is represented as three 8-bit numbers (ranging from 0 to 255) that specify red, green and blue intensity values Our bird photo contains thousands of colors, but we would like to reduce that number to 16 By making this reduction, it would be possible to represent the photo in a more efficient way by storing only the RGB values of the 16 colors present in the image How do solve this problem?

Task2: Documents Categorization We want cluster a set of documents based on their structural hierarchy. Documents that belong to the same topic should belong to the same cluster, documents that belong to the same subtopic should belong to the same cluster, etc. Documents are described by the set of their contained terms Assume that similar documents belong to the same topic Propose a clustering Algorithm based on K-means that can achieve this task Imagine you have a knowledge base that gives information about topic hierarchies, where each topic is described by a set of terms. How would you cluster documents in this case?

Task3: Search Result Diversification Query Google for Images of Barbara Liskov: Results We need to diversify search results How could you approach this problem using Clustering?

Task4: Fraudulent credit-card transactions We aim at identifying frauds in transactions. Assume for each transaction we have the following information Costumer Id Amount List and types of bought items Location How do you achieve this task?