Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Similar documents
Gene Clustering & Classification

ECLT 5810 Clustering

ECLT 5810 Clustering

Unsupervised Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture on Modeling Tools for Clustering & Regression

Cluster Analysis. CSE634 Data Mining

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering 4/5/17

University of Florida CISE department Gator Engineering. Clustering Part 2

COMP 465: Data Mining Still More on Clustering

CSE 5243 INTRO. TO DATA MINING

Clustering part II 1

Clustering. Chapter 10 in Introduction to statistical learning

Understanding Clustering Supervising the unsupervised

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

10701 Machine Learning. Clustering

Clustering CS 550: Machine Learning

CSE 5243 INTRO. TO DATA MINING

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Cluster Analysis. Angela Montanari and Laura Anderlucci

Unsupervised Learning and Clustering

Road map. Basic concepts

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Cluster analysis. Agnieszka Nowak - Brzezinska

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning : Clustering

Unsupervised Learning and Clustering

Data Mining Algorithms

Data Informatics. Seon Ho Kim, Ph.D.

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Clustering in Data Mining

Supervised and Unsupervised Learning (II)

Supervised vs. Unsupervised Learning

High throughput Data Analysis 2. Cluster Analysis

University of Florida CISE department Gator Engineering. Clustering Part 5

Overview of Clustering

Unsupervised Learning Partitioning Methods

Network Traffic Measurements and Analysis

Hierarchical clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS570: Introduction to Data Mining

Introduction to Data Mining

Clustering and Visualisation of Data

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Introduction to Computer Science

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering Algorithms for general similarity measures

Machine Learning. Unsupervised Learning. Manfred Huber

The k-means Algorithm and Genetic Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Redefining and Enhancing K-means Algorithm

Machine Learning (BSMC-GA 4439) Wenke Liu

Kapitel 4: Clustering

CISC 4631 Data Mining

Chapter DM:II. II. Cluster Analysis

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

CS145: INTRODUCTION TO DATA MINING

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering. Shishir K. Shah

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Clustering and Dimensionality Reduction

Clustering Part 3. Hierarchical Clustering

ECG782: Multidimensional Digital Signal Processing

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Cluster Analysis for Microarray Data

Cluster Analysis. Ying Shen, SSE, Tongji University

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Hierarchical Clustering

Clustering. Supervised vs. Unsupervised Learning

Exploratory data analysis for microarrays

CS7267 MACHINE LEARNING

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Cluster Analysis: Basic Concepts and Algorithms

Introduction to Mobile Robotics

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering algorithms and autoencoders for anomaly detection

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

A Review on Cluster Based Approach in Data Mining

ECS 234: Data Analysis: Clustering ECS 234

Transcription:

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some examples are A company wants to find companies that are similar to their best business customers, in order to have the sales staff look at them as prospects. Modern retailers such as Amazon and Netflix use similarity to provide recommendations of similar products or from similar people. Whenever you see statements like People who like X also like Y or Customers with your browsing history have also looked at similarity is being applied. A doctor may reason about a new difficult case by recalling a similar case (either treated personally or documented in a journal) and its diagnosis.

Similarity and Distance The closer two objects are in the space defined by the features, the more similar they are. Consider two instances of credit card application. We want to find similarity between the two cases. So, our objective is to convert 3 dimensions (Age, Years at current address and Residential status) into distance There are many ways to measure similarity between, Simplest way of doing so is by using the Euclidean distance

distance Distance measures Minkowski distance of order p between two points x = (x 1, x 2,, x n ) and y = (y 1, y 2,, y n ) is given as: d(x, y) = n i=1 x i y i p - The 1-norm distance is called Manhattan distance and the 2- norm distance is the Euclidean distance. 1/p (y 1, y 2 ) (x 1, x 2 ) Manhattan

Similarity and Distance When an object is described by n features, n dimensions (d1, d2,, dn), the general equation for Euclidean distance in n dimensions is as below: So, the distance between two persons A and B can be calculated as below: So the distance between these examples is about 19. This distance is just a number it has no units, and no meaningful interpretation. It is only really useful for comparing the similarity of one pair of instances to that of another pair

Clustering Clustering is another application of our fundamental notion of similarity. The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the objects within groups are similar, but the objects in different groups are not so similar. E.g. Revisiting our whiskey example. Let s say that we run a small shop in a well-to-do neighborhood, and as part of our business strategy we want to be known as the place to go for single-malt scotch whiskeys. We may not be able to have the largest selection, given our limited space and ability to invest, but we might choose a strategy of having a broad and eclectic collection. If we understood how the single malts grouped by taste, we could (for example) choose from each taste group a popular member and a lesserknown member. Or an expensive member and a more affordable member.

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used by the method its implementation, and Its ability to discover some or all of the hidden patterns

Hierarchical Clustering Consider the figure. Here 6 points have been grouped in clusters based on their similarity calculated using Euclidean distance Notice that the only overlap between clusters is when one cluster contains other clusters. Because of this structure, the circles actually represent a hierarchy of clusterings. The most general (highest-level) clustering is just the single cluster that contains everything cluster 5 in the example. The lowest-level clustering is when we remove all the circles, and the points themselves are six (trivial) clusters.

Hierarchical Clustering This graph is called a dendrogram, and it shows explicitly the hierarchy of the clusters. Along the x axis are arranged the individual data points. The y axis represents the distance between the clusters At the bottom (y = 0) each point is in a separate cluster. As y increases, different groupings of clusters fall within the distance constraint: first A and C are clustered together, then B and E, then the BE with D, and so on, until all clusters are merged An advantage of hierarchical clustering is that it allows the data analyst to see the Groupings, before deciding on the number of clusters to extract.

Centroid based clustering The most common way of representing clusters is through its center, called the centroid In the figure, we have three clusters whose instances are represented by the circles. Each cluster has a centroid, represented by the solid-lined star. The star is not necessarily one of the instances; it is the geometric center of a group of instances.

K-means clustering The most popular centroid-based clustering algorithm is called k- means clustering In k-means the means are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster So in previous figure, to compute the centroid for each cluster, we would average all the x values of the points in the cluster to form the x coordinate of the centroid, and all the y values to form the y coordinate The k in k-means is simply the number of clusters that one would like to find in the data

K-means clustering The k-means algorithm starts by creating k initial cluster centers, usually randomly, but sometimes by choosing k of the actual data points, or by being given specific initial starting points by the user As shown in previous figure, the clusters corresponding to these cluster centers are formed, by determining which is the closest center to each point. Next, for each of these clusters, its center is recalculated by finding the actual centroid of the points in the cluster. The cluster centers typically shifts as shown in the next figure This procedure keeps iterating until there is no change in the clusters

K-means clustering Above, the figure to the left shows a data set of 90 points The figure to the right shows the final results of clustering after 16 iterations. The three (erratic) lines show the path from each centroid s initial (random) location to its final location.

K-means clustering There is no guarantee that a single run of the k-means algorithm will result in a good clustering. The result of a single clustering run will find a local optimum a locally best clustering but this will be dependent upon the initial centroid locations. For this reason, k-means is usually run many times, starting with different random centroids each time. The results can be compared by examining the clusters or by a numeric measure such as the clusters distortion, which is the sum of the squared differences between each data point and its corresponding centroid. The clustering with the lowest distortion value can be deemed the best clustering.

Determining value of k A common concern with centroid algorithms such as k-means is how to determine a good value for k. One answer is simply to experiment with different k values and see which ones generate good results. The value for k can be decreased if some clusters are too small and overly specific, and increased if some clusters are too broad and diffuse For a more objective measure, the analyst can experiment with increasing values of k and graph various metrics of the quality of the resulting clusters. As k increases the quality metrics should eventually stabilize or plateau, either bottoming out if the metric is to be minimized or topping out if maximized

Determining the Number of Clusters Elbow method Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters Cross validation method Divide a given data set into m parts Use m 1 parts to obtain a clustering model Use the remaining part to test the quality of the clustering E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k s, and find # of clusters that fits the data the best 16

Understanding the results of Clustering Let us take the whiskey example and consider that following are 2 of the clusters formed from the data Group A, Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa Group H, Scotches: Bruichladdich, Deanston, Fettercairn, Glenfiddich, Glen Mhor, Glen Spey, Glentauchers, Ladyburn, Tobermory Thus, to examine the clusters, we can look at the whiskeys in each cluster. Even if we had had massive numbers of whiskeys, we still could have sampled whiskeys from each cluster to show the composition of each. In this case, the names of the data points are meaningful in and of themselves, and convey meaning to an expert in the field. But, if we are clustering customers of a large retailer, probably a list of the names of the customers in a cluster would have little meaning, so this technique for understanding the result of clustering would not be useful.

Understanding the results of Clustering What can we do in cases where we cannot simply show the names of our data points, or for which showing the names does not give sufficient understanding? Let s look again at our whiskey clusters, but this time looking at more information on the clusters. Group A o Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa o The best of its class: Laphroaig (Islay), 10 years, 86 points o Average characteristics: full gold; fruity, salty; medium; oily, salty, sherry; Group H o Scotches: Bruichladdich, Deanston, Fettercairn, Glenfiddich, Glen Mhor, Glen Spey, Glentauchers, Ladyburn, Tobermory o The best of its class: Bruichladdich (Islay), 10 years, 76 points o Average characteristics: white wyne, pale; sweet; smooth, light; sweet, dry, fruity, smoky; dry, light

Understanding the results of Clustering First, in addition to listing out the members, an exemplar member is listed. Here it is the best of its class whiskey These techniques could be especially useful when there are massive numbers of instances in each cluster, so randomly sampling some may not be as telling as carefully selecting exemplars. The example also illustrates a different way of understanding the result of the clustering: it shows the average characteristics of the members of the cluster essentially, it shows the cluster centroid. Showing the centroid can be applied to any clustering; whether it is meaningful depends on whether the data values themselves are meaningful.

Requirements and Challenges Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of these Constraint-based clustering User may give inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionality

Strengths and Weaknesses of Each algorithm K-means Fast and efficient Often terminates at a local optimal solution Applicable to continuous n-dimensional space Need to specify the number of clusters in advance Sensitive to noisy data and outliers Weak in clustering non-convex shapes Hierarchical Availability of dendrogram Smaller clusters may be generated. Not so scalable; Distance matrix can huge to calculate Cannot undo what was done previously

Measuring Clustering Quality Two methods: extrinsic vs. intrinsic Extrinsic: supervised, i.e., the ground truth is available Compare a clustering against the ground truth using certain clustering quality measure E.g.,BCubed precision and recall metrics Intrinsic: unsupervised, i.e., the ground truth is unavailable Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are E.g., Silhouette coefficient 22

Silhouette Coefficient For each datum i a i : average dissimilarity of i with all other data within the same cluster The smaller the better; Indicates cohesiveness b(i): the lowest average dissimilarity of i to any other cluster, of which i is not a member Cluster with the lowest dissimilarity is called neighboring cluster of i The larger the better; Indicates distinctiveness s i = b i a(i) max{a i, b(i)} = a i 1 b i if a i < b(i) 0 if a i = b(i) b i a i 1 if a i > b(i) 1 s(i) 1 and s(i) close to 1 indicates the datum is clustered well, s(i) negative indicates that it would be appropriate if it was clustered in its neighboring cluster. The average s(i) is therefore an overall measure that incorporates how good a clustering result is in terms of both cohesiveness and distinctiveness.

Sources F. Provost and T. Fawcett, Data Science for Business J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.