Kapitel 4: Clustering

Similar documents
Data Mining Algorithms

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

ECLT 5810 Clustering

ECLT 5810 Clustering

Unsupervised Learning Partitioning Methods

Clustering part II 1

CSE 5243 INTRO. TO DATA MINING

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Unsupervised Learning

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

K-Means. Oct Youn-Hee Han

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Gene Clustering & Classification

CSE 5243 INTRO. TO DATA MINING

Clustering in Data Mining

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Techniques

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering k-mean clustering

What to come. There will be a few more topics we will cover on supervised learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

[7.3, EA], [9.1, CMB]

A Review on Cluster Based Approach in Data Mining

Knowledge Discovery in Databases

Understanding Clustering Supervising the unsupervised

Cluster Analysis. CSE634 Data Mining

Introduction to Computer Science

Chapter 5: Outlier Detection

Clustering in Ratemaking: Applications in Territories Clustering

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CS145: INTRODUCTION TO DATA MINING

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

10701 Machine Learning. Clustering

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Based on Raymond J. Mooney s slides

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CS Introduction to Data Mining Instructor: Abdullah Mueen

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Clustering Part 4 DBSCAN

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering. Chapter 10 in Introduction to statistical learning

Cluster Analysis: Basic Concepts and Methods

Intelligent Image and Graphics Processing

Knowledge Discovery and Data Mining I

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Contents. Preface to the Second Edition

K-Mean Clustering Algorithm Implemented To E-Banking

Data Informatics. Seon Ho Kim, Ph.D.

Exploratory data analysis for microarrays

Clustering CS 550: Machine Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CHAPTER 4: CLUSTER ANALYSIS

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY


CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Enhancing K-means Clustering Algorithm with Improved Initial Center

COMP 465: Data Mining Still More on Clustering

University of Florida CISE department Gator Engineering. Clustering Part 5

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

1 Case study of SVM (Rob)

Unsupervised Learning Hierarchical Methods

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Dimension reduction : PCA and Clustering

MSA220 - Statistical Learning for Big Data

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

A Comparative Study of Various Clustering Algorithms in Data Mining

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Machine Learning using MapReduce

9.913 Pattern Recognition for Vision. Class 8-2 An Application of Clustering. Bernd Heisele

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Data Clustering. Danushka Bollegala

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Road map. Basic concepts

Cluster Analysis for Microarray Data

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Chapter 4: Text Clustering

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Network Traffic Measurements and Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Machine Learning. Unsupervised Learning. Manfred Huber

Lecture 7 Cluster Analysis: Part A

CSE 5243 INTRO. TO DATA MINING

Transcription:

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr. Peer Kröger Übungen: Anna Beer, Florian Richter

Contents 1) Introduction to Clustering 2) Partitioning Methods K-Means Variants: K-Medoid, K-Mode, K-Median Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods Agglomerative and Divisive Hierarchical Clustering Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics Ensemble Clustering Discussion: an alternative view on DBSCAN Clustering

What is Clustering? Grouping a set of data objects into clusters Cluster: a collection of data objects 1) Similar to one another within the same cluster 2) Dissimilar to the objects in other clusters Clustering = unsupervised classification (no predefined classes) Typical usage As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Clustering Introduction

General Applications of Clustering Preprocessing as a data reduction (instead of sampling), e.g. Image data bases (color histograms for filter distances) Stream clustering (handle endless data sets for offline clustering) Pattern Recognition and Image Processing Spatial Data Analysis create thematic maps in Geographic Information Systems by clustering feature spaces detect spatial clusters and explain them in spatial data mining Business Intelligence (especially market research) WWW (Documents for Web Content Mining, Web-logs for Web Usage Mining, ) Biology Clustering of gene expression data Clustering Introduction

An Application Example: Thematic Maps Satellite images of a region in different wavelengths (bands) Each point on the surface maps to a high-dimensional feature vector p = (x 1,, x d ) where x i is the recorded intensity at the surface point in band i. Assumption: each different land-use reflects and emits light of different wavelengths in a characteristic way. (12),(17.5) (8.5),(18.7) 1112 1122 3 2 3 2 3333 Surface of the earth Cluster 2 Band 1 Cluster 1 12 10 Cluster 3 8 Band 2 16.5 18.0 20.0 22.0 Feature-space Clustering Introduction

An Application Example: Downsampling Images Reassign color values to k distinct colors Cluster pixels using color difference, not spatial data 58483 KB 19496 KB 9748 KB 65536 256 16 8 4 2 Clustering Introduction

Major Clustering Approaches Partitioning algorithms Find k partitions, minimizing some objective function Probabilistic Model-Based Clustering (EM) Density-based Find clusters based on connectivity and density functions Hierarchical algorithms Create a hierarchical decomposition of the set of objects Other methods Grid-based Neural networks (SOM s) Graph-theoretical methods Subspace Clustering... Clustering Introduction

Contents 1) Introduction to clustering 2) Partitioning Methods K-Means K-Medoid Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods Agglomerative and Divisive Hierarchical Clustering Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics Ensemble Clustering Discussion: an alternative view on DBSCAN Outlier Detection Clustering

Partitioning Algorithms: Basic Concept Goal: Construct a partition of a database D of n objects into a set of k ( ) clusters,, (C,, ) minimizing an objective function. Exhaustively enumerating all possible partitions into k sets in order to find the global minimum is too expensive. Popular heuristic methods: Choose k representatives for clusters, e.g., randomly Improve these initial representatives iteratively: Assign each object to the cluster it fits best in the current clustering Compute new cluster representatives based on these assignments Repeat until the change in the objective function from one iteration to the next drops below a threshold Examples of representatives for clusters k-means: Each cluster is represented by the center of the cluster k-medoid: Each cluster is represented by one of its objects Clustering Partitioning Methods

Contents 1) Introduction to clustering 2) Partitioning Methods K-Means Variants: K-Medoid, K-Mode, K-Median Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods Agglomerative and Divisive Hierarchical Clustering Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics Scaling Up Clustering Algorithms Outlier Detection Clustering

K-Means Clustering: Basic Idea Idea of K-means: find a clustering such that the within-cluster variation of each cluster is small and use the centroid of a cluster as representative. Objective: For a given k, form k groups so that the sum of the (squared) distances between the mean of the groups and their elements is minimal. Poor Clustering (large sum of distances) μ distance μ μ cluster mean μ Centroids Optimal Clustering (minimal sum of distances) S.P. Lloyd: Least squares quantization in PCM. In IEEE Information Theory, 1982 (original version: technical report, Bell Labs, 1957) J. MacQueen: Some methods for classification and analysis of multivariate observation, In Proc. of the 5th Berkeley Symp. on Math. Statist. and Prob., 1967. Clustering Partitioning Methods K-Means

K-Means Clustering: Basic Notions Objects 1,, are points in a -dimensional vector space (the mean of a set of points must be defined: ) :the cluster is assigned to Measure for the compactness of a cluster (sum of squared errors):, Measure for the compactness of a clustering :, Optimal Partitioning: argmin Optimizing the within-cluster variation is computationally challenging (NP-hard) use efficient heuristic algorithms Clustering Partitioning Methods K-Means

K-Means Clustering: Algorithm k-means algorithm (Lloyd s algorithm): Given k, the k-means algorithm is implemented in 2 main steps: Initialization: Choose k arbitrary representatives Repeat until representatives do not change: 1. Assign each object to the cluster with the nearest representative. 2. Compute the centroids of the clusters of the current partitioning. 8 8 8 8 8 6 6 6 6 6 4 2 0 0 2 4 6 8 4 2 0 0 2 4 6 8 4 2 0 0 2 4 6 8 4 2 0 0 2 4 6 8 4 2 0 0 2 4 6 8 repeat until stable init: arbitrary representatives new clustering candidate centroids of current partition new clustering candidate centroids of current partition assign objects compute new means Clustering Partitioning Methods K-Means assign objects compute new means

K-Means Clustering: Discussion Strengths Relatively efficient: O(tkn), where n = # objects, k = # clusters, and t = # iterations Typically: k, t << n Easy implementation Weaknesses Applicable only when mean is defined Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers Clusters are forced to convex space partitions (Voronoi Cells) Result and runtime strongly depend on the initial partition; often terminates at a local optimum however: methods for a good initialization exist Several variants of the k-means method exist, e.g., ISODATA Extends k-means by methods to eliminate very small clusters, merging and split of clusters; user has to specify additional parameters Clustering Partitioning Methods K-Means

Contents 1) Introduction to clustering 2) Partitioning Methods K-Means Variants: K-Medoid, K-Mode, K-Median Choice of parameters: Initialization, Silhouette coefficient 3) Probabilistic Model-Based Clusters: Expectation Maximization 4) Density-based Methods: DBSCAN 5) Hierarchical Methods Agglomerative and Divisive Hierarchical Clustering Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics Scaling Up Clustering Algorithms Outlier Detection Clustering

K-Medoid, K-Modes, K-Median Clustering: Basic Idea Problems with K-Means: Applicable only when mean is defined (vector space) Outliers have a strong influence on the result The influence of outliers is intensified by the use of the squared error use the absolute error (total distance instead):, and Three alternatives for using the Mean as representative: Medoid: representative object in the middle Mode: value that appears most often Median: (artificial) representative object in the middle Objective as for k-means: Find k representatives so that, the sum of the distances between objects and their closest representative is minimal. data set poor clustering optimal clustering Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Medoid Clustering: PAM Algorithm Partitioning Around Medoids [Kaufman and Rousseeuw, 1990] Given k, the k-medoid algorithm is implemented in 3 steps: Initialization: Select k objects arbitrarily as initial medoids (representatives); assign each remaining (non-medoid) object to the cluster with the nearest representative, and compute TD current. Repeat 1. For each pair (medoid M, non-medoid N) compute the value TD NM, i.e., the value of TD for the partition that results when swapping M with N 2. Select the best pair (M, N) for which TD NM is minimal 3. If TD NM < TD current Swap N with M Set TD current := TD NM Until nothing changes Problem of PAM: high complexity ( ^2) Kaufman L., Rousseeuw P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990. Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Medoid Clustering: CLARANS Algorithm Optimization: CLARANS [Ng & Han 1994] Trading accuracy for speed Two additional tuning parameters: maxneighbor und numlocal At most maxneighbor of pairs (M,N) are considered in each iteration (Step 1) Best first: take the first pair (M,N) that reduces the TD-value instead of evaluating all (maxneighbor) pairs (worst-case still maxneighbor) Termination after numlocal iteration even if convergence is not yet reached Kaufman L., Rousseeuw P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990. Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Median Clustering Problem: Sometimes, data is not numerical Idea: If there is an ordering on the data,,,, }, use median instead of mean,, minmax, 2 A median is computed in each dimension independently and can thus be a combination of multiple instances median can be efficiently computed for ordered data Different strategies to determine the middle in an array of even length possible median mean Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Mode Clustering: First Approach [Huang 1997] Given: Ω is a set of objects, each described by categorical attributes (1) Mode: a mode of is a vector,m,, Ωthat minimizes,, where is a distance function for categorical values (e.g. Hamming Dist.) Note: is not necessarily an element of For Hamming: the mode is determined by the most frequent value in each attribute Huang, Z.: A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining, In DMKD, 1997. Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Mode Clustering: First Approach [Huang 1997] Theorem to determine a Mode: Let,, be the relative frequency of category of attribute in the data, then:, is minimal 1,, : : m,j,x,, this allows to use the k-means paradigm to cluster categorical data without loosing its efficiency Note: the mode of a dataset might be not unique K-Modes algorithm proceeds similar to k-means algorithm Huang, Z.: A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining, In DMKD, 1997. Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Mode Clustering: Example Employee-ID Profession Household Pets #133 Technician Cat #134 Manager None #135 Cook Cat #136 Programmer Dog #137 Programmer None #138 Technician Cat #139 Programmer Snake #140 Cook Cat #141 Advisor Dog Profession: (Programmer: 3, Technician: 2, Cook: 2, Advisor: 1, Manager:1) Household Pet: (Cat: 4, Dog: 2, None: 2, Snake: 1) Mode is (Programmer, Cat) Remark: (Programmer, Cat) Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Means/Medoid/Mode/Median overview Age Age Age mean Shoe size Employee-ID Profession Shoe size Age #133 Technician 42 28 #134 Manager 41 45 #135 Cook 46 32 #136 Programmer 40 35 #137 Programmer 41 49 #138 Technician 43 41 #139 Programmer 39 29 #140 Cook 38 33 #141 Advisor 40 56 medoid Shoe size Profession: Programmer Shoe size: 40/41 Age: n.a. median Shoe size mode Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

K-Means/Median/Mode/Medoid Clustering: Discussion data efficiency sensitivity to outliers k-means k-median K-Mode K-Medoid numerical data (mean) high ordered attribute data high categorical attribute data high metric data low high low low low Strength Easy implementation ( many variations and optimizations in the literature) Weakness Need to specify k, the number of clusters, in advance Clusters are forced to convex space partitions (Voronoi Cells) Result and runtime strongly depend on the initial partition; often terminates at a local optimum however: methods for a good initialization exist Clustering Partitioning Methods Variants: K-Medoid, K-Mode, K-Median

Voronoi Model for convex cluster regions Definition: Voronoi diagram For a given set of points 1 (here: cluster representatives), a Voronoi diagram partitions the data space in Voronoi cells, one cell per point. The cell of a point covers all points in the data space for which is the nearest neighbors among the points from. Observations The Voronoi cells of two neighboring points, are separated by the perpendicular hyperplane ( Mittelsenkrechte ) between and. As Voronoi cells are intersections of half spaces, they are convex regions. Clustering Partitioning Methods

Voronoi Model for convex cluster regions(2) Voronoi-parcellation convex hull of cluster Knowledge Discovery in Databases I: Clustering