Chapter DM:II. II. Cluster Analysis

Similar documents
Chapter ML:XI (continued)

Clustering CS 550: Machine Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

ECLT 5810 Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

ECLT 5810 Clustering

Unsupervised Learning : Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

What is Unsupervised Learning?

Gene Clustering & Classification

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

High throughput Data Analysis 2. Cluster Analysis

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cluster analysis. Agnieszka Nowak - Brzezinska

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Machine learning - HT Clustering

Clustering and Visualisation of Data

Pattern Recognition Lecture Sequential Clustering

CHAPTER 4: CLUSTER ANALYSIS

Hierarchical Clustering 4/5/17

Introduction to Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning

Understanding Clustering Supervising the unsupervised

Unsupervised Learning

Discriminate Analysis

Hierarchical Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning

Dimension reduction : PCA and Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Finding Clusters 1 / 60

MIA - Master on Artificial Intelligence

Unsupervised Learning

Network Traffic Measurements and Analysis

Overview of Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Cluster Analysis: Agglomerate Hierarchical Clustering

ECS 234: Data Analysis: Clustering ECS 234

Machine Learning for OR & FE

Clustering Tips and Tricks in 45 minutes (maybe more :)

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Road map. Basic concepts

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Unsupervised Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

10701 Machine Learning. Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

INF 4300 Classification III Anne Solberg The agenda today:

Proximity and Data Pre-processing

Kapitel 4: Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Information Retrieval and Web Search Engines

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Unsupervised Learning and Clustering

Introduction to Mobile Robotics

Cluster Analysis. CSE634 Data Mining

Exploratory data analysis for microarrays

Based on Raymond J. Mooney s slides

Unsupervised Learning and Clustering

CSE 5243 INTRO. TO DATA MINING

PARALLEL CLASSIFICATION ALGORITHMS

Data Mining Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Supervised vs. Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Machine Learning. Unsupervised Learning. Manfred Huber

University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering. Chapter 10 in Introduction to statistical learning

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering in Data Mining

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Supervised and Unsupervised Learning (II)

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Unsupervised: no target value to predict

Contents. Preface to the Second Edition

Clustering algorithms

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Transcription:

Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1 Cluster Analysis STEIN 2002-2018

Cluster analysis is the unsupervised classification of a set of objects in groups, pursuing the following objectives: 1. maximize the similarities within the groups (intra groups) 2. minimize the similarities between the groups (inter groups) DM:II-2 Cluster Analysis STEIN 2002-2018

Cluster analysis is the unsupervised classification of a set of objects in groups, pursuing the following objectives: 1. maximize the similarities within the groups (intra groups) 2. minimize the similarities between the groups (inter groups) Applications: identification of similar groups of buyers higher-level image processing: object recognition search of similar gene profiles specification of syndromes analysis of traffic data in computer networks visualization of complex graphs text categorization in information retrieval DM:II-3 Cluster Analysis STEIN 2002-2018

Remarks: The setting of a cluster analysis is reverse to the setting of a variance analysis: A variance analysis verifies whether a nominal feature defines groups such that the members of the different groups differ significantly with regard to a numerical feature. I.e., the nominal feature is in the role of the independent variable, while the numerical feature(s) is (are) in role of dependent variable(s). Example: The type of a product packaging (the independent variable) may define the number of customers (the dependent variable) in a supermarket who look at the product. A cluster analysis, in turn, can be used to identify such a nominal feature, namely by constructing a suited feature domain for the nominal variable: each cluster corresponds implicitly to a value of the domain. Example: Equivalent but differently presented products in a supermarket are clustered (= the impact of product packaging is identified) with regard to the number of customers who buy the products. Cluster analysis is a tool for structure generation. Nearly nothing is known about the nominal variable that is to be identified. In particular, there is no knowledge about the number of domain values (the number of clusters). Variance analysis is a tool for structure verification. DM:II-4 Cluster Analysis STEIN 2002-2018

Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np DM:II-5 Cluster Analysis STEIN 2002-2018

Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np no Target concept c 1 c 2. c n DM:II-6 Cluster Analysis STEIN 2002-2018

Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np no Target concept c 1 c 2. c n 30 two-dimensional feature vectors (n = 30, p = 2) : Feature 2 Feature 1 DM:II-7 Cluster Analysis STEIN 2002-2018

Definition 3 (Exclusive Clustering [ splitting]) Let X be a set of feature vectors. An exclusive clustering C of X, C = {C 1, C 2,..., C k }, C i X, is a partitioning of X into non-empty, mutually exclusive subsets C i with C i = X. C i C DM:II-8 Cluster Analysis STEIN 2002-2018

Definition 3 (Exclusive Clustering [ splitting]) Let X be a set of feature vectors. An exclusive clustering C of X, C = {C 1, C 2,..., C k }, C i X, is a partitioning of X into non-empty, mutually exclusive subsets C i with C i = X. C i C Algorithms for cluster analysis are unsupervised learning methods: the learning process is self-organized there is no (external) teacher the optimization criterion is task- and domain-independent DM:II-9 Cluster Analysis STEIN 2002-2018

Definition 3 (Exclusive Clustering [ splitting]) Let X be a set of feature vectors. An exclusive clustering C of X, C = {C 1, C 2,..., C k }, C i X, is a partitioning of X into non-empty, mutually exclusive subsets C i with C i = X. C i C Algorithms for cluster analysis are unsupervised learning methods: the learning process is self-organized there is no (external) teacher the optimization criterion is task- and domain-independent Supervised learning: a learning objective such as the target concept is provided the optimization criterion is defined by the task or the domain information is provided about how the optimization criterion can be maximized. Keyword: instructive feedback DM:II-10 Cluster Analysis STEIN 2002-2018

Main Stages of a Cluster Analysis Objects Feature extraction & -preprocessing Similarity computation Merging Clusters DM:II-11 Cluster Analysis STEIN 2002-2018

Feature Extraction and Preprocessing [cluster analysis stages] Required are (possibly new) features of high variance. Approaches: analysis of spreading parameters dimension reduction: PCA, factor analysis, MDS visual inspection: scatter plots, box plots [Webis 2012, VDM tool] DM:II-12 Cluster Analysis STEIN 2002-2018

Feature Extraction and Preprocessing [cluster analysis stages] Required are (possibly new) features of high variance. Approaches: analysis of spreading parameters dimension reduction: PCA, factor analysis, MDS visual inspection: scatter plots, box plots Feature standardization can dampen the structure and make things worse: DM:II-13 Cluster Analysis STEIN 2002-2018

Computation of Distances or Similarities [cluster analysis stages] Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np x 1 x 2... x n x 1 0 d(x 1, x 2 )... d(x 1, x n ) x 2-0... d(x 2, x n ). x n - -... 0 DM:II-14 Cluster Analysis STEIN 2002-2018

Remarks: Usually, the distance matrix is defined implicitly by a metric on the feature space. The distance matrix can be understood as the adjacency matrix of a weighted, undirected graph G, G = V, E, w. The set X of feature vectors is mapped one-to-one (bijection) onto a set of nodes V. The distance d(x i, x j ) corresponds to the weight w({u, v}) of edge {u, v} E between those nodes u and v that are associated with x i and x j respectively. DM:II-15 Cluster Analysis STEIN 2002-2018

Computation of Distances or Similarities (continued) Properties of a distance function: 1. d(x 1, x 2 ) 0 2. d(x 1, x 1 ) = 0 3. d(x 1, x 2 ) = d(x 2, x 1 ) 4. d(x 1, x 3 ) d(x 1, x 2 ) + d(x 2, x 3 ) DM:II-16 Cluster Analysis STEIN 2002-2018

Computation of Distances or Similarities (continued) Properties of a distance function: 1. d(x 1, x 2 ) 0 2. d(x 1, x 1 ) = 0 3. d(x 1, x 2 ) = d(x 2, x 1 ) 4. d(x 1, x 3 ) d(x 1, x 2 ) + d(x 2, x 3 ) Minkowsky metric for features with interval-based measurement scales: where d(x 1, x 2 ) = ( p ) 1/r x 1i x 2i r i=1 r = 1. Manhattan or Hamming distance, L 1 norm r = 2. Euclidean distance, L 2 norm r =. Maximum distance, L norm or L max norm DM:II-17 Cluster Analysis STEIN 2002-2018

Computation of Distances or Similarities (continued) Cluster analysis does not presume a particular measurement scale. Generalization of the distance function towards a (dis)similarity function by omitting the triangle inequality. (Dis)similarities can be quantified between all kinds of features irrespective of the given levels of measurement. DM:II-18 Cluster Analysis STEIN 2002-2018

Computation of Distances or Similarities (continued) Cluster analysis does not presume a particular measurement scale. Generalization of the distance function towards a (dis)similarity function by omitting the triangle inequality. (Dis)similarities can be quantified between all kinds of features irrespective of the given levels of measurement. Similarity coefficients for two feature vectors, x 1, x 2, with binary features: Simple Matching Coefficient (SMC) = f 11 + f 00 f 11 + f 00 + f 01 + f 10 Jaccard Coefficient (J) = f 11 f 11 + f 01 + f 10 where f 11 = number of features with a value of 1 in both x 1 and x 2 f 00 = number of features with a value of 0 in both x 1 and x 2 f 01 = number of features with value 0 in x 1 and value 1 in x 2 f 10 = number of features with value 1 in x 1 and value 0 in x 2 DM:II-19 Cluster Analysis STEIN 2002-2018

Remarks: The definitions for the above similarity coefficients can be extended towards features with a nominal measurement scale. Particular heterogeneous metrics have been developed, such as HEOM and HVDM, which allow the combined computation of feature values from different measurement scales. The computation of the correlation between all features of two feature vectors (not: between two features over all feature vectors) allows to compare feature profiles. Example: Q correlation coefficient The development of a suited, realistic, and expressive similarity measure is the biggest challenge within a cluster analysis tasks. Typical problems: (unwanted) structure damping due to normalization (unwanted) sensitivity concerning outliers (unrecognized) feature correlations (neglected) varying feature importance Similarity measures can be transformed straightforwardly into dissimilarity measures and vice versa. DM:II-20 Cluster Analysis STEIN 2002-2018

Merging Principles [cluster analysis stages] hierarchical iterative agglomerative divisive exemplar-based exchange-based single link, group average MinCut k-means, k-medoid Kerninghan-Lin Cluster analysis density-based point-density-based attraction-based DBSCAN MajorClust meta-searchcontrolled gradient-based competitive simulated annealing evolutionary strategies stochastic Gaussian mixtures... DM:II-21 Cluster Analysis STEIN 2002-2018