Clustering & Bootstrapping

Similar documents
Clustering. Sandrien van Ommen

Lesson 3. Prof. Enza Messina

Cluster Analysis. Ying Shen, SSE, Tongji University

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

3. Cluster analysis Overview

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering: Overview and K-means algorithm

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

3. Cluster analysis Overview

Hierarchical Clustering

Gene Clustering & Classification

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning

Association Rule Mining and Clustering

CHAPTER 4: CLUSTER ANALYSIS

Clustering CS 550: Machine Learning

Automated Clustering-Based Workload Characterization

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

4. Ad-hoc I: Hierarchical clustering

Clustering Algorithms for general similarity measures

Cluster Analysis. Angela Montanari and Laura Anderlucci

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning and Clustering

Online Social Networks and Media. Community detection

A Comparison of Document Clustering Techniques

Algorithms for Bioinformatics

Data clustering & the k-means algorithm

Clustering: Overview and K-means algorithm

A Review on Cluster Based Approach in Data Mining

Hierarchical Clustering Lecture 9

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Finding Clusters 1 / 60

Clustering: K-means and Kernel K-means

Workload Characterization Techniques

Unsupervised Learning I: K-Means Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Unsupervised Learning and Clustering

Unsupervised Learning : Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering. Chapter 10 in Introduction to statistical learning

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

CLUSTERING IN BIOINFORMATICS

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

K-Means Clustering 3/3/17

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Information Retrieval and Web Search Engines

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Machine Learning: Symbol-based

Cluster Analysis for Microarray Data

Clustering in Data Mining

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Information Retrieval and Web Search Engines

Distributed and clustering techniques for Multiprocessor Systems

Evolutionary tree reconstruction (Chapter 10)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Lecture 3: Hierarchical Methods

Cluster analysis. Agnieszka Nowak - Brzezinska

Forestry Applied Multivariate Statistics. Cluster Analysis

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Segmentation (continued)

MIA - Master on Artificial Intelligence

Distances, Clustering! Rafael Irizarry!

Data Mining and Analysis: Fundamental Concepts and Algorithms

Hierarchical and Ensemble Clustering

Dimension reduction : PCA and Clustering

Midterm Examination CS540-2: Introduction to Artificial Intelligence

10701 Machine Learning. Clustering

Road map. Basic concepts

Clustering analysis of gene expression data

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Objective of clustering

11/17/2009 Comp 590/Comp Fall

Lecture 25: Review I

Stat 321: Transposable Data Clustering

Multivariate Methods

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Transcription:

Clustering & Bootstrapping Jelena Prokić University of Groningen The Netherlands March 25, 2009 Groningen

Overview What is clustering? Various clustering algorithms Bootstrapping Application in dialectometry 1

Introduction Cluster analysis: study of algorithms and methods for grouping objects Objects are classified based on the perceived similarities An object is described by a set of measurements or by relationships between the object and other objects Clustering algorithms used to find structure in the data 2

Hierarchical and flat clustering Hierarchical clustering: produces a sequence of nested partitions Flat clustering: determines a partition of patterns into K initial clusters 3

Hierarchical and flat clustering (cont.) 4

Hard and soft clustering Hard clustering: each object is assigned to one and only one cluster hierarchical clustering is usually hard Soft clustering: allows degrees of membership and membership in multiple clusters flat clustering can be both hard and soft 5

Distance measure Euclidean distance distance between two points that one would measure with a ruler d(p, q) = (p 1 q 1 ) 2 + (p 2 q 2 ) 2 +... + (p n q n ) 2 Manhattan distance the sum of absolute distances between the feature values of two instances d(p, q) = p 1 q 1 + p 2 q 2 +... + p n q n 6

Euclidean vs Manhattan distance 7

Hierarchical clustering Hierarchical clustering can be top-down and bottom-up Top-down starts with one group (all objects belong to one cluster) divides it into groups as to maximize within group similarity Bottom-up (agglomerative): starts with separate cluster for each object in each step two most similar clusters are determined and merged into new cluster 8

Cluster similarity How do we determine the similarity between two clusters? Single-link clustering the similarity between two clusters is the similarity of the two closest objects in the clusters checks all pairs of objects that belong to different clusters and selects the pair with greatest similarity produces clusters with good local coherence 9

Cluster similarity (cont.) Complete-link clustering: focuses on global cluster quality the similarity between two clusters is the similarity of the two most dissimilar objects in the clusters merges the two clusters with the smallest maximum pairwise distance Group-average agglomerative clustering: in each iteration merges the pair of clusters with the highest cohesion looks for the average similarity between the objects in different clusters 10

Single link clustering 11

Complete link clustering 12

Average similarity clustering 13

General scheme Estimate pairwise distances Put information on distances into matrix A B C D A 0 0.00717223 0.003664 0.00628 B 0 0.00299 0.006288 C 0 0.00066 D 0 14

General scheme (cont.) Find the shortest distance in the matrix Fuse two closest points Calculate the distance between the newly formed node and the rest of the nodes (matrix updating algorithms) Repeat until there are no more nodes to be fused 15

Matrix updating algorithms Single link d k[ij] = minimum(d ki, d kj ) Complete link d k[ij] = maximum(d ki, d kj ) Unweighted Pair Group Method using Arithmetic averages d k[ij] = (n i /(n i + n j )) d ki + (n j /(n i + n j )) d kj 16

Weighted Pair Group Method using Arithmetic averages d k[ij] = ( 1 2 d ki) + ( 1 2 d kj) Unweighted Pair Group Method using Centroids d k[ij] = (n i /(n i + n j )) d ki + (n j /(n i + n j )) d kj ((n i n j )/(n i + n j )) 2 d ij Weighted Pair Group Method using Centroids d k[ij] = ( 1 2 d ki) + ( 1 2 d kj) ( 1 4 d ij) 17

Ward s method d k[ij] = ((n k +n i )/(n k +n i +n j )) d ki +((n k +n j )/(n k +n i +n j )) d kj ((n k /(n k + n i + n j ) d ij 18

Flat clustering Starts with a partition based on randomly selected seeds Several passes of reallocating objects to the currently best cluster Number of clusters can be given in advance More often the optimal number of clusters has to be determined Minimum Description Length measure of goodness: how well the objects fit into the clusters and how many clusters there are 19

K-means Hard clustering algorithm Starts by partitioning the input points into k initial sets Calculates the mean point, or centroid, of each set Constructs a new partition by associating each point with the closest centroid Repeats last two steps until the objects no longer switch clusters 20

K-means (cont.) 21

Problems There is no one best clustering algorithm every algorithm has its own bias The success depends on the data set it is used on Small differences in input can lead to substantial differences in output 22

Traditional division of sites Figure 1: Two-fold division Figure 2: Six-fold division 23

Two-fold division of sites UPGMA WPGMA Complete link Ward s method 24

Two-fold division of sites (cont.) Single link UPGMC WPGMC 25

Six-fold division of sites UPGMA WPGMA Complete link Ward s method 26

Six-fold division of sites (cont.) Single link UPGMC WPGMC 27

K-means Figure 3: Two-fold division Figure 4: Six-fold division 28

Jackknife and bootstrapping Two general-purpose techniques for empirically estimating the variability of an estimate Jackknife: involves dropping one observation at a time from one s sample and calculating the estimate each time Bootstrapping: involves resampling from one s sample with replacement and making the fictional sample of the same size Set us free from the need for Normal data and large samples 29

Jackknife Compute the desired sample statistics St based upon the complete sample (of size n) Compute the corresponding statistics St i based upon the sample data with each of the observations i ignored in turn Compute the so-called pseudo values φ i as follows: φ i = nst (n 1)St i 30

Jackknife The jackknifed estimate of the statistics is: Ŝt = φi n = φ The approximate standard error of Ŝt is: sŝt = s 2 φ n = (φi φ) 2 n(n 1) 31

Bootstrapping Related technique for obtaining standard errors and confidence limits Set of observations is from independent and identically distributed population 32

Step 1: Resampling In place of many samples from the population, create many resamples Each resample is obtained by random sampling with replacement from the original data set Each resample is the same size as the original random sample Sampling with replacement: after we randomly draw an observation from the original sample we put it back before drawing the next observation 33

Resampling idea 34

Step 2: Bootstrap distribution The bootstrap distribution of a statistic collects its values from the many resamples. The bootstrap distribution gives information about the sampling distribution. Statistically bootstrapped data sets contain variation that you would get from collecting new data sets. 35

Random sample distribution random sample 1644 telephone repair times mean: 8,41 hours 36

Bootstrap distribution nearly Normal distribution we get the distribution of the estimator we get statistics of the estimator bootstrap standard error: 0.367 theory based estimate: 0.360 37

Bootstrapping in phylogenetics 38

Bootstrapping in phylogenetics 39

Bootstrapping in dialectometry 40

References Anil K. Jain and Richard C. Dubes (1988). Algorithms for Clustering Data. Prentice Hall: New Yersey. David S. Moore and George McCabe (1993). Introduction to the Practice of Statistics. 5th edition. Freeman: New York. Robert R. Sokal and F. James Rohlf (1995). Biometry. The Principles and Practices of Statistics in Biological Research. 3rd edition. Freeman: New York. 41