Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Similar documents
Hierarchical clustering for gene expression data analysis

Machine Learning: Algorithms and Applications

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Unsupervised Learning and Clustering

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Unsupervised Learning and Clustering

Unsupervised Learning

K-means and Hierarchical Clustering

Machine Learning. Topic 6: Clustering

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Clustering. A. Bellaachia Page: 1

Cluster Analysis of Electrical Behavior

Graph-based Clustering

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Understanding K-Means Non-hierarchical Clustering

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Data Mining MTAT (4AP = 6EAP)

Topics. Clustering. Unsupervised vs. Supervised. Vehicle Example. Vehicle Clusters Advanced Algorithmics

Machine Learning 9. week

Support Vector Machines

A Hierarchical Clustering and Validity Index for Mixed Data

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Survey of Cluster Analysis and its Various Aspects

Support Vector Machines

On the Two-level Hybrid Clustering Algorithm

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

CS 534: Computer Vision Model Fitting

Clustering algorithms and validity measures

Classifier Selection Based on Data Complexity Measures *

Data Foundations: Data Types and Data Preprocessing. Introduction. Data, tasks and simple visualizations. Data sets. Some key data factors?

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Principal Component Inversion

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

A Statistical Model Selection Strategy Applied to Neural Networks

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

A Robust LS-SVM Regression

AP PHYSICS B 2008 SCORING GUIDELINES

APPLIED MACHINE LEARNING

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort

Parallel matrix-vector multiplication

Clustering of Words Based on Relative Contribution for Text Categorization

Clustering is a discovery process in data mining.

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

On the Efficiency of Swap-Based Clustering

Clustering validation

TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia

Automatic selection of reference velocities for recursive depth migration

Image Segmentation. Image Segmentation

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Fast Computation of Shortest Path for Visiting Segments in the Plane

A Deflected Grid-based Algorithm for Clustering Analysis

KOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE"

Fuzzy Logic Based RS Image Classification Using Maximum Likelihood and Mahalanobis Distance Classifiers

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

y and the total sum of

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Sorting. Sorting. Why Sort? Consistent Ordering

Ensemble Fuzzy Clustering using Cumulative Aggregation on Random Projections

A Two-Stage Algorithm for Data Clustering

Analyzing Popular Clustering Algorithms from Different Viewpoints

Robust and Reversible Relational Database Watermarking Algorithm Based on Clustering and Polar Angle Expansion

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Improving KNN Method Based on Reduced Relational Grade for Microarray Missing Values Imputation

Lecture #15 Lecture Notes

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

1. Introduction. Abstract

DATA CLUSTERING: APPLICATIONS IN ENGINEERING

Web Mining: Clustering Web Documents A Preliminary Review

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Investigation of Transformations and Landscapes for Combinatorial Optimization Problems

Module Management Tool in Software Development Organizations

Smoothing Spline ANOVA for variable screening

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Bidirectional Hierarchical Clustering for Web Mining

Clustering Algorithm of Similarity Segmentation based on Point Sorting

All-Pairs Shortest Paths. Approximate All-Pairs shortest paths Approximate distance oracles Spanners and Emulators. Uri Zwick Tel Aviv University

AUTOMATED METHOD FOR STATISTICAL PROCESSING OF AE TESTING DATA

GSLM Operations Research II Fall 13/14

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Semantic Image Retrieval Using Region Based Inverted File

This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Sorting. Sorted Original. index. index

Correlative features for the classification of textural images

REFRACTIVE INDEX SELECTION FOR POWDER MIXTURES

Data Mining: Model Evaluation

X- Chart Using ANOM Approach

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Transcription:

Herarchcal agglomeratve Cluster Analyss Chrstne Sedle 19-3-2004 Clusterng 1

Classfcaton Basc (unconscous & conscous) human strategy to reduce complexty Always based Cluster analyss to fnd or confrm types n data to uncover relatons between objects The more enttes and the more attrbutes the more dffcultes classfyng them manually Computer-based cluster analyss Clusterng 2

Cluster analyss overvew Selecton of objects to be classfed Selecton of relevant attrbutes of these objects Calculaton of dstances between objects Cluster analyss Check of results (Modfcatons + rerun analys) Clusterng 3

Objects Selecton of objects depends on ntenton If clusters are expected: Number of objects should be balanced Many objects = large dstance matrx n ( n 1) values (e.g. 200 objects = 19900 dstance values) 2 Clusterng 4

Attrbutes Selecton of attrbutes depends on ntenton Not: The more attrbutes the surer groups wll appear Avod correlatons between attrbutes Values of attrbutes have to be comparable Treat mssng values (Weght attrbutes to nfluence clusterng) Clusterng 5

Attrbutes example 600 500 avocado parsnp fennel dandelon Poston of selected fruts/vegetables n the 2 dmensons magnesum & potassum 400 300 200 100 0 K (n mg) -10 passon frut peach water melon apple straw berry pear blueberry 0 10 20 kw frut elderberry peas 30 banana papaya 40 50 Mg (n mg) Clusterng 6

Dstance measures Based on the attrbute values the dstances between the objects have to be determned. Dstance measures have to ensure: Symmetry Trangle nequalty Dstngushablty of nondentcals Indstngushablty of dentcals d ( x, y ) = d ( y, x ) d ( x, y ) d ( x, z ) + d ( y, z f d ( x, y ) 0, then x d ( x, x ') = 0 ) y 0 Clusterng 7

Clusterng 8 Dstance measures examples Dstance measures (squared) Eucldan dstance Manhattan dstance Smlarty measures Pearson s correlaton coeffcent = = n X Y Y X 1 ), δ ( = = n X Y Y X 1 2 ) ( ), δ ( = = = = n n n Y Y X X Y Y X X Y X r 1 2 1 2 1 ) ( ) ( ) )( ( ), (

Squared Eucldan dstance example Dstances of selected fruts/vegetables based on (standardzed) content of Mg & K Proxmty Matrx Case 1:banana 2:avocado 3:parsnp 4:dandelon Ths s a dssmlarty matrx Squared Eucldean Dstance 1:banana 2:avocado 3:parsnp 4:dandelon,000 1,250 1,477,183 1,250,000,346,578 1,477,346,000 1,070,183,578 1,070,000 Clusterng 9

Cluster analyss Here dscussed (because most common): Sequental Agglomeratve Herarchcal Nonoverlappng (SAHN) Other approaches for clusterng: Herarchc dvsve Iteratve parttonng Factor analytc Clumpng... Clusterng 10

Cluster analyss Iteratve process n 1 steps necessary to cluster all objects At every step the two most smlar objects or clusters wll be merged untl all are aggregated n one cluster Clusterng 11

Cluster analyss example banana avocado parsnp dandelon banana 1.25 1.477 0.183 avocado 0.346 0.578 parsnp 1.07 dandelon d avocado, banana d avocado [ banana, dandelon ] = + 2 d avocado 1.25 2 0.578 2 [ banana, dandelon ] = + = d avocado 0.914, 2 dandelon Clusterng 12

Cluster analyss example avocado parsnp avocadoparsnp bananadandelon bananadandelon 0.914 1.2735 avocadoparsnp 1.09375 avocado parsnp 0.346 bananadandelon bananadandelon d [ banana, dandelon ], avocado d [ banana, dandelon ][ avocado, parsnp ] = + 2 0.914 2 1.2735 2 d [ banana, dandelon ][ avocado, parsnp ] = + = d [ 1.09375 banana, dandelon 2 ], parsnp Clusterng 13

Matrx updatng algorthms Several SAHN clusterng algorthms They dffer n how they calculate the dstances of new formed clusters to the other elements. Not every algorthm equally sutable for every stuaton Results can be very dfferent!! Clusterng 14

Matrx updatng algorthms Sngle lnkage Complete lnkage Unweghted average lnkage Weghted average lnkage (Un)Weghted centrod lnkage Ward s method Clusterng 15

Sngle lnkage ) d k j ) = mn( d k, d ( kj Nearest neghbor Dstance between new cluster and other elements equals the smallest n the cluster occurrng dstance to the other elements Tendency to very dfferent szed clusters (outlers!) j k Clusterng 16

Complete lnkage ) d k j ) = max( d k, d ( kj Furthest neghbor Dstance between new cluster and other elements equals the largest n the cluster occurrng dstance Clusters are only merged when dssmlarty s small. Balanced and equally szed clusters j k Clusterng 17

Unweghted average lnkage n n j d n + n n + n k [ j ] = d k + d j j UPGMA, Baverage, lnkage between groups Uses averages nstead of extreme values Number of elements n clusters s taken nto account j k Clusterng 18 kj

Weghted average lnkage d k d k [ j ] = + 2 d kj 2 WPGMA, Waverage, lnkage wthn groups Equals UPGMA but the number of elements n clusters s not takenntoaccount Can be necessary when the sze of supposed clusters or the object densty n them dffers j k Clusterng 19

(Un)Weghted centrod lnkage n n j n n j d n + n n + n ( n + n ) k [ j ] = d k + d kj d 2 j j j d d k kj d k [ j ] = + 2 2 d 4 Centrod of cluster s calculated Dstance to new cluster equals dstance to centrod j k Clusterng 20 j j

Ward s method n + n n + n n d k k j k k [ j ] = d k + d kj d n k + n + n j n k + n + n j n k + n + n j j Mnmum varance Idea: Heterogenty s not a reasonable feature of clusters Mnmze varance To be used only wth quanttatve attrbutes and squared Eucldan dstance! Clusterng 21

Matrx updatng algorthms Types of algorthms: Space-contractng (Sngle & Centrod (?) Lnkage) Unequally szed clusters Outlers vsble Space-dlatng (Complete lnkage & Ward s method) Balanced clusterng Clusters are often not easy to nterpret Space-conservng (Average lnkage) No unnaturally blown up clusters Appearng clusters are often nterpretable Clusterng 22

Space-contractng example 1 Dendrodram generated by Sngle-lnkage Clusterng 23

Space-contractng example 2 Kel Kel Rostock Rostock Hamburg Hamburg Emden Emden Bremen Bremen Berln Hannover Magdeburg Münster Cottbus Berln Hannover Magdeburg Münster Cottbus Dresden Dresden Köln Erfurt Köln Erfurt Marburg Marburg Frankfurt Frankfurt Trer Trer Nürnberg Nürnberg Saarbrücken Saarbrücken Regensburg Regensburg Stuttgart Stuttgart München München Freburg Freburg Sngle lnkage WPGMC Clusterng 24

Space-dlatng example 1 Dendrodram generated by Ward s method Clusterng 25

Space-dlatng example 2 Kel Kel Rostock Rostock Hamburg Hamburg Emden Emden Bremen Bremen Berln Hannover Hannover Magdeburg Magdeburg Münster Münster Cottbus Berln Cottbus Dresden Dresden Köln Erfurt Köln Erfurt Marburg Marburg Frankfurt Frankfurt Trer Trer Nürnberg Nürnberg Saarbrücken Saarbrücken Regensburg Regensburg Stuttgart Stuttgart München München Freburg Freburg Ward s method Complete lnkage Clusterng 26

Space-conservng example 1 Dendrodram generated by UPGMA Clusterng 27

Space-conservng example 2 Kel Kel Rostock Rostock Hamburg Hamburg Emden Emden Bremen Bremen Berln Hannover Hannover Magdeburg Magdeburg Münster Münster Cottbus Berln Cottbus Dresden Dresden Köln Erfurt Köln Erfurt Marburg Marburg Frankfurt Frankfurt Trer Trer Nürnberg Nürnberg Saarbrücken Saarbrücken Regensburg Regensburg Stuttgart Stuttgart München München Freburg Freburg UPGMA WPGMA Clusterng 28

Matrx updatng algorthms Whch should be used? Outlers shall be vsble Sngle lnkage Unequally szed clusters expected Not space-dlatng methods Dfferng object densty n expected clusters WPGMA No-dea-just-try-order: Space-conservng > space-dlatng > spacecontractng Clusterng 29

Number of clusters How many natural classes has cluster analyss generated? Subjectve decson of researcher Analyss of mergng values Large step = rather dssmlar clusters = stop Plot number of clusters aganst mergng values Graph flattens = no new nformaton = stop Ward s method: Sgnfcance test possble Clusterng 30

Valdaton of results Results should be stable Plausble nterpretaton possble Repeat cluster analyss wth dfferent samples of the same populaton Dfferent results = both nvald, but Same results = not necessarly vald and not always possble due to lack of data Cophenetc correlaton, but Normal dstrbuton (wrongly?) assumed In dendrogram fewer (dfferent) values Clusterng 31

Valdaton of results Sgnfcance tests Used attrbutes: Useless because always sgnfcant Not used (but relevant) attrbutes: Useful but only possble when knowledge about classes already exsts Monte Carlo procedures Data set s created whch has the same global propertes as orgnal data but contans no classes Both sets are clustered & results compared Sgnfcant dfferences => results vald Clusterng 32

Attenton! A lot of factors determne the results of cluster analyss Very careful selecton of objects, attrbutes, (ds)smlarty measure, cluster method and matrx updatng algorthm Cluster analyss wll always output clusters f there are natural classes or not! Clusterng 33