Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Similar documents
An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Dimension reduction : PCA and Clustering

Cluster Analysis for Microarray Data

MATH5745 Multivariate Methods Lecture 13

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Understanding Clustering Supervising the unsupervised

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 2

Introduction to R and Statistical Data Analysis

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

The Curse of Dimensionality

Graph projection techniques for Self-Organizing Maps

Tutorial 3. Chiun-How Kao 高君豪

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

/00/$10.00 (C) 2000 IEEE

Unsupervised Learning

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Machine Learning (BSMC-GA 4439) Wenke Liu

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

Experimental Design + k- Nearest Neighbors

Unsupervised Learning and Clustering

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Introduction to Artificial Intelligence

Distances, Clustering! Rafael Irizarry!

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Cluster Analysis, Multidimensional Scaling and Graph Theory. Cluster Analysis, Multidimensional Scaling and Graph Theory

3. Multidimensional Information Visualization II Concepts for visualizing univariate to hypervariate data

Machine Learning (BSMC-GA 4439) Wenke Liu

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value Association

Unsupervised Learning and Clustering

Courtesy of Prof. Shixia University

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Dimension Reduction CS534

Multivariate Analysis

Week 7 Picturing Network. Vahe and Bethany

Dynamic Thresholding for Image Analysis

10701 Machine Learning. Clustering

Selecting Models from Videos for Appearance-Based Face Recognition

Unsupervised learning

University of Florida CISE department Gator Engineering. Clustering Part 5

Visualisation in Multidimensional Space: case of Robust Design. Valentino Pediroda, Dipartimento di Ingegneria Meccanica Università di Trieste

Artificial Neural Networks Unsupervised learning: SOM

Data Mining: Exploring Data

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CIE L*a*b* color model

Lecture Topic Projects

Cluster Analysis using Spherical SOM

Alankrita Aggarwal et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Clustering CS 550: Machine Learning

MSA220 - Statistical Learning for Big Data

Clustering and Visualisation of Data

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages:

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Exploratory data analysis for microarrays

Chapter 1. Using the Cluster Analysis. Background Information

Chapter DM:II. II. Cluster Analysis

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Nonparametric Regression

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

MULTIVARIATE ANALYSIS USING R

Unsupervised Learning

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

University of Florida CISE department Gator Engineering. Visualization

CS 584 Data Mining. Classification 1

Machine Learning with MATLAB --classification

arxiv: v1 [physics.data-an] 27 Sep 2007

Unsupervised Learning

Free Projection SOM: A New Method For SOM-Based Cluster Visualization

Unsupervised Learning

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Iterated Consensus Clustering: A Technique We Can All Agree On

Image Analysis Lecture Segmentation. Idar Dyrdal

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Locality Preserving Projections (LPP) Abstract

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

A vector field visualization technique for Self-Organizing Maps

Neural Network and Adaptive Feature Extraction Technique for Pattern Recognition

7/19/09 A Primer on Clustering: I. Models and Algorithms 1

Comparison of Clustering Techniques for Cluster Analysis

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Methods for Intelligent Systems

Simulation of Back Propagation Neural Network for Iris Flower Classification

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Multivariate Methods

Transcription:

Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6

Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram, Scatterplot, Dendrogram High-dimensional Data : Dimension Reduction Techniques Principal Component Analysis (PCA) Multidimensional Scaling (MDS) Self-Organizing Maps (SOM) High-dimensional Data: Dimension-free Visualization Block Clustering Data Image Generalized Association Plots (GAP) 2

Stages in Clustering What is clustering? Cluster analysis is the organization of a collection of patterns into clusters based on similarity. The problem is to group a given collection of unlabeled patterns into meaningful clusters. 3

Clustering Analysis Hierarchical Clustering Algorithm Partitional Algorithm: k-means Mixture-Resolving and Mode-Seeking Algorithm Nearest Neighbor Clustering Fuzzy Clustering Artificial Neural Networks for Clustering Clustering Large data sets 4

Data/Information Visualization What is Visualization? To visualize = to make visible, to transform into pictures. Making things/processes visible that are not directly accessible by the human eye. Transformation of an abstraction to a picture. Computer aided extraction and display of information from data. Data/Information Visualization Exploiting the human visual system to extract information from data. Provides an overview of complex data sets. Identifies structure, patterns, trends, anomalies, and relationships in data. Assists in identifying the areas of interest. Tegarden, D. P. (1999). Business Information Visualization. Communications of AIS 1, 1-38. 5

The Iris Data (Anderson 1935; Fisher 1936) Images source: http://www.stat.auckland.ac.nz/~ihaka/120/lectures/lecture27.pdf The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in centimeters on fifty iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. 6

(IRIS), Iris " ".. " ",,,,.,,, " ". 7

Histograms The histogram graphically shows: 1. center of the data (location) 2. spread of the data (scale) 3. skewness of the data 4. presence of outliers 5. presence of multiple modes in the data. Two important properties of a clustering definition: 1. Most of data has been organized into non-overlapping clusters. 2. Each cluster has a within variance and one between variance for each of the other clusters. A good cluster should have a small within variance and large between variance. 8

Scatterplot Images source: http://www.stat.auckland.ac.nz/~ihaka/120/lectures/lecture27.pdf 9

Dendrogram (Kaufman and Rousseeuw, 1990) Hierarchical Clustering Example: Agglomerative algorithm + Average linkage clustering 10

Visualizing and Clustering High-dimensional Data: dimension reduction techniques Principal Component Analysis (PCA) Multidimensional Scaling (MDS) Self-Organizing Maps (SOM) Dimension reduction visualization is often adopted for presenting grouping structure for methods such as K-means. 11

Principal Component Analysis (PCA) (Pearson 1901; Hotelling 1933; Jolliffe 2002) The PCA summaries the dispersion of data points as data cloud in a small number of major axes (principal components) of varition among the variables. Software: Splus 12

Multidimensional Scaling (MDS) (Torgerson 1952; Cox and Cox 2001) Classical MDS Multidimensional scaling takes a set of dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. 2D MDS configuration plot Note that if the input-space distances are Euclidean, classical MDS is equivalent to PCA. (Mardia et al. 1979) Software: R with mva package (cmdscale) 13

Self-Organizing Maps (SOM) (kohonen 2001) SOMs were developed by Kohonen in the early 1980's, original area was in the area of speech recognition. SOM is unique in the sense that it combines both aspects. It can be used at the same time both to reduce the amount of data by clustering, and to construct a nonlinear projection of the data onto a low-dimensional display. Figures source from: SCIpath Home http://www.ucl.ac.uk/oncology/microcore/tutorial.htm (Organise data on the basis of similarity by putting entities geometrically close to each other) 14

Overview of SOM Figures source from: SCIpath Home http://www.ucl.ac.uk/oncology/microcore/tutorial.htm 15

SOM - Initialization SOM initialization means to give each weight of the output node a random (or determined) vector value. The dimensionality of the vector values put in must match the dimensionality of the raw data! So if the raw data consists of 5 arrays, then the vectors must have 5 elements (dimensions). 16

SOM Algorithm The SOM algorithm then goes on to interrogate the map for similar vectors Figures source from: SCIpath Home http://www.ucl.ac.uk/oncology/microcore/tutorial.htm 17

SOM Algorithm: neighborhood functions The winner node's weight is modified such that it becomes even more similar to the original input node's vector. The neighborhood value has a two-fold character - a size and a function of distance to influence. One could even define a further third character - the shape of the neighborhood (in this case, a square - highlighted in blue). The peak of the Gaussian function would be the location of the winner node. As one moves out from that location, the r value decreases. Figures source from: SCIpath Home http://www.ucl.ac.uk/oncology/microcore/tutorial.htm 18

SOM Algorithm: neighborhood functions and learning rate functions 19

Summary of SOM Figures source from: SCIpath Home http://www.ucl.ac.uk/oncology/microcore/tutorial.htm 20

Possible parameters used in SOM analysis 21

SOM: iris example Software: R: The som Package http://cran.r-project.org/src/contrib/som_0.2-7.tar.gz > iris.data.n <- normalize(iris.data, byrow=f) > iris.som <- som(iris.data.n, xdim=4, ydim=4, topol="rect", neigh="gaussian") > plot.som(iris.som) 22

U-matrix: Unified Matrix Method (Ultsch and Siemon 1989, Ultsch 1993) U-matrix representation of SOM visualizes the distance between the neurons. The distance between the adjacent neurons is calculated and presented with different colorings between the adjacent nodes. U-matrix representation of the SOM 23

SOM: iris example Software: SOM Toolbox 2.0 for Matlab Source from technical Report on SOM Toolbox 2.0 for Matlab 24

Visualizing and Clustering High-dimensional Data: dimension-free visualization Block Clustering Data Image Generalized Association Plots (GAP) 25

Block clustering (Hartigan, 1972) Reorders rows and columns to produce a matrix with homogenous blocks. Algorithm: rows (columns) are sorted by row (column) mean. Start with entire data in one block. Choose the row or column split (of all existing blocks) that reduces total within- block- variance the most Continue until a large number of blocks are obtained. Recombine blocks by pruning. Stopping rule: The maximum gap approach. 26

Data Image (Minnotte and Webster 1999) Column condition: For each variable Random permute rows Splus code from Michael C. Minnotte: http://math.usu.edu/~minnotte/research/pubs.html 27

Relativity of a Statistical Graph (Chang et al. 2002) Concept: placing similar (different) objects at closer (distant) positions Permuted with good Orders Permuted with bad Orders 28

GAP Generalized Association Plots (Chen, 2002) 29

GAP: iris example Species Ordering color spectrum variable transformation similarity measure Software (1) Xlispstat (2) C++ codes & (3) GapLite (to appear) 30

GAP: iris example Random Permutation color spectrum variable transformation similarity measure 31

GAP: iris example Agglomerative Clustering with average linkage 32

Seriation Robinson matrix Tree seriation Source from Chen (2002). 33

Seriation: iris example Source from Chen (2002). 34

More on GAP Web site: http://gap.stat.sinica.edu.tw/ (categorical) (dependent) (clustered) (missing value) 35

Reference Dr. Alexander Strehl: http://www.lans.ece.utexas.edu/~strehl/ Michael Friendly's Home Page:http://www.math.yorku.ca/SCS/friendly.html Chen, C. H. (2002), Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices, Statistica Sinica, 12, 7-29. Cox, T. F. and Cox, M.A.A. (2001), Multidimensional Scaling, London: Chapman & Hall. Hartigan, J. (1972), Direct Clustering of a Data Matrix. Journal of the American Statistical Association, 67(337):123-129. Hartigan, J. (1975), Clustering Algorithms, John Wiley and Sons, New York. Jacoby, W. G. (1998), Statistical Graphics for Visualizing Multivariate Data, Thousand Oaks, Calif. : Sage Publications. Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. Kohonen, T. (2001), Self-Organizing Maps, Berlin: Springer. Minnotte, M. C. and West, R. W., (1999), "The Data Image: a Tool for Exploring High Dimensional Data Sets,". 1998 Proceedings of the ASA Section on Statistical Graphics, in press. 36