Cutting the Dendrogram through Permutation Tests

Size: px
Start display at page:

Download "Cutting the Dendrogram through Permutation Tests"

Transcription

1 Cutting the Dendrogram through Permutation Tests Dario Bruzzese 1 and Domenico Vistocco 2 1 Dipartimento di Medicina Preventiva, Università di Napoli - Federico II Via S. Pansini 5, Napoli, Italy, dario.bruzzese@unina.it 2 Dipartimento di Scienze Economiche, Università di Cassino Via S.Angelo S.N., Cassino, Italy, vistocco@unicas.it Abstract. This paper introduces an innovative approach for detecting a suboptimal partition starting from the dendrogram produced by a hierarchical clustering techinique. The approach exploits permutation tests and it can be used regardless of the agglomeration method and distance measure used in the classification process because it relies on the same criteria used for producing it. Moreover, the proposed approach can detect partitions not necessarily identifiable using a traditional cut approach, as the resulting clusters could correspond to different heights of the tree. Keywords: Hierarchical clustering, Permutation tests, Cluster detection 1 Introduction and motivation Hierarchical clustering represents one of the most widespread analytical approach to face with classification problems mainly due to the visual power of the associated graphical representation, the dendrogram, and because of the directness of the cluster generation process. All these aside, the requirement of choosing properly the optimum number of clusters still represents the main difficulty for the final user. Actually, different (semi)automatic criteria can be devised to reach the final classification; very often, the very informal solution adopted is that of finding the height in the dendrogram where large changes in fusion level occur. Broadly speaking, all these criteria determine a threshold value of the ultra-metric used to grow the dendrogram such that all the units with a dissimilarity index below this threshold will belong to the same cluster; however such approach allows to search for the solution only within a small set of the whole family of partitions housed in the dendrogram: those which stem from horizontal cuts of the dendrogram. This induces a biunivocal relation between the number k of clusters and the partition set such that by fixing one element of the relation (e.g. the number of clusters) the other is univocally determined. It could happen, however, that clusters differ in terms of their own internal coherence in a way that the same threshold value wouldn t be suitable for all

2 2 Bruzzese, D. and Vistocco, D. of them. Figure 1 shows the dendrogram obtained on a simulated dataset. The dataset contains 4 different clusters of the same cardinality generated from multivariate normal distributions with different mean vectors and variancecovariance matrices. The Ward criterion with the Euclidean distance was used to grow the tree. Fig. 1. Two different partitions in 4 clusters of a simulated dataset. Solid line refers to a traditional horizontal criterion while the dashed line refers to a possible solution offered by the proposed algorithm. The solid line in Figure 1 highlights the 4-clusters solution obtained by cutting the dendrogram with a traditional horizontal criterion; this partition, which actually is the only one that can produce a 4-clusters solution, isolates a very small cluster on the left side of the dendrogram while leaving ungrouped the two clusters on the right that, on the contrary, contain units belonging to different populations. A different solution, inside those that still comply with the hierarchical classification process, could thus be the one described by the dashed line and characterized by two local thresholds located at different heights; actually it turns out that this non-conventional cut can better recover the original cluster structure (according to the misclassification index the first partition produces an error rate of 0.58 while the second is characterized by an error rate of 0.40). The possibility of merging clusters at different heights (thus conflicting with the classificability principle previously described) makes mandatory the implementation of a procedure able to automatically explore the complete set of partitions by tracing the partial thresholds whenever two clusters plainly reflect specific characteristics. The proposed algorithm exploits the theoretical framework of permutation tests in order to reach this goal. The most important by-product of such approach is the automatic identification of the number of clusters. The paper is organized as follows: the idea used for detecting the partition is introduced in Section 2, the notation and the proposed procedure

3 Cutting the Dendrogram through Permutation Tests 3 is detailed in Section 3; Section 4 shows some results on a genetic dataset and a simulation study in order to explore the influence of tuning parameters on the algorithm output. Finally, some concluding remarks and future work directions follow in the final section. 2 The basic intuition The proposed algorithm exploits a permutation test approach to automatically detect a partition starting from a dendrogram resulting from a hierarchical cluster. The algorithm retraces down-ward the tree, starting from the root of the dendrogram where all objects are classified in a unique cluster and moving down a partial threshold until a link joining two clusters is encountered. A permutation test is thus performed in order to verify whether the two clusters must be accounted as a unique group (the null hypothesis) or not (the alternative one). If the null cannot be rejected, the corresponding branch will become a cluster of the final partition and none of its sub-branches will be longer processed. Otherwise each of them will be further visited in the course of the procedure. In fact, in both cases, the partial threshold will continue its path and the next branch of the dendrogram will be processed. The algorithm stops when there are no more branches that stand the test (i.e. the null cannot be rejected any more). The permutation test on which the whole procedure is based can be summarized in this way. Under the Null, if all the units belonging to each of the two clusters are mixed up together and then randomly split up, with the only constraint of the group cardinality, the distance among the shuffled clusters should not be very different from the original one. Repeating the shuffling m times, a Montecarlo p-value can be computed as the number of permuted distances at least as extreme as the original one. The whole algorithm is detailed in the next section. 3 The algorithm Let denote with n the number of objects to classify, with CL k and Ck R the two classes merged at level k (k = 1,..., n), with h(cl k Ck R ) the height necessary to merge CL k and Ck R. Finally we denote with h(ck j ) the height at which Cj k has been obtained (j {L, R}). In Figure 2 the adopted formalism is superimposed, for k {1, 2}, on the dendrogram shown in the previous section. For each k, the difference between max h ( C k ) j and min h ( C k ) j can j {L,R} j {L,R} be considered as the minimum cost necessary to merge the two classes. Minimum because, at least, the dissimilarity measure used in the agglomeration process, must raise from min h ( C k ) j to max h ( C k ) j in order to merge j {L,R} j {L,R} the two clusters.

4 4 Bruzzese, D. and Vistocco, D. h h(c 2 L C2 R) h(c 2 R) C 2 L C 2 R C 1 L C 1 R Fig. 2. Exemplification of the main notation adopted with respect to the dendrogram reported in Figure 1. The difference between h ( CL k R) Ck and max h ( C k ) j can be, instead, j {L,R} considered as the cost actually incurred for merging CL k and Ck R. The ratio between these two costs: cost ( max CL k CR k ) h ( C k ) j min h ( C k ) j j {L,R} j {L,R} = h ( CL k R) Ck max h ( C k ) j j {L,R} is thus a measure that characterizes the aggregation process resulting in the new class CL k Ck R and is indeed used in the permutation test approach for automatically detecting the clusters. The proposed procedure is detailed in Algorithm 1. In particular we denote with aggregationlevelst ov isit a vector containing the heights of the dendrogram to be explored and with permclusters an object storing the clusters detected by the procedure. The permutation test step is embedded in the row 6 of the Algorithm 1. In particular, for each k a permutation test is designed to test the Null Hypothesis that the two groups CL k and Ck R really belong to the same cluster, i.e. H 0 : CL k CR. k Under H 0, mixing up (i.e. permuting) the statistical units of C k L and Ck R should not alter the aggregation process resulting in their merging in.

5 Cutting the Dendrogram through Permutation Tests 5 Input: A dataset and its related dendrogram Output: A partition of the dataset 1. inizialization: 2. aggregationlevelstovisit h(c 1 L C 1 R) 3. permclusters [ ] 4. i 1 5. repeat 6. if C i L C i R 7. add C i L C i R to permclusters 8. else 9. add h(c i L) and h(c i R) to aggregationlevelstovisit 10. sort aggregationlevelstovisit in descending order 11. end 12. remove the first element from aggregationlevelstovisit 13. i i until aggregationlevelstovisit is empty Algorithm 1: The proposed P ermclust algorithm Let m CL k and mcr k be the two new classes obtained by permuting the elements in CL k and Ck R. As a matter of fact, the hierarchical clustering process is invariant with respect to the permutation of the original observations and thus growing a single dendrogram on the permuted set would simply reestablish the same structure. For this reason, after m CL k and mcr k have been obtained, a new dendrogram is generated for each of them. The heights at which each of the two classes are buit up again, clearly correspond to the heights of the root nodes of the corresponding dendrograms. The ratio: cost ( max mcl k m CR k ) h ( mc k ) j min h ( mc k ) j j {L,R} j {L,R} = h ( CL k R) Ck max h ( mc k ) j j {L,R} is thus a measure that characterizes the aggregation process resulting in the new (potential) class m CL k mcr k. Under H 0, the aggregation process resulting in the new cluster CL k Ck R should be very similar to the one that potentially would have produced m CL k mcr k ; thus the two values cost ( CL k ) ( Ck R and cost mcl k mcr) k should be close enough. The permutation procedure is repeated M times and each time a new couple m CL k, mcr k is obtained. The pvalue Montecarlo (Good, 1994) is thus computed as: p = # { cost ( mcl k ) ( )} mcr k cost C k L CR k + 1 M Some results The PermClust algorithm has been applied both on real and synthetic datasets; in the following the main results will be presented. In all the computations,

6 6 Bruzzese, D. and Vistocco, D. the dendrograms have been generated with the Euclidean distance and the Ward agglomeration criterion (Maechler et al., 2005). Unless differently specified, p-values less than 0.01 were considered statistically significant in the permutation test step. Figure 3(a) shows (a zoom of) the dendrogram obtained on the Yeast galactose dataset which describes a subset of 205 genes reflecting four functional categories of the Gene Ontology (Ideker et al., 2001) 1. The value detected Cluster value detected Cluster (a) detected Cluster (b) detected Cluster Fig. 3. (a) The dendrogram obtained on the Yeast Galactose dataset with the partition selected by PermClust algorithm. Numbers refer to the p-values of the associated permutation test. (b) Visual representation of the confusion matrix resulting from PermClust algorithm and (c) from a k-means with k=4. (c) obtained partition is highlighted using red rectangles and clearly reveals the 4-clusters structure originally contained in the dataset. Panels (b) and (c) show the confusion matrices related to the proposed algorithm (b) and to a k-means procedure (c) with k equal to 4. The different sub panels depict the original clusters in the dataset while the different bars refer to the clusters detected by the classification procedures. It can be noticed that the proposed algorithm correctly assigns the units to the first and the fourth class while a small fraction of units belonging to the second cluster is misclassified into the third cluster. K-means, on the contrary, is unable to grasp second cluster (whose units are misclassified in the third cluster). Small misclassification rates characterize also the first and the fourth cluster. The misclassification rate was 1.5% for PermClust and 8.3% for the k-means procedure. It is worth of notice that the partition selected by the proposed algorithm agrees with the hortodox 4-clusters solution but it has been automatically detected by the algorithm. The PermClust algorithm has been also tested on artificial datasets. In particular, Figure 4 shows the results of the algorithm on artificial datasets 1 For this application, the algorithm, written in the R language, uses almost 50 secs. on a Intel Core 2 Duo 2.26 GHz machine with 4 GB of RAM. More efficiency could be achieved optimizing the code and implementing it using a compiled language.

7 Cutting the Dendrogram through Permutation Tests 7 generated according to the random cluster generation method proposed in Qiu and Joe (2006a, 2009). Generated data differ in terms of the number of clusters (k = 2, 3, 4, 5, 6, 7) and of the number of variables (p = 5, 10) 2. The artificial data have been generated using a value of 0.01 for the separation index (Qiu and Joe, 2006b) between any cluster and its nearest neighbor cluster which reflects a close cluster structure. For each combination of k and p, detected Cluster detected Cluster (a) Number of detected clusters (a) (b) Number of detected clusters (b) Fig. 4. Distribution of the number of clusters detected by the PermClust algorithm for artificial datasets in case of 5 variables (a) and 10 variables (b). s = 100 different datasets have been generated. Figure 4(a) shows the number of clusters composing the partition detected by the PermClust algorithm using p = 5 variables. Different columns of the Figure depict the different value of k, while the rows refer to the significance level used in the permutation test step of the algorithm (see row 6 of Algorithm 1). The barplots in each panel show the distribution of the numbers of clusters detected by the algorithm in the s simulations. The same structure is used for the case of a dataset with 10 variables (Figure 4(b)). As can be noticed, the stability of the algorithm strictly depends on the combination among the significance level, the cardinality of the cluster structure and the number of variables. In particular, while a significance level of 0.01 (last row of Figure 4(a) and (b)) always allows to achieve the best results, the accuracy of the solution is inversely proportional to the ratio between k and p. In case of a simple cluster structure (k=2,3), the algorithm seems to fail even with a large number of available variables. 2 With p=15 the performance of the algorithm is almost equal to p=10. The corresponding figure is not reported for sake of brevity.

8 8 Bruzzese, D. and Vistocco, D. 5 Concluding remarks and further developments The output of hierarchical clustering methods is typically displayed as a dendrogram describing a family of partitions indexed by an ultrametric distance. Actually, after the tree structure of the dendrogram has been set up, the most tricky problem is that of cutting the tree with a suitable threshold in order to take out a sub-optimal classification. Several (more or less) objective criteria may be used to achieve this goal, e.g. the deepest step, but most often the partition relies on a subjective choice leaded by interpretation issues. Additionally, whatever the chosen criterion is, only one solution can be obtained for each desired granularity, i.e. the one where clusters are joined at consecutive heights starting from the adopted threshold. In this paper we propose an algorithm, exploiting the methodological framework of permutation test, allowing to find out automatically a suboptimal partition where clusters do not necessarily obey to the afore-mentioned principle. The algorithm allows us to explore partitions which are not directly achievable using a standard cut-level approach. Further works should concern a comparison of the obtained partition with respect to partitions of the same dataset deriving from common partitioning methods. A comparison in terms of quite common quality indexes (Rand, 1971) should strength the proposal. Furthermore the study of the stability of the obtained partitions with respect to tuning parameters used in the permutation test procedure and the study of the computational complexity are topics of interest for further research. References GOOD P. (1994). Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, New York. IDEKER T., THORSSON V., RANISH J.A., CHRISTMAS R., BUHLER J., ENG J.K., BUMGARNER R.E., GOODLETT D.R., AEBERSOLD R., HOOD L. (2001) Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science, 292: J MAECHLER M., ROUSSEEUW P., STRUYF A., HUBERT M. (2005). Cluster Analysis Basics and Extensions. unpublished. QIU W.L., JOE, H. (2006a) Generation of Random Clusters with Specified Degree of Separation. Journal of Classification, 23(2), J QIU W.L., JOE, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, QIU W. L., JOE H. (2009). clustergeneration: random cluster generation (with specified degree of separation). R package version R Development Core Team (2009). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN , url = RAND W.M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, December 1971, 66, 336,

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.

More information

CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS

CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 145 CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 6.1 INTRODUCTION This chapter analyzes the performance of the three proposed colortexture segmentation

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

CHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION

CHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION CHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION 4.1. Introduction Indian economy is highly dependent of agricultural productivity. Therefore, in field of agriculture, detection of

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

A Database of Graphs for Isomorphism and Sub-Graph Isomorphism Benchmarking

A Database of Graphs for Isomorphism and Sub-Graph Isomorphism Benchmarking A Database of Graphs for Isomorphism and Sub-Graph Isomorphism Benchmarking P. Foggia, C. Sansone, M. Vento Dipartimento di Informatica e Sistemistica - Università di Napoli "Federico II" Via Claudio 21,

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Subject Two step clustering approach on large dataset. The aim of the clustering is to identify homogenous subgroups of instance in a population 1. In this tutorial, we implement a two step clustering

More information

Pattern Recognition Lecture Sequential Clustering

Pattern Recognition Lecture Sequential Clustering Pattern Recognition Lecture Prof. Dr. Marcin Grzegorzek Research Group for Pattern Recognition Institute for Vision and Graphics University of Siegen, Germany Pattern Recognition Chain patterns sensor

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

We Verg 02 Centroid-based Cluster Analysis of HVSR Data for Seismic Microzonation

We Verg 02 Centroid-based Cluster Analysis of HVSR Data for Seismic Microzonation We Verg 0 Centroid-based Cluster Analysis of HVSR Data for Seismic Microzonation P. Capizzi* (Università degli Studi di Palermo), R. Martorana (Università degli Studi di Palermo), G. Stassi (Università

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Didacticiel - Études de cas

Didacticiel - Études de cas Subject In this tutorial, we use the stepwise discriminant analysis (STEPDISC) in order to determine useful variables for a classification task. Feature selection for supervised learning Feature selection.

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Tree Models of Similarity and Association. Clustering and Classification Lecture 5 Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Optimal Decision Trees Generation from OR-Decision Tables

Optimal Decision Trees Generation from OR-Decision Tables Optimal Decision Trees Generation from OR-Decision Tables Costantino Grana, Manuela Montangero, Daniele Borghesani, and Rita Cucchiara Dipartimento di Ingegneria dell Informazione Università degli Studi

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Lecture 4 Hierarchical clustering

Lecture 4 Hierarchical clustering CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying

More information

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records 11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. 1 Topic WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. Feature selection. The feature selection 1 is a crucial aspect of

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Hierarchical Clustering of Process Schemas

Hierarchical Clustering of Process Schemas Hierarchical Clustering of Process Schemas Claudia Diamantini, Domenico Potena Dipartimento di Ingegneria Informatica, Gestionale e dell'automazione M. Panti, Università Politecnica delle Marche - via

More information

Automated Clustering-Based Workload Characterization

Automated Clustering-Based Workload Characterization Automated Clustering-Based Worload Characterization Odysseas I. Pentaalos Daniel A. MenascŽ Yelena Yesha Code 930.5 Dept. of CS Dept. of EE and CS NASA GSFC Greenbelt MD 2077 George Mason University Fairfax

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

BOSS. Quick Start Guide For research use only. Blackrock Microsystems, LLC. Blackrock Offline Spike Sorter. User s Manual. 630 Komas Drive Suite 200

BOSS. Quick Start Guide For research use only. Blackrock Microsystems, LLC. Blackrock Offline Spike Sorter. User s Manual. 630 Komas Drive Suite 200 BOSS Quick Start Guide For research use only Blackrock Microsystems, LLC 630 Komas Drive Suite 200 Salt Lake City UT 84108 T: +1 801 582 5533 www.blackrockmicro.com support@blackrockmicro.com 1 2 1.0 Table

More information

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999 Text Categorization Foundations of Statistic Natural Language Processing The MIT Press1999 Outline Introduction Decision Trees Maximum Entropy Modeling (optional) Perceptrons K Nearest Neighbor Classification

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute

More information

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 5 Finding meaningful clusters in data So far we ve been in the vector quantization mindset, where we want to approximate a data set by a small number

More information

A geometric non-existence proof of an extremal additive code

A geometric non-existence proof of an extremal additive code A geometric non-existence proof of an extremal additive code Jürgen Bierbrauer Department of Mathematical Sciences Michigan Technological University Stefano Marcugini and Fernanda Pambianco Dipartimento

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Clusteringon NGS data learning. Sarah, est ce que je dois mettre un nom de personnes ou des noms de personnes?

Clusteringon NGS data learning. Sarah, est ce que je dois mettre un nom de personnes ou des noms de personnes? Clusteringon NGS data learning Sarah, est ce que je dois mettre un nom de personnes ou des noms de personnes? To know about clustering There are two main methods: Classification = supervised method: Bring

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Figure 1: Workflow of object-based classification

Figure 1: Workflow of object-based classification Technical Specifications Object Analyst Object Analyst is an add-on package for Geomatica that provides tools for segmentation, classification, and feature extraction. Object Analyst includes an all-in-one

More information

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

Equations and Functions, Variables and Expressions

Equations and Functions, Variables and Expressions Equations and Functions, Variables and Expressions Equations and functions are ubiquitous components of mathematical language. Success in mathematics beyond basic arithmetic depends on having a solid working

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

h=[3,2,5,7], pos=[2,1], neg=[4,4]

h=[3,2,5,7], pos=[2,1], neg=[4,4] 2D1431 Machine Learning Lab 1: Concept Learning & Decision Trees Frank Hoffmann e-mail: hoffmann@nada.kth.se November 8, 2002 1 Introduction You have to prepare the solutions to the lab assignments prior

More information

1 Topic. Image classification using Knime.

1 Topic. Image classification using Knime. 1 Topic Image classification using Knime. The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a

More information

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2) Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster

More information

Belief Hierarchical Clustering

Belief Hierarchical Clustering Belief Hierarchical Clustering Wiem Maalel, Kuang Zhou, Arnaud Martin and Zied Elouedi Abstract In the data mining field many clustering methods have been proposed, yet standard versions do not take base

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information