Cutting the Dendrogram through Permutation Tests
|
|
- Frank Fisher
- 5 years ago
- Views:
Transcription
1 Cutting the Dendrogram through Permutation Tests Dario Bruzzese 1 and Domenico Vistocco 2 1 Dipartimento di Medicina Preventiva, Università di Napoli - Federico II Via S. Pansini 5, Napoli, Italy, dario.bruzzese@unina.it 2 Dipartimento di Scienze Economiche, Università di Cassino Via S.Angelo S.N., Cassino, Italy, vistocco@unicas.it Abstract. This paper introduces an innovative approach for detecting a suboptimal partition starting from the dendrogram produced by a hierarchical clustering techinique. The approach exploits permutation tests and it can be used regardless of the agglomeration method and distance measure used in the classification process because it relies on the same criteria used for producing it. Moreover, the proposed approach can detect partitions not necessarily identifiable using a traditional cut approach, as the resulting clusters could correspond to different heights of the tree. Keywords: Hierarchical clustering, Permutation tests, Cluster detection 1 Introduction and motivation Hierarchical clustering represents one of the most widespread analytical approach to face with classification problems mainly due to the visual power of the associated graphical representation, the dendrogram, and because of the directness of the cluster generation process. All these aside, the requirement of choosing properly the optimum number of clusters still represents the main difficulty for the final user. Actually, different (semi)automatic criteria can be devised to reach the final classification; very often, the very informal solution adopted is that of finding the height in the dendrogram where large changes in fusion level occur. Broadly speaking, all these criteria determine a threshold value of the ultra-metric used to grow the dendrogram such that all the units with a dissimilarity index below this threshold will belong to the same cluster; however such approach allows to search for the solution only within a small set of the whole family of partitions housed in the dendrogram: those which stem from horizontal cuts of the dendrogram. This induces a biunivocal relation between the number k of clusters and the partition set such that by fixing one element of the relation (e.g. the number of clusters) the other is univocally determined. It could happen, however, that clusters differ in terms of their own internal coherence in a way that the same threshold value wouldn t be suitable for all
2 2 Bruzzese, D. and Vistocco, D. of them. Figure 1 shows the dendrogram obtained on a simulated dataset. The dataset contains 4 different clusters of the same cardinality generated from multivariate normal distributions with different mean vectors and variancecovariance matrices. The Ward criterion with the Euclidean distance was used to grow the tree. Fig. 1. Two different partitions in 4 clusters of a simulated dataset. Solid line refers to a traditional horizontal criterion while the dashed line refers to a possible solution offered by the proposed algorithm. The solid line in Figure 1 highlights the 4-clusters solution obtained by cutting the dendrogram with a traditional horizontal criterion; this partition, which actually is the only one that can produce a 4-clusters solution, isolates a very small cluster on the left side of the dendrogram while leaving ungrouped the two clusters on the right that, on the contrary, contain units belonging to different populations. A different solution, inside those that still comply with the hierarchical classification process, could thus be the one described by the dashed line and characterized by two local thresholds located at different heights; actually it turns out that this non-conventional cut can better recover the original cluster structure (according to the misclassification index the first partition produces an error rate of 0.58 while the second is characterized by an error rate of 0.40). The possibility of merging clusters at different heights (thus conflicting with the classificability principle previously described) makes mandatory the implementation of a procedure able to automatically explore the complete set of partitions by tracing the partial thresholds whenever two clusters plainly reflect specific characteristics. The proposed algorithm exploits the theoretical framework of permutation tests in order to reach this goal. The most important by-product of such approach is the automatic identification of the number of clusters. The paper is organized as follows: the idea used for detecting the partition is introduced in Section 2, the notation and the proposed procedure
3 Cutting the Dendrogram through Permutation Tests 3 is detailed in Section 3; Section 4 shows some results on a genetic dataset and a simulation study in order to explore the influence of tuning parameters on the algorithm output. Finally, some concluding remarks and future work directions follow in the final section. 2 The basic intuition The proposed algorithm exploits a permutation test approach to automatically detect a partition starting from a dendrogram resulting from a hierarchical cluster. The algorithm retraces down-ward the tree, starting from the root of the dendrogram where all objects are classified in a unique cluster and moving down a partial threshold until a link joining two clusters is encountered. A permutation test is thus performed in order to verify whether the two clusters must be accounted as a unique group (the null hypothesis) or not (the alternative one). If the null cannot be rejected, the corresponding branch will become a cluster of the final partition and none of its sub-branches will be longer processed. Otherwise each of them will be further visited in the course of the procedure. In fact, in both cases, the partial threshold will continue its path and the next branch of the dendrogram will be processed. The algorithm stops when there are no more branches that stand the test (i.e. the null cannot be rejected any more). The permutation test on which the whole procedure is based can be summarized in this way. Under the Null, if all the units belonging to each of the two clusters are mixed up together and then randomly split up, with the only constraint of the group cardinality, the distance among the shuffled clusters should not be very different from the original one. Repeating the shuffling m times, a Montecarlo p-value can be computed as the number of permuted distances at least as extreme as the original one. The whole algorithm is detailed in the next section. 3 The algorithm Let denote with n the number of objects to classify, with CL k and Ck R the two classes merged at level k (k = 1,..., n), with h(cl k Ck R ) the height necessary to merge CL k and Ck R. Finally we denote with h(ck j ) the height at which Cj k has been obtained (j {L, R}). In Figure 2 the adopted formalism is superimposed, for k {1, 2}, on the dendrogram shown in the previous section. For each k, the difference between max h ( C k ) j and min h ( C k ) j can j {L,R} j {L,R} be considered as the minimum cost necessary to merge the two classes. Minimum because, at least, the dissimilarity measure used in the agglomeration process, must raise from min h ( C k ) j to max h ( C k ) j in order to merge j {L,R} j {L,R} the two clusters.
4 4 Bruzzese, D. and Vistocco, D. h h(c 2 L C2 R) h(c 2 R) C 2 L C 2 R C 1 L C 1 R Fig. 2. Exemplification of the main notation adopted with respect to the dendrogram reported in Figure 1. The difference between h ( CL k R) Ck and max h ( C k ) j can be, instead, j {L,R} considered as the cost actually incurred for merging CL k and Ck R. The ratio between these two costs: cost ( max CL k CR k ) h ( C k ) j min h ( C k ) j j {L,R} j {L,R} = h ( CL k R) Ck max h ( C k ) j j {L,R} is thus a measure that characterizes the aggregation process resulting in the new class CL k Ck R and is indeed used in the permutation test approach for automatically detecting the clusters. The proposed procedure is detailed in Algorithm 1. In particular we denote with aggregationlevelst ov isit a vector containing the heights of the dendrogram to be explored and with permclusters an object storing the clusters detected by the procedure. The permutation test step is embedded in the row 6 of the Algorithm 1. In particular, for each k a permutation test is designed to test the Null Hypothesis that the two groups CL k and Ck R really belong to the same cluster, i.e. H 0 : CL k CR. k Under H 0, mixing up (i.e. permuting) the statistical units of C k L and Ck R should not alter the aggregation process resulting in their merging in.
5 Cutting the Dendrogram through Permutation Tests 5 Input: A dataset and its related dendrogram Output: A partition of the dataset 1. inizialization: 2. aggregationlevelstovisit h(c 1 L C 1 R) 3. permclusters [ ] 4. i 1 5. repeat 6. if C i L C i R 7. add C i L C i R to permclusters 8. else 9. add h(c i L) and h(c i R) to aggregationlevelstovisit 10. sort aggregationlevelstovisit in descending order 11. end 12. remove the first element from aggregationlevelstovisit 13. i i until aggregationlevelstovisit is empty Algorithm 1: The proposed P ermclust algorithm Let m CL k and mcr k be the two new classes obtained by permuting the elements in CL k and Ck R. As a matter of fact, the hierarchical clustering process is invariant with respect to the permutation of the original observations and thus growing a single dendrogram on the permuted set would simply reestablish the same structure. For this reason, after m CL k and mcr k have been obtained, a new dendrogram is generated for each of them. The heights at which each of the two classes are buit up again, clearly correspond to the heights of the root nodes of the corresponding dendrograms. The ratio: cost ( max mcl k m CR k ) h ( mc k ) j min h ( mc k ) j j {L,R} j {L,R} = h ( CL k R) Ck max h ( mc k ) j j {L,R} is thus a measure that characterizes the aggregation process resulting in the new (potential) class m CL k mcr k. Under H 0, the aggregation process resulting in the new cluster CL k Ck R should be very similar to the one that potentially would have produced m CL k mcr k ; thus the two values cost ( CL k ) ( Ck R and cost mcl k mcr) k should be close enough. The permutation procedure is repeated M times and each time a new couple m CL k, mcr k is obtained. The pvalue Montecarlo (Good, 1994) is thus computed as: p = # { cost ( mcl k ) ( )} mcr k cost C k L CR k + 1 M Some results The PermClust algorithm has been applied both on real and synthetic datasets; in the following the main results will be presented. In all the computations,
6 6 Bruzzese, D. and Vistocco, D. the dendrograms have been generated with the Euclidean distance and the Ward agglomeration criterion (Maechler et al., 2005). Unless differently specified, p-values less than 0.01 were considered statistically significant in the permutation test step. Figure 3(a) shows (a zoom of) the dendrogram obtained on the Yeast galactose dataset which describes a subset of 205 genes reflecting four functional categories of the Gene Ontology (Ideker et al., 2001) 1. The value detected Cluster value detected Cluster (a) detected Cluster (b) detected Cluster Fig. 3. (a) The dendrogram obtained on the Yeast Galactose dataset with the partition selected by PermClust algorithm. Numbers refer to the p-values of the associated permutation test. (b) Visual representation of the confusion matrix resulting from PermClust algorithm and (c) from a k-means with k=4. (c) obtained partition is highlighted using red rectangles and clearly reveals the 4-clusters structure originally contained in the dataset. Panels (b) and (c) show the confusion matrices related to the proposed algorithm (b) and to a k-means procedure (c) with k equal to 4. The different sub panels depict the original clusters in the dataset while the different bars refer to the clusters detected by the classification procedures. It can be noticed that the proposed algorithm correctly assigns the units to the first and the fourth class while a small fraction of units belonging to the second cluster is misclassified into the third cluster. K-means, on the contrary, is unable to grasp second cluster (whose units are misclassified in the third cluster). Small misclassification rates characterize also the first and the fourth cluster. The misclassification rate was 1.5% for PermClust and 8.3% for the k-means procedure. It is worth of notice that the partition selected by the proposed algorithm agrees with the hortodox 4-clusters solution but it has been automatically detected by the algorithm. The PermClust algorithm has been also tested on artificial datasets. In particular, Figure 4 shows the results of the algorithm on artificial datasets 1 For this application, the algorithm, written in the R language, uses almost 50 secs. on a Intel Core 2 Duo 2.26 GHz machine with 4 GB of RAM. More efficiency could be achieved optimizing the code and implementing it using a compiled language.
7 Cutting the Dendrogram through Permutation Tests 7 generated according to the random cluster generation method proposed in Qiu and Joe (2006a, 2009). Generated data differ in terms of the number of clusters (k = 2, 3, 4, 5, 6, 7) and of the number of variables (p = 5, 10) 2. The artificial data have been generated using a value of 0.01 for the separation index (Qiu and Joe, 2006b) between any cluster and its nearest neighbor cluster which reflects a close cluster structure. For each combination of k and p, detected Cluster detected Cluster (a) Number of detected clusters (a) (b) Number of detected clusters (b) Fig. 4. Distribution of the number of clusters detected by the PermClust algorithm for artificial datasets in case of 5 variables (a) and 10 variables (b). s = 100 different datasets have been generated. Figure 4(a) shows the number of clusters composing the partition detected by the PermClust algorithm using p = 5 variables. Different columns of the Figure depict the different value of k, while the rows refer to the significance level used in the permutation test step of the algorithm (see row 6 of Algorithm 1). The barplots in each panel show the distribution of the numbers of clusters detected by the algorithm in the s simulations. The same structure is used for the case of a dataset with 10 variables (Figure 4(b)). As can be noticed, the stability of the algorithm strictly depends on the combination among the significance level, the cardinality of the cluster structure and the number of variables. In particular, while a significance level of 0.01 (last row of Figure 4(a) and (b)) always allows to achieve the best results, the accuracy of the solution is inversely proportional to the ratio between k and p. In case of a simple cluster structure (k=2,3), the algorithm seems to fail even with a large number of available variables. 2 With p=15 the performance of the algorithm is almost equal to p=10. The corresponding figure is not reported for sake of brevity.
8 8 Bruzzese, D. and Vistocco, D. 5 Concluding remarks and further developments The output of hierarchical clustering methods is typically displayed as a dendrogram describing a family of partitions indexed by an ultrametric distance. Actually, after the tree structure of the dendrogram has been set up, the most tricky problem is that of cutting the tree with a suitable threshold in order to take out a sub-optimal classification. Several (more or less) objective criteria may be used to achieve this goal, e.g. the deepest step, but most often the partition relies on a subjective choice leaded by interpretation issues. Additionally, whatever the chosen criterion is, only one solution can be obtained for each desired granularity, i.e. the one where clusters are joined at consecutive heights starting from the adopted threshold. In this paper we propose an algorithm, exploiting the methodological framework of permutation test, allowing to find out automatically a suboptimal partition where clusters do not necessarily obey to the afore-mentioned principle. The algorithm allows us to explore partitions which are not directly achievable using a standard cut-level approach. Further works should concern a comparison of the obtained partition with respect to partitions of the same dataset deriving from common partitioning methods. A comparison in terms of quite common quality indexes (Rand, 1971) should strength the proposal. Furthermore the study of the stability of the obtained partitions with respect to tuning parameters used in the permutation test procedure and the study of the computational complexity are topics of interest for further research. References GOOD P. (1994). Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, New York. IDEKER T., THORSSON V., RANISH J.A., CHRISTMAS R., BUHLER J., ENG J.K., BUMGARNER R.E., GOODLETT D.R., AEBERSOLD R., HOOD L. (2001) Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science, 292: J MAECHLER M., ROUSSEEUW P., STRUYF A., HUBERT M. (2005). Cluster Analysis Basics and Extensions. unpublished. QIU W.L., JOE, H. (2006a) Generation of Random Clusters with Specified Degree of Separation. Journal of Classification, 23(2), J QIU W.L., JOE, H. (2006b) Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis, 50, QIU W. L., JOE H. (2009). clustergeneration: random cluster generation (with specified degree of separation). R package version R Development Core Team (2009). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN , url = RAND W.M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, December 1971, 66, 336,
Unsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More information3. Cluster analysis Overview
Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More information3. Cluster analysis Overview
Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationContents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results
Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be
More informationClustering Jacques van Helden
Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation
More informationSupplementary text S6 Comparison studies on simulated data
Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationOptimization I : Brute force and Greedy strategy
Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean
More informationCS229 Lecture notes. Raphael John Lamarre Townshend
CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationClustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic
Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the
More informationSEEK User Manual. Introduction
SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.
More informationCHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS
145 CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 6.1 INTRODUCTION This chapter analyzes the performance of the three proposed colortexture segmentation
More informationRedefining and Enhancing K-means Algorithm
Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,
More informationCHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION
CHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION 4.1. Introduction Indian economy is highly dependent of agricultural productivity. Therefore, in field of agriculture, detection of
More informationCHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES
70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationA Database of Graphs for Isomorphism and Sub-Graph Isomorphism Benchmarking
A Database of Graphs for Isomorphism and Sub-Graph Isomorphism Benchmarking P. Foggia, C. Sansone, M. Vento Dipartimento di Informatica e Sistemistica - Università di Napoli "Federico II" Via Claudio 21,
More informationCharacter Recognition
Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationDidacticiel Études de cas
1 Subject Two step clustering approach on large dataset. The aim of the clustering is to identify homogenous subgroups of instance in a population 1. In this tutorial, we implement a two step clustering
More informationPattern Recognition Lecture Sequential Clustering
Pattern Recognition Lecture Prof. Dr. Marcin Grzegorzek Research Group for Pattern Recognition Institute for Vision and Graphics University of Siegen, Germany Pattern Recognition Chain patterns sensor
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationWe Verg 02 Centroid-based Cluster Analysis of HVSR Data for Seismic Microzonation
We Verg 0 Centroid-based Cluster Analysis of HVSR Data for Seismic Microzonation P. Capizzi* (Università degli Studi di Palermo), R. Martorana (Università degli Studi di Palermo), G. Stassi (Università
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationIncorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data
Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationDidacticiel - Études de cas
Subject In this tutorial, we use the stepwise discriminant analysis (STEPDISC) in order to determine useful variables for a classification task. Feature selection for supervised learning Feature selection.
More informationAn Information-Theoretic Approach to the Prepruning of Classification Rules
An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationTree Models of Similarity and Association. Clustering and Classification Lecture 5
Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationOn the Relationships between Zero Forcing Numbers and Certain Graph Coverings
On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationOptimal Decision Trees Generation from OR-Decision Tables
Optimal Decision Trees Generation from OR-Decision Tables Costantino Grana, Manuela Montangero, Daniele Borghesani, and Rita Cucchiara Dipartimento di Ingegneria dell Informazione Università degli Studi
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClustering & Classification (chapter 15)
Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical
More informationLecture 4 Hierarchical clustering
CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying
More information11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records
11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationUsing the Kolmogorov-Smirnov Test for Image Segmentation
Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer
More informationWRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.
1 Topic WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. Feature selection. The feature selection 1 is a crucial aspect of
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More information7. Decision or classification trees
7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationHierarchical Clustering of Process Schemas
Hierarchical Clustering of Process Schemas Claudia Diamantini, Domenico Potena Dipartimento di Ingegneria Informatica, Gestionale e dell'automazione M. Panti, Università Politecnica delle Marche - via
More informationAutomated Clustering-Based Workload Characterization
Automated Clustering-Based Worload Characterization Odysseas I. Pentaalos Daniel A. MenascŽ Yelena Yesha Code 930.5 Dept. of CS Dept. of EE and CS NASA GSFC Greenbelt MD 2077 George Mason University Fairfax
More informationIteration Reduction K Means Clustering Algorithm
Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationThe Application of K-medoids and PAM to the Clustering of Rules
The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research
More informationBOSS. Quick Start Guide For research use only. Blackrock Microsystems, LLC. Blackrock Offline Spike Sorter. User s Manual. 630 Komas Drive Suite 200
BOSS Quick Start Guide For research use only Blackrock Microsystems, LLC 630 Komas Drive Suite 200 Salt Lake City UT 84108 T: +1 801 582 5533 www.blackrockmicro.com support@blackrockmicro.com 1 2 1.0 Table
More informationText Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999
Text Categorization Foundations of Statistic Natural Language Processing The MIT Press1999 Outline Introduction Decision Trees Maximum Entropy Modeling (optional) Perceptrons K Nearest Neighbor Classification
More informationClassification. Instructor: Wei Ding
Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute
More informationLecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 5 Finding meaningful clusters in data So far we ve been in the vector quantization mindset, where we want to approximate a data set by a small number
More informationA geometric non-existence proof of an extremal additive code
A geometric non-existence proof of an extremal additive code Jürgen Bierbrauer Department of Mathematical Sciences Michigan Technological University Stefano Marcugini and Fernanda Pambianco Dipartimento
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationClusteringon NGS data learning. Sarah, est ce que je dois mettre un nom de personnes ou des noms de personnes?
Clusteringon NGS data learning Sarah, est ce que je dois mettre un nom de personnes ou des noms de personnes? To know about clustering There are two main methods: Classification = supervised method: Bring
More informationCHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM
96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays
More informationEnhancing K-means Clustering Algorithm with Improved Initial Center
Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationAdvanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret
Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely
More informationFigure 1: Workflow of object-based classification
Technical Specifications Object Analyst Object Analyst is an add-on package for Geomatica that provides tools for segmentation, classification, and feature extraction. Object Analyst includes an all-in-one
More informationCHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS
CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS 8.1 Introduction The recognition systems developed so far were for simple characters comprising of consonants and vowels. But there is one
More informationSGN (4 cr) Chapter 11
SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter
More informationNearest Neighbor Predictors
Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,
More informationEquations and Functions, Variables and Expressions
Equations and Functions, Variables and Expressions Equations and functions are ubiquitous components of mathematical language. Success in mathematics beyond basic arithmetic depends on having a solid working
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationCMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification
More informationCSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection
CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationh=[3,2,5,7], pos=[2,1], neg=[4,4]
2D1431 Machine Learning Lab 1: Concept Learning & Decision Trees Frank Hoffmann e-mail: hoffmann@nada.kth.se November 8, 2002 1 Introduction You have to prepare the solutions to the lab assignments prior
More information1 Topic. Image classification using Knime.
1 Topic Image classification using Knime. The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a
More informationMultivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)
Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster
More informationBelief Hierarchical Clustering
Belief Hierarchical Clustering Wiem Maalel, Kuang Zhou, Arnaud Martin and Zied Elouedi Abstract In the data mining field many clustering methods have been proposed, yet standard versions do not take base
More informationA Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis
A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract
More informationClustering Using Graph Connectivity
Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the
More informationCollaborative Rough Clustering
Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical
More information