2. Background. 2.1 Clustering
|
|
- Rosalind Morgan
- 6 years ago
- Views:
Transcription
1 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning is done with respect to the environment and not from a teacher that specifies the action to be taken in any given state. In one definition, a valid cluster is one in which data items are more similar to each other than they are to the data items in other clusters. Figure 1 shows an example of a simple clustering problem. Input Feature 2 Input Feature 1 Figure1 A simple clustering problem. A clustering algorithm must (shown a star) find a way to separate the examples into meaningful groups (the circles around the groups). Here we can see that the data items belonging to the same cluster are divided into groups of similar points that are far from the other points.
2 There are various areas in which clustering is very useful. For example, suppose we have a large set of data about genes and we want to find out which genes are closely associated with each other. Clustering is used for such a problem. Another such problem maybe the classificaiton of species into sub-species and developing complete taxonomies of species. Clustering algorithms prove to be very useful in tackling such problems. Exploratory pattern analysis, grouping, decision making, data mining, document retrieval, image segmentation and pattern classification are just a few of the fields in which clustering can be utilized. As we can see, in many of these cases there is hardly any prior information available about the data and not many assumptions can be made either. This is the typical situation in which clustering proves to be very useful in finding out the relationships between data items. Clustering can be broken down into 3 main steps. They are 1. Pattern representation, feature extraction/selection. 2. Defining a pattern proximity measure suitable for the dataset. 3. Custering the patterns together. The steps in clustering are shown in Figure 2. Patterns Feature Selection/ Extraction Pattern Representations Inter-Pattern Similarity Grouping Clusters Feedback loop Figure 2: Steps in Clustering
3 Here the feedback loop depicts the situation in which the output of the grouping could affect the result of the subsequently performed feature extraction and inter pattern similarity steps.below we discuss each of these steps briefly Pattern Representation, Feature Selection and Extraction Pattern representation refers setting up the problem including defining the number of available patterns and the number, type and the scale of features available. Feature selection involves selecting the best subset of features that would be used for clustering. When transformations are performed on features to produce new features then it is termed as feature construction. Pattern representation is a difficult task because most often this process is not controllable by the user. What the user can do in this step is to gather lots of information about the data and if needed, perform feature selection and/or construction and this way create the data which has to be clustered. If done carefully, this step can lead to a clustering which is simple and easy to understand. However, if done poorly, this step can lead to clustering whose structure is very complex and is very difficult to comprehend. For example, in figure 3, the data points are equidistant from the center. If we take Cartesian coordinate system as our pattern representation it would yield a different result than if we choose the polar coordinate system.
4 Figure 3: A cluster whose data points are equidistant from the origin. Here different pattern representations will yield different results. Duda and Hart (1973) mention that the patterns are represented conventionally as multidimensional vectors, where each dimension is a single feature. As an example, if we say that we have two features, age and sex then (24,Male) is the pattern representation of a 24-year old male. There are various types of features that can be used. Gowda and Diday (1992) talk about Quantitative features like (a) continuous values (b) discrete values and (c) interval values and Qualitative features like (a) nominal/unordered (b) ordinal. Another type of features are the structured features (Michalski and Stepp 1983) in which the features are represented as trees with child nodes generalized by parent nodes Proximity Measures Choosing an apt proximity measure is very important while doing clustering. This is because we have various feature types and unless the proximity measure is chosen carefully, clustering will not produce an expected output.
5 The most common way to find the proximity between two data points is to calculate the dissimilarity between the two using a distance measure. For continuous features a Euclidean distance is one of the very popular distance measure. The equation for Euclidean distance is given below. d d 2 x i, x j = x i,k x j, k 2 1/2 k d 2 x i,s j = x i s j 2 This proximity measure works well when the clusters produced are compact and isolated (Mao and Jain 1996). There are various proximity measures such as those proposed by Diday and Simon (1976) and Ichino and Yaguchi (1994) for both qualitative and quantitative types of features. A popular measure for finding similarity between patterns is the cosine measure. The equation for the cosine measure is given below. x s c x a,x b = a x b x a 2 x b 2 This method is very good when working with text data because it is easy to interpret and simple to compute for sparse vectors. A similarity measure that has been used successfully for various clustering applications is one proposed by Gowda and Krishna (1977) and is called the mutual neighbor distance(mnd). The MND formula is shown below.
6 MND x i, x j =NN x i, x j NN x j, x i Clustering Techniques There are a variety of clustering algorithms that have been proposed over the years. They can be divided into two basic sub types. They are (1) Hierarchical and (2) Partitional Hierarchical Clustering Hierarchical clustering is based on some method of representing data points in a hierarchical structure. A very common way to do this is to use the dendogram structure. A typical dendogram hierarchy would be as is shown in Figure 7. E8 E4 E5 E1 E9 E3 E6 E7 E2 Figure 7: Dendogram Hierarchy The data points in the figure are arranged like a dendogram in which the nodes which are most closely related to each other are joined together at a lower level of the hierarchy and the nodes that are not closely related to each other are joined together at a higher level in the hierarchy. Hierarchical Clustering can be further divided into two sub categories. These are (1) Agglomerative Clustering and (2) Divisive Clustering.
7 In agglomerative clustering, each data point is considered as a single cluster and successive clusters are merged together until a point is reached where no more merging can be performed (stop condition). There are various ways in which the clusters that are to be merged are selected. One simple approach is to repeatedly select the closest pair of clusters (based on their closest member) and merge them. Examples of Agglomerative clustering algorithms include Single Link Clustering, Complete Link Clustering, and Agglomerative hierarchical clustering. Figure 8 show a basic agglomerative clustering algorithm. Put each of the data objects in their own cluster Compare all clusters and find the clusters that are closest to each other Distance < Threshold No End of clustering Yes Merge the two clusters Figure 8: A Simple Agglomerative Clustering Algorithm In divisive clustering, all the data points are taken into one cluster and then that cluster is split into smaller clusters until no more splitting can be done (stop condition). An example of a divisive clustering algorithm is the Distributional noun algorithm (Pereira et al 1993).
8 Partitional Clustering Partitional clustering is based on the idea that the data set should be directly decomposed into a set of disjoint clusters. The major difficulties that arise in this type of clustering are questions such as the number of clusters, how the points should be divided and how the clusters should be represented. In this type of clustering the cluster representation is often is the terms of a centroid for a cluster which is the minimized squared distance between the centroid and all points in the cluster. One of the most widely used Partitional algorithm is the k- Means algorithm. Here is how k-means works -: 1. Choose k cluster centers randomly. 2. Assign each pattern to the closest center. 3. Recompute the centers using the current cluster elements. 4. If there is minimal change in the cluster or no reassignment of patterns then stop. Else goto step 2. A taxonomy of clustering algorithms (Jain et al, 1999) discussed by us is shown in Figure 10. Clustering Hierarchical Partitional Agglomerative Divisive Single Link Complete Link Distributional Noun K-Means Figure 10: Taxonomy of Clustering Algorithms.
9 2.2 Biclustering Even though clustering has been long known to give good results there are applications for which clustering does not necessarily prove to be an adequate solution. Biclustering was introduced by (Cheng and Church, 2000) for discovering knowledge from gene expression data. Biclustering in their case meant clustering both genes and conditions simultaneously to gather usable knowledge from the gene expression data. Why is biclustering needed? When clustering genes according to the conditions they respond to, any clustering algorithm assumes that related genes behave similarly no matter what the condition. But if the dataset is large then this might not be true. Also, clustering often partitions the genes into disjoint sets i.e. a single gene is associated with a single process/function which in most cases is not true. These are the classic cases where biclustering proves helpful. The most common use of biclustering today is in clustering the microarray data that is produced by genes in response to certain conditions. However, there are various ways in which biclustering has been performed. In this section we will discuss some of these methods Bicluster Structure Any biclustering algorithm makes one of the following assumptions (1)There is only one bicluster in the data matrix and (2)There are K biclusters in the data matrix. While the most popular assumption is one of K biclusters, there are algorithms which assume that there is only one bicluster in the whole data matrix. According to Madeira and Oliveira (2004) if the assumption is of there being K biclusters in the data matrix, then the various types of structures that can be obtained are:
10 1. Exclusive row and column biclusters. 2. Non-Overlapping biclusters with checkerboard structure. 3. Exclusive rows biclusters. 4. Exclusive columns biclusters. 5. Non-Overlapping biclusters with tree structure. 6. Non-Overlapping non-exclusive biclusters. 7. Overlapping biclusters with hierarchical structure. 8. Arbitrarily positioned overlapping biclusters. Figure 11 shows a representation of Bicluster structure. (a)single (b)exclusive (c ) Checkerboard (d) Exclusive (e) Exclusive row and column rows columns (f) Non-Overlapping (g) Non-Overlapping (h) Overlapping (i) Arbitrarily with tree structure non-exclusive hierarchical positioned Figure 12: Bicluster Structure While most of the algorithms that have been mentioned above use the arbitrarily positioned overlapping structure for creating their biclusters, the algorithm proposed by Sheng (Sheng et al 2003) uses the exclusive row structure.
11 2.2.2 Bicluster Types Biclustering algorithms can be classified according to the type of biclusters they are able to find. According to Madeira and Oliveira (2004) there are four major classes of biclusters. They are: 1. Biclusters with Constant value 2. Biclusters with Constant values on rows or columns 3. Biclusters with coherent values. 4. Biclusters with coherent evolutions. Figure 12 shows examples of different types of Biclustering. (a) Constant Value (b) Constant Row (c) Constant Column (d) Coherent Value (e) Coherent Evolution Figure 11: An example of different types of bilcustering Constant Value Biclusters Constant value biclustering is the simplest form of biclustering in which the algorithms try and find subsets of rows and columns with constant values. In gene expression data a constant value bicluster represents a subset of genes with similar expression values across a subset of conditions.
12 Another way to approach biclustering is to look for subsets of rows and columns with constant values on rows/columns of the data matrix. There are various algorithms that implement this type of biclustering. Getz, Levine and Domany's (2000) Coupled Two-Way Clustering algorithm, Sheng, Moreau and Moor's (2003) Gibbs sampling biclustering are a couple of examples of Constant Value Biclustering Algorithm Coherent Value Biclusters Coherent value biclustering algorithms look for biclusters that have coherent values on both the rows and the columns. In these types of biclustering algorithms, more sophisticated analysis of variance between groups is performed to find biclusters that are of good quality. The FLOC (Flexible Overlapped Biclustering) Algorithm (Yang et al 2003), the Interrelated Two-Way Clustering algorithm and the Cheng and Church (2000) algorithm are a few examples of Coherent Value Biclustering algorithms. The FLOC algorithm presented by (Yang et al, 2003) simultaneously produces k biclusters whose mean residues are less than a predefined limit. The algorithm basically moves a row or column out of or into a bicluster depending on whether the row/column is already included in the bicluster or not. They then choose that particular row/column which gives the best gain in score and evaluate the relative reduction of the bicluster's residue. This is done for all rows and columns and the bicluster with the minimum mean residue is kept and the whole process is repeated again. The Cheng and Church algorithm produces one co-cluster at a time. The use a low meansquared residue plus a large variation from the constant as their criteria for identify a bicluster. They apply a sequence of row/column deletions/additions onf the gene condition matrix and they keep the mean squared residue under a given threshold. After creating a bicluster like this, they replace the elements of
13 the bicluster with random numbers and repeat the process on the modified matrix to generate another bicluster till a required number of biclusters are found Coherent Evolutions Biclusters These types of algorithms find coherent evolutions across rows and/or columns of the data matrix regardless of their exact values. In the case of gene expression data, we might look at something like whether the subset of genes is up-regulated or down-regulated across a subset of conditions regardless of their expression values. Algorithms like order-preserving sub-matrix (Ben-Dor et al 2002), OP-cluster (Liu and Wang 2003), xmotif (Murali and Kasif 2003) and SAMBA (Tanay et al 2002) are examples of Coherent Evolutions Biclustering algorithms Biclustering Approaches As we discussed earlier, the main question in front of any person who is writing a biclustering algorithm is: to identify one bicluster or to identify a given number of biclusters. This problem is very complex and various different heuristic methods have been used to solve this. The various biclustering algorithms can be divided into five categories according to the heuristic they use. These are 1. Iterative Row and Column Clustering Combination 2. Divide and Conquer 3. Greedy Iterative Search 4. Exhaustive Bicluster Enumeration 5. Distribution Parameter Identification Iterative Row and Column Clustering Combination This method is relatively easy and uses the existing clustering methods on the columns and rows of the data matrix to get clusters and then combine the results to obtain biclusters. The Coupled Two-Way Clustering (Getz et al 2000) and the
14 Interrelated Two-Way Clustering (Tang et al 2001) are two examples of this type of approach Divide and Conquer In the divide and conquer approach the biclustering problem is broken down into several sub-problems which are then solved recursive. The solutions obtained thus are combined to create end result. Though these algorithms can be very fast, they are likely to miss good biclusters. Block Clustering (Hartigan 1975) is an example of a Divide and Conquer approach. Modifications to this algorithm have also been suggested by Duffy and Quiroz (1991) Greedy Iterative Search The greedy search method creates biclusters by adding/removing rows/columns from them using the local maxima as its criteria of selection. Like Divide and Conquer Approach, this approach tends to be very fast but is prone to make wrong decisions. FLOC (Yang et al 2003), Order-Preserving sub-matrix (Ben-Do et al 2002) and the Cheng and Church algorithm (Cheng and Church 2000) are a few good examples of this method Exhaustive Bicluster Enumeration This method is based on the approach that the best biclusters are only possible if an exhaustive search of all the possible biclusters of the data matrix can be made. The complexity of these algorithms is very high and either they take a long time to run or they have to assume a size restriction on the size of the data matrix. The SAMBA algorithm (Tanay et al 2002), the Maximum Dimension Sets algorithm (Wang et al 2002) and the OPC-tree algorithm (Liu and Wang 2003) are a few good examples for these types of algorithms.
15 Distribution Parameter Identification In this approach, a given statistical model is assumed and then distribution parameters used to generate data are identified. The plaid model algorithm (Lazzeroni and Owen 2000) is one of the examples of this approach Biclustering Applications Besides applications in biology, Biclustering has been used in other fields as well. These are mentioned in the table 2 below. Application Use Examples Identify subgroups of customers who E-Commerce/Target have similar preferences towards a Marketing subset of products. Yang et al 2002 Identify subgroups of documents with Dhillon 2001 Information Retrieval similar properties relative to a Dhillon et al 2003 / Text Mining subgroup of attributes. Berkin et al 2002 Identify a subgroup of people with Politics same political ideas and electoral Hartigan 1972 behavior over a subset of attributes Databases Reduce the dimensionality of tables with thousands of rows and hundreds of columns. Aggarwal et al 1998 Table 2: Some Biclustering Applications
DNA chips and other techniques measure the expression level of a large number of genes, perhaps all
INESC-ID TECHNICAL REPORT 1/2004, JANUARY 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey* Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches
More informationBiclustering Algorithms for Gene Expression Analysis
Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important
More informationBiclustering for Microarray Data: A Short and Comprehensive Tutorial
Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationBiclustering Bioinformatics Data Sets. A Possibilistic Approach
Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationBiclustering with δ-pcluster John Tantalo. 1. Introduction
Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That
More informationAn Unsupervised Technique for Statistical Data Analysis Using Data Mining
International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 5, Number 1 (2013), pp. 11-20 International Research Publication House http://www.irphouse.com An Unsupervised Technique
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationDNA chips and other techniques measure the expression
24 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 1, JANUARY-MARCH 2004 Biclustering Algorithms for Biological Data Analysis: A Survey Sara C. Madeira and Arlindo L. Oliveira
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationWhat is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationPlaid models, biclustering, clustering on subsets of attributes, feature selection in clustering, et al.
Plaid models, biclustering, clustering on subsets of attributes, feature selection in clustering, et al. Ramón Díaz-Uriarte rdiaz@cnio.es http://bioinfo.cnio.es/ rdiaz Unidad de Bioinformática Centro Nacional
More informationData Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science
Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informatione-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data
: Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationMicroarray data analysis
Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationExploratory Analysis: Clustering
Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents
More informationDistance-based Methods: Drawbacks
Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find
More informationOrder Preserving Clustering by Finding Frequent Orders in Gene Expression Data
Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data Li Teng and Laiwan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong Abstract.
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationClustering (COSC 416) Nazli Goharian. Document Clustering.
Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More information2. Department of Electronic Engineering and Computer Science, Case Western Reserve University
Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationCS573 Data Privacy and Security. Li Xiong
CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationUnsupervised Learning
Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision
More informationUnsupervised learning, Clustering CS434
Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,
More informationUsing the Kolmogorov-Smirnov Test for Image Segmentation
Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationClustering CE-324: Modern Information Retrieval Sharif University of Technology
Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationA Memetic Heuristic for the Co-clustering Problem
A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationPattern Clustering with Similarity Measures
Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,
More informationFlat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017
Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationCustomer Clustering using RFM analysis
Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationData Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science
Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationMining Deterministic Biclusters in Gene Expression Data
Mining Deterministic Biclusters in Gene Expression Data Zonghong Zhang 1 Alvin Teo 1 BengChinOoi 1,2 Kian-Lee Tan 1,2 1 Department of Computer Science National University of Singapore 2 Singapore-MIT-Alliance
More informationCSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)
CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationWhat is Unsupervised Learning?
Clustering What is Unsupervised Learning? Unlike in supervised learning, in unsupervised learning, there are no labels We simply a search for patterns in the data Examples Clustering Density Estimation
More informationCHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION
CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationClustering and Dimensionality Reduction
Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationStats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms
Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationLecture 15 Clustering. Oct
Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled
More informationCOMS 4771 Clustering. Nakul Verma
COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find
More information