2. Background. 2.1 Clustering

Size: px
Start display at page:

Download "2. Background. 2.1 Clustering"

Transcription

1 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning is done with respect to the environment and not from a teacher that specifies the action to be taken in any given state. In one definition, a valid cluster is one in which data items are more similar to each other than they are to the data items in other clusters. Figure 1 shows an example of a simple clustering problem. Input Feature 2 Input Feature 1 Figure1 A simple clustering problem. A clustering algorithm must (shown a star) find a way to separate the examples into meaningful groups (the circles around the groups). Here we can see that the data items belonging to the same cluster are divided into groups of similar points that are far from the other points.

2 There are various areas in which clustering is very useful. For example, suppose we have a large set of data about genes and we want to find out which genes are closely associated with each other. Clustering is used for such a problem. Another such problem maybe the classificaiton of species into sub-species and developing complete taxonomies of species. Clustering algorithms prove to be very useful in tackling such problems. Exploratory pattern analysis, grouping, decision making, data mining, document retrieval, image segmentation and pattern classification are just a few of the fields in which clustering can be utilized. As we can see, in many of these cases there is hardly any prior information available about the data and not many assumptions can be made either. This is the typical situation in which clustering proves to be very useful in finding out the relationships between data items. Clustering can be broken down into 3 main steps. They are 1. Pattern representation, feature extraction/selection. 2. Defining a pattern proximity measure suitable for the dataset. 3. Custering the patterns together. The steps in clustering are shown in Figure 2. Patterns Feature Selection/ Extraction Pattern Representations Inter-Pattern Similarity Grouping Clusters Feedback loop Figure 2: Steps in Clustering

3 Here the feedback loop depicts the situation in which the output of the grouping could affect the result of the subsequently performed feature extraction and inter pattern similarity steps.below we discuss each of these steps briefly Pattern Representation, Feature Selection and Extraction Pattern representation refers setting up the problem including defining the number of available patterns and the number, type and the scale of features available. Feature selection involves selecting the best subset of features that would be used for clustering. When transformations are performed on features to produce new features then it is termed as feature construction. Pattern representation is a difficult task because most often this process is not controllable by the user. What the user can do in this step is to gather lots of information about the data and if needed, perform feature selection and/or construction and this way create the data which has to be clustered. If done carefully, this step can lead to a clustering which is simple and easy to understand. However, if done poorly, this step can lead to clustering whose structure is very complex and is very difficult to comprehend. For example, in figure 3, the data points are equidistant from the center. If we take Cartesian coordinate system as our pattern representation it would yield a different result than if we choose the polar coordinate system.

4 Figure 3: A cluster whose data points are equidistant from the origin. Here different pattern representations will yield different results. Duda and Hart (1973) mention that the patterns are represented conventionally as multidimensional vectors, where each dimension is a single feature. As an example, if we say that we have two features, age and sex then (24,Male) is the pattern representation of a 24-year old male. There are various types of features that can be used. Gowda and Diday (1992) talk about Quantitative features like (a) continuous values (b) discrete values and (c) interval values and Qualitative features like (a) nominal/unordered (b) ordinal. Another type of features are the structured features (Michalski and Stepp 1983) in which the features are represented as trees with child nodes generalized by parent nodes Proximity Measures Choosing an apt proximity measure is very important while doing clustering. This is because we have various feature types and unless the proximity measure is chosen carefully, clustering will not produce an expected output.

5 The most common way to find the proximity between two data points is to calculate the dissimilarity between the two using a distance measure. For continuous features a Euclidean distance is one of the very popular distance measure. The equation for Euclidean distance is given below. d d 2 x i, x j = x i,k x j, k 2 1/2 k d 2 x i,s j = x i s j 2 This proximity measure works well when the clusters produced are compact and isolated (Mao and Jain 1996). There are various proximity measures such as those proposed by Diday and Simon (1976) and Ichino and Yaguchi (1994) for both qualitative and quantitative types of features. A popular measure for finding similarity between patterns is the cosine measure. The equation for the cosine measure is given below. x s c x a,x b = a x b x a 2 x b 2 This method is very good when working with text data because it is easy to interpret and simple to compute for sparse vectors. A similarity measure that has been used successfully for various clustering applications is one proposed by Gowda and Krishna (1977) and is called the mutual neighbor distance(mnd). The MND formula is shown below.

6 MND x i, x j =NN x i, x j NN x j, x i Clustering Techniques There are a variety of clustering algorithms that have been proposed over the years. They can be divided into two basic sub types. They are (1) Hierarchical and (2) Partitional Hierarchical Clustering Hierarchical clustering is based on some method of representing data points in a hierarchical structure. A very common way to do this is to use the dendogram structure. A typical dendogram hierarchy would be as is shown in Figure 7. E8 E4 E5 E1 E9 E3 E6 E7 E2 Figure 7: Dendogram Hierarchy The data points in the figure are arranged like a dendogram in which the nodes which are most closely related to each other are joined together at a lower level of the hierarchy and the nodes that are not closely related to each other are joined together at a higher level in the hierarchy. Hierarchical Clustering can be further divided into two sub categories. These are (1) Agglomerative Clustering and (2) Divisive Clustering.

7 In agglomerative clustering, each data point is considered as a single cluster and successive clusters are merged together until a point is reached where no more merging can be performed (stop condition). There are various ways in which the clusters that are to be merged are selected. One simple approach is to repeatedly select the closest pair of clusters (based on their closest member) and merge them. Examples of Agglomerative clustering algorithms include Single Link Clustering, Complete Link Clustering, and Agglomerative hierarchical clustering. Figure 8 show a basic agglomerative clustering algorithm. Put each of the data objects in their own cluster Compare all clusters and find the clusters that are closest to each other Distance < Threshold No End of clustering Yes Merge the two clusters Figure 8: A Simple Agglomerative Clustering Algorithm In divisive clustering, all the data points are taken into one cluster and then that cluster is split into smaller clusters until no more splitting can be done (stop condition). An example of a divisive clustering algorithm is the Distributional noun algorithm (Pereira et al 1993).

8 Partitional Clustering Partitional clustering is based on the idea that the data set should be directly decomposed into a set of disjoint clusters. The major difficulties that arise in this type of clustering are questions such as the number of clusters, how the points should be divided and how the clusters should be represented. In this type of clustering the cluster representation is often is the terms of a centroid for a cluster which is the minimized squared distance between the centroid and all points in the cluster. One of the most widely used Partitional algorithm is the k- Means algorithm. Here is how k-means works -: 1. Choose k cluster centers randomly. 2. Assign each pattern to the closest center. 3. Recompute the centers using the current cluster elements. 4. If there is minimal change in the cluster or no reassignment of patterns then stop. Else goto step 2. A taxonomy of clustering algorithms (Jain et al, 1999) discussed by us is shown in Figure 10. Clustering Hierarchical Partitional Agglomerative Divisive Single Link Complete Link Distributional Noun K-Means Figure 10: Taxonomy of Clustering Algorithms.

9 2.2 Biclustering Even though clustering has been long known to give good results there are applications for which clustering does not necessarily prove to be an adequate solution. Biclustering was introduced by (Cheng and Church, 2000) for discovering knowledge from gene expression data. Biclustering in their case meant clustering both genes and conditions simultaneously to gather usable knowledge from the gene expression data. Why is biclustering needed? When clustering genes according to the conditions they respond to, any clustering algorithm assumes that related genes behave similarly no matter what the condition. But if the dataset is large then this might not be true. Also, clustering often partitions the genes into disjoint sets i.e. a single gene is associated with a single process/function which in most cases is not true. These are the classic cases where biclustering proves helpful. The most common use of biclustering today is in clustering the microarray data that is produced by genes in response to certain conditions. However, there are various ways in which biclustering has been performed. In this section we will discuss some of these methods Bicluster Structure Any biclustering algorithm makes one of the following assumptions (1)There is only one bicluster in the data matrix and (2)There are K biclusters in the data matrix. While the most popular assumption is one of K biclusters, there are algorithms which assume that there is only one bicluster in the whole data matrix. According to Madeira and Oliveira (2004) if the assumption is of there being K biclusters in the data matrix, then the various types of structures that can be obtained are:

10 1. Exclusive row and column biclusters. 2. Non-Overlapping biclusters with checkerboard structure. 3. Exclusive rows biclusters. 4. Exclusive columns biclusters. 5. Non-Overlapping biclusters with tree structure. 6. Non-Overlapping non-exclusive biclusters. 7. Overlapping biclusters with hierarchical structure. 8. Arbitrarily positioned overlapping biclusters. Figure 11 shows a representation of Bicluster structure. (a)single (b)exclusive (c ) Checkerboard (d) Exclusive (e) Exclusive row and column rows columns (f) Non-Overlapping (g) Non-Overlapping (h) Overlapping (i) Arbitrarily with tree structure non-exclusive hierarchical positioned Figure 12: Bicluster Structure While most of the algorithms that have been mentioned above use the arbitrarily positioned overlapping structure for creating their biclusters, the algorithm proposed by Sheng (Sheng et al 2003) uses the exclusive row structure.

11 2.2.2 Bicluster Types Biclustering algorithms can be classified according to the type of biclusters they are able to find. According to Madeira and Oliveira (2004) there are four major classes of biclusters. They are: 1. Biclusters with Constant value 2. Biclusters with Constant values on rows or columns 3. Biclusters with coherent values. 4. Biclusters with coherent evolutions. Figure 12 shows examples of different types of Biclustering. (a) Constant Value (b) Constant Row (c) Constant Column (d) Coherent Value (e) Coherent Evolution Figure 11: An example of different types of bilcustering Constant Value Biclusters Constant value biclustering is the simplest form of biclustering in which the algorithms try and find subsets of rows and columns with constant values. In gene expression data a constant value bicluster represents a subset of genes with similar expression values across a subset of conditions.

12 Another way to approach biclustering is to look for subsets of rows and columns with constant values on rows/columns of the data matrix. There are various algorithms that implement this type of biclustering. Getz, Levine and Domany's (2000) Coupled Two-Way Clustering algorithm, Sheng, Moreau and Moor's (2003) Gibbs sampling biclustering are a couple of examples of Constant Value Biclustering Algorithm Coherent Value Biclusters Coherent value biclustering algorithms look for biclusters that have coherent values on both the rows and the columns. In these types of biclustering algorithms, more sophisticated analysis of variance between groups is performed to find biclusters that are of good quality. The FLOC (Flexible Overlapped Biclustering) Algorithm (Yang et al 2003), the Interrelated Two-Way Clustering algorithm and the Cheng and Church (2000) algorithm are a few examples of Coherent Value Biclustering algorithms. The FLOC algorithm presented by (Yang et al, 2003) simultaneously produces k biclusters whose mean residues are less than a predefined limit. The algorithm basically moves a row or column out of or into a bicluster depending on whether the row/column is already included in the bicluster or not. They then choose that particular row/column which gives the best gain in score and evaluate the relative reduction of the bicluster's residue. This is done for all rows and columns and the bicluster with the minimum mean residue is kept and the whole process is repeated again. The Cheng and Church algorithm produces one co-cluster at a time. The use a low meansquared residue plus a large variation from the constant as their criteria for identify a bicluster. They apply a sequence of row/column deletions/additions onf the gene condition matrix and they keep the mean squared residue under a given threshold. After creating a bicluster like this, they replace the elements of

13 the bicluster with random numbers and repeat the process on the modified matrix to generate another bicluster till a required number of biclusters are found Coherent Evolutions Biclusters These types of algorithms find coherent evolutions across rows and/or columns of the data matrix regardless of their exact values. In the case of gene expression data, we might look at something like whether the subset of genes is up-regulated or down-regulated across a subset of conditions regardless of their expression values. Algorithms like order-preserving sub-matrix (Ben-Dor et al 2002), OP-cluster (Liu and Wang 2003), xmotif (Murali and Kasif 2003) and SAMBA (Tanay et al 2002) are examples of Coherent Evolutions Biclustering algorithms Biclustering Approaches As we discussed earlier, the main question in front of any person who is writing a biclustering algorithm is: to identify one bicluster or to identify a given number of biclusters. This problem is very complex and various different heuristic methods have been used to solve this. The various biclustering algorithms can be divided into five categories according to the heuristic they use. These are 1. Iterative Row and Column Clustering Combination 2. Divide and Conquer 3. Greedy Iterative Search 4. Exhaustive Bicluster Enumeration 5. Distribution Parameter Identification Iterative Row and Column Clustering Combination This method is relatively easy and uses the existing clustering methods on the columns and rows of the data matrix to get clusters and then combine the results to obtain biclusters. The Coupled Two-Way Clustering (Getz et al 2000) and the

14 Interrelated Two-Way Clustering (Tang et al 2001) are two examples of this type of approach Divide and Conquer In the divide and conquer approach the biclustering problem is broken down into several sub-problems which are then solved recursive. The solutions obtained thus are combined to create end result. Though these algorithms can be very fast, they are likely to miss good biclusters. Block Clustering (Hartigan 1975) is an example of a Divide and Conquer approach. Modifications to this algorithm have also been suggested by Duffy and Quiroz (1991) Greedy Iterative Search The greedy search method creates biclusters by adding/removing rows/columns from them using the local maxima as its criteria of selection. Like Divide and Conquer Approach, this approach tends to be very fast but is prone to make wrong decisions. FLOC (Yang et al 2003), Order-Preserving sub-matrix (Ben-Do et al 2002) and the Cheng and Church algorithm (Cheng and Church 2000) are a few good examples of this method Exhaustive Bicluster Enumeration This method is based on the approach that the best biclusters are only possible if an exhaustive search of all the possible biclusters of the data matrix can be made. The complexity of these algorithms is very high and either they take a long time to run or they have to assume a size restriction on the size of the data matrix. The SAMBA algorithm (Tanay et al 2002), the Maximum Dimension Sets algorithm (Wang et al 2002) and the OPC-tree algorithm (Liu and Wang 2003) are a few good examples for these types of algorithms.

15 Distribution Parameter Identification In this approach, a given statistical model is assumed and then distribution parameters used to generate data are identified. The plaid model algorithm (Lazzeroni and Owen 2000) is one of the examples of this approach Biclustering Applications Besides applications in biology, Biclustering has been used in other fields as well. These are mentioned in the table 2 below. Application Use Examples Identify subgroups of customers who E-Commerce/Target have similar preferences towards a Marketing subset of products. Yang et al 2002 Identify subgroups of documents with Dhillon 2001 Information Retrieval similar properties relative to a Dhillon et al 2003 / Text Mining subgroup of attributes. Berkin et al 2002 Identify a subgroup of people with Politics same political ideas and electoral Hartigan 1972 behavior over a subset of attributes Databases Reduce the dimensionality of tables with thousands of rows and hundreds of columns. Aggarwal et al 1998 Table 2: Some Biclustering Applications

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all INESC-ID TECHNICAL REPORT 1/2004, JANUARY 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey* Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches

More information

Biclustering Algorithms for Gene Expression Analysis

Biclustering Algorithms for Gene Expression Analysis Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Biclustering with δ-pcluster John Tantalo. 1. Introduction Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That

More information

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

An Unsupervised Technique for Statistical Data Analysis Using Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 5, Number 1 (2013), pp. 11-20 International Research Publication House http://www.irphouse.com An Unsupervised Technique

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

DNA chips and other techniques measure the expression

DNA chips and other techniques measure the expression 24 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 1, JANUARY-MARCH 2004 Biclustering Algorithms for Biological Data Analysis: A Survey Sara C. Madeira and Arlindo L. Oliveira

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Plaid models, biclustering, clustering on subsets of attributes, feature selection in clustering, et al.

Plaid models, biclustering, clustering on subsets of attributes, feature selection in clustering, et al. Plaid models, biclustering, clustering on subsets of attributes, feature selection in clustering, et al. Ramón Díaz-Uriarte rdiaz@cnio.es http://bioinfo.cnio.es/ rdiaz Unidad de Bioinformática Centro Nacional

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Microarray data analysis

Microarray data analysis Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data

Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data Li Teng and Laiwan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong Abstract.

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian. Document Clustering. Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

More information

Unsupervised learning, Clustering CS434

Unsupervised learning, Clustering CS434 Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

A Memetic Heuristic for the Co-clustering Problem

A Memetic Heuristic for the Co-clustering Problem A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017 Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Mining Deterministic Biclusters in Gene Expression Data

Mining Deterministic Biclusters in Gene Expression Data Mining Deterministic Biclusters in Gene Expression Data Zonghong Zhang 1 Alvin Teo 1 BengChinOoi 1,2 Kian-Lee Tan 1,2 1 Department of Computer Science National University of Singapore 2 Singapore-MIT-Alliance

More information

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

What is Unsupervised Learning?

What is Unsupervised Learning? Clustering What is Unsupervised Learning? Unlike in supervised learning, in unsupervised learning, there are no labels We simply a search for patterns in the data Examples Clustering Density Estimation

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Clustering and Dimensionality Reduction

Clustering and Dimensionality Reduction Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Lecture 15 Clustering. Oct

Lecture 15 Clustering. Oct Lecture 15 Clustering Oct 31 2008 Unsupervised learning and pattern discovery So far, our data has been in this form: x 11,x 21, x 31,, x 1 m y1 x 12 22 2 2 2,x, x 3,, x m y We will be looking at unlabeled

More information

COMS 4771 Clustering. Nakul Verma

COMS 4771 Clustering. Nakul Verma COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find

More information