Clustering Techniques

Size: px
Start display at page:

Download "Clustering Techniques"


1 Clustering Techniques Bioinformatics: Issues and Algorithms CSE Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture

2 Administrative notes Your final project / paper proposal is due on Friday, November 9 at 5:00 pm. The proposal just needs to be a couple paragraphs telling me the problem area you plan to work on and some of the references you'll probably use. If there's a possible connection between the work you'd like to do and the topics you've heard Professor Marzillier talk about, I'll discuss your proposal with her to get her feedback and suggestions (e.g., other papers you might read, datasets you might use for testing code you develop, etc.). I'll send you feedback on your proposal by the middle of the following week then you're off and running! Lopresti Fall 2007 Lecture

3 Outline DNA Microarrays Hierarchical Clustering K-Means Clustering Conservative & Greedy K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm Lopresti Fall 2007 Lecture

4 Applications of clustering Motivation for clustering (from a general perspective): Viewing and analyzing vast amounts of biological data in its unstructured entirety can be perplexing. It is easier to interpret data if it is organized into clusters that combine similar (i.e., related) data points. From a biological perspective, applications include: Analyzing data from DNA microarray experiments (expression analysis i.e., determining which genes are switched on or off under certain conditions of interest). Building and understanding phylogenetic (evolutionary) trees based on genomic or other data. Lopresti Fall 2007 Lecture

5 Inferring gene functionality What's the problem? Biologists want to know functions of newly-sequenced genes. Simply comparing new gene sequence to known DNA sequences often does not reveal function of new gene. For 40% of sequenced genes, functionality cannot be ascertained by comparing to sequences of known genes. Microarrays allow biologists to infer gene function even when sequence similarity alone is insufficient to infer it. Lopresti Fall 2007 Lecture

6 Life: a recipe for making proteins Lopresti Fall 2007 Lecture

7 Recall the Central Dogma Lopresti Fall 2007 Lecture

8 Hybridization is central Lopresti Fall 2007 Lecture

9 Microarrays: the concept Measure level of transcription for a very large number of genes in a single experiment. Lopresti Fall 2007 Lecture

10 Microarrays and expression analysis Microarrays measure activity (expression level) of genes under varying conditions and/or points in time. Expression level is estimated by measuring amount of mrna for that particular gene: A gene is active if it is being transcribed. More mrna usually indicates more gene activity. Lopresti Fall 2007 Lecture

11 Microarrays: how? Lopresti Fall 2007 Lecture

12 Stanford microarrays: production Lopresti Fall 2007 Lecture

13 Stanford microarrays: production Lopresti Fall 2007 Lecture

14 Stanford microarrays: production Coating: 1. Rinse of slides: NaOH and EtOH (2 h - shaking). 2. Wash with water. 3. Coat slides: poly-l-lycine (1 h - shaking). 4. Wash and dry. Attach probes: 1. Produce probes (oligos, cdna library, PCR products). 2. Print by the use of a robot. Lopresti Fall 2007 Lecture

15 Stanford microarrays: production Spotting mechanical deposition of probes: Lopresti Fall 2007 Lecture

16 Stanford microarrays: production Lopresti Fall 2007 Lecture

17 Stanford microarrays: production Lopresti Fall 2007 Lecture

18 Stanford microarrays: production Microarrayer Lopresti Fall 2007 Lecture

19 Microarray experiments Steps: Produce cdna from mrna (DNA is more stable). Attach phosphor to cdna to see when gene is expressed. Different color phosphors are available to compare many samples at once. Hybridize cdna over microarray. Scan microarray with phosphor-illuminating laser: illumination reveals transcribed genes. Scan microarray multiple times for different color phosphors. Lopresti Fall 2007 Lecture

20 Microarray experiments Then instead of staining, laser illumination can be used Phosphors can be added here instead Lopresti Fall 2007 Lecture

21 Using microarrays Track sample over period of time to see how gene expression changes. Track two different samples under same conditions to see differences in gene expression. Each box represents one gene s expression over time Lopresti Fall 2007 Lecture

22 Using microarrays Interpreting colors: Green: expressed only from control. Red: expressed only from experimental cell. Yellow: equally expressed in both samples. Black: NOT expressed in either control or experimental cells. Lopresti Fall 2007 Lecture

23 Microarray data What does biologist do with microarray data? Microarray data usually transformed into an intensity matrix. Intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how gene functions might be related. Clustering comes into play! Similar behavior? Time: Time X Time Y Time Z Intensity (expression level) of gene at measured time Gene 1 Gene Gene Gene Gene Lopresti Fall 2007 Lecture

24 Clustering microarray data Plot each sample as data point in N-dimensional space. Build matrix for distances between every two gene points. Genes with a small distance share same expression patterns and might be functionally related or similar. Clustering reveal groups of functionally related genes. From Cluster analysis and display of genomewide expression patterns by Eisen, Spellman, Brown, and Botstein, Proc. Natl. Acad. Sci. USA, Vol. 95, pp , December Different genes that express similarly Lopresti Fall 2007 Lecture

25 Clustering microarray data Intensity matrix Pairwise distance matrix Three different clusters Expression patterns as points in 3-D space Lopresti Fall 2007 Lecture

26 Homogeneity and Separation Principles All approaches to clustering guided by two basic principles: Homogeneity: elements within a given cluster are close. Separation: elements in different clusters are further apart. Not that clustering is not an easy task! (Don't be mislead by simple illustrative examples.) Given these points, a clustering algorithm might make two distinct clusters as follows... Lopresti Fall 2007 Lecture

27 Bad clustering This clustering violates both Homogeneity and Separation Principles: Close distances from points in separate clusters Far distances from points in same cluster Lopresti Fall 2007 Lecture

28 Good clustering This clustering satisfies both Homogeneity and Separation Principles: Lopresti Fall 2007 Lecture

29 Clustering techniques Agglomerative: start with every element in its own cluster, and iteratively join clusters together. Divisive: start with one cluster and iteratively divide it into smaller clusters. Hierarchical: organize elements into a tree, leaves represent genes and length of the paths between leaves represents distances between genes. Similar genes lie within same subtrees. Lopresti Fall 2007 Lecture

30 Hierarchical clustering Lopresti Fall 2007 Lecture

31 Hierarchical clustering Hierarchical Clustering often used to reveal evolutionary history: Lopresti Fall 2007 Lecture

32 Hierarchical clustering algorithm Hierarchical Clustering (d, n) Form n clusters each with one element Construct graph T by assigning one vertex to each cluster while there more than one cluster find two closest clusters C 1 and C 2 merge C 1 and C 2 into new cluster C of size C 1 + C 2 compute distance from C to all other clusters add a new vertex C to T and connect to vertices C 1 and C 2 remove rows and columns of d corresponding to C 1 and C 2 add a row and column to d corrsponding to new cluster C return T Algorithm takes a n x n distance matrix d of pairwise distances between points as input. Lopresti Fall 2007 Lecture

33 Hierarchical clustering algorithm Hierarchical Clustering (d, n) Form n clusters each with one element Construct graph T by assigning one vertex to each cluster while there more than one cluster find two closest clusters C 1 and C 2 merge C 1 and C 2 into new cluster C of size C 1 + C 2 compute distance from C to all other clusters add a new vertex C to T and connect to vertices C 1 and C 2 remove rows and columns of d corresponding to C 1 and C 2 add a row and column to d corrsponding to new cluster C return T Different ways to define distances between clusters may lead to different clusterings! Lopresti Fall 2007 Lecture

34 Computing distances d min (C, C * ) = min d(x,y) for all elements x in C and y in C * Distance between two clusters is smallest distance between any pair of elements. d avg (C, C * ) = (1 / C * C ) d(x,y) for all elements x in C and y in C * Distance between two clusters is average distance between all pairs of elements. Lopresti Fall 2007 Lecture

35 Squared-error distortion Given a data point v and a set of points X, define distance from v to X: d(v, X) as (Eucledian) distance from v to closest point from X. Given set of n data points V = {v 1 v n } and set of k points X, define squared-error distortion as: d(v, X) = d(v i, X) 2 / n 1 < i < n Lopresti Fall 2007 Lecture

36 Clustering microarray data: k-means clustering K-means clustering is one way to organize this data: Given set of n data points and an integer k. We want to find set of k points that minimizes mean-squared distance from each data point to its nearest cluster center. Sketch of algorithm: Choose k initial center points randomly and cluster data. Calculate new centers for each cluster using points in cluster. Re-cluster all data using new center points. Repeat last two steps until no data points are moved from one cluster to another or some other convergence criterion is met. Lopresti Fall 2007 Lecture

37 Formal definition of K-Means Clustering The K-Means Clustering Problem. Input: A set, V, consisting of n points along with a parameter k. Output: A set X consisting of k points (cluster centers) that minimizes squared-error distortion d(v, X) over all possible choices of X. A (trivially) simple variation, 1-means clustering: The 1-Means Clustering Problem. Input: A set, V, consisting of n points. 1-means clustering is easy. General k-means clustering is NP-complete, however. Output: A single point x (cluster center) that minimizes squared-error distortion d(v, x) over all possible choices of x. Lopresti Fall 2007 Lecture

38 Clustering microarray data: k-means clustering Pick k = 2 centers at random. Cluster data around these center points. Re-calculate centers based on current clusters. From Data Analysis Tools for DNA Microarrays by Sorin Draghici. Lopresti Fall 2007 Lecture

39 Clustering microarray data: k-means clustering Re-cluster data around new center points. Repeat last two steps until no more data points are moved into a different cluster. From Data Analysis Tools for DNA Microarrays by Sorin Draghici. Lopresti Fall 2007 Lecture

40 K-means clustering: Lloyd's algorithm K-Means Clustering (k) arbitrarily assign k cluster centers while cluster centers keep changing assign each data point to cluster C i corresponding to closest cluster representative (center) (1 i k) after assignment of all data points, compute new cluster representatives according to cluster centers of gravity I.e., new cluster representative is v / C for all v in C output final cluster centers Note that this may only lead to a locally optimal clustering. Lopresti Fall 2007 Lecture

41 K-means clustering: another example expression in condition x 1 x 2 x expression in condition 1 Lopresti Fall 2007 Lecture

42 K-means clustering: another example expression in condition x 1 x 2 x expression in condition 1 Lopresti Fall 2007 Lecture

43 K-means clustering: another example expression in condition x 1 x 2 x 3 expression in condition 1 Lopresti Fall 2007 Lecture

44 K-means clustering: another example expression in condition x 2 x expression in condition 1 x 1 Lopresti Fall 2007 Lecture

45 Conservative k-means clustering Observations: This algorithm, known as Lloyd's algorithm, is fast, but in each iteration it moves many data points, not necessarily causing better convergence. A more conservative method would be to move one point at a time only if it improves the overall clustering cost. The smaller the clustering cost of a partition of data points, the better that clustering is. Different methods (e.g., squared-error distortion) can be used to measure this clustering cost. Lopresti Fall 2007 Lecture

46 Greedy k-means clustering ProgressiveGreedyK-Means(k) select an arbitrary partition P into k clusters while forever bestchange 0 for every cluster C for every element i not in C if moving i to cluster C reduces its clustering cost if (cost(p) cost(p i C ) > bestchange bestchange cost(p) cost(p i C ) i* i, C* C if bestchange > 0 Change partition P by moving i* to C* else return P Lopresti Fall 2007 Lecture

47 Clique graphs A more structured view of clustering: A clique is a graph with every vertex connected to every other vertex. A clique graph is a graph where each connected component is a clique. Clique of size 3 Clique of size 5 Clique of size 6 Clique graph with 3 connected components Lopresti Fall 2007 Lecture

48 Transformation into a clique graph Any graph can be transformed into a clique graph by adding or removing edges. What can we do here? Delete 2 edges Lopresti Fall 2007 Lecture

49 Transformation into a clique graph As with edit distance we studied earlier, there many possible transformations: 1 2 Add 2 edges Or: 1 2 Delete 4 edges Lopresti Fall 2007 Lecture

50 Formal definition of Corrupted Cliques Problem The Corrupted Cliques Problem. Input: A graph, G. Output: The smallest number of additions and removals of edges that will transform G into a clique graph. Our ultimate goal is to have: Vertices represent data points. Edges represent relationship between data points. Cliques represent meaningful groupings (i.e., clusters). Lopresti Fall 2007 Lecture

51 Distance graphs Transform a distance matrix into a distance graph: Genes are represented as vertices in graph. Choose a distance threshold θ. If distance between two vertices is below θ, draw an edge between them. Resulting graph may contain cliques. These cliques represent clusters of similar data points! Lopresti Fall 2007 Lecture

52 Transforming distance graph into clique graph Distance matrix d Distance graph for θ = 7 Clique graph Distance graph for is not quite a clique graph. However, it can be transformed into one by removing edges (g 1,g 10 ) and (g 1,g 9 ). Lopresti Fall 2007 Lecture

53 Heuristics for Corrupted Cliques Problem Corrupted Cliques Problem is NP-Hard, some heuristics exist to approximately solve it. For example, CAST (Cluster Affinity Search Technique) is a practical and fast algorithm for CCP. CAST is based on notion of genes close to given cluster C, or distant from cluster C. Distance between gene i and cluster C defined as: d(i,c) = average distance between i and each gene in C Gene i is close to cluster C if d(i,c) < θ, distant otherwise. Lopresti Fall 2007 Lecture

54 CAST algorithm CAST(S, G, θ) P Ø while S Ø V vertex of maximal degree in distance graph G C {v} while a close gene i not in C or distant gene i in C exists Find nearest close gene i not in C and add it to C Remove farthest distant gene i in C Add cluster C to partition P S S C Remove vertices of cluster C from distance graph G return P S = set of elements, G = distance graph, θ = distance threshold Lopresti Fall 2007 Lecture

55 Wrap-up Readings for next time: BBP Chapters and 20 (tools, datasets, and applications). Remember: Come to class having done the readings. Check Blackboard regularly for updates. Lopresti Fall 2007 Lecture

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information


CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye ( School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

11/17/2009 Comp 590/Comp Fall

11/17/2009 Comp 590/Comp Fall Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It! RNA-seq: What is it good for? Clustering High-throughput RNA sequencing experiments (RNA-seq) offer the ability to measure simultaneously the expression level of thousands of genes in a single experiment!

More information

4/4/16 Comp 555 Spring

4/4/16 Comp 555 Spring 4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Clustering k-mean clustering

Clustering k-mean clustering Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: partition genes into distinct sets with high homogeneity and high separation Clustering (unsupervised)

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays. Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links:

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden

More information

4. Ad-hoc I: Hierarchical clustering

4. Ad-hoc I: Hierarchical clustering 4. Ad-hoc I: Hierarchical clustering Hierarchical versus Flat Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time. Hierarchical

More information

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B Clustering in R d Clustering CSE 250B Two common uses of clustering: Vector quantization Find a finite set of representatives that provides good coverage of a complex, possibly infinite, high-dimensional

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: // 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein Clustering Lecture 8 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein Clustering: Unsupervised learning Clustering Requires

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo Clustering Lecture 9: Other Topics Jing Gao SUNY Buffalo 1 Basics Outline Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Miture model Spectral methods Advanced topics

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Module 07 Lecture - 38 Divide and Conquer: Closest Pair of Points We now look at another divide and conquer algorithm,

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University K- Means + GMMs Clustering Readings: Murphy 25.5 Bishop 12.1, 12.3 HTF 14.3.0 Mitchell

More information

Clustering: Centroid-Based Partitioning

Clustering: Centroid-Based Partitioning Clustering: Centroid-Based Partitioning Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 29 Y Tao Clustering: Centroid-Based Partitioning In this lecture, we

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Segmentation Computer Vision Spring 2018, Lecture 27

Segmentation Computer Vision Spring 2018, Lecture 27 Segmentation 16-385 Computer Vision Spring 218, Lecture 27 Course announcements Homework 7 is due on Sunday 6 th. - Any questions about homework 7? - How many of you have

More information

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors Segmentation I Goal Separate image into coherent regions Berkeley segmentation database: Slide by L. Lazebnik Applications Intelligent

More information

CSE 494 Project C. Garrett Wolf

CSE 494 Project C. Garrett Wolf CSE 494 Project C Garrett Wolf Introduction The main purpose of this project task was for us to implement the simple k-means and buckshot clustering algorithms. Once implemented, we were asked to vary

More information

L9: Hierarchical Clustering

L9: Hierarchical Clustering L9: Hierarchical Clustering This marks the beginning of the clustering section. The basic idea is to take a set X of items and somehow partition X into subsets, so each subset has similar items. Obviously,

More information

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein A quick review The clustering problem: homogeneity vs. separation Different representations

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Network Clustering. Balabhaskar Balasundaram, Sergiy Butenko

Network Clustering. Balabhaskar Balasundaram, Sergiy Butenko Network Clustering Balabhaskar Balasundaram, Sergiy Butenko Department of Industrial & Systems Engineering Texas A&M University College Station, Texas 77843, USA. Introduction Clustering can be loosely

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Clustering: K-means and Kernel K-means

Clustering: K-means and Kernel K-means Clustering: K-means and Kernel K-means Piyush Rai Machine Learning (CS771A) Aug 31, 2016 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 1 Clustering Usually an unsupervised learning problem

More information


CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Distances, Clustering! Rafael Irizarry!

Distances, Clustering! Rafael Irizarry! Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to

More information


MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

Lecture 10: Semantic Segmentation and Clustering

Lecture 10: Semantic Segmentation and Clustering Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information


SUPPLEMENTARY INFORMATION Supplementary Discussion 1: Rationale for Clustering Algorithm Selection Introduction: The process of machine learning is to design and employ computer programs that are capable to deduce patterns, regularities

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

An Unsupervised Technique for Statistical Data Analysis Using Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 5, Number 1 (2013), pp. 11-20 International Research Publication House An Unsupervised Technique

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 600.363 Introduction to Algorithms / 600.463 Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 23.1 Introduction We spent last week proving that for certain problems,

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Clustering Lecture 3: Hierarchical Methods

Clustering Lecture 3: Hierarchical Methods Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Pattern Recognition Lecture Sequential Clustering

Pattern Recognition Lecture Sequential Clustering Pattern Recognition Lecture Prof. Dr. Marcin Grzegorzek Research Group for Pattern Recognition Institute for Vision and Graphics University of Siegen, Germany Pattern Recognition Chain patterns sensor

More information Text Clustering David Kauchak cs160 Fall 2009 adapted from: Administrative 2 nd status reports Paper review

More information

Triclustering in Gene Expression Data Analysis: A Selected Survey

Triclustering in Gene Expression Data Analysis: A Selected Survey Triclustering in Gene Expression Data Analysis: A Selected Survey P. Mahanta, H. A. Ahmed Dept of Comp Sc and Engg Tezpur University Napaam -784028, India Email:,

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

CSE 527. CAST: a clustering method with a graph-theoretic basis. Larry Ruzzo

CSE 527. CAST: a clustering method with a graph-theoretic basis. Larry Ruzzo CSE 527 CAST: a clustering method with a graph-theoretic basis Larry Ruzzo Talks this week Today - Dr. Terry Hwa, Professor of Physics, UC San Diego "Complex Transcriptional Logics From Simple Molecular

More information

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University,

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Unsupervised learning Finding centers of similarity using

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can

More information