Micro-array Image Analysis using Clustering Methods

Size: px
Start display at page:

Download "Micro-array Image Analysis using Clustering Methods"


1 Micro-array Image Analysis using Clustering Methods Mrs Rekha A Kulkarni PICT PUNE kulkarni_rekha@hotmail.com Abstract Micro-array imaging is an emerging technology and several experimental procedures have been developed producing different image characteristics. Micro-array images are processed using a multi step procedure involving image segmentation, information extraction, normalization. Images are structured with high intensity spots located on a grid, Each spot corresponds to a gene. Spots have roughly circular shape though some show significant deviation from this shape due to the experimental variations. The clustering methods group a set of genes or arrays representing the similarity of genes to each other. Clustering techniques such as K-means, Partitioning around medoids have been recently used for micro-array image segmentation or classification. K-means is a partitioning algorithm with a prefixed number k of clusters. It tries to minimize the sum of within-cluster-variances. Partitioning Around Medoids is a partitioning algorithm a generalization of K-means Key Words- Micro-Array, Gridding, Clustering, K- Means, PAM 1. Introduction Micro-array technologies, able to measure the expression of thousands of genes in a single experiment, have developed over the past decade and now produce huge amounts of data. New techniques for looking at genetic variations in large human populations, and for identifying interactions between sets of proteins in cells, are pouring data onto file servers around the world. Bioinformatics is charged with managing and making sense of all of the data, keeping pace with both data production and technology development. There's plenty of work to go around. Micro-array is a glass microscope slide with a large number of ordered target sequences on it. These target sequences normally consists of cdna or RNA sequences. These target sequences are single stranded as opposed to DNA that is double stranded. There are thousands of target sequences on the micro-array. Micro-array technology, as a high throughput approach of differential gene expression studies, efficiently generates massive amount of gene regulation data, facilitating scientists in quickly identifying what gene candidates to follow up with functional characterization. Traditional techniques for the study of gene expression allow investigators to study only one or few genes at a time. Genomic projects aimed at cloning, mapping and sequencing genomes of various organisms, generated large amount of sequence data. However, the function, expression and regulation of more than 80% of them were unknown. The next phase of the human genome project will place strong emphasis on assigning function to these genes. There are methods by which one can assign function to genes, out of which DNA micro-array analysis is widely used to extract patterns of gene expression. Although both cdna micro-arrays and oligonucleotide arrays are capable of analyzing patterns of gene expression, fundamental differences exist between the methods. 2. Existing segementation techniques Fixed circle segmentation Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Adaptive circle segmentation The circle diameter is estimated separately for each spot segmentation Adaptive shape segmentation Specification of starting points or seeds Bonus: already know geometry of array! Regions grow outwards from the seed points preferentially according to the difference between a pixel s value and the running mean of values in an adjoining regions. 145

2 Histogram segmentation: Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area 3. Intensity based segmentation The images are structured with high intensity spots which correspond to the probes located on a grid. The spots have roughly circular shape though some show significant deviation from this shape due to the experimental variation of the spotting procedure. The underlying principle in micro-array image analysis is that the spot intensity is a measure of gene expression. This implicitly assumes the gene expression of a spot to be governed entirely by the distribution of pixel intensities. Clustering based segmentation is used to extract the intensity of the spots. the approximate boundaries of spots in the micro-array are determined by adjustment of rectilinear grids The K means and Partitioning around medoids are used to generate a binary partition of pixel intensities. Images will have four colored spots. The red, green,yellow and black. A red colored spot implies that a particular gene is being expressed in the experimental channel. Green colored spot indicates high expression in the control channel. A yellow colored spot indicates that the gene is expressed in both channels. To estimate the differential gene expression the high intensity regions in the image corresponding to each probe have to be identified. This is done as part of image segmentation. Then the local background noise has to be estimated and removed which corresponds to background correction. An example of spot or gene summary statistics for cdna is the ratio of background corrected mean intensities. We denote R i the pixels from the red fluorescent scan and by G i the pixels from the green fluorescent scan. The differential expression level R / G is then calculated as the ratio of the mean ROI intensities : R / G = 1/S σ Ri µr 1/S σ Gi µ G where µ refers to the estimate of the local background and S is the number of ROI pixels. 4. Micro-array image analysis Image analysis involves three stages. First, the arrayed genes must be identified from spurious signals that can arise due to precipitated probe or other hybridization artifacts or dust on the surface of the slide. After gridding the spot intensities (real signal) and background (noise) has to be calculated for spots. It is always better to calculate background locally for each spot, rather than globally for the entire image. next step in image processing is the extraction of signal, noise and quality control measures for the spot. Image processing steps: 1. Addressing/ Gridding : Find the areas in an image that belong to spots. The combined areas of spot and its background is called target area. 2. Segmentation : Partition the target area into foreground and background 3. Reduction : Extract two scalar values R and G for red and green intensity and assign one value for relative abundance Gridding This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis. Ideally spots should be equally spaces across the array. However the robot arm that prints the spots often introduces deviation from these pixel positions. The approximate boundary of the spots are determined by drawing rectilinear grid. As a result each spot is enclosed in a rectangular box.this is accomplished by adjusting the grid 4.2. Segmentation Segmentation allows the classification of pixels as corresponding to a spot of interest or as background. It involves partitioning the image into disjoint sets. Consider an image I Partitioned into L regions. Represented by ri I= 1..L then I= Union of ri where I =1..L And each ri satisfies a predicate. Image segmentation can be either texture based or intensity based. While the former is governed by spatial characteristic latter is entirely governed by the distribution of pixel intensities. The choice of segmentation technique is based on problem at hand. Since a gene expression value is proportional to intensity of the spot the segmentation based on intensity is appropriate. The objective is to separate the foreground and background pixel intensities inside each grid. The one dimensional distribution of the foreground and back ground pixel intensities can be represented by f(f) and f(b) respectively. The distribution of pixel intensities in a grid can be assumed to be the 146

3 superposition of f(f) and f(b). The following case can arise 1) f(f) and f(b) are narrowly distributed with no overlap. 2) F(F) is spread whereas f(b) is narrowly distributed. 3) F(F) and f(b) exhibit significant overlap. Start Microarry image file in TIFF format. Number of rows and numbaer of coloums i.e total number of spots Manuaaly align the row and column grids to determine the approximate boundary of each spot in the control channel (CY3) Retain the coordinate for the experimental channel (CY5) 5. Clustering methods Clustering techniques such as K-means and PAM is useful in detecting patterns in the data generated by unknown processes and have been recently used for micro-array segmentation. The input to k means and Pam clustering algorithms was the two dimensional values (Ri,Gi) where Ri and Gi represent the ith pixel intensity of a given spot in the cy3 and cy5 channels. Cluster algorithms k-means K-means is a partitioning algorithm with a prefixed number k of clusters. It tries to minimize the sum of within-cluster-variances. The algorithm chooses a random sample of k different objects as initial cluster midpoints. Then it alternates between two steps until convergence: 1. Assign each object to its closest of the k midpoints with respect to Euclidean distance. 2. Calculate k new midpoints as the averages of all points assigned to the old midpoints, respectively. Fig.1: Flow chart of Microarray Image Segmentation 4.3. Reduction : Rfg, Gfg and Rbg, Gbg cluster means with high resp. low intensities, (Rfg-Rbg)/(Gfg-Gbg) final rel ative abundance estimate. I=1 Map the image matrix I of the ith grid into a one dimentional vector v. Apply K-means clustering technique to obtain a binary partition of the vetor v f(i) = median of foreground pixels g(i) = median of background pixels t(i) = f(i) - b(i) Is Stop K-means is a randomized algorithm, two runs usually produce different results. Thus it has to be applied a few times to the same data set and the result with minimal sum of within-cluster variances should be chosen. PxKmeans: Pixel clustering with k-means 1. Construct initial representatives: Starting midpoints m1=(rfg,gfg) and m2=(rbg,gbg), where Rfg, Gfg are highest intensity values and Rbg, Gbg the lowest. 2. Find local optimum of cluster problem (k-means): Repeat alternating until convergence: - Assign each data point to its closest of the two midpoints. - Calculate two new midpoints as the means of all points assigned to the old midpoints, respectively. 3. Reduction: Rfg, Gfg and Rbg, Gbg cluster means with high resp. low intensities, (Rfg-Rbg)/(Gfg-Gbg) final rel ative abundance estimate. Cluster algorithms: PAM PAM (Partitioning around medoids) Kaufman and Rousseeuw is a partitioning algorithm, a generalization of k- means. 147

4 For an arbitrary dissimilarity matrix d it tries to minimize the sum (overall objects) of distances to the closest of k prototypes. Objective function: (d: Manhattan, Correlation, etc.) BUILD phase: Initial 'medoids. SWAP phase: Repeat until convergence: Consider all pairs of objects (i,j), where i is a medoid and j not, and make the i j swap (if any) which decreases the objective function most. PxPAM: pixel clustering with PAM 0. Calculation of dissimilarity matrix of spot pixels: Calculate the Manhattan distances between all pairs of pixels: dij = d(xi,xj) = Ri-Rj + Gi-Gj. 1. Construct two initial representatives (PAM Build phase): Define m1 as object with smallest Ói=1..n d(xi,m1) and m2 as object that decreases objective as much as possible. 2. Find local optimum of cluster problem (PAM Swap phase) 3. Reduction: Rfg and Gfg: values of medoid pixel with higher intensities; Rbg, Gbg of other one. (Rfg-Rbg)/(Gfg-Gbg) final relative abundance estimate. 6. Background correction and normalization The estimation of background intensity is generally considered necessary for the purpose of performing background correction. The motivation for background correction is that a spot s measured fluorescence intensity includes a contribution which is not specifically due to the hybridization of the mrna samples to the spotted DNA. Background correction of the spot intensities is usually performed by subtracting background estimation from the red and green foreground values with the aim of improving accuracy that is reducing the bias. Spot quality scores may include measures of spot size or shape or measures of background intensity to foreground intensity. In some cases background adjustment can substantially reduce the precision that is increases the variability of low spot values. The ratio of the fluorescence intensities for each spot is indicative of the relative abundance of the corresponding DNA sequence in the two nucleic acid sample. There are various types of noise that can affect the final signal produced by the scanner. These can be divided into two categories.source noise and detector noise. Examples of source noise are photon noise dust on the slides and treatment of the glass slides. Detector noise includes features of the amplification and digitization process. A perfect image should only reflect measures of the fluorescence intensities for the dye of interest. However in practice we have an input system and images are usually combination of undesired signals. 7. Gene clustering: The need of Gene clustering: Genes clustered together have the same pattern of expression indicates co regulated genes Co-regulated genes have function correlation, and might involve in same metabolic pathways Steps in Gene clustering For each gene, look at xi. (expression of ith gene through m experiments) Measure distances between xi. for i from 1 to m. Clustering is based on the similarity or distance metrics:euclidean distance, vector angle, and correlation coefficient Clustering is to group a set of genes or arrays into a tree. The branch of tree represents the similarity of genes to each other. Methods/Algorithms Hierarchical clustering Single linkage-nearest distance Complete linkage-max distance Average linkage-avg between all points in clusters K-mean clustering Hierarchical clustering The clustering solution is represented by a dendrogram, which is a rooted weighted tree, with leaves corresponding to the objects The edge s length reflects the dissimilarity between that cluster and remaining clusters Hierarchical clustering Steps 1. Filter data: remove genes having a lot of missing values or of low quality spots. 2. Pre-processing data: 1) Log transformation 2) Mean/median centering 3) Normalization 3. Create similarity metrics based on distance measure 4. Create distance matrices 148

5 5. Scan the distance matrix and find the smallest distance (for single linkage.) 6. Create a node/branch of a tree linking two genes with the smallest distance. Set the length of the branch to the distance of two genes. 7. Average two values and replaces two genes with a new item 8. Calculate the distance of the new item to other genes 9. Repeat the process n-1 times (n is the number of genes or arrays) K-mean clustering Have prior knowledge of k (# of clusters) Initialize cluster centroids: Data centroid based search Evenly spaced profiles Randomly generated profiles Calculate cluster centroids Euclidean distance: distance between 2 data points in a N dimension. Correlation K-mean clustering Steps: 1. Random assign the genes into K clusters 2. Measure the mean vector of all genes in each cluster 3. The mean into the cluster whose center is closest to the gene vector is used as center of the cluster and assign the gene. 4. Repeat steps 2 and 3 until reaching the maximum number of cycles number or reach steady state. 8. Conclusion DNA Micro-array analysis allows comparisons to be made between the expression levels of certain genes across different tissues and pathological conditions. Micro-array technology can give us an understanding of the temporal and spatial patterns of expression of all the genes involved in the developmental processes of an organism. Micro-array analysis will improve our understanding of diagnosis and prognosis. Transcription profiling using DNA micro-arrays has great potential as a systematic approach for discovering new classes of tumors for assigning known tumors to classes to predict response to therapy. Thus, it is perceived that gene expression monitoring could provide new insights into many aspects of tumor pathology, including cell of origin, stage, grade, clinical course and response to treatment. Other applications include identification of targets for drug development, diagnosis and prognosis, number detection, risk assessment and the study of mutation. Clustering large and high dimensional data collection is a challenging task. The problem is more complex if, in addition to clusterering, one is also interested in learning cluster dependent feature relevance weights. One possible solution to alleviate this problem is to use partial supervision to guide the search process and narrow down the space of possible solutions. Recently, semisupervised learning has emerged as a new research directive in machine learning to improve the performance of unsupervised learning using some supervised information. 9. References [1] G.A Baxes, Digital image processing, Principles and applications [2]E Gose,R,Johnsonbaugh,Steve jost, Pattern recognition and image analysis. [3] Radhakrishaan Nagarajan,Charlotte A Peterson, Identifying spots in micro-array images IEEE transaction on NanoBio, vol1 No 2 june 2002 [4]Radhakrishaan Nagarajan, intensity based segmentation of microarray images,ieee transaction on Med Imaging, vol 22 No 2 July 2003 [5] Mathias katzer,franz Kummert, Methods for automatic Micro array image segmentation, IEEE transaction on NanoBio, vol 2 No 4 Dec [6] Krishnapuram R. and Keller J.M A Possiblistic approach to clustering IEEE Trans on Fuzzy Systems, Vol 1 No.2 May 1993 pp [7] Hichem Frigui, Fuzzy clustering and Aggregation of Realational Data With Instance Level Constraits, IEEE Trans on Fuzzy Systems, Vol 16 No.6 Dec 2008 pp


MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information


CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ GeneChip 2011/11/29 EECS 730 2 Hybridization to the Chip 2011/11/29

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Region-based Segmentation

Region-based Segmentation Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Fuzzy C-means with Bi-dimensional Empirical Mode Decomposition for Segmentation of Microarray Image

Fuzzy C-means with Bi-dimensional Empirical Mode Decomposition for Segmentation of Microarray Image www.ijcsi.org 316 Fuzzy C-means with Bi-dimensional Empirical Mode Decomposition for Segmentation of Microarray Image J.Harikiran 1, D.RamaKrishna 2, M.L.Phanendra 3, Dr.P.V.Lakshmi 4, Dr.R.Kiran Kumar

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Class Discovery and Prediction of Tumor with Microarray Data

Class Discovery and Prediction of Tumor with Microarray Data Minnesota State University, Mankato Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato Theses, Dissertations, and Other Capstone Projects 2011 Class Discovery

More information

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays. Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

Unsupervised Learning

Unsupervised Learning Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision

More information

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It! RNA-seq: What is it good for? Clustering High-throughput RNA sequencing experiments (RNA-seq) offer the ability to measure simultaneously the expression level of thousands of genes in a single experiment!

More information


CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information


CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information


MR IMAGE SEGMENTATION MR IMAGE SEGMENTATION Prepared by : Monil Shah What is Segmentation? Partitioning a region or regions of interest in images such that each region corresponds to one or more anatomic structures Classification

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Introduction to digital image classification

Introduction to digital image classification Introduction to digital image classification Dr. Norman Kerle, Wan Bakx MSc a.o. INTERNATIONAL INSTITUTE FOR GEO-INFORMATION SCIENCE AND EARTH OBSERVATION Purpose of lecture Main lecture topics Review

More information

Clustering: - (a) k-means (b)kmedoids(c). DBSCAN

Clustering: - (a) k-means (b)kmedoids(c). DBSCAN COMPARISON OF K MEANS, K MEDOIDS, DBSCAN ALGORITHMS USING DNA MICROARRAY DATASET C.Kondal raj CPA college of Arts and science, Theni(Dt), Tamilnadu, India E-mail : kondalrajc@gmail.com Abstract Data mining

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

cse 252c Fall 2004 Project Report: A Model of Perpendicular Texture for Determining Surface Geometry

cse 252c Fall 2004 Project Report: A Model of Perpendicular Texture for Determining Surface Geometry cse 252c Fall 2004 Project Report: A Model of Perpendicular Texture for Determining Surface Geometry Steven Scher December 2, 2004 Steven Scher SteveScher@alumni.princeton.edu Abstract Three-dimensional

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Nature Publishing Group

Nature Publishing Group Figure S I II III 6 7 8 IV ratio ssdna (S/G) WT hr hr hr 6 7 8 9 V 6 6 7 7 8 8 9 9 VII 6 7 8 9 X VI XI VIII IX ratio ssdna (S/G) rad hr hr hr 6 7 Chromosome Coordinate (kb) 6 6 Nature Publishing Group

More information

Microarray data analysis

Microarray data analysis Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

DNA microarrays [1] are used to measure the expression

DNA microarrays [1] are used to measure the expression IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 24, NO. 7, JULY 2005 901 Mixture Model Analysis of DNA Microarray Images K. Blekas*, Member, IEEE, N. P. Galatsanos, Senior Member, IEEE, A. Likas, Senior Member,

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Digital Image Processing

Digital Image Processing Digital Image Processing Part 9: Representation and Description AASS Learning Systems Lab, Dep. Teknik Room T1209 (Fr, 11-12 o'clock) achim.lilienthal@oru.se Course Book Chapter 11 2011-05-17 Contents

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information


CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2) Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Goal-oriented Schema in Biological Database Design

Goal-oriented Schema in Biological Database Design Goal-oriented Schema in Biological Database Design Ping Chen Department of Computer Science University of Helsinki Helsinki, Finland 00014 EMAIL: pchen@cs.helsinki.fi Abstract In this paper, I reviewed

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2 Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Bioconductor s stepnorm package

Bioconductor s stepnorm package Bioconductor s stepnorm package Yuanyuan Xiao 1 and Yee Hwa Yang 2 October 18, 2004 Departments of 1 Biopharmaceutical Sciences and 2 edicine University of California, San Francisco yxiao@itsa.ucsf.edu

More information

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. D.A. Karras, S.A. Karkanis and D. E. Maroulis University of Piraeus, Dept.

More information

Pattern recognition. Classification/Clustering GW Chapter 12 (some concepts) Textures

Pattern recognition. Classification/Clustering GW Chapter 12 (some concepts) Textures Pattern recognition Classification/Clustering GW Chapter 12 (some concepts) Textures Patterns and pattern classes Pattern: arrangement of descriptors Descriptors: features Patten class: family of patterns

More information

Methodology for spot quality evaluation

Methodology for spot quality evaluation Methodology for spot quality evaluation Semi-automatic pipeline in MAIA The general workflow of the semi-automatic pipeline analysis in MAIA is shown in Figure 1A, Manuscript. In Block 1 raw data, i.e..tif

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information


CHAPTER-6 WEB USAGE MINING USING CLUSTERING CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion

More information

CS4733 Class Notes, Computer Vision

CS4733 Class Notes, Computer Vision CS4733 Class Notes, Computer Vision Sources for online computer vision tutorials and demos - http://www.dai.ed.ac.uk/hipr and Computer Vision resources online - http://www.dai.ed.ac.uk/cvonline Vision

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

Automatic Grayscale Classification using Histogram Clustering for Active Contour Models

Automatic Grayscale Classification using Histogram Clustering for Active Contour Models Research Article International Journal of Current Engineering and Technology ISSN 2277-4106 2013 INPRESSCO. All Rights Reserved. Available at http://inpressco.com/category/ijcet Automatic Grayscale Classification

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Preprocessing Probe-level data: the intensities read for each of the components. Genomic-level data: the measures being used in

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information



More information