Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai

Similar documents
FISA: Fast Iterative Signature Algorithm for the analysis of large-scale gene expression data

Biclustering Algorithms for Gene Expression Analysis

Exploratory data analysis for microarrays

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Orange3 Data Fusion Documentation. Biolab

Clustering Jacques van Helden

Dimension reduction : PCA and Clustering

Biclustering algorithms ISA and SAMBA

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

DM and Cluster Identification Algorithm

A Web Page Recommendation system using GA based biclustering of web usage data

BIMAX. Lecture 11: December 31, Introduction Model An Incremental Algorithm.

Triclustering in Gene Expression Data Analysis: A Selected Survey

Gene expression & Clustering (Chapter 10)

Linear Equation Systems Iterative Methods

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

Distance-based Methods: Drawbacks

CS Introduction to Data Mining Instructor: Abdullah Mueen

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

A Memetic Heuristic for the Co-clustering Problem

Computational Genomics and Molecular Biology, Fall

Controlling and visualizing the precision-recall tradeoff for external performance indices

University of Florida CISE department Gator Engineering. Clustering Part 2

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

ECS 234: Data Analysis: Clustering ECS 234

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Segmentation of MR Images of a Beating Heart

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Differential Expression Analysis at PATRIC

MSA220 - Statistical Learning for Big Data

Chapter 15 Introduction to Linear Programming

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Multiple View Geometry in Computer Vision

Weka ( )

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Multiple View Geometry in Computer Vision

Genetic Algorithm for Circuit Partitioning

Stereo Vision. MAN-522 Computer Vision

Chapter Introduction

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology Madras.

Properties of Biological Networks

Package ibbig. R topics documented: December 24, 2018

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Annotated multitree output

Iterative Methods for Solving Linear Problems

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Clustering Techniques

Gene Clustering & Classification

PARALLEL CLASSIFICATION ALGORITHMS

Statistical Methods and Optimization in Data Mining

CompClustTk Manual & Tutorial

Lecture 5: May 24, 2007

ISA internals. October 18, Speeding up the ISA iteration 2

Cluster Analysis in Data Mining

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

1 Projective Geometry

Dimension Reduction CS534

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

CURVE SKETCHING EXAM QUESTIONS

Unsupervised learning in Vision

A Review on Cluster Based Approach in Data Mining

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

DREM. Dynamic Regulatory Events Miner (v1.0.9b) User Manual

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Calculation of the fundamental matrix

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

COMP5318 Knowledge Management & Data Mining Assignment 1

BOSS. Quick Start Guide For research use only. Blackrock Microsystems, LLC. Blackrock Offline Spike Sorter. User s Manual. 630 Komas Drive Suite 200

A Summary of Projective Geometry

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering k-mean clustering

An idea which can be used once is a trick. If it can be used more than once it becomes a method

K-Means Clustering With Initial Centroids Based On Difference Operator

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Particle Swarm Optimization applied to Pattern Recognition

Collaborative Filtering for Netflix

Distributed CUR Decomposition for Bi-Clustering

Unsupervised Learning : Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Information Retrieval and Web Search Engines

Clustering CS 550: Machine Learning

Modern GPUs (Graphics Processing Units)

How do microarrays work

EECS730: Introduction to Bioinformatics

calibrated coordinates Linear transformation pixel coordinates

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Bipartite Perfect Matching in O(n log n) Randomized Time. Nikhil Bhargava and Elliot Marx

Multiple Views Geometry

Similarity measures in clustering time series data. Paula Silvonen

Factorization with Missing and Noisy Data

Clustering with Reinforcement Learning

Nature Methods: doi: /nmeth Supplementary Figure 1

Transcription:

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data By S. Bergmann, J. Ihmels, N. Barkai

Reasoning Both clustering and Singular Value Decomposition(SVD) are useful tools but have draw backs Analysis based on expression over all conditions, but only a small number are important SVD is sensitive to noise in the data

Goal Provide a method for large-scale gene expression analysis without the draw backs of clustering and SVD Does not require any prior knowledge beyond the expression data

Transcription Modules Defines the concept of Transcription Module(TM) A TM is a bicluster with some additional constraints TM may be associated with particular cellular function Ideally there is one transcription factor per TM, but since different transcription factors can regulate the same gene distinct TM can share common genes and conditions

Transcription Modules Each module M contains a subset of genes and conditions Maximize similarity of the gene expression when compared over the conditions Similarity is determined by two threshold parameters T G, T C Mathematical definition of TM:

Matrix Normalization What are E G cg and E c cg? E G and E C are normalized matrices, derived from the original expression matrix E g, c are a gene and a condition More on normalization, E can be written as:

Matrix Normalization Rows are conditions and columns are genes, row vector g contains the expression profile for that condition, and vice versa The normalized matrices are defined by:

Matrix Normalization With Matrices E C, E G compute the projection of a gene set g m or condition set c m onto the normalized martrices

Redefining TM With this information a revised definition of TM can be created, with f t (x) the threshold function

The Threshold Function The concept of a threshold is very important as we will see later The function is made up of two parts a weight w(x i ) and a step function Θ(x i ~ -t) w(x) in this paper is either the sign(x), or x, but could have other values

More on f t (x) x i ~ = [ x i - mean(x)]/stddev(x) The step function sets to zero all elements that do not exceed the mean by at least t*stddev(x) Can capture down regulation by using the absolute value x i Applying the threshold function to C m proj results in a non zero component in c m if the corresponding normalized gene profile is sufficiently aligned with g m

Iterative Signature Algorithm Now with all of this information we can define the iterative signature algorithm A randomly chosen gene set (or condition set) called g i0 is needed to begin, then the following equations can be used.

Iterative Signature Algorithm Using the initial gene set C i1 can be computed, then G i 1, and so on After some number of iterations the g i n values quickly converge to a fixed point g (*),that will satisfy the equation below for some given error

Iterative Signature Algorithm So a general application of ISA would 1)Generate large sample of seeds or g i0 s 2)Find the fixed points corresponding to each seed using iterations 3)Collect distinct fixed points in order to decompose into modules

Importance of Thresholds Lower thresholds will produce larger modules whose coregulation is loose, and higher thresholds give smaller more closely coregulated modules Choosing the threshold low enough will create a super module that contains all of the conditions and genes. Choosing the threshold higher will produce the intersection of modules, and choosing the threshold high enough will give you empty modules

Importance of Thresholds

Importance of Thresholds Thresholds can greatly reduce the noise in expression data, more on this later. The paper compares a simple iterative method that is similar to ISA but without threshold much like SVD Also tested how ISA handles noise at with different number of modules

Importance of Thresholds Created tests sets and artificially added noise Example: t=2 removes 98% of genes outside the module, while only removing 16% inside the module The result of these tests show that ISA in combination with the threshold handles noise well compared with no threshold

Importance of Thresholds

Comparing ISA to other methods Compared ISA to SVD and Hierarchical Clustering Created experimental data and artificially added noise, data had 25 modules Superimposed noise from a random distribution on to the expression set Even as noise became large ISA preformed considerably better than SVD or Hierarchial Clustering

Comparing ISA to other methods

Comparing ISA to other methods

ISA and Yeast Expression Data Used a diverse set of 1000 DNA-chip experiments that were obtained by different groups Tried 20,000 randomly created initial sets, and tried values of T G = 1.8, 1.9...4.0,T C set to 2.0 With T G = 1.8 found 5 know modules of Yeast, and at higher values found TM that had not been previously identified

Conclusion ISA clearly out preforms both hierarchical clustering and SVD when dealing with noise in data sets The use of thresholds in ISA is a very powerful tool enabling structures at different levels to become clear References: all figures in this presentation were taken from the paper. Any Questions?