Biclustering algorithms ISA and SAMBA

Size: px
Start display at page:

Download "Biclustering algorithms ISA and SAMBA"

Transcription

1 Biclustering algorithms ISA and SAMBA Slides with Amos Tanay 1

2 Biclustering Clusters: global. partition of genes according to common exp pattern across all conditions conditions Genes have multiple functions Conditions may be diverse genes Bicluster: subsets of genes and conditions Finer, local analysis 2

3 In this lecture Two current biclustering methodologies Iterative Signature Algorithm (ISA) Simple Randomized SAMBA Combinatorial basis Fast And maybe a little more 3

4 What makes a biclustering algorithm? Define what is a bicluster; score Alg for finding one bicluster Alg for finding all (many) biclusters Important themes: Normalization Redundancies 4

5 The Iterative Signature Algorithm (ISA) Developed at Naama Barkai s Lab at WIS (I. Ihmels, S. Bergman) Motivation: A bicluster is set of genes and conditions that mutually define each other It is possible to refine an approximate bicluster by stabalizing it 6

6 Normalization Can we normalize simultaeously for both gene and condition dependent trends? In the ISA we are not trying to.. Given a genes x conditions matrix E with condition set U, gene set V define: E C : normalize each cond to 0 mean, 1 std E G : normalize each gene to 0 mean, 1 std 7

7 What is a bicluster Assume all columns are independent, what is the distribution of Σ (j in U ) e G ij for a random cond set U and gene i? Mean = 0, Std=sqrt( U ) Same for Σ (i in V ) e C ij and gene set V. In a bicluster, we expect independence not to hold. 8

8 What is a bicluster (2) Given a set of conds U define: ISA(U ) = {v in V s.t. Σ (j in U ) e G vj > T G σ U } Given a set of genes V define: ISA(V ) = {u in U s.t. Σ (j in V ) e C iu > T C σ V } T G,T C threshold parameters, σ U,σ V standard deviations Estimated from the data A (perfect) bicluster is a pair (U,V ) s.t. ISA(V ) = U ISA(U ) = V 9

9 Searching for biclusters Define a directed graph: nodes = condition & gene subsets; arcs X Y iff ISA(X )=Y A bicluster is a cycle of two nodes U V An approximated bicluster is a larger cycle (but not too large). Alg: start from a random or known gene set, compute ISA until converging to an approximated bicluster: V i = ISA(U i-1 ), U i = ISA(V i ) Converge at i when for all j > i-m, U i \U j / U i U j < ε 10

10 ISA 11

11 Adding weights Instead of sets use vectors of gene and condition weights The operator ISA is generalized to become a matrix multiplication + threshold function Gene Set Compute Avgs on conds Compute Z- score of conditions Keep conds that survived the threshold Gene weights Multiply by gene expression matrix Compute Z- scores of conditions Nullify weights below the threshold 12

12 Handling Redundancy Starting from different seeds yields different fixed points (bics) Using different thresholds changes the graph structure and gives more bics Need to filter similar solutions & report a short, non-redundant list of significant bics 13

13 ISA - applications Start from sets of genes with a known functional annotation Start from genes with binding sites of a transcription factor Start from a set of sequence orthologs See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS

14 The basic signature algorithm (Nat. Genetics 02) 15

15 Using recurrence to evaluate solutions A bad initial gene set will also lead to some module How can we identify the good modules? Idea: for input gene set A, random set of other genes R, apply ISA(A), ISA(R A) and compare them. If A represents part of a real transcription module, expect large overlap in resulting solutions. 16

16 Using recurrence (2) a, A reference set of Ncore co-regulated genes was composed of genes encoding either ribosomal proteins (dashed lines) or proteins involved in amino acid biosynthesis (dashed/dotted line). The recurrent signature method was applied to this set as follows. First, a collection of input sets was derived by randomly adding genes to the reference set. Second, the signature algorithm was applied to the reference set and to the derived sets; this generates a reference signature and a collection of perturbed signatures, respectively. Last, the overlaps between the reference signature and the perturbed signatures were calculated. Shown is the average overlap as a function of the number of genes added to the reference set. The different lines correspond to different choices of Ncore, shown in parentheses. b, The recurrent signature method was applied to three sequence-related references sets. These sets include all of the genes that contain the binding sequences CGGN11CCG (for Gal4), TGACTC (for Gcn4) or TTN9GGAAA (for Mcm1) in a region of 600 bp upstream. Shown is the fraction of perturbed signatures whose overlap with the reference signature is greater than some threshold, as a function of this threshold. Note the large number of highly overlapping outputs for all three references sets. By contrast, the profile corresponding to a random sequence is distinctly different, with no large overlaps. Thus, the recurrence profile gives a clear indication of whether a given sequence functions as a regulatory control element. 17

17 A global analysis in yeast 1000 expression profiles Applied SA with input gene sets: All target sets of 6-mers, 7-mers, 8-mers (~86K sets) All functional groups in MIPS All clusters in a hierarchical clustering of all genes Accepted only recurring modules. Results: 86 modules covering 2241 genes. 18

18 Genes in most modules participate in module-specific cellular process 19

19 ISA Pros/Cons Pros Simple, quite fast Elegant solution to the normalization problem Good empirical results in several cases Cons Thresholds setting Finding good seeds Redundancies Non normal behaviors 20

20 SAMBA: Statistical and Algorithmic Method for Bicluster Analysis Developed here (Tanay, Sharan, Shamir Bioinformatics 02) Outline: Develop efficient combinatorial techniques for biclustering large datasets. Employ a statistical model for biclusters Allow integration of heterogeneous data 21

21 The SAMBA model edge conditions no edge Goal : Find high similarity submatrices Goal : Find dense G=(U,V,E) subgraphs 22

22 The SAMBA approach Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data Bicluster model: Heavy subgraphs How to find biclusters: Combined hashing and local optimization Redundancies: Find many biclusters at once, filter them in post process 23

23 From a statistical model to edge weights a simple example Background model: Independent edges, each present with prob. p<½. H subgraph of n genes, m conds, k edges P-value = tail of binomial distribution: nm p( H ) log ( p( H )) nm + k log( p) + ( nm k)log(1 p) 2 k nm k nm k nm k = ' ' p (1 p) 2 p (1 p) k ' k k' Weight the graph edges: (1+log p) non-edges: (1+log(1-p)). then subgraph weight log p-value. 24

24 Limitations of the uniform probability model Not all dense subgraphs are statistically significant. Different genes/conds have dissimilar noise characteristics. Noisy genes/conds have high probability of forming dense subgraphs. An extended likelihood ratio model: Bicluster Random Subgraph Model Background Random Graph Model = Will show: Likelihood model translates to sum of weights over edges and non edges 25

25 A Degree Based Random Graph Model low-prob edges medium-prob edges high-prob edges Each edge (u,v) occurs independently w prob p(u,v). p(u,v) depends on the degrees of both u and v Γ = { G =(U,V,E ) deg(w, E )=deg(w, E) for all w in U,V} set of degree preserving graphs on same node sets. p(u,v) = Pr((u,v) in E G in Γ) Approximated using Monte Carlo process 26

26 27 Likelihood Ratio Model + = = = ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ), ( 1 1 log ), ( log ) ( log ), ( 1 1 ), ( )), ( (1 ), ( ) (1 ) ( E v u c E v u c E v u c E v u c E v u E v u E v u c E v u c v u p p v u p p B L v u p p v u p p v u p v u p p p B L Subgraph weight = log likelihood ratio Bicluster model assumption: edges occur independently with prob p c Likelihood ratio score:

27 Heaviest bipartite subgraph NPC (Dawande et al. 97, Hochbaum 98) (But: node biclique is polynomial!) Assumption: degree on V side bounded by d Start by finding heavy bicliques. Alg: use hashing to discover heavy subsets of conds. 28

28 Finding Heaviest Biclique Takes O(n2 d ) time and space. GE Ron Shamir 29

29 Using bicliques to find the heaviest biclusters Assume: edge weight = 1, non-edge weight = -1 Note : w(( U', V')) = wuv (, ') u U' Lemma: If B=(U,V ) is a maximum weight subgraph and X U then v s.t. N(v) X X /2. Pf: 0 < w(( XV, ')) = Nv ( ) X Nv ( ) X = v V' v V' 2 Nv ( ) X X Corollary: If B=(U,V ) is a maximum weight subgraph then U 2d 30

30 Using bicliques to find the heaviest biclusters Assume: edge weight = 1, non-edge weight = -1 Lemma: in a max wt subgraph (U*,V*), X U* Y X, Y X /2 s.t. Y N(v) for some v V*. Corollary: in a max wt subgraph (U*,V*), U* can be covered by at most log (2d) sets, each containing the neighborhood of some vertex in V* 31

31 Using bicliques to find the heaviest biclusters A set of conditions in a maximal bicluster is the union of up to log(2d) subsets of gene neighborhoods. U u u Exhaustive O((n2 d ) log(2d) ) time alg: Hash bicliques enumerate all log(2d) size N(v) combinations. Can be generalized to arbitrary edge/nonedge weights. 32

32 SAMBA s implementation Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6. Phase II: greedy expansion of heaviest bicliques containing each gene/cond Phase III: filter overlapping biclusters. 33

33 Evaluating Specificity Suppose conditions partition into k classes of sizes c 1, c k ; Σc i = m A bic has b conditions, b i from class i If b t =max b i assign the bic to class t How good is the match of the bic to the classification? Hypergeometric score: b Pr( B) = k= b t ct m ct k b k m b 34

34 Specificity Test GE data: Alizadeh et al. (00) 4026 genes, 96 human tissues; 9 classes of lymphoma, normal Fraction of biclusters SAMBA Cheng-Church 00 Random Log (p-value) Better fit to true classification 35

35 Specificity (2) Generate random bipartite graph with same degree sequence as the Alizadeh data; compute biclusters; plot p-value and likelihood (weight) log likelihood + Lymphoma data (Alizadeh et.al) x Shuffled Data log p-value 36

36 Heterogeneous data Tanay Sharan Kupiec Shamir PNAS 04 Transcription Level Protein Level Phenotype Level 2-Hybrid mrna profiling Protein Complexes ChIP Chip Identification using Mass Spec and so many more Barcoded deletion libraries = 0 Synthetic lethality 37

37 Unified Modeling of Biological Information Genes/Proteins Properties Modules 38

38 A Heterogeneous Collection of Yeast Genomic Information Gene expression: ~1000 conditions, 27 publications TF binding profiles: 110 profiles from growth on YPD (Lee et al.) Phenotype profiles: 6 (30) profiles (Giaever et al.) Two hybrid interactions: ~1000 (Uetz et al.) Protein Complex interaction: ~4000 (Ho et al.) MIPS interactions: ~

39 From experiments to properties p2 p1 Strong complex binding to protein P Medium complex binding to Protein P p1 p2 p3 p4 Strong Induction Medium Induction Medium Repression Strong Repression gene g p1 p2 p1 p2 p1 p2 Strong Medium Binding to Binding to GE Ron Shamir TF T TF T High Sensitivity Medium Sensitivity High Confidence Interaction Medium Confidence Interaction 40

40 A SAMBA module Properties Genes GO annotations CPA1 CPA2 41

41 modular organization in yeast Ovals = modules Edges = module overlaps Map generated automatically by SAMBA Clustered organization Cluster=process Hierarchical bridges 42

42 TFfunction map 43

43 SAMBA Pros/Cons Pros Fast Allow simultaneous normalization of genes and conditions Allow integration of heterogeneous data Well suited for query based usage Cons Discretization redundancies 44

44 Biclustering interim summary A general data mining problem The key point: defining what is a bicluster Algorithms vary, depending on the nature of bicluster model Open issues: What is the best objective/ bic criterion? Search for bics in really huge matrices Handling redundancies 45

Identifying network modules

Identifying network modules Network biology minicourse (part 3) Algorithmic challenges in genomics Identifying network modules Roded Sharan School of Computer Science, Tel Aviv University Gene/Protein Modules A module is a set of

More information

Lecture 5: May 24, 2007

Lecture 5: May 24, 2007 Analysis of Gene Expression Data Spring Semester, 2007 Lecture 5: May 24, 2007 Lecturer: Ron Shamir Scribe: Shelly Mahlev and Shaul Karni 1 5.1 Introduction As we have seen in previous lectures, the technology

More information

Biclustering Algorithms for Gene Expression Analysis

Biclustering Algorithms for Gene Expression Analysis Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important

More information

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data By S. Bergmann, J. Ihmels, N. Barkai Reasoning Both clustering and Singular Value Decomposition(SVD) are useful tools

More information

BIMAX. Lecture 11: December 31, Introduction Model An Incremental Algorithm.

BIMAX. Lecture 11: December 31, Introduction Model An Incremental Algorithm. Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecturer: Ron Shamir BIMAX 11.1 Introduction. Lecture 11: December 31, 2009 Scribe: Boris Kostenko In the course we have already seen different

More information

Analysis of Biological Networks: Network Modules Identication

Analysis of Biological Networks: Network Modules Identication Analysis of Biological Networks: Network Modules Identication Lecturer: Roded Sharan Scribe: Regina Ring and Constantin Radchenko Lecture 4, March 25, 2009 In this lecture we complete the discussion of

More information

Analysis of Biological Networks: Protein modules Color Coding

Analysis of Biological Networks: Protein modules Color Coding Analysis of Biological Networks: Protein modules Color Coding Lecturer: Eithan Hirsh Scribe: Shelly Mahleb and Benny Davidovich Lecture 6, November 30, 2006 1 Introduction In this lecture we present: Color

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

BIOINFORMATICS. Discovering statistically significant biclusters in gene expression data. Amos Tanay, Roded Sharan and Ron Shamir

BIOINFORMATICS. Discovering statistically significant biclusters in gene expression data. Amos Tanay, Roded Sharan and Ron Shamir BIOINFORMATICS Electronic edition http://www.bioinformatics.oupjournals.org VOLUME 18 NUMBER Suppl. 1 JULY 2002 PAGES S136 S144 Discovering statistically significant biclusters in gene expression data

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

FISA: Fast Iterative Signature Algorithm for the analysis of large-scale gene expression data

FISA: Fast Iterative Signature Algorithm for the analysis of large-scale gene expression data FISA: Fast Iterative Signature Algorithm for the analysis of large-scale gene expression data Seema Aggarwal Department of Computer Science University of Delhi saggarwal@mh.du.ac.in and Neelima Gupta Department

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

p v P r(v V opt ) = Algorithm 1 The PROMO algorithm for module identification.

p v P r(v V opt ) = Algorithm 1 The PROMO algorithm for module identification. BIOINFORMATICS Vol. no. 6 Pages 1 PROMO : A Method for identifying modules in protein interaction networks Omer Tamuz, Yaron Singer, Roded Sharan School of Computer Science, Tel Aviv University, Tel Aviv,

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

The k-means Algorithm and Genetic Algorithm

The k-means Algorithm and Genetic Algorithm The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding MSCBIO 2070/02-710:, Spring 2015 A4: spline, HMM, clustering, time-series data analysis, RNA-folding Due: April 13, 2015 by email to Silvia Liu (silvia.shuchang.liu@gmail.com) TA in charge: Silvia Liu

More information

Community Detection. Community

Community Detection. Community Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Average clustering coefficient of a graph Overall measure

More information

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748 COT 6936: Topics in Algorithms! Giri Narasimhan ECS 254A / EC 2443; Phone: x3748 giri@cs.fiu.edu http://www.cs.fiu.edu/~giri/teach/cot6936_s12.html https://moodle.cis.fiu.edu/v2.1/course/view.php?id=174

More information

Identifying and Understanding Differential Transcriptor Binding

Identifying and Understanding Differential Transcriptor Binding Identifying and Understanding Differential Transcriptor Binding 15-899: Computational Genomics David Koes Yong Lu Motivation Under different conditions, a transcription factor binds to different genes

More information

Microarray data analysis

Microarray data analysis Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using

More information

Sequence Design Problems in Discovery of Regulatory Elements

Sequence Design Problems in Discovery of Regulatory Elements Sequence Design Problems in Discovery of Regulatory Elements Yaron Orenstein, Bonnie Berger and Ron Shamir Regulatory Genomics workshop Simons Institute March 10th, 2016, Berkeley, CA Differentially methylated

More information

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Gene regulation DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Especially but not only during developmental stage And cells respond to

More information

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules?

More information

Department of Computer Science & Engineering University of Kalyani. Syllabus for Ph.D. Coursework

Department of Computer Science & Engineering University of Kalyani. Syllabus for Ph.D. Coursework Department of Computer Science & Engineering University of Kalyani Syllabus for Ph.D. Coursework Paper 1: A) Literature Review: (Marks - 25) B) Research Methodology: (Marks - 25) Paper 2: Computer Applications:

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

MCL. (and other clustering algorithms) 858L

MCL. (and other clustering algorithms) 858L MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Reflexive Regular Equivalence for Bipartite Data

Reflexive Regular Equivalence for Bipartite Data Reflexive Regular Equivalence for Bipartite Data Aaron Gerow 1, Mingyang Zhou 2, Stan Matwin 1, and Feng Shi 3 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada 2 Department of Computer

More information

Biclustering Algorithms: A Survey

Biclustering Algorithms: A Survey Biclustering Algorithms: A Survey Amos Tanay Λ Roded Sharan y Ron Shamir Λ May 2004 Abstract Analysis of large scale geonomics data, notably gene expression, has initially focused on clustering methods.

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2????? Machine Learning Node

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

An Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays

An Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays An Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays Xiang Zhang and Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599,

More information

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12.

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12. AMS 550.47/67: Graph Theory Homework Problems - Week V Problems to be handed in on Wednesday, March : 6, 8, 9,,.. Assignment Problem. Suppose we have a set {J, J,..., J r } of r jobs to be filled by a

More information

Lecture 5: Multiple sequence alignment

Lecture 5: Multiple sequence alignment Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment

More information

Subject Index. Journal of Discrete Algorithms 5 (2007)

Subject Index. Journal of Discrete Algorithms 5 (2007) Journal of Discrete Algorithms 5 (2007) 751 755 www.elsevier.com/locate/jda Subject Index Ad hoc and wireless networks Ad hoc networks Admission control Algorithm ; ; A simple fast hybrid pattern-matching

More information

Local Algorithms for Sparse Spanning Graphs

Local Algorithms for Sparse Spanning Graphs Local Algorithms for Sparse Spanning Graphs Reut Levi Dana Ron Ronitt Rubinfeld Intro slides based on a talk given by Reut Levi Minimum Spanning Graph (Spanning Tree) Local Access to a Minimum Spanning

More information

Lecture Note: Computation problems in social. network analysis

Lecture Note: Computation problems in social. network analysis Lecture Note: Computation problems in social network analysis Bang Ye Wu CSIE, Chung Cheng University, Taiwan September 29, 2008 In this lecture note, several computational problems are listed, including

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

Package ibbig. R topics documented: December 24, 2018

Package ibbig. R topics documented: December 24, 2018 Type Package Title Iterative Binary Biclustering of Genesets Version 1.26.0 Date 2011-11-23 Author Daniel Gusenleitner, Aedin Culhane Package ibbig December 24, 2018 Maintainer Aedin Culhane

More information

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Biclustering with δ-pcluster John Tantalo. 1. Introduction Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2 Graph Theory S I I S S I I S Graphs Definition A graph G is a pair consisting of a vertex set V (G), and an edge set E(G) ( ) V (G). x and y are the endpoints of edge e = {x, y}. They are called adjacent

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Integrated Analysis of Gene Expression and Other Data

Integrated Analysis of Gene Expression and Other Data Analysis of DNA Chips and Gene Networks Fall Semester, 2010 Lecture 14: January 21, 2010 Lecturer: Prof. Ron Shamir Scribe: David Ze evi Integrated Analysis of Gene Expression and Other Data 14.1 Introduction

More information

CS473-Algorithms I. Lecture 13-A. Graphs. Cevdet Aykanat - Bilkent University Computer Engineering Department

CS473-Algorithms I. Lecture 13-A. Graphs. Cevdet Aykanat - Bilkent University Computer Engineering Department CS473-Algorithms I Lecture 3-A Graphs Graphs A directed graph (or digraph) G is a pair (V, E), where V is a finite set, and E is a binary relation on V The set V: Vertex set of G The set E: Edge set of

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Data mining, 4 cu Lecture 8:

Data mining, 4 cu Lecture 8: 582364 Data mining, 4 cu Lecture 8: Graph mining Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

by conservation of flow, hence the cancelation. Similarly, we have

by conservation of flow, hence the cancelation. Similarly, we have Chapter 13: Network Flows and Applications Network: directed graph with source S and target T. Non-negative edge weights represent capacities. Assume no edges into S or out of T. (If necessary, we can

More information

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 03/02/17

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 03/02/17 CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 Due at the beginning of class Thursday 03/02/17 1. Consider a model of a nonbipartite undirected graph in which

More information

Mean Square Residue Biclustering with Missing Data and Row Inversions

Mean Square Residue Biclustering with Missing Data and Row Inversions Mean Square Residue Biclustering with Missing Data and Row Inversions Stefan Gremalschi a, Gulsah Altun b, Irina Astrovskaya a, and Alexander Zelikovsky a a Department of Computer Science, Georgia State

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Integrating Probabilistic Reasoning with Constraint Satisfaction

Integrating Probabilistic Reasoning with Constraint Satisfaction Integrating Probabilistic Reasoning with Constraint Satisfaction IJCAI Tutorial #7 Instructor: Eric I. Hsu July 17, 2011 http://www.cs.toronto.edu/~eihsu/tutorial7 Getting Started Discursive Remarks. Organizational

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Addendum to the proof of log n approximation ratio for the greedy set cover algorithm

Addendum to the proof of log n approximation ratio for the greedy set cover algorithm Addendum to the proof of log n approximation ratio for the greedy set cover algorithm (From Vazirani s very nice book Approximation algorithms ) Let x, x 2,...,x n be the order in which the elements are

More information

Coverage Approximation Algorithms

Coverage Approximation Algorithms DATA MINING LECTURE 12 Coverage Approximation Algorithms Example Promotion campaign on a social network We have a social network as a graph. People are more likely to buy a product if they have a friend

More information

Sparse and large-scale learning with heterogeneous data

Sparse and large-scale learning with heterogeneous data Sparse and large-scale learning with heterogeneous data February 15, 2007 Gert Lanckriet (gert@ece.ucsd.edu) IEEE-SDCIS In this talk Statistical machine learning Techniques: roots in classical statistics

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

Web Structure Mining Community Detection and Evaluation

Web Structure Mining Community Detection and Evaluation Web Structure Mining Community Detection and Evaluation 1 Community Community. It is formed by individuals such that those within a group interact with each other more frequently than with those outside

More information

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction. by Ritambhara Singh IIIT-Delhi June 10, 2016

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction. by Ritambhara Singh IIIT-Delhi June 10, 2016 Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1 Biology in a Slide DNA RNA PROTEIN CELL ORGANISM 2 DNA and Diseases

More information

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 02/26/15

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 02/26/15 CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 Due at the beginning of class Thursday 02/26/15 1. Consider a model of a nonbipartite undirected graph in which

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Thursday, October 1, 2015 Outline 1 Recap 2 Shortest paths in graphs with non-negative edge weights (Dijkstra

More information

Graph Mining: Overview of different graph models

Graph Mining: Overview of different graph models Graph Mining: Overview of different graph models Davide Mottin, Konstantina Lazaridou Hasso Plattner Institute Graph Mining course Winter Semester 2016 Lecture road Anomaly detection (previous lecture)

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Distribution-Free Models of Social and Information Networks

Distribution-Free Models of Social and Information Networks Distribution-Free Models of Social and Information Networks Tim Roughgarden (Stanford CS) joint work with Jacob Fox (Stanford Math), Rishi Gupta (Stanford CS), C. Seshadhri (UC Santa Cruz), Fan Wei (Stanford

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Discrete Mathematics Course Review 3

Discrete Mathematics Course Review 3 21-228 Discrete Mathematics Course Review 3 This document contains a list of the important definitions and theorems that have been covered thus far in the course. It is not a complete listing of what has

More information

1 More stochastic block model. Pr(G θ) G=(V, E) 1.1 Model definition. 1.2 Fitting the model to data. Prof. Aaron Clauset 7 November 2013

1 More stochastic block model. Pr(G θ) G=(V, E) 1.1 Model definition. 1.2 Fitting the model to data. Prof. Aaron Clauset 7 November 2013 1 More stochastic block model Recall that the stochastic block model (SBM is a generative model for network structure and thus defines a probability distribution over networks Pr(G θ, where θ represents

More information

Review: Identification of cell types from single-cell transcriptom. method

Review: Identification of cell types from single-cell transcriptom. method Review: Identification of cell types from single-cell transcriptomes using a novel clustering method University of North Carolina at Charlotte October 12, 2015 Brief overview Identify clusters by merging

More information