Biclustering algorithms ISA and SAMBA

Size: px

Start display at page:

Download "Biclustering algorithms ISA and SAMBA"

Camilla Mathews
5 years ago
Views:

1 Biclustering algorithms ISA and SAMBA Slides with Amos Tanay 1

2 Biclustering Clusters: global. partition of genes according to common exp pattern across all conditions conditions Genes have multiple functions Conditions may be diverse genes Bicluster: subsets of genes and conditions Finer, local analysis 2

3 In this lecture Two current biclustering methodologies Iterative Signature Algorithm (ISA) Simple Randomized SAMBA Combinatorial basis Fast And maybe a little more 3

4 What makes a biclustering algorithm? Define what is a bicluster; score Alg for finding one bicluster Alg for finding all (many) biclusters Important themes: Normalization Redundancies 4

5 The Iterative Signature Algorithm (ISA) Developed at Naama Barkai s Lab at WIS (I. Ihmels, S. Bergman) Motivation: A bicluster is set of genes and conditions that mutually define each other It is possible to refine an approximate bicluster by stabalizing it 6

6 Normalization Can we normalize simultaeously for both gene and condition dependent trends? In the ISA we are not trying to.. Given a genes x conditions matrix E with condition set U, gene set V define: E C : normalize each cond to 0 mean, 1 std E G : normalize each gene to 0 mean, 1 std 7

7 What is a bicluster Assume all columns are independent, what is the distribution of Σ (j in U ) e G ij for a random cond set U and gene i? Mean = 0, Std=sqrt( U ) Same for Σ (i in V ) e C ij and gene set V. In a bicluster, we expect independence not to hold. 8

8 What is a bicluster (2) Given a set of conds U define: ISA(U ) = {v in V s.t. Σ (j in U ) e G vj > T G σ U } Given a set of genes V define: ISA(V ) = {u in U s.t. Σ (j in V ) e C iu > T C σ V } T G,T C threshold parameters, σ U,σ V standard deviations Estimated from the data A (perfect) bicluster is a pair (U,V ) s.t. ISA(V ) = U ISA(U ) = V 9

9 Searching for biclusters Define a directed graph: nodes = condition & gene subsets; arcs X Y iff ISA(X )=Y A bicluster is a cycle of two nodes U V An approximated bicluster is a larger cycle (but not too large). Alg: start from a random or known gene set, compute ISA until converging to an approximated bicluster: V i = ISA(U i-1 ), U i = ISA(V i ) Converge at i when for all j > i-m, U i \U j / U i U j < ε 10

10 ISA 11

11 Adding weights Instead of sets use vectors of gene and condition weights The operator ISA is generalized to become a matrix multiplication + threshold function Gene Set Compute Avgs on conds Compute Z- score of conditions Keep conds that survived the threshold Gene weights Multiply by gene expression matrix Compute Z- scores of conditions Nullify weights below the threshold 12

12 Handling Redundancy Starting from different seeds yields different fixed points (bics) Using different thresholds changes the graph structure and gives more bics Need to filter similar solutions & report a short, non-redundant list of significant bics 13

13 ISA - applications Start from sets of genes with a known functional annotation Start from genes with binding sites of a transcription factor Start from a set of sequence orthologs See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS

14 The basic signature algorithm (Nat. Genetics 02) 15

15 Using recurrence to evaluate solutions A bad initial gene set will also lead to some module How can we identify the good modules? Idea: for input gene set A, random set of other genes R, apply ISA(A), ISA(R A) and compare them. If A represents part of a real transcription module, expect large overlap in resulting solutions. 16

16 Using recurrence (2) a, A reference set of Ncore co-regulated genes was composed of genes encoding either ribosomal proteins (dashed lines) or proteins involved in amino acid biosynthesis (dashed/dotted line). The recurrent signature method was applied to this set as follows. First, a collection of input sets was derived by randomly adding genes to the reference set. Second, the signature algorithm was applied to the reference set and to the derived sets; this generates a reference signature and a collection of perturbed signatures, respectively. Last, the overlaps between the reference signature and the perturbed signatures were calculated. Shown is the average overlap as a function of the number of genes added to the reference set. The different lines correspond to different choices of Ncore, shown in parentheses. b, The recurrent signature method was applied to three sequence-related references sets. These sets include all of the genes that contain the binding sequences CGGN11CCG (for Gal4), TGACTC (for Gcn4) or TTN9GGAAA (for Mcm1) in a region of 600 bp upstream. Shown is the fraction of perturbed signatures whose overlap with the reference signature is greater than some threshold, as a function of this threshold. Note the large number of highly overlapping outputs for all three references sets. By contrast, the profile corresponding to a random sequence is distinctly different, with no large overlaps. Thus, the recurrence profile gives a clear indication of whether a given sequence functions as a regulatory control element. 17

17 A global analysis in yeast 1000 expression profiles Applied SA with input gene sets: All target sets of 6-mers, 7-mers, 8-mers (~86K sets) All functional groups in MIPS All clusters in a hierarchical clustering of all genes Accepted only recurring modules. Results: 86 modules covering 2241 genes. 18

18 Genes in most modules participate in module-specific cellular process 19

19 ISA Pros/Cons Pros Simple, quite fast Elegant solution to the normalization problem Good empirical results in several cases Cons Thresholds setting Finding good seeds Redundancies Non normal behaviors 20

Develop efficient combinatorial techniques for biclustering large

20 SAMBA: Statistical and Algorithmic Method for Bicluster Analysis Developed here (Tanay, Sharan, Shamir Bioinformatics 02) Outline: Develop efficient combinatorial techniques for biclustering large datasets. Employ a statistical model for biclusters Allow integration of heterogeneous data 21

21 The SAMBA model edge conditions no edge Goal : Find high similarity submatrices Goal : Find dense G=(U,V,E) subgraphs 22

22 The SAMBA approach Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data Bicluster model: Heavy subgraphs How to find biclusters: Combined hashing and local optimization Redundancies: Find many biclusters at once, filter them in post process 23

23 From a statistical model to edge weights a simple example Background model: Independent edges, each present with prob. p<½. H subgraph of n genes, m conds, k edges P-value = tail of binomial distribution: nm p( H ) log ( p( H )) nm + k log( p) + ( nm k)log(1 p) 2 k nm k nm k nm k = ' ' p (1 p) 2 p (1 p) k ' k k' Weight the graph edges: (1+log p) non-edges: (1+log(1-p)). then subgraph weight log p-value. 24

24 Limitations of the uniform probability model Not all dense subgraphs are statistically significant. Different genes/conds have dissimilar noise characteristics. Noisy genes/conds have high probability of forming dense subgraphs. An extended likelihood ratio model: Bicluster Random Subgraph Model Background Random Graph Model = Will show: Likelihood model translates to sum of weights over edges and non edges 25

25 A Degree Based Random Graph Model low-prob edges medium-prob edges high-prob edges Each edge (u,v) occurs independently w prob p(u,v). p(u,v) depends on the degrees of both u and v Γ = { G =(U,V,E ) deg(w, E )=deg(w, E) for all w in U,V} set of degree preserving graphs on same node sets. p(u,v) = Pr((u,v) in E G in Γ) Approximated using Monte Carlo process 26

26 27 Likelihood Ratio Model + = = = ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ), ( 1 1 log ), ( log ) ( log ), ( 1 1 ), ( )), ( (1 ), ( ) (1 ) ( E v u c E v u c E v u c E v u c E v u E v u E v u c E v u c v u p p v u p p B L v u p p v u p p v u p v u p p p B L Subgraph weight = log likelihood ratio Bicluster model assumption: edges occur independently with prob p c Likelihood ratio score:

27 Heaviest bipartite subgraph NPC (Dawande et al. 97, Hochbaum 98) (But: node biclique is polynomial!) Assumption: degree on V side bounded by d Start by finding heavy bicliques. Alg: use hashing to discover heavy subsets of conds. 28

28 Finding Heaviest Biclique Takes O(n2 d ) time and space. GE Ron Shamir 29

29 Using bicliques to find the heaviest biclusters Assume: edge weight = 1, non-edge weight = -1 Note : w(( U', V')) = wuv (, ') u U' Lemma: If B=(U,V ) is a maximum weight subgraph and X U then v s.t. N(v) X X /2. Pf: 0 < w(( XV, ')) = Nv ( ) X Nv ( ) X = v V' v V' 2 Nv ( ) X X Corollary: If B=(U,V ) is a maximum weight subgraph then U 2d 30

30 Using bicliques to find the heaviest biclusters Assume: edge weight = 1, non-edge weight = -1 Lemma: in a max wt subgraph (U*,V*), X U* Y X, Y X /2 s.t. Y N(v) for some v V*. Corollary: in a max wt subgraph (U*,V*), U* can be covered by at most log (2d) sets, each containing the neighborhood of some vertex in V* 31

31 Using bicliques to find the heaviest biclusters A set of conditions in a maximal bicluster is the union of up to log(2d) subsets of gene neighborhoods. U u u Exhaustive O((n2 d ) log(2d) ) time alg: Hash bicliques enumerate all log(2d) size N(v) combinations. Can be generalized to arbitrary edge/nonedge weights. 32

32 SAMBA s implementation Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6. Phase II: greedy expansion of heaviest bicliques containing each gene/cond Phase III: filter overlapping biclusters. 33

33 Evaluating Specificity Suppose conditions partition into k classes of sizes c 1, c k ; Σc i = m A bic has b conditions, b i from class i If b t =max b i assign the bic to class t How good is the match of the bic to the classification? Hypergeometric score: b Pr( B) = k= b t ct m ct k b k m b 34

34 Specificity Test GE data: Alizadeh et al. (00) 4026 genes, 96 human tissues; 9 classes of lymphoma, normal Fraction of biclusters SAMBA Cheng-Church 00 Random Log (p-value) Better fit to true classification 35

35 Specificity (2) Generate random bipartite graph with same degree sequence as the Alizadeh data; compute biclusters; plot p-value and likelihood (weight) log likelihood + Lymphoma data (Alizadeh et.al) x Shuffled Data log p-value 36

36 Heterogeneous data Tanay Sharan Kupiec Shamir PNAS 04 Transcription Level Protein Level Phenotype Level 2-Hybrid mrna profiling Protein Complexes ChIP Chip Identification using Mass Spec and so many more Barcoded deletion libraries = 0 Synthetic lethality 37

37 Unified Modeling of Biological Information Genes/Proteins Properties Modules 38

38 A Heterogeneous Collection of Yeast Genomic Information Gene expression: ~1000 conditions, 27 publications TF binding profiles: 110 profiles from growth on YPD (Lee et al.) Phenotype profiles: 6 (30) profiles (Giaever et al.) Two hybrid interactions: ~1000 (Uetz et al.) Protein Complex interaction: ~4000 (Ho et al.) MIPS interactions: ~

39 From experiments to properties p2 p1 Strong complex binding to protein P Medium complex binding to Protein P p1 p2 p3 p4 Strong Induction Medium Induction Medium Repression Strong Repression gene g p1 p2 p1 p2 p1 p2 Strong Medium Binding to Binding to GE Ron Shamir TF T TF T High Sensitivity Medium Sensitivity High Confidence Interaction Medium Confidence Interaction 40

40 A SAMBA module Properties Genes GO annotations CPA1 CPA2 41

41 modular organization in yeast Ovals = modules Edges = module overlaps Map generated automatically by SAMBA Clustered organization Cluster=process Hierarchical bridges 42

42 TFfunction map 43

43 SAMBA Pros/Cons Pros Fast Allow simultaneous normalization of genes and conditions Allow integration of heterogeneous data Well suited for query based usage Cons Discretization redundancies 44

44 Biclustering interim summary A general data mining problem The key point: defining what is a bicluster Algorithms vary, depending on the nature of bicluster model Open issues: What is the best objective/ bic criterion? Search for bics in really huge matrices Handling redundancies 45

Identifying network modules

Network biology minicourse (part 3) Algorithmic challenges in genomics Identifying network modules Roded Sharan School of Computer Science, Tel Aviv University Gene/Protein Modules A module is a set of