Biclustering algorithms ISA and SAMBA
|
|
- Camilla Mathews
- 5 years ago
- Views:
Transcription
1 Biclustering algorithms ISA and SAMBA Slides with Amos Tanay 1
2 Biclustering Clusters: global. partition of genes according to common exp pattern across all conditions conditions Genes have multiple functions Conditions may be diverse genes Bicluster: subsets of genes and conditions Finer, local analysis 2
3 In this lecture Two current biclustering methodologies Iterative Signature Algorithm (ISA) Simple Randomized SAMBA Combinatorial basis Fast And maybe a little more 3
4 What makes a biclustering algorithm? Define what is a bicluster; score Alg for finding one bicluster Alg for finding all (many) biclusters Important themes: Normalization Redundancies 4
5 The Iterative Signature Algorithm (ISA) Developed at Naama Barkai s Lab at WIS (I. Ihmels, S. Bergman) Motivation: A bicluster is set of genes and conditions that mutually define each other It is possible to refine an approximate bicluster by stabalizing it 6
6 Normalization Can we normalize simultaeously for both gene and condition dependent trends? In the ISA we are not trying to.. Given a genes x conditions matrix E with condition set U, gene set V define: E C : normalize each cond to 0 mean, 1 std E G : normalize each gene to 0 mean, 1 std 7
7 What is a bicluster Assume all columns are independent, what is the distribution of Σ (j in U ) e G ij for a random cond set U and gene i? Mean = 0, Std=sqrt( U ) Same for Σ (i in V ) e C ij and gene set V. In a bicluster, we expect independence not to hold. 8
8 What is a bicluster (2) Given a set of conds U define: ISA(U ) = {v in V s.t. Σ (j in U ) e G vj > T G σ U } Given a set of genes V define: ISA(V ) = {u in U s.t. Σ (j in V ) e C iu > T C σ V } T G,T C threshold parameters, σ U,σ V standard deviations Estimated from the data A (perfect) bicluster is a pair (U,V ) s.t. ISA(V ) = U ISA(U ) = V 9
9 Searching for biclusters Define a directed graph: nodes = condition & gene subsets; arcs X Y iff ISA(X )=Y A bicluster is a cycle of two nodes U V An approximated bicluster is a larger cycle (but not too large). Alg: start from a random or known gene set, compute ISA until converging to an approximated bicluster: V i = ISA(U i-1 ), U i = ISA(V i ) Converge at i when for all j > i-m, U i \U j / U i U j < ε 10
10 ISA 11
11 Adding weights Instead of sets use vectors of gene and condition weights The operator ISA is generalized to become a matrix multiplication + threshold function Gene Set Compute Avgs on conds Compute Z- score of conditions Keep conds that survived the threshold Gene weights Multiply by gene expression matrix Compute Z- scores of conditions Nullify weights below the threshold 12
12 Handling Redundancy Starting from different seeds yields different fixed points (bics) Using different thresholds changes the graph structure and gives more bics Need to filter similar solutions & report a short, non-redundant list of significant bics 13
13 ISA - applications Start from sets of genes with a known functional annotation Start from genes with binding sites of a transcription factor Start from a set of sequence orthologs See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS
14 The basic signature algorithm (Nat. Genetics 02) 15
15 Using recurrence to evaluate solutions A bad initial gene set will also lead to some module How can we identify the good modules? Idea: for input gene set A, random set of other genes R, apply ISA(A), ISA(R A) and compare them. If A represents part of a real transcription module, expect large overlap in resulting solutions. 16
16 Using recurrence (2) a, A reference set of Ncore co-regulated genes was composed of genes encoding either ribosomal proteins (dashed lines) or proteins involved in amino acid biosynthesis (dashed/dotted line). The recurrent signature method was applied to this set as follows. First, a collection of input sets was derived by randomly adding genes to the reference set. Second, the signature algorithm was applied to the reference set and to the derived sets; this generates a reference signature and a collection of perturbed signatures, respectively. Last, the overlaps between the reference signature and the perturbed signatures were calculated. Shown is the average overlap as a function of the number of genes added to the reference set. The different lines correspond to different choices of Ncore, shown in parentheses. b, The recurrent signature method was applied to three sequence-related references sets. These sets include all of the genes that contain the binding sequences CGGN11CCG (for Gal4), TGACTC (for Gcn4) or TTN9GGAAA (for Mcm1) in a region of 600 bp upstream. Shown is the fraction of perturbed signatures whose overlap with the reference signature is greater than some threshold, as a function of this threshold. Note the large number of highly overlapping outputs for all three references sets. By contrast, the profile corresponding to a random sequence is distinctly different, with no large overlaps. Thus, the recurrence profile gives a clear indication of whether a given sequence functions as a regulatory control element. 17
17 A global analysis in yeast 1000 expression profiles Applied SA with input gene sets: All target sets of 6-mers, 7-mers, 8-mers (~86K sets) All functional groups in MIPS All clusters in a hierarchical clustering of all genes Accepted only recurring modules. Results: 86 modules covering 2241 genes. 18
18 Genes in most modules participate in module-specific cellular process 19
19 ISA Pros/Cons Pros Simple, quite fast Elegant solution to the normalization problem Good empirical results in several cases Cons Thresholds setting Finding good seeds Redundancies Non normal behaviors 20
20 SAMBA: Statistical and Algorithmic Method for Bicluster Analysis Developed here (Tanay, Sharan, Shamir Bioinformatics 02) Outline: Develop efficient combinatorial techniques for biclustering large datasets. Employ a statistical model for biclusters Allow integration of heterogeneous data 21
21 The SAMBA model edge conditions no edge Goal : Find high similarity submatrices Goal : Find dense G=(U,V,E) subgraphs 22
22 The SAMBA approach Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data Bicluster model: Heavy subgraphs How to find biclusters: Combined hashing and local optimization Redundancies: Find many biclusters at once, filter them in post process 23
23 From a statistical model to edge weights a simple example Background model: Independent edges, each present with prob. p<½. H subgraph of n genes, m conds, k edges P-value = tail of binomial distribution: nm p( H ) log ( p( H )) nm + k log( p) + ( nm k)log(1 p) 2 k nm k nm k nm k = ' ' p (1 p) 2 p (1 p) k ' k k' Weight the graph edges: (1+log p) non-edges: (1+log(1-p)). then subgraph weight log p-value. 24
24 Limitations of the uniform probability model Not all dense subgraphs are statistically significant. Different genes/conds have dissimilar noise characteristics. Noisy genes/conds have high probability of forming dense subgraphs. An extended likelihood ratio model: Bicluster Random Subgraph Model Background Random Graph Model = Will show: Likelihood model translates to sum of weights over edges and non edges 25
25 A Degree Based Random Graph Model low-prob edges medium-prob edges high-prob edges Each edge (u,v) occurs independently w prob p(u,v). p(u,v) depends on the degrees of both u and v Γ = { G =(U,V,E ) deg(w, E )=deg(w, E) for all w in U,V} set of degree preserving graphs on same node sets. p(u,v) = Pr((u,v) in E G in Γ) Approximated using Monte Carlo process 26
26 27 Likelihood Ratio Model + = = = ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ' ), ( ), ( 1 1 log ), ( log ) ( log ), ( 1 1 ), ( )), ( (1 ), ( ) (1 ) ( E v u c E v u c E v u c E v u c E v u E v u E v u c E v u c v u p p v u p p B L v u p p v u p p v u p v u p p p B L Subgraph weight = log likelihood ratio Bicluster model assumption: edges occur independently with prob p c Likelihood ratio score:
27 Heaviest bipartite subgraph NPC (Dawande et al. 97, Hochbaum 98) (But: node biclique is polynomial!) Assumption: degree on V side bounded by d Start by finding heavy bicliques. Alg: use hashing to discover heavy subsets of conds. 28
28 Finding Heaviest Biclique Takes O(n2 d ) time and space. GE Ron Shamir 29
29 Using bicliques to find the heaviest biclusters Assume: edge weight = 1, non-edge weight = -1 Note : w(( U', V')) = wuv (, ') u U' Lemma: If B=(U,V ) is a maximum weight subgraph and X U then v s.t. N(v) X X /2. Pf: 0 < w(( XV, ')) = Nv ( ) X Nv ( ) X = v V' v V' 2 Nv ( ) X X Corollary: If B=(U,V ) is a maximum weight subgraph then U 2d 30
30 Using bicliques to find the heaviest biclusters Assume: edge weight = 1, non-edge weight = -1 Lemma: in a max wt subgraph (U*,V*), X U* Y X, Y X /2 s.t. Y N(v) for some v V*. Corollary: in a max wt subgraph (U*,V*), U* can be covered by at most log (2d) sets, each containing the neighborhood of some vertex in V* 31
31 Using bicliques to find the heaviest biclusters A set of conditions in a maximal bicluster is the union of up to log(2d) subsets of gene neighborhoods. U u u Exhaustive O((n2 d ) log(2d) ) time alg: Hash bicliques enumerate all log(2d) size N(v) combinations. Can be generalized to arbitrary edge/nonedge weights. 32
32 SAMBA s implementation Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6. Phase II: greedy expansion of heaviest bicliques containing each gene/cond Phase III: filter overlapping biclusters. 33
33 Evaluating Specificity Suppose conditions partition into k classes of sizes c 1, c k ; Σc i = m A bic has b conditions, b i from class i If b t =max b i assign the bic to class t How good is the match of the bic to the classification? Hypergeometric score: b Pr( B) = k= b t ct m ct k b k m b 34
34 Specificity Test GE data: Alizadeh et al. (00) 4026 genes, 96 human tissues; 9 classes of lymphoma, normal Fraction of biclusters SAMBA Cheng-Church 00 Random Log (p-value) Better fit to true classification 35
35 Specificity (2) Generate random bipartite graph with same degree sequence as the Alizadeh data; compute biclusters; plot p-value and likelihood (weight) log likelihood + Lymphoma data (Alizadeh et.al) x Shuffled Data log p-value 36
36 Heterogeneous data Tanay Sharan Kupiec Shamir PNAS 04 Transcription Level Protein Level Phenotype Level 2-Hybrid mrna profiling Protein Complexes ChIP Chip Identification using Mass Spec and so many more Barcoded deletion libraries = 0 Synthetic lethality 37
37 Unified Modeling of Biological Information Genes/Proteins Properties Modules 38
38 A Heterogeneous Collection of Yeast Genomic Information Gene expression: ~1000 conditions, 27 publications TF binding profiles: 110 profiles from growth on YPD (Lee et al.) Phenotype profiles: 6 (30) profiles (Giaever et al.) Two hybrid interactions: ~1000 (Uetz et al.) Protein Complex interaction: ~4000 (Ho et al.) MIPS interactions: ~
39 From experiments to properties p2 p1 Strong complex binding to protein P Medium complex binding to Protein P p1 p2 p3 p4 Strong Induction Medium Induction Medium Repression Strong Repression gene g p1 p2 p1 p2 p1 p2 Strong Medium Binding to Binding to GE Ron Shamir TF T TF T High Sensitivity Medium Sensitivity High Confidence Interaction Medium Confidence Interaction 40
40 A SAMBA module Properties Genes GO annotations CPA1 CPA2 41
41 modular organization in yeast Ovals = modules Edges = module overlaps Map generated automatically by SAMBA Clustered organization Cluster=process Hierarchical bridges 42
42 TFfunction map 43
43 SAMBA Pros/Cons Pros Fast Allow simultaneous normalization of genes and conditions Allow integration of heterogeneous data Well suited for query based usage Cons Discretization redundancies 44
44 Biclustering interim summary A general data mining problem The key point: defining what is a bicluster Algorithms vary, depending on the nature of bicluster model Open issues: What is the best objective/ bic criterion? Search for bics in really huge matrices Handling redundancies 45
Identifying network modules
Network biology minicourse (part 3) Algorithmic challenges in genomics Identifying network modules Roded Sharan School of Computer Science, Tel Aviv University Gene/Protein Modules A module is a set of
More informationLecture 5: May 24, 2007
Analysis of Gene Expression Data Spring Semester, 2007 Lecture 5: May 24, 2007 Lecturer: Ron Shamir Scribe: Shelly Mahlev and Shaul Karni 1 5.1 Introduction As we have seen in previous lectures, the technology
More informationBiclustering Algorithms for Gene Expression Analysis
Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important
More informationIterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai
Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data By S. Bergmann, J. Ihmels, N. Barkai Reasoning Both clustering and Singular Value Decomposition(SVD) are useful tools
More informationBIMAX. Lecture 11: December 31, Introduction Model An Incremental Algorithm.
Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecturer: Ron Shamir BIMAX 11.1 Introduction. Lecture 11: December 31, 2009 Scribe: Boris Kostenko In the course we have already seen different
More informationAnalysis of Biological Networks: Network Modules Identication
Analysis of Biological Networks: Network Modules Identication Lecturer: Roded Sharan Scribe: Regina Ring and Constantin Radchenko Lecture 4, March 25, 2009 In this lecture we complete the discussion of
More informationAnalysis of Biological Networks: Protein modules Color Coding
Analysis of Biological Networks: Protein modules Color Coding Lecturer: Eithan Hirsh Scribe: Shelly Mahleb and Benny Davidovich Lecture 6, November 30, 2006 1 Introduction In this lecture we present: Color
More informationBiclustering Bioinformatics Data Sets. A Possibilistic Approach
Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationBIOINFORMATICS. Discovering statistically significant biclusters in gene expression data. Amos Tanay, Roded Sharan and Ron Shamir
BIOINFORMATICS Electronic edition http://www.bioinformatics.oupjournals.org VOLUME 18 NUMBER Suppl. 1 JULY 2002 PAGES S136 S144 Discovering statistically significant biclusters in gene expression data
More informationClustering gene expression data
Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationFISA: Fast Iterative Signature Algorithm for the analysis of large-scale gene expression data
FISA: Fast Iterative Signature Algorithm for the analysis of large-scale gene expression data Seema Aggarwal Department of Computer Science University of Delhi saggarwal@mh.du.ac.in and Neelima Gupta Department
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informatione-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data
: Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
More informationp v P r(v V opt ) = Algorithm 1 The PROMO algorithm for module identification.
BIOINFORMATICS Vol. no. 6 Pages 1 PROMO : A Method for identifying modules in protein interaction networks Omer Tamuz, Yaron Singer, Roded Sharan School of Computer Science, Tel Aviv University, Tel Aviv,
More informationPROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota
Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein
More informationBiclustering for Microarray Data: A Short and Comprehensive Tutorial
Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department
More informationMissing Data Estimation in Microarrays Using Multi-Organism Approach
Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008
More informationThe k-means Algorithm and Genetic Algorithm
The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationMSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding
MSCBIO 2070/02-710:, Spring 2015 A4: spline, HMM, clustering, time-series data analysis, RNA-folding Due: April 13, 2015 by email to Silvia Liu (silvia.shuchang.liu@gmail.com) TA in charge: Silvia Liu
More informationCommunity Detection. Community
Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More information2. Background. 2.1 Clustering
2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationExample for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)
Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Average clustering coefficient of a graph Overall measure
More informationCOT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748
COT 6936: Topics in Algorithms! Giri Narasimhan ECS 254A / EC 2443; Phone: x3748 giri@cs.fiu.edu http://www.cs.fiu.edu/~giri/teach/cot6936_s12.html https://moodle.cis.fiu.edu/v2.1/course/view.php?id=174
More informationIdentifying and Understanding Differential Transcriptor Binding
Identifying and Understanding Differential Transcriptor Binding 15-899: Computational Genomics David Koes Yong Lu Motivation Under different conditions, a transcription factor binds to different genes
More informationMicroarray data analysis
Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using
More informationSequence Design Problems in Discovery of Regulatory Elements
Sequence Design Problems in Discovery of Regulatory Elements Yaron Orenstein, Bonnie Berger and Ron Shamir Regulatory Genomics workshop Simons Institute March 10th, 2016, Berkeley, CA Differentially methylated
More informationGene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate
Gene regulation DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Especially but not only during developmental stage And cells respond to
More informationAnalysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths
Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules?
More informationDepartment of Computer Science & Engineering University of Kalyani. Syllabus for Ph.D. Coursework
Department of Computer Science & Engineering University of Kalyani Syllabus for Ph.D. Coursework Paper 1: A) Literature Review: (Marks - 25) B) Research Methodology: (Marks - 25) Paper 2: Computer Applications:
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationContents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results
Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be
More informationMCL. (and other clustering algorithms) 858L
MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted
More informationClustering Jacques van Helden
Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation
More informationReflexive Regular Equivalence for Bipartite Data
Reflexive Regular Equivalence for Bipartite Data Aaron Gerow 1, Mingyang Zhou 2, Stan Matwin 1, and Feng Shi 3 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada 2 Department of Computer
More informationBiclustering Algorithms: A Survey
Biclustering Algorithms: A Survey Amos Tanay Λ Roded Sharan y Ron Shamir Λ May 2004 Abstract Analysis of large scale geonomics data, notably gene expression, has initially focused on clustering methods.
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2????? Machine Learning Node
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank
More informationAnalyzing ICAT Data. Analyzing ICAT Data
Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationFeature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262
Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel
More informationGraphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs
Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are
More informationBehavioral Data Mining. Lecture 18 Clustering
Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i
More informationDistance-based Methods: Drawbacks
Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationSEEK User Manual. Introduction
SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.
More informationMathematical and Algorithmic Foundations Linear Programming and Matchings
Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis
More informationAn Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays
An Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays Xiang Zhang and Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599,
More informationAMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12.
AMS 550.47/67: Graph Theory Homework Problems - Week V Problems to be handed in on Wednesday, March : 6, 8, 9,,.. Assignment Problem. Suppose we have a set {J, J,..., J r } of r jobs to be filled by a
More informationLecture 5: Multiple sequence alignment
Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment
More informationSubject Index. Journal of Discrete Algorithms 5 (2007)
Journal of Discrete Algorithms 5 (2007) 751 755 www.elsevier.com/locate/jda Subject Index Ad hoc and wireless networks Ad hoc networks Admission control Algorithm ; ; A simple fast hybrid pattern-matching
More informationLocal Algorithms for Sparse Spanning Graphs
Local Algorithms for Sparse Spanning Graphs Reut Levi Dana Ron Ronitt Rubinfeld Intro slides based on a talk given by Reut Levi Minimum Spanning Graph (Spanning Tree) Local Access to a Minimum Spanning
More informationLecture Note: Computation problems in social. network analysis
Lecture Note: Computation problems in social network analysis Bang Ye Wu CSIE, Chung Cheng University, Taiwan September 29, 2008 In this lecture note, several computational problems are listed, including
More informationLecture 5: Markov models
Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a
More informationPackage ibbig. R topics documented: December 24, 2018
Type Package Title Iterative Binary Biclustering of Genesets Version 1.26.0 Date 2011-11-23 Author Daniel Gusenleitner, Aedin Culhane Package ibbig December 24, 2018 Maintainer Aedin Culhane
More informationBiclustering with δ-pcluster John Tantalo. 1. Introduction
Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationGraph Theory S 1 I 2 I 1 S 2 I 1 I 2
Graph Theory S I I S S I I S Graphs Definition A graph G is a pair consisting of a vertex set V (G), and an edge set E(G) ( ) V (G). x and y are the endpoints of edge e = {x, y}. They are called adjacent
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford
More informationIntegrated Analysis of Gene Expression and Other Data
Analysis of DNA Chips and Gene Networks Fall Semester, 2010 Lecture 14: January 21, 2010 Lecturer: Prof. Ron Shamir Scribe: David Ze evi Integrated Analysis of Gene Expression and Other Data 14.1 Introduction
More informationCS473-Algorithms I. Lecture 13-A. Graphs. Cevdet Aykanat - Bilkent University Computer Engineering Department
CS473-Algorithms I Lecture 3-A Graphs Graphs A directed graph (or digraph) G is a pair (V, E), where V is a finite set, and E is a binary relation on V The set V: Vertex set of G The set E: Edge set of
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationData mining, 4 cu Lecture 8:
582364 Data mining, 4 cu Lecture 8: Graph mining Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs
More informationClustering Using Graph Connectivity
Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the
More informationParsimony-Based Approaches to Inferring Phylogenetic Trees
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationDynamic Programming: Sequence alignment. CS 466 Saurabh Sinha
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationby conservation of flow, hence the cancelation. Similarly, we have
Chapter 13: Network Flows and Applications Network: directed graph with source S and target T. Non-negative edge weights represent capacities. Assume no edges into S or out of T. (If necessary, we can
More informationCME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 03/02/17
CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 Due at the beginning of class Thursday 03/02/17 1. Consider a model of a nonbipartite undirected graph in which
More informationMean Square Residue Biclustering with Missing Data and Row Inversions
Mean Square Residue Biclustering with Missing Data and Row Inversions Stefan Gremalschi a, Gulsah Altun b, Irina Astrovskaya a, and Alexander Zelikovsky a a Department of Computer Science, Georgia State
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationIntegrating Probabilistic Reasoning with Constraint Satisfaction
Integrating Probabilistic Reasoning with Constraint Satisfaction IJCAI Tutorial #7 Instructor: Eric I. Hsu July 17, 2011 http://www.cs.toronto.edu/~eihsu/tutorial7 Getting Started Discursive Remarks. Organizational
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationAddendum to the proof of log n approximation ratio for the greedy set cover algorithm
Addendum to the proof of log n approximation ratio for the greedy set cover algorithm (From Vazirani s very nice book Approximation algorithms ) Let x, x 2,...,x n be the order in which the elements are
More informationCoverage Approximation Algorithms
DATA MINING LECTURE 12 Coverage Approximation Algorithms Example Promotion campaign on a social network We have a social network as a graph. People are more likely to buy a product if they have a friend
More informationSparse and large-scale learning with heterogeneous data
Sparse and large-scale learning with heterogeneous data February 15, 2007 Gert Lanckriet (gert@ece.ucsd.edu) IEEE-SDCIS In this talk Statistical machine learning Techniques: roots in classical statistics
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationDiscrete mathematics , Fall Instructor: prof. János Pach
Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,
More informationWeb Structure Mining Community Detection and Evaluation
Web Structure Mining Community Detection and Evaluation 1 Community Community. It is formed by individuals such that those within a group interact with each other more frequently than with those outside
More informationTransfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction. by Ritambhara Singh IIIT-Delhi June 10, 2016
Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1 Biology in a Slide DNA RNA PROTEIN CELL ORGANISM 2 DNA and Diseases
More informationCME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 02/26/15
CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 Due at the beginning of class Thursday 02/26/15 1. Consider a model of a nonbipartite undirected graph in which
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Thursday, October 1, 2015 Outline 1 Recap 2 Shortest paths in graphs with non-negative edge weights (Dijkstra
More informationGraph Mining: Overview of different graph models
Graph Mining: Overview of different graph models Davide Mottin, Konstantina Lazaridou Hasso Plattner Institute Graph Mining course Winter Semester 2016 Lecture road Anomaly detection (previous lecture)
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationIntroduction to Computer Science
DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at
More informationDistribution-Free Models of Social and Information Networks
Distribution-Free Models of Social and Information Networks Tim Roughgarden (Stanford CS) joint work with Jacob Fox (Stanford Math), Rishi Gupta (Stanford CS), C. Seshadhri (UC Santa Cruz), Fan Wei (Stanford
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationDiscrete Mathematics Course Review 3
21-228 Discrete Mathematics Course Review 3 This document contains a list of the important definitions and theorems that have been covered thus far in the course. It is not a complete listing of what has
More information1 More stochastic block model. Pr(G θ) G=(V, E) 1.1 Model definition. 1.2 Fitting the model to data. Prof. Aaron Clauset 7 November 2013
1 More stochastic block model Recall that the stochastic block model (SBM is a generative model for network structure and thus defines a probability distribution over networks Pr(G θ, where θ represents
More informationReview: Identification of cell types from single-cell transcriptom. method
Review: Identification of cell types from single-cell transcriptomes using a novel clustering method University of North Carolina at Charlotte October 12, 2015 Brief overview Identify clusters by merging
More information