Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Similar documents
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Computational Genomics and Molecular Biology, Fall

Huber & Bulyk, BMC Bioinformatics MS ID , Additional Methods. Installation and Usage of MultiFinder, SequenceExtractor and BlockFilter

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

PROJECTION Algorithm for Motif Finding on GPUs

Evolutionary tree reconstruction (Chapter 10)

On the Optimality of the Neighbor Joining Algorithm

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Mixture Models and EM

Chapter 2. Related Work

Biology 644: Bioinformatics

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Software and Hardware Acceleration of the Genomic Motif Finding Tool PhyloNet

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Gene expression & Clustering (Chapter 10)

Finding Hidden Patterns in DNA. What makes searching for frequent subsequences hard? Allowing for errors? All the places they could be hiding?

CLUSTERING. JELENA JOVANOVIĆ Web:

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Gene Clustering & Classification

I How does the formulation (5) serve the purpose of the composite parameterization

Framework for Design of Dynamic Programming Algorithms

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Biology 644: Bioinformatics

3.4 Multiple sequence alignment

Fitting a mixture model by expectation maximization to discover. Timothy L. Bailey and Charles Elkan y. and

A SURVEY OF PLANTED MOTIF SEARCH ALGORITHMS

1 Methods for Posterior Simulation

CS Introduction to Data Mining Instructor: Abdullah Mueen

Special course in Computer Science: Advanced Text Algorithms

Sequence alignment algorithms

Research Article An Improved Scoring Matrix for Multiple Sequence Alignment

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Semi-Supervised Clustering with Partial Background Information

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS281 Section 9: Graph Models and Practical MCMC

Manual, ver. 03/01/2008, for PhylCRM_1.1 and the associated program PhylCRM_preprocess_1.1 utilized in the paper:

Probabilistic Graphical Models

Fall 09, Homework 5

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Mixture Models and the EM Algorithm

An Ultrafast Scalable Many-core Motif Discovery Algorithm for Multiple GPUs

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

CSE 549: Computational Biology

Theoretical Concepts of Machine Learning

Lecture 8: Linkage algorithms and web search

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Sequence alignment theory and applications Session 3: BLAST algorithm

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

10701 Machine Learning. Clustering

BCRANK: predicting binding site consensus from ranked DNA sequences

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

An Improved Algebraic Attack on Hamsi-256

From Smith-Waterman to BLAST

version 1.0 User guide Ralf Eggeling, Ivo Grosse, Jan Grau

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Clustering Relational Data using the Infinite Relational Model

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

Initiate a PSI-BLAST search simply by choosing the option on the BLAST input form.

Protein Sequence Classification Using Probabilistic Motifs and Neural Networks

Special course in Computer Science: Advanced Text Algorithms

EECS730: Introduction to Bioinformatics

EECS 442 Computer vision. Fitting methods

CLC Server. End User USER MANUAL

Treewidth and graph minors

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Collective Entity Resolution in Relational Data

Identifying network modules

Algorithms for Integer Programming

m6aviewer Version Documentation

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Fitting. Instructor: Jason Corso (jjcorso)! web.eecs.umich.edu/~jjcorso/t/598f14!! EECS Fall 2014! Foundations of Computer Vision!

Computational Molecular Biology

Midterm Examination CS540-2: Introduction to Artificial Intelligence

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

Deep Reinforcement Learning

Clustering gene expression data

Recent Research Results. Evolutionary Trees Distance Methods

GPU-MEME: Using Graphics Hardware to Accelerate Motif Finding in DNA Sequences

Clustering Lecture 5: Mixture Model

1 More configuration model

Breaking Grain-128 with Dynamic Cube Attacks

Richard Feynman, Lectures on Computation

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CisGenome User s Manual

Manual, ver. 03/01/2008, for Lever_1.1, and the associated programs PhylCRM_preprocess_1.1 and Lever_statistics_1.1 utilized in the paper:

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Locality-sensitive hashing and biological network alignment

Transcription:

Gene regulation DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Especially but not only during developmental stage And cells respond to external conditions and/or messages from other cells Much of this dynamic response is attained through protein or gene regulation: how much and which variant of the gene is present

The central dogma

Mechanisms of gene regulation Pre-transcription: accessibility of the gene the chromatin structure which packs the DNA is dynamic Transcription: rate Post-transcription: mrna degradation rate Translation: rate Post-translation: Modifications Rate of degradation

Transcription factors Bind to specific DNA sites: Transcription Factor Binding Sites Typically downstream effect on mrna transcription rate

Transcription rate

Motif finding Motif finding is the computational problem of identifying TFBSs Implicit assumption: different TFBSs of the same TF should be similar to another Hence the name motif Two related tasks: Given a specific model of TF motif compiled from a known list of TFBSs find additional sites (scanning) Identify the unknown motif given only the DNA sequences

Modelling motifs Discovered sites: How do we model the motif? important for finding additional sites Consensus pattern: generalizes to regular expressions TACGAT TATAAT TATAAT GATACT TATGAT TATATT TATAAT Positional profile: 1 2 3 4 5 6 A 6 4 4 C 1 1 G 1 2 T 5 5 1 6

Generative models Consensus pattern: each instance is a randomly mutated version of the consensus substitution only: the same TF binds to the various sites, so indels are unlikely to occur as the DNA-TF contact region remains the same Profile: instances are drawn according to the probability implied by the positional profile assuming each position is drawn independently Pseudocounts are typically added to avoid excluding unseen letters

Counts to frequencies profile 1 2 3 4 5 6 A 6 4 4 C 1 1 G 1 2 T 5 5 1 6 1 2 3 4 5 6 A 0.1 0.7 0.1 0.5 0.5 0.1 C 0.1 0.1 0.2 0.1 0.2 0.1 G 0.2 0.1 0.1 0.3 0.1 0.1 T 0.6 0.1 0.6 0.1 0.2 0.7 What is the pseudocount in this example?

The fitness of a TFBS 1 How well does a putative TFBS w fits the model? For a consensus model we typically use s C (w) = d H (C, w), the Hamming distance to the consensus pattern C. It is convenient to work with but more appropriate for uniform nucleotide sample For a profile parametrized by M = (f ik ) i=1:l,k=1:4, it is natural to use the likelihood score: s M (w) = P M (w) = l i=1 f iw i Better: use the LLR (loglikelihood ratio) score s M (w) = log P M(w) P B (w) = l i=1 log f iw i b wi, where B specifies an iid background model with nucleotide frequency (b k ) 4 1, typically taken from the organism or the scanned sample

Scanning for TFBS 2 Given a parametrized motif model and an associated fitness function looking for additional sites is algorithmically trivial However, setting a cutoff score typically requires carefully analyzing the FP rates These FP rates are set using a model of random sequences Markov chains shuffling using random chunks of DNA

Motif finding 3 Do these sequences share a common TFBS? tagcttcatcgttgacttctgcagaaagcaagctcctgagtagctggccaagcgagc tgcttgtgcccggctgcggcggttgtatcctgaatacgccatgcgccctgcagctgc tagaccctgcagccagctgcgcctgatgaaggcgcaacacgaaggaaagacgggacc agggcgacgtcctattaaaagataatcccccgaacttcatagtgtaatctgcagctg ctcccctacaggtgcaggcacttttcggatgctgcagcggccgtccggggtcagttg cagcagtgttacgcgaggttctgcagtgctggctagctcgacccggattttgacgga ctgcagccgattgatggaccattctattcgtgacacccgacgagaggcgtccccccg gcaccaggccgttcctgcaggggccaccctttgagttaggtgacatcattcctatgt acatgcctcaaagagatctagtctaaatactacctgcagaacttatggatctgaggg agaggggtactctgaaaagcgggaacctcgtgtttatctgcagtgtccaaatcctat

If only life could be that simple 4 The binding sites are almost never exactly the same A more likely sample is: tagcttcatcgttgacttttgaagaaagcaagctcctgagtagctggccaagcgagc tgcttgtgcccggctgcggcggttgtatcctgaatacgccatgcgccctggagctgc tagaccctgcagccagctgcgcctgatgaaggcgcaacacgaaggaaagacgggacc agggcgacgtcctattaaaagataatcccccgaacttcatagtgtaatctgcagctg ctcccctacaggtgcaggcacttttcggatgctgcttcggccgtccggggtcagttg cagcagtgttacgcgaggttctacagtgctggctagctcgacccggattttgacgga CTGCAGccgattgatggaccattctattcgtgacacccgacgagaggcgtccccccg gcaccaggccgttcctacaggggccaccctttgagttaggtgacatcattcctatgt acatgcctcaaagagatctagtctaaatactacctacagaacttatggatctgaggg agaggggtactctgaaaagcgggaacctcgtgtttatttgcattgtccaaatcctat

Searching for motifs 5 Simultaneously looking for a motif model and sites that will optimize a scoring function is significantly more difficult Assume for simplicity the OOPS model (One Occurrence Per Sequence model): w m S m for m = 1 : n A natural way to score a putative combination of a motif M and sites (w m ) n 1 is by summing the fitness scores of all sites: s(m; w 1,..., w n ) := n m=1 s M (w m ) Thus, our goal is to search the joint space of motifs, M (consensus or profile), and alignments, w m S m, so as to optimize this score Fortunately, for both models this can be done sequentially so we do not have to optimize simultaneously over the alignment and the motif

Optimizing the motif or the alignment 6 Once we choose the alignment, w m S m for m = 1 : n, the optimal motif for that alignment is trivial For the consensus model it is a consensus word as it clearly minimizes the total distance to the words in the alignment For the profile model we find with a little more effort that the best model is the one which coincides with how we define a profile: f ik = n ik n, where n ik is the number of occurrence of the letter k at position i. Conversely, if we know the model we can find the optimal sites for the putative motif by linearly scanning the sequences Often a motif finder will combine both the motif s and the alignment s optimizations and indeed they are in some sense equivalent

Heuristic vs. guaranteed optimizations 7 Assume for now l is known (we can enumerate over possible ls) and let N m be the length of S m By considering all, roughly, n m=1 N m gapless alignments made of w m S m we are guaranteed to find the optimal alignment under both possible motif models Unfortunately, this number is prohibitively expensive for all but a few cases

Finding an optimal pattern 8 Consistent with our previous discussion under the OOPS model the score of a consensus word C is often the total distance: T D(C) := n m=1 d H (C, S m ) = n m=1 min d H(C, w ) w S m Problem: find a word C that minimizes the total distance Naive solution: enumerate all 4 l possible consensus words Complexity: O(4 l D) While this approach is feasible for a larger set of parameters than the one available for alignment enumeration it is still often too expensive

Heuristic approaches: Sample Driven 9 Most of the 4 l patterns we explore in the exhaustive enumeration have little to do with our sample Sample driven approach: compute T D(w) only for words w in the sample Complexity: O(D 2 ) where D = n m=1 N m is the size of the sample Analysis: fast but can miss the optimal pattern if it is missing from the sample More sophisticated methods were developed based on the sample driven approach

CONSENSUS - greedy profile search (Hertz & Stormo 99) 10 Assume the OOPS model and that l is given There is a version that does not assume l is given (WCONSEN- SUS) CONSENSUS Follows a greedy strategy looking first for the best alignment of just two sites: For each i j, and w S i, w S j compute the information content of the alignment made of w and w : I = l i=1 4 k=1 n ik log n ik/2 b k Keep the top q 2 alignments (matrices)

It then greedily adds one word at a time from the sequences that are not already represented in the alignment 11 Let m := 3 denote the number of sequences in the current alignments While m < n for each of([ the top ]) saved q m 1 alignments A of m 1 rows A compute I for all words w which come from sequences w that are not already in A keep the best q m alignments and set m := m + 1

MEME (Bailey & Elkan 94) 12 MEME: Multiple EM for Motif Elicitation the multiple part is for dealing with multiple motifs probabilistic generative model, deterministic algorithm Recall that given the motif model we can linearly scan the sequences for instances Conversely, given the instances deducing the profile is trivial MEME alternates between the two tasks

MEME s outline 13 Starting from a heuristically chosen initial profile Sample driven: the profile is derived from the word in the sample that has a minimal total distance MEME iterates the following two steps until convergence score each word according to how well it fits the current profile update the profile by taking a weighted average of all the words The EM in MEME stands for Expectation Maximization (Dempster, Laird & Rubin 77) which MEME s two step procedure follows EM is guaranteed to monotonically converge to a local maximum (intelligent choice of a starting point is crucial)

Gibbs Sampler (Lawrence et al. 93) 14 Probabilistic framework, random algorithm Assumes the OOPS model for simplicity (many more variants) Suppose we selected putative instances w i S i, these define the profile or motif model as in MEME As in the EM context we compute the LR score of every word in the sample: L w = P M(w) P B (w) In EM we use a soft assignment of words to the list of selected sites (instances), alternatively we can use hard assignment: Iterate e.g., we can choose: w i = argmax w S i L w or, we can randomly choose a site with probability proportional to its LLR score (Gibbs Sampling)

Gibbs outline 15 Start with a random choice of w i S i While there has been an improvement in the total LLR score (over all sites) in the last L (the plateau period) iterations do For i = 1... n: remove word w i from the motif model M and randomly pick a new site from sequence S i with probability proportional to L w (an iteration is one such loop) Note that there is no convergence in the naive sense There is an alternative formulation in terms of a Gibbs Sampler: the goal is to randomly sample alignments from a distribution where each alignment s probability is roughly proportional to its LR score. The Gibbs Sampler defines an MCMC that converges to a stationary distribution with that property thus allowing us to sample from this distribution.

Is this significant? 16 Motif finders always find something A high scoring motif reported by CONSENSUS The alignment ccgataaggtaag TCGATAAGGaGAG TCGATAAaGTaAG TCGATAAGGTcAG TCGATAAGGTGAG The profile A 5 5 5 1 1 2 5 C 1 5 1 G 5 4 5 2 5 T 4 5 4

Assessing the significance 17 How likely are we to see such alignments or better by chance? Clearly you need more information: What is the null model, or how are random (chance) sequences generated? typically iid (independent identically distributed or 0-th order Markov) What is the size of the search space: How many input sequences are there and how long are they? What was CONSENSUS instructed to look for? What is a better alignment, or how do we score a motif information content (Stormo 88): L i=1 A j=1 n ij log n ij/n b j

Quantifying the significance: the E-value / p-value 18 The E-value of an alignment with score s is: The expected number of random such alignments with score s... given the size of the search space An E-value of 0.01 is better than 100 It is computed by multiplying the size of the search space by the p-value of the alignment The p-value of an alignment with score s is: The probability that the score of a random alignment of the same width and depth is s

Phylogenetically aware finders 19 So far we looked at de novo motif finders: the only input is the set of presumably co-regulated sequences Binding sites are functional and functional elements tend to be more conserved Therefore we should look for conserved words in phylogenetically related species This can be done when we search for a known motif (MONKEY - Moses et al. 2004) Or, when looking for unknown motif (PhyloGibbs - Siddharthan et al. 2005) Other finders might add ChIP-chip data (MDscan - Liu et al. 2002)