EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Similar documents
CS 534: Computer Vision Model Fitting

Biostatistics 615/815

Mathematics 256 a course in differential equations for engineering students

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Unsupervised Learning

Problem Set 3 Solutions

Three supervised learning methods on pen digits character recognition dataset

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

AP PHYSICS B 2008 SCORING GUIDELINES

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Meta-heuristics for Multidimensional Knapsack Problems

Performance Evaluation of Information Retrieval Systems

CMPS 10 Introduction to Computer Science Lecture Notes

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Active Contours/Snakes

Hierarchical clustering for gene expression data analysis

Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions

Wishing you all a Total Quality New Year!

SI485i : NLP. Set 5 Using Naïve Bayes

What s Next for POS Tagging. Statistical NLP Spring Feature Templates. Maxent Taggers. HMM Trellis. Decoding. Lecture 8: Word Classes

A Compressing Method for Genome Sequence Cluster using Sequence Alignment

Help for Time-Resolved Analysis TRI2 version 2.4 P Barber,

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Multi-stable Perception. Necker Cube

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Learning to Classify Documents with Only a Small Positive Training Set

Load Balancing for Hex-Cell Interconnection Network

EXTENDED BIC CRITERION FOR MODEL SELECTION

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Optimizing Document Scoring for Query Retrieval

Software Reliability Assessment Using High-Order Markov Chains

Predicting Transcription Factor Binding Sites with an Ensemble of Hidden Markov Models

Dijkstra s Single Source Algorithm. All-Pairs Shortest Paths. Dynamic Programming Solution. Performance. Decision Sequence.

Hermite Splines in Lie Groups as Products of Geodesics

Report on On-line Graph Coloring

Future Generation Computer Systems

Programming in Fortran 90 : 2017/2018

A fault tree analysis strategy using binary decision diagrams

Sorting. Sorted Original. index. index

User Authentication Based On Behavioral Mouse Dynamics Biometrics

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

CSCI 5417 Information Retrieval Systems Jim Martin!

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Machine Learning. K-means Algorithm

Stability Region based Expectation Maximization for Model-based Clustering

A Robust Method for Estimating the Fundamental Matrix

Priority queues and heaps Professors Clark F. Olson and Carol Zander

Dijkstra s Single Source Algorithm. All-Pairs Shortest Paths. Dynamic Programming Solution. Performance

An Image Fusion Approach Based on Segmentation Region

USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES

Verification by testing

The Codesign Challenge

Fitting: Deformable contours April 26 th, 2018

A Binarization Algorithm specialized on Document Images and Photos

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Parallel matrix-vector multiplication

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Lecture 9 Fitting and Matching

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Context-Specific Bayesian Clustering for Gene Expression Data

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Reducing Frame Rate for Object Tracking

Unsupervised Learning and Clustering

Edge Detection in Noisy Images Using the Support Vector Machines

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

A Clustering Algorithm for Chinese Adjectives and Nouns 1

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

From Comparing Clusterings to Combining Clusterings

Feature Reduction and Selection

Self-tuning Histograms: Building Histograms Without Looking at Data

Simulation Based Analysis of FAST TCP using OMNET++

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Smoothing Spline ANOVA for variable screening

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Disulfide Bonding Pattern Prediction Using Support Vector Machine with Parameters Tuned by Multiple Trajectory Search

Machine Learning: Algorithms and Applications

Image Alignment CSC 767

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Improving Web Image Search using Meta Re-rankers

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Assembler. Building a Modern Computer From First Principles.

Transcription:

EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/

HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty of transton from state k to state l Emsson Probabltes e k Probablty of emttng character b n state k HMM topology A fully connected graph (.e. clque) contans too many parameters ( b) Pr( x b k ) ) 2011/10/22 EECS 730 2

HMM Π = { S, begn, end} = [1,2] a kl Pr( l 1 k a 11 : 0.5 0: 0.8 1: 0.2 S 1 ) e k a 1e : 0.2 ( b) Pr( x b k ) Begn a 12 : 0.3 a 21 : 0.7 End a 22 : 0.2 S 2 0: 0.1 1: 0.9 a 2e : 0.1 2011/10/22 EECS 730 3

Components of profle HMMs From bonformatcs D I Begn M End The transton structure of a profle HMM. 2011/10/22 EECS 730 4

Trval questons: What s the probablty that we wll observe the state-path (path) b, S 1, S 2, e? Gven a path b, S 1, S 2, e, what s the probablty that we wll observe the sequence 01? 0: 0.8 1: 0.2 a 11 : 0.5 S 1 a 12 : 0.3 Begn a 21 : 0.7 S 2 a 22 : 0.2 0: 0.1 1: 0.9 a 1e : 0.2 a 2e : 0.1 End 2011/10/22 EECS 730 5

Slghtly nvolved questons: What s the probablty that we wll observe the sequence 01 by gong through the path b, S 1, S 2, e? What s the probablty that we wll observe the sequence 01 wth M? What s the most lkely path, when we observe the sequence 01 from M? 0: 0.8 1: 0.2 a 11 : 0.5 S 1 a 12 : 0.3 Begn a 21 : 0.7 S 2 a 22 : 0.2 0: 0.1 1: 0.9 a 1e : 0.2 a 2e : 0.1 End 2011/10/22 EECS 730 6 M

A hard queston: Gven a set of sequences (assumng they are generated by a HMM), how do we estmate the parameters (and the structure) of the related HMM? 2011/10/22 EECS 730 7

Why do we care? Assgn membershp Gven a HMM M, buldng from a proten famly P, and a new sequence s, the probablty P(s M) tells us how lkely the sequence s belongs to P and hence have the same functon as protens n P. Questons: what f we have two famles P 1 and P 2 and we are not sure whch famly I should assgn the sequence to? 2011/10/22 EECS 730 8

Why do we care? Fnd the algnment Gven a HMM M, buldng from a proten famly P, and a new sequence s, the most lkely path of events T = max P(s P) (P s a vald path n M) tells us how should we algn s to M. 2011/10/22 EECS 730 9

Why do we care? Buld a HMM Gven a set of proten sequences S, buld the HMM that mostly lkely generates S. 2011/10/22 EECS 730 10

Three Important Questons How lkely s a gven sequence? The Forward algorthm What s the most probable path for generatng a gven sequence? The Vterb algorthm How can we learn the HMM parameters gven a set of sequences? The Forward-Backward (Baum-Welch) algorthm 2011/10/22 EECS 730 11

Searchng wth profle HMMs Man usage of profle HMMs Detectng potental membershp n a famly Matchng a sequence to the profle HMMs Vterb algorthm Based on Dynamc Programmng Mantanng log-odd rato compared wth random model P ( x R) q x Show Desktop.scf 2011/10/22 EECS 730 12

Vterb Algorthm The best way to get to E s ether: To go to N5 va the best way to t from S and then to E, or To go to N6 va the best way to t from S and then to E, or To go to N7 va the best way to t from S and then to E. The best way to get to N5 s ether: To go to N2 va the best way to t from S and then to N5 etc., etc., In practce: Calculate best route to N1, then N2, N3, N4, N5, N6, N7 & E N2 N5 N1 N4 N7 S N3 N6 E 2011/10/22 EECS 730 13

2011/10/22 EECS 730 14 Vterb equaton ; log ) (, log ) (, log ) ( max ) ( ; log 1) (, log 1) (, log 1) ( max ) ( log ) ( ; log 1) (, log 1) (, log 1) ( max ) ( log ) ( D D D 1 D I I 1 D M M 1 D I D D I I I I M M I I M D D 1 M I I 1 M M M 1 M M 1 1 1 1 1 1 a V a V a V V a V a V a V q x e V a V a V a V q x e V x x =0

Example Calculaton N2 N5 1 1 0.1 0.1 N1 N7 0 0.1 0 0.1 0.1 N4 0.1 0.1 0.9 0.1 0.2 N3 N6 S 0.7 0.5 0.01 E Best path to N1 scores max{0.1*0.1} = 0.01 from S Best path to N2 scores max{ 0.01 * 1, 0.2*1} = 0.2 from S Best path to N3 scores max{0.7*0.5, 0.01 *0.9 *0.5} = 0.035 from S Best path to N4 scores max{ 0.035 *0.1, 0.2 *0,1} = 0.02 from N2 and so on As wth Needleman-Wunsch, we must record the nodes from whch the best path came 2011/10/22 EECS 730 15

HMMs from multple algnments Key dea behnd profle HMMs Use the same structure, wth dfferent transton and emsson probabltes, to capture specfc nformaton about each poston n the multple algnment of the whole famly Model representng the consensus for the famly Not the sequence of any partcular member HBA_HUMAN...VGA--HAGEY... HBB_HUMAN...V----NVDEV... MYG_PHYCA...VEA--DVAGH... GLB3_CHITP...VKG------D... GLB5_PETMA...VYS--TYETS... LGB2_LUPLU...FNA--NIPKH... GLB1_GLYDI...IAGADNGAGV... *** ***** Ten columns from the multple algnment of seven globn proten sequences. The starred columns are ones that wll be treated as matches n the profle HMM. 2011/10/22 EECS 730 16

Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM Before we estmate a model and a multple algnment smultaneously we consder the smpler problem of obtanng a multple algnment from a known model. When we have a multple algnment and a model of a small representatve set of sequences n a famly, and we wsh to use that model to algn a large member of other famly members altogether. 2011/10/22 EECS 730 17

Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM We know how to algn a sequence to a profle HMM- Vterb algorthm Constructon a multple algnment ust requres calculatng a Vterb algnment for each ndvdual sequence. Resdues algned to the same profle HMM match state are algned n columns. 2011/10/22 EECS 730 18

Multple algnment by profle HMM tranng- Multple algnment wth a known profle HMM Importance dfference wth other MSA programs Vterb path through HMM dentfes nserts Profle HMM does not algn nserts Other multple algnment algorthms algn the whole sequences. HMM doesn t attempt to algn resdues assgned to nsert states. The nsert state resdues usually represent part of the sequences whch are atypcal, unconserved, and not meanngfully algnable. Ths s a bologcally realstc vew of multple algnment 2011/10/22 EECS 730 19

Example Algnment, gven a learned HMM for 3 sequences ACSA AST ACCST best path: best path: best path: D C S A C S A E S A S T E S A C S T E MSA so far: MSA so far: MSA so far: ACGA ACSA AC-SA A-ST A--ST ACCST 2011/10/22 EECS 730 20

Another Example ATSA ACCA ACAST best path: best path: best path: D S A T S A E S A C A E C A S A C S T E MSA so far: MSA so far: MSA so far: ATSA AT-SA AT--GA A-CCA A-C-CA AC-AST 2011/10/22 EECS 730 21

HMMs from multple algnments Basc profle HMM parameterzaton Am: makng the dstrbuton peak around members of the famly Parameters the probabltes values: emsson probabltes, transton probabltes length of the model: heurstcs or systematc way 2011/10/22 EECS 730 22

Tranng from an exstng algnment Start wth a predetermned number of states n your HMM. For each poston n the model, assgn a column n the multple algnment that s relatvely conserved. Emsson probabltes are set accordng to amno acd counts n columns. Transton probabltes are set accordng to how many sequences make use of a gven delete or nsert state. a kl A kl k k A ' ' ( ') l kl E a' k a 2011/10/22 EECS 730 23 e ( a) E ( a)

More on estmaton of prob. (1) Maxmum lkelhood (ML) estmaton gven observed freq. c a of resdue a n poston. M ( a) e Problem of ML estmaton c a' a' If observed cases are absent? a c Specally when observed examples are somewhat few. 2011/10/22 EECS 730 24

More on estmaton of prob. (2) Smple pseudocounts q a : background dstrbuton A: weght factor c a Aq em ( a) A c a' a' Laplace s rule: Aq a = 1 a 2011/10/22 EECS 730 25

A smple example Chose sx postons n model. Hghlghted area was selected to be modeled by an nsert due to varablty. 2011/10/22 EECS 730 26

Profle HMM tranng from unalgned sequences Harder problem estmatng both a model and a multple algnment from ntally unalgned sequences. Intalzaton: Choose the length of the profle HMM and ntalze parameters. Tranng: Estmate the model usng the Baum-Welch algorthm or the Vterb alternatve. Multple Algnment: Algn all sequences to the fnal model usng the Vterb algorthm and buld a multple algnment as descrbed n the prevous secton. 2011/10/22 EECS 730 27

Profle HMM tranng from unalgned sequences Intal Model The only decson that must be made n choosng an ntal structure for Baum-Welch estmaton s the length of the model M. A commonly used rule s to set M be the average length of the tranng sequence. We need some randomness n ntal parameters to avod local maxma. 2011/10/22 EECS 730 28

Fnd approprate parameters Baum-Welch algorthm Instance of EM (Expectaton-Maxmzaton) algorthms. Flow of the B-W algorthm The update always ncrease Set ntal parameters the lkelhood P(X ), where at random. X s a set of sequences. Update parameters Increase of lkelhood< yes no Lkelhood P(X ) # updates Output parameters 2011/10/22 EECS 730 29

Fnd approprate parameters The Vterb alternatve Start wth a model whose length matches the average length of the sequences and wth random emsson and transton probabltes. Algn all the sequences to the model. Use the algnment to alter the emsson and transton probabltes Repeat. Contnue untl the model stops changng 2011/10/22 EECS 730 30

Multple algnment by profle HMM tranng Avodng Local maxma Baum-Welch algorthm s guaranteed to fnd a LOCAL maxma. Models are usually qute long and there are many opportuntes to get stuck n a wrong soluton. Multdmensonal dynamc programmng fnds global optma, but s not practcal. Soluton Start agan many tmes from dfferent ntal models. Use some form of stochastc search algorthm, e.g. smulated annealng. 2011/10/22 EECS 730 31

Profle HMM tranng from unalgned sequences Advantages: You take full advantage of the expressveness of your HMM. You mght not have a multple algnment on hand. Dsadvantages: HMM tranng methods are local optmzers, you may not get the best algnment or the best model unless you re very careful. Can be allevated by startng from a logcal model nstead of a random one. 2011/10/22 EECS 730 32

Profle HMM Summary Advantages: Very expressve proflng method Transparent method: You can vew and nterpret the model produced A consstent theory behnd gap and nserton scores Very effectve at detectng remote homolog Dsadvantages: Slow full search on a database of 400,000 sequences can take 15 hours Have to avod over-fttng and locally optmal models 2011/10/22 EECS 730 33