Regular Expression Constrained Sequence Alignment

Similar documents
Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

6 NFA and Regular Expressions

lec3:nondeterministic finite state automata

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

CMPSCI 250: Introduction to Computation. Lecture 20: Deterministic and Nondeterministic Finite Automata David Mix Barrington 16 April 2013

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Front End: Lexical Analysis. The Structure of a Compiler

1. Draw the state graphs for the finite automata which accept sets of strings composed of zeros and ones which:

Regular Languages and Regular Expressions

CT32 COMPUTER NETWORKS DEC 2015

Finite Automata. Dr. Nadeem Akhtar. Assistant Professor Department of Computer Science & IT The Islamia University of Bahawalpur

Lexical Analysis. Implementation: Finite Automata

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CS 310: State Transition Diagrams

NFAs and Myhill-Nerode. CS154 Chris Pollett Feb. 22, 2006.

Ambiguous Grammars and Compactification

ECS 120 Lesson 16 Turing Machines, Pt. 2

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis


Special course in Computer Science: Advanced Text Algorithms

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

An improved algorithm for the regular expression constrained multiple sequence alignment problem

Languages and Compilers

Formal Definition of Computation. Formal Definition of Computation p.1/28

Lexical Analysis. Chapter 2

Formal Languages and Automata

Mouse, Human, Chimpanzee

CS2 Language Processing note 3

Converting a DFA to a Regular Expression JP

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Midterm Exam. CSCI 3136: Principles of Programming Languages. February 20, Group 2

Theory of Computation Dr. Weiss Extra Practice Exam Solutions

Lexical Analysis. Prof. James L. Frankel Harvard University

FAdo: Interactive Tools for Learning Formal Computational Models

Theory of Computation

Midterm I (Solutions) CS164, Spring 2002

CSE 105 THEORY OF COMPUTATION

Compiler Construction

Lexical Analysis. Lecture 2-4

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Implementation of Lexical Analysis. Lecture 4

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Midterm Exam II CIS 341: Foundations of Computer Science II Spring 2006, day section Prof. Marvin K. Nakayama

2. Lexical Analysis! Prof. O. Nierstrasz!

Chapter Seven: Regular Expressions

Formal Languages and Compilers Lecture VI: Lexical Analysis

(Refer Slide Time: 0:19)

Computational Genomics and Molecular Biology, Fall

14.1 Encoding for different models of computation

Turing Machines. A transducer is a finite state machine (FST) whose output is a string and not just accept or reject.

I have read and understand all of the instructions below, and I will obey the Academic Honor Code.

Actually talking about Turing machines this time

To illustrate what is intended the following are three write ups by students. Diagonalization

Lexical Analysis - 2

Languages and Finite Automata

JNTUWORLD. Code No: R

Lecture 6,

CSE 431S Scanning. Washington University Spring 2013

Lexical Analysis. Lecture 3-4

Decision Properties of RLs & Automaton Minimization

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Finite Automata Part Three

CSE 105 THEORY OF COMPUTATION

Symbolic Automata Library for Fast Prototyping

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

Chapter Seven: Regular Expressions. Formal Language, chapter 7, slide 1

HKN CS 374 Midterm 1 Review. Tim Klem Noah Mathes Mahir Morshed

Compiler Construction

Compiler Construction

Midterm I review. Reading: Chapters 1-4

CSE 105 THEORY OF COMPUTATION

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Quiz 1: Solutions J/18.400J: Automata, Computability and Complexity. Nati Srebro, Susan Hohenberger

CSE 105 THEORY OF COMPUTATION

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Introduction to Lexical Analysis

Limited Automata and Unary Languages

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States

Finite Automata Part Three

NONDETERMINISTIC MOORE

CS 125 Section #10 Midterm 2 Review 11/5/14

CIT3130: Theory of Computation. Regular languages

Timo Latvala. January 28, 2004

1 Parsing (25 pts, 5 each)

Algorithmic Approaches for Biological Data, Lecture #20

TOPIC PAGE NO. UNIT-I FINITE AUTOMATA

Assignment 4 CSE 517: Natural Language Processing

Lecture 10. Sequence alignments

Computer Sciences Department

Theory of Languages and Automata

Safra's Büchi determinization algorithm

Automating Construction of Lexers

CSE 105 THEORY OF COMPUTATION

[Lexical Analysis] Bikash Balami

T Reactive Systems: Kripke Structures and Automata

Lecture 2 Finite Automata

Transcription:

Regular Expression Constrained Sequence Alignment By Abdullah N. Arslan Department of Computer science University of Vermont Presented by Tomer Heber & Raz Nissim

otivation When comparing two proteins, it may be important to take into account a common specific structure. Families of similar protein sequences include a conserved region called a motif. The PROSITE database is a collection of these motifs, represented as regular expressions. Our problem concentrates on sequence alignment of proteins that contain these motifs and need to be aligned according to them.

Formalization of the problem The Optimal Sequence Alignment Problem: String Alignment: Given two strings s and t, a global alignment is obtained by inserting spaces into s and t so that the characters of the resulting strings can be put in one-to-one correspondence to each other. H - A S K E L L P A A - C A - L * Two characters correspond/match each other. The Optimal Sequence Alignment is a string alignment that has the maximum number of characters that correspond to each other.

Example: Given two strings: S1 = TGFPSVGKTKDDA S2 = TFSVAKDDDGKSA The Optimal Sequence Alignment for this two strings is 8. *By adding the gaps to both strings, 8 characters correspond to each other in both strings: T,F,S,V,K,D,D,A. There are no alignments of 9 or more corresponding characters, thus this is the optimal solution.

Regular Expression Constrained Sequence Alignment (RECSA) Definition: Given strings S1, S2 and the regular expression R, RECSA is the problem of finding the optimal global sequence alignment between S1 and S2 where there exists s1 subset of S1 and s2 subset of S2 such that the following constraints are satisfied: 1. Substring s1 is aligned with substring s2 (local alignment). 2. Both s1 and s2 match the expression R, s1, s2 L( R).

Example: Given two strings and a regular expression: S1 = TGFPSVGKTKDDA S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular expression R is 4: *s1 = GFPSVGKT is a substring of S1, and s2 = AKDDDGKS is a substring of S2. s1 and s2 are aligned with each other and both match the regular expression R. 4 characters correspond/match to each other in strings S1 and S2.

Needleman Wunsch algorithm This algorithm is used for computing the maximum global score given two strings of size n and m in O(nm).

Needleman Wunsch algorithm Continued i, j = The optimal global alignment score of the substrings S 1.. i], S [1.. ]. 1[ 2 j Insertion or Deletion are referred to as gaps. ( S 1 [ i] ) = The price of adding a gap to S2. ( S = The price of adding a gap to S1. 2[ j]) S [ i] S [ ]) = The price of a match or a mismatch of a character. ( 1 2 i Question: Can we improve the runtime for finding an optimal global alignment? Hint: LCS and optimal sequence alignment problems are equivalent.

Reminder Given two strings S1, S2 and a regular expression R, we want to find two substrings s1 and s2 (substrings of S1 and S2) that uphold these constraints: 1. s1 is aligned with s2. 2. Both s1 and s2 match the regular expression R. Previous example:

Review of Automata (State achines) An automaton is represented by the 5-tuple, where: Q - Set of states. - Finite set of symbols, that we will call the alphabet of the language the automaton accepts. δ - Transition function, that is q 0 - Start state, that is, the state in which the automaton is when no input has been processed yet (Obviously, q 0 Q). F - Set of states of Q (i.e. F Q), called accept states.

Review of Automata (cont.) There exists a strong equivalence: for every regular language, there is a finite state automaton (DFA or NFA), and vice versa. Example- This automaton accepts the regular language: A (C + G)* (S + T)

Example of N x N multiplication automaton :

Constructing our Weighted Finite Automaton We construct from a given regular expression R in several steps: (1) First, given a regular expression R we construct a nondeterministic finite automaton N=(Q,Σ,δ,q0,F), with no ε- moves, such that L(N)= L(R). N accepts only the set of strings described by the regular expression R.

Constructing our Weighted Finite Automaton (cont.) (2) We define a weighted N N automaton as the finite automaton = (Q,W,Σ,q0,F ) which we construct as follows: Q = Q Q is the set of states. Each state of corresponds to a pair of states in N. remembers in each state what part of the regular expression has been seen in S1 and S2. W : Q is a function that assigns real weights to each state in Q. Initially all weights are -.

Step 2 (cont.) Σ = (Σ x Σ) \ {ε ε}. The alphabet for is the set of edit operations which does not include ε ε. q0 = (q0, q0) is the start state whose weight is 0 and always stays 0. F = (F x F). Is the set of final states. If is in a final state then has processed an alignment that satisfies the regular expression constraint (there are substrings s1 of S1 and s2 of S2 that are aligned and both s1 and s2 take N to final states.

Step 2 The Transition Function δ : Σ x Q Q. moves on edit operations: Once an alignment satisfies the regular expression constraint, i.e. once a final state is reached in, the rest of the alignment does not alter the satisfaction of the constraint. ( has the option of staying in a final state on any input after that final state is reached)

The main idea

Denotations For any given weighted N N automaton we x y denote by for any X Y a copy of the automaton after making the move on X Y. Given two weighted N N automata 1 and 2, we define a commutative and associative operation max- such that max-{1, 2} is a weighted N N automaton with state weights calculated as follows: For all (p, q) Q W (p, q) = max{ 1 2 W (p, q), W (p, q) }.

Denotations (cont.) We denote by S[i..j] the substring of S from positions i to j, i j. Let S[i] denote the i-th symbol of string S. We denote by i, j the optimally weighted automaton for S1[1 i] and S2[1 j].

The Algorithm Let S1 = n, S2 = m with n m, and let N be a non-deterministic automaton with no - moves equivalent to regular expression R, and let be a weighted N N automaton constructed from N as we described. For all i, j, 0 i n, 0 j m,both i,0 and 0, j are identical weighted N N automaton whose state-weights are all (except for the weight of the start state (q0,q0) which is always 0).

Algorithm (cont.) Optimal automata formula: i, j are computed with this 1[ ] S i i is the copy of the optimally weighted 1, j automaton after making the move S [ i ] 1 i 1, j. (we ll see an example later on )

Algorithm (cont.) After we finish calculating for all i, j, we ll i, j focus on n, m (the optimally weighted automaton for substrings s1=s1, s2=s2) We output the weight: n m n, max{ W ( p, q) ( p, q) F, m (the maximum of the weights of the accepting states in.) n, m }

Calculating the weights Upon making a move, these two steps must be followed:

Calculating the weights (example) V=80 q0,q1 a q1,q3 V=? ( a) 20 q1,q1 a V=90 Step 1: (q1,q3) gets the value : max{ W ( q0, q1) ( a), W ( q1, q1) ( a)} In this case, max{80-20, 90-20} = 70.

Calculating the weights (example cont.) V= - q0,q1 q1,q1 V= - b a b a q1,q3 V=240 ->? Step 2: If (p,q) is reachable with the move b->a only from parents that have a weight of -, its weight is updated to - as well. It is imperative that this step is done only after step 1 is completed, since some updates could be missed.

Optimally Weighted - definition i,j is optimally weighted for S1[1..i], and S2[1..j], if the following two properties hold:

Proof of correctness Claim: For all i,j, i, j computed by is optimally weighted. Proof: By induction on nodes (i,j)

Proof of correctness (cont.) Base case: i = 0, for all j the weights of are - (or 0). 0, j The two properties hold, and the claim is true. Assuming the claim is true for, i 1, j i, j 1 and i 1, j 1, we will show that the following automata are optimal:

Proof of correctness (Intuition) s values are computed by taking the i, j maximal of its parents values for each state, and adding to them the score of one edit action. Optimality of i, j follows from its parents assumption of optimality and since an optimal constrained alignment at node (i,j) uses one of these arcs, and we compute maximum scores for all possible optimal alignments.

Run-Time Analysis, Computing each i j takes time O(r) where r is the size of the transition function of. 4 * We note that r = O( t ) where t is the number of states in automaton N accepting the language 2 L(R). This is because has O( t ) states and the transition function of a nondeterministic automata is of size O(states^2) We need to compute O(nm) automata therefore we have a running time of O(nmr). and

Optimal alignment path reconstruction In order to reconstruct the optimal path, we can store for each active state of each automaton, the neighboring automaton and the state in that automaton from which the optimal score is obtained. This is possible by altering the ax- function. starting at in reverse. n, m, we generate the alignment path

Questions?? A detailed example will be uploaded for further reference. Thanks