COMBINATORIAL PATTERN MATCHING

Similar documents
CS481: Bioinformatics Algorithms

Information Retrieval and Organisation

Suffix trees, suffix arrays, BWT

What are suffix trees?

Algorithm Design (5) Text Search

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

11/5/09 Comp 590/Comp Fall

Combinatorial Pattern Matching

11/5/13 Comp 555 Fall

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

Intermediate Information Structures

COMP 423 lecture 11 Jan. 28, 2008

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Suffix Tries. Slides adapted from the course by Ben Langmead

Lecture 10: Suffix Trees

Suffix trees. December Computational Genomics

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

Dr. D.M. Akbar Hussain

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Ma/CS 6b Class 1: Graph Recap

Ma/CS 6b Class 1: Graph Recap

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Graphs with at most two trees in a forest building process

Fig.25: the Role of LEX

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

Definition of Regular Expression

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CSE 549: Suffix Tries & Suffix Trees. All slides in this lecture not marked with * of Ben Langmead.

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CS201 Discussion 10 DRAWTREE + TRIES

Section 3.1: Sequences and Series

Allocator Basics. Dynamic Memory Allocation in the Heap (malloc and free) Allocator Goals: malloc/free. Internal Fragmentation

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

On String Matching in Chunked Texts

Integration. September 28, 2017

From Indexing Data Structures to de Bruijn Graphs

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Topic 2: Lexing and Flexing

Lexical Analysis: Constructing a Scanner from Regular Expressions

10.5 Graphing Quadratic Functions

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

The Greedy Method. The Greedy Method

The dictionary model allows several consecutive symbols, called phrases

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

Finite Automata. Lecture 4 Sections Robb T. Koether. Hampden-Sydney College. Wed, Jan 21, 2015

Slides for Data Mining by I. H. Witten and E. Frank

Presentation Martin Randers

Combinatorial Pattern Matching

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Greedy Algorithm. Algorithm Fall Semester

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

Looking up objects in Pastry

Lecture T1: Pattern Matching

SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

String Searching. String Search. Applications. Brute Force: Typical Case

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

Deterministic. Finite Automata. And Regular Languages. Fall 2018 Costas Busch - RPI 1

Integration. October 25, 2016

2014 Haskell January Test Regular Expressions and Finite Automata

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

Lists in Lisp and Scheme

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

Phylogeny and Molecular Evolution

CSCE 531, Spring 2017, Midterm Exam Answer Key

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Reducing a DFA to a Minimal DFA

Section 10.4 Hyperbolas

Orthogonal line segment intersection

Typing with Weird Keyboards Notes

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Basics of Logic Design Arithmetic Logic Unit (ALU)

Compilers Spring 2013 PRACTICE Midterm Exam

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

CS 221: Artificial Intelligence Fall 2011

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

INTRODUCTION TO SIMPLICIAL COMPLEXES

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

CS 430 Spring Mike Lam, Professor. Parsing

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Homework. Context Free Languages III. Languages. Plan for today. Context Free Languages. CFLs and Regular Languages. Homework #5 (due 10/22)

Answer Key Lesson 6: Workshop: Angles and Lines

Lecture T4: Pattern Matching

CS 241 Week 4 Tutorial Solutions

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1 COMPUTATION & LOGIC INSTRUCTIONS TO CANDIDATES

Lesson 4.4. Euler Circuits and Paths. Explore This

Transcription:

COMBINATORIAL PATTERN MATCHING

Genomic Repets Exmple of repets: ATGGTCTAGGTCCTAGTGGTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized y n explosion of repets

Genomic Repets The prolem is often more difficult: ATGGTCTAGGACCTAGTGTTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized y n explosion of repets

l-mer Repets Long repets re difficult to find Short repets re esy to find (e.g., hshing) Simple pproch to finding long repets: Find exct repets of short l-mers (l is usully 10 to 13) Use l-mer repets to potentilly extend into longer, mximl repets

l-mer Repets (cont d) There re typiclly mny loctions where n l-mer is repeted: GCTTACAGATTCAGTCTTACAGATGGT The 4-mer TTAC strts t loctions 3 nd 17

Extending l-mer Repets GCTTACAGATTCAGTCTTACAGATGGT Extend these 4-mer mtches: GCTTACAGATTCAGTCTTACAGATGGT Mximl repet: TTACAGAT

Mximl Repets To find mximl repets in this wy, we need ALL strt loctions of ll l-mers in the genome Hshing lets us find repets quickly in this mnner

Hshing DNA sequences Ech l-mer cn e trnslted into inry string (A, T, C, G cn e represented s 00, 01, 10, 11) After ssigning unique integer per l-mer it is esy to get ll strt loctions of ech l- mer in genome

Hshing: Mximl Repets To find repets in genome: For ll l-mers in the genome, note the strt position nd the sequence Generte hsh tle index for ech unique l-mer sequence In ech index of the hsh tle, store ll genome strt loctions of the l-mer which generted tht index Extend l-mer repets to mximl repets

Hshing: Collisions Deling with collisions: Chin ll strt loctions of l-mers (linked list)

Hshing: Summry When finding genomic repets from l-mers: Generte hsh tle index for ech l-mer sequence In ech index, store ll genome strt loctions of the l-mer which generted tht index Extend l-mer repets to mximl repets

Pttern Mtching Wht if, insted of finding repets in genome, we wnt to find ll sequences in dtse tht contin given pttern? This leds us to different prolem, the Pttern Mtching Prolem

Pttern Mtching Prolem Gol: Find ll occurrences of pttern in text Input: Pttern p = p 1 p n nd text t = t 1 t m Output: All positions 1< i < (m n + 1) such tht the n-letter sustring of t strting t i mtches p Motivtion: Serching dtse for known pttern

Exct Pttern Mtching: A Brute- Force Algorithm PtternMtching(p,t) 1 m length of pttern p 2 n length of text t 3 for i 1 to (n m + 1) 4 if t i t i+m-1 = p 5 output i

Exct Pttern Mtching: An Exmple PtternMtching lgorithm for: Pttern GCAT GCAT CGCATC GCAT CGCATC GCAT CGCATC Text CGCATC GCAT CGCATC GCAT CGCATC

Exct Pttern Mtching: Running Time PtternMtching runtime: O(nm) Better solution: suffix trees Cn solve prolem in O(n) time Conceptully relted to keyword trees

Keyword Trees: Exmple Keyword tree: Apple Also known s trie

Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos

Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos Bnn

Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos Bnn Bndn

Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos Bnn Bndn Ornge

Keyword Trees: Properties Stores set of keywords in rooted leled tree Ech edge leled with letter from n lphet Any two edges coming out of the sme vertex hve distinct lels Every keyword stored cn e spelled on pth from root to some lef

Keyword Trees: Threding (cont d) Thred ppel ppel

Keyword Trees: Threding (cont d) Thred ppel ppel

Keyword Trees: Threding (cont d) Thred ppel ppel

Keyword Trees: Threding (cont d) Thred ppel ppel

Keyword Trees: Threding (cont d) Thred pple pple

Keyword Trees: Threding (cont d) Thred pple pple

Keyword Trees: Threding (cont d) Thred pple pple

Keyword Trees: Threding (cont d) Thred pple pple

Keyword Trees: Threding (cont d) Thred pple pple

Multiple Pttern Mtching Prolem Gol: Given set of ptterns nd text, find ll occurrences of ny of ptterns in text Input: k ptterns p 1,,p k, nd text t = t 1 t m Output: Positions 1 < i < m where sustring of t strting t i mtches p j for 1 < j < k Motivtion: Serching dtse for known multiple ptterns

Multiple Pttern Mtching: Strightforwrd Approch Cn solve s k Pttern Mtching Prolems Runtime: O(kmn) using the PtternMtching lgorithm k times m - length of the text n - verge length of the pttern

Multiple Pttern Mtching: Keyword Tree Approch Or, we could use keyword trees: Build keyword tree in O(N) time; N is totl length of ll ptterns With nive threding: O(N + nm) Aho-Corsick lgorithm: O(N + m)

Keyword Trees: Threding To mtch ptterns in text using keyword tree: Build keyword tree of ptterns Thred the text through the keyword tree

Keyword Trees: Threding (cont d) Threding is complete when we rech lef in the keyword tree When threding is complete, we ve found pttern in the text Prolem: High memory requirement when N is lrge

Suffix Trees=Collpsed Keyword Trees Similr to keyword trees, except edges tht form pths re collpsed Ech edge is leled with sustring of text All internl edges hve t lest two outgoing edges Leves leled y the index of the pttern.

Suffix Tree of Text Suffix trees of text is constructed for ll its suffixes ATCATG TCATG CATG ATG TG G Keyword Tree Suffix Tree

Suffix Tree of Text Suffix trees of text is constructed for ll its suffixes ATCATG TCATG CATG ATG TG G Keyword Tree How much time does it tke? Suffix Tree

Suffix Tree of Text Suffix trees of text is constructed for ll its suffixes ATCATG TCATG CATG ATG TG G qudrtic Keyword Tree Suffix Tree Time is liner in the totl size of ll suffixes, i.e., it is qudrtic in the length of the text

Suffix tree (Exmple) Let s=, suffix tree of s is compressed trie of ll suffixes of s= { }

Trivil lgorithm to uild Suffix tree Put the lrgest suffix in Put the suffix in

Put the suffix in

Put the suffix in

Put the suffix in

We will lso lel ech lef with the strting point of the corres. suffix. 5 4 3 Trivil lgorithm: O(n 2 ) time 1 2

Suffix Trees: Advntges Suffix trees of text is constructed for ll its suffixes Suffix trees uild fster thn keyword trees ATCATG TCATG CATG ATG TG G qudrtic Keyword Tree liner (Weiner suffix tree lgorithm) Suffix Tree

Use of Suffix Trees Suffix trees hold ll suffixes of text i.e., ATCGC: ATCGC, TCGC, CGC, GC, C Builds in O(m) time for text of length m To find ny pttern of length n in text: Build suffix tree for text Thred the pttern through the suffix tree Cn find pttern in text in O(n) time! O(n + m) time for Pttern Mtching Prolem Build suffix tree nd lookup pttern

Pttern Mtching with Suffix Trees SuffixTreePtternMtching(p,t) 1 Build suffix tree for text t 2 Thred pttern p through suffix tree 3 if threding is complete 4 output positions of ll p-mtching leves in the tree 5 else 6 output Pttern does not pper in text

Suffix Trees: Exmple

Generlized suffix tree Given set of strings S generlized suffix tree of S is compressed trie of ll suffixes of s S To mke these suffixes prefix-free we dd specil chr, sy, t the end of s To ssocite ech suffix with unique string in S dd different specil chr to ech s

Generlized suffix tree (Exmple) Let s 1 = nd s 2 = here is generlized suffix tree for s 1 nd s 2 { # # # # } 3 # 2 # 1 2 # 4 # 5 3 4 1 Mtching pttern ginst dtse of strings

Longest common sustring of two strings Every node with lef descendnt from string s 1 nd lef descendnt from string s 2 represents mximl common sustring nd vice vers. Find such node with lrgest string depth 3 # 2 # 1 2 # 4 # 5 3 4 1

Multiple Pttern Mtching: Summry Keyword nd suffix trees re used to find ptterns in text Keyword trees: Build keyword tree of ptterns, nd thred text through it Suffix trees: Build suffix tree of text, nd thred ptterns through it