CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

Similar documents
Strings. Zachary Friggstad. Programming Club Meeting

Exact String Matching. The Knuth-Morris-Pratt Algorithm

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

CSCI 104 Tries. Mark Redekopp David Kempe

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

COMP4128 Programming Challenges

Indexing and Searching

String Matching Algorithms

Cache-Oblivious String Dictionaries

Lecture 18 April 12, 2005

Applications of Suffix Tree

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

4. Suffix Trees and Arrays

Lecture 5: Suffix Trees

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Crit-bit Trees. Adam Langley (Version )

Chapter 20: Binary Trees

4. Suffix Trees and Arrays

Lecture Notes on Tries

An introduction to suffix trees and indexing

Data Structure and Algorithm Midterm Reference Solution TA

Crit-bit Trees. Adam Langley (Version )

DATA STRUCTURES + SORTING + STRING

Lecture 7 February 26, 2010

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

String matching algorithms

Binary Search Trees Part Two

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Discussion 2C Notes (Week 8, February 25) TA: Brian Choi Section Webpage:

CSE100 Practice Final Exam Section C Fall 2015: Dec 10 th, Problem Topic Points Possible Points Earned Grader

CS 270 Algorithms. Oliver Kullmann. Binary search. Lists. Background: Pointers. Trees. Implementing rooted trees. Tutorial

String Patterns and Algorithms on Strings

COMP4128 Programming Challenges

PRACTICE FINAL EXAM 2

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

University of Waterloo CS240R Winter 2018 Help Session Problems

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 2123: Collections: Priority Queues. Jeremy Morris

Lecture Notes on Tries

Discussion 2C Notes (Week 10, March 11) TA: Brian Choi Section Webpage:

String Matching Algorithms

Suffix trees and applications. String Algorithms

University of Waterloo CS240R Fall 2017 Review Problems

Lecture L16 April 19, 2012

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

CS106X Handout 39 Autumn 2012 November 28 th, 2012 CS106X Midterm Examination II

CS61B Fall 2015 Guerrilla Section 3 Worksheet. 8 November 2015

CMSC 341 Lecture 7 Lists

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

void insert( Type const & ) void push_front( Type const & )

Clever Linear Time Algorithms. Maximum Subset String Searching

CS106X Handout 35 Winter 2018 March 12 th, 2018 CS106X Midterm Examination

speller.c dictionary contains valid words, one per line 1. calls load on the dictionary file

CMSC 341 Lecture 10 Binary Search Trees

Text search. CSE 392, Computers Playing Jeopardy!, Fall

Indexing and Searching

8. Binary Search Tree

Trees. Chapter 6. strings. 3 Both position and Enumerator are similar in concept to C++ iterators, although the details are quite different.

Binary Trees and Binary Search Trees

COS 226 Algorithms and Data Structures Fall Final

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

CSCI-1200 Data Structures Spring 2018 Lecture 7 Order Notation & Basic Recursion

Data structures for string pattern matching: Suffix trees

CS 315 Data Structures mid-term 2

Clever Linear Time Algorithms. Maximum Subset String Searching. Maximum Subrange

CSE 2123 Recursion. Jeremy Morris

COS 226 Algorithms and Data Structures Spring Second Written Exam

Array Elements as Function Parameters

Indexing and Searching

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: introduction

University of Waterloo CS240 Spring 2018 Help Session Problems

CMSC 341 Priority Queues & Heaps. Based on slides from previous iterations of this course

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

CSC152 Algorithm and Complexity. Lecture 7: String Match

speller.c dictionary contains valid words, one per line 1. calls load on the dictionary file

Giri Narasimhan. COT 6936: Topics in Algorithms. The String Matching Problem. Approximate String Matching

O T E Ū P O K O O T E I K A A M Ā U I U N I V E R S I T Y O F W E L L I N G T O N EXAMINATIONS 2015 TRIMESTER 1 *** WITH SOLUTIONS ***

Ū P O K O O T E I K A A M Ā U I U N I V E R S I T Y O F W E L L I N G T O N EXAMINATIONS 2015 TRIMESTER 1 *** WITH SOLUTIONS ***

Suffix Trees and Arrays

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

Test #2. Login: 2 PROBLEM 1 : (Balance (6points)) Insert the following elements into an AVL tree. Make sure you show the tree before and after each ro

Technical University of Denmark

Heaps, Heap Sort, and Priority Queues.

B-Trees. nodes with many children a type node a class for B-trees. an elaborate example the insertion algorithm removing elements

Trees. Carlos Moreno uwaterloo.ca EIT

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

! Tree: set of nodes and directed edges. ! Parent: source node of directed edge. ! Child: terminal node of directed edge

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

CS106X Handout 35 Autumn 2012 November 12 th, 2012 Section Handout

Topics Applications Most Common Methods Serial Search Binary Search Search by Hashing (next lecture) Run-Time Analysis Average-time analysis Time anal

Topics. Sorting. Sorting. 1) How can we sort data in an array? a) Selection Sort b) Insertion Sort

BackTracking Introduction

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

An abstract tree stores data that is hierarchically ordered. Operations that may be performed on an abstract tree include:

PRACTICE MIDTERM EXAM #2

Transcription:

CMPUT 403: Strings Zachary Friggstad March 11, 2016

Outline Tries Suffix Arrays Knuth-Morris-Pratt Pattern Matching

Tries Given a dictionary D of strings and a query string s, determine if s is in D. Using a trie, this can be done in O( s ) time after some preprocessing that takes O(T ) time and space where T is the total size of D. Example A trie for dictionary {rat, rate, rant, at}. root r a t n e t a t

A trie is simply a tree whose children can be quickly indexed by characters. Each node also stores a boolean value indicating it is a terminal node (i.e. it represents a string in D). Can store other useful info too. struct TrieNode { bool word ; unordered map<char, TrieNode> c h i l d ; T r i e ( ) : word ( false ) {} } ; // initializing :) TrieNode r o o t ;

Building the Trie Add each dictionary string, one at a time. To add a string, just go as far in the Trie with the string as possible and then start creating new nodes. void a d d s t r i n g ( TrieNode root, const s t r i n g& s ) { // constructs a new TrieNode if the child didn t exist for ( auto c : s ) r o o t = &root >c h i l d [ c ] ; root >word = true ; // record this string } v e c t o r <s t r i n g > d i c t ; for ( const auto& s : d i c t ) a d d s t r i n g (& root, s ) ; Adding one string s takes O(s) time. So adding all strings takes O(T ) time.

Querying Given a trie representing a dictionary D and given a string s, is s D? bool query ( TrieNode root, const s t r i n g& s ) { for ( auto c : s ) { auto i t = root >c h i l d. f i n d ( c ) ; // no child is indexed by c if ( i t == root >c h i l d. end ( ) ) return false ; r o o t = &i t >second ; } // done traversing, check that we ended at a terminal return root >word ; } Other Query Examples How many strings in D begin with s? Need to keep an int at each node v storing # of strings represented below v.

Suffix Arrays Sort all of the suffixes of a string lexicographically. bananaban aban an anaban ananaban ban bananaban n naban nanaban

Obvious how to do it: O(n 2 log n) - create all suffixes and sort them. Can actually get O(n) time! This is overkill and a bit technical, let s see an O(n log 2 n) algorithm. Suffix Array An array of indices of the start positions of the suffixes in sorted order. Example For string bananaban int sarray[] = {5, 7, 3, 1, 6, 0, 8, 4, 2};

Idea: For i = 0,..., log 2 n, sort the suffixes just by their first 2 i characters. ananaban anaban aban an bananaban ban nanaban naban n Can do in O(n log n) time (recall we are actually just sorting the indices, not the whole suffixes).

Next, sort the suffixes by their length 2 prefixes. aban ananaban anaban an bananaban ban n nanaban naban

Next, sort the suffixes by their length 4 prefixes. aban an anaban ananaban ban bananaban n naban nanaban To check if nanaban < naban, just look up the 2nd half of the red parts to see how they were ordered last step.

Generally, to sort the suffixes by their length 2 i+1 prefixes we check < using the ordering based on length 2 i prefixes. Check how the first 2 i characters of two suffixes a, b compare using the previous ordering. If they are different then just return that result. If they are the same, check how the second 2 i characters of a, b compare again using the previous ordering. Example nanaban vs. naban. Length-2 prefixes are the same (na), but next 2 characters (na vs. ba) show the answer is >. Sorting based on length 2 i prefixes then takes only O(n log n) time. Since i ranges up to log 2 n, then overall time is O(n log 2 n).

Can also quickly compute the longest common prefix between adjacent suffixes in the array. Example naban and nanaban are adjacent suffixes in the suffix array. Their common prefix length is 2. This information can easily be construct along with the construction of the suffix array itself.

Knuth-Morris-Pratt Given a source string s and a pattern string p, does p appear as a substring of a? More generally, record all positions i such that p appears as a substring of a starting at position i. Example s = findmatchingmatches p = match Then p appears as a substring of s at indices 4 and 12 (highlighted).

An obvious algorithm is to try all locations of s and linearly scan to see if p matches there. Can take Ω( s p ) time. The Knuth-Morris-Pratt (KMP) algorithm only takes O( s + p ) time! Main Idea: For each index i into p, let π[i] denote the length of the longest proper suffix of p 0 p 1... p i that is also a prefix of p. Confusing? Example! p = acabaca The longest proper suffix of acabac that is also a prefix is ac. pi = {0, 0, 1, 0, 1, 2, 3};

Slide the pattern p over s. acabaca acacabacabaca Match as many symbols as possible acabaca acacabacabaca When stuck, slide the pattern to the next partial match. acabaca acacabacabaca Distance to slide encoded by prefix table π.

Continue matching acabaca acacabacabaca Found a match, record it! Slide pattern over to the next partial match. acabaca acacabacabaca Continue matching acabaca acacabacabaca Another match, record it!

Slide pattern over to next partial match. acabaca acacabacabaca Quit, the pattern is past the end of the string. Runs in O( s + p ) time because each step increases the matched pointer or slides the pattern over. Each slide takes O(1) time using the π values.

void kmp( const s t r i n g& s, const s t r i n g& p ) { v e c t o r <int> p i ; c o m p u t e p r e f i x ( p, p i ) ; // next two slides :) // hit = # matched, i = position of pattern int h i t = 0, i = 0 ; while ( i + p. l e n g t h ( ) < s. l e n g t h ( ) ) { // match as much as possible while ( h i t < p. l e n g t h ( ) && p [ h i t ] == s [ i + h i t ] ) ++h i t ; // record a match if ( h i t == p. l e n g t h ( ) ) cout << "Match:" << i << e n d l ; } } // shift pattern i += h i t p i [ h i t ] ; h i t = p i [ h i t ] ;

How to compute π? Basically the same idea! p = acabaca Note, a suffix of π[i] that is also a prefix of p comes from a suffix of π[i 1] that is a prefix of p. acabac: Note a is a suffix of acaba that is also a prefix of p. Overall idea: slide the pattern over itself! acabaca acabaca When computing π[i], if the current slide matches then nothing to do (see above). Otherwise shift pattern over until there is a match.

void c o m p u t e p r e f i x ( const s t r i n g& p, v e c t o r <int>& p i ) { p i. r e s i z e ( p. l e n g t h ( ) ) ; p i [ 0 ] = 0 ; } for ( int i = 1, h i t = 0 ; i < p. l e n g t h ( ) ; ++i ) { while ( i < p. l e n g t h ( ) ) { while ( h i t > 0 && p [ h i t ]!= p [ i ] ) h i t = p i [ h i t ] ; if ( p [ h i t ]!= p [ i ] ) p i [ i ] = 0 ; // hit == 0 here else p [ i ] = ++h i t ; }

Missing Topics Suffix Trees Manachar s Algorithm: find all maximal palindromes in linear time. Next Week Bipartite matching and network flow. Remember to vote on your preferred extra topics for the weeks following!